// 创建视图 data.createOrReplaceTempView("Affairs") val df1 = spark.sql("SELECT * FROM Affairs WHERE age BETWEEN 20 AND 25") df1 ...
import org.apache.spark.storage.StorageLevel 数据持久缓存到内存中 data.cache data.persist 设置缓存级别data.persist StorageLevel.DISK ONLY 清除缓存data.unpersist data.unpersist blocking true 级别 使用空间 CPU时间 是否在内存中 是否在磁盘上 备注 ...
2016-11-25 15:40 0 6230 推荐指数:
// 创建视图 data.createOrReplaceTempView("Affairs") val df1 = spark.sql("SELECT * FROM Affairs WHERE age BETWEEN 20 AND 25") df1 ...
data.groupBy("gender").agg(count($"age"),max($"age").as("maxAge"), avg($"age").as("avgAge")).show ...
val dfList = List(("Hadoop", "Java,SQL,Hive,HBase,MySQL"), ("Spark", "Scala,SQL,DataSet,MLlib,GraphX")) dfList: List[(String, String)] = List ...
import org.apache.spark.sql.functions._ // 对整个DataFrame的数据去重 data.distinct() data.dropDuplicates() // 对指定列的去重 val colArray=Array ...
collect_set去除重复元素;collect_list不去除重复元素select gender, concat_ws(',', collect_set(children)), ...
val df6 = spark.sql("select gender,children,max(age),avg(age),count(age) from Affairs group by Cube(gender,children) order by 1,2") df6.show +------+--------+--------+--------+----------+ ...
import org.apache.spark.sql.SparkSession import org.apache.spark.sql.Dataset import org.apache.spark.sql.Row import ...
Dataset是一个强类型的特定领域的对象,这种对象可以函数式或者关系操作并行地转换。每个Dataset也有一个被称为一个DataFrame的类型化视图,这种DataFrame是Row类型的Dataset,即Dataset[Row] Dataset是“懒惰”的,只在执行行动操作时触发计算 ...