一、dataframe操作大全
https://blog.csdn.net/dabokele/article/details/52802150
https://www.jianshu.com/p/009126dec52f
增/刪/改/查/合並/統計與數據處理: https://blog.csdn.net/sinat_26917383/article/details/80500349
spark左外連接:https://blog.csdn.net/iduanyingjie/article/details/57449539
structField、structType、schame:https://blog.csdn.net/legotime/article/details/52643243
dataframe、dataset、sql.dataframe:https://www.cnblogs.com/seaspring/p/5831677.html
創建dataframe:https://blog.csdn.net/shirukai/article/details/81085642
二、dataframe的filter用法
val df = sc.parallelize(Seq(("a", 1), ("a", 2), ("b", 2), ("b", 3), ("c", 1))).toDF("id", "num")
1、對整數類型過濾
(1)邏輯運算符:>, <, ===
df.filter($"num"===2) df.filter($"num">2) df.filter($"num"<2)
或者
df.filter("num=2") df.filter("num>2") df.filter("num<2")
(2)傳遞參數過濾
val ind:Int=2; df.filter($"num"===ind) df.filter($"num">ind) df.filter($"num"<ind)
2、對字符串過濾
df.filter($"id".equalTo("a"))
(1)傳遞參數過濾
val str = s"a" df.filter($"id"equalTo(str))
當dataframe沒有字段名時,可以用默認的字段名[_1, _2, .....]來進行判斷
3、多條件判斷
邏輯連接符 &&(並)、||(或)
df.filter($"num"===2 && $"id".equalTo("a") df.filter($"num"===1 || $"num"===3)
三、DataFrame和DataSet[T]無法使用map的問題:Unable to find encoder for type stored in a Dataset
spark2.0以后的版本采用的是新的分布式數據集DataSet,其中DataFrame是DataSet[Row]的別名形式。
DataSet數據集在使用sql()時,無法使用map,flatMap等轉換算子的解決辦法:https://blog.51cto.com/9269309/1954540
方法一:要想對dataset進行操作,需要進行相應的encode操作。要進行map操作,要先定義一個Encoder。特別是官網給的例子:
// No pre-defined encoders for Dataset[Map[K,V]], define explicitly
implicit val mapEncoder = org.apache.spark.sql.Encoders.kryo[Map[String, Any]]
// Primitive types and case classes can be also defined as
// implicit val stringIntMapEncoder: Encoder[Map[String, Any]] = ExpressionEncoder()
// row.getValuesMap[T] retrieves multiple columns at once into a Map[String, T]
teenagersDF.map(teenager => teenager.getValuesMap[Any](List("name", "age"))).collect()
// Array(Map("name" -> "Justin", "age" -> 19))
方法二:(不推薦這種方法)為了更簡單一些,dataset也提供了轉化RDD的操作。因此只需要將之前dataframe.map 在中間修改為:dataframe.rdd.map即可。