spark使用udf給dataFrame新增列


spark 中給 dataframe 增加一列的方法一般使用 withColumn

// 新建一個dataFrame
val sparkconf = new SparkConf()
  .setMaster("local")
  .setAppName("test")
val spark = SparkSession.builder().config(sparkconf).getOrCreate()
val tempDataFrame = spark.createDataFrame(Seq(
  (1, "asf"),
  (2, "2143"),
  (3, "rfds")
)).toDF("id", "content")
// 增加一列
val addColDataframe = tempDataFrame.withColumn("col", tempDataFrame("id")*0)
addColDataframe.show(10,false)

打印結果如下:

+---+-------+---+
|id |content|col|
+---+-------+---+
|1  |asf    |0  |
|2  |2143   |0  |
|3  |rfds   |0  |
+---+-------+---+

可以看到 withColumn 很依賴原來 dataFrame 的結構,但是假設沒有 id 這一列,那么增加列的時候靈活度就降低了很多,假設原始 dataFrame 如下:

+---+-------+
| id|content|
+---+-------+
|  a|    asf|
|  b|   2143|
|  b|   rfds|
+---+-------+

這樣可以用 udf 寫自定義函數進行增加列:

import org.apache.spark.sql.functions.udf
// 新建一個dataFrame
val sparkconf = new SparkConf()
  .setMaster("local")
  .setAppName("test")
val spark = SparkSession.builder().config(sparkconf).getOrCreate()
val tempDataFrame = spark.createDataFrame(Seq(
  ("a, "asf"),
  ("b, "2143"),
  ("c, "rfds")
)).toDF("id", "content")
// 自定義udf的函數
val code = (arg: String) => {
      if (arg.getClass.getName == "java.lang.String") 1 else 0
    }

val addCol = udf(code)
// 增加一列
val addColDataframe = tempDataFrame.withColumn("col", addCol(tempDataFrame("id")))
addColDataframe.show(10, false)

得到結果:

+---+-------+---+
|id |content|col|
+---+-------+---+
|a  |asf    |1  |
|b  |2143   |1  |
|c  |rfds   |1  |
+---+-------+---+

還可以寫下更多的邏輯判斷:

// 新建一個dataFrame
val sparkconf = new SparkConf()
  .setMaster("local")
  .setAppName("test")
val spark = SparkSession.builder().config(sparkconf).getOrCreate()
val tempDataFrame = spark.createDataFrame(Seq(
  (1, "asf"),
  (2, "2143"),
  (3, "rfds")
)).toDF("id", "content")

val code :(Int => String) = (arg: Int) => {if (arg < 2) "little" else "big"}
val addCol = udf(code)
val addColDataframe = tempDataFrame.withColumn("col", addCol(tempDataFrame("id")))
addColDataframe.show(10, false)
+---+-------+------+
|1  |asf    |little|
|2  |2143   |big   |
|3  |rfds   |big   |
+---+-------+------+

傳入多個參數:

val sparkconf = new SparkConf()
  .setMaster("local")
  .setAppName("test")
val spark = SparkSession.builder().config(sparkconf).getOrCreate()
val tempDataFrame = spark.createDataFrame(Seq(
  ("1", "2"),
  ("2", "3"),
  ("3", "1")
)).toDF("content1", "content2")

val code = (arg1: String, arg2: String) => {
  Try(if (arg1.toInt > arg2.toInt) "arg1>arg2" else "arg1<=arg2").getOrElse("error")
}
val compareUdf = udf(code)

val addColDataframe = tempDataFrame.withColumn("compare", compareUdf(tempDataFrame("content1"),tempDataFrame("content2")))
addColDataframe.show(10, false)
+--------+--------+----------+
|content1|content2|compare   |
+--------+--------+----------+
|1       |2       |arg1<=arg2|
|2       |3       |arg1<=arg2|
|3       |1       |arg1>arg2 |
+--------+--------+----------+


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM