Spark算子：RDD基本轉換操作(1)–map、flatMap、distinct

關鍵字：Spark算子、Spark RDD基本轉換、map、flatMap、distinct

將一個RDD中的每個數據項，通過map中的函數映射變為一個新的元素。

輸入分區與輸出分區一對一，即：有多少個輸入分區，就有多少個輸出分區。

hadoop fs -cat /tmp/lxw1234/1.txt
hello world
hello spark
hello hive
//讀取HDFS文件到RDD
scala> var data = sc.textFile("/tmp/lxw1234/1.txt")
data: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at :21
//使用map算子
scala> var mapresult = data.map(line => line.split("\\s+"))
mapresult: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[2] at map at :23
//運算map算子結果
scala> mapresult.collect
res0: Array[Array[String]] = Array(Array(hello, world), Array(hello, spark), Array(hello, hive))

屬於Transformation算子，第一步和map一樣，最后將所有的輸出分區合並成一個。

/使用flatMap算子
scala> var flatmapresult = data.flatMap(line => line.split("\\s+"))
flatmapresult: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at flatMap at :23
//運算flagMap算子結果
scala> flatmapresult.collect
res1: Array[String] = Array(hello, world, hello, spark, hello, hive)

使用flatMap時候需要注意：
flatMap會將字符串看成是一個字符數組。
看下面的例子：

scala> data.map(_.toUpperCase).collect
res32: Array[String] = Array(HELLO WORLD, HELLO SPARK, HELLO HIVE, HI SPARK)
scala> data.flatMap(_.toUpperCase).collect
res33: Array[Char] = Array(H, E, L, L, O, , W, O, R, L, D, H, E, L, L, O, , S, P, A, R, K, H, E, L, L, O, , H, I, V, E, H, I, , S, P, A, R, K)

再看：

scala> data.map(x => x.split("\\s+")).collect
res34: Array[Array[String]] = Array(Array(hello, world), Array(hello, spark), Array(hello, hive), Array(hi, spark))
scala> data.flatMap(x => x.split("\\s+")).collect
res35: Array[String] = Array(hello, world, hello, spark, hello, hive, hi, spark)

這次的結果好像是預期的，最終結果里面並沒有把字符串當成字符數組。
這是因為這次map函數中返回的類型為Array[String]，並不是String。
flatMap只會將String扁平化成字符數組，並不會把Array[String]也扁平化成字符數組。

對RDD中的元素進行去重操作。

scala> data.flatMap(line => line.split("\\s+")).collect
res61: Array[String] = Array(hello, world, hello, spark, hello, hive, hi, spark)
scala> data.flatMap(line => line.split("\\s+")).distinct.collect
res62: Array[String] = Array(hive, hello, world, spark, hi)

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Spark算子：RDD基本轉換操作(7)–zipWithIndex、zipWithUniqueId Spark算子：RDD基本轉換操作(5)–mapPartitions、 spark RDD 的map與flatmap區別說明 [Spark][Python]RDD flatMap 操作例子（八）map，filter，flatMap算子-Java&Python版Spark Spark RDD算子介紹 Spark中map與flatMap spark教程(四)-SparkContext 和 RDD 算子 Spark基礎 --RDD算子詳解 spark RDD 鍵值算子——repartitionAndSortWithinPartitions算子