Spark RDD和DataSet與DataFrame轉換成RDD


Spark RDD和DataSet與DataFrame轉換成RDD

一、什么是RDD

        RDD是彈性分布式數據集(resilient distributed dataset) 的簡稱,是一個可以參與並行操作並且可容錯的元素集合。什么是並行操作呢?例如,對於一個含4個元素的數組Array,元素分別為1,2,3,4。如果現在想將數組的每個元素放大兩倍,Java實現通常是遍歷數組的每個元素,然后每個元素乘以2,數組中的每個元素操作是有先后順序的。但是在Spark中,可以將數組轉換成一個RDD分布式數據集,然后同時操作每個元素。

 

二、創建RDD

        Spark中提供了兩種方式創建RDD

首先執行

1 spark-shell

命令,打開scala終端,如圖:

我們使用的HDP集成好的Spark,可以自己安裝Apache Spark。

1、並行化一個存在的數據集

        例如:將一個數組Array轉換成一個RDD,如圖:

val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)

在命令窗口執行上述命令后,如圖:

parallesize函數提供了兩個參數,第二個參數表示RDD的分區數(partiton number),例如:

scala> val distDataP = sc.parallelize(data,3)
distDataP: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:26

scala>

將數組轉換成一個分區數為3的RDD,例如第一個分區中元素為{1,2},第二分區中元素為{3,4},第三個分區中元素為{5}。

 

2、從外部存儲系統引用一個數據集

        外部存儲如共享文件系統,HDFS文件系統,HBase,或者其他提供Hadoop文件格式的數據源等。例如,我們從HDFS文件系統/input/mahout-demo/目錄下讀取一個文件itemdata.data,將內容變成一個RDD,如圖:

在命令行中執行命令,如圖:

 1 scala> val fileRDD = sc.textFile("hdfs://192.168.189.21:8020/input/mahout-demo/itemdata.data")
 2 fileRDD: org.apache.spark.rdd.RDD[String] = hdfs://192.168.189.21:8020/input/mahout-demo/itemdata.d
 3 ata MapPartitionsRDD[3] at textFile at <console>:24
 4 scala> fileRDD.c
 5 cache       checkpoint   collectAsync   copy          countApproxDistinct   countByValueApprox   
 6 canEqual    coalesce     compute        count         countAsync                                 
 7 cartesian   collect      context        countApprox   countByValue                               
 8 
 9 scala> fileRDD.collect
10 collect   collectAsync
11 
12 scala> fileRDD.collect
13 res1: Array[String] = Array(0162381440670851711,4,7.0, 0162381440670851711,11,4.0, 0162381440670851
14 711,32,1.0, 0162381440670851711,176,27.0, 0162381440670851711,183,11.0, 0162381440670851711,184,5.0, 0162381440670851711,207,9.0, 0162381440670851711,256,3.0, 0162381440670851711,258,4.0, 0162381440670851711,259,16.0, 0162381440670851711,260,8.0, 0162381440670851711,261,18.0, 0162381440670851711,301,1.0, 0162381440670851711,307,1.0, 0162381440670851711,477,1.0, 0162381440670851711,518,1.0, 0162381440670851711,549,3.0, 0162381440670851711,570,1.0, 0162381440670851711,826,2.0, 0357211441096952115,207,1.0, 0617721441096186493,184,1.0, 0617721441096186493,207,1.0, 1205421441071459451,5,1.0, 1214361441096861254,207,1.0, 1401731441095483081,258,1.0, 1401731441095483081,814,4.0, 14017314410954830...
15 scala> 

 

collect觸發計算並輸出文件內容。

 

3、Idea開發工具創建RDD

        當然上面提供的是使用Spark自帶的Scala終端命令行來創建RDD的,我們也可以通過開發工具(比如Idea)來完成上面兩個創建方式。

(1)pom.xml導入spark包

        我們使用的Scala版本號是2.11,spark版本號是2.3.0,hadoop版本號2.7.3,這都是比較穩定和常用的版本,如下:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>com.leboop</groupId>
    <artifactId>mahout</artifactId>
    <version>1.0-SNAPSHOT</version>
    <properties>
        <!-- scala版本號 -->
        <scala.version>2.11</scala.version>
        <!-- spark版本號 -->
        <spark.version>2.3.0</spark.version>
        <!-- hadoop版本 -->
        <hadoop.version>2.7.3</hadoop.version>
    </properties>
    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_${scala.version}</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>${hadoop.version}</version>
        </dependency>
    </dependencies>
</project>

 

(2)程序

 1 package com.leboop.rdd
 2 
 3 import org.apache.spark.{SparkConf, SparkContext}
 4 
 5 /**
 6   * RDD創建Demo
 7   */
 8 object RDDDemo {
 9   def main(args: Array[String]): Unit = {
10     //spark配置
11     val sparkConf = new SparkConf().setAppName("rdd-demo").setMaster("local")
12     //創建sc對象
13     val sc = new SparkContext(sparkConf)
14     //數組
15     val data = Array(1, 2, 3, 4, 5)
16     //將數組變成RDD數據集,分區數為4
17     val rdd = sc.parallelize(data, 4)
18     //從HDFS文件系統讀取文件,轉換成一個RDD分布式數據集
19     val fileRDD = sc.textFile("hdfs://192.168.189.21:8020/input/mahout-demo/itemdata.data")
20     //打印前3個元素
21     rdd.take(3).foreach(println)
22     fileRDD.take(3).foreach(println)
23   }
24 }

 

執行程序,刪除一些日志輸出,部分結果如下:

1 1
2 2
3 3
4 0162381440670851711,4,7.0
5 0162381440670851711,11,4.0
6 0162381440670851711,32,1.0
7 
8 Process finished with exit code 0

 

 

4、DataFrame轉換成RDD

 1 package com.leboop.rdd
 2 
 3 import org.apache.spark.sql.SparkSession
 4 import org.apache.spark.{SparkConf, SparkContext}
 5 
 6 /**
 7   * RDD創建Demo
 8   */
 9 object RDDDemo {
10   def main(args: Array[String]): Unit = {
11     //創建Spark SQL的切入點(RDD的切入點是SparkContext)
12     val spark = SparkSession.builder().appName("spark-sql-demo").master("local").getOrCreate()
13     val dataDF = spark.read.csv("hdfs://192.168.189.21:8020/input/mahout-demo/itemdata.data")
14     val rdd=dataDF.rdd
15     rdd.take(3).foreach(println)
16   }
17 }

程序運行結果

1 [0162381440670851711,4,7.0]
2 [0162381440670851711,11,4.0]
3 [0162381440670851711,32,1.0]

 

5、DataSet轉換成RDD

 1 package com.leboop.rdd
 2 
 3 import org.apache.spark.sql.SparkSession
 4 import org.apache.spark.{SparkConf, SparkContext}
 5 
 6 /**
 7   * RDD創建Demo
 8   */
 9 object RDDDemo {
10   def main(args: Array[String]): Unit = {
11     //創建Spark SQL的切入點(RDD的切入點是SparkContext)
12     val spark = SparkSession.builder().appName("spark-sql-demo").master("local").getOrCreate()
13     val dataDS = spark.read.textFile("hdfs://192.168.189.21:8020/input/mahout-demo/itemdata.data")
14     val rdd=dataDS.rdd
15     rdd.take(3).foreach(println)
16   }
17 }

 

程序運行結果
1 0162381440670851711,4,7.0
2 0162381440670851711,11,4.0
3 0162381440670851711,32,1.0

 

三、RDD算子

RDD支持兩種類型的算子,一個稱為變換(transformations),另一個稱為動作(actions)。

1、變換

最典型的變換就是map函數,他將RDD的每一個元素通過一些函數計算變成的一個新的元素組成的RDD,如圖:

map函數將上面創建的distData的每一個元素變成了原來的2倍。

變換算子是懶惰的(lazy),它們不會立即計算,只是記憶下這個變換,當一個動作(action)作用於它時,才會觸發計算,例如這里的collect函數。

 

2、動作

        最典型的動作就是reduce,我們知道上面創建好的fileRDD,從HDFS文件系統中讀取了一個文件,fileRDD的每一個元素存儲了文件的每一行內容,我們使用reduce計算文件的內容總長度。分兩個步驟:

(1)計算每一行的內容長度

        使用轉換算子map,將fileRDD的每一行內容映射成一個新的RDD,叫做lineLengths,這個RDD的每個元素存儲的是文件每一行的內容長度。map屬於轉換算子,不會立即觸發計算。

(2)累加每一行內容長度,計算出文件的內容總長度

        使用action算子reduce,累加lineLengths算子的每個元素,得到最終結果。如圖:

計算結果為2722。

 

四、RDD緩存

        事實上,上面我們在計算文件內容總長度時,存在一個效率問題。因為map轉換是懶惰的(lazy),每次執行reduce,都會重新計算一次map,如果數據量數以億計,效率是十分低的。所以,我們想是否可以執行map轉換后,將結果保存下來,然后執行reduce動作時,不會再重新計算map。答案是肯定的。我們使用persist或cache方法,將計算結果保存在內存中。如:

1 lineLengths.persist()

 

如圖:

我們在reduce之前加入了緩存,第一個reduce觸發map計算,並計算得到結果,map計算結果會保存在內存中,第二個reduce計算時,不會再觸發map重新計算,而是直接使用內存中保存的結果參與reduce計算。

 

五、RDD算子列表

1、Transformations

The following table lists some of the common transformations supported by Spark. Refer to the RDD API doc (ScalaJavaPythonR) and pair RDD functions doc (ScalaJava) for details.

Transformation Meaning
map(func) Return a new distributed dataset formed by passing each element of the source through a function func.
filter(func) Return a new dataset formed by selecting those elements of the source on which funcreturns true.
flatMap(func) Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).
mapPartitions(func) Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type Iterator<T> => Iterator<U> when running on an RDD of type T.
mapPartitionsWithIndex(func) Similar to mapPartitions, but also provides func with an integer value representing the index of the partition, so func must be of type (Int, Iterator<T>) => Iterator<U> when running on an RDD of type T.
sample(withReplacementfractionseed) Sample a fraction fraction of the data, with or without replacement, using a given random number generator seed.
union(otherDataset) Return a new dataset that contains the union of the elements in the source dataset and the argument.
intersection(otherDataset) Return a new RDD that contains the intersection of elements in the source dataset and the argument.
distinct([numPartitions])) Return a new dataset that contains the distinct elements of the source dataset.
groupByKey([numPartitions]) When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs. 
Note: If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using reduceByKey or aggregateByKey will yield much better performance. 
Note: By default, the level of parallelism in the output depends on the number of partitions of the parent RDD. You can pass an optional numPartitions argument to set a different number of tasks.
reduceByKey(func, [numPartitions]) When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument.
aggregateByKey(zeroValue)(seqOpcombOp, [numPartitions]) When called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs where the values for each key are aggregated using the given combine functions and a neutral "zero" value. Allows an aggregated value type that is different than the input value type, while avoiding unnecessary allocations. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument.
sortByKey([ascending], [numPartitions]) When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument.
join(otherDataset, [numPartitions]) When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoinrightOuterJoin, and fullOuterJoin.
cogroup(otherDataset, [numPartitions]) When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (Iterable<V>, Iterable<W>)) tuples. This operation is also called groupWith.
cartesian(otherDataset) When called on datasets of types T and U, returns a dataset of (T, U) pairs (all pairs of elements).
pipe(command[envVars]) Pipe each partition of the RDD through a shell command, e.g. a Perl or bash script. RDD elements are written to the process's stdin and lines output to its stdout are returned as an RDD of strings.
coalesce(numPartitions) Decrease the number of partitions in the RDD to numPartitions. Useful for running operations more efficiently after filtering down a large dataset.
repartition(numPartitions) Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them. This always shuffles all data over the network.
repartitionAndSortWithinPartitions(partitioner) Repartition the RDD according to the given partitioner and, within each resulting partition, sort records by their keys. This is more efficient than calling repartition and then sorting within each partition because it can push the sorting down into the shuffle machinery.

2、Actions

The following table lists some of the common actions supported by Spark. Refer to the RDD API doc (ScalaJavaPythonR)

and pair RDD functions doc (ScalaJava) for details.

Action Meaning
reduce(func) Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.
collect() Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.
count() Return the number of elements in the dataset.
first() Return the first element of the dataset (similar to take(1)).
take(n) Return an array with the first n elements of the dataset.
takeSample(withReplacementnum, [seed]) Return an array with a random sample of num elements of the dataset, with or without replacement, optionally pre-specifying a random number generator seed.
takeOrdered(n[ordering]) Return the first n elements of the RDD using either their natural order or a custom comparator.
saveAsTextFile(path) Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file.
saveAsSequenceFile(path
(Java and Scala)
Write the elements of the dataset as a Hadoop SequenceFile in a given path in the local filesystem, HDFS or any other Hadoop-supported file system. This is available on RDDs of key-value pairs that implement Hadoop's Writable interface. In Scala, it is also available on types that are implicitly convertible to Writable (Spark includes conversions for basic types like Int, Double, String, etc).
saveAsObjectFile(path
(Java and Scala)
Write the elements of the dataset in a simple format using Java serialization, which can then be loaded usingSparkContext.objectFile().
countByKey() Only available on RDDs of type (K, V). Returns a hashmap of (K, Int) pairs with the count of each key.
foreach(func) Run a function func on each element of the dataset. This is usually done for side effects such as updating an Accumulator or interacting with external storage systems. 
Note: modifying variables other than Accumulators outside of the foreach() may result in undefined behavior. See Understanding closures for more details.


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM