RDD、DataFrame、Dataset三者三者之間轉換

本文轉載自查看原文 2018-01-23 09:19 1913

轉化：

RDD、DataFrame、Dataset三者有許多共性，有各自適用的場景常常需要在三者之間轉換


DataFrame/Dataset轉RDD：

這個轉換很簡單

    
val rdd1=testDF.rdd
val rdd2=testDS.rdd



RDD轉DataFrame：

    
import spark.implicits._
val testDF = rdd.map {line=>
      (line._1,line._2)
    }.toDF("col1","col2")

一般用元組把一行的數據寫在一起，然后在toDF中指定字段名


RDD轉Dataset： 
import spark.implicits._
case class Coltest(col1:String,col2:Int)extends Serializable //定義字段名和類型
val testDS = rdd.map {line=>
      Coltest(line._1,line._2)
    }.toDS

可以注意到，定義每一行的類型（case class）時，已經給出了字段名和類型，后面只要往case class里面添加值即可



Dataset轉DataFrame：

這個也很簡單，因為只是把case class封裝成Row

    
import spark.implicits._
val testDF = testDS.toDF



DataFrame轉Dataset：

    
import spark.implicits._
case class Coltest(col1:String,col2:Int)extends Serializable //定義字段名和類型
val testDS = testDF.as[Coltest]

這種方法就是在給出每一列的類型后，使用as方法，轉成Dataset，這在數據類型是DataFrame又需要針對各個字段處理時極為方便
特別注意：

在使用一些特殊的操作時，一定要加上 import spark.implicits._ 不然toDF、toDS無法使用

package dataframe

import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}

//
// Explore interoperability between DataFrame and Dataset. Note that Dataset
// is covered in much greater detail in the 'dataset' directory.
//
object DatasetConversion {

case class Cust(id: Integer, name: String, sales: Double, discount: Double, state: String)

case class StateSales(state: String, sales: Double)

def main(args: Array[String]) {
val spark =
SparkSession.builder()
.appName("DataFrame-DatasetConversion")
.master("local[4]")
.getOrCreate()

import spark.implicits._

// create a sequence of case class objects
// (we defined the case class above)
val custs = Seq(
Cust(1, "Widget Co", 120000.00, 0.00, "AZ"),
Cust(2, "Acme Widgets", 410500.00, 500.00, "CA"),
Cust(3, "Widgetry", 410500.00, 200.00, "CA"),
Cust(4, "Widgets R Us", 410500.00, 0.0, "CA"),
Cust(5, "Ye Olde Widgete", 500.00, 0.0, "MA")
)

// Create the DataFrame without passing through an RDD
val customerDF : DataFrame = spark.createDataFrame(custs)
//
// println("*** DataFrame schema")
//
// customerDF.printSchema()
//
// println("*** DataFrame contents")
//
// customerDF.show()

// +---+---------------+--------+--------+-----+
//| id| name| sales|discount|state|
//+---+---------------+--------+--------+-----+
//| 1| Widget Co|120000.0| 0.0| AZ|
//| 2| Acme Widgets|410500.0| 500.0| CA|
//| 3| Widgetry|410500.0| 200.0| CA|
//| 4| Widgets R Us|410500.0| 0.0| CA|
//| 5|Ye Olde Widgete| 500.0| 0.0| MA|
//+---+---------------+--------+--------+-----+

//
// println("*** Select and filter the DataFrame")
//
val smallerDF =
customerDF.select("sales", "state").filter($"state".equalTo("CA"))
//
// smallerDF.show()

//
// +--------+-----+
//| sales|state|
//+--------+-----+
//|410500.0| CA|
//|410500.0| CA|
//|410500.0| CA|
//+--------+-----+

///////////////////////////////////////////////////////////////////////////////////

// Convert it to a Dataset by specifying the type of the rows -- use a case
// class because we have one and it's most convenient to work with. Notice
// you have to choose a case class that matches the remaining columns.
// BUT also notice that the columns keep their order from the DataFrame --
// later you'll see a Dataset[StateSales] of the same type where the
// columns have the opposite order, because of the way it was created.

val customerDS : Dataset[StateSales] = smallerDF.as[StateSales]
//
// println("*** Dataset schema")
//
// customerDS.printSchema()
//
// println("*** Dataset contents")
//
// customerDS.show()

// Select and other operations can be performed directly on a Dataset too,
// but be careful to read the documentation for Dataset -- there are
// "typed transformations", which produce a Dataset, and
// "untyped transformations", which produce a DataFrame. In particular,
// you need to project using a TypedColumn to gate a Dataset.

// val verySmallDS : Dataset[Double] = customerDS.select($"sales".as[Double])
//
// println("*** Dataset after projecting one column")
//
// verySmallDS.show()

//
//+--------+
//| sales|
//+--------+
//|410500.0|
//|410500.0|
//|410500.0|
//+--------+

// If you select multiple columns on a Dataset you end up with a Dataset
// of tuple type, but the columns keep their names.
val tupleDS : Dataset[(String, Double)] =
customerDS.select($"state".as[String], $"sales".as[Double])
//
// println("*** Dataset after projecting two columns -- tuple version")
//
// tupleDS.show()

//
//+-----+--------+
//|state| sales|
//+-----+--------+
//| CA|410500.0|
//| CA|410500.0|
//| CA|410500.0|
//+-----+--------+

// You can also cast back to a Dataset of a case class. Notice this time
// the columns have the opposite order than the last Dataset[StateSales]
// val betterDS: Dataset[StateSales] = tupleDS.as[StateSales]
//
// println("*** Dataset after projecting two columns -- case class version")
//
// betterDS.show()

//
//+-----+--------+
//|state| sales|
//+-----+--------+
//| CA|410500.0|
//| CA|410500.0|
//| CA|410500.0|
//+-----+--------+

// Converting back to a DataFrame without making other changes is really easy
// val backToDataFrame : DataFrame = tupleDS.toDF()
//
// println("*** This time as a DataFrame")
//
// backToDataFrame.show()
//

//+-----+--------+
//|state| sales|
//+-----+--------+
//| CA|410500.0|
//| CA|410500.0|
//| CA|410500.0|
//+-----+--------+

//
// // While converting back to a DataFrame you can rename the columns
val renamedDataFrame : DataFrame = tupleDS.toDF("MyState", "MySales")

println("*** Again as a DataFrame but with renamed columns")

renamedDataFrame.show()

// +-------+--------+
//|MyState| MySales|
//+-------+--------+
//| CA|410500.0|
//| CA|410500.0|
//| CA|410500.0|
//+-------+--------+

}
}

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 spark-DataFrame之RDD和DataFrame之間的轉換 Spark DataFrame及RDD與DataSet轉換成DataFrame RDD/Dataset/DataFrame互轉 sparksql 動態設置schema將rdd轉換成dataset/dataframe Spark中RDD、DataFrame和DataSet的區別 Spark提高篇——RDD/DataSet/DataFrame（二） Apache Spark 2.0三種API的傳說：RDD、DataFrame和Dataset RDD轉換成為DataFrame dataframe，list，numpy之間的互相轉換 spark2.0系列《一》—— RDD VS. DataFrame VS. DataSet