Spark實戰練習01--XML數據處理

本文轉載自查看原文 2018-03-07 13:03 1822 Spark實戰

一、要求

將XML中的account_number、model數據提取出來，並以account_number：model格式存儲

1、XML文件數據格式

<activations>
　　<activation timestamp="1225499258" type="phone">
　　<account-number>316</account-number>
　　<device-id> 　　　　d61b6971-33e1-42f0-bb15-aa2ae3cd8680 　　</device-id>
　　<phone-number>5108307062</phone-number>
　　<model>iFruit 1</model>
　　</activation> … </activations>

2、存儲格式：

1234:iFruit 1
987:Sorrento F00L
4566:iFruit 1

二、代碼

import scala.xml._ //給定一個包含XML的字符串，解析字符串，並返回字符串中包含的激活XML記錄(節點)的迭代器
def getactivations(xmlstring: String): Iterator[Node] = { val nodes = XML.loadString(xmlstring) \\ "activation" nodes.toIterator } // 給定一個激活記錄(XML節點)，返回模型名稱
def getmodel(activation: Node): String = { (activation \ "model").text } // 給定一個激活記錄(XML節點)，返回帳號
def getaccount(activation: Node): String = { (activation \ "account-number").text } //mydata1:(0："路徑":1："內容") //wholeTextFiles 創建包含文件名、文件內容的RDD
var mydata1 = sc.wholeTextFiles("file:/home/training/training_materials/data/activations/") //flatmap 遍歷RDD中的文件內容得到文件內容的RDD
val mydata2=mydata1.flatMap(line => getactivations(line._2)) //通過函數獲取對應節點的值，創建account-number:model RDD
val mydata3=mydata2.map(line => getaccount(line)+":"+getmodel(line)) //輸出數據，測試數據格式
mydata3.take(10).foreach(println)

三、函數解釋

sc.wholeTextFiles (directory)

　　從HDFS中讀取文本文件的目錄，本地文件系統(在所有節點上可用)，或者任何hadoop支持的文件系統URI。每個文件被讀取為單個記錄，然后返回到鍵值對中，其中鍵是每個文件的路徑，值是每個文件的內容。

例如：

　　(filel.json,{"firstName":"Fred","lastName":"Flintstone","userid":"123"} )
　　(file2.json,{"firstName":"Barney","lastName":"Rubble","userid":"234"} )
　　(file3.json,... )
　　(file4.json,... )

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 基於地震數據的Spark數據處理與分析前端xml格式數據處理 Spark大數據處理之從WordCount看Spark大數據處理的核心機制（2） Spark大數據處理之從WordCount看Spark大數據處理的核心機制（1） R實戰第四篇：數據處理（數據框） R實戰第三篇：數據處理（基礎） python數據處理 ArcMap數據處理 jupyterlab數據處理 TextFormField數據處理