Spark实战练习01--XML数据处理

本文转载自查看原文 2018-03-07 13:03 1822 Spark实战

一、要求

将XML中的account_number、model数据提取出来，并以account_number：model格式存储

1、XML文件数据格式

<activations>
　　<activation timestamp="1225499258" type="phone">
　　<account-number>316</account-number>
　　<device-id> 　　　　d61b6971-33e1-42f0-bb15-aa2ae3cd8680 　　</device-id>
　　<phone-number>5108307062</phone-number>
　　<model>iFruit 1</model>
　　</activation> … </activations>

2、存储格式：

1234:iFruit 1
987:Sorrento F00L
4566:iFruit 1

二、代码

import scala.xml._ //给定一个包含XML的字符串，解析字符串，并返回字符串中包含的激活XML记录(节点)的迭代器
def getactivations(xmlstring: String): Iterator[Node] = { val nodes = XML.loadString(xmlstring) \\ "activation" nodes.toIterator } // 给定一个激活记录(XML节点)，返回模型名称
def getmodel(activation: Node): String = { (activation \ "model").text } // 给定一个激活记录(XML节点)，返回帐号
def getaccount(activation: Node): String = { (activation \ "account-number").text } //mydata1:(0："路径":1："内容") //wholeTextFiles 创建包含文件名、文件内容的RDD
var mydata1 = sc.wholeTextFiles("file:/home/training/training_materials/data/activations/") //flatmap 遍历RDD中的文件内容得到文件内容的RDD
val mydata2=mydata1.flatMap(line => getactivations(line._2)) //通过函数获取对应节点的值，创建account-number:model RDD
val mydata3=mydata2.map(line => getaccount(line)+":"+getmodel(line)) //输出数据，测试数据格式
mydata3.take(10).foreach(println)

三、函数解释

sc.wholeTextFiles (directory)

　　从HDFS中读取文本文件的目录，本地文件系统(在所有节点上可用)，或者任何hadoop支持的文件系统URI。每个文件被读取为单个记录，然后返回到键值对中，其中键是每个文件的路径，值是每个文件的内容。

例如：

　　(filel.json,{"firstName":"Fred","lastName":"Flintstone","userid":"123"} )
　　(file2.json,{"firstName":"Barney","lastName":"Rubble","userid":"234"} )
　　(file3.json,... )
　　(file4.json,... )

免责声明！

本站转载的文章为个人学习借鉴使用，本站对版权不负任何法律责任。如果侵犯了您的隐私权益，请联系本站邮箱yoyou2525@163.com删除。

猜您在找 基于地震数据的Spark数据处理与分析前端xml格式数据处理 Spark大数据处理之从WordCount看Spark大数据处理的核心机制（2） Spark大数据处理之从WordCount看Spark大数据处理的核心机制（1） R实战第四篇：数据处理（数据框） R实战第三篇：数据处理（基础） python数据处理 ArcMap数据处理 jupyterlab数据处理 TextFormField数据处理