#原理很簡單:先是通過flatMap函數,把rdd進行扁平化操作,再用map函數得到(k,1)的樣式,然后再用groupByKey函數,合並value值,就相當於對key進行去重操作,再用keys()函數,取出key
實驗數據:delcp.txt
hello
hello
world
world
h
h
h
g
g
g
hello
world
world
h
h
h
g
g
g
from pyspark import SparkContext
sc = SparkContext('local','delcp')
rdd = sc.textFile("file:///usr/local/spark/mycode/TestPackage/delcp.txt")
delp = rdd.flatMap(lambda line : line.split(" ")
).map(lambda a : (a,1)).groupByKey().keys()
delp.foreach(print)