Step0:安裝好Java ,jdk
Step1:下載好:
Step2: 將解壓后的hadoop和spark設置好環境變量:
在系統path變量里面+:
Step3:
使用pip安裝 py4j : pip install py4j
如果沒裝pip那就先裝一下
例程:wordcount.py
運行worldcount例程發現,SPARK_HOME keyerror 然后 使用os設置了臨時的環境變量。 麻蛋~ 發現重啟一下編譯器pycharm就好了
from pyspark import SparkContext import os os.environ["SPARK_HOME"] = "H:\Spark\spark-2.0.1-bin-hadoop2.7" sc = SparkContext('local') doc = sc.parallelize([['a', 'b', 'c'], ['b', 'd', 'd']]) words = doc.flatMap(lambda d: d).distinct().collect() word_dict = {w: i for w, i in zip(words, range(len(words)))} word_dict_b = sc.broadcast(word_dict) def word_count_per_doc(d): dict_tmp = {} wd = word_dict_b.value for w in d: dict_tmp[wd[w]] = dict_tmp.get(wd[w], 0) + 1 return dict_tmp print(doc.map(word_count_per_doc).collect()) print("successful!")