win7 + spark + hive + python集成
通過win7使用spark的pyspark訪問hive
1、安裝spark軟件包
2、復制mysql驅動
3、復制hadoop配置目錄到spark的conf下
4、復制hadoop和hive的配置文件到conf下
5.1、在pyspark腳本中添加HADOOP_CONF_DIR環境變量,指向hadoop配置目錄
set HADOOP_CONF_DIR=D:\myprogram\spark-2.1.0-bin-hadoop2.7\conf\ha
5.2、以下也要配置
set HADOOP_CONF_DIR=D:\myprogram\spark-2.1.0-bin-hadoop2.7\conf\ha
6、修改hdfs目錄權限
[centos@s101 ~]$ hdfs dfs -chmod -R 777 /user
7、在win7啟動pyspark shell,連接到yarn,在bin下
pyspark --master yarn
8、測試
>>> rdd1 = sc.textFile("/user/centos/myspark/wc") >>> rdd1.flatMap(lambda e:e.split(" ")).map(lambda e:(e,1)).reduceByKey(lambda a,b:a+b).collect() [(u'9', 3), (u'1', 2), (u'3', 3), (u'5', 4), (u'7', 3), (u'0', 2), (u'8', 3), (u'2', 3), (u'4', 3), (u'6', 4)] >>> for i in rdd1.flatMap(lambda e:e.split(" ")).map(lambda e:(e,1)).reduceByKey(lambda a,b:a+b).collect():print i ... (u'1', 2) (u'9', 3) (u'3', 3) (u'5', 4) (u'7', 3) (u'0', 2) (u'8', 3) (u'2', 3) (u'4', 3) (u'6', 4) >>> spark.sql("show databases").show() +------------+ |databaseName| +------------+ | default| | lx| | udtf| +------------+
IDEA中開發pyspark程序:前提是以上步驟完成
1、創建java或scala模塊
2、進入項目結構(設置右側)--左側點modules--選myspark--右鍵add,python支持
點擊python,指定解釋器
3、在配置中指定環境變量
1、進入設置界面
2、如下配置
4、導入spark的python核心庫
5、測試
安裝:pip install py4j
#coding:utf-8 #wordcount from pyspark.context import SparkContext from pyspark import SparkConf conf = SparkConf().setMaster("local[*]").setAppName("") sc = SparkContext(conf=conf) rdd1 = sc.textFile("/user/centos/myspark/wc") rdd2 = rdd1.flatMap(lambda s:s.split(" ")).map(lambda s:(s,1)).reduceByKey(lambda a,b:a+b) lst = rdd2.collect() for i in lst: print(i) #sparksql from pyspark.sql import * spark = SparkSession.builder.enableHiveSupport().getOrCreate() arr = spark.sql("show databases").show() if __name__ == "__main__": pass