Spark學習小記-(3)pyspark連接hive庫表sql操作


參考:spark連接外部Hive應用

如果想連接外部已經部署好的Hive,需要通過以下幾個步驟。

1)     將Hive中的hive-site.xml拷貝或者軟連接到Spark安裝目錄下的conf目錄下。

2)     打開spark shell,注意帶上訪問Hive元數據庫的JDBC客戶端(找到連接hive元mysql數據庫的驅動)

$ bin/spark-shell  --jars mysql-connector-java-5.1.27-bin.jar

這里用的是pyspark

[root@hadoop02 spark]# bin/pyspark --jars /opt/module/hive/lib/mysql-connector-java-5.1.27-bin.jar

測試命令行操作

操作完成后可以成功打開:[root@hadoop02 spark]# bin/pyspark --jars /opt/module/hive/lib/mysql-connector-java-5.1.27-bin.jar 

Python 2.7.5 (default, Apr  2 2020, 13:16:51) [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux2 Type "help", "copyright", "credits" or "license" for more information. Warning: Ignoring non-spark config property: export=JAVA_HOME=/opt/module/jdk1.8.0_144 Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 21/01/09 22:23:27 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 21/01/09 22:23:31 WARN metastore.ObjectStore: Failed to get database global_temp, returning NoSuchObjectException Welcome to ____ __ / __/__  ___ _____/ /__ _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.1.1
      /_/ Using Python version 2.7.5 (default, Apr  2 2020 13:16:51) SparkSession available as 'spark'.
## ## 測試spark建表
>>> from pyspark.sql import HiveContext,SparkSession
>>> hive_sql=HiveContext(spark)
>>> hive_sql.sql(''' create table test_youhua.test_pyspark_creat_tbl like test_youhua.youhua1 ''')
21/01/09 22:26:48 WARN metastore.HiveMetaStore: Location: hdfs://hadoop02:9000/user/hive/warehouse/test_youhua.db/test_pyspark_creat_tbl specified for non-external table:test_pyspark_creat_tbl
DataFrame[]

這時去hive庫查可以看到已經通過spark操作生成了表 test_youhua.test_pyspark_creat_tbl

[root@hadoop02 hive]# bin/hive
ls: 無法訪問/opt/module/spark/lib/spark-assembly-*.jar: 沒有那個文件或目錄

Logging initialized using configuration in jar:file:/opt/module/hive/lib/hive-common-1.2.1.jar!/hive-log4j.properties
hive> show databases;
OK
default
test_youhua
Time taken: 0.903 seconds, Fetched: 2 row(s)
hive> use test_youhua;
OK
Time taken: 0.038 seconds
hive> show tables;
OK
test_pyspark_creat_tbl
youhua1
Time taken: 0.028 seconds, Fetched: 2 row(s)

 

如果不想每次都手動添加驅動地址這么麻煩,可以在spark-defaults.conf里配置:

spark.executor.extraClassPath   /opt/module/hive/lib/mysql-connector-java-5.1.27-bin.jar
spark.driver.extraClassPath   /opt/module/hive/lib/mysql-connector-java-5.1.27-bin.jar

 測試提交操作

先寫好

[root@hadoop02 spark]# vim input/test_pyspark_hive.py

import pyspark
from pyspark.sql import SparkSession
from pyspark import SparkConf,SparkContext
from pyspark.sql import HiveContext

sc=SparkSession.builder.master("local")\
    .appName('first_name')\
    .config('spark.executor.memory','2g')\
    .config('spark.driver.memory','2g')\
    .enableHiveSupport()\
    .getOrCreate()
hive_sql=HiveContext(sc)
hive_sql.sql(''' create table test_youhua.test_spark_create_tbl1 like test_youhua.youhua1 ''')
hive_sql.sql(''' insert overwrite table test_youhua.test_spark_create_tbl1 select * from test_youhua.youhua1 ''')

再提交

[root@hadoop02 spark]# spark-submit input/test_pyspark_hive.py

可以看到操作成功:

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM