SparkSQL使用之Spark SQL CLI

本文轉載自查看原文 2014-09-13 16:41 8103 spark/ Spark

Spark SQL CLI描述

Spark SQL CLI的引入使得在SparkSQL中通過hive metastore就可以直接對hive進行查詢更加方便；當前版本中還不能使用Spark SQL CLI與ThriftServer進行交互。

使用Spark SQL CLI前需要注意：

1、將hive-site.xml配置文件拷貝到$SPARK_HOME/conf目錄下；

2、需要在$SPARK_HOME/conf/spark-env.sh中的SPARK_CLASSPATH添加jdbc驅動的jar包

export SPARK_CLASSPATH=$SPARK_CLASSPATH:/home/hadoop/software/mysql-connector-java-5.1.27-bin.jar

Spark SQL CLI命令參數介紹：

cd $SPARK_HOME/bin
spark-sql --help

Usage: ./bin/spark-sql [options] [cli option]
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Options:
  --master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally ("client") or
                              on one of the worker machines inside the cluster ("cluster")
                              (Default: client).
  --class CLASS_NAME          Your application's main class (for Java / Scala apps).
  --name NAME                 A name of your application.
  --jars JARS                 Comma-separated list of local jars to include on the driver
                              and executor classpaths.
  --py-files PY_FILES         Comma-separated list of .zip, .egg, or .py files to place
                              on the PYTHONPATH for Python apps.
  --files FILES               Comma-separated list of files to be placed in the working
                              directory of each executor.

  --conf PROP=VALUE           Arbitrary Spark configuration property.
  --properties-file FILE      Path to a file from which to load extra properties. If not
                              specified, this will look for conf/spark-defaults.conf.

  --driver-memory MEM         Memory for driver (e.g. 1000M, 2G) (Default: 512M).
  --driver-java-options       Extra Java options to pass to the driver.
  --driver-library-path       Extra library path entries to pass to the driver.
  --driver-class-path         Extra class path entries to pass to the driver. Note that
                              jars added with --jars are automatically included in the
                              classpath.

  --executor-memory MEM       Memory per executor (e.g. 1000M, 2G) (Default: 1G).

  --help, -h                  Show this help message and exit
  --verbose, -v               Print additional debug output

 Spark standalone with cluster deploy mode only:
  --driver-cores NUM          Cores for driver (Default: 1).
  --supervise                 If given, restarts the driver on failure.

 Spark standalone and Mesos only:
  --total-executor-cores NUM  Total cores for all executors.

 YARN-only:
  --executor-cores NUM        Number of cores per executor (Default: 1).
  --queue QUEUE_NAME          The YARN queue to submit to (Default: "default").
  --num-executors NUM         Number of executors to launch (Default: 2).
  --archives ARCHIVES         Comma separated list of archives to be extracted into the
                              working directory of each executor.

CLI options:
-d,--define <key=value>          Variable subsitution to apply to hive
                                  commands. e.g. -d A=B or --define A=B
    --database <databasename>     Specify the database to use
 -e <quoted-query-string>         SQL from command line
 -f <filename>                    SQL from files
 -h <hostname>                    connecting to Hive Server on remote host
    --hiveconf <property=value>   Use value for given property
    --hivevar <key=value>         Variable subsitution to apply to hive
                                  commands. e.g. --hivevar A=B
 -i <filename>                    Initialization SQL file
 -p <port>                        connecting to Hive Server on port number
 -S,--silent                      Silent mode in interactive shell
 -v,--verbose                     Verbose mode (echo executed SQL to the console)

在啟動spark-sql時，如果不指定master，則以local的方式運行，master既可以指定standalone的地址，也可以指定yarn；

當設定master為yarn時(spark-sql --master yarn)時，可以通過http://hadoop000:8088頁面監控到整個job的執行過程；

注：如果在$SPARK_HOME/conf/spark-defaults.conf中配置了spark.master spark://hadoop000:7077，那么在啟動spark-sql時不指定master也是運行在standalone集群之上。

spark-sql使用

啟動spark-sql：由於我已經在spark-defaults.conf中配置了spark.master spark://hadoop000:7077，就沒在spark-sql啟動時指定master了

cd $SPARK_HOME/bin
spark-sql

SELECT track_time, url, session_id, referer, ip, end_user_id, city_id FROM page_views WHERE city_id = -1000 limit 10;

SELECT session_id, count(*) c FROM page_views group by session_id order by c desc limit 10;

上面兩個sql語句用到的表現在存在hive中了，如果沒有則手工創建下，創建腳本以及導入數據腳本如下：

create table page_views(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

load data local inpath '/home/spark/software/data/page_views.dat' overwrite into table page_views;

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Spark之使用SparkSql操作Hive的Scala程序實現 Spark之使用SparkSql操作mysql和DataFrame的Scala實現 SparkSQL /DataFrame /Spark RDD誰快？ spark教程(10)-sparkSQL Spark詳解(06) - SparkSQL SparkSQL & Spark on Hive & Hive on Spark 關於在使用sparksql寫程序是報錯以及解決方案：org.apache.spark.sql.AnalysisException: Duplicate column(s): "name" found, cannot save to file. Hive，Hive on Spark和SparkSQL區別 Hive，Hive on Spark和SparkSQL區別 spark SQL之Catalog API使用