hive-on-spark 安裝 以及 scala 實例


hive 安裝與 have-on-spark:
1,hive 默認是啟用的 derby 數據庫,在當前路徑(hive/bin下)創建元數據
2,derby只能單用戶使用,mysql 支持多用戶使用!
安裝hive:
1,下載 apache-hive-1.2.2-bin.tar.gz
2,配置 $HIVE_HOME/conf/hive-env.sh 指定 HADOOP_HOME
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Set Hive and Hadoop environment variables here. These variables can be used
# to control the execution of Hive. It should be used by admins to configure
# the Hive installation (so that users do not have to set environment variables
# or set command line parameters to get correct behavior).
#
# The hive service being invoked (CLI etc.) is available via the environment
# variable SERVICE


# Hive Client memory usage can be an issue if a large number of clients
# are running at the same time. The flags below have been useful in 
# reducing memory usage:
#
# if [ "$SERVICE" = "cli" ]; then
#   if [ -z "$DEBUG" ]; then
#     export HADOOP_OPTS="$HADOOP_OPTS -XX:NewRatio=12 -Xms10m -XX:MaxHeapFreeRatio=40 -XX:MinHeapFreeRatio=15 -XX:+UseParNewGC -XX:-UseGCOverheadLimit"
#   else
#     export HADOOP_OPTS="$HADOOP_OPTS -XX:NewRatio=12 -Xms10m -XX:MaxHeapFreeRatio=40 -XX:MinHeapFreeRatio=15 -XX:-UseGCOverheadLimit"
#   fi
# fi

# The heap size of the jvm stared by hive shell script can be controlled via:
#
# export HADOOP_HEAPSIZE=1024
#
# Larger heap size may be required when running queries over large number of files or partitions. 
# By default hive shell scripts use a heap size of 256 (MB).  Larger heap size would also be 
# appropriate for hive server.

HADOOP_HOME=/home/hadoop/hadoop-2.3.0
# Set HADOOP_HOME to point to a specific hadoop install directory
# HADOOP_HOME=${bin}/../../hadoop

# Hive Configuration Directory can be controlled by:
export HIVE_CONF_DIR=/home/hadoop/hive/conf
# Folder containing extra libraries required for hive compilation/execution can be controlled by:
# export HIVE_AUX_JARS_PATH=
View Code
  3,配置元數據信息 ,$HIVE_HOME/conf 新增 hive-site.xml 配置數據庫地址,用戶名,密碼,與driver信息
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>

<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>

<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<description>username to use against metastore database</description>
</property>

<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>x</value>
<description>password to use against metastore database</description>
</property>
</configuration>
View Code
  4,在beeline-log4j.properties配置hive.log.dir(日志,還是要看map reduce的日志)位置
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Define some default values that can be overridden by system properties
hive.log.threshold=ALL
hive.root.logger=INFO,DRFA
hive.log.dir=/home/hadoop/hive
hive.log.file=hive.log

# Define the root logger to the system property "hadoop.root.logger".
log4j.rootLogger=${hive.root.logger}, EventCounter

# Logging Threshold
log4j.threshold=${hive.log.threshold}

#
# Daily Rolling File Appender
#
# Use the PidDailyerRollingFileAppend class instead if you want to use separate log files
# for different CLI session.
#
# log4j.appender.DRFA=org.apache.hadoop.hive.ql.log.PidDailyRollingFileAppender

log4j.appender.DRFA=org.apache.log4j.DailyRollingFileAppender

log4j.appender.DRFA.File=${hive.log.dir}/${hive.log.file}

# Rollver at midnight
log4j.appender.DRFA.DatePattern=.yyyy-MM-dd

# 30-day backup
#log4j.appender.DRFA.MaxBackupIndex=30
log4j.appender.DRFA.layout=org.apache.log4j.PatternLayout

# Pattern format: Date LogLevel LoggerName LogMessage
#log4j.appender.DRFA.layout.ConversionPattern=%d{ISO8601} %p %c: %m%n
# Debugging Pattern format
log4j.appender.DRFA.layout.ConversionPattern=%d{ISO8601} %-5p [%t]: %c{2} (%F:%M(%L)) - %m%n


#
# console
# Add "console" to rootlogger above if you want to use this
#

log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} [%t]: %p %c{2}: %m%n
log4j.appender.console.encoding=UTF-8

#custom logging levels
#log4j.logger.xxx=DEBUG

#
# Event Counter Appender
# Sends counts of logging messages at different severity levels to Hadoop Metrics.
#
log4j.appender.EventCounter=org.apache.hadoop.hive.shims.HiveEventCounter


log4j.category.DataNucleus=ERROR,DRFA
log4j.category.Datastore=ERROR,DRFA
log4j.category.Datastore.Schema=ERROR,DRFA
log4j.category.JPOX.Datastore=ERROR,DRFA
log4j.category.JPOX.Plugin=ERROR,DRFA
log4j.category.JPOX.MetaData=ERROR,DRFA
log4j.category.JPOX.Query=ERROR,DRFA
log4j.category.JPOX.General=ERROR,DRFA
log4j.category.JPOX.Enhancer=ERROR,DRFA


# Silence useless ZK logs
log4j.logger.org.apache.zookeeper.server.NIOServerCnxn=WARN,DRFA
log4j.logger.org.apache.zookeeper.ClientCnxnSocketNIO=WARN,DRFA
View Code
關於mysql的配置:
1,mysql 開啟支持集群模式
# *.*:所有庫下的所有表   %:任何IP地址或主機都可以連接
GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY 'root' WITH GRANT OPTION;
FLUSH PRIVILEGES;
    2,將mysql-connector-java-5.1.28.jar 復制到 $HIVE_HOME/lib 之中
hive說明:hive 查詢過程,是指從 hdfs路徑之中加載文件的過程,以此類推
1,hive 存儲的是元數據(derby或者mysql)與文件的對用關系,兩者都存在就可以查到
2,hive在啟動過程中會自動創建一個數據倉庫,創建一個數據庫(default除外)會在默認的倉庫創建一個文件夾
    # bin/hive 里創建一個數據庫,hive 不區分大小寫
    #/user/hive/warehouse/sga.db  在hdfs之中默認的倉庫位置創建 sga文件夾
    create database Sga;
    #hdfs 之中 /user/hive/warehouse/sga.db/stu  創建一個stu的文件夾
    create table stu(id int,name string) row format delimited fields terminated by "\t";
    # ./hdfs dfs -cat /user/hive/warehouse/sga.db/stu/students
    load data local inpath "/home/hadoop/datas" into table stu;
hive常用的操作命令:
# -e 命令行執行sql語句
./hive -e "select * from stu;"
# -f 命令行傳遞 sql文件 重定向到一個txt文件之中
./hive -f fhive.sql > fhiveresult.txt

# 使用dfs 命令 查看 hdfs 文件操作系統
dfs  -du -h / ;
# 使用 ! 執行本地linux命令
!ls -a

# 在 bin/hive 之中配置參數 
set hive.cli.print.header=true
# 命令行啟動加載指定的配置信息
./hive -hiveconf hive.cli.print.current.db=true

#hive歷史命令保存位置
$HOST_NAME/.hivehistory
pom 導入
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>5.1.38</version>
        </dependency>
        <!-- sparksql 對hive 的兼容 jar 包 -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-hive_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>    
scala 代碼實例
package Day3
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
object hivesparksql {
  def main(args: Array[String]): Unit = {
      val spark = SparkSession.builder().master("local[*]")
      .appName(this.getClass.getSimpleName)
        .enableHiveSupport() // 開啟 sparksql 對 hive的支持
        .getOrCreate()
      System.setProperty("HADOOP_USER_NAME","root")
      val results = spark.sql("show tables")
      results.show()
    /*
+--------+--------------+-----------+
|database|     tableName|isTemporary|
+--------+--------------+-----------+
| default|          dept|      false|
| default|           emp|      false|
| default|           stu|      false|
| default| stu_partition|      false|
| default|       student|      false|
| default| student_infos|      false|
| default|student_scores|      false|
+--------+--------------+-----------+
    * */
    // 在 hdfs 之中 創建目錄  /user/hive/warehouse/sxu
    spark.sql("create table if not exists t_access(username string,mount string)")

    import spark.implicits._
    val access:Dataset[String] = spark.createDataset(
      List("jie,2019-11-01","cal,2011-11-01")
    )
    val accessdf =access.map({ t=>
      val lines:Array[String]=t.split(",")
      (lines(0),lines(1))}).toDF("username","mount")

    // 第一種寫入自定義數據方式 使用臨時表
//    accessdf.createTempView(viewName = "v_tmp")
//    spark.sql("insert into t_access select * from v_tmp")

    // 第二種方式 把自定義數據寫入表之中 tableName 是 數據庫.表名 的格式
    accessdf.write.insertInto("t_access")
    spark.stop()
  }}

// scala 實例

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM