Docker+Hadoop+Hive+Presto 使用Docker部署Hadoop環境和Presto


 

Background

一. 什么是Presto

Presto通過使用分布式查詢,可以快速高效的完成海量數據的查詢。如果你需要處理TB或者PB級別的數據,那么你可能更希望借助於Hadoop和HDFS來完成這些數據的處理。作為Hive和Pig(Hive和Pig都是通過MapReduce的管道流來完成HDFS數據的查詢)的替代者,Presto不僅可以訪問HDFS,也可以操作不同的數據源,包括:RDBMS和其他的數據源(例如:Cassandra)。

Presto被設計為數據倉庫和數據分析產品:數據分析、大規模數據聚集和生成報表。這些工作經常通常被認為是線上分析處理操作。

Presto是FaceBook開源的一個開源項目。Presto在FaceBook誕生,並且由FaceBook內部工程師和開源社區的工程師公共維護和改進。

 

二. 環境和應用准備

  • 環境

  macbook pro

  • application

  Docker for mac: https://docs.docker.com/docker-for-mac/#check-versions

  jdk-1.8: http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

  hadoop-2.7.5

  hive-2.3.3

  presto-server-0.198.tar.gz

  presto-cli-0.198-executable.jar

 

三. 構建images

我們使用Docker來啟動三台Centos7虛擬機,三台機器上安裝Hadoop和Java。

1. 安裝Docker,Macbook上安裝Docker,並使用倉庫賬號登錄。

docker login

2. 驗證安裝結果

docker version

3. 拉取Centos7 images

docker pull centos

4. 構建具有ssh功能的centos

mkdir ~/centos-ssh
cd centos-ssh
vi Dockerfile
# 選擇一個已有的os鏡像作為基礎  
FROM centos 

# 鏡像的作者  
MAINTAINER crxy 

# 安裝openssh-server和sudo軟件包,並且將sshd的UsePAM參數設置成no  
RUN yum install -y openssh-server sudo  
RUN sed -i 's/UsePAM yes/UsePAM no/g' /etc/ssh/sshd_config  
#安裝openssh-clients
RUN yum  install -y openssh-clients

# 添加測試用戶root,密碼root,並且將此用戶添加到sudoers里  
RUN echo "root:root" | chpasswd  
RUN echo "root   ALL=(ALL)       ALL" >> /etc/sudoers  
# 下面這兩句比較特殊,在centos6上必須要有,否則創建出來的容器sshd不能登錄  
RUN ssh-keygen -t dsa -f /etc/ssh/ssh_host_dsa_key  
RUN ssh-keygen -t rsa -f /etc/ssh/ssh_host_rsa_key  

# 啟動sshd服務並且暴露22端口  
RUN mkdir /var/run/sshd  
EXPOSE 22  
CMD ["/usr/sbin/sshd", "-D"]

構建

docker build -t=”centos-ssh” .

5. 基於centos-ssh鏡像構建有JDK和Hadoop的鏡像

mkdir ~/hadoop
cd hadoop
vi Dockerfile
FROM centos-ssh
ADD jdk-8u161-linux-x64.tar.gz /usr/local/
RUN mv jdk-8u161-linux-x64.tar.gz /usr/local/jdk1.7
ENV JAVA_HOME /usr/local/jdk1.8
ENV PATH $JAVA_HOME/bin:$PATH

ADD hadoop-2.7.5.tar.gz /usr/local
RUN mv hadoop-2.7.5.tar.gz /usr/local/hadoop
ENV HADOOP_HOME /usr/local/hadoop
ENV PATH $HADOOP_HOME/bin:$PATH

jdk包和hadoop包要放在hadoop目錄下

docker build -t=”centos-hadoop” .

 

四. 搭建Hadoop集群

1. 集群規划

搭建有三個節點的hadoop集群,一主兩從

主節點:hadoop0 ip:172.18.0.2 
從節點1:hadoop1 ip:172.18.0.3 
從節點2:hadoop2 ip:172.18.0.4

但是由於docker容器重新啟動之后ip會發生變化,所以需要我們給docker設置固定ip。

Docker安裝后,默認會創建下面三種網絡類型:

docker network ls                                                                                                                                                                                                                           jinhongliu@Jinhongs-MacBo
NETWORK ID          NAME                DRIVER              SCOPE
085be4855a90        bridge              bridge              local
177432e48de5        host                host                local
569f368d1561        none                null                local

啟動 Docker的時候,用 --network 參數,可以指定網絡類型,如:

~ docker run -itd --name test1 --network bridge --ip 172.17.0.10 centos:latest /bin/bash

bridge:橋接網絡

默認情況下啟動的Docker容器,都是使用 bridge,Docker安裝時創建的橋接網絡,每次Docker容器重啟時,會按照順序獲取對應的IP地址,這個就導致重啟下,Docker的IP地址就變了.

none:無指定網絡

使用 --network=none ,docker 容器就不會分配局域網的IP

host: 主機網絡

使用 --network=host,此時,Docker 容器的網絡會附屬在主機上,兩者是互通的。

例如,在容器中運行一個Web服務,監聽8080端口,則主機的8080端口就會自動映射到容器中。

創建自定義網絡:(設置固定IP)

啟動Docker容器的時候,使用默認的網絡是不支持指派固定IP的,如下:

~ docker run -itd --net bridge --ip 172.17.0.10 centos:latest /bin/bash
6eb1f228cf308d1c60db30093c126acbfd0cb21d76cb448c678bab0f1a7c0df6
docker: Error response from daemon: User specified IP address is supported on user defined networks only.

因此,需要創建自定義網絡,下面是具體的步驟:

步驟1: 創建自定義網絡

創建自定義網絡,並且指定網段:172.18.0.0/16

➜ ~ docker network create --subnet=172.18.0.0/16 mynetwork
➜ ~ docker network ls
NETWORK ID          NAME                DRIVER              SCOPE
085be4855a90        bridge              bridge              local
177432e48de5        host                host                local
620ebbc09400        mynetwork           bridge              local
569f368d1561        none                null                local

步驟2: 創建docker容器。啟動三個容器,分別作為hadoop0 hadoop1 hadoop2

➜  ~ docker run --name hadoop0 --hostname hadoop0 --net mynetwork --ip 172.18.0.2 -d -P -p 50070:50070 -p 8088:8088  centos-hadoop
➜  ~ docker run --name hadoop0 --hostname hadoop1 --net mynetwork --ip 172.18.0.3 -d -P centos-hadoop
➜  ~ docker run --name hadoop0 --hostname hadoop2 --net mynetwork --ip 172.18.0.4 -d -P centos-hadoop

使用docker ps 查看剛才啟動的是三個容器:

5e0028ed6da0        hadoop              "/usr/sbin/sshd -D"      16 hours ago        Up 3 hours          0.0.0.0:32771->22/tcp                                                     hadoop2
35211872eb20        hadoop              "/usr/sbin/sshd -D"      16 hours ago        Up 4 hours          0.0.0.0:32769->22/tcp                                                     hadoop1
0f63a870ef2b        hadoop              "/usr/sbin/sshd -D"      16 hours ago        Up 5 hours          0.0.0.0:8088->8088/tcp, 0.0.0.0:50070->50070/tcp, 0.0.0.0:32768->22/tcp   hadoop0

這樣3台機器就有了固定的IP地址。驗證一下,分別ping三個ip,能ping通就說明沒問題。

 

五. 配置Hadoop集群

1. 先連接到hadoop0上, 使用命令

docker exec -it hadoop0 /bin/bash

下面的步驟就是hadoop集群的配置過程 
1:設置主機名與ip的映射,修改三台容器:vi /etc/hosts 
添加下面配置

172.18.0.2    hadoop0
172.18.0.3    hadoop1
172.18.0.4    hadoop2

2:設置ssh免密碼登錄 
在hadoop0上執行下面操作

cd  ~
mkdir .ssh
cd .ssh
ssh-keygen -t rsa(一直按回車即可)
ssh-copy-id -i localhost
ssh-copy-id -i hadoop0
ssh-copy-id -i hadoop1
ssh-copy-id -i hadoop2
在hadoop1上執行下面操作
cd  ~
cd .ssh
ssh-keygen -t rsa(一直按回車即可)
ssh-copy-id -i localhost
ssh-copy-id -i hadoop1
在hadoop2上執行下面操作
cd  ~
cd .ssh
ssh-keygen -t rsa(一直按回車即可)
ssh-copy-id -i localhost
ssh-copy-id -i hadoop2

3:在hadoop0上修改hadoop的配置文件 
進入到/usr/local/hadoop/etc/hadoop目錄 
修改目錄下的配置文件core-site.xml、hdfs-site.xml、yarn-site.xml、mapred-site.xml 
(1)hadoop-env.sh

export JAVA_HOME=/usr/local/jdk1.8

(2)core-site.xml

<configuration>
        <property>
                <name>fs.defaultFS</name>
                <value>hdfs://hadoop0:9000</value>
        </property>
        <property>
                <name>hadoop.tmp.dir</name>
                <value>/usr/local/hadoop/tmp</value>
        </property>
         <property>
                 <name>fs.trash.interval</name>
                 <value>1440</value>
        </property>
</configuration>

(3)hdfs-site.xml

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
        <name>dfs.permissions</name>
        <value>false</value>
    </property>
</configuration>

(4)yarn-site.xml

<configuration>
        <property>
                <name>yarn.nodemanager.aux-services</name>
                <value>mapreduce_shuffle</value>
        </property>
        <property> 
                <name>yarn.log-aggregation-enable</name> 
                <value>true</value> 
        </property>
</configuration>

(5)修改文件名:mv mapred-site.xml.template mapred-site.xml 
vi mapred-site.xml

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

(6)格式化 
進入到/usr/local/hadoop目錄下 
執行格式化命令

bin/hdfs namenode -format
注意:在執行的時候會報錯,是因為缺少which命令,安裝即可

執行下面命令安裝
yum install -y which

格式化操作不能重復執行。如果一定要重復格式化,帶參數-force即可。

(7)啟動偽分布hadoop

命令:sbin/start-all.sh

第一次啟動的過程中需要輸入yes確認一下。 使用jps,檢查進程是否正常啟動?能看到下面幾個進程表示偽分布啟動成功

3267 SecondaryNameNode
3003 NameNode
3664 Jps
3397 ResourceManager
3090 DataNode
3487 NodeManager

(8)停止偽分布hadoop

命令:sbin/stop-all.sh

(9)指定nodemanager的地址,修改文件yarn-site.xml

<property>
    <description>The hostname of the RM.</description>
    <name>yarn.resourcemanager.hostname</name>
    <value>hadoop0</value>
  </property>

(10)修改hadoop0中hadoop的一個配置文件etc/hadoop/slaves 
刪除原來的所有內容,修改為如下

hadoop1
hadoop2

(11)在hadoop0中執行命令

scp  -rq /usr/local/hadoop   hadoop1:/usr/local
scp  -rq /usr/local/hadoop   hadoop2:/usr/local

(12)啟動hadoop分布式集群服務

執行sbin/start-all.sh

注意:在執行的時候會報錯,是因為兩個從節點缺少which命令,安裝即可

分別在兩個從節點執行下面命令安裝

yum install -y which

再啟動集群(如果集群已啟動,需要先停止)

(13)驗證集群是否正常 
首先查看進程: 

Hadoop0上需要有這幾個進程

4643 Jps
4073 NameNode
4216 SecondaryNameNode
4381 ResourceManager

Hadoop1上需要有這幾個進程

715 NodeManager
849 Jps
645 DataNode

Hadoop2上需要有這幾個進程

456 NodeManager
589 Jps
388 DataNode

使用程序驗證集群服務 
創建一個本地文件

vi a.txt
hello you
hello me

上傳a.txt到hdfs上

hdfs dfs -put a.txt /

執行wordcount程序

cd /usr/local/hadoop/share/hadoop/mapreduce
hadoop jar hadoop-mapreduce-examples-2.4.1.jar wordcount /a.txt /out

查看程序執行結果 

這樣就說明集群正常了。

通過瀏覽器訪問集群的服務 
由於在啟動hadoop0這個容器的時候把50070和8088映射到宿主機的對應端口上了

所以在這可以直接通過宿主機訪問容器中hadoop集群的服務 

 

六. 安裝Hive

我們使用Presto的hive connector來對hive中的數據進行查詢,因此需要先安裝hive.

1. 本地下載hive,使用下面的命令傳到hadoop0上

docker cp ~/Download/hive-2.3.3-bin.tar.gz 容器ID:/

2. 解壓到指定目錄

tar -zxvf apache-hive-2.3.3-bin.tar.gz
mv apache-hive-2.3.3-bin /hive
cd /hive

3、配置/etc/profile,在/etc/profile中添加如下語句

export HIVE_HOME=/usr/local/hive
export PATH=$HIVE_HOME/bin:$PATH
source /etc/profile 

4、安裝MySQL數據庫

我們使用docker容器來進行安裝,首先pull mysql image

docker pull mysql

啟動mysql容器

docker run --name mysql -e MYSQL_ROOT_PASSWORD=111111 --net mynetwork --ip 172.18.0.5  -d

登錄mysql容器

5、創建metastore數據庫並為其授權

create database metastore;

6、 下載jdbc connector

下載地址Connector/J 5.1.43

下載完成之后將其解壓,並把其中的mysql-connector-java-5.1.41-bin.jar文件拷貝到$HIVE_HOME/lib目錄

7、修改hive配置文件

cd /hive/conf

7.1復制初始化文件並重改名

cp hive-env.sh.template hive-env.sh
cp hive-default.xml.template hive-site.xml
cp hive-log4j2.properties.template hive-log4j2.properties
cp hive-exec-log4j2.properties.template hive-exec-log4j2.properties

7.2修改hive-env.sh

export JAVA_HOME=/usr/local/jdk1.8    ##Java路徑
export HADOOP_HOME=/usr/local/hadoop   ##Hadoop安裝路徑
export HIVE_HOME=/usr/local/hive    ##Hive安裝路徑
export HIVE_CONF_DIR=/hive/conf    ##Hive配置文件路徑

7.3在hdfs 中創建下面的目錄 ,並且授權

hdfs dfs -mkdir -p /user/hive/warehouse
hdfs dfs -mkdir -p /user/hive/tmp
hdfs dfs -mkdir -p /user/hive/log
hdfs dfs -chmod -R 777 /user/hive/warehouse
hdfs dfs -chmod -R 777 /user/hive/tmp
hdfs dfs -chmod -R 777 /user/hive/log

7.4修改hive-site.xml

<property>
    <name>hive.exec.scratchdir</name>
    <value>/user/hive/tmp</value>
</property>
<property>
    <name>hive.metastore.warehouse.dir</name>
    <value>/user/hive/warehouse</value>
</property>
<property>
    <name>hive.querylog.location</name>
    <value>/user/hive/log</value>
</property>

## 配置 MySQL 數據庫連接信息
<property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://172.18.0.5:3306/metastore?createDatabaseIfNotExist=true&amp;characterEncoding=UTF-8&amp;useSSL=false</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>root</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>111111</value>
  </property>

7.5 創建tmp文件

mkdir /home/hadoop/hive/tmp

並在hive-site.xml中修改:

把{system:java.io.tmpdir} 改成 /home/hadoop/hive/tmp/

把 {system:user.name} 改成 {user.name}

8、初始化hive

schematool -dbType mysql -initSchema

9、啟動hive

hive

10. hive中創建表

新建create_table文件

REATE TABLE IF NOT EXISTS `default`.`d_abstract_event` ( `id` BIGINT, `network_id` BIGINT, `name` STRING) COMMENT 'Imported by sqoop on 2017/06/27 09:49:25' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY");
CREATE TABLE IF NOT EXISTS `default`.`d_ad_bumper` ( `front_bumper_id` BIGINT, `end_bumper_id` BIGINT, `content_item_type` STRING, `content_item_id` BIGINT, `content_item_name` STRING) COMMENT 'Imported by sqoop on 2017/06/27 09:31:05' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY");
CREATE TABLE IF NOT EXISTS `default`.`d_ad_tracking` ( `id` BIGINT, `network_id` BIGINT, `name` STRING, `creative_id` BIGINT, `creative_name` STRING, `ad_unit_id` BIGINT, `ad_unit_name` STRING, `placement_id` BIGINT, `placement_name` STRING, `io_id` BIGINT, `io_ad_group_id` BIGINT, `io_name` STRING, `campaign_id` BIGINT, `campaign_name` STRING, `campaign_status` STRING, `advertiser_id` BIGINT, `advertiser_name` STRING, `agency_id` BIGINT, `agency_name` STRING, `status` STRING) COMMENT 'Imported by sqoop on 2017/06/27 09:31:03' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY");
CREATE TABLE IF NOT EXISTS `default`.`d_ad_tree_node_frequency_cap` ( `id` BIGINT, `ad_tree_node_id` BIGINT, `frequency_cap` INT, `frequency_period` INT, `frequency_cap_type` STRING, `frequency_cap_scope` STRING) COMMENT 'Imported by sqoop on 2017/06/27 09:31:03' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY");
CREATE TABLE IF NOT EXISTS `default`.`d_ad_tree_node_skippable` ( `id` BIGINT, `skippable` BIGINT) COMMENT 'Imported by sqoop on 2017/06/27 09:31:03' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY");
CREATE TABLE IF NOT EXISTS `default`.`d_ad_tree_node` ( `id` BIGINT, `network_id` BIGINT, `name` STRING, `internal_id` STRING, `staging_internal_id` STRING, `budget_exempt` INT, `ad_unit_id` BIGINT, `ad_unit_name` STRING, `ad_unit_type` STRING, `ad_unit_size` STRING, `placement_id` BIGINT, `placement_name` STRING, `placement_internal_id` STRING, `io_id` BIGINT, `io_ad_group_id` BIGINT, `io_name` STRING, `io_internal_id` STRING, `campaign_id` BIGINT, `campaign_name` STRING, `campaign_internal_id` STRING, `advertiser_id` BIGINT, `advertiser_name` STRING, `advertiser_internal_id` STRING, `agency_id` BIGINT, `agency_name` STRING, `agency_internal_id` STRING, `price_model` STRING, `price_type` STRING, `ad_unit_price` DECIMAL(16,2), `status` STRING, `companion_ad_package_id` BIGINT) COMMENT 'Imported by sqoop on 2017/06/27 09:31:03' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY");
CREATE TABLE IF NOT EXISTS `default`.`d_ad_tree_node_staging` ( `ad_tree_node_id` BIGINT, `adapter_status` STRING, `primary_ad_tree_node_id` BIGINT, `production_ad_tree_node_id` BIGINT, `hide` INT, `ignore` INT) COMMENT 'Imported by sqoop on 2017/06/27 09:31:03' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY");
CREATE TABLE IF NOT EXISTS `default`.`d_ad_tree_node_trait` ( `id` BIGINT, `ad_tree_node_id` BIGINT, `trait_type` STRING, `parameter` STRING) COMMENT 'Imported by sqoop on 2017/06/27 09:31:03' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY");
CREATE TABLE IF NOT EXISTS `default`.`d_ad_unit_ad_slot_assignment` ( `id` BIGINT, `ad_unit_id` BIGINT, `ad_slot_id` BIGINT) COMMENT 'Imported by sqoop on 2017/06/27 09:31:03' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY");
CREATE TABLE IF NOT EXISTS `default`.`d_ad_unit` ( `id` BIGINT, `name` STRING, `ad_unit_type` STRING, `height` INT, `width` INT, `size` STRING, `network_id` BIGINT, `created_type` STRING) COMMENT 'Imported by sqoop on 2017/06/27 09:31:03' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY");
CREATE TABLE IF NOT EXISTS `default`.`d_advertiser` ( `id` BIGINT, `network_id` BIGINT, `name` STRING, `agency_id` BIGINT, `agency_name` STRING, `advertiser_company_id` BIGINT, `agency_company_id` BIGINT, `billing_contact_company_id` BIGINT, `address_1` STRING, `address_2` STRING, `address_3` STRING, `city` STRING, `state_region_id` BIGINT, `country_id` BIGINT, `postal_code` STRING, `email` STRING, `phone` STRING, `fax` STRING, `url` STRING, `notes` STRING, `billing_term` STRING, `meta_data` STRING, `internal_id` STRING, `active` INT, `budgeted_imp` BIGINT, `num_of_campaigns` BIGINT, `adv_category_name_list` STRING, `adv_category_id_name_list` STRING, `updated_at` TIMESTAMP, `created_at` TIMESTAMP) COMMENT 'Imported by sqoop on 2017/06/27 09:31:22' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY");
cat create_table | hive

11. 啟動metadata service

presto需要使用hive的metadata service

nohup hive --service metadata &

至此hive的安裝就完成了。

 

七. 安裝presto

1. 下載presto-server-0.198.tar.gz

2. 解壓

cd presto-service-0.198
mkdir etc
cd etc

3. 編輯配置文件:

Node Properties 

etc/node.properties

node.environment=production
node.id=ffffffff-0000-0000-0000-ffffffffffff
node.data-dir=/opt/presto/data/discovery/

JVM Config

etc/jvm.config

-server
-Xmx16G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError

Config Properties

etc/config.properties

coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
query.max-memory=5GB
query.max-memory-per-node=1GB
discovery-server.enabled=true
discovery.uri=http://hadoop0:8080

catalog配置:

etc/catalog/hive.properties

connector.name=hive-hadoop2
hive.metastore.uri=thrift://hadoop0:9083
hive.config.resources=/usr/local/hadoop/etc/hadoop/core-site.xml,/usr/local/hadoop/etc/hadoop/hdfs-site.xml

4. 啟動hive service

./bin/launch start

5. Download presto-cli-0.198-executable.jar, rename it to presto, make it executable with chmod +x, then run it:

./presto --server localhost:8080 --catalog hive --schema default

這樣整個配置就完成啦。看一下效果吧,通過show tables來查看我們在hive中創建的表。

 

參考:

https://blog.csdn.net/xu470438000/article/details/50512442‘

http://www.jb51.net/article/118396.htm

https://prestodb.io/docs/current/installation/cli.html

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM