Background
一. 什么是Presto
Presto通過使用分布式查詢,可以快速高效的完成海量數據的查詢。如果你需要處理TB或者PB級別的數據,那么你可能更希望借助於Hadoop和HDFS來完成這些數據的處理。作為Hive和Pig(Hive和Pig都是通過MapReduce的管道流來完成HDFS數據的查詢)的替代者,Presto不僅可以訪問HDFS,也可以操作不同的數據源,包括:RDBMS和其他的數據源(例如:Cassandra)。
Presto被設計為數據倉庫和數據分析產品:數據分析、大規模數據聚集和生成報表。這些工作經常通常被認為是線上分析處理操作。
Presto是FaceBook開源的一個開源項目。Presto在FaceBook誕生,並且由FaceBook內部工程師和開源社區的工程師公共維護和改進。
二. 環境和應用准備
- 環境
macbook pro
- application
Docker for mac: https://docs.docker.com/docker-for-mac/#check-versions
jdk-1.8: http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
hadoop-2.7.5
hive-2.3.3
presto-cli-0.198-executable.jar
三. 構建images
我們使用Docker來啟動三台Centos7虛擬機,三台機器上安裝Hadoop和Java。
1. 安裝Docker,Macbook上安裝Docker,並使用倉庫賬號登錄。
docker login
2. 驗證安裝結果
docker version
3. 拉取Centos7 images
docker pull centos
4. 構建具有ssh功能的centos
mkdir ~/centos-ssh cd centos-ssh vi Dockerfile
# 選擇一個已有的os鏡像作為基礎 FROM centos # 鏡像的作者 MAINTAINER crxy # 安裝openssh-server和sudo軟件包,並且將sshd的UsePAM參數設置成no RUN yum install -y openssh-server sudo RUN sed -i 's/UsePAM yes/UsePAM no/g' /etc/ssh/sshd_config #安裝openssh-clients RUN yum install -y openssh-clients # 添加測試用戶root,密碼root,並且將此用戶添加到sudoers里 RUN echo "root:root" | chpasswd RUN echo "root ALL=(ALL) ALL" >> /etc/sudoers # 下面這兩句比較特殊,在centos6上必須要有,否則創建出來的容器sshd不能登錄 RUN ssh-keygen -t dsa -f /etc/ssh/ssh_host_dsa_key RUN ssh-keygen -t rsa -f /etc/ssh/ssh_host_rsa_key # 啟動sshd服務並且暴露22端口 RUN mkdir /var/run/sshd EXPOSE 22 CMD ["/usr/sbin/sshd", "-D"]
構建
docker build -t=”centos-ssh” .
5. 基於centos-ssh鏡像構建有JDK和Hadoop的鏡像
mkdir ~/hadoop cd hadoop vi Dockerfile
FROM centos-ssh ADD jdk-8u161-linux-x64.tar.gz /usr/local/ RUN mv jdk-8u161-linux-x64.tar.gz /usr/local/jdk1.7 ENV JAVA_HOME /usr/local/jdk1.8 ENV PATH $JAVA_HOME/bin:$PATH ADD hadoop-2.7.5.tar.gz /usr/local RUN mv hadoop-2.7.5.tar.gz /usr/local/hadoop ENV HADOOP_HOME /usr/local/hadoop ENV PATH $HADOOP_HOME/bin:$PATH
jdk包和hadoop包要放在hadoop目錄下
docker build -t=”centos-hadoop” .
四. 搭建Hadoop集群
1. 集群規划
搭建有三個節點的hadoop集群,一主兩從
主節點:hadoop0 ip:172.18.0.2 從節點1:hadoop1 ip:172.18.0.3 從節點2:hadoop2 ip:172.18.0.4
但是由於docker容器重新啟動之后ip會發生變化,所以需要我們給docker設置固定ip。
Docker安裝后,默認會創建下面三種網絡類型:
docker network ls jinhongliu@Jinhongs-MacBo NETWORK ID NAME DRIVER SCOPE 085be4855a90 bridge bridge local 177432e48de5 host host local 569f368d1561 none null local
啟動 Docker的時候,用 --network 參數,可以指定網絡類型,如:
~ docker run -itd --name test1 --network bridge --ip 172.17.0.10 centos:latest /bin/bash
bridge:橋接網絡
默認情況下啟動的Docker容器,都是使用 bridge,Docker安裝時創建的橋接網絡,每次Docker容器重啟時,會按照順序獲取對應的IP地址,這個就導致重啟下,Docker的IP地址就變了.
none:無指定網絡
使用 --network=none ,docker 容器就不會分配局域網的IP
host: 主機網絡
使用 --network=host,此時,Docker 容器的網絡會附屬在主機上,兩者是互通的。
例如,在容器中運行一個Web服務,監聽8080端口,則主機的8080端口就會自動映射到容器中。
創建自定義網絡:(設置固定IP)
啟動Docker容器的時候,使用默認的網絡是不支持指派固定IP的,如下:
~ docker run -itd --net bridge --ip 172.17.0.10 centos:latest /bin/bash 6eb1f228cf308d1c60db30093c126acbfd0cb21d76cb448c678bab0f1a7c0df6 docker: Error response from daemon: User specified IP address is supported on user defined networks only.
因此,需要創建自定義網絡,下面是具體的步驟:
步驟1: 創建自定義網絡
創建自定義網絡,並且指定網段:172.18.0.0/16
➜ ~ docker network create --subnet=172.18.0.0/16 mynetwork ➜ ~ docker network ls NETWORK ID NAME DRIVER SCOPE 085be4855a90 bridge bridge local 177432e48de5 host host local 620ebbc09400 mynetwork bridge local 569f368d1561 none null local
步驟2: 創建docker容器。啟動三個容器,分別作為hadoop0 hadoop1 hadoop2
➜ ~ docker run --name hadoop0 --hostname hadoop0 --net mynetwork --ip 172.18.0.2 -d -P -p 50070:50070 -p 8088:8088 centos-hadoop
➜ ~ docker run --name hadoop0 --hostname hadoop1 --net mynetwork --ip 172.18.0.3 -d -P centos-hadoop
➜ ~ docker run --name hadoop0 --hostname hadoop2 --net mynetwork --ip 172.18.0.4 -d -P centos-hadoop
使用docker ps 查看剛才啟動的是三個容器:
5e0028ed6da0 hadoop "/usr/sbin/sshd -D" 16 hours ago Up 3 hours 0.0.0.0:32771->22/tcp hadoop2 35211872eb20 hadoop "/usr/sbin/sshd -D" 16 hours ago Up 4 hours 0.0.0.0:32769->22/tcp hadoop1 0f63a870ef2b hadoop "/usr/sbin/sshd -D" 16 hours ago Up 5 hours 0.0.0.0:8088->8088/tcp, 0.0.0.0:50070->50070/tcp, 0.0.0.0:32768->22/tcp hadoop0
這樣3台機器就有了固定的IP地址。驗證一下,分別ping三個ip,能ping通就說明沒問題。
五. 配置Hadoop集群
1. 先連接到hadoop0上, 使用命令
docker exec -it hadoop0 /bin/bash
下面的步驟就是hadoop集群的配置過程
1:設置主機名與ip的映射,修改三台容器:vi /etc/hosts
添加下面配置
172.18.0.2 hadoop0 172.18.0.3 hadoop1 172.18.0.4 hadoop2
2:設置ssh免密碼登錄
在hadoop0上執行下面操作
cd ~ mkdir .ssh cd .ssh ssh-keygen -t rsa(一直按回車即可) ssh-copy-id -i localhost ssh-copy-id -i hadoop0 ssh-copy-id -i hadoop1 ssh-copy-id -i hadoop2 在hadoop1上執行下面操作 cd ~ cd .ssh ssh-keygen -t rsa(一直按回車即可) ssh-copy-id -i localhost ssh-copy-id -i hadoop1 在hadoop2上執行下面操作 cd ~ cd .ssh ssh-keygen -t rsa(一直按回車即可) ssh-copy-id -i localhost ssh-copy-id -i hadoop2
3:在hadoop0上修改hadoop的配置文件
進入到/usr/local/hadoop/etc/hadoop目錄
修改目錄下的配置文件core-site.xml、hdfs-site.xml、yarn-site.xml、mapred-site.xml
(1)hadoop-env.sh
export JAVA_HOME=/usr/local/jdk1.8
(2)core-site.xml
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://hadoop0:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/usr/local/hadoop/tmp</value> </property> <property> <name>fs.trash.interval</name> <value>1440</value> </property> </configuration>
(3)hdfs-site.xml
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.permissions</name> <value>false</value> </property> </configuration>
(4)yarn-site.xml
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.log-aggregation-enable</name> <value>true</value> </property> </configuration>
(5)修改文件名:mv mapred-site.xml.template mapred-site.xml
vi mapred-site.xml
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
(6)格式化
進入到/usr/local/hadoop目錄下
執行格式化命令
bin/hdfs namenode -format 注意:在執行的時候會報錯,是因為缺少which命令,安裝即可 執行下面命令安裝 yum install -y which
格式化操作不能重復執行。如果一定要重復格式化,帶參數-force即可。
(7)啟動偽分布hadoop
命令:sbin/start-all.sh
第一次啟動的過程中需要輸入yes確認一下。 使用jps,檢查進程是否正常啟動?能看到下面幾個進程表示偽分布啟動成功
3267 SecondaryNameNode 3003 NameNode 3664 Jps 3397 ResourceManager 3090 DataNode 3487 NodeManager
(8)停止偽分布hadoop
命令:sbin/stop-all.sh
(9)指定nodemanager的地址,修改文件yarn-site.xml
<property> <description>The hostname of the RM.</description> <name>yarn.resourcemanager.hostname</name> <value>hadoop0</value> </property>
(10)修改hadoop0中hadoop的一個配置文件etc/hadoop/slaves
刪除原來的所有內容,修改為如下
hadoop1
hadoop2
(11)在hadoop0中執行命令
scp -rq /usr/local/hadoop hadoop1:/usr/local scp -rq /usr/local/hadoop hadoop2:/usr/local
(12)啟動hadoop分布式集群服務
執行sbin/start-all.sh
注意:在執行的時候會報錯,是因為兩個從節點缺少which命令,安裝即可
分別在兩個從節點執行下面命令安裝
yum install -y which
再啟動集群(如果集群已啟動,需要先停止)
(13)驗證集群是否正常
首先查看進程:
Hadoop0上需要有這幾個進程
4643 Jps 4073 NameNode 4216 SecondaryNameNode 4381 ResourceManager
Hadoop1上需要有這幾個進程
715 NodeManager 849 Jps 645 DataNode
Hadoop2上需要有這幾個進程
456 NodeManager 589 Jps 388 DataNode
使用程序驗證集群服務
創建一個本地文件
vi a.txt hello you hello me
上傳a.txt到hdfs上
hdfs dfs -put a.txt /
執行wordcount程序
cd /usr/local/hadoop/share/hadoop/mapreduce hadoop jar hadoop-mapreduce-examples-2.4.1.jar wordcount /a.txt /out
查看程序執行結果

這樣就說明集群正常了。
通過瀏覽器訪問集群的服務
由於在啟動hadoop0這個容器的時候把50070和8088映射到宿主機的對應端口上了
所以在這可以直接通過宿主機訪問容器中hadoop集群的服務
六. 安裝Hive
我們使用Presto的hive connector來對hive中的數據進行查詢,因此需要先安裝hive.
1. 本地下載hive,使用下面的命令傳到hadoop0上
docker cp ~/Download/hive-2.3.3-bin.tar.gz 容器ID:/
2. 解壓到指定目錄
tar -zxvf apache-hive-2.3.3-bin.tar.gz mv apache-hive-2.3.3-bin /hive cd /hive
3、配置/etc/profile,在/etc/profile中添加如下語句
export HIVE_HOME=/usr/local/hive
export PATH=$HIVE_HOME/bin:$PATH
source /etc/profile
4、安裝MySQL數據庫
我們使用docker容器來進行安裝,首先pull mysql image
docker pull mysql
啟動mysql容器
docker run --name mysql -e MYSQL_ROOT_PASSWORD=111111 --net mynetwork --ip 172.18.0.5 -d
登錄mysql容器
5、創建metastore數據庫並為其授權
create database metastore;
6、 下載jdbc connector

下載完成之后將其解壓,並把其中的mysql-connector-java-5.1.41-bin.jar文件拷貝到$HIVE_HOME/lib目錄
7、修改hive配置文件
cd /hive/conf
7.1復制初始化文件並重改名
cp hive-env.sh.template hive-env.sh cp hive-default.xml.template hive-site.xml cp hive-log4j2.properties.template hive-log4j2.properties cp hive-exec-log4j2.properties.template hive-exec-log4j2.properties
7.2修改hive-env.sh
export JAVA_HOME=/usr/local/jdk1.8 ##Java路徑 export HADOOP_HOME=/usr/local/hadoop ##Hadoop安裝路徑 export HIVE_HOME=/usr/local/hive ##Hive安裝路徑 export HIVE_CONF_DIR=/hive/conf ##Hive配置文件路徑
7.3在hdfs 中創建下面的目錄 ,並且授權
hdfs dfs -mkdir -p /user/hive/warehouse hdfs dfs -mkdir -p /user/hive/tmp hdfs dfs -mkdir -p /user/hive/log hdfs dfs -chmod -R 777 /user/hive/warehouse hdfs dfs -chmod -R 777 /user/hive/tmp hdfs dfs -chmod -R 777 /user/hive/log
7.4修改hive-site.xml
<property> <name>hive.exec.scratchdir</name> <value>/user/hive/tmp</value> </property> <property> <name>hive.metastore.warehouse.dir</name> <value>/user/hive/warehouse</value> </property> <property> <name>hive.querylog.location</name> <value>/user/hive/log</value> </property> ## 配置 MySQL 數據庫連接信息 <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://172.18.0.5:3306/metastore?createDatabaseIfNotExist=true&characterEncoding=UTF-8&useSSL=false</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>root</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>111111</value> </property>
7.5 創建tmp文件
mkdir /home/hadoop/hive/tmp
並在hive-site.xml中修改:
把{system:java.io.tmpdir} 改成 /home/hadoop/hive/tmp/
把 {system:user.name} 改成 {user.name}
8、初始化hive
schematool -dbType mysql -initSchema
9、啟動hive
hive
10. hive中創建表
新建create_table文件
REATE TABLE IF NOT EXISTS `default`.`d_abstract_event` ( `id` BIGINT, `network_id` BIGINT, `name` STRING) COMMENT 'Imported by sqoop on 2017/06/27 09:49:25' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY"); CREATE TABLE IF NOT EXISTS `default`.`d_ad_bumper` ( `front_bumper_id` BIGINT, `end_bumper_id` BIGINT, `content_item_type` STRING, `content_item_id` BIGINT, `content_item_name` STRING) COMMENT 'Imported by sqoop on 2017/06/27 09:31:05' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY"); CREATE TABLE IF NOT EXISTS `default`.`d_ad_tracking` ( `id` BIGINT, `network_id` BIGINT, `name` STRING, `creative_id` BIGINT, `creative_name` STRING, `ad_unit_id` BIGINT, `ad_unit_name` STRING, `placement_id` BIGINT, `placement_name` STRING, `io_id` BIGINT, `io_ad_group_id` BIGINT, `io_name` STRING, `campaign_id` BIGINT, `campaign_name` STRING, `campaign_status` STRING, `advertiser_id` BIGINT, `advertiser_name` STRING, `agency_id` BIGINT, `agency_name` STRING, `status` STRING) COMMENT 'Imported by sqoop on 2017/06/27 09:31:03' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY"); CREATE TABLE IF NOT EXISTS `default`.`d_ad_tree_node_frequency_cap` ( `id` BIGINT, `ad_tree_node_id` BIGINT, `frequency_cap` INT, `frequency_period` INT, `frequency_cap_type` STRING, `frequency_cap_scope` STRING) COMMENT 'Imported by sqoop on 2017/06/27 09:31:03' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY"); CREATE TABLE IF NOT EXISTS `default`.`d_ad_tree_node_skippable` ( `id` BIGINT, `skippable` BIGINT) COMMENT 'Imported by sqoop on 2017/06/27 09:31:03' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY"); CREATE TABLE IF NOT EXISTS `default`.`d_ad_tree_node` ( `id` BIGINT, `network_id` BIGINT, `name` STRING, `internal_id` STRING, `staging_internal_id` STRING, `budget_exempt` INT, `ad_unit_id` BIGINT, `ad_unit_name` STRING, `ad_unit_type` STRING, `ad_unit_size` STRING, `placement_id` BIGINT, `placement_name` STRING, `placement_internal_id` STRING, `io_id` BIGINT, `io_ad_group_id` BIGINT, `io_name` STRING, `io_internal_id` STRING, `campaign_id` BIGINT, `campaign_name` STRING, `campaign_internal_id` STRING, `advertiser_id` BIGINT, `advertiser_name` STRING, `advertiser_internal_id` STRING, `agency_id` BIGINT, `agency_name` STRING, `agency_internal_id` STRING, `price_model` STRING, `price_type` STRING, `ad_unit_price` DECIMAL(16,2), `status` STRING, `companion_ad_package_id` BIGINT) COMMENT 'Imported by sqoop on 2017/06/27 09:31:03' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY"); CREATE TABLE IF NOT EXISTS `default`.`d_ad_tree_node_staging` ( `ad_tree_node_id` BIGINT, `adapter_status` STRING, `primary_ad_tree_node_id` BIGINT, `production_ad_tree_node_id` BIGINT, `hide` INT, `ignore` INT) COMMENT 'Imported by sqoop on 2017/06/27 09:31:03' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY"); CREATE TABLE IF NOT EXISTS `default`.`d_ad_tree_node_trait` ( `id` BIGINT, `ad_tree_node_id` BIGINT, `trait_type` STRING, `parameter` STRING) COMMENT 'Imported by sqoop on 2017/06/27 09:31:03' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY"); CREATE TABLE IF NOT EXISTS `default`.`d_ad_unit_ad_slot_assignment` ( `id` BIGINT, `ad_unit_id` BIGINT, `ad_slot_id` BIGINT) COMMENT 'Imported by sqoop on 2017/06/27 09:31:03' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY"); CREATE TABLE IF NOT EXISTS `default`.`d_ad_unit` ( `id` BIGINT, `name` STRING, `ad_unit_type` STRING, `height` INT, `width` INT, `size` STRING, `network_id` BIGINT, `created_type` STRING) COMMENT 'Imported by sqoop on 2017/06/27 09:31:03' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY"); CREATE TABLE IF NOT EXISTS `default`.`d_advertiser` ( `id` BIGINT, `network_id` BIGINT, `name` STRING, `agency_id` BIGINT, `agency_name` STRING, `advertiser_company_id` BIGINT, `agency_company_id` BIGINT, `billing_contact_company_id` BIGINT, `address_1` STRING, `address_2` STRING, `address_3` STRING, `city` STRING, `state_region_id` BIGINT, `country_id` BIGINT, `postal_code` STRING, `email` STRING, `phone` STRING, `fax` STRING, `url` STRING, `notes` STRING, `billing_term` STRING, `meta_data` STRING, `internal_id` STRING, `active` INT, `budgeted_imp` BIGINT, `num_of_campaigns` BIGINT, `adv_category_name_list` STRING, `adv_category_id_name_list` STRING, `updated_at` TIMESTAMP, `created_at` TIMESTAMP) COMMENT 'Imported by sqoop on 2017/06/27 09:31:22' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS ORC tblproperties ("orc.compress"="SNAPPY");
cat create_table | hive
11. 啟動metadata service
presto需要使用hive的metadata service
nohup hive --service metadata &
至此hive的安裝就完成了。
七. 安裝presto
1. 下載presto-server-0.198.tar.gz
2. 解壓
cd presto-service-0.198 mkdir etc cd etc
3. 編輯配置文件:
Node Properties
etc/node.properties
node.environment=production node.id=ffffffff-0000-0000-0000-ffffffffffff node.data-dir=/opt/presto/data/discovery/
JVM Config
etc/jvm.config
-server -Xmx16G -XX:+UseG1GC -XX:G1HeapRegionSize=32M -XX:+UseGCOverheadLimit -XX:+ExplicitGCInvokesConcurrent -XX:+HeapDumpOnOutOfMemoryError -XX:+ExitOnOutOfMemoryError
Config Properties
etc/config.properties
coordinator=true node-scheduler.include-coordinator=true http-server.http.port=8080 query.max-memory=5GB query.max-memory-per-node=1GB discovery-server.enabled=true discovery.uri=http://hadoop0:8080
catalog配置:
etc/catalog/hive.properties
connector.name=hive-hadoop2 hive.metastore.uri=thrift://hadoop0:9083 hive.config.resources=/usr/local/hadoop/etc/hadoop/core-site.xml,/usr/local/hadoop/etc/hadoop/hdfs-site.xml
4. 啟動hive service
./bin/launch start
5. Download presto-cli-0.198-executable.jar, rename it to presto, make it executable with chmod +x, then run it:
./presto --server localhost:8080 --catalog hive --schema default
這樣整個配置就完成啦。看一下效果吧,通過show tables來查看我們在hive中創建的表。


參考:
https://blog.csdn.net/xu470438000/article/details/50512442‘
http://www.jb51.net/article/118396.htm
https://prestodb.io/docs/current/installation/cli.html
