基礎環境
流程概述
此次編譯過程是將Spark-3.1.1源碼包下載到Windows本機,使用maven編譯,編譯完成后導入IDEA運行,連接Linux端hive做SQL操作。
軟件版本
Linux端:
hadoop-2.6.0-cdh5.16.2
apache-hive-3.1.2-bin
Windows端:(最好安裝vpn軟件)
java-1.8
maven-3.6.3
scala-2.12.10
spark-3.1.1 (不用安裝,解壓到指定目錄即可)
配置修改
修改pom文件
修改Spark根目錄下的pom文件
(1)添加軟件源
pom文件中的谷歌倉庫一定要放到第一位,自己配置的倉庫放到后面,阿里倉庫建議添加,CDH版本可以多添加一個cloudera倉庫。
位置:pom文件264行左右
<repository> <id>gcs-maven-central-mirror</id> <!-- Google Mirror of Maven Central, placed first so that it's used instead of flaky Maven Central. See https://storage-download.googleapis.com/maven-central/index.html --> <name>GCS Maven Central mirror</name> <url>https://maven-central.storage-download.googleapis.com/maven2/</url> <releases> <enabled>true</enabled> </releases> <snapshots> <enabled>false</enabled> </snapshots> </repository> <repository> <id>alimaven</id> <name>aliyun maven</name> <url>http://maven.aliyun.com/nexus/content/groups/public/</url> </repository> <repository> <id>cloudera</id> <name>cloudera repository</name> <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url> </repository>
(2)添加模塊
將以下內容添加到module中。
位置:pom文件100行左右
<module>sql/hive-thriftserver</module>
修改編譯腳本
需要修改編譯腳本中的各組件的指定版本號,如果不修改,編譯時可能會卡,住此處修改的內容與編譯命令中的 “-D” 選項有相同作用。
spark:當前編譯的版本即可;
scala:當前windows中安裝的scala版本,如果是 2.11.x 版本,那編譯之前需要做scala version change 操作;
hadoop:Linux服務器中的hadoop版本;
hive:1代表開啟hive支持,不需要在此處指定hive版本。
# spark版本 VERSION=3.1.1 # scala版本 SCALA_VERSION=2.12 # hadoop版本 SPARK_HADOOP_VERSION=2.6.0-cdh5.16.2 # 開啟hive SPARK_HIVE=1
編譯運行
理想狀態下,修改完以上內容就可以開始編譯,並且能改編譯成功了,但是由於諸如版本、配置等略有差異,在編譯和運行的過程中可能會出現各種各樣的報錯,我將在報錯修改中將我遇到的報錯問題一一列舉。
源碼編譯
進入到spark-3.1.1根目錄下,右鍵打開Git Bash。
將以下命令復制進去並執行
此處的 -Dhadoop.version 和 -Dscala.version=2.12.10 與上面修改的編譯腳本中的hadoop和scala有對應關系,功能作用相同,編譯腳本中修改的選項,在命令中可以不用添加。
--name:寫hadoop版本
--pip:不需要
--tgz:打包使用
./dev/make-distribution.sh \ --name 2.6.0-cdh5.16.2 \ --tgz -Phive \ -Phive-thriftserver \ -Pyarn -Phadoop-2.7 \ -Dhadoop.version=2.6.0-cdh5.16.2 \ -Dscala.version=2.12.10
大概等20分鍾左右,編譯成功。
編譯之后生成文件:
導入IDEA
(1)導入IDEA之后需要將根目錄下的pom文件Reimport一次,將依賴加載;
(2)整個項目rebuild一次,將導入到IDEA后不匹配的地方重新編譯一次;
(3)導入服務器中配置文件到:spark-3.1/spark-3.1.1/sql/hive-thriftserver/src/main/resources
要重點檢查hive-site.xml文件,開啟:hive.metastore.uris

<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://ruozedata:9000</value> </property> <!--指定hadoop臨時目錄, hadoop.tmp.dir 是hadoop文件系統依賴的基礎配置,很多路徑都依賴它。如果hdfs-site.xml中不配 置namenode和datanode的存放位置,默認就放在這>個路徑中 --> <property> <name>hadoop.tmp.dir</name> <value>/home/hadoop/tmp/hadoop</value> </property> <property> <name>fs.trash.interval</name> <value>1440</value> </property> <property> <name>hadoop.proxyuser.hadoop.hosts</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.hadoop.groups</name> <value>*</value> </property> </configuration>

<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <!-- Put site-specific property overrides in this file. --> <configuration> <!-- 指定HDFS副本的數量 --> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.permissions</name> <value>false</value> </property> </configuration>

<?xml version="1.0" encoding="UTF-8" standalone="no"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <!-- 連接數據庫 --> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://ruozedata:3306/ruozedata_hive?createDatabaseIfNotExist=true;characterEncoding=utf-8&useSSL=false;</value> </property> <!-- 數據庫驅動 --> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <!-- 數據庫用戶 --> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>root</value> </property> <!-- 數據庫密碼 --> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>ruozedata</value> </property> <!-- 啟用本地模式 --> <property> <name>hive.exec.mode.local.auto</name> <value>true</value> </property> <property> <name>hive.metastore.uris</name> <value>thrift://ruozedata:9083</value> </property> <property> <name>hive.insert.into.multilevel.dirs</name> <value>true</value> </property> </configuration>

#log4j.rootLogger=ERROR, stdout # #log4j.appender.stdout=org.apache.log4j.ConsoleAppender #log4j.appender.target=System.out #log4j.appender.stdout.layout=org.apache.log4j.PatternLayout #log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n log4j.rootLogger=INFO, stdout #log4j.rootLogger=ERROR, stdout log4j.appender.stdout=org.apache.log4j.ConsoleAppender log4j.appender.stdout.layout=org.apache.log4j.PatternLayout log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n log4j.appender.logfile=org.apache.log4j.FileAppender log4j.appender.logfile.File=target/spring.log log4j.appender.logfile.layout=org.apache.log4j.PatternLayout log4j.appender.logfile.layout.ConversionPattern=%d %p [%c] - %m%n
運行主類
(1)Linux啟動Hive服務:hive --service metastore &
(2)啟動SparkSQLDriver:org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver
(3)操作hive表
spark-sql (default)> show databases; show databases; databaseName company default hive_function_analyze skewtest spark-sql (default)> Time taken: 0.028 seconds, Fetched 10 row(s) select * from score; INFO SparkSQLCLIDriver: Time taken: 1.188 seconds, Fetched 4 row(s) id name subject 1 tom ["HuaXue","Physical","Math","Chinese"] 2 jack ["HuaXue","Animal","Computer","Java"] 3 john ["ZheXue","ZhengZhi","SiXiu","history"] 4 alice ["C++","Linux","Hadoop","Flink"] spark-sql (default)>
報錯修改
報錯:NoClassDefFoundError: com/google/common/cache/CacheLoader
Exception in thread "main" java.lang.NoClassDefFoundError: com/google/common/cache/CacheLoader at org.apache.spark.internal.Logging$.<init>(Logging.scala:189) at org.apache.spark.internal.Logging$.<clinit>(Logging.scala) at org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:108) at org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:105) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.initializeLogIfNecessary(SparkSQLCLIDriver.scala:57) at org.apache.spark.internal.Logging.initializeLogIfNecessary(Logging.scala:102) at org.apache.spark.internal.Logging.initializeLogIfNecessary$(Logging.scala:101) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.initializeLogIfNecessary(SparkSQLCLIDriver.scala:57) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.<init>(SparkSQLCLIDriver.scala:63) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.<clinit>(SparkSQLCLIDriver.scala) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) Caused by: java.lang.ClassNotFoundException: com.google.common.cache.CacheLoader at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:355) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) ... 11 more
解決:修改pom文件
修改hive-thriftserver模塊下的pom.xm文件
位置:pom文件422行左右
<dependency> <groupId>org.eclipse.jetty</groupId> <artifactId>jetty-server</artifactId> <!-- <scope>provided</scope>--> </dependency> <dependency> <groupId>org.eclipse.jetty</groupId> <artifactId>jetty-servlet</artifactId> <!-- <scope>provided</scope>--> </dependency>
修改主pom.xml文件
位置:pom文件84行左右
<dependency> <groupId>org.eclipse.jetty</groupId> <artifactId>jetty-http</artifactId> <version>${jetty.version}</version> <!-- <scope>provided</scope>--> </dependency> <dependency> <groupId>org.eclipse.jetty</groupId> <artifactId>jetty-continuation</artifactId> <version>${jetty.version}</version> <!-- <scope>provided</scope>--> </dependency> <dependency> <groupId>org.eclipse.jetty</groupId> <artifactId>jetty-servlet</artifactId> <version>${jetty.version}</version> <!-- <scope>provided</scope>--> </dependency> <dependency> <groupId>org.eclipse.jetty</groupId> <artifactId>jetty-servlets</artifactId> <version>${jetty.version}</version> <!-- <scope>provided</scope>--> </dependency> <dependency> <groupId>org.eclipse.jetty</groupId> <artifactId>jetty-proxy</artifactId> <version>${jetty.version}</version> <!-- <scope>provided</scope>--> </dependency> <dependency> <groupId>org.eclipse.jetty</groupId> <artifactId>jetty-client</artifactId> <version>${jetty.version}</version> <!-- <scope>provided</scope>--> </dependency> <dependency> <groupId>org.eclipse.jetty</groupId> <artifactId>jetty-util</artifactId> <version>${jetty.version}</version> <!-- <scope>provided</scope>--> </dependency> <dependency> <groupId>org.eclipse.jetty</groupId> <artifactId>jetty-security</artifactId> <version>${jetty.version}</version> <!-- <scope>provided</scope>--> </dependency> <dependency> <groupId>org.eclipse.jetty</groupId> <artifactId>jetty-plus</artifactId> <version>${jetty.version}</version> <!-- <scope>provided</scope>--> </dependency> <dependency> <groupId>org.eclipse.jetty</groupId> <artifactId>jetty-server</artifactId> <version>${jetty.version}</version> <!-- <scope>provided</scope>--> </dependency> <dependency> <groupId>org.eclipse.jetty</groupId> <artifactId>jetty-webapp</artifactId> <version>${jetty.version}</version> <!-- <scope>provided</scope>--> </dependency>
修改主pom.xml文件
位置:pom文件663行左右,換成compile
<dependency> <groupId>xml-apis</groupId> <artifactId>xml-apis</artifactId> <version>1.4.01</version> <scope>compile</scope> </dependency>
修改主pom.xml文件
位置:pom文件486行左右
<dependency> <groupId>com.google.guava</groupId> <artifactId>guava</artifactId> <version>${guava.version}</version> <!-- <scope>provided</scope>--> </dependency>
如果還有其他類似的ClassNotFoundException,都是這個原因引起的,注釋即可。
參考連接:https://blog.csdn.net/qq_43081842/article/details/105777311
報錯:A master URL must be set in your configuration
2021-04-15 00:59:20,147 ERROR [org.apache.spark.SparkContext] - Error initializing SparkContext. org.apache.spark.SparkException: A master URL must be set in your configuration at org.apache.spark.SparkContext.<init>(SparkContext.scala:394) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2678) at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:942) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:936) at org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:52) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.<init>(SparkSQLCLIDriver.scala:325) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:157) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) Exception in thread "main" org.apache.spark.SparkException: A master URL must be set in your configuration at org.apache.spark.SparkContext.<init>(SparkContext.scala:394) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2678) at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:942) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:936) at org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:52) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.<init>(SparkSQLCLIDriver.scala:325) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:157) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
報錯:value setRolledLogsIncludePattern is not a member of org.apache.hadoop.yarn.api.records.LogAggregationContext
原因:Spark3.x 對hadoop2.x 支持有問題,需要手動修改源碼
解決:修改源碼:spark-3.1.1\resource-managers\yarn\src\main\scala\org\apache\spark\deploy\yarn\Client.scala
參考鏈接:https://github.com/apache/spark/pull/16884/files