1.准备工作
1.1 安装jdk1.8
1.2 安装scala2.11.8
1.3 安装idea
版本按自己需要选择即可,以上过程不在本文中详细讲解,有需要在其他文章中分享。
1.4 注意事项
- jdk和scala都需要配置JAVA_HOME和SCALA_HOME的环境变量。
- 在idea下需要下载scala插件
- 创建项目时通过maven创建,需要下载scala sdk
- 下载maven包,解压缩后配置maven的settings.xml目录,同时本地仓库位置
- maven项目创建完成后,在项目名称上右键点击Add Framework Support,然后添加scala支持
2.spark环境配置
2.1 在pom.xml添加spark包
主要是spark-core,spark-sql,spark-mllib,spark-hive等,根据项目需要添加依赖即可。
使用maven下的加载按钮加载一下导入的依赖,会有自动下载jar包的过程。
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.3.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.4.8</version>
<scope>provided</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-mllib -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.11</artifactId>
<version>2.4.8</version>
<!--<scope>provided</scope>-->
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-hive -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>2.4.8</version>
<!--<scope>provided</scope>-->
</dependency>
</dependencies>
2.2 创建scala object,添加配置启动spark环境
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SparkSession
object readcsv_demo {
def main(args: Array[String]): Unit = {
System.setProperty("hadoop.home.dir", "D:\\Regent Wan\\install\\hadoop-common-2.2.0-bin-master")
lazy val cfg: SparkConf =new SparkConf().setAppName("local_demo").setMaster("local[*]")
lazy val spark: SparkSession =SparkSession.builder().config(cfg).enableHiveSupport().getOrCreate()
lazy val sc: SparkContext =spark.sparkContext
}
}
2.3 常见问题
2.3.1 Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/sql/SparkSession$
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/sql/SparkSession$
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.SparkSession$
原因:在导入spark模块时在maven复制了如下code,而其中默认添加了<scope>provided</scope>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.4.8</version>
<scope>provided</scope>
</dependency>
解决办法:注释<scope>provided</scope>
,再重新加载即可。
2.3.2 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path
21/08/24 20:27:59 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
原因:未配置hadoop
解决办法:下载hadoop包,添加配置:System.setProperty("hadoop.home.dir", "D:\\Regent Wan\\install\\hadoop-common-2.2.0-bin-master")
2.3.3运行会出现很多info
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
21/08/24 20:33:23 INFO SparkContext: Running Spark version 2.3.0
21/08/24 20:33:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
21/08/24 20:33:23 INFO SparkContext: Submitted application: local_demo
21/08/24 20:33:23 INFO SecurityManager: Changing view acls to: Administrator
21/08/24 20:33:23 INFO SecurityManager: Changing modify acls to: Administrator
21/08/24 20:33:23 INFO SecurityManager: Changing view acls groups to:
21/08/24 20:33:23 INFO SecurityManager: Changing modify acls groups to:
21/08/24 20:33:23 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(Administrator); groups with view permissions: Set(); users with modify permissions: Set(Administrator); groups with modify permissions: Set()
21/08/24 20:33:24 INFO Utils: Successfully started service 'sparkDriver' on port 12914.
21/08/24 20:33:24 INFO SparkEnv: Registering MapOutputTracker
21/08/24 20:33:25 INFO SparkEnv: Registering BlockManagerMaster
21/08/24 20:33:25 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
21/08/24 20:33:25 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
21/08/24 20:33:25 INFO DiskBlockManager: Created local directory at C:\Users\Administrator\AppData\Local\Temp\blockmgr-82e75467-4dcc-405f-9f06-94374e10f55b
21/08/24 20:33:25 INFO MemoryStore: MemoryStore started with capacity 877.2 MB
21/08/24 20:33:25 INFO SparkEnv: Registering OutputCommitCoordinator
21/08/24 20:33:25 INFO Utils: Successfully started service 'SparkUI' on port 4040.
21/08/24 20:33:25 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://john-PC:4040
21/08/24 20:33:25 INFO Executor: Starting executor ID driver on host localhost
21/08/24 20:33:25 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 12935.
21/08/24 20:33:25 INFO NettyBlockTransferService: Server created on john-PC:12935
21/08/24 20:33:25 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
21/08/24 20:33:25 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, john-PC, 12935, None)
21/08/24 20:33:25 INFO BlockManagerMasterEndpoint: Registering block manager john-PC:12935 with 877.2 MB RAM, BlockManagerId(driver, john-PC, 12935, None)
21/08/24 20:33:25 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, john-PC, 12935, None)
21/08/24 20:33:25 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, john-PC, 12935, None)
21/08/24 20:33:26 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/D:/Regent%20Wan/保存/InScala/spark-warehouse').
21/08/24 20:33:26 INFO SharedState: Warehouse path is 'file:/D:/Regent%20Wan/保存/InScala/spark-warehouse'.
21/08/24 20:33:27 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
21/08/24 20:33:27 INFO InMemoryFileIndex: It took 95 ms to list leaf files for 1 paths.
21/08/24 20:33:27 INFO InMemoryFileIndex: It took 2 ms to list leaf files for 1 paths.
解决办法:在resources目录下新建log4j.properties,添加如下程序
log4j.rootCategory=ERROR, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
# Set the default spark-shell log level to ERROR. When running the spark-shell, the
# log level for this class is used to overwrite the root logger's log level, so that
# the user can have different defaults for the shell and regular Spark apps.
log4j.logger.org.apache.spark.repl.Main=ERROR
# Settings to quiet third party logs that are too verbose
log4j.logger.org.spark_project.jetty=ERROR
log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=ERROR
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=ERROR
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR
# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR