【西天取經】(入門)windows10 安裝spark3.0, .net core 創建 spark 程序
1、安裝java8,配置環境變量
JDK:https://www.oracle.com/java/technologies/javase/javase-jdk8-downloads.html
java -version
2、安裝python,配置環境變量
3、安裝Spark
下載: https://spark.apache.org/downloads.html
解壓縮這個spark-3.0.1-bin-hadoop3.2.tgz文件到D:\spark目錄中
添加Hadoop,windows使用 winutils.exe 這個文件
克隆: https://github.com/steveloughran/winutils 代碼倉庫到本地,
復制hadoop-3.0.0里面的bin目錄到D:\hadoop目錄
4、配置環境變量
- 配置Spark環境變量:
- 配置Hadoop環境變量:
- PATH變量增加Java,Spark,Hadoop環境變量
- 設置Spark本地主機名的環境變量:SPARK_LOCAL_HOSTNAME = localhost
查看Spark是否安裝成功(參考微軟官方的URL)
spark-submit --version
5、 運行Spark
spark-shell
創建word.txt文件:
寫入文件內容:
Hello Scala
Hello Spark
Hello Scala
執行命令:
scala> sc.textFile("data/word.txt").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).collect
瀏覽器:http://localhost:4040/
至此:Spark本地windows的環境就配好了。
接下來就該net core程序閃亮登場了。
Install .NET for Apache Spark(參考微軟官方的URL)
下載:https://github.com/dotnet/spark/releases/tag/v1.0.0
解壓Microsoft.Spark.Worker.netcoreapp3.1.win-x64-1.0.0.zip到D:\spark目錄里,設置環境變量
Visual Studio 2019創建“HelloSpark” Console的項目
HelloSpark.csproj文件內容:
<Project Sdk="Microsoft.NET.Sdk"> <PropertyGroup> <OutputType>Exe</OutputType> <TargetFramework>netcoreapp3.1</TargetFramework> </PropertyGroup> <ItemGroup> <PackageReference Include="Microsoft.Spark" Version="1.0.0" /> </ItemGroup> <ItemGroup> <None Update="input.txt"> <CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory> </None> <None Update="people.json"> <CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory> </None> </ItemGroup> </Project>
people.json文件內容:
{"name": "Michael" }
{"name":"Andy", "age":30} {"name":"Justin", "age":19}
input.txt文件內容:
Hello World This .NET app uses .NET for Apache Spark This .NET app counts words with Apache Spark
Program.cs:
using Microsoft.Spark.Sql; namespace HelloSpark { class Program { static void Main(string[] args) { var spark = SparkSession.Builder().AppName("word_count_sample").GetOrCreate(); DataFrame peopleFrame = spark.Read().Json("people.json"); peopleFrame.Show(); DataFrame dataFrame = spark.Read().Text("input.txt"); DataFrame words = dataFrame .Select(Functions.Split(Functions.Col("value"), " ").Alias("words")) .Select(Functions.Explode(Functions.Col("words")) .Alias("word")) .GroupBy("word") .Count() .OrderBy(Functions.Col("count").Desc()); words.Show(); spark.Stop(); } } }
編譯:
dotnet build
運行net core的Spark程序的代碼有點復雜:
%SPARK_HOME%\bin\spark-submit --class org.apache.spark.deploy.dotnet.DotnetRunner --master local bin\Debug\netcoreapp3.1\microsoft-spark-3-0_2.12-1.0.0.jar dotnet bin\Debug\netcoreapp3.1\HelloSpark.dll
運行結果:
Spark框架不僅有JVM系的位置,就在18天前也有了.net的一席之地!
以上內容僅作為.net for spark入門程序,由於本人剛接觸Spark,水平有限只能給大家做到搭環境寫個demo這一步了,希望.net for spark 在不遠的將來能夠給.net技術棧貢獻更多的大數據項目。