相比於pandas,pyspark的dataframe的接口和sql類似,比較容易上手。
搭建python3環境
建議使用miniconda3
下載地址:https://mirrors.bfsu.edu.cn/anaconda/miniconda/ 選擇py37版本
conda鏡像配置:https://mirrors.bfsu.edu.cn/help/anaconda/
pip鏡像配置:https://mirrors.bfsu.edu.cn/help/pypi/
miniconda安裝,直接sh minicondaxxxxxx.sh 很簡單
選擇一個編輯器或者pycharm
pyspark跑單機模式
准備數據集data.csv
name,age 張三,24 李四,25 小紅,22
編寫一下代碼,使用jupyter更佳。
from pyspark.sql import SparkSession spark = SparkSession.builder.master("local[*]").getOrCreate() print("\n\napp start") df = spark.read.option('header','true').csv("data.csv") df.printSchema() df.show() df.filter("age<25").show() spark.stop()
20/12/05 22:14:07 WARN Utils: Your hostname, shuai-virtual-machine resolves to a loopback address: 127.0.1.1; using 192.168.153.128 instead (on interface ens33) 20/12/05 22:14:07 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/home/shuai/miniconda3/lib/python3.7/site-packages/pyspark/jars/spark-unsafe_2.12-3.0.1.jar) to constructor java.nio.DirectByteBuffer(long,int) WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release 20/12/05 22:14:08 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). app start root |-- name: string (nullable = true) |-- age: string (nullable = true) +----+---+ |name|age| +----+---+ |張三| 24| |李四| 25| |小紅| 22| +----+---+ +----+---+ |name|age| +----+---+ |張三| 24| |小紅| 22| +----+---+