目的
剛入門spark,安裝的是CDH的版本,版本號spark-core_2.11-2.4.0-cdh6.2.1,部署了cdh客戶端(非集群節點),本文主要以spark-shell為例子,對在cdh客戶端上提交spark作業原理進行簡單分析,加深理解
spark-shell執行
啟動spark-shell后,可以發下yarn集群上啟動了一個作業,實際上,cdh-spark默認提交作業模式為yarn-client模式,即在本地運行Driver,作業在yarn集群上執行
spark-shell啟動過程分析
查看spark-shell路徑及內容,$LIB_DIR值為/opt/cloudera/parcels/CDH/lib,所以執行的是/opt/cloudera/parcels/CDH/lib/spark/bin/spark-shell
繼續查看/opt/cloudera/parcels/CDH/lib/spark/bin/spark-shell,腳本關鍵的內容如下:
#!/usr/bin/env bash
if [ -z "${SPARK_HOME}" ]; then
source "$(dirname "$0")"/find-spark-home
fi
export _SPARK_CMD_USAGE="Usage: ./bin/spark-shell [options]"
SPARK_SUBMIT_OPTS="$SPARK_SUBMIT_OPTS -Dscala.usejavacp=true"
function main() {
export SPARK_SUBMIT_OPTS
"${SPARK_HOME}"/bin/spark-submit --class org.apache.spark.repl.Main --name "Spark shell" "$@"
fi
}
main "$@"
上述腳本中首先判斷是否存在SPARK_HOME變量,如果不存在的話就執行同一目錄下的find-spark-home腳本,改腳本中如果存在SPARK_HOME存在,則直接返回。如果不返回,則查看當前目錄下,是否有find_spark_home.py文件。如果存在find_spark_home.py文件,則調用python執行獲取結果。如果不存在,則使用當前bin目錄的上一級為SPARK_HOME,在本環境中SPARK_HOME被設置為/opt/cloudera/parcels/CDH/lib/spark,設置好SPARK_HOME之后,調用了spark-submit腳本。
查看spark-submit腳本,發現其調用的是${SPARK_HOME}"/bin/spark-class org.apache.spark.deploy.SparkSubmit,繼續查看spark-class腳本,主要內容如下:
if [ -z "${SPARK_HOME}" ]; then
source "$(dirname "$0")"/find-spark-home
fi
. "${SPARK_HOME}"/bin/load-spark-env.sh
# Find the java binary
if [ -n "${JAVA_HOME}" ]; then
RUNNER="${JAVA_HOME}/bin/java"
else
if [ "$(command -v java)" ]; then
RUNNER="java"
else
echo "JAVA_HOME is not set" >&2
exit 1
fi
fi
# Find Spark jars.
if [ -d "${SPARK_HOME}/jars" ]; then
SPARK_JARS_DIR="${SPARK_HOME}/jars"
else
SPARK_JARS_DIR="${SPARK_HOME}/assembly/target/scala-$SPARK_SCALA_VERSION/jars"
fi
if [ ! -d "$SPARK_JARS_DIR" ] && [ -z "$SPARK_TESTING$SPARK_SQL_TESTING" ]; then
echo "Failed to find Spark jars directory ($SPARK_JARS_DIR)." 1>&2
echo "You need to build Spark with the target \"package\" before running this program." 1>&2
exit 1
else
LAUNCH_CLASSPATH="$SPARK_JARS_DIR/*"
fi
# Add the launcher build dir to the classpath if requested.
if [ -n "$SPARK_PREPEND_CLASSES" ]; then
LAUNCH_CLASSPATH="${SPARK_HOME}/launcher/target/scala-$SPARK_SCALA_VERSION/classes:$LAUNCH_CLASSPATH"
fi
build_command() {
"$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"
printf "%d\0" $?
}
# Turn off posix mode since it does not allow process substitution
set +o posix
CMD=()
while IFS= read -d '' -r ARG; do
CMD+=("$ARG")
done < <(build_command "$@")
COUNT=${#CMD[@]}
LAST=$((COUNT - 1))
LAUNCHER_EXIT_CODE=${CMD[$LAST]}
spark-class中,首先設置了spark-home,然后執行load-spark-env.sh,並添加${SPARK_HOME}/jars目錄下的spark依賴,最后執行的是org.apache.spark.launcher.Main類,繼續查看load-spark-env.sh
,改腳本主要是設置一些環境變量,關鍵內容如下:首先是設置spark_home,然后設置${SPARK_CONF_DIR},並執行該目錄下的spark-env.sh,SPARK_CONF_DIR默認為spark-home下的的conf目錄,本環境為/opt/cloudera/parcels/CDH/lib/spark/conf
if [ -z "${SPARK_HOME}" ]; then
source "$(dirname "$0")"/find-spark-home
fi
# Save SPARK_HOME in case the user's spark-env.sh overwrites it.
ORIGINAL_SPARK_HOME="$SPARK_HOME"
if [ -z "$SPARK_ENV_LOADED" ]; then
export SPARK_ENV_LOADED=1
export SPARK_CONF_DIR="${SPARK_CONF_DIR:-"${SPARK_HOME}"/conf}"
if [ -f "${SPARK_CONF_DIR}/spark-env.sh" ]; then
# Promote all variable declarations to environment (exported) variables
set -a
. "${SPARK_CONF_DIR}/spark-env.sh"
set +a
fi
fi
繼續查看spark-env.sh內容,改腳本中直接設置了spark_home和hadoop_home目錄,另外比較重要的是HADOOP_CONF_DIR和HIVE_CONF_DIR,如果沒有設置的話,默認為cdh中提供配置文件,否則為用戶設置的值,我們的環境bashrc中都設置了這兩個變量,因此運行spark-shell時,會知道yarn集群的信息,建議使用spark-sql以及yarn模式運行作業是設置這兩個變量
#!/usr/bin/env bash
SELF="$(cd $(dirname $BASH_SOURCE) && pwd)"
if [ -z "$SPARK_CONF_DIR" ]; then
export SPARK_CONF_DIR="$SELF"
fi
export SPARK_HOME=/opt/cloudera/parcels/CDH-6.2.1-1.cdh6.2.1.p0.1425774/lib/spark
SPARK_PYTHON_PATH=""
if [ -n "$SPARK_PYTHON_PATH" ]; then
export PYTHONPATH="$PYTHONPATH:$SPARK_PYTHON_PATH"
fi
export HADOOP_HOME=/opt/cloudera/parcels/CDH-6.2.1-1.cdh6.2.1.p0.1425774/lib/hadoop
export HADOOP_COMMON_HOME="$HADOOP_HOME"
if [ -n "$HADOOP_HOME" ]; then
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${HADOOP_HOME}/lib/native
fi
SPARK_EXTRA_LIB_PATH=""
if [ -n "$SPARK_EXTRA_LIB_PATH" ]; then
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$SPARK_EXTRA_LIB_PATH
fi
export LD_LIBRARY_PATH
HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-$SPARK_CONF_DIR/yarn-conf}
HIVE_CONF_DIR=${HIVE_CONF_DIR:-/etc/hive/conf}
if [ -d "$HIVE_CONF_DIR" ]; then
HADOOP_CONF_DIR="$HADOOP_CONF_DIR:$HIVE_CONF_DIR"
fi
export HADOOP_CONF_DIR
PYLIB="$SPARK_HOME/python/lib"
if [ -f "$PYLIB/pyspark.zip" ]; then
PYSPARK_ARCHIVES_PATH=
for lib in "$PYLIB"/*.zip; do
if [ -n "$PYSPARK_ARCHIVES_PATH" ]; then
PYSPARK_ARCHIVES_PATH="$PYSPARK_ARCHIVES_PATH,local:$lib"
else
PYSPARK_ARCHIVES_PATH="local:$lib"
fi
done
export PYSPARK_ARCHIVES_PATH
fi
if [ -f "$SELF/classpath.txt" ]; then
export SPARK_DIST_CLASSPATH=$(paste -sd: "$SELF/classpath.txt")
fi