1. 原因:
一般hadoop 集群是兩個或多個NameNode , 但是一個處於standby , 另一個active , 通過設置可以忽略不確定誰是active的情況
2.
import org.apache.spark.sql.SparkSession
object Spark_HDFS {
def main(args: Array[String]): Unit = {
import org.apache.log4j.Logger
import org.apache.log4j.Level
Logger.getLogger("org").setLevel(Level.OFF)
System.setProperty("spark.ui.showConsoleProgress","false")
System.setProperty("HADOOP_USER_NAME", "abby")
val ss = SparkSession
.builder()
.appName(" spark 3.0")
.master("local")
.getOrCreate()
val sc = ss.sparkContext //獲取socket
sc.hadoopConfiguration.set("fs.defaultFS", "hdfs://cluster")
sc.hadoopConfiguration.set("dfs.nameservices", "cluster")
sc.hadoopConfiguration.set("dfs.ha.namenodes.cluster", "nn1,nn2")
sc.hadoopConfiguration.set("dfs.namenode.rpc-address.cluster.nn1", "node1:8020")
sc.hadoopConfiguration.set("dfs.namenode.rpc-address.cluster.nn2", "node2:8020")
sc.hadoopConfiguration.set("dfs.client.failover.proxy.provider.cluster", "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider")
val data = sc.textFile("hdfs://cluster/46062.txt",3)
}
}
文中紅色部分就是設置兩個NameNode所需要改的 .具體cluster , 可以自己去看hadoop的配置里面所寫