class SparkContext extends Logging with ExecutorAllocationClient
Main entry point for Spark functionality.
spark功能函數的主入口。
def parallelize[T](seq: Seq[T], numSlices: Int = defaultParallelism)(implicit arg0: ClassTag[T]): RDD[T]
注意
Parallelize是懶動作函數.如果參數seq是一個易變的collection,並且在調用parallelize之后但又在一個對RDD的action之前的期間會被修改,那么所得的RDD將會反應出被修改的collection,導致結果可能會不可預料。所以,向本函數的參數seq傳遞一個副本。
checkpoint(self)
Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint directory set with SparkContext.setCheckpointDir() and all references to its parent RDDs will be removed. This function must be called before any job has been executed on this RDD. It is strongly recommended that this RDD is persisted in memory, otherwise saving it on a file will require recomputation.
checkpoint(self)
標記當前RDD的校驗點。它會被保存為在由SparkContext.setCheckpointDir()方法設置的checkpoint目錄下的文件集中的一個文件。簡而言之就是當前RDD的校驗點被保存為了一個文件,而這個文件在一個目錄下,這個目錄下有不少的這樣的文件,這個目錄是由SparkContext.setCheckpointDir()方法設置的。並且所有從父RDD中引用的文件都將被刪除。這個函數必須在所有的job前被調用,運行在這個RDD上。它被強烈的建議保存在內存中,否則,也就是從內存轉出存入文件,則需要重新計算它。
scala:
def setCheckpointDir(directory: String): Unit
Set the directory under which RDDs are going to be checkpointed. The directory must be a HDFS path if running on a cluster.
設置一個目錄,用來讓RDD們可以在其下被checkpoint。如果是跑在一個集群上,這個目錄必須是一個HDFS路徑。

Distribute a local Scala collection to form an RDD.
將一個本地Scala collection 格式化為一個RDD。
Parallelize acts lazily. If
seqis a mutable collection and is altered after the call to parallelize and before the first action on the RDD, the resultant RDD will reflect the modified collection. Pass a copy of the argument to avoid this.