RDD的cache 與 checkpoint 的區別

本文轉載自查看原文 2018-12-07 11:05 570 Hadoop/ Spark

問題：cache 與 checkpoint 的區別？

關於這個問題，Tathagata Das 有一段回答: There is a significant difference between cache and checkpoint. Cache materializes the RDD and keeps it in memory and/or disk（其實只有 memory）. But the lineage（也就是 computing chain） of RDD (that is, seq of operations that generated the RDD) will be remembered, so that if there are node failures and parts of the cached RDDs are lost, they can be regenerated. However, checkpoint saves the RDD to an HDFS file and actually forgets the lineage completely. This is allows long lineages to be truncated and the data to be saved reliably in HDFS (which is naturally fault tolerant by replication). 深入一點討論，rdd.persist(StorageLevel.DISK_ONLY) 與 checkpoint 也有區別。前者雖然可以將 RDD 的 partition 持久化到磁盤，但該 partition 由 blockManager 管理。一旦 driver program 執行結束，也就是 executor 所在進程 CoarseGrainedExecutorBackend stop，blockManager 也會 stop，被 cache 到磁盤上的 RDD 也會被清空（整個 blockManager 使用的 local 文件夾被刪除）。而 checkpoint 將 RDD 持久化到 HDFS 或本地文件夾，如果不被手動 remove 掉（話說怎么 remove checkpoint 過的 RDD？），是一直存在的，也就是說可以被下一個 driver program 使用，而 cached RDD 不能被其他 dirver program 使用。 Hadoop MapReduce 在執行 job 的時候，不停地做持久化，每個 task 運行結束做一次，每個 job 運行結束做一次（寫到 HDFS）。在 task 運行過程中也不停地在內存和磁盤間 swap 來 swap 去。可是諷刺的是，Hadoop 中的 task 太傻，中途出錯需要完全重新運行，比如 shuffle 了一半的數據存放到了磁盤，下次重新運行時仍然要重新 shuffle。Spark 好的一點在於盡量不去持久化，所以使用 pipeline，cache 等機制。用戶如果感覺 job 可能會出錯可以手動去 checkpoint 一些 critical 的 RDD，job 如果出錯，下次運行時直接從 checkpoint 中讀取數據。唯一不足的是，checkpoint 需要兩次運行 job。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Spark cache、checkpoint機制筆記 spark 緩存操作(cache checkpoint)與分區關於checkpoint spark RDD 的map與flatmap區別說明 Spark中RDD、DataFrame和DataSet的區別 Linux中cache和buff的區別淺談Cookie、Session與Cache的區別 Ehcache與Guava Cache的區別淺談 spark中的cache和persist的區別 jquery 中cache為true與false 的區別