R 多線程和多節點並行計算

一：R本身是單線程的，如何讓其多線程跑起來，提高運算速度？

用Parallel和foreach包玩轉並行計算

看完上面這篇文章就會了。說白了，要加載parallel包，再改寫一下自己的代碼就ok了。

#-----用一個實力來演示 R 如何多線程計算
func <- function(x) {
n = 1
raw <- x
while (x > 1) {
x <- ifelse(x%%2==0,x/2,3*x+1)
n = n + 1
}
return(c(raw,n))
}

#----
library(parallel)
# 用system.time來返回計算所需時間
system.time({
x <- 1:1e5
cl <- makeCluster(4) # 初始化四核心集群
results <- parLapply(cl,x,func) # lapply的並行版本
res.df <- do.call('rbind',results) # 整合結果
stopCluster(cl) # 關閉集群
})

用戶系統流逝
0.431 0.062 18.954

對1:100萬執行func函數，只用18.954秒就完成了。

#－－－我把結果用圖形展示（見圖一），圖還挺奇怪的，，，
library(ggplot2)
df=as.data.frame(res.df)
qplot(data=df,x=V1,y=V2)

－－－－－－－－－－－－

圖一

－－－－－－－－－－－

圖二：看一下CPU使用率，可以看到有四個 R 的線程在跑，CPU使用率瞬間飆升到近100%，心疼我電腦，，，

－－－－－－－－－

把parallel用在爬蟲程序，下面拿一個爬蟲程序測試一下parallel性能如何

需要注意的是需要把加載包這個過程寫進函數里，因為每個線程都需要加載包。

getdata <- function(i){
library(magrittr)
library(proto)
library(gsubfn)
library(bitops)
library(rvest)
library(stringr)
library(DBI)
library(RSQLite)
#library(sqldf)
library(RCurl)
#library(ggplot2)
library(sp)
library(raster)
url <- paste0("http://www.cnblogs.com/pick/",i,"/")##generate url
combined_info <- url%>%html_session()%>%html_nodes("div.post_item div.post_item_foot")%>%html_text()%>%strsplit(split="\r\n")
post_date <- sapply(combined_info, function(v) return(v[3]))%>%str_sub(9,24)%>%as.POSIXlt()##get the date
post_year <- post_date$year+1900
post_month <- post_date$mon+1
post_day <- post_date$mday
post_hour <- post_date$hour
post_weekday <- weekdays(post_date)
title <- url%>%html_session()%>%html_nodes("div.post_item h3")%>%html_text()%>%as.character()%>%trim()
link <- url%>%html_session()%>%html_nodes("div.post_item a.titlelnk")%>%html_attr("href")%>%as.character()
author <- url%>%html_session()%>%html_nodes("div.post_item a.lightblue")%>%html_text()%>%as.character()%>%trim()
author_hp <- url%>%html_session()%>%html_nodes("div.post_item a.lightblue")%>%html_attr("href")%>%as.character()
recommendation <- url%>%html_session()%>%html_nodes("div.post_item span.diggnum")%>%html_text()%>%trim()%>%as.numeric()
article_view <- url%>%html_session()%>%html_nodes("div.post_item span.article_view")%>%html_text()%>%str_sub(4,20)
article_view <- gsub(")","",article_view)%>%trim()%>%as.numeric()
article_comment <- url%>%html_session()%>%html_nodes("div.post_item span.article_comment")%>%html_text()%>%str_sub(14,100)
article_comment <- gsub(")","",article_comment)%>%trim()%>%as.numeric()
data.frame(title,recommendation,article_view,article_comment,post_date,post_weekday,post_year,post_month,post_day,post_hour,link,author,author_hp)

}

#--------方法1 循環

df <- data.frame()

system.time({
for(i in 1:73){
df <- rbind(df,getdata(i))
}
})

用戶系統流逝
21.605 0.938 95.918

#--------方法 2 多線程並行計算
library(parallel)
system.time({
x <- 1:73
cl <- makeCluster(4) # 初始化四核心集群
results <- parLapply(cl,x,getdata) # lapply的並行版本
jinghua <- do.call('rbind',results) # 整合結果
stopCluster(cl) # 關閉集群
})

用戶系統流逝
0.155 0.122 32.674

顯然用parallel快很多，，，

－－－

爬下來的數據長這樣，，，是博客園精華帖的一些信息，，，

－－－－－－我是分割線－－－－－－－－－－－－－－－－－－－－－－－－－

二：部署R在linux服務器上

等部署完再寫遇到過哪些坑，，，，不過肖楠：WEB SCRAPING WITH R 這篇文章介紹了種種linux上R的各種好處

Why Linux?

• Network performance & mem. management → Faster

• Better parallelization support → Faster

• Uni􏰀ed encoding & locale → Faster (for coders)

• More recent third party libs → Faster (less bugs)

很期待我們的分析環境搭建起來，，，

三：總結

－－－－－－

要想提高R的運算速度，可從以下幾點着手解決

1.拋棄data.frame,擁抱data.table,優化code，，，

2.利用R本身的parallel，多線程計算，提高CPU利用率，，

3.上一個強大的服務器，16核128G啊，這種暴強的超級計算機，，

4.上多個巨型機進行集群，Rhadoop，SparkR，，，

－－－－－－－

SparkR最新進展，備查，。

Announcing SparkR: R on Spark

SparkR github

SparkR (R on Spark)

Documentation for package ‘SparkR’ version 1.4.1

SparkR 技術，聽起來很炫，其實還有很多路要走，，，曾配合Transwrap的工程師對SparkR環境進行功能測試，結果是：要想把本地的R代碼正常的運行在SparkR環境下，需大量改動代碼，因為R code和sparkR環境的R code是不一樣的，spark的數據結構是RDD（RDD 全稱為Resilient Distributed Datasets，是一個容錯的、並行的數據結構，可以讓用戶顯式地將數據存儲到磁盤和內存中，並能控制數據的分區。）Announcing SparkR: R on Spark 剛發布說1.41版本，要支持data.frame了，期待SparkR變得更好用，，，

－－－－－－－

下面放一些R會的deck備查。

第七屆中國R語言會議（北京）紀要【含演講資料】

第七屆中國R語言會議（廣州）紀要【含演講資料】

第六屆中國R語言會議（北京）紀要

第六屆中國R語言會議（上海）紀要

第五屆中國R語言會議（上海會場）紀要

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 R多線程並行計算 Materials_Studio多節點並行計算 python加速包numba並行計算多線程 Java多線程並行計算（Google的Guava使用）淺談多核CPU、多線程與並行計算你應該這樣去開發接口：Java多線程並行計算 [轉]淺談多核CPU、多線程與並行計算 java多線程模擬並行計算框架 R與並行計算（轉）多進程、多線程、並行計算、並發計算、分布式計算的區別