R語言 網站數據獲取 (rvest)——網絡爬蟲初學


都說Python爬蟲功能強大,其實遇到動態加載或者登陸網站Python還是很困難,對於大部分的一些普通爬蟲,R語言還是很方便。這里介紹R語言rvest包爬蟲,主要用到函數:read_html()、html_nodes()、html_text()和html_attrs()。

 

 

rvest: Easily Harvest (Scrape) Web Pages  (輕松獲取網頁)

CRAN - Package rvest (r-project.org)

tidyverse/rvest: Simple web scraping for R (github.com)

 

首先,安裝 rvest

install.packages("rvest")

  

安裝好后就可以使用了

library(rvest)

  

函數 作用
read_html() 讀取 html 頁面
html_nodes() 提取所有符合條件的節點
html_node() 返回一個變量長度相等的list,相當於對html_nodes()[[1]]操作
html_table() 獲取 table 標簽中的表格,默認參數trim=T,設置header=T可以包含表頭,返回數據框
html_text() 提取標簽包含的文本,令參數trim=T,可以去除首尾的空格
html_attrs(nodes) 提取指定節點所有屬性及其對應的屬性值,返回list
html_attr(nodes,attr) 提取節點某個屬性的屬性值
html_children() 提取某個節點的孩子節點
html_session() 創建會話

 

舉例參考:

1、上證綜指成份股列表爬取

網站: 上海證券交易所_上證綜合指數成分股列表 (sse.com.cn)    http://www.sse.com.cn/market/sseindex/indexlist/s/i000001/const_list.shtml

 

 

 

 

利用Chrome瀏覽器的功能先獲取表格所在頁面部分的xpath, 辦法是鼠標右鍵單擊表格開頭部分, 選擇“檢查”(inspect), 這時會在瀏覽器右邊打開一個html源代碼窗口,

當前加亮顯示部分是表格開頭內容的源代碼,將鼠標單擊到上層的<table class="tablestyle">處, 右鍵單擊選擇“Copy-Copy XPath”, 得到如下的xpath地址:'//*[@id="content_ab"]/div[1]/table'

然后, 用rvest的 html_nodes()函數提取頁面中用xpath指定的成分, 用 html_table()函數將HTML表格轉換為數據框, 結果是一個數據框列表, 因為僅有一個, 所以取列表第一項即可。 

 

library(rvest)

## 網頁地址
urlb <- "http://www.sse.com.cn/market/sseindex/indexlist/s/i000001/const_list.shtml"
## 網頁中數據表的xpath xpath <- '//*[@id="content_ab"]/div[1]/table' ## 讀入網頁並提取其中的表格節點 nodes <- html_nodes( read_html(urlb), xpath=xpath) ## 從表格節點轉換為表格列表 tables <- html_table(nodes) restab <- tables[[1]] head(restab) ## X1 X2 ## 1 浦發銀行\r\n (600000) 白雲機場\r\n (600004) ## 2 中國國貿\r\n (600007) 首創股份\r\n (600008) ## X3 ## 1 東風汽車\r\n (600006) ## 2 上海機場\r\n (600009)

  

 

 

可見每一行有三個股票, 我們將數據中的\r\n和空格去掉, 然后轉換成名稱與代碼分開的格式:

 

library(tidyverse)

pat1 <- "^(.*?)\\((.*?)\\)"
tab1 <- restab %>%
  ## 將三列合並為一列,結果為字符型向量
  reduce(c) %>% 
  ## 去掉空格和換行符,結果為字符型向量
  stringr::str_replace_all("[[:space:]]", "") %>%
  ## 提取公司簡稱和代碼到一個矩陣行,結果為字符型矩陣
  stringr::str_match(pat1) 
tab <- tibble(
  name = tab1[,2],
  code = tab1[,3])
head(tab)
## # A tibble: 6 x 2
##   name     code  
##   <chr>    <chr> 
## 1 浦發銀行 600000
## 2 中國國貿 600007
## 3 包鋼股份 600010
## 4 華夏銀行 600015
## 5 上港集團 600018
## 6 上海電力 600021

  

 

str(tab)
## Classes ‘tbl_df’, ‘tbl’ and 'data.frame':    1551 obs. of  2 variables:
## $ name: chr  "浦發銀行" "中國國貿" "包鋼股份" "華夏銀行" ...
##  $ code: chr  "600000" "600007" "600010" "600015" ...

  

 

 

 對於不符合規則的網頁, 可以用download.file()下載網頁文件, 用str_replace_all()或者gsub()去掉不需要的成分。 用str_which()或者grep查找關鍵行。

 

 

 

2、開始爬取IMDB上2016年度最流行的100部故事片

Feature Film, Released between 2016-01-01 and 2016-12-31 (Sorted by Popularity Ascending) - IMDb

 

# 加載包
library('rvest')

# 指定要爬取的url
url <- 'https://www.imdb.com/search/title/?count=100&release_date=2016,2016&title_type=feature'

# 從網頁讀取html代碼
webpage <- read_html(url)

  

# 用CSS選擇器獲取排名部分
rank_data_html <- html_nodes(webpage,'.text-primary')

# 把排名轉換為文本
rank_data <- html_text(rank_data_html)

# 檢查一下數據
head(rank_data)

[1] "1." "2." "3." "4." "5." "6."

  

# 數據預處理:把排名轉換為數值型
rank_data<-as.numeric(rank_data)

# 再檢查一遍
head(rank_data)

[1] 1 2 3 4 5 6

  

# 爬取標題
title_data_html <- html_nodes(webpage,'.lister-item-header a')

# 轉換為文本
title_data <- html_text(title_data_html)

# 檢查一下
head(title_data)

[1] "Sing"          "Moana"         "Moonlight"     "Hacksaw Ridge"
[5] "Passengers"    "Trolls"

  

# 爬取描述
description_data_html <- html_nodes(webpage,'.ratings-bar+ .text-muted')

# 轉為文本
description_data <- html_text(description_data_html)

# 檢查一下
head(description_data)

[1] "\nIn a city of humanoid animals, a hustling theater impresario's attempt to save his theater with a singing competition becomes grander than he anticipates even as its finalists' find that their lives will never be the same."

[2] "\nIn Ancient Polynesia, when a terrible curse incurred by the Demigod Maui reaches an impetuous Chieftain's daughter's island, she answers the Ocean's call to seek out the Demigod to set things right."

[3] "\nA chronicle of the childhood, adolescence and burgeoning adulthood of a young, African-American, gay man growing up in a rough neighborhood of Miami."

[4] "\nWWII American Army Medic Desmond T. Doss, who served during the Battle of Okinawa, refuses to kill people, and becomes the first man in American history to receive the Medal of Honor without firing a shot."

[5] "\nA spacecraft traveling to a distant colony planet and transporting thousands of people has a malfunction in its sleep chambers. As a result, two passengers are awakened 90 years early."

[6] "\nAfter the Bergens invade Troll Village, Poppy, the happiest Troll ever born, and the curmudgeonly Branch set off on a journey to rescue her friends.

# 移除 '\n'
description_data<-gsub("\n","",description_data)

# 再檢查一下
head(description_data)

[1] "In a city of humanoid animals, a hustling theater impresario's attempt to save his theater with a singing competition becomes grander than he anticipates even as its finalists' find that their lives will never be the same."

[2] "In Ancient Polynesia, when a terrible curse incurred by the Demigod Maui reaches an impetuous Chieftain's daughter's island, she answers the Ocean's call to seek out the Demigod to set things right."

[3] "A chronicle of the childhood, adolescence and burgeoning adulthood of a young, African-American, gay man growing up in a rough neighborhood of Miami."

[4] "WWII American Army Medic Desmond T. Doss, who served during the Battle of Okinawa, refuses to kill people, and becomes the first man in American history to receive the Medal of Honor without firing a shot."

[5] "A spacecraft traveling to a distant colony planet and transporting thousands of people has a malfunction in its sleep chambers. As a result, two passengers are awakened 90 years early."

[6] "After the Bergens invade Troll Village, Poppy, the happiest Troll ever born, and the curmudgeonly Branch set off on a journey to rescue her friends."

# 爬取runtime section
runtime_data_html <- html_nodes(webpage,'.text-muted .runtime')

# 轉為文本
runtime_data <- html_text(runtime_data_html)

# 檢查一下
head(runtime_data)

[1] "108 min" "107 min" "111 min" "139 min" "116 min" "92 min"

# 數據預處理: 去除“min”並把數字轉換為數值型

runtime_data <- gsub(" min","",runtime_data)
runtime_data <- as.numeric(runtime_data)

# 再檢查一下
head(rank_data)

[1] 1 2 3 4 5 6

# 爬取genre
genre_data_html <- html_nodes(webpage,'.genre')

# 轉為文本
genre_data <- html_text(genre_data_html)

# 檢查一下
head(genre_data)

[1] "\nAnimation, Comedy, Family "

[2] "\nAnimation, Adventure, Comedy "

[3] "\nDrama "

[4] "\nBiography, Drama, History "

[5] "\nAdventure, Drama, Romance "

[6] "\nAnimation, Adventure, Comedy "

# 去除“\n”
genre_data<-gsub("\n","",genre_data)

# 去除多余空格
genre_data<-gsub(" ","",genre_data)

# 每部電影只保留第一種類型
genre_data<-gsub(",.*","",genre_data)

# 轉化為因子
genre_data<-as.factor(genre_data)

# 再檢查一下
head(genre_data)

[1] Animation Animation Drama     Biography Adventure Animation

  

# 爬取IMDB rating
rating_data_html <- html_nodes(webpage,'.ratings-imdb-rating strong')

# 轉為文本
rating_data <- html_text(rating_data_html)

# 檢查一下
head(rating_data)

[1] "7.2" "7.7" "7.6" "8.2" "7.0" "6.5"

# 轉為數值型
rating_data<-as.numeric(rating_data)

# 再檢查一下
head(rating_data)

[1] 7.2 7.7 7.6 8.2 7.0 6.5

# 爬取votes section
votes_data_html <- html_nodes(webpage,'.sort-num_votes-visible span:nth-child(2)')

# 轉為文本
votes_data <- html_text(votes_data_html)

# 檢查一下
head(votes_data)

[1] "40,603"  "91,333"  "112,609" "177,229" "148,467" "32,497"

# 移除“,”
votes_data<-gsub(",", "", votes_data)

# 轉為數值型
votes_data<-as.numeric(votes_data)

# 再檢查一下
head(votes_data)

[1]  40603  91333 112609 177229 148467  32497

# 爬取directors section
directors_data_html <- html_nodes(webpage,'.text-muted+ p a:nth-child(1)')

# 轉為文本
directors_data <- html_text(directors_data_html)

# 檢查一下
head(directors_data)

[1] "Christophe Lourdelet" "Ron Clements"         "Barry Jenkins"
[4] "Mel Gibson"           "Morten Tyldum"        "Walt Dohrn"

# 轉為因子
directors_data<-as.factor(directors_data)

# 爬取actors section
actors_data_html <- html_nodes(webpage,'.lister-item-content .ghost+ a')

# 轉為文本
actors_data <- html_text(actors_data_html)

# 檢查一下
head(actors_data)

[1] "Matthew McConaughey" "Auli'i Cravalho"     "Mahershala Ali"
[4] "Andrew Garfield"     "Jennifer Lawrence"   "Anna Kendrick"

# 轉為因子
actors_data<-as.factor(actors_data)

  

# 爬取metascore section
metascore_data_html <- html_nodes(webpage,'.metascore')

# 轉為文本
metascore_data <- html_text(metascore_data_html)

# 檢查一下
head(metascore_data)

[1] "59        " "81        " "99        " "71        " "41        "
[6] "56        "

# 去除多余空格
metascore_data<-gsub(" ","",metascore_data)

# 檢查metascore data的長度
length(metascore_data)

[1] 96

  

爬取好數據后,你們隊數據進行一些分析與推斷,訓練一些機器學習模型。我在上面這個數據集的基礎上做了一些有趣的可視化來回答下面的問題。

 

library('ggplot2')
qplot(data = movies_df,Runtime,fill = Genre,bins = 30)

  

ggplot(movies_df,aes(x=Runtime,y=Rating))+
geom_point(aes(size=Votes,col=Genre))

  

ggplot(movies_df,aes(x=Runtime,y=Gross_Earning_in_Mil))+
geom_point(aes(size=Rating,col=Genre))

  

圖靈社區 (ituring.com.cn)

Beginner’s Guide on Web Scraping in R (using rvest) with example (analyticsvidhya.com)


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM