R語言爬蟲：CSS方法與XPath方法對比（代碼實現）

本文轉載自查看原文 2018-01-18 14:32 1512 爬蟲

CSS選擇器和XPath方法都是用來定位DOM樹的標簽，只不過兩者的定位表示形式上存在一些差別：

CSS 方法提取節點

library("rvest")
single_table_page <- read_html("single-table.html")
# 提取url里的所有表格
html_table(single_table_page)
html_table(html_node(single_table_page,"table"))
products_page <- read_html("./case/products.html")
products_page %>% html_nodes(".product-list li .name") %>% html_text() 
product_items <- products_page %>% html_nodes(".product-list li")
data.frame(name = product_items %>% html_nodes(".name") %>% html_text(), 
           price = product_items %>% html_nodes(".price") %>%html_text() 
           %>% str_replace_all(pattern="\\$",replacement="") %>% 
               as.numeric(), stringsAsFactors = FALSE)

XPath 方法提取節點

page <- read_html("./case/new-products.html")
#查找所有p節點
page %>% html_nodes(xpath="//p")
#CSS's way
page %>% html_nodes("p")
# 找到所有具有class屬性的li標簽
# xpath's way
page %>% html_nodes(xpath="//li[@class]")
# CSS's way
page %>% html_nodes("li[class]")
# 找到id=‘list’的div標簽下的所有li標簽
# xparth's way
page %>% html_nodes(xpath="//div[@id='list']/ul/li")
# CSS's way
page %>% html_nodes("div#list > ul > li")
# 查找包含p節點的所有div節點
page %>% html_nodes(xpath="//div[p]")
# 查找所有class值為“info-value”，文本內容為“Good”的span節點
page %>% html_nodes(xpath = "//span[@class='info-value' and text()='Good']")

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 爬蟲常用Xpath和CSS3選擇器對比爬蟲解析之css,xpath語法 [搜片神器]之DHT網絡爬蟲的代碼實現方法 SOCKET簡單爬蟲實現代碼和使用方法靜態網頁爬蟲獲取數據的簡單方法Xpath JsonPath 語法與 XPath 對比 R代碼|K均值算法R語言代碼 java調用R語言的方法基於R語言實現SVM KNN算法的實現（R語言）