【UiBot教程】【JS爬蟲插件】基於瀏覽器RunJS,封裝一個js數據抽取插件,簡化抽取步驟


說明:

該插件是一個純js腳本,通過WebBrowser.RunJS注入到瀏覽器頁面。通過腳本創建爬蟲對象,然后支持對象,事件,json,element,node,正則,字符串的鏈式抽取。該js可以直接在瀏覽器控制台運行。

 

2020.3.4更新:

1、增加ForEach方法。支持自定義方便多值結構抽取。

2、增加more方法。支持跨級同級多節點抽取。

 

接口說明:

/*
js爬蟲封裝方法,通過runjs注入腳本,執行獲取結果
1、當前只做了chrome兼容
2、chrome如果開啟了跨域,表達式中出現iframe則自動跨域
3、支持xpath,css-selector,jsonpath鏈式抽取
4、支持字符串結果鏈式處理

# 創建爬蟲對象,可以傳入抽取的element對象。默認document.body對象
# ele:對象數組
new crawler(ele);

# xpath抽取,只支持document下抽取。即時后置的鏈式抽取,也是從document下抽取的
# query:xpath表達式【必填】
$x(query)

# css-selector抽取。支持html的屬性,方法,事件抽取,支持鏈式多次列表模板抽取。
# query:css-selector表達式【必填】。param:屬性,方法,事件
$(query, param)

# json格式數據抽取。
# query:jsonPath表達式【必填】。param:PATH(json路徑)、VALUE(取值)。默認提取值
$j(query, param)

# 字符串正則過濾。通過過濾條件排除或保留一些數據。
# query:正則表達式或字符串【必填】
filter(query)

# 字符串正則提取。
# query:正則表達式或字符串【必填】。index:數字或數字數組,抽取結果根據下標過濾
regex(query, index)

# 字符串替換。
# substr:被替換的字符串,正則表達式或字符串【必填】。replacement:替換后的字符串
replace(substr, replacement)

# 字符串拆分
# query:正則表達式或字符串【必填】
split(query)

# 鏈式字符串格式抽取,就是將鏈式字符串轉為執行表達式
# expression:抽取表達式【必填】
mix(expression)

# 對外暴露抽取結果處理。參考Array.forEach,對應遍歷Crawler的ele結果
# func:javascript方法
forEach(func)

# 用於同級多結果抽取,返回對象數組,或者二位數組
# exps:數組或json。value為Crawler的表達式
# types:字符串(mix|xpath|css|json|regex|replace|split|filter)
more(exps,types)

# 獲取抽取結果。將ele對象轉為可視化數據
get()

# 根據元素獲取xpath或者css-selector就近唯一表達式
# elm:要定位的元素【必填】。xp:xpath(true),css-selector(false)
getSelector(elm, xp)

# 根據元素獲取xpath或者css-selector完整鏈路表達式
# elm:要定位的元素【必填】。xp:xpath(true),css-selector(false)
getFullSelector(elm, xp)
*/
/*
js爬蟲封裝方法,通過runjs注入腳本,執行獲取結果
1、當前只做了chrome兼容
2、chrome如果開啟了跨域,表達式中出現iframe則自動跨域
3、支持xpath,css-selector,jsonpath鏈式抽取
4、支持字符串結果鏈式處理

# 創建爬蟲對象,可以傳入抽取的element對象。默認document.body對象
# ele:對象數組
new crawler(ele);

# xpath抽取,只支持document下抽取。即時后置的鏈式抽取,也是從document下抽取的
# query:xpath表達式【必填】
$x(query)

# css-selector抽取。支持html的屬性,方法,事件抽取,支持鏈式多次列表模板抽取。
# query:css-selector表達式【必填】。param:屬性,方法,事件
$(query, param)

# json格式數據抽取。
# query:jsonPath表達式【必填】。param:PATH(json路徑)、VALUE(取值)。默認提取值
$j(query, param)

# 字符串正則過濾。通過過濾條件排除或保留一些數據。
# query:正則表達式或字符串【必填】
filter(query)

# 字符串正則提取。
# query:正則表達式或字符串【必填】。index:數字或數字數組,抽取結果根據下標過濾
regex(query, index)

# 字符串替換。
# substr:被替換的字符串,正則表達式或字符串【必填】。replacement:替換后的字符串
replace(substr, replacement)

# 字符串拆分
# query:正則表達式或字符串【必填】
split(query)

# 鏈式字符串格式抽取,就是將鏈式字符串轉為執行表達式
# expression:抽取表達式【必填】
mix(expression)

# 對外暴露抽取結果處理。參考Array.forEach,對應遍歷Crawler的ele結果
# func:javascript方法
forEach(func)

# 用於同級多結果抽取,返回對象數組,或者二位數組
# exps:數組或json。value為Crawler的表達式
# types:字符串(mix|xpath|css|json|regex|replace|split|filter)
more(exps,types)

# 獲取抽取結果。將ele對象轉為可視化數據
get()

# 根據元素獲取xpath或者css-selector就近唯一表達式
# elm:要定位的元素【必填】。xp:xpath(true),css-selector(false)
getSelector(elm, xp)

# 根據元素獲取xpath或者css-selector完整鏈路表達式
# elm:要定位的元素【必填】。xp:xpath(true),css-selector(false)
getFullSelector(elm, xp)
*/

 

插件使用:

需要依賴Chrome瀏覽器,將附件task導入工程,源碼界面導入即可。

Import spider

 

Demo1:

Import spider
dim hWeb = ""
hWeb = WebBrowser.Create("chrome","https://forum.uibot.com.cn/",10000,{"bContinueOnError":false,"iDelayAfter":300,"iDelayBefore":200,"sBrowserPath":"","sStartArgs":""})
data = spider.xpath(hWeb,"//div[@class='media-body']//a/text()")
TracePrint(data)
data = spider.xpath(hWeb,"//div[@class='media-body']//a/@href")
TracePrint(data)
data = spider.css(hWeb,"div.media-body>div>a","innerText")
TracePrint(data)
data = spider.css(hWeb,"div.media-body>div>a","href")
TracePrint(data)
data = spider.mix(hWeb,'''$(".card-body li .media-body").more(["//a[1]/text()","//a[1]/@href","//div[1]/span[1]/text()","//div[1]/span[2]/text()"],"xpath")''')
TracePrint(data)
data = spider.mix(hWeb,'''$(".card-body li .media-body").more({"標題":"//a[1]/text()","地址":"//a[1]/@href","作者":"//div[1]/span[1]/text()","時間":"//div[1]/span[2]/text()"},"xpath")''')
TracePrint(data)

Demo2:

dim hWeb = ""
hWeb = WebBrowser.Create("chrome","http://9pk.5566rs.com/",3600000,{"bContinueOnError":false,"iDelayAfter":300,"iDelayBefore":200,"sBrowserPath":"","sStartArgs":""})
// hWeb = WebBrowser.BindBrowser("chrome",10000,{"bContinueOnError":false,"iDelayAfter":300,"iDelayBefore":200})

data1=spider.mix(hWeb,'''$("tr").filter("今日\\\\d").more(["//td[1]/a/text()","//td[1]/a/@href","//td[3]/text()"],"xpath")''')

data = ""
For Each v In data1
data = data & Join(v,",") & "\n"
Next
dim dicts = {"":"a",\
"":"b",\
"":"c",\
"":"d",\
"":"e",\
"":"f",\
"":"g",\
"":"h",\
"":"i",\
"":"j",\
"":"k",\
"":"l",\
"":"m",\
"":"n",\
"":"o",\
"":"p",\
"":"q",\
"":"r",\
"":"s",\
"":"t",\
"":"u",\
"":"v",\
"":"w",\
"":"x",\
"":"y",\
"":"z",\
"":"A",\
"":"B",\
"":"C",\
"":"D",\
"":"E",\
"":"F",\
"":"G",\
"":"H",\
"":"I",\
"":"J",\
"":"K",\
"":"L",\
"":"M",\
"":"N",\
"":"O",\
"":"P",\
"":"Q",\
"":"R",\
"":"S",\
"":"T",\
"":"U",\
"":"V",\
"":"W",\
"":"X",\
"":"Y",\
"":"Z",\
"":"1",\
"":"2",\
"":"3",\
"":"4",\
"":"5",\
"":"6",\
"":"7",\
"":"8",\
"":"9",\
"":"0"}
For Each k,v In dicts
data = Replace(data,k,v)
Next
File.Write("d:\\aaa1.csv",data,"gbk")
 

Demo4:

Import spider

dim hWeb = ""
hWeb = WebBrowser.Create("chrome","http://9pk.5566rs.com/",3600000,{"bContinueOnError":false,"iDelayAfter":300,"iDelayBefore":200,"sBrowserPath":"","sStartArgs":""})
// hWeb = WebBrowser.BindBrowser("chrome",10000,{"bContinueOnError":false,"iDelayAfter":300,"iDelayBefore":200})

data1=spider.mix(hWeb,'''$("tr").filter("今日\\\\d").$x("//td[1]/a/text()")''')
data2=spider.mix(hWeb,'''$("tr").filter("今日\\\\d").$x("//td[1]/a/@href")''')
data3=spider.mix(hWeb,'''$("tr").filter("今日\\\\d").$x("//td[3]/text()")''')

data = ""
For Each k,v In data1
res = []
push(res,v)
push(res,data2[k])
push(res,data3[k])
data = data & Join(res,",") & "\n"
Next
dim dicts = {"":"a",\
"":"b",\
"":"c",\
"":"d",\
"":"e",\
"":"f",\
"":"g",\
"":"h",\
"":"i",\
"":"j",\
"":"k",\
"":"l",\
"":"m",\
"":"n",\
"":"o",\
"":"p",\
"":"q",\
"":"r",\
"":"s",\
"":"t",\
"":"u",\
"":"v",\
"":"w",\
"":"x",\
"":"y",\
"":"z",\
"":"A",\
"":"B",\
"":"C",\
"":"D",\
"":"E",\
"":"F",\
"":"G",\
"":"H",\
"":"I",\
"":"J",\
"":"K",\
"":"L",\
"":"M",\
"":"N",\
"":"O",\
"":"P",\
"":"Q",\
"":"R",\
"":"S",\
"":"T",\
"":"U",\
"":"V",\
"":"W",\
"":"X",\
"":"Y",\
"":"Z",\
"":"1",\
"":"2",\
"":"3",\
"":"4",\
"":"5",\
"":"6",\
"":"7",\
"":"8",\
"":"9",\
"":"0"}
For Each k,v In dicts
data = Replace(data,k,v)
Next
File.Write("d:\\aaa.csv",data,"gbk")
 

更多:

1、鏈式抽取:

new crawler().$(".subject.break-all").$x("//a").filter(/857/ig).replace(/(t+)/,'hahaha$1').get()

 

2、事件觸發:

new crawler().$(".subject.break-all","click()")

 

3、瀏覽器對象使用

new crawler(document.querySelectorAll(".subject.break-all")).$("a").get()

 

4、重復區塊,子模板抽取

{"title":new crawler(document.querySelectorAll(".subject.break-all")).$("a").get(),

"content":new crawler(document.querySelectorAll(".subject.break-all")).$("a").get()

}

 

5、多節點抽取(高級功能)

new crawler().$("li a").filter("tabindex").$x("img").forEach(function(a,b,c){c[b]=a.src}).get()

 

6、快捷多節點抽取(高級功能)

data1=spider.mix(hWeb,'''$("tr").filter("今日\\\\d").more(["//td[1]/a/text()","//td[1]/a/@href","//td[3]/text()"],"xpath")''')

 

更多的彩蛋大家自己發現。

PS:該JS為個人嘔心之作,會用的人會發現超級實用。

 

下載插件請進原文地址下載:https://forum.uibot.com.cn/thread-869.htm


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM