一 用的QueryList庫
二 安裝方法
確認已經安裝了composer,因為速度會很慢,可以切換到中國鏡像:
composer config -g repo.packagist composer https://packagist.phpcomposer.com
安裝QueryList:
composer require jaeger/querylist
QueryList文檔地址,可以了解下:
三 需求如下
通過淘寶或天貓的商品鏈接,采集該商品鏈接對應的商品標題、商品首圖、店鋪名稱、商家旺旺名稱
四 目前的采集數據Demo可以適用於所有天貓商品+店鋪名稱在右邊或上邊的
五 代碼如下
<?php include "vendor/autoload.php"; use QL\QueryList; function uni_decode($s) { //針對部分淘寶寶貝鏈接的店鋪名被加密進行解密處理 preg_match_all('/\&\#([0-9]{2,5})\;/', $s, $html_uni); preg_match_all('/[\\\%]u([0-9a-f]{4})/ie', $s, $js_uni); $source = array_merge($html_uni[0], $js_uni[0]); $js = array(); for($i=0;$i<count($js_uni[1]);$i++) { $js[] = hexdec($js_uni[1][$i]); } $utf8 = array_merge($html_uni[1], $js); $code = $s; for($j=0;$j<count($utf8);$j++) { $code = str_replace($source[$j], unicode2utf8($utf8[$j]), $code); } return $code; } function unicode2utf8($c) { $str=""; if ($c < 0x80) { $str.=chr($c); } else if ($c < 0x800) { $str.=chr(0xc0 | $c>>6); $str.=chr(0x80 | $c & 0x3f); } else if ($c < 0x10000) { $str.=chr(0xe0 | $c>>12); $str.=chr(0x80 | $c>>6 & 0x3f); $str.=chr(0x80 | $c & 0x3f); } else if ($c < 0x200000) { $str.=chr(0xf0 | $c>>18); $str.=chr(0x80 | $c>>12 & 0x3f); $str.=chr(0x80 | $c>>6 & 0x3f); $str.=chr(0x80 | $c & 0x3f); } return $str; } function get_between($input, $start, $end) {//截取指定兩個字符之間的內容 return substr($input, strlen($start)+strpos($input, $start),(strlen($input) - strpos($input, $end))*(-1)); } function trimall($str)//刪除空格 { $qian=array(" "," ","\t","\n","\r"); $hou=array("","","","",""); return str_replace($qian,$hou,$str); } $url = 'https://item.taobao.com/item.htm?spm=a230r.1.14.34.47cd6ace3iAnm0&id=564043247193&ns=1&abbucket=19#detail'; $ql = QueryList::get($url)->encoding('UTF-8','GB2312');//防止數據亂碼 //針對1天貓寶貝鏈接 2淘寶店鋪名在右邊 3淘寶店鋪名在上面 采取不同的采集方式 if (substr($url, 0, 24) == 'https://detail.tmall.com') { $rt = [ 'img' => $ql->find('#J_ImgBooth')->attr('src'), 'title' => $ql->find(':input[name="title"]')->attr('value'), 'shop_name' => $ql->find('.slogo-shopname')->text() ]; $rt['seller_name'] = $rt['shop_name']; } else { $rt = [ 'img' => $ql->find('#J_ImgBooth')->attr('src'), 'title' => $ql->find('.tb-main-title')->text(), 'shop_name' => $ql->find('.tb-shop-name>dl>dd>strong>a')->text(), 'seller_name' => $ql->find('.tb-seller-name')->text() ]; if (!$rt['shop_name']) { $config = substr(trimall($ql->find('script')->eq(0)->text()), 100, 150); $shop_name = get_between($config, "shopName:'", "',sellerId"); $rt['shop_name'] = uni_decode($shop_name); $rt['seller_name'] = get_between($config, "sellerNick:'", "',sibUrl"); } } var_dump($rt['shop_name']); echo '<hr />'; ?> <!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <title>爬取淘寶商品數據Demo</title> </head> <body> <h4>標題:<?php echo $rt['title']; ?></h4> <h4>店鋪:<?php echo $rt['shop_name']; ?></h4> <h4>旺旺:<?php echo $rt['seller_name']; ?></h4> <h4>圖片:</h4> <img src="<?php echo $rt['img'] ?>" alt=""> </body> </html>
六 效果展示
1 天貓商品鏈接
采集效果:
2 店鋪名稱在右邊的淘寶商品鏈接
采集效果:
3 店鋪名稱在上方的商品鏈接(這個稍微有些麻煩,因為這種類型的商家旺旺和店鋪名都是在js中的,而且店鋪名稱還是加過密的)
采集效果:
7 最近項目中剛好有這個需求,所以寫的這個Demo,如果需要采集其它的數據,可以參考QueryList手冊,根據實際產品業務需求進行更改