【php】實現php+puppeteer+npm爬蟲js動態渲染的淘寶、天貓HTML頁面


1、相關文檔和網站
pupp使用示例demo:http://www.querylist.cc/docs/guide/v4/Puppeteer

pupp官方原生語法大全:https://zhaoqize.github.io/puppeteer-api-zh_CN/#?product=Puppeteer&version=v5.5.0&show=api-pagewaitforselectorselector-options

華為雲pupp指定谷歌瀏覽器鏡像:https://mirrors.huaweicloud.com/

下載:

1、composer require jaeger/querylist-puppeteer:~v4
2、composer require jaeger/querylist-puppeteer:*
3、npm install @nesk/puphpeteer

后,必須上華為雲【https://mirrors.huaweicloud.com】搜索【chromium】,下載對應的puppteteer的Chrome谷歌瀏覽器

【原因:puppteteer1.7.5版本后,谷歌團隊將puppteteer和瀏覽器分開了,所以我們用npm 下的puppteteer是-core內核,瀏覽器需要單獨下載 -- 此坑花了我3天時間】

 

 

 

2、下載相關npm插件

 

3、華為雲鏡像源,下載puppteteer的瀏覽器和-core

然后就可以愉快的玩耍了。

4、puppeteer插件在win下沒問題,在linux會報錯:Failed to lanch the browser process!

 

 

解決辦法:puppeteer的github->issue 里面有這個問題,

 

 進入項目/node_modules/,找到chrome運行文件路徑(我的是:www/wwwroot/xxx/project/node_modules/puppeteer/.local-chromium/linux-818858/chrome-linux/),執行ldd chrome | grep not,查看缺失的包,然后安裝就行。

 

5、下面上一段php爬蟲的laravel5.7代碼:抓取天貓www.tmall.com模擬登陸賬號並獲取某個店鋪列表頁 -- js動態渲染數據后HTML

(備注:2020/12/25,分析天貓登陸邏輯,Jhtml->login.do->add?xx->訪問目標頁面並帶上add頁面response->headers->cookies)

<?php

namespace App\Http\Controllers;

use Illuminate\Http\Request;
use GuzzleHttp\Cookie\CookieJar;
use GuzzleHttp\Psr7\Response;
use QL\QueryList;
use QL\Ext\Chrome;
use GuzzleHttp\Client;

class HomeController extends Controller
{
    //jaeger--模擬天貓登錄,存cookie
    public function jargerTmallLogin2()
    {
        $jhtml_cookie = $this->getJhtmlCookie();

         $form_params = array_merge([
                'loginId' => config('cache.Tmall.username'),
                'password2' => 'b9c3b97c2a3ab755deddb370bf2659200236b4ab9ebfcb0a44f4c8431920f7b4799bdfc1cc368769aa66c353b000d4a6b3961332dc5ea9b112e539384baa4ec6af9d5df3c84e5ef97ca07233e61dfc84ce16fa1baabf2e9257ce586fdfbd860c630e343852db4f8103ce89539a55b7ec7761ce97fd44621a89db85b0f68dc93d',
//                'password2' => config('cache.Tmall.password'),
                'keepLogin' => 'false',
                'ua' => '137#g6E9hE9o9kYFAWPMA9Oy4vcGIn9b7zb0r37oFA7TeVJPcfsFFthJcm420Q1uI3U3JfRy0DyD4EKTXVl3Uvo/HQr2q564KWWvt2PCEZcmLm1pQiJNWwz8Xx38LQMmOIIy5lXGepSQoiuSLCaJm5VaL+z+bgl81143CHYkMr1+qEV5a7ZOj1tlFXm9iRbMZuRiEwBKg1242mSCCSygmf02VgvJvpf/maKYEs+hhsfAZ7ynOSIdPJ4K3D2F89sB2Lwn+YEOspR9rjd5DVpq1/T3wdmzE8Pivm6BH9/neUPY72dzAA+CEqWbn9AfBzFwOJMJtIqvSCQhgI5pXMIiGkvKCCOqd07xJc2MinUk0oLjZyzt8Ef5e8C11G/0TKBL0ZPeFUVr9A+AM0e70T0ZlI5qLd8wSbWvceu4/VRaE9xjMlJWsDy0g8ovyCjRUt/+4mAFyUF9qS8zDBe5DBHoJLaPjiqmQofJ+GxVBp1m1ioI+/IkYrSP3lEIvEDhQefU+GDpFoyql4ei+tiVYSUS1lQyqdicQDJO/d6MpRJs1Iey+tiVYTUS8lrDJdicBAW70sceDRZjEyIRacn+bLnq1qQipXpmQAfJ+GXppRSf1Iei+piVYTUx1lgippimQOIoEV+H99qXEnF/EPLUBSzfs/9wdbVc8ByysTdY1MM2zHHu+JZXoPpTgItyRHHJ/WSHIuscdksoU0MND1yioT+MrJE0s/vfyPAb+tBO94uaky6uUe+Nt2Fc0BBypPwGSnZVVOeLQMWTuPMmjIjCvbt9Uon/ETx86o5mwlz9hqK5JBtqLm1sdWr0DDg0DaXupn4Dt6Bir6yBzetn34hfTteaUokQO2EAtBvcmiUThn3gykXcLoVzhnwW2Hmlbh0lC123Tym6lEqpKgrfNCbZM8wS9eUDufau3qZVQi80WOWdmLmN/JCUZ8GwE2SZ9RK7dEugd9R8dz2awhZF0VqvRFNF1sMlp/6rNeMOK3tHbQlGogeDfFseAwwbXdOeNEM5BnEP+6aj+ob4nVXRn6D+Ibzm28Dk1ArFoZSraC/41e0fs0/k7CJPrNyz197hJkwlv6P29Y1YaYSLdEjlq7nYa0bHpVM/ESSApV+QMRCWcHavL0Gx6eV5+Ok6rd5cYoHDvsyNjNrbFAYJSP1Eg3KRmqoIf7Hsy0C6Qb4u36HE5qzb7txcpOgqIgalG3Ot0GcwuXZ3QkyXOpzSAn9k6U+g7FiEcTdMe7Neis7pNpqvy/6RWdS/yTv9xaG4AnKWdoUuTVTKMKgaDZAtCRYnCaD6roOJAxrMnxnD9g0ghweHx31wwzBa256fafx7ljorDiKE3ENmcmtwqjnKHADOsGiQhc==',
                'umidGetStatusVal'    =>    '255',
                'navlanguage'    =>    'zh-CN',
                'navUserAgent'    =>    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36',
                'navPlatform'    =>    'Win32'
         ], $jhtml_cookie);
        $jar = new \GuzzleHttp\Cookie\CookieJar;
        $client = new Client(['cookies' => $jar]);
        $res = $client->request('POST', 'https://login.taobao.com/newlogin/login.do?appName=taobao&fromSite=0', [
                'form_params'=>$form_params,
                'timeout' => 60,
                'verify' => false,    //不驗證ssl證書
                
            ]
        );
        $loginCookie = [];
        foreach ($jar->toArray() as $k=>$v){
            $loginCookie[$k]['name'] = $v['Name'];
            $loginCookie[$k]['value'] = $v['Value'];
//             $loginCookie[$k]['path'] = $v['Path'];
//             $loginCookie[$k]['Max-Age'] = $v['Max-Age'];
//             $loginCookie[$k]['expires'] = floatval($v['Expires']);
//             $loginCookie[$k]['secure'] = $v['Secure'];
//             $loginCookie[$k]['discard'] = $v['Discard'];
//             $loginCookie[$k]['httpOnly'] = $v['HttpOnly'];
//             $loginCookie[$k]['url'] = strstr($v['Domain'], 'http') ? $v['Domain'] : 'https://www' . $v['Domain'];
//             $loginCookie[$k]['url'] = strstr($v['Domain'], 'http') ? $v['Domain'] : 'https://www.tmall.com';
//             foreach ($v as $key=>$val){
//                 $loginCookie[$k][lcfirst($key)] = $val;
//             }
        }
        //print_r($loginCookie);
        file_put_contents(storage_path('logs/tmall2.cookie.txt'), json_encode($loginCookie, JSON_UNESCAPED_UNICODE));
        $body = (string)$res->getBody();
        $loginReturn = json_decode($body, true);
        print_r($loginReturn['content']['data']['redirectUrl']);//https://pass.tmall.com/add?_l_g=xx
        $headers = $res->getHeaders();
//         dd($loginCookie);

        $jumpCookieArr = $this->redirectUrlJump($loginReturn['content']['data']['redirectUrl'], $loginCookie);
        $jumpReCookie = [];
        foreach ($jumpCookieArr as $k=>$v){
            $jumpReCookie[$k]['name'] = $v['Name'];
            $jumpReCookie[$k]['value'] = $v['Value'];
            $jumpReCookie[$k]['path'] = $v['Path'];
            $jumpReCookie[$k]['Max-Age'] = $v['Max-Age'];
            $jumpReCookie[$k]['expires'] = floatval($v['Expires']);
            $jumpReCookie[$k]['secure'] = $v['Secure'];
            $jumpReCookie[$k]['discard'] = $v['Discard'];
            $jumpReCookie[$k]['httpOnly'] = $v['HttpOnly'];
            $jumpReCookie[$k]['url'] = strstr($v['Domain'], 'www') ? $v['Domain'] : 'www' . $v['Domain'];
            $jumpReCookie[$k]['url'] = strstr($jumpReCookie[$k]['url'], 'http://') ? $v['Domain'] : 'https://' . $v['Domain'];
        }
        

//         $jar2 = new CookieJar(false, $loginCookie);
//         $jar2 = unserialize(file_get_contents(storage_path('logs/tmall2.cookie.txt')));
//         dd($jar2);
//         $client = new Client(['cookie'=>$jar2]);
//         $client = new Client();
        /* $response = $client->request('GET', 'https://nanjirenoc.tmall.com/category.htm' , [
                    'verify' => false,    //不驗證ssl證書
                    'cookie' => $jar2,
                    'allow_redirects' => false,
                    'headers'        => ['Accept-Encoding' => 'gzip, deflate, br'],
                    'decode_content' => false,
                    'timeout' => 60
                ]
            ); */
        //print_r((string)$response->getBody());
        
        // 執行南極人店鋪列表頁js異步渲染html獲取        
        $ql = QueryList::getInstance();
        // 注冊插件,默認注冊的方法名為: chrome
        $ql->use(Chrome::class);
        $text2 = $ql->chrome(function ($page,$browser)  use($jumpReCookie) {
            $page->setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.39');
            // 設置cookie
            //print_r($jumpReCookie);
            // 動態傳參
            call_user_func_array([$page, 'setCookie'], $jumpReCookie);            //$page->setCookie();
            
            $page->goto('https://nanjirenoc.tmall.com/category.htm');
//             $page->goto('https://www.iviewui.com/components/button');
            // 等待h1元素出現
            $page->waitFor('#J_ShopAsynSearchURL');
//             $page->waitFor('h1');
            sleep(100);
            // 獲取頁面HTML內容
            $html = $page->content();
            // 關閉瀏覽器
            $browser->close();
            // 返回值一定要是頁面的HTML內容
            return $html;
        },[
                'headless' => false, // 啟動可視化Chrome瀏覽器,方便調試
                'devtools' => true, // 打開瀏覽器的開發者工具
                'timeout' => 100000,
                'ignoreHTTPSErrors' => true,
        ])
        ->removeHead()
        ->getHtml();
//         ->find('h1')->text();
        dd($text2);
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        $response = $client->request('GET', 'https://nanjirenoc.tmall.com/category.htm' , [
                'verify' => false,    //不驗證ssl證書
                'cookie' => $jar2,
                'allow_redirects' => false,
                /* 'headers' => [
                        'Content-Type'=>'application/html;charset=UTF-8',
                        'Accept-Encoding' => 'gzip, deflate, br',
                ], */
                'timeout' => 60
        ]
                );
        $string = (string)$response->getBody();
        
        dd(iconv("gb2312","utf-8//IGNORE",$string));
        
    }
    
    public function redirectUrlJump( $jumpUrl, $loginCookie=[] )
    {
        // 將二維數組拼接成原生cookie格式
        $cookieStr = '';
        foreach ($loginCookie as $k=>$v){
            $cookieStr .= ';'.$v['name'].'='.$v['value'];
        }
        $cookieStr = substr($cookieStr, 1);
        print_r($cookieStr);
        // 自定義http header
        $retrunCookie = new \GuzzleHttp\Cookie\CookieJar;
        $client = new Client(['cookies' => $retrunCookie]);
        $res = $client->request('GET', $jumpUrl, [
                'timeout' => 60,
                'verify' => false,    //不驗證ssl證書
                'headers' => [
                        'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36',
                        // 攜帶cookie
                        'Cookie'    => $cookieStr,//'abc=111;xxx=222'
                ]
        ]);
        // 獲取響應頭部信息
        $headers = $res->getHeaders();
        file_put_contents(storage_path('logs/jumpUrlReturnCookie.log'), json_encode($retrunCookie->toArray()));
        return $retrunCookie->toArray();
        dd($headers);
    }
    
    
    
    //查看是否登錄成功URL:https://i.taobao.com/my_taobao.htm
    
    public function getJhtmlCookie()
    {
        $jar = $jar_response = new CookieJar();
        $url = 'https://login.taobao.com/member/login.jhtml?tpl_redirect_url=https://www.tmall.com&style=miniall&enup=true&newMini2=true&full_redirect=true&sub=true&from=tmall&allp=assets_css=3.0.10/login_pc.css';
        $ql = QueryList::get($url, null, [
                'cookies' => $jar
        ])
        //->encoding('UTF-8','GB2312')
        ->removeHead()
        ->getHtml();
        
        $match = [];
        preg_match_all('/\"loginFormData\"\:\{(.*?)\}/', $ql, $match);
        //print_r($match);
        $a1 = $match[1][0];
        return $loginFormData = json_decode('{' . $a1 . '}', true);
    }

}

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM