爬取京東商城的商品數據


其實,若不考慮反爬蟲技術,正兒八經的爬蟲技術沒有什么太多的技術含量,這里只是將這次爬取數據的過程做個簡單的備忘,在Conv-2019的特別日子里,不能到公司職場工作,在家遠程,做一些調研和准備工作。這里頭,就有產品市場調研這塊,數據說話!

 

我重點爬取了京東商城的數據,當然,早期也爬取了天貓和淘寶的數據(阿里系列,反爬蟲技術還是比較厲害,后來頻繁提示滑動條,這個繞不過去,即便程序中監測到跳出來了滑動條驗證,然后我手動驗證都不讓過,這的確比較厲害,目前因為沒有多少時間深入調研,沒有弄清楚這個到底怎么繞過去,若有過來人,還請告知一二!!!)

 

我的爬取過程,技術采用的是selenium+httpclient+mysql實現的。

  • selenium是一款自動化測試工具,在這里,很好的用來設計自動化的點擊頁面按鈕的動作。說實在話,不用selenium,完全用jsoup也是可以搞得定的。但是,完全用selenium,可能有些場景就不是那么好搞定了。涉及到完全異步操作的時候,selenium的模擬點擊頁面,不管通過cssSelector還是xpath等,都可能遇到元素不存在的錯誤。
  • 完全用jsoup是可以解決問題的,只不過呢,完全用jsoup,這個爬蟲的程序就相對比較復雜一些了,自己要寫很多的代碼。
  • 所以,我最終采用了selenium和httpclient爬取數據。selenium模擬翻頁,因為京東商城的商品列表頁面,是有明確的規律的。不管是參數翻頁(WebElement.click(href)這種模式),還是基於模擬點擊列表頁面的"下一頁",都是比較輕松的事情,而且,針對要爬取的頁面,還有web頁面被打開,可以看到一個大概的視圖。httpclient在這里,主要用來獲取商品的價格和評論數據,價格是輔助獲取,評論數據是完全依靠httpclient。

 

先創建一個爬蟲程序的maven工程,主要是為了方便拉取依賴包。

<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.5.6</version>
</dependency>
<dependency>
    <groupId>c3p0</groupId>
<artifactId>c3p0</artifactId>
<version>0.9.1.2</version>
</dependency>
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-chrome-driver</artifactId>
<version>3.141.59</version>
</dependency>

因為我這里selenium基於瀏覽器運行,即模擬瀏覽器的工作,所以,我選擇的是客戶端模式,谷歌瀏覽器驅動。所以,還要下載chrome的本地程序,可以理解為chrome的內核程序,在java工程程序中,系統參數中需要配置這個chrome瀏覽器內核,通過java的JNI工作模式,進行模擬控制操作瀏覽器打開頁面的過程。

 

整個java工程就是一個非常基本的main程序,普通的maven項目,讀者可以按照自己的需求,設計成web模式也是可以的。先來看看配置selenium的部分。

JDSeleniumFullProxy
package com.shihuc.up.spider.jd.comment;

import com.google.common.collect.Lists;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;

import java.util.List;
import java.util.concurrent.TimeUnit;

public class JDSeleniumFullProxy {

    public static ChromeDriver driver;

    static {
        try {
            //啟動瀏覽器
            getDriver();
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    }
    public static void main(String[] args) throws InterruptedException {
        getProductsWithFullScenario();

        Thread.sleep(10000);
        System.out.println("!!!!!!!==========Well Done===========!!!!!!");

        //關閉模擬器
        driver.quit();
    }

    private static void getProductsWithFullScenario() {
        String urls[] = new String[] {
                /*車載手機支架*/
                "https://search.jd.com/Search?keyword=%E8%BD%A6%E8%BD%BD%E6%89%8B%E6%9C%BA%E6%94%AF%E6%9E%B6&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&click=0"
        };
        String products[][] = new String[][] {
                {"jd_info_czsjzj", "jd_comment_czsjzj"}
        };
        int hmp = 40;

        JDProductDao productDao = new JDProductDao();

        //爬取所需的數據
        for (int i=0; i < urls.length; i++) {
            JDSeleniumFullCrawler.getAllProducts(driver, hmp, urls[i], productDao, products[i]);
        }

        //將價格和銷量做適當的處理(價格有范圍的,銷量中有‘萬’或者 ‘+’的,處理為數值)
        for (int i=0; i<products.length; i++) {
            productDao.updateProductForPriceSells(products[i][0]);
        }
    }

    /**
     * 獲取 ChromeDriver
     * @throws InterruptedException
     */
    private static void getDriver() throws InterruptedException{
        String os = System.getProperty("os.name");
        if (os.toLowerCase().startsWith("win")) {
            System.setProperty("webdriver.chrome.driver",
                    System.getProperty("user.dir") + "\\chromedriver_win32\\chromedriver.exe");
        } else {
            System.setProperty("webdriver.chrome.driver", "/usr/bin/chromedriver");
        }
        ChromeOptions options = new ChromeOptions();
        // 關閉界面上的---Chrome正在受到自動軟件的控制
        options.addArguments("--disable-infobars");
        // 允許重定向
        options.addArguments("--disable-web-security");
        // 最大化
        options.addArguments("--start-maximized");
        options.addArguments("--no-sandbox");
        List<String> excludeSwitches = Lists.newArrayList("enable-automation");
        options.setExperimentalOption("excludeSwitches", excludeSwitches);

        driver = new ChromeDriver(options);
        driver.manage().timeouts().implicitlyWait(5, TimeUnit.SECONDS);
        //driver.get("https://passport.jd.com/new/login.aspx");

/**
 * 下面這些模擬滑動條的方式,都不湊用,只有通過淘寶的登錄頁打開,然后手動切換到支付寶登錄頁,手機支付寶掃碼
 * 這樣方能繞過淘寶反爬蟲的那個滑動條阻攔
 */
//        while(true) {
//            if(currentIsLoginPage()){
//                System.out.println("============>>>>");
//            }else {
//                System.out.println(">>>>>>OOOOOOOOOOO");
//                break;
//            }
//            Thread.sleep(2000);
//        }
    }

    private static boolean currentIsLoginPage() {
        String url = driver.getCurrentUrl();
        if (url.contains("https://passport.jd.com/new/login.aspx")){
            return true;
        }
        return false;
    }
}

代碼中紅色部分,是我的chrome驅動程序所在路徑的配置,即chromedriver.exe文件在我的項目內文件夾chromedriver_win32里面。依據你下載這個文件時放的路徑不同,這里有所調整。

 

上面程序中,也可以模擬程序登錄的過程,因為京東商城瀏覽商品,不管怎么瀏覽都不要求登錄,不想阿里系,瀏覽一下,還防爬,時不時蹦出來登錄。。。鄙視。。。

 

接下來,就是真正操作selenium和jsoup爬取數據的過程了。

JDSeleniumFullCrawler
package com.shihuc.up.spider.jd.comment;

import com.alibaba.fastjson.JSONArray;
import com.alibaba.fastjson.JSONObject;
import com.shihuc.up.spider.jd.opt.JDPhoneHolder;
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;

import java.io.IOException;
import java.util.HashMap;
import java.util.List;
import java.util.Set;

public class JDSeleniumFullCrawler {

    private static String COMMENT_TOTAL = "評論總數";
    private static String COMMENT_GOOD = "好評數量";
    private static String COMMENT_GENERAL = "中評數量";
    private static String COMMENT_POOL = "差評數量";
    private static String COMMENT_VIDEO = "視頻曬單";
    private static String COMMENT_AFTER = "追評數量";

    public static void getAllProducts(ChromeDriver driver, int howManyPages, String url, JDProductDao pdao, String []pname) {

        for (int i = 1; i <= howManyPages; i++) {
            getFullPageProducts(driver, i, url, pdao, pname);
            try {
                Thread.sleep(100);
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
        }
    }

    public static void getFullPageProducts(ChromeDriver driver, int i, String rawUrl, JDProductDao pdao, String []pname) {
//        WebElement pageNumInput = driver.findElement(By.xpath("//*[@id=\"J_bottomPage\"]/span[2]/input"));
//        pageNumInput.clear();
//        pageNumInput.sendKeys(i + "");
//        WebElement searchSubmit = driver.findElement(By.xpath("//*[@id=\"J_bottomPage\"]/span[2]/a"));
//        searchSubmit.click();
        String url = rawUrl + "&page=" + (2*i - 1) + "&s=" + (60*(i-1) + 1);
        driver.get(url);
        getProductsProcess(driver, pdao, pname);
    }
    private static void getProductsProcess(ChromeDriver driver, JDProductDao pdao, String []pname) {
        List<WebElement> itemElements = driver.findElements(By.cssSelector("#J_goodsList .gl-item"));
        System.out.println(itemElements.size());
        String mainHandle = driver.getWindowHandle();
        String href = null;
        for(WebElement we: itemElements) {
            try {
                String weId = we.getAttribute("data-pid");
                //WebElement weHref = we.findElement(By.cssSelector(".p-name a"));
                WebElement weHref = we.findElement(By.cssSelector(".p-img a"));
                //href = weHref.getAttribute("href");
                href = "https://item.jd.com/" + weId + ".html";

                //價格和評論這么取取不到,網站是一個完全異步的顯示邏輯
                String price = null;
                try {
                    WebElement wePrice = we.findElement(By.cssSelector(".p-price strong i"));
                    price = wePrice.getText();
                }catch (Exception ep) {
                    System.err.println("can not get the price information for pid " + weId + " ......");
                }
//                String sells = null;
//                try {
//                    WebElement weSells = we.findElement(By.cssSelector(".p-commit strong a"));
//                    sells = weSells.getText();
//                }catch (Exception ec) {
//                    System.err.println("can not get the comment information for pid " + weId + " ......");
//                }


                driver.executeScript("window.open(\"https://item.jd.com/" + weId + ".html\");");

                Set<String> handles = driver.getWindowHandles();
                String newHandle = "";
                for (String s : handles) {
                    if (s.equalsIgnoreCase(mainHandle)) {
                        continue;
                    }
                    newHandle = s;
                    break;
                }
                //將窗口調整到剛才打開的產品詳情頁窗口
                driver.switchTo().window(newHandle);

                //獲取當前產品詳情頁的關注的產品詳情信息
                JDProduct product = getJDProductInfos(driver);
                try {
                    if (price == null) {
                        price = JDPhoneHolder.getPrice(weId);
                    }
                } catch (IOException e) {
                    e.printStackTrace();
                }

                product.setUrl(href);
                product.setPid(weId);
                product.setPrice(price);

                //JDComment comment = getCommentByCD(driver);
                JDComment comment = getCommentByPID(weId);
                product.setComment(comment);
                int rid = pdao.addProductInfoGenId(product, pname[0]);
                pdao.addProductComments2(product, rid, pname[1]);

                //關閉當前處理的產品詳情頁窗口
                closeAllOtherWindows(mainHandle, driver);
            }catch(Exception eal) {
                closeAllOtherWindows(mainHandle, driver);
                eal.printStackTrace();
                System.out.println(href);
            }
        }
    }

    public static JDProduct getJDProductInfoByUrl(WebDriver driver, String url, JDProduct product) {
        System.out.println("URL: " + url);
        driver.get(url);

        WebElement weComment = driver.findElement(By.cssSelector(".comment-count .count"));
        WebElement wePrice = driver.findElement(By.cssSelector(".summary-price .price"));

        String strComment = weComment.getText();
        if (strComment.equalsIgnoreCase("0")){
            try {
                strComment = JDPhoneHolder.getCommitCountNum(product.getPid()) + "";
            } catch (IOException e) {
                e.printStackTrace();
            }
        }

        String strPrice = wePrice.getText();
        product.setPrice(strPrice);

        return product;
    }

    public static JDProduct getJDProductInfos(WebDriver driver) {
        WebElement weTitle = driver.findElement(By.cssSelector(".w div.sku-name"));
        String title = weTitle.getText();

        /**
         * 獲取產品型號信息, 通過xpath獲取信息的性能比cssSelector高很多
         */
        WebElement weBrand = driver.findElement(By.xpath(".//*[@id=\"parameter-brand\"]/li/a"));
        String brand = weBrand.getText();

        WebElement weName = driver.findElement(By.xpath(".//*[@id=\"detail\"]/div[2]/div[1]/div[1]/ul[2]/li[1]"));
        String name = weName.getText();
        name = name.replace("商品名稱:","").trim();

        JDProduct product = new JDProduct();
        product.setBrand(brand);
        product.setPname(name);
        product.setTitle(title);
        return product;
    }

    public static JDComment getCommentByPID(String pid) {
        JDComment comments = new JDComment();
        HashMap<String, Integer> groups = new HashMap<>();
        try {
            JSONObject commentJson =JDPhoneHolder.getComments(pid);
            JSONObject productCommentSummary = commentJson.getJSONObject("productCommentSummary");
            //好評比例
            int goodRateShow = productCommentSummary.getInteger("goodRateShow");
            comments.setGoodRate(goodRateShow);

            //評論總數
            int commentCount = productCommentSummary.getInteger("commentCount");
            comments.setTotalc(commentCount);
            //好評數量
            int goodCount = productCommentSummary.getInteger("goodCount");
            comments.setGoodc(goodCount);
            //中評數量
            int generalCount = productCommentSummary.getInteger("generalCount");
            comments.setGeneralc(generalCount);
            //差評數量
            int poorCount = productCommentSummary.getInteger("poorCount");
            comments.setPoorc(poorCount);
            //視頻曬單
            int videoCount = productCommentSummary.getInteger("videoCount");
            comments.setVideoc(videoCount);
            //追評數量
            int afterCount = productCommentSummary.getInteger("afterCount");
            comments.setAfterc(afterCount);

            JSONArray hotCommentTagStatistics = commentJson.getJSONArray("hotCommentTagStatistics");
            for (int i=0; i<hotCommentTagStatistics.size(); i++){
                JSONObject hotComment = hotCommentTagStatistics.getJSONObject(i);
                String name = hotComment.getString("name");
                int count = hotComment.getInteger("count");
                groups.put(name, count);
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
        comments.setCommentGroups(groups);
        return comments;
    }

    public static JDComment getCommentByCD(ChromeDriver driver) {
        JDComment comment = new JDComment();
        WebElement weCommentTab = driver.findElement(By.xpath("//*[@id=\"detail\"]/div[1]/ul/li[5]"));
        weCommentTab.click();
        try {
            Thread.sleep(2000);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }

        WebElement weGoodRate = driver.findElement(By.cssSelector(".comment-percent .percent-con"));
        String goodRate = weGoodRate.getText();
        int len = goodRate.length();
        if (len > 1) {
            goodRate = goodRate.substring(0, len - 1);
        }
        int rate = Integer.valueOf(goodRate);

        List<WebElement> weGroupList = driver.findElements(By.cssSelector(".J-comment-info .percent-info .tag-list .tag-1"));
        HashMap<String, Integer> groups = new HashMap<>();
        for (WebElement we: weGroupList) {
            String rawGroup = we.getText();
            splitDescInfo(rawGroup, groups);
        }

        List<WebElement> weLevelList = driver.findElements(By.cssSelector(".J-comments-list .filter-list li"));
        HashMap<String, Integer> levels = new HashMap<>();
        for (WebElement we: weLevelList) {
            WebElement weLevel = we.findElement(By.cssSelector("a"));
            if (containsDatatab(weLevel)){
                //TODO
//                String rawLevel = weLevel.getText();
//                splitDescInfo(rawLevel, levels);
            }
        }
        comment.setGoodRate(rate);
        comment.setCommentGroups(groups);
        return comment;
    }

    private static boolean containsDatatab(WebElement we){
        try {
            we.getAttribute("data-tab");
            return true;
        }catch (Exception e){
            return false;
        }
    }

    private static void splitDescInfo(String desc, HashMap<String, Integer> map) {
        String info = desc;
        int commaIdx = info.indexOf("(");
        String context = info.substring(0, commaIdx);
        String strCount = info.substring(commaIdx+1, info.length() - 1);
        float count = getRealCount(strCount);
        map.put(context, (int)count);
    }

    private static float getRealCount(String rawCount) {
        float realCount;
        if (rawCount.contains("")){
            int wanIdx = rawCount.indexOf("");
            String strRealCount = rawCount.substring(0, wanIdx);
            realCount = Float.valueOf(strRealCount) * 10000;
        }else if (rawCount.contains("+")){
            int plusIdx = rawCount.indexOf("+");
            String strRealCount = rawCount.substring(0, plusIdx);
            realCount = Integer.valueOf(strRealCount);
        }else{
            realCount = Integer.valueOf(rawCount);
        }
        return realCount;
    }

    private static void closeAllOtherWindows(String main, ChromeDriver driver) {
        Set<String> handles = driver.getWindowHandles();
        System.out.println("------->main: " + main);
        Object []hs = handles.toArray();
        for (int i = hs.length - 1; i>0; i--) {
            System.out.println("-------->child: " + hs[i]);
            driver.switchTo().window(hs[i].toString());
            driver.close();
        }
        driver.switchTo().window(main);
    }
}

這個java類里面,重點在於處理頁面切換的邏輯,否則想操作的頁面數據和實際driver所指向的頁面handle可能不是一個東西,導致所找的頁面元素不存在的錯誤,這是比較常見的錯誤,所以,一定得注意窗口句柄的管理,爬取完畢后,頁面最好是關閉掉(selenium模擬操作頁面打開頁面是順序的將句柄記錄在一個有序集合LinkedHashSet里面,所以,操作的時候,后打開的頁面句柄在集合的后面,利用Set轉換為Array的模式,簡單實現窗口的關閉邏輯),因為爬取數據的場景很簡單,列表頁和詳情頁之間切換。

 

接下來,是爬取到的數據寫庫的過程,我操作數據,用的是很簡單的spring的jdbcTemplate實現的,雖然功能不及mybatis那么強大,但是應付爬取點數據,還是夠了。

JDProductDao
package com.shihuc.up.spider.jd.comment;

import com.mchange.v2.c3p0.ComboPooledDataSource;
import org.openqa.selenium.chrome.ChromeDriver;
import org.springframework.jdbc.core.BatchPreparedStatementSetter;
import org.springframework.jdbc.core.JdbcTemplate;
import org.springframework.jdbc.core.PreparedStatementCreator;
import org.springframework.jdbc.core.RowCallbackHandler;
import org.springframework.jdbc.support.GeneratedKeyHolder;
import org.springframework.jdbc.support.KeyHolder;

import java.beans.PropertyVetoException;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Statement;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;

public class JDProductDao extends JdbcTemplate{

    public JDProductDao(){

        //定義c3p0連接池
        ComboPooledDataSource ds = new ComboPooledDataSource();
        try {
            ds.setDriverClass("com.mysql.jdbc.Driver");
            ds.setUser("root");
            ds.setPassword("shihuc");
            ds.setJdbcUrl("jdbc:mysql://localhost:3306/nav?characterEncoding=utf-8");
        } catch (PropertyVetoException e) {
            e.printStackTrace();
        }
        super.setDataSource(ds);
    }

    public int addProductInfoGenId(JDProduct product, String shop) {
        KeyHolder keyHolder = new GeneratedKeyHolder();
        JDComment comment = product.getComment();
        super.update(new PreparedStatementCreator(){
            final String sql="insert into good_holder_" + shop +
                    " (pid,title,brand,pname,price,url, goodrate,totalc,goodc,generalc,poorc,videoc,afterc)" +
                    " values (?,?,?,?,?,?,?,?,?,?,?,?,?)";
            public PreparedStatement createPreparedStatement(java.sql.Connection conn) throws SQLException{
                PreparedStatement ps = conn.prepareStatement(sql, Statement.RETURN_GENERATED_KEYS);
                ps.setString(1, product.getPid());
                ps.setString(2, product.getTitle());
                ps.setString(3, product.getBrand());
                ps.setString(4, product.getPname());
                ps.setString(5, product.getPrice());
                ps.setString(6, product.getUrl());

                ps.setInt(7, comment.getGoodRate());
                ps.setInt(8, comment.getTotalc());
                ps.setInt(9, comment.getGoodc());
                ps.setInt(10, comment.getGeneralc());
                ps.setInt(11, comment.getPoorc());
                ps.setInt(12, comment.getVideoc());
                ps.setInt(13, comment.getAfterc());
                return ps;
            }
        },keyHolder);
        return keyHolder.getKey().intValue();
    }

    public void addProductComments(JDProduct product, int rid, String shop) {
        final String sql="insert into good_holder_" + shop + " (rid,info,count) values (?,?,?)";
        List<Object []> comments = transformCommentsToObjects(rid, product.getComment());
        super.batchUpdate(sql, new BatchPreparedStatementSetter() {
            @Override
            public void setValues(PreparedStatement ps, int i)
                    throws SQLException {
                ps.setInt(1, (Integer) comments.get(i)[0]);
                ps.setString(2, (String)comments.get(i)[1]);
                ps.setInt(3, (Integer) comments.get(i)[2]);
            }
            @Override
            public int getBatchSize() {
                return comments.size();
            }
        });
    }

    public void addProductComments2(JDProduct product, int rid, String shop) {
        final String sql="insert into good_holder_" + shop + " (rid,info,count) values (?,?,?)";
        List<Object []> comments = transformCommentsToObjects(rid, product.getComment());
        super.batchUpdate(sql, comments);
    }
    private List<Object[]> transformCommentsToObjects(int rid, JDComment comments) {
        List<Object[]> list = new ArrayList<>();
        Object[] object = null;
        HashMap<String, Integer> groups = comments.getCommentGroups();
        for(String group: groups.keySet()){
            object = new Object[]{
                    rid,
                    group,
                    groups.get(group),
            };
            list.add(object);
        }
        return list ;
    }

    public List<JDProduct> updateProductForPriceSells(String tableIdx) {
        //查詢數據,使用RowCallbackHandler處理結果集
        String sql = "select id, pid, price from good_holder_" + tableIdx;
        final JDProduct product = new JDProduct();

        List<JDProduct> nokProducts = new ArrayList<>();

        //將結果集數據行中的數據抽取到product對象中
        super.query(sql, new Object[]{}, new RowCallbackHandler() {
            public void processRow(ResultSet rs) throws SQLException {
                product.setId(rs.getInt("id"));
                product.setPid(rs.getString("pid"));
                product.setPrice(rs.getString("price"));

                dataProcess(product, tableIdx);
            }
        });
        return nokProducts;
    }

    public void updateNokProductForPriceSells(String tableIdx, ChromeDriver driver) {
        //查詢數據,使用RowCallbackHandler處理結果集
        String sql = "select id, url, price from good_holder_" + tableIdx;
        final JDProduct product = new JDProduct();

        //將結果集數據行中的數據抽取到product對象中
        super.query(sql, new Object[]{}, new RowCallbackHandler() {
            public void processRow(ResultSet rs) throws SQLException {
                product.setId(rs.getInt("id"));
                product.setUrl(rs.getString("url"));
                product.setPrice(rs.getString("price"));

                if(isNokProduct(product, tableIdx)){
                    JDProduct pd = JDSeleniumFullCrawler.getJDProductInfoByUrl(driver, product.getUrl(), product);
                    reSetPriceOrSells(product.getId(), tableIdx, pd.getPrice());
                }
            }
        });
    }

    public boolean isNokProduct(JDProduct product, String tableIdx){
        String price = product.getPrice();
        String url = product.getUrl();
        if (price.equalsIgnoreCase("")) {
            System.out.println("good_holder_" + tableIdx + ", id=" + product.getId() + " data is not ok");

            if (url != null && !url.equalsIgnoreCase("")){
                return true;
            }
        }
        return false;
    }

    public void dataProcess(JDProduct product, String tableIdx) {
        String price = product.getPrice();
        double dlow = 0 , dhigh=0;
        if (price.equalsIgnoreCase("")) {
            System.out.println("good_holder_" + tableIdx + ", id=" + product.getId() + " data is not ok");
            return;
        }
        String low = "0", high = "0";
        if (price.contains("-")){
            int idx = price.indexOf("-");
            low = price.substring(0, idx);
            high = price.substring(idx+1);
        }else{
            low = price;
            high = price;
        }
        dlow = Double.valueOf(low);
        dhigh = Double.valueOf(high);

//        String countReg = "^[1-9][0-9]*";
//        Pattern p = Pattern.compile(countReg);
//        Matcher m = p.matcher(sells);
//        if (m.find()){
//            String sc = m.group();
//            sellCount = Integer.valueOf(sc);
//        }

        updateProductPriceSell(product.getId(), tableIdx, dlow, dhigh);
    }

    public void updateProductPriceSell(int id, String tableIdx, double priceLow, double priceHigh) {
        String sql = "update good_holder_" + tableIdx + " set priceLow=?,priceHigh=? where id=?";
        int rows = super.update(sql, priceLow, priceHigh,id);
        System.out.println(rows);
    }

    public void reSetPriceOrSells(int id, String tableIdx, String price) {
        String sql = "update good_holder_" + tableIdx + " set price=? where id=?";
        int rows = super.update(sql, price, id);
        System.out.println(rows);
    }
}

 

下面就是商品信息和評論信息的model類

JDProduct
package com.shihuc.up.spider.jd.comment;
public class JDProduct {

    private int id;
    private String pid;
    private String title;
    private String brand;
    private String pname;
    private String price;
    private String url;
    private double priceHigh;
    private double priceLow;

    private JDComment comment;

    public int getId() {
        return id;
    }

    public void setId(int id) {
        this.id = id;
    }

    public String getPid() {
        return pid;
    }

    public void setPid(String pid) {
        this.pid = pid;
    }

    public String getTitle() {
        return title;
    }

    public void setTitle(String title) {
        this.title = title;
    }

    public String getBrand() {
        return brand;
    }

    public void setBrand(String brand) {
        this.brand = brand;
    }

    public String getPname() {
        return pname;
    }

    public void setPname(String pname) {
        this.pname = pname;
    }

    public String getPrice() {
        return price;
    }

    public void setPrice(String price) {
        this.price = price;
    }

    public String getUrl() {
        return url;
    }

    public void setUrl(String url) {
        this.url = url;
    }

    public double getPriceHigh() {
        return priceHigh;
    }

    public void setPriceHigh(double priceHigh) {
        this.priceHigh = priceHigh;
    }

    public double getPriceLow() {
        return priceLow;
    }

    public void setPriceLow(double priceLow) {
        this.priceLow = priceLow;
    }

    public JDComment getComment() {
        return comment;
    }

    public void setComment(JDComment comment) {
        this.comment = comment;
    }


    @Override
    public String toString() {
        return "Product{" +
                "pid=" + pid +
                ", title='" + title + '\'' +
                ", brand='" + brand + '\'' +
                ", pname='" + pname + '\'' +
                ", price=" + price + '\'' +
                '}';
    }
}

 

JDComment
package com.shihuc.up.spider.jd.comment;

import java.awt.*;
import java.util.HashMap;

public class JDComment {
    private Integer goodRate;
    /**
     * 評論內容的分類信息以及對應的條數
     */
    private HashMap<String, Integer> commentGroups;

    //天貓是銷量數據,淘寶和京東一樣,是累計評論數據
    private int totalc;
    private int goodc;
    private int generalc;
    private int poorc;
    private int videoc;
    private int afterc;

    public Integer getGoodRate() {
        return goodRate;
    }

    public void setGoodRate(Integer goodRate) {
        this.goodRate = goodRate;
    }

    public HashMap<String, Integer> getCommentGroups() {
        return commentGroups;
    }

    public void setCommentGroups(HashMap<String, Integer> commentGroups) {
        this.commentGroups = commentGroups;
    }

    public int getTotalc() {
        return totalc;
    }

    public void setTotalc(int totalc) {
        this.totalc = totalc;
    }

    public int getGoodc() {
        return goodc;
    }

    public void setGoodc(int goodc) {
        this.goodc = goodc;
    }

    public int getGeneralc() {
        return generalc;
    }

    public void setGeneralc(int generalc) {
        this.generalc = generalc;
    }

    public int getPoorc() {
        return poorc;
    }

    public void setPoorc(int poorc) {
        this.poorc = poorc;
    }

    public int getVideoc() {
        return videoc;
    }

    public void setVideoc(int videoc) {
        this.videoc = videoc;
    }

    public int getAfterc() {
        return afterc;
    }

    public void setAfterc(int afterc) {
        this.afterc = afterc;
    }
}

 

這里需要補充說明一下,價格和評論用到的關於httpclient拉到網頁的工具類

HttpClientUtils
package com.shihuc.up.spider;

import com.shihuc.up.spider.jd.opt.JDPhoneHolder;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.entity.UrlEncodedFormEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.client.methods.HttpRequestBase;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.impl.conn.PoolingHttpClientConnectionManager;
import org.apache.http.message.BasicNameValuePair;
import org.apache.http.util.EntityUtils;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;

public class HttpClientUtils {

    //創建httpclient連接池
    private static PoolingHttpClientConnectionManager connectionManager;
    static{
        connectionManager=new PoolingHttpClientConnectionManager();
        //定義連接池最大連接數
        connectionManager.setMaxTotal(200);
        //對指定的網址最多只有20個連接
        connectionManager.setDefaultMaxPerRoute(20);
    }

    private static CloseableHttpClient getCloseableHttpClient(){
        CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(connectionManager).build();
        return httpClient;
    }

    private static String execute(HttpRequestBase httpRequestBase) throws IOException {
        httpRequestBase.setHeader("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0");

        //設置超時時間
        RequestConfig config = RequestConfig.custom().setConnectionRequestTimeout(10000).setConnectTimeout(10000).setSocketTimeout(15 * 1000).build();

        httpRequestBase.setConfig(config);
        CloseableHttpClient httpClient = getCloseableHttpClient();
        CloseableHttpResponse response = httpClient.execute(httpRequestBase);

        String html = EntityUtils.toString(response.getEntity(), "utf-8");
        return html;
    }

    private static String executeReferer(HttpRequestBase httpRequestBase, String referer) throws IOException {
        httpRequestBase.setHeader("User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0");
        httpRequestBase.setHeader("Referer", referer);
        httpRequestBase.setHeader("Sec-Fetch-Mode", "no-cors");

        //設置超時時間
        RequestConfig config = RequestConfig.custom().setConnectionRequestTimeout(60000).setConnectTimeout(60000).setSocketTimeout(10 * 10000).build();

        httpRequestBase.setConfig(config);
        CloseableHttpClient httpClient = getCloseableHttpClient();
        CloseableHttpResponse response = httpClient.execute(httpRequestBase);

        String html = EntityUtils.toString(response.getEntity(), "utf-8");
        return html;
    }

    public static String doGet(String url) throws IOException {
        HttpGet httpGet = new HttpGet(url);
        String html = execute(httpGet);
        return html;
    }

    public static String doGetReferer(String url, String referer) throws IOException {
        HttpGet httpGet = new HttpGet(url);
        String html = executeReferer(httpGet, referer);
        return html;
    }

    public static String doPost(String url, Map<String,String> params) throws IOException {
        HttpPost httpPost = new HttpPost(url);

        List<BasicNameValuePair> list = new ArrayList<>();
        for (String key : params.keySet()) {
            list.add(new BasicNameValuePair(key,params.get(key)));
        }

        UrlEncodedFormEntity entity = new UrlEncodedFormEntity(list);
        httpPost.setEntity(entity);

        return execute(httpPost);
    }

    public static void main(String args[]) {
        String pid = "4310407";
//        try {
//            JDPhoneHolder.getCommitCount(pid);
//        } catch (IOException e) {
//            e.printStackTrace();
//        }

        try {
            int commitCountNum = JDPhoneHolder.getCommitCountNum(pid);
            System.out.println("產品: " + pid + ", 評論數:" + commitCountNum);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

 

針對所用到的表結構,也附在這里:

產品表:

CREATE TABLE `good_holder_jd_info_czsjzj` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `pid` varchar(32) NOT NULL COMMENT '產品ID',
  `title` varchar(1024) NOT NULL COMMENT '產品標題描述',
  `brand` varchar(1024) NOT NULL COMMENT '產品品牌',
  `pname` varchar(1024) NOT NULL COMMENT '產品名稱',
  `price` varchar(32) NOT NULL COMMENT '產品價格',
  `url` varchar(2048) NOT NULL COMMENT '產品鏈接',
  `priceLow` double(16,2) DEFAULT NULL COMMENT '商品的低價',
  `priceHigh` double(16,2) DEFAULT NULL COMMENT '商品的高價',
  `goodrate` int(11) DEFAULT NULL COMMENT '產品評論分數',
  `totalc` int(64) DEFAULT NULL COMMENT '總評論數',
  `goodc` int(11) DEFAULT NULL COMMENT '好評數量',
  `generalc` int(11) DEFAULT NULL COMMENT '中評數量',
  `poorc` int(11) DEFAULT NULL COMMENT '差評數量',
  `videoc` int(11) DEFAULT NULL COMMENT '視頻曬單量',
  `afterc` int(11) DEFAULT NULL COMMENT '追評數量',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1134 DEFAULT CHARSET=utf8mb4

評論分類表(我這里沒有抓評論的詳情數據,我只抓取了評論的類別和次數數據)

CREATE TABLE `good_holder_jd_comment_czsjzj` (
  `rid` int(11) NOT NULL COMMENT '評論對應的產品記錄的主鍵ID',
  `info` varchar(256) DEFAULT NULL COMMENT '描述內容信息',
  `count` int(11) DEFAULT NULL COMMENT '對應內容的條數'
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4

這個評論分類表的數據類似下圖紅圈內的內容:

 

寫在博文的最后,關於抓取JD商品價格和評論數據的方法:

//獲取價格,只需要傳入商品的ID即可
public static String getPrice(String pid) throws IOException {
        String priceUrl="https://p.3.cn/prices/mgets?pduid="+Math.random()+"&skuIds=J_"+pid;
        String priceJson = HttpClientUtils.doGet(priceUrl);
        System.out.println(priceJson);
        Gson gson = new GsonBuilder().create();
        List<Map<String,String>> list = gson.fromJson(priceJson, List.class);
        return list.get(0).get("p");
    }
//獲取商品的評論信息,只需要傳入商品的ID即可
public static JSONObject getComments(String pid) throws IOException {
        String baseUrl = "https://sclub.jd.com/comment/productPageComments.action?score=0&sortType=5&page=1&pageSize=1&isShadowSku=0&productId=" + pid;
        String commentJson = HttpClientUtils.doGet(baseUrl);
        System.out.println(commentJson);

        JSONObject jsonObject = JSON.parseObject(commentJson);

        return jsonObject;
    }

兩個函數中,紅色URL部分,是重點內容,從這兩個URL來看,JD的商城站點信息,相對設計的還是比較簡單的。

 

這篇博文,就分享到這里吧,上述爬蟲程序(主要是爬取車載手機支架信息的),稍微修改一下,就可以爬取其他商品的類似信息。歡迎評論,歡迎給出繞開阿里反爬技術的解決方案!

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM