Java爬蟲的底層及實現過程（可動手實現爬取京東官網的商品信息數據並保存到數據庫中）

本文轉載自查看原文 2020-03-08 17:05 786 Java爬蟲

一，什么是網絡爬蟲？

網絡爬蟲（web crawer），是一種按照一定的規則，自動的抓取萬維網信息的程序或者腳本。從功能上來講，爬蟲一般分為數據采集，處理，儲存三個部分。

1，入門程序

環境准備

（1）jdk1.8 （2）idea環境（3）maven

（4）需要導入httpClient的依賴。（去官網找用的最多的一個版本，不要找最新的）

 <!-- https://mvnrepository.com/artifact/org.apache.httpcomponents/httpclient -->
        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpclient</artifactId>
            <version>4.5.2</version>
        </dependency>

2，寫一個爬蟲小例子帶你初次體驗爬蟲

這里寫一個測試類，把傳智播客官網首頁的代碼全都爬出來。

public class CrawerFirst {
    public static void main(String[] args) throws IOException {
        //1，打開瀏覽器，創建HTTPClient對象
        CloseableHttpClient httpClient = HttpClients.createDefault();
        //2，輸入網址，發起get請求，創建httpGet對象
        HttpGet httpGet=new HttpGet("http://www.itcast.cn");
        //3，按回車發送請求，返回響應，使用HttpClient對象發起請求
        CloseableHttpResponse response = httpClient.execute(httpGet);
        //4，解析響應，獲取數據
        //判斷狀態碼是否為200
        if(response.getStatusLine().getStatusCode()==200){
            HttpEntity httpEntity=response.getEntity();
            String content = EntityUtils.toString(httpEntity, "utf-8");
            System.out.println(content);
        }
    }
}

然后就可以打印出content，即首頁的所有html代碼信息。

3，HttpClient

這里我們使用Java的Http協議客戶端HttpClient這個技術，來實現抓取網頁數據。

3.1 Get請求

 public static void main(String[] args) throws IOException {
        //創建HttpClient對象
        CloseableHttpClient httpClient = HttpClients.createDefault();
        //創建HttpGet對象，設置url訪問地址
        HttpGet httpGet=new HttpGet("http://www.itcast.cn");
        //使用httpClient發起請求，獲取response
        CloseableHttpResponse response = null;
        try{
            response=httpClient.execute(httpGet);
            //解析響應
            if(response.getStatusLine().getStatusCode()==200){
                //得到響應體，並把結果通過EntityUtils工具類把結果轉換為字符串
                String content= EntityUtils.toString(response.getEntity(),"utf8");
                System.out.println(content.length());
            }
        }catch (Exception e){
            e.printStackTrace();
        }finally {
            //關閉response
            response.close();
            httpClient.close();
        }
    }

3.2 帶參數的Get請求

通過URIBuilder來設置參數。

public class HttpGetTest {
    public static void main(String[] args) throws Exception {
        //創建HttpClient對象
        CloseableHttpClient httpClient = HttpClients.createDefault();

        //設置請求地址是：http://yun.itheima.com/search?keys=Java
        //創建URIBuilder
        URIBuilder uriBuilder=new URIBuilder("http://yun.itheima.com/search");
        //設置參數
        uriBuilder.setParameter("keys","Java");


        //創建HttpGet對象，設置url訪問地址
        HttpGet httpGet=new HttpGet(uriBuilder.build());

        System.err.println("發送的請求是"+httpGet);
        //使用httpClient發起請求，獲取response
        CloseableHttpResponse response = null;
        try{
            response=httpClient.execute(httpGet);
            //解析響應
            if(response.getStatusLine().getStatusCode()==200){
                //得到響應體，並把結果通過EntityUtils工具類把結果轉換為字符串
                String content= EntityUtils.toString(response.getEntity(),"utf8");
                System.out.println(content.length());
            }
        }catch (Exception e){
            e.printStackTrace();
        }finally {
            //關閉response
            response.close();
            httpClient.close();
        }
    }
}

3.3 不帶參數的 Post請求

不帶參數的post請求和get請求的區別只有一個，就是請求的聲明。

//get請求
HttpGet httpGet=new HttpGet("url路徑地址");
//post請求
HttpPost httpPost=new HttpPost("url路徑地址");

3.4 帶參數的Post請求

帶參的話，使用post請求，url地址沒有參數，參數keys=Java放在表單中進行提交。

public static void main(String[] args) throws Exception {
        //創建HttpClient對象
        CloseableHttpClient httpClient = HttpClients.createDefault();

        //設置請求地址是：http://yun.itheima.com/search?keys=Java

        //創建HttpPost對象，設置url訪問地址
        HttpPost httpPost=new HttpPost("http://yun.itheima.com/search");

        //聲明list集合，封裝表單中的參數
        List<NameValuePair> params=new ArrayList<NameValuePair>();
        //設置參數
        params.add(new BasicNameValuePair("keys","Java"));

        //創建表單的Entity對象，第一個參數就是封裝好的表單數據，第二個參數就是編碼
        UrlEncodedFormEntity formEntity=new UrlEncodedFormEntity(params,"utf8");

        //設置表單的Entity對象到post請求中
        httpPost.setEntity(formEntity);

        //使用httpClient發起請求，獲取response
        CloseableHttpResponse response = null;
        try{
            response=httpClient.execute(httpPost);
            //解析響應
            if(response.getStatusLine().getStatusCode()==200){
                //得到響應體，並把結果通過EntityUtils工具類把結果轉換為字符串
                String content= EntityUtils.toString(response.getEntity(),"utf8");
                System.out.println(content.length());
            }
        }catch (Exception e){
            e.printStackTrace();
        }finally {
            //關閉response
            response.close();
            httpClient.close();
        }
    }

3.5 連接池

如果每次請求都要創建HttpClient，會有頻繁創建和銷毀的問題，可以使用連接池來解決這個問題。

public class HttpClientPoolTest {
    public static void main(String[] args) throws Exception {
        //創建連接池管理器
        PoolingHttpClientConnectionManager cm=new PoolingHttpClientConnectionManager();

        //設置最大連接數
        cm.setMaxTotal(100);

        //設置每個主機的最大連接數
        cm.setDefaultMaxPerRoute(10);        

        //使用連接池管理器發送請求
        doGet(cm);

    }

    private static void doGet(PoolingHttpClientConnectionManager cm) throws Exception {
        //不是每次都創建新的HttpClient，而是從連接池中獲取HttpClient對象
        CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(cm).build();

        HttpGet httpGet=new HttpGet("http://www.itcast.cn");
        CloseableHttpResponse response=null;
        try{
            response=httpClient.execute(httpGet);

            if(response.getStatusLine().getStatusCode()==200){
                String content = EntityUtils.toString(response.getEntity(), "utf8");
                System.out.println(content.length());
            }

        }catch (Exception e){
            throw new Exception("發生異常");
        }finally {
            if(response!=null){
                response.close();
            }

            //不能關閉HttpClient,由連接池管理HttpClient
            //httpClient.close();
        }
    }
}

4,請求參數（配置請求信息RequestConfig）

有時候因為網絡，或者目標服務器的原因，請求需要更長的時間才能完成，我么需要自定義相關時間。

public class HttpConfigTest {
    public static void main(String[] args) {
        //創建HttpClient對象
        CloseableHttpClient httpClient = HttpClients.createDefault();

        //創建httpGet對象，設置url訪問地址
        HttpGet httpGet=new HttpGet("http://www.itcast.cn");

        //配置請求信息
        RequestConfig config=RequestConfig.custom().setConnectTimeout(1000) //創建連接的最長時間,單位是毫秒
        .setConnectionRequestTimeout(500)   //設置獲取連接的最長時間，單位是毫秒
        .setSocketTimeout(10*1000)  //設置數據傳輸的最長時間，單位是毫秒
        .build();

        //給請求設置請求信息
        httpGet.setConfig(config);
    }
}

二，Jsoup

我們抓取到頁面之后，還需要對頁面進行解析，可以使用字符串處理工具解析頁面，也可以使用正則表達式，但是這些方法都會帶來很大的開發成本，所以我們需要使用一款專門解析html頁面的技術。

2.1 Jsoup介紹

jsoup是一款java的html解析器，可直接解析某個url地址，html文本等內容，它提供了一套非常省力的api，可通過dom，css以及類似於jquery的操作方法來取出和操作數據。

Jsonp的主要功能如下：

1，從一個url，文件或字符串中解析html；

2，使用dom或css選擇器來查找、取出數據。

2.2 使用Jsoup需要導入的依賴

 <!--Jsonp-->
        <!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.10.2</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/junit/junit
        junit 測試工具類，只是用於測試-->
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.12</version>
            <scope>test</scope>
        </dependency>
        <!-- https://mvnrepository.com/artifact/commons-io/commons-io
 		這里是使用到了這個jar包的一個util工具類，lang3也是	-->
        <dependency>
            <groupId>commons-io</groupId>
            <artifactId>commons-io</artifactId>
            <version>2.4</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.commons/commons-lang3 -->
        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-lang3</artifactId>
            <version>3.9</version>
        </dependency>

2.3 Jsoup解析URL

這里寫一個小例子，抓取黑馬官網主頁的title的內容。

 @Test
    public void testUrl() throws Exception{
        //解析url地址，第一個參數是訪問的url地址，第二個參數是訪問時候的超時時間。
        //返回類型是一個dom對象，可以理解為抓取到的html頁面。
        Document doc = Jsoup.parse(new URL("http://www.itcast.cn"), 1000);
        //使用標簽選擇器，獲取title標簽中的內容
        String title = doc.getElementsByTag("title").first().text();//第一個的文本內容
        System.out.println(title);
    }

[外鏈圖片轉存失敗(img-szceJTJD-1567139877894)(D:\文件筆記\image\1566883592525.png)]

說明：雖然使用Jsoup可以替代HttpClient直接發起請求解析數據，但是往往不會這樣用，因為實際的開發過程中，需要使用到多線程，連接池，代理等等方式，而Jsoup對這些的支持並不是很好，所以我們一般把jsoup僅僅作為html解析工具使用。

2.4 Jsoup解析字符串

@Test
    public void testString() throws Exception{
        //使用工具類讀取文件，獲取字符串
        String content=FileUtils.readFileToString(new File("D:\\IdeaProjects\\黨建項目	  \\client\\src\\main\\resources\\templates\\web\\demo\\student\\lzjj_test.html"),"utf8");
        //解析字符串
        Document doc = Jsoup.parse(content);
        //獲取title內容
        String title = doc.getElementsByTag("title").first().text();
        System.out.println(title);
    }

[外鏈圖片轉存失敗(img-6UmUVM3Q-1567139877896)(D:\文件筆記\image\1566885700972.png)]

2.5 Jsoup解析文件

@Test
    public void testFile() throws Exception{
        //解析文件
        Document doc = Jsoup.parse(new File("D:\\IdeaProjects\\黨建項目\\light-client\\src\\main\\resources\\templates\\web\\demo\\student\\lzjj_test.html"), "utf8");
        String title=doc.getElementsByTag("title").first().text();
        System.out.println(title);
    }

[外鏈圖片轉存失敗(img-fhj4dlUk-1567139877896)(D:\文件筆記\image\1566885760260.png)]

2.6 使用dom的方式獲取元素

 @Test
    public void testDom() throws Exception{
        //解析文件，獲取document對象
        Document doc = Jsoup.parse(new File("D:\\IdeaProjects\\黨建項目\\light-client\\src\\main\\resources\\templates\\web\\demo\\student\\lzjj_test.html"), "utf8");
        //獲取元素
        //根據id獲取
        /*Element a = doc.getElementById("a");
        System.out.println(a.text());*/
        //根據標簽獲取
        Element element = doc.getElementsByTag("td").last();
        System.out.println(element);
         //根據class類獲取
        Element test = doc.getElementsByClass("test").first();
        //根據屬性獲取
        Elements abc = doc.getElementsByAttribute("abc");
         //通過指定的屬性名和屬性值指定獲取
        Elements href = doc.getElementsByAttributeValue("href", "www.baidu.com");
    }

2.7 獲取元素中的數據

上一步已經獲取到了元素，怎么獲取到元素中的諸多數據呢？

1，從元素中獲取id

2，從元素中獲取className

3，從元素中獲取屬性的值attr

4，從元素中獲取所有屬性attributes

5，從元素中獲取文本內容text

 @Test
    public void testData() throws Exception{
        //解析文件，獲取document對象
        Document doc = Jsoup.parse(new File("D:\\IdeaProjects\\黨建項目\\light-client\\src\\main\\resources\\templates\\web\\demo\\student\\lzjj_test.html"), "utf8");
        Element element = doc.getElementsByTag("td").last();
        //獲取元素的id值
        String id = element.id();
        //獲取元素的class類的值（className）
        String className = element.className();
        System.out.println(className);
        //如果className的值是有多個class組成，這里獲取每一個className，把它們拆分開
        Set<String> strings = element.classNames();
        for(String s:strings){
            System.out.println(s);
        }
        //從元素中獲取class屬性的值attr
        String aClass = element.attr("class");
        //從元素中獲取文本內容text
        String text = element.text();
    }

2.8 使用組合選擇器獲取元素

@Test
    public void testSelectors() throws Exception{
        //解析文件，獲取document對象
        Document doc = Jsoup.parse(new File("D:\\IdeaProjects\\黨建項目\\light-client\\src\\main\\resources\\templates\\web\\demo\\student\\lzjj_test.html"), "utf8");
        //元素 + ID
        Element element = doc.select("p#lese").first();
        //元素 + class
        Element ele = doc.select("p.lese").first();
        //元素 + 屬性名
        Elements select = doc.select("p[abc]");
        //任意組合（元素+class+id+屬性名的任意組合）
        Element first = doc.select("p[abc].lese").first();
        //查找某個元素下的子元素  比如 .city li
        Element first1 = doc.select(".city li").first();
        //查找某個元素下的直接子元素  比如 .city>li
        Element first2 = doc.select(".city>ul>li").first();
        //parent > *    查找某個父元素下的所有直接子元素
        Element first3 = doc.select(".city>ul>*").first();
        System.out.println(first);
    }

三，案例--抓取京東的商品信息

這里只抓取京東的一部分數據就行了，商品的圖片，價格，顏色等信息。

3.1 先在數據庫建表

[外鏈圖片轉存失敗(img-d6avaMTn-1567139877896)(D:\文件筆記\image\1566984565304.png)]

3.2 添加依賴

使用springboot+spring Data JPA和定時任務完成開發。

需要創建maven工程並添加以下依賴。

<dependencies>
        <!-- springMVC -->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
            <version>2.1.3.RELEASE</version>
        </dependency>
        <!-- Mysql -->
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>8.0.13</version>
        </dependency>
        <!-- httpClient用於抓取數據 -->
        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpclient</artifactId>
            <version>4.5.2</version>
        </dependency>
        <!--Jsonp-->
        <!-- 用於解析數據 -->
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.10.2</version>
        </dependency>
        <!--junit 測試工具類-->
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.12</version>
            <scope>test</scope>
        </dependency>
        <!-- https://mvnrepository.com/artifact/commons-io/commons-io -->
        <dependency>
            <groupId>commons-io</groupId>
            <artifactId>commons-io</artifactId>
            <version>2.4</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.commons/commons-lang3 -->
        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-lang3</artifactId>
            <version>3.9</version>
        </dependency>

        <!--springboot data-jpa -->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-data-jpa</artifactId>
            <version>2.1.4.RELEASE</version>
        </dependency>
    </dependencies>

3.3 添加配置文件

加入application.properties配置文件

# DB 配置
spring.datasource.driver-class-name=com.mysql.jdbc.Driver
spring.datasource.url=jdbc:mysql://127.0.0.1:3306/jsoup
spring.datasource.username=root
spring.datasource.password=1234
# JPA 配置
spring.jpa.database=mysql
spring.jpa.show-sql=true

3.4 代碼實現

先寫pojo類

@Entity
@Table(name = "jd_item")
public class item {
    private Long id;
    private Long spu;
    private Long sku;
    private String title;
    private double price;
    private String pic;
    private String url;
    private Date created;
    private Date updated;
}

3.5 封裝HttpClient

我們經常要使用HttpClient，所以需要進行封裝，方便使用。

package com.qianlong.jd.util;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.impl.conn.PoolingHttpClientConnectionManager;
import org.apache.http.util.EntityUtils;
import org.springframework.stereotype.Component;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import java.util.UUID;

@Component  //創建實例
public class HttpUtils {
    //使用連接池
    private PoolingHttpClientConnectionManager cm;
    //需要聲明構造方法，因為參數不需要從外面傳進來，所以不需要參數
    //為什么需要構造方法，是因為聲明的這個連接池需要賦於屬性的值
    public HttpUtils() {
        this.cm = new PoolingHttpClientConnectionManager();
        //設置最大連接數
        this.cm.setMaxTotal(100);
        //設置每個主機的最大連接數
        this.cm.setDefaultMaxPerRoute(10);
    }


    //這里使用get請求獲取頁面數據，返回類型是string字符串類型

    /**
     * 根據請求地址下載頁面數據
     * @param url
     * @return
     */
    public String doGetHTML(String url){
        //獲取HttpClient對象
        CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(cm).build();
        //創建httpGet對象，設置url地址
        HttpGet httpGet=new HttpGet(url);
        //設置請求信息
        httpGet.setConfig(this.getConfig());


        CloseableHttpResponse response=null;

            try {
                //使用httpClient發起請求，獲取響應
                response=httpClient.execute(httpGet);
                //解析響應，返回結果
                if(response.getStatusLine().getStatusCode()==200){
                    //判斷響應體Entity是否為空，如果不為空就可以使用HttpUtils
                    if(response.getEntity()!=null){
                        String content = EntityUtils.toString(response.getEntity(), "utf8");

                    }
                }

            } catch (IOException e) {
                e.printStackTrace();
            }finally {
                //關閉response
                if(response!=null){
                    try {
                        response.close();
                    } catch (IOException e) {
                        e.printStackTrace();
                    }
                }
            }

        return "";
    }
    //設置請求的信息
    private RequestConfig getConfig() {
        RequestConfig config=RequestConfig.custom()
                .setConnectTimeout(1000)//創建連接的最長時間
                .setConnectionRequestTimeout(500)//獲取連接的最長時間
                .setSocketTimeout(500)//數據傳輸的最長時間
                .build();
        return config;
    }

    /**
     * 下載圖片
     * @param url
     * @return
     */
    public String doGetImage(String url){
        //獲取HttpClient對象
        CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(cm).build();
        //創建httpGet對象，設置url地址
        HttpGet httpGet=new HttpGet(url);
        //設置請求信息
        httpGet.setConfig(this.getConfig());


        CloseableHttpResponse response=null;

        try {
            //使用httpClient發起請求，獲取響應
            response=httpClient.execute(httpGet);
            //解析響應，返回結果
            if(response.getStatusLine().getStatusCode()==200){
                //判斷響應體Entity是否為空，如果不為空就可以使用HttpUtils
                if(response.getEntity()!=null){
                   //下載圖片
                    //獲取圖片的后綴
                    String extName=url.substring(url.lastIndexOf("."));
                    //創建圖片名，重命名圖片
                    String picName= UUID.randomUUID().toString()+extName;
                    //下載圖片
                    //聲明OutputStream
                    OutputStream outputStream=new FileOutputStream(new File("D:\\suibian\\image")+picName);
                    response.getEntity().writeTo(outputStream);
                    //圖片下載完成，返回圖片名稱
                    return picName;

                }
            }

        } catch (IOException e) {
            e.printStackTrace();
        }finally {
            //關閉response
            if(response!=null){
                try {
                    response.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }

        return "";
    }
}

3.6 實現數據抓取

使用定時任務，可以定時抓取最新的數據。

先寫好springboot的啟動類（這里就不仔細說明啟動類文件的位置了，和包同級）

//使用定時任務，需要先開啟定時任務，需要添加注解
@EnableScheduling
@SpringBootApplication
public class Application {
    public static void main(String[] args) {
        SpringApplication.run(Application.class,args);
    }
}

然后就開始寫主角了，開始抓取數據

package com.qianlong.jd.task;

import com.qianlong.jd.pojo.Item;
import com.qianlong.jd.service.ItemService;
import com.qianlong.jd.service.ItemServiceImpl;
import com.qianlong.jd.util.HttpUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.scheduling.annotation.Scheduled;
import org.springframework.stereotype.Component;

import java.util.List;

@Component
public class ItemTask {
    @Autowired
    private HttpUtils httpUtils;
    @Autowired
    private ItemService itemService;
    
    //當下載任務完成后，間隔100秒進行下一次的任務
    @Scheduled(fixedDelay = 100*1000)
    public void itemTask() throws Exception{
    //聲明需要解析的初始地址
        String url="https://search.jd.com/Search?keyword=iphone&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&wq=iphone&page=1&s=1&click=";

        //這里下載前五頁（1，3，5，7---順序）
        //按照頁面對手機的搜索結果進行遍歷解析
        for(int i=1;i<10;i=i+2){
        String html=httpUtils.doGetHTML(url+i);
        //解析頁面，獲取商品數據並存儲
            parse(html);
        }
        System.out.println("手機數據抓取完成");
    }

    /**
     * 對頁面進行解析
     * @param html
     */
    private void parse(String html) throws Exception {
        System.err.println("進到了解析方法");
        //解析html獲取dom對象
        Document dom = Jsoup.parse(html);
        //獲取spu信息
        Elements elements = dom.select("div#J_goodsList>ul>li");
        for(Element element:elements){
            //獲取spu
            long spu=Long.parseLong(element.attr("data-spu"));
            //獲取sku信息
            Elements elements1 = element.select("li.ps-item");
            for(Element skuEle:elements1){
                //獲取sku
                long sku=Long.parseLong(skuEle.select("[data-sku]").attr("data-sku"));
                System.err.println(sku);
                //根據sku查詢商品信息
                Item item=new Item();
                item.setSku(sku);
                List<Item> list = itemService.findAll(item);
                //如果商品存在，就進行下一個循環，該商品不保存，因為已存在
                if(list.size()>0){
                    continue;
                }

                //設置商品的spu
                item.setSpu(spu);
                //獲取商品的詳情的url
                String itemUrl="https://item.jd.com/"+sku+".html";
                item.setUrl(itemUrl);
                //獲取商品的圖片
              String picUrl = "https:"+skuEle.select("img[data-sku]").first().attr("data-lazy-img");
               String picName=httpUtils.doGetImage(picUrl);
               item.setPic(picName);

               //保存數據到數據庫中
                itemService.save(item);
            }
        }
    }
}

來到這里案例基本已經結束了，接下來就是處理dao的數據了，插入數據到數據庫，這里省略。

到這里爬蟲已經結束了，上面是Java爬蟲的基礎，可以實現一些小的demo，比如爬取一個網站的部分數據，但是在實際的爬蟲項目中使用的都是爬蟲框架，例如WebMagic框架，底層使用的就是HttpClient和Jsoup，更方便的開發爬蟲，同時內置了一些常用的組件，便於爬蟲開發。如果你想更深的學習爬蟲的話，你必須深入學習那些更優秀的框架才行，以上是實現爬蟲的基礎內容。

如果想看源碼的話就自己下載，如果覺得還不錯的話就留下你的足跡吧！

項目鏈接：https://pan.baidu.com/s/1ArXk_QlmtbhzW_wfMrerFw
提取碼：sqms

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Java爬蟲爬取京東商品信息 Java爬蟲爬取京東商品信息 python爬蟲實踐——爬取京東商品信息爬蟲之selenium爬取京東商品信息 Python爬蟲爬取淘寶，京東商品信息 python_爬蟲_爬取京東商品信息 python爬蟲：爬取京東商品信息 python爬蟲爬取京東商品信息基於java的網絡爬蟲框架(實現京東數據的爬取，並將插入數據庫) Java 利用爬蟲爬取京東、天貓商品信息