記錄使用jQuery和Python抓取采集數據的一個實例

本文轉載自查看原文 2016-05-04 12:09 3932 Python

從現成的網站上抓取汽車品牌，型號，車系的數據庫記錄。

先看成果，大概4w條車款記錄

一共建了四張表，分別存儲品牌，車系，車型和車款

大概過程：

使用jQuery獲取頁面中呈現的大批內容

能通過頁面一次性獲得所需大量數據的，通過jQuery獲取原數據，並一條條顯示在console面板中。每條我是直接拼接成sql顯示。

打開chrome，進到地址http://www.autozi.com/carBrandLetter/.html。按F12點console面板。粘貼下面的內容

$("tr.event_scroll").each(function(i){
   var _this = $(this);
   // 奧迪，寶馬各個主品牌
   var mainBrandName = _this.find('th>h4').text();
   var seriesList    = $(this).find('.car-series-list li');
   $.each(seriesList, function(i, el){
       // 各品牌下的子品牌，如奧迪下有進口奧迪和一汽奧迪
       var subBrandName = $(el).find('h4').text();
       // 各個車系，如奧迪A6，A4
       var models = $(el).find('a.carModel')
       $.each(models, function(j, element){
            var model = $(element).text();
            var carSeriesId = getCarSeriesId($(element).attr('s_href'));
            // 拼接成sql語句，插入數據庫用
            getSql(subBrandName,model,carSeriesId);
       })
   });
});

// 根據地址獲取參數id
// 如http://www.autozi.com:80/carmodels.do?carSeriesId=1306030951244661 得到1306030951244661 

function getCarSeriesId(str) {
    return str.slice(str.indexOf('=')+1);
}

// 拼接成sql語句，插入數據庫用
// insert into tableName(brandName, name, carSeriesId) values ("一汽奧迪", "A6", "425");

function getSql(subBrandName,model,carSeriesId) {
    var str = 'insert into tableName(brandName, name, carSeriesId) values ("'+subBrandName+'", "'+model+'", "'+carSeriesId+'");';
    console.log(str);
}

回車，顯示如下。

這樣我就拿到了所有的汽車品牌，子品牌和車系。

但是具體的包含年份和排量的車型還沒有辦法拿到。比如奧迪A6L。有2011年款2.0L的，有2005年4.2L的。

網站做成了在彈窗中顯示。

比如點擊A6L。發送一個ajax請求，請求地址是：http://www.autozi.com/carmodels.do?carSeriesId=425&_=1462335011762

當點擊第二頁，又發起了一個新的ajax請求，請求地址是：http://www.autozi.com/carmodelsAjax.do?currentPageNo=2&carSeriesId=425&_=1462335011762

奧迪A6L一共有四頁carSeriesId=425剛才已經拿到了。要獲得所有年份和排量的A6L。就要發起四個請求，地址是：

http://www.autozi.com/carmodelsAjax.do?currentPageNo=[#page]&carSeriesId=425

[#page]即為1-4。每次改變下分頁參數數值即可。當請求不存在的http://www.autozi.com/carmodelsAjax.do?currentPageNo=5&carSeriesId=425。會返回空頁面。

想想之前學了點使用Python的BeautifulSoup 類庫采集網頁內容。剛好在這里派上了用場。

使用Python獲取頁面中的內容

getSoup是打開頁面並返回html，請求頁面地址中初始pageNo參數是1。判斷返回的html是否為空。如果有內內容則pageNo+1。繼續請求這個地址。

如果沒有則請求下一個車系的地址。

每兩個車系之間暫停10秒。因為我發現如果操作過於頻繁服務端會返回空。

from urllib.request import  urlopen
from bs4 import BeautifulSoup
from time import sleep

# carList
def getList(carList):
    fo = open("cars.txt", "a+")
    for link in soup.find_all("a", class_="link"):
        #print(link.get('title'))
        fo.write(link.get('title')+'\n')
    fo.close()



def getSoup(modelId, pageNumber):
    tpl_url  = "http://www.autozi.com/carmodelsAjax.do?carSeriesId=[#id]&currentPageNo=[#page]"
    real_url = tpl_url.replace('[#id]', str(modelId))
    real_url = real_url.replace('[#page]', str(pageNumber))
    from_url = urlopen(real_url).read().decode('utf-8')
    soup     = BeautifulSoup(from_url, "html5lib")
    return soup


modelIds = [741,1121,357,1055]


for modelId in modelIds:
    flag = True
    i = 1
    while flag:
        soup     = getSoup(modelId, i)
        carList  = soup.find_all('li', limit=1)
        if len(carList):
            getList(carList)
            i=i+1
        else:
            flag = False
    sleep(10)

因為這個腳本的執行時間會很長，我是放到自己的vps上，將該腳本另存為car.py

然后在linux命令行里執行 nohup ./car.py &

這樣保證防止斷網退出執行，同時將該任務放到后台。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 使用nmon來按頻率采集數據 python數據分析4之自動采集數據使用ScrapySharp快速從網頁中采集數據 NodeJs+Request+Cheerio 采集數據【Python數據分析】從Web收集數據小實例數據采集之數據庫：怎么實時采集數據庫數據？使用python抓取App數據抓取某一個網站整站的記錄 sql 合並結果集數據【python爬蟲實戰】使用Selenium webdriver采集山東招考數據