Python爬蟲與一汽項目【一】爬取中海油，郵政，國家電網問題總結

本文轉載自查看原文 2019-03-27 16:10 512 Python爬蟲/ 實驗室工作

項目介紹

中國海洋石油是爬取的第一個企業，之后依次爬取了，國家電網，中國郵政，這三家公司的源碼並沒有多大難度，

采購信息地址：

國家電網電子商務平台

http://ecp.sgcc.com.cn/project_list.jsp?site=global&column_code=014001001&project_type=1

中國海洋石油集團有限公司

https://buy.cnooc.com.cn/cbjyweb/001/001001/moreinfo.html

中國郵政

http://www.chinapost.com.cn/html1/category/181313/7294-1.htm

項目地址：

https://github.com/code-return/Crawl_faw

實現過程與方法

1.中國海洋石油集團有限公司

中海油的信息頁面很友好，並沒有多大難度，實現順序如下：

#獲取首頁內容
def get_one_page(url):
    ...

#解析網頁
def parse_one_page(html):
    ...

#獲取最大頁碼
def getMaxpage(html):
    ...

#獲取二級頁面的文本內容
def getContent(url):
    ...

#主函數
def main()：
    
    url = "https://buy.cnooc.com.cn/cbjyweb/001/001001/moreinfo.html"
    html = get_one_page(url)
    parse_one_page(html)
    page_num = getMaxpage(html)
    #拼接翻頁的url，並返回翻頁的源代碼
    for i in range(2,page_num + 1):
        next_url = url.replace('moreinfo',str(i))
        next_html = get_one_page(next_url)
        parse_one_page(next_html)

　在主函數中需實現翻頁爬取的功能，這里通過先獲取網頁最大頁碼，然后根據頁碼設置循環，我們從第二頁開始解析網頁。

在網頁解析函數parse_one_page(html)中，主要實現，獲取網頁中的標題，發布時間，href，獲取該內容之后對數據進行篩選，存儲

def time_restrant(date): # 時間判斷函數，判斷是否當年發布的消息
    thisYear = int(datetime.date.today().year)  
    thisMonth = int(datetime.date.today().month)
    thisday = int(datetime.date.today().day)
    year = int(date.split('-')[0])
    month = int(date.split('-')[1])
    day = int(date.split('-')[2])
    #if ((thisYear - year <= 1) or (thisYear - year == 2 and month >= thisMonth)):  # 爬取24個月內的信息
    # if (thisYear == year and month == thisMonth and day == thisday):  # 這里是設置時間的地方
    #if (thisYear == year and month == thisMonth): 
    if (thisYear == year):
    #if thisYear == year:
        return True
    else:
        return False

def title_restraint(title,car_count, true_count):  # 標題判斷函數，判斷標題中是否有所需要的“車”的內容
    global most_kw_arr
    global pos_kw_arr
    global neg_kw_arr
    car_count += 1
    if title.find(u"車") == -1:  # or title.find(u"采購公告"):
        return False,car_count, true_count
    else:
        #car_count += 1
        neg_sign = 0
        pos_sign = 0

        for neg_i in neg_kw_arr:
            if title.find(neg_i) != -1:  # 出現了d_neg_kw中的詞
                neg_sign = 1
                break

        for pos_i in pos_kw_arr:
            if title.find(pos_i) != -1:  # 出現了d_pos_kw中的詞
                pos_sign = 1
                break

        if neg_sign == 1:
            return False,car_count, true_count
        else:
            if pos_sign == 0:
                return False,car_count, true_count
            elif pos_sign == 1:
                true_count += 1
                return True,car_count, true_count

　　將數據篩選完畢之后，對數據進行存儲

def store(title, date, content, province, url): # 向nbd_message表存儲車的信息
    title, content = removeSingleQuote(title, content)
    sql = "insert into nbd_message (title,time,content,province,href) values('%s','%s','%s','%s','%s')" % (
    title, date, content, province, url)
    return mySQL("pydb", sql, title, date, province)


def store_nbd_log(car_count, true_count, province_file): # 向nbd_spider_log表存儲爬取日志信息
    sql = "insert into nbd_spider_log (total_num,get_num,pro_name,spider_time) values('%d','%d','%s','%s')" % (
    car_count, true_count, province_file,str(datetime.date.today())

　　流程結束

2.中國郵政

郵政的頁面更加單一，但是郵政問題在於，

其每個單位都有單獨的鏈接來展示其不同業務部門的招標信息，經過對比我發現，這個下屬部門的首頁鏈接，就差了最后一點不一樣，因此我偷了個懶，多加了個循環

def main():
    """
    urls中分別對應着集團公司，省郵政分公司，郵政儲蓄銀行，中郵保險，集團公司直屬單位
    """
    urls = ['7294-','7331-','7338-','7345-','7360-']
    for i in range(0,len(urls)):
        strPost = '1.htm'#url后綴
        base_url = "http://www.chinapost.com.cn/html1/category/181313/" + str(urls[i])
        url = base_url + strPost
        html = get_one_page(url)
        # print(html)
        parse_one_page(html)
        page_num = getMaxpage(html)
        getMaxpage(html)
        for i in range(2,page_num + 1):
            next_url = base_url + strPost.replace('1',str(page_num))
            next_html = get_one_page(next_url)
            parse_one_page(next_html)

　郵政完成

3.國家電網

國家電網是我遇到的第一個問題，他的問題在於，在所需要的每個公告里面的href中，給出的不是通常的二級頁面鏈接，而是JavaScript的兩個參數，

href=”javascript:void(0);”這個的含義是，讓超鏈接去執行一個js函數，而不是去跳轉到一個地址，
而void(0)表示一個空的方法，也就是不執行js函數。
為什么要使用href=”javascript:void(0);”
javascript:是偽協議，表示url的內容通過javascript執行。void(0)表示不作任何操作，這樣會防止鏈接跳轉到其他頁面。這么做往往是為了保留鏈接的樣式，但不讓鏈接執行實際操作，

<a href="javascript：void(0)" onClick="window.open()"> 點擊鏈接后，頁面不動，只打開鏈接

<a href="#" onclick="javascript:return false;"> 作用一樣，但不同瀏覽器會有差異

　而二級頁面的鏈接與屬性onclick里面的兩個數字有關！！！因此我用onclick的兩個參數，進行二級頁面的拼接，

 hrefAttr = selector.xpath("//*[@class='content']/div/table[@class='font02 tab_padd8']/tr/td/a/@onclick")

    for i in range(0,len(hrefAttr)):
        #獲取二級頁面的跳轉參數，以便進行二級頁面url拼接
        string = str(hrefAttr[i])
        attr1 = re.findall("\d+",string)[0]
        attr2 = re.findall("\d+",string)[1]

結語

繼續搬磚......

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 國家電網 ERP系統國家電網電費查詢電費充值api接口，支持國家電網+南方電信 “AIIA”杯-國家電網-電力專業領域詞匯挖掘國家電網全業務數據中心統一分析服務平台建設 python爬蟲學習之爬取全國各省市縣級城市郵政編碼 Python爬取mc皮膚【爬蟲項目】 Python爬蟲超簡單實戰教程（一）| 爬取國家統計局數據 python爬蟲---實現項目(一) Requests爬取HTML信息 Python網絡爬蟲與如何爬取段子的項目實例