一些爬蟲代碼


基於xpath的爬蟲

​ 爬取起點的熱門書籍名稱,作者,月票以及簡介,並將結果保存在xiaoshuo.txt中

import requests
from lxml import etree
import time
import sys		#以下三行是為了解決編碼報錯問題
reload(sys)
sys.setdefaultencoding("utf8")

fo = open("xiaoshuo.txt","w")
i=1
for i in range(5):
    url = "https://www.qidian.com/all?orderId=&style=1&pageSize=20&siteid=1&pubflag=0&hiddenField=0&page=%d"%i
    header = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"}
    data = requests.get(url,headers=header).text
    f = etree.HTML(data)

    hrefs = f.xpath('/html/body/div[1]/div[5]/div[2]/div[2]/div/ul/li/div[2]/h4/a/@href')
    for href in hrefs:
        href = "https:"+href
        book = requests.get(href,headers=header).text
        e = etree.HTML(book)    
        title = e.xpath('/html/body/div/div[6]/div[1]/div[2]/h1/em/text()')[0]
        zuozhe = e.xpath('/html/body/div/div[6]/div[1]/div[2]/h1/span/a/text()')[0]
        jieshao = e.xpath('/html/body/div/div[6]/div[4]/div[1]/div[1]/div[1]/p/text()')
        yuepiao = e.xpath('//*[@id="monthCount"]/text()')[0]
        str = '<----->'+title+'<----->'+zuozhe+'<----->'+yuepiao+'\n'
        fo.write(str)
        for te in jieshao:
            fo.write(te)

fo.close()

基於selenium的爬蟲

​ 目的是爬取校園網上個人基本信息,未完成。最終目的是做出批量查詢(學號密碼有固定形式)

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time	

#由於find.element_by_*始終無法定位到需要點擊的按鈕上,無法進入下一頁,下一步准備嘗試與requests庫連用

driver = webdriver.Chrome()
driver.get("http://cas.hdu.edu.cn/cas/login?service=http%3A%2F%2Fonce.hdu.edu.cn%2Fdcp%2Findex.jsp")
elem1 = driver.find_element_by_id("un")
elem2 = driver.find_element_by_id("pd")
elem1.send_keys("學號")		#將學號密碼替換為自己的真實學號密碼
elem2.send_keys("密碼")
driver.find_element_by_id('index_login_btn').click()
driver.find_element_by_class_name('quickhome_item_link').click()
print driver.page_source

基於正則表達式

​ 貼吧圖片批量下載

import urllib
import re

def gethtml(url):
    page = urllib.urlopen(url)
    html = page.read()
    return html

def getimg(html):
    reg = r'src="(.+?\.jpg)" size'
    imgre= re.compile(reg)
    imglist = re.findall(imgre,html)
    return imglist

def downimg(imglist):
    x=0
    local = 'D:/VScode/image/'
    for img in imglist:
        urllib.urlretrieve(img,local+'%s.jpg'%x)
        x+=1
            
html = gethtml("https://movie.douban.com/subject/26942674/")
print html


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM