由於最近做圖片爬取項目,涉及到網頁中圖片信息的選擇,所以邊做邊學了點皮毛,有自己的心得
百度圖庫是ajax加載的,所以解析json數據即可
hjsons = json.loads(response.body) img_datas = hjsons['data'] if hjsons: for data in img_datas: try: item = Bd_Item() #print(data['fromPageTitleEnc']) #print(data['thumbURL']) item['img_url'] = data['thumbURL'] item['img_title'] = data['fromPageTitleEnc'] item['width'] = data['width'] item['height'] = data['height'] yield item except: pass
千圖網摳圖是分頁加載
http://588ku.com/sucai/0-default-0-0-yueliang-0-1/
qt_imgs = response.css('.org-img-wrap .picture-list') for qt_img in qt_imgs: try: item = Qt_Item() img_url = qt_img.css('.img-show .lazy::attr(data-original)').extract_first() title = qt_img.css('.img-show .lazy::attr(title)').extract_first() size = qt_img.css('.hover-pic-detail .pic-info .info-title::text').extract_first() #width = re.findall(r'(.*?)\*',size).extract_first() #height = re.findall(r'\*(.*?)', size).extract_first() #print(width) #print(height) #time.sleep(10) item['qtimg_url'] = img_url item['qtimg_title'] = title item['size'] = size #item['width'] = width #item['height'] = height yield item except: pass
覓元素和千圖網差不多,但是選取圖片鏈接有技巧,千圖網圖片可以看到有兩個圖片鏈接,其中data-original這個鏈接不同處理即可,但是如果選src會發現,選取出來的鏈接都是一樣的,而且當你打開鏈接時發現黑色一片,我感覺這是種保護吧,但只有這一種鏈接該怎么辦呢,於是我用正則去選擇,結果發現,抓取結果中有兩條鏈接,而第一條是無用的,第二條才是有用的,它的名字是data-src,這就好辦了,只需要把src改成data-src即可成功選取。
mys_imgs = response.css('.content-wrap .w1200 .f-content .i-flow-item') for mys_img in mys_imgs: try: item = Mys_Item() img_url = mys_img.css('.img-out-wrap .img-wrap img::attr(data-src)').extract_first() title = mys_img.css('.img-out-wrap .img-wrap img::attr(alt)').extract_first() size = mys_img.css('.i-title-wrap a::text').extract_first() size_detail = re.findall(r'\((.*?)\)',size) #text = mys_img.css('.img-wrap .lazy').extract_first() # time.sleep(10) #img_url = re.findall(r'src="(.*?)!/fw/260/quality/90/unsharp/true/compress/true"', text) #width = re.findall(r'(.*?)x', size_detail).extract_first() #height = re.findall(r'x(.*?)', size_detail).extract_first() item['mysimg_url'] = img_url item['mysimg_title'] = title item['size'] = size_detail #item['width'] = width #item['height'] = height yield item except: pass
這東西有點意思,需要琢磨,以后用到再慢慢學吧