知乎用戶信息的爬取

本文轉載自查看原文 2018-02-02 19:52 3872 爬蟲

上一次爬取了知乎問題和答案,這一次來爬取知乎用戶的信息

一構造url

首先構造用戶信息的url

　　知乎用戶信息都是放在一個json文件中,我們找到存放這個json文件的url,就可以請求這個json文件,得到我們的數據.

　url="https://www.zhihu.com/api/v4/members/liu-qian-chi-24?include=locations,employments,gender,educations,business,voteup_count,thanked_Count,follower_count,following_count,cover_url,following_topic_count,following_question_count,following_favlists_count,following_columns_count,answer_count,articles_count,pins_count,question_count,commercial_question_count,favorite_count,favorited_count,logs_count,marked_answers_count,marked_answers_text,message_thread_token,account_status,is_active,is_force_renamed,is_bind_sina,sina_weibo_url,sina_weibo_name,show_sina_weibo,is_blocking,is_blocked,is_following,is_followed,mutual_followees_count,vote_to_count,vote_from_count,thank_to_count,thank_from_count,thanked_count,description,hosted_live_count,participated_live_count,allow_message,industry_category,org_name,org_homepage,badge[?(type=best_answerer)].topics",

這是我的知乎的url,前面加顏色的部分是用戶名,后面一部分你要請求的內容,這個內容當然不是我手寫的,是瀏覽在請求時對應的參數.

可以看到,這就是請求用戶信息時,include后面所包含的內容,由些,用戶信息的url構成為

user_url = 'https://www.zhihu.com/api/v4/members/{user}?include={include}'

include參數如下:

user_query = 'locations,employments,gender,educations,business,voteup_count,thanked_Count,follower_count,following_count,cover_url,following_topic_count,following_question_count,following_favlists_count,following_columns_count,answer_count,articles_count,pins_count,question_count,commercial_question_count,favorite_count,favorited_count,logs_count,marked_answers_count,marked_answers_text,message_thread_token,account_status,is_active,is_force_renamed,is_bind_sina,sina_weibo_url,sina_weibo_name,show_sina_weibo,is_blocking,is_blocked,is_following,is_followed,mutual_followees_count,vote_to_count,vote_from_count,thank_to_count,thank_from_count,thanked_count,description,hosted_live_count,participated_live_count,allow_message,industry_category,org_name,org_homepage,badge[?(type=best_answerer)].topics'

接下來 構造關注我的人 的url:
可以通過瀏覽器看到 關注我的人 的url是 https://www.zhihu.com/api/v4/members/liu-qian-chi-24/followees?include=data%5B%2A%5D.
answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F%28type%3Dbest
_answerer%29%5D.topics&limit=20&offset=0

我們要構造這四部分,可以看到初次請時的參數:

這里有三個參數了 liu-qian-chi-24 是用戶名,加上用戶名后就可以構成了關注他的人信息的url,了,構成如下

followed_url = "https://www.zhihu.com/api/v4/members/{user}/followers?include={include}&limit={limit}&offset={offset}"

include參數如下:

followed_query = "data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics"

接下來構造我關注的人的url:

以同樣的方法可以構造我關注的人的信息url. 瀏覽器中參數如下:

由此把我關注的人的url 構造出來:

following_url = "https://www.zhihu.com/api/v4/members/{user}/followees?include={include}&limit={limit}&offset={offset}"

include參數如下:

following_query = "data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics"

二模擬知乎登錄

因為知乎在沒有登錄情況下是不能訪問的,所以一定要模擬登錄,登錄過程中會出現輸入驗證碼的情況 ,我沒有使用雲打碼的方式,這次我把驗證碼下載到本地,通過手工輸入,

代碼如下:

    def start_requests(self):
        """請求登錄頁面"""
        return [scrapy.Request(url="https://www.zhihu.com/signup", callback=self.get_captcha)]

    def get_captcha(self, response):
        """這一步主要是獲取驗證碼"""
        post_data = {
            "email": "lq574343028@126.com",
            "password": "lq534293223",
            "captcha": "",  # 先把驗證碼設為空,這樣知乎就會提示輸入驗證碼
        }
        t = str(int(time.time() * 1000))
        #  這里是關鍵,這也是我找了好久才找到的方法,這就是知乎每次的驗證碼圖片的url
        captcha_url = "https://www.zhihu.com/captcha.gif?r={0}&type=login".format(t)
        return [scrapy.FormRequest(url=captcha_url, meta={"post_data": post_data}, callback=self.after_get_captcha)]

    def after_get_captcha(self, response):
        """把驗證碼存放到本地,手工輸入"""
        with open("E:/outback/zhihu/zhihu/utils/captcha.png", "wb") as f:
            f.write(response.body)
        try:
            # 這一句就是讓程序自動打打圖片
            img = Image.open("E:/outback/zhihu/zhihu/utils/captcha.png")
            img.show()
        except:
            pass
        captcha = input("input captcha")
        post_data = response.meta.get("post_data", {})
        post_data["captcha"] = captcha
        post_url = "https://www.zhihu.com/login/email"
        return [scrapy.FormRequest(url=post_url, formdata=post_data,
                                   callback=self.check_login)]

    def check_login(self, response):
        """驗證是否登錄成功"""
        text_json = json.loads(response.text)
        if "msg" in text_json and text_json["msg"] == "登錄成功":
            yield scrapy.Request("https://www.zhihu.com/", dont_filter=True, callback=self.start_get_info)
        else:
            # 如果不成功就再登錄一次
            return [scrapy.Request(url="https://www.zhihu.com/signup", callback=self.get_captcha)]

    def start_get_info(self, response):
        """登錄成功后就可以去請求用戶信息"""
        yield scrapy.Request(url=self.user_url.format(user="liu-qian-chi-24", include=self.user_query),
                             callback=self.parse_user)

三處理處理請求用戶信息url所得到的response

可以看到用戶信息就是一個json文件,我們解析這個json文件就行:

只是一點要注意 : is_end 意思是是否還有下一頁,注意這里不能用 next 是否能打開來判斷,因為無論怎么樣都能打開,

user_token 就是網頁中的用戶名,我們用這個用戶名加上其他參數來構造關注他,和他關注的url 信息.代碼如下:

    def parse_user(self, response):
        user_data = json.loads(response.text)
        zhihu_item = ZhihuUserItem()
        # 解析這個json文件,如果這個key在item中,就存出item,這里用字典的get()方法,即使字典中沒有這個值也不會出錯
        for field in zhihu_item.fields:
            if field in user_data.keys():
                zhihu_item[field] = user_data.get(field)
        yield zhihu_item
        # 通過url_token信息把followed_url yield出去
        yield scrapy.Request(
            url=self.followed_url.format(user=user_data.get('url_token'), include=self.followed_query, limit=20,offset=0,
                                         ),
            callback=self.parse_followed)
        # 通過url_token信息把following_url yield出去
        yield scrapy.Request(
            url=self.following_url.format(user=user_data.get('url_token'), include=self.following_query,limit=20, offset=0,
                                          ), callback=self.parse_following)

四分別解析following_url 和 followed_url

接下來就是分別解析following_url 和 followed_url 解析方法和解析user_url一下,這里就不詳細說明了,代碼如下:

    def parse_following(self, response):
        user_data = json.loads(response.text)
        zhihu_item = ZhihuUserItem()
        # 請求下一個頁面
        if "paging" in user_data.keys() and user_data.get("paging").get("next") == "false":
            next_url = user_data.get("paging").get("next")
            yield scrapy.Request(url=next_url, callback=self.parse_following)

        if "data" in user_data.keys():
            for result in user_data.get("data"):
                url_token = result.get("url_token")
                yield scrapy.Request(url=self.user_url.format(user=url_token, include=self.user_query),
                                     callback=self.parse_user)

    def parse_followed(self, response):
        user_data = json.loads(response.text)
        zhihu_item = ZhihuUserItem()
        # 請求下一個頁面
        if "paging" in user_data.keys() and user_data.get("paging").get("next") == "false":
            next_url = user_data.get("paging").get("next")
            yield scrapy.Request(url=next_url, callback=self.parse_followed)

        if "data" in user_data.keys():
            for result in user_data.get("data"):
                url_token = result.get("url_token")
                yield scrapy.Request(url=self.user_url.format(user=url_token, include=self.user_query),
                                     callback=self.parse_user)

至些 spider 中主要的邏輯結束.接下來我們來把數據存入到mongodb 中

五 pipleline 存入mongodb

class ZhihuUserMongoPipeline(object):
    collection_name = 'scrapy_items'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        # self.db[self.collection_name].insert_one(dict(item))
        # 為了使用數據不重復,我們這里以ID為准進行更新
        self.db[self.collection_name].update({'id': item['id']}, dict(item), True)
        return item

六配置好settings

ROBOTSTXT_OBEY = False
DOWNLOAD_DELAY = 3

DEFAULT_REQUEST_HEADERS = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en',
    "HOST": "www.zhihu.com",
    "Referer": "https://www.zhizhu.com",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:58.0) Gecko/20100101 Firefox/58.0",
}

ITEM_PIPELINES = {
   # 'zhihu.pipelines.ZhihuPipeline': 300,
   'zhihu.pipelines.ZhihuUserMongoPipeline': 300,
}
MONGO_URI="127.0.0.1:27017"
MONGO_DATABASE="outback"

七編寫item

編寫item非常簡單,因為我這次是把數據存入mongodb中,所以我盡量多抓取數據,我們在請求那三個url時,每個url都有一個include, 這個就是知乎的所有字段,我們去重就行

following_query = "data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics"

followed_query = "data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics"

user_query = 'locations,employments,gender,educations,business,voteup_count,thanked_Count,follower_count,following_count,cover_url,following_topic_count,following_question_count,following_favlists_count,following_columns_count,answer_count,articles_count,pins_count,question_count,commercial_question_count,favorite_count,favorited_count,logs_count,marked_answers_count,marked_answers_text,message_thread_token,account_status,is_active,is_force_renamed,is_bind_sina,sina_weibo_url,sina_weibo_name,show_sina_weibo,is_blocking,is_blocked,is_following,is_followed,mutual_followees_count,vote_to_count,vote_from_count,thank_to_count,thank_from_count,thanked_count,description,hosted_live_count,participated_live_count,allow_message,industry_category,org_name,org_homepage,badge[?(type=best_answerer)].topics'

item 代碼如下:

class ZhihuUserItem(scrapy.Item):
    # url info item 這幾個字段構成了url
    answer_count = scrapy.Field()
    articles_count = scrapy.Field()
    gender = scrapy.Field()
    follower_count = scrapy.Field()
    is_followed = scrapy.Field()
    is_following = scrapy.Field()
    badge = scrapy.Field()
    id = scrapy.Field()

    # 其他我們需要的信息
    locations = scrapy.Field()
    employments = scrapy.Field()
    educations = scrapy.Field()
    business = scrapy.Field()
    voteup_count = scrapy.Field()
    thanked_Count = scrapy.Field()
    following_count = scrapy.Field()
    cover_url = scrapy.Field()
    following_topic_count = scrapy.Field()
    following_question_count = scrapy.Field()
    following_favlists_count = scrapy.Field()
    following_columns_count = scrapy.Field()
    pins_count = scrapy.Field()
    question_count = scrapy.Field()
    commercial_question_count= scrapy.Field()
    favorite_count = scrapy.Field()
    favorited_count = scrapy.Field()
    logs_count = scrapy.Field()
    marked_answers_count = scrapy.Field()
    marked_answers_text = scrapy.Field()
    message_thread_token = scrapy.Field()
    account_status = scrapy.Field()
    is_active = scrapy.Field()
    is_force_renamed = scrapy.Field()
    is_bind_sina = scrapy.Field()
    sina_weibo_urlsina_weibo_name = scrapy.Field()
    show_sina_weibo = scrapy.Field()
    is_blocking = scrapy.Field()
    is_blocked = scrapy.Field()
    mutual_followees_count = scrapy.Field()
    vote_to_count = scrapy.Field()
    vote_from_count = scrapy.Field()
    thank_to_count = scrapy.Field()
    thank_from_count = scrapy.Field()
    thanked_count = scrapy.Field()
    description = scrapy.Field()
    hosted_live_count = scrapy.Field()
    participated_live_count = scrapy.Field()
    allow_message = scrapy.Field()
    industry_category = scrapy.Field()
    org_name = scrapy.Field()
    org_homepage = scrapy.Field()

View Code

八 mian()函數

當然,為了邊寫邊調試,我們還少不了一個mian()函數,這樣方便打斷點調試

from scrapy.cmdline import execute

import sys
import os
sys.path.insert(0,os.path.dirname(os.path.abspath(__file__)))
print(os.path.dirname(os.path.abspath(__file__)))

execute(["scrapy", "crawl", "zhihu_user"])

到此項目完成,照例把完整的spider代碼寫在這里:

# -*- coding: utf-8 -*-
import scrapy
import time
from PIL import Image
import json
from zhihu.items import ZhihuUserItem


class ZhihuUserSpider(scrapy.Spider):
    name = 'zhihu_user'
    allowed_domains = ['zhihu.com']
    start_urls = ["liu-qian-chi-24"]

    custom_settings = {
        "COOKIES_ENABLED": True
    }

    # 他關注的人的url
    following_url = "https://www.zhihu.com/api/v4/members/{user}/followees?include={include}&limit={limit}&offset={offset}"

    # 關注他的人的url
    followed_url = "https://www.zhihu.com/api/v4/members/{user}/followers?include={include}&limit={limit}&offset={offset}"

    # 用戶信息url
    user_url = 'https://www.zhihu.com/api/v4/members/{user}?include={include}'

    # 他關注的人的url構成參數
    following_query = "data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics"

    # 關於他的人的url構成參數
    followed_query = "data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics"

    # 用戶信息url構成參數信息
    user_query = 'locations,employments,gender,educations,business,voteup_count,thanked_Count,follower_count,following_count,cover_url,following_topic_count,following_question_count,following_favlists_count,following_columns_count,answer_count,articles_count,pins_count,question_count,commercial_question_count,favorite_count,favorited_count,logs_count,marked_answers_count,marked_answers_text,message_thread_token,account_status,is_active,is_force_renamed,is_bind_sina,sina_weibo_url,sina_weibo_name,show_sina_weibo,is_blocking,is_blocked,is_following,is_followed,mutual_followees_count,vote_to_count,vote_from_count,thank_to_count,thank_from_count,thanked_count,description,hosted_live_count,participated_live_count,allow_message,industry_category,org_name,org_homepage,badge[?(type=best_answerer)].topics'

    def start_requests(self):
        """請求登錄頁面"""
        return [scrapy.Request(url="https://www.zhihu.com/signup", callback=self.get_captcha)]

    def get_captcha(self, response):
        """這一步主要是獲取驗證碼"""
        post_data = {
            "email": "lq573320328@126.com",
            "password": "lq132435",
            "captcha": "",  # 先把驗證碼設為空,這樣知乎就會提示輸入驗證碼
        }
        t = str(int(time.time() * 1000))
        #  這里是關鍵,這也是我找了好久才找到的方法,這就是知乎每次的驗證碼圖片的url
        captcha_url = "https://www.zhihu.com/captcha.gif?r={0}&type=login".format(t)
        return [scrapy.FormRequest(url=captcha_url, meta={"post_data": post_data}, callback=self.after_get_captcha)]

    def after_get_captcha(self, response):
        """把驗證碼存放到本地,手工輸入"""
        with open("E:/outback/zhihu/zhihu/utils/captcha.png", "wb") as f:
            f.write(response.body)
        try:
            # 這一句就是讓程序自動打打圖片
            img = Image.open("E:/outback/zhihu/zhihu/utils/captcha.png")
            img.show()
        except:
            pass
        captcha = input("input captcha")
        post_data = response.meta.get("post_data", {})
        post_data["captcha"] = captcha
        post_url = "https://www.zhihu.com/login/email"
        return [scrapy.FormRequest(url=post_url, formdata=post_data,
                                   callback=self.check_login)]

    def check_login(self, response):
        """驗證是否登錄成功"""
        text_json = json.loads(response.text)
        if "msg" in text_json and text_json["msg"] == "登錄成功":
            yield scrapy.Request("https://www.zhihu.com/", dont_filter=True, callback=self.start_get_info)
        else:
            # 如果不成功就再登錄一次
            return [scrapy.Request(url="https://www.zhihu.com/signup", callback=self.get_captcha)]

    def start_get_info(self, response):
        """登錄成功后就可以去請求用戶信息"""
        yield scrapy.Request(url=self.user_url.format(user="liu-qian-chi-24", include=self.user_query),
                             callback=self.parse_user)

    def parse_user(self, response):
        user_data = json.loads(response.text)
        zhihu_item = ZhihuUserItem()
        # 解析這個json文件,如果這個key在item中,就存出item,這里用字典的get()方法,即使字典中沒有這個值也不會出錯
        for field in zhihu_item.fields:
            if field in user_data.keys():
                zhihu_item[field] = user_data.get(field)
        yield zhihu_item
        # 通過url_token信息把followed_url yield出去
        yield scrapy.Request(
            url=self.followed_url.format(user=user_data.get('url_token'), include=self.followed_query, limit=20,offset=0,
                                         ),
            callback=self.parse_followed)
        # 通過url_token信息把following_url yield出去
        yield scrapy.Request(
            url=self.following_url.format(user=user_data.get('url_token'), include=self.following_query,limit=20, offset=0,
                                          ), callback=self.parse_following)

    def parse_following(self, response):
        user_data = json.loads(response.text)
        zhihu_item = ZhihuUserItem()
        # 請求下一個頁面
        if "paging" in user_data.keys() and user_data.get("paging").get("next") == "false":
            next_url = user_data.get("paging").get("next")
            yield scrapy.Request(url=next_url, callback=self.parse_following)

        if "data" in user_data.keys():
            for result in user_data.get("data"):
                url_token = result.get("url_token")
                yield scrapy.Request(url=self.user_url.format(user=url_token, include=self.user_query),
                                     callback=self.parse_user)

    def parse_followed(self, response):
        user_data = json.loads(response.text)
        zhihu_item = ZhihuUserItem()
        # 請求下一個頁面
        if "paging" in user_data.keys() and user_data.get("paging").get("next") == "false":
            next_url = user_data.get("paging").get("next")
            yield scrapy.Request(url=next_url, callback=self.parse_followed)

        if "data" in user_data.keys():
            for result in user_data.get("data"):
                url_token = result.get("url_token")
                yield scrapy.Request(url=self.user_url.format(user=url_token, include=self.user_query),
                                     callback=self.parse_user)

View Code

項目結構如下:

九總結

項目還有一些不足的地方

　　1, 應該把存入mongdo的函數寫在Item類中,在Pipeline 統一調用這個類的接口就行.這樣如果項目中用很多個爬蟲的話,就可以共用這個類,

　　2.不是每個user的所有字段都有值,應該在存入數據前把沒有值的字段過濾了.

　　3.沒有加異常處理機制,我在跑這個爬蟲的過程中沒有出現異常,所以也就沒有加異常處理機制.

　　4,沒有做成分步式.

github https://github.com/573320328/zhihu.git 記得點start哦

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 利用 Scrapy 爬取知乎用戶信息爬蟲之知乎用戶信息爬取 Python爬蟲從入門到放棄（十九）之 Scrapy爬取所有知乎用戶信息(下) Python爬蟲從入門到放棄（十八）之 Scrapy爬取所有知乎用戶信息(上) 利用Scrapy爬取所有知乎用戶詳細信息並存至MongoDB python爬取中國知網部分論文信息 Python 爬取生成中文詞雲以爬取知乎用戶屬性為例 python爬取知乎評論 Python爬取知乎網站 python 爬取知乎圖片

知乎用戶信息的爬取

一 構造url

首先構造用戶信息的url

接下來 構造我關注的人的url:

二 模擬知乎登錄

三 處理處理請求用戶信息url所得到的response

四 分別解析following_url 和 followed_url

五 pipleline 存入mongodb

六 配置好settings

七 編寫item

八 mian()函數

九 總結

免責聲明！

一構造url

接下來構造我關注的人的url:

二模擬知乎登錄

三處理處理請求用戶信息url所得到的response

四分別解析following_url 和 followed_url

六配置好settings

七編寫item

九總結