[爬蟲]抓取知乎百萬用戶信息之Redis篇


             點擊我前往Github查看源代碼   別忘記star

本項目github地址:https://github.com/wangqifan/ZhiHu     

Redis安裝

 Redis官方並沒有推出windows版本,人家覺得linux已經夠了,開發windows版本影響開發進度,還好微軟有一個團隊維持着Redis的windows版本,網上有很多介紹Redis安裝的博客,大多數是敲各種命令行。這里有Redis的msi版本,只需要像安裝普通軟件一樣點擊下一步,下一步即可地址:https://github.com/MSOpenTech/redis/releases/download/win-3.2.100/Redis-x64-3.2.100.msi

RRedis配置

Redis配置文件詳解 http://www.cnblogs.com/kreo/p/4423362.html

找到Redis.windowserver.conf

 

這里要注意的兩點:1.遠程連接

#
# ~~~ WARNING ~~~ If the computer running Redis is directly exposed to the
# internet, binding to all the interfaces is dangerous and will expose the
# instance to everybody on the internet. So by default we uncomment the
# following bind directive, that will force Redis to listen only into
# the IPv4 lookback interface address (this means Redis will be able to
# accept connections only from clients running into the same computer it
# is running).
#
# IF YOU ARE SURE YOU WANT YOUR INSTANCE TO LISTEN TO ALL THE INTERFACES
# JUST COMMENT THE FOLLOWING LINE.
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
bind 0.0.0.0
將bind 127.0.0.1 修改成bind 0.0.0.0這樣redis可以接受遠程連接

內存限制

# NOTE: since Redis uses the system paging file to allocate the heap memory,
# the Working Set memory usage showed by the Windows Task Manager or by other
# tools such as ProcessExplorer will not always be accurate. For example, right
# after a background save of the RDB or the AOF files, the working set value
# may drop significantly. In order to check the correct amount of memory used
# by the redis-server to store the data, use the INFO client command. The INFO
# command shows only the memory used to store the redis data, not the extra
# memory used by the Windows process for its own requirements. Th3 extra amount
# of memory not reported by the INFO command can be calculated subtracting the
# Peak Working Set reported by the Windows Task Manager and the used_memory_peak
# reported by the INFO command.
#
maxmemory 2000mb
這里可以修改最大內存,建議放大點Redis比較還是吃內存的

連接Reids類的封裝

Redis的C#驅動ServiceStack.Redis使用NuGet進行安裝,由於這個類庫已經商業化了,在4.0版本開始限制數量,每小時不得超過6000次,建議安裝3.9版本

在這個爬蟲系統中,開始時候我只使用一台電腦裝Redis,后來發現這台電腦特別卡,后來換成三台電腦裝Redis,一個負責hash表,一個負責UrlNext隊列和Urltoken隊列,一台負責User隊列,由於實驗室的電腦非常老舊,還是很卡。最后又加持2台電腦,實驗室三台電腦負責hash表,我的電腦負責User隊列,征用學妹電腦用作任務隊列。

這個類命名為RedisCore

Ip地址列表

public static List<string> ips = new List<string>()

        {

            "59.74.169.54",

            "59.74.169.57",

            "59.74.169.52",

            "59.74.169.58",

            "59.74.169.39"

        };

 

對插入隊列的封裝。

Redis隊列是有list這個數據結構實現的,從右邊插入,左邊彈出就可以實現隊列

插入

public static bool PushIntoList(int type, string key, string value)

        {

            bool Result = false;

            using (RedisClient Redis = new RedisClient(ips[type - 1], 6379))

            {

                Redis.ConnectTimeout = 2000;

                Result = Redis.RPush(key, Encoding.UTF8.GetBytes(value)) > 0;

            }

            return Result;

        }

 

注意這個非托管資源要手動釋放

彈出

public static string PopFromList(int type, string key)
        {
            string result = string.Empty;
            try
            {
             
                using (RedisClient Redis = new RedisClient(ips[type - 1], 6379))
                {
                    Redis.ConnectTimeout = 2000;
                    result = Encoding.UTF8.GetString(Redis.LPop(key));
                }
            
            }
            catch
            {
               
            }
            return result;
        }

Hash表有三個電腦,到底放到那一台,首先對key進行hash運算,取絕對值,對3取余,為0 就放到3號機器,為1放到4號機器,為2 放到5號機器

,如果hash表已經存在就會插入失敗返回false,不存在插入成功返回true

 public static bool InsetIntoHash(int type, string hashid, string key, string value)
        {
            bool result = false;
            try
            {
                using (RedisClient Redis = new RedisClient(ips[type - 1], 6379))
                {
                    Redis.ConnectTimeout = 2000;
                    result = Redis.SetEntryInHashIfNotExists(hashid, key, value);
                }
            }
            catch { }

            return result;
        }
      

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM