怎么判斷百度網盤分享連接已經失效?有那么簡單嗎?


  我不知道現在有多少人在用網盤搜索引擎,但就去轉盤網來說本人傾注了很多的心血,現在使用的人數也還可以,網盤資源都有個通病,那就是資源可能失效,但很多引擎都沒有做失效判斷,尤其是一些google自定義的引擎,技術含量不高,站長也就花心思賺錢,很少考慮用戶體驗。這篇文章是本人又一篇技術公開博客,之前本人已經公開了去轉盤

網的幾乎所有的技術細節,這一篇繼續補充:

      首先做個回顧:百度網盤爬蟲  java分詞算法 數據庫自動備份 代理服務器爬取 邀請好友注冊

 1 ing:utf-8
 2 """
 3 @author:haoning
 4 @create time:2015.8.5
 5 """
 6 from __future__ import division  # 精確除法
 7 from Queue import Queue
 8 from __builtin__ import False
 9 from _sqlite3 import SQLITE_ALTER_TABLE
10 from collections import OrderedDict
11 import copy
12 import datetime
13 import json
14 import math
15 import os
16 import random
17 import platform
18 import re
19 import threading, errno, datetime
20 import time
21 import urllib2
22 import MySQLdb as mdb
23 
24 
25 DB_HOST = '127.0.0.1'
26 DB_USER = 'root'
27 DB_PASS = 'root'
28 
29 
30 def gethtml(url):
31     try:
32         print "url",url
33         req = urllib2.Request(url)
34         response = urllib2.urlopen(req,None,8) #在這里應該加入代理
35         html = response.read()
36         return html
37     except Exception,e:
38         print "e",e
39 
40 if __name__ == '__main__':
41 
42    while 1:
43        #url='http://pan.baidu.com/share/link?uk=1813251526&shareid=540167442'
44        url="http://pan.baidu.com/s/1qXQD2Pm"
45        html=gethtml(url)
46        print html

結果:e HTTP Error 403: Forbidden,這就是說,度娘他是反爬蟲的,之后看了很多網站,一不小心試了下面的鏈接:

http://pan.baidu.com/share/link?uk=1813251526&shareid=540167442

1 if __name__ == '__main__':
2 
3    while 1:
4        url='http://pan.baidu.com/share/link?uk=1813251526&shareid=540167442'
5        #url="http://pan.baidu.com/s/1qXQD2Pm"
6        html=gethtml(url)
7        print html

結果:<title>百度雲 網盤-鏈接不存在</title>,你懂的,有這個的必然已經失效,看來度娘沒有反爬蟲,好家伙。

其實百度網盤的資源入口有兩種方式:

一種是:http://pan.baidu.com/s/1qXQD2Pm,最后為短碼。

另一種是:http://pan.baidu.com/share/link?uk=1813251526&shareid=540167442,關鍵是shareId+uk 前者已知道反爬蟲,后者目前沒有,所以用python測試后,本人又將代碼翻譯成了java,因為去轉盤是用java寫的,直接上代碼:

  1 package com.tray.common.utils;
  2 
  3 import static org.junit.Assert.*;
  4 
  5 import java.io.BufferedReader;
  6 import java.io.IOException;
  7 import java.io.InputStream;
  8 import java.io.InputStreamReader;
  9 import java.net.HttpURLConnection;
 10 import java.net.MalformedURLException;
 11 import java.net.URL;
 12 import java.util.HashMap;
 13 import java.util.Iterator;
 14 import java.util.Map;
 15 import java.util.Properties;
 16 import java.util.Random;
 17 import java.util.Set;
 18 
 19 import org.jsoup.Jsoup;
 20 import org.jsoup.nodes.Document;
 21 import org.jsoup.select.Elements;
 22 import org.junit.Test;
 23 
 24 /**
 25  * 資源校驗工具
 26  * 
 27  * @author hui
 28  * 
 29  */
 30 public class ResourceCheckUtil {
 31     private static Map<String, String[]> rules;
 32     static {
 33         loadRule();
 34     }
 35 
 36     /**
 37      * 加載規則庫
 38      */
 39     public static void loadRule() {
 40         try {
 41             InputStream in = ResourceCheckUtil.class.getClassLoader()
 42                     .getResourceAsStream("rule.properties");
 43             Properties p = new Properties();
 44             p.load(in);
 45             Set<Object> keys = p.keySet();
 46             Iterator<Object> iterator = keys.iterator();
 47             String key = null;
 48             String value = null;
 49             String[] rule = null;
 50             rules = new HashMap<String, String[]>();
 51             while (iterator.hasNext()) {
 52                 key = (String) iterator.next();
 53                 value = (String) p.get(key);
 54                 rule = value.split("\\|");
 55                 rules.put(key, rule);
 56             }
 57         } catch (Exception e) {
 58             e.printStackTrace();
 59         }
 60     }
 61 
 62     public static String httpRequest(String url) {
 63         try {
 64             URL u = new URL(url);
 65             Random random = new Random();
 66             HttpURLConnection connection = (HttpURLConnection) u
 67                     .openConnection();
 68             connection.setConnectTimeout(3000);//3秒超時
 69             connection.setReadTimeout(3000); 
 70             connection.setDoOutput(true);
 71             connection.setDoInput(true);
 72             connection.setUseCaches(false);
 73             connection.setRequestMethod("GET");
 74             
 75             String[] user_agents = {
 76                     "Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11",
 77                     "Opera/9.25 (Windows NT 5.1; U; en)",
 78                     "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
 79                     "Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)",
 80                     "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12",
 81                     "Lynx/2.8.5rel.1 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/1.2.9",
 82                     "Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.7 (KHTML, like Gecko) Ubuntu/11.04 Chromium/16.0.912.77 Chrome/16.0.912.77 Safari/535.7",
 83                     "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:10.0) Gecko/20100101 Firefox/10.0 "
 84             };
 85             int index=random.nextInt(7);
 86             /*connection.setRequestProperty("Content-Type",
 87                     "text/html;charset=UTF-8");*/
 88             connection.setRequestProperty("User-Agent",user_agents[index]);
 89             /*connection.setRequestProperty("Accept-Encoding","gzip, deflate, sdch");
 90             connection.setRequestProperty("Accept-Language","zh-CN,zh;q=0.8");
 91             connection.setRequestProperty("Connection","keep-alive");
 92             connection.setRequestProperty("Host","pan.baidu.com");
 93             connection.setRequestProperty("Cookie","");
 94             connection.setRequestProperty("Upgrade-Insecure-Requests","1");*/
 95             InputStream in = connection.getInputStream();
 96 
 97             BufferedReader br = new BufferedReader(new InputStreamReader(in,
 98                     "utf-8"));
 99             StringBuffer sb = new StringBuffer();
100             String line = null;
101             while ((line = br.readLine()) != null) {
102                 sb.append(line);
103             }
104             return sb.toString();
105 
106         } catch (MalformedURLException e) {
107             e.printStackTrace();
108         } catch (IOException e) {
109             e.printStackTrace();
110         }
111 
112         return null;
113     }
114 
115      @Test
116      public void test7() throws Exception {
117          System.out.println(isExistResource("http://pan.baidu.com/s/1jGjBmyq",
118          "baidu"));
119          System.out.println(isExistResource("http://pan.baidu.com/s/1jGjBmyqa",
120          "baidu"));
121         
122          System.out.println(isExistResource("http://yunpan.cn/cQx6e6xv38jTd","360"));
123          System.out.println(isExistResource("http://yunpan.cn/cQx6e6xv38jTdd",
124          "360"));
125         
126          System.out.println(isExistResource("http://share.weiyun.com/ec4f41f0da292adb89a745200b8e8b57","weiyun"));
127          System.out.println(isExistResource("http://share.weiyun.com/ec4f41f0da292adb89a745200b8e8b57dd",
128          "360"));
129         
130          System.out.println(isExistResource("http://cloud.letv.com/s/eiGLzuSes","leshi"));
131          System.out.println(isExistResource("http://cloud.letv.com/s/eiGLzuSesdd",
132          "leshi"));
133      }
134 
135     /**
136      * 獲取指定頁面上標簽的內容
137      * 
138      * @param url
139      * @param tagName
140      *            標簽名稱
141      * @return
142      */
143     private static String getHtmlContent(String url, String tagName) {
144         String html = httpRequest(url);
145         if(html==null){
146             return "";
147         }
148         Document doc = Jsoup.parse(html);
149         //System.out.println("doc======"+doc);
150         Elements tag=null;
151         if(tagName.equals("<h3>")){ //針對微雲
152             tag=doc.select("h3");
153         }
154         else if(tagName.equals("class")){ //針對360
155             tag=doc.select("div[class=tip]");
156         }
157         else{
158             tag= doc.getElementsByTag(tagName);
159         }
160         //System.out.println("tag======"+tag);
161         String content="";
162         if(tag!=null&&!tag.isEmpty()){
163             content = tag.get(0).text();
164         }
165         return content;
166     }
167 
168     public static int isExistResource(String url, String ruleName) {
169         try {
170             String[] rule = rules.get(ruleName);
171             String tagName = rule[0];
172             String opt = rule[1];
173             String flag = rule[2];
174             /*System.out.println("ruleName"+ruleName);
175             System.out.println("tagName"+tagName);
176             System.out.println("opt"+opt);
177             System.out.println("flag"+flag);
178             System.out.println("url"+url);*/
179             String content = getHtmlContent(url, tagName);
180             //System.out.println("content="+content);
181             if(ruleName.equals("baidu")){
182                 if(content.contains("百度雲升級")){ //升級作為不存在處理
183                     return 1;
184                 }
185             }
186             String regex = null;
187             if ("eq".equals(opt)) {
188                 regex = "^" + flag + "$";
189             } else if ("bg".equals(opt)) {
190                 regex = "^" + flag + ".*$";
191             } else if ("ed".equals(opt)) {
192                 regex = "^.*" + flag + "$";
193             } else if ("like".equals(opt)) {
194                 regex = "^.*" + flag + ".*$";
195             }else if("contain".equals(opt)){
196                 if(content.contains(flag)){
197                     return 0;
198                 }
199                 else{
200                     return 1;
201                 }
202             }
203             if(content.matches(regex)){
204                 return 1;
205             }
206         } catch (Exception e) {
207             e.printStackTrace();
208         }
209         return 0;
210     }
211 
212     // public static void main(String[] args)throws Exception {
213     // final Path p = Paths.get("C:/Users/hui/Desktop/6-14/");
214     // final WatchService watchService =
215     // FileSystems.getDefault().newWatchService();
216     // p.register(watchService, StandardWatchEventKinds.ENTRY_MODIFY);
217     // new Thread(new Runnable() {
218     //
219     // public void run() {
220     // while(true){
221     // System.out.println("檢測中。。。。");
222     // try {
223     // WatchKey watchKey = watchService.take();
224     // List<WatchEvent<?>> watchEvents = watchKey.pollEvents();
225     //
226     // for(WatchEvent<?> event : watchEvents){
227     // //TODO 根據事件類型采取不同的操作。。。。。。。
228     // System.out.println("["+p.getFileName()+"/"+event.context()+"]文件發生了["+event.kind()+"]事件");
229     // }
230     // watchKey.reset();
231     //
232     // } catch (Exception e) {
233     // e.printStackTrace();
234     // }
235     // }
236     // }
237     // }).start();
238     // }
239     
240 //    @Test
241 //    public void testName() throws Exception {
242 //        System.out.println(new String("\u8BF7\u8F93\u5165\u63D0\u53D6\u7801".getBytes("utf-8"), "utf-8"));
243 //    }
244 
245 }

注意代碼本生要用來兼容360,微盤等網盤的,但有些網盤倒了,大家都知道,不過代碼還是得在,這才是程序猿該有的思路,那就是可寬展,注意代碼有個配置文件,我也附上吧:

360=class|contain|\u5206\u4EAB\u8005\u5DF2\u53D6\u6D88\u6B64\u5206\u4EAB
baidu=title|contain|\u94FE\u63A5\u4E0D\u5B58\u5728
weiyun=<h3>|contain|\u5206\u4EAB\u8D44\u6E90\u5DF2\u7ECF\u5220\u9664
leshi=title|ed|\u63D0\u53D6\u6587\u4EF6

sorry,unicode編碼,麻煩你自己轉下碼吧,不會請百度:unicode轉碼工具

到此,去轉盤網鏈接是否失效的驗證,代碼我已經完全公開,喜歡這篇博客的孩子請收藏並關注下。

本人建個qq群,歡迎大家一起交流技術, 群號:512245829 喜歡微博的朋友關注:轉盤娛樂即可


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM