一個簡單抓取糗事百科糗事的小程序

本文轉載自查看原文 2012-05-25 15:09 2787 糗事百科/ 正則表達式

看糗事百科是從2008年開始的,自從買了智能手機以后,就用手機看了,想着糗百的網站上下都有廣告,自己只想看糗事,不想看廣告,順便還能節省下流量,就能能不能做個程序把糗百的糗事抓下來,其他的都去掉,於是就寫了下面的這段.希望糗百大神們不要追究我的責任啊,我只是研究了一下下.

前台文件:

<%@ Page Language="C#" AutoEventWireup="true" CodeBehind="Default.aspx.cs" Inherits="WebTest._Default" EnableViewState="false" %>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head runat="server">
  <meta name="viewport" content="width=device-width, initial-scale=1.0" />
  <title>糗事百科</title>
  <style type="text/css">
    body{margin:5px;font:12px arial,sinsun;background:#fff;}
    img{border:none;}
    a{text-decoration:none;}
    .qiushi{margin:5px 0;padding:10px;border-bottom:1px solid #ece5d8;}
  </style>
</head>
<body><form id="bodyForm" runat="server"></form></body></html>

后台代碼:

1 protected void Page_Load(object sender, EventArgs e)
2 {
3       string URI = "http://wap3.qiushibaike.com";
4       string pageInfo = Request.QueryString["param"] == null ? string.Empty : Request.QueryString["param"].ToString().Trim();
5       URI = URI + pageInfo;
6 
7       bodyForm.InnerHtml = Server.HtmlDecode(getQiushi(URI));
8 }

getQiushi

 1 private string getQiushi(string URI)
 2 {
 3       WebRequest request = WebRequest.Create(URI);
 4       WebResponse result = null;
 5       result = request.GetResponse();
 6       Stream ReceiveStream = result.GetResponseStream();
 7       StreamReader sr = new StreamReader(ReceiveStream);
 8       string resultstring = sr.ReadToEnd();
 9       StringBuilder responseString = new StringBuilder();
10 
11       Regex regContent = new Regex("<div class=\"qiushi\">(?<content>[\\s\\S]+?)</div>");   //匹配糗事內容
12       Regex regComment = new Regex("<p class=\"vote\">(?<content>[\\s\\S]+?)</p>");         //匹配評論
13       Regex regUserInfo = new Regex("<p class=\"user\">(?<content>[\\s\\S]+?)</p>");        //匹配發布者信息

16       Regex regLinks = new Regex("(href=\")(/[^\\s]*)(\")");                                //匹配鏈接
17       Regex regPrevPage = new Regex("<a href=\".*?\">上一頁</a>");                          //匹配換頁
18       Regex regNextPage = new Regex("<a href=\".*?\">下一頁</a>");
19       Regex regBlankLine = new Regex(@"[\n|\r|\r\n]");                                      //匹配換行
20       MatchCollection mcContent = regContent.Matches(resultstring);
21       Match mcPrevPage = regPrevPage.Match(resultstring);
22       Match mcNextPage = regNextPage.Match(resultstring);
23       string prevPage = "<a href=\"?param=" + mcPrevPage.ToString().Replace("<a href=\"", "").Replace("\">上一頁</a>", "") + "\">上一頁</a>&nbsp;&nbsp;";
24       string nextPage = "<a href=\"?param=" + mcNextPage.ToString().Replace("<a href=\"", "").Replace("\">下一頁</a>", "") + "\">下一頁</a>";
25 
26       for (int i = 0; i < mcContent.Count; i++)
27       {
28         string content = mcContent[i].ToString();
29         content = Regex.Replace(content, regComment.ToString(), "", RegexOptions.IgnoreCase);
30         content = Regex.Replace(content, regUserInfo.ToString(), "", RegexOptions.IgnoreCase);

32         content = Regex.Replace(content, regLinks.ToString(), "href=\"?param=$2\"", RegexOptions.IgnoreCase);
33         content = Regex.Replace(content, regBlankLine.ToString(),"", RegexOptions.IgnoreCase);
34 
35         responseString.Append(content);

37       }
38 
39       responseString.Append("<div style=\"text-align:center\">" + prevPage);
40       responseString.Append(nextPage + "</div>");
41 
42       return responseString.ToString();
43 }

Page Load里面的那個param參數主要是為了獲取上一頁 ,下一頁和標簽的,現在基本的功能都實現了,沒有廣告了,不過不能查看留言.

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python抓取糗事百科成人版圖片 scrapy實戰4 GET方法抓取ajax動態頁面(以糗事百科APP為例子)： python3.8 安裝scrapy及其使用，爬取糗事百科小案例 C#+HtmlAgilityPack—糗事百科桌面版V2.0 術語-mPaaS：百科雜項-PIN：百科 SSM --- 百度百科百度百科目錄導航樹小插件