python爬蟲 ----文章爬蟲（合理處理字符串中的\n\t\r........）

本文轉載自查看原文 2017-08-12 18:07 1877 python爬蟲

import urllib.request
import re
import time

num=input("輸入日期（20150101000）：")



def openpage(url):
    html=urllib.request.urlopen(url)
    
    page=html.read().decode('gb2312')
    
    return page

def getpassage(page):
    passage = re.findall(r'<p class="MsoNormal" align="left">([\s\S]*)</FONT>',str(page))
    
    passage1=re.sub("</?\w+[^>]*>", "", str(passage))
    
    passage2=passage1.replace('\\r', '\r').replace('\\n', ' \n').replace('\\t','\t').replace(']','').replace('[','').replace('&nbsp;','   ')

    print(passage2)

    with open(load,'a',encoding='utf-8') as f:
        f.write("-----------------------------"+"日期"+str(date)+"---------------------------------\n"+passage2+"----------------------------------------------------\n")





for i in range(1,32):
    date=int(num)+int(i)
    print(date)
    load="C:/Users/home/Desktop/新建文本文檔.txt"
    url=("http://www.hbuas.edu.cn/news/xyxw/news_"+str(date)+".htm")
    
　　
    try:

        page=openpage(url)

        getpassage(page)

        print("第"+str(i)+"號有文章，----已下載")
    except:
        print("第"+str(i)+"號無文章。")
    time.sleep(2)

寫了一個爬學校新聞網的爬蟲，

主要涉及 re正則 urllib.request 文件的寫入

在爬取文章時通常會返回很多影響美感的代碼

如下：

優化：

兩次正則

passage = re.findall(r'<p align="left">([\s\S]*)</FONT>',str(page))       #第一次匹配字段
    
passage1=re.sub("</?\w+[^>]*>", "", str(passage))　　　　　　　　　　　　　　# 第二次去掉html標簽

替換

passage2=passage1.replace('\\r', '\r').replace('\\n', ' \n').replace('\\t','\t').replace(']','').replace('[','').replace('&nbsp;',' ')

效果如下：

over！

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python中去掉字符串中的\xa0、\t、\n Java替換字符串中的\r\n Python和Go對時間字符串中包含T和Z的處理 Python中字符串拼接的N種方法（轉摘）解析JSON字符串中的回車和\r\n字符串，以及JS的replace用法 R語言-字符串處理函數 python字符串處理 python 字符串處理 js替換字符串中的空格，換行符\r\n或\n替換成
 js替換字符串中的空格，換行符\r\n或\n替換成