記錄初學Python的坑-----python3.7.3版本

本文轉載自查看原文 2019-07-12 18:04 2908 Python

先記錄一下 idea安裝Python的語言支持插件后的操作：我用的是windows環境、windows環境、windows環境。

首先下載 Anaconda3 的可執行文件下載地址

然后安裝，安裝過程中有一個環節，默認打了一個勾，把上面的配置環境變量也勾上，然后一路next

借圖說明下：

安裝完成后，打開cms 輸入：conda info --env 查看下環境，默認只有一個base；

下面那個py37是后來建的，創建命令conda create -n py37

刪除環境謹慎執行：conda remove -n py37

激活環境：activate base

關閉環境：deactivate base

激活后，就可以起飛了。

慣例：helloworld

#!/usr/bin/python3
print('hello world')

報錯原因：

TypeError: can’t use a string pattern on a bytes-like object.

html用decode(‘utf-8’)進行解碼，由bytes變成string。

py3的urlopen返回的不是string是bytes。
解決：

html = html.decode('utf-8')

報錯原因：

如果用 urllib.request.urlopen 方式打開一個URL,服務器端只會收到一個單純的對於該頁面訪問的請求,但是服務器並不知道發送這個請求使用的瀏覽器,操作系統,硬件平台等信息,而缺失這些信息的請求往往都是非正常的訪問,例如爬蟲.

有些網站為了防止這種非正常的訪問,會驗證請求信息中的UserAgent(它的信息包括硬件平台、系統軟件、應用軟件和用戶個人偏好),如果UserAgent存在異常或者是不存在,那么這次請求將會被拒絕(如上錯誤信息所示)

所以可以嘗試在請求中加入UserAgent的信息
解決：

def getHtml(url):

    u = urllib.request.URLopener() # Python 3: urllib.request.URLOpener
    u.addheaders = []
    u.addheader( 'Accept', '*/*')
    u.addheader('Accept-Language','en-US,en;q=0.8')
    u.addheader( 'Cache-Control', 'max-age=0')
    u.addheader( 'User-Agent', 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36')
    u.addheader( 'Connection', 'keep-alive')
    u.addheader( 'Referer', 'http://www.baidu.com/')

    page=u.open(url)
    html = page.read()
    html = html.decode('utf-8',"ignore")
    page.close()
    return html

3.在一個py文件中，引入另一個py文件，直接import即可

4.用python3版本，引入request的下載函數 urlretrieve時候，可以這么做,舉一反三，可以減少代碼量

from  urllib.request import urlretrieve

urlretrieve(url,filename)

5.在獲取對象Html時，即使加了decode('utf-8'),也仍舊報錯：

'utf-8' codec can't decode byte 0xd7 in position 309: invalid continuation byte

解決方法：decode('utf-8','ignore')

6.int 型變量 index 轉 str類型---->str(index)

相反：str類型變量 string 轉int類型 -----> int(str) --------10進制下

7.顯示下載進度

def Schedule(a,b,c):
    per = 100.0 * a * b / c
    if per>100:
        per = 100
        print('完成！')
    print('%.2f%%' % per)

urlretrieve(fileUrl,fileName,Schedule)

效果如下：

8記錄一個爬網址圖片，然后創建文件夾存圖片的操作

# coding:utf-8
import requests
from bs4 import BeautifulSoup
import os
# 創建一個文件夾名稱
FileName = 'mm'
def dd():
    if not os.path.exists(os.path.join(os.getcwd(), FileName)):     # 新建文件夾
        os.mkdir(os.path.join(os.getcwd(),FileName))
        print(u'建了一個名字叫做', FileName, u'的文件夾！')
    else:
        print(u'名字叫做', FileName, u'的文件夾已經存在了！')
    url = 'http://www.xiaohuar.com/list-1-1.html'
    html = requests.get(url).content    # 返回html
    # html = html.decode('utf-8')
    soup  = BeautifulSoup(html,'html.parser')   # BeautifulSoup對象
    jpg_data = soup.find_all('img',width="210") # 找到圖片信息
    index = 1
    for i in jpg_data:
        deindex = str(index) +  "a"
        data = i['src'] # 圖片的URL
        print("圖片url為"+data)
        if "https://www.dxsabc.com/" not in data:
            data = 'http://www.xiaohuar.com'+data
        r2 = requests.get(data)
        fpath = os.path.join(FileName,deindex)
        with open(fpath+'.jpg','wb+')as f : # 循環寫入圖片
            f.write(r2.content)
        index += 1
    print('保存成功，快去查看圖片吧！！')

if __name__== '__main__':
    dd()

9記錄一個爬網址的shtml下的文章，然后創建文件夾存文件的操作

1）

我要判斷獲取到的<a>標簽內是否含有 'shtml'

result = string.find(a_data,shtml)!=-1

if result:
　　#包含
else:
　　#不包含

module 'string' has no attribute 'find'

因為python版本升級，函數名稱已有改變！只需要將string改為str即可！

2）邏輯是找到網址中的所有<a>標簽，再取所有的href值，經判斷后，取合適的鏈接

再通過循環獲取鏈接內的屬性，找自己想要的資源

3）can only concatenate str (not "int") to str

int類型和str類型不能拼接，所以需要轉一下index的類型：str(index)

4) 以下是其中一個生成的文本截圖，正文有了，但是調取接口卻傳不過去數據，導致校驗返回為：缺少必填字段，

debug找到原因，是因為我在讀取文件的時候，寫入流還在一個線程沒結束，文本是空的。我需要先走完寫，再去讀，才可以讀到文件。

是由於讀取流和寫入流在同一個線程里，執行的先后順序不一樣。趕時間，我先分為2個進程去做。線程問題先不考慮，畢竟只是試驗，

接下來學習下python的多線程，再搞。

5) 在陸續解決一些小問題后；終於出點成績

可惜的是在運行到第115的時候，程序報錯：

urllib3.exceptions.MaxRetryError: HTTPConnectionPool

接下來處理一下這個問題

6) urllib3.exceptions.MaxRetryError: HTTPConnectionPool

查了一下，發現我在拼接shtml鏈接邏輯的時候，判斷有點問題，導致http://重復，肯定訪問不到了。然后就不報錯了

7），上整體代碼。一共4個文件，其中有些是我調試用的代碼，主要調取邏輯在mainTODO.py，剛剛練手，一些邏輯很笨拙，

代碼不能拿走用，因為適配性並不好。一些標簽是我要用的網站特有的。以后有類似問題，可以參考思路。

checkshtml.py

# coding:utf-8
import requests
from bs4 import BeautifulSoup
import os
import checkwebsite
import string
# 創建一個文件夾名稱
FilePath = 'I:\\test\\'
FileName = 'xhtml'
def dd(url):
    if not os.path.exists(os.path.join(os.getcwd(), FileName)):     # 新建文件夾
        os.mkdir(os.path.join(os.getcwd(),FileName))
        print(u'建了一個名字叫做', FileName, u'的文件夾！')
    else:
        print(u'名字叫做', FileName, u'的文件夾已經存在了！')

    html = requests.get(url).content    # 返回html
    # html = html.decode('utf-8')
    soup  = BeautifulSoup(html,'html.parser')   # BeautifulSoup對象
    # jpg_data = soup.find_all(r'href="(*.+?\.shtml)"') # 找到shtml
    a_datalist = soup.find_all('a') # 找到a標簽
    index = 1
    datalist = []
    for a_data in a_datalist:
        result = a_data.get('href')
        if not (result is None):
            # str1=Hello.python
            # print str1[:str1.index(str2)]     #獲取 "."之前的字符(不包含點)  結果 Hello
            # print str1[str1.index(str2):] ; #獲取 "."之后的字符(包含點) 結果.python
            if "http://www.hainan.gov.cn/" not in result:
                if "http://www"not in result:
                    result = 'http://www.hainan.gov.cn'+result
                datalist.append(result)
    return datalist


                # print(index)
        # index += 1

checkwebsite.py

# coding:utf-8
import requests
from bs4 import BeautifulSoup
import codecs
import urllib
from html5lib import HTMLParser
import os

def checkTitle(soup):
    divtitle = soup.find_all('ucaptitle')
    strtitle = str(divtitle)
    if 'ucaptitle' in strtitle:
        strtitle = strtitle[:strtitle.index("</ucaptitle>")]
        strtitle = strtitle[::-1]
        strtitle = strtitle[:strtitle.index(">eltitpacu")]
        strtitle = strtitle[::-1]
        # print(strtitle)
        return str(strtitle)
    else:
        return ''

def checkContent(soup):
    uuuu = ''
    divlist = soup.find_all('ucapcontent')
    for pfont in divlist:#遍歷div的所有屬性以及其值
        p = pfont.find_all('p')

        for pp in p:
            start_pp = str(pp)
            p_start_pp = start_pp[:start_pp.index("</p>")]
            p_start_pp = p_start_pp[p_start_pp.index('>'):]
            result = p_start_pp[::-1]
            result = result[:result.index(">")]
            result = result[::-1]
            #w 只能操作寫入  r 只能讀取   a 向文件追加
            #w+ 可讀可寫   r+可讀可寫    a+可讀可追加
            #wb+寫入進制數據
            #w模式打開文件，如果而文件中有數據，再次寫入內容，會把原來的覆蓋掉
            uuuu = uuuu+str(result+'\n')
    return uuuu
            # return result


def writeTxt(result):
    f = codecs.open('I:\\data.txt','a','utf-8')
    f.write(str(result))
    f.close()

if __name__== '__main__':
    sHtmlUrl = "http://www.hainan.gov.cn/hainan/hngs/201906/c0a08f5b5a7e42b2bab66212de76b050.shtml"
    html = requests.get(sHtmlUrl).content
    soup  = BeautifulSoup(html,'html.parser')
    # writeTxt(checkTitle(soup))
    writeTxt(checkContent(soup))
    ## 這個執行文件，是為了把所有的目標文章都爬進本地的txt文件里。
    ## 待下一個文件，去讀取本地文章，再去調取遠程post服務，把摘要寫進去到txt文件，只需要分析txt文件即可

readFIleAndPost.py

import codecs
import requests
import json
import checkwebsite
import time

fileUrl = 'I:\\data.txt'
def two(fileUrl):
    num = 0
    f = codecs.open(fileUrl,'r','utf-8')
    string=''
    for l in f:
        tup = l.rstrip('\n').rstrip()
        # print(tup)
        string += tup
        num = num+1
    # print (str)
    # post 請求
    url = 'http://localhost:8080/documentNew/parserNew'
    s = json.dumps({'content': string, 'keywordCount': '5','summarySize':'2'})
    try:
        r = requests.post(url, s)
    except:
        print( u'[%s] HTTP請求失敗！！！正在准備重發。。。')
        time.sleep(2)
        r = requests.post(url, s)

    resultStr = str(r.text)
    r.close()
    return resultStr


if __name__ == '__main__':
    checkwebsite.writeTxt(two(fileUrl))

mainTODO.py

import checkshtml
import checkwebsite
import readFIleAndPost
import requests
import os
import shutil
from bs4 import BeautifulSoup
import codecs

url = 'http://www.hainan.gov.cn/hainan/'
filePath = 'I:\\test\shtml\\'
updateFilePath = 'I:\\test\\xhtml\\'
def downLoadFile(url):
    # 先拿取 目標網站的 href屬性 對象數組
    resultlist = checkshtml.dd(url)
    shtmllist = []
    # 取出 所需的shtml 對象數組
    for list in resultlist:
        if "shtml" in list:
            shtmllist.append(list)
    index = 1
    #循環目標對象數組，分別取里面所需要的文本信息
    for shtml in shtmllist:
        # shtml = "http://www.hainan.gov.cn/hainan/hngs/201906/c0a08f5b5a7e42b2bab66212de76b050.shtml"
        # shtml = "http://www.hainan.gov.cn/hainan/index.shtml"
        html = requests.get(shtml).content
        soup  = BeautifulSoup(html,'html.parser')
        urlpath = filePath+'shtml'+str(index)+'.txt'
        updateurl = updateFilePath+'shtml'+str(index)+'.txt'
        # 拿到文本對應的 標題和正文信息
        strtitle = checkwebsite.checkTitle(soup).strip()
        strcontent = checkwebsite.checkContent(soup)
        if not (strcontent=='') :
            f1 = codecs.open(urlpath,'w','utf-8')
            f2 = codecs.open(updateurl,'w','utf-8')
            # 新建並打開一個文本，寫入標題
            f1.write(strtitle+'\n')
            f2.write(strtitle+'\n')
            print("源文件路徑>>>>>>"+urlpath+">>>>>>標題："+strtitle)
            print("修改文件路徑>>>>>>"+updateurl+">>>>>>標題："+strtitle)
            # 循環調取 checkwebsite.checkContent() 寫入正文。
            f1.write(strcontent)
            f2.write(strcontent)
            # 關閉文件流
            index += 1
            f1.close()
            f2.close()
            # 開啟下一個鏈接的處理

def updateFile():
    file_name = os.listdir(updateFilePath)
    for file in file_name:
        #獲取到 接口返回值
        response = readFIleAndPost.two(updateFilePath+file)
        f = codecs.open(updateFilePath+file,'a','utf-8')
        f.write(response)
        f.close()

def deleteAndCopyFile():
    del_file(updateFilePath)
    f_list = os.listdir(filePath)
    n=0
    for fileNAME in f_list:
        n += 1
        oldname = filePath + fileNAME
        newname = updateFilePath + fileNAME
        shutil.copyfile(oldname, newname)
        # print(str(n)+'.'+'已復制'+fileNAME)

def  del_file(path):
    for i in os.listdir(path):
        path_file = os.path.join(path,i)
        if os.path.isfile(path_file):
            os.remove(path_file)
        else:
            del_file(path_file)

if __name__== '__main__':
    # 下載 2份文檔到本地，運行一次，即可注釋
    # downLoadFile(url)
    # 直接修改初始化的其中一份文本
    # updateFile()
    # 刪除已有的樣本，取源文件，先copy一份，再發post filePath源路徑
    deleteAndCopyFile()
    updateFile()

　10、取字典里的數據

[{'name': '張三', 'phone': '185185', 'wechat': '6546231'}, {'name': '李四', 'phone': '187169', 'wechat': 'asdsad'}]

這個版本里：

for studeht in self.studehts :
    #錯誤
    name=studeht .get["name"]
    #正確取法
    name=studeht["name"]

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python3.7.3安裝pyspider遇到的坑 Python3.7.3 + pycharm安裝 ubuntu14.04上編譯安裝python3.7.3 python3.7.3操作FastDfs來進行文件操作 Linux系統下安裝python3.7.3環境 python3.7.3使用web.py報錯解決辦法 and RuntimeError: generator raised StopIteration 阿里雲服務器安裝python3.7.3，解決openssl問題 win8.1安裝Python3.7.3后出現“無法啟動此程序，因為計算機中丟失api-ms-win-crt-runtime-|1-1-0.dll”的問題記錄初學SpringBoot使用Redis序列化的坑 android——記錄從android studio2.3升級到android studio3.0版本遇到的坑