python3中使用builtwith的方法(很詳細)


1. 首先通過pip install builtwith安裝builtwith

C:\Users\Administrator>pip install builtwith  
Collecting builtwith  
  Downloading builtwith-1.3.2.tar.gz  
Installing collected packages: builtwith  
  Running setup.py install for builtwith ... done  
Successfully installed builtwith-1.3.2  

2. 在pycharm中新建工程並輸入下面測試代碼

import builtwith  
tech_used = builtwith.parse('http://www.baidu.com')  
print(tech_used)  

運行會得到下面的錯誤:

C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy  
Traceback (most recent call last):  
  File "F:/python/first/FirstPy", line 1, in <module>  
    import builtwith  
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 43  
    except Exception, e:  
                    ^  
SyntaxError: invalid syntax  
  
  
Process finished with exit code 1  

原因是builtwith是基於2.x版本的,需要修改幾個地方,在pycharm出錯信息中雙擊出錯文件,進行修改,主要修改下面三種:
1. Python2中的 “Exception ,e”的寫法已經不支持,需要修改為“Exception as e”。
2. Python2中print后的表達式在Python3中都需要用括號括起來。
3. builtwith中使用的是Python2中的urllib2工具包,這個工具包在Python3中是不存在的,需要修改urllib2相關的代碼。
1和2容易修改,下面主要針對第3點進行修改:
首先將import urllib2替換為下面的代碼:

 
import urllib.request  
import urllib.error  

然后將urllib2的相關方法替換如下:

request = urllib.request.Request(url, None, {'User-Agent': user_agent})  
response = urllib.request.urlopen(request)  

再次運行項目,遇到下面錯誤:

C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy  
Traceback (most recent call last):  
  File "F:/python/first/FirstPy", line 3, in <module>  
    builtwith.parse('http://www.baidu.com')  
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 62, 
in builtwith  
    if contains(html, snippet):  
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 105, 
in contains  
    return re.compile(regex.split('\\;')[0], flags=re.IGNORECASE).search(v)  
TypeError: cannot use a string pattern on a bytes-like object  
  
  
Process finished with exit code 1  

這是因為urllib返回的數據格式已經發生了改變,需要進行轉碼,將下面的代碼:

if html is None:  
    html = response.read()  

修改為

if html is None:  
     html = response.read()  
     html = html.decode('utf-8')  

再次運行得到最終結果如下:

C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy  
{'javascript-frameworks': ['jQuery']}  
  
  
Process finished with exit code 0  

但是如果把網站換成 'www.163.com',運行再次報錯如下:

C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy  
Error: 'utf-8' codec can't decode byte 0xcd in position 500: invalid continuation byte  
Traceback (most recent call last):  
  File "F:/python/first/FirstPy", line 2, in <module>  
    tech_used = builtwith.parse('http://www.163.com')  
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 63, 
in builtwith  
    if contains(html, snippet):  
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 106, 
in contains  
    return re.compile(regex.split('\\;')[0], flags=re.IGNORECASE).search(v)  
TypeError: cannot use a string pattern on a bytes-like object  
  
  
  
Process finished with exit code 1  

似乎還是編碼的問題,將編碼設置成 ‘GBK’,運行成功如下:

C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy  
{'web-servers': ['Nginx']}  
  
  
Process finished with exit code 0  

所以不同的網站需要用不同的解碼方式么?下面介紹一種判別網站編碼格式的方法。
我們需要安裝一個叫chardet的工具包,如下:

C:\Users\Administrator>pip install chardet  
Collecting chardet  
  Downloading chardet-2.3.0-py2.py3-none-any.whl (180kB)  
    100% |████████████████████████████████| 184kB 616kB/s  
Installing collected packages: chardet  
Successfully installed chardet-2.3.0  
  
  
C:\Users\Administrator>  

將byte數據傳入chardet的detect方法后會得到一個Dict,里面有兩個值,一個是置信值,一個是編碼方式

{'encoding': 'utf-8', 'confidence': 0.99}  

將builtwith對應的代碼做下面修改:

encode_type = chardet.detect(html)  
  if encode_type['encoding'] == 'utf-8':  
    html = html.decode('utf-8')  
  else:  
    html = html.decode('gbk')  

記得 import chardet!!!!
加入chardet判斷字符編碼的方式后,就能適配網站了~~~~

 http://blog.csdn.net/fengzhizi76506/article/details/61617067


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM