urlparse之urljoin() 爬蟲必備

本文轉載自查看原文 2015-08-18 18:03 9483 爬蟲/ python/ Scrapy

首先導入模塊，用help查看相關文檔

>>> from urlparse import urljoin
>>> help(urljoin)
Help on function urljoin in module urlparse:

urljoin(base, url, allow_fragments=True)
    Join a base URL and a possibly relative URL to form an absolute
    interpretation of the latter.

意思就是將基地址與一個相對地址形成一個絕對地址，然而講的太過抽象

接下來，看幾個例子，從例子中發現規律。

>>> urljoin("http://www.google.com/1/aaa.html","bbbb.html")
'http://www.google.com/1/bbbb.html'
>>> urljoin("http://www.google.com/1/aaa.html","2/bbbb.html")
'http://www.google.com/1/2/bbbb.html'
>>> urljoin("http://www.google.com/1/aaa.html","/2/bbbb.html")
'http://www.google.com/2/bbbb.html'
>>> urljoin("http://www.google.com/1/aaa.html","http://www.google.com/3/ccc.html")
'http://www.google.com/3/ccc.html'
>>> urljoin("http://www.google.com/1/aaa.html","http://www.google.com/ccc.html")
'http://www.google.com/ccc.html'
>>> urljoin("http://www.google.com/1/aaa.html","javascript:void(0)")
'javascript:void(0)'

規律不難發現，但是並不是萬事大吉了，還需要處理特殊情況，如鏈接是其本身，鏈接中包含無效字符等

url = urljoin("****","****")

### find()查找字符串函數，如果查到：返回查找到的第一個出現的位置。否則，返回-1
if url.find("'")!=-1:
    continue  

### 只取井號前部分
url = url.split('#')[0]

### 這個isindexed()是我自己定義的函數，判斷該鏈接不在保存鏈接的數據庫中
if url[0:4]=='http' and not self.isindexed(url):

    ###newpages = set(),無序不重復元素集
    newpages.add(url)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Urlparse模塊 urlparse模塊 01-爬蟲必備基礎知識 python之路徑拼接urljoin Python-urlparse urlparse解析URL參數網絡爬蟲必備知識之concurrent.futures庫網絡爬蟲必備知識之正則表達式互聯網人必備：推薦4款爬蟲抓包神器 urlparse獲取url后面的參數