Python tldextract模塊准確獲取域名和后綴

本文轉載自查看原文 2018-01-15 17:13 3400 tldextract 域名/ python學習

Python tldextract 模塊 - 功能說明

tldextract准確地從URL的域名和子域名分離通用頂級域名或國家頂級域名。例如，http://www.google.com，你只想取出連接的 'google' 部分。每個人都會想到用 ‘.’ 拆分，來獲取域名和后綴，但這是不准確的。並且只有當你想到簡單的，例如.com域名，以 ‘.’ 截取最后2個元素得到結果。想想如果解析，例如：http://forums.bbc.co.uk，上面天真的分裂方法是有問題的，你會得到 'co' 作為域名和“uk”為頂級域名，而不是“bbc”和“co.uk” 。tldextract有一個公共后綴列表，它可以匹配所有域名。因此，給定一個URL，它從其域中知道其子域名，並且從其國家中知道其域名。

>>> import tldextract >>> tldextract.extract('http://forums.news.cnn.com/') ExtractResult(subdomain='forums.news', domain='cnn', suffix='com') >>> tldextract.extract('http://forums.bbc.co.uk/') # United Kingdom ExtractResult(subdomain='forums', domain='bbc', suffix='co.uk') >>> tldextract.extract('http://www.worldbank.org.kg/') # Kyrgyzstan ExtractResult(subdomain='www', domain='worldbank', suffix='org.kg')

ExtractResult是namedtuple，所以它以簡單方法得到你想要的部分。

>>> ext = tldextract.extract('http://forums.bbc.co.uk') >>> (ext.subdomain, ext.domain, ext.suffix) ('forums', 'bbc', 'co.uk') >>> # rejoin subdomain and domain >>> '.'.join(ext[:2]) 'forums.bbc' >>> # a common alias >>> ext.registered_domain 'bbc.co.uk'

子域和后綴是可選的。不是所有類似URL的輸入都有一個子域或有效的后綴。

>>> tldextract.extract('google.com') ExtractResult(subdomain='', domain='google', suffix='com') >>> tldextract.extract('google.notavalidsuffix') ExtractResult(subdomain='google', domain='notavalidsuffix', suffix='') >>> tldextract.extract('http://127.0.0.1:8080/deployed/') ExtractResult(subdomain='', domain='127.0.0.1', suffix='')

如果要重新加入整個命名的元組，無論是否找到子域或后綴：

>>> ext = tldextract.extract('http://127.0.0.1:8080/deployed/') >>> # this has unwanted dots >>> '.'.join(ext) '.127.0.0.1.' >>> # join each part only if it's truthy >>> '.'.join(part for part in ext if part) '127.0.0.1'

該模塊通過實現從選擇stackoverflow答案開始，從一個URL獲取“域名”這個計算問題。然而，建議的正則表達式解決方案不涉及其它許多國家，如 com.au，如注冊域parliament.uk。公共后綴列表，這個模塊也是如此。

安裝 tldextract

最新發布的 PyPI：

pip install tldextract

或者最新的開發版本：

pip install -e 'git://github.com/john-kurkowski/tldextract.git#egg=tldextract'

命令行用法，按空格分開網址：

tldextract http://forums.bbc.co.uk # forums bbc co.uk

注意緩存更新

當第一次運行該模塊時，它會用實時HTTP請求更新其后綴列表。這個更新的后綴集在無限期緩存/path/to/tldextract/.tld_set 。（可以說運行時引導類似這樣不應該是默認行為，就像生產系統，但我想要你有最新的后綴，特別是當我沒有保持這個代碼的最新）。要避免此提取或控制緩存的位置，請通過設置后綴EXTRACT_CACHE環境變量或通過在后綴Extract初始化中設置cache_file路徑來使用您自己的提取調用。

# extract callable that falls back to the included TLD snapshot, no live HTTP fetching no_fetch_extract = tldextract.TLDExtract(suffix_list_urls=None) no_fetch_extract('http://www.google.com') # extract callable that reads/writes the updated TLD set to a different path custom_cache_extract = tldextract.TLDExtract(cache_file='/path/to/your/cache/file') custom_cache_extract('http://www.google.com') # extract callable that doesn't use caching no_cache_extract = tldextract.TLDExtract(cache_file=False) no_cache_extract('http://www.google.com')

如果你想保持最新后綴定義 - 雖然他們不經常更改 - 偶爾刪除緩存文件，運行更新命令

tldextract --update

或：

env TLDEXTRACT_CACHE="~/tldextract.cache" tldextract --update

也建議在升級此lib之后刪除文件。

高級用法

為后綴列表數據指定自己的URL或文件

您可以指定自己的輸入數據代替默認的Mozilla公共后綴列表：

extract = tldextract.TLDExtract( suffix_list_urls=["http://foo.bar.baz"], # Recommended: Specify your own cache file, to minimize ambiguities about where # tldextract is getting its data, or cached data, from. cache_file='/path/to/your/cache/file')

以上片段將與您指定的網址提取，在首先需要下載后綴列表（即如果cache_file不存在）。如果你想從你的本地文件系統使用的輸入數據，只需要使用file://協議：

extract = tldextract.TLDExtract( suffix_list_urls=["file://absolute/path/to/your/local/suffix/list/file"], cache_file='/path/to/your/cache/file')

請使用絕對路徑suffix_list_urls關鍵字參數。 os.path是友好路徑。

如果我傳遞一個無效的URL，我仍然得到一個結果，沒有錯誤。為什么會得到？

為了保持tldextract光控制線和開銷，因為有大量的URL驗證器在那里，這個庫是非常寬松的輸入。如果有效的URL是對你很重要，調用之前先驗證這些tldextract 。這種寬松的態度降低了使用庫的學習曲線，代價是使用戶對URL的細微差別。誰知道多少。但在將來，我會考慮一次大修。例如，用戶可以選擇驗證，接收結果中的異常或錯誤元數據。 tldextract GitHub 地址：https://github.com/john-kurkowski/tldextract

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python 提取主域名和子域名代碼——先根據規則提取，如果有問題，則使用tldextract Python 獲取文件類型后綴 python獲取文件后綴名的方法 Python 准確獲取今天是星期幾的代碼（isoweekday和weekday 有效的域名后綴列表 mongodb獲取准確的行數 python 獲取文件名稱以及文件后綴 python-一種去掉前后綴獲取子串的方法 python獲取文件路徑，文件名，后綴名 python中獲取文件后綴名的方法