使用python訪問網絡上的數據

本文轉載自查看原文 2017-02-16 11:56 3482

這兩天看完了Course上面的：

使用 Python 訪問網絡數據

https://www.coursera.org/learn/python-network-data/

寫了一些作業，完成了一些作業。做些學習筆記以做備忘。

1.正則表達式 --- 雖然后面的課程沒有怎么用到這個知識點，但是這個技能還是蠻好的。

附上課程中列出來的主要正則表達式的用法：

Python Regular Expression Quick Guide
^        Matches the beginning of a line
$        Matches the end of the line
.        Matches any character
\s       Matches whitespace
\S       Matches any non-whitespace character
*        Repeats a character zero or more times
*?       Repeats a character zero or more times
         (non-greedy)
+        Repeats a character one or more times
+?       Repeats a character one or more times
         (non-greedy)
[aeiou]  Matches a single character in the listed set
[^XYZ]   Matches a single character not in the listed set
[a-z0-9] The set of characters can include a range
(        Indicates where string extraction is to start
)        Indicates where string extraction is to end

特別的以前沒注意：From([0-9a-z]) 其實是取得符合整個規則的語句中()的部分。

並且 (.)並不表示任意字符而是只是.的意思。

附上作業編程：

import re

def sumText(name):
        handle = open(name, 'r')
        sum = 0
        for line in handle:
                nums = re.findall('[0-9]+', line)
                if len(nums) >=1:
                        for num in nums:
                                sum += int(num)
        return sum

filedir = raw_input("imput fileName :")
sum1 = sumText(filedir)
print sum1

2.使用python建立socket鏈接

介紹了下socket，一個用於和應用通訊的東西，每個網絡應用都有對應的端口號，通過協議+主機名+端口就可以找到這個應用進行通訊了。

展示了使用telnet來獲取http服務的方式。

telnet www.cnblogs.com 80
 GET http://www.cnblogs.com/webarn/p/6398989.html HTTP/1.0

不一定成功，覺得不是課程上面說的速度太慢的原因。

嗯附上自己知道比較簡單的方式：

curl -XGET http://www.cnblogs.com/webarn/p/6398989.html

或者使用python直接建立url鏈接，代碼如下：

import socket

mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('data.pr4e.org', 80))
mysock.send('GET http://data.pr4e.org/intro-short.txt HTTP/1.0\n\n')

while True:
    data = mysock.recv(512)
    if ( len(data) < 1 ) :
        break
    print data;

mysock.close()

再或者，使用瀏覽器的開發者工具也是一目了然的。

3.理解HTML並且進行解析

由於網頁大部分都是html格式也就是超文本標記語言，是大部分網頁展示的時候使用的語言，所以告訴了我們python里面也是有解析html 的包的：BeautifulSoup。

這個項目的鏈接如下：

https://www.crummy.com/software/BeautifulSoup/

使用詳情可以查看它。

然后就是代碼怎么使用了，還是自己作業的小小demo：

import urllib
from BeautifulSoup import *

url = raw_input('Enter - ')
html = urllib.urlopen(url).read()

soup = BeautifulSoup(html)

sum = 0
trs = soup('tr')
for tr in trs:
        if tr.span is not None:
                num = int(tr.span.contents[0])
                sum += num
print sum

4.webService 和xml

介紹了xml，可擴展標記語言。主要用來傳輸和存儲數據。可讀性會比較強。很多webservice的通訊協議都是用xml來設計的。

其中有一個schme的概念，比如我們以前會寫一些xsd文件來表示xml數據結構中的約束，比如字段是否可輸還是必輸，比如字段類型，這是一個約束，也是類似於協議的東西。

schema也會有很多的標准的。

xml解析用的是python內部的包：

xml.etree.ElementTree，將xml作為一個樹狀結構來解析了,要獲取字段值要從根節點來數。

代碼如下：

import urllib
import xml.etree.ElementTree as ET

url = raw_input("Enter location:")
print 'Retrieving', url
uh = urllib.urlopen(url)
data = uh.read()
print '\nRetrieved ', len(data), ' characters'
tree = ET.fromstring(data)

comments = tree.findall('.//comment')
sum = 0
count = len(comments)
print 'Count:', count
for comment in comments:
        sum += int(comment.find('count').text)
print 'Sum:', sum

5.json，api

這節談到了SOA，面向對象服務，大型的系統都會用到這個，感覺上是各個系統中都有一層中間層用於通訊，通訊所用的數據協議，格式都是統一的，這樣可以互相通訊。當然里面還有服務發現等問題需要考慮。但是有了SOA的架構之后，各個不同的系統都可以通訊了。

api 課程中舉了google map的api和twitter的api，各個應用可能都提供了api來進行調用，application program interface 就是和一個系統通訊的接口。api的格式比較簡單，使用REST風格的調用。RESTFul風格感覺可以再寫一篇文章了，可以看看他們直接的關系，但是我看到的api大都是網址+參數。就是這種 http://www.xxxx.com?para1=11&&param2=11這種，應該理解下來就是和前面說的協議+ 主機+ 端口+ 參數差不多了。

json介紹：json是一個簡介的數據交換協議，只有一個版本，永遠不會修改了，和xml比起來輕量很多，只有兩個數據格式map，list。其他可以參看（json.org）(寫這段chrome崩潰了3次，我也崩潰了。。。)然后就是loads才是解析string的，load是解析file的。

代碼附上：

import json
import urllib

url = raw_input('Enter location:')
print 'Retrieving', url
uh = urllib.urlopen(url)
data = uh.read()
print 'Retrieved', len(data)
info = json.loads(data)
print 'Count:', len(info['comments'])
sum = 0
for comment in info['comments']:
        sum += int(comment['count'])
print 'Sum: ', sum

api獲取然后解析json的：

import urllib
import json

serviceurl = 'http://python-data.dr-chuck.net/geojson?'

while True:
        address = raw_input('Enter location:')

        if len(address) < 1 :
                break

        url = serviceurl + urllib.urlencode({'sensor': 'false', 'address': address})
        print 'Retrieving ', url
        uh = urllib.urlopen(url)
        data = uh.read()
        print 'Retrieved',len(data),'characters'

        try: js = json.loads(str(data))
        except: js = None
        if 'status' not in js or js['status'] != 'OK':
                print '==== Failure To Retrieve ===='
                print data
                continue

        print json.dumps(js, indent=4)

        print 'place_id:', js['results'][0]['place_id']

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 使用Python訪問網絡數據 python network-data 第五章 SWIFT中使用AFNetwroking訪問網絡數據 python使用游標訪問數據 Python判斷網絡是否可以訪問 IOS網絡訪問之使用AFNetworking 使用Python訪問HDFS 網絡爬蟲－使用Python抓取網頁數據使用HTTP協議訪問網絡（Android） Android使用Http協議訪問網絡 Android使用Http協議訪問網絡——HttpConnection