這里面通過爬蟲github上的一些start比較高的python項目來學習一下BeautifulSoup和pymysql的使用。我一直以為山是水的故事,雲是風的故事,你是我的故事,可是卻不知道,我是不是你的故事。
github的python爬蟲
爬蟲的需求:爬取github上有關python的優質項目,以下是測試用例,並沒有爬取很多數據。
一、實現基礎功能的爬蟲版本
這個案例可以學習到關於pymysql的批量插入、使用BeautifulSoup解析html數據以及requests庫的get請求數據的知識。至於pymysql的一些使用,可以參考博客:python框架---->pymysql的使用
import requests import pymysql.cursors from bs4 import BeautifulSoup def get_effect_data(data): results = list() soup = BeautifulSoup(data, 'html.parser') projects = soup.find_all('div', class_='repo-list-item') for project in projects: writer_project = project.find('a', attrs={'class': 'v-align-middle'})['href'].strip() project_language = project.find('div', attrs={'class': 'd-table-cell col-2 text-gray pt-2'}).get_text().strip() project_starts = project.find('a', attrs={'class': 'muted-link'}).get_text().strip() update_desc = project.find('p', attrs={'class': 'f6 text-gray mb-0 mt-2'}).get_text().strip() result = (writer_project.split('/')[1], writer_project.split('/')[2], project_language, project_starts, update_desc) results.append(result) return results def get_response_data(page): request_url = 'https://github.com/search' params = {'o': 'desc', 'q': 'python', 's': 'stars', 'type': 'Repositories', 'p': page} resp = requests.get(request_url, params) return resp.text def insert_datas(data): connection = pymysql.connect(host='localhost', user='root', password='root', db='test', charset='utf8mb4', cursorclass=pymysql.cursors.DictCursor) try: with connection.cursor() as cursor: sql = 'insert into project_info(project_writer, project_name, project_language, project_starts, update_desc) VALUES (%s, %s, %s, %s, %s)' cursor.executemany(sql, data) connection.commit() except: connection.close() if __name__ == '__main__': total_page = 2 # 爬蟲數據的總頁數 datas = list() for page in range(total_page): res_data = get_response_data(page + 1) data = get_effect_data(res_data) datas += data insert_datas(datas)
運行完之后,可以在數據庫中看到如下的數據:
11 | tensorflow | tensorflow | C++ | 78.7k | Updated Nov 22, 2017 |
12 | robbyrussell | oh-my-zsh | Shell | 62.2k | Updated Nov 21, 2017 |
13 | vinta | awesome-python | Python | 41.4k | Updated Nov 20, 2017 |
14 | jakubroztocil | httpie | Python | 32.7k | Updated Nov 18, 2017 |
15 | nvbn | thefuck | Python | 32.2k | Updated Nov 17, 2017 |
16 | pallets | flask | Python | 31.1k | Updated Nov 15, 2017 |
17 | django | django | Python | 29.8k | Updated Nov 22, 2017 |
18 | requests | requests | Python | 28.7k | Updated Nov 21, 2017 |
19 | blueimp | jQuery-File-Upload | JavaScript | 27.9k | Updated Nov 20, 2017 |
20 | ansible | ansible | Python | 26.8k | Updated Nov 22, 2017 |
21 | justjavac | free-programming-books-zh_CN | JavaScript | 24.7k | Updated Nov 16, 2017 |
22 | scrapy | scrapy | Python | 24k | Updated Nov 22, 2017 |
23 | scikit-learn | scikit-learn | Python | 23.1k | Updated Nov 22, 2017 |
24 | fchollet | keras | Python | 22k | Updated Nov 21, 2017 |
25 | donnemartin | system-design-primer | Python | 21k | Updated Nov 20, 2017 |
26 | certbot | certbot | Python | 20.1k | Updated Nov 20, 2017 |
27 | aymericdamien | TensorFlow-Examples | Jupyter Notebook | 18.1k | Updated Nov 8, 2017 |
28 | tornadoweb | tornado | Python | 14.6k | Updated Nov 17, 2017 |
29 | python | cpython | Python | 14.4k | Updated Nov 22, 2017 |
30 | Python | 14.2k | Updated Oct 17, 2017 |
友情鏈接