爬蟲系列：存儲 CSV 文件

本文轉載自查看原文 2021-12-09 15:52 99 python爬蟲

上一期：爬蟲系列：存儲媒體文件，講解了如果通過爬蟲下載媒體文件，以及下載媒體文件相關代碼講解。

本期將講解如果將數據保存到 CSV 文件。

逗號分隔值（Comma-Separated Values，CSV，有時也稱為字符分隔值，因為分隔字符也可以不是逗號）是存儲表格數據常用文件格式。Microsoft Excel 和很多應用都支持 CSV 格式，因為它很簡潔。下面是一個 CSV 文件的例子：

code,parentcode,level,name,parentcodes,province,city,district,town,pinyin,jianpin,firstchar,tel,zip,lng,lat
110000,100000,1,北京,110000,北京,,,,Beijing,BJ,B,,,116.405285,39.904989
110100,110000,2,北京市,"110000,110100",北京,北京市,,,Beijing,BJS,B,010,100000,116.405285,39.904989
110101,110100,3,東城區,"110000,110100,110101",北京,北京市,東城區,,Dongcheng,DCQ,D,010,100000,116.418757,39.917544

和 Python 一樣， CSV 里留白（whitespace）也是很重要的：每一行都用一個換行符，列與列之間用逗號分隔（因此也叫“逗號分隔值”）。CSV 文件還可以用 Tab 字符或其他字符分隔行，但是不太常見，用得不多。

如果你只想從網頁上把 CSV 文件下載到電腦里，不打算做任何修改和解析，那么接下來的內容就不要看了，只用上一篇文章介紹的方法下載並保存 CSV 文件就可以了。

Python 的 CSV 庫可以非常簡單的修改 CSV 文件，甚至從零開始創建一個 CSV 文件：

import csv
import os
from os import path


class DataSaveToCSV(object):
    @staticmethod
    def save_data():
        get_path = path.join(os.getcwd(), 'files')
        if not path.exists(get_path):
            os.makedirs(get_path)
        csv_file = open(get_path + '\\test.csv', 'w+', newline='')
        try:
            writer = csv.writer(csv_file)
            writer.writerow(('number', 'number plus 2', 'number times 2'))
            for i in range(10):
                writer.writerow((i, i + 2, i * 2))
        finally:
            csv_file.close()


if __name__ == '__main__':
    DataSaveToCSV().save_data()

如果 files 文件夾不存在，新建文件夾。如果文件已經存在，Python 會用新的數據覆蓋 test.csv 文件，newline='' 去掉行與行之間得空格。

運行完成之后，你會看到一個 CSV 文件：

number,number plus 2,number times 2
0,2,0
1,3,2
2,4,4
3,5,6
4,6,8
5,7,10
6,8,12
7,9,14
8,10,16
9,11,18

下面一個示例是采集某博客文章，並存儲到 CSV 文件中，具體代碼如下：

import csv
import os
from os import path

from utils import connection_util
from config import logger_config


class DataSaveToCSV(object):
    def __init__(self):
        self._init_download_dir = 'downloaded'
        self._target_url = 'https://www.scrapingbee.com/blog/'
        self._baseUrl = 'https://www.scrapingbee.com'
        self._init_connection = connection_util.ProcessConnection()
        logging_name = 'write_csv'
        init_logging = logger_config.LoggingConfig()
        self._logging = init_logging.init_logging(logging_name)


    def scrape_data_to_csv(self):
        get_path = path.join(os.getcwd(), 'files')
        if not path.exists(get_path):
            os.makedirs(get_path)
        with open(get_path + '\\article.csv', 'w+', newline='', encoding='utf-8') as csv_file:
            writer = csv.writer(csv_file)
            writer.writerow(('標題', '發布時間', '內容概要'))
            # 連接目標網站，獲取內容
            get_content = self._init_connection.init_connection(self._target_url)
            if get_content:
                parent = get_content.findAll("section", {"class": "section-sm"})[0]
                get_row = parent.findAll("div", {"class": "col-lg-12 mb-5 mb-lg-0"})[0]
                get_child_item = get_row.findAll("div", {"class": "col-md-4 mb-4"})
                for item in get_child_item:
                    # 獲取標題文字
                    get_title = item.find("a", {"class": "h5 d-block mb-3 post-title"}).get_text()
                    # 獲取發布時間
                    get_release_date = item.find("div", {"class": "mb-3 mt-2"}).findAll("span")[1].get_text()
                    # 獲取文章描述
                    get_description = item.find("p", {"class": "card-text post-description"}).get_text()
                    writer.writerow((get_title, get_release_date, get_description))
            else:
                self._logging.warning('未獲取到文章任何內容，請檢查！')


if __name__ == '__main__':
    DataSaveToCSV().scrape_data_to_csv()

代碼大部分復用了前幾篇文章的內容，這里需要着重說明的是：

    logging_name = 'write_csv'
    init_logging = logger_config.LoggingConfig()
    self._logging = init_logging.init_logging(logging_name)

設置日志名稱，並實例化日志，用於后面記錄日志。

    with open(get_path + '\\article.csv', 'w+', newline='', encoding='utf-8') as csv_file:

with() 定義了在執行 with 語句時要建立的運行時上下文。with() 允許對普通的 try...except...finally 使用模式進行封裝以方便地重用。

newline='' 避免在 CSV 文件中行與行之間空行內容產生。

同時也設置了文件的編碼為 utf-8 ，這樣做的目的是避免文件含有中文或者其他語言造成亂碼。

以上就是關於將采集的內容保存為 csv 文件的內容，本實例的所有代碼托管於 github。

github: https://github.com/sycct/Scrape_1_1.git

如果有任何問題，歡迎在 github issue。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 爬蟲—文件存儲—CSV存儲 Python-爬蟲-（Json和Csv）文件存儲【Python爬蟲學習筆記7】CSV文件存儲 python爬蟲系列之數據的存儲（二）：csv庫的使用爬蟲文件存儲:txt文檔，json文件，csv文件爬蟲學習之csv讀取和存儲 python爬蟲之csv文件 json和csv文件存儲【Python3 爬蟲】U39_selenium爬取拉勾網並將數據存儲到csv文件爬蟲系列：讀取 CSV、PDF、Word 文檔