【項目目標】
對大量的公司年報(PDF文件)進行關鍵詞的識別與提取,判斷文件是否含有“增值稅留抵稅額:XXXX”,並將這份文件的名字和此內容寫入表格
【項目實現】
1.導入處理PDF的python庫
1 import pdfplumber 2 import PyPDF2 3 import re 4 import os 5 import csv 6 import json
2.定義函數,得到PDF文件的頁數
def get_pages(filename): with open(filename, 'rb', ) as fb: pages = PyPDF2.PdfFileReader(fb).getNumPages() return pages
3.因為增值稅留抵稅額這條信息一般出現在文件的后半部分,所以循環查找從100頁開始,利用正則表達式查找關鍵詞,並提取
def get_text(filename, pages): with pdfplumber.open(filename) as pdf: for i in range(100, pages-10): find = re.findall('增值稅留抵稅額(.*)', pdf.pages[i].extract_text()) if find: return find[0].strip().split(" ")
4.保存表格
def save(company_name, report_date, end_balance, start_balance): with open('annual_report.csv', 'a', newline="", encoding='utf-8') as f_csv: writer = csv.writer(f_csv) writer.writerow([company_name, report_date, end_balance, start_balance])
5.運行代碼
if __name__ == '__main__': file_list = os.listdir() file_list.remove('.idea') file_list.remove('pdf6.py') file_list.remove('annual_report.csv') file_list_copy = file_list[::] for file in file_list_copy: name = re.findall(r'\d+(.*?):', file)[0] date = re.findall(r'(\d+年)年度報告', file)[0] pages_num = get_pages(file) if get_text(file, pages_num) is not None: try: end, start = get_text(file, pages_num) save(name, date, end, start) file_list.remove(file) except Exception as e: print(e) with open('rest.txt', 'a', encoding='utf-8') as f: f.write(json.dumps(file_list, ensure_ascii=False))
