幾天幫朋友解決一個技術問題,在Linux下,將word文檔中的內容讀取,然后使用正則匹配,拼成sql入庫
查閱了外文資料和google之后,步驟如下:
#wget http://www.winfield.demon.nl/linux/antiword-0.37.tar.gz
#tar zxvf antiword-0.37.tar.gz
#cd antiword-0.37
#make
#make install
antiword
cp /root/bin/*antiword /usr/local/bin/
mkdir /usr/share/antiword
cp -R /root/.antiword/* /usr/share/antiword/
chmod 777 /usr/local/bin/*antiword
chmod 755 /usr/share/antiword/*
安裝完成之后,如果要在web上查看的話,需要使用root執行 make global_install
<?php header("Content-type: text/html; charset=utf-8"); $filename = 'test.doc'; #$content = shell_exec('antiword '.$filename); $content = shell_exec('antiword -mUTF-8 '.$filename); echo '<pre>'; print_r ($content); echo '</pre>';
#coding=utf-8 #usage python <script_name> <docFilePath> #pip install python-docx [安裝一下擴展庫] import sys import os from docx import Document #獲取當前腳本得名稱 argv0_list = sys.argv[0].split("\\"); script_name = argv0_list[len(argv0_list) - 1]; usage = "\n Usage python <"+script_name+"> <docFilePath>" if len(sys.argv) != 2: print "Warning:\n docx file is empty" + usage sys.exit() docx_path = sys.argv[1] if not os.path.exists(docx_path): print "Warning:\n docx file is not exist" + usage sys.exit() #打開文檔 document = Document(docx_path) #讀取每段資料 l = [ paragraph.text.encode('utf8') for paragraph in document.paragraphs]; #輸出並觀察結果,也可以通過其他手段處理文本即可 for i in l: print i #讀取表格材料,並輸出結果 tables = [table for table in document.tables]; for table in tables: for row in table.rows: for cell in row.cells: print cell.text.encode('utf8'),'\t',