提取pdf文件文本：pdfparser與xpdf具體操作

本文轉載自查看原文 2015-12-20 17:06 7029 xpdf/ pdf/ parser/ PHP/ php/ pdf parser

網上搜索有許多pdf文本提取相關的開發包，僅php語言就有許多。下面是本猿在實踐中接觸的三種庫：

1. PDFLIB TET http://www.pdflib.com/en/download/tet/

2. PDF Parser http://www.pdfparser.org/

3. XPDF http://www.foolabs.com/xpdf/

第一感覺比較滿意的是 PDFLIB TET，因為其具有圖片提取等功能，然而這個庫是收費的，只能看着多達200多頁的英文文檔無動於衷！作為愛學習的類猿，還是期待大神的出現！

本文主要通過 PDF Parser 和 XPDF 來實現pdf文件中文本的提取工作。

實驗環境：

阿里雲平台 + ubuntu12.04 + apache2 + php5.3.10 + mysql5.6 （本項目中，整體采用 thinkphp 框架，該功能只是項目的一部分）

PDF Parser

准備工作：

上訴官網下載項目源碼：pdfparser-master.zip；

解壓源碼文件，復制src文件夾下Smalot文件夾（該文件夾中源碼是項目的核心源碼）到ThinkPHP/Library文件夾下（該文件夾為thinkphp框架中存放第三方庫）；

修改源代碼的命名，如page.php修改為page.class.php（后者為php官方推薦類命名方式）；

實驗環節：

編寫一個類調用上面的庫，具體代碼

 1 <?php
 2 namespace Admin\Controller;
 3 use Think\Controller;
 4 
 5 class PdfParseController extends Controller {
 6   //定義方法，解析pdf文件
 7   function parse(){
 8     // 獲取參數，文件所在路徑
 9     $path = $_GET['path'];
10     // 創建源碼中的Parser類對象
11     $parser = new \Smalot\PdfParser\Parser();
12     // 調用解析方法，參數為pdf文件路徑，返回結果為Document類對象
13     $document = $parser->parseFile($path);
14     // 獲取所有的頁
15     $pages = $document->getPages();
16     // 逐頁提取文本
17     foreach($pages as $page){
18         echo($page->getText());
19     }   
20   }
21 }
22 ?>

本項目中是通過前端請求來調用上訴類中的parse()方法，由於存在網絡延遲等問題，為了不影響UI體驗，采用ajax異步調用

 1 // js文件，頁面按鈕點擊后調用parse方法
 2 var xmlHttp = null;
 3 
 4 function parse(){
 5     //alert("開始");
 6     var path = document.getElementById("pdffile").value; // 獲取文件路徑
 7 
 8     var url = "http://***.***.***.***/***/***/PdfParse/parse?path=" + path;  //請求路徑
 9 
10     request(url, function(result){
11         //回調函數
12         //alert(result);
13         document.getElementsByName("context")[0].value = result;
14     });
15 }
16   
17 function request(url, onsuccess){
18             
19     //獲取XMLHttpRequest對象，執行異步請求操作
20     if (window.XMLHttpRequest) {
21             xmlHttp = new XMLHttpRequest();
22         } else if (window.ActiveXObject) {
23             xmlHttp = new ActiveXObject("Microsoft.XMLHTTP");
24     } else {
25         alert("Browser does not support HTTP Request");
26     }
27     
28     xmlHttp.onreadystatechange = function(){
29         if (xmlHttp.readyState == 4) {
30             if (xmlHttp.status == 200) {
31                 // 請求成功返回
32                 onsuccess(xmlHttp.responseText);
33             }
34         }
35     }
36     xmlHttp.open("GET", url, true);
37     xmlHttp.send();
38 }

 1 <!-- 網頁代碼 -->
 2 <body>
 3     <tr>
 4         <td>文檔解析：</td>
 5         <td>
 6             <select id="pathtype" name="pathtype" style="width:60px;">
 7                 <option value="url">網址</option>
 8             </select>
 9             <input type="text" id="pdffile" name="pdffile" style="width:500px">
10         </td>
11         <td colspan="10" >
12             <input type="button" class="input_button" name="parse" value="解析" onclick="parse()" />
13         </td>
14     </tr>
15 </body>

測試網址：http://www.cffex.com.cn/tzgg/jysgg/201512/W020151204630497494614.pdf

優點：可以直接解析網頁中的pdf文件，無需下載；

缺點：部分解析結果存在亂碼格式；不支持圖片提取

XPDF

准備工作：

上訴官網下載項目：xpdfbin-linux-3.04.tar.gz， xpdf-chinese-simplified.tar.gz；

安裝xpdf-3.04到指定目錄（本次為/usr/local）

　　tar zxvf xpdfbin-linux-3.04.tar.gz -C /usr/local //解壓到安裝目錄

　　cd /usr/local/xpdfbin-linux-3.04 //打開解壓文件夾

　　cat INSTALL

　　cd bin32/

　　cp ./* /usr/local/bin/

　　cd ../doc/

　　mkdir -p /usr/local/man/man1

　　mkdir -p /usr/local/man/man5

　　cp *.1 /usr/local/man/man1

　　cp *.5 /usr/local/man/man5

至此解析工具已經安裝好，可以shell端命令調用解析英文文檔，如果需要支持其他語言，需要安裝字體插件。下面為簡體中文插件安裝過程

　　cp sample-xpdfrc /usr/local/etc/xpdfrc

　　tar zxvf xpdf-chinese-simplified.tar.gz -C /usr/local

　　cd /usr/local/xpdf-chinese-simplified

　　mkdir -p /usr/local/share/xpdf/chinese-simplified

　　cp -r Adobe-GB1.cidToUnicode ISO-2022-CN.unicodeMap EUC-CN.unicodeMap GBK.unicodeMap CMap /usr/local/share/xpdf/chinese-simplified/

把解壓后文件夾chinese-simplified里面文件 add-to-xpdfrc 的內容復制到/usr/local/etc/xpdfrc文件中。

shell端命令調用（W020151204630497494614.pdf文件已經下載到shell命令當前目錄中）：

pdftotext W020151204630497494614.pdf //沒有采用字體庫，存在亂碼

pdftotext -layout -enc GBK W020151204630497494614.pdf //無亂碼

實驗環節：

編寫一個類調用上面的命令，具體代碼

 1 <?php
 2 namespace Admin\Controller;
 3 use Think\Controller;
 4 
 5 class PdfParseController extends Controller {
 6   function pdftotxt(){
 7     // 獲取參數，文件所在路徑
 8     $path = $_GET['path'];
 9     // 下載文件
10     $file_name = $this->download($path);
11     // 解析文件
12     $content = shell_exec('/usr/local/bin/pdftotext -layout -enc GBK '.$file_name.' -'); 
13     // 轉換文本編碼格式
14     $content = mb_convert_encoding($content, 'UTF-8','GBK'); 
15     // 刪除下載的文件
16     unlink($file_name);
17     echo($content);
18   }
19 
20   // 定義函數，下載文件
21   function download($file_url){
22     // 判斷參數是否賦值及是否為空
23     if(!isset($file_url)||trim($file_url)==''){
24         return '500';
25     }
26 
27     // 返回路徑中的文件名部分，包含擴展名
28     $file_name=basename($file_url);
29 
30     $content = file_get_contents($file_url);
31     file_put_contents($file_name, $content);
32 
33     return $file_name;
34   }
35 }
36 ?>

同樣通過前端異步請求來調用上訴類中的parse()方法

 1 var xmlHttp = null;
 2 
 3 function pdftotxt(){
 4     var path = document.getElementById("pdffile").value; // 獲取文件路徑
 5 
 6     var url = "http://***.***.***.***/***/***/PdfParse/pdftotxt?path=" + path;  //請求路徑
 7 
 8     request(url, function(result){
 9         //回調函數
10         //alert(result);
11         document.getElementsByName("context")[0].value = result;
12     });
13 }
14   
15 function request(url, onsuccess){
16             
17     //獲取XMLHttpRequest對象，執行異步請求操作
18     if (window.XMLHttpRequest) {
19             xmlHttp = new XMLHttpRequest();
20         } else if (window.ActiveXObject) {
21             xmlHttp = new ActiveXObject("Microsoft.XMLHTTP");
22     } else {
23         alert("Browser does not support HTTP Request");
24     }
25     
26     xmlHttp.onreadystatechange = function(){
27         if (xmlHttp.readyState == 4) {
28             if (xmlHttp.status == 200) {
29                 // 請求成功返回
30                 onsuccess(xmlHttp.responseText);
31             }
32         }
33     }
34     xmlHttp.open("GET", url, true);
35     xmlHttp.send();
36 }

 1 <body>
 2 
 3     <tr>
 4         <td>文檔解析：</td>
 5         <td>
 6             <select id="pathtype" name="pathtype" style="width:60px;">
 7                 <option value="url">網址</option>
 8             </select>
 9             <input type="text" id="pdffile" name="pdffile" style="width:500px">
10         </td>
11         <td colspan="10" >
12             <input type="button" class="input_button" name="parse" value="解析" onclick="parse()" />
13         </td>
14         <td colspan="10" >
15             <input type="button" class="input_button" name="exchange" value="轉換" onclick="pdftotxt()" />
16         </td>
17     </tr>
18 </body>

測試網址：http://www.cffex.com.cn/tzgg/jysgg/201512/W020151204630497494614.pdf

優點：不存在亂碼

如有問題，請批評指正！

和諧交流，傳播正能量！

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 jxl生成表格的具體操作記錄mysql的具體操作明細 LightGBM 調參方法（具體操作）使用nestjs集成grpc具體操作 vue.js對樣式的具體操作詳解 thinkphp5.0上對redis的具體操作 LightGBM 調參方法（具體操作） LightGBM 調參方法（具體操作）摘抄：NameNode的format具體操作過程 JNI及Android JNI 開發基本知識和具體操作步驟