java利用pdfbox處理pdf


剛開始以為java讀取pdf向讀取txt文件一樣簡單,圖樣圖森普!亂碼問題!

Game Starts

參考文檔

  1) http://pdfbox.apache.org/cookbook/documentcreation.html

依賴jar包

  1)pdfbox-app-1.8.6.jar http://pdfbox.apache.org/downloads.html#recent

What's Up

Lucene怎么對pdf做索引呢?轉成txt嗎?

Lucene Integration

Document luceneDocument = LucenePDFDocument.getDocument( ... );

Alway Be Coding

Create a blank PDF

This small sample shows how to create a new PDF document using PDFBox.

 1 // Create a new empty document
 2 PDDocument document = new PDDocument();
 3 
 4 // Create a new blank page and add it to the document
 5 PDPage blankPage = new PDPage();
 6 document.addPage( blankPage );
 7 
 8 // Save the newly created document
 9 document.save("BlankPage.pdf");
10 
11 // finally make sure that the document is properly
12 // closed.
13 document.close();

Hello World using a PDF base font

This small sample shows how to create a new document and print the text "Hello World" using one of the PDF base fonts.

// Create a document and add a page to it
PDDocument document = new PDDocument();
PDPage page = new PDPage();
document.addPage( page );

// Create a new font object selecting one of the PDF base fonts
PDFont font = PDType1Font.HELVETICA_BOLD;

// Start a new content stream which will "hold" the to be created content
PDPageContentStream contentStream = new PDPageContentStream(document, page);

// Define a text content stream using the selected font, moving the cursor and drawing the text "Hello World"
contentStream.beginText();
contentStream.setFont( font, 12 );
contentStream.moveTextPositionByAmount( 100, 700 );//注意這個坐標,(0,0)為本頁的左下角
contentStream.drawString( "Hello World" );
contentStream.endText();

// Make sure that the content stream is closed:
contentStream.close();

// Save the results and ensure that the document is properly closed:
document.save( "Hello World.pdf");
document.close();

Read PDF

下面是我參考網上的代碼自己嘗試的,官網沒有具體例子介紹
其實整個過程就是 加載Document(pdf文檔) 利用IO流寫入到TXT文件

 1 package tools;
 2 
 3 import java.io.File;
 4 import java.io.FileNotFoundException;
 5 import java.io.FileWriter;
 6 import java.io.IOException;
 7 import java.net.MalformedURLException;
 8 import java.net.URL;
 9 import org.apache.pdfbox.pdmodel.PDDocument;
10 import org.apache.pdfbox.util.PDFTextStripper;
11 
12 public class PDFHandler {
13     public static void readPDF(String pdfFile) {
14         String txtFile = null;
15         PDDocument doc = null;
16         FileWriter writer = null;
17         URL url = null;
18         try {
19             url = new URL(pdfFile); 
20         } catch (MalformedURLException e) {
21             //有異常說明無法轉成url,以文件系統處理
22             url = null;
23         }
24         
25         if(url != null) {//url處理
26             try {
27                 doc = PDDocument.load(url);//加載文檔
28                 String fileName = url.getFile();
29                 if(fileName.endsWith(".pdf")) { //得到新文件的文件名
30                     File outFile = new File(fileName.replace(".pdf", ".txt"));
31                     txtFile = outFile.getName(); 
32                 } else {
33                     return;
34                 }
35             } catch (IOException e) {
36                 e.printStackTrace();
37                 return;
38             }
39         } else {//文件系統處理
40             try {
41                 doc = PDDocument.load(pdfFile);
42                 if(pdfFile.endsWith(".pdf")) {
43                     txtFile = pdfFile.replace(".pdf", ".txt");
44                 } else {
45                     return;
46                 }
47             } catch (IOException e) {
48                 e.printStackTrace();
49                 return;
50             }
51         }
52         try {
53             writer = new FileWriter(txtFile);
54             PDFTextStripper textStripper = new PDFTextStripper();//讀取PDF到TXT中的操作類
55             textStripper.setSortByPosition(false);//這個看了下官方說明,不是很確定是什么意思,但是為了提高效率最好設為false,缺省為false
56             textStripper.setStartPage(1);//起始頁,缺省為第一頁
57             textStripper.setEndPage(2);//結束頁,缺省為最后一頁
58             textStripper.writeText(doc, writer);//最重要的一步,寫入到txt
59         } catch (FileNotFoundException e) {
60             e.printStackTrace();
61         } catch (IOException e) {
62             e.printStackTrace();
63         } finally {
64             if(doc != null) {
65                 try {
66                     doc.close();
67                 } catch (IOException e) {
68                     e.printStackTrace();
69                 }
70             }
71             if(writer!= null) {
72                 try {
73                     writer.close();
74                 } catch (IOException e) {
75                     e.printStackTrace();
76                 }
77             }
78         }
79     }
80     public static void main(String[] args) {
81         readPDF("resource/正則表達式.pdf");
82     }
83 }

 TO BE CONTINUED……


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM