java讀取pdf文本轉換html


補充:一下代碼基於maven,現將依賴的jar包單獨導出

地址:pdf jar

 

完整代碼地址 也就兩個文件

 

 java讀取pdf中的純文字,這里使用的是pdfbox工具包

maven引入如下配置

     <dependency>
            <groupId>net.sf.cssbox</groupId>
            <artifactId>pdf2dom</artifactId>
            <version>1.7</version>
        </dependency>
        <dependency>
            <groupId>org.apache.pdfbox</groupId>
            <artifactId>pdfbox</artifactId>
            <version>2.0.12</version>
        </dependency>
        <dependency>
            <groupId>org.apache.pdfbox</groupId>
            <artifactId>pdfbox-tools</artifactId>
            <version>2.0.12</version>
        </dependency>

工具類直接讀取

代碼示例

  /* 讀取pdf文字 */ @Test public void readPdfTextTest() throws IOException { byte[] bytes = getBytes("D:\\code\\pdf\\HashMap.pdf"); //加載PDF文檔
        PDDocument document = PDDocument.load(bytes); readText(document); } public void readText(PDDocument document) throws IOException { PDFTextStripper stripper = new PDFTextStripper(); String text = stripper.getText(document); System.out.println(text); }

將pdf轉換為html

效果圖

 代碼示例

/* pdf轉換html */ @Test public void pdfToHtmlTest() { String outputPath = "D:\\code\\pdf\\HashMap.html"; byte[] bytes = getBytes("D:\\code\\pdf\\HashMap.pdf"); // try() 寫在()里面會自動關閉流
        try (BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(new File(outputPath)),"UTF-8"));){ //加載PDF文檔
            PDDocument document = PDDocument.load(bytes); PDFDomTree pdfDomTree = new PDFDomTree(); pdfDomTree.writeText(document,out); } catch (Exception e) { e.printStackTrace(); } } /* 將文件轉換為byte數組 */
    private byte[] getBytes(String filePath){ byte[] buffer = null; try { File file = new File(filePath); FileInputStream fis = new FileInputStream(file); ByteArrayOutputStream bos = new ByteArrayOutputStream(1000); byte[] b = new byte[1000]; int n; while ((n = fis.read(b)) != -1) { bos.write(b, 0, n); } fis.close(); bos.close(); buffer = bos.toByteArray(); } catch (FileNotFoundException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } return buffer; }

完整的一個上傳pdf轉換為HTML功能(今后轉換pdf也不需要找什么第三方了,哈哈)

@RequestMapping("ud") @Controller public class UpAndDownController { @RequestMapping("upload.do") @ResponseBody public Map<String,Object> upload(@RequestParam("file") MultipartFile file, HttpServletRequest request){ Map<String, Object> map = new HashMap<>(); map.put("code","200"); try { PdfConvertUtil pdfConvertUtil = new PdfConvertUtil(); String pdfName = file.getOriginalFilename(); int lastIndex = pdfName.lastIndexOf(".pdf"); String fileName = pdfName.substring(0, lastIndex); String htmlName = fileName + ".html"; String realPath = ResourceUtils.getURL("classpath:").getPath() + "/templates/file"; File f = new File(realPath); if(!f.exists()){ f.mkdirs(); } String htmlPath = realPath + "\\" + htmlName; pdfConvertUtil.pdftohtml(file.getBytes(), htmlPath); } catch (Exception e) { map.put("code","500"); e.printStackTrace(); } return map; } }

可以使用postman調試

需要設置請求頭 Content-Type 指定為 application/x-www-form-urlencoded

之后選擇body選擇form-data,OK

 

如果涉及到HTML頁面直接加載PDF,無需插件

可以參考下 

https://www.cnblogs.com/jacksoft/p/5302587.html

https://github.com/mozilla/pdf.js

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM