Java解析OFFICE(word,excel,powerpoint)以及PDF的實現方案及開發中的點滴分享

本文轉載自查看原文 2017-07-29 00:33 7361

Java解析OFFICE(word,excel,powerpoint)以及PDF的實現方案及開發中的點滴分享

　　在此，先分享下寫此文前的經歷與感受，我所有的感覺濃縮到一個字，那就是:"坑",如果是兩個字那就是"巨坑"=>因為這個需求一開始並不是這樣子的，且聽我漫漫道來：

　　一開始客戶與我們商量的是將office和PDF上傳，將此類文件解析成html格式，在APP端調用內置server直接以html"播放"

　　經歷一個月~，兩個月~，三個月~~~

　　到需求開發階段，發現這是個坑。。。：按照需規的意思這個整體是當做一個功能來做的，技術難度也就算了，而且按照估算的工時也很難做成需規所需要的樣子(缺陷太多！)

　　然后一周~，一周~，又一周~~~

　　各種方案下來將需求做成能用的樣子，然后需求確認時客戶說：“我們沒有要求你們能解析這些文檔，我們只要求你們當做一個源文件上傳，在APP端點擊直接能選擇調用第三方應用打開就行了，而且一開始我們的需求就是這樣的。”

　　/**聽完，頓時淚流滿面(ಥ _ ಥ)，如果業務一開始就確認這樣做，何至於浪費如此多的時間，花費如此多的精力繞老大一圈。。。*/

　　需求繞了一圈又繞回來了，作為經歷過的人，現在總結下這需求里面無盡的坑：

　　A>開源社區有很多Demo，這些Demo有很多缺陷，比如office里面的藝術字、圖片、公式、顏色樣式、視頻和音頻不能解析

　　B>能解析的對象，解析出來的效果不是很好，比如word和ppt自身的排版亂了,excel單元格里面的自定義格式全變成數字了~等等

　　C>開源社區的資料並不是很全，導致的結果是不同的文檔類型需要用不同的解析方式去解析，比如word用docx4j解析、excel用poi解析帶來的代碼量巨大

　　D>由於代碼自身的解析效果不是很好，更改后的方案需要在上傳之前將源文件處理成其他的形式，如pdf需要切成圖片，ppt需要轉換成視頻或是圖片，這樣一來需求實現的方式就變成半自動了╥﹏╥...

　　E>word用docx4j解析一個很大的問題是解析的效率太低了，5MB以上的文件或者內容比較復雜的word文檔解析十分耗時，解析效率太低，再一就是poi解析數據量比較大的Exel(比如>1000行)容易造成內存溢出，不好控制

　　F>工時太短，只有15天。。。，加班加點(⊙︿⊙) ，包工頭，加工資！！！ε=怒ε=怒ε=怒ε=怒ε=( o｀ω′)ノ

以上吐槽完了，該展示下最終成果了~

上4圖從左至右依次是pdf、ppt、word、excel的解析html的效果，由於涉及開發協議上圖1和圖2部分地方有塗抹，且以上只是瀏覽器模擬手機顯示，遂顯示效果較為粗糙，在此十分抱歉~

下面介紹一下我的最終實現思路：
　　A>Word文檔分兩種格式(03版)doc和(07版)docx，由於doc屬於即將淘汰的格式同時為方便使用docx4j一步到位的實現方式，故不考慮doc格式文檔

　　B>同Word一樣，excel也不考慮舊版格式的轉換，方案是選用第三方Demo實現，涉及到具體的技術就是 poi.hssf

　　C>PowerPoint(ppt)由於內置對象比較多，為保證客戶的使用體驗，我的方案是將ppt直接導出成mp4或圖片(需打zip包)上傳，再用代碼包裝成html

　　D>對於pdf，同樣沒有很好的Demo實現成html，遂同ppt一樣通過軟件轉換成圖片的形式打包上傳，再用代碼包裝成html

先展示下word解析的相關代碼:

(代碼片段一)

1     public static void Word2Html() throws FileNotFoundException, Docx4JException{ 2             //需在log4j內配置docx4j的級別
3             WordprocessingMLPackage wmp = WordprocessingMLPackage.load(new File("C:\\Users\\funnyZpC\\Desktop\\Test\\word.docx")); 4             Docx4J.toHTML(wmp, "C:\\Users\\funnyZpC\\Desktop\\result\\wordIMG", "wordIMG", new FileOutputStream(new File("C:\\Users\\funnyZpC\\Desktop\\result\\word.html"))); 5     }

(代碼片段二)

 1     public ProcessFileInfo processDOCX(File file,String uploadPath)throws Exception{  2         String fileName=file.getName().substring(0,file.getName().lastIndexOf("."));//獲取文件名稱
 3         WordprocessingMLPackage wmp = WordprocessingMLPackage.load(file);//加載源文件
 4         String basePath=String.format("%s%s%s", uploadPath,File.separator,fileName);//基址
 5         FileUtils.forceMkdir(new File(basePath));//創建文件夾
 6         String zipFilePath=String.format("%s%s%s.%s", uploadPath,File.separator,fileName,"ZIP");//最終生成文件的路徑
 7         Docx4J.toHTML(wmp, String.format("%s%s%s", basePath,File.separator,fileName),fileName,new FileOutputStream(new File(String.format("%s%s%s", basePath,File.separator,"index.html"))));//解析
 8         scormService.zip(basePath, zipFilePath);//壓縮包
 9         FileUtils.forceDelete(new File(basePath));//刪除臨時文件夾
10         file.delete();//解析完成，刪除原docx文件
11         return new ProcessFileInfo(true,new File(zipFilePath).getName(),zipFilePath);//返回目標文件相關信息
12     }

解析word(docx)文檔所需要的代碼簡單到只需要兩行代碼(代碼片段一3、4兩行)，以上(代碼片段二)是實際開發的代碼，建議對比片段一看，同時由於項目可能會部署在linux系統下，建議使用File.separator來代替"/"或者"\"路徑分隔符；同時，需要解釋的是toHtml方法的四個參數==>

　　Docx4j.toHtml(加載源docx文件的WordprocessingMLPackage實例化對象，存放解析結果(html和圖片)的基目錄,存放圖片的文件夾名稱(在基目錄下),輸出主html的輸出流對象);

下圖是輸出的結果的目錄：

由於docx4j內部的log較多，默認Demo測試的時候輸出文件會有如下提示：

這句話的大意是：如需隱藏此消息，請設置docx4j的debug的級別。解決的方式是在實際項目的log4j.properties中添加docx4j的消息級別為ERROR，如：

如果使用maven管理項目，直接在pom.xml里面添加docx4j的dependency，如果需手動配置docx4j及其依賴包，一定要注意依賴包與當前docx4j的版本對應性(推薦3.3.5的docx4j，解析效果會好一些！)否則各種毛病啊~，下圖是maven倉庫的一些說明，如需手動配置依賴一定要點進去看下：

下面的代碼是Excel解析word的部分代碼片段(代碼不全,如有需要請郵件私我)：

(代碼片段一)

 1 /**
 2  *  3  * @param file 源文件:c://xx//xx.xlsx  4  * @param uploadPath 基目錄地址  5  * @return
 6  * @throws Exception  7      */
 8     public ProcessFileInfo processXLSX(File file,String uploadPath)throws Exception {  9         List<String> sheets=Excel2HtmlUtils.readExcelToHtml(file.getPath()); 10         FileUtils.forceMkdir(new File(uploadPath));//創建文件夾
11         String code=file.getName().substring(0,file.getName().lastIndexOf("."));//文件名稱
12         String basePath=String.format("%s%s%s", uploadPath,File.separator,code); 13         FileUtils.forceMkdir(new File(basePath)); 14         File htmlFile = new File(String.format("%s%s%s", basePath,File.separator,"index.html")); 15         Writer fw=null; 16         PrintWriter bw=null; 17         //構建html文件
18         try{ 19              fw= new BufferedWriter( new OutputStreamWriter(new FileOutputStream(htmlFile.getPath()),"UTF-8")); 20              bw=new PrintWriter(fw); 21              //添加表頭及可縮放樣式
22             String head="<!DOCTYPE html><html><head><meta charset=\"UTF-8\"></head><body style=\"transform: scale(0.7,0.7);-webkit-transform: scale(0.7,0.7);\">"; 23             StringBuilder body=new StringBuilder(); 24             for (String e : sheets) { 25  body.append(e); 26  } 27             String foot="</body></html>"; 28             bw.write(String.format("%s%s%s", head,body.toString(),foot)); 29         }catch(Exception e){ 30             throw new Exception("");//錯誤扔出
31         }finally{ 32             if (bw != null) { 33  bw.close(); 34  } 35             if(fw!=null){ 36  fw.close(); 37  } 38  } 39         String htmlZipFile=String.format("%s%s%s.%s",uploadPath,File.separator,file.getName().substring(0,file.getName().lastIndexOf(".")),"ZIP"); 40         //壓縮文件
41  scormService.zip(basePath, htmlZipFile); 42         file.delete();//刪除上傳的xlsx文件
43         FileUtils.forceDelete(new File(basePath)); 44         return new ProcessFileInfo(true,new File(htmlZipFile).getName(),htmlZipFile); 45     }

View Code

(代碼片段二)

 1     /**
 2  * 程序入口方法  3  *  4  * @param filePath  5  * 文件的路徑  6  * @return <table>  7  * ...  8  * </table>  9  * 字符串 10      */
11     public static List<String> readExcelToHtml(String filePath) { 12             List<String> htmlExcel=null; 13             try { 14                 File sourcefile = new File(filePath); 15                 InputStream is = new FileInputStream(sourcefile); 16                 Workbook wb = WorkbookFactory.create(is); 17                 htmlExcel = getExcelToHtml(wb); 18             } catch (EncryptedDocumentException e) { 19  e.printStackTrace(); 20             } catch (FileNotFoundException e) { 21  e.printStackTrace(); 22             } catch (InvalidFormatException e) { 23  e.printStackTrace(); 24             } catch (IOException e) { 25  e.printStackTrace(); 26  } 27         return htmlExcel; 28         
29     }

View Code

以上只展示了xlsx文件的內容包裝和解析excel的入口方法，整個解析類全部放在了utils包下面，service里面只管調用方法傳參就好了,如下圖：

解析Excel的工具類一共有四個文件類，其中Excel2HtmlUtils是入口類，其它三個均是關聯Excel2HtmlUtils類處理Excel樣式，需要注意的是:工具類處理Excel的時候一定要限制處理記錄的數量,以免造成內存溢出錯誤，順便說下：如果您解析的html供移動端使用，建議給html設置可縮放大小=>transform: scale(0.7,0.7);-webkit-transform: scale(0.7,0.7);。

說完Excel解析，下面給出pdf(圖片ZIP包)解析html的代碼片段，由於代碼較為簡單，不多的解釋，以下是具體的實現代碼：

 1     /**
 2  * 根據文件名中的數字排列圖片  3  * a>提取文件名中的數字放入int數組(序列)  4  * b>判斷序列數組元素個數與文件個數是否一致,不一致則拋出  5  * c>將序列數組從小到大排列  6  * d>遍歷序列數組獲取Map中的文件名(value)並寫html  7          */
 8         String nm=null;  9         int[] i=new int[imgNames.size()]; 10         Map<Integer,String> names=new HashMap<Integer,String>(); 11         Pattern p=Pattern.compile("[^0-9]"); 12         for(int j=0;j<imgNames.size();j++){ 13             nm=imgNames.get(j).substring(0,imgNames.get(j).lastIndexOf("."));//提取名稱
14             String idx=p.matcher(nm).replaceAll("").trim(); 15             i[j]=Integer.parseInt("".equals(idx)?"0":idx); 16  names.put(i[j],imgNames.get(j)); 17  } 18         if(names.keySet().size()!=i.length){ 19             //System.out.println("====請檢查您的圖片編號====");/*重復或者不存在數字編號*/
20             return new ProcessFileInfo(false,null,null); 21  } 22         Arrays.sort(i);//int數組內元素從小到大排列 23 
24         //包裝成html
25         StringBuilder html=new StringBuilder(); 26         html.append("<!DOCTYPE html><html><head><meta charset='UTF-8'><title>PDF</title></head>"); 27         html.append("<body style=\"margin:0px 0px;padding:0px 0px;\">"); 28         for (int k : i) { 29             html.append(String.format("%s%s%s%s%s","<div style=\"width:100%;\"><img src=\"./",fileName,File.separator,names.get(k),"\"  style=\"width:100%;\" /></div>")); 30  } 31         html.append("</body></html>"); 32         File indexFile=new File(String.format("%s%s%s",basePath,File.separator,"index.html")); 33         Writer fw=null; 34         PrintWriter bw=null; 35         //構建文件(html寫入html文件)
36         try{ 37              fw= new BufferedWriter( new OutputStreamWriter(new FileOutputStream(indexFile),"UTF-8"));//以UTF-8的格式寫入文件
38              bw=new PrintWriter(fw); 39  bw.write(html.toString()); 40         }catch(Exception e){ 41             throw new Exception(e.toString());//錯誤扔出
42         }finally{ 43             if (bw != null) { 44  bw.close(); 45  } 46             if(fw!=null){ 47  fw.close(); 48  } 49  } 50         String zipFilePath=String.format("%s%s%s.%s", uploadPath,File.separator,file.hashCode(),"ZIP"); 51  scormService.zip(basePath, zipFilePath); 52         //刪除文件
53  file.delete(); 54         FileUtils.forceDelete(new File(basePath)); 55         return new ProcessFileInfo(true,new File(zipFilePath).getName(),zipFilePath); 56     }

View Code

同Excel，由於我將ppt存為mp4格式，上傳后只需要做簡單包裝就可以了，處理的時候一定要注意html對視頻的相對引用，以下是具體的實現代碼：

 1     /**
 2  *  3  * @param file 上傳的文件的路徑 c://xx.//xxx.mp4  4  * @param uploadPath 保存html的基目錄路徑  5  * @return
 6  * @throws Exception  7      */
 8     public ProcessFileInfo processPPTX(File file,String uploadPath)throws Exception{  9         String fileName=file.getName().substring(0,file.getName().lastIndexOf("."));//獲取文件名稱
10         String suffix=file.getName().substring(file.getName().lastIndexOf(".")+1,file.getName().length()).toLowerCase();//音頻文件后綴名
11         String basePath=String.format("%s%s%s", uploadPath,File.separator,fileName); 12         FileUtils.forceMkdir(new File(basePath)); 13         //將視頻文件copy到basePath內
14         String videoPath=String.format("%s%s%s", basePath,File.separator,file.getName()); 15         FileUtils.copyFile(file, new File(videoPath)); 16         StringBuilder html=new StringBuilder(); 17         html.append("<!DOCTYPE html><html><head><meta charset='utf-8'><title>powerpoint</title></head>"); 18         html.append("<body style=\"margin:0px 0px;\"><div style=\"width:100%;margin:auto 0% auto 0%;\">"); 19         html.append("<video controls=\"controls\"  width=\"100%\"  height=\"100%\" name=\"media\" >");//無背景圖片
20         html.append(String.format("%s%s.%s%s%s%s%s","<source src=\"",fileName,suffix,"\" type=\"audio/",suffix,"\" >","</video></div>"));//視頻
21         html.append("</body></html>");//結尾
22         File indexFile=new File(String.format("%s%s%s",basePath,File.separator,"index.html")); 23         Writer fw=null; 24         PrintWriter bw=null; 25         //構建文件(html寫入html文件)
26         try{ 27              fw= new BufferedWriter( new OutputStreamWriter(new FileOutputStream(indexFile),"UTF-8"));//以UTF-8的格式寫入文件
28              bw=new PrintWriter(fw); 29  bw.write(html.toString()); 30         }catch(Exception e){ 31             throw new Exception(e.toString());//錯誤扔出
32         }finally{ 33             if (bw != null) { 34  bw.close(); 35  } 36             if(fw!=null){ 37  fw.close(); 38  } 39  } 40         String zipFilePath=String.format("%s%s%s.%s", uploadPath,File.separator,fileName,"ZIP"); 41  scormService.zip(basePath, zipFilePath); 42         //刪除文件
43  file.delete(); 44         FileUtils.forceDelete(new File(basePath)); 45         return new ProcessFileInfo(true,new File(zipFilePath).getName(),zipFilePath); 46     }

View Code

　　雖然需求最終還是改成最簡單的實現方式，這中間近乎白忙活的結果研究出來的實現方案還是有必要分享的，以上如能幫助到開發者，哪怕只有一位，也是非常值得的。

轉載請注明地址：http://www.cnblogs.com/funnyzpc/p/7225988.html

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Office文件的奧秘——.NET平台下不借助Office實現Word、Powerpoint等文件的解析(一) Office文件的奧秘——.NET平台下不借助Office實現Word、Powerpoint等文件的解析(二) Office文件的奧秘——.NET平台下不借助Office實現Word、Powerpoint等文件的解析(完) office 2019某一個功能（例如excel）突然消失的解決方式（word、excel、PowerPoint） Web方式預覽Office/Word/Excel/pdf文件解決方案免費的Office批量打印工具 Word、Excel、PDF批量打印 Aspose office （Excel,Word,PPT）,PDF 在線預覽 linux php 環境word轉pdf、excel轉pdf、office轉pdf Java:Excel轉PDF實現方案;基於POI與Itext進行搭配. java實現pdf轉為word