文件在線預覽doc,docx轉換pdf(一)
1. 前言
文檔轉換是一個是一塊硬骨頭,但是也是必不可少的,我們正好做的知識庫產品中,也面臨着同樣的問題,文檔轉換,精准的全文搜索,知識的轉換率,是知識庫產品的基本要素,初識閱讀時同時絞盡腦汁,自己開發?,集成第三方?都是中小企業面臨的一大難題…….
自己在網上搜索着找到poi開源出來的很多例子,最開始是用poi把所有文檔轉換為html,
1) 在github上面找到一個https://github.com/litter-fish/transform完整的demo,你想要的轉換基本都提供,初學者可以參照實現轉換出來的基本樣子,達到通用級別,需要自己花很多功夫。此開源代碼是基於poi和itext(pdf)的轉換方式。
2) https://gitee.com/kekingcn/file-online-preview這是開源中國提供的一個源碼,基於jodconverter,原理是調用windows,另存為的組件,實現轉換。
3) 收費產品例如【永中office】【office365】【idocv】、【https://downloads.aspose.com/words/java】
2. 轉換思路
自己在嘗試過很多后,也與永中集成了文檔轉換,發現,要想完成預覽的品質,必須的做二次渲染。畢竟永中做了十幾年文檔轉換我們不能比的,自己琢磨后,發現一個勉強靠譜的思路,doc和docx都轉換為pdf實現預覽。都是在基於poi的基礎上。
2.1. Doc轉換pdf
1) Doc轉換為xml
/**
* doc轉xml
*/
public String toXML(String filePath){
try{
POIFSFileSystem nPOIFSFileSystem = new POIFSFileSystem(new File(filePath));
HWPFDocument nHWPFDocument = new HWPFDocument(nPOIFSFileSystem);
WordToFoConverter nWordToHtmlConverter = new WordToFoConverter(
DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument());
PicturesManager nPicturesManager = new PicturesManager() {
public String savePicture(byte[] arg0, PictureType arg1,String arg2, float arg3, float arg4) {
//file:///F://20.vscode//iWorkP//temp//images//0.jpg
//System.out.println("file:///"+PathMaster.getWebRootPath()+ java.io.File.separator + "temp"+java.io.File.separator+"images" + java.io.File.separator + arg2);
// return "file:///"+PathMaster.getWebRootPath()+java.io.File.separator +"temp"+java.io.File.separator+"images" + java.io.File.separator + arg2;
return "file:///"+PathMaster.getWebRootPath()+java.io.File.separator +"temp"+java.io.File.separator+"images" + java.io.File.separator + arg2;
}
};
nWordToHtmlConverter.setPicturesManager(nPicturesManager);
nWordToHtmlConverter.processDocument(nHWPFDocument);
String nTempPath = PathMaster.getWebRootPath() + java.io.File.separator + "temp" + java.io.File.separator + "images" + java.io.File.separator;
File nFile = new File(nTempPath);
if (!nFile.exists()) {
nFile.mkdirs();
}
for (Picture nPicture : nHWPFDocument.getPicturesTable().getAllPictures()) {
nPicture.writeImageContent(new FileOutputStream(nTempPath + nPicture.suggestFullFileName()));
}
Document nHtmlDocument = nWordToHtmlConverter.getDocument();
OutputStream nByteArrayOutputStream = new FileOutputStream(OUTFILEFO);
DOMSource nDOMSource = new DOMSource(nHtmlDocument);
StreamResult nStreamResult = new StreamResult(nByteArrayOutputStream);
TransformerFactory nTransformerFactory = TransformerFactory.newInstance();
Transformer nTransformer = nTransformerFactory.newTransformer();
nTransformer.setOutputProperty(OutputKeys.ENCODING, "GBK");
nTransformer.setOutputProperty(OutputKeys.INDENT, "YES");
nTransformer.setOutputProperty(OutputKeys.METHOD, "xml");
nTransformer.transform(nDOMSource, nStreamResult);
nByteArrayOutputStream.close();
return "";
}catch(Exception e){
e.printStackTrace();
}
return "";
}
2) Xml轉換為pdf
這里我是使用fop通過xml轉換為pdf,也是最近欣喜的一個發現,poi官網推薦的我一直沒去仔細看,里面的架包和永中的很多高清包,一模一樣,現在貌似路子對了。有興趣者研究去吧。我的源碼已經在githubhttps://github.com/liuxufeijidian/file.convert.master/tree/master上面,環境已經配置好,需要准備好doc和docx文檔即可。
/*
* xml 轉pdf
*/
public void xmlToPDF() throws SAXException, TransformerException{
// Step 1: Construct a FopFactory by specifying a reference to the configuration file
// (reuse if you plan to render multiple documents!)
FopFactory fopFactory = null;
new URIResolverAdapter(new URIResolver(){
public Source resolve(String href, String base) throws TransformerException {
try {
URL url = new URL(href);
URLConnection connection = url.openConnection();
connection.setRequestProperty("User-Agent", "whatever");
return new StreamSource(connection.getInputStream());
} catch (IOException e) {
throw new RuntimeException(e);
}
}
});
OutputStream out = null;
try {
fopFactory = FopFactory.newInstance(new File(CONFIG));
// Step 2: Set up output stream.
// Note: Using BufferedOutputStream for performance reasons (helpful with FileOutputStreams).
out = new BufferedOutputStream(new FileOutputStream(OUTFILEPDF));
// Step 3: Construct fop with desired output format
Fop fop = fopFactory.newFop(MimeConstants.MIME_PDF, out);
// Step 4: Setup JAXP using identity transformer
TransformerFactory factory = TransformerFactory.newInstance();
Transformer transformer = factory.newTransformer(); // identity transformer
// Step 5: Setup input and output for XSLT transformation
// Setup input stream
Source src = new StreamSource(OUTFILEFO);
// Resulting SAX events (the generated FO) must be piped through to FOP
Result res = new SAXResult(fop.getDefaultHandler());
// Step 6: Start XSLT transformation and FOP processing
transformer.transform(src, res);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} finally {
//Clean-up
try {
out.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}}
2.1.3
很多時候我們是使用word直接轉的html,但是需要自己寫二次渲染的代碼,較為復雜,我是使用迂回方法,doc轉xml,再用xml轉換pdf,轉換出來的pdf用pdfjs渲染即可實現和瀏覽器打開一樣的預覽,pdfjs預覽方法詳情見https://blog.csdn.net/liuxufeijidian/article/details/82260199
ending:大家都想看效果如何,https://github.com/litter-fish/transform,github獲取改源碼,配置好doc和docx文檔即可實現轉換,接下來會繼續努力不間斷優化和更新文檔轉換。
