最近做了一個功能,需要將word文檔轉化成html的格式,並提取出標題生成導航。考慮到功能的復雜程度,將需要降低為對“標題1”這種格式進行提取。
docx為后綴的文檔(word2007)支持XML的文件格式,本質上是一個zip壓縮包,解壓出來就可以看到所有信息,可能正因為如果,使用XHTMLConverter便可以得到對應的html文檔,且標題元素的class屬性被標注為"X"+n(n為標題層級)。
但doc文檔但相對麻煩,doc文檔一般使用poi讀取,用的比較多的html轉換方式是使用poi中的WordToHtmlConverter進行轉換,這個轉換器並不會對標題進行特殊處理,將其當做普通有樣式的一個段落(Paragraph)進行處理,因此會和其他普通段落混合在一起。對此有兩種處理方法:
方案一:重寫processParagraph方法,在注釋的判斷處加上對標題的判斷,對標題進行特殊處理,但由於WordToHtmlConverter的成員變量均聲明為private,因此我采用了另一種方案。
protected void processParagraph(HWPFDocumentCore hwpfDocument, Element parentElement, int currentTableLevel, Paragraph paragraph, String bulletText) { Element pElement = this.htmlDocumentFacade.createParagraph(); parentElement.appendChild(pElement); StringBuilder style = new StringBuilder(); WordToHtmlUtils.addParagraphProperties(paragraph, style); int charRuns = paragraph.numCharacterRuns(); if(charRuns != 0) { CharacterRun characterRun = paragraph.getCharacterRun(0); String pFontName; int pFontSize; if(characterRun != null) { Triplet triplet = this.getCharacterRunTriplet(characterRun); pFontSize = characterRun.getFontSize() / 2; pFontName = triplet.fontName; WordToHtmlUtils.addFontFamily(pFontName, style); WordToHtmlUtils.addFontSize(pFontSize, style); } else { pFontSize = -1; pFontName = ""; } this.blocksProperies.push(new WordToHtmlConverter.BlockProperies(pFontName, pFontSize)); try { if(WordToHtmlUtils.isNotEmpty(bulletText)) { if(bulletText.endsWith("\t")) { float defaultTab = 720.0F; float firstLinePosition = (float)(paragraph.getIndentFromLeft() + paragraph.getFirstLineIndent() + 20); float nextStop = (float)(Math.ceil((double)(firstLinePosition / 720.0F)) * 720.0D); float spanMinWidth = nextStop - firstLinePosition; Element span = this.htmlDocumentFacade.getDocument().createElement("span"); this.htmlDocumentFacade.addStyleClass(span, "s", "display: inline-block; text-indent: 0; min-width: " + spanMinWidth / 1440.0F + "in;"); pElement.appendChild(span); Text textNode = this.htmlDocumentFacade.createText(bulletText.substring(0, bulletText.length() - 1) + '\u200b' + ' '); span.appendChild(textNode); } else { Text textNode = this.htmlDocumentFacade.createText(bulletText.substring(0, bulletText.length() - 1)); pElement.appendChild(textNode); } } this.processCharacters(hwpfDocument, currentTableLevel, paragraph, pElement); } finally { this.blocksProperies.pop(); }
// 此處需要修改 if(style.length() > 0) { this.htmlDocumentFacade.addStyleClass(pElement, "p", style.toString()); } WordToHtmlUtils.compactSpans(pElement); } }
方案二:在word文檔中進行埋點,然后在處理過后的html文檔中根據itTitleMap進行再處理
private Map<String,String> setTitleElements(HWPFDocument wordObject ){ // 獲取樣式表 StyleSheet styleSheet = wordObject.getStyleSheet(); int styleTotal = wordObject.getStyleSheet().numStyles(); // 使用map映射存儲標題信息 Map<String,String> idTitleMap = Maps.newHashMap(); Range range = wordObject.getRange(); for (int i = 0; i < range.numParagraphs(); i++) { // 獲取樣式信息 Paragraph paragraph = range.getParagraph(i); int styleIndex = paragraph.getStyleIndex(); if (styleTotal > styleIndex) { StyleDescription styleDescription = styleSheet.getStyleDescription(styleIndex); String descriptionName = styleDescription.getName(); if ( descriptionName != null && descriptionName.contains(FIRST_LEVEL_TITLE_DESCRIPTION)) { String uuid = UUIDHelper.getUuid(); String text = paragraph.text().replaceAll( "[\r\n]", "" ); paragraph.replaceText( uuid, false ); idTitleMap.put( uuid, text ); } } } return idTitleMap; }