使用poi讀取word2007(.docx)中的復雜表格
最近工作需要做一個讀取word(.docx)中的表格,並以html形式輸出。經過上網查詢,使用了poi。
對於2007及之后的word文檔,需要導入poi-ooxml-xxx.jar及其依賴包,如下圖(圖中為使用maven):

對於簡單表格,可以使用如下方式來獲取每個表格的內容:
XWPFDocument document = new XWPFDocument(new FileInputStream("word.docx"));
// 獲取所有表格
List<XWPFTable> tables = document.getTables();
for (XWPFTable table : tables) {
// 獲取表格的行
List<XWPFTableRow> rows = table.getRows();
for (XWPFTableRow row : rows) {
// 獲取表格的每個單元格
List<XWPFTableCell> tableCells = row.getTableCells();
for (XWPFTableCell cell : tableCells) {
// 獲取單元格的內容
String text = cell.getText();
}
}
}
但是對於復雜表格(含合並的單元格),則無法正常處理。
於是繼續上網查詢,在stackoverflow查到如下生成含有合並的單元格的表格:
public class CreateWordTableMerge {
static void mergeCellVertically(XWPFTable table, int col, int fromRow, int toRow) {
for(int rowIndex = fromRow; rowIndex <= toRow; rowIndex++){
CTVMerge vmerge = CTVMerge.Factory.newInstance();
if(rowIndex == fromRow){
// The first merged cell is set with RESTART merge value
vmerge.setVal(STMerge.RESTART);
} else {
// Cells which join (merge) the first one, are set with CONTINUE
vmerge.setVal(STMerge.CONTINUE);
}
XWPFTableCell cell = table.getRow(rowIndex).getCell(col);
// Try getting the TcPr. Not simply setting an new one every time.
CTTcPr tcPr = cell.getCTTc().getTcPr();
if (tcPr != null) {
tcPr.setVMerge(vmerge);
} else {
// only set an new TcPr if there is not one already
tcPr = CTTcPr.Factory.newInstance();
tcPr.setVMerge(vmerge);
cell.getCTTc().setTcPr(tcPr);
}
}
}
static void mergeCellHorizontally(XWPFTable table, int row, int fromCol, int toCol) {
for(int colIndex = fromCol; colIndex <= toCol; colIndex++){
CTHMerge hmerge = CTHMerge.Factory.newInstance();
if(colIndex == fromCol){
// The first merged cell is set with RESTART merge value
hmerge.setVal(STMerge.RESTART);
} else {
// Cells which join (merge) the first one, are set with CONTINUE
hmerge.setVal(STMerge.CONTINUE);
}
XWPFTableCell cell = table.getRow(row).getCell(colIndex);
// Try getting the TcPr. Not simply setting an new one every time.
CTTcPr tcPr = cell.getCTTc().getTcPr();
if (tcPr != null) {
tcPr.setHMerge(hmerge);
} else {
// only set an new TcPr if there is not one already
tcPr = CTTcPr.Factory.newInstance();
tcPr.setHMerge(hmerge);
cell.getCTTc().setTcPr(tcPr);
}
}
}
public static void main(String[] args) throws Exception {
XWPFDocument document= new XWPFDocument();
XWPFParagraph paragraph = document.createParagraph();
XWPFRun run=paragraph.createRun();
run.setText("The table:");
//create table
XWPFTable table = document.createTable(3,5);
for (int row = 0; row < 3; row++) {
for (int col = 0; col < 5; col++) {
table.getRow(row).getCell(col).setText("row " + row + ", col " + col);
}
}
//create and set column widths for all columns in all rows
//most examples don't set the type of the CTTblWidth but this
//is necessary for working in all office versions
for (int col = 0; col < 5; col++) {
CTTblWidth tblWidth = CTTblWidth.Factory.newInstance();
tblWidth.setW(BigInteger.valueOf(1000));
tblWidth.setType(STTblWidth.DXA);
for (int row = 0; row < 3; row++) {
CTTcPr tcPr = table.getRow(row).getCell(col).getCTTc().getTcPr();
if (tcPr != null) {
tcPr.setTcW(tblWidth);
} else {
tcPr = CTTcPr.Factory.newInstance();
tcPr.setTcW(tblWidth);
table.getRow(row).getCell(col).getCTTc().setTcPr(tcPr);
}
}
}
//using the merge methods
mergeCellVertically(table, 0, 0, 1);
mergeCellHorizontally(table, 1, 2, 3);
mergeCellHorizontally(table, 2, 1, 4);
paragraph = document.createParagraph();
FileOutputStream out = new FileOutputStream("create_table.docx");
document.write(out);
System.out.println("create_table.docx written successully");
}
}
運行一下確實可以實現,不過仍是一頭霧水,對於其中的cTTc,tcPr,vMerge等屬性仍是不知道是什么。
直到后來知道了Office Open XML (OOXML) ,可以將.docx文件后綴改為.zip,即可以使用解壓軟件打開,進入后有一個word文件夾,里面的document.xml即為word正文內容。

對於word中的上圖行合並表格,對應的xml如下:
<w:tbl>
<w:tblPr>
<w:tblStyle w:val="a3"/>
<w:tblW w:w="0" w:type="auto"/>
<w:tblLook w:val="04A0" w:firstRow="1" w:lastRow="0" w:firstColumn="1" w:lastColumn="0" w:noHBand="0" w:noVBand="1"/>
</w:tblPr>
<w:tblGrid>
<w:gridCol w:w="2765"/>
<w:gridCol w:w="2765"/>
</w:tblGrid>
<w:tr w:rsidR="00151AA4" w:rsidTr="000249EF">
<w:tc>
<w:tcPr>
<w:tcW w:w="2765" w:type="dxa"/>
<w:vMerge w:val="restart"/>
</w:tcPr>
<w:p w:rsidR="00151AA4" w:rsidRDefault="00151AA4" w:rsidP="00915802">
<w:r>
<w:rPr>
<w:rFonts w:hint="eastAsia"/>
</w:rPr>
<w:t>0,0</w:t>
</w:r>
</w:p>
</w:tc>
<w:tc>
<w:tcPr>
<w:tcW w:w="2765" w:type="dxa"/>
</w:tcPr>
<w:p w:rsidR="00151AA4" w:rsidRDefault="00151AA4">
<w:r>
<w:rPr>
<w:rFonts w:hint="eastAsia"/>
</w:rPr>
<w:t>0,1</w:t>
</w:r>
</w:p>
</w:tc>
</w:tr>
<w:tr w:rsidR="00151AA4" w:rsidTr="000249EF">
<w:tc>
<w:tcPr>
<w:tcW w:w="2765" w:type="dxa"/>
<w:vMerge/>
</w:tcPr>
<w:p w:rsidR="00151AA4" w:rsidRDefault="00151AA4"/>
</w:tc>
<w:tc>
<w:tcPr>
<w:tcW w:w="2765" w:type="dxa"/>
</w:tcPr>
<w:p w:rsidR="00151AA4" w:rsidRDefault="00151AA4">
<w:r>
<w:rPr>
<w:rFonts w:hint="eastAsia"/>
</w:rPr>
<w:t>1,1</w:t>
</w:r>
<w:bookmarkStart w:id="0" w:name="_GoBack"/>
<w:bookmarkEnd w:id="0"/>
</w:p>
</w:tc>
</w:tr>
</w:tbl>
看到這里,相信大家會理解了前面的tc,tcPr,vMerge等屬性了吧。
其中w:tr表示的是表格的一行,tcPr代表的是一個單元格的屬性。
具體可以參考:http://www.datypic.com/sc/ooxml/e-w_tbl-1.html
下面在給大家展示一下列合並的情況,大家也可以用來驗證一下:

對應的xml:
<w:tbl>
<w:tblPr>
<w:tblStyle w:val="a3"/>
<w:tblW w:w="0" w:type="auto"/>
<w:tblLook w:val="04A0" w:firstRow="1" w:lastRow="0" w:firstColumn="1" w:lastColumn="0" w:noHBand="0" w:noVBand="1"/>
</w:tblPr>
<w:tblGrid>
<w:gridCol w:w="2765"/>
<w:gridCol w:w="2765"/>
</w:tblGrid>
<w:tr w:rsidR="006C0A9A" w:rsidTr="006C099A">
<w:tc>
<w:tcPr>
<w:tcW w:w="5530" w:type="dxa"/>
<w:gridSpan w:val="2"/>
</w:tcPr>
<w:p w:rsidR="006C0A9A" w:rsidRDefault="006C0A9A">
<w:r>
<w:rPr>
<w:rFonts w:hint="eastAsia"/>
</w:rPr>
<w:t>0,0</w:t>
</w:r>
</w:p>
</w:tc>
</w:tr>
<w:tr w:rsidR="006C0A9A" w:rsidTr="000249EF">
<w:tc>
<w:tcPr>
<w:tcW w:w="2765" w:type="dxa"/>
</w:tcPr>
<w:p w:rsidR="006C0A9A" w:rsidRDefault="006C0A9A">
<w:r>
<w:rPr>
<w:rFonts w:hint="eastAsia"/>
</w:rPr>
<w:t>1,0</w:t>
</w:r>
</w:p>
</w:tc>
<w:tc>
<w:tcPr>
<w:tcW w:w="2765" w:type="dxa"/>
</w:tcPr>
<w:p w:rsidR="006C0A9A" w:rsidRDefault="006C0A9A">
<w:r>
<w:rPr>
<w:rFonts w:hint="eastAsia"/>
</w:rPr>
<w:t>1,1</w:t>
</w:r>
</w:p>
</w:tc>
</w:tr>
</w:tbl>
通過觀察可以總結如下(使用poi提供的方法):
行合並情況:
CTTcPr tcpr = tables.get(0).getRow(2).getCell(0).getCTTc().getTcPr(); // 此屬性每個單元格都有,為每個單元格的屬性:tableCell.cellProperty
如果是行合並的第一行單元格,則: tcpr.getVMerge().getVal().toString() == "restart"
如果是行合並的其他行單元格,則: tcpr.getVMerge().getVal() == null
如果不是行合並的單元格,則: tcpr.getVMerge() == null
列合並情況:
CTTcPr tcpr = tables.get(0).getRow(2).getCell(0).getCTTc().getTcPr();
如果是列合並的第一列單元格,則:tcpr.getGridSpan().getVal()可以獲取到這列單元格所占的行數
其他單元格:tcpr.getGridSpan() == null
這里有一個獲取表格內容轉為html的demo供大家參考。

