對於某些項目,開發人員需要從Word文檔中提取數據並導出到數據庫。最大的挑戰是必須支持現有Word文檔。
相同格式且帶多個數據塊的Word文檔有成千上萬。該文檔格式並不是設計來被另一個系統所讀取的。這意味着,沒有書簽、合並字段、從標准指令識別實際數據的方式等。還好,所有輸入字段都在表格內,但這些表格也是不同的格式,一些是單行/單元格,另一些則變化多端。
我們可以用Aspose.Words來創建和操作Word文檔。
以C#創建一個類似的表格模型從而稍后當讀取文檔的時候我們可以用上它。
如下所示,你可以看到創建的名為WordDocumentTable的類,帶有三個屬性:TableID,RowID和ColumnID,如之前所說的,我們沒有支持TableID/RowIDs,這些屬性僅僅暗示着Word文檔的位置。開始索引假定為0。
public class WordDocumentTable { public WordDocumentTable(int PiTableID) { MiTableID = PiTableID; } public WordDocumentTable(int PiTableID, int PiColumnID) { MiTableID = PiTableID; MiColumnID = PiColumnID; } public WordDocumentTable(int PiTableID, int PiColumnID, int PiRowID) { MiTableID = PiTableID; MiColumnID = PiColumnID; MiRowID = PiRowID; } private int MiTableID = 0; public int TableID { get { return MiTableID; } set { MiTableID = value; } } private int MiRowID = 0; public int RowID { get { return MiRowID; } set { MiRowID = value; } } private int MiColumnID = 0; public int ColumnID { get { return MiColumnID; } set { MiColumnID = value; } } }
現在來到提取環節。如下所示,你將看到我想要從文檔中讀取的表格單元格的集。
private List<WordDocumentTable> WordDocumentTables { get { List<WordDocumentTable> wordDocTable = new List<WordDocumentTable>(); //Reads the data from the first Table of the document. wordDocTable.Add(new WordDocumentTable(0)); //Reads the data from the second table and its second column. //This table has only one row. wordDocTable.Add(new WordDocumentTable(1, 1)); //Reads the data from third table, second row and second cell. wordDocTable.Add(new WordDocumentTable(2, 1, 1)); return wordDocTable; } }
下面是從基於表格、行和單元格的Aspose.Words文檔提取數據。
public void ExtractTableData(byte[] PobjData) { using (MemoryStream LobjStream = new MemoryStream(PobjData)) { Document LobjAsposeDocument = new Document(LobjStream); foreach(WordDocumentTable wordDocTable in WordDocumentTables) { Aspose.Words.Tables.Table table = (Aspose.Words.Tables.Table) LobjAsposeDocument.GetChild (NodeType.Table, wordDocTable.TableID, true); string cellData = table.Range.Text; if (wordDocTable.ColumnID > 0) { if (wordDocTable.RowID == 0) { NodeCollection LobjCells = table.GetChildNodes(NodeType.Cell, true); cellData = LobjCells[wordDocTable.ColumnID].ToTxt(); } else { NodeCollection LobjRows = table.GetChildNodes(NodeType.Row, true); cellData = ((Row)(LobjRows[wordDocTable.RowID])). Cells[wordDocTable.ColumnID].ToTxt(); } } Console.WriteLine(String.Format("Data in Table {0}, Row {1}, Column {2} : {3}", wordDocTable.TableID, wordDocTable.RowID, wordDocTable.ColumnID, cellData); } } }