Office文件的奧秘——.NET平台下不借助Office實現Word、Powerpoint等文件的解析(三)

本文轉載自查看原文 2013-03-31 01:33 6305 C#/ Office/ 文件/ Word/ ppt/ .NET/ 讀取/ 文件解析/ 解析/ Powerpoint/ 文字

【題外話】

我突然發現現在做Office文檔的解析要比2010年的時候容易得多，因為文檔從2010年開始更新了好多好多次，讀起來也越來越容易。寫前兩篇文章的時候參考的好多還是微軟的舊文檔（2010年的），寫這篇的時候重下了所有的文檔，發現每個文檔都好讀得多，整理得也更系統，感覺微軟真的是用心在做這個開放的事。當然，這些文檔大部分也是2010年的時候才開始發布出來的，仔細想想當年還是很幸運的。

【系列索引】

Office文件的奧秘——.NET平台下不借助Office實現Word、Powerpoint等文件的解析(一)
獲取Office二進制文檔的DocumentSummaryInformation以及SummaryInformation
Office文件的奧秘——.NET平台下不借助Office實現Word、Powerpoint等文件的解析(二)
獲取Word二進制文檔（.doc）的文字內容（包括正文、頁眉、頁腳、批注等等）
Office文件的奧秘——.NET平台下不借助Office實現Word、Powerpoint等文件的解析(三)
詳細介紹Office二進制文檔中的存儲結構，以及獲取PowerPoint二進制文檔（.ppt）的文字內容
Office文件的奧秘——.NET平台下不借助Office實現Word、Powerpoint等文件的解析(完)
介紹Office Open XML文檔（.docx、.pptx）如何進行解析以及解析Office文件常見開源類庫

【文章索引】

奇怪的文檔與FAT和DIFAT
奇怪的DocumentSummary和Summary
PowerPoint Document的結構與解析
相關鏈接

【一、奇怪的文檔與FAT和DIFAT】

在剛開始做解析的時候，大都是從Word文檔（.doc）入手，而doc文檔沒有太多復雜的東西，所以按照流程都可以輕松做到，也不會出現什么差錯。但是做PowerPoint解析的時候就會遇到很多問題，比如如果按第一節講的進行解析Directory的話會發現，很多PowerPoint文檔是沒有DocumentSummaryInformation的，這還不是關鍵，關鍵是，還有一部分甚至連PowerPoint Document都沒有，見下圖。

其實這種問題不光解析PowerPoint的時候會遇到，解析Excel的時候同樣會遇到，那么這到底是什么問題呢？其實我們在讀取Directory時，認為Directory所在的Sector是按EntryID從小到大排列的，但實際上DirectoryEntry並不一定是這樣的，並且有的Entry所在的Sector有可能在RootEntry之前。

不知大家是否還記得FAT和DIFAT這兩個結構，雖然從第一篇就讀取了諸如開始的位置和個數，但是一直沒有使用，那么本篇先詳細介紹一下這倆結構。

首先來看下微軟的文檔是如何描述這倆結構的：

我們可以看到，FAT、DIFAT其實是4字節的結構，那他們有什么作用呢？我們知道，Windows復合文檔是以Sector為單位存儲的文檔，但是Sector的順序並不一定是存儲的前后順序，所以我們需要有一個記錄着所有Sector順序的結構，那么這個就是FAT表。

那么FAT表里存儲的是什么呢？FAT表其實本身也是一個Sector，只不過這個Sector存儲的是其他Sector的ID，即每個FAT表存儲了128個SectorID，並且這個順序就是Sector的實際順序。所以，獲取了所有的FAT表，然后再獲取所有的SectorID，其實就獲取了所有Sector的順序。當然，我們其實只需要存儲所有FAT表的SectorID就行，然后根據根據SectorID在FAT表中查找下一個SectorID就可。

還記得第一篇讀取文件頭Header么？在文件頭的最后有109塊指向FAT表的SectorID，經過計算，如果這109個FAT表全部填滿，那么一共可以包括109 * 128個SectorID，也就是除了文件頭一共有109 * 128 * 512字節，所以整個文件最多是512 + 109 * 128 * 512 = 7143936 Byte = 6976.5 KB = 6.81 MB。如果文件再大怎么辦？這時候就有了DIFAT，DIFAT是記錄剩余FAT表的SectorID的，也就是相當於Header中109個FAT表的SectorID的擴充。所以，我們可以通過文件頭Header和DIFAT獲取所有FAT表的SectorID，然后通過這些FAT表的SectorID再獲取所有的Sector的順序。

首先我們獲取文件頭中前109個FAT表的SectorID：

View Code

 1 protected List<UInt32> m_fatSectors;
 2 
 3 private void ReadFirst109FatSectors()
 4 {
 5     for (Int32 i = 0; i < 109; i++)
 6     {
 7         UInt32 nextSector = this.m_reader.ReadUInt32();
 8 
 9         if (nextSector == CompoundBinaryFile.FreeSector)
10         {
11             break;
12         }
13 
14         this.m_fatSectors.Add(nextSector);
15     }
16 }

需要說明的是，這里並沒有判斷FAT的數量是否大於109塊，因為如果FAT為空，則標識為FreeSector，即0xFFFFFFFF，所以讀取到FreeSector時表明之后不再有FAT，即可以退出讀取。所有常見的標識見下。

protected const UInt32 MaxRegSector = 0xFFFFFFFA;
protected const UInt32 DifSector = 0xFFFFFFFC;
protected const UInt32 FatSector = 0xFFFFFFFD;
protected const UInt32 EndOfChain = 0xFFFFFFFE;
protected const UInt32 FreeSector = 0xFFFFFFFF;

如果FAT的數量大於109，我們還需要通過讀取DIFAT來獲取剩余FAT的位置，需要說明的是，每個DIFAT只存儲127個FAT，而最后4字節則為下一個DIFAT的SectorID，所以我們可以通過此遍歷所有的FAT。

View Code

 1 private void ReadLastFatSectors()
 2 {
 3     UInt32 difSectorID = this.m_difStartSectorID;
 4 
 5     while (true)
 6     {
 7         Int64 entryStart = this.GetSectorOffset(difSectorID);
 8         this.m_stream.Seek(entryStart, SeekOrigin.Begin);
 9 
10         for (Int32 i = 0; i < 127; i++)
11         {
12             UInt32 fatSectorID = this.m_reader.ReadUInt32();
13 
14             if (fatSectorID == CompoundBinaryFile.FreeSector)
15             {
16                 return;
17             }
18 
19             this.m_fatSectors.Add(fatSectorID);
20         }
21 
22         difSectorID = this.m_reader.ReadUInt32();
23         if (difSectorID == CompoundBinaryFile.EndOfChain)
24         {
25             break;
26         }
27     }
28 }

文章到這，大家應該能明白接下來做什么了吧？之前由於“理所當然”地認為Sector的順序就是存儲的順序，所以導致很多DirectoryEntry無法讀取出來。所以現在我們應該首先獲取DirectoryEntry所占Sector的真實順序。

View Code

 1 protected List<UInt32> m_dirSectors;
 2 
 3 protected UInt32 GetNextSectorID(UInt32 sectorID)
 4 {
 5     UInt32 sectorInFile = this.m_fatSectors[(Int32)(sectorID / 128)];
 6     this.m_stream.Seek(this.GetSectorOffset(sectorInFile) + 4 * (sectorID % 128), SeekOrigin.Begin);
 7 
 8     return this.m_reader.ReadUInt32();
 9 }
10 
11 private void ReadDirectory()
12 {
13     if (this.m_reader == null)
14     {
15         return;
16     }
17 
18     this.m_dirSectors = new List<UInt32>();
19     UInt32 sectorID = this.m_dirStartSectorID;
20 
21     while (true)
22     {
23         this.m_dirSectors.Add(sectorID);
24         sectorID = this.GetNextSectorID(sectorID);
25 
26         if (sectorID == CompoundBinaryFile.EndOfChain)
27         {
28             break;
29         }
30     }
31 
32     UInt32 leftSiblingEntryID, rightSiblingEntryID, childEntryID;
33     this.m_dirRootEntry = GetDirectoryEntry(0, null, out leftSiblingEntryID, out rightSiblingEntryID, out childEntryID);
34     this.ReadDirectoryEntry(this.m_dirRootEntry, childEntryID);
35 }

然后獲取每個DirectoryEntry偏移的方法也應該改為：

View Code

1 protected Int64 GetDirectoryEntryOffset(UInt32 entryID)
2 {
3     UInt32 sectorID = this.m_dirSectors[(Int32)(entryID * CompoundBinaryFile.DirectoryEntrySize / this.m_sectorSize)];
4     return this.GetSectorOffset(sectorID) + (entryID * CompoundBinaryFile.DirectoryEntrySize) % this.m_sectorSize;
5 }

這樣所有的DirectoryEntry就都能獲取到了。注意，除了Directory應該先讀取SectorID和順序再根據這個順序讀取DirectoryEntry外，讀取每個DirectoryEntry也應該首先讀取這個Entry所占的Sector的ID和順序，然后再進行讀取，思路類似就不再貼代碼（可以見這里），詳見文章最后附的程序。

【二、奇怪的DocumentSummary和Summary】

在能真正獲取所有的DirectoryEntry之后，不知道大家發現了沒有，很多文檔的DocumentSummary和Summary卻還是無法獲取到的，一般說來就是得到SectorID后Seek到指定位置后讀到的數據跟預期的有太大的不同。不過有個很有意思的事就是，這些無法讀取的DocumentSummary和Summary的長度都是小於4096的，如下圖。

那么問題出在哪里呢？還記得不記得我們第一篇到讀取的什么結構現在還沒用到？沒錯，就是MiniFAT。可能您想到了，DirectoryEntry中記錄的SectorID不一定就是FAT的SectorID，還有可能是Mini-SectorID，這也就導致了實際上讀取的內容與預期的不同。在Windows復合文件中有這樣一個規定，就是凡是小於4096字節的內容，都要放置於Mini-Sector中，當然這個4096這個數也是存在於文件頭Header中，我們可以在如下圖的位置讀取它，不過這個數是固定4096的。

如同FAT一樣，Mini-Sector的信息也是存放在Mini-FAT表中的，但是Sector是從文件頭Header之后開始的，那么Mini-Sector是從哪里開始的呢？官方文檔是這樣說的，Mini-Sector所占的第一個Sector位置即Root Entry指向的SectorID，Mini-Sector總共的長度即Root Entry所記錄的長度。我們可以通過剛才的FAT表獲取所有Mini-Sector所占的Sector的順序。

View Code

 1 protected List<UInt32> m_miniSectors;
 2 
 3 private void ReadMiniFatSectors()
 4 {
 5     UInt32 sectorID = this.m_miniFatStartSectorID;
 6 
 7     while (true)
 8     {
 9         this.m_minifatSectors.Add(sectorID);
10         sectorID = this.GetNextSectorID(sectorID);
11 
12         if (sectorID == CompoundBinaryFile.EndOfChain)
13         {
14             break;
15         }
16     }
17 }

光有了Mini-Sector所占的Sector的順序還不夠，我們還需要知道Mini-Sector是怎樣的順序。這一點與FAT基本相同，固不在此贅述。

View Code

 1 protected List<UInt32> m_minifatSectors;
 2 
 3 private void ReadMiniFatSectors()
 4 {
 5     UInt32 sectorID = this.m_miniFatStartSectorID;
 6 
 7     while (true)
 8     {
 9         this.m_minifatSectors.Add(sectorID);
10         sectorID = this.GetNextSectorID(sectorID);
11 
12         if (sectorID == CompoundBinaryFile.EndOfChain)
13         {
14             break;
15         }
16     }
17 }

然后我們去寫一個新的GetEntryOffset去滿足不同的DirectoryEntry。

View Code

 1 protected Int64 GetEntryOffset(DirectoryEntry entry)
 2 {
 3     if (entry.Length >= this.m_miniCutoffSize)
 4     {
 5         return GetSectorOffset(entry.SectorID);
 6     }
 7     else
 8     {
 9         return GetMiniSectorOffset(entry.SectorID);
10     }
11 }
12 
13 protected Int64 GetSectorOffset(UInt32 sectorID)
14 {
15     return HeaderSize + this.m_sectorSize * sectorID;
16 }
17 
18 protected Int64 GetMiniSectorOffset(UInt32 miniSectorID)
19 {
20     UInt32 sectorID = this.m_miniSectors[(Int32)((miniSectorID * this.m_miniSectorSize) / this.m_sectorSize)];
21     UInt32 offset = (UInt32)((miniSectorID * this.m_miniSectorSize) % this.m_sectorSize);
22 
23     return HeaderSize + this.m_sectorSize * sectorID + offset;
24 }

現在再試試，是不是所有的Office文檔的DocumentSummary和Summary都能讀取到了呢？

【三、PowerPoint Document的結構與解析】

跟Word不一樣的是，WordDocument永遠是Header后的第一個Sector，但是PowerPoint Document就不一定咯，不過PowerPoint不像Word那樣，要想讀取文字，還需要先讀取WordDocument中的FIB以及TableStream中的數據才能讀取文本，所有PowerPoint幻燈片的數據都存儲在PowerPoint Document中。

簡要說，PowerPoint中存儲的內容是以Record為基礎的，Record又包括Container Record和Atom Record兩種，從名字其實就可以看出，前者是容器，后者是容器中的內容，那么其實PowerPoint Document中存儲的其實也就是樹形結構。

對於每一個Record，其結構如下：

從000H到001H的2字節UInt16，是Record的版本，其中低4位是recVer（特別的是，如果為0xF則一定為Container），高12位是recInstance。
從002H到003H的2字節UInt16，是Record的類型recType。
從004H到007H的4字節UInt32，是Record內容的長度recLen。
之后recLen字節是Record的具體內容。

接下來常見的recType的類型：

如果為0x03E8（1000），則為DocumentContainer。
如果為0x0FF0（4080），則為MasterListWithTextContainer或SlideListWithTextContainer或NotesListWithTextContainer。
如果為0x03F3（1011），則為MasterPersistAtom或SlidePersistAtom或NotesPersistAtom。
如果為0x0F9F（3999），則為TextHeaderAtom。
如果為0x03EA（1002），則為EndDocumentAtom。
如果為0x03F8（1016），則為MainMasterContainer。
如果為0x040C（1036），則為DrawingContainer。
如果為0x03EE（1006），則為SlideContainer。
如果為0x0FD9（4057），則為SlideHeadersFootersContainer或NotesHeadersFootersContainer。
如果為0x03EF（1007），則為SlideAtom。
如果為0x03F0（1008），則為NotesContainer。
如果為0x0FA0（4000），則為TextCharsAtom。
如果為0x0FA8（4008），則為TextBytesAtom。
如果為0x0FBA（4026），則為CString，儲存很多文字的Atom。

由於PowerPoint支持上百種Record，這里只列舉可能用到的一些Record，其他的就不一一列舉了，詳細內容可以參考微軟文檔“[MS-PPT].pdf”的2.13.24節（http://msdn.microsoft.com/en-us/library/dd945336）。

為了更好地了解Record和PowerPoint Document，我們創建一個Record類

View Code

  1 public enum RecordType : uint
  2 {
  3     Unknown = 0,
  4     DocumentContainer = 0x03E8,
  5     ListWithTextContainer = 0x0FF0,
  6     PersistAtom = 0x03F3,
  7     TextHeaderAtom = 0x0F9F,
  8     EndDocumentAtom = 0x03EA,
  9     MainMasterContainer = 0x03F8,
 10     DrawingContainer = 0x040C,
 11     SlideContainer = 0x03EE,
 12     HeadersFootersContainer = 0x0FD9,
 13     SlideAtom = 0x03EF,
 14     NotesContainer = 0x03F0,
 15     TextCharsAtom = 0x0FA0,
 16     TextBytesAtom = 0x0FA8,
 17     CString = 0x0FBA
 18 }
 19 
 20 public class Record
 21 {
 22     #region 字段
 23     private UInt16 m_recVer;
 24     private UInt16 m_recInstance;
 25     private RecordType m_recType;
 26     private UInt32 m_recLen;
 27     private Int64 m_offset;
 28 
 29     private Int32 m_deepth;
 30     private Record m_parent;
 31     private List<Record> m_children;
 32     #endregion
 33 
 34     #region 屬性
 35     /// <summary>
 36     /// 獲取RecordVersion
 37     /// </summary>
 38     public UInt16 RecordVersion
 39     {
 40         get { return this.m_recVer; }
 41     }
 42 
 43     /// <summary>
 44     /// 獲取RecordInstance
 45     /// </summary>
 46     public UInt16 RecordInstance
 47     {
 48         get { return this.m_recInstance; }
 49     }
 50 
 51     /// <summary>
 52     /// 獲取Record類型
 53     /// </summary>
 54     public RecordType RecordType
 55     {
 56         get { return this.m_recType; }
 57     }
 58 
 59     /// <summary>
 60     /// 獲取Record內容大小
 61     /// </summary>
 62     public UInt32 RecordLength
 63     {
 64         get { return this.m_recLen; }
 65     }
 66     
 67     /// <summary>
 68     /// 獲取Record相對PowerPoint Document偏移
 69     /// </summary>
 70     public Int64 Offset
 71     {
 72         get { return this.m_offset; }
 73     }
 74 
 75     /// <summary>
 76     /// 獲取Record深度
 77     /// </summary>
 78     public Int32 Deepth
 79     {
 80         get { return this.m_deepth; }
 81     }
 82 
 83     /// <summary>
 84     /// 獲取Record的父節點
 85     /// </summary>
 86     public Record Parent
 87     {
 88         get { return this.m_parent; }
 89     }
 90 
 91     /// <summary>
 92     /// 獲取Record的子節點
 93     /// </summary>
 94     public List<Record> Children
 95     {
 96         get { return this.m_children; }
 97     }
 98     #endregion
 99 
100     #region 構造函數
101     /// <summary>
102     /// 初始化新的Record
103     /// </summary>
104     /// <param name="parent">父節點</param>
105     /// <param name="version">RecordVersion和Instance</param>
106     /// <param name="type">Record類型</param>
107     /// <param name="length">Record內容大小</param>
108     /// <param name="offset">Record相對PowerPoint Document偏移</param>
109     public Record(Record parent, UInt16 version, UInt16 type, UInt32 length, Int64 offset)
110     {
111         this.m_recVer = (UInt16)(version & 0xF);
112         this.m_recInstance = (UInt16)(version & 0xFFF0);
113         this.m_recType = (RecordType)type;
114         this.m_recLen = length;
115         this.m_offset = offset;
116         this.m_deepth = (parent == null ? 0 : parent.m_deepth + 1);
117         this.m_parent = parent;
118 
119         if (m_recVer == 0xF)
120         {
121             this.m_children = new List<Record>();
122         }
123     }
124     #endregion
125 
126     #region 方法
127     public void AddChild(Record entry)
128     {
129         if (this.m_children == null)
130         {
131             this.m_children = new List<Record>();
132         }
133 
134         this.m_children.Add(entry);
135     }
136     #endregion
137 }

然后我們遍歷所有節點讀取Record的樹形結構

View Code

 1 private StringBuilder m_recordTree;
 2 
 3 /// <summary>
 4 /// 獲取PowerPoint中Record的樹形結構
 5 /// </summary>
 6 public String RecordTree
 7 {
 8     get { return this.m_recordTree.ToString(); }
 9 }
10 
11 protected override void ReadContent()
12 {
13     DirectoryEntry entry = this.m_dirRootEntry.GetChild("PowerPoint Document");
14 
15     if (entry == null)
16     {
17         return;
18     }
19 
20     Int64 entryStart = this.GetEntryOffset(entry);
21     this.m_stream.Seek(entryStart, SeekOrigin.Begin);
22 
23     this.m_recordTree = new StringBuilder();
24     this.m_records = new List<Record>();
25     Record record = null;
26 
27     while (this.m_stream.Position < this.m_stream.Length)
28     {
29         record = this.ReadRecord(null);
30 
31         if (record == null || record.RecordType == 0)
32         {
33             break;
34         }
35     }
36 }
37 
38 private Record ReadRecord(Record parent)
39 {
40     Record record = GetRecord(parent);
41 
42     if (record == null)
43     {
44         return null;
45     }
46     else
47     {
48         this.m_recordTree.Append('-', record.Deepth * 2);
49         this.m_recordTree.AppendFormat("[{0}]-[{1}]-[Len:{2}]", record.RecordType, record.Deepth, record.RecordLength);
50         this.m_recordTree.AppendLine();
51     }
52 
53     if (parent == null)
54     {
55         this.m_records.Add(record);
56     }
57     else
58     {
59         parent.AddChild(record);
60     }
61 
62     if (record.RecordVersion == 0xF)
63     {
64         while (this.m_stream.Position < record.Offset + record.RecordLength)
65         {
66             this.ReadRecord(record);
67         }
68     }
69     else
70     {
71         this.m_stream.Seek(record.RecordLength, SeekOrigin.Current);
72     }
73 
74     return record;
75 }
76 
77 private Record GetRecord(Record parent)
78 {
79     if (this.m_stream.Position >= this.m_stream.Length)
80     {
81         return null;
82     }
83 
84     UInt16 version = this.m_reader.ReadUInt16();
85     UInt16 type = this.m_reader.ReadUInt16();
86     UInt32 length = this.m_reader.ReadUInt32();
87 
88     return new Record(parent, version, type, length, this.m_stream.Position);
89 }

結果類似於如下圖所示

其實如果要讀取PowerPoint中所有的文本，那么只需要讀取所有的TextCharsAtom、TextBytesAtom和CString就可以，需要說明的是，TextBytesAtom是以Ansi單字節進行存儲的，而另外兩個則是以Unicode形式存儲的。上節我們已經讀取過Word，那么接下來就不費勁了吧。

我們其實只要把讀取到Atom時跳過內容的那句話“this.m_stream.Seek(record.RecordLength, SeekOrigin.Current);”替換為如下代碼就可以了。

View Code

 1 if (record.RecordType == RecordType.TextCharsAtom || record.RecordType == RecordType.CString)//找到Unicode雙字節文字內容
 2 {
 3     Byte[] data = this.m_reader.ReadBytes((Int32)record.RecordLength);
 4     this.m_allText.Append(StringHelper.GetString(true, data));
 5     this.m_allText.AppendLine();
 6     
 7 }
 8 else if (record.RecordType == RecordType.TextBytesAtom)//找到Unicode<256單字節文字內容
 9 {
10     Byte[] data = this.m_reader.ReadBytes((Int32)record.RecordLength);
11     this.m_allText.Append(StringHelper.GetString(false, data));
12     this.m_allText.AppendLine();
13 }
14 else
15 {
16     this.m_stream.Seek(record.RecordLength, SeekOrigin.Current);
17 }

不過如果這樣讀取的話，也會把母版頁及其他內容讀取進來，比如下圖：

所以我們可以通過判斷文字父Record的類型來決定是否讀取這段文字。通常存放文字的Record有“ListWithTextContainer和HeadersFootersContainer”，我們僅需要判斷文字Record的父Record是否是這倆就可以的。不過有一點，在用PowerPoint 2013存儲的ppt文件，如果只判斷這倆是讀取不到內容的，還需要判斷Type值為0xF00D的Record，不過這個RecordType在目前最新的文檔中並沒有說明。

這里把完整的代碼貼出來：

View Code

  1 protected override void ReadContent()
  2 {
  3     DirectoryEntry entry = this.m_dirRootEntry.GetChild("PowerPoint Document");
  4 
  5     if (entry == null)
  6     {
  7         return;
  8     }
  9 
 10     Int64 entryStart = this.GetEntryOffset(entry);
 11     this.m_stream.Seek(entryStart, SeekOrigin.Begin);
 12 
 13     #region 測試方法
 14     this.m_recordTree = new StringBuilder();
 15     #endregion
 16 
 17     this.m_allText = new StringBuilder();
 18     this.m_records = new List<Record>();
 19     Record record = null;
 20 
 21     while (this.m_stream.Position < this.m_stream.Length)
 22     {
 23         record = this.ReadRecord(null);
 24 
 25         if (record == null || record.RecordType == 0)
 26         {
 27             break;
 28         }
 29     }
 30 
 31     this.m_allText = new StringBuilder(StringHelper.ReplaceString(this.m_allText.ToString()));
 32 }
 33 
 34 private Record ReadRecord(Record parent)
 35 {
 36     Record record = GetRecord(parent);
 37 
 38     if (record == null)
 39     {
 40         return null;
 41     }
 42     #region 測試方法
 43     else
 44     {
 45         this.m_recordTree.Append('-', record.Deepth * 2);
 46         this.m_recordTree.AppendFormat("[{0}]-[{1}]-[Len:{2}]", record.RecordType, record.Deepth, record.RecordLength);
 47         this.m_recordTree.AppendLine();
 48     }
 49     #endregion
 50 
 51     if (parent == null)
 52     {
 53         this.m_records.Add(record);
 54     }
 55     else
 56     {
 57         parent.AddChild(record);
 58     }
 59 
 60     if (record.RecordVersion == 0xF)
 61     {
 62         while (this.m_stream.Position < record.Offset + record.RecordLength)
 63         {
 64             this.ReadRecord(record);
 65         }
 66     }
 67     else
 68     {
 69         if (record.Parent != null && (
 70             record.Parent.RecordType == RecordType.ListWithTextContainer ||
 71             record.Parent.RecordType == RecordType.HeadersFootersContainer ||
 72             (UInt32)record.Parent.RecordType == 0xF00D))
 73         {
 74             if (record.RecordType == RecordType.TextCharsAtom || record.RecordType == RecordType.CString)//找到Unicode雙字節文字內容
 75             {
 76                 Byte[] data = this.m_reader.ReadBytes((Int32)record.RecordLength);
 77                 this.m_allText.Append(StringHelper.GetString(true, data));
 78                 this.m_allText.AppendLine();
 79 
 80             }
 81             else if (record.RecordType == RecordType.TextBytesAtom)//找到Unicode<256單字節文字內容
 82             {
 83                 Byte[] data = this.m_reader.ReadBytes((Int32)record.RecordLength);
 84                 this.m_allText.Append(StringHelper.GetString(false, data));
 85                 this.m_allText.AppendLine();
 86             }
 87             else
 88             {
 89                 this.m_stream.Seek(record.RecordLength, SeekOrigin.Current);
 90             }
 91         }
 92         else
 93         {
 94             this.m_stream.Seek(record.RecordLength, SeekOrigin.Current);
 95         }
 96     }
 97 
 98     return record;
 99 }
100 
101 private Record GetRecord(Record parent)
102 {
103     if (this.m_stream.Position >= this.m_stream.Length)
104     {
105         return null;
106     }
107 
108     UInt16 version = this.m_reader.ReadUInt16();
109     UInt16 type = this.m_reader.ReadUInt16();
110     UInt32 length = this.m_reader.ReadUInt32();
111 
112     return new Record(parent, version, type, length, this.m_stream.Position);
113 }

最后附上這三篇文章全部的代碼下載地址：https://github.com/mayswind/SimpleOfficeReader

【四、相關鏈接】

1、Microsoft Open Specifications：http://www.microsoft.com/openspecifications/en/us/programs/osp/default.aspx
2、用PHP讀取MS Word(.doc)中的文字：https://imethan.com/post-2009-10-06-17-59.html
3、Office檔案格式：http://www.programmer-club.com.tw/ShowSameTitleN/general/2681.html
4、LAOLA file system：http://stuff.mit.edu/afs/athena/astaff/project/mimeutils/share/laola/guide.html

【后記】

本想盡量精簡盡量少地去寫測試用的代碼，結果沒想到好幾個類的代碼寫到第三篇還是寫了不少。到這里關於Office二進制文檔文字的抽取就結束了，下篇簡要介紹下OOXML（Office 2007開始的格式）文字抽取的方法。另外，如果您覺得文章對您有用，一定要點個推薦啊；如果文章對您起到了幫助，評論一下又不會懷孕，還能給我支持，多好的事。hiahiahia~~

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 .NET平台下的DICOM文件解析 JS，如果沒有方法。。。（不借助任何JS方法實現round方法） .NET平台下開源框架 .NET平台系列29：.NET Core 跨平台奧秘 .NET讀取Office文件內容（word、excel、ppt） C# 不借助第三個變量實現兩整數交換關於office/word/excel/powerpoint/ppt彈出“配置進度”的解決辦法 Microsoft Office word powerpoint 中刪除MathType加載項后每次啟動顯示加載錯誤 .NET Core跨平台的奧秘[下篇]：全新的布局 net框架平台下RPC框架選型