自然語言處理----語料庫


    本文重點介紹預料庫的一般操作。

    1.  使用nltk加載自己的預料庫

 1 >>> from nltk.corpus import PlaintextCorpusReader
 2 >>> corpus_root=r'D:/00001/2002/Annual_txt'
 3 >>> reader=PlaintextCorpusReader(corpus_root, '.*')
 4 >>> reader.fileids()
 5 ['2001 Business Highlights .txt', 'Back Cover.txt', 'Balance Sheet.txt', 'Cheung Kong Infrastructure Holdings Limited.txt', 'Consolidated Balance Sheet.txt', 'Consolidated Cash Flow Statement.txt', 'C
 6 onsolidated Profit & Loss Account .txt', 'Consolidated Statement of Recognised Gains and Losses.txt', 'Contents.txt', 'Corporate Information.txt', 'Cover.txt', 'Development Projects.txt', "Directors'
 7 Biographical Information.txt", 'Extracts from Hutchison Whampoa Limited Financial Statements.txt', 'Financial Highlights.txt', 'Group Financial Summary.txt', 'Group Structure.txt', 'Hongkong Electric
 8 Holdings Limited.txt', 'Hutchison Whampoa Limited.txt', 'Management Discussion and Analysis.txt', 'Notes to Financial Statements.txt', 'Notice of Annual General Meeting.txt', 'Overseas Properties.txt'
 9 , 'Rental Properties.txt', 'Report of the Auditors.txt', 'Report of the Chairman and the Managing Director.txt', 'Report of the Directors.txt', 'Schedule of Major Properties.txt']
10 >>>
View Code

    這里將本地'D:/00001/2002/Annual_txt'文件夾作為一個預料庫,操作里面的文件。

    

    2. 預料庫的一般操作

    1) fileids(): 獲取預料庫中的文件列表

    2) fileids([categories]): 獲取分類對應的語料庫中的文件

    3)categories(): 獲取語料庫的分類

    4) categories([fileids]): 獲取文件對應的語料庫中的分類

    5) raw(): 獲取語料庫中的原始內容

    6)raw(fileids=[f1, f2, f3]): 獲取指定文件中的原始內容

    7) words(): 獲取整個語料庫中的詞匯

    8)words([fileids=[f1, f2, f3]): 獲取指定文件中的詞匯

    9) words(categories=[c1, c2]): 獲取指定分類中的詞匯

  10) sents(): 獲取整個語料庫中的句子

    11)sents([fileids=[f1, f2, f3]): 獲取指定文件中的句子

    12) sents(categories=[c1, c2]): 獲取指定分類中的句子

    13) abspath(fileid): 獲取指定文件在磁盤上的位置

    14) encoding(fileid): 獲取指定文件的文件編碼(如果知道的話)

    15) open(fileid): 獲取打開指定語料庫文件的文件流

    16)readme(): 獲取語料庫的README文件的內容

 1 >>> from nltk.corpus import PlaintextCorpusReader
 2 >>> corpus_root=r'D:/00001/2002/Annual_txt'
 3 >>> reader=PlaintextCorpusReader(corpus_root, '.*')
 4 >>> reader.fileids()
 5 ['2001 Business Highlights .txt', 'Back Cover.txt', 'Balance Sheet.txt', 'Cheung Kong Infrastructure Holdings Limited.txt', 'Consolidated Balance Sheet.txt', 'Consolidated Cash Flow Statement.txt', 'C
 6 onsolidated Profit & Loss Account .txt', 'Consolidated Statement of Recognised Gains and Losses.txt', 'Contents.txt', 'Corporate Information.txt', 'Cover.txt', 'Development Projects.txt', "Directors'
 7 Biographical Information.txt", 'Extracts from Hutchison Whampoa Limited Financial Statements.txt', 'Financial Highlights.txt', 'Group Financial Summary.txt', 'Group Structure.txt', 'Hongkong Electric
 8 Holdings Limited.txt', 'Hutchison Whampoa Limited.txt', 'Management Discussion and Analysis.txt', 'Notes to Financial Statements.txt', 'Notice of Annual General Meeting.txt', 'Overseas Properties.txt'
 9 , 'Rental Properties.txt', 'Report of the Auditors.txt', 'Report of the Chairman and the Managing Director.txt', 'Report of the Directors.txt', 'Schedule of Major Properties.txt']
10 >>> reader.categories()
11 Traceback (most recent call last):
12   File "<stdin>", line 1, in <module>
13 AttributeError: 'PlaintextCorpusReader' object has no attribute 'categories'
14 >>> reader.raw()[:100]
15 u'CHEUNG KONG (HOLDINGS) LIMITED\r\n\r\nANNUAL REPORT 2001\r\n\r\n2001 Business Highlights\r\n\r\n4\r\n\r\nJanuary\r\n\r\n'
16 >>> reader.raw(fileids='Balance Sheet.txt')[:100]
17 u'CHEUNG KONG (HOLDINGS) LIMITED\r\n\r\nANNUAL REPORT 2001\r\n\r\nBalance Sheet\r\n\r\nAs at 31st December, 2001\r\n'
18 >>> reader.raw()[:100]
19 u'CHEUNG KONG (HOLDINGS) LIMITED\r\n\r\nANNUAL REPORT 2001\r\n\r\n2001 Business Highlights\r\n\r\n4\r\n\r\nJanuary\r\n\r\n'
20 >>> reader.sents()[:5]
21 [[u'CHEUNG', u'KONG', u'(', u'HOLDINGS', u')', u'LIMITED'], [u'ANNUAL', u'REPORT', u'2001'], [u'2001', u'Business', u'Highlights'], [u'4'], [u'January']]
22 >>> reader.sents(fileids='Balance Sheet.txt')[:5]
23 [[u'CHEUNG', u'KONG', u'(', u'HOLDINGS', u')', u'LIMITED'], [u'ANNUAL', u'REPORT', u'2001'], [u'Balance', u'Sheet'], [u'As', u'at', u'31st', u'December', u',', u'2001'], [u'44']]
24 >>> reader.words(fileids='Balance Sheet.txt')[:5]
25 [u'CHEUNG', u'KONG', u'(', u'HOLDINGS', u')']
26 >>> reader.words()[:20]
27 [u'CHEUNG', u'KONG', u'(', u'HOLDINGS', u')', u'LIMITED', u'ANNUAL', u'REPORT', u'2001', u'2001', u'Business', u'Highlights', u'4', u'January', u'*', u'Successfully', u'raised', u'a', u'5', u'-']
28 >>> reader.abspath('Balance Sheet.txt')
29 FileSystemPathPointer('D:\\00001\\2002\\Annual_txt\\Balance Sheet.txt')
30 >>> reader.open('Balance Sheet.txt')
31 <nltk.data.SeekableUnicodeStreamReader object at 0x064822B0>
View Code

顯然,自定義的預料庫因為沒有設置categories屬性,所以涉及相關的操作無法進行。

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM