前言
在看一個聊天機器人的神經網絡模型訓練前准備訓練數據,需要對訓練材料做處理(轉化成張量)需要先提煉詞干,然后對詞干做去重和排序
words = sorted(list(set(words)))
對這三個方法做一下整理:
1.set()
語法:set([iterable])
參數:可迭代對象(可選),a sequence (string, tuple, etc.) or collection (list, set, dictionary, etc.) or an iterator object to be converted into a set
返回值:set集合
作用:去重,因為set集合的本質是無序,不重復的集合。所以轉變為set集合的過程就是去重的過程
1 # empty set 2 print(set()) 3 4 # from string 5 print(set('google')) 6 7 # from tuple 8 print(set(('a', 'e', 'i', 'o', 'u'))) 9 10 # from list 11 print(set(['g', 'o', 'o', 'g', 'l', 'e']))
12
13 # from range 14 print(set(range(5)))
運行結果:
set() {'o', 'G', 'l', 'e', 'g'} {'a', 'o', 'e', 'u', 'i'} {'e', 'g', 'l', 'o'} {0, 1, 2, 3, 4}
2.sorted()
語法:sorted(iterable[, key][, reverse])
參數:
iterable 可迭代對象,- sequence (string, tuple, list) or collection (set, dictionary, frozen set) or any iterator
reverse 反向(可選),If true, the sorted list is reversed (or sorted in Descending order)
key (可選),function that serves as a key for the sort comparison
返回值:a sorted list 一個排好序的列表
示例1:排序
# vowels list pyList = ['e', 'a', 'u', 'o', 'i'] print(sorted(pyList)) # string pyString = 'Python' print(sorted(pyString)) # vowels tuple pyTuple = ('e', 'a', 'u', 'o', 'i') print(sorted(pyTuple))
結果:
['a', 'e', 'i', 'o', 'u'] ['P', 'h', 'n', 'o', 't', 'y'] ['a', 'e', 'i', 'o', 'u']
示例2:反向排序
# set pySet = {'e', 'a', 'u', 'o', 'i'} print(sorted(pySet, reverse=True)) # dictionary pyDict = {'e': 1, 'a': 2, 'u': 3, 'o': 4, 'i': 5} print(sorted(pyDict, reverse=True)) # frozen set pyFSet = frozenset(('e', 'a', 'u', 'o', 'i')) print(sorted(pyFSet, reverse=True))
結果:
['u', 'o', 'i', 'e', 'a'] ['u', 'o', 'i', 'e', 'a'] ['u', 'o', 'i', 'e', 'a']
示例3:指定key parameter排序
1 # take second element for sort 2 def takeSecond(elem): 3 return elem[1] 4 5 # random list 6 random = [(2, 2), (3, 4), (4, 1), (1, 3)] 7 8 # sort list with key 9 sortedList = sorted(random, key=takeSecond) 10 11 # print list 12 print('Sorted list:', sortedList)
結果:
Sorted list: [(4, 1), (2, 2), (1, 3), (3, 4)]
值得一提的是,sort()和sorted()的區別:
sort 是應用在 list 上的方法(list.sort()),sorted 可以對所有可迭代的對象進行排序操作(sorted(iterable))。
list 的 sort 方法返回的是對已經存在的列表進行操作,無返回值,而內建函數 sorted 方法返回的是一個新的 list,而不是在原來的基礎上進行的操作。
在了解這幾個函數的過程中,發現了一個博友的文章,關於校招題目的,摘其中一道題如下:
原文鏈接:http://www.cnblogs.com/klchang/p/4752441.html
用python實現統計一篇英文文章內每個單詞的出現頻率,並返回出現頻率最高的前10個單詞及其出現次數,並解答以下問題?(標點符號可忽略)
答案如下:
1 def findTopFreqWords(filename, num=1): 2 'Find Top Frequent Words:' 3 fp = open(filename, 'r') 4 text = fp.read() 5 fp.close() 6 7 lst = re.split('[0-9\W]+', text) 8 9 # create words set, no repeat 10 words = set(lst) 11 d = {} 12 for word in words: 13 d[word] = lst.count(word) 14 del d[''] 15 16 result = [] 17 for key, value in sorted(d.iteritems(), key=lambda (k,v): (v,k),reverse=True): 18 result.append((key, value)) 19 return result[:num] 20 21 def test(): 22 topWords = findTopFreqWords('test.txt',10) 23 print topWords 24 25 if __name__=='__main__': 26 test()
使用的 test.txt 內容如下,
3.1 Accessing Text from the Web and from Disk
Electronic Books
A small sample of texts from Project Gutenberg appears in the NLTK corpus collection.
However, you may be interested in analyzing other texts from Project Gutenberg.
You can browse the catalog of 25,000 free online books at http://www.gutenberg.org/catalog/, and obtain a URL to an ASCII text file.
Although 90% of the texts in Project Gutenberg are in English, it includes material in over 50 other languages, including Catalan, Chinese, Dutch, Finnish, French, German, Italian,