python中sorted()和set()去重,排序


前言

在看一個聊天機器人的神經網絡模型訓練前准備訓練數據,需要對訓練材料做處理(轉化成張量)需要先提煉詞干,然后對詞干做去重和排序

words = sorted(list(set(words)))

對這三個方法做一下整理:

1.set()

語法:set([iterable])

參數:可迭代對象(可選),a sequence (string, tuple, etc.) or collection (list, set, dictionary, etc.) or an iterator object to be converted into a set

返回值:set集合

作用:去重,因為set集合的本質是無序,不重復的集合。所以轉變為set集合的過程就是去重的過程

 1 # empty set
 2 print(set())
 3 
 4 # from string
 5 print(set('google'))
 6 
 7 # from tuple
 8 print(set(('a', 'e', 'i', 'o', 'u')))
 9 
10 # from list
11 print(set(['g', 'o', 'o', 'g', 'l', 'e'])) 
12
13 # from range 14 print(set(range(5)))

運行結果:

set()
{'o', 'G', 'l', 'e', 'g'}
{'a', 'o', 'e', 'u', 'i'}
{'e', 'g', 'l', 'o'}
{0, 1, 2, 3, 4}

2.sorted()

 語法:sorted(iterable[, key][, reverse])

參數:

iterable 可迭代對象,- sequence (string, tuple, list) or collection (set, dictionary, frozen set) or any iterator 

reverse 反向(可選),If true, the sorted list is reversed (or sorted in Descending order)

key (可選),function that serves as a key for the sort comparison

返回值:a sorted list 一個排好序的列表

示例1:排序

# vowels list
pyList = ['e', 'a', 'u', 'o', 'i']
print(sorted(pyList))

# string 
pyString = 'Python'
print(sorted(pyString))

# vowels tuple
pyTuple = ('e', 'a', 'u', 'o', 'i')
print(sorted(pyTuple))

結果:

['a', 'e', 'i', 'o', 'u']
['P', 'h', 'n', 'o', 't', 'y']
['a', 'e', 'i', 'o', 'u']

示例2:反向排序

# set
pySet = {'e', 'a', 'u', 'o', 'i'}
print(sorted(pySet, reverse=True))

# dictionary
pyDict = {'e': 1, 'a': 2, 'u': 3, 'o': 4, 'i': 5}
print(sorted(pyDict, reverse=True))

# frozen set
pyFSet = frozenset(('e', 'a', 'u', 'o', 'i'))
print(sorted(pyFSet, reverse=True))

結果:

['u', 'o', 'i', 'e', 'a']
['u', 'o', 'i', 'e', 'a']
['u', 'o', 'i', 'e', 'a']

示例3:指定key parameter排序

 1 # take second element for sort
 2 def takeSecond(elem):
 3     return elem[1]
 4 
 5 # random list
 6 random = [(2, 2), (3, 4), (4, 1), (1, 3)]
 7 
 8 # sort list with key
 9 sortedList = sorted(random, key=takeSecond)
10 
11 # print list
12 print('Sorted list:', sortedList)

結果:

Sorted list: [(4, 1), (2, 2), (1, 3), (3, 4)]

值得一提的是,sort()和sorted()的區別:

sort 是應用在 list 上的方法(list.sort()),sorted 可以對所有可迭代的對象進行排序操作(sorted(iterable))。
list 的 sort 方法返回的是對已經存在的列表進行操作,無返回值,而內建函數 sorted 方法返回的是一個新的 list,而不是在原來的基礎上進行的操作。

 

在了解這幾個函數的過程中,發現了一個博友的文章,關於校招題目的,摘其中一道題如下:

原文鏈接:http://www.cnblogs.com/klchang/p/4752441.html

用python實現統計一篇英文文章內每個單詞的出現頻率,並返回出現頻率最高的前10個單詞及其出現次數,並解答以下問題?(標點符號可忽略)

答案如下:

 1 def findTopFreqWords(filename, num=1):
 2     'Find Top Frequent Words:'
 3     fp = open(filename, 'r')
 4     text = fp.read()
 5     fp.close()
 6 
 7     lst = re.split('[0-9\W]+', text)
 8 
 9     # create words set, no repeat
10     words = set(lst)
11     d = {}
12     for word in words:
13         d[word] = lst.count(word)
14     del d['']
15     
16     result = []
17     for key, value in sorted(d.iteritems(), key=lambda (k,v): (v,k),reverse=True):
18         result.append((key, value))
19     return result[:num]
20 
21 def test():
22     topWords = findTopFreqWords('test.txt',10)
23     print topWords
24 
25 if __name__=='__main__':
26     test()

使用的 test.txt 內容如下,

3.1   Accessing Text from the Web and from Disk

Electronic Books

A small sample of texts from Project Gutenberg appears in the NLTK corpus collection. 
However, you may be interested in analyzing other texts from Project Gutenberg. 
You can browse the catalog of 25,000 free online books at http://www.gutenberg.org/catalog/, and obtain a URL to an ASCII text file. 
Although 90% of the texts in Project Gutenberg are in English, it includes material in over 50 other languages, including Catalan, Chinese, Dutch, Finnish, French, German, Italian,

 

  

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM