python 进行结巴分词并且用re去掉符号

本文转载自查看原文 2018-05-15 16:51 5850 自然语言处理

# 把停用词做成字典
stopwords = {}
fstop = open('stop_words.txt', 'r',encoding='utf-8',errors='ingnore')
for eachWord in fstop:
    stopwords[eachWord.strip()] = eachWord.strip()  #停用词典
fstop.close()

f1=open('all.txt','r',encoding='utf-8',errors='ignore')
f2=open('allutf11.txt','w',encoding='utf-8')

line=f1.readline()
while line:
    line = line.strip()  #去前后的空格
    line = re.sub(r"[0-9\s+\.\!\/_,$%^*()?;；:-【】+\"\']+|[+——！，;:。？、~@#￥%……&*（）]+", " ", line) #去标点符号
    seg_list=jieba.cut(line,cut_all=False)  #结巴分词
    outStr=""
    for word in seg_list:
        if word not in stopwords:
            outStr+=word
            outStr+=" "
    f2.write(outStr)

    line=f1.readline()
f1.close()
f2.close()

免责声明！

本站转载的文章为个人学习借鉴使用，本站对版权不负任何法律责任。如果侵犯了您的隐私权益，请联系本站邮箱yoyou2525@163.com删除。

猜您在找 python中文分词，使用结巴分词对python进行分词 Python 结巴分词（1）分词 python文本处理(结巴分词并去除符号) python 结巴分词学习 python中文分词：结巴分词 Python 结巴分词 python 中文分词：结巴分词 python结巴(jieba)分词 python 结巴分词(jieba)详解 Python 结巴分词（2）关键字提取

python 进行结巴分词 并且用re去掉符号

免责声明！

python 进行结巴分词并且用re去掉符号