python之NLP數據清洗

本文轉載自查看原文 2019-06-13 21:40 2180 自然語言處理(NLP)

1、知識點

"""
安裝模塊：bs4 nltk gensim
nltk:處理英文
    1、安裝
    2、nltk.download() 下載相應的模塊

英文數據處理：
    1、去掉html標簽  example = BeautifulSoup(df['review'][1000],'html.parser').get_text()
    2、移除標點      example_letter = re.sub(r'[^a-zA-Z]',' ',example)
    3、切分成詞/token      words = example_letter.lower().split()
    4、去掉停用詞  例如：the  a  an  it's
                stopwords = {}.fromkeys([line.rstrip() for line in open('./stopwords.txt')])
                words_nostop = [w for w in words if w not in stopwords]
    5、重組為新的句子

詞向量解決方案：
    1、one-hot編碼
        缺點：這種方案浪費存儲空間還是次要的，更重要的是詞與詞（向量與向量）之間沒有相關性，計算機完全無法進行哪怕一丁點的理解和處理
    2、基於奇異值分解（SVD）的方法
        步驟：a)第一步是通過大量已有文檔統計形成詞空間矩陣X，有兩種辦法。
                一種是統計出某篇文檔中各個詞出現的次數，假設詞的數目是W、文檔篇數是M，則此時X的維度是W*M；
                第二種方法是針對某個特定詞，統計其前后文中其它詞的出現頻次，從而形成W*W的X矩陣。
              b)第二步是針對X矩陣進行SVD分解，得到特征值，根據需要截取前k個特征值及對應的前k個特征向量，
                那么前k個特征向量構成的矩陣維度是W*k，這就構成了所有W個詞的k維表示向量
        缺點：
            1、需要維護一個極大的詞空間稀疏矩陣X，而且隨着新詞的出現還會經常發生變化；
            2、SVD運算量大，而且每增減一個詞或文檔之后，都需要重新計算
    3、構建一個word2vec模型：通過大量文檔迭代學習其中的參數及已有詞的編碼結果，這樣每新來一篇文檔都不用修改已有模型，只需要再次迭代計算參數和詞向量即可
            舉例：我愛python和java
            a)CBOW算法: 輸入:我愛， 目標值：python和java
                   CBOW算法使用上下文窗口內詞向量作為輸入，將這些向量求和（或取均值）后，求得與輸出詞空間的相關性分布，
                   進而使用softmax函數得到在整個輸出詞空間上的命中概率，與目標詞one-hot編碼的交叉熵即為loss值，
                   通過loss針對輸入和輸出詞向量的梯度，即可使用梯度下降（gradient descent）法得到一次針對輸入和輸出詞向量的迭代調整。

            b)Skip-Gram算法: 輸入:python和java， 目標值：我愛
                    Skip-Gram算法使用目標詞向量作為輸入，求得其與輸出詞空間的相關性分布，
                    進而使用softmax函數得到在整個輸出詞空間上的命中概率，與one-hot編碼的上下文詞逐一計算交叉熵，
                    求和后即為loss值，通過loss針對輸入和輸出詞向量的梯度，
                    即可使用梯度下降（gradient descent）法得到一次針對輸入和輸出詞向量的迭代調整
"""

2、中文數據清洗(使用停用詞)

import os
import re
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
import nltk
from nltk.corpus import stopwords
import  jieba
def clean_chineses_text(text):
    """
    中文數據清洗  stopwords_chineses.txt存放在博客園文件中
    :param text:
    :return:
    """
    text = BeautifulSoup(text, 'html.parser').get_text() #去掉html標簽
    text =jieba.lcut(text);
    stopwords = {}.fromkeys([line.rstrip() for line in open('./stopwords_chineses.txt')]) #加載停用詞(中文)
    eng_stopwords = set(stopwords) #去掉重復的詞
    words = [w for w in text if w not in eng_stopwords] #去除文本中的停用詞
    return ' '.join(words)

3、英文數據清洗(使用停用詞)

import os
import re
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
import nltk
from nltk.corpus import stopwords
import  jieba
def clean_english_text(text):
    """
    英文數據清洗  stopwords_english.txt存放在博客園文件中
    :param text:
    :return:
    """
    text = BeautifulSoup(text, 'html.parser').get_text() #去掉html標簽
    text = re.sub(r'[^a-zA-Z]', ' ', text)  #只保留英文字母
    words = text.lower().split()  #全部小寫
    stopwords = {}.fromkeys([line.rstrip() for line in open('./stopwords_english.txt')]) #加載停用詞(中文)
    eng_stopwords = set(stopwords) #去掉重復的詞
    words = [w for w in words if w not in eng_stopwords] #去除文本中的停用詞
    print(words)
    return ' '.join(words)

if __name__ == '__main__':
    text = "ni hao ma ,hello ! my name is haha'. ,<br/> "
    a = clean_english_text(text)
    print(a)

    test1 = "你在干嘛啊，怎么不回復我消息!,對了“你媽在找你”。"
    b = clean_chineses_text(test1)
    print(b)

4、nltk的停用詞進行數據清洗

def clean_english_text_from_nltk(text):
    """
    使用nltk的停用詞對英文數據進行清洗
    :param text: 
    :return: 
    """
    text = BeautifulSoup(text,'html.parser').get_text() #去掉html標簽
    text = re.sub(r'[^a-zA-Z]',' ',text) #除去標點符號
    words = text.lower().split() #轉為小寫並切分
    stopwords = nltk.corpus.stopwords.words('english') #使用nltk的停用詞
    wordList =[word for word in words if word not in stopwords]
    return ' '.join(wordList)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 nlp數據清洗(包括有英文和中文) 數據清洗數據清洗 HIVE數據清洗 SQL數據清洗數據清洗 MapReduce數據清洗 [數據清洗]- Pandas 清洗“臟”數據（三） Python學習筆記：數據清洗之缺失值填充fillna 使用python腳本進行數據清洗（1）