from sklearn.model_selecting import train_test_spilt()
參數stratify: 依據標簽y,按原數據y中各類比例,分配給train和test,使得train和test中各類數據的比例與原數據集一樣。
例如:A:B:C=1:2:3
split后,train和test中,都是A:B:C=1:2:3
將stratify=X就是按照X中的比例分配
將stratify=y就是按照y中的比例分配
一般都是=y
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html
TF-IDF (Term Frequency - Inverse Document Frequency)
TfidfVectorizer 參數意義:
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer.build_tokenizer
詳細解釋:
https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction