數據挖掘:理論與算法(導論)


清華大學研究生公開課

數據挖掘是數據科學,是多領域交叉學科:數據挖掘 = 機器學習 + 人工智能 + 模式識別 + 統計學

數據挖掘的廣泛應用:

  1. Business Intelligence
  2. Data Analytics
  3. Big Data
  4. Decision Support
  5. Customer Relationship Management

"Education is the kindling of a flame, not the filling of a vessel."--Socrates

DRIP : Data Rich, Information Poor

Learning Resources

只有課堂上的傳授是遠遠不夠的,需要學生課后找書深入研究。

緊跟某個領域內最新動態的辦法:

  1. 跟蹤國際會議
  2. 關注權威期刊
  3. 關注業內大牛的研究方向

SVM : 在機器學習領域,支持向量機SVM(Support Vector Machine)是一個有監督的學習模型,通常用來進行模式識別、分類、以及回歸分析。
libsvm : A Library for Support Vector Machines

科學研究只有第一,沒有第二。

搜文章、論文一定要用 Google、Google Scholar

weka: GUI化的數據挖掘軟件,幫助建立對數據挖掘的感性認識,不必一開始就深入至算法層面。

神經網絡軟件包:matlab 收斂速度很快

KD nuggets 數據挖掘相關數據、信息、工作機會。

學習基本原理

Tell me and I forget,(光是聽老師講,很快就會忘)
Teach me and I remember,(了解了原理以后,記憶的時間可能稍長一些)
Invoke me and I learn. (只有自己動手做過之后,才能掌握並且固化在腦海中)

"The value of college education is not the learning of many facts but the training of mind to think." -- Albert Einstain

Data

(從抽象的程度衡量)信息 > 數據

大數據的應用:

  1. 用戶畫像
  2. 流數據
  3. 預測犯罪發生
  4. 針對每個人的基因制定葯量
  5. Urban Planning

關於大數據的兩個定義:

  1. “Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.” — Gartner
  2. “Big data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze.” — Mckinsey & Company

Synonym of data mining : knowledge discovery —— 數據挖掘的同義詞是“知識發現”

數據挖掘的應用:

  1. 啤酒與尿布 (NOT REAL)
  2. money ball : 數據分析支持挑選適合自己球隊的球員。
  3. Retail Data(零售數據) : Targeted Marketing
  4. Retail Data : Sentiment Analusis——零售業數據的情感分析,通過挖掘用戶評論內容作消費者購物體驗的分析

Is data mining realy important ?
“If you are looking for a career where your services will be in high demand, you should find something where you provide a scarce, complementary service to something that is getting ubiquitous and cheap. So what’s getting ubiquitous and cheap? Data. And what is complementary to data? Analysis. So my recommendation is to take lots of courses about how to manipulate and analyze data: databases, machine learning, econometrics, statistics, visualization, and so on.”——An interview with Google Chief Economist Hal Varian from the New York Times

From Data To Intelligence

ETL : 提取、轉換、裝載
Data Integration & Analysis

DM Techniques - Classification

Definition : “Classification is a procedure in which individual items are placed into groups based on quantitative information on one or more characteristics (referred to as variables) and based on a training set of previously labeled items.”

Process : Given a training set: {(x1, y1), …, (xn, yn)}, produce a classifier (function) that maps any unknown object xi to its class label yi.

Algorithms :

  • Decision Trees(決策樹)
  • K-Nearest Neighbours(K最鄰近分類算法)
  • Neural Networks(神經網絡)
  • Support Vector Machines(支持向量機)

Applications :

  • Churn Prediction(流失預測)
  • Medical Diagnosis(醫學診斷)

Type : supervised learning(監督學習)

實質 : Classification Boundaries(分界面,如下圖),對空間進行划分

Confusion Matrix(混淆矩陣,如下圖)

Receiver Operating Characteristic(ROC曲線,如下圖)

threshold 閥值,臨界值

AUC(Area Under roc Curve)

衡量分類模型好壞的一個標准

DM Techniques - clustering

Definition : “Clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense.”

Distance Metrics(距離度量) :

  • Euclidean Distance
  • Manhattan Distance
  • Mahalanobis Distance

Algorithms :

  • K-Means
  • Sequential Leader
  • Affinity Propagation

Applications :

  • Market Research
  • Image Segmentation
  • Social Network Analysis

Type : 無監督學習

Hierarchical Clustering(分層聚類,如下圖)

DM Techniques – Association Rule(關聯規則)

如下圖,如果買了牛奶和面包機器會自動推薦你買黃油

DM Techniques – Regression(線下回歸,如下圖)

Seeing is Knowing

**數據挖掘的 KEY POINT : 可解釋性。 **

可視化軟件

Data Preprocessing(數據預處理)

Real data are often surprisingly(驚人地) dirty.

  • A Major Challenge for Data Mining

Typical Issues

  • Missing Attribute Values
  • Different Coding/Naming Schemes
  • Infeasible Values(不可行的值)
  • Inconsistent Data(不一致的值)
  • Outliers(極端值)

Data Quality

  • Accuracy
  • Completeness
  • Consistency
  • Interpretability
  • Credibility
  • Timeliness

GIGO : garbage in garbage out.

Data Cleaning

  • Fill in missing values.
  • Correct inconsistent data.
  • Identify outliers and noisy data.

Data Integration

  • Combine data from different sources.

Data Transformation

  • Normalization
  • Aggregation
  • Type Conversion

Data Reduction

  • Feature Selection
  • Sampling

數據挖掘相關問題

  1. 隱私保護
  2. 雲計算:彈性擴容(如下圖)避免機器資源浪費(Pay As You Go)
  3. 並行計算 : GPU 作為計算卡、科學計算、廉價的超級計算

The Big Picture

數據挖掘 = 數據 + 模型 + 高性能計算平台

如果強調結果的可解釋性,選擇:決策樹。反之,神經網絡。

聚類:K-means;分類:KNN

金融大數據:量化交易,克服交易者性格上的缺陷

數據挖掘不創造規律,它只能發掘規律。

負相關:A 增加則 B 減少

注意可能存在的“分組”規律、注意數據間的相關性、注意心理因素的影響

數據挖掘領域的經典問題:Survivorship Bias(幸存者偏差)


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM