方法1
統計邏輯:統計一個loginname的所有去重的通訊錄數C,統計這個Loginname對應的每個設備對應的通訊錄c1,c2,c3…cn;
X=(c1/c+c2/c+c3/c+….cn/c)/n=(c1+c2+…cn)/(c*n)
數據表現如下:
將3個以上設備計算出來的結果關聯上黑名單庫,在8092個loginname中,有3915個,
占48.4%。將是否黑名單當作Y值,X值為自變量進行回歸:
Y與X之間的相關性為0.1229,顯著性正相關。
回歸統計 |
|
|
|
|
|
|
|
|
Multiple R |
0.12294864 |
|
|
|
|
|
|
|
R Square |
0.01511637 |
|
|
|
|
|
|
|
Adjusted R Square |
0.01499461 |
|
|
|
|
|
|
|
標准誤差 |
0.49600959 |
|
|
|
|
|
|
|
觀測值 |
8091 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
方差分析 |
|
|
|
|
|
|
|
|
|
df |
SS |
MS |
F |
Significance F |
|
|
|
回歸分析 |
1 |
30.54481 |
30.54481 |
124.153 |
1.26E-28 |
|
|
|
殘差 |
8089 |
1990.1 |
0.246026 |
|
|
|
|
|
總計 |
8090 |
2020.645 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Coefficients |
標准誤差 |
t Stat |
P-value |
Lower 95% |
Upper 95% |
下限 95.0% |
上限 95.0% |
Intercept |
0.65722473 |
0.016506 |
39.81648 |
0 |
0.624868 |
0.689581 |
0.624868 |
0.689581 |
X Variable 1 |
-0.2788014 |
0.025022 |
-11.1424 |
1.26E-28 |
-0.32785 |
-0.22975 |
-0.32785 |
-0.22975 |
方法2
統計邏輯:計算出每2個設備之間通訊錄的相同率,如Xab,Xac,Xad,Xcd…XnXn-1,關聯上黑名單庫,在當時7383的大於等於3個設備的loginname中,按照設備個數的不同分組,然后進行回歸,
3個設備的loginname有6512個,其中黑名單3294個,占比50.6%,X與Y的關系程正弱相關,前2個之間的相同率(自變量)對Y的影響最大。
Multiple R |
0.119978 |
|
|
|
|
|
|
|
R Square |
0.014395 |
|
|
|
|
|
|
|
Adjusted R Square |
0.01394 |
|
|
|
|
|
|
|
標准誤差 |
0.496507 |
|
|
|
|
|
|
|
觀測值 |
6512 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
方差分析 |
|
|
|
|
|
|
|
|
|
df |
SS |
MS |
F |
Significance F |
|
|
|
回歸分析 |
3 |
23.43156 |
7.810518 |
31.68321 |
2.52E-20 |
|
|
|
殘差 |
6508 |
1604.347 |
0.246519 |
|
|
|
|
|
總計 |
6511 |
1627.778 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Coefficients |
標准誤差 |
t Stat |
P-value |
Lower 95% |
Upper 95% |
下限 95.0% |
上限 95.0% |
Intercept |
0.40805 |
0.011882 |
34.34126 |
1.1E-237 |
0.384757 |
0.431344 |
0.384757 |
0.431344 |
X Variable 1 |
0.115797 |
0.039147 |
2.958019 |
0.003107 |
0.039056 |
0.192537 |
0.039056 |
0.192537 |
X Variable 2 |
0.163985 |
0.039448 |
4.156987 |
3.27E-05 |
0.086654 |
0.241316 |
0.086654 |
0.241316 |
X Variable 3 |
0.087144 |
0.038996 |
2.234676 |
0.025472 |
0.010699 |
0.16359 |
0.010699 |
0.16359 |
4個設備的組,統計回歸相對高0.05個百分點的相關性,但是相關性依然較弱,第一個和第6個變量起到決定性影響,5個以上的設備存在高度正相關,但是因為數據量太少,
不認為具有統計學代表意義。
回歸統計 |
|
|
|
|
|
|
|
|
|
|
Multiple R |
0.171505 |
|
|
|
|
|
|
|
|
|
R Square |
0.029414 |
|
|
|
|
|
|
|
|
|
Adjusted R Square |
0.020577 |
|
|
|
|
|
|
|
|
|
標准誤差 |
0.493803 |
|
|
|
|
|
|
|
|
|
觀測值 |
666 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
方差分析 |
|
|
|
|
|
|
|
|
|
|
|
df |
SS |
MS |
|
F |
Significance F |
|
|
|
|
回歸分析 |
6 |
4.869808 |
0.811635 |
|
3.32853 |
0.003088 |
|
|
|
|
殘差 |
659 |
160.6918 |
0.243842 |
|
|
|
|
|
|
|
總計 |
665 |
165.5616 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Coefficients |
標准誤差 |
t Stat |
|
P-value |
Lower 95% |
Upper 95% |
|
下限 95.0% |
上限 95.0% |
Intercept |
0.355842 |
0.071268 |
4.993029 |
|
7.62E-07 |
0.215903 |
0.495781 |
|
0.215903 |
0.495781 |
X Variable 1 |
0.209455 |
0.131209 |
1.596339 |
|
0.110893 |
-0.04818 |
0.467094 |
|
-0.04818 |
0.467094 |
X Variable 2 |
0.089427 |
0.136561 |
0.654849 |
|
0.512793 |
-0.17872 |
0.357575 |
|
-0.17872 |
0.357575 |
X Variable 3 |
0.09917 |
0.127629 |
0.777021 |
|
0.437425 |
-0.15144 |
0.349778 |
|
-0.15144 |
0.349778 |
X Variable 4 |
-0.10775 |
0.144002 |
-0.74823 |
|
0.454591 |
-0.3905 |
0.175012 |
|
-0.3905 |
0.175012 |
X Variable 5 |
-0.14237 |
0.121381 |
-1.17293 |
|
0.241249 |
-0.38071 |
0.095969 |
|
-0.38071 |
0.095969 |
方法3
沒能獲取用戶更換設備的時間,分析業務邏輯,可以通過手機號登陸APP時候對應的版本號來間接代表更換設備的時間。成為自變量的X有:設備數,更換設備的次數,每2個設備更換的最大時間差,最小時間差,平均更換時間,更換設備時間差的標准差,累計更換時間,進行回歸得到如下結果:設備累計更換時間和更換時間差的標准差成顯著反比。
回歸統計 |
|
||||||||
Multiple R |
0.154438366 |
|
|||||||
R Square |
0.023851209 |
|
|||||||
Adjusted R Square |
0.023259245 |
|
|||||||
標准誤差 |
0.492116372 |
|
|||||||
觀測值 |
9901 |
|
|||||||
|
|||||||||
方差分析 |
|
||||||||
|
df |
SS |
MS |
|
F |
Significance F |
|||
回歸分析 |
6 |
58.54663 |
9.757771812 |
|
40.29164794 |
9.68003E-49 |
|||
殘差 |
9894 |
2396.114 |
0.242178524 |
|
|||||
總計 |
9900 |
2454.661 |
|
|
|
|
|||
|
|||||||||
|
Coefficients |
標准誤差 |
t Stat |
|
P-value |
Lower 95% |
Upper 95% |
下限 95.0% |
上限 95.0% |
Intercept |
0.448879765 |
0.032716 |
13.72029481 |
|
1.87579E-42 |
0.384748795 |
0.513010734 |
0.384749 |
0.513011 |
X Variable 1 |
0.054091353 |
0.010083 |
5.364858051 |
|
8.28418E-08 |
0.034327536 |
0.073855171 |
0.034328 |
0.073855 |
X Variable 2 |
-0.014132904 |
0.001316 |
-10.74041732 |
|
9.23172E-27 |
-0.016712261 |
-0.011553547 |
-0.01671 |
-0.01155 |
X Variable 3 |
0.000177389 |
3.03E-05 |
5.85593915 |
|
4.8943E-09 |
0.00011801 |
0.000236767 |
0.000118 |
0.000237 |
X Variable 4 |
1.01037E-08 |
1.25E-09 |
8.067239472 |
|
8.02511E-16 |
7.64868E-09 |
1.25587E-08 |
7.65E-09 |
1.26E-08 |
X Variable 5 |
-7.0701E-09 |
1.23E-09 |
-5.741173492 |
|
9.67925E-09 |
-9.48403E-09 |
-4.65616E-09 |
-9.5E-09 |
-4.7E-09 |
X Variable 6 |
-3.40952E-08 |
3.46E-09 |
-9.855884854 |
|
8.23224E-23 |
-4.08763E-08 |
-2.73141E-08 |
-4.1E-08 |
-2.7E-08 |
方法4
回歸統計 |
|
|
|
|
|
|
|
|
Multiple R |
0.211204 |
|
|
|
|
|
|
|
R Square |
0.044607 |
|
|
|
|
|
|
|
Adjusted R Square |
0.043425 |
|
|
|
|
|
|
|
標准誤差 |
0.488799 |
|
|
|
|
|
|
|
觀測值 |
8091 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
方差分析 |
|
|
|
|
|
|
|
|
|
df |
SS |
MS |
F |
Significance F |
|
|
|
回歸分析 |
10 |
90.13552 |
9.013552 |
37.72553 |
3.87E-73 |
|
|
|
殘差 |
8080 |
1930.51 |
0.238924 |
|
|
|
|
|
總計 |
8090 |
2020.645 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Coefficients |
標准誤差 |
t Stat |
P-value |
Lower 95% |
Upper 95% |
下限 95.0% |
上限 95.0% |
Intercept |
0.824792 |
0.042672 |
19.32859 |
2.08E-81 |
0.741144 |
0.90844 |
0.741144 |
0.90844039 |
X Variable 1 |
0.000604 |
0.01188 |
0.050819 |
0.959471 |
-0.02269 |
0.023893 |
-0.02269 |
0.02389257 |
X Variable 2 |
-0.01201 |
0.001585 |
-7.57658 |
3.94E-14 |
-0.01512 |
-0.0089 |
-0.01512 |
-0.0089046 |
X Variable 3 |
0.000106 |
3.51E-05 |
3.01029 |
0.002618 |
3.68E-05 |
0.000174 |
3.68E-05 |
0.0001743 |
X Variable 4 |
1.11E-08 |
2.04E-09 |
5.45342 |
5.09E-08 |
7.12E-09 |
1.51E-08 |
7.12E-09 |
1.5115E-08 |
X Variable 5 |
-7.5E-09 |
1.78E-09 |
-4.239 |
2.27E-05 |
-1.1E-08 |
-4E-09 |
-1.1E-08 |
-4.05E-09 |
X Variable 6 |
4.7E-10 |
2.6E-09 |
0.181014 |
0.856361 |
-4.6E-09 |
5.56E-09 |
-4.6E-09 |
5.5622E-09 |
X Variable 7 |
-2.2E-08 |
4.59E-09 |
-4.85498 |
1.23E-06 |
-3.1E-08 |
-1.3E-08 |
-3.1E-08 |
-1.329E-08 |
X Variable 8 |
2.45E-11 |
1.14E-11 |
2.149089 |
0.031657 |
2.15E-12 |
4.68E-11 |
2.15E-12 |
4.68E-11 |
X Variable 9 |
-0.00266 |
0.000444 |
-5.99087 |
2.18E-09 |
-0.00353 |
-0.00179 |
-0.00353 |
-0.0017881 |
X Variable 10 |
-0.24645 |
0.02494 |
-9.88193 |
6.72E-23 |
-0.29534 |
-0.19757 |
-0.29534 |
-0.1975655 |
單個X對Y都有一定影響,現在將設備更換時間的頻率、更換頻率標准偏差、每個設備的平均更換周期、設備相同率幾個維度組合成Xn變量回歸,結果顯示X對Y的總體復相關性已經有較大幅度增加,達到0.211204。
雖然這些變量對判定是否黑名單的決定性怡然較小,需要挖掘更多的用戶特征進行豐富。
dc |
dcs |
abrankcount |
maxtime |
mintime |
avgtime |
stddevt |
totaltime |
interval |
pc |
3 |
3 |
3 |
3237421 |
-4064 |
2155571.333 |
1527094 |
6466714 |
13 |
0.457187 |
3 |
3 |
3 |
20561466 |
9930757 |
13707644 |
4854801 |
41122932 |
80 |
0.57037 |
3 |
3 |
3 |
18509989 |
-19642609 |
-755080 |
15578021 |
2265240 |
76 |
0.343496 |
3 |
3 |
3 |
-14023 |
-1893173 |
-1262115.333 |
882553.1 |
3786346 |
8 |
0.335878 |
方法5
回歸統計 |
|
|
|
|
|
|
|
|
Multiple R |
0.506696 |
|
|
|
|
|
|
|
R Square |
0.256741 |
|
|
|
|
|
|
|
Adjusted R Square |
0.256666 |
|
|
|
|
|
|
|
標准誤差 |
0.429309 |
|
|
|
|
|
|
|
觀測值 |
9901 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
方差分析 |
|
|
|
|
|
|
|
|
|
df |
SS |
MS |
F |
Significance F |
|
|
|
回歸分析 |
1 |
630.2114 |
630.2114 |
3419.367 |
0 |
|
|
|
殘差 |
9899 |
1824.45 |
0.184306 |
|
|
|
|
|
總計 |
9900 |
2454.661 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Coefficients |
標准誤差 |
t Stat |
P-value |
Lower 95% |
Upper 95% |
下限 95.0% |
上限 95.0% |
Intercept |
0.760723 |
0.006787 |
112.093 |
0 |
0.74742 |
0.774026 |
0.74742 |
0.774026 |
X Variable 1 |
-0.19139 |
0.003273 |
-58.4754 |
0 |
-0.1978 |
-0.18497 |
-0.1978 |
-0.18497 |
求出每個手機更換的設備中是IOS的數量,回歸出來的結果顯示X單變量對Y的影響是顯著正相關,達到0.506696
方法6
將以上所有變量加入,形成的特征如下:
ifblack |
dc |
dcs |
abrankcount |
maxtime |
mintime |
avgtime |
stddevt |
totaltime |
interval |
pc |
deiosnum |
1 |
3 |
3 |
3 |
3237421 |
-4064 |
2155571.333 |
1527094 |
6466714 |
13 |
0.457 |
2 |
0 |
3 |
3 |
3 |
20561466 |
9930757 |
13707644 |
4854801 |
41122932 |
80 |
0.57 |
2 |
1 |
3 |
3 |
3 |
18509989 |
-19642609 |
-755080 |
15578021 |
2265240 |
76 |
0.343 |
0 |
0 |
3 |
3 |
3 |
-14023 |
-1893173 |
-1262115.333 |
882553.1 |
3786346 |
8 |
0.336 |
1 |
1 |
3 |
3 |
3 |
-91956 |
-724724 |
-483149.3333 |
279151.3 |
1449448 |
3 |
0.944 |
1 |
1 |
3 |
3 |
3 |
412948 |
-845102 |
-288102.6667 |
523600.1 |
864308 |
4 |
0.344 |
0 |
0 |
3 |
3 |
3 |
5092634 |
1290609 |
3395089.333 |
1578617 |
10185268 |
20 |
0.335 |
1 |
回歸分析相關復系數達到0.547115,較高的因果關系
回歸統計 |
|
|
|
|
|
|
|
|
Multiple R |
0.547115 |
|
|
|
|
|
|
|
R Square |
0.299335 |
|
|
|
|
|
|
|
Adjusted R Square |
0.298556 |
|
|
|
|
|
|
|
標准誤差 |
0.417037 |
|
|
|
|
|
|
|
觀測值 |
9901 |
|
|
|
|
|
|
|
方差分析 |
|
|
|
|
|
|
|
|
|
df |
SS |
MS |
F |
Significance F |
|
|
|
回歸分析 |
11 |
734.7663 |
66.79693 |
384.067 |
0 |
|
|
|
殘差 |
9889 |
1719.895 |
0.17392 |
|
|
|
|
|
總計 |
9900 |
2454.661 |
|
C |
|
|
|
|
|
Coefficients |
標准誤差 |
t Stat |
P-value |
Lower 95% |
Upper 95% |
下限 95.0% |
上限 95.0% |
Intercept |
0.173297 |
0.033632 |
5.152794 |
2.62E-07 |
0.107372 |
0.239222 |
0.107372 |
0.239222 |
X Variable 1 |
0.190591 |
0.010059 |
18.94818 |
1.12E-78 |
0.170874 |
0.210307 |
0.170874 |
0.210307 |
X Variable 2 |
-0.00304 |
0.001233 |
-2.46576 |
0.013689 |
-0.00546 |
-0.00062 |
-0.00546 |
-0.00062 |
X Variable 3 |
1.19E-05 |
2.81E-05 |
0.423152 |
0.672194 |
-4.3E-05 |
6.7E-05 |
-4.3E-05 |
6.7E-05 |
X Variable 4 |
5.53E-09 |
1.51E-09 |
3.651315 |
0.000262 |
2.56E-09 |
8.49E-09 |
2.56E-09 |
8.49E-09 |
X Variable 5 |
-6.4E-09 |
1.32E-09 |
-4.80573 |
1.56E-06 |
-8.9E-09 |
-3.8E-09 |
-8.9E-09 |
-3.8E-09 |
X Variable 6 |
1.14E-09 |
1.92E-09 |
0.592423 |
0.553581 |
-2.6E-09 |
4.91E-09 |
-2.6E-09 |
4.91E-09 |
X Variable 7 |
-2.4E-08 |
3.39E-09 |
-7.09928 |
1.34E-12 |
-3.1E-08 |
-1.7E-08 |
-3.1E-08 |
-1.7E-08 |
X Variable 8 |
6.86E-12 |
8.73E-12 |
0.786271 |
0.431727 |
-1E-11 |
2.4E-11 |
-1E-11 |
2.4E-11 |
X Variable 9 |
0.000974 |
0.000332 |
2.938598 |
0.003305 |
0.000324 |
0.001624 |
0.000324 |
0.001624 |
X Variable 10 |
0.070007 |
0.013677 |
5.118475 |
3.14E-07 |
0.043197 |
0.096818 |
0.043197 |
0.096818 |
X Variable 11 |
-0.21499 |
0.003476 |
-61.8428 |
0 |
-0.22181 |
-0.20818 |
-0.22181 |
-0.20818 |