Realitymining 數據集簡單介紹與使用

本文轉載自查看原文 2014-08-19 12:19 1565 大數據

　　數據集的官網 http://realitycommons.media.mit.edu/index.html(可能需要翻牆) ,下面是數據集的簡要介紹(摘自官方網站)

The goal of this experiment was to explore the capabilities of the smart phones that enabled social scientists to investigate human interactions beyond the traditional survey based methodology or the traditional simulation base methodology. The subjects were 75 students or faculty in the MIT Media Laboratory, and 25 incoming students at the MIT Sloan business school adjacent to the Media Laboratory. Of the 75 Media Lab participants, 20 were incoming masters students and 5 were incoming MIT freshman, and the rest had remained in the Media Lab for at least a year.

本文的初衷是在盡可能保戶用戶隱私的情況下對用戶進行好友推薦,而不是像許多文獻那樣(這里指獲取用戶的隱私數據,我個人覺得不可行.),這里只是在實驗的情況下,因為在現實生活中,不會有人經常開着無線設備而為了得到一些無關緊要的推薦結果.本文思想是利用 bluetooth 數據,發現用戶好友關系,對其簡單的排序結果對用戶進行好友推薦,並與隨推薦結果相比較,驗證其方法的可以行性.實驗使用的數據集是 2004年,mit的數據,不知道有沒近些年的相關數據集,有感興趣的可以交流一下.

抽取部分自己需要的數據:

 1 %獲取需要的數據,
 2 %轉入原始數據
 3 data = load('realitymining.mat');
 4 %subject數據  struct 數組   1*106  struct
 5 %結構體數組  data.s(0)  - data.s(106)
 6 %可以采用這種方式 給新 結構數組 賦值   datalite.s(0).mac  =  data.s(0).mac
 7 
 8 % 新建一個結構體數組,可采用 使用直接引用方式定義結構
 9 s = struct([]);
10 n = 1;
11 while (n~=107)
12     %添加想要的數據
13     s(n).mac = data.s(n).mac;
14     s(n).device_list_macs = data.s(n).device_list_macs;
15     s(n).device_list_names = data.s(n).device_list_names;
16     %這三列數據的 列數應該是相等的.
17     s(n).device_date = data.s(n).device_date;
18     s(n).device_names = data.s(n).device_names;
19     s(n).device_macs = data.s(n).device_macs;
20     s(n).neighborhood = data.s(n).neighborhood;
21     s(n).my_office = data.s(n).my_office;
22     n=n+1;
23 
24 end
25 
26 network = data.network;
27 save 'slite' 's' 'network'

根據好友關系繪制拓撲圖,結點顯示 bluetooth的 mac號.

        
function ND_netplot(network,s)
    A = network.friends;
    [n,m]=size(A);
    w=floor(sqrt(n));       
    h=floor(n/w);        
    x = zeros(1,w*h);
    y = zeros(1,w*h);
    index = 0;
    for i=1:h           %使產生的隨機點有其范圍，使顯示分布的更廣
        for j=1:w
            index = index +1;
            x(index)=10*rand(1)+(j-1)*10;       
            y(index) =10*rand(1)+(i-1)*10; 
           
        end     
    end
    
    ed=n-h*w;
    for i=1:ed
       index = index +1;
       x(index)=10*rand(1)+(i-1)*10; 
       y(index)=10*rand(1)+h*10;
    end
    plot(x,y,'ok');    

    title('網絡拓撲圖'); 
    for i=1:n
        for j=1:n
            if A(i,j) == 1
                c=num2str(A(i,j));                      %將A中的權值轉化為字符型              
                text((x(i)+x(j))/2,(y(i)+y(j))/2,c,'Fontsize',10);  %顯示邊的權值
                if i ~= j
                    arrow([x(j),y(j)],[x(i),y(i)]);         %帶箭頭的連線 
                end
            end           
            %hold on;
        end
        if i< 94
            %這里不顯示點的序號,顯示  mac地址.
            sub_index = network.sub_sort(i);
            mac = ['--',num2str(s(sub_index).mac)];
            text(x(i),y(i),[num2str(sub_index),mac],'Fontsize',9,'color','r');   %顯示點的序號
            disp([num2str(sub_index),mac]);
        end
        
    end  
end

結果如圖:

到這里並不沒做什么實際性的工作,只是將需要的數據分離出現.並將好友關系,以有向圖的方式繪制出來 .

用戶 hash_number 與之對應的 bluetooth mac

3--61961024887
4--61961024891
5--61961024929
6--61961024956
7--61961024927
8--61961025059
9--61961078506
10--61961024943
11--61961024868
12--61961078565
13--61961024950
14--61961078566
15--61961024968
16--61961024963
17--61960619991
19--61961024937
20--61961024824
21--61961024912
22--61961024938
23--61961025033
25--61960946218
26--61961078573
27--61961024853
28--61961078595
29--61960946349
30--61960946207
31--61961078619
32--61961024881
33--61961025202
35--61961024951
36--61961025054
37--61961024911
38--61961025073
40--61961078559
41--61965359991
42--61965359983
43--61965360050
44--61965359948
46--61964943979
48--61965019962
49--61964944350
50--61965359903
52--61964943984
53--61964979150
54--61964961925
55--61965020029
56--61965359944
57--61964979154
58--61964979130
60--61965019994
61--61965020015
62--0
63--61964944011
65--61964944067
66--61964943982
67--61965019987
68--61965020019
69--61964961927
70--61965359883
71--61964944054
72--61965019983
73--61965020021
74--61964943996
75--61964961871
76--61964944027
77--61965359909
78--61964944064
79--61965019992
80--14720303796
81--61965019959
82--61964979163
83--61964972168
84--61964944053
86--61964979139
87--61964944337
88--61965020009
89--61964944313
90--61964944046
91--61964944038
92--61964944018
93--61964944057
94--61964943986
95--61964944035
96--61964979158
97--61964944341
98--61965019996
99--61965359937
100--61965359920
101--413791240929
102--61961353423
103--413791240838
104--413787380563
106--

把好友關系藍牙的掃描到的次數用用圖形表示出來,程序寫的比較亂便不貼上來了:

根據掃描到的次數進行好友排序的排序算法 ,這里是根據相遇時長進行排序,基於相遇頻率的算法與之類似,對於連續掃描到相同mac 認為是一次相遇,略修改即可:

 1 function getdurationbluetoothfriends(S,Network)
 2     
 3     disp('run scipt to get duration');
 4     [~,wS] = size(S);
 5     [~,wN] = size(Network.friends);
 6     limits = wN;  %94
 7     durationbluetooth = zeros(wS,wS);%這里儲存的是 sub_index
 8     for n = 1:limits-1  %ws   1-93
 9         %sub_sort 是得到對應的 subject 號  1-106
10         device_mac = S(Network.sub_sort(n)).device_macs;
11         [~,t] = size(device_mac);   %cell  
12         for m = 1:t  
13             %添加一些什么方法     這里數據 是  1- m
14             EveryScan = device_mac{m}; %每一個cell 包含多個數據,所以還需要解析.
15              %每個output 還有多個數據,所以也要分離出來.
16             [hE,~] = size(EveryScan);
17             for r = 1:hE
18                 %在這里把每次掃描 的mac 與現有的mac 做比較 ,並加入到頻率直方圖中.
19                 %這里mac 獲取應該沒問題了.
20                 mac = EveryScan(r,1);        
21                 sub_index = submacindex(mac,Network.sub_sort(n));                    
22                 if sub_index > 0
23                     %某個 subject 與某個  subject '相遇一次'  並計算次數
24                     durationbluetooth(Network.sub_sort(n),sub_index) = durationbluetooth(Network.sub_sort(n),sub_index)+1;
25                 end                
26             end 
27         end
28         disp(Network.sub_sort(n));
29     end
30     %對 frequencybluetooth  排序
31     %  行 為 project 號 列為對應好友 .
32     % 對 frequencybluetooth  數據進行排序
33     sortduration = zeros(wS,wS);%這里儲存的是 sub_index
34     for i = 1:wS
35         [~,index] = sort(durationbluetooth(i,:),'descend');
36         sortduration(i,:) = index;
37     end
38    
39     save 'sortduration' 'sortduration';
40     save 'durationbluetooth' 'durationbluetooth'; 
41     
42     %用於獲取根據傳遞 過來的 mac 的 subject 索引號
43     %  sub_index 為當前 mac 對應索引.
44     function index = submacindex(mac,currentIndex)
45         for index = 1:wS 
46             if isempty(S(index).mac)   %1-93
47                 continue;
48             end
49             if index~=currentIndex && mac==S(index).mac 
50                 return;
51             end
52         end
53         index = -1;   %表示數據不存在,非本實驗已有的數據.
54     end
55 end

為了確保數據的有效,我簡單寫了個數據驗證的程序 :

 1 function checkdata(s)
 2     a = 61961024886;
 3     num = 0;
 4     data = s(3).device_macs;
 5     for n = 1:6100
 6         everyScan = data{n}; %每一個cell 包含多個數據,所以還需要解析.
 7         [h1,w1] = size(everyScan);
 8         for r = 1:h1
 9              mac = everyScan(r,1);
10              if mac == a
11                 index = char(num2str(n),'.',num2str(r),':');
12                 disp(index);
13                 num = num+1;
14                 disp(num2str(num));
15              end
16         end
17     end
18 end

看看這里統計的數據是否與上面排序時的頻率是否相同,只需要取一個數據驗證即可.

下面驗證一下隨機推薦的推薦效果.

根據上面好友排序算法生成的 sortduration 數據和原network數據 ,隨機推薦算法:

%根據之前的生成的矩陣,與隨機推薦做比較,並繪圖
%這里先實現隨機推薦,觀察推薦好友數與 命中 個數的關系,正常情況下應該近似線性關系.
function recommendfriends(Sortfreq,Network)
     [~,wN] = size(Network.friends); 
     [~,wS] = size(Sortfreq);
     relations = zeros(1,wS);
    
     
     %推薦好友的 個數從  1 - 106
     for m = 1:wS  %隨機選擇  m個好友,計算其命中個數
         randomHit = 0;
        
         r=randperm(wS);%生成1到106的隨機排列
         selectedMatrix = r(1:m); %選擇推薦  m 個好友 ,這里是隨機推薦 是一維矩陣.
         %n 為對應subject  索引,非真正索引.
         for n = 1:wN-1      %1-93  對應   3-106
             % 3-106
             subjectIndex =  Network.sub_sort(n);  %subjectIndex為真實索引.
             randomHit =randomHit + hits(n,selectedMatrix);     
         end
         % 儲存 randomHit  與 對應 m 值 .  
         relations(m) = randomHit;
     end
     save 'relations' 'relations';
     %繪圖
     
     %數據做平滑處理.
     smoothData = smooth(relations,5);
     %plot(1:wS,smoothData(1:wS));
     plot(1:wS,smoothData(1:wS),'r*');   
    %傳入參數 ,
    function value = hits(n,selectedMatrix)
        value = 0;
        for i = 1:wN-1   %1- 93
            if Network.friends(n,i) >= 1
                realInex = Network.sub_sort(i);   %1- 106
                if any(selectedMatrix == realInex)   %矩陣中包含.realInex
                    value = value+1;
                end
            end
        end
    end %子函數
%最外部函數
end

因為數據的稀疏性,我簡單做了smooth處理,感覺好很多.其結果如圖:

簡單說明一下,為了驗證推薦算法的有效性,我這里只做與隨機推薦的對比.這里用命中數進行衡量,由於真實數據中,好友關系比較稀疏,統計的好友共有125個數據,

對於每個人,其推薦的好友越多其越是能夠命中其原有的真實好友,所以在不采用任何算法的基礎之上,其推薦好友人數與命中人數成線性關系 .

推薦比較圖:

進行基於相遇時長和相遇頻率的實驗,結果如圖,看來基本沒有什么差異,

實驗總算完成了,和當初預想的一樣,基於時長的推薦在開始處會一相對好的推薦結果,當推薦的人數增加,其逐漸等同於隨機推薦.

為了做實驗,生成了好多子數據,有需要的可以郵箱.本文程序供大家參考,請誤抄襲.

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Cora 數據集介紹 kitti數據集介紹 ImageNet數據集相關介紹和使用梳理 STB數據集使用 COCO數據集使用 COCO 數據集的使用短期負荷預測(二)數據集介紹一些SEED數據集介紹 Fashion MNIST數據集介紹 CTW1500數據集介紹