<< <乾>:元,亨,利,貞。>>>
初九:潛龍,勿用。
九二:見龍在田,利見大人。
九三:君子終日乾乾,夕惕若厲,無咎。
九四:或躍在淵,無咎。
九五:飛龍在天,利見大人。
上九:亢龍,有悔。
用九:見群龍無首,吉。
1. Graph concepts
One interactive website: https://d3gt.com/unit.html
Graphs
- a graph G = (V,E), consisting of a finite nonempty set V of vertices or nodes, and a set E ⊆ V × V of edges consisting of unordered pairs of vertices.
weighted graph 里 對於每一條邊 (vi, vi) ∈ E 都有一個相對應的 weight wij - (vi, vi) : loop, 一個沒有loop的無向圖叫 simple graph.
- (vi, vj) : 這兩個node稱作 neighbors,且互為adjacent. 在有向圖中, 這個有向的edge 可被稱為 arc,vi 為tail, vj為head
- |V| = n, the number of nodes in G, 也叫做 order of the graph.
- |E| = m, the number of edges in G, 也叫做 size of the graph. 比如下圖中 order = 6, size = 5
Subgraphs
- A subgraph G' of a graph G is a graph G' whose vertex set and edge set are subsets of those of G. If G' is a subgraph of G, then G is said to be a supergraph of G' (Harary 1994, p. 11).
- A (sub)graph is called complete (or a clique) if there exists an edge between all pairs of nodes.
Degree
- Degree of vertice 就是有多少個連着它,denoted by deg (v), minimum degree 用 δ(G), maximum 用 ∆(G). 為了避免混淆,前面那個叫small delta, 后面是 big delta. 注意分清哪個是G哪個是V的屬性
- vi ∈ V , denoted as d(vi) or just di
- Degree sequence 就是這些nodes的degrees的一個列表,如下圖,Degree Sequence = (5,4,4,4,4,4,1)
- for directed graphs, indegree 寫成 id(vi), 就是以該點為head的連線數量,outdegree od(vi), the number of outgoing edges from vi
Path and Distance
- A walk in a graph G between nodes x and y is an ordered sequence of vertices, starting at x and ending at y.
- The length of the walk, t, is measured in terms of hops – the number of edges along the walk
- A trail 不重復edges
- A path 不重復vertices (除首尾巴)
- A cycle 一個closed trail ≥ 3,開頭結尾都是同一個點 且nodes不重復
- The distance between two vertices in a graph is the number of edges in a shortest or minimal path.
Connectedness
- if there exists a path between them 就叫connected了
一個栗子:
- The degree sequence of the graph is (4,4,4,3,2,2,2,1), and therefore its degree frequency distribution is given as (N0,N1,N2,N3,N4) = (0,1,3,1,3),
The degree distribution is given as (f (0),f (1),f (2),f (3),f (4))= (0,0.125,0.375,0.125,0.375) - 對於(b)圖, The indegree of v7 is id(v7) = 2, whereas its outdegree is od(v7) = 0.
Adjacency matrix
- A graph G = (V,E), with |V| = n vertices, can be conveniently represented in the form of an n × n, symmetric binary adjacency matrix, A
- 當有向時,這個矩陣不對稱
Graphs from data matrix
- 這里講的是把data轉變為需要的graph形式,也就是找一個weighted graph來放下 有着n個點的d-維空間的dataset 然后通過一些similarity的distance的算法 來映射,把這個matrix轉變成一個binary
2. Topological attributes
只能應用在單個點或邊的attributes是local 能用在整個圖叫 global
Degree (local)
Average degree:
Average path length (also called characteristic path length)
For a connected graph:
Eccentricity (local)
- defined as the maximum distance of one vertex from other vertex. denoted by e(V)
Radius and diameter
- 對於disconnected graph,看的都是all the connected components
- radius, r(G):
- Diameter, d(G):
- 當然這個d對一些異常值挺敏感,所以引入 effective diameter,也就是設定一個minimum number of hops 讓這個范圍內的所有連接點都可以任意鏈接到, say:
- 這個圖里 94%的pairs of nodes 落在了7步以內,所以可以說 effective diameter 是 7.
Clustering coefficient
-
對於vi及neighbors形成的一個subgraph Gi來說,the clustering coefficient of vi is defined as:
-
The clustering coefficient of a graph G is simply the average clustering coefficient over all the nodes.
Efficiency
- The efficiency for a pair of nodes vi and vj is defined as 1/d(vi,vj). 如果兩個點不相連,那么d無限大也就是efficiency為0,兩者間距離越小,越efficient
- Efficiency for a graph G, is the average efficiency over all pairs of nodes, whether connected or not, given as:
舉個栗子,用上面那個圖,
求node v4,整個graph的clustering coefficient 以及 v4 的local efficiency
知 一個點的cc和 與它連接的neighbor們所產生的subgraph有關,即用subgraph中實際的邊數 除以 maximum number邊數。而一個 graph 的cc 就只是簡單的對圖中每個點的 cc進行平均,
通過下圖:
可以得到 C(v4) = 2/6 = 0.33, C(G) = 1/8 * (1/2 +1/3 +1 + 1/3 + 1/3 + 0 + 0 +0 )= 0.3125
而local efficiency用上面的公式:
3. Centrality analysis
- Centrality measures have typically been used as indicators of power, influence, popularity and prestige.
3.1 Basic centralities
Degree Centrality
- 也就是直接數有幾條邊啦
Eccentricity centrality
- less eccentric, more central. 也就是看最大的distance(也就是length of shortest path) 是什么 然后取倒數
- center node: 當等於radius時;等於diameter時 稱作periphery node (適合醫院選址問題)
Closeness Centrality
- the reciprocal of farness
- Uses the sum of all the distances to rank how central a node is.
- smallest total distance, median node
- For comparison purpose, we can standardize the closeness by dividing by the maximum possible value 1/(n − 1)
- The more central a node is, the lower its total distance to all other nodes.
Betweenness centrality
- brokers, bridges, bottlenecks
- measures how many shortest paths between all pairs of vertices
- 首先計算 某兩個點間 shortest paths的數量,然后計算通過given vertex的paths的數量,計算fraction (注意這里的jk選取時是不考慮i的)
計算betweenness的栗子 (http://www2.unb.ca/~ddu/6634/Lecture_notes/Lecture_4_centrality_measure.pdf):
再用之前那個graph計算各點的中心度:
3.2 Web centralities
web里很多主要指的是有向網
Prestige score (eigenvector centrality)
- As a centrality, prestige is supposed to be a measure of the importance or rank of a node/ the influence of a node in a network
- A high eigenvector score means that a node is connected to many nodes who themselves have high scores
- 也就是看誰給的最多 或者 誰收到的最多
- 簡而言之就是提取點點之間的關系矩陣轉換成一個等價最顯著的特征向量 從而進行比較,具體的策略是:
舉個例子:給一個有5個node,也就是 5*5的關系矩陣求值 Starting with an initial prestige vector p0=(1,1,1,1,1)T,
每次iterate后 都用vector中得到的最大值 進行 scale。每次iterate之后的 vector p 比上前一次的 vector 得到 λ, 也就是特征值
經過多次iterations之后,λ會穩定在某一個值,如下圖:
我們再把它normalize成單位向量,就可以得到 dominant eigenvector,比較結果vector中 哪個點的值更大,就可以說哪個更prestige一點
Random jumps
- 指的是random surfing中 就算點與點之間沒有聯系 但還是可能會從這里跳到那里去
Page rank
- a method for computing the prestige or centrality of nodes in the context of Web search.
- 用了 random surfing 的假設,也就是人們會隨機點開這些links
- The PageRank of a Web page is defined to be the probability of a random web surfer landing at that page.
Normalized Prestige
-
也就是考慮了random jumps 這個點跳到另一個點outdegree的幾率 (多除了一個)For the random surfer matrix, the outdegree of each node is od(u) = n
-
so far, PageRank vector is essentially a normalized prestige vector.
舉個例子,依舊以上面那5個點的adjacency list為例,首先是把它normalize:
然后 是對random jump normalize(normalized random jump adjacency matrix:
假設這個小概率α = 0.1, 那么總的 normalized adjacency matrix 為 M= 0.9N+0.1Nr =
Hub and Authority Scores
- 這個概念的出現是為了解決web search 的 ranking 問題。又是也當作 Hyperlink Induced Topic Search (HITS) method
- 和pagerank不同的是 引入了兩個 two flavors of importance: 含有所需要相關topic信息的 authority 以及 提供了指引導向所需要authority信息的 hub (就比如 某個大學ranking網站為 hub,你想知道的university 是authority)
- The authority score of a page is analogous to PageRank or prestige, and it depends on how many “good” pages point to it. On the other hand, the hub score of a page is based on how many “good” pages it points to.
- 同樣的 就可以對每一個網頁進行兩個score 的 weighting, 一個是 Authority score (a) ,一個是 hub score (h):
計算時像之前那個prestige的例子,先 列出 原關系矩陣,矩陣的轉置,然后start with 都是1 的 a 矩陣,相乘后再 用最大值 scale,得到第一個iterate之后的 h vector,接着用 轉置矩陣 × 這個 h vector 得到 a,再scale 得到第一個iterate之后的a矩陣 ATA and AAT,不斷重復
4. Graph models
常見的三個property
Small-world Property
- average path length μL ∝ logn, n is the number of nodes in the graph
Scale-free Property
- empirical degree distribution f (k) exhibits a scale-free behavior captured by a power-law relationship with k, f (k) ∝ k−γ
Clustering Effect
Erd¨os–R´enyi Random Graph Model
- generates a random graph such that any of the possible graphs with a fixed number of nodes and edges has equal probability of being chosen.
Watts–Strogatz Small-world Graph Model
- Such a network will have a high clustering coefficient, but will not be small-world.
Barab´asi–Albert Scale-free Model
可以看看他寫的那本網絡科學的書噢
這一part就先到這里 具體的一些theory還有model放在其他的自我梳理環節
Some references
- Emirbayer/Goodwin (1994): Network Analysis, Culture, and the Problem of Agency Identifies three social network paradigms: structural determinism, structural instrumentalism, and structural constructionism
- Freeman (2004): The Development of Social Network Analysis: A Study in the Sociology of Science
- The SAGE Handbook of Social Network Analysis 少不了這本經典的大部頭啦
- Link analysis pagerank hub authority之類的算法