spark MLlib 概念 5: 余弦相似度(Cosine similarity)


概述:
余弦相似度 是對兩個向量相似度的描述,表現為兩個向量的夾角的余弦值。當方向相同時(調度為0),余弦值為1,標識強相關;當相互垂直時(在線性代數里,兩個維度垂直意味着他們相互獨立),余弦值為0,標識他們無關。
Cosine similarity  is a measure of similarity between two vectors of an  inner product space  that measures the  cosine  of the angle between them. The cosine of 0° is 1, and it is less than 1 for any other angle. It is thus a judgement of orientation and not magnitude: two vectors with the same orientation have a Cosine similarity of 1, two vectors at 90° have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude. Cosine similarity is particularly used in positive space, where the outcome is neatly bounded in [0,1].

定義
基礎知識。。

The cosine of two vectors can be derived by using the Euclidean dot product formula:

\mathbf{a}\cdot\mathbf{b}
=\left\|\mathbf{a}\right\|\left\|\mathbf{b}\right\|\cos\theta

Given two vectors of attributes, A and B, the cosine similarity, cos(θ), is represented using a dot product and magnitude as

\text{similarity} = \cos(\theta) = {A \cdot B \over \|A\| \|B\|} = \frac{ \sum\limits_{i=1}^{n}{A_i \times B_i} }{ \sqrt{\sum\limits_{i=1}^{n}{(A_i)^2}} \times \sqrt{\sum\limits_{i=1}^{n}{(B_i)^2}} }

The resulting similarity ranges from −1 meaning exactly opposite, to 1 meaning exactly the same, with 0 usually indicating independence, and in-between values indicating intermediate similarity or dissimilarity.

與皮爾森相關系數的關系
If the attribute vectors are normalized by subtracting the vector means (e.g.,  A - \bar{A} ), the measure is called centered cosine similarity and is equivalent to the  Pearson Correlation Coefficient .










 




免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM