Cosine similarity

Cosine similarity is a measure of similarity between two vectors by measuring the cosine of the angle between them. The cosine of 0 is 1, and less than 1 for any other angle. The cosine of the angle between two vectors thus determines whether two vectors are pointing in roughly the same direction.

This is often used to compare documents in text mining. In addition, it is used to measure cohesion within clusters in the field of Data Mining^[1].

1 Math
- 1.1 Angular similarity
- 1.2 Confusion with "Tanimoto" coefficient
2 See also
3 External links
4 References

Math

The cosine of two vectors can be easily derived by using the Euclidean Dot Product formula:

$\mathbf{a}\cdot\mathbf{b} =\left\|\mathbf{a}\right\|\left\|\mathbf{b}\right\|\cos\theta$

Given two vectors of attributes, A and B, the cosine similarity, θ, is represented using a dot product and magnitude as

$\text{similarity} = \cos(\theta) = {A \cdot B \over \|A\| \|B\|} = \frac{ \sum_{i=1}^{n}{A_i \times B_i} }{ \sqrt{\sum_{i=1}^{n}{(A_i)^2}} \times \sqrt{\sum_{i=1}^{n}{(B_i)^2}} }$

The resulting similarity ranges from −1 meaning exactly opposite, to 1 meaning exactly the same, with 0 usually indicating independence, and in-between values indicating intermediate similarity or dissimilarity.

For text matching, the attribute vectors A and B are usually the term frequency vectors of the documents. The cosine similarity can be seen as a method of normalizing document length during comparison.

In the case of information retrieval, the cosine similarity of two documents will range from 0 to 1, since the term frequencies (tf-idf weights) cannot be negative. The angle between two term frequency vectors cannot be greater than 90°.

Angular similarity

The term "cosine similarity" has also been used on occasion to express a different coefficient, although the most common use is as defined above. Using the same calculation of similarity, the normalised angle between the vectors can be used as a bounded similarity function within [0,1], calculated from the above definition of similarity by:

$1 - \left ( \frac{ \cos^{-1}( similarity )}{ \pi} \right )$

in a domain where vector coefficients may be positive or negative, or

$1 - \left ( \frac{ 2 \cdot \cos^{-1}( similarity ) }{ \pi } \right)$

in a domain where the vector coefficients are always positive.

Although the term "cosine similarity" has been used for this angular distance, the term is oddly used as the cosine of the angle is used only as a convenient mechanism for calculating the angle itself and is no part of the meaning. The advantage of the angular similarity coefficient is that, when used as a difference coefficeint (by subtracting it from 1) the resulting function is a proper distance metric, which is not the case for the first meaning. However for most uses this is not an important property. For any use where only the relative ordering of similarity or distance within a set of vectors is important, then which function is used is immaterial as the a resulting order will be unaffected by the choice.

Confusion with "Tanimoto" coefficient

On occasion, Cosine Similarity has been confused as a specialised form of a similarity coefficeint with a similar algebraic form:

$T(A,B) = {A \cdot B \over \|A\|^2 +\|B\|^2 - A \cdot B}$

In fact, this algebraic form was first defined by Tanimoto as a mechanism for calculating the Jaccard coefficient in the case where the sets being compared are represented as bit vectors. While the formula extends to vectors in general, it has quite different properties from Cosine Similarity and bears little relation other than its superficial appearance.

External links

References

^ P.-N. Tan, M. Steinbach & V. Kumar, "Introduction to Data Mining", , Addison-Wesley (2005), ISBN 0-321-32136-7, chapter 8; page 500

Categories:

Information retrieval

Wikimedia Foundation. 2010.

Игры ⚽ Нужна курсовая?

Look at other dictionaries:

Trigonometric functions — Cosine redirects here. For the similarity measure, see Cosine similarity. Trigonometry History Usage Functions Generalized Inverse functions … Wikipedia
Automatic summarization — is the creation of a shortened version of a text by a computer program. The product of this procedure still contains the most important points of the original text. The phenomenon of information overload has meant that access to coherent and… … Wikipedia
Коэффициент Охаи — Мера Охаи бинарная мера сходства, предложенная Акирой Охаи в 1957 году.[1] В дальнейшем была развита Дж. Баркманом.[2] Фамилия автора коэффициента в литературе переводилась как: Очиаи, Отиаи и т. п. Сам автор (Охаи) утверждает,… … Википедия
Jaccard index — The Jaccard index, also known as the Jaccard similarity coefficient (originally coined coefficient de communauté by Paul Jaccard), is a statistic used for comparing the similarity and diversity of sample sets.The Jaccard coefficient measures… … Wikipedia
Community structure — In the study of complex networks, a network is said to have community structure if the nodes of the network can be easily grouped into (potentially overlapping) sets of nodes such that each set of nodes is densely connected internally. In the… … Wikipedia
String metric — String metrics (also known as similarity metrics) are a class of textual based metrics resulting in a similarity or dissimilarity (distance) score between two pairs of text strings for approximate matching or comparison and in fuzzy string… … Wikipedia
SimMetrics — is an open source extensible library of similarity or distance metrics (also known as string metrics). The SimMetrics open source library includes the following metrics * Levenshtein distance, * Block distance or city block distance or L2… … Wikipedia
Tf–idf — The tf–idf weight (term frequency–inverse document frequency) is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus.… … Wikipedia
Latent semantic analysis — (LSA) is a technique in natural language processing, in particular in vectorial semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA was … Wikipedia
Pythagorean theorem — See also: Pythagorean trigonometric identity The Pythagorean theorem: The sum of the areas of the two squares on the legs (a and b) equals the area of the square on the hypotenuse (c) … Wikipedia

Academic Dictionaries and Encyclopedias

Cosine similarity

Contents