Cluster labeling

Cluster labeling

Cluster labeling is closely related to the concept of text clustering. This process tries to select descriptive labels for the clusters obtained through a clustering algorithm such as Flat Clustering and Hierarchical Clustering. For example, a cluster of documents that talks about various internet protocols can be best labeled as "Internet Protocols". Typically, the labels are obtained by examining the contents of the documents in a cluster. A good label not only summarizes the central concept of a cluster but also uniquely differentiates it from other clusters in the collection.

Contents

Differential Cluster Labeling

Differential cluster labeling labels a cluster by comparing the terms in one cluster with the terms occurring in other clusters. The techniques used for feature selection in information retrieval, such as mutual information and chi-squared feature selection, can also be applied to differential cluster labeling. Terms having very low frequency are not the best in representing the whole cluster and can be omitted in labeling a cluster. By omitting those rare terms and using a differential test, one can achieve the best results with differential cluster labeling[1].

Mutual Information

In the fields of probability theory and information theory, mutual information measures the degree of dependence of two random variables. The mutual information of two variables X and Y is defined as:

I(X, Y) = \sum_{x\in X}{ \sum_{y\in Y} {p(x, y)log_2\left(\frac{p(x, y)}{p_1(x)p_2(y)}\right)}}

where p(x, y) is the joint probability distribution of the two variables, p1(x) is the probability distribution of X, and p2(y) is the probability distribution of Y.

In the case of cluster labeling, the variable X is associated with membership in a cluster, and the variable Y is associated with the presence of a term[2]. Both variables can have values of 0 or 1, so the equation can be rewritten as follows:

I(C, T) = \sum_{c\in {0, 1}}{ \sum_{t\in {0, 1}} {p(C = c, T = t)log_2\left(\frac{p(C = c, T = t)}{p(C = c)p(T = t)}\right)}}

In this case, p(C = 1) represents the probability that a randomly-selected document is a member of a particular cluster, and p(C = 0) represents the probability that it isn't. Similarly, p(T = 1) represents the probability that a randomly-selected document contains a given term, and p(T = 0) represents the probability that it doesn't. The joint probability distribution function p(C, T) represents the probability that two events occur simultaneously. For example, p(0, 0) is the probability that a document isn't a member of cluster c and doesn't contain term t; p(0, 1) is the probability that a document isn't a member of cluster c and does contain term t; and so on.

Example

The following example calculates the mutual information between the cluster "commerce" and the term "tariff":

Documents in "commerce" Documents not in "commerce"
Documents containing "tariff" 60 10,000
Documents not containing "tariff" 200 500,000

Total number of documents = (60 + 200 + 10,000 + 500,000) = 510,260

P (C = 1, T = 1) = 60/510,260

P (C = 1, T = 0) = 200/510,260

P (C = 0, T = 1) = 10,000/510,260

P (C = 0, T = 0) = 500,000/510,260

P (C = 1) = (# of documents in the cluster) / (total number of documents) = (60 + 200) / 510,260 = 260/510,260

P (C = 0) = (# of documents not in the cluster) / (total number of documents) = (10,000 + 500,000) / 510,260 = 510,000/510,260

P (T = 1) = (# of documents containing the term) / (total number of documents) = (60 + 10,000) / 510,260 = 10,060/510,260

P (T = 0) = (# of documents not containing the term) / (total number of documents) = (200 + 500,000) / 510,260 = 500,200/510,260

Plugging these probabilities into the above equation gives us the following MI value:

I(C, T) = \frac{60}{510,260} log_2\left(\frac{60/510,260}{260/510,260 * 10,060/510,260}\right) + \frac{200}{510,260} log_2\left(\frac{200/510,260}{260/510,260 * 500,200/510,260}\right) + \frac{10,000}{510,260} log_2\left(\frac{10,000/510,260}{510,000/510,260 * 10,060/510,260}\right) + \frac{500,000}{510,260} log_2\left(\frac{500,000/510,260}{510,000/510,260 * 500,200/510,260}\right) = \frac{60}{510,260} log_2\left(\frac{60*510,260}{260 * 10,060}\right) + \frac{200}{510,260} log_2\left(\frac{200*510,260}{260 * 500,200}\right) + \frac{10,000}{510,260} log_2\left(\frac{10,000*510,260}{510,000 * 10,060}\right) + \frac{500,000}{510,260} log_2\left(\frac{500,000*510,260}{510,000 * 500,200}\right)

= 0.000417322 - 0.000137100 - 0.000154725 + 0.000155158

= 0.000280655

Therefore, the mutual information between the cluster "commerce" and the term "tariff" is 0.000280655. We can create a label for the cluster "commerce" by calculating the mutual information of each term in the cluster, and selected the k terms with the highest MI value.

Chi-Squared Selection

The Pearson's chi-squared test can be used to calculate the probability that the occurrence of an event matches the initial expectations. In particular, it can be used to determine whether two events, A and B, are statistically independent. The value of the chi-squared statistic is:

X^2 = \sum_{a \in A}{\sum_{b \in B}{\frac{(O_{a,b} - E_{a, b})^2}{E_{a, b}}}}

where Oa,b is the observed frequency of a and b co-occurring, and Ea,b is the expected frequency of co-occurrence.

In the case of cluster labeling, the variable A is associated with membership in a cluster, and the variable B is associated with the presence of a term. Both variables can have values of 0 or 1, so the equation can be rewritten as follows:

X^2 = \sum_{a \in {0,1}}{\sum_{b \in {0,1}}{\frac{(O_{a,b} - E_{a, b})^2}{E_{a, b}}}}

For example, O1,0 is the observed number of documents that are in a particular cluster but don't contain a certain term, and E1,0 is the expected number of documents that are in a particular cluster but don't contain a certain term. Our initial assumption is that the two events are independent, so the expected probabilities of co-occurrence can be calculated by multiplying individual probabilities[3]:

E1,0 = N * P(C = 1) * P(T = 0)

where N is the total number of documents in the collection.

Example

Using the same data for the mutual information example, we can calculate the expected probabilities and plug them into the equation to calculate the chi-squared statistic:

Documents in "commerce" Documents not in "commerce"
Documents containing "tariff" 60 10,000
Documents not containing "tariff" 200 500,000

P (C = 1) = (# of documents in the cluster) / (total number of documents) = (60 + 200) / 510,260 = 260/510,260

P (C = 0) = (# of documents not in the cluster) / (total number of documents) = (10,000 + 500,000) / 510,260 = 510,000/510,260

P (T = 1) = (# of documents containing the term) / (total number of documents) = (60 + 10,000) / 510,260 = 10,060/510,260

P (T = 0) = (# of documents not containing the term) / (total number of documents) = (200 + 500,000) / 510,260 = 500,200/510,260

E_{0, 0} = 510,260 * \frac {510,000}{510,260} * \frac {500,200}{510,260}

= 499,945

E_{0,1}= 510,260 * \frac {510,000}{510,260} * \frac {10.060}{510,260}

= 10,055

E_{1,0}= 510,260 * \frac {260}{510,260} * \frac {500,200}{510,260}

= 254.9

E_{1,1}= 510,260 * \frac {260}{510,260} * \frac {10,060}{510,260}

= 5.13

X^2 = \frac{(500,000 - 499,945)^2}{499,945} + \frac{(10,000 - 10,055)^2}{10,055} + \frac{(260 - 254.9)^2}{254.9} + \frac{(60 - 5.13)^2}{5.13}

= 587

Since each variable can have two values, the number of degrees of freedom is (2 - 1)(2 - 1) = 1. The chi-squared distribution for one degree of freedom states that the probability of observing a value greater than 10.83 is less than 0.001. Therefore, we can reject the null hypothesis, which states that the two events are independent. Since the term "tariff" and the cluster "commerce" are dependent, we can assume that the term is a good label for the cluster.

Cluster-Internal Labeling

Cluster-internal labeling selects labels that only depend on the contents of the cluster of interest. No comparison is made with the other clusters. Cluster-internal labeling can use a variety of methods, such as finding terms that occur frequently in the centroid or finding the document that lies closest to the centroid.

Centroid Labels

A frequently-used model in the field of information retrieval is the vector space model, which represents documents as vectors. The entries in the vector correspond to terms in the vocabulary. Binary vectors have a value of 1 if the term is present within a particular document and 0 if it is absent. Many vectors make use of weights that reflect the importance of a term in a document, and/or the importance of the term in a document collection. For a particular cluster of documents, we can calculate the centroid by finding the arithmetic mean of all the document vectors. If an entry in the centroid vector has a high value, then the corresponding term occurs frequently within the cluster. These terms can be used as a label for the cluster. One downside to using centroid labeling is that it can pick up words like "place" and "word" that have a high frequency in written text, but have little relevance to the contents of the particular cluster.

Title Labels

An alternative to centroid labeling is title labeling. Here, we find the document within the cluster that has the smallest Euclidean distance to the centroid, and use its title as a label for the cluster. One advantage to using document titles is that they provide additional information that would not be present in a list of terms. However, they also have the potential to mislead the user, since one document might not be representative of the entire cluster.

External knowledge labels

Cluster labeling can be done indirectly using external knowledge such as pre-categorized knowledge such as the one of Wikipedia.[4] In such methods, a set of important cluster text features are first extracted from the cluster documents. These features then can be used to retrieve the (weighted) K-nearest categorized documents from which candidates for cluster labels can be extracted. The final step involves the ranking of such candidates. Suitable methods are such that are based on a voting or a fusion process which is determined using the set of categorized documents and the original cluster features.

External links

References

  1. ^ Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schutze. Introduction to Information Retrieval. Cambridge: Cambridge UP, 2008. Cluster Labeling. Stanford Natural Language Processing Group. Web. 25 Nov. 2009. <http://nlp.stanford.edu/IR-book/html/htmledition/cluster-labeling-1.html>.
  2. ^ Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schutze. Introduction to Information Retrieval. Cambridge: Cambridge UP, 2008. Mutual Information. Stanford Natural Language Processing Group. Web. 25 Nov. 2009. <http://nlp.stanford.edu/IR-book/html/htmledition/mutual-information-1.html>.
  3. ^ Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schutze. Introduction to Information Retrieval. Cambridge: Cambridge UP, 2008. Chi2 Feature Selection. Stanford Natural Language Processing Group. Web. 25 Nov. 2009. <http://nlp.stanford.edu/IR-book/html/htmledition/feature-selectionchi2-feature-selection-1.html>.
  4. ^ David Carmel, Haggai Roitman, Naama Zwerdling. Enhancing cluster labeling using wikipedia. SIGIR 2009: 139-146

Wikimedia Foundation. 2010.

Игры ⚽ Нужна курсовая?

Look at other dictionaries:

  • Cluster chemistry — In chemistry, a cluster is an ensemble of bound atoms intermediate in size between a molecule and a bulk solid. Clusters exist of diverse stoichiometries and nuclearities. For example, carbon and boron atoms form fullerene and borane clusters,… …   Wikipedia

  • Cluster-Analyse — Unter Clusteranalyse (der Begriff Ballungsanalyse wird selten verwendet) versteht man strukturentdeckende, multivariate Analyseverfahren zur Ermittlung von Gruppen (Clustern) von Objekten, deren Eigenschaften oder Eigenschaftsausprägungen… …   Deutsch Wikipedia

  • Gold cluster — Gold clusters in cluster chemistry are gold derived materials that can either be discrete molecules or larger colloidal particles. Both types are describes as nanoparticles, with diameters of less than one micrometer. Discrete gold clustersWell… …   Wikipedia

  • 2-satisfiability — In computer science, 2 satisfiability (abbreviated as 2 SAT or just 2SAT) is the problem of determining whether a collection of two valued (Boolean or binary) variables with constraints on pairs of variables can be assigned values satisfying all… …   Wikipedia

  • Polyhedral skeletal electron pair theory — In chemistry the polyhedral skeletal electron pair theory provides electron counting rules useful for predicting the structures of clusters such as borane and carborane clusters. The electron counting rules were originally formulated by Kenneth… …   Wikipedia

  • Chemical biology — is a scientific discipline spanning the fields of chemistry and biology that involves the application of chemical techniques and tools, often compounds produced through synthetic chemistry, to the study and manipulation of biological systems.… …   Wikipedia

  • Population groups in biomedicine — Biomedical researchers subdivide populations into groups with the goal of improving the prevention and treatment of diseases. Many studies have found that disease susceptibility and environmental responses vary among U.S. ethnicities, among New… …   Wikipedia

  • Clusteranalyse — Dieser Artikel wurde aufgrund von inhaltlichen Mängeln auf der Qualitätssicherungsseite der Redaktion Informatik eingetragen. Dies geschieht, um die Qualität der Artikel aus dem Themengebiet Informatik auf ein akzeptables Niveau zu bringen. Hilf… …   Deutsch Wikipedia

  • Ballungsanalyse — Unter Clusteranalyse (der Begriff Ballungsanalyse wird selten verwendet) versteht man strukturentdeckende, multivariate Analyseverfahren zur Ermittlung von Gruppen (Clustern) von Objekten, deren Eigenschaften oder Eigenschaftsausprägungen… …   Deutsch Wikipedia

  • Clustering — Unter Clusteranalyse (der Begriff Ballungsanalyse wird selten verwendet) versteht man strukturentdeckende, multivariate Analyseverfahren zur Ermittlung von Gruppen (Clustern) von Objekten, deren Eigenschaften oder Eigenschaftsausprägungen… …   Deutsch Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”