Category utility

﻿
Category utility

Category utility is a measure of "category goodness" defined in Harvtxt|Gluck|Corter|1985 and Harvtxt|Corter|Gluck|1992. It was intended to supersede more limited measures of category goodness such as "cue validity" (Harvnb|Reed|1972;Harvnb|Rosch|Mervis|1975) and "collocation index" Harv|Jones|1983. It provides a normative information-theoretic measure of the "predictive advantage" gained by the observer who possesses knowledge of the given category structure (i.e., the class labels of instances) over the observer who does "not" possess knowledge of the category structure. In this sense the motivation for the "category utility" measure is similar to the information gain metric used in decision tree learning. In certain presentations, it is also formally equivalent to the mutual information, as discussed below. An excellent review of "category utility" in its probabilistic incarnation, with applications to machine learning, is provided in Harvtxt|Witten|Frank|2005|pp=260–262.

Probability-theoretic definition of the Category Utility

The probability-theoretic definition of "category utility" given in Harvtxt|Fisher|1987 and Harvtxt|Witten|Frank|2005 is as follows:

:$CU\left(C,F\right) = frac\left\{1\right\}\left\{p\right\} sum_\left\{c_j in C\right\} p\left(c_j\right) left \left[sum_\left\{f_i in F\right\} sum_\left\{k=1\right\}^m p\left(f_\left\{ik\right\}|c_j\right)^2 - sum_\left\{f_i in F\right\} sum_\left\{k=1\right\}^m p\left(f_\left\{ik\right\}\right)^2 ight \right]$

where $F = \left\{f_i\right\}, i=1 ldots n$ is a size-$n$ set of $m$-ary features, and $C = \left\{c_j\right\} j=1 ldots p$ is a set of $p$ categories. The term $p\left(f_\left\{ik\right\}\right)$ designates the marginal probability that feature $f_i$ takes on value $k$, and the term $p\left(f_\left\{ik\right\}|c_j\right)$ designates the category-conditional probability that feature $f_i$ takes on value $k$ "given" that the object in question belongs to category $c_j$.

The motivation and development of this expression for "category utility", and the role of the multiplicand $extstyle frac\left\{1\right\}\left\{p\right\}$ as a crude overfitting control, is given in the above sources. Loosely Harv|Fisher|1987, the term $extstyle p\left(c_j\right) sum_\left\{f_i in F\right\} sum_\left\{k=1\right\}^m p\left(f_\left\{ik\right\}|c_j\right)^2$ is the expected number of attribute values that can be correctly guessed by an observer using a probability-matching strategy together with knowledge of the category labels, while $extstyle p\left(c_j\right) sum_\left\{f_i in F\right\} sum_\left\{k=1\right\}^m p\left(f_\left\{ik\right\}\right)^2$ is the expected number of attribute values that can be correctly guessed by an observer the same strategy but without any knowledge of the category labels. Their difference therefore reflects the relative advantage accruing to the observer by having knowledge of the category structure.

Information-theoretic definition of the Category Utility

The information-theoretic definition of "category utility" for a set of entities with size-$n$ binary feature set $F = \left\{f_i\right\}, i=1 ldots n$, and a binary category is given in Harvtxt|Gluck|Corter|1985 as follows:

:

where $p\left(c\right)$ is the prior probability of an entity belonging to the positive category $c$ (in the absence of any feature information), $p\left(f_i|c\right)$ is the conditional probability of an entity having feature $f_i$ given that the entity belongs to category $c$, is likewise the conditional probability of an entity having feature $f_i$ given that the entity belongs to category , and $p\left(f_i\right)$ is the prior probability of an entity possessing feature $f_i$ (in the absence of any category information).

The intuition behind the above expression is as follows: The term $p\left(c\right) extstyle sum_\left\{i=1\right\}^n p\left(f_i|c\right)log p\left(f_i|c\right)$ represents the cost (in bits) of optimally encoding (or transmitting) feature information when it known that the objects to be described belong to category $c$. Similarly, the term represents the cost (in bits) of optimally encoding (or transmitting) feature information when it known that the objects to be described belong to category . The sum of these two terms in the brackets is therefore the weighted average of these two costs. The final term, $extstyle sum_\left\{i=1\right\}^n p\left(f_i\right)log p\left(f_i\right)$, represents the cost (in bits) of optimally encoding (or transmitting) feature information when no category information is available. The value of the "category utility" will, in the above formulation, be negative (???).

Category Utility and Mutual Information

It is mentioned in Harvtxt|Gluck|Corter|1985 and Harvtxt|Corter|Gluck|1992 that the category utility is equivalent to the mutual information. Here we provide a simple demonstration of the nature of this equivalence. Let us assume a set of entities each having the same $n$ features, i.e., feature set $F = \left\{f_i\right\}, i=1 ldots n$, with each feature variable having cardinality $m$. That is, each feature has the capacity to adopt any of $m$ distinct values (which need "not" be ordered; all variables can be nominal); for the special case $m=2$ these features would be considered "binary", but more generally, for any $m$, the features are simply "m-ary". For our purposes, without loss of generality, we can replace feature set $F$ with a single aggregate variable $F_a$ that has cardinality $m^n$, and adopts a unique value $v_i, i=1 ldots m^n$ corresponding to each feature combination in the Cartesian product $otimes F$. (Ordinality does "not" matter, because the mutual information is not sensitive to ordinality.) In what follows, a term such as $p\left(F_a=v_i\right)$ or simply $p\left(v_i\right)$ refers to the probability with which $F_a$ adopts the particular value $v_i$. (Using the aggregate feature variable $F_a$ replaces multiple summations, and simplifies the presentation to follow.)

We assume also a single category variable $C$ which has cardinality $p$. This is equivalent to a classification system in which there are $p$ non-intersecting categories. In the special case of $p=2$ we have the two-category case discussed above. From the definition of mutual information for discrete variables, the mutual information $I\left(F_a;C\right)$ between the aggregate feature variable $F_a$ and the category variable $C$ is given by:

:$I\left(F_a;C\right) = sum_\left\{v_i in F_a\right\} sum_\left\{c_j in C\right\} p\left(v_i,c_j\right) log frac\left\{p\left(v_i,c_j\right)\right\}\left\{p\left(v_i\right),p\left(c_j\right)\right\}$

where $p\left(v_i\right)$ is the prior probability of feature variable $F_a$ adopting value $v_i$, $p\left(c_j\right)$ is the marginal probability of category variable $C$ adopting value $c_j$, and $p\left(v_i,c_j\right)$ is the joint probability of variables $F_a$ and $C$ simultaneously adopting those respective values. In terms of the conditional probabilities this can be re-written (or defined) as

:

If we will rewrite the original definition of the category utility from above, with , we have

:$CU\left(C,F\right) = sum_\left\{f_i in F\right\} sum_\left\{c_j in C\right\} p\left(f_i|c_j\right) p\left(c_j\right) log p\left(f_i|c_j\right) - sum_\left\{f_i in F\right\} p\left(f_i\right) log p\left(f_i\right)$

This equation clearly has the same form as the (blue) equation expressing the mutual information between the feature set and the category variable; the difference is that the sum $extstyle sum_\left\{f_i in F\right\}$ in the "category utility" equation runs over independent binary variables $F = \left\{f_i\right\}, i=1 ldots n$, whereas the sum $extstyle sum_\left\{v_i in F_a\right\}$ in the mutual information runs over "values" of the single $m^n$-ary variable $F_a$. The two measures are actually equivalent then "only" when the features $\left\{f_i\right\}$, are "independent" (and assuming that terms in the sum corresponding to are also added).

Insensitivity of category utility to ordinality

Like the mutual information, the "category utility" is not sensitive to any "ordering" in the feature or category variable values. That is, as far as the "category utility" is concerned, the category set {small,medium,large,jumbo} is not qualitatively different than the category set {desk,fish,tree,mop} since the formulation of the "category utility" does not account for any ordering of the class variable. Similarly, a feature variable adopting values {1,2,3,4,5} is not qualitatively different from a feature variable adopting values {fred,joe,bob,sue,elaine}. As far as the "category utility" or "mutual information" are concerned, "all" category and feature variables are "nominal variables." For this reason, "category utility" does not reflect any "gestalt" aspects of "category goodness" that might be based on such ordering effects. One possible adjustment for this insensitivity to ordinality is given by the weighting scheme described in the article for mutual information.

Category "goodness": Models and Philosophy

This section provides some background on the origins of, and need for, formal measures of "category goodness" such as the "category utility", and some of the history that lead to the development of this particular metric.

What makes a good category?

At least since the time of Aristotle there has been a tremendous fascination in philosophy with the nature of concepts and universals. What kind of "entity" is a concept such as "horse"? Such abstractions do not designate any particular individual in the world, and yet we can scarcely imagine being able to comprehend the world without their use. Does the concept "horse" therefore have an independent existence outside of the mind? If it does, then what is the locus of this independent existence? The question of locus was an important issue on which the classical schools of Plato and Aristotle famously differed. However, they remained in agreement that universals "did" indeed have a mind-independent existence. There was, therefore, always a "fact to the matter" about which concepts and universals exist in the world.

In the late Middle Ages (perhaps beginning with Occam, although Porphyry also makes a much earlier remark indicating a certain discomfort with the status quo), however, the certainty that existed on this issue began to erode, and it became acceptable among the so-called nominalists and empiricists to consider concepts and universals as strictly mental entities or conventions of language. On this view of concepts—that they are purely representational constructs—a new question then comes to the fore: "Why do we possess one set of concepts rather than another?" What makes one set of concepts "good" and another set of concepts "bad"? This is a question that modern philosophers, and subsequently machine learning theorists and cognitive scientists, have struggled with for many decades.

What purpose do concepts serve?

One approach to answering such questions is to investigate the "role" or "purpose" of concepts in cognition. Thus, we ask: "What are concepts good for in the first place?" The answer provided by Harvtxt|Mill|1843/1936|p=425 and many others is that classification (conception) is a precursor to "induction": By imposing a particular categorization on the universe, an organism gains the ability to deal with physically non-identical objects or situations in an identical fashion, thereby gaining substantial predictive leverage (Harvnb|Smith|Medin|1981;Harvnb|Harnad|2005). As J.S. Mill puts it Harv|Mill|1843/1936|pp=466–468,

From this base, Mill reaches the following conclusion, which foreshadows much subsequent thinking about category goodness, including the notion of "category utility":

One may compare this to the "category utility hypothesis" proposed by Harvtxt|Corter|Gluck|1992: "A category is useful to the extent that it can be expected to improve the ability of a person to accurately predict the features of instances of that category." Mill here seems to be suggesting that the best category structure is one in which object features (properties) are maximally informative about the object's class, and, simultaneously, the object class is maximally informative about the object's features. In other words, a useful classification scheme is one in which we can use category knowledge to accurately infer object properties, and we can use property knowledge to accurately infer object classes. One may also compare this idea to Aristotle's criterion of "counter-predication" for definitional predicates, as well as to the notion of concepts described in formal concept analysis.

Attempts at formalization

A variety of different measures have been suggested with an aim of formally capturing this notion of "category goodness," the best known of which is probably the "cue validity". Cue validity of a feature $f_i$ with respect to category $c_j$ is defined as the conditional probability of the category given the feature (Harvnb|Reed|1972;Harvnb|Rosch|Mervis|1975;Harvnb|Rosch|1978), $p\left(c_j|f_i\right)$, or as the deviation of the conditional probability from the category base rate (Harvnb|Edgell|1993;Harvnb|Kruschke|Johansen|1999), $p\left(c_j|f_i\right)-p\left(c_j\right)$. Clearly, these measures quantify only inference from feature to category (i.e., "cue validity"), but not from category to feature, i.e., the "category validity" $p\left(f_i|c_j\right)$. Also, while the cue validity was originally intended to account for the demonstrable appearance of "basic categories" in human cognition—categories of a particular level of generality that are evidently preferred by human learners—a number of major flaws in the cue validity quickly emerged in this regard (Harvnb|Jones|1983;Harvnb|Murphy|1982;Harvnb|Corter|Gluck|1992, and others).

One attempt to address both problems by simultaneously maximizing both feature validity and category validity was made by Harvtxt|Jones|1983 in defining the "collocation index" as the product $p\left(c_j|f_i\right) p\left(f_i|c_j\right)$, but this construction was fairly "ad hoc" (see Harvnb|Corter|Gluck|1992). The "category utility" was introduced as a more sophisticated refinement of the cue validity which attempts to more rigorously quantify the full inferential power of a class structure. As shown above, on a certain view the category utility is equivalent to the mutual information between the feature variable and the category variable. It has been suggested that categories having the greatest overall "category utility" are those which are not only those which are "best" in a normative sense, but are also those which human learners prefer to use, e.g., "basic" categories Harv|Corter|Gluck|1992. Other related measures of category goodness are "cohesion" (Harvnb|Hanson|Bauer|1989;Harvnb|Gennari|Langley|Fisher|1989) and "salience" Harv|Gennari|1989.

Applications

* Category utilility is used as the category evaluation measure in the popular conceptual clustering algorithm called COBWEB Harv|Fisher|1987.

References

*Harvard reference | Surname2=Gluck| Given2=Mark A. | Surname1=Corter| Given1=James E. | Authorlink= | Title=Explaining basic categories: Feature predictability and information | Journal=Psychological Bulletin | Volume=111 | Issue=2 | Year=1992 | Pages=291–303 | URL=

*Harvard reference | Surname=Edgell| Given=Stephen E.| Year= 1993| Chapter=Using configural and dimensional information | Editor= N. John Castellan | Title=Individual and Group Decision Making: Current Issues | Publisher=Lawrence Erlbaum| Place=Hillsdale, New Jersey| URL=| Pages=43–64

*

*Harvard reference | Surname=Gennari| Given=John H.| Year= 1989| Chapter=Focused concept formation | Editor= Alberto Maria Segre | Title=Proceedings of the Sixth International Workshop on Machine Learning | Edition= | Publisher=Morgan Kaufmann| Place=Ithaca, NY| URL=| Pages=379–382

*Harvard reference | Surname1=Gennari| Given1=John H. | Surname2=Langley| Given2=Pat |Surname3=Fisher| Given3=Doug |Authorlink= | Title=Models of incremental concept formation | Journal=Artificial Intelligence | Volume=40 | Issue=1-3 | Year=1989| Pages=11–61 | URL=

*

*

*Harvard reference | Surname=Harnad| Given=Stevan | Year= 2005| Chapter=To cognize is to categorize: Cognition is categorization | Editor= Henri Cohen & Claire Lefebvre | Title=Handbook of Categorization in Cognitive Science | Edition= | Publisher=Elsevier | Place=Amsterdam | URL=http://eprints.ecs.soton.ac.uk/11725/| Pages=19–43 |

*Harvard reference | Surname=Jones| Given=Gregory V. | Authorlink= | Title=Identifying basic categories | Journal=Psychological Bulletin | Volume=94 | Issue=3 | Year=1983 | Pages=423–428 | URL=

*Harvard reference | Surname1=Kruschke| Given1=John K. | Surname2=Johansen| Given2=Mark K. |Authorlink= | Title=A model of probabilistic category learning | Journal=Journal of Experimental Psychology: Learning, Memory, and Cognition | Volume=25 | Issue=5 | Year=1999| Pages=1083–1119 | URL=

*Harvard reference | Surname=Mill| Given=John Stuart | Title=A System of Logic, Ratiocinative and Inductive: Being a Connected View of the Principles of Evidence and the Methods of Scientific Investigation | Publisher=Longmans, Green and Co. | Place=London | Year=1843/1936| URL= | authorlink=John Stuart Mill.

*Harvard reference | Surname=Murphy| Given=Gregory L. | Authorlink= | Title=Cue validity and levels of categorization | Journal=Psychological Bulletin | Volume=91 | Issue=1 | Year=1982| Pages=174–177 | URL=

*Harvard reference | Surname=Reed| Given=Stephen K. | Authorlink= | Title=Pattern recognition and categorization | Journal=Cognitive Psychology | Volume=3 | Issue=3 | Year=1972 | Pages=382–407 | URL=

*Harvard reference | Surname=Rosch| Given=Eleanor| Year= 1978| Chapter=Principles of categorization | Editor= Eleanor Rosch & Barbara B. Lloyd | Title=Cognition and Categorization | Edition= | Publisher=Lawrence Erlbaum| Place=Hillsdale, New Jersey| URL=| Pages=27–48

*Harvard reference | Surname1=Rosch| Given1=Eleanor | Surname2=Mervis| Given2=Carolyn B.| Authorlink= | Title=Family Resemblances: Studies in the Internal Structure of Categories | Journal=Cognitive Psychology | Volume=7 | Issue=4 | Year=1975 | Pages=573–605 | URL=

*Harvard reference | Surname1=Smith| Given1=Edward E. | Surname2=Medin| Given2=Douglas L. |Title=Categories and Concepts | Publisher=Harvard University Press | Place=Cambridge, MA | Year=1981| URL=

*Harvard reference | Surname1=Witten| Given1=Ian H. | Surname2=Frank| Given2=Eibe |Title=Data Mining: Practical Machine Learning Tools and Techniques | Publisher=Morgan Kaufmann | Place=Amsterdam | Year=2005| URL=http://www.cs.waikato.ac.nz/~ml/weka/book.html

Concepts, Concept learning, Abstraction, Universals, Conceptual Clustering, Unsupervised learning

Wikimedia Foundation. 2010.

Look at other dictionaries:

• Utility Group — is the name of a breed group of dogs, used by kennel clubs to classify a defined collection of dog breeds. How Utility Group is defined varies among kennel clubs, and different kennel clubs may not include the same breeds in their Utility Group …   Wikipedia

• utility patent — The most common type of patent; issued for useful inventions that are new (novel) and that produce results that are not expected by those working in the field of invention (nonobvious). A utility patent lasts for 20 years from the patent… …   Law dictionary

• Utility — This article is about the economic concept. For other uses, see Utility (disambiguation). Part of a series on Utilitarianism …   Wikipedia

• Utility knife — Boxcutter and box cutter redirect here. For the electronic music artist, see Boxcutter (musician). For season four premiere of Breaking Bad, see Box Cutter (Breaking Bad). A utility knife is a knife used for general or utility purposes. The… …   Wikipedia

• category of aircraft classification — i. With respect to the certification, rating, privileges, and information of airmen, it means a broad classification of aircraft. Examples include airplane, rotorcraft, glider, and lighter than air aircraft. ii. With respect to the certification… …   Aviation dictionary

• utility-category aircraft — A classification of aircraft approved for limited aerobatics. The maneuvers include lazy eights, chandelles, steep turns with a bank angle exceeding 60°, and spins …   Aviation dictionary

• Transport category — is a category of airworthiness applicable to large civil airplanes and large civil helicopters. Any aircraft s airworthiness category is shown on its airworthiness certificate. The name transport category is used in the USA, Canada, Europe and… …   Wikipedia

• Sport utility vehicle — SUV redirects here. For other uses, see SUV (disambiguation). 1995 1998 Ford Explorer built on a light truck chassis A sport utility vehicle (SUV) is a generic marketing term for a vehicle similar to a station wagon, but built on a light truck… …   Wikipedia

• Marine Corps Combat Utility Uniform — MARPAT Utility Uniform The Marine Corps Combat Utility Uniform (MCCUU) is the current battledress uniform of the United States Marine Corps. It is also worn by Navy personnel (mostly corpsmen and chaplains) assigned to Marine Corps units (Fleet… …   Wikipedia

• public utility — n: a business organization (as an electric company) performing a public service and subject to special government regulation Merriam Webster’s Dictionary of Law. Merriam Webster. 1996. public utility …   Law dictionary