Information gain in decision trees

In information theory and machine learning, information gain is an alternative synonym for "Kullback–Leibler divergence".

In particular, the information gain about a random variable "X" obtained from an observation that a random variable "A" takes the value "A=a" is the Kullback-Leibler divergence "D"_KL( "p"("x"|"a") || "p"("x"|I) ) of the prior distribution "p"("x"|I) for x from the posterior distribution "p"("x"|"a") for "x" given "a".

The expected value of the information gain is the mutual information "I(X;A)" of "X" and "A" — i.e. the reduction in the entropy of "X" achieved by learning the state of the random variable "A".

In machine learning this concept can be used to define a preferred sequence of attributes to investigate to most rapidly narrow down the state of "X". Such a sequence (which depends on the outcome of the investigation of previous attributes at each stage) is called a decision tree. Usually an attribute with high information gain should be preferred to other attributes.

General definition

In general terms, the expected information gain is the change in information entropy from a prior state to a state that takes some information as given:

$IG(Ex,a) = H(Ex) - H(Ex|a)$

Formal definition

Let $Attr$ be the set of all attributes and $Ex$ the set of all training examples, $value(x,a)$ with $xin Ex$ defines the value of a specific example $x$ for attribute $ain Attr$ , $H$ specifies the entropy.The information gain for an attribute $ain Attr$ is defined as follows:

$IG(Ex,a)=H(Ex)-sum_{vin values(a)} frac{|Ex ullet H({xin Ex|value(x,a)=v})$

The information gain is equal to the total entropy for an attribute if for each of the attribute values a unique classification can be made for the result attribute. In this case the relative entropies subtracted from the total entropy are 0.

Drawbacks

Although information gain is usually a good measure for deciding the relevance of an attribute, it is not perfect. A notable problem occurs when information gain is applied to attributes that can take on a large number of distinct values. For example, suppose that we are building a decision tree for some data describing a business's customers. Information gain is often used to decide which of the attributes are the most relevant, so they can be tested near the root of the tree. One of the input attributes might be the customer's credit card number. This attribute has a high information gain, because it uniquely identifies each customer, but we do "not" want to include it in the decision tree: deciding how to treat a customer based on their credit card number is unlikely to generalize to customers we haven't seen before.

Information gain ratio is sometimes used instead. This biases the decision tree against considering attributes with a large number of distinct values.

Constructing a decision tree using information gain

A decision tree can be constructed top-down using the information gain in thefollowing way:

# begin at the root node
# determine the attribute with the highest information gain which is not used in an ancestor node
# add a child node for each possible value of that attribute
# attach all examples to the child node where the attribute values of the examples are identical to the attribute value attached to the node
# if all examples attached to the child node can be classified uniquely add that classification to that node and mark it as leaf node
# go back to step two if there is at least one more unused attribute left, otherwise add the classification of most of the examples attached to the child node

External sources

* [1] Mitchell, Tom M., Machine Learning. The Mc-Graw-Hill Companies, Inc., 1997

Wikimedia Foundation. 2010.

Игры ⚽ Поможем решить контрольную работу

Look at other dictionaries:

Decision tree learning — This article is about decision trees in machine learning. For the use of the term in decision analysis, see Decision tree. Decision tree learning, used in statistics, data mining and machine learning, uses a decision tree as a predictive model… … Wikipedia
Decision tree — This article is about decision trees in decision analysis. For the use of the term in machine learning, see Decision tree learning. A decision tree is a decision support tool that uses a tree like graph or model of decisions and their possible… … Wikipedia
Дерево принятия решений — (также могут назваться деревьями классификации или регрессионными деревьями) используется в области статистики и анализа данных для прогнозных моделей. Структура дерева представляет собой следующее: «листья» и «ветки». На ребрах («ветках»)… … Википедия
Buyer decision processes — are the decision making processes undertaken by consumers in regard to a potential market transaction before, during, and after the purchase of a product or service.More generally, decision making is the cognitive process of selecting a course of … Wikipedia
Kullback–Leibler divergence — In probability theory and information theory, the Kullback–Leibler divergence[1][2][3] (also information divergence, information gain, relative entropy, or KLIC) is a non symmetric measure of the difference between two probability distributions P … Wikipedia
Arbre De Décision — Un arbre de décision est un outil d aide à la décision et à l exploration de données. Il permet de modéliser simplement, graphiquement et rapidement un phénomène mesuré plus ou moins complexe. Sa lisibilité, sa rapidité d exécution et le peu d… … Wikipédia en Français
Arbre de decision — Arbre de décision Un arbre de décision est un outil d aide à la décision et à l exploration de données. Il permet de modéliser simplement, graphiquement et rapidement un phénomène mesuré plus ou moins complexe. Sa lisibilité, sa rapidité d… … Wikipédia en Français
Arbre de décision — Pour les articles homonymes, voir Arbre (homonymie). Un arbre de décision est un outil d aide à la décision qui représente la situation plus ou moins complexe à laquelle on doit faire face sous la forme graphique d un arbre de façon à faire… … Wikipédia en Français
Random forest — (англ. случайный лес) алгоритм машинного обучения, предложенный Лео Брейманом[1][2] и Адель Катлер, заключающийся в использовании комитета (ансамбля) решающих деревьев. Алгоритм сочетает в себе две основные идеи: метод бэггинга… … Википедия
C4.5 algorithm — C4.5 is an algorithm used to generate a decision tree developed by Ross Quinlan. C4.5 is an extension of Quinlan s earlier ID3 algorithm. The decision trees generated by C4.5 can be used for classification, and for this reason, C4.5 is often… … Wikipedia

Academic Dictionaries and Encyclopedias

Information gain in decision trees

Look at other dictionaries:

Share the article and excerpts

Academic Dictionaries and Encyclopedias

Wikipedia

Information gain in decision trees

Look at other dictionaries:

Share the article and excerpts

Direct link