Feature selection

Feature selection

Feature selection, also known as variable selection, feature reduction, attribute selection or variable subset selection, is the technique, commonly used in machine learning, of selecting a subset of relevant features for building robust learning models. When applied in biology domain, the technique is also called discriminative gene selection, which detects influential genes based on DNA microarray experiments. By removing most irrelevant and redundant features from the data, feature selection helps improve the performance of learning models by::* Alleviating the effect of the curse of dimensionality.:* Enhancing generalization capability.:* Speeding up learning process.:* Improving model interpretability.Feature selection also helps people to acquire better understanding about their data by telling them that which are the important features and how they are related with each other.

Introduction

Simple feature selection algorithms are ad hoc, but there are also more methodical approaches. From a theoretical perspective, it can be shown that optimal feature selection for supervised learning problems requires an exhaustive search of all possible subsets of features of the chosen cardinality. If large numbers of features are available, this is impractical. For practical supervised learning algorithms, the search is for a satisfactory set of features instead of an optimal set.

Feature selection algorithms typically fall into two categories; Feature Ranking and Subset Selection. Feature Ranking ranks the features by a metric and eliminates all features that do not achieve an adequate score. Subset Selection searches the set of possible features for the optimal subset.

In statistics, the most popular form of feature selection is stepwise regression. It is a greedy algorithm that adds the best feature (or deletes the worst feature) at each round. The main control issue is deciding when to stop the algorithm. In machine learning, this is typically done by cross validation. In statistics, some criteria are optimized. This leads to the inherent problem of nesting. More robust methods have been explored, such as Branch and Bound and Piecewise Linear Networks.

ubset Selection

Subset selection evaluates a subset of features as a group for suitability. Subset selection algorithms can be broken into Wrappers, Filters and Embedded. Wrappers use a search algorithm to search through the space of possible features and evaluate each subset by running a model on the subset. Wrappers can be computationally expensive and have a risk of over fitting to the model. Filters are similar to Wrappers in the search approach, but instead of evaluating against a model, a simpler filter is evaluated. Embedded techniques are embedded in and specific to a model.

Many popular search approaches use greedy hill climbing, which iteratively evaluates a candidate subset of features, then modifies the subset and evaluates if the new subset is an improvement over the old. Evaluation of the subsets requires a scoring metric that grades a subset of features. Exhaustive search is generally impractical, so at some implementor (or operator) defined stopping point, the subset of features with the highest score discovered up to that point is selected as the satisfactory feature subset. The stopping criterion varies by algorithm; possible criteria include: a subset score exceeds a threshold, a program's maximum allowed run time has been surpassed, etc.

Search approaches include:

* Exhaustive
* Best first
* Simulated annealing
* Genetic algorithm
* Greedy forward selection
* Greedy backward elimination

Two popular filter metrics for classification problems are correlation and mutual information, although neither are true metrics or 'distance measures' in the mathematical sense, since they fail to obey the triangle inequality and thus do not compute any actual 'distance' -- they should rather be regarded as 'scores'. These scores are computed between a candidate feature (or set of features) and the desired output category.

Other available filter metrics include:

* Class separability
** Error probability
** Inter-class distanceFact|date=September 2008
** Probabilistic distanceFact|date=September 2008
** Entropy
* Consistency-based feature selection
* Correlation-based feature selection

Optimality criteria

There are a variety of optimality criteria that can be used for controlling feature selection. The oldest are Mallows' Cp statistic and Akaike information criterion (AIC). These add variables if the t statistic is bigger than sqrt{2}.

Other criteria are Bayesian information criterion (BIC) which uses sqrt{log{n, minimum description length (MDL) which asymptotically uses sqrt{log{n but some argue this asymptote is not computed correctlyFact|date=March 2007, Bonnferroni / RIC which use sqrt{2log{p, and a variety of new criteria that are motivated by false discovery rate (FDR) which use something close to sqrt{2log{frac{p}{q}.

minimum-Redundancy-Maximum-Relevance

Selecting features that correlate strongest to the classification variable has been called the "maximum-relevance selection". Many heuristic algorithms can be used, such as the sequential forward, backward, or floating selections.

On the other hand, features can be selected to be different from each other, while they still have high correlation to the classification variable. This scheme, called "minimum-Redundancy-Maximum-Relevance" selection ( [http://research.janelia.org/peng/proj/mRMR/index.htm mRMR] ), has been found to be more powerful than the maximum relevance selection.

As a special case, statistical dependence between variables can be used instead of correlation. Mutual information can be used to quantify the dependency. In this case, it is shown that mRMR is an approximation to maximizing the dependency between the joint distribution of the selected features and the classification variable.

Embedded methods incorporating Feature Selection

* Random forests (RF)
* Random multinomial logit (RMNL)
* Ridge regression
* Decision tree
* Auto-encoding networks with a bottleneck-layer
* Many other machine learning methods applying a pruning step.

oftware for Feature Selection

* RapidMiner, a freely available open-source software for intelligent data analysis, knowledge discovery, data mining, machine learning, visualization, etc. featuring numerous feature generation and feature selection operators.
* Weka, a Java software package including a collection of machine learning algorithms for data mining tasks.

ee also

* Cluster analysis
* Dimensionality reduction
* Feature extraction
* Data mining

References

* [http://jmlr.csail.mit.edu/papers/special/feature03.html JMLR Special Issue on Variable and Feature Selection]
* Peng, H.C., Long, F., and Ding, C., "Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 27, No. 8, pp.1226-1238, 2005. [http://research.janelia.org/peng/proj/mRMR/index.htm Program]
* [http://www.springer.com/west/home?SGWID=4-102-22-33327495-0&changeHeader=true&referer=www.wkap.nl&SHORTCUT=www.springer.com/prod/b/0-7923-8198-X Feature Selection for Knowledge Discovery and Data Mining] (Book)
* [http://www.kernel-machines.org/jmlr/volume3/guyon03a/guyon03a.pdf An Introduction to Variable and Feature Selection] (Survey)
* [http://ieeexplore.ieee.org/iel5/69/30435/01401889.pdf Toward integrating feature selection algorithms for classification and clustering] (Survey)
* [http://www.ijcai.org/papers07/Papers/IJCAI07-187.pdf Searching for Interacting Features]
* [http://www.icml2006.org/icml_documents/camera-ready/107_Feature_Subset_Selec.pdf Feature Subset Selection Bias for Classification Learning]
* M. Hall 1999, [http://www.cs.waikato.ac.nz/~mhall/thesis.pdf Correlation-based Feature Selection for Machine Learning]

External links

* [http://www.clopinet.com/isabelle/Projects/NIPS2003/ NIPS challenge 2003] (see also NIPS)
* [http://paul.luminos.nl/documents/show_document.php?d=198 Naive Bayes implementation with feature selection in Visual Basic] (includes executable and source code)
* [http://research.janelia.org/peng/proj/mRMR/index.htm minimum-Redundancy-Maximum-Relevance (mRMR) feature selection program]


Wikimedia Foundation. 2010.

Игры ⚽ Поможем написать реферат

Look at other dictionaries:

  • Minimum redundancy feature selection — is an algorithm frequently used in a method to accurately identify characteristics of genes and phenotypes and narrow down their relevance and is usually described in its pairing with relevant feature selection as Minimum Redundancy Maximum… …   Wikipedia

  • Feature Subset Selection — Die Feature Subset Selection (FSS), kurz Feature Selection, ist ein Ansatz aus dem maschinellen Lernen, bei dem nur eine Teilmenge der verfügbaren Features für einen Lernalgorithmus verwendet wird. FSS ist notwendig, weil es teilweise technisch… …   Deutsch Wikipedia

  • Feature extraction — In pattern recognition and in image processing, Feature extraction is a special form of dimensionality reduction.When the input data to an algorithm is too large to be processed and it is suspected to be notoriously redundant (much data, but not… …   Wikipedia

  • Feature vector — In pattern recognition and machine learning, a feature vector is an n dimensional vector of numerical features that represent some object. Many algorithms in machine learning require a numerical representation of objects, since such… …   Wikipedia

  • Feature Oriented Programming — (FOP) or Feature Oriented Software Development (FOSD) is a general paradigm for program synthesis in software product lines. FOSD arose out of layer based designs of network protocols and extensible database systems in the late 1980s cite web |… …   Wikipedia

  • Feature group — Feature Groups in United States telephone jargon were switching arrangements between Local exchange carriers central offices to interexchange carriers.Such an arrangement allowed the LEC s end users to make (long distance) phone calls using the… …   Wikipedia

  • Feature detection (computer vision) — In computer vision and image processing the concept of feature detection refers to methods that aim at computing abstractions of image information and making local decisions at every image point whether there is an image feature of a given type… …   Wikipedia

  • selection — noun 1 process of choosing/being chosen ADJECTIVE ▪ careful ▪ the careful selection of building materials ▪ random ▪ initial ▪ final …   Collocations dictionary

  • Selection-based search — A selection based search system is a search engine system in which the user invokes a search query using only the mouse.[1] A selection based search system allows the user to search the internet for more information about any keyword or phrase… …   Wikipedia

  • feature — {{Roman}}I.{{/Roman}} noun 1 important part of sth ADJECTIVE ▪ basic, central, critical, crucial, essential, fundamental, important, key, main, major …   Collocations dictionary

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”