# Linear classifier

﻿
Linear classifier

In the field of machine learning, the goal of classification is to group items that have similar feature values, into groups. A linear classifier achieves this by making a classification decision based on the value of the linear combination of the features.

Definition

If the input feature vector to the classifier is a real vector $vec x$, then the output score is

:$y = f\left(vec\left\{w\right\}cdotvec\left\{x\right\}\right) = fleft\left(sum_j w_j x_j ight\right),$

where $vec w$ is a real vector of weights and "f" is a function that converts the dot product of the two vectors into the desired output. The weight vector $vec w$ is learned from a set of labeled training samples. Often "f" is a simple function that maps all values above a certain threshold to the first class and all other values to the second class. A more complex "f" might give the probability that an item belongs to a certain class.

For a two-class classification problem, one can visualize the operation of a linear classifier as splitting a high-dimensional input space with a hyperplane: all points on one side of the hyperplane are classified as "yes", while the others are classified as "no".

A linear classifier is often used in situations where the speed of classification is an issue, since it is often the fastest classifier, especially when $vec x$ is sparse. However, decision trees can be faster. Also, linear classifiers often work very well when the number of dimensions in $vec x$ is large, as in document classification, where each element in $vec x$ is typically the number of counts of a word in a document (see document-term matrix). In such cases, the classifier should be well-regularized.

Generative models vs. discriminative models

There are two broad classes of methods for determining the parameters of a linear classifier $vec w$ [T. Mitchell, Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression. Draft Version, 2005 [http://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf download] ] [A. Y. Ng and M. I. Jordan. On Discriminative vs. Generative Classifiers: A comparison of logistic regression and Naive Bayes. in NIPS 14, 2002. [http://www.cs.berkeley.edu/~jordan/papers/ng-jordan-nips01.ps download] ] . The first is by modeling conditional density functions $P\left(vec x|\left\{ m class\right\}\right)$. Examples of such algorithms include:
* Linear Discriminant Analysis (or Fisher's linear discriminant) (LDA) --- assumes Gaussian conditional density models
* Naive Bayes classifier --- assumes independent binomial conditional density models.

The second set approaches are called discriminative models, which attempt to maximize the quality of the output on a training set. Additional terms in the training cost function can easily perform regularization of the final model. Examples of discriminative training of linear classifiers include
* Logistic regression --- maximum likelihood estimation of $vec w$ assuming that the observed training set was generated by a binomial model that depends on the output of the classifier.
* Perceptron --- an algorithm that attempts to fix all errors encountered in the training set
* Support vector machine --- an algorithm that maximizes the margin between the decision hyperplane and the examples in the training set.

Note: In contrast to its name, LDA does not belong to the class of discriminative models in this taxonomy. However, its name makes sense when we compare LDA to the other main linear dimensionality reduction algorithm: Principal Components Analysis (PCA). LDA is a supervised learning algorithm that utilizes the labels of the data, while PCA is an unsupervised learning algorithm that ignores the labels. To summarize, the name is a historical artifact (see [R.O. Duda, P.E. Hart, D.G. Stork, "Pattern Classification", Wiley, (2001). ISBN 0-471-05669-3] , p.117).

Discriminative training often yields higher accuracy than modeling the conditional density functions. However, handling missing data is often easier with conditional density models.

All of the linear classifier algorithms listed above can be converted into non-linear algorithms operating on a different input space $varphi\left(vec x\right)$, using the kernel trick.

* Statistical classification

Notes

# Y. Yang, X. Liu, "A re-examination of text categorization", Proc. ACM SIGIR Conference, pp. 42-49, (1999). [http://citeseer.ist.psu.edu/yang99reexamination.html paper @ citeseer]
# R. Herbrich, "Learning Kernel Classifiers: Theory and Algorithms," MIT Press, (2001). ISBN 0-262-08306-X

Wikimedia Foundation. 2010.

### Look at other dictionaries:

• Linear discriminant analysis — (LDA) and the related Fisher s linear discriminant are methods used in statistics, pattern recognition and machine learning to find a linear combination of features which characterize or separate two or more classes of objects or events. The… …   Wikipedia

• Classifier — may refer to: Classifier (linguistics) Classifier (mathematics) Classifier (UML) Hierarchical classifier Linear classifier This disambiguation page lists articles associated with the same title. If an …   Wikipedia

• Classifier (mathematics) — In mathematics, a classifier is a mapping from a (discrete or continuous) feature space X to a discrete set of labels Y .Classifiers may either be fixed classifiers or learning classifiers, and learning classifiers may in turn be divided into… …   Wikipedia

• Quadratic classifier — A quadratic classifier is used in machine learning to separate measurements of two or more classes of objects or events by a quadric surface. It is a more general version of the linear classifier.The classification problemStatistical… …   Wikipedia

• Margin classifier — In machine learning, a margin classifer is a classifier which is able to give an associated distance from the decision boundary for each example. For instance, if a linear classifier (e.g. perceptron or linear discriminant analysis) is used, the… …   Wikipedia

• Naive Bayes classifier — A naive Bayes classifier is a simple probabilistic classifier based on applying Bayes theorem with strong (naive) independence assumptions. A more descriptive term for the underlying probability model would be independent feature model . In… …   Wikipedia

• Support vector machine — Support vector machines (SVMs) are a set of related supervised learning methods used for classification and regression. Viewing input data as two sets of vectors in an n dimensional space, an SVM will construct a separating hyperplane in that… …   Wikipedia

• Perceptron — Perceptrons redirects here. For the book of that title, see Perceptrons (book). The perceptron is a type of artificial neural network invented in 1957 at the Cornell Aeronautical Laboratory by Frank Rosenblatt. It can be seen as the simplest… …   Wikipedia

• List of mathematics articles (L) — NOTOC L L (complexity) L BFGS L² cohomology L function L game L notation L system L theory L Analyse des Infiniment Petits pour l Intelligence des Lignes Courbes L Hôpital s rule L(R) La Géométrie Labeled graph Labelled enumeration theorem Lack… …   Wikipedia

• Clasificador lineal — En el campo del aprendizaje automático, el objetivo del aprendizaje supervisado es usar las características de un objeto para identificar a qué clase (o grupo) pertenece. Un clasificador lineal logra esto tomando una decisión de clasificación… …   Wikipedia Español