Stepwise regression

Stepwise regression

In statistics, stepwise regression includes regression models in which the choice of predictive variables is carried out by an automatic procedure. [Hocking, R. R. (1976) "The Analysis and Selection of Variables in Linear Regression," "Biometrics, 32."] [Draper, N. and Smith, H. (1981) "Applied Regression Analysis, 2d Edition," New York: John Wiley & Sons, Inc.] [SAS Institute Inc. (1989) "SAS/STAT User's Guide, Version 6, Fourth Edition, Volume 2," Cary, NC: SAS Institute Inc.] Usually, this takes the form of a sequence of F-tests, but other techniques are possible, such as t-tests, adjusted R-square, Akaike information criterion, Bayesian information criterion, Mallows' Cp, or false discovery rate.

s.

For additional consideration, when planning an experiment, computer simulation, or scientific survey to collect data for this model, one must keep in mind the number of parameters, P, to estimate and adjust the sample size accordingly. For K variables,

P = 1(Start)+ K(Stage I)+ (K2-K)/2(Stage II)+ 3K(Stage III)= .5K2+ 3.5K + 1.

For K<17, an efficient design of experiments exists for this type of model, a Box-Behnken design, [ [http://www.itl.nist.gov/div898/handbook/pri/section3/pri3362.htm Box-Behnken designs] from a [http://www.itl.nist.gov/div898/handbook/ handbook on engineering statistics] at NIST] augmented with positive and negative axial points of length min(2,sqrt(int(1.5+K/4))), plus point(s) at the origin. There are more efficient designs, requiring fewer runs, even for K>16.]

The main approaches are:

a) Forward selection, which involves starting with no variables in the model, trying out the variables one by one and including them if they are 'statistically significant'.

b) Backward elimination, which involves starting with all candidate variables and testing them one by one for statistical significance, deleting any that are not significant.

c) Methods that are a combination of the above, testing at each stage for variables to be included or excluded.

A widely used algorithm was proposed by Efroymson (1960). [Efroymson, MA (1960) "Multiple regression analysis." In Ralston, A. and Wilf, HS, editors, "Mathematical Methods for Digital Computers." Wiley.] This is an automatic procedure for statistical model selection in cases where there are a large number of potential explanatory variables, and no underlying theory on which to base the model selection. The procedure is used primarily in regression analysis, though the basic approach is applicable in many forms of model selection. This is a variation on forward selection. At each stage in the process, after a new variable is added, a test is made to check if some variables can be deleted without appreciably increasing the residual sum of squares (RSS). The procedure terminates when the measure is (locally) maximized, or when the available improvement falls below some critical value.

Stepwise regression procedures are used in data mining, but are controversial. Several points of criticism have been made.

1. A sequence of F-tests is often used to control the inclusion or exclusion of variables, but these are carried out on the same data and so there will be problems of multiple comparisons for which many correction criteria have been developed.

2. It is difficult to interpret the p-values associated with these tests, since each is conditional on the previous tests of inclusion and exclusion (see "dependent tests" in false discovery rate).

3. The tests themselves are biased, since they are based on the same data. (Rencher and Pun, 1980, Copas, 1983). [Rencher, A.C. and Pun, F.C. (1980) "Inflation of R² in Best Subset Regression." "Technometrics. 22.49-54."] [Copas, J.B. (1983) "Regression, prediction and shrinkage." "J. Roy. Statist. Soc. Series B. 45." 311-354.] Wilkinson and Dalall (1981) [Wilkinson, L. and Dallal, G.E. (1981) "Tests of significance in forward selection regression with an F-to enter stopping rule." "Technometrics. 23." 377-380.] computed percentage points of the multiple correlation coefficient by simulation and showed that a final regression obtained by forward selection, said by the F-procedure to be significant at 0.1% was in fact only significant at 5%.

Critics regard the procedure as a paradigmatic example of data dredging, intense computation often being inadequate substitute for subject area expertise.

ee also

*Backward regression
*Forward regression
*Logistic regression
*Occam's Razor

References


Wikimedia Foundation. 2010.

Игры ⚽ Нужна курсовая?

Look at other dictionaries:

  • Stepwise Regression — The step by step iterative construction of a regression model that involves automatic selection of independent variables. Stepwise regression can be achieved either by trying out one independent variable at a time and including it in the… …   Investment dictionary

  • Regression Analysis of Time Series — Infobox Software name = RATS caption = developer = Estima latest release version = 7.0 latest release date = 2007 operating system = Cross platform genre = econometrics software license = Proprietary website =… …   Wikipedia

  • Linear regression — Example of simple linear regression, which has one independent variable In statistics, linear regression is an approach to modeling the relationship between a scalar variable y and one or more explanatory variables denoted X. The case of one… …   Wikipedia

  • Nonlinear regression — See Michaelis Menten kinetics for details In statistics, nonlinear regression is a form of regression analysis in which observational data are modeled by a function which is a nonlinear combination of the model parameters and depends on one or… …   Wikipedia

  • Robust regression — In robust statistics, robust regression is a form of regression analysis designed to circumvent some limitations of traditional parametric and non parametric methods. Regression analysis seeks to find the effect of one or more independent… …   Wikipedia

  • Least-angle regression — In statistics, least angle regression (LARS) is a regression algorithm for high dimensional data, developed by Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani. [cite journal author = Efron, Bradley coauthors = Hastie, Trevor;… …   Wikipedia

  • Множественная регрессия (multiple regression) — М. p. метод многомерного анализа, посредством к рого зависимая переменная (или критерий) Y связывается с совокупностью независимых переменных (или предикторов) X посредством линейного уравнения: Y = а + b1Х1 + b2Х2 + ... + bkXk. Коэффициенты… …   Психологическая энциклопедия

  • Model selection — is the task of selecting a statistical model from a set of candidate models, given data. In the simplest cases, a pre existing set of data is considered. However, the task can also involve the design of experiments such that the data collected is …   Wikipedia

  • Ordinary least squares — This article is about the statistical properties of unweighted linear regression analysis. For more general regression analysis, see regression analysis. For linear regression on a single variable, see simple linear regression. For the… …   Wikipedia

  • List of statistics topics — Please add any Wikipedia articles related to statistics that are not already on this list.The Related changes link in the margin of this page (below search) leads to a list of the most recent changes to the articles listed below. To see the most… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”