Testing hypotheses suggested by the data

Testing hypotheses suggested by the data

In statistics, hypotheses suggested by the data must be tested differently from hypotheses formed independently of the data.

How to do it wrong

For example, suppose fifty different researchers, unaware of each other's work, run clinical trials to test whether Vitamin X is efficacious in treating cancer. Forty-nine of them find no significant differences between measurements done on patients who have taken Vitamin X and those who have taken a placebo. The fiftieth study finds a big difference, but the difference is of a size that one would expect to see in about one of every fifty studies even if vitamin X has no effect at all, just due to chance (this could be caused by measurement error, regression to the mean, or just dumb luck that patients who were going to get better anyway ended up in the Vitamin X group instead of the control group.) When all fifty studies are pooled, one would say no effect of Vitamin X was found, because the positive result was not more frequent than chance, i.e. it was not statistically significant. However, it would be reasonable for the investigators running the fiftieth study to consider it likely that they have found an effect, at least until they learn of the other forty-nine studies. Now suppose that the one anomalous study was in Denmark. The data suggest a hypothesis that Vitamin X is more efficacious in Denmark than elsewhere. But Denmark was by chance the one-in-fifty in which an extreme value of the test statistic happened; one expects such extreme cases one time in fifty on average if no effect is present. It would therefore be fallacious to cite the data as serious evidence for this particular hypothesis suggested by the data.

However, if another study is then done in Denmark and again finds a difference between the vitamin and the placebo, then the first study strengthens the case provided by the second study. Or, if a second "series" of studies is done on fifty countries, and Denmark stands out in the second study as well, the two series together constitute important evidence even though neither by itself is at all impressive.

The general problem

Testing an a hypothesis suggested by the data can very easily result in false positives (type I errors). If one looks long enough and in enough different places, eventually data can be found to support any hypothesis. Unfortunately, these positive data do not by themselves constitute evidence that the hypothesis is correct. The negative test data that were thrown out are just as important, because they give one an idea of how common the positive results are compared to chance. (An "effect" that occurs no more often than chance would predict is not an effect at all, much as a method for winning at gambling is only useful if it improves the odds over what the casino already gives you.) Running an experiment, seeing a pattern in the data, proposing a hypothesis from that pattern, then using the "same" experimental data as evidence for the new hypothesis is extremely suspect, because data from all other experiments has essentially been "thrown out" by choosing to look only at the experiments that suggested the new hypothesis in the first place. This is like turning over rocks in a forest which mostly have nothing under them, finally discovering a centipede under one particular rock, and saying that all rocks must have centipedes under them.

A large set of tests as described above greatly inflates the probability of type I error as all but the data most favorable to the hypothesis is discarded. This is a risk, not only in hypothesis testing but in all statistical inference as it is often problematic to accurately describe the process that has been followed in searching and discarding data. In other words one wants to keep all data (regardless of whether they tend to support or refute the hypothesis) from "good tests", but it is sometimes difficult to figure out what a "good test" is. It is a particular problem in statistical modelling, where many different models are rejected by trial and error before publishing a result (see also overfitting). Likelihood and Bayesian approaches are no less at risk owing to the difficulty in specifying the likelihood function without an exact description of the search and discard process. The error is particularly prevalent in data mining and machine learning. It also commonly occurs in academic publishing where only reports of positive, rather than negative, results tend to be accepted, resulting in the effect known as publication bias.

How to do it right

All strategies for sound testing of hypotheses suggested by the data involve including a wider range of tests in an attempt to validate or refute the new hypothesis. These include:
*Collecting confirmation samples
*Cross-validation
*Methods of compensation for multiple comparisons
*Simulation studies including adequate representation of the multiple-testing actually involved

Henry Scheffé's simultaneous test of all contrasts in multiple comparison problems is the most well-known remedy in the case of analysis of variance. It is a method designed for testing hypotheses suggested by the data while avoiding the fallacy described above. See his "A Method for Judging All Contrasts in the Analysis of Variance", "Biometrika", 40, pages 87-104 (1953).

ee also

*Type I and type II errors
*Data-snooping bias
*Data analysis
*Exploratory data analysis
*Data dredging
*Predictive analytics


Wikimedia Foundation. 2010.

Игры ⚽ Поможем решить контрольную работу

Look at other dictionaries:

  • Data dredging — (data fishing, data snooping) is the inappropriate (sometimes deliberately so) use of data mining to uncover misleading relationships in data. Data snooping bias is a form of statistical bias that arises from this misuse of statistics. Any… …   Wikipedia

  • Exploratory data analysis — (EDA) is an approach to analyzing data for the purpose of formulating hypotheses worth testing, complementing the tools of conventional statistics for testing hypotheses And roughly the only mechanism for suggesting questions is exploratory. And… …   Wikipedia

  • The Mismeasure of Man — The first edition of The Mismeasure of Man. (1981) The Mismeasure of Man (1981), by Stephen Jay Gould, is a history and critique of the statistical methods and cultural motivations underlying biological determinism, the belief that “the social… …   Wikipedia

  • The Bell Curve — For other uses, see Bell curve (disambiguation). The Bell Curve   …   Wikipedia

  • Data visualization — A data visualization of Wikipedia as part of the World Wide Web, demonstrating hyperlinks Data visualization is the study of the visual representation of data, meaning information that has been abstracted in some schematic form, including… …   Wikipedia

  • Psychology (The separation of) from philosophy — The separation of psychology from philosophy Studies in the sciences of mind 1815–1879 Edward S.Reed THE IMPOSSIBLE SCIENCE Traditional metaphysics The consensus of European opinion during and immediately after the Napoleonic era was that… …   History of philosophy

  • Air Force reports on the Roswell UFO incident — The two Air Force reports on the Roswell UFO incident, published in 1994/5 and 1997, form the basis for much of the skeptical explanation for the 1947 incident, the purported recovery of aliens and their craft from the vicinity of Roswell, New… …   Wikipedia

  • Outline of scientific method — The following outline is provided as an overview of and topical guide to scientific method: Scientific method – body of techniques for investigating phenomena and acquiring new knowledge, as well as for correcting and integrating previous… …   Wikipedia

  • Statistical hypothesis testing — This article is about frequentist hypothesis testing which is taught in introductory statistics. For Bayesian hypothesis testing, see Bayesian inference. A statistical hypothesis test is a method of making decisions using data, whether from a… …   Wikipedia

  • Type I and type II errors — In statistics, the terms Type I error (also, α error, or false positive) and type II error (β error, or a false negative) are used to describe possible errors made in a statistical decision process. In 1928, Jerzy Neyman (1894 1981) and Egon… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”