Sunday, May 31, 2009

Data set selection

On Moon Landings, Michelle Malkin, P-Values, the Clintons, and the Magical Mystery Dealergate Conspiracy Theory: [...]The way this data is being used is almost the same. Singer ran six sets of regression analysis: one each for Obama, McCain, Clinton, Democratic and Republican donors, and another for those dealers who had made no political contributions at all. She was therefore testing six hypotheses. If these hypothesis were independent from one another (which, to be clear, in this case they aren't), the odds that at least one of the six would return a p-value of .125 or lower are better than 50:50! Not only are false positives possible -- they are practically inevitable, particularly if you test enough hypotheses and tolerate a low enough threshold for statistical significance. [...] (Via FiveThirtyEight.com: Electoral Projections Done Right.)

I feel so much better that it's not just machine learning that practices the arcane crafts of post hoc hypothesis and data set selection.

No comments: