Explaining The Problem

First, a definition of the p-value. The p-value is the probability that the test statistic that was observed in a measurement or experiment, or something as ‘extreme,’ would have been observed if the null hypothesis were true of a population of interest. The test statistic is a vector of some set of ordered observations [IT’S A SINGLE NUMBER CALCULATED FROM THE SET OF OBSERVATIONS. JASON 2017-08-29 P.S. THE OBSERVATIONS DON’T HAVE TO BE ORDERED. ALL THE POTENTIAL TEST STATISTICS HAVE TO BE ORDERED, SO THAT YOU CAN SAY WHAT COUNTS AS MORE EXTREME. JASON 2017-08-30]. The t-score, for example, is a vector of the sample mean and the sample standard deviation. A test statistic that is more ‘extreme’ than the one that occurred if it is ‘further away’ from what the population parameter would be if the null hypothesis were true. And the null hypothesis is just the hypothesis that the researcher wishes to test. If p is low enough (generally less than 0.05), the researcher ‘rejects’ the null.

In practice, the null hypothesis almost always has two other properties (at least in the social sciences). First, it is the ‘hypothesis of no difference.’ It is informally speaking the hypothesis that a change in the independent variable(s) of interest is not ‘associated’ with any change in the dependent variable. Consider the research question, ‘what is the relationship between an Australian citizen’s total personal income and the likelihood that they voted for the ALP in Lower House in the 2016 federal election?’ Suppose the researcher gathered survey data about Australian citizens in 2016 and wanted to calculate a p-value to test the hypothesis that there was no relationship between income and voting behaviour. Suppose the researcher also thought that the appropriate test statistic was the t-statistic for the bivariate logit regression coefficient. The null population parameter would be such that t was around zero.

The second characteristic of the null hypotheses that researchers actually test is that they are almost always ‘point’ null hypotheses. That is, the researcher is interested in the likelihood of observing something at least as extreme as the observed test statistic given that the population parameter is exactly equal to a particular value. Return to the research question about Australian voters. The average political scientist would almost never calculate a p-value for the null hypothesis that the population parameter is roughly zero. No, our political scientist only cares about the hypothesis that Australians who make $50 000 a year are exactly as likely (all-else-equal) to have voted Labor as Australians who make $100 000 a year. If, in the population, there is even one more ALP voter in the $50 000 group than there is in the $100 000 group, then the political scientist will reject the null if their sample is large enough. [DO THEY SOMETIMES TEST X > 0 AGAINST X < 0 THOUGH? THAT WOULDN’T BE A POINT HYOTHESIS. JASON 2017-08-29]

It is curious that researchers are so preoccupied with the point null. After all, the natural interpretation of the null is that it describes a state of affairs in the actual population. The hypothesis that there were just as many ALP voters among the wealthy and the not-so-wealthy in the 2016 federal election is the hypothesis that this was true with respect to the 15 million or so Australians who did in fact vote. But our political scientist knows in advance that there is a difference in the population. T is almost certainly different from zero. And you should surely reject the hypothesis of no difference if you know it to be false. So it is pointless to calculate a p-value for a point null. Indeed, there does not seem to be any point in running any significance test for a point null, frequentist or otherwise. [ABSOLUTELY. IF WE NEED TO, WE CAN FIND PEOPLE SAYING THIS IN THE LITERATURE. JASON 2017-08-29]

Perhaps researchers should instead calculate p-values for ‘interval’ null hypotheses. Maybe they should. It does not look like the aforementioned political scientist should immediately reject the hypothesis that income bears almost no relationship to Labor voting. The interval null might for example posit that t is between 0.1 and -0.1. But researchers rarely test interval nulls. This is a problem as researchers take it that a sufficiently small p-value for a point null constitute evidence against the view that there is no causal relationship between some variables of interest. Medications have been prescribed, policies have been implemented and psychotherapies adopted because some a researcher used a p-value to reject a point null. I am contemplating that doctors, policymakers, psychologists and other members of frequentist professions have been basing vital decisions on a useless statistical test. One rather hopes this is not the case.

Partisans of p may deny that the point null is really a hypothetical description of the actual population. The point null may instead describe the ‘long-run’ state of affairs for which there is no causal relationship between some variables of interest. If there were no relationship between incomes and voting behaviour, one would expect the t-score to tend towards zero over time as more and more trials were conducted in relevantly similar circumstances.

The way to model this formally is to understand the actual, or finite, population to be a randomly sample from a ‘superpopulation.’ The superpopulation contains all finite populations belonging to a particular reference class. In the voting behaviour case, that might include a hypothetical population of all countries that are relevantly similar to Australia in 2016 and that could have come into being. The idea is that the finite population parameter is a realisation of a random quantity belonging to the superpopulation. Saying that fair six-sided die lands on five one-in-six times in the long-run is just like saying that the probability of landing a five is one-in-six in the superpopulation containing all ‘dice-rolling occasions’ that might have taken place. If the p-value for the superpopulation model is similar to the p-value for the finite population model, then the point p-value may not be pointless after all.