Positive vs Negative studies, which are more likely to be right?

Positive studies are studies with p-value less than 0.05 and those with p-value more than 0.05 are negative studies.

What is p-value?

P-value is defined along the line of “given the null hypothesis is true (there’s no difference between experimental and control group), what will be the chances of getting a result as extreme and more extreme than the one shown in the experiment.” 0.05 or 5% was arbitrarily chosen as a cut off point.   If the chance of seeing the result as extreme and more extreme than the one shown in the experiment was less than 5%, the difference we see in the experiment is not likely to be the result of chance alone. In other words, if p-value is less than 0.05, we would reject the null hypothesis, which in turn means that there was a ‘statistically significant’ difference between experimental and control group. P-value is explained this way for the sake of simplicity, technically and strictly speaking, however, it’s not quite accurate.  P-value will be discussed in more details later on in a separate article.

During the experiment design, sample sizes are calculated.  They need to be large enough to detect the difference of interest, but not too large to waste valuable resources.  Part of sample size calculation depends on acceptable level of false negative and false positive. The widely-accepted level or standard acceptable protocol (at least, among clinical trialists) is to set significant level at 5% and study power at 80%. In other words, the studies that could produce 5% false positive and 20% false negative are acceptable by health science research community. It is possible to lower these numbers down to reduce the chance of false findings, but the sample size required will grow dramatically as a result.  Essentially, it’s a balancing act of two conflicting aims.  We want the studies to be super-sensitive to detect real, minute difference, but at the same time we don’t want to waste time, money and efforts on experiment with too large a size.

Say, there are 1,000 hypotheses being tested, 100 of which are true and 900 of which are false. All tests employ standard acceptable protocol, which is to set significant level at 5% and study power at 80%. 

The tests have false positive rate of 5%. That means they would report 45 false positive hypotheses (5% of 900).  

By the same token, the tests have false negative rate of 20%. That means they would report 20 false negative hypotheses (20% of 100).

Not knowing what is false; 

We would see 125 positive studies (80 true positive + 45 false positive). Positive studies would have the chance of getting it right (accuracy rate) of 66% (80/125). 

We would also see 875 negative studies (855 true negative + 20 false negative).  Negative studies would have the chance of getting it right (accuracy rate) of 97.7% (855/875).

That is in the best of time when studies are designed and conducted to meet the acceptable standard.  In reality, that is rarely the case.  The low accuracy rate problem of positive studies is potentially even more acute in dentistry as small studies with inadequate power are rampant. It is estimated that the average power of published studies is around 40%.  By conducting studies with power of 40%, researchers implicitly say that it’s ok for the studies to have 60% false negative.  By publishing such studies, reviewers and editors are implicitly saying the same thing.  Using the theoretical example above with study of 40% power, accuracy rate of positive studies will drop to 47% (40/40+45) and that of negative studies will be 93% (855/855+60). 

Taking all these together, when we read dental literature that reports positive findings, the chance that the findings are actually true is only 47%.  Do you see ominous dark clouds gathering?

Sources:

Trouble at the lab. The Economist. 2013 Oct