Something Doesn’t Add Up

The author is a Professor of Statistics at The University of Bath. I like to read books on Statistics written for general public. The authors intentionally make a concerted effort to explain complicated concepts in easy to understand manner. Standard statistical textbooks, on the other hand, usually explains complicated concepts in an even more complex manner (with lots of equations, full of alien characters). It’s not surprising why we, students of science, are not good at statistics. Our main sources of information, upon which our clinical decision are based, are from scientific articles, most of which are clinical and laboratory experiments. Results of these experiments are extrapolated and used as a basis to explain what would happen in the real world. The key link of experimental and real-world conditions is statistics. If statistics in the paper is wrong, the whole assumption cannot be right. So to understand, appreciate and make full use of papers in our practice, we need to understand statistics. It is that weak link.

I find the book very useful and intellectually-stimulating.

Measurement of inherently subjective matter e.g. quality fo life, happiness

Can you really condense the complexity and diversity of experiencing life into a single number? It might represent general state of happiness, but surely much of the richness of life is lost in that impoverished number. Layard the author of Happiness may disagree, but the key is to exploit the simplicity of numbers as far as this is useful, but also to be aware of their limitations and to realise that they usually only tell you part of the story.

City liveability index by The Economist

In 2016 Melbourne snatched first place from Vienna by 0.1 point. Lord Mayor at the time was making a big deal out of it. There was a media blitz, big announcement, “we all should be proud so and so…”

If we, however, look closely at the methodology of the city liveability league table, it is actually intended as a ranking of the attractiveness of cities for people who work for multinational companies and who are relocated there. That means all the factors make up for the ranking may have absolutely nothing to do with the residents who actually live there.

University ranking

One year UK business schools all of sudden beat their US counterparts. One of the criteria was graduates salary. No one looked at the exchange rate. UK pound value of that particular year increased in comparison to US dollar so UK grads appeared to relatively earn more in dollar terms, which was a currency used in ranking table.

Bibliometric

The number of times an academic’s publications are cited by other researchers is often used to measure that elusive concept: the quality of their research. However, a few years ago a research paper reported that a highly regarded advanced forecasting method was not as accurate as much simpler methods. The paper was highly cited, not because of its quality, but because it contained errors. Other researchers felt the need to explain that the paper’s conclusion was wrong when they reviewed the literature at the start of their own papers. Number of citation would be very high for the wrong reason. High number of citation doesn’t mean high quality.

Perverse incentive

If the quality of an academic research is measured by the number of papers they publish, they might be motivated to concentrate on quantity rather than quality – writing lots of mediocre papers rather than a few outstanding pieces of work.

GDP calculation

Planning holidays, download recipes for meals, communicate with friends around the world and to acquire knowledge. Years ago these activities would have cost money, but now, because they are free, they do not register with GDP. We are generally better off because we have access to these services, but GDP ignores them. The drop in GDP doesn’t mean the drop in quality of life.

On p value

The null hypothesis in the example is that the two drugs are equally safe. We would then ask: how probable is it that at least eight times more people will die taking one of the drugs if the hypothesis is true? The probability in question is called a p-value.

Suppose that the probability turned out to be 40 per cent, would you have doubts about the hypothesis that the drugs were equally safe? I expect most people would judge that the difference in death rates could well be due to chance. But what if the probability for the trial turning out that way was only 2 per cent? We could then say: if the drugs are equally safe, it is very improbable that we would get eight times more people dying taking one of them. At this point we would start having serious doubts about the ‘equal safety’ hypothesis.

The key point here is that Fisher’s method does not tell us the probability that the claim (or hypothesis) is true. It only tells us the probability of getting the result that we did if the claim is true, which is a different matter altogether.

P<0.05

It is vital to appreciate that the 5 per cent threshold is arbitrary. There is nothing scientific about it. When asked later why Fisher had recommended that a hypothesis could be rejected if the p-value was less than 5 per cent, he conceded that he had no rationale. He had chosen 5 per cent simply because it was ‘convenient’. But despite Fisher’s suggestion, you are surely entitled to conclude that a hypothesis is untenable if the p-value is 6 per cent, or even 10 per cent – it is a matter for your own judgement. This is especially true if ‘accepting’ an untrue hypothesis is dangerous or costly, especially as the calculation of p-values is usually based on approximations and assumptions that are not strictly true. It’s surely better simply to state the exact p-value and let the reader judge the plausibility of a hypothesis on that basis.

Indeed, in later years Fisher himself recommended this practice, but by then bad habits had been established and the 5 per cent fixation was well ingrained into the practice of many disciplines.

Perpetual false glorification of p<0.05

The question remains: why do honest scientists, who are usually intelligent people with enquiring minds, go along unquestioningly with the ‘p less than 0.05’ rule? One reason is that scientists are often not trained statisticians and many don’t understand what a p-value is – they just assume that this strange number has to be below 0.05 to make a result worth reporting.

We really don’t understand p value

Psychology students and their teachers were given a series of six statements that wrongly interpreted the results of significance tests. All of the students and 90 per cent of their teachers believed that at least one of the statements was true.

Polls: what’s wrong with them?

We can’t perform (due to incorrect assumptions) margin of error calculation for non-random samples, yet pollsters commonly report the margins of errors of polls even when the survey is non-random.

Answers and subgroup population affect uncertainty. Yet many pollsters report a single margin of error for the entire survey.

“Candidate A’s lacklustre performance in last week’s debate has caused a 2 percentage point decline in his lead.” “People are warming to Candidate B’s promise to reduce income tax. It’s led to a 3 percentage point increase in her poll rating.” This is called ‘narrative fallacy’. We tend to invent stories to explain outcomes. All these numbers might all be noises of random sampling.

We lie…

Have you ever borrowed money from a personal loan company? One hundred per cent of the respondents said no. Yet all of the interviewees had been selected because they were listed as clients of local loan company.

How many time did you have sex last week? Who did you vote for? Even if you voted Trump, you might find it difficult to admit that to others. People lie to avoid embarrassing answers.

When asked for our opinion, we are prompted by the phrasing of the question, the immediate context (i.e. choice architect in Nudge), and by recent experience to bring to mind a few of these scraps of thoughts which we then use to help us choose between the options.

Happiness measurement

Without a clear understanding of what happiness is, it’s difficult to argue that happiness measurement is anything other than a meretricious attempt to assign quantities, via unanswerable questions, to a nebulous concept, without any possibility of objective calibration. In other words, it’s spurious accuracy – micrometre readings of an ill-defined phenomenon reflected in error-bound data.

Subjective numbers can be trusted when people produce consistent values under the same conditions and when the number is measuring what it is supposed to be measuring i.e. reliability and validity. Despite their deficiencies, subjective numbers can often be useful when we only need a ‘rough and ready’ idea of what might be the truth, but, because of their coarseness, we should also be suspicious of very exact numbers based on subjective responses (e.g. difference of pain degree at two decimal points).

Data has no character and no message until it’s been interpreted by a human. And in the supposedly objective arena of science, that interpretation will inevitably be subjective.

Replication

A good rule of thumb is that, the more newsworthy a finding is, the less likely it is to be successfully replicated. Until scientific findings have been replicated several times, we should treat them as provisional. So, defer judgement on recent findings.

Conspiracy theory, cult

Our ability to reason didn’t evolve so that we could discover truths. Instead, as social animals, reasoning developed so we could justify our actions and decisions to others, thereby improving communication and cooperation. Being able to defend our position in arguments also served to increase our prestige and status within our social group. Hence there was an incentive to overweigh evidence and choose lines of argument that favoured our actions while simultaneously downplaying inconvenient facts.

We tend to agree with people in our social group, tend to seek out evidence that confirm our beliefs. Echo chamber amplifies group beliefs.

We react to stories not numbers…

The donations from those who only saw the statistical information were less than half those received when just the description and image of disaster victims were provided. But, surprisingly, adding the statistical information to victims’ image and description also decreased donations by almost 40 per cent. It appeared that the statistics lessened people’s reliance on their emotions to guide their decision. Instead they fostered a more analytical mode of thinking and this caused the participants to donate less.

We are emotionally swayed by individual accounts of people who need help and resistant to statistics about the numbers of victims when large disaster occurred. We have deviant propensity to stay clear of statistics. This is self-inflicted harm as it distorts our perception of the world.

Root canals and breast cancer:

The connection is clear! It refers to a study of 300 women with breast cancer by Dr Robert Jones. He found that 93 per cent of them had had root canal surgery. Of course, this in itself provides no evidence of causality at all. It’s probable that around 93 per cent of the women were also meat eaters, that around 93 per cent of them had access to a mobile phone, or that around 93 per cent were less than 70 inches tall.

Disconfirming evidence has less chance of registering with us because we have natural and inherent resistance to it. ‘Yes, that figures,’ we might say to ourselves. ‘Mrs A. had root canal surgery and she suffered from breast cancer.’ The many people who have breast cancer but have not undergone root canal surgery, or those that had root canal surgery but did not suffer from the disease, go unrecognised.

Risk perceptions

Things we think we can control present more acceptable risks than potential threats judged to be outside our influence. Some people are happy to ski but studiously avoid food preservatives even though leisure time spent on the piste is estimated to be a thousand times more likely to cause injury or damage to one’s health.

People in disagreement

Research suggests that open and honest dialogue with people, appreciating the basis of their concerns, acknowledging any uncertainty, and allowing full disclosure of available information, is more likely to establish trust in the long term and allow people to make informed decisions.

Gut feelings vs Statistics

When numbers presented to you conflict with your intuition, your intuition may be right if you have plenty of experience in an environment that has regularities.

Number presentation

Most of us have problems absorbing very large numbers, but they can often be cut down to a manageable size with a bit of thought. For example, it’s been estimated that British women, aged 16 or over, spent £28.4 billion on clothes in 2017. This doesn’t mean much to me. But, given that there were roughly 27 million females in the UK who were aged 16 or over in 2017, that works out at about £1050 per woman.

Another way of helping people to conceive large numbers when something is happening repeatedly is to use the ‘rate per time period’ method. For example, Greenpeace reported in April 2018 that the equivalent of one truckload of plastic enters the sea every minute of every day. And, at one point in 2018, it was estimated that Jeff Bezos, the founder of Amazon, was earning nearly $2700 a second.

The most useful bits from the book

Recommendations on dealing with numbers

What’s the motivation? Ask yourself: what argument, product or service is the provider of a statistic trying to sell me? Numbers have a happy habit of supporting the views and interests of the person or organization providing them.

What doesn’t the number tell us? Ask yourself, what does the number leave out. Proxies heart attack chloresterol. numbers tend to reflect the most easily quantifiable aspects of an issue – as we saw with GDP – and they can ignore qualitative factors completely.

If you are told that your lifestyle increases your risk of suffering from a deadly disease by 30 per cent, without a starting figure the true risk you face may still be very small.

What simplifying assumptions underlie the number? It’s also worth asking in what ways the number simplifies the true picture and what assumptions it is based on. An average is a simplified summary of a diverse population and may not represent anyone at all.

Is the number based on an unrepresentative survey?

Is the number based on a small sample?

If a questionnaire was used to obtain the number, was it biased? People’s responses to a questionnaire can be manipulated by the way the questions are phrased.

Is the number based on subjective judgement? If it is, we should be aware that people can have problems in converting their feelings into numbers, that psychological biases may distort how they respond, and that these responses can often be inconsistent. It’s therefore unwise to treat subjective numbers as exact quantities. However, sometimes it’s possible to say: ‘Even if this subjective estimate changed by a considerable amount, we would still choose the same course of action, so it’s safe to base our decision on it.’

Are any comparisons that are presented valid? Not comparing like with like is an oft-rehearsed trick of those seeking to deceive us. Beware comparisons between two points in time that have been carefully selected to suggest an upward or downward trend where there is really no trend. Be especially wary of international comparisons where statistics have been collected using different definitions. Similarly, comparisons made in a single country over time can be invalid when the definition or scope of what is being measured has changed.

Total or per capita? If a country is spending 20 per cent more on healthcare than it was twenty years ago it might give us a warm feeling, but if the size of the population has dramatically increased over the same period then the expenditure per person might actually have gone down.

How reliable is the presentation of the number? Look carefully at the vertical scales of graphs. They can sometimes be distorted or non-existent. Scales that don’t start at zero can make small insignificant changes look large.

Do the numbers make sense? It’s always worth asking: is there a plausible rationale to support the finding? If not, It’s best to keep an open mind until further evidence is available.

Chankhrit SathornOctober 8, 2020

Something Doesn’t Add Up

PRACTICE LOCATIONS