Correlation and Causation in Statistics

One day at lunch I was eating a large bowl of ice cream, and a fellow faculty member said, “You had better be careful, there is a high statistical correlation between ice cream and drowning.” I must have given him a confused look, as he elaborated some more. “Days with the most sales of ice cream also see the most people drown.”
When I had finished my ice cream we discussed the fact that just because one variable is statistically associated to another, it doesn’t mean that one is the cause of the other. Sometimes there is a variable hiding in the background. In this case the day of the year is hiding in the data. More ice cream is sold on hot summer days than snowy winter ones. More people swim in the summer, and hence more drown in the summer than in the winter.

Beware of Lurking Variables

The above anecdote is a prime example of what is known as a lurking variable. As its name suggests, a lurking variable can be elusive and difficult to detect. When we find that two numerical data sets are strongly correlated, we should always ask, “Could there be something else that is causing this relationship?”
The following are examples of strong correlation caused by a lurking variable:
  • The average number of computers per person in a country and that country’s average life expectancy.
  • The number of firefighters at a fire and the damage caused by the fire.
  • The height of an elementary school student and his or her reading level.

Detection of Lurking Variables

By their nature, lurking variables are difficult to detect. One strategy, if available, is to examine what happens to the data over time. This can reveal seasonal trends, such as the ice cream example, that get obscured when the data is lumped together. Another method is to look at outliers and try to determine what makes them different than the other data. Sometimes this provides a hint of what is happening behind the scenes. The best course of action is to be proactive; question assumptions and design experiments carefully.

Why Does It Matter?

In the opening scenario, suppose a well meaning but statistically uninformed congressman proposed to outlaw all ice cream in order to prevent drowning,. Such a bill would inconvenience large segments of the population, force several companies into bankruptcy, and eliminate thousands of jobs as the country’s ice cream industry closed down. Despite the best of intentions, this bill would not decrease the number of drowning deaths.
If that example seems a little too far fetched, consider the following, which actually happened. In the early 1900’s doctors noticed that some infants were mysteriously dying in their sleep from perceived respiratory problems. This was called crib death, and is now known as SIDS. One thing that stuck out from autopsies performed on those who died from SIDS was an enlarged thymus, a gland located in the chest. From the correlation of enlarged thymus glands in SIDS babies, doctors presumed that an abnormally large thymus caused improper breathing and death.
The proposed solution was to shrink the thymus with high does of radiation, or to remove the gland entirely. These procedures had a high mortality rate, and led to even more deaths. What is sad is that these operations didn’t have to have been performed. Subsequent research has shown that these doctors were mistaken in their assumptions and that the thymus is not responsible for SIDS.

Correlation Does Not Imply Causation

The above should make us pause when we think that statistical evidence is used to justify things such as medical regimens, legislation, and educational proposals. It is important that good work is done in interpreting data, especially if results involving correlation are going to affect the lives of others.
When anyone states, “Studies show that A is a cause of B and some statistics back it up,” be ready to reply, “correlation does not imply causation.” Always be on the lookout for what lurks beneath the data.

No comments:

Post a Comment