Having received a good promotion with that corner office and a massive hike in your paycheck, you finally decide to buy your own house. And you get in touch with me, a real estate broker.
Being a smart broker, I figure out that you want to live in an affluent neighborhood. So, I tell you that the average household income of the neighborhood I'm showing you a flat in is ₹15 LPA. This convinces you of how premium the locality is, and you decide to buy the flat.
A year or so later, you see that a member of your society committee is making a plea to cap the housekeep salary to a certain amount. After all, the average household income in this neighborhood is only ₹8 LPA a year. Perhaps you go along with the petition in this, because who doesn't want to pay lower wages, right? But you can't help being surprised to hear about that measly ₹8 LPA a year.
So, was I lying to you or is the society member lying now?
The answer is neither! Neither of us was lying.
The only trick I used was applying a different kind of average to come up with the ₹15 LPA figure.
Here's some Statistics 101 for you:
There are 3 kinds of averages: Mean, Median, and Mode.
The mean is the sum of all samples (figures) in the set divided by the total number of samples. The mean of set of 12 numbers S = (1, 3, 3, 5, 7, 6, 16, 27, 3, 4, 8, 9) is calculated by adding all the numbers and dividing them by 12.
The mean for set S is (1+3+3+5+7+6+16+27+3+4+8+9) / 12 = 7.67
The median is the figure that appears halfway in the set if you arrange the figures in ascending order. It means that half of the numbers in the set are larger than the median.
Set S in ascending order: 1, 3, 3, 3, 4, 5, 7, 8, 9, 16, 27
Since S is a set with an even number of elements, there is no single median.
Hence the median of the set is the mean of the middle two elements, i.e., 5 and 7, which is 6.
The mode is the figure that appears the most frequently in the set.
The mode for set S is 3, as it appears most frequently in the set.
Now, coming back to our original scenario, if you want to check how premium a locality is, you want to know the average household income as defined by the mode or median, and not by the mean.
Why?
Because even 3 millionaires living in the area can boost the mean household income by a huge number, even though most of the people living in the locality are poor.
A median in this case, would actually reveal the most information. If I told you that the median household income in ₹8 LPA, you would know that at least half the population in the locality earns above ₹8 LPA. Or if you wanted to get a sense of what most people in a locality earn, you would find the mode more useful, as it is the salary figure that appears most frequently in the locality.
Because, consider this:
A Jeff Bezos or an Elon Musk coming to live in this locality would raise the average net worth of everyone in the locality by at least $10 mn, or maybe more. But is that number a good representative of the average individual's net worth in the locality? No.
Where does the distinction between mean, median, and mode matter most?
For this, let's understand Nassim Taleb's concept of Mediocristan and Extremistan.
Mediocristan is any domain where there is not a lot of variance between the values the numbers can take.
The height of adults in a population is one such domain. You know that matter what sample on earth you pick, the height of adults in that population will vary from 4 feet to 8 feet. That's the maximum amount of variance you can expect.
If an 8-foot man entered a bar of 100 other people, it would hardly make a difference to the average height of people in the bar. Because height has a very small range of values it can take, with an upper bound. The distribution of height in a population is thin-tailed with little variance.
Similarly with the age of working people in a given population. The maximum that number will vary will be from 18 – 90, and you can say with a relative degree of certainty that most working people will be between ages 20 – 60.
These are mediocristan domains. There is not a lot of variance.
This is not the same case with say, a contagious disease that multiplies exponentially due to network effects with no upper bound, or a person's wealth, which again has no upper bound. These are highly nonlinear, fat-tailed distributions — extremistan domains — where there is no upper bound on the range of values in the set. You can have a person with an annual income of $100 and an annual income of $1 billion sitting in the same room.
If I were to tell you the average household income in a village, it would actually tell you nothing about the individual incomes of households in the village. Maybe a typical house in that village has an income near the average income; maybe it's nowhere near that, with one multi-millionaire household in the village skewing the numbers.
Mean values are almost meaningless in extremistan domains. It's like stating the average distance between any two stars in the universe. As the range of values is so uncapped and varying, the mean in this case doesn't tell you anything.
How can not knowing the difference between mean, median, and mode affect decision-making and interpretation of data?
A simple example would be of people's preferences when it comes to food choices, which are highly nonlinear and differ on an individual basis. Some of them are mostly binary — a clear yes or a clear no.
An example: Pineapple on Pizza. Some like it, others hate it.
But if the government designed pizzas, each would be 25% covered in pineapple, since that is what the average pizza customer wants. In reality, nobody wants this at all: they either want the pizza to be full of pineapple or they do not want it at all.
Takeaways
1. Always look at the story behind the numbers.
You may read a news article and find out that the average household income in India is so and so. You should not try to make too much out of that figure unless you also know what the definition of "household" has been used to mean, as well as what kind of average this is. You should also ask how this data was collected and from whom.
It's very easy to fool with graphs and numbers if you aren't careful.
2. Stop trying to solve for the mean.
Especially if you're working in Extremistan — where individual preferences and traits can vastly differ from the average (mean) preferences or traits of the group.
By serving the best solution to your customers on average, you might not be serving anyone.