Population and Sample Parameters

Ashish Agarwal
Analytics Vidhya
Published in
3 min readSep 22, 2020

--

I feel most of us, have struggled to grab the concept of mean, standard deviation and variance between population and sample. What is population and why do we need a sample? And most importantly, why do we have a n-1 term in the denominator while calculating variance in sample and n while in population?

First of all, I’ll just put some formulas for the terminology, so, we won’t get carried away by the symbols used in defining the parameters.

Fig: formulas

Okay, Now, what is population and what is a sample ?

Most of the times, we want to estimate what our data would look like or what our distribution would be like. And clearly enough, we won’t have all the data because it might not be possible or we might me lacking in money or time constraints.

For example, we want to find the average height of Humans. So, in order to do that, we must measure every human on the Earth. But is that viable?

So, in essence, Population is the ideal condition of our whole data. Population refers to a complete set of our dataset, i.e. there is no other data that is not included in the population. We always want the population measures(𝜇,𝜎) but since it is not possible most of the times, we estimate it. How??

Now, comes sample to the rescue !

Sample, is nothing but a small set of population that is picked at random. So, in essence, sample would generally provide some sort of similarity with population, as it is picked at random. Also, we can choose fairly small number of measurements to deal with.

For example, we measure about 10,000 people around the world and assume this measurements to relay us similar results if taken for the whole population.

Okay, But I’m not convinced….

Fig : population and sample
Fig: Mean for sample and population

Consider, blue dots to be a Population and orange dots to be a random sample with green and red dot being their mean respectively. Now, we can see, sample is withdrawn from the population. Also, the mean of the sample is somewhat near the mean of the population(this may not be always the case).

For multiple samples, we have:

fig: Multiple samples

Hence, we can see a pattern here. Most of the times, (if taken sufficiently large sample) sample means are generally near the population mean. Now, let’s delve a little into standard deviation or variance( they both convey somewhat same things but in different units)

Variance.. what does it mean? It is simply, how varied our data is, i.e. the amount of dissimilarity between the data points. Now, for a population, we can see, that can range from the far left to far right. Hence, we generally have high variance in our population.

In other perspective, Population extremities(very left/right data) are very far from Population mean(𝜇) than as compared to sample datapoints. Also, sample mean is near to the sample data measurements, hence that account for less variance.

So, what does that mean? Is there a catch?

Now, the whole point of sample is to estimate the population parameters. Clearly, sample mean can be said to estimate population mean quite correctly. But, since, sample variance seems to be smaller than the population variance(due to comparatively lower value of numerator in the formula), we compensate the effect by lowering down the denominator as well. Hence, we have n-1 as the denominator for the calculation of sample standard deviation(s) and sample variance.

In essence, calculation of sample variance is just a mathematical/statistical hack to provide a more correct representation of the population variance.

--

--

Ashish Agarwal
Analytics Vidhya

Data Science || Data Engineer || Machine Learning || Linux || Infosec Enthusiast || Math-Physics-Philosophy Trilogy Admirer || Computer Engineer