Distribution

Ashish Agarwal
3 min readSep 18, 2020

Okay, this is going to be an intuition behind distribution or specifically statistical distribution. So, what is a statistical distribution ? why do we need it? Is it better than a histogram or does they share something in common?

For a basic understanding of histogram, refer Histogram.

Fig: Histogram

let’s say, we have a simple histogram which describes people’s height on a sample data. Clearly, from above figure, we can say, if we one measurement at random, there is a good chance, it would be between 140 c.m and 180 c.m.

Now, that’s helpful. But the odds are not challenging enough. we want something more precise. let’s increase the number of bins and see what we have.

Fig : Histograms with more bins

Hmm. That’s interesting. Now, we can say, we have a good chance of picking measurement around 160 c.m. It provides more detail than the previous histogram. So, a natural question arises, What if we keep on increasing the number of bins?

Fig : Histogram with N bins

what? That was disappointing. This graph does not seems to provide any insight.

Well, the thing is, we will have a better representation of our data at a certain range of bins. Taking bins size very low or very high doesn’t provide significant analysis. Thus, after some hit and trial we find the best approximation for the number of bins.

Fig : Histogram with relatively better number of bins

Now, the histogram provides us analysis on the probability of picking someone with a given measurement. For example, we have good chance of picking someone in the middle rather than on the ends. Also, higher the bin height, more the chances of someone having those height (assuming someone chosen on random).

Now, the interesting thing is, we can use a curve to approximate the histogram. But, why would we want to do that?

Fig: Distribution

Interesting. The curve represents a smooth transition among the measurements and provide the same insight as the histogram. However, the distribution seems to fill on the gaps on the histogram by relative probability of its neighbours. Clearly, there are people having height between 220–230c.m or below 100c.m. But, since, our data-sample didn’t contain those measurements our histogram doesn’t show any results. However, the distribution curve gives us a nice probability on those cases as well.

Further, let’s say, we want to find the likelihood between 120–130 c.m, 120–135 c.m,120–140c.m and so on. Clearly, we require multiple histograms considering the number of bins. Here, Distribution curve comes in handy. The distribution curve is not limited by the number of bins. We, can determine the probability of a range by simple calculus(Area under the curve). Hence, we don’t require multiple value of bins for our use cases.

Fig: Distribution with small sample-size

Finally, if we don’t have enough measurements or data, the approximation curve based on mean and the standard deviation of the collected data would generally provide a good distribution curve.

So, now the answer to the question, on the head of the page:

Both Histograms and curve are distributions. They show us how the probabilities of measurements are distributed. The tallest region of histogram/curve shows the region where the measurements are most likely.

--

--

Ashish Agarwal

Data Science || Data Engineer || Machine Learning || Linux || Infosec Enthusiast || Math-Physics-Philosophy Trilogy Admirer || Computer Engineer