An elegant alternative to calculating a cumbersome binomial distribution — the efforts of Gauss, De Moivre, and LaPlace leverages an alternate approach that makes life easier.
If we want to flip a coin, we know of its probability → p=1/2. However when running our live experiments, it doesn’t always stay at 1/2 but slightly deviates. The Law of Large Numbers states that with a night enough n=number of experiments, we should gradually refine the actual probability to 1/2.
What if we want to understand the probability of obtaining a certain outcome? For example if we are flipping a coin 10,000 times, how likely is it that you get exactly 5,040 heads? We would use a binomial probability to determine this, which is defined as follows.
Approaching this problem by hand (or even a computer) is cumbersome with an as high as 10,000. Much of mathematics is distilling an existing formula into a simpler one (think of the Fourier Series) and this problem is no exception.
Plotting the binomial distribution of 10,000 coin flips would look like the graph below. So if we wanted to produce a simpler equation, we would need to produce this output. The graph takes a familiar shape known as the bell curve which gives a starting point.
Our binomial distribution uses discrete increments (no floats) while the bell curve is a continuous function. When you chart the outputs as a histogram, with a sufficiently higher , you achieve a smooth-like function just as you would when distilling a continuous function through derivatives and integrals.
Now, the goal becomes creating a function that fits the normal distribution, a much easier function to work with.
Through trial and error, they landed on the best candidate being the following:
Which closely mirrors the shape we are looking to achieve.
The issue is that our probability must adhere to but this curve’s total area does not equal 1. Essentially we want to find the function that does the following:
This involves a scaling exercise where we provide it parameters to fit into a probability distribution — an integral does this as you provide it boundaries and its sole purpose is to produce the area within the curve.
This integral can’t be solved the usual way because you cannot take the anti-derivative of this function. Said differently, there’s no way to distill this into elementary functions using a combination of polynomials, exponentials, logarithms, trig functions, or roots that equals the integral. So the clever idea was to square the integral and turn this into a 2-dimensional problem using the Gaussian integral trick.
We start by defining the integral:
Then we square the integral in a manner that produces a double integral.
Above you see that this squared integral measures the area of the curve along the incremental changes across both and axes. The immediately tells us there’s a circle at play as shown in the following:
The circle also signals to us the need for using polar coordinates. which allows us to shift radians, therefore we need to adjust our and coordinates to be in line.
We plug these values back into our double integral and get:
We start by solving the inner integral first:
We’ll use u-substitution to simplify this integration, converting to u :
This positions us to swap out and simplify the integration:
The integral of from to is , therefore we have the value to plug into the second integral.
We’ll re-express this back as a function — to represent this as a proper probability distribution that equates to , we divide it by .
Now that we have the desired curve to fit our binomial distribution and we normalized it to 1 so we can represent it as a proper probability distribution, we need to tune the width of our curve. Picture a knob we adjust to achieve this, using a variance . We generalize the overall width (variance) by using the standard deviation , thereby re-representing the in our function by dividing it by .
We need to re-normalize this once more to ensure we stay within the bounds of , which means we need to ensure:
We re-apply the standard deviation, this time to the denominator:
Out of convenience for when we need to derive the formula further, we will take the power of and apply 2 to the denominator and square it entirely —thereby transferring the 2 to the denominator as follows:
Our function is normalized and it is also centered at , now we need to re-express this function so we can easily shift the curve in any horizontal direction which can be done along the mean . We do this by translating into .
We’re almost there — this is a probability distribution so we need to adjust some of the variables to account for that original coin-toss experiment we wanted to run.
You no longer need to take the function to the power which drastically simplifies the calculation of the binomial distribution. It gets at the heart of the Central Limit Theorem and shows that if you average enough random (binary) events, the overall distribution looks normal.
What’s most inspiring behind this formula is the instinct mathematicians have to take a proven, modeled reality and simplify it further by finding its closest analog. Functions like these simplify computational complexity in modeling random processes.