In the last article, we discussed the binomial distribution where we are interested in the probability of ‘k’ successes in ’n’ trials.
In binomial distribution, we talked about tossing a coin ’n’ times, in geometric distribution, we generally talk about tossing a coin infinite times, we don’t actually know how many times are we going to toss the coin, we just keep tossing it and we are interested in the number of tosses until we see the first heads and as discussed in the last article, the random variable could take on values from 0 to ’n’ in case of the binomial distribution, here in case of the geometric distribution, the random variable can take on values from 1 to infinite as we may get the heads in the very first toss itself or we may not get the heads even after tossing it 1000 times or for that matter a very large number of times, so practically speaking, the random variable, in this case, can take on values from 1 to infinite.
We are interested in assigning the probabilities to all the values that the random variable can take which will give the probability distribution of this random variable and we want this function equation/distribution to be in terms of a few parameters.
We can think of this as repeating Bernoulli's trial infinite times and we are interested in knowing the number of trials after which we get the first success.
To answer the above questions, let’s look at some examples where this distribution is useful
Consider a hawker selling belts outside a subway station and now there are sort of infinite people walking past the hawker through days, months he is sitting there and he would like to know when he is going to encounter the first person who is going to buy the belt
And he would want the probability to be very high for a small value of ‘k’ so even with 3–4 customers that pass by his shop, he must be able to sell something.
Another example could be: say a salesman is handing out to pamphlets to passersby and he is actually interested in knowing the probability that the kᵗʰ person will be the first one to actually read the pamphlet. This will give the idea that the salesman must hand out at least that many pamphlets.
Another example would be:
So, this distribution is useful in any situation where we have waiting times, we are continuously doing a trial and we want to know after how long we’ll get success(we know we’ll get success after a certain time) which happens in many situations especially in sales situations.
And these are all independent trials so we are repeating the trials infinite times, so every customer who is passing by is an independent customer, he does not care about what the earlier customer did(purchased or not).
And this is an identical distribution that means every person passing by the shop has the same probability of buying the product
Let’s take some examples and then derive the general formula for geometric distribution:
Say k=5 which means the first four trials resulted in failure and the fifth trial is the one when we got the first success and after happens after the fifth trial does not matter as in geometric distribution we look for the number of trials before the first success.
We are relying on the property that all these trials are independent, so the first failure can occur with a probability of (1-p), the second failure occurs with a probability of (1-p), same for the 3rd failure and the 4th one and success in the fifth trial will occur with a probability of ‘p’
In general, we would have the following formula:
This distribution can be fully specified with just one parameter ‘p’, once we have the value of ‘p’, we can compute the probability for any value of ‘k’
Let’s take the case when ‘p’ equals 0.2 and in the below image, we plot the output for 25 values although the random variable in geometric distribution can take on infinite values
We leverage the ‘geom’ function from the ‘scipy.stats’ module
And we have the function’s equation as the following:
If ‘k’ equals 1, then the value on the right-hand side will just be ‘p’ and we see that the bar corresponding to ‘k’ as 1 point to a probability value of 0.2
Similarly, for ‘k’ as 2, the right-hand side of the equation would be: (1-p)(p), and putting in the value of ‘p’ we get the output as (0.8)(0.2) ~ 0.16
And then the probability is going to continuously decrease as ‘k’ tends towards infinity.
As we have a non-zero probability of success and having the case that the first success occurs a very large number of times later, that probability is going to be very low. If we have a non-zero probability of success, then we are going to encounter success at some point and we’ll not have to go all the way up to infinity.
For example, we see that the probability value is very low for ‘k’ as 12, this means that all the first 11 trials resulted in failure and then we have one success, that’s very unlikely to happen even if we have a low probability of success and that’s why the probability value keeps on decreasing as the value of ‘k’ increases.
In binomial distribution, if the probability of success is very low, we have the tall bars(probability values) towards the left and the same is in geometric distribution as well.
For a high probability of success, the binomial distribution had all the tall bars towards the right, let’s see the plot for a high probability of success for geometric distribution:
For geometric distribution, we have the tallest bars towards the left only and we can reason out the same using the equation for geometric distribution:
For ‘k’ as 1, the value would be: (1-p)⁰.(p) which is the same as ‘p’
And for every other value of ‘k’, we have the term (1-p) raised to the power (k-1) in the formula which will be less than 1 and as we increase the value of ‘k’, the power term i.e (k-1) would increase and the value that we get eventually will reduce(a quantity less than 1 raised to higher and higher powers will reduce only).
Geometric distribution always has this type of shape where the tallest bars are at the left irrespective of what the value of ‘p’ is.
And here is the plot for ‘p’ as 0.5 and once the trend remains the same as we have the tallest bars towards the left
Is geometric distribution a valid distribution?
To show that the geometric distribution is a valid distribution, the first thing we need to show is that the probability value for any value that the random variable can take is greater than equal to 0
And we can prove this by considering the function's equation
As the terms ‘p’ and (1-p) in the formula are the probability value that means they are always going to be greater than equal to 0, that means (1-p) raised to some power will always be greater than equal to 0, and the product of two positive numbers is going to be a positive number, so we can be sure that the probability of any value that the random variable can take is going to be a positive number.
The second property we need to show is that the sum of probabilities of all the values that the random variable can take is always going to be 1:
Let’s just expand this formula and write down all the terms
Now this is the same as the sum of an infinite geometric progression(shown in red in the below image, terms in yellow correspond to the terms in the geometric distribution as per the formula)
The Sum of such a series is given as: ‘a / (1-r)’ where ‘r’ is less than 1
Using the same formula, we have the sum as 1:
Let’s take an example:
This is a rare blood group we are talking about and say the doctor/hospital has been given a list of volunteers for donating the blood, and the administrator is going through the list and see if the patient has the matching blood group
Let’s try to solve it through the plot:
Now that we have the plot, we can see the probability for ‘k’ as 7 from the plot which turns out to be ~0.05.
The second part is interesting and here we are looking for the chance that at least one of the first 10 volunteers in the list has the matching blood group, we can tackle this problem using the subtraction principle and can look for the probability that none of the first of 10 persons is successful that means the first 10 persons are going to be failures/not have the matching blood type and we know the probability of failure, so this probability value would be:
And since this is the probability of the complement of an event, we can get the probability of asked case as (1-probability of the complement)