In the last article, we discussed the probability mass function and its properties. In this article, we discuss the discrete distributions also termed as distributions of discrete random variables or probability mass functions for discrete random variables.
Here is a quick recap of what is meant by random variable and the distribution of a random variable
A random variable maps all the possible outcomes in a sample space to a numeric quantity and an assignment of probabilities to all possible values that the random variable can take is termed as its distribution.
We also discussed that we can think of or we can represent the distribution as a table but sometimes this table can be very large so the question that we try to answer in this article is “Can we specify the PMF very compactly?”
If we think it in terms of function, then it would look like:
And this is a very elaborate representation of the function, for every value of ‘x’ we have listed down the corresponding probability.
To motivate this, let’s take another example:
So, this is an experiment involving tossing of a coin and say the probability of getting heads is ‘p’
And it could be the case that we keep getting Tails for a large number of times and then we get a Heads finally or it could be the case that we get the Heads in the very first flip of the coin or maybe in 2nd flip and so on.
So, the support of this random variable is an infinite set from 1 to infinity
And if we have to write the function as in the previous case, then this is what the function would look like:
And this is clearly not desirable. What is desirable is the compact way of writing the same function as represented in the below image:
This function equation depends on two things: first is the value of the ‘x’ and the second is the value of ‘p’ which corresponds to the probability of heads in a single trial
We will discuss the rationale behind this formula and other questions(below image) related to it at a later stage in the article.
The parameter, in this case, is ‘p’, ‘x’ is not a parameter it’s just input to the function, if we know the value of ‘p’, we can compute the probability value for all possible values ‘x’ can take and this is desirable.
The reason we want the entire distribution to be specified by some parameters is that it plays an important role in Machine Learning for example, given an image as the input, we want to label it
We are interested in the conditional probability that the label is cat given an image as the input
Now we could think of this as the following problem
Given an image as input(image is a high-dimensional stack of numbers), what the is the probability that the image is of a cat(probability value will be given by the function, it will be between 0 and 1 and if the output from the function is greater than 0.5, we call it as a cat).
Now this PMF can be as complex whose parameters we learn from the data and once we learn the values of these parameters, we can compute the probability value for any given input
In the previous example, we had one parameter ‘p’, this is also something we learned from the data for example if we have data for the previous 10000 coin tosses that we have done and for every coin toss we tell whether it was heads or tails and now we can use the frequency definition of probability and we compute the probability of heads which will the value of ‘p’ in this case.
In Machine learning, we have a large number of parameters that we learn from the data(say 100,000 images are given along with their labels and we use this data to learn the parameters of the function), and eventually, our goal is that given such a function, we should be able to pass in any input and it should give the corresponding probability value/output.
So, that’s why having such compact functions where the entire distribution can be specified by a few parameters is helpful.
Entire distribution means for any possible value of ‘x’, we can tell the value of probability that can be computed using a function and that function can be as complex as required(it could have any number of parameters and we can learn these parameters from the data). And that’s where the concept of PMF and representing the PMF in a compact manner comes into the picture.