- Qi Wang
Welcome back everyone to 2023! Bayes' Theorem is a simple probability statement that many people use as a baseline model for their natural language processing models to evaluate its performance. In this post, I'll derive what the Bayes' Theorem is and how to use it in spam detection, author identification, and other interesting applications that stemmed from such an easy probability statement!
Let's start by defining some notation.
We define as the probability of an event occurring. We define as the probability of both event and event occurring. Lastly, we define as the probability of event occurring given that event has occurred.
Now we will derive the formula for Bayes' Theorem. To do so, we must figure out what is. Let us imagine there is a jar with cube and spherical candies, and each candy can be either red or blue. For this example, there are 40 cube-shaped candy and 30 spherical candy, and there are 15 red candies that are also cube-shaped. If I were to ask you what is the probability of finding a red candy given that it's cube-shaped, you'd easily tell me .
From this example, we get that . Thus, . Knowing that , we can rearrange the equations to get . And... that is the Bayes' Theorem.
So What Does This Mean?
Many times we want to know but we are only given values that let's us calculate . Take spam detection for example. We know the probability or and we want to find the probability or vice versa.
Since we have the labelled data of whether a piece of text is spam or not, we can easily fine by counting the frequencies of each word inside the spam; can be found by the same method. As for and /, these are just their respective proportions in the data set.
Using Bayes' Theorem
Now, I'll go over how this works using the example of text sentiment analysis. To start, I'll create a table with the frequencies of some words that belong to a positive sentences and negative sentences.
Note this is just a made up table, you would build this table from the data in your projects.
To find , we need which is and which is and which is . Thus, . We will do the same to find . We see that the ratio of , so we conclude that the word has an overall positive sentiment. If we do this to every word in a sentence, and find . If this is greater than 1, then the sentence has positive sentiment, otherwise, negative sentiment.
To account for imbalanced datasets where you have more of one sentiment and less of another, we make a slight modification to account for this.
To avoid any divide by zeros, we use a technique called Laplacian Smoothing which follows the form: where is the total Positive count and is the unique number of words. We add because we added 1 to the frequency for every unique word.
Another issue with this is that we run the risk of receiving tiny numbers and underflowing. We can fix this by using a property of logs.
Now the bounds are , with positive being a text with positive sentiment and negative being a text with negative sentiment.
After this example, you should be able to see that the Bayes' Theorem can also be applied to author identifications: what is the probability of this text's author being Shakespeare given this word, or what is the probability of this email being spam given this word. It is important to note that this is only a purely probabilistic approach towards analyzing the text and does not take into account the semantics of a sentence; therefore, it will not work in certain scenarios. However, this is a great baseline model to test the different natural language processing models that you create!
Hope you guys enjoyed this post, and I'll see y'all in the next one!