Published on

Interesting Applications of the Bayes' Theorem

Authors
  • avatar
    Name
    Qi Wang
    Twitter

Overview

Table of Contents

Introduction

Welcome back everyone to 2023! Bayes' Theorem is a simple probability statement that many people use as a baseline model for their natural language processing models to evaluate its performance. In this post, I'll derive what the Bayes' Theorem is and how to use it in spam detection, author identification, and other interesting applications that stemmed from such an easy probability statement!

Probability Party!

Let's start by defining some notation.

We define P(A)P(A) as the probability of an event AA occurring. We define P(AB)P(A \cap B) as the probability of both event AA and event BB occurring. Lastly, we define P(AB)P(A \mid B) as the probability of event AA occurring given that event BB has occurred.

Now we will derive the formula for Bayes' Theorem. To do so, we must figure out what P(AB)P(A \mid B) is. Let us imagine there is a jar with cube and spherical candies, and each candy can be either red or blue. For this example, there are 40 cube-shaped candy and 30 spherical candy, and there are 15 red candies that are also cube-shaped. If I were to ask you what is the probability of finding a red candy given that it's cube-shaped, you'd easily tell me 1540=38\frac{15}{40}=\frac{3}{8}.

From this example, we get that P(AB)=P(AB)P(B)P(A \mid B) = \frac{P(A \cap B)}{P(B)}. Thus, P(BA)=P(BA)P(A)P(B \mid A) = \frac{P(B \cap A)}{P(A)}. Knowing that P(AB)=P(BA)P(A \cap B) = P(B \cap A), we can rearrange the equations to get P(AB)=P(BA)P(A)P(B)P(A \mid B) = P(B \mid A) \cdot \frac{P(A)}{P(B)}. And... that is the Bayes' Theorem.

So What Does This Mean?

Many times we want to know P(BA)P(B \mid A) but we are only given values that let's us calculate P(AB)P(A \mid B). Take spam detection for example. We know the probability P(wordspam)P(word \mid spam) or P(wordnotspam)P(word \mid notspam) and we want to find the probability P(spamword)P(spam \mid word) or vice versa.

Since we have the labelled data of whether a piece of text is spam or not, we can easily fine P(wordspam)P(word \mid spam) by counting the frequencies of each word inside the spam; P(wordnotspam)P(word \mid notspam) can be found by the same method. As for P(word)P(word) and P(spam)P(spam)/P(notspam)P(notspam), these are just their respective proportions in the data set.

Using Bayes' Theorem

Now, I'll go over how this works using the example of text sentiment analysis. To start, I'll create a table with the frequencies of some words that belong to a positive sentences and negative sentences.

WordPositiveNegative
I33
not23
sad13
happy31
learn22

Note this is just a made up table, you would build this table from the data in your projects.

To find P(Positivehappy")P(Positive \mid ``happy"), we need P(happy"Positive)P(``happy" \mid Positive) which is 311\frac{3}{11} and P(happy")P(``happy") which is 423\frac{4}{23} and P(Positive)P(Positive) which is 1123\frac{11}{23}. Thus, P(Positivehappy")=3111123234=34P(Positive \mid ``happy") = \frac{3}{11} \cdot \frac{11}{23} \cdot \frac{23}{4} = \frac{3}{4}. We will do the same to find P(Negativehappy")=1121223234=14P(Negative \mid ``happy") = \frac{1}{12} \cdot \frac{12}{23} \cdot \frac{23}{4} = \frac{1}{4}. We see that the ratio of P(Positivehappy")P(Negativehappy")=3>1\frac{P(Positive \mid ``happy")}{P(Negative \mid ``happy")} = 3 > 1, so we conclude that the word has an overall positive sentiment. If we do this to every word in a sentence, and find i=1mP(Positivewordi)P(Negativewordi)\prod_{i=1}^{m}\frac{P(Positive \mid word_i)}{P(Negative \mid word_i)}. If this is greater than 1, then the sentence has positive sentiment, otherwise, negative sentiment.

To account for imbalanced datasets where you have more of one sentiment and less of another, we make a slight modification to account for this.

P(Positive)P(Negative)i=1mP(Positivewordi)P(Negativewordi)\frac{P(Positive)}{P(Negative)} \prod_{i=1}^{m}\frac{P(Positive \mid word_i)}{P(Negative \mid word_i)}

To avoid any divide by zeros, we use a technique called Laplacian Smoothing which follows the form: P(wordPositive)=freq(word,  positive)+1Cp+UwP(word \mid Positive) = \frac{\texttt{freq}(word,\;positive)+1}{C_p + U_w} where CpC_p is the total Positive count and UwU_w is the unique number of words. We add UwU_w because we added 1 to the frequency for every unique word.

Another issue with this is that we run the risk of receiving tiny numbers and underflowing. We can fix this by using a property of logs.

log(P(Positive)P(Negative)i=1mP(Positivewordi)P(Negativewordi))log(\frac{P(Positive)}{P(Negative)} \prod_{i=1}^{m}\frac{P(Positive \mid word_i)}{P(Negative \mid word_i)})

Turns into:

log(P(Positive)P(Negative))log(\frac{P(Positive)}{P(Negative)})

added to

i=1mlog(P(Positivewordi)P(Negativewordi))\sum_{i=1}^{m} log(\frac{P(Positive \mid word_i)}{P(Negative \mid word_i)})

Now the bounds are (,)(-\infty, \infty), with positive being a text with positive sentiment and negative being a text with negative sentiment.

Other Applications

After this example, you should be able to see that the Bayes' Theorem can also be applied to author identifications: what is the probability of this text's author being Shakespeare given this word, or what is the probability of this email being spam given this word. It is important to note that this is only a purely probabilistic approach towards analyzing the text and does not take into account the semantics of a sentence; therefore, it will not work in certain scenarios. However, this is a great baseline model to test the different natural language processing models that you create!

Hope you guys enjoyed this post, and I'll see y'all in the next one!

Subscribe to the newsletter