By MC — Aug 16, 2024

Term Frequency-Inverse Document Frequency

Term Frequency-Inverse Document Frequency (TF-IDF) is a measure of:

How many times a given term appears in a document - Term Frequency
How many times the same term appears in other documents (corpus) - Inverse Document Frequency

TF-IDF is used to quantify how relevant a given term is within a document relative to a corpus. TDF-IDF is used in Information Retrieval (e.g. Search Engines) and Machine Learning.

Understanding the Math behind TF-IDF

Let's first understand the math used to determine TF, IDF and then TF-IDF.

Term Frequency

TF is a measure of the number of times a given term appears in a document. The formula for TF is:

where:

f_t_,d is the number of times a term tappears in a document d
divided by the sum of all of the terms found in that document

A simpler way of writing this formula is:

Inverse Document Frequency

IDF is a measure of the number of times the same term appears in other documents (corpus). The formula for IDF is:

where:

N is the total number of documents in a corpus D
divided by the number of documents where the term t appears

A simpler way of writing this formula is:

Term Frequency-Inverse Document Frequency

Using the TF and IDF quantities, we can now calculate the TF-IDF. The formula for TF-IDF is:

where:

TF is multiplied with IDF to get the TF-IDF score

A Worked Example

Let's go through a worked example of how TF-IDF is calculated. In this example, there is a Corpus with 3 documents each containing some text.

Document 1 contains the text "I am the one"
Document 2 contains the text "Neo is the one"
Document 3 contains the text "He is the one"

Let's first list out how many times each term appears in each document.

Term	Frequency
I	1
am	1
the	3
one	3
Neo	1
is	2
He	1

Term Frequency

Now let's calculate the TF i.e. the number of times a given term appears in a document

Term	Document 1	Document 2	Document 3
I	1/4=0.25	0	0
am	1/4=0.25	0	0
the	1/4=0.25	1/4=0.25	1/4=0.25
one	1/4=0.25	1/4=0.25	1/4=0.25
Neo	0	1/4=0.25	0
is	0	1/4=0.25	1/4=0.25
He	0	0	1/4=0.25

Inverse Document Frequency

Now let's calculate the IDF i.e. the number of times the same term appears in other documents.

Term	Document 1	Document 2	Document 3
I	log(3/1)=0.47	0	0
am	log(3/1)=0.47	0	0
the	log(3/3)=0	log(3/3)=0	log(3/3)=0
one	log(3/3)=0	log(3/3)=0	log(3/3)=0
Neo	0	log(3/1)=0.47	0
is	0	log(3/2)=0.17	log(3/2)=0.17
He	0	0	log(3/1)=0.47

Term Frequency-Inverse Document Frequency

Finally let's calculate the TF-IDF i.e. multiplying the TF with IDF to get the TF-IDF score.

Term	Document 1	Document 2	Document 3
I	0.25*0.47=0.11	0	0
am	0.25*0.47=0.11	0	0
the	0.25*0=0	0.25*0=0	0.25*0=0
one	0.25*0=0	0.25*0=0	0.25*0=0
Neo	0	0.25*0.47=0.11	0
is	0	0.25*0.17=0.04	0.25*0.17=0.04
He	0	0	0.25*0.47=0.11

The higher the TF-IDF score the more relevant the term is. As the term gets less relevant, the score will approach 0. In other words, if a term appears multiple times in a document, the importance score increases (TF). But if the same term appears many times in other documents then it may just be a common term and hence has no relevance so the importance score decreases (IDF). TF-IDF is a balance between TF and the rarity of the term in other documents in the corpus.

The terms the and one are given a score of 0 because these terms are less relevant. However, the terms I, am, Neo, is and He are given higher scores and hence are more relevant.

References

tf-idf

Term Frequency-Inverse Document Frequency

Understanding the Math behind TF-IDF

Term Frequency

Inverse Document Frequency

Term Frequency-Inverse Document Frequency

A Worked Example

Term Frequency

Inverse Document Frequency

Term Frequency-Inverse Document Frequency

References

Symmetric vs. Asymmetric Encryption

K-Nearest Neighbors (KNN) Algorithm for Supervised Machine Learning