I'm Manish - Coder and Tinkerer

Term Frequency-Inverse Document Frequency

Term Frequency-Inverse Document Frequency
Photo by Nick Fewings / Unsplash

Term Frequency-Inverse Document Frequency (TF-IDF) is a measure of:

  • How many times a given term appears in a document - Term Frequency
  • How many times the same term appears in other documents (corpus) - Inverse Document Frequency

TF-IDF is used to quantify how relevant a given term is within a document relative to a corpus. TDF-IDF is used in Information Retrieval (e.g. Search Engines) and Machine Learning.

Understanding the Math behind TF-IDF

Let's first understand the math used to determine TF, IDF and then TF-IDF.

Term Frequency

TF is a measure of the number of times a given term appears in a document. The formula for TF is:

where:

  • ft,d is the number of times a term t appears in a document d
  • divided by the sum of all of the terms found in that document

A simpler way of writing this formula is:

Inverse Document Frequency

IDF is a measure of the number of times the same term appears in other documents (corpus). The formula for IDF is:

where:

  • N is the total number of documents in a corpus D
  • divided by the number of documents where the term t appears

A simpler way of writing this formula is:

Term Frequency-Inverse Document Frequency

Using the TF and IDF quantities, we can now calculate the TF-IDF. The formula for TF-IDF is:

where:

  • TF is multiplied with IDF to get the TF-IDF score

A Worked Example

Let's go through a worked example of how TF-IDF is calculated. In this example, there is a Corpus with 3 documents each containing some text.

  • Document 1 contains the text "I am the one"
  • Document 2 contains the text "Neo is the one"
  • Document 3 contains the text "He is the one"

Let's first list out how many times each term appears in each document.

Term Frequency
I 1
am 1
the 3
one 3
Neo 1
is 2
He 1

Term Frequency

Now let's calculate the TF i.e. the number of times a given term appears in a document

Term Document 1 Document 2 Document 3
I 1/4=0.25 0 0
am 1/4=0.25 0 0
the 1/4=0.25 1/4=0.25 1/4=0.25
one 1/4=0.25 1/4=0.25 1/4=0.25
Neo 0 1/4=0.25 0
is 0 1/4=0.25 1/4=0.25
He 0 0 1/4=0.25

Inverse Document Frequency

Now let's calculate the IDF i.e. the number of times the same term appears in other documents.

Term Document 1 Document 2 Document 3
I log(3/1)=0.47 0 0
am log(3/1)=0.47 0 0
the log(3/3)=0 log(3/3)=0 log(3/3)=0
one log(3/3)=0 log(3/3)=0 log(3/3)=0
Neo 0 log(3/1)=0.47 0
is 0 log(3/2)=0.17 log(3/2)=0.17
He 0 0 log(3/1)=0.47

Term Frequency-Inverse Document Frequency

Finally let's calculate the TF-IDF i.e. multiplying the TF with IDF to get the TF-IDF score.

Term Document 1 Document 2 Document 3
I 0.25*0.47=0.11 0 0
am 0.25*0.47=0.11 0 0
the 0.25*0=0 0.25*0=0 0.25*0=0
one 0.25*0=0 0.25*0=0 0.25*0=0
Neo 0 0.25*0.47=0.11 0
is 0 0.25*0.17=0.04 0.25*0.17=0.04
He 0 0 0.25*0.47=0.11

The higher the TF-IDF score the more relevant the term is. As the term gets less relevant, the score will approach 0. In other words, if a term appears multiple times in a document, the importance score increases (TF). But if the same term appears many times in other documents then it may just be a common term and hence has no relevance so the importance score decreases (IDF). TF-IDF is a balance between TF and the rarity of the term in other documents in the corpus.

The terms the and one are given a score of 0 because these terms are less relevant. However, the terms I, am, Neo, is and He are given higher scores and hence are more relevant.

References