What is TF-IDF

TF-IDF (Term Frequency-Inverse Document Frequency) scores a document's relevance by multiplying how often a term appears in the document by how rare that term is across all documents. Common words get suppressed. Rare, focused terms get amplified.

How does it work?

The score for a single term in a single document is:

TF-IDF(term, doc) = TF(term, doc) × IDF(term)

Term frequency measures how often the word appears in this document. IDF measures how rare the word is across the entire corpus.

For a multi-word query, you compute TF-IDF for each query term and sum them:

corpus: 825 source files
query: "kubernetes deployment"

Document A (a Kubernetes deployment guide):
  TF("kubernetes", A) = 8    × IDF("kubernetes") = 5.61  → 44.88
  TF("deployment", A) = 5    × IDF("deployment") = 3.20  → 16.00
  Total: 60.88

Document B (a general ops runbook):
  TF("kubernetes", B) = 1    × IDF("kubernetes") = 5.61  → 5.61
  TF("deployment", B) = 2    × IDF("deployment") = 3.20  → 6.40
  Total: 12.01

Document A scores much higher because it focuses heavily on rare, relevant terms.

Why TF-IDF works

The multiplication does something clever: it filters out noise from both ends.

  • A word that appears often in a document but also appears in every other document (like "the") gets a high TF but near-zero IDF. The product is tiny.
  • A rare word that appears only once in a document gets a high IDF but low TF. The product is moderate.
  • A rare word that appears many times in a document gets both high TF and high IDF. The product is large. That's the signal.

How BM25 improves on TF-IDF

TF-IDF has two weaknesses that BM25 fixes:

  • No saturation — In raw TF-IDF, the 10th mention of a word contributes as much as the 1st. BM25 adds a saturation curve so additional occurrences have diminishing returns.
  • No length normalization — A 5000-word document naturally contains more term occurrences than a 200-word document. TF-IDF doesn't account for this. BM25 introduces a document length normalization parameter (b) that adjusts scores based on how long the document is relative to the average.

TF-IDF was the standard for decades. BM25 is TF-IDF with these two corrections, and it's now the default in nearly every search engine.