What is TF-IDF
TF-IDF (Term Frequency-Inverse Document Frequency) scores a document's relevance by multiplying how often a term appears in the document by how rare that term is across all documents. Common words get suppressed. Rare, focused terms get amplified.
How does it work?
The score for a single term in a single document is:
TF-IDF(term, doc) = TF(term, doc) × IDF(term)
Term frequency measures how often the word appears in this document. IDF measures how rare the word is across the entire corpus.
For a multi-word query, you compute TF-IDF for each query term and sum them:
corpus: 825 source files
query: "kubernetes deployment"
Document A (a Kubernetes deployment guide):
TF("kubernetes", A) = 8 × IDF("kubernetes") = 5.61 → 44.88
TF("deployment", A) = 5 × IDF("deployment") = 3.20 → 16.00
Total: 60.88
Document B (a general ops runbook):
TF("kubernetes", B) = 1 × IDF("kubernetes") = 5.61 → 5.61
TF("deployment", B) = 2 × IDF("deployment") = 3.20 → 6.40
Total: 12.01
Document A scores much higher because it focuses heavily on rare, relevant terms.
Why TF-IDF works
The multiplication does something clever: it filters out noise from both ends.
- A word that appears often in a document but also appears in every other document (like "the") gets a high TF but near-zero IDF. The product is tiny.
- A rare word that appears only once in a document gets a high IDF but low TF. The product is moderate.
- A rare word that appears many times in a document gets both high TF and high IDF. The product is large. That's the signal.
How BM25 improves on TF-IDF
TF-IDF has two weaknesses that BM25 fixes:
- No saturation — In raw TF-IDF, the 10th mention of a word contributes as much as the 1st. BM25 adds a saturation curve so additional occurrences have diminishing returns.
- No length normalization — A 5000-word document naturally contains more term occurrences than a 200-word document. TF-IDF doesn't account for this. BM25 introduces a document length normalization parameter (
b) that adjusts scores based on how long the document is relative to the average.
TF-IDF was the standard for decades. BM25 is TF-IDF with these two corrections, and it's now the default in nearly every search engine.