What is Document Length Normalization
Document length normalization adjusts relevance scores so longer documents don't unfairly dominate search results just because they contain more words. Without it, a 5000-word page that mentions "kubernetes" ten times would always outrank a focused 200-word page that mentions it three times.
Why is length a problem?
Longer documents naturally contain more words, which means more opportunities for any given term to appear. A raw term frequency count rewards length, not focus. Consider:
Document A (200 words): mentions "kubernetes" 3 times → TF = 3
Document B (5000 words): mentions "kubernetes" 10 times → TF = 10
By raw count, Document B wins. But Document A dedicates 1.5% of its words to "kubernetes" while Document B dedicates only 0.2%. Document A is clearly more focused on the topic.
How BM25 handles it
BM25 uses the b parameter to control how much document length affects scoring. The formula compares each document's length to the average document length in the corpus:
normalization factor = 1 - b + b × (doc_length / avg_doc_length)
- b = 1.0 (full normalization) — Long documents are heavily penalized. A document twice the average length has its term frequency contribution cut roughly in half.
- b = 0.0 (no normalization) — Document length is ignored entirely. Raw term counts determine relevance.
- b = 0.75 (the default) — A moderate penalty. Length matters, but doesn't dominate.
When to adjust b
The right value of b depends on your corpus:
- Homogeneous lengths (e.g., code files that are roughly the same size) — Lower
bworks. Length differences don't carry much signal. - Mixed lengths (e.g., a mix of README files, blog posts, and API references) — Higher
bprevents long documents from drowning out short, focused ones.
In practice, the default b = 0.75 works well across most domains. It's rarely worth tuning unless your ranking metrics show a specific length-related bias in results.
The broader principle
Length normalization is one of two key improvements that BM25 makes over TF-IDF. The other is term frequency saturation — the idea that the 10th mention of a word matters less than the 1st. Together, they turn a simple word-counting formula into a robust ranking algorithm.