What is Document Length Normalization

Document length normalization adjusts relevance scores so longer documents don't unfairly dominate search results just because they contain more words. Without it, a 5000-word page that mentions "kubernetes" ten times would always outrank a focused 200-word page that mentions it three times.

Why is length a problem?

Longer documents naturally contain more words, which means more opportunities for any given term to appear. A raw term frequency count rewards length, not focus. Consider:

Document A (200 words): mentions "kubernetes" 3 times → TF = 3
Document B (5000 words): mentions "kubernetes" 10 times → TF = 10

By raw count, Document B wins. But Document A dedicates 1.5% of its words to "kubernetes" while Document B dedicates only 0.2%. Document A is clearly more focused on the topic.

How BM25 handles it

BM25 uses the b parameter to control how much document length affects scoring. The formula compares each document's length to the average document length in the corpus:

normalization factor = 1 - b + b × (doc_length / avg_doc_length)
  • b = 1.0 (full normalization) — Long documents are heavily penalized. A document twice the average length has its term frequency contribution cut roughly in half.
  • b = 0.0 (no normalization) — Document length is ignored entirely. Raw term counts determine relevance.
  • b = 0.75 (the default) — A moderate penalty. Length matters, but doesn't dominate.

When to adjust b

The right value of b depends on your corpus:

  • Homogeneous lengths (e.g., code files that are roughly the same size) — Lower b works. Length differences don't carry much signal.
  • Mixed lengths (e.g., a mix of README files, blog posts, and API references) — Higher b prevents long documents from drowning out short, focused ones.

In practice, the default b = 0.75 works well across most domains. It's rarely worth tuning unless your ranking metrics show a specific length-related bias in results.

The broader principle

Length normalization is one of two key improvements that BM25 makes over TF-IDF. The other is term frequency saturation — the idea that the 10th mention of a word matters less than the 1st. Together, they turn a simple word-counting formula into a robust ranking algorithm.