Reading the _explain tree: why did this product rank here?

14 June 2026By AntonioGitHub ↗LinkedIn ↗

CodingWeb DevelopmentFull StackNext.js

Elasticsearch already knows exactly why a product ranked where it did, and it will tell you. I added an /explain endpoint and a studio tree view, then walked one real score all the way down to its BM25 leaves, finding my tuning change sitting right where I put it.

The question I had was simple and a little embarrassing: why did this product end up in that spot? I had just spent a tuning session adding a phrase boost to my search ranking, the numbers said it helped, and yet when I looked at a real results page I still could not point at any single result and say with confidence why it landed where it did. I was guessing. The short answer to why did it rank here is that Elasticsearch already knows, exactly, down to the decimal, and it will tell you if you ask. The interesting part is everything in between: the score is a little tree of math, and once you can read the tree, ranking stops being a black box.

This is the story of adding an /explain endpoint to my search service and a small visualizer in the studio, then walking one real example all the way down. The example is the query task chair against product 1864, which scored about 50.5. By the end you will know what every number in that score came from, and you will see my tuning change sitting right there in the tree where I put it.

What does _explain actually do?

Elasticsearch has an API called _explain. You hand it a query and one document id, and instead of running a search and ranking everything, it answers a much narrower question: if I scored this one document against this query, what score would it get, and how did I arrive at that number?

Think of it like asking a cashier for an itemized receipt instead of just the total. The total is the final _score. The receipt is the breakdown: this line came from the name field, that line came from the description, this term was rare so it counted for more, that field was long so it counted for less. Same total, but now you can see where every cent went.

In my search service the endpoint is tiny, and the important detail is in how it builds the query:

typescript

app.get<{ Querystring: { q?: string; id?: string; scope?: string } }>(
  "/explain",
  async (req, reply) => {
    const query = req.query.q?.trim();
    const id = req.query.id?.trim();
    if (!query || !id) {
      return reply.code(400).send({ error: "q and id are required" });
    }
    const target = req.query.scope === "shop" ? shopIndex : index;
    const body = buildSearchBody(query, 1, {});
    try {
      const res = await es.explain({ index: target, id, query: body.query });
      return { matched: res.matched, explanation: res.explanation };
    } catch {
      return reply.code(404).send({ error: "no such document in this index" });
    }
  },
);

The line that matters is buildSearchBody(query, 1, {}). That is the exact same function the real /search endpoint calls to build its query. I am not explaining some simplified demo query. I am explaining the production ranking, the one shoppers actually hit, phrase boost and all. If I explained a different query than the one I serve, the receipt would be for a meal nobody ordered. So /explain reuses the real query builder and just asks Elasticsearch to break down the score for one document instead of ranking the whole index.

What is the production ranking it explains?

Before I read the tree, I need to know what query produced it. Here is the ranking the storefront ships, the thing buildSearchBody constructs by default:

typescript

export const DEFAULT_RANKING: RankingConfig = {
  fields: { name: 3, productClass: 2, description: 1 },
  fuzziness: "AUTO",
  fuzzyPrefixLength: 2,
  minimumShouldMatch: undefined,
  phraseBoost: 4,
  popularityWeight: 0,
};

Two things in there are mine, added during a tuning session, and both will show up in the explanation. The first is fuzzyPrefixLength: 2, which means the first two letters of a word have to match exactly before Elasticsearch will forgive a typo. (That one stops light from quietly matching right.) The second, the star of this post, is phraseBoost: 4. When the words the shopper typed appear together, as a phrase, in the product name, that match gets a boost. In code, the phrase boost becomes a match_phrase clause with a little slop:

typescript

const should: estypes.QueryDslQueryContainer[] =
  ranking.phraseBoost > 0
    ? [{ match_phrase: { name: { query, slop: 2, boost: ranking.phraseBoost } } }]
    : [];

slop: 2 means the words can be up to two positions out of place and still count as a phrase. boost: 4 means that when they do line up, the clause contributes four times its raw weight. The rest of the ranking is a best_fields multi-match across name, product_class, and description, with the name field boosted highest. Keep those two pieces in mind: a phrase clause on the name, and a per-field best-of across three fields. Both are about to appear in the receipt.

Walking the tree for "task chair"

So I typed task chair into the studio visualizer, gave it product 1864, and hit Explain. The document matched, and the final score came back around 50.5. Then the tree unfolded. At the very top, the root node says the score is a sum of its children. The biggest child, the one carrying most of that 50.5, was the phrase clause: something like weight(name:"task chair"~2 in ...). That ~2 is my slop: 2 showing up verbatim, and the size of its contribution is my boost: 4 doing exactly what I asked. This is the moment the post is really about. I made a change in a config file, claimed it helped, and here it is in the score breakdown, named, with a number attached. The phrase boost is not a theory anymore. It is the single largest line on the receipt for this product, because product 1864 happens to have task chair sitting right there in its name as a phrase.

Below the phrase clause sits the other big branch, the best_fields multi-match. Its description reads max of, and that word max is the whole personality of best_fields. The same query ran against three fields and produced three candidate scores: one from name, one from product_class, one from description. Instead of adding them up, best_fields takes the single best one and throws the rest away. For task chair against a chair, the name field wins that max, because the words are right there in the title, so the name score is what bubbles up into the total.

Why max and not sum? Because product names are short, self-contained phrases. If a product is genuinely a task chair, the name says so, and I want the best matching field to decide, not a product that mentions task once in the name and chair once in a long description racking up points across both. best_fields rewards the product that is clearly about the thing, not the product that scatters the words around.

What do the BM25 leaves mean?

Keep expanding and eventually you hit the bottom of the tree: the leaves, each one labelled with something like PerFieldSimilarity and BM25. BM25 is the formula Elasticsearch uses to score a single term against a single field, and it is built from three ideas. I had to look these up more than once before they stuck, so here they are in plain words.

Term frequency. How often does the term appear in this field? More is better, but with a catch: it saturates. The jump from zero mentions to one mention is huge. The jump from the ninth mention to the tenth is almost nothing. A product that says chair ten times is not ten times more about chairs than one that says it once, and BM25 has the curve baked in to reflect that. In the tree this shows up as a tf term with the raw frequency and the saturation math beside it.

Inverse document frequency. How rare is this term across the whole index? Rare terms carry more signal. In a furniture catalog, chair appears in thousands of products, so matching it tells you almost nothing. Task is rarer, so matching that actually narrows things down, and BM25 weights it higher. That is the idf line, and it is why a hit on the distinctive word in a query counts for more than a hit on the common one.

Field-length normalization. How long is the field this term sits in? A term in a short field counts for more than the same term buried in a long one. A three-word product name that is literally Ergonomic Task Chair is a stronger signal than the word chair appearing once inside a two-hundred-word description, because in the short field the term is clearly central, not incidental. The tree shows this as a length factor pulling the score down for longer fields.

Put those three together for one term in one field and you get one BM25 leaf. Stack the leaves up through the best_fields max and add the phrase boost on top, and you get the 50.5. Nothing in that number is magic. It is term frequency, inverse document frequency, and field length, combined the way the query told Elasticsearch to combine them.

The studio renderer just walks that structure recursively. Every node prints its own value and its description, then recurses into its children, with the deep BM25 leaves collapsed by default so the top of the tree stays readable:

tsx

function ExplainTree({ node, depth }: { node: ExplainNode; depth: number }) {
  const children = node.details ?? [];
  const open = depth < 2;
  return (
    <div style={{ marginLeft: depth === 0 ? 0 : 14 }} className="border-l border-line pl-3">
      <div className="flex gap-2 py-0.5 text-[13px]">
        <span className="tnum w-20 shrink-0 text-right font-semibold">{node.value.toFixed(4)}</span>
        <span className="text-ink-muted">{node.description}</span>
      </div>
      {children.length > 0 && (
        <details open={open}>
          <summary className="cursor-pointer py-0.5 pl-6 text-[12px] text-pine">
            {children.length} contributing factor{children.length === 1 ? "" : "s"}
          </summary>
          {children.map((child, i) => (
            <ExplainTree key={i} node={child} depth={depth + 1} />
          ))}
        </details>
      )}
    </div>
  );
}

The shape of Elasticsearch's explanation is already a tree (every node has a value, a description, and an optional details array of children), so the renderer is almost a direct mirror of it. The data does the explaining. The component just indents it.

Things that surprised me

A few things did not match what I expected before I started reading these trees:

The boost is not the contribution. I set boost: 4 on the phrase clause and assumed I would see a 4 somewhere. The 4 is a multiplier folded into the weight calculation, so what you read in the tree is the result after BM25 and the boost combine, not the boost itself. You see the effect, not the knob.
best_fields literally says max of. I knew on paper that best_fields takes the maximum across fields, but seeing the word max in the node description, with the losing fields' scores sitting right there unused, made it concrete in a way the docs never did.
The common word in a two-word query barely matters. For task chair, almost all the discriminating power comes from task. Chair is everywhere in a furniture index, so its inverse document frequency is tiny. The query feels like it is about chairs, but the ranking is mostly about task.
The ~2 in the phrase node is my slop, verbatim. I did not expect my config values to show up character for character in Elasticsearch's own description of the clause. Seeing name:"task chair"~2 is how I confirmed, with zero doubt, that the production query and the explained query are the same query.

When this is worth caring about

Here is what I would tell someone about to add an explain view to their own search: do it the moment you start tuning. Not before. While you are just getting search to return anything, the explain tree is noise. But the first time you change a ranking config and the metrics move, you will want to know why they moved, and the nDCG went up is not an answer you can act on. The phrase boost is now the top contributor for head queries, and here is the tree is.

The real payoff is verification. I added a phrase boost and claimed it made the typed words winning together count for more. I did not have to take my own word for it. I opened the tree for task chair against product 1864, and the phrase clause was sitting at the top of the receipt, carrying most of the score, with my slop value printed in its name. That is the difference between believing a tuning change worked and being able to point at the exact line where it did. _explain is what turns I think the ranking does X into the ranking does X, and here is the math.