How do search filters and the relevance score work together?

13 June 2026By AntonioGitHub ↗LinkedIn ↗

SearchFull StackWeb AppNext.js

When you type a word and tick a category, those are two different jobs: one ranks results, the other just lets things in or keeps them out. Here is how Elasticsearch keeps them separate, and why mixing them up quietly wrecks your ranking.

Here's a question I had while building search for a furniture shop: when someone types sofa and then ticks the Sofas category checkbox, are those the same kind of instruction? They feel similar. Both narrow things down. But they're really not the same, and if you treat them the same your search results come back in a slightly wrong order that no error message will ever warn you about. The short answer is that the typed word is about relevance (how good a match is this?) and the ticked category is about membership (is this in the set or not?). Elasticsearch has two different machines for those two jobs, and the whole trick is putting each thing in the right machine. The interesting part is what goes wrong when you don't.

What does it mean to score a result?

When you search for sofa, Elasticsearch gives every matching product a number called the score (it uses an algorithm named BM25 to work it out). A product with sofa in its name gets a higher number than one that only says sofa deep in its description, and the higher number sorts to the top. That's exactly what you want: the score is the engine's opinion about how good each match is.

Now here's the catch. In Elasticsearch you build a query out of clauses, and the obvious thing to do with a category filter is to drop it into the same bucket as the typed word, a bucket called must. It reads like plain English: the typed word must match, and the category must match. The results come back correct, so it's tempting to ship it and move on. The problem is that everything inside must adds to the score. So a category match on beds hands the product some bonus points, and how many points depends on a quirk: how rare that category word is across the whole index. Two beds you'd consider equally good answers can end up ranked apart purely because the category test gave one of them a few extra points. A yes-or-no question (am I looking at beds?) has leaked into the ranking and started reshuffling your results.

What does it mean to filter instead?

Elasticsearch has a second bucket called filter, and a clause that lives there doesn't score anything. It just answers yes or no for each document: in the set, or out. Think of it like a bouncer at a door rather than a judge holding up a number. The bouncer doesn't rank you, they just check whether you're on the list.

There's a nice bonus to being a bouncer instead of a judge. Because the answer is just yes or no, Elasticsearch can cache it as a bitset, which is basically one bit per document saying match or no-match. category = beds is the same set of documents whether the shopper typed oak or headboard, so the engine can reuse that bitset across queries. Put the same condition in must and you've asked it to score the thing, so it can't reuse the cached answer and has to recompute every time. So getting this wrong costs you twice: messed-up ranking, and a cache you threw away.

So what does the code actually look like?

The rule I follow is one question asked of every clause before it goes anywhere: is this about relevance, or about membership? The typed query is about relevance, so it goes in must and drives the score. The facet selections are about membership, so they go in filter and stay out of the score entirely. Here's the function that builds the clause:

typescript

function buildQueryClause(
  query: string,
  filters?: SearchFilters,
): estypes.QueryDslQueryContainer {
  const multiMatch: estypes.QueryDslQueryContainer = {
    multi_match: {
      query,
      type: "best_fields",
      fields: ["name^3", "product_class^2", "description"],
      fuzziness: "AUTO",
    },
  };
  const clauses = queryFilterClauses(filters);
  if (clauses.length === 0) return multiMatch;
  return { bool: { must: [multiMatch], filter: clauses } };
}

With no filters, the clause is just the multi_match. Those little ^3 and ^2 numbers are boosts: a hit in the name is worth three times a hit in the description, because that's where the typed word means the most. The moment a filter shows up, the clause becomes a bool with the multi_match alone in must and the filters tucked into filter. That one structural choice is the whole thing. The score still comes only from the typed words, and the filters narrow the set without ever adding a point.

There's actually a third home I haven't mentioned, post_filter, which I use for multi-select facets like colour. If you tick white, you still want to see black sitting there with its count so you can add it. Putting colour in post_filter narrows the products you see without erasing the other colour options. But that's a detail on top of the same idea: none of these three homes lets a filter touch the score.

How do you test for something you can't see?

This is the part that bugged me. The relevance-pollution bug is invisible. Nothing throws, every result belongs to the right category, and the order is just quietly wrong. So I wrote a test whose only job is to fail if someone ever moves the category back into must. It runs the same query twice, once without the filter and once with it, and asserts that a product's score is exactly identical both times:

typescript

it("does not change relevance scores (filter context, not query context)", async () => {
  const all = SearchResponseSchema.parse(
    (await app.inject({ url: "/search?q=oak&scope=shop" })).json(),
  );
  const filtered = SearchResponseSchema.parse(
    (await app.inject({ url: "/search?q=oak&scope=shop&category=beds" })).json(),
  );

  const bedScoreUnfiltered = all.hits.find((h) => h.category === "beds")?.score;
  const bedScoreFiltered = filtered.hits.find((h) => h.category === "beds")?.score;

  expect(bedScoreFiltered).toBeDefined();
  expect(bedScoreFiltered).toBe(bedScoreUnfiltered);
});

This runs against a real Elasticsearch in a container, not a mock, with hand-counted fixtures: two sofas and a bed, all sharing the word oak so the query matches all three. If the category were being scored, the bed's number would move when I add category=beds. The toBe is exact equality on purpose. The test can't pass if membership has leaked into the ranking, which is the one thing I want it to catch.

Things that surprised me

Wrong results vs wrong order. I expected a bug like this to show up as missing or extra products. It doesn't. The right products come back, just in a subtly worse order, which is much harder to notice.
Rarity is what bites you. The bonus points a scored category adds aren't constant. They depend on how rare that category word is, so the size of the damage shifts as your catalogue changes.
The caching win was a free side effect. I moved filters out of the score for ranking reasons, then realised filter context also gets cached as a reusable bitset. Correct and faster from the same decision.
Equality on a float felt scary, but it's the point. Asserting one score equals another exactly looks fragile. Here it's the only assertion that proves the filter never touched the number.

When is this worth caring about?

If you're building search with a single text box and nothing else, you can skip all of this. The split between scoring and filtering only starts to matter the moment you add facets: category checkboxes, price bands, colour swatches, anything that's really a yes-or-no membership question rather than a how-good-is-this question. The good news is that once the question clicks, it answers itself for free. The day someone adds a brand facet or a rating filter, you already know where it goes, and a small exact-equality test makes sure it stays there through the next refactor. That's the bit I'd tell anyone starting faceted search on Elasticsearch: ask of every filter whether it's about relevance or membership, and let that one question decide where it lives.