Skip to main content
← Back to blogs

Making a hardcoded query tunable without changing what it does

By AntonioGitHub ↗LinkedIn ↗
CodingWeb DevelopmentFull StackNext.js

My Elasticsearch ranking was hardcoded in one function, so I could not measure a change without editing code. I turned it into a config object, with one rule: the default reproduces the old query byte for byte, proven by a deep-equality test. That safety is what unlocked tuning.

The question I kept circling was simple to say and annoying to act on: how do I tune my search ranking when the ranking lives baked inside a function? On NORDHEM, my Nordic home goods storefront, the Elasticsearch query was a single hardcoded thing. The field boosts were fixed. The fuzziness was fixed. If I wanted to ask whether a higher boost on the product name actually helps, I had no way to ask it cleanly. I would have to edit the function, run the whole thing, eyeball some results, and edit it back.

The short answer is that I turned the query into a config object. A plain value with the boosts and the fuzziness and a few other knobs on it, fed into a builder that produces the Elasticsearch query. The interesting part, the bit worth writing about, is the discipline I put around the change: the default config had to reproduce the old query exactly, proven by a test, so the refactor changed nothing on its own. That safety is the whole point. It is what let me experiment afterwards without wondering whether I had quietly broken search while I was just refactoring.

What was actually wrong?

Nothing was broken. That is the awkward thing about this kind of work. The search worked fine. It was a best_fields multi_match across three fields, with the product name boosted above the class, and the class boosted above the description. A search for outdoor chair returned outdoor chairs. Good enough to ship, and I had shipped it.

The problem was that I could not measure it. I had a relevance lab by this point: the WANDS furniture dataset comes with 480 judged queries and a pile of human relevance ratings, and I had wired up nDCG, MRR and recall so I could put a number on how good my ranking was. My baseline number was an nDCG@10 of 0.6532. Fine. But to improve it I needed to try variations, and every variation meant touching the query function. A boost of 3 on the name versus a boost of 4. Fuzziness on versus off. A phrase bonus when the typed words appear together. Each of those is a one-line edit, and a one-line edit you make twenty times is a great way to introduce a bug you do not notice.

What I wanted was for a ranking to be a value. Not code I edit, but data I pass in. If a ranking is a value, then I can hand it to the lab, score it against the judged queries, write the score down next to it, and compare two of them honestly. And the storefront can use one specific value as its default while the lab plays with others. That is the shape I was after.

What does it look like to make a query into a value?

Here is the config. Every knob that used to be hardcoded is now a field on an object.

typescript
export interface RankingConfig {
  /** Per-field BM25 boosts; a name hit outweighs a class hit outweighs a description hit. */
  fields: { name: number; productClass: number; description: number };
  /** Typo tolerance; undefined turns fuzziness off entirely. */
  fuzziness?: estypes.Fuzziness;
  /** Exact leading chars before any edit is allowed; >0 stops "light" matching "right". */
  fuzzyPrefixLength: number;
  /** How many query terms must match (e.g. "2<75%"); undefined keeps OR semantics. */
  minimumShouldMatch?: string;
  /** Boost for the whole query appearing as a phrase in the name; 0 disables. */
  phraseBoost: number;
  /** function_score weight on review_count (ln1p saturated); 0 disables popularity. */
  popularityWeight: number;
}

Read that as a list of questions I wanted to be able to ask. How much should a name match outweigh a description match? Should I tolerate typos, and if so, how many leading characters must be exact before I allow one? Should I demand that some fraction of the query terms match, or accept any of them? Should I give a bonus when the whole query shows up as a phrase in the name? Should popular products get a nudge? Each question is now a field, and each field has a value I can change without touching code.

The builder takes the config and assembles the query from it. The important property of the builder is that it is boring. It does exactly what the fields tell it to, and nothing the fields do not. If the phrase boost is zero, there is no phrase clause. If fuzziness is undefined, there is no fuzziness. The query that comes out is the smallest query that satisfies the config.

typescript
function buildMultiMatch(query: string, r: RankingConfig): estypes.QueryDslQueryContainer {
  const mm: estypes.QueryDslMultiMatchQuery = {
    query,
    type: "best_fields",
    fields: fieldList(r.fields),
  };
  if (r.fuzziness !== undefined) mm.fuzziness = r.fuzziness;
  if (r.fuzzyPrefixLength > 0) mm.prefix_length = r.fuzzyPrefixLength;
  if (r.minimumShouldMatch) mm.minimum_should_match = r.minimumShouldMatch;
  return { multi_match: mm };
}

Those if checks matter more than they look. They are what let the default config produce the exact same bytes as the old hardcoded query. A knob set to its off value adds nothing to the query, so a config full of off-values is indistinguishable from the query that never had those knobs at all.

How do you prove a refactor changed nothing?

This is the part I care about. A refactor is a change that, by definition, is not supposed to change behaviour. Everyone agrees with that until they are three commits deep and the search results look slightly different and nobody can say whether that is the tuning working or the refactor leaking.

So I wrote a test that pins the output. I picked the config I wanted as the default, ran it through the builder, and asserted the result deep-equals the query I expected, field by field, byte for byte. Not close enough. Equal.

typescript
it("builds the graduated step-7 query: boosted multi_match + prefix_length + phrase boost", () => {
  expect(buildSearchBody("outdoor chair", 20)).toEqual({
    query: {
      bool: {
        must: [
          {
            multi_match: {
              query: "outdoor chair",
              type: "best_fields",
              fields: ["name^3", "product_class^2", "description"],
              fuzziness: "AUTO",
              prefix_length: 2,
            },
          },
        ],
        should: [{ match_phrase: { name: { query: "outdoor chair", slop: 2, boost: 4 } } }],
      },
    },
    // ...highlight, suggest, size elided
  });
});

When I first made the change, this test asserted the old query, the plain boosted multi_match with no prefix length and no phrase clause. Same boosts as before, name boosted 3, class 2, description 1. That was the proof. The default config went in, the builder ran, and the output matched the query that had been hardcoded the day before. The refactor was pure. Nothing about what users saw changed. I had just moved the same query behind a value.

(The snippet above shows the default as it stands now, after I tuned it. More on that in a second. The shape of the proof is the same either way: pick a config, pin its exact output, refuse to let it drift.)

There is a second test that proves the other direction: that a non-default config actually emits the extra Elasticsearch DSL it is supposed to. A config with a minimum-should-match and a popularity weight should produce a function_score wrapping a bool with a minimum_should_match on the multi_match. I pinned that too. Between the two tests I have the whole contract nailed down. The default does exactly what it always did. A tuned config does exactly what its knobs say. Neither can drift without a test going red.

What about input from sliders? Can you trust it?

No, and that is a thing I almost forgot. The whole reason a ranking is now a value is so I can feed it different values. In the Search Studio I built sliders for these knobs, so I can drag the name boost up and down and watch the nDCG move. But those sliders post JSON to my eval endpoint, and JSON from a browser is untrusted input. Somebody could send me a name boost of nine million, or a string where I expect a number, or a missing field. So there is a coercion step. Anything coming in from outside gets clamped into a sane range before it touches the query builder.

typescript
export function coerceRankingConfig(raw: unknown): RankingConfig {
  const o = (raw && typeof raw === "object" ? raw : {}) as Record<string, unknown>;
  const clamp = (v: unknown, def: number, min: number, max: number): number => {
    const n = Number(v);
    return Number.isFinite(n) ? Math.min(max, Math.max(min, n)) : def;
  };
  const f = (o.fields && typeof o.fields === "object" ? o.fields : {}) as Record<string, unknown>;
  return {
    fields: {
      name: clamp(f.name, DEFAULT_RANKING.fields.name, 0, 20),
      productClass: clamp(f.productClass, DEFAULT_RANKING.fields.productClass, 0, 20),
      description: clamp(f.description, DEFAULT_RANKING.fields.description, 0, 20),
    },
    // fuzziness, prefix length, phrase boost, popularity all clamped the same way
    phraseBoost: clamp(o.phraseBoost, DEFAULT_RANKING.phraseBoost, 0, 50),
    popularityWeight: clamp(o.popularityWeight, DEFAULT_RANKING.popularityWeight, 0, 10),
  };
}

A name boost of nine million becomes 20. A string becomes the default. A missing field becomes the default. The nice thing about doing this with the same DEFAULT_RANKING value the storefront uses is that the worst case for a malformed request is just the default ranking, which is exactly the ranking everyone already gets. A bad request cannot produce a broken query. It can only produce the boring, known-good one.

Things that surprised me

  • The off value matters as much as the knob. The reason the default config could reproduce the old query exactly is that every knob has a value that makes it disappear from the query. Fuzziness of undefined, phrase boost of zero, popularity of zero. If I had made the builder always emit a phrase clause and just set its boost low, the default config could not have matched the old query byte for byte, and I would have lost the clean refactor proof.
  • Once a ranking is a value, the workflow falls out for free. I did not plan a graduate-a-config-into-the-default workflow. It just became obvious. The lab tries a config, scores it, I look at the score, and if it wins I copy that value into DEFAULT_RANKING. The storefront picks it up. There is no special machinery, because a config is just a value and the default is just one of those values.
  • The tuning gains were small. I want to be honest here. After all this, the measured win was modest. Full-set nDCG@10 went from 0.6532 to about 0.6629, helped by a fuzziness prefix length of 2 (so light stops fuzzy-matching right) and the phrase boost. Minimum-should-match actively hurt: it cratered recall, so I rejected it. Popularity did not help. The plumbing was the bigger deliverable than the gain.
  • Deep-equality tests feel excessive until they save you. Asserting an entire Elasticsearch query object byte for byte looks paranoid. But it is the only assertion that actually proves nothing changed. A looser test would have let the refactor leak, and I would never have known which commit did it.

When is this worth doing?

I would reach for this the moment I want to measure a piece of behaviour rather than just run it. A search ranking is the clear case: there is a number I can compute, judged data to compute it against, and a real temptation to keep nudging the query. The day you want to compare version A against version B with a straight face, you want both versions to be values you can name, store, and score, not two states of the same edited function.

I would not bother for a query I am never going to tune. If the ranking is fine and there is no eval loop behind it, a config object is just ceremony. The payoff is not the object, it is the measurement the object unlocks, and if there is no measurement there is no payoff.

The one rule I would carry to any refactor like this: make the default reproduce the old behaviour exactly, and write the test that proves it before you change anything else. That test is cheap to write and it is the difference between I refactored search and tuning is now possible and I refactored search and something feels off and I cannot tell why. The first one is a good day. The second one is a week.

Related posts