Lite mode: keeping a shop searchable when the search engine is asleep
My search engine runs on a PC in my apartment, so it is sometimes asleep. Here is how NORDHEM stays searchable when it is: a circuit breaker that gives up fast, a Postgres full-text fallback, and a status page that tells the truth.
My search engine runs on a computer in my apartment. Elasticsearch, in Docker, behind a free tunnel. That is a deliberate choice for NORDHEM, my Nordic home goods shop: it lets me run the full engine, semantic vectors, faceted aggregations and synonyms, at no hosting cost. It also means the engine is sometimes asleep, because the machine is off, the tunnel dropped, or I am away from home. The storefront itself lives on Vercel and is always up. So the interesting question was never how to make search fast. It was what the shop does when the thing behind its search box is unreachable.
The answer is lite mode. When the engine does not answer, my Next.js backend stops waiting on it and serves a Postgres full-text fallback, with a banner that says so and a status page that tells the truth. Two pieces make it work: a circuit breaker that decides when to give up, and a Postgres query good enough to keep the shop usable. The verdict up front: a degraded but honest search beats a spinner every time, and the whole thing is about a hundred lines.
Why a try/catch is not enough
The naive version is one try/catch. Call the search service. If it throws, show a message saying search is offline. I shipped that first. It has two problems.
The first is latency. A sleeping PC does not refuse the connection politely, it hangs. The socket sits open until something times out, and the default timeout is measured in tens of seconds. So a shopper who searches while the machine is down waits twenty seconds to be told search is broken. The page is hostage to a computer that is not even on.
The second problem is that try/catch has no memory. If the engine is down, the next thousand shoppers each pay the full timeout, independently, because nothing remembers that the previous calls all failed. The system keeps throwing requests at a wall and keeps making people wait.
Both problems have the same shape. I need something that notices the engine is down, stops calling it for a while, and serves something useful in the meantime. That is a circuit breaker plus a fallback.
The breaker
A circuit breaker has three states. Closed is normal: calls go through. After a handful of consecutive failures it trips open, which means stop calling, serve the fallback right away, do not wait. After a cooldown it goes half-open and lets exactly one trial request through to see whether the engine came back. A success closes it again, a failure re-opens it and restarts the cooldown.
canRequest(): boolean {
if (this.state === "open") {
if (this.now() - this.openedAt >= this.opts.cooldownMs) {
this.state = "half-open";
return true; // allow a single trial
}
return false;
}
return true; // closed or half-open
}
recordSuccess(): void {
this.failures = 0;
this.state = "closed";
}
recordFailure(): void {
this.failures += 1;
if (this.state === "half-open" || this.failures >= this.opts.failureThreshold) {
this.state = "open";
this.openedAt = this.now();
}
}The clock is injected (the now option) instead of calling Date.now directly. That one decision makes the whole thing a pure unit test: set the failure threshold to three, record three failures, advance the fake clock past the cooldown, and assert that canRequest flips to half-open. No timers, no waiting, no flake. Notice what the breaker does not do: it has no timeout of its own. Deciding how long to wait for a single call is the caller's job, which keeps the state machine small.
The policy
The breaker only decides whether to try. The retrieval policy is a separate function, and it is the part worth getting right: try the full engine, and on any failure fall back to Postgres. While the breaker is open, skip the engine entirely and go straight to the fallback.
export async function resolveSearch(
deps: ResolveDeps,
): Promise<{ response: SearchResponse; lite: boolean }> {
if (deps.breaker.canRequest()) {
try {
const response = await deps.full();
deps.breaker.recordSuccess();
return { response, lite: false };
} catch {
deps.breaker.recordFailure();
}
}
return { response: await deps.fallback(), lite: true };
}Splitting the policy out from both the breaker and the real network call means I can test it with stubs. A full that resolves proves the success path closes the breaker. A full that rejects proves it records a failure and returns the fallback. An already-open breaker proves the engine is never called at all. The function returns the response plus a lite flag, so the page knows which mode it is rendering.
The fallback that was almost good enough
Lite mode is Postgres full-text search over the shop catalog: name, description and category compiled into a tsvector, matched against the shopper's query. Neon hosts that Postgres and is always up, so the fallback is available exactly when Elasticsearch is not. I wrote it, the unit and integration tests passed, and I moved on.
Then I opened a real browser, searched for 'oak bed' with the engine switched off, and got nothing. Zero results. The catalog is full of beds.
The cause is one function. plainto_tsquery turns 'oak bed' into the query 'oak' AND 'bed', and no single product in the 800-item shop had both words in its text. Elasticsearch, with its field boosts and looser scoring, would have returned a page of beds. My fallback returned an empty state. A fallback that gives up on an ordinary two-word query is worse than no fallback, because it looks like the catalog itself is empty.
The fix is to favour recall. I rewrite the AND query into an OR query by replacing the ampersand in the plainto_tsquery output with a pipe. Because the function has already sanitized the user input, rewriting its text is injection-safe, and ts_rank still pushes the products that match more terms to the top.
// The searchable document: name + description + category, English-analyzed.
const tsv = sql`to_tsvector('english', coalesce(${productsRaw.name}, '') || ' '
|| coalesce(${productsRaw.description}, '') || ' '
|| coalesce(${shopProducts.category}, ''))`;
// plainto_tsquery ANDs the terms, so "oak bed" finds nothing unless one product
// has both. A fallback should favour recall, so rewrite the AND query to OR.
// plainto_tsquery sanitizes the input, so rewriting its text stays injection-safe.
const tsq = sql`replace(plainto_tsquery('english', ${query})::text, '&', '|')::tsquery`;
const match = sql`${tsv} @@ ${tsq}`;
const rank = sql<number>`ts_rank(${tsv}, ${tsq})`;The query for 'oak bed' went from zero results to a hundred and seventy six. That is the whole bug. It is also the kind of bug that only a real browser finds: every test was green, because the tests happened to ask for words that did exist together. The lesson stuck. A fallback is not a smaller version of the main engine, it is a different tool with a different priority, and its priority is do not return nothing.
Telling the truth
The last piece is honesty. The search contract has carried a mode field since the first version, either full or fallback, so the UI cannot accidentally pretend a degraded result is the real thing.
const full = async (): Promise<SearchResponse> => {
const res = await fetch(`${SEARCH_API_URL}/search?${queryString}`, {
signal: AbortSignal.timeout(TIMEOUT_MS), // 800ms: a sleeping PC must not block the page
});
if (!res.ok) throw new Error(`search service responded ${res.status}`);
return SearchResponseSchema.parse(await res.json());
};
const fallback = () => ftsSearchShop(db(), query, fts);
return resolveSearch({ breaker: searchBreaker(), full, fallback });The 800 millisecond timeout is the budget: if the engine has not answered in that long, it does not get to hold up the page. When the response comes back as fallback, the results page shows a short banner explaining that the engine is asleep and that filters and semantic results are paused. A separate status page reports the live mode and the breaker state, the way a real service would. Lite mode is allowed to be worse. It is not allowed to lie.
How I knew it worked
Three layers. The breaker is a pure state machine tested with an injected clock, so every transition is deterministic. The policy is tested with stubbed engine and fallback functions, so the fall-through and the skip-while-open paths are both pinned. The Postgres query runs against a real disposable Postgres in the integration tests. And the bug that none of those caught, the AND versus OR one, I found by taking the engine down for real and watching the browser render lite results from Postgres.
A degraded but honest search beats a spinner, and it cost about a hundred lines: a small state machine, a policy function, one Postgres query, and a banner. If part of an app depends on something its author does not fully control, a self-hosted box, a flaky upstream, an API with a rate limit, the move is to put a breaker in front of it and decide what lite looks like before it is needed. The day the dependency goes down is the wrong day to discover the app has no second mode.


