Skip to main content
← Back to blogs

The four bugs my unit tests were never going to find

By AntonioGitHub ↗LinkedIn ↗
Next.jsFull StackCodingWeb Development

NORDHEM has three test layers: fast unit tests, integration tests against a real Postgres, and one Playwright flow through a real browser and two servers. Getting that flow green surfaced four bugs the lower layers structurally could not see. Each layer is best at a different job.

I had a question that sounds trivial and isn't: if my unit tests are green and my integration tests are green, what is left to break? In my webshop NORDHEM the answer turned out to be four separate bugs, and not one of them could show up in the tests I already had. The short version is that there are three layers of tests, each is genuinely best at a different job, and the top layer (a real browser driving the real app) sees a category of problem the lower two are built in a way that hides. The interesting part is which problems, and why.

I'll walk through the three layers using my own code, then through the four bugs the top layer caught. At the end I'll say the thing I actually believe, which is not end-to-end tests are better. It's that you want all three, and you want very few of the slow one.

What are the three layers, concretely?

Think of the three layers like inspecting a car. A unit test is checking that one bolt is torqued correctly. An integration test is checking that the engine runs on a stand. An end-to-end test is putting a driver in the seat and seeing if the car actually pulls out of the driveway. All three are real checks. They tell you completely different things, and a green engine on a stand tells you nothing about whether the wheels are bolted on.

Layer one: unit tests that pinpoint one function

A unit test runs one function in isolation and finishes in milliseconds. In NORDHEM these cover things like cart math and the rule for merging a guest cart into your account when you sign in. There's no database, no browser, no network. That's the whole point: when one fails, it points at one function, and usually one line. The trade is that a unit test cannot tell you anything is wired together. It can prove the cart-merge function is correct and tell you nothing about whether anything ever calls it.

My test runner, Vitest, keeps these in their own project so a plain pnpm test never waits on anything heavy:

typescript
// vitest.config.ts — two projects, deliberately split
projects: [
  {
    test: {
      name: "unit",
      environment: "jsdom",
      include: ["test/**/*.test.{ts,tsx}"],
      exclude: ["test/integration/**"],
      setupFiles: ["test/setup.ts"],
    },
  },
  {
    test: {
      name: "integration",
      environment: "node",
      include: ["test/integration/**/*.test.ts"],
      testTimeout: 120_000,
      hookTimeout: 240_000,
    },
  },
],

Notice the timeouts on the second project: two minutes per test, four minutes for setup. That alone tells you the integration tests are a different animal.

Layer two: integration tests against a real database

An integration test in NORDHEM boots a real Postgres in a throwaway container (using Testcontainers), runs my actual database code against it, and asserts on the rows that come back. No mocks. This is where I prove things that are only true of a real database: that a transaction is all-or-nothing, that a constraint rejects bad data, that a rollback actually rolls back.

The clearest example is checkout. It is the one place where money and order history get frozen, so it's a single transaction: read the cart, snapshot each line's name and price into the order, write the order, clear the cart, all or nothing. The test that earns its keep is the rollback test. It forces the payment step to throw and then proves the database is untouched.

typescript
// test/integration/checkout-repo.test.ts
it("rolls everything back when payment authorization fails", async () => {
  const cartId = await seedCart();
  await expect(
    checkout(db, { userId: "user-1", cartId, address: ADDRESS }, {
      beforeCommit: async () => {
        throw new Error("payment declined");
      },
    }),
  ).rejects.toThrow("payment declined");

  // No order, and the cart is untouched.
  expect(await db.select().from(orders)).toHaveLength(0);
  expect(await getCartLines(db, cartId)).toEqual([
    { productId: 1, quantity: 1 },
    { productId: 3, quantity: 2 },
  ]);
});

This is a strong test. It runs against the same Postgres I run in production, and it proves a property I could never trust to a mock. But look at what it can't see: there's no browser here, and no user. It runs headless. It calls my checkout function directly, in the same process, with arguments I hand it. It never finds out whether a real form on a real page ever produces those arguments, or whether the button that's supposed to call this code is even wired up. The engine runs beautifully on the stand. Nobody has tried to drive the car.

Layer three: a real browser driving the real app

The top layer is one Playwright test that drives an actual Chromium browser against the real stack: a production build of my Next.js app, talking over HTTP to my real Fastify search service, against real Elasticsearch and real Postgres. It signs up a user, searches for a sofa, opens a product, adds it to the cart, checks out, and then finds the order in the history page. A second test favorites a product and reloads to prove it stuck.

typescript
// e2e/golden-flow.spec.ts — abbreviated
await page.getByRole("button", { name: "Add to cart" }).click();
const drawer = page.getByRole("dialog", { name: "Shopping cart" });
await expect(drawer).toBeVisible();
await drawer.getByRole("link", { name: "Checkout" }).click();
await page.waitForURL((url) => url.pathname === "/checkout");

await page.getByLabel("Full name").fill("E2E Shopper");
await page.getByLabel("Address", { exact: true }).fill("Storgata 1");
await page.getByLabel("City").fill("Oslo");
await page.getByLabel("Postal code").fill("0155");
await page.getByRole("button", { name: "Place order" }).click();

// Confirmation page with a real order number.
await page.waitForURL(/\/orders\/NDH-\d{4}-\d{6}/);

Every line of that is a real user action against a real browser. The click has to reach a button, the button has to call a Server Action, the action has to call the same checkout code my integration test exercises, Postgres has to write the order, and the browser has to be redirected to a confirmation page with a real order number on it. The whole chain, end to end. This is the layer that sees the seams.

What can only the top layer see?

Three kinds of truth live only at the top. The first is the seams between components: does a click actually reach the Server Action, the database code, the database, and come back to update the little cart badge in the header? Every lower test owns one box; only this one tests the lines between boxes. The second is real-browser truth: does the page hydrate, does a form actually submit, does the session cookie the server sets get sent back by the browser on the next request? The third is cross-process reality: two servers, HTTP between them, IPv4 versus IPv6, a production build behaving differently from dev mode. None of that exists when you call a function directly in one Node process.

I know these three categories are real because getting this one flow to pass turned up exactly one bug from each, plus one more. Here they are.

The form that submitted before React woke up

A page in Next.js arrives as plain HTML first, then React attaches to it in the browser in a step called hydration. There's a window, usually short, where the HTML is on screen but React hasn't taken over the form yet. If a user submits during that window, the browser does what browsers did in 1996: a native form submission, a full GET request, ignoring all my React handlers. My e2e was fast enough to hit that window, so the form sometimes submitted as a plain GET before React was ready. A unit test can't see this because there's no hydration in jsdom. An integration test can't see it because there's no browser at all. You need a real browser racing a real React bundle.

localhost that resolved to the wrong address

My Fastify search service binds to IPv4 only. When the Next app asked for localhost, the machine resolved it to the IPv6 address ::1, nobody was listening there, and the request hung. The word localhost is not an address; it's a name that resolves to one, and which one you get depends on the machine. This is a cross-process bug by definition: it only exists because two servers are talking over the network. The whole point of unit and integration tests is to remove the network so they're fast and deterministic, which is exactly why they remove the thing that broke. The fix is in my Playwright config, pinned to the literal IPv4 address:

typescript
// playwright.config.ts
use: {
  // 127.0.0.1, not localhost: the Fastify search service binds IPv4
  // only, and Playwright would otherwise resolve localhost to IPv6
  // (::1) and hang.
  baseURL: `http://127.0.0.1:${WEB_PORT}`,
  trace: "on-first-retry",
},

Dev mode broke hydration on a custom port

My first instinct was to run the e2e against the dev server, the same one I use every day. The dev server keeps a websocket open for hot reload, so a code change refreshes the page instantly. On a non-standard port, that websocket handshake failed, and when it failed it took hydration down with it: the page loaded but React never attached, so every click did nothing. The fix was to stop testing the dev server and test a production build instead, which has no hot reload and behaves like the deployed app. This is the difference between next dev and next start, and you only feel it when a real browser tries to hydrate a real build:

typescript
// playwright.config.ts — the web server it boots
{
  // Production build, not `next dev`: dev-mode HMR misbehaves on a
  // custom port (the HMR websocket handshake fails and the client
  // never hydrates). next start has no HMR, hydrates normally, and
  // matches CI.
  command: "pnpm build && pnpm start",
  env: {
    PORT: String(WEB_PORT),
    SEARCH_API_URL: `http://127.0.0.1:${SEARCH_PORT}`,
  },
  url: `http://127.0.0.1:${WEB_PORT}`,
  reuseExistingServer: false,
  timeout: 300_000,
},

The optimistic button I had to wait for

The cart and favorite controls are optimistic: they flip instantly on click and reconcile with the server afterward. That means they only work once their client-side provider has mounted, which is after hydration. My test was clicking them too early and the click landed on nothing. The fix was to give the page a marker the test could wait on, an attribute the app sets once hydration finishes, so the test waits for the app to actually be alive before it pokes the optimistic buttons:

typescript
// e2e/golden-flow.spec.ts
// The cart/favorite controls are client-only (optimistic), so wait
// for the CartProvider to mount before clicking them.
async function waitForHydration(page) {
  await page
    .locator("html[data-hydrated='true']")
    .waitFor({ state: "attached", timeout: 20_000 });
}

Four bugs. A form racing hydration, a name resolving to the wrong IP, a dev server breaking hydration on a custom port, and an optimistic control clicked before it existed. Every one lives in the gaps between components, in the browser, or across the wire. The lower layers don't miss these because they're weak. They miss them because they deliberately remove the browser and the network to be fast and deterministic, and the bugs live in exactly the parts that got removed.

Things that surprised me

A few things I didn't expect going in:

  • Three of the four bugs had nothing to do with my application logic. They were about hydration, networking, and build mode, the plumbing around the code, not the code. My business logic was fine. The wiring was not.
  • When an e2e test fails, it tells you a failure exists somewhere in a long journey, and that's the worst part of it. A unit test failure says "this function, this line." An e2e failure says "something between sign-up and order history is wrong, good luck." The slowest test to write is also the slowest to debug.
  • The honest way to think about it is membership versus speed. I have 29 unit tests, 16 integration tests, and 2 end-to-end tests. The shape is a pyramid on purpose: many fast tests at the bottom, a few slow ones at the top guarding only the critical journeys. Two e2e tests, not two hundred.
  • Half the work was just getting the test harness to boot the real stack reliably. Custom ports so it never collides with a running dev server, a plain process instead of a file watcher that never settles without a terminal, a 300-second timeout because a production build is slow. The test itself was the easy part.

So when is each layer worth it?

Here's what I'd tell someone about to write their first end-to-end test. Don't write it to replace your unit tests. Write it because there is a class of bug your unit and integration tests cannot see, and you should know what that class is before you spend the time. Unit tests are for logic: fast, precise, they name the broken line. Integration tests are for the database and other real dependencies you can't honestly fake: transactions, constraints, rollbacks. End-to-end tests are for the seams, the browser, and the wire: the things that only exist when the whole system is running as a system.

Keep the e2e count tiny and aimed at the journeys that earn money or sign people in. When one fails you'll pay for it in debugging time, so you want few of them and you want each to guard something that matters. But skip them entirely and you ship a checkout where the engine runs perfectly on the stand and the car won't leave the driveway. I found four of those in one flow. That's the whole argument for the top of the pyramid, and also the whole argument for keeping it small.

Related posts