Performance

medium

mid

How do you measure and quantify the impact of a performance fix?

Compare metric distributions before and after, not single numbers. Use RUM (web-vitals + analytics) to capture p75/p95 for affected users; segment by route/device/network. Run for at least a week post-deploy to smooth daily/weekly cycles. Pair with lab (Lighthouse CI per PR) for regression catch. Tie to business metrics (conversion, bounce, time-on-task) where possible. Use A/B tests for high-stakes changes.

9 min read·~5 min to think through

Measuring a perf fix isn't "open Lighthouse, see if the score went up." It's a statistical comparison of distributions, ideally against a control.

Step 1: define the metric

Pick one or two primary metrics tied to what you fixed:

Image format change → LCP.
Bundle split → TTI, FCP, INP.
Layout shift fix → CLS.
Long task split → INP, long-task count.
Cache strategy change → TTFB, cache hit ratio.

Don't compare on every metric — perf changes are often net-neutral across the dashboard while moving the specific metric you targeted.

Step 2: define the segment

Affected route(s).
Affected device class.
Affected geography.
Affected user state (first visit vs repeat).

A homepage perf fix might not move /checkout's metrics. A mobile image fix won't affect desktop. Compare apples to apples.

Step 3: collect baseline

Before deploying the fix:

1+ week of RUM data → distribution at p50/p75/p95.
Lab runs (Lighthouse CI median of 5 runs on a fixed environment) for the route.

Step 4: deploy + measure post

After deploying the fix:

Wait for at least a full business cycle (7 days) for RUM to smooth out weekday/weekend variance.
Compare p75 (and p95) before vs after on the same segment.
Look at the distribution shape, not just the median. Sometimes a fix improves the median but worsens the tail.

Step 5: confidence

Distribution comparison isn't a single number. Use:

Histogram before/after at the same percentiles.
Statistical significance: at scale, a ~5% improvement in p75 LCP is typically significant; smaller deltas need more data or A/B testing.
Sample size: if you have <1000 RUM samples post-deploy, wait longer.

Step 6: A/B test for high-stakes

For changes whose perf impact you can't predict (new bundle split, new image strategy), gate behind a flag and compare cohorts:

Cohort A (control): old version, 50% traffic.
Cohort B (variant): new version, 50% traffic.
Compare LCP p75 between A and B.

Eliminates day-of-week / outage / external confounders. Most growth/experiment platforms support this.

Step 7: tie to business

The perf metric is the proxy. The business metric is the truth:

Conversion rate on the affected page.
Bounce rate.
Engagement (time on task, items added to cart, sessions per user).
Revenue per visitor.

If LCP moved -800ms but conversion didn't budge, either the perf metric is wrong or the user experience wasn't the bottleneck. Both are worth knowing.

Tools

web-vitals.js → analytics for RUM.
Lighthouse CI for lab regression catch.
Vercel Speed Insights / Datadog RUM / Sentry Performance for dashboards.
PageSpeed Insights for public CrUX data.
A/B testing platform (in-house, Optimizely, GrowthBook) for cohort comparison.

Reporting

A useful perf report includes:

Change: replaced JPEG hero images with AVIF on /products/* (PR #1234). Metric: LCP p75 (mobile, US/EU traffic). Before: 2.8s. After: 2.1s. Delta: -700ms (25% improvement), statistically significant (n=12k post-deploy samples). Business: +1.4% checkout conversion (95% CI [0.4%, 2.4%]) over 14-day post-deploy window. Cost: 2 eng-days + ongoing image optimizer cost ~$50/mo.

This is the language stakeholders react to: clear change, clear metric, clear delta, business tie-in, cost.

Pitfalls

Comparing two days — daily variance can be huge; need at least a week.
Mixing cohorts — comparing all users before to all users after misses traffic-mix shifts.
Cherry-picking metrics — looking for the one metric that moved.
Lab-only validation — Lighthouse passed; real users still slow.
Ignoring the tail — p75 improves while p95 regresses for low-end devices.
No business measurement — perf wins that nobody can show ROI for get deprioritized.

Mental model

A perf fix is a hypothesis. Define the metric, the segment, and the expected delta upfront. Collect baseline. Deploy. Wait for statistical signal. Compare distributions. Tie to business if you can. Report with numbers, not vibes.

Follow-up questions

•Why p75 over the median?
•When do you A/B test a perf change?
•How do you tie LCP to business metrics?
•What's the difference between lab and field validation?

Common mistakes

•Comparing two days of data — daily variance washes out the signal.
•Using only Lighthouse — real users diverge.
•Comparing across all routes — homepage fix doesn't move checkout.
•No baseline — can't quantify the delta.
•Mixing cohorts before and after a traffic-mix change.
•Cherry-picking metrics — looking for whichever one moved.

Performance considerations

•Measurement itself has a cost. Don't ship redundant analytics; sample RUM at 1-10% at high traffic. The biggest mistake is shipping fixes you can't measure — they get re-broken next quarter because nobody knows their value.

Edge cases

•Seasonal traffic spikes change the perf profile — control for them.
•Caching transients post-deploy — first 24h is dirty, exclude.
•Different perf budgets for repeat vs first-visit cohorts.
•Compounding fixes in the same release — attribute carefully.
•Tiny user segments (small markets) — wider confidence intervals.

Real-world examples

•Pinterest's 'pwa rewrite' was measured against a held-out cohort and showed clear LCP + conversion lift.
•BBC published their LCP investigation tying load time to abandonment.
•Vercel Speed Insights compares per-deploy metrics out of the box.

Senior engineer discussion

Seniors quantify perf changes in metric distributions and business outcomes, not vibes. They use A/B tests when the bet is risky, RUM windows when measuring rollouts, and tie everything to a primary metric. They also publish results so the next person can build on them.