Complete p95 Latency Reduction Roadmap

Imagine looking for a flight on a travel website and waiting for 10 seconds as the results load up. Feels like an eternity, right? Modern travel search platforms must return results almost instantly, even under heavy load. Yet, not long ago, our travel search engine’s API had a p95 latency hovering around 10 seconds. This meant 5% of user searches, often during peak traffic, took 10 seconds or more. Result – frustrated users, heavy bounce rate, and worse – lost sales. Latency reduction, hence, is a non-negotiable in such cases.

This article is a real-world case study of how we evolved our cloud infrastructure to slay the latency dragon. By leveraging Google Cloud Run for scalable compute and Redis for smart caching, we transformed our search API’s p95 latency from ~10 seconds down to ~2 seconds. Here, we will walk through this entire process of latency reduction. This includes the performance bottlenecks, the optimizations, as well as the dramatic improvements they brought on.

The Latency Bottleneck

The sky-high latency was a serious problem. Delving deep into it, we found multiple culprits dragging down our response times. All of these had a common factor – they made our search API do a lot of heavy lifting on each request. Before we could achieve an overall latency reduction, here are the issues we had to fix:

Multiple backend calls: For each user query, the service contacted several downstream services (airline fare providers, databases for ancillary info, etc.) sequentially. For example, searching for a flight called three different APIs simultaneously at times, each taking around 2 to 3 seconds. The combined latency stacked up close to 10 seconds for some searches.
No caching layer: Since there was no in-memory or Redis cache to quickly return recent results, every request on the website started from scratch. Even identical searches were repeated within minutes. This meant even popular routes or static data (like airport details) were fetched from the database or third parties every time.
Cloud Run cold starts: Our service ran on Cloud Run (a serverless container platform). With default settings (0 minimum instances), when traffic was idle and a new request came in, Cloud Run had to spin up a container instance. These cold starts added a significant delay (often 1–2 seconds of overhead). This “startup tax” was deeply hurting our tail latency.
Single Requests: Initially, we configured each Cloud Run container to handle only one request at a time (concurrency = 1). This simplified request handling, but meant that a burst of 10 concurrent searches would immediately spin up 10 separate instances. With cold starts and limited CPU per instance, our system struggled to handle spikes efficiently.

All these factors formed a perfect storm for slow p95 latency. Under the hood, our architecture simply wasn’t optimized for speed. Queries were doing redundant work, and our infrastructure was not tuned for a latency-sensitive workload. The good news? Each bottleneck was an opportunity for latency reduction.

Also Read: Cloud Run as a Serverless Platform to Deploy Containers

What We Changed for Latency Reduction

We targeted latency reduction on two major fronts. Caching to avoid repetitive work, and Cloud Run optimizations to minimize cold-start and processing overhead. Here is how the backend evolved:

Introduced a Redis Caching Layer

We deployed a Redis cache to short-circuit expensive operations on hot paths. The idea was pretty straightforward: store the results of frequent or recent queries, and serve those directly for subsequent requests. For example, when a user searched for flights from NYC to LON for certain dates, our API would fetch and compile the results once. It would then cache that “fare response” in Redis for a short period.

If another user (or the same user) made the same search shortly after, the backend could return the cached fare data in milliseconds, avoiding repeated calls to external APIs and database queries. By avoiding expensive upstream calls on cache hits, we dramatically reduced latency for hot queries.

We applied caching to other data as well, like static or slow-changing reference data. e.g. airport codes, city metadata, currency exchange rates, now used cache. Rather than hitting our database for airport info on each request, the service now retrieves it from Redis (populated at startup or on first use). This cut down a lot of minor lookups that were adding milliseconds here and there (which add up under load).

Caching Musts

As a rule of thumb, we decided to “cache what’s hot.” Popular routes, recently fetched prices, and static reference data like airport info were all kept readily available in memory. To keep cached data fresh (important where prices change), we set sensible TTL (time-to-live) expirations and invalidation rules. For instance, fare search results were cached for a few minutes at most.

After that, they would expire, so new searches can get up-to-date prices. For highly volatile data, we could even proactively invalidate cache entries when we detected changes. As the Redis docs note, flight prices often update only “every few hours.” So, a short TTL combined with event-based invalidation balances freshness vs. speed.

The outcome? On cache hits, the response time per query dropped from multiple seconds to a few hundred milliseconds or less. This was all thanks to Redis, which can serve data blazingly fast over memory. In fact, industry reports show that using an in-memory “fare cache” can turn a multi-second flight query into a response in just tens of milliseconds. While our results weren’t quite that instant across the board, this caching layer delivered a huge boost. Significant latency reduction was achieved, especially for repeat searches and popular queries.

Optimized Cloud Run Settings for Latency Reduction

Caching helped with repeated work, but we also needed to optimise performance for first-time queries and scale-ups. We therefore fine-tuned our Cloud Run service for low latency.

Always-one warm instance

We enabled minimum instances = 1 for the Cloud Run service. This guaranteed that at least one container is up and ready to receive requests even during idle periods. The first user request no longer incurs a cold start penalty. Google’s engineers note that keeping a minimum instance can dramatically improve performance for latency-sensitive apps by eliminating the zero-to-one startup delay.

In our case, setting min instances to 1 (and even 2 or 3 during peak hours) meant users weren’t stuck waiting for containers to spin up. The p95 latency saw a significant drop just from this one optimisation alone.

Increased concurrency per container

We revisited our concurrency setting. After ensuring our code could handle parallel requests safely, we raised the Cloud Run concurrency from 1 to a higher number. We experimented with values like 5, 10, and eventually settled on 5 for our workload. This meant each container could handle up to 5 simultaneous searches before a new instance needed to start.

Result – fewer new instances spawned during traffic spikes. This, in turn, meant fewer cold starts and less overhead. Essentially, we let each container do a bit more work in parallel, up to the point where CPU usage was still healthy. We monitored CPU and memory closely – our goal was to use each instance efficiently without overloading it.

This tuning helped smooth out latency during bursts: if 10 requests came in at once, instead of 10 cold starts (with concurrency = 1), we’d handle them with 2 warm instances handling 5 each, keeping things snappy.

Faster startup and processing

We also made some app-level tweaks to start up quicker and run faster on Cloud Run. Also, we enabled Cloud Run’s startup CPU boost feature, which gives a burst of CPU to new instances during startup. We also used a slim base container image and loaded only essential modules at startup.

Certain initialization steps (like loading large config files or warming certain caches) were also moved to the container startup phase instead of at request time. Thanks to min instances, this startup ran infrequently. In practice, by the time a request arrived, the instance was already bootstrapped (database connections open, config loaded, etc.), so it could start processing the query immediately.

We essentially paid the startup cost once and reused it across many requests, rather than paying a bit of that cost on each request.

The results were instantly visible with these optimisations in place. We monitored our API’s performance before vs. after. The p95 latency plummeted from roughly 10 seconds, down to around 2 seconds. This was a mind-blowing 5 times faster loading experience for our users. Even the average latency improved (for cache-hitting queries, it was often <500 ms).

More importantly, the responses became consistent and reliable. Users no longer experienced the painful and much-dreaded 10-second waits. The system could handle traffic spikes gracefully: Cloud Run scaled out to additional instances when needed. With warm containers and higher concurrency, it did so without choking on cold starts.

Meanwhile, Redis caching absorbed repeated queries and reduced load on our downstream APIs and databases. This also indirectly improved latency by preventing those systems from becoming bottlenecked.

The net effect was a snappier, more scalable search API that kept up with our customers’ expectations of quick responses and a smooth experience.

Key Takeaways

From the entire set of optimizations we undertook for latency reduction, here are the key takeaways and important points for you to consider.

Measure and target tail latency: Focusing on p95 latency (and above) is crucial. It highlights the worst-case delays that real users feel. Reducing 95th percentile latency from 10s to 2s made our worst experiences 5 times better, faster. A huge win for user satisfaction! So, always monitor these high-percentile metrics, not just the average.
Use caching to avoid redundant work: Introducing a Redis cache proved to be a game-changer for us. Caching frequently requested data dramatically cuts response times by serving results from memory. The combination of in-memory speed and thoughtful invalidation (using TTLs and updates) can offload expensive computations from your backend.
Optimize serverless for speed: Cloud Run gave us easy scaling, but to make it truly low-latency, we leveraged its notable features – keeping minimum instances warm. This eliminated cold-start lag, and tuned concurrency and resources so instances are used efficiently without getting overwhelmed. A bit of upfront cost (always-on instance) can be well worth the payoff in consistent performance.
Parallelize and streamline where possible: We re-examined our request flow to remove needless serialization and delays. By doing things like parallelizing external calls and doing one-time setup during startup (not for every request), we shaved seconds off the critical path. Every micro-optimization (non-blocking I/O, faster code, preloading data) adds up in a high-scale, distributed system.
Continuous profiling and iteration: Lastly, an important thing to note here is that this was an iterative journey. We used monitoring and profiling to find the biggest bottlenecks, addressed them one by one, and measured the impact. Performance tuning is seldom one-and-done. It’s about data-driven improvements and sometimes creative fixes to reach your latency goals.

Conclusion

While these latency reduction strategies may seem too much to handle at once, a systematic check through each one is quite smooth in practice. The one unparalleled plus for the entire exercise is that we turned our travel search API from a sluggish experience into one that feels instant. In a world where users expect answers “yesterday,” cutting p95 latency from 10s to 2s made all the difference in delivering a smooth travel search experience.

I’m Ravi Thutari, a Lead Software Engineer with experience at Hopper, Amazon, and Wayfair. I focus on building scalable, low-latency systems using distributed architecture and serverless technologies. I enjoy sharing real-world engineering lessons through writing, speaking engagements, and mentoring developers who want to grow in backend and cloud engineering

Complete p95 Latency Reduction Roadmap

The Latency Bottleneck

What We Changed for Latency Reduction

Introduced a Redis Caching Layer

Caching Musts

Optimized Cloud Run Settings for Latency Reduction

Always-one warm instance

Increased concurrency per container

Faster startup and processing

Key Takeaways

Conclusion

Login to continue reading and enjoy expert-curated content.

Rehan

Leave a Reply Cancel reply

The Latency Bottleneck

What We Changed for Latency Reduction

Introduced a Redis Caching Layer

Caching Musts

Optimized Cloud Run Settings for Latency Reduction

Always-one warm instance

Increased concurrency per container

Faster startup and processing

Key Takeaways

Conclusion

Login to continue reading and enjoy expert-curated content.

Rehan

Leave a Reply Cancel reply

You May Like

What does it look like?

A High Price to Pay

5 Optimization Tips for Data-Driven Businesses