Optimizing Wellhub Autocomplete Service Latency: A Multi-Region Architecture

Key Takeaways

  • Wellhub adopted a multi-region architecture for its Go-based autocomplete service, utilizing Elasticsearch to predict user input and enhance search relevance with geo queries.
  • AWS Global Accelerator was leveraged for efficient traffic routing, using static IPs and TCP optimizations to ensure low-latency connections to the nearest service instance.
  • Data replication was handled through AWS S3 Cross-Region Replication, allowing backups to be restored in different regions, aligning with non-real-time update requirements.
  • Even after deploying the multi-region architecture, latency was further reduced by introducing a pre-fetch endpoint, which enhanced perceived performance and improved the user experience across regions.
  • Mobile network optimizations, like reducing polling and batching requests, lead to faster and more efficient service delivery on mobile devices.

Every company desires fast, highly available, and low-latency services. As engineers, we share this aspiration. However, as the well-known theorem goes, “there is no free lunch”. Achieving these goals requires significant investment and effort. In this article, I will share how Wellhub invested in a multi-region architecture to achieve a low-latency autocomplete service. I will discuss the strategies we adopted, the lessons we learned, and our suggestions based on these experiences.

Context

The autocomplete service, developed in Go, functions by receiving a small text input and predicting the user’s intended completion. This approach enhances search experiences by reducing zero results from misspelled words and creating a smoother user experience. For example, typing “Yo” might return suggestions like “Yoga” and “Yoga for Kids”. We also leverage Elasticsearch’s geo queries feature to improve relevance by tailoring suggestions based on the user’s location.

This service does more than just help users find what they’re looking for more quickly; it offloads significant traffic from our primary search infrastructure, enhancing overall system efficiency. Approximately 66% of autocompletions, considering unique searches, result in clicks that redirect to a dedicated screen, bypassing the search results. This streamlines and optimizes the search process.

Our company is based on Amazon Web Services (AWS), and all our services are deployed within Kubernetes. We utilize managed solutions whenever possible, such as Amazon Managed Streaming for Apache Kafka, DynamoDB, and Amazon Aurora. However, our Elasticsearch cluster is still managed within our infrastructure (an alternative would be AWS OpenSearch). For our multi-region deployment, we created Landing Zones. Though not exclusive to this project, they provide better isolation, security, and data management.

Before implementing a multi-region architecture, we considered other solutions, particularly because the service’s processing time is around 15ms, leaving little room for improvement. One option we considered was geohashing for an optimized cache within the Content Delivery Network (CDN), in our case, Amazon CloudFront. However, this approach proved ineffective due to the high variability of the data. Our analysis showed only a 28% success rate with a Time to live (TTL) of 7 days and a geohash length of 5, representing a grid area of approximately 4,89 km by 4,89 km (3,03 mi).

Given the global distribution of users, we determined that moving the data closer to them was the most effective solution.

Designing a Multi-Region System for Fast Autocomplete

We developed an RFC proposing the construction of a multi-region architecture, which required the collaboration of our Platform Team. Migrating the service was relatively straightforward: we replicated our Helm chart definitions, created users, applied policies in the new regions, and configured all keys within the Vault. Despite this, as in nearly any migration, the main challenge is handling the data.

Initially, we considered using Elasticsearch Cross-Cluster Replication for its bidirectional, chained, and disaster recovery capabilities. However, these features were not necessary for our needs and would also require acquiring licenses. Our rationale was based on the frequency of index updates, the required freshness of data, and the fact that we did not need to keep all regions perfectly synchronized.

As a best practice, we store daily backups in AWS S3, triggered by a cron job, taking snapshots of our indices for disaster recovery. The key insight was realizing that AWS S3 has a Cross-Region Replication feature. We could restore these snapshots from our main region in Virginia (USA) to other regions such as São Paulo (BRA) and Dublin (IE).

Figure 1: Multi-region autocomplete service architecture

As mentioned in the Context section, our autocomplete data belongs to our partners and their activities. While changes, such as name updates, activity additions, and partner onboarding/offboarding occur, they do not require near real-time updates. Consider users in Barcelona searching for activities; they don’t need to know immediately if a partner in Los Angeles has added a new activity—unless they are traveling there, which is less common. Thus, we determined that a maximum replication lag of one day is acceptable for our needs.

We have two daily cron jobs; the first, in the main region, creates snapshots and stores them in an AWS S3 bucket. The second job, running in the other regions, follows a different process. This process uses a small pipeline built with Curator for better error handling and control, particularly from the API’s perspective. You might wonder how the API knows about a new index if we have to create one during each restore operation. This is where Elasticsearch’s alias feature becomes invaluable. We use an alias to manage which underlying index is pointed to. The pipeline starts by restoring the snapshot to a temporary index. Then, in an atomic operation, we switch the alias from the main index to the temporary one, ensuring the most updated data is used. Next, we reindex the main index, switch the alias back, and delete the temporary index. This entire process happens seamlessly, without causing any disruptions to the service, allowing us to continue serving requests without interruption. It also involves proper error handling, utilizes retries and timeouts, and is well-instrumented for observability.

Figure 2: Data replication pipeline

Our data replication strategy is efficient and low-cost, and it does not require clusters to know each other or implement a leader/follower strategy. The processing is straightforward since the only change involves authentication to a different cluster using credentials provided by Vault. But one might wonder, how do we serve this data? Does our app have all endpoints hard-coded? How do we decide which region to route a user to?

AWS Global Accelerator, powered with static IPs and custom routing, deterministically routes traffic to our service instances. Leveraging the power of the AWS global network, it significantly enhances performance by reducing first-byte latency, minimizing jitter, and increasing throughput—offering a faster and more consistent experience compared to the public internet. It performs TCP termination at the edge and uses traffic dials to direct traffic to the nearest region, providing fast failover across regions. This allows us to maintain a single endpoint within our mobile application, eliminating the need to manage routing manually.

The adoption of AWS Global Accelerator provided the performance improvements we required, along with an unexpected bonus: fine-grained control over routing. This enhanced routing flexibility allowed us to perform seamless migrations, roll out features gradually by region, and control traffic distribution—adjusting the percentage of traffic directed to different regions as needed. A deep understanding of network fundamentals is crucial to building and maintaining a low-latency service.

The Building Blocks of High Performance

One of the aspects I want to emphasize in this article is the importance of paying attention to every component, being detail-oriented, and understanding the context to find optimal solutions. Achieving this requires thorough research. The following section is based on the classic book High Performance Browser Networking by Ilya Grigorik. It is essential reading for any engineer and manager concerned with delivering high-quality software. Let’s briefly review some of these aspects.

Transmission Control Protocol (TCP)

TCP is an incredible protocol that allows for fine-tuning of its inner workings to optimize data transfer. For example, increasing the TCP initial congestion window enables more data transfer in the first roundtrip and enhances window growth. Disabling the slow-start after idle can improve the performance of long-lived TCP connections. Window scaling allows for better throughput in high-latency connections, and TCP Fast Open can sometimes send data along with the SYN packet.

AWS Global Accelerator incorporates several of these optimizations. It leverages a large receive-side window and TCP buffers, a large congestion window, and support for jumbo frames, enabling it to send six times more data in each packet. These features collectively enhance the performance and efficiency of data transfer across the network.

We usually focus on the backend, optimizing query execution, applying caches, or rewriting the entire service. Nonetheless, the other components also contribute to the equation.

Mobile networks

When working with mobile applications, it is important to consider that a single connection does not solely flow through fiber optic cables at the speed of light. Yes, a data packet can travel around the Earth’s equatorial circumference of approximately 40,075 km (24,901 mi) in nearly 200 ms via fiber optic cables (133,7 ms in a vacuum). However, mobile networks introduce additional components that impact performance.

Carrier architecture

Carrier architecture is a component to consider when building a high-performance mobile application. These architectures can vary significantly from carrier to carrier and across technologies such as 4G and 5G. A typical LTE network’s high-level architecture includes a Radio Access Network (RAN) and a Core Network (CN).

The Radio Access Network mediates data packet traffic and provides user devices access to the reserved radio channel. It also informs the Core Network of the location of all users. This is necessary because a user can be associated with any radio tower, and handovers between towers must be managed as the user moves. An interesting note is that if you look at the RAN representation, which uses cells, you can understand why we call them cell phones.

Figure 3: A rudimentary LTE architecture

The Core Network is responsible for data routing, accounting, and policy management, effectively connecting the radio network to the public internet. It comprises several components, each vital to the network’s functionality:

  • The Packet Gateway (PGW): manages the connection to the public internet, including allocating IP addresses for devices. It also handles packet filtering, inspection, and protection against denial-of-service (DoS) attacks.
  • The Mobility Management Entity (MME): acts as a database of users, maintaining information about their location, account type, and billing status.
  • The Serving Gateway (SGW): when the PGW receives a packet, it relies on the SGW to determine the user’s location. It queries the MME and establishes a connection with the appropriate radio tower to send the data, ensuring seamless communication across the network.

This complex yet transparent process significantly influences your mobile application’s latency and performance.

Radio Resource Controller (RRC)

Another component is the Radio Resource Controller (RRC), which manages the connections between the device and the radio base station, affecting latency, throughput, and battery life. To manage data transfer over a wireless channel, the RRC must either allocate resources or signal the device about incoming packets. The RRC uses a state machine that defines the radio resources available in each state, influencing performance and energy consumption. Key states include RRC Idle and RRC Connected. These states are governed by timeouts. For example, in an HSPA+ connection, the state transitions from high power, with dedicated upstream and downstream network resources, to a lower power state after 10 seconds of inactivity. This lower power state consumes less energy but relies on a shared, limited, low-speed channel (less than 20 Kbps). If a data request occurs after this transition, it triggers a shift back to a high-power state. This negotiation process takes several milliseconds, increasing latency and draining battery life.

Fine-Tuning for Speed and Efficiency

The literature on optimizing mobile networks offers several significant points to evaluate. One important aspect is minimizing polling, as frequent polling is exceptionally expensive on mobile networks. Instead, it is recommended to use push delivery and notifications whenever possible. Additionally, outbound and inbound requests should be coalesced and aggregated to reduce the number of transmissions. Noncritical requests should be deferred until the radio is already active, avoiding the need to wake the radio for minor operations.

Eliminating unnecessary application keepalives is also crucial, as the carrier network manages the connection lifecycle independently of the device’s radio state. Anticipating network latency overhead is another important consideration. Establishing a new connection can take up to 600 ms in 4G and up to 3500 ms in 3G, considering the control plane, DNS lookup, TCP handshake, TLS handshake, and the HTTP request. One effective strategy we adopted was to send an “empty” request right before the user interacts with an input component, thereby preemptively setting up the connection and making subsequent requests faster.

Furthermore, it’s beneficial to batch data transfers by downloading as much data as possible quickly and then allowing the radio to return to idle. This approach maximizes throughput and battery life, as the mobile radio interface is optimized for this, offering significant performance and efficiency improvements for mobile applications.

Measuring the Impact of Our Optimizations

Before implementing the multi-region architecture, our users were experiencing a p90 latency of 600 ms to 700 ms, as highlighted by Figure 4 and 5, which resulted in a poor experience. This level of latency severely impacted the functionality of our autocomplete service. Waiting that long to autocomplete a term is counterproductive, as users might prefer typing the entire text themselves, even at the risk of making typos.

Afterward, the latency improved, but not as much as we had hoped. It decreased by 200 ms at p90, significantly better than before but still not as low as expected for a multi-region setup given the investment. Our goal was to achieve a latency of around 100 ms to be considered instantaneous.

We then decided to anticipate the latency overhead by implementing a pre-fetch endpoint, mentioned in the Optimizing section. After this change, the latency dropped to 123,3 ms at p90 on iOS and 134,6 ms at p90 on Android. The pre-fetch endpoint handles all the heavy lifting, with a p90 latency of 561 ms on Android and 471 ms on iOS. However, the user does not perceive it this time because the request is made preemptively, right on time.

Figure 4: Perceived latency breakdown before and after the multi-region architecture for iOS

Figure 5: Perceived latency breakdown before and after the multi-region architecture for Android

Another significant change resulting from the project was a noticeable shift in network usage patterns. Among iOS users, Wi-Fi usage dropped from 63,47% to 49,33%, suggesting that more users are now accessing the app through their mobile networks. This shift is further supported by an increase in LTE usage, which rose from 24,58% to 29,30%.

Android users saw similar benefits, with Wi-Fi usage decreasing from 67,08% to 51,18%. Meanwhile, LTE usage grew from 30,31% to 44,46%. These changes suggest that network optimization has enabled users to rely more on mobile networks than before, enhancing their overall experience.

Figure 6: Latency perceived by users before and after the multi-region project

Conclusion

In a Socratic fashion, “understanding the question is half the answer”. Every piece of software, despite following several common patterns, is unique. This uniqueness stems from various factors such as business requirements, technology constraints, and the team building it. The key is the evaluation and the systematic analysis of tradeoffs.

Optimizing a service through a multi-region approach might not be the best choice for many use cases or could be more complex and more challenging to implement in others. In our case, this architecture resulted in significant latency reductions that we initially expected, further research revealed that a simple technique—pre-fetching—could reduce perceived latency even further.

Future endeavors may include a deeper analysis using more robust metrics suited to systems of this kind, such as Minimum Keystroke Length (MKS) and Effort Saved (e-Saved). Both are user-assist metrics that measure how helpful the service is: e-Saved, for example, indicates the ratio of characters a user does not need to type to complete a query. Furthermore, improving observability by distinguishing metrics across regions could reveal different access patterns, potentially requiring region-specific optimizations.