All Crypto Culture Design Education Engineering Ideas News Uncategorized

Scaling Confidently with the Load and Fault Team

Scaling Confidently with the Load and Fault Team

Authors Vishal Kuo and George Tong are engineers working on the Load and Fault team at Robinhood.

The past two years have been an exciting time for Robinhood engineering as we’ve launched new products to millions of customers and scaled our operations to meet the enormous demand. During this rapid growth period, we spun up the Load and Fault team to increase product safety and reliability for our customers. In this blog post, we’ll be looking at how this team tackles daily load testing and the core principles that inform our decision making process.

Preparing for Scale

Building for scale is a top priority for all our engineering teams. For our backend services, this means increasing confidence that our services can handle a target number of queries-per-second (QPS) without sacrificing latency or success rates. To that end, the Load and Fault team built a read load testing framework founded in the following principles:

  • Safety First: We wanted our framework to have minimal production impact. Load testing often requires pushing services to their peak capacity and our framework would have to be mindful of not overloading services further than that.
  • High Fidelity: Our load tests should provide high signal to service owners; we aimed to run in production whenever possible and to use real production traffic instead of simulating it.
  • Easy to Automate: Running load tests as special events is undesirable as it causes service owners to make one-off adjustments for them and the high variability of running in production makes it hard to see trends. We wanted to build a framework that can be run regularly (per deployment, aspirationally) and without involvement from the Load and Fault team.

Load Testing Architecture

Our load testing architecture incorporates safety, fidelity, and simplicity as its core principles, and is composed of two major systems; request capture and request replay.

Request Capture​​

Figure 1: Our request capture pipeline

In order for our load tests to provide value to our backend services, they need to be able to accurately simulate realistic high load scenarios. We decided to go straight to the source and collect real customer traffic hitting our backend services to use during load tests. Doing so allowed us to accurately match the distribution of traffic to backend endpoints, ensuring that issues we uncover are real issues that will affect our customers.

Backend services at Robinhood have a reverse proxy nginx load balancer routing traffic. The Load and Fault team took advantage of this existing infrastructure and added nginx logging rules that sampled a percentage of traffic and logged user_uuid, URI, and timestamp (1). Filebeat (2) monitors the nginx logs and regularly pushes new log lines to Kafka (3), where logstash pulls from and then pushes to S3 (4).

Once in S3, our data pipeline (5) takes the raw data, adjusts the format, filters for only GET requests, appends a read-only scoped authentication token to each request (6), and stores them back into S3 (7). At this point, the data is ready to be used by our request replay system.

Request Replay

Our load tests are built on top of Kubernetes primitives to take advantage of its job execution and scaling capabilities. Tests are broken up into two distinct components: a pool of load generating pods (a Kubernetes deployment) and an event loop that manages these pods (a Kubernetes job). The following diagram depicts how these two interact.

Figure 2: Our request replay architecture

The event loop kicks off a test by creating the load generating pool and continually monitoring the target service’s health. As the test goes on, the loop gradually increases the amount of QPS going to the service by increasing the number of pods in the pool. If, at any point, the loop detects that the target service is reporting as unhealthy, it instantly stops the test and removes the load generating pool.

The load generating pods themselves run k6, an open source tool, that streams requests from the s3 bucket we populated during the request capture pipeline. The pods are intentionally designed with simplicity and performance in mind so as to avoid adversely impacting the Kubernetes cluster it runs on.

Safety Mechanisms

Safety pervades every aspect of engineering at Robinhood, and our load test design is a perfect example of this principle in practice. Our request capture system only stores GET requests, ensuring our request replay system of read only traffic never sends traffic that affects customer data. During live load tests, our automated service health monitoring prevents service degradation. In addition, there are multiple clear safety levers such as deep links from Slack notifications, UI test controls, and ultimately a big red smash glass button to bring down the whole system. These safety mechanisms are depicted below.

Figure 3: Slack messaging containing a Stop Test deeplink
Figure 4: UI controls to stop an individual test
Figure 5: A big red button to stop all load tests

Load Test Wins

Our load testing framework has helped us significantly improve our services’ stability and reliability. Robinhood engineers now have the ability to:

  • Detect performance regressions
  • Replicate a high load failure, and then verify a resulting fix
  • Identify a service’s next scaling bottleneck before our customers do
  • Ensure that a new service rolling out is able to withstand Robinhood scale

All of these enhancements have made a measurable impact on the firm. In particular, we recorded a 75% drop in load related incidents that affected customers between Q1 and Q2 of 2021 and recently went through our own IPO with no major load-induced incident.

Looking Forward

The team has already made a tremendous impact in its short lifetime at Robinhood, and is looking forward to further improving our ability to detect issues long before our customers could experience them. In the next couple of quarters, we plan on tackling projects like load testing mutating (POST) requests, experimenting with fault testing, and expanding our tests to include gRPC communication. Customer safety is our highest priority while executing on these projects and we’ve been taking additional measures, such as building out a separate load test environment, to ensure isolation of these new test types.

The Load and Fault team is growing rapidly as well; we’ve already gone from 3 engineers to 8 within a year, and we’re looking to grow even more. If you’re a safety first thinker and someone that believes in Robinhood’s mission, we’re hiring!

All investments involve risk and loss of principal is possible.

Robinhood Financial LLC (member SIPC), is a registered broker dealer. Robinhood Securities, LLC (member SIPC), provides brokerage clearing services. Robinhood Crypto, LLC provides crypto currency trading. All are subsidiaries of Robinhood Markets, Inc. (‘Robinhood’).

© 2021 Robinhood Markets, Inc.

Related news

Share this
Subscribe to our newsroom