How Brex drove reliability standards with Scorecards and reports
Fintech company Brex offers companies an alternative to traditional banking. Brex provides over 20,000 startups and enterprises with a spend management platform, as well as corporate credit cards, business accounts, and bill pay services. Brex cards are used in more than 100 countries, and the company has processed billions of dollars’ worth of transactions.
What was the problem?
"Prior to Cortex, we had a lot of best practices and standards scattered across our knowledge bases as documentation, but it was impossible to track what services were following the standards that we had defined. It was often the case that incidents would be caused by assumptions that these standards are being followed throughout our software, when in actuality this was not always true, especially for legacy services."
As a financial company, reliability is incredibly important to Brex. For a long time, the company invested heavily in its SRE team, but didn’t have insight into whether services were actually in line with best practices. A number of Brex’s legacy services were in different states of compliance with best practices, but tracking services’ compliance involved a lot of manual work and spreadsheets. As a consequence, team members often assumed that standards were being followed, which left a number of services out of compliance with their best practices.
How was this solved before?
Prior to using Cortex, Brex invested a lot of time and resources in monitoring and observability. They conducted weekly tech ops meetings, where all of the leads gathered to investigate the statuses of applications and services to decide where to invest. The company has hundreds of services, so it’s easy to imagine just how time and effort would be involved in manually tracking the performance of each one. For the most part, teams built and shared one-off spreadsheets to monitor the reliability of services.
How has Cortex helped?
How has Cortex helped Brex drive progress on reliability standards?
Cortex gave Brex the ability to codify best practices and define standards, and through Scorecards, they could automatically track and monitor how services were performing according to their standards.
Engineers at Brex found it easy to use Cortex from the start. Because so many of their integrations were available out of the box, they didn’t have to customize anything — they were able to just set up their Scorecards and let Cortex do the rest.
Cortex also offered Brex a clear way to drive organization-wide progress on reliability standards. By leveraging the power of initiatives, Brex found an even more robust way of tracking migrations to their clearly defined best practices. Recently, Brex used Cortex to automate a movement toward a new, unified approach to managing secrets. By using Cortex to manage this large and important migration, the teams were able to significantly minimize the burden and toil required to push every service to the new standard.
Any interesting insights that have resulted due to Cortex?
What business value have you gained from using Cortex?
“The Scorecards have been really to communicate with even our engineering directors and CTO about what’s happening across the org. Right now… there’s been more and more conversations on infra in operation review meetings, reliability meetings…And so I think a lot of this [data] will hopefully drive those conversations as well.”
— Vamsi Chitters, Engineering Manager
Just as quickly as Brex was able to use Cortex, they saw value in the platform. The visibility that Scorecards provided developers inspired them to proactively fix issues and improve their services. The ladder system within Scorecards, which tiers maturity standards, has prompted organic improvement in services as engineers become more motivated to move up levels.
Brex has been able to share reports at tech ops meetings, too, making it easier for the group to identify areas of risk and opportunities for investment. Insights linking maturity level with mean time to resolution, for example, offered a clear return on investment: the more they invest in improving services’ maturity, the better outcomes they’ve seen in response and resolution times.
These kinds of insights have made it easier to make the case for greater investment, too. With the data in Cortex, the SRE team could show leadership concrete evidence that investment in reliability practices have a real return for the company.
What’s your favorite feature of Cortex?
“...we are seeing some promising data that services meeting the basic level objectives, or even higher quality standards, are performing better when it comes to incidents — whether that’s mean time to detection or just having overall lower number of incidents.”
The ease of use and extendability of Cortex was especially beneficial to Brex in the beginning, allowing them to get up and running quickly. As they rapidly expanded their initial set of scorecard rules, the flexibility provided by Cortex gave them the ability to truly represent their definitions of reliability and service maturity.
This applies not only to Scorecards, but to reporting, too. Although the past six months have been dedicated to establishing and monitoring standards, the engineering team has begun tying standards to incident metrics. With Cortex, they can evaluate whether improving the overall quality of software has a meaningful impact on incidents. Recently, they’ve seen evidence that the services that meet or surpass quality standards do perform better in terms of incidents.
What’s next for Cortex?
What’s next in your reliability journey with Cortex?
Now that Brex has clearly defined standards and established maturity levels within their Scorecards, they have the ability to automatically track software quality — and they can keep those standards updated as their best practices evolve over time. The company plans to create Scorecards for other software and resources in their system, such as databases. The next step is going beyond microservices, and tracking and enforcing reliability for all software at Brex.
The company has plans to work with larger, more established enterprises in the near future, so there will be more of an emphasis on reliability as time goes on. As Brex evolves, they plan to use Cortex to make sure that they maintain the high quality of their engineering, even as they take on more significant initiatives and customer profiles.
Cortex is excited to support Brex as they take the next steps in their reliability journey. To see how Cortex can help your organization enforce reliability standards and improve the quality of your software, book a demo with us today!