Customer

How LetsGetChecked doubled deployment frequency and slashed MTTR with Cortex

Introduction

LetsGetChecked is a global healthcare company that provides the tools to manage health from home through health testing, genetic sequencing, virtual consultations, and medication delivery. An end-to-end model that includes manufacturing, logistics, prescriptions, and clinician support puts enormous pressure on the software behind every click, call, and delivery. 

By 2021, scaling these systems to meet pandemic-driven demand was a tall order. But Javier de Vega Ruiz, Chief Software Engineer, was no stranger to the challenge. Coming from banking and internet gaming where low latency systems drove sub-second decisions, he understood the compound effect even the tiniest disruptions have at scale. “I came to LetsGetChecked because the technical challenge is familiar, but the stakes are higher,” he starts. “Here, even a brief lapse in reliability could keep a patient from accessing a critical care plan on time, or delay a physician’s ability to share instructions for a new prescription.”

Javier’s team faces a daily challenge of maintaining reliability while innovating to meet new customer demands. Fortunately, they have the right tools for the job. This is the story of how LetsGetChecked slashed time to restore, while doubling deployment frequency with microservices, containerization, and Cortex.

The move to microservices

It wasn’t long after their 15th engineer that the team realized they needed to change how they work. Javier describes the point at which he knew their monolithic development model would challenge their ability to maintain momentum: “We saw too many engineers fighting to merge code in the same few codebases, creating bottlenecks in our deployment pipeline.” He continued, “Our services had become too big, and too slow—hurting our ability to keep up with the innovation our growing user-base required.”

To meet customer demand, the team migrated to a microservice model. Existing services were split, and new ones were created, enabling devs to work in parallel and recapture efficiencies of scale. But as services ballooned from 15 to 100 in under a year, the next challenge seemed obvious—how to preserve quality without losing newfound momentum.

Balancing quality and velocity in a heavily regulated industry 

Lengthy compliance regulations like HIPAA and HITrust, along with the potentially life-threatening repercussions of bad code have historically put a ceiling on how fast healthtech teams can innovate. Even in a microservice model—which promises greater rate of innovation—it’s often the speed at which teams can safely meet standards of compliance and reliability that matters most. Here, competitive edge lies in operational excellence.

To enforce standards of quality as they grew, LetsGetChecked turned to Cortex. Javier notes, “The fastest way to ensure both quality and compliance measures are followed is to bake the necessary steps into every initiative. Whether you’re migrating systems or building a service—everything should include a step-by-step plan for your devs. We chose Cortex because it gave our fast-growing team a clear path to operational excellence.”

Reducing MTTR with Cortex Catalogs and Developer Homepage

LetsGetChecked’s TechOps team is tasked with triaging reported incidents. After a ticket comes in, one of the first steps is to find who’s responsible. But, without a system for cataloging and routinely updating ownership information, the TechOps team would be forced to “blast” Slack channels with 100+ engineers, which proved to be inefficient.

Now, with auto-sync service and resource catalogs in Cortex, the TechOps team can quickly reference who’s on call and where to contact them, while the JIRA gets dropped into that developer’s priority queue in their personalized Developer Homepage. “One of the biggest improvements we’ve seen since implementing Cortex is in our Mean Time to Restore—which we were able to reduce by 67%. Being able to quickly find up-to-date service information, and have requests immediately prioritized is a small operational change that has enormous impact.”

Driving operational excellence with Cortex Scorecards

Beyond easing access to information about existing software, LetsGetChecked is raising the bar for new software with scorecards for onboarding, service maturity, and deployment frequency. Javier explains why he believes this particular feature helps drive adoption, “There’s a gamification element in Cortex where teams can compare themselves to one another. Anyone can see straight away who’s improving, who’s falling behind, and who needs more support.”

Onboarding: LetsGetChecked has scorecards for new service and resource onboarding to ensure minimum standards from the start. Rules include requirements for documentation, dashboards, deployment information, and group checks to categorize service types as well as integration-enabled checks for a Git repo, SonarQube, and Slack channel.

Service Maturity: LetsGetChecked’s Service Maturity scorecard includes best practices like ensuring the service in question has on-call escalation in PagerDuty, less than 10 issues in JIRA, and at least 75% Sonarqube coverage. In future, it will check if the repository has transitioned to trunk-based development as a branching strategy, which the team favors due to correlation with delivery metric improvements.

Deployment Frequency: The deployment frequency Scorecard sets goals for weekly deployment metrics, and enables teams to benchmark progress against peers. Javier shares, “With Cortex, we set a year-long goal of hitting 25 deployments per week. Within months, we went from 17 per week to 32.”

Accelerating migration to K8s with Cortex Initiatives & Scaffolding

While alignment to best practice can improve service and resource health, this is of course not the only way to think about reliability and performance. Proactive infrastructure decisions often have the biggest impact on these metrics, but ensuring timely migration isn’t always easy. LetsGetChecked uses Cortex to drive org-wide initiatives on strict timelines.

Javier provides an example of an exciting project the team is undertaking—moving services off EC2 and Windows to a containerized model in Kubernetes and Linux. With higher density deployments and elastic provisioning now much easier to set up, the team’s monthly hosting costs are set to drop by 75%. But that’s not all they stand to gain, says Javier, "I'm super excited. The savings are huge. But so are the improvements to developer experience. Soon, devs will be able to run advanced testing scenarios before deploying against their own ‘local cloud’ on their laptop, shortening feedback cycles and increasing productivity."

To accelerate these outcomes, LetsGetChecked leverages Cortex Initiatives, Scaffolding, and custom alerting. “Cortex makes action obvious for devs, helping them stay focused on priorities, and saving me 4-5 hours a week following up,” says Javier. He continues, “In the case of our Kubernetes migration, the ability to drive timely action cut our projected timeline from 24 months down to 16—enabling us to capture savings and productivity benefits 8 months earlier than expected.” 

Advice for those building a culture of continuous improvement 

LetsGetChecked has done a remarkable job of meeting demand without compromising on the quality or reliability of the life-saving systems they support. We asked Javier if he had any advice for those beginning their journey towards operational excellence using a platform like Cortex: 

“In terms of where to start, I would say gather metrics as soon as possible. The obvious ones are software delivery and SRE metrics, like deployment velocity and MTTR. These tend to be easiest to track and change, unlocking maximum adoption. You should also optimize your platform in accordance with those metrics. If velocity is a priority, use Cortex’s Scaffolder to bootstrap new services, and ensure you add Initiatives to set deadlines for certain audiences.”

He continues with a final note, “Overall, it’s important the team appreciate the real scope of their work. The finish line isn’t merging a PR [pull request], but ensuring software they own actually works in production. Collecting telemetry, setting up alerts, and owning the fix. Cortex brings all of the inputs for this work into one space so devs can really take ownership over what they build, without slowing them down.”

To learn more about how Cortex can help you build a culture of continuous improvement, and accelerate your journey towards engineering excellence, check out our website, or book some time with our team of experts.

Javier de Vega Ruiz - Chief Software Engineer
Text
text
text
text
text
text
text
text
text
text
text
text
text
text
text
text
text
text
text
text