Best Practice
SRE

Real Production Readiness with Internal Developer Portals

How can teams reach alignment on ever-changing standards for production readiness? In this blog, we summarize our January 2024 interview with Alina Anderson from Outreach.io, during which we discuss how her team uses Cortex’s scorecarding to coordinate production readiness standards across stakeholders. This article offers actionable guidance for DevOps Engineers, SREs, and managers seeking to achieve more reliable services by streamlining production readiness processes while maintaining a culture of continuous improvement.

By
Justin Reock
-
January 19, 2024

Introduction

In cultures of continuous improvement, the criteria by which teams define a release's fitness for production is flexible by definition. Engineering organizations strive to balance risk and velocity, aiming for high quality releases on a cadence that doesn’t impede overall business throughput.

I recently interviewed Alina Anderson, Principal Technical Program Manager Outreach.io, whose team uses Cortex’s scorecarding capability to define production readiness in a way that is easy to understand, quantifiable, and transparent, and provides a base level of alignment for release stakeholders.

The full recording and transcript of that interview is available here.  In our interview, Alina describes much of her experience in successfully leading a team through an internal developer portal adoption initiative, and how they use their portal to cross-functionally align on readiness standards. In this article, I’ll summarize some of her thoughts and describe how teams can begin using some of these techniques immediately.

Challenges with Service Fragmentation

At Outreach, the move to microservices created novel challenges for the business surrounding service organization and ownership. Alina reflected on the time in our interview, and described the need for better centralization and alignment. 

“There was manual tracking going on, there was a lack of visibility, and as an organization grows you’ve got lots of stakeholders who need to answer questions about ‘who owns what?’ for a variety of reasons.” 

Alina’s team has chosen the Cortex internal developer portal to solve these problems, and in doing so, her team has unlocked several new capabilities which improve overall engineering efficiency. One of these new capabilities is the ability to assess production readiness per-release, and, where necessary, according to new or emerging sets of criteria.

“Cortex has allowed us to operate in a way that allows teams to have ownership. We’re more modular, we’re more distributed, versus having a more monolithic approach to production readiness.”

Alina’s team explored a number of approaches to avoid the chaos that can accompany the decomposition of application services, and gained a lot of knowledge of the solution landscape in the process. The benefits microservice decomposition brings to a business are widely understood, but, without a plan to centrally organize these services, the effect can be net-negative due to the overhead associated with increased system complexity.

“We’ve had several iterations of solutions which I think is expected and normal, this isn’t the type of problem where you pick one tool and it's solved, this is an ongoing space where you always have to be questioning what you’re doing, why you’re doing it, and the best ROI.”

Service Ownership Uplevels Production Readiness Assessment

In practice, teams cannot accurately predict production readiness without first establishing clear service ownership and accountability. It has to be easy to understand who is responsible for defining the metrics and indicators that describe production readiness, as well as which teams are accountable for meeting those expectations.

Once ownership has been established, teams just need a way for the various stakeholders to reach what Alina calls a “minimum bar of alignment” for production readiness for the release. This is effectively the immediate level of compliance agreed upon across all stakeholders in the organization for a release to move to production. Using Cortex’s service catalog as a central system of record for the metrics that describe this level of compliance in detail as well as scorecards as a mechanism for dynamic, per-release production readiness checklists has facilitated this alignment.

It’s difficult for organizations to cross-functionally establish easily enforceable standards for production readiness without first establishing this bar, and then publishing these standards somewhere central where stakeholders can use them to keep themselves accountable for release quality.

As Alina observes in our interview, “Key to [defining ‘doneness’] is having a ‘minimum bar of alignment.’ When we talk about an IDPs consistent standards across the organization, it isn't Joe's opinion or Susie's opinion of ‘done.’ There is one minimum bar of, ‘Hey, we've all agreed that  in order to deploy to production, the release needs to meet X, Y and Z.”

Simple, Distilled Criteria Is Best

Outreach enforces sophisticated criteria for production readiness, but when it comes to validation, the team has done a remarkable job of keeping standards compliance very binary. Across concerns such as code quality standards, security standards, and even infrastructure cataloging standards, releases will either meet the standard or fail to meet the standard.

Many experts have converged on this approach, because there are so many use cases that can cause releases to fail, so it's a lot easier to get alignment along each of these cases if the validation is simple to understand and doesn’t give too much room for subjective or emotional interpretation.

She describes this decision as one that has brought clarity and facilitated the minimum bar of alignment.

“It’s binary. Either you meet the standard or you don’t, and it's much easier to understand. It’s much easier to get alignment on. Is this a blocking requirement or not?”

She notes that because scorecards can denote different levels of doneness, in other words scores with established precedent as well as aspirational metrics, that this approach balances the teams need for velocity with risk. Teams can approve releases based on a scorecard that represents the minimum bar of alignment, while striving to achieve a release that exceeds these quality measures, allowing continuous improvement without interrupting the flow of value.

Exception Requests are Inevitable

Despite these positive outcomes, the story doesn’t end with better organization and ownership alone. Alina is quick to point out that even with these standards discovered and published, exceptions will begin to arise almost immediately as a result of the unique situations that are bound to happen. Here, Cortex is essential for allowing Outreach to conduct risk assessments and make calculated decisions about when to allow exceptions to the criteria, and to even change the standards with maximum agility.

Alina describes using Cortex to process a risk assessment: “A tool like Cortex that has all of the exception functionality built in makes it super easy to do that, because you can require having a process built around understanding that exceptions happen … For instance, instead of meeting this requirement the day of release, we’re giving a one-week exception, and the team will do this follow-on work, and the exception will expire in a week.” 

If Outreach determines after a risk assessment that certain production readiness criteria needs to be bypassed, the outcome of that assessment is made publicly known, and Cortex’s exception functionality is used to smoothly enforce that exception. 

Scorecards Help Production Readiness Criteria Become ‘Living Documents’

There’s no point in creating a process that will always require unexpected exceptions, so Alicia’s team has learned that it's better to just build dynamic capabilities into your production readiness strategy from the beginning. Teams should expect that criteria will change alongside shifting business priorities, and so a mechanism that allows for and gives structure to such volatility is essential.

“I would describe [production readiness criteria] as a living document. I would describe it as dynamic. That the way you iterate on production readiness is not as a static checklist because it’s really, really hard to keep that up to date. It’s really, really hard to enforce the use of checklists. It’s super manual, and the requirements are always changing.”

Cortex’s scorecarding capability gives the team at Outreach a richer means for creating, testing, and publishing production readiness criteria cross-functionally. Where checklists provide a linear, incomplete, and inflexible view of what it takes for a service to be considered ready, scorecards can be updated across teams easily as conditions change with minimal impact to overall workflow. 

If a new criteria or threshold causes undue friction or doesn’t mitigate risk as intended, teams can simply reverse the changes within the single system of record that is Cortex, and instantly the workflow and criteria is updated for all stakeholders.

Start with Ownership and Data

Alina recommends that teams who want to begin a successful production readiness initiative realize that there is in all likelihood already an established checklist somewhere in the organization. Tribal knowledge of who is responsible for the various tasks or automation associated with the release as well as what those tasks and quality gates are is known, it just may not be published or easily accessible. Start with establishing clear plans for service ownership, and then use that data to bring alignment to stakeholders across releases.

We covered a lot of ground in an hour, be sure to check out the full interview here

Want to learn about other use cases for internal developer portals? Check out our webinar on the Top Five Use Cases for Internal Developer Portals

You can also take a tour of Cortex right now, or book a live demo.

Best Practice
SRE
By
Justin Reock
What's driving urgency for IDPs?