What's missing from your incident management workflow

The first fifteen minutes of an incident set the tone for the rest of the resolution process. But what makes the difference between a rapid response and a stressful scramble—clear ownership—hasn't always been easy to ascertain.

In this article, we’ll cover how Cortex, an internal developer portal, can be your team’s source of truth to accelerate the incident management process, and reduce MTTR.

Incident response without an internal developer portal

Let’s look at each step in the incident response process and identify where teams often struggle.

The initial incident assessment can typically feel like the most chaotic step. Emotions are running high, there’s a lot of uncertainty, and immediate next steps aren’t always clear. Gaining full view of the problem often comes down to a flurry of Slack and disjointed PagerDuty analysis to (hopefully accurately) answer questions like:

1
Which services are affected?
2
What is the severity and scope of the impact?
3
Who are the appropriate owners to involve for the fix?

Even after the initial assessment, on-call personnel need to begin gathering context. This process might start with scouring recent deployments, config changes, commits, and alerts across all of their different tools. But on-call engineers frequently have different systems for staying organized, ranging from meticulous spreadsheets to storing everything in their head.

For some teams, incident management ends after deploying the fix. What these teams miss is that documentation and communication pay dividends in the long run. Other teams that make a best effort to learn from past incidents still struggle to maintain that institutional memory, especially as team members turn over or the organization grows.

By the time another incident comes around, teams are essentially starting from scratch. They are stuck always reacting to incidents instead of being proactive. Developers have a poor experience and the organization as a whole suffers, too, from a long and highly variable MTTR.

Incident management with Cortex

Cortex is an internal developer portal that organizes all the information engineers need for a streamlined incident response process. Particularly useful for teams that are scaling up, Cortex aggregates knowledge about:

1
Ownership, including incident commanders and relevant stakeholders
2
The purpose of every software service
3
Service dependencies and ownership of these dependencies
4
Other assets/resources that could be affected by an incident
5
Runbooks and dashboards

While it takes some level of upfront effort to configure an internal developer portal, Cortex’s integrations with dozens of popular software tools eases the burden of manually mapping and modeling information. For alerting and incident management, you can automatically read data into Cortex from Datadog, FireHydrant, OpsGenie, PagerDuty, VictorOps, and XMatters.

That initial setup is well worth it when incidents inevitably happen. Let’s see why.

Assign and investigate

What used to be a frenzied fifteen minutes or more pinging anybody who might have context about affected services and resources becomes a quick look at your catalogs and dependency graphs for detail on scope and ownership. Quickly view on-call information and appropriate Slack channels to find exactly who you need when it matters most.

Whether the on-call engineer is a veteran or is new to the job, Cortex makes it clear how to respond by surfacing the relevant runbooks. Cortex also has an event timeline, which highlights everything that’s recently changed about a given service so engineers don’t have to search across disparate tools. Together, these features make it much easier to identify the source of the problem and design a solution.

Trigger and escalate

The engineering team may not immediately be aware of all incidents. Support teams play a large role in triaging customer feedback and can discern when there’s something wrong. Cortex bridges the gap between technical and non-technical functions, saving everyone the time and effort of messaging everyone on Slack.

For example, LetsGetChecked reduced MTTR by 67% just by giving the support team a better view of service ownership. The TechOps team no longer blast a general engineering Slack channel to find the appropriate team or owner. Engineers can confirm that a ticket qualifies as an incident and escalate accordingly, directly from Cortex.

Mitigating future incidents

Teams can use Cortex to craft a production readiness checklist via Scorecard to ensure all new services meet best practice from the start, and are continually updated as you onboard or off-board tools and frameworks. Additionally, new services and resources can be bootstrapped to include all core requirements for deployment, via Cortex Scaffolder.

Cortex is a fully featured internal developer portal

Teams that use Cortex eliminate unnecessary swirl from lack of ownership, while enabling devs to build better from the start.

But Cortex’s benefits go far beyond incident management. From Initiatives to drive urgency against specific projects, to Scaffolding which enables your team to quickly spin up new services and resources according to best practice, Cortex helps manage all parts of running and growing a complex software system. To learn more about using the Cortex internal developer portal in your organization, reach out for a demo today.