Handling incidents with Pagerduty & Cortex

Pagerduty is an incident response platform that helps teams detect and fix issues quickly through reliable notifications, automatic escalations, and on-call scheduling

Integrating Pagerduty into Cortex supercharges your incident response and reporting. You’ll be able to: 

  • See on-call information in the service catalog
  • Trigger incidents directly from Cortex 
  • Enforce adoption of on-call best practices for each service and team
  • Track dozens of key metrics like MTTR through Scorecards 

Check out our documentation for setting up Pagerduty in Cortex to get started.

On-Call Information

In the homepage for a specific service, you’ll be able to see at a glance the current on-call and any active incidents. You can see additional detailed information about the service’s on-call rotation in the “Current Oncall” tab, including a full breakdown of the escalation policy and PD service details.

Triggering incidents from Cortex

From the on-call tab, you can trigger incidents for a specific service - Cortex will create the Pagerduty incident and escalate to the current on-call for the service. This way, you don’t need to keep track of which policy to page if a specific service goes down – just hit “Trigger Incident” and Cortex handles the rest!

During an incident, Cortex can help in other various ways: 

  • Use the Slack Bot to get basic service information such as owners, current oncall, links, and timeline events. The command is “/cortex service <service-tag>”. 
  • Open the events page to see the incident in context with recent Deploys, ECS events, Git commits, K8s events, Snyk scans, and Sentry events. This makes it easier to pinpoint what happened right before the incident was triggered.
  • Check out the service graph to predict the downstream impacts of the outage. 

Scorecard Integration

Additionally, you’ll also be able to enforce best practices for development maturity in Scorecards through the various Pagerduty rules that we expose. You can check: 

  • If service has on-call setup
  • Whether the on-call rotation for the service has the right number of escalation levels
  • The average seconds it took to acknowledge an incident for a specified time period 
  • The average seconds it took to resolve an incident for a specified time period 
  • The total business hours interruptions for a specified time period 
  • The total number of incidents for a specified time period 
  • The total number of off hour interruptions for a specified time period 
  • The total number of sleep hour interruptions for a specified time period 
  • The total number of snoozed seconds for an incident for a specified time period 
  • The percentage of up time for a specified time period 

These rules can ensure that teams have a healthy on-call system set up to respond and mitigate incidents that affect your business. 

If you set up a Scorecard to measure on-call health and notice that a lot of your services are failing, you can create an Initiative and prioritize the rules to improve on. Cortex will send each service owner their action items as the target date approaches, making it extremely easy to set attainable goals and accelerate progress towards best practices. Working towards reducing the number of incidents can lead towards a happier & more productive engineering team and better service to your customers.

Start using Cortex & Pagerduty today

With our integration, Cortex can provide even more details to engineering teams to help triage incidents and using Scorecards can help improve the on-call health of your teams. Visit our documentation to integrate Pagerduty with Cortex. Don’t have a Cortex account yet? Set up a demo with our team to get started.