Reliability as code

Scorecards allow engineers to set reliability standards across teams and types of services by tracking the health of deploys, SLOs, on-call, vulnerabilities, package versions, and more.
Cortex

Scorecards help teams win

Cortex
Understand quality for existing services
Cortex
Establish best practices and hold teams accountable
Cortex
Track migrations and
platform improvements

Direct integrations with third party tools

All integrations have corresponding scorecard rules that automatically fetch data directly from their API, letting you enforce best practices without any manual work.
Cortex

Leadership reporting

Autogenerated reports provide deep insights on services at risk and improved services.

SLO support

Define SLOs for services through direct integrations with third party tools.
Cortex

Complex rule builder

Cortex’s CQL is a powerful DSL to let you define rules that fit your use case, and even rules that refer to multiple data sources, such as “has on-call only if tier >= 1”, or “available k8s replicas > desired k8s replicas”

Scorecard Examples

Operational Maturity

Are services meeting SLOs? Are the on-call metrics looking healthy? Are post-mortem tickets closed promptly? Are there too many customer facing incidents?
Sample Scorecard Rules
Cortex
oncall.analysis.meanSecondsToResolve < 3600
Make sure that issues are resolved in a reasonable amount of time. If they’re not, you can dig into the root cause.
Cortex
10
Cortex
oncall.analysis.offHourInterruptions < 3
If engineers are being paged off hours, it will lead to alert fatigue and low morale. By catching services that are causing high numbers of off hour interruptions, you can improve developer happiness.
Cortex
30
JIRA: post mortem tickets opened in the last 6 months that are still open
Developers constantly creating action items for services and not actually closing them is an organizational risk. Either the team is not prioritizing incident-related issues, or the team is not equipped with the right resources.
Cortex
10
Cortex
jira.issues(“labels=customer and created > startOfMonth(-3)”)< 2
A reliable service should not be a source of frequent customer facing incidents.
Cortex
20
jira.issues(“labels=compliance”)< 3
Make sure there are no outstanding compliance/legal issues affecting the service.
Cortex
10

Operational Readiness

Are services ready to be deployed to production? Are there runbooks, dashboards, logs, on-call escalation policies, monitoring/alerting, and accountable owners?
Sample Scorecard Rules
Cortex
owners.count > 2
Incident response requires crystal clear accountability, so make sure there are owners defined for each service.
Cortex
10
Cortex
oncall.escalations.count > 1
Check that there are at least 2 levels in the escalation policy, so that if the first on-call does not ack, there is a backup.
Cortex
30
runbooks.count >= 1
Create a culture of preparation by requiring there are runbooks in place for the service.
Cortex
10
Cortex
links(“logs”).count> 1
When there is an incident, responders should be able to easily find the right logs (usually load balancer logs + application logs).
Cortex
20
dashboards count >= 1
Responders should have standard standard dashboards quickly accessible for every service for speeding up triage.
Cortex
10
custom(“pre-prod-enabled”) = true
Use an asynchronous process to check whether there is a live pre-prod environment for the service, and send a true/false flag to Cortex using the custom metadata API.
Cortex
10
sonarqube.metric(“vulnerabilities”) < 3
Ensure that production services are not deployed with a high number of security vulnerabilities
Cortex
10

Development Maturity

Ensuring services conform to basic development best practices, like code coverage, checking in lockfiles, READMEs, package versions, and ownership.
Cortex
Sample Scorecard Rules
owners.count > 2
Catch organizational risk by detecting orphaned services.
Cortex
10
Cortex
git.fileExists(“package-lock.json”)
Developers should be checking in lockfiles to ensure repeatable builds.
Cortex
30
sonarqube.metric(“coverage”) > 80.0
Set a threshold that’s achievable, so there’s an incentive to actually try. This also serves secondarily as a check that the service is hooked up to Sonarqube and reporting frequently.
Cortex
10
Cortex
git.lastCommit.freshness < duration(“P30D”)
Services that are committed to infrequently, counterintuitively, are actually at more risk. This is because people who are familiar with the service may leave the team, tribal knowledge accumulates, and from a technical standpoint, the service may be running old/outdated versions of your platform tooling.
Cortex
20
git.fileExists(*Test.java”)
Use a wildcard search to make sure there are unit tests enabled.
Cortex
10
git.numRequiredApprovals >= 1
Ensure that a rigorous PR process is in place for the repo, and PRs must be approved by at least one user before merging.
Cortex
10
git.fileContents(“circleci/config.yml”).matches(“.*npm test.*”)
Enforce that a CI pipeline exists, and there is a testing step defined in the pipeline.
Cortex
10

Migrations

Some scorecards are for more adhoc purposes, such as migrations (between language versions, platforms, deployment strategies, etc) or security audits (pci-dss, soc2, etc). These check that services are “on the right platform library version”, “on kubernetes”, “have the right CI file checked in”.
Cortex
Sample Scorecard Rules
custom(“ci-platform-version”) > semver(“1.1.3”)
Having every CI pipeline send a current version to cortex on each master build lets you catch services that are on outdated versions of tooling (like CI, deploy scripts, etc).
Cortex
10
Cortex
package(“apache.commons.lang”) > semver(“1.2”)
Cortex automatically parses dependency management files, so you can easily enforce library versions for platform migrations, security audits, and more.
Cortex
10
Cortex