Reliability as code

Scorecards allow engineers to set reliability standards across teams and types of services by tracking the health of deploys, SLOs, on-call, vulnerabilities, package versions, and more.

Scorecards help teams win

Understand quality
 for existing services
Establish best practices
 and hold teams accountable
Track migrations 
and
platform improvements

Direct integrations with third party tools

All integrations have corresponding scorecard rules that automatically fetch data directly from their API, letting you enforce best practices without any manual work.

Leadership reporting

Autogenerated reports provide deep insights on services at risk and improved services.

SLO support

Define SLOs for services through direct integrations with third party tools.

Complex rule builder

Cortex’s CQL is a powerful DSL to let you define rules that fit your use case, and even rules that refer to multiple data sources, such as “has on-call only if tier >= 1”, or “available k8s replicas > desired k8s replicas”

Scorecard Examples

Operational Maturity

Are services meeting SLOs? Are the on-call metrics looking healthy? Are post-mortem tickets closed promptly? Are there too many customer facing incidents?
Sample Scorecard Rules
oncall.analysis.meanSecondsToResolve < 3600
Make sure that issues are resolved in a reasonable amount of time. If they’re not, you can dig into the root cause.
10
oncall.analysis.offHourInterruptions < 3
If engineers are being paged off hours, it will lead to alert fatigue and low morale. By catching services that are causing high numbers of off hour interruptions, you can improve developer happiness.
30
JIRA: post mortem tickets opened in the last 6 months that are still open
Developers constantly creating action items for services and not actually closing them is an organizational risk. Either the team is not prioritizing incident-related issues, or the team is not equipped with the right resources.
10
jira.issues(“labels=customer and created > startOfMonth(-3)”)< 2
A reliable service should not be a source of frequent customer facing incidents.
20
jira.issues(“labels=compliance”)< 3
Make sure there are no outstanding compliance/legal issues affecting the service.
10

Operational Readiness

Are services ready to be deployed to production? Are there runbooks, dashboards, logs, on-call escalation policies, monitoring/alerting, and accountable owners?
Sample Scorecard Rules
owners.count > 2
Incident response requires crystal clear accountability, so make sure there are owners defined for each service.
10
oncall.escalations.count > 1
Check that there are at least 2 levels in the escalation policy, so that if the first on-call does not ack, there is a backup.
30
runbooks.count >= 1
Create a culture of preparation by requiring there are runbooks in place for the service.
10
links(“logs”).count> 1
When there is an incident, responders should be able to easily find the right logs (usually load balancer logs + application logs).
20
dashboards count >= 1
Responders should have standard standard dashboards quickly accessible for every service for speeding up triage.
10
custom(“pre-prod-enabled”) = true
Use an asynchronous process to check whether there is a live pre-prod environment for the service, and send a true/false flag to Cortex using the custom metadata API.
10
sonarqube.metric(“vulnerabilities”) < 3
Ensure that production services are not deployed with a high number of security vulnerabilities
10

Development Maturity

Ensuring services conform to basic development best practices, like code coverage, checking in lockfiles, READMEs, package versions, and ownership.
Sample Scorecard Rules
owners.count > 2
Catch organizational risk by detecting orphaned services.
10
git.fileExists(“package-lock.json”)
Developers should be checking in lockfiles to ensure repeatable builds.
30
sonarqube.metric(“coverage”) > 80.0
Set a threshold that’s achievable, so there’s an incentive to actually try. This also serves secondarily as a check that the service is hooked up to Sonarqube and reporting frequently.
10
git.lastCommit.freshness < duration(“P30D”)
Services that are committed to infrequently, counterintuitively, are actually at more risk. This is because people who are familiar with the service may leave the team, tribal knowledge accumulates, and from a technical standpoint, the service may be running old/outdated versions of your platform tooling.
20
git.fileExists(*Test.java”)
Use a wildcard search to make sure there are unit tests enabled.
10
git.numRequiredApprovals >= 1
Ensure that a rigorous PR process is in place for the repo, and PRs must be approved by at least one user before merging.
10
git.fileContents(“circleci/config.yml”).matches(“.*npm test.*”)
Enforce that a CI pipeline exists, and there is a testing step defined in the pipeline.
10

Migrations

Some scorecards are for more adhoc purposes, such as migrations (between language versions, platforms, deployment strategies, etc) or security audits (pci-dss, soc2, etc). These check that services are “on the right platform library version”, “on kubernetes”, “have the right CI file checked in”.
Sample Scorecard Rules
custom(“ci-platform-version”) > semver(“1.1.3”)
Having every CI pipeline send a current version to cortex on each master build lets you catch services that are on outdated versions of tooling (like CI, deploy scripts, etc).
10
package(“apache.commons.lang”) > semver(“1.2”)
Cortex automatically parses dependency management files, so you can easily enforce library versions for platform migrations, security audits, and more.
10