Taking incident management to the next level with an internal developer portal
There is no denying that incident management is one of the most crucial processes concerning the service and business aspects of software deployment. Not having a robust system in place to address and remedy unfortunate incidents can lead to user dissatisfaction, which can ultimately take a toll on your business metrics. A suboptimal management system can also have adverse impacts internally if it prioritizes efficiency and speed of recovery to the point of neglecting employee well-being.
However, that does not have to be the case. Incident management, when done right, provides relief to all stakeholders involved and contributes to improving the quality of your software. It may sound like getting it right has to be a long and arduous process of trial and error, but using an internal developer portal can get you there in no time.
In this article, we discuss a few of the reasons an internal developer portal is not only good for any software development team to have but is essential to executing a strong incident management strategy.
What is an internal developer portal?
Think of the internal developer portal as a repository that houses the ecosystem born out of the development of your software. It has complete information on assets ranging from entire microservices to the developers that manage and own them. Its primary purpose is to allow developers quick access to the resources they need without having to hunt them down.
Developer portals can also be external-facing when they cater to the needs of developers outside of your company. These usually display documentation for API usage. In this article, we focus on the merits of internal developer portals and what they bring to the table with regard to effective incident management.
Why is having an internal developer portal important for incident management?
The internal developer portal contributes to a smoother incident management strategy in more ways than one, as it has something to offer the developer at every step of the way. In this section, we highlight a few that demonstrate its impact on your on-call rotations and incident resolution practices.
Easy access to information saves time in the initial stages
When you think about the initial steps of an incident management process, you are primarily concerned with making sure the right people are involved. Following the occurrence of an incident, the steps usually look like this:
- Checking which services are affected
- Based on the services that are affected, identifying the severity and scope of the impact
- Identifying the people that need to be looped in to resolve the incident
- Using the appropriate runbooks or other kinds of metadata to triage and fix the issue
- Ensuring the fix works
- Communicating the details of the incident and its resolution to relevant stakeholders, as well as maintaining documentation for the future
As you look at these initial steps of incident management, you will notice that a lot of them will be living in your internal developer portal. Take, for instance, the following information:
- Incident commanders, i.e., the developers currently on-call for that service
- What the service does
- Its dependencies and the ownership of these dependencies
- Other assets/services that could potentially be affected by an incident observed in that service
- Stakeholders, i.e., owners of affected services that need to be looped into the incident management process
- Runbooks and dashboards
This is all data that you would normally find in the internal developer portal. However, if you do not have a single source of truth for all of this information, you run the risk of delaying the resolution process. When an incident occurs, you will spend the first 10-15 minutes solely collecting that information to create your initial setup. You are likely to go on Slack and message the incident channel to ask who is depending on that service and is affected by the incident, i.e., who should be looped into this. Or, you will need to go on to PagerDuty and look for the people who are on-call for those teams. So the first fifteen minutes are spent combining and correlating a vast amount of relevant information.
The internal developer portal frees up that much time in your incident response process. There is no getting around the initial upfront investment it takes to get all the information in a single place. However, once you have it, that initial effort then drops to a couple of minutes of quickly looking over and collecting the information and creating a channel to loop the relevant stakeholders in automatically. By contributing automation to select tasks in your on-call and incident management processes, portals speed up workflows and reduce the risk of human error. In as urgent a process as incident response, saving that time is a significant boost, and the internal development portal makes it possible.
Access also encourages consistency and speed
Once you have done the initial sweep and contacted the relevant stakeholders, the next step is to triage and mitigate the damage. The internal developer portal has a wealth of information about the service, including metadata such as changelogs, commits, and deployments. This does away with the need to bookmark information across tools.
Sometimes developers, if they are new or are onboarding a new tool, will set up bookmarks for all the different tools they are using for when they have to troubleshoot and fix an incident. However, not everyone does this. Other developers may be more disorganized in their workflows and not have anything saved to refer to later. So you might see a rise in the Mean Time To Resolve (MTTR) metric the week that they are on-call. How can you avoid such a situation and ensure that there is consistency between incident commanders?
Having an internal developer portal lets you bring all this information into a single pane of glass, where a developer can immediately look up if there are any relevant runbooks that they should be using at that point to mitigate the issue. They can also get an event timeline and identify what has changed about their service: were there recent deployments that went out or were there configuration changes? Were there commits that were made to the master branch that could potentially have triggered something? Is there a new set of Sentry issues that fired in the last five minutes that could potentially be an indicator of this? With all the data they need in a single place, developers can look up dashboards quickly or other information such as runtime.
This ability to search for virtually anything they need during the incident management process is instrumental in generating consistency across multiple incident commanders. No matter who is on-call or how they are running the process, anybody can seek information from the internal developer portal. The portal ensures that a developer’s proclivity for being organized no longer plays a pivotal role in how efficient their incident response workflow is.
Additionally, with the process more standardized, teams can begin to iterate on it and improve it with time. At this point, you can ask, “What is missing? How could we have improved this process? Let's pull up the internal developer portal.” This is a step up from attempting to resolve incidents with little to no structure and rigor to the process. So the internal developer portal can be significantly helpful in terms of actively managing an incident and bringing consistency and speed to the incident resolution process. Long-term, this sets teams up for success. The developers can devote their energy to coming up with a solution and improving the quality of the software.
Have the freedom to think long-term
The bottom line is that incident management is concerned with the quality of your software, customer satisfaction, and developer experience. These are big objectives that are not always expressed in the most tangible ways. Using an internal developer portal to support your incident management strategy simplifies and streamlines the process to such a great extent that the team can focus on fixing the issue, documenting the experience for future use, and elevating the experience for the end-user. When the information they need is easy to find and use, and there are standards in place, developers are also less likely to experience high levels of stress despite the urgency of the situation. An internal developer portal helps your team manage individual incidents better while at the same time strengthening your response systems in the long term.
The dependency graph prevents escalation
The dependency graph is another helpful feature of the internal developer portal that can be your single source of truth for identifying which services depend on one another. This knowledge is crucial as it can help you figure out which dependency owners need to be contacted and looped in. Ultimately, this prevents the impact of the incident from escalating and affecting a larger portion of your microservices architecture.
The resource catalog improves the overall quality
At Cortex, we also provide teams with a resource catalog. As part of your internal developer portal, it tracks your infrastructure for developers to have a deeper understanding of its various assets, such as s3 buckets and databases. This is an important tool to have because dependencies are not limited to your microservices. Developers also need to stay informed about the role that the resources might be playing in the incident. To understand the scope and severity of an incident during an incident management process, you need to have a resource catalog that lets you track the infrastructure that is consuming that particular service or a set of services that might be down as part of an incident. For example, are there Kafka topics that the affected service is producing data onto that are then being consumed by other processes? So having a resource catalog in your internal developer portal can help you better triage an incident, better understand the scope and severity of it, and then better resolve it.
It supports the enforcement of compliance requirements pre-production
Having ownership and runbooks for your services makes incident management faster and easier. It is good practice to set standards and requirements for the incident management process, but that does not count for much if they are not regularly enforced across the team. How can you encourage enforcement in the pre-management stage so that when the time comes, developers can follow the process without wasting any time?
Incident management begins before a service goes into production. It must be preventative and not just an afterthought. For starters, make such basic information available before a microservice goes into production. That way, if something goes wrong, the team is not left spending the first fifteen minutes searching far and wide for it.
With that out of the way, consider defining production readiness checklists and service maturity standards right from the start. Production readiness can easily be automated by combining the power of Cortex’s service catalog and scorecard features. You can provide a list of criteria that you are going to enforce and announce that services cannot be pushed to production unless they meet those requirements. The internal developer portal can help developers find the requirements upfront, in the form of scorecards. Setting standards and best practices is not only beneficial for the sake of consistency, but it also greatly reduces the confusion developers have to confront in a situation that is already stress-inducing.
You cannot do incident management correctly, or view your internal developer portal as an incident management grade tool internally if you are not able to enforce compliance requirements around it. Using your internal developer portal to enforce production readiness standards is another strong reason to integrate one into your incident management strategy.
Scorecards are useful for tracking post-incident metrics
Finally, when you are nearing the end of the incident management process, you need to track metrics to be able to close out your post-mortem ticket. For example, how do you track that the MTTR is going down over time? You will also want to keep an eye on the incident response rate and ensure that you are effectively managing the consequences of the incidents. The internal developer portal, being the singular source of all your information needs is the perfect tool for the job. It gives you visibility into any metrics you are interested in tracking. Not only that, but you can also create scorecards for service maturity to close the loop in your developer portal.
In addition to being able to close the ticket, these metrics are useful for writing better runbooks that provide detailed instructions to solve specific issues. The teams also learn which parts of the incident management strategy can be improved and strengthen their documentation. Whether it is the escalation mechanisms or the incident discovery systems that could use more work, they can make changes for the next time an issue surfaces.
How can you use Cortex for better incident management?
Your services, dependencies, runbooks - everything you need to effectively manage an incident is available through the internal developer portal. From start to finish, it can guide your developers through the entire process so that the only friction they must deal with comes from the issue itself. It starts with the use of production readiness scorecards, moves through the service and resource catalogs for information access, to ultimately end with scorecards that track post-incident metrics.
Cortex’s offerings are well-suited to meet the needs of your internal developer portal. The service catalog is the backbone of any portal, as it provides a comprehensive overview of the entirety of your microservices architecture. This is the most important source of information, as it houses details about aspects of your services, such as ownership and on-call. The resource catalog complements it well by adding another layer of information and demystifying the inner workings of the infrastructure that your software relies on. Finally, the scorecards offer a simple yet highly efficient way of enforcing development standards so that your incident management strategy can come into effect across the board and help you maintain consistency and accountability.
Setting up an internal developer portal is not a small task, and it demands a certain amount of time commitment. However, the payoff is ultimately worth it when your developers feel confident in themselves as well as in the system anytime they are faced with an incident management request. To learn more about internal developer portals, why you should invest in one, and how Cortex can best help you, book a demo today.