How Clever standardized service quality across its 500+ services
Clever is a digital learning platform for K12 schools--one friendly place for single sign-on, messaging, analytics, and more. The platform is used by 65% of U.S. K–12 schools, with over 22 million students and teachers using Clever to instantly access a world of digital learning.
What was the problem?
As Clever scaled over the last few years, the infrastructure team found it difficult to keep track of all the different metadata about existing services across critical initiatives like migrations. The team needed a high level view of all services that were run at Clever that they could broadly share with the entire organization.
How was this solved before?
“When we would start a migration, it would involve writing scripts and then manually copying the output into a spreadsheet. We would spend a lot of time just formatting the data and making it consumable for different teams. Keeping the spreadsheet up to date was a huge challenge and a distraction from core engineering work. Overall it didn’t feel like a very efficient use of our time.”
Having to maintain and update a spreadsheet that tracks metadata about services is a painstaking process especially since data can be scattered across several different tools. It’s even more challenging to then build accurate reporting on top of that data and highlight the services or teams at risk for not completing the migration.
"I remember we were in the middle of a pod migration but we actually had no idea which percentage of pods were actually migrated and which ones were not. Gathering this data and then tracking it on a team base level was a very manual process. Every week we had to check in with teams to see if the data was updated, making it tedious in nature."
Spreadsheets are often six months out of date when you need them and it becomes harder to track whether the quality of your most important services is improving over time. All of these challenges led Clever to seek out a tool that could help them automate this process.
How has Cortex helped?
"Cortex has made it incredibly easy to track the status of migrations across different services and teams. We now have real data on which services are at risk and no longer need to manually check with teams, run scripts, or dig through several tools to find the right data. No one has to go in and update anything manually - it’s all automated and synced with Cortex.
Clever uses Scorecards to automatically track the progress of their migrations. They push in data from their build pipelines using the Cortex API and build rules based on that data. Scorecards are a powerful way to enforce best practices and track whether teams are actually making progress in different initiatives.
"If we ever want to check in on progress on any migration, it’s simple to pull up a report and check which teams are performing well. The reports page makes it trivial to visualize migration progress broken down by teams. We can see that Team X is 90% done but Team Y is only 50% done with the migration so infra knows where to direct their efforts. This was previously difficult with spreadsheets."
Cortex can help quickly and easily visualize the status of the migration over time, helping Clever keep team’s accountable to their quarterly goals.
Any interesting insights that have resulted due to Cortex?
"I didn’t know we had 500 applications across Clever! Cortex also helped us find services labeled with owners that had left Clever a while ago. We were able to update the ownership to the right owner or in some cases even delete repos that weren’t needed."
"It’s been great to see which teams have fully migrated and have completed their tasks! We can now go and congratulate those teams. These statistics were really tricky to obtain before Cortex."
With direct integrations into all of the third party tooling at Clever, Cortex is able to surface insights about the service architecture which help improve engineering quality. Everything from orphaned services to stale repos can be surfaced using Scorecards, which makes it a powerful way to maintain high engineering standards.
Why did you pick Cortex over alternative solutions?
“Scorecards is what pulled us into using Cortex since it solves a core problem for us around migrations. We can see Scorecards helping in two categories, the first being migrations that we track progress from 0 to 100, and then the second being tracking whether teams are following best practices. The Cortex API is also really easy to use and was far easier to use than the alternative solutions we considered.”
“Seeing the wide range of third party integrations and the opportunity to explore those down the road made it really intriguing. We see integrations that are relevant to us, like SLO monitoring, which makes it more motivating to adopt.”
Compared to alternative solutions in the market, Cortex has more integrations making both the Catalog and Scorecards very powerful. Scorecards are powered by CQL (Cortex Query Language), which allows teams to accurately track metrics about their services. Direct integrations with all the existing tools in their workflow made the decision clear to Clever, and resulted in a better end user experience for their engineers.Cortex’s extensive APIs for the Catalog and Scorecards makes it easy to both pull and push data for scoring or reporting.
What’s your favorite feature of Cortex?
"Scorecards is my favorite feature of Cortex, mainly because migrations efforts have become so easy to report on and analyze. I’m able to filter across both services and teams which makes my life a lot easier. I love looking at numbers, and Cortex provides all of that data to me instantly."
"The reporting feature and being able to see a high level view of all the different services and teams is very powerful."
What’s next for Cortex?
"We’re expanding Cortex across the entire engineering team. We see engineers especially liking the graph tool, which helps them understand dependencies on an application level and resiliency. We are also embedding Cortex across the Clever engineering stack and in Slack as well for increased visibility. We also want to start setting SLOs in Datadog and track SLO adoption across services at Clever. This will come as part of a best practices Scorecard."
Cortex has integrations with APM tools including SLO support which allows teams to track whether services are meeting their SLOs in Scorecards. Our Slack bot makes it easy for engineers to fetch service data using Cortex commands.
How has it been working with the Cortex team?
"It’s been great working with the team and I really like that the feedback Clever gives every week is implemented quickly. The velocity of feature improvements week over week makes it really easy to work with the team."
We’re excited to continue working with the Clever team. If you’d like to see a demo of Cortex and help your team with service complexity, sign up here!