Every Engineering Org is Taking an AI Readiness Test Right Now

Tamar Bercovici has been at Box for 15 years. She leads the core platform, the backend layer that storage, search, metadata, and AI capabilities all run on. When her systems go down, Box goes down.

On a recent episode of the Braintrust podcast, she said the debate around AI-generated code tends to focus on whether the models will write clean code and/or introduce bugs. Tamar's focus is somewhere else entirely. She's watching what happens to CI pipelines, deployment gates, ownership records, and observability when output volume suddenly multiplies.

"I think what's interesting about AI coding is that it's really pressure testing what you already technically should have had in place," Tamar says. "People like to say a lot about, oh, the AI can write bad code or make mistakes or whatever. Well, guess what? Humans also write bad code sometimes or make mistakes."

Tamar had a few thoughts on what that pressure actually reveals, and why the teams that built strong operational foundations before AI showed up are the ones absorbing it best. You can listen to her full conversation with Ganesh on the Braintrust podcast here, but here are the cliff's notes on the AI readiness test Tamar says every engineering org is in the middle of right now.

Your operational investments just became load-bearing

When Box's engineering team was 30 people, everyone knew what everyone else was working on. If something broke in production, someone on the team could usually connect it to a colleague's recent deploy. But that ambient context disappears at scale. By the time you have hundreds of engineers, you need actual systems to do what hallway conversations used to handle. Observability, CI/CD gates, deployment practices, clear ownership records, the full nine.

"AI, when done right, accelerates that team growth," Tamar explains. "All of a sudden it's like your one developer is maybe committing more code or committing more quickly, or if you take it fully into that, I'm now orchestrating over a team of agents, then maybe that one developer is actually kind of like three or four."

When one developer does the work of several, every downstream system takes on that multiplied load. The CI pipeline, the review process, the deployment gates, the monitoring. These aren't new problems. They're existing investments that suddenly have to perform under conditions they weren't sized for, and most teams haven't stopped to ask which one breaks first.

Operational rigor is the dividing line

Nathen Harvey, DORA Lead at Google, made the case on a recent Braintrust episode that AI is an amplifier. High-functioning teams get faster, but teams with weak operational foundations start to crack because the bottlenecks they've been working around can't be worked around anymore.

Tamar's team has been building operational rigor for years because, for them, the consequences of not having it are immediate and visible. So far, that foundation has held. "We have not seen AI topple that," she says.

She's also not treating that track record as settled. "I think it's something that we need to keep tabs on," she continues. "It's also just assessing, are we using AI successfully enough? What could we be doing more? What metrics should we even be looking at?"

Tamar doesn't claim that she or any other engineering leader has completely figured things out. "We're still all collectively iterating on this too," she says. The foundation is strong, but nobody's done learning how to build on it.

AI can strengthen the same systems it's stressing

Tamar's team is looking at AI for test generation because more comprehensive coverage counterbalances the faster pace of commits. They're exploring it for monitoring, specifically for catching incident signals that humans might miss across a growing volume of production data. And migration work, which Tamar says has basically been her entire career at Box, turns out to be a strong fit because the changes are repetitive and high-volume.

"Just like AI can put more pressure on certain parts, it can also help you relieve pressure in others," she says.

When people think about AI and rapid prototyping, they usually picture a designer spinning up five different interface options to test with users. But Tamar's team makes architectural decisions that only become clear once you actually try to build them. "Sometimes you only really get a full sense for what it's going to look like when you try to code it out," she says.

This manifested quickly at Box, where one of Tamar's peers started pairing engineers who were getting value from AI with those who weren't. Some of the pairings had a junior engineer mentoring a staff engineer. Seniority had nothing to do with who knew the tools best.

Nobody knows how to measure AI's impact on engineering yet

There's never been a good answer or definition of developer productivity, and AI hasn't delivered one. "The perennial problem of engineering leadership is how do you measure the productivity of engineering," Tamar says. "I've been in so many networking events where people are trying to either propose something that's never a full capture or kind of commiserate on how difficult this problem already was."

AI agents have made it worse. "Now you throw in all these AI agents into the mix," she says, "and even assessing the impact of what you're seeing on anything beyond qualitative is difficult."

Her team is working on better metrics, but Tamar is honest about how fast any approach becomes outdated. Whatever worked six months ago doesn't work now. The same can probably be said of what worked six weeks ago, too.

"If we go back to what is engineering, you got to take this ambiguous, complicated problem and you need to break it down into its sub-components," Tamar continues. "If you can't do that well, no matter what AI technology you're using, if you give it a bad prompt, you're going to get a bad outcome. Just like if a product manager gives you a bad set of requirements or your architect doesn't explain what they want you to do, the engineer's not probably going to do a good job."

Clear standards for what a well-run service looks like were always important. When output volume multiplies and the tooling changes every few weeks, those standards become the only stable thing to measure against.

How Cortex helps engineering orgs assess their AI readiness

Cortex Scorecards exist to answer the questions Tamar keeps circling back to. You define what a well-run, AI-ready service looks like across your org, and the platform measures every service against that definition automatically.

With Cortex, teams can:

1
Measure AI readiness at scale with Scorecards that track every service against criteria like ownership clarity, test coverage, deployment practices, observability, and security posture
2
Surface gaps before they become incidents with real-time visibility into which services meet the bar and which don't
3
Drive remediation with Initiatives that attach owners, deadlines, and automatic progress tracking to every gap the Scorecard identifies
4
Stay ahead of a moving target by updating standards and tracking org-wide progress as AI adoption evolves, without rebuilding measurement from scratch

The operational rigor Tamar describes at Box is the baseline Scorecards are designed to measure and enforce. Getting that baseline visible and trackable is how engineering leaders move from instinct to evidence on the readiness question.

Book a demo to see how Cortex helps engineering organizations assess their AI readiness, or try it for free.