Chinmay Jain is Director of Product, Driving Behavior at Waymo. He began his career as a business analyst at McKinsey & Company before joining Accel Partners, where he identified, sourced, and evaluated investing opportunities in consumer internet companies. Chinmay then started and ran his own company, String Wars, which provided automated online guitar lessons using sound recognition. Before joining Waymo, he served as a product manager at YouTube, where he worked on YouTube Movies & Shows and redesigned the platform’s creator-payment model.
In our conversation, Chinmay talks about the importance of having a proper evaluation system in place throughout an AI product’s ecosystem. He discusses when to focus on granular product metrics, e.g., for making improvements, versus when to focus on telling a more intuitive, big-picture story. Chinmay also shares how, in working on a cutting-edge product, there is no playbook and, as a result, his team has to rely on first-principles thinking.
Considering and testing for tail cases
Just a couple of years ago, hardly anyone was talking about evaluations, but now, they seem to be top of mind for every AI company. Can you start by explaining what evaluations are?
Absolutely, and as you alluded to, it’s not like we didn’t have eval before. Eval is essentially asking, “Do I understand the behavior of the software or product that’s going to market, and do I have confidence in that understanding?” We always did evaluation, but it’s in focus now more than ever because of machine learning.
Machine learning is not as deterministic as traditional software, where you know what you’re expecting. You can fully control the context in which that works. But with machine learning systems like Waymo, you have to be very thoughtful and diligent, and need to have very detailed eval to understand how the system will perform.
Eval, in some sense, has become the main driver of product direction. For example, I was at YouTube before this, and one of the things I worked on was increasing the conversion of someone buying or renting a movie from the platform. We would run multiple experiments, test them out in the wild, and launch whichever worked better. In that sense, evaluation was constant but in the form of A/B tests.
In today’s world, that’s not always possible. Especially in the case of Waymo and other products out there, the downside of making a mistake is so high that we want to evaluate the system before it goes out to market. That’s why machine learning can be so different from traditional software.
As you mentioned, evaluating a Waymo car is a lot different than evaluating a product that sells more movies. Can you talk a little bit about the evaluation-first mindset at Waymo?
Almost everything that I talk about regarding Waymo applies to other AI agent products out there. Waymo is, in some sense, one of the first, especially in the physical world. But this also applies to the digital world. Making a mistake when an agent is buying something online, for example, also has downsides.
We have been using machine learning from the get-go, and there are three points to highlight. One is that we have a safety-first mindset, and to have that in place, we need to have very good evaluation processes to be confident in whether our system is safe or not. Safety should not be proven first on public roads.
Second is the use of machine learning, in that the output is not as deterministic. The real way to understand how a model will behave is through evaluation. You can’t just understand the code or do some unit testing; you need to have a much more detailed evaluation with various scenarios and environments built in to understand how the system will perform.
Third, because Waymo is going out in a world with a lot of long-tail cases, there are infinite contexts we need to consider. We need to understand those edge cases before they happen on public roads.
In software, ship decisions are simple; metrics move, and you launch. In autonomous driving, where edge cases are endless, how do you decide when to stop evaluating?
It’s important to understand that when launching, we are looking for how confident we are that this will be extremely safe and can handle edge cases. This is where the evaluation becomes extremely challenging. Creating these agents, like an autonomous driver, is not just about whether we can have the model or data in place. It’s also about whether we can build a great evaluation pipeline that can give us this confidence.
The evaluation pipeline has two main aspects to it. One, it is multi-layered, which means it has many dimensions with multiple metrics. When we’re looking at multiple dimensions, we have a higher probability of catching if something is going wrong than if we were to only look at one dimension. Safety, which is our core metric, is multi-layered — we look at things like collisions, close calls, and a lot more.
The second aspect is that we don’t do evaluation at only one step in the process. This differs slightly from traditional software, where you launch something and A/B test. With our software, we’re doing evaluation at every step, including tuning the model, bringing everything together, getting close to launch, and when it’s on the ground. Evaluation is part of the whole life cycle.
Together, these aspects come together to give us the confidence that we are robust enough to handle the edge cases that we would see out in the real world.
Looking at aggregated metrics
Evaluation is a core part of product management for you. How has perfecting the evaluation process improved building the product?
We want to perform evaluation in a way such that we can left-shift the process, i.e., find problems as early as possible. That means that we need to focus a lot on that initial stage so that we catch the problems early on. If we catch it early, it means we prevent this problem from occurring later, which has a high ROI. It also means that developers are directly getting that signal, so they’re learning a lot more as they build. In that sense, as we are improving the evaluation process, we’re also improving the overall product that we are creating.
A specific example from Waymo has to do with dense metrics, which I highlighted earlier. Another example is not just looking at averages, but at the long-tail cases. What’s the 95th percentile routing time? We don’t just want to improve the average case; we have to improve that long-tail case, and, therefore, our metrics should capture that.
The user experience just doesn’t improve on averages; it improves in the long-tail cases. We are able to measure that and see improvements. An important lesson I’ve learned is that the best engineering process is like a flywheel. What makes the product successful is whether the engineers building it can do something and quickly receive feedback on whether that change is improving the product or not.
Can you share any examples of moments when surface-level metrics look strong, but deeper evaluation exposed weaknesses?
There are countless of these situations. A very common example, especially in autonomous driving, is looking at aggregated, surface-level metrics like disengagement rate. This is when a human has to take over the car because it’s not performing well. When you’re testing the software, there is a human sitting in the driver’s seat to take over if something goes wrong. How many times do they have to take over the car every 100 or 1,000 miles? Or how many miles can they go before they have to take it over? There’s a lot of focus on this number in the autonomous driving world because it’s very easy to understand. If it’s autonomous, then humans should never be taking over. That number should, ideally, be infinite miles.
In reality, no metric is perfect. When you use that number, it doesn’t tell you anything about improvement or safety. Maybe what actually happened was we trained the drivers better, or they’ve ridden the same car so much that they can predict its performance better and not have to take over driving. There are a lot of nuances there, so to evaluate whether a product is truly successful, aggregated metrics alone are not sufficient.
Focusing on granularity vs. the overall story
What’s the hardest part about building trust in evaluation results, both with external users, which is a big deal in the autonomous driving world, and internally across teams who may interpret performance differently?
The first, most basic thing to do is to create the right evaluation framework using proper metrics. How do you make sure the understanding and trust for external users and within teams is the same? Hopefully, you are able to build a product that’s trustworthy and represents actual behavior, which is a very difficult engineering task.
When it comes to internal teams, engineers specifically want granular metrics. When they’re working on a small part of the whole product, they want to know how their part is doing. But at some point, executives and the external world don’t want to see that — they instead want overall, aggregated metrics that give them an intuitive sense of the product’s performance.
How do you bridge that gap? One way is, for every release that goes out, to have a comprehensive view of the software. Show multiple different metrics and how the software is performing, while also sharing the overall story. If these are the specific metrics, how do they come together in terms of actual, overall behavior? Some folks will want to understand the details, but others will combine that understanding and create an overall view.
For external stakeholders, it’s important to convert these numbers into something intuitive. Waymo does this a lot. We publish a lot of reports, including ones on annual safety. Seeing a report that says, “Waymo is 80% safer than human drivers,” is very intuitive, and we need to show our progress at a high level. But we need to dive deeper to help our engineers make improvements. For that reason, we have to focus on granular metrics to make improvements, but also make sure they tie into the overall story and are intuitive to understand holistically.
How do you help product teams maintain a sense of progress when success often means stability instead of new launches?
What’s the best Waymo ride? It’s the most uneventful and boring one. We want a product that’s not very eventful. But there are multiple aspects to this. We already have a lot of services running in cities like San Francisco and LA, which are extremely difficult cities to drive in. At the same time, we are trying to scale on freeways, we want to drive in snow, and we want to expand to places like Tokyo and London, where we’ll have to drive on the left side of the road. So we are not done yet. There’s so much work for the team to do, which keeps us motivated.
Having said that, even in areas where we are performing really well, we don’t stop the work. We want to keep improving, even in places where we are already pretty good. That’s good for society as a whole, and it also creates a moat for Waymo — if we’re this good, then the bar for everyone else will be very high.
Going back to first-principles thinking
As PMs are beginning to manage more AI systems in high-stakes domains like healthcare and finance, what lessons from autonomous driving evaluation should they adopt early?
The top lesson is that you need to have a strong evaluation system in place to create any product that’s responsible and successful. Invest in evaluation upfront; it shouldn’t be an afterthought. To be honest, this wasn’t obvious to me when I joined Waymo. You can keep building and building, but you’ll only build the right thing if you invest in evaluations and have the right way to evaluate the product. For any machine-learning product, it starts with evaluation.
That’s the mindset change everyone needs to make. Depending on the industry you are in, there are varying risks you could take, but you need to know if you can simulate at scale. Can you really do testing? Can you test the rare critical cases before they cause real-world problems? I find that to be very important.
Next, you can’t use a single metric or even a few metrics. You need a complex set of metrics to give you an understanding of the product’s various dimensions. This also ties back to investment and being ready to have multiple types of evaluations for the product. Lastly, make sure the story is clear to stakeholders. You can have all these things, but you have to make sure that human intuition is not lost.
What advice do you have for PMs to upskill and make sure that evaluation is part of their skill set?
One way is to learn on the job. If you work on an ML product, there’s no choice — you have to do it. But if you think about it, the idea of eval is pretty simple. If you take any machine learning product and you ask someone, “How do you know this product is good for you?” they’ll already start thinking about evaluation.
The challenge is making sure that the eval is comprehensive and that you’re thinking through all the aspects. However, as product managers, we’ve always been thinking about edge cases, even in past products and before machine learning. Product thinking remains the same, but now we just have to express that with data sets, examples, and edge cases.
Lastly, given your background in traditional software companies like YouTube, what originally drew you to autonomous vehicles?
I joined Waymo in 2018, and this was when people were losing faith that we would actually be able to deliver an autonomous driver. I joined the company, even at that point, for two reasons. The first was thinking about how big a change we could bring with autonomous driving. Ten years down the line, if I were to look back at my work, I wanted to feel satisfied with what I’ve done and see a huge change from where we started. That’s always been key for me in my career.
Second, we are working at the cutting edge of machine learning. We have to use the most advanced technology out there to make this happen. Everything that we are doing at Waymo is at the forefront of technology, and for autonomous driving specifically, we’re doing things that no one else has ever done. That novelty was very enticing.
Overall, it’s been a lot of fun. Because what we’re working on is so novel, we often have to go back to first-principles thinking. There are no playbooks out there, so we have to create everything. For a day-to-day job, it is also a lot of work, but we are seeing amazing numbers, including in safety. That has been really satisfying for my team, and we’re happy to see it.
What does LogRocket do?
LogRocket’s Galileo AI watches user sessions for you and surfaces the technical and usability issues holding back your web and mobile apps. Understand where your users are struggling by trying it for free at LogRocket.com.


