Newsletter #085 – ML in production

This week, a data scientist on my team discovered a pretty serious bug in one of our production applications. While it’s important to diagnose and fix every bug, the way this one was diagnosed was especially notable. The data scientist was analyzing historical outputs from the machine learning models powering the application when he noticed that at one point the average model prediction had suddenly shifted by several orders of magnitude :O . We were able to trace the issue back to a specific codebase update and a few teammates were able to help fix the bug.

There are so many lessons we can take from this experience. Much to learn from our mistakes.

First of all, this is a great example of what can go wrong in data-driven applications. The application is responsible for rendering decisions in real-time based on the outputs of various machine learning models. Because we didn’t have good guardrail metrics, the system kept rendering decisions even though the model outputs were not correct. I would classify this as a silent failure – a system error that does not explicitly generate an exception. How do we prevent this from happening again? One way is to monitor the model predictions themselves. While Monitoring ML is a deep topicWe would have noticed this problem immediately if we had monitored even a simple metric like the average model prediction.

Other issues we discovered while diagnosing the bug related to application logging and software testingAlthough these are considered “software topics”, Data scientists and ML engineers need to care about them. If you’re implementing/operating an ML-powered application, you need to realize that you’re in the software development business. That means that all the things that matter to software engineers – clean and modular code, extensive test suites, detailed application logs – should matter to you, too. You may have a highly specialized team of software engineers helping you build applications, but you may not. Either way, you need to get someone with that expertise in the room when you’re designing and implementing your application.

But this leads to a tricky situation. What if no one on your team has any software experience? How do you know if you are following best practices or not?

Chances are, there are plenty of software engineers at your company who are willing to dig into your code and offer advice. Try to get some senior engineers involved that other devs look up to. These people have seen all sorts of application problems and can quickly describe potential issues and how to improve your codebase. Honestly, good software engineers are happy to point out bugs in code, so you shouldn’t have to look too hard to find 1 or 2 of these people 😉

Here’s what I’ve been reading/watching/listening to lately:

Solving the time travel problem in machine learning – When using ML models to predict future events, data scientists need access to snapshots of past feature data to prevent data leakage during training. One way to access these snapshots is to “log and wait” for feature values to accumulate until we have enough data for model training. Another approach is to backfill the data by efficiently computing historical feature values.
Bringing an AI product to market – This is the third post in O’Reilly’s series on AI Product Management, and discusses how to bring an AI-powered product to market. The core responsibilities of an AI PM include identifying the right problem and agreeing on metrics, planning and managing the project, and executing the project roadmap by working on interface design, developing prototypes, and collaborating with engineering leaders. The post emphasizes the importance of experimentation when building AI products: “Lack of clarity in metrics is a technical debt worth paying off. Without clarity in metrics, it is impossible to conduct meaningful experiments.”
Post-implementation AI product management – O’Reilly’s series on AI Product Management concludes with a post describing the responsibilities of an AI PM after the product has been deployed. Compared to traditional software engineering, PMs and the development team must remain intimately involved in managing operations to improve the model and pipeline and ensure that the product performs as expected and desired over time. This debugging process relies on the use of logging and monitoring tools to detect and resolve issues that arise in a production environment. From my own experience managing products, I can confidently say that AI products cannot simply be handed off to operations teams that do not have ML expertise.
A step-by-step process to resolve the root causes of most event analysis failures – Written by the former SVP of Growth and Business Intelligence at Gojek, this post describes a process for planning and implementing an event tracking system to enable data-driven decision making. This isn’t a problem specific to machine learning, but it does have serious implications for the type of data you can later use to train ML models. If part of your responsibility is to identify parts of your company’s products for instrumentation, this post is for you. Here’s a quote: “Beyond all the tooling, there’s one fundamental thing that will make or break any data initiative within an enterprise: how you think about what to track, how to track it, and how to manage it over time. If you get these things wrong, the best tooling in the world won’t save you.”
Scaling Airbnb’s Experimentation Platform – Airbnb has seen exponential growth in the number of experiments used in recent years to optimize the look and feel of their website and native apps, their smart pricing and search ranking algorithms, and the targeting of their email campaigns. Along the way, they’ve evolved their Experimentation Reporting Framework (ERF) from a Ruby script to an application with a full UI, domain language, and suite of Airflow tasks. Here, they describe how the system evolved and discuss specific features.

That’s it for this week. If you have any ideas, I’d love to hear them in the comments below!

Our Company

About Links

Useful Links

Newsletter

Laest News

Newsletter #085 – ML in production

Azure Monitor Logs Next Evolution: Multi-tier logging

Newsletter #084 – ML in production

You may also like

Leave a Comment Cancel Reply

Our Company

About Links

Useful Links

Newsletter

Laest News