Newsletter #083 – ML in production by info.odysseyx@gmail.com July 29, 2024 written by info.odysseyx@gmail.com July 29, 2024 0 comment 17 views 17 The role of data scientist is one of the most sought-after jobs in technology today. However, it is also one of the most misunderstood. And this contributes to much of the dissatisfaction that many data scientists experience in the industry. I believe that data science and machine learning companies can be divided into two groups. The first group consists of companies where ML/DS is a core competency. These are companies that would not exist if machine learning did not exist. Popular examples include Google, Facebook, Amazon, and Stitch Fix. Several of these companies did not start out as ML-driven companies, but if you were to remove their ML competency today, these companies would be completely different. Companies where ML is a core competency do not need to be convinced that machine learning can add business value. They have several ML-driven products in production and many other ML-driven bets in development and testing. The second group are companies where ML is not a core competency. The vast majority of companies fall into this category. At these companies, data scientists and ML engineers may be working on projects that augment existing products, (attempt to) automate internal operational processes, or otherwise pursue other efficiencies. Many of these places have decided that ML is worth the investment (otherwise their data scientists wouldn’t have a job), but sometimes that investment is driven by FOMO (fear of missing out) rather than a desire to innovate. These companies are investing millions of dollars in their data science teams, so naturally they want to market themselves as ML-driven companies, even if their ML efforts are net negative. While they have decided that ML is worth the investment, leaders of companies in group two must be convinced that deploying machine learning is worth the risk. This is a a people problem, not a technology problem. To solve this, relationships must be cultivated with the right business stakeholders and decision makers. It requires trust. The business must trust that data science understands the domain and what is at stake (which often includes the reputation and jobs of the stakeholders). Accordingly, data scientists must have the support of the stakeholders. Every technology solution will not be perfect at first, so data scientists and business stakeholders must work together to solve the problems. This is a difficult problem to solve unless the data science team already has some degree of influence. It’s a bit of a chicken-and-egg problem. But companies that solve this challenge can begin to generate business value through machine learning and data science. Those that perfect their solutions can even “cross the chasm” to become ML-first companies. Here’s what I’ve been reading/watching/listening to lately: Column names as contracts – While there are accepted strategies for creating contracts with users of software and user interfaces, similar strategies are less widespread for data tables. This article describes controlled vocabulary for column names as a simple approach to creating a shared understanding of how each field in a dataset is intended to work. The post introduces the concept with an example, and shows how controlled vocabularies can provide lightweight solutions for routine data validation, discoverability, and wrangling. pointless – Last week I referred to the great expectations library for “unit testing” data in Python. This week I discovered pointblank, a similar library to methodically validate your data (either in-memory as data frames or as db tables) in R. It provides a collection of powerful validation functions, maintains information about tables that is updated when the tables are updated, generates automated data quality reports, supports multiple databases, and can be used in pipeline processes to periodically check data, trigger alerts, raise errors, or write information to logs when validations exceed specified error thresholds. I’m a firm believer that data validations should catch errors in ML pipelines. Data Science Project Flow for Startups – A data science consultant provides their perspective on how to structure and execute projects with teams of 1-4 data scientists. The process is divided into three parallel aspects: product, data science, and data engineering, and involves data science repeatedly checking in with product to ensure KPIs are being met. The process itself is divided into 4 phases: scoping, research, model development, and deployment. Peer review of data science projects – A follow-up to the previous article proposing a structured process for peer review of data science projects. The post suggests two different peer review processes: one for the research phase and the other for the model development phase. I especially enjoyed the extensive list of questions the author proposes for the research phase and how they can be used to reduce the risk associated with the project. Very well written and insightful. Good data analysis – My team at 2U has been doing a lot of data analysis lately, so I decided to revisit this excellent resource from Google Data analysts that I shared a long time ago. number 7This paper summarizes the ideas and techniques that careful, methodical data analysts use on large, high-dimensional data sets. It is divided into three sections: technical (techniques for exploring data), process (how to approach data, what questions to ask), and mindset (how to collaborate with others and communicate insights). One of the most interesting subsections describes the danger of mixing services: where the sizes of subpopulations within a group differ. Mixture shifts can lead to Simpson’s paradox “where a trend occurs in several groups of data but disappears or reverses when these groups are combined.” That’s it for this week. If you have any ideas, I’d love to hear them in the comments below! Share 0 FacebookTwitterPinterestEmail info.odysseyx@gmail.com previous post Housekeeping Attendant Positions Available in Vellore – Narayan Service Solutions Hiring Now next post Newsletter #082 – ML in production You may also like Newsletter #082 – ML in production July 29, 2024 Newsletter #084 – ML in production July 29, 2024 Newsletter #085 – ML in production July 29, 2024 Newsletter #086 – ML in production July 29, 2024 Newsletter #087 – ML in production July 29, 2024 Setting up an effective experiment program (Experiment Program Series: Guide 01) July 29, 2024 Leave a Comment Cancel Reply Save my name, email, and website in this browser for the next time I comment.