Model Ops: Getting to production with Machine Learning
Date posted
20 April 2020
Reading time
12 Minutes
Model Ops: Getting to production with Machine Learning
This blog looks at the challenge of taking machine learning (ML) models from the data science notebook running on a laptop to integrating ML as a robust part of a live service.
Figure 1 Disconnect between Data Science and Operations
Although these challenges are substantial, we can apply our experience of continuous integration, testing and deployment in software delivery, adapting these techniques to the task of making machine learning production ready.
Figure 2 Automating machine learning delivery
Automation is the key to building a process which can allow businesses to adapt quickly in response to change and to do so in a way that is reliable, repeatable and removes human error wherever possible.
Data science development is generally conducted using data science notebooks. These are ideal for iterative exploratory analysis, enabling data scientists to test multiple approaches when selecting a candidate model.
Significant problems can arise if the model packaging and deployment occurs in isolation with no connection to the training data. Re-training requires reverting back to the notebook environment and requires input from the data scientist once more.
The traditional code versioning approach is adapted to include the version history on training data. This is essential to provide lineage, auditability and reproducibility that would be impossible with code versioning alone. Production code is separated from the concept of iterative notebooks.
The Challenge
Productionising a machine learning model is partly an organisational and cultural challenge. The area is still extremely new and even experts in the field still have much to learn, particularly when taking the output of machine learning efforts to the point that they can provide value to end users. Data science teams are usually separated from software engineering teams and the gap between the disciplines leads to significant challenges in bringing an ML model into production. This is usually due to gaps in understanding and inadequate processes. Even basic machine learning models which may be simplistic in isolation still present complexities from the perspective of a traditional software engineering release pipeline - there are more vectors of change to consider than just code. What about the records on which the model was trained? The data's schema? If code is not the only way to trigger a new build, then how do we initiate a new release from changing data and how is that new artefact validated? Machine learning solutions do not function in isolation, they are highly dependent on a supply of high-quality data. The challenge of making that data available in a way which is performant and secure is non-trivial, but essential to get right if we want to make ML a reliable part of live services.Model Ops
Much as DevOps sought to bridge the gap between development teams and operations, we require an integrated approach to provide the output of the exploratory data analysis that data science teams conduct with a clear path to live which satisfies the non-functional requirements of the overall service.
Quality Live Services
Once a model has been promoted to Live, we apply our experience of observability to understand how that model performs and to monitor for model drift or decay. As more data becomes available, model retraining and refinement should be possible. Our automated delivery approach allows us to make those refinements through regular, incremental updates which are tightly controlled. An approach founded on automation and observability provides a process which can ensure that the Non-Functional Requirements (NFRs) are well understood and built-in to the delivery pipeline. Key NFRs include:- Performance ?�� does the model selection hinder performance or can we better optimise processing and associated cost?
- Security ?�� models are often dependant on sensitive and valuable data so we must apply enterprise rigour to the handling, storage and usage of that data
- Accountability ?�� machine learning services which are explainable, auditable and transparent so they can meet regulatory and legal requirements
- Reliability ?�� we leverage resilient public cloud services to ensure a reliable flow of data through end-to-end pipelines and automate testing at each stage
- Maintainability ?�� software engineering techniques for versioning, lifecycle management and automated deployment reduce the risk and cost of change and ensure that deliverables are of a high quality