The following paragraph represents a preface to the series of article guidelines that will talk about critical aspects of the Machine Learning lifecycle. The representation of the ML project’s flow by itself is not important – it simply is a special kind of flowchart. However, the nature of the problems solved by ML makes the outcoming projects without a visual overview difficult to follow. Taking this into account and putting the emphasis on the ML lifecycle, these article guidelines will not only give a summary of the current trends and worldwide practices but also provide examples from our experience at Intertec.io.
The main intention of the ML lifecycle series is to educate individuals and potential clients on how to digest the ML projects, as well as to teach project managers what they should focus on to satisfy the client’s needs, thus moving toward achieving the explainability aspects of Artificial Intelligence.
The problem-framing as the initial phase of the ML lifecycle is often neglected by the team involved in the ML project, particularly by the tech staff since they usually expect that this should be handled by stakeholders or managers. This is probably a consequence of the wrong perception of this phase as a strictly business-oriented precondition to solving a particular problem with ML. To make matters worse, not handling this phase properly might have a negative impact on all of the subsequent phases, if they are reachable afterward, and result in unnecessary costs in time and resources. Furthermore, with an unsuccessful outcome (determining that there can be no ML solution applied) as a result of the problem framing the team saves time and effort by finding out the same thing the hard way – while working on the next phases in the lifecycle.
For that purpose at Intertec.io, we have established a specific set of steps that need to be taken in order to accurately determine the outcome of the problem-framing. We also have a semi-strictly defined flow in which the business understanding process always comes first and the cost prediction is the last one before concluding the whole phase – if it already hadn’t failed in some of the previous processes.
Sometimes, the order of the processes can be different. For example, in some exceptional cases, the data discovery process can follow the ML-type determination where there can be a certain level of flexibility, which in reality rarely happens but it’s not an impossible scenario. The provided process diagram above is constructed as a summary of the best practices of problem-framing I have experienced so far. Next, I will focus on the key points for each of the problem-framing processes.
I intentionally use the phrase “visual overview” in the preface to emphasize the visual aspects when discussing, planning, or analyzing things in problem framing. This is especially important for the business understanding process where either one or both of the following two scenarios happen:
During the communication with the client’s representatives, noting down the key points is an essential part of gathering your thoughts afterward. It also serves as a basis for further brainstorming sessions. Besides aligning and adjusting the terminology, the most important part of those sessions is exchanging thoughts on how an individual understands and how the team should understand the problem.
One activity that doesn’t require much skill, which is probably the reason for being overlooked, is the understanding of process flow diagramming. There aren’t many free icons set specific to the ML lifecycle for the most popular diagramming tools such as Diagrams or Lucidchart and I would not recommend spending more time looking for the most adequate. Instead, selecting the as self-explanatory icons will simplify the taken actions, besides creating your own reusable set. As an illustration, the diagrams in the previous section were created using Diagrams and the following diagrams are created using Whimsical suggested by Doug Hudgeon, one of the authors of the book Machine Learning for Business.
This understanding by diagramming helps the team in defining the problem regarding its possible resolution by using ML. A proper problem definition usually provides the outcome of the business understanding process. There are three possible outcomes for an addressed problem.
Sometimes there are problems that with the current technology and methods cannot be solved with ML. The first hunch of whether something can be solved comes from the following diagram. If each of the steps that are considered manual simply cannot be performed by a machine then it’s usually determined that solving the problem using ML is impossible.
Determined by some random event or if the frequency of action is low or cannot be predicted, other candidates can be problems that are not repetitive enough. For example, any tech research activity using a variety of methods depending on the technology and it is not repetitive in nature can be sparked by some event that is hard to anticipate beforehand and usually rarely occurs.
Finally, some problems are not worthy of even being candidates for ML solutions. These cases often occur when the team is the initiator for a solution to a problem that a client probably possesses. Such an example is offering a conversational chatbot for a small company that uses chat services only internally.
Problems that can definitely be classified as ML solvable regardless of the domain usually fall under the following two categories regarding the business processes:
automation – repetitive time-consuming processes or processes prone to human mistakes;
prediction – processes that benefit from values that cannot be easily predicted.
Let’s go back to the diagram explaining the manual content creation process involving an editor. How to specifically determine whether one problem is ML solvable? Simply by referring to the business understanding diagram. If at least one of the actions labeled as manual can be automated then the team can proceed with the next processes of the ML lifecycle.
Referring to the diagram describing the content creation by an editor. This is a good candidate for enhancing the automated process since at least one of the actions is manual and can be done in different ways: by text generation only of the main content, by summarization of the main content, by generation + summarization, or even full automation of the whole process without even requiring an editor acting as a user of the present content creation system.
Since sometimes there can be multiple ways of solving a particular problem with ML, as, in the previous example, the way to proceed should be negotiated with the client. But it is important to put all of the options on the table and be open as much as possible because sometimes the client is not aware of all of the possibilities. For example, maybe the client wanted to solve only the writing of the main content and didn’t take into account the summarization at all.
Some problems can be classified as ML solvable but only under certain conditions. Those conditions are reflected by the probabilistic nature of the ML, where depending on the achieved accuracy, the client needs to make a decision about classifying the problem as solvable or not. For example, in life-dependent cases such as the ability of the model to recognize street signs for self-driving vehicles, achieving a 99% recognition rate may not be enough.
Since at Intertec.io, we deal with non-life-dependent ML problems, the clients sometimes accept the initial accuracy of our models as low as 60% even for production purposes, as some of the models are used for suggestion purposes. The client can often be very specific about what is the minimum expected accuracy of a solution but the reality is that this cannot always be guaranteed by the team because of the previously mentioned probability aspect, especially at the higher rate (above 90%). What can be negotiated is the period over which the model will be polished. A good example of this can be “to improve the model from 70% to 75% in the course of a month”.
Additionally, depending on the client’s decision, which usually comes as a result of the perspective, one problem can be solved with ML if such perspective matches the team’s view as well. For instance, take a look at the diagram below describing a newsletter creation problem. The team is confident that it can fully automate the NL creation with ML based on the argument that an efficient way of writing a summary of the language content can be implemented. However, the client is satisfied with only having a few templates for that purpose and does not value writing creativity. Therefore an additional effort of using a summarization model will not add value in the eyes of the client which results in a partial NL generation (only on the content with the key information) instead of a complete automatically generated NL without any manual intervention.
Sometimes the lack of current availability of adequate technology is a key factor in determining the probability of the problem being ML solvable. If the problem of NL creation was addressed years ago before the transformers architecture together with the availability of the large number of pre-trained models coupled with the transfer learning concepts, there would have been significantly more effort to solve the problem with traditional NLG methods and the outcome would have been uncertain.
The best practice to follow when the outcome of this business understanding process is probably ML solvable is to conduct further investigations which take a longer time for which the client must be aware and can often back down since in terms of time and money is too costly.
This process starts when the client is asked whether it possesses qualitative and quantitative data related to the business problem. Following the practice of understanding by diagramming, the question of data availability comes as a direct consequence of the absence of an icon that is related to any kind of data storage. Even the availability of, what may seem, useful data, does not imply a guaranteed ML solution.
A negative outcome of this process depends on the client’s willingness to invest more when there is not enough data in the appropriate format. If the client is reluctant, the team should usually push for discovering appropriate data sources and not give up immediately. The problem is that those additional actions on identifying relevant data sources can drastically push the release deadlines and because of that it can significantly affect the project’s estimation. The possible danger of understaffing should also be taken into account.
Based on our company and team experience I will try to summarize the data availability into four categories.
This category covers the worst possible scenario where there are no sources from where the data can be obtained or its locations are unknown. If you are not terminating the ML lifecycle after this scenario, it will mean that the only way to proceed is to manually create the data. This is one of the biggest risks upon taking action and as such should be carefully estimated.
If there is a hypothesis that potential data sources do exist, then another thing that the team can do, with the client’s approval, is crawl. This can take a long time and until its completion, the other activities should be blocked.
Even though the name of this category seems similar to the previous one, here, at least there is a chance to get the locations of adequate data sources. If those locations are successfully identified, then the data should be obtained using techniques such as scrapping, scanning, converting, etc. This is also a risky step because there cannot be strong guarantees about the results from the mentioned techniques and therefore should be considered cautiously.
There are scenarios where the team identifies part of the data. However, if incomplete, it is useless for the project. Missing labels are one of the most common cases that prevent the data to be used for ML solutions and in most cases manual labeling comes to the rescue.
For better understanding, I will share a personal experience while trying to solve an employee classification problem based on their resumes and additional content. The client had a unique way of defining the company’s job descriptions that cannot be easily matched with any scraped data. Since the company didn’t possess sufficient quantities of its own data, manual or semi-automated (by using token matching criteria where applicable) labeling was a preliminary step towards obtaining a significant amount of data for the next phase.
It is important to point out that sometimes the labels are not the only missing piece but other parts of the data as well, which often requires additional scraping or other manual interventions.
This is the best possible outcome of the data discovery process that has a clear impact (without many surprises) on the next processes. Try to keep in mind that I am only pointing out the availability of raw data. Other techniques such as data transformation steps come later in the data preparation phase.
This process is a consequence of the successful outcome of the business understanding and the data discovery, although in some cases the outcome can be sensed during the progress of both processes. Sensing the ML type is one thing but correct determination is something else because it requires putting on the table every candidate and choosing the most promising one. Seems easy and straightforward but this process is sometimes prone to mistakes.
Let’s assume that the team has all the data it needs to start working on the problem. At first, it seems clear what the output of the model is planned to be or what the client requires. However, sometimes there can be other output candidates that are missed by the client, which usually happens when the team perceives the problem and therefore the solution differently. Let’s take a look at the two examples below.
In the first one, I’ll focus just on the solution for the automatic selection of key information in the NL from the Probably ML solvable subsection. The business understanding showed that the key information is selected based on manual feeling about the revenue it brings to the client. Furthermore, according to the data discovery process, there is a huge amount of historical data available which contains the revenues for each key piece of information. Therefore, even though this seems like a simple binary classification problem with the selection as a label, it can also be solved by a regression if the label represents the revenue itself. Even though the second approach has the potential to be more complicated, it is also more suitable as a solution because usually, manual selection cannot predict the revenue with great accuracy.
As for the other example, let’s take a look at the second diagram presented in the ML solvable subsection. Here we need to find a way to solve the problem of generating two separate content parts – the main and summarized content. According to the discovered data, one team member explains that both content parts can be generated from a single set of features, while another one thinks that two types of models should be used, that deal with one content exclusively. This time, instead of talking about “classy” ML types we are talking about NLP or more specifically different NLG types. Separate frameworks, formats and sizes should not be up for discussion during this process. Instead, we should elaborate on the most suitable candidates according to the model type. Therefore, there should be less talking about Pytorch vs Tensorflow or GPT-2 vs DistilGPT-2 vs T5, and more discussion about text2text, seq2seq, and summarization.
Once again, the result of this process does not strictly need to be limited to one ML type or model, but to a set of candidates. It must be noted though, that in case of more than one proposal, an extra period should be allocated beforehand in order to have a buffer for the recovery of an unsatisfying pick.
In this process, not just the infrastructure, but the potential usage of every candidate piece of technology as part of the solution, is planned. The emphasis is put on the infrastructure because it is prone to architectural flexibility and it has the biggest impact on the next process of the problem-framing phase. Infrastructure can be determined from the answers received from the client or decided by the team itself if the client is not strongly bound to a specific infrastructure that is inherited or wants to introduce.
Each team should strive to get answers to an important set of questions during this process. It is recommended to start with questions that mostly depend on the client, such as about a possible existing infrastructure, and then finalize with the ones that mainly depend on the team itself. An example set of questions can be the following (recommendable to preserve their order, although usually the last one can be treated independently and in extreme cases, the team’s proposal will not be accepted):
Since there can be multiple candidates for infrastructure and hosting/serving, their planning can be more negotiable because it depends on the client’s view and the inheritance. For instance, for a time series forecasting ML type, a client would like to host the solution on AWS instead of GCP because it has already adopted that cloud platform. Following that, when the team proposes to serve the models using Sagemaker, the client may not be entirely satisfied with the number of manual interventions that it requires. Finally, the team proposes a limited but more specialized alternative in Forecast which is acceptable to the client.
In general, it is preferable for a team that the software has as much maneuverability as possible, especially when it comes to open-source. When planning, there should be a mini-report about whether a particular piece supports another (does DVC support S3 as a destination for the versioned assets) or overlaps with one (does MLFlow store model artifacts in a similar fashion as DVC). A useful concept for diagramming the relationships among different software pieces is proposed in the C4 model, for which I recommend using the IcePanel as a complete tool or Diagrams’ C4 shape set, where the shape level:
Intertec.io’s experience proves that decoupling the datasets, models, serving, experimenting (training/fine-tuning, validating, testing) and tracking is indeed the best practice for an architectural solution. One example is described in the diagram above – the first impression might be that is not entirely optimal since DVC can also be used for model versioning, but since MLFlow handles it in a way more suitable for the individuals involved, the team decides to still go with both.
I am not going to describe how to train a model for predicting the costs of implementing an ML solution. Although the title of this process seems to be misleading, I will immediately warn you that the term costs doesn’t refer only to money but every other aspect of spending resources, like time and staff. The costs are also considered in the preceding processes but on a smaller scale.
In this process, the team has the whole project plan to review and if necessary make some adjustments or terminate the lifecycle. Both sides should be aware that the estimation process should be approached cautiously because of the complexity of the ML lifecycle which is prone to more iterations and creativity, compared to other standard software development workflows. Obviously, I am not allowed to provide a complete and detailed example for this section. Instead, I will try to mention some key points together with smaller examples per lifecycle phase, while disregarding the obvious (and usually open-source/freeware) tech choices such as the programming languages and frameworks.
When preparing datasets for modeling, usually the team wants to export the best possible one. Even for a small number of data sources, it can be hard to repeat this step frequently without a pipeline, which often requires using some managed services that are relatively expensive. Is it always necessary to construct data preparation pipelines? Unlikely, because if the frequency of training is low and there is a low risk of experiencing data drifts, there is no strong reason for replacing manual ingestion with an automated pipeline.
Regarding the modeling phase, the set of model candidates chosen previously in the ML-type determination process can be filtered out furthermore when analyzing the potential costs that each of them brings. This one is about analyzing the costs of three approaches (third-party Lex service, own trained DialoGPT, own fine-tuned DialoGPT) for solving a conversation problem while comparing the key criteria where the color indicates the relative level of cost (green being the lowest, red being the highest).
The costs for the price and the purpose fit criteria are generalized meaning that in rare cases the estimations can be totally opposite. Purpose fit can truly be verified much later in the lifecycle during the testing process when the quality of the output together with the latency can be compared on the same dataset. The benefits of the transfer learning concept usually encourage the team to give the fine-tuning models a priority over the rest due to the lower efforts needed in the modeling phase compared to the training from scratch as well as for a better purpose fit than the managed services such as Lex. A similar comparison table can be constructed for other NLP approaches such as summarization. For other non-NLP ML types, the criteria can be different.
The final phase of the ML lifecycle brings costs to the functioning of the model in production. This means estimation for the serving/APIs and monitoring/observability. It is good to use a service that manages the deployment and can handle concurrency such as Sagemaker, but using it can often be expensive in terms of money, especially for the models that require larger instances to operate. When it comes to using cases where the model’s output is not directly consumed by an end-user meaning the latency is not a big issue a cheaper option to use is containerized API on a smaller instance.
There are additional costs to be considered, that are more global than the previously mentioned. As MLOps is not yet an acclaimed standard in the ML industry, usually the clients are not keen to understand its benefits. But besides the advantages of handling the majority of the ML lifecycle processes for the team, exposing the experiment tracking to the client can mean satisfying the AI explainability needs. Therefore the client has to be informed about the MLOps tool’s existence which should be involved without any compromises.
The client’s cost reduction after the solution is deployed should not be measured by how many manual actions it will replace. Not because of the tendency to avoid having revolts among people “whose jobs can potentially be overtaken by the machines” but by the expectation that those people will become the first domain evaluators of the models’ activities which is especially true for ML types unable to perform accurate validation such as natural language generation.
The final result of the cost prediction process instead of a diagram comes in the form of a table containing several options for implementation, focusing on the costs for each key criterion. One of the key criteria is the team’s size and its structure which can vary from one option to another and correlates with most of the other criteria. This table should display the advantages of some options over the rest per criteria because if there are no clear arguments for or against a pick, it should be shared with the client for the final decision.
I tried to “compress” the content of the previous sections as much as possible, meaning that there are other concepts to be mentioned and other examples to be shared. Let me try to give answers to some of the obvious questions and provide my final conclusion on the topic.
If possible all the team planned to be involved in the project should take part in the initial meetings. Besides the product manager and other staff in standard roles such as DevOps or QA, the crucial roles are a Data analyst, a Data Engineer and a Data Scientist, especially in the processes of data discovery, infrastructure planning and ML type determination respectively. When it comes to the seniority of the involved people, it is preferable but not a must, since sometimes the younger team members can think out of the box and unexpectedly contribute to the brainstorming.
I mentioned several times the complexity of the ML lifecycle and will do it once again when it comes to the ML design patterns. Usually, ML practitioners think that there are no ML design patterns for problem-framing because there is simply no development or engineering. But there is some tech work to be done such as research and analysis. This type of tech work is more than enough for classifying similar experiences into design patterns whose characteristics can be put to good use during the phase. These design patterns are mainly bound to the ML type determination process but not necessarily. The following two examples are inspired by the design patterns according to Google standards.
Let’s go back to the example with the solution for the automatic selection of key information. In the section about the ML type determination, it is explained how for that use case it is possible to present the solution as regression instead of binary classification. This “conversion” of the output representation is known as reframing – one of the most common problem-framing design patterns.
Another frequently followed design pattern in this phase is multilabel, sometimes encountered either as multi-task learning or as multiclass classification but only when it comes to the classification ML type. The previously shown example of Intertec.io’s unique experience with automated content creation is an untraditional multilabel design pattern where a generated output of a text2text transformer model is represented as multiple labels of main content and summarized content by including delimiters.
To conclude, the team should tend to monetize its efforts in this phase, regardless of the outcome. Even in negative outcomes, at least the valid analysis coupled with a detailed report of the accomplished problem-framing processes should give the client a clearer picture of the required efforts that are taken to conduct a highly professional framing.
Hopefully, having read about all the cases that I covered in this rather extensive article guideline, the inability to proceed with the ML lifecycle will be detected as early as possible. If not, the team should move on to the more challenging data preparation phase with lower risks of failure possibility.
At the end of the day, every problem brings its own specifics and probably there will be some use cases that cannot fit in the pattern that you already read, which on the positive side can lead to new resolution strategies, new design patterns, and new problem-framing processes in the ML lifecycle.