Lifecycle of a bug
A few days back, I was working on crucial client data, and my ML model was performing as expected, but the final results were really off. I checked cross validation scores to see if this is a case of overfitting, I ran my model on a random sample just to make sure that the data distribution remains the same, I went through all model evaluation parameters, checked all the feature correlation, went through rounds of hyper parameter tuning, error analysis; I checked everything I knew, but I could not find the problem. Then, I found the monster bug, quietly sleeping behind the mask of a wrapper function.
This bug must have been there since the beginning but it never surfaced, and I never suspected. It made me wonder;
How did it originate?
Why didn’t I catch it earlier?
What could I have done to avoid it?
What was the lifecycle of the bug?
What is a lifecycle of a bug?
After calming my anxiety with a glass of wine, just kidding, it was a bottle. I took some deep breaths and retraced my steps. I found answers to some of my questions.
Revisiting my thought process, I divided the analysis in two parts:
- What are the potential sources of bug introduction? Especially, in a ML project.
- How can we avoid bugs in ML projects?
Potential sources of bug introduction in ML projects
- Inherited codebase: “I don’t know how to reproduce our current model, the person who worked on it isn’t around anymore — but it works, and I don’t want to poke the bear.” I have heard this so many times, and guilty of saying it many times. Sometimes, we don’t have the domain knowledge, or we missed critical details during knowledge transfer, or we don’t know the assumptions made in the code, or we are afraid to make changes. Such a codebase is a breeding ground for bugs. It is possible that you are not using input data as intended, or as simple as you are not preprocessing the test data, that can quietly mess up your results.
- Lack of modularity: Most ML projects are a mix of experimentation and production driven code. With experimentation, comes messy Jupyter/Colab notebooks. I love Jupyter notebook, but it can make life a living hell if it’s too long. A notebook which works just fine, can go horribly wrong when execution order is changed. Hence, code grows into a monolith, and prone to bugs.
- Packages: With plethora of machine learning packages available, the experiment cycle has become really quick. But, we have to make assumptions on the APIs, version management and compatibility, license agreements, etc. These assumptions can be very dangerous because this can break your code silently, for example with just a package update.
- Non-deterministic results: While creating any machine learning model, we deal with two aspects, deterministic user interface-intensive systems where we know what should our training data look like, how are labels supposed to be encoded, how should our model be saved, What should be the prediction data format etc. These are the places where we can avoid bugs by writing tighter test. The problem arrises when the bug is introduced in the non deterministic algorithm-intensive part of your machine learning model, for example, some model issue in a hierarchical classifier setting, where you have multiple models in place where one model is working fine and the other model is not producing the desired results on a specific data. Such data and model dependent bugs are very hard to catch.
- Lack of type definitions: We tend to skip writing type definitions when prototyping and as always this prototype grows into something real but our typeless code creeps into production. This leads to
null
pointer exceptions, bad edge case handling, serialization and deserialization errors just to name a few.
I know a lot of you must be wondering; Hey!! these things seem to be so obvious, and I can totally relate to that. But still sometimes, I make these mistakes, mostly when I am in utmost hurry. But after working through a lot of bugs, I have found some better strategy to avoid bugs especially when your work consists of good amount of experimentation and production ready code.
- Organize your code: No matter how much you are in hurry, always organize your code. Write smaller reusable helper functions with proper documentation. If your script is going more than 100 lines, consider breaking it up in multiple scripts. Consider constructing a common place to keep all your utility and preprocessing code. Always try to add checks to older codes and transferred examples.
- Test driven environments to tackle both interface related system test and data dependent test: This is a really important step to write efficient machine learning modules. Unit test/ Nosetests can help with interface related system tests while using simple sample of data for checking model quality can help to catch algorithm intensive systems. Consider writing test case which will be applicable for different scale of data and handle edge cases of your complex module.
- Post-mortem analysis of bugs: Post-mortem analysis/ Retrospective of bugs is not only beneficial for understanding characteristics of already existing bugs, but it gives an opportunity to learn more sophisticated coding patterns. It can provide insights for guiding the design of bug detection tools, triaging bug reports, locating likely bug locations, suggesting possible fixes, gauging testing and debugging costs, measuring software quality, and helps to monitor and manage development processes.
- Deep dive into the packages used in your module: It’s crucial to know the details around the packages used. Focus on the libraries you are using, their nuances, code compatibility, package limitations, version management, etc. Invest some time in creating your own mini ML utility library. Sometimes it’s handy to write your own accessors and sample library which might use primitive data structures, hence more code transparency.
- Write better logs and docs: I can’t emphasize enough on this point. Having a research journal with all the experiment details with logs and documentation not only helps in retrospectives but also gives structure to your code base. Managing machine learning module is difficult enough on it’s own, but not having a layer of description makes it even more convoluted. Keeping all records of experiments can help to understand overall nature of your data, can give starting points for future experiments, helping in formulating relevant hypothesis and debugging strategy.
- Increase the transparency of your model: I am a firm believer of Feynman technique of learning. If we can’t explain any concept in plain simple words, then probably we haven’t understood the concept well for ourself too. Try validating more hypothesis of your model. Document your justification of the assumptions made. Write better example of expected data and consequent results. Spend some time in better visualizations. Tools like Matplotlib, Seaborn, Highcharts, qgrid can help with the process.
- Invest some effort in using machine learning experiment management tools like SageMaker: There is an awesome article describing the overall cycle of machine learning experiment and it’s management here.
Lastly, I would like to conclude that writing better and efficient machine learning module is a constant effort, more we spend time, more we understand the nitty-gritty details of our data, our module performance and scalability factors. But keeping some starter points help to get a good head start.
I have attached some amazing articles below on this topic.
Some good links to read: