At Xpand, we take pride in our XP4DS workflow and like to surround ourselves with the best tools to make our work easier and results better. Among those technologies and tools, there is a special place reserved for MLFlow.
If you haven’t heard about MLFlow, turn off your phone and connect your modem because it’s time to catch up with [modern solutions/the technological world]!
MLFlow is an open-source platform that helps you manage your machine learning life cycle from the first model you train to that amazing model you will deploy to solve all your problems.
It covers your problems under 3 main topics:
- Tracking: Record and query experiments (code, data, config and results)
- Projects: Packaging format for reproducible runs on any platform.
- Models: General format for sending models to diverse deployment tools.
MLFlow is library-agnostic. You can use it with any machine learning library, and in any programming language, since all functions are accessible through a REST API and CLI. For convenience, the project also includes a Python API, R API, and Java API.
1. Do you recall with precision the ROC AUC? (Metrics + Parameters Logging)
We’ve all been there in the past. It’s your 1st iteration and you train a model with good accuracy values. You continue your iterations in the hope of finding a better set of hyperparameters, only to discover that your best model was an earlier one. You can no longer remember that combination of hyperparameters. With MLFlow, you don’t have this problem! With model logging, you can get information on all your models in one place. From metrics to hyperparameters, you can even add your own tags. In the API, you will be able to compare all the trained models, sort them by any metric or tag and select the model of your choice.
2. It works on my machine ¯_(ツ)_/¯ (Model + Environment Logging)
Once again, as an incredible data scientist, you create an amazing model that solves the problem you need it to. Nevertheless, when you hand it over to your colleagues it does not work. It may be that a library needs to be updated or that some sorcery in the background does not work. With MLFlow this will no longer be a problem. Parallel to the metrics logging, you can save your (trained) model, conda environment and any other file that you deem important. This way, your colleagues can seamlessly replicate your conda environment and execute your trained model without issues.
3. Logging beyond experiments (Model Registry)
The same way that you can save models per experiment, every model that was once in production will also be saved. Through the MLFlow UI you can access all the previous versions of the deployed model. More importantly, when you decide on the best model, you can register it so everyone in the team knows that’s the model that will follow staging and production.
4. There is no “I” In MLFlow (Teamwork)
MLFlow ups teamwork to the next level by improving interoperability between teams. With your DS team, you can all submit and see each other’s models, compare them with yours and even import them so you can work on them too. Then, as a team, you can also push certain models for staging and deployment. These will have to be approved by the team responsible for those tasks. And thus, the whole DS pipeline is present in the MLFlow UI.
5. Model is ready for delivery (Deployment for Production)
You’re nearing the end of the project, you can see light at the end of the tunnel and all your hard work is paying off. All that’s left is to deploy the model and, you guessed it, MLFlow has you covered. With MLFlow Models you are ready to send your trained model for deployment in a vast array of platforms. This, combined with its logging tool, makes it perfect for the constant monitoring of the model’s performance over time so you can improve it if needed!
Conclusion
You will have realised by now that MLFlow is a tool that tries to and succeeds in solving many of the problems that a data scientist faces along the data science pipeline. From the moment you start training your model to the model you deploy into production, you can always rely on MLFlow to track your progress and make the data science process much easier. An open-source tool, MLFlow is an ever-evolving must-have tool for the 21st century data scientist.
Data Scientist