Nuno Chicória

ngpc

Data Scientist

Data Science Assessment: how to analyse a project’s viability

5 SECOND SUMMARY:
  • Data Science is a trendy field that applies new techniques, methods, and technologies to old problems to unlock data’s full potential and uncover hidden patterns.
  • To ensure the success of any data science project, you must carry out a viability analysis right at the start (Data Science assessment). Each company has its requirements, timelines, ways of accessing data, specifications, etc. For this reason, you can’t start a project without understanding your organisation’s current status and the direction it wants to take.

The significance of data is widely acknowledged across various sectors, and leveraging it represents a pivotal factor that can confer a substantial competitive advantage or even transform the fundamental nature of a business. The field of data science, being contemporary in its approach, employs innovative techniques, methods, and technologies to address long-standing challenges, unlocking the complete potential of data and revealing latent patterns.

To ensure the success of Data Science projects, a thorough viability analysis must precede any other stage (our Data Science Assessment). Each company possesses unique requirements, timelines, methods for accessing data, and data specifications. Consequently, commencing a project necessitates a comprehensive understanding of the current state of the organization and its intended direction. This involves alignment with the existing state of our customer platforms, in addition to addressing needs, expectations, and requirements, with a primary focus on effective problem-solving.

Data Science Assessment

For this purpose, you must first evaluate (i) the level of maturity in data collection and processing, (ii) the volumes of data and scaling requirements and (iii) the technologies utilised within the organisation:

  • Maturity of data collection and processing – This involves assessing whether the organisation possesses well-established capabilities for collecting and storing data to gain the intended insights. Generally, there are three levels here: (i) the entry level, where data collection and processing enhancements are required to develop suitable data science tools; (ii) the intermediate level, where the client already has mechanisms for data collection, storage and processing ready to progress towards the data science domain; and (iii) advanced level, where the organisation already incorporates some data science processes within its respective business areas.
  • Volume and scale needs – Here we ascertain the location of data storage and its scalability.
  • Technologies used in the organisation – Understanding the organisation’s technological infrastructure crucial for working out the best solution for each situation in collaboration with our client. This involves capitalising on the advantages and mitigating the disadvantages of the technologies involved, depending on the circumstances.

Along with the technological component, the business component is a pivotal driver of success. It typically acts as the bridge between the current state (AS IS) and the envisioned future state (TO BE). Understanding the business context has major significance at various stages of the process. During the initial phase, knowing the challenges and factors that directly or indirectly affect the outcomes – dependent on the project type – may touch on legal considerations, process flows, sector-specific nuances, technical terminology, and more. In the data analysis phase, business context can help with identifying relationships between features or the evaluation of patterns. The correct alignment of technology and business is paramount, not only for the aforementioned reasons but also to streamline comprehension in subsequent phases.

To gain a deeper insight into the next phase of our journey, we must consider key attributes, including:

  • Quality and quantity of data – The data available plays a pivotal role in delineating not only the challenges that can be addressed but also the processes and efforts required for the necessary developments tailored to each client. Additionally, it prompts the question of whether it is feasible to collect, store and process additional data beyond the current scope.
  • How the model will be “fed” – It is vital to know the origin of data needed for training, along with the triggers and queries directed towards the model. For instance, determining whether queries originate from the front end, back end or both.
  • Context of model usage – The intended use for the model holds significant sway in project definition. Factors such as the frequency of requests, real-time versus batch queries, and the acceptability of a potential five-minute waiting time for results become crucial considerations.
  • Nature of the problem – The challenges to overcome may require the development of a model, the establishment of a monitoring and maintenance process or the extraction of relevant information using techniques like data mining, discovering relationships between variables or patterns in the data.

Each of these specifics, particularly the structure and respective content of the data, imparts a unique character to every challenge.Consequently, uncertainty emerges as a constant factor in every project, notably in the initial phase where the intricacies of the data and the existing organisational reality are not yet fully understood.

Our Data Science process

Our team is well-prepared to address the most intricate challenges. We have developed a method to combat the inherent uncertainty of data science projects while ensuring progress within the defined scope and delivering substantial value to our client, underpinned by agile methodologies.

  1. We comprehensively analyse project viability, examining it from business and technical perspectives and meticulously defining success criteria.
  2. We construct and compare diverse models, identifying the one that aligns most closely with the established criteria.
  3. We use the insights gathered to meticulously plan and execute the deployment and monitoring of our solution in a production environment, recognising the ongoing need for constant monitoring and care in maintaining a data science solution.

Our unwavering commitment to continuous improvement has propelled Xpand IT to receive the esteemed “Microsoft Partner of the Year Award” for two years. This accomplishment solidifies our status as recognised experts in Microsoft tools.

Nuno ChicóriaData Science Assessment: how to analyse a project’s viability
read more

Guide for monitoring machine learning models

5 SECONDS-SUMMARY:
  • This content is a continuation of the article: “Data Science Assessment: how to create machine learning models”.
  • Continuous model monitoring is essential for ensuring sustained success and optimal performance in machine learning models, involving the observation of a model’s behavior over time and the tracking of key metrics to ensure accuracy and reliability.
  • Various open-source platforms simplify the machine learning lifecycle by providing tools for experiment tracking, model versioning through registries, and seamless deployment with integrated monitoring, empowering data scientists to navigate model management complexities for sustained success.

In the dynamic landscape of data science, building and deploying machine learning models is just the beginning. To ensure sustained success and optimal performance, continuous monitoring of these models is crucial. Model monitoring in the data science pipeline involves tracking, evaluating, and managing the performance of both experimental models and those deployed in production.

In this blog post, we’ll delve into the significance of model monitoring and explore how tools like MLflow can empower data scientists to keep a close eye on their experiments and deployed models.

Monitoring machine learning models

Model monitoring refers to the ongoing process of observing a machine learning model’s behaviour over time, both during the development phase and after deployment. It involves tracking various metrics to ensure that the model continues to deliver accurate and reliable predictions as data distributions evolve.

Key aspects of model monitoring

Performance Metrics

Monitoring the performance of your models involves tracking key metrics such as accuracy, precision, recall, F1 score, and more. These metrics provide insights into how well the model is generalizing to new data and whether any degradation in performance has occurred.

Data Drift Detection

Data distributions in real-world scenarios are rarely static. Monitoring for data drift involves comparing the distribution of incoming data with the data the model was trained on. Monitoring tools allow you to set up automated processes to detect and alert when significant drift occurs.

Model Drift Detection

Similar to data drift, model drift involves tracking changes in the model’s predictions over time. Monitoring tools enable you to log and compare model performance, helping you identify if the model’s effectiveness has degraded.

How Model Monitoring Tools Facilitate Model Monitoring

Various open-source platforms simplify the machine learning lifecycle. One key approach is the ability to track and manage experiments. Here’s how these tools assist in keeping your models in check:

Experiment Tracking

These platforms allow you to log and organize experiments, making it easy to compare different runs and identify the most successful models. They record parameters, metrics, and artefacts, providing a comprehensive overview of your model development process.

Model Registry

Model Registries act as central hubs for managing and versioning models. This ensures that every deployment is based on a specific version of the model, facilitating easy rollback in case issues arise.

Model Deployment and Monitoring

These platforms simplify the deployment process, making it seamless to transition from experimenting with models to deploying them in production. Additionally, they provide integrations with monitoring tools, allowing you to keep a close eye on the deployed model’s performance.

Final Thoughts

Model monitoring is an integral part of the data science pipeline that ensures the continued effectiveness of machine learning models. Various tools, with MLflow as an example, emerge as powerful allies, offering features that streamline experiment tracking, model versioning, and deployment monitoring. By leveraging these tools, data scientists can confidently navigate the complexities of model management and monitoring, contributing to the sustained success of their machine learning endeavours.

Nuno ChicóriaGuide for monitoring machine learning models
read more

Five everyday problems MLFlow solves

At Xpand, we take pride in our XP4DS workflow and like to surround ourselves with the best tools to make our work easier and results better. Among those technologies and tools, there is a special place reserved for MLFlow.

If you haven’t heard about MLFlow, turn off your phone and connect your modem because it’s time to catch up with [modern solutions/the technological world]!

MLFlow is an open-source platform that helps you manage your machine learning life cycle from the first model you train to that amazing model you will deploy to solve all your problems.

It covers your problems under 3 main topics:

  • Tracking: Record and query experiments (code, data, config and results)
  • Projects: Packaging format for reproducible runs on any platform.
  • Models: General format for sending models to diverse deployment tools.

MLFlow is library-agnostic. You can use it with any machine learning library, and in any programming language, since all functions are accessible through a REST API and CLI. For convenience, the project also includes a Python API, R API, and Java API.

1. Do you recall with precision the ROC AUC? (Metrics + Parameters Logging)

We’ve all been there in the past. It’s your 1st iteration and you train a model with good accuracy values. You continue your iterations in the hope of finding a better set of hyperparameters, only to discover that your best model was an earlier one. You can no longer remember that combination of hyperparameters. With MLFlow, you don’t have this problem! With model logging, you can get information on all your models in one place. From metrics to hyperparameters, you can even add your own tags. In the API, you will be able to compare all the trained models, sort them by any metric or tag and select the model of your choice.

2. It works on my machine ¯_(ツ)_/¯ (Model + Environment Logging)

Once again, as an incredible data scientist, you create an amazing model that solves the problem you need it to. Nevertheless, when you hand it over to your colleagues it does not work. It may be that a library needs to be updated or that some sorcery in the background does not work. With MLFlow this will no longer be a problem. Parallel to the metrics logging, you can save your (trained) model, conda environment and any other file that you deem important. This way, your colleagues can seamlessly replicate your conda environment and execute your trained model without issues.

3. Logging beyond experiments (Model Registry)

The same way that you can save models per experiment, every model that was once in production will also be saved. Through the MLFlow UI you can access all the previous versions of the deployed model. More importantly, when you decide on the best model, you can register it so everyone in the team knows that’s the model that will follow staging and production.

4. There is no “I” In MLFlow (Teamwork)

MLFlow ups teamwork to the next level by improving interoperability between teams. With your DS team, you can all submit and see each other’s models, compare them with yours and even import them so you can work on them too. Then, as a team, you can also push certain models for staging and deployment. These will have to be approved by the team responsible for those tasks. And thus, the whole DS pipeline is present in the MLFlow UI.

5. Model is ready for delivery (Deployment for Production)

You’re nearing the end of the project, you can see light at the end of the tunnel and all your hard work is paying off. All that’s left is to deploy the model and, you guessed it, MLFlow has you covered. With MLFlow Models you are ready to send your trained model for deployment in a vast array of platforms. This, combined with its logging tool, makes it perfect for the constant monitoring of the model’s performance over time so you can improve it if needed!

Conclusion

You will have realised by now that MLFlow is a tool that tries to and succeeds in solving many of the problems that a data scientist faces along the data science pipeline. From the moment you start training your model to the model you deploy into production, you can always rely on MLFlow to track your progress and make the data science process much easier. An open-source tool, MLFlow is an ever-evolving must-have tool for the 21st century data scientist.

Nuno ChicóriaFive everyday problems MLFlow solves
read more