Data Science Assessment: how to create machine learning models

5 SECONDS SUMMARY:
  • This content is a continuation of the article: “Data Science Assessment: how to analyse a project’s viability“.
  • You have recognised the value that data science could bring your company and have already analysed your data pipelines and business goals.
  • You have decided on a problem you want to solve and have access to the data you need, but how do you begin to implement a Data Science solution, and to create machine learning models? Find out everything in this article.

First identify the machine learning problem type

There are three main types of machine learning models. These are supervised, unsupervised and reinforcement learning.

1. Supervised learning

A supervised model requires the use of a labelled dataset, and the training data must contain the value that you are trying to predict. Learning methods are used to train the model to make predictions. There are two main categories of supervised learning: classification and regression.

Classification attempts to assign an observation to a particular class from a finite set provided in the training step. Examples of this are email spam detection and churn prediction.

Regression is used for predicting a numeric value, such as sales revenue or supply demand.

While these subcategories are separated, there are often equivalent algorithms for both. However, the metrics used to understand the quality of the model are quite distinct for both.

2. Unsupervised learning

Unsupervised learning is the opposite of supervised learning, requiring no labels in the dataset. As such, the models created are not as powerful as their supervised counterparts but do not require labelling, which can often be a long process, expensive or even impossible. Unsupervised methods discover patterns in the data that can be used for particular insights. Examples include customer segmentation, anomaly detection and dimensionality reduction.

3. Reinforcement learning

Reinforcement learning refers to the class of problems in which an agent is interacting with the environment. Unlike other problems, the model is expected to interact with the environment via the set of actions it can perform, and adapt to the impact of its actions. Self-driving cars or chatbots are examples of such models.

Batch vs online

How are we going to consume the model? Are you looking for a model that will process a batch of data at a fixed interval (every day/month)? Or do you want a model that will process inputs on demand, requiring that you make predictions quickly at any point? These are called batch and online predictions; some models and frameworks are better suited to one or the other.

For example, a tool like Spark allows processing of large datasets in parallel and is often used when handling large amounts of data, but models trained with PySpark are not viable for online predictions (at least not with Spark runtime), due to the long startup time that Spark requires. If you don’t require predictions in real time, batch predictions are a better option, as processing data in large chunks often allows for more efficient calculations, and these predictions can be easily stored and accessed in the future. Online predictions can also store the model outputs, but at a significantly higher computing cost because it only stores one prediction at a time.

What tools are used to create the model?

The tools we are going to use depend on the volume of data. If we are working with data that comfortably fits in RAM, we can use data processing and modelling frameworks that keep the data in memory. The Pandas library would be the most frequent option if we are working with Python. If the data is too large to work within the memory, there are still options for working on a single machine, with Dask and Polars being examples of Python libraries that do so. If you want to scale your training horizontally, Spark is the way to go.

Access to data

To develop a machine learning model, you typically need to have access to a sufficient amount of data relevant to the problem you are trying to solve. In some cases, you may be able to find pre-existing datasets that are relevant to your problem or use an open-source machine learning model that you then fine-tune.

It’s important to remember that while existing data and technology can be useful resources, they may not always be sufficient for your specific needs. You may need to tailor your approach based on the unique requirements of your problem and the data available to you.

Final Thoughts

Selecting the most adequate model for a problem involves a trade-off between simplicity, adaptability to data topology, and performance. Several factors, such as business requirements, data to be worked on, problem understanding, model usage, and evolution, influence the model selection process.

With all this in mind, our team is ready to help solve any complex challenge. We help organisations assess the feasibility of applying data science techniques to solve specific challenges in their industry. With only a few consulting sessions, we can identify the problem and explore the potential of the company’s data, reducing the risk associated with implementing a solution in this area.

Luís VicenteData Science Assessment: how to create machine learning models

Read more in

Data Science

Readers also checked out

Do you want to receive amazing news about the IT industry's hot topics and the best articles about state-of-the-art technology?
Subscribe to our newsletter and be the first one to receive information to keep you constantly on edge.