This post explains what is machine learning, it gives a 360 view of the different ML systems and introduces the main challenges when it comes to creating ML solutions, which is definitely a good place to start with :)
Machine Learning is a programming approach where a program can improve its performance P on a task T through a learning experience E. Arthur Michel (1959) defines it as a field of study that gives a computer the ability to learn without being explicitly programmed. Indeed, a ML system is expected to improve its performance, otherwise it's good for a debug time, and it becomes better at it, learning by itself from provided data, instead of having to explicitly code rules for.
Machine Leaning shines solving complex problems for which we have no straight and cost-effective algorithmic solutions, such as natural language processing, image recognition, spam filtering. Also, it can help humans learn about data structure to solve problems through data mining and perform advanced classifications such as anomaly detection (i.e Fraud, Money Laundering).
There are mainly three types of machine learning categories, the ones that:
- Require or not human supervision: supervised, unsupervised, semi-supervised, reinforcement training Can learn on the go: Online versus batch learning
- Simply compare new data to existing one to find likelihood or detect patterns in training data (much more like scientists): instance based vs model based
- The two most common supervised tasks are predictions and classifications. In unsupervised learning, we can list four common tasks: dimentionality reduction, clustering, association rule learning, vizualisation
Reinforcement Leaning can be used to allow a robot to lean how to walk in various unknown terrains. How it works ? It is based on a reward mechanism, each time the robot walk in the right path it is rewarded, each time it falls out of track it gets penalties. At the end, it learns by itself what is the best strategy, called policy, to get the most rewards over time. The world champion of Go has been beaten in 2017 by DeepMind's AlphaGo, a robot trained using Reinforcement Learning. Note that learning capabilities where turned off during the match, AlphaGo was just applying the policy it had learned through previous experiences.
When it comes to segment your customers into multiple groups, you can use an unsupervised learning algorithm such as k-means or even HCA (hierarchical clustering algorithm), that may also subdivide each group into smaller groups. Which may help targeting your marketing and advertising campaigns for each group. Note that supervised learning can also be used to segment your customers if you know what groups you would like to target.
Spam detection is an interesting use case, it can benefit from both a supervised and an unsupervised algorithm. The supervised mode can initialise the model training with a set of mails flagged as spams (i.e labelled dataset). But sometimes, some spams will have characteristics not represented in the training dataset, which can be identified using an unsupervised algorithm that will help refining the supervised training mode, eventually improving the overall model performance.
Spam filters are usually trained offline, meaning upfront of being deployed to perform the filtering task on the mailing server. Once deployed, the algorithm is classifying mails based on the examples it has been trained with, without the ability to learn while doing its work. Conversely, Online training systems can keep learning incrementally by feeding it data instances sequentially, either individually or by small groups called mini-batches. Each learning step is fast and cheap, so the system can learn on the fly as it arrives, improving their performance while performing their tasks. Its a great alternative if you have limited computing resources as it can also learn huge datasets that cannot fit in one machine's main memory. This is called out-of-core learning, the algorithm load part of the data, run a training step call epoch, and repeats the process until it has run on all the data. Online training is great for systems receiving data as a continuous flow such as stock prices.
Instance-based learning is a learning algorithm that learn initial data by hearth and then relies on similarity measures to make predictions. For instance, the spam filter could also be programmed based on a measure of similarity (i.e number of words they have in common) to flag emails that are very similar to know spam emails. While model-based learning generalizes from a set of examples by building a model, then uses it to make predictions.
Model-based learning algorithms, such as logistic regression or neural networks, create a model that represent the dataset and its parameters (features). The performance of a model-based algorithm can be measured using a fitness function of a cost function. Generally, finding the right model that plots well against provided data instances and training it on a good quality dataset helps achieving greater performance.
Still, there are common challenges paving the road to achieving great performance with ML systems:
Overfitting/Underfitting: models can perform really well against training data but poorly against new data, we call this overfitting. On the other hand, underfitting happens when a model is too simply to learn the underlying structure of provided data.
Poor-Quality Data: Obviously, if your training data is full of errors, outliers and noise, it will make it harder for a the system to detect the underlying patterns and your system will less likely to perform well. It is often well worth the effort to spend time cleaning up your training data. This is a recurring activity that data engineers frequently do.
Not enough Data: Simple problems need typically thousand of examples, while complex problems, such as image recognition may need millions unless you can reuse parts of an existing model.
Nonrepresentative training Data: It is crucial that your training data be representative of the new cases you want to generalize to. Hence being mindful about sampling noise (small dataset nonrepresentative), and sampling biais (large datasets with a flawed sampling method).
Irrelevant features: A system will only be capable of improving if the data contains enough relevant features. Coming up with a good set of features to train on, called feature engineering, is a critical part for the success of a Machine Learning project.
Assuming you train a model against a representative dataset, with enough quality data, based on relevant features, Overfitting can still happen. This is where tuning the model hyperparameter value comes in play. It's a parameter of the learning algorithm, not the model itself. it is set prior to learning and remains constant during the learning phase. The larger the value, the more your learning algorithm will not end-up overfitting, and likely start Underfitting the training data. Hence, the hyperparameter value should be carefully set.
So, how would you identify the right model and a good value for the nasty hyperparameter ? by testing and validating :) again, again and again.. Testing is still in machine learning a pillar to create quality models, such as in traditional programming.. Recall TDD, an XP programming technique praised by the Software Craftsmanship communities but still cursed by most developers :) Conversely, you can choose to put your model in production and monitor how well it performs. This works well, but if your model is appalling, your customers will complain - not the best idea.
Let's see how to prevent this. To select the best model, you should make an assumption about what the best model would be (linear function, polynomial function, logarithmic,.. ) and then split your data in mainly 2 sets: the training set (i.e 80%) and the test set (i.e 20%). In a first phase, the different models are run against the training set using different combinatorics of hyperparameters. In a second phase, the best performing ones are evaluated on the test set assessing the error rate on new cases, called generalization error.
But sometimes, this is not enough as you did optimise your model and hyperparameter to provide the best generalization error for the test set (i.e reaching 5%), which is unlikely to perform as well as on new data instances in production (i.e witnessing 15%). A common solution is to have a second holdout set called validation set:
train multiple models with various hyperparameters using the training set (i.e 80% of total data set) select the model and hyperparameters that work best on the validation set (i.e 10% of total data set) run a final test against the test set (i.e 10% of total dataset) to get a better estimate of the generalization error Be mindful about creating the three datasets with a similar distribution. Also, cross-validation is an alternative technique to prevent spending too much with validation sets. Checking multiple models and hyperparameters against different subsets of the training set and validated on the remaining part. Once the model type and hyperparameters are selected, the final model is trained using these hyperparameters on the full training set and its generalization error measured on the test set.
As you can see, building a working ML system is a no easy task and the more time you spend upfront, identifying the right model, creating a good data set, extracting the right features, identifying the right hyperparameter, the better your system shall perform. Similarly to working with imperative programming, following a test driven approach will provide better efficiency in the long run and greater return on invested efforts.
In the next post, I will talk about chapter 2, discovering how to apply this to an end-to-end machine learning project, looking at the big picture, getting the data, discovering and visualizing it to gain insights, preparing it for ML algorithms, selecting a model and training it, fine-tuning the model, deploying/monitoring/maintaining the system.. stay tuned :)