MachineLearningMastery.com

The Perceptron is a linear machine learning algorithm for binary classification tasks.

It may be considered one of the first and one of the simplest types of artificial neural networks. It is definitely not “deep” learning but is an important building block.

Like logistic regression, it can quickly learn a linear separation in feature space for two-class classification tasks, although unlike logistic regression, it learns using the stochastic gradient descent optimization algorithm and does not predict calibrated probabilities.

In this tutorial, you will discover the Perceptron classification machine learning algorithm.

After completing this tutorial, you will know:

The Perceptron Classifier is a linear algorithm that can be applied to binary classification tasks.
How to fit, evaluate, and make predictions with the Perceptron model with Scikit-Learn.
How to tune the hyperparameters of the Perceptron algorithm on a given dataset.

Let’s get started.

Perceptron Algorithm for Classification in Python
Photo by Belinda Novika, some rights reserved.

Tutorial Overview

This tutorial is divided into 3=three parts; they are:

Perceptron Algorithm
Perceptron With Scikit-Learn
Tune Perceptron Hyperparameters

Perceptron Algorithm

The Perceptron algorithm is a two-class (binary) classification machine learning algorithm.

It is a type of neural network model, perhaps the simplest type of neural network model.

It consists of a single node or neuron that takes a row of data as input and predicts a class label. This is achieved by calculating the weighted sum of the inputs and a bias (set to 1). The weighted sum of the input of the model is called the activation.

Activation = Weights * Inputs + Bias

If the activation is above 0.0, the model will output 1.0; otherwise, it will output 0.0.

Predict 1: If Activation > 0.0
Predict 0: If Activation <= 0.0

Given that the inputs are multiplied by model coefficients, like linear regression and logistic regression, it is good practice to normalize or standardize data prior to using the model.

The Perceptron is a linear classification algorithm. This means that it learns a decision boundary that separates two classes using a line (called a hyperplane) in the feature space. As such, it is appropriate for those problems where the classes can be separated well by a line or linear model, referred to as linearly separable.

The coefficients of the model are referred to as input weights and are trained using the stochastic gradient descent optimization algorithm.

Examples from the training dataset are shown to the model one at a time, the model makes a prediction, and error is calculated. The weights of the model are then updated to reduce the errors for the example. This is called the Perceptron update rule. This process is repeated for all examples in the training dataset, called an epoch. This process of updating the model using examples is then repeated for many epochs.

Model weights are updated with a small proportion of the error each batch, and the proportion is controlled by a hyperparameter called the learning rate, typically set to a small value. This is to ensure learning does not occur too quickly, resulting in a possibly lower skill model, referred to as premature convergence of the optimization (search) procedure for the model weights.

weights(t + 1) = weights(t) + learning_rate * (expected_i – predicted_) * input_i

Training is stopped when the error made by the model falls to a low level or no longer improves, or a maximum number of epochs is performed.

The initial values for the model weights are set to small random values. Additionally, the training dataset is shuffled prior to each training epoch. This is by design to accelerate and improve the model training process. Because of this, the learning algorithm is stochastic and may achieve different results each time it is run. As such, it is good practice to summarize the performance of the algorithm on a dataset using repeated evaluation and reporting the mean classification accuracy.

The learning rate and number of training epochs are hyperparameters of the algorithm that can be set using heuristics or hyperparameter tuning.

For more about the Perceptron algorithm, see the tutorial:

How to Implement the Perceptron Algorithm From Scratch in Python

Now that we are familiar with the Perceptron algorithm, let’s explore how we can use the algorithm in Python.

Perceptron With Scikit-Learn

The Perceptron algorithm is available in the scikit-learn Python machine learning library via the Perceptron class.

The class allows you to configure the learning rate (eta0), which defaults to 1.0.

...
# define model
model = Perceptron(eta0=1.0)

The implementation also allows you to configure the total number of training epochs (max_iter), which defaults to 1,000.

...
# define model
model = Perceptron(max_iter=1000)

The scikit-learn implementation of the Perceptron algorithm also provides other configuration options that you may want to explore, such as early stopping and the use of a penalty loss.

We can demonstrate the Perceptron classifier with a worked example.

First, let’s define a synthetic classification dataset.

We will use the make_classification() function to create a dataset with 1,000 examples, each with 20 input variables.

The example creates and summarizes the dataset.

# test classification dataset
from sklearn.datasets import make_classification
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=10, n_redundant=0, random_state=1)
# summarize the dataset
print(X.shape, y.shape)

Running the example creates the dataset and confirms the number of rows and columns of the dataset.

(1000, 10) (1000,)

We can fit and evaluate a Perceptron model using repeated stratified k-fold cross-validation via the RepeatedStratifiedKFold class. We will use 10 folds and three repeats in the test harness.

We will use the default configuration.

...
# create the model
model = Perceptron()

The complete example of evaluating the Perceptron model for the synthetic binary classification task is listed below.

# evaluate a perceptron model on the dataset
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.linear_model import Perceptron
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=10, n_redundant=0, random_state=1)
# define model
model = Perceptron()
# define model evaluation method
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# summarize result
print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example evaluates the Perceptron algorithm on the synthetic dataset and reports the average accuracy across the three repeats of 10-fold cross-validation.

Your specific results may vary given the stochastic nature of the learning algorithm. Consider running the example a few times.

In this case, we can see that the model achieved a mean accuracy of about 84.7 percent.

Mean Accuracy: 0.847 (0.052)

We may decide to use the Perceptron classifier as our final model and make predictions on new data.

This can be achieved by fitting the model pipeline on all available data and calling the predict() function passing in a new row of data.

We can demonstrate this with a complete example listed below.

# make a prediction with a perceptron model on the dataset
from sklearn.datasets import make_classification
from sklearn.linear_model import Perceptron
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=10, n_redundant=0, random_state=1)
# define model
model = Perceptron()
# fit model
model.fit(X, y)
# define new data
row = [0.12777556,-3.64400522,-2.23268854,-1.82114386,1.75466361,0.1243966,1.03397657,2.35822076,1.01001752,0.56768485]
# make a prediction
yhat = model.predict([row])
# summarize prediction
print('Predicted Class: %d' % yhat)

Running the example fits the model and makes a class label prediction for a new row of data.

Predicted Class: 1

Next, we can look at configuring the model hyperparameters.

Tune Perceptron Hyperparameters

The hyperparameters for the Perceptron algorithm must be configured for your specific dataset.

Perhaps the most important hyperparameter is the learning rate.

A large learning rate can cause the model to learn fast, but perhaps at the cost of lower skill. A smaller learning rate can result in a better-performing model but may take a long time to train the model.

You can learn more about exploring learning rates in the tutorial:

How to Configure the Learning Rate When Training Deep Learning Neural Networks

It is common to test learning rates on a log scale between a small value such as 1e-4 (or smaller) and 1.0. We will test the following values in this case:

...
# define grid
grid = dict()
grid['eta0'] = [0.0001, 0.001, 0.01, 0.1, 1.0]

The example below demonstrates this using the GridSearchCV class with a grid of values we have defined.

# grid search learning rate for the perceptron
from sklearn.datasets import make_classification
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.linear_model import Perceptron
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=10, n_redundant=0, random_state=1)
# define model
model = Perceptron()
# define model evaluation method
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# define grid
grid = dict()
grid['eta0'] = [0.0001, 0.001, 0.01, 0.1, 1.0]
# define search
search = GridSearchCV(model, grid, scoring='accuracy', cv=cv, n_jobs=-1)
# perform the search
results = search.fit(X, y)
# summarize
print('Mean Accuracy: %.3f' % results.best_score_)
print('Config: %s' % results.best_params_)
# summarize all
means = results.cv_results_['mean_test_score']
params = results.cv_results_['params']
for mean, param in zip(means, params):
    print(">%.3f with: %r" % (mean, param))

Running the example will evaluate each combination of configurations using repeated cross-validation.

Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see that a smaller learning rate than the default results in better performance with learning rate 0.0001 and 0.001 both achieving a classification accuracy of about 85.7 percent as compared to the default of 1.0 that achieved an accuracy of about 84.7 percent.

Mean Accuracy: 0.857
Config: {'eta0': 0.0001}
>0.857 with: {'eta0': 0.0001}
>0.857 with: {'eta0': 0.001}
>0.853 with: {'eta0': 0.01}
>0.847 with: {'eta0': 0.1}
>0.847 with: {'eta0': 1.0}

Another important hyperparameter is how many epochs are used to train the model.

This may depend on the training dataset and could vary greatly. Again, we will explore configuration values on a log scale between 1 and 1e+4.

...
# define grid
grid = dict()
grid['max_iter'] = [1, 10, 100, 1000, 10000]

We will use our well-performing learning rate of 0.0001 found in the previous search.

...
# define model
model = Perceptron(eta0=0.0001)

The complete example of grid searching the number of training epochs is listed below.

# grid search total epochs for the perceptron
from sklearn.datasets import make_classification
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.linear_model import Perceptron
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=10, n_redundant=0, random_state=1)
# define model
model = Perceptron(eta0=0.0001)
# define model evaluation method
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# define grid
grid = dict()
grid['max_iter'] = [1, 10, 100, 1000, 10000]
# define search
search = GridSearchCV(model, grid, scoring='accuracy', cv=cv, n_jobs=-1)
# perform the search
results = search.fit(X, y)
# summarize
print('Mean Accuracy: %.3f' % results.best_score_)
print('Config: %s' % results.best_params_)
# summarize all
means = results.cv_results_['mean_test_score']
params = results.cv_results_['params']
for mean, param in zip(means, params):
    print(">%.3f with: %r" % (mean, param))

Running the example will evaluate each combination of configurations using repeated cross-validation.

Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see that epochs 10 to 10,000 result in about the same classification accuracy. An interesting exception would be to explore configuring learning rate and number of training epochs at the same time to see if better results can be achieved.

Mean Accuracy: 0.857
Config: {'max_iter': 10}
>0.850 with: {'max_iter': 1}
>0.857 with: {'max_iter': 10}
>0.857 with: {'max_iter': 100}
>0.857 with: {'max_iter': 1000}
>0.857 with: {'max_iter': 10000}

Summary

In this tutorial, you discovered the Perceptron classification machine learning algorithm.

Specifically, you learned:

The Perceptron Classifier is a linear algorithm that can be applied to binary classification tasks.
How to fit, evaluate, and make predictions with the Perceptron model with Scikit-Learn.
How to tune the hyperparameters of the Perceptron algorithm on a given dataset.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Perceptron Algorithm for Classification in Python appeared first on Machine Learning Mastery.

Dynamic classifier selection is a type of ensemble learning algorithm for classification predictive modeling.

The technique involves fitting multiple machine learning models on the training dataset, then selecting the model that is expected to perform best when making a prediction, based on the specific details of the example to be predicted.

This can be achieved using a k-nearest neighbor model to locate examples in the training dataset that are closest to the new example to be predicted, evaluating all models in the pool on this neighborhood and using the model that performs the best on the neighborhood to make a prediction for the new example.

As such, the dynamic classifier selection can often perform better than any single model in the pool and provides an alternative to averaging the predictions from multiple models, as is the case in other ensemble algorithms.

In this tutorial, you will discover how to develop dynamic classifier selection ensembles in Python.

After completing this tutorial, you will know:

Dynamic classifier selection algorithms choose one from among many models to make a prediction for each new example.
How to develop and evaluate dynamic classifier selection models for classification tasks using the scikit-learn API.
How to explore the effect of dynamic classifier selection model hyperparameters on classification accuracy.

Let’s get started.

How to Develop Dynamic Classifier Selection in Python
Photo by Jean and Fred, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

Dynamic Classifier Selection
Dynamic Classifier Selection With Scikit-Learn
1. DCS With Overall Local Accuracy (OLA)
2. DCS With Local Class Accuracy (LCA)
Hyperparameter Tuning for DCS
1. Explore k in k-Nearest Neighbor
2. Explore Algorithms for Classifier Pool

Dynamic Classifier Selection

Multiple Classifier Systems refers to a field of machine learning algorithms that use multiple models to address classification predictive modeling problems.

This includes familiar techniques such as one-vs-rest, one-vs-all, and output error-correcting codes techniques. It also includes more general techniques that select a model to use dynamically for each new example that requires a prediction.

Several approaches are currently used to construct an MCS […] One of the most promising MCS approaches is Dynamic Selection (DS), in which the base classifiers are selected on the fly, according to each new sample to be classified.

— Dynamic Classifier Selection: Recent Advances And Perspectives, 2018.

For more on these types of multiple classifier systems, see the tutorial:

How to Use One-vs-Rest and One-vs-One for Multi-Class Classification

These methods are generally known by the name: Dynamic Classifier Selection, or DCS for short.

Dynamic Classifier Selection: Algorithms that choose one from among many trained models to make a prediction based on the specific details of the input.

Given that multiple models are used in DCS, it is considered a type of ensemble learning technique.

Dynamic Classifier Selection algorithms generally involve partitioning the input feature space in some way and assigning specific models to be responsible for making predictions for each partition. There are a variety of different DCS algorithms and research efforts are mainly focused on how to evaluate and assign classifiers to specific regions of the input space.

After training multiple individual learners, DCS dynamically selects one learner for each test instance. […] DCS makes predictions by using one individual learner.

— Page 93, Ensemble Methods: Foundations and Algorithms, 2012.

An early and popular approach involves first fitting a small, diverse set of classification models on the training dataset. When a prediction is required, first a k-nearest neighbor (kNN) algorithm is used to find the k most similar examples from the training dataset that match the example. Each previously fit classifier in the model is then evaluated on the neighbor of k training examples and the classifier that performs the best is selected to make a prediction for the new example.

This approach is referred to as “Dynamic Classifier Selection Local Accuracy” or DCS-LA for short and was described by Kevin Woods, et al. in their 1997 paper titled “Combination Of Multiple Classifiers Using Local Accuracy Estimates.”

The basic idea is to estimate each classifier’s accuracy in local region of feature space surrounding an unknown test sample, and then use the decision of the most locally accurate classifier.

— Combination of multiple classifiers using local accuracy estimates, 1997.

The authors describe two approaches for selecting a single classifier model to make a prediction for a given input example, they are:

Local Accuracy, often referred to as LA or Overall Local Accuracy (OLA).
Class Accuracy, often referred to as CA or Local Class Accuracy (LCA).

Local Accuracy (OLA) involves evaluating the classification accuracy of each model on the neighborhood of k training examples. The model that performs the best in this neighborhood is then selected to make a prediction for the new example.

The OLA of each classifier is computed as the percentage of the correct recognition of the samples in the local region.

— Dynamic Selection Of Classifiers—a Comprehensive Review, 2014.

Class Accuracy (LCA) involves using each model to make a prediction for the new example and noting the class that was predicted. Then, the accuracy of each model on the neighbor of k training examples is evaluated and the model that has the best skill for the class that it predicted on the new example is selected and its prediction returned.

The LCA is estimated for each base classifier as the percentage of correct classifications within the local region, but considering only those examples where the classifier has given the same class as the one it gives for the unknown pattern.

— Dynamic Selection Of Classifiers—a Comprehensive Review, 2014.

In both cases, if all fit models make the same prediction for a new input example, then the prediction is returned directly.

Now that we are familiar with DCS and the DCS-LA algorithm, let’s look at how we can use it on our own classification predictive modeling projects.

Dynamic Classifier Selection With Scikit-Learn

The Dynamic Ensemble Selection Library or DESlib for short is an open source Python library that provides an implementation of many different dynamic classifier selection algorithms.

DESlib is an easy-to-use ensemble learning library focused on the implementation of the state-of-the-art techniques for dynamic classifier and ensemble selection.

Dynamic Selection Library Project, GitHub.

First, we can install the DESlib library using the pip package manager.

sudo pip install deslib

Once installed, we can confirm that the library was installed correctly and is ready to be used by loading the library and printing the installed version.

# check deslib version
import deslib
print(deslib.__version__)

Running the script will print your version of the DESlib library you have installed.

Your version should be the same or higher. If not, you must upgrade your version of the DESlib library.

0.3

The DESlib provides an implementation of the DCS-LA algorithm with each classifier selection technique via the OLA and LCA classes respectively.

Each class can be used as a scikit-learn model directly, allowing the full suite of scikit-learn data preparation, modeling pipelines, and model evaluation techniques to be used directly.

Both classes use a k-nearest neighbor algorithm to select the neighbor with a default value of k=7.

A bootstrap aggregation (bagging) ensemble of decision trees is used as the pool of classifier models considered for each classification that is made by default, although this can be changed by setting “pool_classifiers” to a list of models.

We can use the make_classification() function to create a synthetic binary classification problem with 10,000 examples and 20 input features.

# synthetic binary classification dataset
from sklearn.datasets import make_classification
# define dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
# summarize the dataset
print(X.shape, y.shape)

Running the example creates the dataset and summarizes the shape of the input and output components.

(10000, 20) (10000,)

Now that we are familiar with the DESlib API, let’s look at how to use each DCS-LA algorithm.

DCS With Overall Local Accuracy (OLA)

We can evaluate a DCS-LA model using overall local accuracy on the synthetic dataset.

In this case, we will use default model hyperparameters, including bagged decision trees as the pool of classifier models and a k=7 for the selection of the local neighborhood when making a prediction.

We will evaluate the model using repeated stratified k-fold cross-validation with three repeats and 10 folds. We will report the mean and standard deviation of the accuracy of the model across all repeats and folds.

The complete example is listed below.

# evaluate dynamic classifier selection DCS-LA with overall local accuracy
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from deslib.dcs.ola import OLA
# define dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
# define the model
model = OLA()
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Mean Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example reports the mean and standard deviation accuracy of the model.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see the DCS-LA with OLA and default hyperparameters achieves a classification accuracy of about 88.3 percent.

Mean Accuracy: 0.883 (0.012)

We can also use the DCS-LA model with OLA as a final model and make predictions for classification.

First, the model is fit on all available data, then the predict() function can be called to make predictions on new data.

The example below demonstrates this on our binary classification dataset.

# make a prediction with DCS-LA using overall local accuracy
from sklearn.datasets import make_classification
from deslib.dcs.ola import OLA
# define dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
# define the model
model = OLA()
# fit the model on the whole dataset
model.fit(X, y)
# make a single prediction
row = [0.2929949,-4.21223056,-1.288332,-2.17849815,-0.64527665,2.58097719,0.28422388,-7.1827928,-1.91211104,2.73729512,0.81395695,3.96973717,-2.66939799,3.34692332,4.19791821,0.99990998,-0.30201875,-4.43170633,-2.82646737,0.44916808]
yhat = model.predict([row])
print('Predicted Class: %d' % yhat[0])

Running the example fits the DCS-LA with OLA model on the entire dataset and is then used to make a prediction on a new row of data, as we might when using the model in an application.

Predicted Class: 0

Now that we are familiar with using DCS-LA with OLA, let’s look at the LCA method.

DCS With Local Class Accuracy (LCA)

We can evaluate a DCS-LA model using local class accuracy on the synthetic dataset.

The complete example is listed below.

# evaluate dynamic classifier selection DCS-LA using local class accuracy
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from deslib.dcs.lca import LCA
# define dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
# define the model
model = LCA()
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Mean Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example reports the mean and standard deviation accuracy of the model.

In this case, we can see the DCS-LA with LCA and default hyperparameters achieves a classification accuracy of about 92.2 percent.

Mean Accuracy: 0.922 (0.007)

We can also use the DCS-LA model with LCA as a final model and make predictions for classification.

First, the model is fit on all available data, then the predict() function can be called to make predictions on new data.

The example below demonstrates this on our binary classification dataset.

# make a prediction with DCS-LA using local class accuracy
from sklearn.datasets import make_classification
from deslib.dcs.lca import LCA
# define dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
# define the model
model = LCA()
# fit the model on the whole dataset
model.fit(X, y)
# make a single prediction
row = [0.2929949,-4.21223056,-1.288332,-2.17849815,-0.64527665,2.58097719,0.28422388,-7.1827928,-1.91211104,2.73729512,0.81395695,3.96973717,-2.66939799,3.34692332,4.19791821,0.99990998,-0.30201875,-4.43170633,-2.82646737,0.44916808]
yhat = model.predict([row])
print('Predicted Class: %d' % yhat[0])

Running the example fits the DCS-LA with LCA model on the entire dataset and is then used to make a prediction on a new row of data, as we might when using the model in an application.

Predicted Class: 0

Now that we are familiar with using the scikit-learn API to evaluate and use DCS-LA models, let’s look at configuring the model.

Hyperparameter Tuning for DCS

In this section, we will take a closer look at some of the hyperparameters you should consider tuning for the DCS-LA model and their effect on model performance.

There are many hyperparameters we can look at for DCS-LA, although in this case, we will look at the value of k in the k-nearest neighbor model used in the local evaluation of the models, and how to use a custom pool of classifiers.

We will use the DCS-LA with OLA as the basis for these experiments, although the choice of the specific method is arbitrary.

Explore k in k-Nearest Neighbor

The configuration of the k-nearest neighbor algorithm is critical to the DCS-LA model as it defines the scope of the neighborhood in which each classifier is considered for selection.

The k value controls the size of the neighborhood and it is important to set it to a value that is appropriate for your dataset, specifically the density of samples in the feature space. A value too small will mean that relevant examples in the training set might be excluded from the neighborhood, whereas values too large may mean that the signal is being washed out by too many examples.

The example below explores the classification accuracy of the DCS-LA with OLA with k values from 2 to 21.

# explore k in knn for DCS-LA with overall local accuracy
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from deslib.dcs.ola import OLA
from matplotlib import pyplot

# get the dataset
def get_dataset():
	X, y = make_classification(n_samples=10000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
	return X, y

# get a list of models to evaluate
def get_models():
	models = dict()
	for n in range(2,22):
		models[str(n)] = OLA(k=n)
	return models

# evaluate a give model using cross-validation
def evaluate_model(model):
	cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
	scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
	return scores

# define dataset
X, y = get_dataset()
# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
	scores = evaluate_model(model)
	results.append(scores)
	names.append(name)
	print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores)))
# plot model performance for comparison
pyplot.boxplot(results, labels=names, showmeans=True)
pyplot.show()

Running the example first reports the mean accuracy for each configured neighborhood size.

In this case, we can see that accuracy increases with the neighborhood size, perhaps to k=13 or k=14, where it appears to level off.

>2 0.873 (0.009)
>3 0.874 (0.013)
>4 0.880 (0.009)
>5 0.881 (0.009)
>6 0.883 (0.010)
>7 0.883 (0.011)
>8 0.884 (0.012)
>9 0.883 (0.010)
>10 0.886 (0.012)
>11 0.886 (0.011)
>12 0.885 (0.010)
>13 0.888 (0.010)
>14 0.886 (0.009)
>15 0.889 (0.010)
>16 0.885 (0.012)
>17 0.888 (0.009)
>18 0.886 (0.010)
>19 0.889 (0.012)
>20 0.889 (0.011)
>21 0.886 (0.011)

A box and whisker plot is created for the distribution of accuracy scores for each configured neighborhood size.

We can see the general trend of increasing model performance and k value before reaching a plateau.

Box and Whisker Plots of Accuracy Distributions for k Values in DCS-LA With OLA

Explore Algorithms for Classifier Pool

The choice of algorithms used in the pool for the DCS-LA is another important hyperparameter.

By default, bagged decision trees are used, as it has proven to be an effective approach on a range of classification tasks. Nevertheless, a custom pool of classifiers can be considered.

This requires first defining a list of classifier models to use and fitting each on the training dataset. Unfortunately, this means that the automatic k-fold cross-validation model evaluation methods in scikit-learn cannot be used in this case. Instead, we will use a train-test split so that we can fit the classifier pool manually on the training dataset.

The list of fit classifiers can then be specified to the OLA (or LCA) class via the “pool_classifiers” argument. In this case, we will use a pool that includes logistic regression, a decision tree, and a naive Bayes classifier.

The complete example of evaluating DCS-LA with OLA and a custom set of classifiers on the synthetic dataset is listed below.

# evaluate DCS-LA using OLA with a custom pool of algorithms
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from deslib.dcs.ola import OLA
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
X, y = make_classification(n_samples=10000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
# split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=1)
# define classifiers to use in the pool
classifiers = [
	LogisticRegression(),
	DecisionTreeClassifier(),
	GaussianNB()]
# fit each classifier on the training set
for c in classifiers:
	c.fit(X_train, y_train)
# define the DCS-LA model
model = OLA(pool_classifiers=classifiers)
# fit the model
model.fit(X_train, y_train)
# make predictions on the test set
yhat = model.predict(X_test)
# evaluate predictions
score = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % (score))

Running the example first reports the mean accuracy for the model with the custom pool of classifiers.

In this case, we can see that the model achieved an accuracy of about 91.2 percent.

Accuracy: 0.913

In order to adopt the DCS model, it must perform better than any contributing model. Otherwise, we would simply use the contributing model that performs better instead.

We can check this by evaluating the performance of each contributing classifier on the test set.

...
# evaluate contributing models
for c in classifiers:
	yhat = c.predict(X_test)
	score = accuracy_score(y_test, yhat)
	print('>%s: %.3f' % (c.__class__.__name__, score))

The updated example of DCS-LA with a custom pool of classifiers that are also evaluated independently is listed below.

# evaluate DCS-LA using OLA with a custom pool of algorithms
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from deslib.dcs.ola import OLA
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
# define dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
# split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=1)
# define classifiers to use in the pool
classifiers = [
	LogisticRegression(),
	DecisionTreeClassifier(),
	GaussianNB()]
# fit each classifier on the training set
for c in classifiers:
	c.fit(X_train, y_train)
# define the DCS-LA model
model = OLA(pool_classifiers=classifiers)
# fit the model
model.fit(X_train, y_train)
# make predictions on the test set
yhat = model.predict(X_test)
# evaluate predictions
score = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % (score))
# evaluate contributing models
for c in classifiers:
	yhat = c.predict(X_test)
	score = accuracy_score(y_test, yhat)
	print('>%s: %.3f' % (c.__class__.__name__, score))

Running the example first reports the mean accuracy for the model with the custom pool of classifiers and the accuracy of each contributing model.

In this case, we can see that again, the DCS-LA achieves an accuracy of about 91.3 percent, which is better than any contributing model.

Accuracy: 0.913
>LogisticRegression: 0.878
>DecisionTreeClassifier: 0.884
>GaussianNB: 0.873

Summary

In this tutorial, you discovered how to develop dynamic classifier selection ensembles in Python.

Specifically, you learned:

Dynamic classifier selection algorithms choose one from among many models to make a prediction for each new example.
How to develop and evaluate dynamic classifier selection models for classification tasks using the scikit-learn API.
How to explore the effect of dynamic classifier selection model hyperparameters on classification accuracy.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Dynamic Classifier Selection Ensembles in Python appeared first on Machine Learning Mastery.

Knowledge of calculus is not required to get results and solve problems in machine learning or deep learning.

However, knowing some calculus will help you in a number of ways, such as in reading mathematical notation in books and papers, and in understanding the terms used to describe fitting models like “gradient,” and in understanding the learning dynamics of models fit via optimization such as neural networks.

Calculus is a challenging topic as taught at a university level, but you don’t need to know all of calculus, just a handful of terms and methods related to numerical function optimization, central to fitting algorithms like neural networks. And the best way to get a handle on calculus is from books.

In this tutorial, you will discover books on calculus for machine learning.

After completing this tutorial, you will know:

Which books on machine learning provide a gentle introduction to the relevant calculus topics.
Books that you can use to learn the intuitions, history, and techniques of calculus.
Textbooks you can use for reference or deeper learning of calculus techniques and their proofs.

Let’s get started.

Calculus Books for Machine Learning
Photo by Roanish, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

Calculus in Machine Learning Books
Introductory Calculus Books
Calculus Textbooks

Calculus in Machine Learning Books

Start with machine learning books that cover the basics of calculus.

This is perfect if you learned calculus in school (a long time ago) and need a refresh, or if you need a quick crash course in the terms and methods.

Many top machine learning and deep learning textbooks will cover the basics, and it is often enough for most cases, e.g. when you’re focused on getting results with machine learning algorithms.

Also, it helps if you already own a machine learning textbook that covers some calculus as you don’t need to get another book.

Two great textbooks that cover some calculus include:

Deep Learning, 2016.
Pattern Recognition and Machine Learning, 2006.

The coverage of calculus in the “Deep Learning” textbook is brief.

Calculus is introduced in the context of optimization, first in terms of linear regression then more generally for multivariate optimization–seen when fitting neural nets.

This includes topics such as:

Derivative
Partial derivative
Second derivative
Hessian matrix
Gradient
Gradient descent
Critical points
Stationary points
Local maximum
Global minimum
Saddle points
Jacobian matrix

And more.

Calculus Terms.
Taken from Page xiii, Deep Learning, 2016.

The book “Pattern Recognition and Machine Learning” provides more in-depth coverage.

Specifically:

Chapter 10: Approximate Inference
Appendix D: Calculus of Variations

Appendix D introduces the topic of “calculus of variations” and Chapter 10 makes use of the technique. The topic is also covered in the deep learning book.

If a typical calculus problem involves finding a value of a variable that optimizes a function, then calculus of variations is about funding a function that optimizes another function. We can see that this is relevant in machine learning and specifically neural networks as the neural network model (circuit) learns arbitrary functions under a loss function.

A common problem in conventional calculus is to find a value of x that maximizes (or minimizes) a function y(x). Similarly, in the calculus of variations we seek a function y(x) that maximizes (or minimizes) a functional F [y]. That is, of all possible functions y(x), we wish to find the particular function for which the functional F [y] is a maximum (or minimum).

— Page 703, Pattern Recognition and Machine Learning, 2006.

The calculus of variations is not required to fit neural nets, but it provides a useful tool to better understand the problem we are solving when fitting a neural net and the types of learning dynamics we may see in practice.

Finally, we are starting to see books dedicated to the mathematical understanding that underlies machine learning.

One example is “Mathematics for Machine Learning.”

This book covers a lot of the calculus required for machine learning and provides the context showing where it fits in terms of the optimization (training/learning) of models.

Calculus and its Connection to Machine Learning.
Taken from Mathematics for Machine Learning, page 140.

The discussion of calculus is confined to Chapter 5: Vector Calculus, which covers the following topics:

Section 5.1 Differentiation of Univariate Functions
Section 5.2 Partial Differentiation and Gradients
Section 5.3 Gradients of Vector-Valued Functions
Section 5.4 Gradients of Matrices
Section 5.5 Useful Identities for Computing Gradients
Section 5.6 Backpropagation and Automatic Differentiation
Section 5.7 Higher-Order Derivatives
Section 5.8 Linearization and Multivariate Taylor Series

This book is an excellent starting point to fill out or refresh your knowledge of calculus for machine learning.

Introductory Calculus Books

Knowing the names of terms is one thing, but what if you want to know some of the methods more generally?

For that, I would recommend a solid beginner book, such as:

The Hitchhiker’s Guide to Calculus, 2019.
Calculus For Dummies, 2016.

These are not textbooks; instead, they assume little or no background (e.g. pre-calculus) and will step you through the intuitions and the techniques and their application to simple exercises.

Intuition is key! We’re not doing math degrees; we’re solving machine learning problems.

A textbook will teach you the method and the proof but rarely tell you what problem the method was designed to solve in the first place and a little history. I think background is critical.

I own both books. I like “Calculus For Dummies” and would recommend it, if you can get past the name and style.

There are many who say that calculus is one of the crowning achievements in all of intellectual history. As such, it’s worth the effort. Read this jargon-free book, get a handle on calculus, and join the happy few who can proudly say, “Calculus? Oh, sure, I know calculus. It’s no big deal.”

— Page 1, Calculus For Dummies, 2016.

The table of contents is as follows:

Introduction
Part I: An Overview of Calculus
- Chapter 1: What Is Calculus?
- Chapter 2: The Two Big Ideas of Calculus: Differentiation and Integration — plus Infinite Series
- Chapter 3: Why Calculus Works
Part II: Warming Up with Calculus Prerequisites
- Chapter 4: Pre-Algebra and Algebra Review
- Chapter 5: Funky Functions and Their Groovy Graphs
- Chapter 6: The Trig Tango
Part III: Limits
- Chapter 7: Limits and Continuity
- Chapter 8: Evaluating Limits
Part IV: Differentiation
- Chapter 9: Differentiation Orientation
- Chapter 10: Differentiation Rules — Yeah, Man, It Rules
- Chapter 11: Differentiation and the Shape of Curves
- Chapter 12: Your Problems Are Solved: Differentiation to the Rescue!
- Chapter 13: More Differentiation Problems: Going Off on a Tangent
Part V: Integration and Infinite Series
- Chapter 14: Intro to Integration and Approximating Area
- Chapter 15: Integration: It’s Backwards Differentiation
- Chapter 16: Integration Techniques for Experts
- Chapter 17: Forget Dr. Phil: Use the Integral to Solve Problems
- Chapter 18: Taming the Infinite with Improper Integrals
- Chapter 19: Infinite Series
Part VI: The Part of Tens
- Chapter 20: Ten Things to Remember
- Chapter 21: Ten Things to Forget
- Chapter 22: Ten Things You Can’t Get Away With

I find “The Hitchhiker’s Guide to Calculus” good, but terse. It’s straight to the point of each method.

The best thing about this book is that it is focused on making you do calculations. Learn by calculating. This is how I learn.

… Calculus requires understanding a whole new set of ideas, ones which are very interesting and quite beautiful, but admittedly also a bit hard to grasp. Yet, with all the new ideas that it entails, Calculus is a method of calculation, so in your Calculus course you are going to be doing calculations, reams of calculations, oodles of calculations, a seeming endless number of calculations!

— Pages 1-2, The Hitchhiker’s Guide to Calculus, 2019.

Another book I strongly recommend is a popular science book:

Infinite Powers: How Calculus Reveals the Secrets of the Universe, 2020.

Infinite Powers

This book will get you excited about calculus.

It covers the history and will ground you in why the tools of calculus were invented and why they are so powerful.

Without calculus, we wouldn’t have cell phones, computers, or microwave ovens. We wouldn’t have radio. Or television. Or ultrasound for expectant mothers, or GPS for lost travelers. We wouldn’t have split the atom, unraveled the human genome, or put astronauts on the moon. We might not even have the Declaration of Independence.

— Page vii, Infinite Powers, 2020.

History is important. You need someone to lay out the hard problems that geometry and algebra could not solve, the hacks that were tried, and the new methods that were invented that work really well.

Calculus began as an outgrowth of geometry. Back around 250 bce in ancient Greece, it was a hot little mathematical startup devoted to the mystery of curves.

— Page 3, Infinite Powers, 2020.

It hammers home that calculus is not magic and spells, but tools for solving problems. And that it’s all learnable, if you want.

Calculus Textbooks

Maybe you want to go deeper.

You want to see and work through proofs for each method, work through practice exercises at an undergraduate level, go deep.

This is not required to be effective at machine learning, but sometimes we want to go all in. I understand. In that case, I recommend a textbook, such as a textbook used for undergraduate courses.

It can also be a good idea to have a textbook on hand to dip into the details of specific terms and methods on demand when using high-level/simpler material. E.g. really digging into the Hessian matrix.

There are a huge number of textbooks on calculus and seemingly new editions every few years.

Nevertheless, some of the top textbooks used at the university level include the following:

Calculus, 3rd Edition, 2017. (Gilbert Strang)
Calculus, 8th edition, 2015. (James Stewart)
Calculus, 4th edition, 2008. (Michael Spivak)
Calculus, 11th edition, 2017. (Ron Larson, Bruce Edwards)
Thomas’ Calculus, 14th edition, 2017. (Joel Hass, Christopher Heil, Maurice Weir)

I like Stewart, but they’re all pretty much the same. All are hard work.

You will be expected to do a ton of worked examples. There’s no way around it.

Perhaps skim a few and pick one that feels like a good fit for your learning style.

Summary

In this tutorial, you discovered books on calculus for machine learning.

Have you read any of the books, or are you planning to get one?
Let me know in the comments below.

Do you know another great book on calculus?
Please let me know in the comments.

The post Calculus Books for Machine Learning appeared first on Machine Learning Mastery.

Meta-learning in machine learning refers to learning algorithms that learn from other learning algorithms.

Most commonly, this means the use of machine learning algorithms that learn how to best combine the predictions from other machine learning algorithms in the field of ensemble learning.

Nevertheless, meta-learning might also refer to the manual process of model selecting and algorithm tuning performed by a practitioner on a machine learning project that modern automl algorithms seek to automate. It also refers to learning across multiple related predictive modeling tasks, called multi-task learning, where meta-learning algorithms learn how to learn.

In this tutorial, you will discover meta-learning in machine learning.

After completing this tutorial, you will know:

Meta-learning refers to machine learning algorithms that learn from the output of other machine learning algorithms.
Meta-learning algorithms typically refer to ensemble learning algorithms like stacking that learn how to combine the predictions from ensemble members.
Meta-learning also refers to algorithms that learn how to learn across a suite of related prediction tasks, referred to as multi-task learning.

Let’s get started.

What Is Meta-Learning in Machine Learning?
Photo by Ryan Hallock, some rights reserved.

Tutorial Overview

This tutorial is divided into five parts; they are:

What Is Meta?
What Is Meta-Learning?
Meta-Algorithms, Meta-Classifiers, and Meta-Models
Model Selection and Tuning as Meta-Learning
Multi-Task Learning as Meta-Learning

What Is Meta?

Meta refers to a level above.

Meta typically means raising the level of abstraction one step and often refers to information about something else.

For example, you are probably familiar with “meta-data,” which is data about data.

Data about data is often called metadata …

— Page 512, Data Mining: Practical Machine Learning Tools and Techniques, 2016.

You store data in a file and a common example of metadata is data about the data stored in the file, such as:

The name of the file.
The size of the file.
The date the file was created.
The date the file was last modified.
The type of file.
The path to the file.

Now that we are familiar with the idea of “meta,” let’s consider the use of the term in machine learning, such as “meta-learning.”

What Is Meta-Learning?

Meta-learning refers to learning about learning.

Meta-learning in machine learning most commonly refers to machine learning algorithms that learn from the output of other machine learning algorithms.

In our machine learning project where we are trying to figure out (learn) what algorithm performs best on our data, we could think of a machine learning algorithm taking the place of ourselves, at least to some extent.

Machine learning algorithms learn from historical data. For example, supervised learning algorithms learn how to map examples of input patterns to examples of output patterns to address classification and regression predictive modeling problems.

Algorithms are trained on historical data directly to produce a model. The model can then be used later to predict output values, such as a number or a class label, for new examples of input.

Learning Algorithms: Learn from historical data and make predictions given new examples of data.

Meta-learning algorithms learn from the output of other machine learning algorithms that learn from data. This means that meta-learning requires the presence of other learning algorithms that have already been trained on data.

For example, supervised meta-learning algorithms learn how to map examples of output from other learning algorithms (such as predicted numbers or class labels) onto examples of target values for classification and regression problems.

Similarly, meta-learning algorithms make predictions by taking the output from existing machine learning algorithms as input and predicting a number or class label.

Meta-Learning Algorithms: Learn from the output of learning algorithms and make a prediction given predictions made by other models.

In this way, meta-learning occurs one level above machine learning.

If machine learning learns how to best use information in data to make predictions, then meta-learning or meta machine learning learns how to best use the predictions from machine learning algorithms to make predictions.

Now that we are familiar with the idea of meta-learning, let’s look at some examples of meta-learning algorithms.

Meta-Algorithms, Meta-Classifiers, and Meta-Models

Meta-learning algorithms are often referred to simply as meta-algorithms or meta-learners.

Meta-Algorithm: Short-hand for a meta-learning machine learning algorithm.

Similarly, meta-learning algorithms for classification tasks may be referred to as meta-classifiers and meta-learning algorithms for regression tasks may be referred to as meta-regressors.

Meta-Classifier: Meta-learning algorithm for classification predictive modeling tasks.
Meta-Regression: Meta-learning algorithm for regression predictive modeling tasks.

After a meta-learning algorithm is trained, it results in a meta-learning model, e.g. the specific rules, coefficients, or structure learned from data. The meta-learning model or meta-model can then be used to make predictions.

Meta-Model: Result of running a meta-learning algorithm.

The most widely known meta-learning algorithm is called stacked generalization, or stacking for short.

Stacking is probably the most-popular meta-learning technique.

— Page 82, Pattern Classification Using Ensemble Methods, 2010.

Stacking is a type of ensemble learning algorithm. Ensemble learning refers to machine learning algorithms that combine the predictions for two or more predictive models. Stacking uses another machine learning model, a meta-model, to learn how to best combine the predictions of the contributing ensemble members.

In order to induce a meta classifier, first the base classifiers are trained (stage one), and then the Meta classifier (second stage). In the prediction phase, base classifiers will output their classifications, and then the Meta-classifier(s) will make the final classification (as a function of the base classifiers).

— Page 82, Pattern Classification Using Ensemble Methods, 2010.

As such, the stacking ensemble algorithm is referred to as a type of meta-learning, or as a meta-learning algorithm.

Stacking: Ensemble machine learning algorithm that uses meta-learning to combine the predictions made by ensemble members.

There are also lesser-known ensemble learning algorithms that use a meta-model to learn how to combine the predictions from other machine learning models. Most notably, a mixture of experts that uses a gating model (the meta-model) to learn how to combine the predictions of expert models.

By using a meta-learner, this method tries to induce which classifiers are reliable and which are not.

— Page 82, Pattern Classification Using Ensemble Methods, 2010.

More generally, meta-models for supervised learning are almost always ensemble learning algorithms, and any ensemble learning algorithm that uses another model to combine the predictions from ensemble members may be referred to as a meta-learning algorithm.

Instead, stacking introduces the concept of a metalearner […] Stacking tries to learn which classifiers are the reliable ones, using another learning algorithm—the metalearner—to discover how best to combine the output of the base learners.

— Page 497, Data Mining: Practical Machine Learning Tools and Techniques, 2016.

Model Selection and Tuning as Meta-Learning

Training a machine learning algorithm on a historical dataset is a search process.

The internal structure, rules, or coefficients that comprise the model are modified against some loss function.

This type of search process is referred to as optimization, as we are not simply seeking a solution, but a solution that maximizes a performance metric like classification or minimizes a loss score, like prediction error.

This idea of learning as optimization is not simply a useful metaphor; it is the literal computation performed at the heart of most machine learning algorithms, either analytically (least squares) or numerically (gradient descent), or some hybrid optimization procedure.

A level above training a model, the meta-learning involves finding a data preparation procedure, learning algorithm, and learning algorithm hyperparameters (the full modeling pipeline) that result in the best score for a performance metric on the test harness.

This, too, is an optimization procedure that is typically performed by a human.

As such, we could think of ourselves as meta-learners on a machine learning project.

This is not the common meaning of the term, yet it is a valid usage. This would cover tasks such as model selection and algorithm hyperparameter tuning.

Automating the procedure is generally referred to as automated machine learning, shortened to “automl.”

… the user simply provides data, and the AutoML system automatically determines the approach that performs best for this particular application. Thereby, AutoML makes state-of-the-art machine learning approaches accessible to domain scientists who are interested in applying machine learning but do not have the resources to learn about the technologies behind it in detail.

— Page ix, Automated Machine Learning: Methods, Systems, Challenges, 2019.

Automl may not be referred to as meta-learning, but automl algorithms may harness meta-learning across learning tasks, referred to as learning to learn.

Meta-learning, or learning to learn, is the science of systematically observing how different machine learning approaches perform on a wide range of learning tasks, and then learning from this experience, or meta-data, to learn new tasks much faster than otherwise possible.

— Page 35, Automated Machine Learning: Methods, Systems, Challenges, 2019.

Multi-Task Learning as Meta-Learning

Learning to learn is a related field of study that is also colloquially referred as meta-learning.

If learning involves an algorithm that improves with experience on a task, then learning to learn is an algorithm that is used across multiple tasks that improves with experiences and tasks.

… an algorithm is said to learn to learn if its performance at each task improves with experience and with the number of tasks.

— Learning to Learn: Introduction and Overview, 1998.

Rather than manually developing an algorithm for each task or selecting and tuning an existing algorithm for each task, learning to learn algorithms adjust themselves based on a collection of similar tasks.

Meta-learning provides an alternative paradigm where a machine learning model gains experience over multiple learning episodes – often covering a distribution of related tasks – and uses this experience to improve its future learning performance.

— Meta-Learning in Neural Networks: A Survey, 2020.

This is referred to as the problem of multi-task learning.

Algorithms that are developed for multi-task learning problems learn how to learn and may be referred to as performing meta-learning.

The idea of using learning to learn or meta-learning to acquire knowledge or inductive biases has a long history.

— Learning to learn by gradient descent by gradient descent, 2016.

Learning to Learn: Application learning algorithms on multi-task learning problems in which they perform meta-learning across the tasks, e.g. learning about learning on the tasks.

This includes familiar techniques such as transfer learning that are common in deep learning algorithms for computer vision. This is where a deep neural network is trained on one computer vision task and is used as the starting point, perhaps with very little modification or training for a related vision task.

Transfer learning works well when the features that are automatically extracted by the network from the input images are useful across multiple related tasks, such as the abstract features extracted from common objects in photographs.

This is typically understood in a supervised learning context, where the input is the same but the target may be of a different nature. For example, we may learn about one set of visual categories, such as cats and dogs, in the first setting, then learn about a different set of visual categories, such as ants and wasps, in the second setting.

— Page 536, Deep Learning, 2016.

Summary

In this tutorial, you discovered meta-learning in machine learning.

Specifically, you learned:

Meta-learning refers to machine learning algorithms that learn from the output of other machine learning algorithms.
Meta-learning algorithms typically refer to ensemble learning algorithms like stacking that learn how to combine the predictions from ensemble members.
Meta-learning also refers to algorithms that learn how to learn across a suite of related prediction tasks, referred to as multi-task learning.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post What Is Meta-Learning in Machine Learning? appeared first on Machine Learning Mastery.

Occam’s razor suggests that in machine learning, we should prefer simpler models with fewer coefficients over complex models like ensembles.

Taken at face value, the razor is a heuristic that suggests more complex hypotheses make more assumptions that, in turn, will make them too narrow and not generalize well. In machine learning, it suggests complex models like ensembles will overfit the training dataset and perform poorly on new data.

In practice, ensembles are almost universally the type of model chosen on projects where predictive skill is the most important consideration. Further, empirical results show a continued reduction in generalization error as the complexity of an ensemble learning model is incrementally increased. These findings are at odds with the Occam’s razor principle taken at face value.

In this tutorial, you will discover how to reconcile Occam’s Razor with ensemble machine learning.

After completing this tutorial, you will know:

Occam’s razor is a heuristic that suggests choosing simpler machine learning models as they are expected to generalize better.
The heuristic can be divided into two razors, one of which is true and remains a useful tool and the other that is false and should be abandoned.
Ensemble learning algorithms like boosting provide a specific case of how the second razor fails and added complexity can result in lower generalization error.

Let’s get started.

Ensemble Learning Algorithm Complexity and Occam’s Razor
Photo by dylan_odonnell, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

Occam’s Razor for Model Selection
Occam’s Two Razors for Machine Learning
Occam’s Razor and Ensemble Learning

Occam’s Razor for Model Selection

Model selection is the process of choosing one from among possibly many candidate machine learning models for a predictive modeling project.

It is often straightforward to select a model based on its expected performance, e.g. choose the model with the highest accuracy or lowest prediction error.

Another important consideration is to choose simpler models over complex models.

Simpler models are typically defined as models that make fewer assumptions or have fewer elements, most commonly characterized as fewer coefficients (e.g. rules, layers, weights, etc.). The rationale for choosing simpler models is tied back to Occam’s Razor.

The idea is that the best scientific theory is the smallest one that explains all the facts.

— Page 197, Data Mining: Practical Machine Learning Tools and Techniques, 2016.

Occam’s Razor is an approach to problem-solving and is commonly invoked to mean that if all else is equal, we should prefer the simpler solutions.

Occam’s Razor: If all else is equal, the simplest solution is correct.

It is named for William of Ockham and was proposed to counter ever more elaborate philosophy without equivalent increases in predictive power.

William of Occam’s famous razor states that “Nunquam ponenda est pluralitas sin necesitate,” which, approximately translated, means “Entities should not be multiplied beyond necessity”.

— Occam’s Two Razors: The Sharp and the Blunt, 1998.

It is not a rule, more of a heuristic for problem-solving, and is commonly invoked in science to prefer simpler hypotheses that make fewer assumptions over more complex hypotheses that make more assumptions.

There is a long-standing tradition in science that, other things being equal, simple theories are preferable to complex ones. This is known as Occam’s Razor after the medieval philosopher William of Occam (or Ockham).

— Page 197, Data Mining: Practical Machine Learning Tools and Techniques, 2016.

The problem with complex hypotheses with more assumptions is that they are likely too specific.

They may include details of specific cases that are at hand or easily available, and in turn, may not generalize to new cases. That is, the more assumptions a hypothesis has, the more narrow it is expected to be in its application. Conversely, fewer assumptions suggests a more general hypothesis with greater predictive power to more cases.

Simple Hypothesis: Fewer assumptions, and in turn, broad applicability.
Complex Hypothesis: More assumptions, and in turn, narrow applicability.

This has implications in machine learning, as we are specifically trying to generalize to new unseen cases from specific observations, referred to as inductive reasoning.

If Occam’s Razor suggests that more complex models don’t generalize well, then in applied machine learning, it suggests we should choose simpler models as they will have lower prediction errors on new data.

If this is true, then how can we justify using an ensemble machine learning algorithm?

By definition, ensemble machine learning algorithms are more complex than a single machine learning model, as they are composed of many individual machine learning models.

Occam’s razor suggests that the added complexity of ensemble learning algorithms means that they will not generalize as well as simpler models fit on the same dataset.

Yet ensemble machine learning algorithms are the dominant solution when predictive skill on new data is the most important concern, such as machine learning competitions. Ensembles have been studied at great length and have been shown not to overfit the training dataset in study after study.

It has been empirically observed that certain ensemble techniques often do not overfit the model, even when the ensemble contains thousands of classifiers.

— Page 40, Pattern Classification Using Ensemble Methods, 2010.

How can this inconsistency be reconciled?

Occam’s Two Razors for Machine Learning

The conflict between the expectation of simpler models generalizing better in theory and complex models like ensembles generalizing better in practice was mostly ignored as an inconvenient empirical finding for a long time.

In the late 1990s, the problem was specifically studied by Pedro Domingos and published in the award-winning 1996 paper titled “Occam’s Two Razors: The Sharp and the Blunt,” and follow-up 1999 journal article “The Role of Occam’s Razor in Knowledge Discovery.”

In the work, Domingos defines the problem as two specific commonly asserted implications of Occam’s Razor in applied machine learning, which he refers to as “Occam’s Two Razors” in machine learning, they are (taken from the paper):

First razor: Given two models with the same generalization error, the simpler one should be preferred because simplicity is desirable in itself.
Second razor: Given two models with the same training-set error, the simpler one should be preferred because it is likely to have lower generalization error.

Domingos then enumerates a vast number of examples for and against each razor from both theory and empirical studies in machine learning.

The first razor suggests if two models have the same expected performance on data not seen during training, we should prefer the simpler model. Domingos highlights that this razor holds and provides a good heuristic on machine learning projects.

The second razor suggests that if two models have the same performance on a training dataset, then the simpler model should be chosen because it is expected to generalize better when used to make predictions on new data.

This seems sensible on the surface.

It is the argument behind not adopting ensemble algorithms in a machine learning project because they are very complex compared to other models and expected to not generalize.

It turns out that this razor cannot be supported by the evidence from the machine learning literature.

All of this evidence points to the conclusion that not only is the second razor not true in general; it is also typically false in the types of domains KDD has been applied to.

— Occam’s Two Razors: The Sharp and the Blunt, 1998.

Occam’s Razor and Ensemble Learning

The finding begins to sound intuitive once you mull on it for a while.

For example, in practice, we would not choose a machine learning model based on its performance on the training dataset alone. We intuitively, or perhaps after a lot of experience, tacitly expect the estimate of performance on a training set to be a poor estimate of performance on a holdout dataset.

We have this expectation because the model can overfit the training dataset.

Yet, less intuitively, overfitting the training dataset can lead to better performance on a holdout test set. This has been observed many times in practice in systematic studies.

A common situation involves plotting the performance of a model on the training dataset and a holdout test dataset each iteration of learning for the model, such as training epochs or iterations for models that support incremental learning.

If learning on the training dataset is set up to continue for a large number of training iterations and the curves observed, it can often be seen that performance on the training dataset will fall to zero error. This is to be expected as we might think that the model will overfit the training dataset given enough resources and time to train. Yet the performance on the test set will continue to improve, even while the performance on the training set remains fixed at zero error.

… occasionally, the generalization error would continue to improve long after the training error had reached zero.

— Page 40, Ensemble Methods in Data Mining, 2010.

This behavior can be observed with ensemble learning algorithms like boosting and bagging, where performance on the holdout dataset will continue to improve as additional model members are added to the ensemble.

One very surprising finding is that performing more boosting iterations can reduce the error on new data long after the classification error of the combined classifier on the training data has dropped to zero.

— Page 489, Data Mining: Practical Machine Learning Tools and Techniques, 2016.

That is, the model complexity is incrementally increased, which systematically decreases the error on unseen data, e.g. generalization error. The additional training cannot improve the performance on the training dataset; it has no possible improvement to make.

Performing more boosting iterations without reducing training error does not explain the training data any better, and it certainly adds complexity to the combined classifier.

— Page 490, Data Mining: Practical Machine Learning Tools and Techniques, 2016.

This finding directly contradicts the second razor and supports Domingos’ argument about abandoning the second razor.

The first one is largely uncontroversial, while the second one, taken literally, is false.

— Occam’s Two Razors: The Sharp and the Blunt, 1998.

This problem has been studied and can generally be explained by the ensemble algorithms learning to be more confident in their predictions on the training dataset, which carry over to the hold out data.

The contradiction can be resolved by considering the classifier’s confidence in its predictions.

— Page 490, Data Mining: Practical Machine Learning Tools and Techniques, 2016.

The first razor remains an important heuristic in applied machine learning.

The key aspect of this razor is the predicate of “all else being equal.” That is, if two models are compared, they must be compared using their generalization error on a holdout dataset or estimated using k-fold cross-validation. If their performance is equal under these circumstances, then the razor can come into effect and we can choose the simpler solution.

This is not the only way to choose models.

We may choose a simpler model because it is easier to interpret, and this remains valid if model interpretability is a more important project requirement than predictive skill.

Ensemble learning algorithms are unambiguously a more complex type of model when the number of model parameters is considered the measure of complexity. As such, an open problem in machine learning involves alternate measures of complexity.

Summary

In this tutorial, you discovered how to reconcile Occam’s Razor with ensemble machine learning.

Specifically, you learned:

Occam’s razor is a heuristic that suggests choosing simpler machine learning models as they are expected to generalize better.
The heuristic can be divided into two razors, one of which is true and remains a useful tool, and the other that is false and should be abandoned.
Ensemble learning algorithms like boosting provide a specific case of how the second razor fails and added complexity can result in lower generalization error.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Ensemble Learning Algorithm Complexity and Occam’s Razor appeared first on Machine Learning Mastery.

Optimization is the problem of finding a set of inputs to an objective function that results in a maximum or minimum function evaluation.

It is the challenging problem that underlies many machine learning algorithms, from fitting logistic regression models to training artificial neural networks.

There are perhaps hundreds of popular optimization algorithms, and perhaps tens of algorithms to choose from in popular scientific code libraries. This can make it challenging to know which algorithms to consider for a given optimization problem.

In this tutorial, you will discover a guided tour of different optimization algorithms.

After completing this tutorial, you will know:

Optimization algorithms may be grouped into those that use derivatives and those that do not.
Classical algorithms use the first and sometimes second derivative of the objective function.
Direct search and stochastic algorithms are designed for objective functions where function derivatives are unavailable.

Let’s get started.

How to Choose an Optimization Algorithm
Photo by Matthewjs007, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

Optimization Algorithms
Differentiable Objective Function
Non-Differential Objective Function

Optimization Algorithms

Optimization refers to a procedure for finding the input parameters or arguments to a function that result in the minimum or maximum output of the function.

The most common type of optimization problems encountered in machine learning are continuous function optimization, where the input arguments to the function are real-valued numeric values, e.g. floating point values. The output from the function is also a real-valued evaluation of the input values.

We might refer to problems of this type as continuous function optimization, to distinguish from functions that take discrete variables and are referred to as combinatorial optimization problems.

There are many different types of optimization algorithms that can be used for continuous function optimization problems, and perhaps just as many ways to group and summarize them.

One approach to grouping optimization algorithms is based on the amount of information available about the target function that is being optimized that, in turn, can be used and harnessed by the optimization algorithm.

Generally, the more information that is available about the target function, the easier the function is to optimize if the information can effectively be used in the search.

Perhaps the major division in optimization algorithms is whether the objective function can be differentiated at a point or not. That is, whether the first derivative (gradient or slope) of the function can be calculated for a given candidate solution or not. This partitions algorithms into those that can make use of the calculated gradient information and those that do not.

Differentiable Target Function?
- Algorithms that use derivative information.
- Algorithms that do not use derivative information.

We will use this as the major division for grouping optimization algorithms in this tutorial and look at algorithms for differentiable and non-differentiable objective functions.

Note: this is not an exhaustive coverage of algorithms for continuous function optimization, although it does cover the major methods that you are likely to encounter as a regular practitioner.

Differentiable Objective Function

A differentiable function is a function where the derivative can be calculated for any given point in the input space.

The derivative of a function for a value is the rate or amount of change in the function at that point. It is often called the slope.

First-Order Derivative: Slope or rate of change of an objective function at a given point.

The derivative of the function with more than one input variable (e.g. multivariate inputs) is commonly referred to as the gradient.

Gradient: Derivative of a multivariate continuous objective function.

A derivative for a multivariate objective function is a vector, and each element in the vector is called a partial derivative, or the rate of change for a given variable at the point assuming all other variables are held constant.

Partial Derivative: Element of a derivative of a multivariate objective function.

We can calculate the derivative of the derivative of the objective function, that is the rate of change of the rate of change in the objective function. This is called the second derivative.

Second-Order Derivative: Rate at which the derivative of the objective function changes.

For a function that takes multiple input variables, this is a matrix and is referred to as the Hessian matrix.

Hessian matrix: Second derivative of a function with two or more input variables.

Simple differentiable functions can be optimized analytically using calculus. Typically, the objective functions that we are interested in cannot be solved analytically.

Optimization is significantly easier if the gradient of the objective function can be calculated, and as such, there has been a lot more research into optimization algorithms that use the derivative than those that do not.

Some groups of algorithms that use gradient information include:

Bracketing Algorithms
Local Descent Algorithms
First-Order Algorithms
Second-Order Algorithms

Note: this taxonomy is inspired by the 2019 book “Algorithms for Optimization.”

Let’s take a closer look at each in turn.

Bracketing Algorithms

Bracketing optimization algorithms are intended for optimization problems with one input variable where the optima is known to exist within a specific range.

Bracketing algorithms are able to efficiently navigate the known range and locate the optima, although they assume only a single optima is present (referred to as unimodal objective functions).

Some bracketing algorithms may be able to be used without derivative information if it is not available.

Examples of bracketing algorithms include:

Fibonacci Search
Golden Section Search
Bisection Method

Local Descent Algorithms

Local descent optimization algorithms are intended for optimization problems with more than one input variable and a single global optima (e.g. unimodal objective function).

Perhaps the most common example of a local descent algorithm is the line search algorithm.

Line Search

There are many variations of the line search (e.g. the Brent-Dekker algorithm), but the procedure generally involves choosing a direction to move in the search space, then performing a bracketing type search in a line or hyperplane in the chosen direction.

This process is repeated until no further improvements can be made.

The limitation is that it is computationally expensive to optimize each directional move in the search space.

First-Order Algorithms

First-order optimization algorithms explicitly involve using the first derivative (gradient) to choose the direction to move in the search space.

The procedures involve first calculating the gradient of the function, then following the gradient in the opposite direction (e.g. downhill to the minimum for minimization problems) using a step size (also called the learning rate).

The step size is a hyperparameter that controls how far to move in the search space, unlike “local descent algorithms” that perform a full line search for each directional move.

A step size that is too small results in a search that takes a long time and can get stuck, whereas a step size that is too large will result in zig-zagging or bouncing around the search space, missing the optima completely.

First-order algorithms are generally referred to as gradient descent, with more specific names referring to minor extensions to the procedure, e.g.:

Gradient Descent
Momentum
Adagrad
RMSProp
Adam

The gradient descent algorithm also provides the template for the popular stochastic version of the algorithm, named Stochastic Gradient Descent (SGD) that is used to train artificial neural networks (deep learning) models.

The important difference is that the gradient is appropriated rather than calculated directly, using prediction error on training data, such as one sample (stochastic), all examples (batch), or a small subset of training data (mini-batch).

The extensions designed to accelerate the gradient descent algorithm (momentum, etc.) can be and are commonly used with SGD.

Stochastic Gradient Descent
Batch Gradient Descent
Mini-Batch Gradient Descent

Second-Order Algorithms

Second-order optimization algorithms explicitly involve using the second derivative (Hessian) to choose the direction to move in the search space.

These algorithms are only appropriate for those objective functions where the Hessian matrix can be calculated or approximated.

Examples of second-order optimization algorithms for univariate objective functions include:

Newton’s Method
Secant Method

Second-order methods for multivariate objective functions are referred to as Quasi-Newton Methods.

Quasi-Newton Method

There are many Quasi-Newton Methods, and they are typically named for the developers of the algorithm, such as:

Davidson-Fletcher-Powell
Broyden-Fletcher-Goldfarb-Shanno (BFGS)
Limited-memory BFGS (L-BFGS)

Now that we are familiar with the so-called classical optimization algorithms, let’s look at algorithms used when the objective function is not differentiable.

Non-Differential Objective Function

Optimization algorithms that make use of the derivative of the objective function are fast and efficient.

Nevertheless, there are objective functions where the derivative cannot be calculated, typically because the function is complex for a variety of real-world reasons. Or the derivative can be calculated in some regions of the domain, but not all, or is not a good guide.

Some difficulties on objective functions for the classical algorithms described in the previous section include:

No analytical description of the function (e.g. simulation).
Multiple global optima (e.g. multimodal).
Stochastic function evaluation (e.g. noisy).
Discontinuous objective function (e.g. regions with invalid solutions).

As such, there are optimization algorithms that do not expect first- or second-order derivatives to be available.

These algorithms are sometimes referred to as black-box optimization algorithms as they assume little or nothing (relative to the classical methods) about the objective function.

A grouping of these algorithms include:

Direct Algorithms
Stochastic Algorithms
Population Algorithms

Let’s take a closer look at each in turn.

Direct Algorithms

Direct optimization algorithms are for objective functions for which derivatives cannot be calculated.

The algorithms are deterministic procedures and often assume the objective function has a single global optima, e.g. unimodal.

Direct search methods are also typically referred to as a “pattern search” as they may navigate the search space using geometric shapes or decisions, e.g. patterns.

Gradient information is approximated directly (hence the name) from the result of the objective function comparing the relative difference between scores for points in the search space. These direct estimates are then used to choose a direction to move in the search space and triangulate the region of the optima.

Examples of direct search algorithms include:

Cyclic Coordinate Search
Powell’s Method
Hooke-Jeeves Method
Nelder-Mead Simplex Search

Stochastic Algorithms

Stochastic optimization algorithms are algorithms that make use of randomness in the search procedure for objective functions for which derivatives cannot be calculated.

Unlike the deterministic direct search methods, stochastic algorithms typically involve a lot more sampling of the objective function, but are able to handle problems with deceptive local optima.

Stochastic optimization algorithms include:

Simulated Annealing
Evolution Strategy
Cross-Entropy Method

Population Algorithms

Population optimization algorithms are stochastic optimization algorithms that maintain a pool (a population) of candidate solutions that together are used to sample, explore, and hone in on an optima.

Algorithms of this type are intended for more challenging objective problems that may have noisy function evaluations and many global optima (multimodal), and finding a good or good enough solution is challenging or infeasible using other methods.

The pool of candidate solutions adds robustness to the search, increasing the likelihood of overcoming local optima.

Examples of population optimization algorithms include:

Genetic Algorithm
Differential Evolution
Particle Swarm Optimization

Summary

In this tutorial, you discovered a guided tour of different optimization algorithms.

Specifically, you learned:

Optimization algorithms may be grouped into those that use derivatives and those that do not.
Classical algorithms use the first and sometimes second derivative of the objective function.
Direct search and stochastic algorithms are designed for objective functions where function derivatives are unavailable.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post How to Choose an Optimization Algorithm appeared first on Machine Learning Mastery.

Typically, a simpler and better-performing machine learning model can be developed by removing input features (columns) from the training dataset.

This is called feature selection and there are many different types of algorithms that can be used.

It is possible to frame the problem of feature selection as an optimization problem. In the case that there are few input features, all possible combinations of input features can be evaluated and the best subset found definitively. In the case of a vast number of input features, a stochastic optimization algorithm can be used to explore the search space and find an effective subset of features.

In this tutorial, you will discover how to use optimization algorithms for feature selection in machine learning.

After completing this tutorial, you will know:

The problem of feature selection can be broadly defined as an optimization problem.
How to enumerate all possible subsets of input features for a dataset.
How to apply stochastic optimization to select an optimal subset of input features.

Let’s get started.

How to Use Optimization for Feature Selection
Photo by Gregory “Slobirdr” Smith, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

Optimization for Feature Selection
Enumerate All Feature Subsets
Optimize Feature Subsets

Optimization for Feature Selection

Feature selection is the process of reducing the number of input variables when developing a predictive model.

It is desirable to reduce the number of input variables to both reduce the computational cost of modeling and, in some cases, to improve the performance of the model. There are many different types of feature selection algorithms, although they can broadly be grouped into two main types: wrapper and filter methods.

Wrapper feature selection methods create many models with different subsets of input features and select those features that result in the best performing model according to a performance metric. These methods are unconcerned with the variable types, although they can be computationally expensive. RFE is a good example of a wrapper feature selection method.

Filter feature selection methods use statistical techniques to evaluate the relationship between each input variable and the target variable, and these scores are used as the basis to choose (filter) those input variables that will be used in the model.

Wrapper Feature Selection: Search for well-performing subsets of features.
Filter Feature Selection: Select subsets of features based on their relationship with the target.

For more on choosing feature selection algorithms, see the tutorial:

How to Choose a Feature Selection Method For Machine Learning

A popular wrapper method is the Recursive Feature Elimination, or RFE, algorithm.

RFE works by searching for a subset of features by starting with all features in the training dataset and successfully removing features until the desired number remains.

This is achieved by fitting the given machine learning algorithm used in the core of the model, ranking features by importance, discarding the least important features, and re-fitting the model. This process is repeated until a specified number of features remains.

For more on RFE, see the tutorial:

Recursive Feature Elimination (RFE) for Feature Selection in Python

The problem of wrapper feature selection can be framed as an optimization problem. That is, find a subset of input features that result in the best model performance.

RFE is one approach to solving this problem systematically, although it may be limited by a large number of features.

An alternative approach would be to use a stochastic optimization algorithm, such as a stochastic hill climbing algorithm, when the number of features is very large. When the number of features is relatively small, it may be possible to enumerate all possible subsets of features.

Few Input Variables: Enumerate all possible subsets of features.
Many Input Features: Stochastic optimization algorithm to find good subsets of features.

Now that we are familiar with the idea that feature selection may be explored as an optimization problem, let’s look at how we might enumerate all possible feature subsets.

Enumerate All Feature Subsets

When the number of input variables is relatively small and the model evaluation is relatively fast, then it may be possible to enumerate all possible subsets of input variables.

This means evaluating the performance of a model using a test harness given every possible unique group of input variables.

We will explore how to do this with a worked example.

First, let’s define a small binary classification dataset with few input features. We can use the make_classification() function to define a dataset with five input variables, two of which are informative, and 1,000 rows.

The example below defines the dataset and summarizes its shape.

# define a small classification dataset
from sklearn.datasets import make_classification
# define dataset
X, y = make_classification(n_samples=1000, n_features=5, n_informative=2, n_redundant=3, random_state=1)
# summarize the shape of the dataset
print(X.shape, y.shape)

Running the example creates the dataset and confirms that it has the desired shape.

(1000, 5) (1000,)

Next, we can establish a baseline in performance using a model evaluated on the entire dataset.

We will use a DecisionTreeClassifier as the model because its performance is quite sensitive to the choice of input variables.

We will evaluate the model using good practices, such as repeated stratified k-fold cross-validation with three repeats and 10 folds.

The complete example is listed below.

# evaluate a decision tree on the entire small dataset
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.tree import DecisionTreeClassifier
# define dataset
X, y = make_classification(n_samples=1000, n_features=3, n_informative=2, n_redundant=1, random_state=1)
# define model
model = DecisionTreeClassifier()
# define evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report result
print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example evaluates the decision tree on the entire dataset and reports the mean and standard deviation classification accuracy.

In this case, we can see that the model achieved an accuracy of about 80.5 percent.

Mean Accuracy: 0.805 (0.030)

Next, we can try to improve model performance by using a subset of the input features.

First, we must choose a representation to enumerate.

In this case, we will enumerate a list of boolean values, with one value for each input feature: True if the feature is to be used and False if the feature is not to be used as input.

For example, with the five input features the sequence [True, True, True, True, True] would use all input features, and [True, False, False, False, False] would only use the first input feature as input.

We can enumerate all sequences of boolean values with the length=5 using the product() Python function. We must specify the valid values [True, False] and the number of steps in the sequence, which is equal to the number of input variables.

The function returns an iterable that we can enumerate directly for each sequence.

...
# determine the number of columns
n_cols = X.shape[1]
best_subset, best_score = None, 0.0
# enumerate all combinations of input features
for subset in product([True, False], repeat=n_cols):
	...

For a given sequence of boolean values, we can enumerate it and transform it into a sequence of column indexes for each True in the sequence.

...
# convert into column indexes
ix = [i for i, x in enumerate(subset) if x]

If the sequence has no column indexes (in the case of all False values), then we can skip that sequence.

# check for now column (all False)
if len(ix) == 0:
	continue

We can then use the column indexes to choose the columns in the dataset.

...
# select columns
X_new = X[:, ix]

And this subset of the dataset can then be evaluated as we did before.

...
# define model
model = DecisionTreeClassifier()
# define evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X_new, y, scoring='accuracy', cv=cv, n_jobs=-1)
# summarize scores
result = mean(scores)

If the accuracy for the model is better than the best sequence found so far, we can store it.

...
# check if it is better than the best so far
if best_score is None or result >= best_score:
	# better result
	best_subset, best_score = ix, result

And that’s it.

Tying this together, the complete example of feature selection by enumerating all possible feature subsets is listed below.

# feature selection by enumerating all possible subsets of features
from itertools import product
from numpy import mean
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.tree import DecisionTreeClassifier
# define dataset
X, y = make_classification(n_samples=1000, n_features=5, n_informative=2, n_redundant=3, random_state=1)
# determine the number of columns
n_cols = X.shape[1]
best_subset, best_score = None, 0.0
# enumerate all combinations of input features
for subset in product([True, False], repeat=n_cols):
	# convert into column indexes
	ix = [i for i, x in enumerate(subset) if x]
	# check for now column (all False)
	if len(ix) == 0:
		continue
	# select columns
	X_new = X[:, ix]
	# define model
	model = DecisionTreeClassifier()
	# define evaluation procedure
	cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
	# evaluate model
	scores = cross_val_score(model, X_new, y, scoring='accuracy', cv=cv, n_jobs=-1)
	# summarize scores
	result = mean(scores)
	# report progress
	print('>f(%s) = %f ' % (ix, result))
	# check if it is better than the best so far
	if best_score is None or result >= best_score:
		# better result
		best_subset, best_score = ix, result
# report best
print('Done!')
print('f(%s) = %f' % (best_subset, best_score))

Running the example reports the mean classification accuracy of the model for each subset of features considered. The best subset is then reported at the end of the run.

In this case, we can see that the best subset of features involved features at indexes [2, 3, 4] that resulted in a mean classification accuracy of about 83.0 percent, which is better than the result reported previously using all input features.

>f([0, 1, 2, 3, 4]) = 0.813667
>f([0, 1, 2, 3]) = 0.827667
>f([0, 1, 2, 4]) = 0.815333
>f([0, 1, 2]) = 0.824000
>f([0, 1, 3, 4]) = 0.821333
>f([0, 1, 3]) = 0.825667
>f([0, 1, 4]) = 0.807333
>f([0, 1]) = 0.817667
>f([0, 2, 3, 4]) = 0.830333
>f([0, 2, 3]) = 0.819000
>f([0, 2, 4]) = 0.828000
>f([0, 2]) = 0.818333
>f([0, 3, 4]) = 0.830333
>f([0, 3]) = 0.821333
>f([0, 4]) = 0.816000
>f([0]) = 0.639333
>f([1, 2, 3, 4]) = 0.823667
>f([1, 2, 3]) = 0.821667
>f([1, 2, 4]) = 0.823333
>f([1, 2]) = 0.818667
>f([1, 3, 4]) = 0.818000
>f([1, 3]) = 0.820667
>f([1, 4]) = 0.809000
>f([1]) = 0.797000
>f([2, 3, 4]) = 0.827667
>f([2, 3]) = 0.755000
>f([2, 4]) = 0.827000
>f([2]) = 0.516667
>f([3, 4]) = 0.824000
>f([3]) = 0.514333
>f([4]) = 0.777667
Done!
f([0, 3, 4]) = 0.830333

Now that we know how to enumerate all possible feature subsets, let’s look at how we might use a stochastic optimization algorithm to choose a subset of features.

Optimize Feature Subsets

We can apply a stochastic optimization algorithm to the search space of subsets of input features.

First, let’s define a larger problem that has many more features, making model evaluation too slow and the search space too large for enumerating all subsets.

We will define a classification problem with 10,000 rows and 500 input features, 10 of which are relevant and the remaining 490 are redundant.

# define a large classification dataset
from sklearn.datasets import make_classification
# define dataset
X, y = make_classification(n_samples=10000, n_features=500, n_informative=10, n_redundant=490, random_state=1)
# summarize the shape of the dataset
print(X.shape, y.shape)

Running the example creates the dataset and confirms that it has the desired shape.

(10000, 500) (10000,)

We can establish a baseline in performance by evaluating a model on the dataset with all input features.

Because the dataset is large and the model is slow to evaluate, we will modify the evaluation of the model to use 3-fold cross-validation, e.g. fewer folds and no repeats.

The complete example is listed below.

# evaluate a decision tree on the entire larger dataset
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.tree import DecisionTreeClassifier
# define dataset
X, y = make_classification(n_samples=10000, n_features=500, n_informative=10, n_redundant=490, random_state=1)
# define model
model = DecisionTreeClassifier()
# define evaluation procedure
cv = StratifiedKFold(n_splits=3)
# evaluate model
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report result
print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example evaluates the decision tree on the entire dataset and reports the mean and standard deviation classification accuracy.

In this case, we can see that the model achieved an accuracy of about 91.3 percent.

This provides a baseline that we would expect to outperform using feature selection.

Mean Accuracy: 0.913 (0.001)

We will use a simple stochastic hill climbing algorithm as the optimization algorithm.

First, we must define the objective function. It will take the dataset and a subset of features to use as input and return an estimated model accuracy from 0 (worst) to 1 (best). It is a maximizing optimization problem.

This objective function is simply the decoding of the sequence and model evaluation step from the previous section.

The objective() function below implements this and returns both the score and the decoded subset of columns used for helpful reporting.

# objective function
def objective(X, y, subset):
	# convert into column indexes
	ix = [i for i, x in enumerate(subset) if x]
	# check for now column (all False)
	if len(ix) == 0:
		return 0.0
	# select columns
	X_new = X[:, ix]
	# define model
	model = DecisionTreeClassifier()
	# evaluate model
	scores = cross_val_score(model, X_new, y, scoring='accuracy', cv=3, n_jobs=-1)
	# summarize scores
	result = mean(scores)
	return result, ix

We also need a function that can take a step in the search space.

Given an existing solution, it must modify it and return a new solution in close proximity. In this case, we will achieve this by randomly flipping the inclusion/exclusion of columns in subsequence.

Each position in the sequence will be considered independently and will be flipped probabilistically where the probability of flipping is a hyperparameter.

The mutate() function below implements this given a candidate solution (sequence of booleans) and a mutation hyperparameter, creating and returning a modified solution (a step in the search space).

The larger the p_mutate value (in the range 0 to 1), the larger the step in the search space.

# mutation operator
def mutate(solution, p_mutate):
	# make a copy
	child = solution.copy()
	for i in range(len(child)):
		# check for a mutation
		if rand() < p_mutate:
			# flip the inclusion
			child[i] = not child[i]
	return child

We can now implement the hill climbing algorithm.

The initial solution is a randomly generated sequence, which is then evaluated.

...
# generate an initial point
solution = choice([True, False], size=X.shape[1])
# evaluate the initial point
solution_eval, ix = objective(X, y, solution)

We then loop for a fixed number of iterations, creating mutated versions of the current solution, evaluating them, and saving them if the score is better.

...
# run the hill climb
for i in range(n_iter):
	# take a step
	candidate = mutate(solution, p_mutate)
	# evaluate candidate point
	candidate_eval, ix = objective(X, y, candidate)
	# check if we should keep the new point
	if candidate_eval >= solution_eval:
		# store the new point
		solution, solution_eval = candidate, candidate_eval
	# report progress
	print('>%d f(%s) = %f' % (i+1, len(ix), solution_eval))

The hillclimbing() function below implements this, taking the dataset, objective function, and hyperparameters as arguments and returns the best subset of dataset columns and the estimated performance of the model.

# hill climbing local search algorithm
def hillclimbing(X, y, objective, n_iter, p_mutate):
	# generate an initial point
	solution = choice([True, False], size=X.shape[1])
	# evaluate the initial point
	solution_eval, ix = objective(X, y, solution)
	# run the hill climb
	for i in range(n_iter):
		# take a step
		candidate = mutate(solution, p_mutate)
		# evaluate candidate point
		candidate_eval, ix = objective(X, y, candidate)
		# check if we should keep the new point
		if candidate_eval >= solution_eval:
			# store the new point
			solution, solution_eval = candidate, candidate_eval
		# report progress
		print('>%d f(%s) = %f' % (i+1, len(ix), solution_eval))
	return solution, solution_eval

We can then call this function and pass in our synthetic dataset to perform optimization for feature selection.

In this case, we will run the algorithm for 100 iterations and make about five flips to the sequence for a given mutation, which is quite conservative.

...
# define dataset
X, y = make_classification(n_samples=10000, n_features=500, n_informative=10, n_redundant=490, random_state=1)
# define the total iterations
n_iter = 100
# probability of including/excluding a column
p_mut = 10.0 / 500.0
# perform the hill climbing search
subset, score = hillclimbing(X, y, objective, n_iter, p_mut)

At the end of the run, we will convert the boolean sequence into column indexes (so we could fit a final model if we wanted) and report the performance of the best subsequence.

...
# convert into column indexes
ix = [i for i, x in enumerate(subset) if x]
print('Done!')
print('Best: f(%d) = %f' % (len(ix), score))

Tying this all together, the complete example is listed below.

# stochastic optimization for feature selection
from numpy import mean
from numpy.random import rand
from numpy.random import choice
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier

# objective function
def objective(X, y, subset):
	# convert into column indexes
	ix = [i for i, x in enumerate(subset) if x]
	# check for now column (all False)
	if len(ix) == 0:
		return 0.0
	# select columns
	X_new = X[:, ix]
	# define model
	model = DecisionTreeClassifier()
	# evaluate model
	scores = cross_val_score(model, X_new, y, scoring='accuracy', cv=3, n_jobs=-1)
	# summarize scores
	result = mean(scores)
	return result, ix

# mutation operator
def mutate(solution, p_mutate):
	# make a copy
	child = solution.copy()
	for i in range(len(child)):
		# check for a mutation
		if rand() < p_mutate:
			# flip the inclusion
			child[i] = not child[i]
	return child

# hill climbing local search algorithm
def hillclimbing(X, y, objective, n_iter, p_mutate):
	# generate an initial point
	solution = choice([True, False], size=X.shape[1])
	# evaluate the initial point
	solution_eval, ix = objective(X, y, solution)
	# run the hill climb
	for i in range(n_iter):
		# take a step
		candidate = mutate(solution, p_mutate)
		# evaluate candidate point
		candidate_eval, ix = objective(X, y, candidate)
		# check if we should keep the new point
		if candidate_eval >= solution_eval:
			# store the new point
			solution, solution_eval = candidate, candidate_eval
		# report progress
		print('>%d f(%s) = %f' % (i+1, len(ix), solution_eval))
	return solution, solution_eval

# define dataset
X, y = make_classification(n_samples=10000, n_features=500, n_informative=10, n_redundant=490, random_state=1)
# define the total iterations
n_iter = 100
# probability of including/excluding a column
p_mut = 10.0 / 500.0
# perform the hill climbing search
subset, score = hillclimbing(X, y, objective, n_iter, p_mut)
# convert into column indexes
ix = [i for i, x in enumerate(subset) if x]
print('Done!')
print('Best: f(%d) = %f' % (len(ix), score))

Running the example reports the mean classification accuracy of the model for each subset of features considered. The best subset is then reported at the end of the run.

In this case, we can see that the best performance was achieved with a subset of 239 features and a classification accuracy of approximately 91.8 percent.

This is better than a model evaluated on all input features.

Although the result is better, we know we can do a lot better, perhaps with tuning of the hyperparameters of the optimization algorithm or perhaps by using an alternate optimization algorithm.

...
>80 f(240) = 0.918099
>81 f(236) = 0.918099
>82 f(238) = 0.918099
>83 f(236) = 0.918099
>84 f(239) = 0.918099
>85 f(240) = 0.918099
>86 f(239) = 0.918099
>87 f(245) = 0.918099
>88 f(241) = 0.918099
>89 f(239) = 0.918099
>90 f(239) = 0.918099
>91 f(241) = 0.918099
>92 f(243) = 0.918099
>93 f(245) = 0.918099
>94 f(239) = 0.918099
>95 f(245) = 0.918099
>96 f(244) = 0.918099
>97 f(242) = 0.918099
>98 f(238) = 0.918099
>99 f(248) = 0.918099
>100 f(238) = 0.918099
Done!
Best: f(239) = 0.918099

Summary

In this tutorial, you discovered how to use optimization algorithms for feature selection in machine learning.

Specifically, you learned:

The problem of feature selection can be broadly defined as an optimization problem.
How to enumerate all possible subsets of input features for a dataset.
How to apply stochastic optimization to select an optimal subset of input features.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Feature Selection with Stochastic Optimization Algorithms appeared first on Machine Learning Mastery.

Gradient boosting is an ensemble of decision trees algorithms.

It may be one of the most popular techniques for structured (tabular) classification and regression predictive modeling problems given that it performs so well across a wide range of datasets in practice.

A major problem of gradient boosting is that it is slow to train the model. This is particularly a problem when using the model on large datasets with tens of thousands of examples (rows).

Training the trees that are added to the ensemble can be dramatically accelerated by discretizing (binning) the continuous input variables to a few hundred unique values. Gradient boosting ensembles that implement this technique and tailor the training algorithm around input variables under this transform are referred to as histogram-based gradient boosting ensembles.

In this tutorial, you will discover how to develop histogram-based gradient boosting tree ensembles.

After completing this tutorial, you will know:

Histogram-based gradient boosting is a technique for training faster decision trees used in the gradient boosting ensemble.
How to use the experimental implementation of histogram-based gradient boosting in the scikit-learn library.
How to use histogram-based gradient boosting ensembles with the XGBoost and LightGBM third-party libraries.

Let’s get started.

How to Develop Histogram-Based Gradient Boosting Ensembles
Photo by YoTuT, some rights reserved.

Tutorial Overview

This tutorial is divided into four parts; they are:

Histogram Gradient Boosting
Histogram Gradient Boosting With Scikit-Learn
Histogram Gradient Boosting With XGBoost
Histogram Gradient Boosting With LightGBM

Histogram Gradient Boosting

Gradient boosting is an ensemble machine learning algorithm.

Boosting refers to a class of ensemble learning algorithms that add tree models to an ensemble sequentially. Each tree model added to the ensemble attempts to correct the prediction errors made by the tree models already present in the ensemble.

Gradient boosting is a generalization of boosting algorithms like AdaBoost to a statistical framework that treats the training process as an additive model and allows arbitrary loss functions to be used, greatly improving the capability of the technique. As such, gradient boosting ensembles are the go-to technique for most structured (e.g. tabular data) predictive modeling tasks.

Although gradient boosting performs very well in practice, the models can be slow to train. This is because trees must be created and added sequentially, unlike other ensemble models like random forest where ensemble members can be trained in parallel, exploiting multiple CPU cores. As such, a lot of effort has been put into techniques that improve the efficiency of the gradient boosting training algorithm.

Two notable libraries that wrap up many modern efficiency techniques for training gradient boosting algorithms include the Extreme Gradient Boosting (XGBoost) and Light Gradient Boosting Machines (LightGBM).

One aspect of the training algorithm that can be accelerated is the construction of each decision tree, the speed of which is bounded by the number of examples (rows) and number of features (columns) in the training dataset. Large datasets, e.g. tens of thousands of examples or more, can result in the very slow construction of trees as split points on each value, for each feature must be considered during the construction of the trees.

If we can reduce #data or #feature, we will be able to substantially speed up the training of GBDT.

— LightGBM: A Highly Efficient Gradient Boosting Decision Tree, 2017.

The construction of decision trees can be sped up significantly by reducing the number of values for continuous input features. This can be achieved by discretization or binning values into a fixed number of buckets. This can reduce the number of unique values for each feature from tens of thousands down to a few hundred.

This allows the decision tree to operate upon the ordinal bucket (an integer) instead of specific values in the training dataset. This coarse approximation of the input data often has little impact on model skill, if not improves the model skill, and dramatically accelerates the construction of the decision tree.

Additionally, efficient data structures can be used to represent the binning of the input data; for example, histograms can be used and the tree construction algorithm can be further tailored for the efficient use of histograms in the construction of each tree.

These techniques were originally developed in the late 1990s for efficiency developing single decision trees on large datasets, but can be used in ensembles of decision trees, such as gradient boosting.

As such, it is common to refer to a gradient boosting algorithm supporting “histograms” in modern machine learning libraries as a histogram-based gradient boosting.

Instead of finding the split points on the sorted feature values, histogram-based algorithm buckets continuous feature values into discrete bins and uses these bins to construct feature histograms during training. Since the histogram-based algorithm is more efficient in both memory consumption and training speed, we will develop our work on its basis.

— LightGBM: A Highly Efficient Gradient Boosting Decision Tree, 2017.

Now that we are familiar with the idea of adding histograms to the construction of decision trees in gradient boosting, let’s review some common implementations we can use on our predictive modeling projects.

There are three main libraries that support the technique; they are Scikit-Learn, XGBoost, and LightGBM.

Let’s take a closer look at each in turn.

Note: We are not racing the algorithms; instead, we are just demonstrating how to configure each implementation to use the histogram method and hold all other unrelated hyperparameters constant at their default values.

Histogram Gradient Boosting With Scikit-Learn

The scikit-learn machine learning library provides an experimental implementation of gradient boosting that supports the histogram technique.

Specifically, this is provided in the HistGradientBoostingClassifier and HistGradientBoostingRegressor classes.

In order to use these classes, you must add an additional line to your project that indicates you are happy to use these experimental techniques and that their behavior may change with subsequent releases of the library.

...
# explicitly require this experimental feature
from sklearn.experimental import enable_hist_gradient_boosting

The scikit-learn documentation claims that these histogram-based implementations of gradient boosting are orders of magnitude faster than the default gradient boosting implementation provided by the library.

These histogram-based estimators can be orders of magnitude faster than GradientBoostingClassifier and GradientBoostingRegressor when the number of samples is larger than tens of thousands of samples.

— Histogram-Based Gradient Boosting, Scikit-Learn User Guide.

The classes can be used just like any other scikit-learn model.

By default, the ensemble uses 255 bins for each continuous input feature, and this can be set via the “max_bins” argument. Setting this to smaller values, such as 50 or 100, may result in further efficiency improvements, although perhaps at the cost of some model skill.

The number of trees can be set via the “max_iter” argument and defaults to 100.

...
# define the model
model = HistGradientBoostingClassifier(max_bins=255, max_iter=100)

The example below shows how to evaluate a histogram gradient boosting algorithm on a synthetic classification dataset with 10,000 examples and 100 features.

The model is evaluated using repeated stratified k-fold cross-validation and the mean accuracy across all folds and repeats is reported.

# evaluate sklearn histogram gradient boosting algorithm for classification
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
# define dataset
X, y = make_classification(n_samples=10000, n_features=100, n_informative=50, n_redundant=50, random_state=1)
# define the model
model = HistGradientBoostingClassifier(max_bins=255, max_iter=100)
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model and collect the scores
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example evaluates the model performance on the synthetic dataset and reports the mean and standard deviation classification accuracy.

In this case, we can see that the scikit-learn histogram gradient boosting algorithm achieves a mean accuracy of about 94.3 percent on the synthetic dataset.

Accuracy: 0.943 (0.007)

We can also explore the effect of the number of bins on model performance.

The example below evaluates the performance of the model with a different number of bins for each continuous input feature from 50 to (about) 250 in increments of 50.

The complete example is listed below.

# compare number of bins for sklearn histogram gradient boosting
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
from matplotlib import pyplot

# get the dataset
def get_dataset():
	X, y = make_classification(n_samples=10000, n_features=100, n_informative=50, n_redundant=50, random_state=1)
	return X, y

# get a list of models to evaluate
def get_models():
	models = dict()
	for i in [10, 50, 100, 150, 200, 255]:
		models[str(i)] = HistGradientBoostingClassifier(max_bins=i, max_iter=100)
	return models

# evaluate a give model using cross-validation
def evaluate_model(model, X, y):
	# define the evaluation procedure
	cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
	# evaluate the model and collect the scores
	scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
	return scores

# define dataset
X, y = get_dataset()
# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
	# evaluate the model and collect the scores
	scores = evaluate_model(model, X, y)
	# stores the results
	results.append(scores)
	names.append(name)
	# report performance along the way
	print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores)))
# plot model performance for comparison
pyplot.boxplot(results, labels=names, showmeans=True)
pyplot.show()

Running the example evaluates each configuration, reporting the mean and standard deviation classification accuracy along the way and finally creating a plot of the distribution of scores.

In this case, we can see that increasing the number of bins may decrease the mean accuracy of the model on this dataset.

We might expect that an increase in the number of bins may also require an increase in the number of trees (max_iter) to ensure that the additional split points can be effectively explored and harnessed by the model.

Importantly, fitting an ensemble where trees use 10 or 50 bins per variable is dramatically faster than 255 bins per input variable.

>10 0.945 (0.009)
>50 0.944 (0.007)
>100 0.944 (0.008)
>150 0.944 (0.008)
>200 0.944 (0.007)
>255 0.943 (0.007)

A figure is created comparing the distribution in accuracy scores for each configuration using box and whisker plots.

In this case, we can see that increasing the number of bins in the histogram appears to reduce the spread of the distribution, although it may lower the mean performance of the model.

Box and Whisker Plots of the Number of Bins for the Scikit-Learn Histogram Gradient Boosting Ensemble

Histogram Gradient Boosting With XGBoost

Extreme Gradient Boosting, or XGBoost for short, is a library that provides a highly optimized implementation of gradient boosting.

One of the techniques implemented in the library is the use of histograms for the continuous input variables.

The XGBoost library can be installed using your favorite Python package manager, such as Pip; for example:

sudo pip install xgboost

We can develop XGBoost models for use with the scikit-learn library via the XGBClassifier and XGBRegressor classes.

The training algorithm can be configured to use the histogram method by setting the “tree_method” argument to ‘approx‘, and the number of bins can be set via the “max_bin” argument.

...
# define the model
model = XGBClassifier(tree_method='approx', max_bin=255, n_estimators=100)

The example below demonstrates evaluating an XGBoost model configured to use the histogram or approximate technique for constructing trees with 255 bins per continuous input feature and 100 trees in the model.

# evaluate xgboost histogram gradient boosting algorithm for classification
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from xgboost import XGBClassifier
# define dataset
X, y = make_classification(n_samples=10000, n_features=100, n_informative=50, n_redundant=50, random_state=1)
# define the model
model = XGBClassifier(tree_method='approx', max_bin=255, n_estimators=100)
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model and collect the scores
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example evaluates the model performance on the synthetic dataset and reports the mean and standard deviation classification accuracy.

In this case, we can see that the XGBoost histogram gradient boosting algorithm achieves a mean accuracy of about 95.7 percent on the synthetic dataset.

Accuracy: 0.957 (0.007)

Histogram Gradient Boosting With LightGBM

Light Gradient Boosting Machine or LightGBM for short is another third-party library like XGBoost that provides a highly optimized implementation of gradient boosting.

It may have implemented the histogram technique before XGBoost, but XGBoost later implemented the same technique, highlighting the “gradient boosting efficiency” competition between gradient boosting libraries.

The LightGBM library can be installed using your favorite Python package manager, such as Pip; for example:

sudo pip install lightgbm

We can develop LightGBM models for use with the scikit-learn library via the LGBMClassifier and LGBMRegressor classes.

The training algorithm uses histograms by default. The maximum bins per continuous input variable can be set via the “max_bin” argument.

...
# define the model
model = LGBMClassifier(max_bin=255, n_estimators=100)

The example below demonstrates evaluating a LightGBM model configured to use the histogram or approximate technique for constructing trees with 255 bins per continuous input feature and 100 trees in the model.

# evaluate lightgbm histogram gradient boosting algorithm for classification
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from lightgbm import LGBMClassifier
# define dataset
X, y = make_classification(n_samples=10000, n_features=100, n_informative=50, n_redundant=50, random_state=1)
# define the model
model = LGBMClassifier(max_bin=255, n_estimators=100)
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model and collect the scores
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example evaluates the model performance on the synthetic dataset and reports the mean and standard deviation classification accuracy.

In this case, we can see that the LightGBM histogram gradient boosting algorithm achieves a mean accuracy of about 94.2 percent on the synthetic dataset.

Accuracy: 0.942 (0.006)

Summary

In this tutorial, you discovered how to develop histogram-based gradient boosting tree ensembles.

Specifically, you learned:

Histogram-based gradient boosting is a technique for training faster decision trees used in the gradient boosting ensemble.
How to use the experimental implementation of histogram-based gradient boosting in the scikit-learn library.
How to use histogram-based gradient boosting ensembles with the XGBoost and LightGBM third-party libraries.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Histogram-Based Gradient Boosting Ensembles in Python appeared first on Machine Learning Mastery.

Semi-supervised learning refers to algorithms that attempt to make use of both labeled and unlabeled training data.

Semi-supervised learning algorithms are unlike supervised learning algorithms that are only able to learn from labeled training data.

A popular approach to semi-supervised learning is to create a graph that connects examples in the training dataset and propagate known labels through the edges of the graph to label unlabeled examples. An example of this approach to semi-supervised learning is the label propagation algorithm for classification predictive modeling.

In this tutorial, you will discover how to apply the label propagation algorithm to a semi-supervised learning classification dataset.

After completing this tutorial, you will know:

An intuition for how the label propagation semi-supervised learning algorithm works.
How to develop a semi-supervised classification dataset and establish a baseline in performance with a supervised learning algorithm.
How to develop and evaluate a label propagation algorithm and use the model output to train a supervised learning algorithm.

Let’s get started.

Semi-Supervised Learning With Label Propagation
Photo by TheBluesDude, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

Label Propagation Algorithm
Semi-Supervised Classification Dataset
Label Propagation for Semi-Supervised Learning

Label Propagation Algorithm

Label Propagation is a semi-supervised learning algorithm.

The algorithm was proposed in the 2002 technical report by Xiaojin Zhu and Zoubin Ghahramani titled “Learning From Labeled And Unlabeled Data With Label Propagation.”

The intuition for the algorithm is that a graph is created that connects all examples (rows) in the dataset based on their distance, such as Euclidean distance. Nodes in the graph then have label soft labels or label distribution based on the labels or label distributions of examples connected nearby in the graph.

Many semi-supervised learning algorithms rely on the geometry of the data induced by both labeled and unlabeled examples to improve on supervised methods that use only the labeled data. This geometry can be naturally represented by an empirical graph g = (V,E) where nodes V = {1,…,n} represent the training data and edges E represent similarities between them

— Page 193, Semi-Supervised Learning, 2006.

Propagation refers to the iterative nature that labels are assigned to nodes in the graph and propagate along the edges of the graph to connected nodes.

This procedure is sometimes called label propagation, as it “propagates” labels from the labeled vertices (which are fixed) gradually through the edges to all the unlabeled vertices.

— Page 48, Introduction to Semi-Supervised Learning, 2009.

The process is repeated for a fixed number of iterations to strengthen the labels assigned to unlabeled examples.

Starting with nodes 1, 2,…,l labeled with their known label (1 or −1) and nodes l + 1,…,n labeled with 0, each node starts to propagate its label to its neighbors, and the process is repeated until convergence.

— Page 194, Semi-Supervised Learning, 2006.

Now that we are familiar with the Label Propagation algorithm, let’s look at how we might use it on a project. First, we must define a semi-supervised classification dataset.

Semi-Supervised Classification Dataset

In this section, we will define a dataset for semis-supervised learning and establish a baseline in performance on the dataset.

First, we can define a synthetic classification dataset using the make_classification() function.

We will define the dataset with two classes (binary classification) and two input variables and 1,000 examples.

...
# define dataset
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1)

Next, we will split the dataset into train and test datasets with an equal 50-50 split (e.g. 500 rows in each).

...
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)

Finally, we will split the training dataset in half again into a portion that will have labels and a portion that we will pretend is unlabeled.

...
# split train into labeled and unlabeled
X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)

Tying this together, the complete example of preparing the semi-supervised learning dataset is listed below.

# prepare semi-supervised learning dataset
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# define dataset
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1)
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)
# split train into labeled and unlabeled
X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)
# summarize training set size
print('Labeled Train Set:', X_train_lab.shape, y_train_lab.shape)
print('Unlabeled Train Set:', X_test_unlab.shape, y_test_unlab.shape)
# summarize test set size
print('Test Set:', X_test.shape, y_test.shape)

Running the example prepares the dataset and then summarizes the shape of each of the three portions.

The results confirm that we have a test dataset of 500 rows, a labeled training dataset of 250 rows, and 250 rows of unlabeled data.

Labeled Train Set: (250, 2) (250,)
Unlabeled Train Set: (250, 2) (250,)
Test Set: (500, 2) (500,)

A supervised learning algorithm will only have 250 rows from which to train a model.

A semi-supervised learning algorithm will have the 250 labeled rows as well as the 250 unlabeled rows that could be used in numerous ways to improve the labeled training dataset.

Next, we can establish a baseline in performance on the semi-supervised learning dataset using a supervised learning algorithm fit only on the labeled training data.

This is important because we would expect a semi-supervised learning algorithm to outperform a supervised learning algorithm fit on the labeled data alone. If this is not the case, then the semi-supervised learning algorithm does not have skill.

In this case, we will use a logistic regression algorithm fit on the labeled portion of the training dataset.

...
# define model
model = LogisticRegression()
# fit model on labeled dataset
model.fit(X_train_lab, y_train_lab)

The model can then be used to make predictions on the entire hold out test dataset and evaluated using classification accuracy.

...
# make predictions on hold out test set
yhat = model.predict(X_test)
# calculate score for test set
score = accuracy_score(y_test, yhat)
# summarize score
print('Accuracy: %.3f' % (score*100))

Tying this together, the complete example of evaluating a supervised learning algorithm on the semi-supervised learning dataset is listed below.

# baseline performance on the semi-supervised learning dataset
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
# define dataset
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1)
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)
# split train into labeled and unlabeled
X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)
# define model
model = LogisticRegression()
# fit model on labeled dataset
model.fit(X_train_lab, y_train_lab)
# make predictions on hold out test set
yhat = model.predict(X_test)
# calculate score for test set
score = accuracy_score(y_test, yhat)
# summarize score
print('Accuracy: %.3f' % (score*100))

Running the algorithm fits the model on the labeled training dataset and evaluates it on the holdout dataset and prints the classification accuracy.

In this case, we can see that the algorithm achieved a classification accuracy of about 84.8 percent.

We would expect an effective semi-supervised learning algorithm to achieve better accuracy than this.

Accuracy: 84.800

Next, let’s explore how to apply the label propagation algorithm to the dataset.

Label Propagation for Semi-Supervised Learning

The Label Propagation algorithm is available in the scikit-learn Python machine learning library via the LabelPropagation class.

The model can be fit just like any other classification model by calling the fit() function and used to make predictions for new data via the predict() function.

...
# define model
model = LabelPropagation()
# fit model on training dataset
model.fit(..., ...)
# make predictions on hold out test set
yhat = model.predict(...)

Importantly, the training dataset provided to the fit() function must include labeled examples that are integer encoded (as per normal) and unlabeled examples marked with a label of -1.

The model will then determine a label for the unlabeled examples as part of fitting the model.

After the model is fit, the estimated labels for the labeled and unlabeled data in the training dataset is available via the “transduction_” attribute on the LabelPropagation class.

...
# get labels for entire training dataset data
tran_labels = model.transduction_

Now that we are familiar with how to use the Label Propagation algorithm in scikit-learn, let’s look at how we might apply it to our semi-supervised learning dataset.

First, we must prepare the training dataset.

We can concatenate the input data of the training dataset into a single array.

...
# create the training dataset input
X_train_mixed = concatenate((X_train_lab, X_test_unlab))

We can then create a list of -1 valued (unlabeled) for each row in the unlabeled portion of the training dataset.

...
# create "no label" for unlabeled data
nolabel = [-1 for _ in range(len(y_test_unlab))]

This list can then be concatenated with the labels from the labeled portion of the training dataset to correspond with the input array for the training dataset.

...
# recombine training dataset labels
y_train_mixed = concatenate((y_train_lab, nolabel))

We can now train the LabelPropagation model on the entire training dataset.

...
# define model
model = LabelPropagation()
# fit model on training dataset
model.fit(X_train_mixed, y_train_mixed)

Next, we can use the model to make predictions on the holdout dataset and evaluate the model using classification accuracy.

...
# make predictions on hold out test set
yhat = model.predict(X_test)
# calculate score for test set
score = accuracy_score(y_test, yhat)
# summarize score
print('Accuracy: %.3f' % (score*100))

Tying this together, the complete example of evaluating label propagation on the semi-supervised learning dataset is listed below.

# evaluate label propagation on the semi-supervised learning dataset
from numpy import concatenate
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.semi_supervised import LabelPropagation
# define dataset
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1)
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)
# split train into labeled and unlabeled
X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)
# create the training dataset input
X_train_mixed = concatenate((X_train_lab, X_test_unlab))
# create "no label" for unlabeled data
nolabel = [-1 for _ in range(len(y_test_unlab))]
# recombine training dataset labels
y_train_mixed = concatenate((y_train_lab, nolabel))
# define model
model = LabelPropagation()
# fit model on training dataset
model.fit(X_train_mixed, y_train_mixed)
# make predictions on hold out test set
yhat = model.predict(X_test)
# calculate score for test set
score = accuracy_score(y_test, yhat)
# summarize score
print('Accuracy: %.3f' % (score*100))

Running the algorithm fits the model on the entire training dataset and evaluates it on the holdout dataset and prints the classification accuracy.

In this case, we can see that the label propagation model achieves a classification accuracy of about 85.6 percent, which is slightly higher than a logistic regression fit only on the labeled training dataset that achieved an accuracy of about 84.8 percent.

Accuracy: 85.600

So far, so good.

Another approach we can use with the semi-supervised model is to take the estimated labels for the training dataset and fit a supervised learning model.

Recall that we can retrieve the labels for the entire training dataset from the label propagation model as follows:

...
# get labels for entire training dataset data
tran_labels = model.transduction_

We can then use these labels along with all of the input data to train and evaluate a supervised learning algorithm, such as a logistic regression model.

The hope is that the supervised learning model fit on the entire training dataset would achieve even better performance than the semi-supervised learning model alone.

...
# define supervised learning model
model2 = LogisticRegression()
# fit supervised learning model on entire training dataset
model2.fit(X_train_mixed, tran_labels)
# make predictions on hold out test set
yhat = model2.predict(X_test)
# calculate score for test set
score = accuracy_score(y_test, yhat)
# summarize score
print('Accuracy: %.3f' % (score*100))

Tying this together, the complete example of using the estimated training set labels to train and evaluate a supervised learning model is listed below.

# evaluate logistic regression fit on label propagation for semi-supervised learning
from numpy import concatenate
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.semi_supervised import LabelPropagation
from sklearn.linear_model import LogisticRegression
# define dataset
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1)
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)
# split train into labeled and unlabeled
X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)
# create the training dataset input
X_train_mixed = concatenate((X_train_lab, X_test_unlab))
# create "no label" for unlabeled data
nolabel = [-1 for _ in range(len(y_test_unlab))]
# recombine training dataset labels
y_train_mixed = concatenate((y_train_lab, nolabel))
# define model
model = LabelPropagation()
# fit model on training dataset
model.fit(X_train_mixed, y_train_mixed)
# get labels for entire training dataset data
tran_labels = model.transduction_
# define supervised learning model
model2 = LogisticRegression()
# fit supervised learning model on entire training dataset
model2.fit(X_train_mixed, tran_labels)
# make predictions on hold out test set
yhat = model2.predict(X_test)
# calculate score for test set
score = accuracy_score(y_test, yhat)
# summarize score
print('Accuracy: %.3f' % (score*100))

Running the algorithm fits the semi-supervised model on the entire training dataset, then fits a supervised learning model on the entire training dataset with inferred labels and evaluates it on the holdout dataset, printing the classification accuracy.

In this case, we can see that this hierarchical approach of the semi-supervised model followed by supervised model achieves a classification accuracy of about 86.2 percent on the holdout dataset, even better than the semi-supervised learning used alone that achieved an accuracy of about 85.6 percent.

Accuracy: 86.200

Can you achieve better results by tuning the hyperparameters of the LabelPropagation model?
Let me know what you discover in the comments below.

Summary

In this tutorial, you discovered how to apply the label propagation algorithm to a semi-supervised learning classification dataset.

Specifically, you learned:

An intuition for how the label propagation semi-supervised learning algorithm works.
How to develop a semi-supervised classification dataset and establish a baseline in performance with a supervised learning algorithm.
How to develop and evaluate a label propagation algorithm and use the model output to train a supervised learning algorithm.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Semi-Supervised Learning With Label Propagation appeared first on Machine Learning Mastery.

Multinomial logistic regression is an extension of logistic regression that adds native support for multi-class classification problems.

Logistic regression, by default, is limited to two-class classification problems. Some extensions like one-vs-rest can allow logistic regression to be used for multi-class classification problems, although they require that the classification problem first be transformed into multiple binary classification problems.

Instead, the multinomial logistic regression algorithm is an extension to the logistic regression model that involves changing the loss function to cross-entropy loss and predict probability distribution to a multinomial probability distribution to natively support multi-class classification problems.

In this tutorial, you will discover how to develop multinomial logistic regression models in Python.

After completing this tutorial, you will know:

Multinomial logistic regression is an extension of logistic regression for multi-class classification.
How to develop and evaluate multinomial logistic regression and develop a final model for making predictions on new data.
How to tune the penalty hyperparameter for the multinomial logistic regression model.

Let’s get started.

Multinomial Logistic Regression With Python
Photo by Nicolas Rénac, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

Multinomial Logistic Regression
Evaluate Multinomial Logistic Regression Model
Tune Penalty for Multinomial Logistic Regression

Multinomial Logistic Regression

Logistic regression is a classification algorithm.

It is intended for datasets that have numerical input variables and a categorical target variable that has two values or classes. Problems of this type are referred to as binary classification problems.

Logistic regression is designed for two-class problems, modeling the target using a binomial probability distribution function. The class labels are mapped to 1 for the positive class or outcome and 0 for the negative class or outcome. The fit model predicts the probability that an example belongs to class 1.

By default, logistic regression cannot be used for classification tasks that have more than two class labels, so-called multi-class classification.

Instead, it requires modification to support multi-class classification problems.

One popular approach for adapting logistic regression to multi-class classification problems is to split the multi-class classification problem into multiple binary classification problems and fit a standard logistic regression model on each subproblem. Techniques of this type include one-vs-rest and one-vs-one wrapper models.

An alternate approach involves changing the logistic regression model to support the prediction of multiple class labels directly. Specifically, to predict the probability that an input example belongs to each known class label.

The probability distribution that defines multi-class probabilities is called a multinomial probability distribution. A logistic regression model that is adapted to learn and predict a multinomial probability distribution is referred to as Multinomial Logistic Regression. Similarly, we might refer to default or standard logistic regression as Binomial Logistic Regression.

Binomial Logistic Regression: Standard logistic regression that predicts a binomial probability (i.e. for two classes) for each input example.
Multinomial Logistic Regression: Modified version of logistic regression that predicts a multinomial probability (i.e. more than two classes) for each input example.

If you are new to binomial and multinomial probability distributions, you may want to read the tutorial:

Discrete Probability Distributions for Machine Learning

Changing logistic regression from binomial to multinomial probability requires a change to the loss function used to train the model (e.g. log loss to cross-entropy loss), and a change to the output from a single probability value to one probability for each class label.

Now that we are familiar with multinomial logistic regression, let’s look at how we might develop and evaluate multinomial logistic regression models in Python.

Evaluate Multinomial Logistic Regression Model

In this section, we will develop and evaluate a multinomial logistic regression model using the scikit-learn Python machine learning library.

First, we will define a synthetic multi-class classification dataset to use as the basis of the investigation. This is a generic dataset that you can easily replace with your own loaded dataset later.

The make_classification() function can be used to generate a dataset with a given number of rows, columns, and classes. In this case, we will generate a dataset with 1,000 rows, 10 input variables or columns, and 3 classes.

The example below generates the dataset and summarizes the shape of the arrays and the distribution of examples across the three classes.

# test classification dataset
from collections import Counter
from sklearn.datasets import make_classification
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, n_classes=3, random_state=1)
# summarize the dataset
print(X.shape, y.shape)
print(Counter(y))

Running the example confirms that the dataset has 1,000 rows and 10 columns, as we expected, and that the rows are distributed approximately evenly across the three classes, with about 334 examples in each class.

(1000, 10) (1000,)
Counter({1: 334, 2: 334, 0: 332})

Logistic regression is supported in the scikit-learn library via the LogisticRegression class.

The LogisticRegression class can be configured for multinomial logistic regression by setting the “multi_class” argument to “multinomial” and the “solver” argument to a solver that supports multinomial logistic regression, such as “lbfgs“.

...
# define the multinomial logistic regression model
model = LogisticRegression(multi_class='multinomial', solver='lbfgs')

The multinomial logistic regression model will be fit using cross-entropy loss and will predict the integer value for each integer encoded class label.

Now that we are familiar with the multinomial logistic regression API, we can look at how we might evaluate a multinomial logistic regression model on our synthetic multi-class classification dataset.

It is a good practice to evaluate classification models using repeated stratified k-fold cross-validation. The stratification ensures that each cross-validation fold has approximately the same distribution of examples in each class as the whole training dataset.

We will use three repeats with 10 folds, which is a good default, and evaluate model performance using classification accuracy given that the classes are balanced.

The complete example of evaluating multinomial logistic regression for multi-class classification is listed below.

# evaluate multinomial logistic regression model
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.linear_model import LogisticRegression
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, n_classes=3, random_state=1)
# define the multinomial logistic regression model
model = LogisticRegression(multi_class='multinomial', solver='lbfgs')
# define the model evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model and collect the scores
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report the model performance
print('Mean Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example reports the mean classification accuracy across all folds and repeats of the evaluation procedure.

In this case, we can see that the multinomial logistic regression model with default penalty achieved a mean classification accuracy of about 68.1 percent on our synthetic classification dataset.

Mean Accuracy: 0.681 (0.042)

We may decide to use the multinomial logistic regression model as our final model and make predictions on new data.

This can be achieved by first fitting the model on all available data, then calling the predict() function to make a prediction for new data.

The example below demonstrates how to make a prediction for new data using the multinomial logistic regression model.

# make a prediction with a multinomial logistic regression model
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, n_classes=3, random_state=1)
# define the multinomial logistic regression model
model = LogisticRegression(multi_class='multinomial', solver='lbfgs')
# fit the model on the whole dataset
model.fit(X, y)
# define a single row of input data
row = [1.89149379, -0.39847585, 1.63856893, 0.01647165, 1.51892395, -3.52651223, 1.80998823, 0.58810926, -0.02542177, -0.52835426]
# predict the class label
yhat = model.predict([row])
# summarize the predicted class
print('Predicted Class: %d' % yhat[0])

Running the example first fits the model on all available data, then defines a row of data, which is provided to the model in order to make a prediction.

In this case, we can see that the model predicted the class “1” for the single row of data.

Predicted Class: 1

A benefit of multinomial logistic regression is that it can predict calibrated probabilities across all known class labels in the dataset.

This can be achieved by calling the predict_proba() function on the model.

The example below demonstrates how to predict a multinomial probability distribution for a new example using the multinomial logistic regression model.

# predict probabilities with a multinomial logistic regression model
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, n_classes=3, random_state=1)
# define the multinomial logistic regression model
model = LogisticRegression(multi_class='multinomial', solver='lbfgs')
# fit the model on the whole dataset
model.fit(X, y)
# define a single row of input data
row = [1.89149379, -0.39847585, 1.63856893, 0.01647165, 1.51892395, -3.52651223, 1.80998823, 0.58810926, -0.02542177, -0.52835426]
# predict a multinomial probability distribution
yhat = model.predict_proba([row])
# summarize the predicted probabilities
print('Predicted Probabilities: %s' % yhat[0])

Running the example first fits the model on all available data, then defines a row of data, which is provided to the model in order to predict class probabilities.

In this case, we can see that class 1 (e.g. the array index is mapped to the class integer value) has the largest predicted probability with about 0.50.

Predicted Probabilities: [0.16470456 0.50297138 0.33232406]

Now that we are familiar with evaluating and using multinomial logistic regression models, let’s explore how we might tune the model hyperparameters.

Tune Penalty for Multinomial Logistic Regression

An important hyperparameter to tune for multinomial logistic regression is the penalty term.

This term imposes pressure on the model to seek smaller model weights. This is achieved by adding a weighted sum of the model coefficients to the loss function, encouraging the model to reduce the size of the weights along with the error while fitting the model.

A popular type of penalty is the L2 penalty that adds the (weighted) sum of the squared coefficients to the loss function. A weighting of the coefficients can be used that reduces the strength of the penalty from full penalty to a very slight penalty.

By default, the LogisticRegression class uses the L2 penalty with a weighting of coefficients set to 1.0. The type of penalty can be set via the “penalty” argument with values of “l1“, “l2“, “elasticnet” (e.g. both), although not all solvers support all penalty types. The weighting of the coefficients in the penalty can be set via the “C” argument.

...
# define the multinomial logistic regression model with a default penalty
LogisticRegression(multi_class='multinomial', solver='lbfgs', penalty='l2', C=1.0)

The weighting for the penalty is actually the inverse weighting, perhaps penalty = 1 – C.

From the documentation:

C : float, default=1.0
Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.

This means that values close to 1.0 indicate very little penalty and values close to zero indicate a strong penalty. A C value of 1.0 may indicate no penalty at all.

C close to 1.0: Light penalty.
C close to 0.0: Strong penalty.

The penalty can be disabled by setting the “penalty” argument to the string “none“.

...
# define the multinomial logistic regression model without a penalty
LogisticRegression(multi_class='multinomial', solver='lbfgs', penalty='none')

Now that we are familiar with the penalty, let’s look at how we might explore the effect of different penalty values on the performance of the multinomial logistic regression model.

It is common to test penalty values on a log scale in order to quickly discover the scale of penalty that works well for a model. Once found, further tuning at that scale may be beneficial.

We will explore the L2 penalty with weighting values in the range from 0.0001 to 1.0 on a log scale, in addition to no penalty or 0.0.

The complete example of evaluating L2 penalty values for multinomial logistic regression is listed below.

# tune regularization for multinomial logistic regression
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.linear_model import LogisticRegression
from matplotlib import pyplot

# get the dataset
def get_dataset():
	X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1, n_classes=3)
	return X, y

# get a list of models to evaluate
def get_models():
	models = dict()
	for p in [0.0, 0.0001, 0.001, 0.01, 0.1, 1.0]:
		# create name for model
		key = '%.4f' % p
		# turn off penalty in some cases
		if p == 0.0:
			# no penalty in this case
			models[key] = LogisticRegression(multi_class='multinomial', solver='lbfgs', penalty='none')
		else:
			models[key] = LogisticRegression(multi_class='multinomial', solver='lbfgs', penalty='l2', C=p)
	return models

# evaluate a give model using cross-validation
def evaluate_model(model, X, y):
	# define the evaluation procedure
	cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
	# evaluate the model
	scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
	return scores

# define dataset
X, y = get_dataset()
# get the models to evaluate
models = get_models()
# evaluate the models and store results
results, names = list(), list()
for name, model in models.items():
	# evaluate the model and collect the scores
	scores = evaluate_model(model, X, y)
	# store the results
	results.append(scores)
	names.append(name)
	# summarize progress along the way
	print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores)))
# plot model performance for comparison
pyplot.boxplot(results, labels=names, showmeans=True)
pyplot.show()

Running the example reports the mean classification accuracy for each configuration along the way.

In this case, we can see that a C value of 1.0 has the best score of about 77.7 percent, which is the same as using no penalty that achieves the same score.

>0.0000 0.777 (0.037)
>0.0001 0.683 (0.049)
>0.0010 0.762 (0.044)
>0.0100 0.775 (0.040)
>0.1000 0.774 (0.038)
>1.0000 0.777 (0.037)

A box and whisker plot is created for the accuracy scores for each configuration and all plots are shown side by side on a figure on the same scale for direct comparison.

In this case, we can see that the larger penalty we use on this dataset (i.e. the smaller the C value), the worse the performance of the model.

Box and Whisker Plots of L2 Penalty Configuration vs. Accuracy for Multinomial Logistic Regression

Summary

In this tutorial, you discovered how to develop multinomial logistic regression models in Python.

Specifically, you learned:

Multinomial logistic regression is an extension of logistic regression for multi-class classification.
How to develop and evaluate multinomial logistic regression and develop a final model for making predictions on new data.
How to tune the penalty hyperparameter for the multinomial logistic regression model.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Multinomial Logistic Regression With Python appeared first on Machine Learning Mastery.

Semi-supervised learning refers to algorithms that attempt to make use of both labeled and unlabeled training data.

Semi-supervised learning algorithms are unlike supervised learning algorithms that are only able to learn from labeled training data.

A popular approach to semi-supervised learning is to create a graph that connects examples in the training dataset and propagates known labels through the edges of the graph to label unlabeled examples. An example of this approach to semi-supervised learning is the label spreading algorithm for classification predictive modeling.

In this tutorial, you will discover how to apply the label spreading algorithm to a semi-supervised learning classification dataset.

After completing this tutorial, you will know:

An intuition for how the label spreading semi-supervised learning algorithm works.
How to develop a semi-supervised classification dataset and establish a baseline in performance with a supervised learning algorithm.
How to develop and evaluate a label spreading algorithm and use the model output to train a supervised learning algorithm.

Let’s get started.

Semi-Supervised Learning With Label Spreading
Photo by Jernej Furman, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

Label Spreading Algorithm
Semi-Supervised Classification Dataset
Label Spreading for Semi-Supervised Learning

Label Spreading Algorithm

Label Spreading is a semi-supervised learning algorithm.

The algorithm was introduced by Dengyong Zhou, et al. in their 2003 paper titled “Learning With Local And Global Consistency.”

The intuition for the broader approach of semi-supervised learning is that nearby points in the input space should have the same label, and points in the same structure or manifold in the input space should have the same label.

The key to semi-supervised learning problems is the prior assumption of consistency, which means: (1) nearby points are likely to have the same label; and (2) points on the same structure typically referred to as a cluster or a manifold) are likely to have the same label.

— Learning With Local And Global Consistency, 2003.

The label spreading is inspired by a technique from experimental psychology called spreading activation networks.

This algorithm can be understood intuitively in terms of spreading activation networks from experimental psychology.

— Learning With Local And Global Consistency, 2003.

Points in the dataset are connected in a graph based on their relative distances in the input space. The weight matrix of the graph is normalized symmetrically, much like spectral clustering. Information is passed through the graph, which is adapted to capture the structure in the input space.

The approach is very similar to the label propagation algorithm for semi-supervised learning.

Another similar label propagation algorithm was given by Zhou et al.: at each step a node i receives a contribution from its neighbors j (weighted by the normalized weight of the edge (i,j)), and an additional small contribution given by its initial value

— Page 196, Semi-Supervised Learning, 2006.

After convergence, labels are applied based on nodes that passed on the most information.

Finally, the label of each unlabeled point is set to be the class of which it has received most information during the iteration process.

— Learning With Local And Global Consistency, 2003.

Now that we are familiar with the label spreading algorithm, let’s look at how we might use it on a project. First, we must define a semi-supervised classification dataset.

Semi-Supervised Classification Dataset

In this section, we will define a dataset for semis-supervised learning and establish a baseline in performance on the dataset.

First, we can define a synthetic classification dataset using the make_classification() function.

We will define the dataset with two classes (binary classification) and two input variables and 1,000 examples.

...
# define dataset
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1)

Next, we will split the dataset into train and test datasets with an equal 50-50 split (e.g. 500 rows in each).

...
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)

Finally, we will split the training dataset in half again into a portion that will have labels and a portion that we will pretend is unlabeled.

...
# split train into labeled and unlabeled
X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)

Tying this together, the complete example of preparing the semi-supervised learning dataset is listed below.

# prepare semi-supervised learning dataset
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# define dataset
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1)
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)
# split train into labeled and unlabeled
X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)
# summarize training set size
print('Labeled Train Set:', X_train_lab.shape, y_train_lab.shape)
print('Unlabeled Train Set:', X_test_unlab.shape, y_test_unlab.shape)
# summarize test set size
print('Test Set:', X_test.shape, y_test.shape)

Running the example prepares the dataset and then summarizes the shape of each of the three portions.

The results confirm that we have a test dataset of 500 rows, a labeled training dataset of 250 rows, and 250 rows of unlabeled data.

Labeled Train Set: (250, 2) (250,)
Unlabeled Train Set: (250, 2) (250,)
Test Set: (500, 2) (500,)

A supervised learning algorithm will only have 250 rows from which to train a model.

A semi-supervised learning algorithm will have the 250 labeled rows as well as the 250 unlabeled rows that could be used in numerous ways to improve the labeled training dataset.

Next, we can establish a baseline in performance on the semi-supervised learning dataset using a supervised learning algorithm fit only on the labeled training data.

In this case, we will use a logistic regression algorithm fit on the labeled portion of the training dataset.

...
# define model
model = LogisticRegression()
# fit model on labeled dataset
model.fit(X_train_lab, y_train_lab)

The model can then be used to make predictions on the entire holdout test dataset and evaluated using classification accuracy.

...
# make predictions on hold out test set
yhat = model.predict(X_test)
# calculate score for test set
score = accuracy_score(y_test, yhat)
# summarize score
print('Accuracy: %.3f' % (score*100))

Tying this together, the complete example of evaluating a supervised learning algorithm on the semi-supervised learning dataset is listed below.

# baseline performance on the semi-supervised learning dataset
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
# define dataset
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1)
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)
# split train into labeled and unlabeled
X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)
# define model
model = LogisticRegression()
# fit model on labeled dataset
model.fit(X_train_lab, y_train_lab)
# make predictions on hold out test set
yhat = model.predict(X_test)
# calculate score for test set
score = accuracy_score(y_test, yhat)
# summarize score
print('Accuracy: %.3f' % (score*100))

Running the algorithm fits the model on the labeled training dataset and evaluates it on the holdout dataset and prints the classification accuracy.

In this case, we can see that the algorithm achieved a classification accuracy of about 84.8 percent.

We would expect an effective semi-supervised learning algorithm to achieve a better accuracy than this.

Accuracy: 84.800

Next, let’s explore how to apply the label spreading algorithm to the dataset.

Label Spreading for Semi-Supervised Learning

The label spreading algorithm is available in the scikit-learn Python machine learning library via the LabelSpreading class.

The model can be fit just like any other classification model by calling the fit() function and used to make predictions for new data via the predict() function.

...
# define model
model = LabelSpreading()
# fit model on training dataset
model.fit(..., ...)
# make predictions on hold out test set
yhat = model.predict(...)

Importantly, the training dataset provided to the fit() function must include labeled examples that are ordinal encoded (as per normal) and unlabeled examples marked with a label of -1.

The model will then determine a label for the unlabeled examples as part of fitting the model.

After the model is fit, the estimated labels for the labeled and unlabeled data in the training dataset is available via the “transduction_” attribute on the LabelSpreading class.

...
# get labels for entire training dataset data
tran_labels = model.transduction_

Now that we are familiar with how to use the label spreading algorithm in scikit-learn, let’s look at how we might apply it to our semi-supervised learning dataset.

First, we must prepare the training dataset.

We can concatenate the input data of the training dataset into a single array.

...
# create the training dataset input
X_train_mixed = concatenate((X_train_lab, X_test_unlab))

We can then create a list of -1 valued (unlabeled) for each row in the unlabeled portion of the training dataset.

...
# create "no label" for unlabeled data
nolabel = [-1 for _ in range(len(y_test_unlab))]

This list can then be concatenated with the labels from the labeled portion of the training dataset to correspond with the input array for the training dataset.

...
# recombine training dataset labels
y_train_mixed = concatenate((y_train_lab, nolabel))

We can now train the LabelSpreading model on the entire training dataset.

...
# define model
model = LabelSpreading()
# fit model on training dataset
model.fit(X_train_mixed, y_train_mixed)

Next, we can use the model to make predictions on the holdout dataset and evaluate the model using classification accuracy.

...
# make predictions on hold out test set
yhat = model.predict(X_test)
# calculate score for test set
score = accuracy_score(y_test, yhat)
# summarize score
print('Accuracy: %.3f' % (score*100))

Tying this together, the complete example of evaluating label spreading on the semi-supervised learning dataset is listed below.

# evaluate label spreading on the semi-supervised learning dataset
from numpy import concatenate
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.semi_supervised import LabelSpreading
# define dataset
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1)
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)
# split train into labeled and unlabeled
X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)
# create the training dataset input
X_train_mixed = concatenate((X_train_lab, X_test_unlab))
# create "no label" for unlabeled data
nolabel = [-1 for _ in range(len(y_test_unlab))]
# recombine training dataset labels
y_train_mixed = concatenate((y_train_lab, nolabel))
# define model
model = LabelSpreading()
# fit model on training dataset
model.fit(X_train_mixed, y_train_mixed)
# make predictions on hold out test set
yhat = model.predict(X_test)
# calculate score for test set
score = accuracy_score(y_test, yhat)
# summarize score
print('Accuracy: %.3f' % (score*100))

Running the algorithm fits the model on the entire training dataset and evaluates it on the holdout dataset and prints the classification accuracy.

In this case, we can see that the label spreading model achieves a classification accuracy of about 85.4 percent, which is slightly higher than a logistic regression fit only on the labeled training dataset that achieved an accuracy of about 84.8 percent.

Accuracy: 85.400

So far so good.

Another approach we can use with the semi-supervised model is to take the estimated labels for the training dataset and fit a supervised learning model.

Recall that we can retrieve the labels for the entire training dataset from the label spreading model as follows:

...
# get labels for entire training dataset data
tran_labels = model.transduction_

We can then use these labels, along with all of the input data, to train and evaluate a supervised learning algorithm, such as a logistic regression model.

The hope is that the supervised learning model fit on the entire training dataset would achieve even better performance than the semi-supervised learning model alone.

...
# define supervised learning model
model2 = LogisticRegression()
# fit supervised learning model on entire training dataset
model2.fit(X_train_mixed, tran_labels)
# make predictions on hold out test set
yhat = model2.predict(X_test)
# calculate score for test set
score = accuracy_score(y_test, yhat)
# summarize score
print('Accuracy: %.3f' % (score*100))

Tying this together, the complete example of using the estimated training set labels to train and evaluate a supervised learning model is listed below.

# evaluate logistic regression fit on label spreading for semi-supervised learning
from numpy import concatenate
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.semi_supervised import LabelSpreading
from sklearn.linear_model import LogisticRegression
# define dataset
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1)
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)
# split train into labeled and unlabeled
X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)
# create the training dataset input
X_train_mixed = concatenate((X_train_lab, X_test_unlab))
# create "no label" for unlabeled data
nolabel = [-1 for _ in range(len(y_test_unlab))]
# recombine training dataset labels
y_train_mixed = concatenate((y_train_lab, nolabel))
# define model
model = LabelSpreading()
# fit model on training dataset
model.fit(X_train_mixed, y_train_mixed)
# get labels for entire training dataset data
tran_labels = model.transduction_
# define supervised learning model
model2 = LogisticRegression()
# fit supervised learning model on entire training dataset
model2.fit(X_train_mixed, tran_labels)
# make predictions on hold out test set
yhat = model2.predict(X_test)
# calculate score for test set
score = accuracy_score(y_test, yhat)
# summarize score
print('Accuracy: %.3f' % (score*100))

In this case, we can see that this hierarchical approach of semi-supervised model followed by supervised model achieves a classification accuracy of about 85.8 percent on the holdout dataset, slightly better than the semi-supervised learning algorithm used alone that achieved an accuracy of about 85.6 percent.

Accuracy: 85.800

Can you achieve better results by tuning the hyperparameters of the LabelSpreading model?
Let me know what you discover in the comments below.

Summary

In this tutorial, you discovered how to apply the label spreading algorithm to a semi-supervised learning classification dataset.

Specifically, you learned:

An intuition for how the label spreading semi-supervised learning algorithm works.
How to develop a semi-supervised classification dataset and establish a baseline in performance with a supervised learning algorithm.
How to develop and evaluate a label spreading algorithm and use the model output to train a supervised learning algorithm.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Semi-Supervised Learning With Label Spreading appeared first on Machine Learning Mastery.

Applied machine learning is typically focused on finding a single model that performs well or best on a given dataset.

Effective use of the model will require appropriate preparation of the input data and hyperparameter tuning of the model.

Collectively, the linear sequence of steps required to prepare the data, tune the model, and transform the predictions is called the modeling pipeline. Modern machine learning libraries like the scikit-learn Python library allow this sequence of steps to be defined and used correctly (without data leakage) and consistently (during evaluation and prediction).

Nevertheless, working with modeling pipelines can be confusing to beginners as it requires a shift in perspective of the applied machine learning process.

In this tutorial, you will discover modeling pipelines for applied machine learning.

After completing this tutorial, you will know:

Applied machine learning is concerned with more than finding a good performing model; it also requires finding an appropriate sequence of data preparation steps and steps for the post-processing of predictions.
Collectively, the operations required to address a predictive modeling problem can be considered an atomic unit called a modeling pipeline.
Approaching applied machine learning through the lens of modeling pipelines requires a change in thinking from evaluating specific model configurations to sequences of transforms and algorithms.

Let’s get started.

A Gentle Introduction to Machine Learning Modeling Pipelines
Photo by Jay Huang, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

Finding a Skillful Model Is Not Enough
What Is a Modeling Pipeline?
Implications of a Modeling Pipeline

Finding a Skillful Model Is Not Enough

Applied machine learning is the process of discovering the model that performs best for a given predictive modeling dataset.

In fact, it’s more than this.

In addition to discovering which model performs the best on your dataset, you must also discover:

Data transforms that best expose the unknown underlying structure of the problem to the learning algorithms.
Model hyperparameters that result in a good or best configuration of a chosen model.

There may also be additional considerations such as techniques that transform the predictions made by the model, like threshold moving or model calibration for predicted probabilities.

As such, it is common to think of applied machine learning as a large combinatorial search problem across data transforms, models, and model configurations.

This can be quite challenging in practice as it requires that the sequence of one or more data preparation schemes, the model, the model configuration, and any prediction transform schemes must be evaluated consistently and correctly on a given test harness.

Although tricky, it may be manageable with a simple train-test split but becomes quite unmanageable when using k-fold cross-validation or even repeated k-fold cross-validation.

The solution is to use a modeling pipeline to keep everything straight.

What Is a Modeling Pipeline?

A pipeline is a linear sequence of data preparation options, modeling operations, and prediction transform operations.

It allows the sequence of steps to be specified, evaluated, and used as an atomic unit.

Pipeline: A linear sequence of data preparation and modeling steps that can be treated as an atomic unit.

To make the idea clear, let’s look at two simple examples:

The first example uses data normalization for the input variables and fits a logistic regression model:

[Input], [Normalization], [Logistic Regression], [Predictions]

The second example standardizes the input variables, applies RFE feature selection, and fits a support vector machine.

[Input], [Standardization], [RFE], [SVM], [Predictions]

You can imagine other examples of modeling pipelines.

As an atomic unit, the pipeline can be evaluated using a preferred resampling scheme such as a train-test split or k-fold cross-validation.

This is important for two main reasons:

Avoid data leakage.
Consistency and reproducibility.

A modeling pipeline avoids the most common type of data leakage where data preparation techniques, such as scaling input values, are applied to the entire dataset. This is data leakage because it shares knowledge of the test dataset (such as observations that contribute to a mean or maximum known value) with the training dataset, and in turn, may result in overly optimistic model performance.

Instead, data transforms must be prepared on the training dataset only, then applied to the training dataset, test dataset, validation dataset, and any other datasets that require the transform prior to being used with the model.

A modeling pipeline ensures that the sequence of data preparation operations performed is reproducible.

Without a modeling pipeline, the data preparation steps may be performed manually twice: once for evaluating the model and once for making predictions. Any changes to the sequence must be kept consistent in both cases, otherwise differences will impact the capability and skill of the model.

A pipeline ensures that the sequence of operations is defined once and is consistent when used for model evaluation or making predictions.

The Python scikit-learn machine learning library provides a machine learning modeling pipeline via the Pipeline class.

You can learn more about how to use this Pipeline API in this tutorial:

How to Avoid Data Leakage When Performing Data Preparation

Implications of a Modeling Pipeline

The modeling pipeline is an important tool for machine learning practitioners.

Nevertheless, there are important implications that must be considered when using them.

The main confusion for beginners when using pipelines comes in understanding what the pipeline has learned or the specific configuration discovered by the pipeline.

For example, a pipeline may use a data transform that configures itself automatically, such as the RFECV technique for feature selection.

When evaluating a pipeline that uses an automatically-configured data transform, what configuration does it choose? or When fitting this pipeline as a final model for making predictions, what configuration did it choose?

The answer is, it doesn’t matter.

Another example is the use of hyperparameter tuning as the final step of the pipeline.

The grid search will be performed on the data provided by any prior transform steps in the pipeline and will then search for the best combination of hyperparameters for the model using that data, then fit a model with those hyperparameters on the data.

When evaluating a pipeline that grid searches model hyperparameters, what configuration does it choose? or When fitting this pipeline as a final model for making predictions, what configuration did it choose?

The answer again is, it doesn’t matter.

The answer applies when using a threshold moving or probability calibration step at the end of the pipeline.

The reason is the same reason that we are not concerned about the specific internal structure or coefficients of the chosen model.

For example, when evaluating a logistic regression model, we don’t need to inspect the coefficients chosen on each k-fold cross-validation round in order to choose the model. Instead, we focus on its out-of-fold predictive skill

Similarly, when using a logistic regression model as the final model for making predictions on new data, we do not need to inspect the coefficients chosen when fitting the model on the entire dataset before making predictions.

We can inspect and discover the coefficients used by the model as an exercise in analysis, but it does not impact the selection and use of the model.

This same answer generalizes when considering a modeling pipeline.

We are not concerned about which features may have been automatically selected by a data transform in the pipeline. We are also not concerned about which hyperparameters were chosen for the model when using a grid search as the final step in the modeling pipeline.

In all three cases: the single model, the pipeline with automatic feature selection, and the pipeline with a grid search, we are evaluating the “model” or “modeling pipeline” as an atomic unit.

The pipeline allows us as machine learning practitioners to move up one level of abstraction and be less concerned with the specific outcomes of the algorithms and more concerned with the capability of a sequence of procedures.

As such, we can focus on evaluating the capability of the algorithms on the dataset, not the product of the algorithms, i.e. the model. Once we have an estimate of the pipeline, we can apply it and be confident that we will get similar performance, on average.

It is a shift in thinking and may take some time to get used to.

It is also the philosophy behind modern AutoML (automatic machine learning) techniques that treat applied machine learning as a large combinatorial search problem.

Summary

In this tutorial, you discovered modeling pipelines for applied machine learning.

Specifically, you learned:

Applied machine learning is concerned with more than finding a good performing model; it also requires finding an appropriate sequence of data preparation steps and steps for the post-processing of predictions.
Collectively, the operations required to address a predictive modeling problem can be considered an atomic unit called a modeling pipeline.
Approaching applied machine learning through the lens of modeling pipelines requires a change in thinking from evaluating specific model configurations to sequences of transforms and algorithms.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Machine Learning Modeling Pipelines appeared first on Machine Learning Mastery.

How to Optimize a Function with One Variable?

Univariate function optimization involves finding the input to a function that results in the optimal output from an objective function.

This is a common procedure in machine learning when fitting a model with one parameter or tuning a model that has a single hyperparameter.

An efficient algorithm is required to solve optimization problems of this type that will find the best solution with the minimum number of evaluations of the objective function, given that each evaluation of the objective function could be computationally expensive, such as fitting and evaluating a model on a dataset.

This excludes expensive grid search and random search algorithms and in favor of efficient algorithms like Brent’s method.

In this tutorial, you will discover how to perform univariate function optimization in Python.

After completing this tutorial, you will know:

Univariate function optimization involves finding an optimal input for an objective function that takes a single continuous argument.
How to perform univariate function optimization for an unconstrained convex function.
How to perform univariate function optimization for an unconstrained non-convex function.

Let’s get started.

Univariate Function Optimization in Python
Photo by Robert Haandrikman, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

Univariate Function Optimization
Convex Univariate Function Optimization
Non-Convex Univariate Function Optimization

Univariate Function Optimization

We may need to find an optimal value of a function that takes a single parameter.

In machine learning, this may occur in many situations, such as:

Finding the coefficient of a model to fit to a training dataset.
Finding the value of a single hyperparameter that results in the best model performance.

This is called univariate function optimization.

We may be interested in the minimum outcome or maximum outcome of the function, although this can be simplified to minimization as a maximizing function can be made minimizing by adding a negative sign to all outcomes of the function.

There may or may not be limits on the inputs to the function, so-called unconstrained or constrained optimization, and we assume that small changes in input correspond to small changes in the output of the function, e.g. that it is smooth.

The function may or may not have a single optima, although we prefer that it does have a single optima and that shape of the function looks like a large basin. If this is the case, we know we can sample the function at one point and find the path down to the minima of the function. Technically, this is referred to as a convex function for minimization (concave for maximization), and functions that don’t have this basin shape are referred to as non-convex.

Convex Target Function: There is a single optima and the shape of the target function leads to this optima.

Nevertheless, the target function is sufficiently complex that we don’t know the derivative, meaning we cannot just use calculus to analytically compute the minimum or maximum of the function where the gradient is zero. This is referred to as a function that is non-differentiable.

Although we might be able to sample the function with candidate values, we don’t know the input that will result in the best outcome. This may be because of the many reasons it is expensive to evaluate candidate solutions.

Therefore, we require an algorithm that efficiently samples input values to the function.

One approach to solving univariate function optimization problems is to use Brent’s method.

Brent’s method is an optimization algorithm that combines a bisecting algorithm (Dekker’s method) and inverse quadratic interpolation. It can be used for constrained and unconstrained univariate function optimization.

The Brent-Dekker method is an extension of the bisection method. It is a root-finding algorithm that combines elements of the secant method and inverse quadratic interpolation. It has reliable and fast convergence properties, and it is the univariate optimization algorithm of choice in many popular numerical optimization packages.

— Pages 49-51, Algorithms for Optimization, 2019.

Bisecting algorithms use a bracket (lower and upper) of input values and split up the input domain, bisecting it in order to locate where in the domain the optima is located, much like a binary search. Dekker’s method is one way this is achieved efficiently for a continuous domain.

Dekker’s method gets stuck on non-convex problems. Brent’s method modifies Dekker’s method to avoid getting stuck and also approximates the second derivative of the objective function (called the Secant Method) in an effort to accelerate the search.

As such, Brent’s method for univariate function optimization is generally preferred over most other univariate function optimization algorithms given its efficiency.

Brent’s method is available in Python via the minimize_scalar() SciPy function that takes the name of the function to be minimized. If your target function is constrained to a range, it can be specified via the “bounds” argument.

It returns an OptimizeResult object that is a dictionary containing the solution. Importantly, the ‘x‘ key summarizes the input for the optima, the ‘fun‘ key summarizes the function output for the optima, and the ‘nfev‘ summarizes the number of evaluations of the target function that were performed.

...
# minimize the function
result = minimize_scalar(objective, method='brent')

Now that we know how to perform univariate function optimization in Python, let’s look at some examples.

Convex Univariate Function Optimization

In this section, we will explore how to solve a convex univariate function optimization problem.

First, we can define a function that implements our function.

In this case, we will use a simple offset version of the x^2 function e.g. a simple parabola (u-shape) function. It is a minimization objective function with an optima at -5.0.

# objective function
def objective(x):
	return (5.0 + x)**2.0

We can plot a coarse grid of this function with input values from -10 to 10 to get an idea of the shape of the target function.

The complete example is listed below.

# plot a convex target function
from numpy import arange
from matplotlib import pyplot

# objective function
def objective(x):
	return (5.0 + x)**2.0

# define range
r_min, r_max = -10.0, 10.0
# prepare inputs
inputs = arange(r_min, r_max, 0.1)
# compute targets
targets = [objective(x) for x in inputs]
# plot inputs vs target
pyplot.plot(inputs, targets, '--')
pyplot.show()

Running the example evaluates input values in our specified range using our target function and creates a plot of the function inputs to function outputs.

We can see the U-shape of the function and that the objective is at -5.0.

Line Plot of a Convex Objective Function

Note: in a real optimization problem, we would not be able to perform so many evaluations of the objective function so easily. This simple function is used for demonstration purposes so we can learn how to use the optimization algorithm.

Next, we can use the optimization algorithm to find the optima.

...
# minimize the function
result = minimize_scalar(objective, method='brent')

Once optimized, we can summarize the result, including the input and evaluation of the optima and the number of function evaluations required to locate the optima.

...
# summarize the result
opt_x, opt_y = result['x'], result['fun']
print('Optimal Input x: %.6f' % opt_x)
print('Optimal Output f(x): %.6f' % opt_y)
print('Total Evaluations n: %d' % result['nfev'])

Finally, we can plot the function again and mark the optima to confirm it was located in the place we expected for this function.

...
# define the range
r_min, r_max = -10.0, 10.0
# prepare inputs
inputs = arange(r_min, r_max, 0.1)
# compute targets
targets = [objective(x) for x in inputs]
# plot inputs vs target
pyplot.plot(inputs, targets, '--')
# plot the optima
pyplot.plot([opt_x], [opt_y], 's', color='r')
# show the plot
pyplot.show()

The complete example of optimizing an unconstrained convex univariate function is listed below.

# optimize convex objective function
from numpy import arange
from scipy.optimize import minimize_scalar
from matplotlib import pyplot

# objective function
def objective(x):
	return (5.0 + x)**2.0

# minimize the function
result = minimize_scalar(objective, method='brent')
# summarize the result
opt_x, opt_y = result['x'], result['fun']
print('Optimal Input x: %.6f' % opt_x)
print('Optimal Output f(x): %.6f' % opt_y)
print('Total Evaluations n: %d' % result['nfev'])
# define the range
r_min, r_max = -10.0, 10.0
# prepare inputs
inputs = arange(r_min, r_max, 0.1)
# compute targets
targets = [objective(x) for x in inputs]
# plot inputs vs target
pyplot.plot(inputs, targets, '--')
# plot the optima
pyplot.plot([opt_x], [opt_y], 's', color='r')
# show the plot
pyplot.show()

Running the example first solves the optimization problem and reports the result.

In this case, we can see that the optima was located after 10 evaluations of the objective function with an input of -5.0, achieving an objective function value of 0.0.

Optimal Input x: -5.000000
Optimal Output f(x): 0.000000
Total Evaluations n: 10

A plot of the function is created again and this time, the optima is marked as a red square.

Line Plot of a Convex Objective Function with Optima Marked

Non-Convex Univariate Function Optimization

A convex function is one that does not resemble a basin, meaning that it may have more than one hill or valley.

This can make it more challenging to locate the global optima as the multiple hills and valleys can cause the search to get stuck and report a false or local optima instead.

We can define a non-convex univariate function as follows.

# objective function
def objective(x):
	return (x - 2.0) * x * (x + 2.0)**2.0

We can sample this function and create a line plot of input values to objective values.

The complete example is listed below.

# plot a non-convex univariate function
from numpy import arange
from matplotlib import pyplot

# objective function
def objective(x):
	return (x - 2.0) * x * (x + 2.0)**2.0

# define range
r_min, r_max = -3.0, 2.5
# prepare inputs
inputs = arange(r_min, r_max, 0.1)
# compute targets
targets = [objective(x) for x in inputs]
# plot inputs vs target
pyplot.plot(inputs, targets, '--')
pyplot.show()

Running the example evaluates input values in our specified range using our target function and creates a plot of the function inputs to function outputs.

We can see a function with one false optima around -2.0 and a global optima around 1.2.

Line Plot of a Non-Convex Objective Function

Next, we can use the optimization algorithm to find the optima.

As before, we can call the minimize_scalar() function to optimize the function, then summarize the result and plot the optima on a line plot.

The complete example of optimization of an unconstrained non-convex univariate function is listed below.

# optimize non-convex objective function
from numpy import arange
from scipy.optimize import minimize_scalar
from matplotlib import pyplot

# objective function
def objective(x):
	return (x - 2.0) * x * (x + 2.0)**2.0

# minimize the function
result = minimize_scalar(objective, method='brent')
# summarize the result
opt_x, opt_y = result['x'], result['fun']
print('Optimal Input x: %.6f' % opt_x)
print('Optimal Output f(x): %.6f' % opt_y)
print('Total Evaluations n: %d' % result['nfev'])
# define the range
r_min, r_max = -3.0, 2.5
# prepare inputs
inputs = arange(r_min, r_max, 0.1)
# compute targets
targets = [objective(x) for x in inputs]
# plot inputs vs target
pyplot.plot(inputs, targets, '--')
# plot the optima
pyplot.plot([opt_x], [opt_y], 's', color='r')
# show the plot
pyplot.show()

Running the example first solves the optimization problem and reports the result.

Want to Get Started With Ensemble Learning?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Course

In this case, we can see that the optima was located after 15 evaluations of the objective function with an input of about 1.28, achieving an objective function value of about -9.91.

Optimal Input x: 1.280776
Optimal Output f(x): -9.914950
Total Evaluations n: 15

A plot of the function is created again, and this time, the optima is marked as a red square.

We can see that the optimization was not deceived by the false optima and successfully located the global optima.

Line Plot of a Non-Convex Objective Function with Optima Marked

Summary

In this tutorial, you discovered how to perform univariate function optimization in Python.

Specifically, you learned:

Univariate function optimization involves finding an optimal input for an objective function that takes a single continuous argument.
How to perform univariate function optimization for an unconstrained convex function.
How to perform univariate function optimization for an unconstrained non-convex function.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Univariate Function Optimization in Python appeared first on Machine Learning Mastery.

Optimization is a field of mathematics concerned with finding a good or best solution among many candidates.

It is an important foundational topic required in machine learning as most machine learning algorithms are fit on historical data using an optimization algorithm. Additionally, broader problems, such as model selection and hyperparameter tuning, can also be framed as an optimization problem.

Although having some background in optimization is critical for machine learning practitioners, it can be a daunting topic given that it is often described using highly mathematical language.

In this post, you will discover top books on optimization that will be helpful to machine learning practitioners.

Let’s get started.

Books on Optimization for Machine Learning
Photo by Patrick Alexander, some rights reserved.

Overview

The field of optimization is enormous as it touches many other fields of study.

As such, there are hundreds of books on the topic, and most are textbooks filed with math and proofs. This is fair enough given that it is a highly mathematical subject.

Nevertheless, there are books that provide a more approachable description of optimization algorithms.

Not all optimization algorithms are relevant to machine learning; instead, it is useful to focus on a small subset of algorithms.

Frankly, it is hard to group optimization algorithms as there are many concerns. Nevertheless, it is important to have some idea of the optimization that underlies simpler algorithms, such as linear regression and logistic regression (e.g. convex optimization, least squares, newton methods, etc.), and neural networks (first-order methods, gradient descent, etc.).

These are foundational optimization algorithms covered in most optimization textbooks.

Not all optimization problems in machine learning are well behaved, such as optimization used in AutoML and hyperparameter tuning. Therefore, knowledge of stochastic optimization algorithms is required (simulated annealing, genetic algorithms, particle swarm, etc.). Although these are optimization algorithms, they are also a type of learning algorithm referred to as biologically inspired computation or computational intelligence.

Therefore, we will take a look at both books that cover classical optimization algorithms as well as books on alternate optimization algorithms.

In fact, the first book we will look at covers both types of algorithms, and much more.

Algorithms for Optimization

This book was written by Mykel Kochenderfer and Tim Wheeler and was published in 2019.

Algorithms for Optimization

This book might be one of the very few textbooks that I’ve seen that broadly covers the field of optimization techniques relevant to modern machine learning.

This book provides a broad introduction to optimization with a focus on practical algorithms for the design of engineering systems. We cover a wide variety of optimization topics, introducing the underlying mathematical problem formulations and the algorithms for solving them. Figures, examples, and exercises are provided to convey the intuition behind the various approaches.

— Page xiiix, Algorithms for Optimization, 2019.

Importantly the algorithms range from univariate methods (bisection, line search, etc.) to first-order methods (gradient descent), second-order methods (Newton’s method), direct methods (pattern search), stochastic methods (simulated annealing), and population methods (genetic algorithms, particle swarm), and so much more.

It includes both technical descriptions of algorithms with references and worked examples of algorithms in Julia. It’s a shame the examples are not in Python as this would make the book near perfect in my eyes.

The complete table of contents for the book is listed below.

Chapter 01: Introduction
Chapter 02: Derivatives and Gradients
Chapter 03: Bracketing
Chapter 04: Local Descent
Chapter 05: First-Order Methods
Chapter 06: Second-Order Methods
Chapter 07: Direct Methods
Chapter 08: Stochastic Methods
Chapter 09: Population Methods
Chapter 10: Constraints
Chapter 11: Linear Constrained Optimization
Chapter 12: Multiobjective Optimization
Chapter 13: Sampling Plans
Chapter 14: Surrogate Models
Chapter 15: Probabilistic Surrogate Models
Chapter 16: Surrogate Optimization
Chapter 17: Optimization under Uncertainty
Chapter 18: Uncertainty Propagation
Chapter 19: Discrete Optimization
Chapter 20: Expression Optimization
Chapter 21: Multidisciplinary Optimization

I like this book a lot; it is full of valuable practical advice. I highly recommend it!

Learn More:

Algorithms for Optimization, 2019.

Numerical Optimization

This book was written by Jorge Nocedal and Stephen Wright and was published in 2006.

Numerical Optimization

This book is focused on the math and theory of the optimization algorithms presented and does cover many of the foundational techniques used by common machine learning algorithms. It may be a little too heavy for the average practitioner.

The book is intended as a textbook for graduate students in mathematical subjects.

We intend that this book will be used in graduate-level courses in optimization, as offered in engineering, operations research, computer science, and mathematics departments.

— Page xviii, Numerical Optimization, 2006.

Even though it is highly mathematical, the descriptions of the algorithms are precise and may provide a useful alternative description to complement the other books listed.

The complete table of contents for the book is listed below.

Chapter 01: Introduction
Chapter 02: Fundamentals of Unconstrained Optimization
Chapter 03: Line Search Methods
Chapter 04: Trust-Region Methods
Chapter 05: Conjugate Gradient Methods
Chapter 06: Quasi-Newton Methods
Chapter 07: Large-Scale Unconstrained Optimization
Chapter 08: Calculating Derivatives
Chapter 09: Derivative-Free Optimization
Chapter 10: Least-Squares Problems
Chapter 11: Nonlinear Equations
Chapter 12: Theory of Constrained Optimization
Chapter 13: Linear Programming: The Simplex Method
Chapter 14: Linear Programming: Interior-Point Methods
Chapter 15: Fundamentals of Algorithms for Nonlinear Constrained Optimization
Chapter 16: Quadratic Programming
Chapter 17: Penalty and Augmented Lagrangian Methods
Chapter 18: Sequential Quadratic Programming
Chapter 19: Interior-Point Methods for Nonlinear Programming

It’s a solid textbook on optimization.

Learn More:

Numerical Optimization, 2006.

If you do prefer the theoretical approach to the subject, another widely used mathematical book on optimization is “Convex Optimization” written by Stephen Boyd and Lieven Vandenberghe and published in 2004.

Computational Intelligence: An Introduction

This book was written by Andries Engelbrecht and published in 2007.

Computational Intelligence: An Introduction

This book provides an excellent overview of the field of nature-inspired optimization algorithms, also referred to as computational intelligence. This includes fields such as evolutionary computation and swarm intelligence.

This book is far less mathematical than the previous textbooks and is more focused on the metaphor of the inspired system and how to configure and use the specific algorithms with lots of pseudocode explanations.

While the material is introductory in nature, it does not shy away from details, and does present the mathematical foundations to the interested reader. The intention of the book is not to provide thorough attention to all computational intelligence paradigms and algorithms, but to give an overview of the most popular and frequently used models.

— Page xxix, Computational Intelligence: An Introduction, 2007.

Algorithms like genetic algorithms, genetic programming, evolutionary strategies, differential evolution, and particle swarm optimization are useful to know for machine learning model hyperparameter tuning and perhaps even model selection. They also form the core of many modern AutoML systems.

The complete table of contents for the book is listed below.

Part I Introduction
- Chapter 01: Introduction to Computational Intelligence
Part II Artificial Neural Networks
- Chapter 02: The Artificial Neuron
- Chapter 03: Supervised Learning Neural Networks
- Chapter 04: Unsupervised Learning Neural Networks
- Chapter 05: Radial Basis Function Networks
- Chapter 06: Reinforcement Learning
- Chapter 07: Performance Issues (Supervised Learning)
Part III Evolutionary Computation
- Chapter 08: Introduction to Evolutionary Computation
- Chapter 09: Genetic Algorithms
- Chapter 10: Genetic Programming
- Chapter 11: Evolutionary Programming
- Chapter 12: Evolution Strategies
- Chapter 13: Differential Evolution
- Chapter 14: Cultural Algorithms
- Chapter 15: Coevolution
Part IV Computational Swarm Intelligence
- Chapter 16: Particle Swarm Optimization
- Chapter 17: Ant Algorithms
Part V Artificial Immune Systems
- Chapter 18: Natural Immune System
- Chapter 19: Artificial Immune Models
Part VI Fuzzy Systems
- Chapter 20: Fuzzy Sets
- Chapter 21: Fuzzy Logic and Reasoning

I’m a fan of this book and recommend it.

Learn More:

Computational Intelligence: An Introduction, 2007.

Summary

In this post, you discovered books on optimization algorithms that are helpful to know for applied machine learning.

Did I miss a good book on optimization?
Let me know in the comments below.

Have you read any of the books listed?
Let me know what you think of it in the comments.

The post 3 Books on Optimization for Machine Learning appeared first on Machine Learning Mastery.

Gradient descent is an optimization algorithm that follows the negative gradient of an objective function in order to locate the minimum of the function.

A limitation of gradient descent is that a single step size (learning rate) is used for all input variables. Extensions to gradient descent like AdaGrad and RMSProp update the algorithm to use a separate step size for each input variable but may result in a step size that rapidly decreases to very small values.

The Adaptive Movement Estimation algorithm, or Adam for short, is an extension to gradient descent and a natural successor to techniques like AdaGrad and RMSProp that automatically adapts a learning rate for each input variable for the objective function and further smooths the search process by using an exponentially decreasing moving average of the gradient to make updates to variables.

In this tutorial, you will discover how to develop gradient descent with Adam optimization algorithm from scratch.

After completing this tutorial, you will know:

Gradient descent is an optimization algorithm that uses the gradient of the objective function to navigate the search space.
Gradient descent can be updated to use an automatically adaptive step size for each input variable using a decaying average of partial derivatives, called Adam.
How to implement the Adam optimization algorithm from scratch and apply it to an objective function and evaluate the results.

Let’s get started.

Gradient Descent Optimization With Adam From Scratch
Photo by Don Graham, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

Gradient Descent
Adam Optimization Algorithm
Gradient Descent With Adam
1. Two-Dimensional Test Problem
2. Gradient Descent Optimization With Adam
3. Visualization of Adam

Gradient Descent

Gradient descent is an optimization algorithm.

It is technically referred to as a first-order optimization algorithm as it explicitly makes use of the first-order derivative of the target objective function.

First-order methods rely on gradient information to help direct the search for a minimum …

— Page 69, Algorithms for Optimization, 2019.

The first-order derivative, or simply the “derivative,” is the rate of change or slope of the target function at a specific point, e.g. for a specific input.

If the target function takes multiple input variables, it is referred to as a multivariate function and the input variables can be thought of as a vector. In turn, the derivative of a multivariate target function may also be taken as a vector and is referred to generally as the gradient.

Gradient: First-order derivative for a multivariate objective function.

The derivative or the gradient points in the direction of the steepest ascent of the target function for a specific input.

Gradient descent refers to a minimization optimization algorithm that follows the negative of the gradient downhill of the target function to locate the minimum of the function.

The gradient descent algorithm requires a target function that is being optimized and the derivative function for the objective function. The target function f() returns a score for a given set of inputs, and the derivative function f'() gives the derivative of the target function for a given set of inputs.

The gradient descent algorithm requires a starting point (x) in the problem, such as a randomly selected point in the input space.

The derivative is then calculated and a step is taken in the input space that is expected to result in a downhill movement in the target function, assuming we are minimizing the target function.

A downhill movement is made by first calculating how far to move in the input space, calculated as the step size (called alpha or the learning rate) multiplied by the gradient. This is then subtracted from the current point, ensuring we move against the gradient, or down the target function.

x(t) = x(t-1) – step_size * f'(x(t-1))

The steeper the objective function at a given point, the larger the magnitude of the gradient and, in turn, the larger the step taken in the search space. The size of the step taken is scaled using a step size hyperparameter.

Step Size (alpha): Hyperparameter that controls how far to move in the search space against the gradient each iteration of the algorithm.

If the step size is too small, the movement in the search space will be small and the search will take a long time. If the step size is too large, the search may bounce around the search space and skip over the optima.

Now that we are familiar with the gradient descent optimization algorithm, let’s take a look at the Adam algorithm.

Adam Optimization Algorithm

Adaptive Movement Estimation algorithm, or Adam for short, is an extension to the gradient descent optimization algorithm.

The algorithm was described in the 2014 paper by Diederik Kingma and Jimmy Lei Ba titled “Adam: A Method for Stochastic Optimization.”

Adam is designed to accelerate the optimization process, e.g. decrease the number of function evaluations required to reach the optima, or to improve the capability of the optimization algorithm, e.g. result in a better final result.

This is achieved by calculating a step size for each input parameter that is being optimized. Importantly, each step size is automatically adapted throughput the search process based on the gradients (partial derivatives) encountered for each variable.

We propose Adam, a method for efficient stochastic optimization that only requires first-order gradients with little memory requirement. The method computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients; the name Adam is derived from adaptive moment estimation

— Adam: A Method for Stochastic Optimization

This involves maintaining a first and second moment of the gradient, e.g. an exponentially decaying mean gradient (first moment) and variance (second moment) for each input variable.

The moving averages themselves are estimates of the 1st moment (the mean) and the 2nd raw moment (the uncentered variance) of the gradient.

— Adam: A Method for Stochastic Optimization

Let’s step through each element of the algorithm.

First, we must maintain a moment vector and exponentially weighted infinity norm for each parameter being optimized as part of the search, referred to as m and v (really the Greek letter nu) respectively. They are initialized to 0.0 at the start of the search.

m = 0
v = 0

The algorithm is executed iteratively over time t starting at t=1, and each iteration involves calculating a new set of parameter values x, e.g. going from x(t-1) to x(t).

It is perhaps easy to understand the algorithm if we focus on updating one parameter, which generalizes to updating all parameters via vector operations.

First, the gradient (partial derivatives) are calculated for the current time step.

g(t) = f'(x(t-1))

Next, the first moment is updated using the gradient and a hyperparameter beta1.

m(t) = beta1 * m(t-1) + (1 – beta1) * g(t)

Then the second moment is updated using the squared gradient and a hyperparameter beta2.

v(t) = beta2 * v(t-1) + (1 – beta2) * g(t)^2

The first and second moments are biased because they are initialized with zero values.

… these moving averages are initialized as (vectors of) 0’s, leading to moment estimates that are biased towards zero, especially during the initial timesteps, and especially when the decay rates are small (i.e. the betas are close to 1). The good news is that this initialization bias can be easily counteracted, resulting in bias-corrected estimates …

— Adam: A Method for Stochastic Optimization

Next the first and second moments are bias-corrected, starring with the first moment:

mhat(t) = m(t) / (1 – beta1(t))

And then the second moment:

vhat(t) = v(t) / (1 – beta2(t))

Note, beta1(t) and beta2(t) refer to the beta1 and beta2 hyperparameters that are decayed on a schedule over the iterations of the algorithm. A static decay schedule can be used, although the paper recommend the following:

beta1(t) = beta1^t
beta2(t) = beta2^t

Finally, we can calculate the value for the parameter for this iteration.

x(t) = x(t-1) – alpha * mhat(t) / sqrt(vhat(t)) + eps

Where alpha is the step size hyperparameter, eps is a small value (epsilon) such as 1e-8 that ensures we do not encounter a divide by zero error, and sqrt() is the square root function.

Note, a more efficient reordering of the update rule listed in the paper can be used:

alpha(t) = alpha * sqrt(1 – beta2(t)) / (1 – beta1(t))
x(t) = x(t-1) – alpha(t) * m(t) / (sqrt(v(t)) + eps)

To review, there are three hyperparameters for the algorithm, they are:

alpha: Initial step size (learning rate), a typical value is 0.001.
beta1: Decay factor for first momentum, a typical value is 0.9.
beta2: Decay factor for infinity norm, a typical value is 0.999.

And that’s it.

For full derivation of the Adam algorithm in the context of the Adam algorithm, I recommend reading the paper.

Adam: A Method for Stochastic Optimization

Next, let’s look at how we might implement the algorithm from scratch in Python.

Gradient Descent With Adam

In this section, we will explore how to implement the gradient descent optimization algorithm with Adam.

Two-Dimensional Test Problem

First, let’s define an optimization function.

We will use a simple two-dimensional function that squares the input of each dimension and define the range of valid inputs from -1.0 to 1.0.

The objective() function below implements this function

# objective function
def objective(x, y):
	return x**2.0 + y**2.0

We can create a three-dimensional plot of the dataset to get a feeling for the curvature of the response surface.

The complete example of plotting the objective function is listed below.

# 3d plot of the test function
from numpy import arange
from numpy import meshgrid
from matplotlib import pyplot

# objective function
def objective(x, y):
	return x**2.0 + y**2.0

# define range for input
r_min, r_max = -1.0, 1.0
# sample input range uniformly at 0.1 increments
xaxis = arange(r_min, r_max, 0.1)
yaxis = arange(r_min, r_max, 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a surface plot with the jet color scheme
figure = pyplot.figure()
axis = figure.gca(projection='3d')
axis.plot_surface(x, y, results, cmap='jet')
# show the plot
pyplot.show()

Running the example creates a three-dimensional surface plot of the objective function.

We can see the familiar bowl shape with the global minima at f(0, 0) = 0.

Three-Dimensional Plot of the Test Objective Function

We can also create a two-dimensional plot of the function. This will be helpful later when we want to plot the progress of the search.

The example below creates a contour plot of the objective function.

# contour plot of the test function
from numpy import asarray
from numpy import arange
from numpy import meshgrid
from matplotlib import pyplot

# objective function
def objective(x, y):
	return x**2.0 + y**2.0

# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# sample input range uniformly at 0.1 increments
xaxis = arange(bounds[0,0], bounds[0,1], 0.1)
yaxis = arange(bounds[1,0], bounds[1,1], 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a filled contour plot with 50 levels and jet color scheme
pyplot.contourf(x, y, results, levels=50, cmap='jet')
# show the plot
pyplot.show()

Running the example creates a two-dimensional contour plot of the objective function.

We can see the bowl shape compressed to contours shown with a color gradient. We will use this plot to plot the specific points explored during the progress of the search.

Two-Dimensional Contour Plot of the Test Objective Function

Now that we have a test objective function, let’s look at how we might implement the Adam optimization algorithm.

Gradient Descent Optimization With Adam

We can apply the gradient descent with Adam to the test problem.

First, we need a function that calculates the derivative for this function.

f(x) = x^2
f'(x) = x * 2

The derivative of x^2 is x * 2 in each dimension. The derivative() function implements this below.

# derivative of objective function
def derivative(x, y):
	return asarray([x * 2.0, y * 2.0])

Next, we can implement gradient descent optimization.

First, we can select a random point in the bounds of the problem as a starting point for the search.

This assumes we have an array that defines the bounds of the search with one row for each dimension and the first column defines the minimum and the second column defines the maximum of the dimension.

...
# generate an initial point
x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
score = objective(x[0], x[1])

Next, we need to initialize the first and second moments to zero.

...
# initialize first and second moments
m = [0.0 for _ in range(bounds.shape[0])]
v = [0.0 for _ in range(bounds.shape[0])]

We then run a fixed number of iterations of the algorithm defined by the “n_iter” hyperparameter.

...
# run iterations of gradient descent
for t in range(n_iter):
	...

The first step is to calculate the gradient for the current solution using the derivative() function.

...
# calculate gradient
gradient = derivative(solution[0], solution[1])

The first step is to calculate the derivative for the current set of parameters.

...
# calculate gradient g(t)
g = derivative(x[0], x[1])

Next, we need to perform the Adam update calculations. We will perform these calculations one variable at a time using an imperative programming style for readability.

In practice, I recommend using NumPy vector operations for efficiency.

...
# build a solution one variable at a time
for i in range(x.shape[0]):
	...

First, we need to calculate the moment.

...
# m(t) = beta1 * m(t-1) + (1 - beta1) * g(t)
m[i] = beta1 * m[i] + (1.0 - beta1) * g[i]

Then the second moment.

...
# v(t) = beta2 * v(t-1) + (1 - beta2) * g(t)^2
v[i] = beta2 * v[i] + (1.0 - beta2) * g[i]**2

Then the bias correction for the first and second moments.

...
# mhat(t) = m(t) / (1 - beta1(t))
mhat = m[i] / (1.0 - beta1**(t+1))
# vhat(t) = v(t) / (1 - beta2(t))
vhat = v[i] / (1.0 - beta2**(t+1))

Then finally the updated variable value.

...
# x(t) = x(t-1) - alpha * mhat(t) / sqrt(vhat(t)) + eps
x[i] = x[i] - alpha * mhat / (sqrt(vhat) + eps)

This is then repeated for each parameter that is being optimized.

At the end of the iteration we can evaluate the new parameter values and report the performance of the search.

...
# evaluate candidate point
score = objective(x[0], x[1])
# report progress
print('>%d f(%s) = %.5f' % (t, x, score))

We can tie all of this together into a function named adam() that takes the names of the objective and derivative functions as well as the algorithm hyperparameters, and returns the best solution found at the end of the search and its evaluation.

This complete function is listed below.

# gradient descent algorithm with adam
def adam(objective, derivative, bounds, n_iter, alpha, beta1, beta2, eps=1e-8):
	# generate an initial point
	x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
	score = objective(x[0], x[1])
	# initialize first and second moments
	m = [0.0 for _ in range(bounds.shape[0])]
	v = [0.0 for _ in range(bounds.shape[0])]
	# run the gradient descent updates
	for t in range(n_iter):
		# calculate gradient g(t)
		g = derivative(x[0], x[1])
		# build a solution one variable at a time
		for i in range(x.shape[0]):
			# m(t) = beta1 * m(t-1) + (1 - beta1) * g(t)
			m[i] = beta1 * m[i] + (1.0 - beta1) * g[i]
			# v(t) = beta2 * v(t-1) + (1 - beta2) * g(t)^2
			v[i] = beta2 * v[i] + (1.0 - beta2) * g[i]**2
			# mhat(t) = m(t) / (1 - beta1(t))
			mhat = m[i] / (1.0 - beta1**(t+1))
			# vhat(t) = v(t) / (1 - beta2(t))
			vhat = v[i] / (1.0 - beta2**(t+1))
			# x(t) = x(t-1) - alpha * mhat(t) / sqrt(vhat(t)) + eps
			x[i] = x[i] - alpha * mhat / (sqrt(vhat) + eps)
		# evaluate candidate point
		score = objective(x[0], x[1])
		# report progress
		print('>%d f(%s) = %.5f' % (t, x, score))
	return [x, score]

Note: we have intentionally used lists and imperative coding style instead of vectorized operations for readability. Feel free to adapt the implementation to a vectorized implementation with NumPy arrays for better performance.

We can then define our hyperparameters and call the adam() function to optimize our test objective function.

In this case, we will use 60 iterations of the algorithm with an initial steps size of 0.02 and beta1 and beta2 values of 0.8 and 0.999 respectively. These hyperparameter values were found after a little trial and error.

...
# seed the pseudo random number generator
seed(1)
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# define the total iterations
n_iter = 60
# steps size
alpha = 0.02
# factor for average gradient
beta1 = 0.8
# factor for average squared gradient
beta2 = 0.999
# perform the gradient descent search with adam
best, score = adam(objective, derivative, bounds, n_iter, alpha, beta1, beta2)
print('Done!')
print('f(%s) = %f' % (best, score))

Tying all of this together, the complete example of gradient descent optimization with Adam is listed below.

# gradient descent optimization with adam for a two-dimensional test function
from math import sqrt
from numpy import asarray
from numpy.random import rand
from numpy.random import seed

# objective function
def objective(x, y):
	return x**2.0 + y**2.0

# derivative of objective function
def derivative(x, y):
	return asarray([x * 2.0, y * 2.0])

# gradient descent algorithm with adam
def adam(objective, derivative, bounds, n_iter, alpha, beta1, beta2, eps=1e-8):
	# generate an initial point
	x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
	score = objective(x[0], x[1])
	# initialize first and second moments
	m = [0.0 for _ in range(bounds.shape[0])]
	v = [0.0 for _ in range(bounds.shape[0])]
	# run the gradient descent updates
	for t in range(n_iter):
		# calculate gradient g(t)
		g = derivative(x[0], x[1])
		# build a solution one variable at a time
		for i in range(x.shape[0]):
			# m(t) = beta1 * m(t-1) + (1 - beta1) * g(t)
			m[i] = beta1 * m[i] + (1.0 - beta1) * g[i]
			# v(t) = beta2 * v(t-1) + (1 - beta2) * g(t)^2
			v[i] = beta2 * v[i] + (1.0 - beta2) * g[i]**2
			# mhat(t) = m(t) / (1 - beta1(t))
			mhat = m[i] / (1.0 - beta1**(t+1))
			# vhat(t) = v(t) / (1 - beta2(t))
			vhat = v[i] / (1.0 - beta2**(t+1))
			# x(t) = x(t-1) - alpha * mhat(t) / sqrt(vhat(t)) + eps
			x[i] = x[i] - alpha * mhat / (sqrt(vhat) + eps)
		# evaluate candidate point
		score = objective(x[0], x[1])
		# report progress
		print('>%d f(%s) = %.5f' % (t, x, score))
	return [x, score]

# seed the pseudo random number generator
seed(1)
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# define the total iterations
n_iter = 60
# steps size
alpha = 0.02
# factor for average gradient
beta1 = 0.8
# factor for average squared gradient
beta2 = 0.999
# perform the gradient descent search with adam
best, score = adam(objective, derivative, bounds, n_iter, alpha, beta1, beta2)
print('Done!')
print('f(%s) = %f' % (best, score))

Running the example applies the Adam optimization algorithm to our test problem and reports the performance of the search for each iteration of the algorithm.

In this case, we can see that a near-optimal solution was found after perhaps 53 iterations of the search, with input values near 0.0 and 0.0, evaluating to 0.0.

...
>50 f([-0.00056912 -0.00321961]) = 0.00001
>51 f([-0.00052452 -0.00286514]) = 0.00001
>52 f([-0.00043908 -0.00251304]) = 0.00001
>53 f([-0.0003283  -0.00217044]) = 0.00000
>54 f([-0.00020731 -0.00184302]) = 0.00000
>55 f([-8.95352320e-05 -1.53514076e-03]) = 0.00000
>56 f([ 1.43050285e-05 -1.25002847e-03]) = 0.00000
>57 f([ 9.67123406e-05 -9.89850279e-04]) = 0.00000
>58 f([ 0.00015359 -0.00075587]) = 0.00000
>59 f([ 0.00018407 -0.00054858]) = 0.00000
Done!
f([ 0.00018407 -0.00054858]) = 0.000000

Visualization of Adam

We can plot the progress of the Adam search on a contour plot of the domain.

This can provide an intuition for the progress of the search over the iterations of the algorithm.

We must update the adam() function to maintain a list of all solutions found during the search, then return this list at the end of the search.

The updated version of the function with these changes is listed below.

# gradient descent algorithm with adam
def adam(objective, derivative, bounds, n_iter, alpha, beta1, beta2, eps=1e-8):
	solutions = list()
	# generate an initial point
	x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
	score = objective(x[0], x[1])
	# initialize first and second moments
	m = [0.0 for _ in range(bounds.shape[0])]
	v = [0.0 for _ in range(bounds.shape[0])]
	# run the gradient descent updates
	for t in range(n_iter):
		# calculate gradient g(t)
		g = derivative(x[0], x[1])
		# build a solution one variable at a time
		for i in range(bounds.shape[0]):
			# m(t) = beta1 * m(t-1) + (1 - beta1) * g(t)
			m[i] = beta1 * m[i] + (1.0 - beta1) * g[i]
			# v(t) = beta2 * v(t-1) + (1 - beta2) * g(t)^2
			v[i] = beta2 * v[i] + (1.0 - beta2) * g[i]**2
			# mhat(t) = m(t) / (1 - beta1(t))
			mhat = m[i] / (1.0 - beta1**(t+1))
			# vhat(t) = v(t) / (1 - beta2(t))
			vhat = v[i] / (1.0 - beta2**(t+1))
			# x(t) = x(t-1) - alpha * mhat(t) / sqrt(vhat(t)) + ep
			x[i] = x[i] - alpha * mhat / (sqrt(vhat) + eps)
		# evaluate candidate point
		score = objective(x[0], x[1])
		# keep track of solutions
		solutions.append(x.copy())
		# report progress
		print('>%d f(%s) = %.5f' % (t, x, score))
	return solutions

We can then execute the search as before, and this time retrieve the list of solutions instead of the best final solution.

...
# seed the pseudo random number generator
seed(1)
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# define the total iterations
n_iter = 60
# steps size
alpha = 0.02
# factor for average gradient
beta1 = 0.8
# factor for average squared gradient
beta2 = 0.999
# perform the gradient descent search with adam
solutions = adam(objective, derivative, bounds, n_iter, alpha, beta1, beta2)

We can then create a contour plot of the objective function, as before.

...
# sample input range uniformly at 0.1 increments
xaxis = arange(bounds[0,0], bounds[0,1], 0.1)
yaxis = arange(bounds[1,0], bounds[1,1], 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a filled contour plot with 50 levels and jet color scheme
pyplot.contourf(x, y, results, levels=50, cmap='jet')

Finally, we can plot each solution found during the search as a white dot connected by a line.

...
# plot the sample as black circles
solutions = asarray(solutions)
pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w')

Tying this all together, the complete example of performing the Adam optimization on the test problem and plotting the results on a contour plot is listed below.

# example of plotting the adam search on a contour plot of the test function
from math import sqrt
from numpy import asarray
from numpy import arange
from numpy.random import rand
from numpy.random import seed
from numpy import meshgrid
from matplotlib import pyplot
from mpl_toolkits.mplot3d import Axes3D

# objective function
def objective(x, y):
	return x**2.0 + y**2.0

# derivative of objective function
def derivative(x, y):
	return asarray([x * 2.0, y * 2.0])

# gradient descent algorithm with adam
def adam(objective, derivative, bounds, n_iter, alpha, beta1, beta2, eps=1e-8):
	solutions = list()
	# generate an initial point
	x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
	score = objective(x[0], x[1])
	# initialize first and second moments
	m = [0.0 for _ in range(bounds.shape[0])]
	v = [0.0 for _ in range(bounds.shape[0])]
	# run the gradient descent updates
	for t in range(n_iter):
		# calculate gradient g(t)
		g = derivative(x[0], x[1])
		# build a solution one variable at a time
		for i in range(bounds.shape[0]):
			# m(t) = beta1 * m(t-1) + (1 - beta1) * g(t)
			m[i] = beta1 * m[i] + (1.0 - beta1) * g[i]
			# v(t) = beta2 * v(t-1) + (1 - beta2) * g(t)^2
			v[i] = beta2 * v[i] + (1.0 - beta2) * g[i]**2
			# mhat(t) = m(t) / (1 - beta1(t))
			mhat = m[i] / (1.0 - beta1**(t+1))
			# vhat(t) = v(t) / (1 - beta2(t))
			vhat = v[i] / (1.0 - beta2**(t+1))
			# x(t) = x(t-1) - alpha * mhat(t) / sqrt(vhat(t)) + ep
			x[i] = x[i] - alpha * mhat / (sqrt(vhat) + eps)
		# evaluate candidate point
		score = objective(x[0], x[1])
		# keep track of solutions
		solutions.append(x.copy())
		# report progress
		print('>%d f(%s) = %.5f' % (t, x, score))
	return solutions

# seed the pseudo random number generator
seed(1)
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# define the total iterations
n_iter = 60
# steps size
alpha = 0.02
# factor for average gradient
beta1 = 0.8
# factor for average squared gradient
beta2 = 0.999
# perform the gradient descent search with adam
solutions = adam(objective, derivative, bounds, n_iter, alpha, beta1, beta2)
# sample input range uniformly at 0.1 increments
xaxis = arange(bounds[0,0], bounds[0,1], 0.1)
yaxis = arange(bounds[1,0], bounds[1,1], 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a filled contour plot with 50 levels and jet color scheme
pyplot.contourf(x, y, results, levels=50, cmap='jet')
# plot the sample as black circles
solutions = asarray(solutions)
pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w')
# show the plot
pyplot.show()

Running the example performs the search as before, except in this case, a contour plot of the objective function is created.

In this case, we can see that a white dot is shown for each solution found during the search, starting above the optima and progressively getting closer to the optima at the center of the plot.

Contour Plot of the Test Objective Function With Adam Search Results Shown

Summary

In this tutorial, you discovered how to develop gradient descent with Adam optimization algorithm from scratch.

Specifically, you learned:

Gradient descent is an optimization algorithm that uses the gradient of the objective function to navigate the search space.
Gradient descent can be updated to use an automatically adaptive step size for each input variable using a decaying average of partial derivatives, called Adam.
How to implement the Adam optimization algorithm from scratch and apply it to an objective function and evaluate the results.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Code Adam Gradient Descent Optimization From Scratch appeared first on Machine Learning Mastery.

Function optimization involves finding the input that results in the optimal value from an objective function.

Optimization algorithms navigate the search space of input variables in order to locate the optima, and both the shape of the objective function and behavior of the algorithm in the search space are opaque on real-world problems.

As such, it is common to study optimization algorithms using simple low-dimensional functions that can be easily visualized directly. Additionally, the samples in the input space of these simple functions made by an optimization algorithm can be visualized with their appropriate context.

Visualization of lower-dimensional functions and algorithm behavior on those functions can help to develop the intuitions that can carry over to more complex higher-dimensional function optimization problems later.

In this tutorial, you will discover how to create visualizations for function optimization in Python.

After completing this tutorial, you will know:

Visualization is an important tool when studying function optimization algorithms.
How to visualize one-dimensional functions and samples using line plots.
How to visualize two-dimensional functions and samples using contour and surface plots.

Let’s get started.

Visualization for Function Optimization in Python
Photo by Nathalie, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

Visualization for Function Optimization
Visualize 1D Function Optimization
1. Test Function
2. Sample Test Function
3. Line Plot of Test Function
4. Scatter Plot of Test Function
5. Line Plot with Marked Optima
6. Line Plot with Samples
Visualize 2D Function Optimization
1. Test Function
2. Sample Test Function
3. Contour Plot of Test Function
4. Filled Contour Plot of Test Function
5. Filled Contour Plot of Test Function with Samples
6. Surface Plot of Test Function

Visualization for Function Optimization

Function optimization is a field of mathematics concerned with finding the inputs to a function that result in the optimal output for the function, typically a minimum or maximum value.

Optimization may be straightforward for simple differential functions where the solution can be calculated analytically. However, most functions we’re interested in solving in applied machine learning may or may not be well behaved and may be complex, nonlinear, multivariate, and non-differentiable.

As such, it is important to have an understanding of a wide range of different algorithms that can be used to address function optimization problems.

An important aspect of studying function optimization is understanding the objective function that is being optimized and understanding the behavior of an optimization algorithm over time.

Visualization plays an important role when getting started with function optimization.

We can select simple and well-understood test functions to study optimization algorithms. These simple functions can be plotted to understand the relationship between the input to the objective function and the output of the objective function and highlighting hills, valleys, and optima.

In addition, the samples selected from the search space by an optimization algorithm can also be plotted on top of plots of the objective function. These plots of algorithm behavior can provide insight and intuition into how specific optimization algorithms work and navigate a search space that can generalize to new problems in the future.

Typically, one-dimensional or two-dimensional functions are chosen to study optimization algorithms as they are easy to visualize using standard plots, like line plots and surface plots. We will explore both in this tutorial.

First, let’s explore how we might visualize a one-dimensional function optimization.

Visualize 1D Function Optimization

A one-dimensional function takes a single input variable and outputs the evaluation of that input variable.

Input variables are typically continuous, represented by a real-valued floating-point value. Often, the input domain is unconstrained, although for test problems we impose a domain of interest.

Test Function

In this case we will explore function visualization with a simple x^2 objective function:

f(x) = x^2

This has an optimal value with an input of x=0.0, which equals 0.0.

The example below implements this objective function and evaluates a single input.

# example of a 1d objective function

# objective function
def objective(x):
	return x**2.0

# evaluate inputs to the objective function
x = 4.0
result = objective(x)
print('f(%.3f) = %.3f' % (x, result))

Running the example evaluates the value 4.0 with the objective function, which equals 16.0.

f(4.000) = 16.000

Sample the Test Function

The first thing we might want to do with a new function is define an input range of interest and sample the domain of interest using a uniform grid.

This sample will provide the basis for generating a plot later.

In this case, we will define a domain of interest around the optima of x=0.0 from x=-5.0 to x=5.0 and sample a grid of values in this range with 0.1 increments, such as -5.0, -4.9, -4.8, etc.

...
# define range for input
r_min, r_max = -5.0, 5.0
# sample input range uniformly at 0.1 increments
inputs = arange(r_min, r_max, 0.1)
# summarize some of the input domain
print(inputs[:5])

We can then evaluate each of the x values in our sample.

...
# compute targets
results = objective(inputs)
# summarize some of the results
print(results[:5])

Finally, we can check some of the input and their corresponding outputs.

...
# create a mapping of some inputs to some results
for i in range(5):
	print('f(%.3f) = %.3f' % (inputs[i], results[i]))

Tying this together, the complete example of sampling the input space and evaluating all points in the sample is listed below.

# sample 1d objective function
from numpy import arange

# objective function
def objective(x):
	return x**2.0

# define range for input
r_min, r_max = -5.0, 5.0
# sample input range uniformly at 0.1 increments
inputs = arange(r_min, r_max, 0.1)
# summarize some of the input domain
print(inputs[:5])
# compute targets
results = objective(inputs)
# summarize some of the results
print(results[:5])
# create a mapping of some inputs to some results
for i in range(5):
	print('f(%.3f) = %.3f' % (inputs[i], results[i]))

Running the example first generates a uniform sample of input points as we expected.

The input points are then evaluated using the objective function and finally, we can see a simple mapping of inputs to outputs of the objective function.

[-5.  -4.9 -4.8 -4.7 -4.6]
[25.   24.01 23.04 22.09 21.16]
f(-5.000) = 25.000
f(-4.900) = 24.010
f(-4.800) = 23.040
f(-4.700) = 22.090
f(-4.600) = 21.160

Now that we have some confidence in generating a sample of inputs and evaluating them with the objective function, we can look at generating plots of the function.

Line Plot of Test Function

We could sample the input space randomly, but the benefit of a uniform line or grid of points is that it can be used to generate a smooth plot.

It is smooth because the points in the input space are ordered from smallest to largest. This ordering is important as we expect (hope) that the output of the objective function has a similar smooth relationship between values, e.g. small changes in input result in locally consistent (smooth) changes in the output of the function.

In this case, we can use the samples to generate a line plot of the objective function with the input points (x) on the x-axis of the plot and the objective function output (results) on the y-axis of the plot.

...
# create a line plot of input vs result
pyplot.plot(inputs, results)
# show the plot
pyplot.show()

Tying this together, the complete example is listed below.

# line plot of input vs result for a 1d objective function
from numpy import arange
from matplotlib import pyplot

# objective function
def objective(x):
	return x**2.0

# define range for input
r_min, r_max = -5.0, 5.0
# sample input range uniformly at 0.1 increments
inputs = arange(r_min, r_max, 0.1)
# compute targets
results = objective(inputs)
# create a line plot of input vs result
pyplot.plot(inputs, results)
# show the plot
pyplot.show()

Running the example creates a line plot of the objective function.

We can see that the function has a large U-shape, called a parabola. This is a common shape when studying curves, e.g. the study of calculus.

Line Plot of a One-Dimensional Function

Scatter Plot of Test Function

The line is a construct. It is not really the function, just a smooth summary of the function. Always keep this in mind.

Recall that we, in fact, generated a sample of points in the input space and corresponding evaluation of those points.

As such, it would be more accurate to create a scatter plot of points; for example:

# scatter plot of input vs result for a 1d objective function
from numpy import arange
from matplotlib import pyplot

# objective function
def objective(x):
	return x**2.0

# define range for input
r_min, r_max = -5.0, 5.0
# sample input range uniformly at 0.1 increments
inputs = arange(r_min, r_max, 0.1)
# compute targets
results = objective(inputs)
# create a scatter plot of input vs result
pyplot.scatter(inputs, results)
# show the plot
pyplot.show()

Running the example creates a scatter plot of the objective function.

We can see the familiar shape of the function, but we don’t gain anything from plotting the points directly.

The line and the smooth interpolation between the points it provides are more useful as we can draw other points on top of the line, such as the location of the optima or the points sampled by an optimization algorithm.

Scatter Plot of a One-Dimensional Function

Line Plot with Marked Optima

Next, let’s draw the line plot again and this time draw a point where the known optima of the function is located.

This can be helpful when studying an optimization algorithm as we might want to see how close an optimization algorithm can get to the optima.

First, we must define the input for the optima, then evaluate that point to give the x-axis and y-axis values for plotting.

...
# define the known function optima
optima_x = 0.0
optima_y = objective(optima_x)

We can then plot this point with any shape or color we like, in this case, a red square.

...
# draw the function optima as a red square
pyplot.plot([optima_x], [optima_y], 's', color='r')

Tying this together, the complete example of creating a line plot of the function with the optima highlighted by a point is listed below.

# line plot of input vs result for a 1d objective function and show optima
from numpy import arange
from matplotlib import pyplot

# objective function
def objective(x):
	return x**2.0

# define range for input
r_min, r_max = -5.0, 5.0
# sample input range uniformly at 0.1 increments
inputs = arange(r_min, r_max, 0.1)
# compute targets
results = objective(inputs)
# create a line plot of input vs result
pyplot.plot(inputs, results)
# define the known function optima
optima_x = 0.0
optima_y = objective(optima_x)
# draw the function optima as a red square
pyplot.plot([optima_x], [optima_y], 's', color='r')
# show the plot
pyplot.show()

Running the example creates the familiar line plot of the function, and this time, the optima of the function, e.g. the input that results in the minimum output of the function, is marked with a red square.

Line Plot of a One-Dimensional Function With Optima Marked by a Red Square

This is a very simple function and the red square for the optima is easy to see.

Sometimes the function might be more complex, with lots of hills and valleys, and we might want to make the optima more visible.

In this case, we can draw a vertical line across the whole plot.

...
# draw a vertical line at the optimal input
pyplot.axvline(x=optima_x, ls='--', color='red')

Tying this together, the complete example is listed below.

# line plot of input vs result for a 1d objective function and show optima as line
from numpy import arange
from matplotlib import pyplot

# objective function
def objective(x):
	return x**2.0

# define range for input
r_min, r_max = -5.0, 5.0
# sample input range uniformly at 0.1 increments
inputs = arange(r_min, r_max, 0.1)
# compute targets
results = objective(inputs)
# create a line plot of input vs result
pyplot.plot(inputs, results)
# define the known function optima
optima_x = 0.0
# draw a vertical line at the optimal input
pyplot.axvline(x=optima_x, ls='--', color='red')
# show the plot
pyplot.show()

Running the example creates the same plot and this time draws a red line clearly marking the point in the input space that marks the optima.

Line Plot of a One-Dimensional Function With Optima Marked by a Red Line

Line Plot with Samples

Finally, we might want to draw the samples of the input space selected by an optimization algorithm.

We will simulate these samples with random points drawn from the input domain.

...
# simulate a sample made by an optimization algorithm
seed(1)
sample = r_min + rand(10) * (r_max - r_min)
# evaluate the sample
sample_eval = objective(sample)

We can then plot this sample, in this case using small black circles.

...
# plot the sample as black circles
pyplot.plot(sample, sample_eval, 'o', color='black')

The complete example of creating a line plot of a function with the optima marked by a red line and an algorithm sample drawn with small black dots is listed below.

# line plot of domain for a 1d function with optima and algorithm sample
from numpy import arange
from numpy.random import seed
from numpy.random import rand
from matplotlib import pyplot

# objective function
def objective(x):
	return x**2.0

# define range for input
r_min, r_max = -5.0, 5.0
# sample input range uniformly at 0.1 increments
inputs = arange(r_min, r_max, 0.1)
# compute targets
results = objective(inputs)
# simulate a sample made by an optimization algorithm
seed(1)
sample = r_min + rand(10) * (r_max - r_min)
# evaluate the sample
sample_eval = objective(sample)
# create a line plot of input vs result
pyplot.plot(inputs, results)
# define the known function optima
optima_x = 0.0
# draw a vertical line at the optimal input
pyplot.axvline(x=optima_x, ls='--', color='red')
# plot the sample as black circles
pyplot.plot(sample, sample_eval, 'o', color='black')
# show the plot
pyplot.show()

Running the example creates the line plot of the domain and marks the optima with a red line as before.

This time, the sample from the domain selected by an algorithm (really a random sample of points) is drawn with black dots.

We can imagine that a real optimization algorithm will show points narrowing in on the domain as it searches down-hill from a starting point.

Line Plot of a One-Dimensional Function With Optima Marked by a Red Line and Samples Shown with Black Dots

Next, let’s look at how we might perform similar visualizations for the optimization of a two-dimensional function.

Visualize 2D Function Optimization

A two-dimensional function is a function that takes two input variables, e.g. x and y.

Test Function

We can use the same x^2 function and scale it up to be a two-dimensional function; for example:

f(x, y) = x^2 + y^2

This has an optimal value with an input of [x=0.0, y=0.0], which equals 0.0.

The example below implements this objective function and evaluates a single input.

# example of a 2d objective function

# objective function
def objective(x, y):
	return x**2.0 + y**2.0

# evaluate inputs to the objective function
x = 4.0
y = 4.0
result = objective(x, y)
print('f(%.3f, %.3f) = %.3f' % (x, y, result))

Running the example evaluates the point [x=4, y=4], which equals 32.

f(4.000, 4.000) = 32.000

Next, we need a way to sample the domain so that we can, in turn, sample the objective function.

Sample Test Function

A common way for sampling a two-dimensional function is to first generate a uniform sample along each variable, x and y, then use these two uniform samples to create a grid of samples, called a mesh grid.

This is not a two-dimensional array across the input space; instead, it is two two-dimensional arrays that, when used together, define a grid across the two input variables.

This is achieved by duplicating the entire x sample array for each y sample point and similarly duplicating the entire y sample array for each x sample point.

This can be achieved using the meshgrid() NumPy function; for example:

...
# define range for input
r_min, r_max = -5.0, 5.0
# sample input range uniformly at 0.1 increments
xaxis = arange(r_min, r_max, 0.1)
yaxis = arange(r_min, r_max, 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# summarize some of the input domain
print(x[:5, :5])

We can then evaluate each pair of points using our objective function.

...
# compute targets
results = objective(x, y)
# summarize some of the results
print(results[:5, :5])

Finally, we can review the mapping of some of the inputs to their corresponding output values.

...
# create a mapping of some inputs to some results
for i in range(5):
	print('f(%.3f, %.3f) = %.3f' % (x[i,0], y[i,0], results[i,0]))

The example below demonstrates how we can create a uniform sample grid across the two-dimensional input space and objective function.

# sample 2d objective function
from numpy import arange
from numpy import meshgrid

# objective function
def objective(x, y):
	return x**2.0 + y**2.0

# define range for input
r_min, r_max = -5.0, 5.0
# sample input range uniformly at 0.1 increments
xaxis = arange(r_min, r_max, 0.1)
yaxis = arange(r_min, r_max, 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# summarize some of the input domain
print(x[:5, :5])
# compute targets
results = objective(x, y)
# summarize some of the results
print(results[:5, :5])
# create a mapping of some inputs to some results
for i in range(5):
	print('f(%.3f, %.3f) = %.3f' % (x[i,0], y[i,0], results[i,0]))

Running the example first summarizes some points in the mesh grid, then the objective function evaluation for some points.

Finally, we enumerate coordinates in the two-dimensional input space and their corresponding function evaluation.

[[-5.  -4.9 -4.8 -4.7 -4.6]
 [-5.  -4.9 -4.8 -4.7 -4.6]
 [-5.  -4.9 -4.8 -4.7 -4.6]
 [-5.  -4.9 -4.8 -4.7 -4.6]
 [-5.  -4.9 -4.8 -4.7 -4.6]]
[[50.   49.01 48.04 47.09 46.16]
 [49.01 48.02 47.05 46.1  45.17]
 [48.04 47.05 46.08 45.13 44.2 ]
 [47.09 46.1  45.13 44.18 43.25]
 [46.16 45.17 44.2  43.25 42.32]]
f(-5.000, -5.000) = 50.000
f(-5.000, -4.900) = 49.010
f(-5.000, -4.800) = 48.040
f(-5.000, -4.700) = 47.090
f(-5.000, -4.600) = 46.160

Now that we are familiar with how to sample the input space and evaluate points, let’s look at how we might plot the function.

Contour Plot of Test Function

A popular plot for two-dimensional functions is a contour plot.

This plot creates a flat representation of the objective function outputs for each x and y coordinate where the color and contour lines indicate the relative value or height of the output of the objective function.

This is just like a contour map of a landscape where mountains can be distinguished from valleys.

This can be achieved using the contour() Matplotlib function that takes the mesh grid and the evaluation of the mesh grid as input directly.

We can then specify the number of levels to draw on the contour and the color scheme to use. In this case, we will use 50 levels and a popular “jet” color scheme where low-levels use a cold color scheme (blue) and high-levels use a hot color scheme (red).

...
# create a contour plot with 50 levels and jet color scheme
pyplot.contour(x, y, results, 50, alpha=1.0, cmap='jet')
# show the plot
pyplot.show()

Tying this together, the complete example of creating a contour plot of the two-dimensional objective function is listed below.

# create a contour plot with 50 levels and jet color scheme
pyplot.contour(x, y, results, 50, alpha=1.0, cmap='jet')
# show the plot
pyplot.show()

Tying this together, the complete example of creating a contour plot of the two-dimensional objective function is listed below.

# contour plot for 2d objective function
from numpy import arange
from numpy import meshgrid
from matplotlib import pyplot

# objective function
def objective(x, y):
	return x**2.0 + y**2.0

# define range for input
r_min, r_max = -5.0, 5.0
# sample input range uniformly at 0.1 increments
xaxis = arange(r_min, r_max, 0.1)
yaxis = arange(r_min, r_max, 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a contour plot with 50 levels and jet color scheme
pyplot.contour(x, y, results, 50, alpha=1.0, cmap='jet')
# show the plot
pyplot.show()

Running the example creates the contour plot.

We can see that the more curved parts of the surface around the edges have more contours to show the detail, and the less curved parts of the surface in the middle have fewer contours.

We can see that the lowest part of the domain is the middle, as expected.

Contour Plot of a Two-Dimensional Objective Function

Filled Contour Plot of Test Function

It is also helpful to color the plot between the contours to show a more complete surface.

Again, the colors are just a simple linear interpolation, not the true function evaluation. This must be kept in mind on more complex functions where fine detail will not be shown.

We can fill the contour plot using the contourf() version of the function that takes the same arguments.

...
# create a filled contour plot with 50 levels and jet color scheme
pyplot.contourf(x, y, results, levels=50, cmap='jet')

We can also show the optima on the plot, in this case as a white star that will stand out against the blue background color of the lowest part of the plot.

...
# define the known function optima
optima_x = [0.0, 0.0]
# draw the function optima as a white star
pyplot.plot([optima_x[0]], [optima_x[1]], '*', color='white')

Tying this together, the complete example of a filled contour plot with the optima marked is listed below.

# filled contour plot for 2d objective function and show the optima
from numpy import arange
from numpy import meshgrid
from matplotlib import pyplot

# objective function
def objective(x, y):
	return x**2.0 + y**2.0

# define range for input
r_min, r_max = -5.0, 5.0
# sample input range uniformly at 0.1 increments
xaxis = arange(r_min, r_max, 0.1)
yaxis = arange(r_min, r_max, 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a filled contour plot with 50 levels and jet color scheme
pyplot.contourf(x, y, results, levels=50, cmap='jet')
# define the known function optima
optima_x = [0.0, 0.0]
# draw the function optima as a white star
pyplot.plot([optima_x[0]], [optima_x[1]], '*', color='white')
# show the plot
pyplot.show()

Running the example creates the filled contour plot that gives a better idea of the shape of the objective function.

The optima at [x=0, y=0] is then marked clearly with a white star.

Filled Contour Plot of a Two-Dimensional Objective Function With Optima Marked by a White Star

Filled Contour Plot of Test Function with Samples

We may want to show the progress of an optimization algorithm to get an idea of its behavior in the context of the shape of the objective function.

In this case, we can simulate the points chosen by an optimization algorithm with random coordinates in the input space.

...
# simulate a sample made by an optimization algorithm
seed(1)
sample_x = r_min + rand(10) * (r_max - r_min)
sample_y = r_min + rand(10) * (r_max - r_min)

These points can then be plotted directly as black circles and their context color can give an idea of their relative quality.

...
# plot the sample as black circles
pyplot.plot(sample_x, sample_y, 'o', color='black')

Tying this together, the complete example of a filled contour plot with optimal and input sample plotted is listed below.

# filled contour plot for 2d objective function and show the optima and sample
from numpy import arange
from numpy import meshgrid
from numpy.random import seed
from numpy.random import rand
from matplotlib import pyplot

# objective function
def objective(x, y):
	return x**2.0 + y**2.0

# define range for input
r_min, r_max = -5.0, 5.0
# sample input range uniformly at 0.1 increments
xaxis = arange(r_min, r_max, 0.1)
yaxis = arange(r_min, r_max, 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# simulate a sample made by an optimization algorithm
seed(1)
sample_x = r_min + rand(10) * (r_max - r_min)
sample_y = r_min + rand(10) * (r_max - r_min)
# create a filled contour plot with 50 levels and jet color scheme
pyplot.contourf(x, y, results, levels=50, cmap='jet')
# define the known function optima
optima_x = [0.0, 0.0]
# draw the function optima as a white star
pyplot.plot([optima_x[0]], [optima_x[1]], '*', color='white')
# plot the sample as black circles
pyplot.plot(sample_x, sample_y, 'o', color='black')
# show the plot
pyplot.show()

Running the example, we can see the filled contour plot as before with the optima marked.

We can now see the sample drawn as black dots and their surrounding color and relative distance to the optima gives an idea of how close the algorithm (random points in this case) got to solving the problem.

Filled Contour Plot of a Two-Dimensional Objective Function With Optima and Input Sample Marked

Surface Plot of Test Function

Finally, we may want to create a three-dimensional plot of the objective function to get a fuller idea of the curvature of the function.

This can be achieved using the plot_surface() Matplotlib function, that, like the contour plot, takes the mesh grid and function evaluation directly.

...
# create a surface plot with the jet color scheme
figure = pyplot.figure()
axis = figure.gca(projection='3d')
axis.plot_surface(x, y, results, cmap='jet')

The complete example of creating a surface plot is listed below.

# surface plot for 2d objective function
from numpy import arange
from numpy import meshgrid
from matplotlib import pyplot
from mpl_toolkits.mplot3d import Axes3D

# objective function
def objective(x, y):
	return x**2.0 + y**2.0

# define range for input
r_min, r_max = -5.0, 5.0
# sample input range uniformly at 0.1 increments
xaxis = arange(r_min, r_max, 0.1)
yaxis = arange(r_min, r_max, 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a surface plot with the jet color scheme
figure = pyplot.figure()
axis = figure.gca(projection='3d')
axis.plot_surface(x, y, results, cmap='jet')
# show the plot
pyplot.show()

Running the example creates a three-dimensional surface plot of the objective function.

Surface Plot of a Two-Dimensional Objective Function

Additionally, the plot is interactive, meaning that you can use the mouse to drag the perspective on the surface around and view it from different angles.

Surface Plot From a Different Angle of a Two-Dimensional Objective Function

Summary

In this tutorial, you discovered how to create visualizations for function optimization in Python.

Specifically, you learned:

Visualization is an important tool when studying function optimization algorithms.
How to visualize one-dimensional functions and samples using line plots.
How to visualize two-dimensional functions and samples using contour and surface plots.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Visualization for Function Optimization in Python appeared first on Machine Learning Mastery.

Activation functions are a critical part of the design of a neural network.

The choice of activation function in the hidden layer will control how well the network model learns the training dataset. The choice of activation function in the output layer will define the type of predictions the model can make.

As such, a careful choice of activation function must be made for each deep learning neural network project.

In this tutorial, you will discover how to choose activation functions for neural network models.

After completing this tutorial, you will know:

Activation functions are a key part of neural network design.
The modern default activation function for hidden layers is the ReLU function.
The activation function for output layers depends on the type of prediction problem.

Let’s get started.

How to Choose an Activation Function for Deep Learning
Photo by Peter Dowley, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

Activation Functions
Activation for Hidden Layers
Activation for Output Layers

Activation Functions

An activation function in a neural network defines how the weighted sum of the input is transformed into an output from a node or nodes in a layer of the network.

Sometimes the activation function is called a “transfer function.” If the output range of the activation function is limited, then it may be called a “squashing function.” Many activation functions are nonlinear and may be referred to as the “nonlinearity” in the layer or the network design.

The choice of activation function has a large impact on the capability and performance of the neural network, and different activation functions may be used in different parts of the model.

Technically, the activation function is used within or after the internal processing of each node in the network, although networks are designed to use the same activation function for all nodes in a layer.

A network may have three types of layers: input layers that take raw input from the domain, hidden layers that take input from another layer and pass output to another layer, and output layers that make a prediction.

All hidden layers typically use the same activation function. The output layer will typically use a different activation function from the hidden layers and is dependent upon the type of prediction required by the model.

Activation functions are also typically differentiable, meaning the first-order derivative can be calculated for a given input value. This is required given that neural networks are typically trained using the backpropagation of error algorithm that requires the derivative of prediction error in order to update the weights of the model.

There are many different types of activation functions used in neural networks, although perhaps only a small number of functions used in practice for hidden and output layers.

Let’s take a look at the activation functions used for each type of layer in turn.

Activation for Hidden Layers

A hidden layer in a neural network is a layer that receives input from another layer (such as another hidden layer or an input layer) and provides output to another layer (such as another hidden layer or an output layer).

A hidden layer does not directly contact input data or produce outputs for a model, at least in general.

A neural network may have zero or more hidden layers.

Typically, a differentiable nonlinear activation function is used in the hidden layers of a neural network. This allows the model to learn more complex functions than a network trained using a linear activation function.

In order to get access to a much richer hypothesis space that would benefit from deep representations, you need a non-linearity, or activation function.

— Page 72, Deep Learning with Python, 2017.

There are perhaps three activation functions you may want to consider for use in hidden layers; they are:

Rectified Linear Activation (ReLU)
Logistic (Sigmoid)
Hyperbolic Tangent (Tanh)

This is not an exhaustive list of activation functions used for hidden layers, but they are the most commonly used.

Let’s take a closer look at each in turn.

ReLU Hidden Layer Activation Function

The rectified linear activation function, or ReLU activation function, is perhaps the most common function used for hidden layers.

It is common because it is both simple to implement and effective at overcoming the limitations of other previously popular activation functions, such as Sigmoid and Tanh. Specifically, it is less susceptible to vanishing gradients that prevent deep models from being trained, although it can suffer from other problems like saturated or “dead” units.

The ReLU function is calculated as follows:

max(0.0, x)

This means that if the input value (x) is negative, then a value 0.0 is returned, otherwise, the value is returned.

You can learn more about the details of the ReLU activation function in this tutorial:

A Gentle Introduction to the Rectified Linear Unit (ReLU)

We can get an intuition for the shape of this function with the worked example below.

# example plot for the relu activation function
from matplotlib import pyplot

# rectified linear function
def rectified(x):
	return max(0.0, x)

# define input data
inputs = [x for x in range(-10, 10)]
# calculate outputs
outputs = [rectified(x) for x in inputs]
# plot inputs vs outputs
pyplot.plot(inputs, outputs)
pyplot.show()

Running the example calculates the outputs for a range of values and creates a plot of inputs versus outputs.

We can see the familiar kink shape of the ReLU activation function.

Plot of Inputs vs. Outputs for the ReLU Activation Function.

When using the ReLU function for hidden layers, it is a good practice to use a “He Normal” or “He Uniform” weight initialization and scale input data to the range 0-1 (normalize) prior to training.

Sigmoid Hidden Layer Activation Function

The sigmoid activation function is also called the logistic function.

It is the same function used in the logistic regression classification algorithm.

The function takes any real value as input and outputs values in the range 0 to 1. The larger the input (more positive), the closer the output value will be to 1.0, whereas the smaller the input (more negative), the closer the output will be to 0.0.

The sigmoid activation function is calculated as follows:

1.0 / (1.0 + e^-x)

Where e is a mathematical constant, which is the base of the natural logarithm.

We can get an intuition for the shape of this function with the worked example below.

# example plot for the sigmoid activation function
from math import exp
from matplotlib import pyplot

# sigmoid activation function
def sigmoid(x):
	return 1.0 / (1.0 + exp(-x))

# define input data
inputs = [x for x in range(-10, 10)]
# calculate outputs
outputs = [sigmoid(x) for x in inputs]
# plot inputs vs outputs
pyplot.plot(inputs, outputs)
pyplot.show()

Running the example calculates the outputs for a range of values and creates a plot of inputs versus outputs.

We can see the familiar S-shape of the sigmoid activation function.

Plot of Inputs vs. Outputs for the Sigmoid Activation Function.

When using the Sigmoid function for hidden layers, it is a good practice to use a “Xavier Normal” or “Xavier Uniform” weight initialization (also referred to Glorot initialization, named for Xavier Glorot) and scale input data to the range 0-1 (e.g. the range of the activation function) prior to training.

Tanh Hidden Layer Activation Function

The hyperbolic tangent activation function is also referred to simply as the Tanh (also “tanh” and “TanH“) function.

It is very similar to the sigmoid activation function and even has the same S-shape.

The function takes any real value as input and outputs values in the range -1 to 1. The larger the input (more positive), the closer the output value will be to 1.0, whereas the smaller the input (more negative), the closer the output will be to -1.0.

The Tanh activation function is calculated as follows:

(e^x – e^-x) / (e^x + e^-x)

Where e is a mathematical constant that is the base of the natural logarithm.

We can get an intuition for the shape of this function with the worked example below.

# example plot for the tanh activation function
from math import exp
from matplotlib import pyplot

# tanh activation function
def tanh(x):
	return (exp(x) - exp(-x)) / (exp(x) + exp(-x))

# define input data
inputs = [x for x in range(-10, 10)]
# calculate outputs
outputs = [tanh(x) for x in inputs]
# plot inputs vs outputs
pyplot.plot(inputs, outputs)
pyplot.show()

Running the example calculates the outputs for a range of values and creates a plot of inputs versus outputs.

We can see the familiar S-shape of the Tanh activation function.

Plot of Inputs vs. Outputs for the Tanh Activation Function.

When using the TanH function for hidden layers, it is a good practice to use a “Xavier Normal” or “Xavier Uniform” weight initialization (also referred to Glorot initialization, named for Xavier Glorot) and scale input data to the range -1 to 1 (e.g. the range of the activation function) prior to training.

How to Choose a Hidden Layer Activation Function

A neural network will almost always have the same activation function in all hidden layers.

It is most unusual to vary the activation function through a network model.

Traditionally, the sigmoid activation function was the default activation function in the 1990s. Perhaps through the mid to late 1990s to 2010s, the Tanh function was the default activation function for hidden layers.

… the hyperbolic tangent activation function typically performs better than the logistic sigmoid.

— Page 195, Deep Learning, 2016.

Both the sigmoid and Tanh functions can make the model more susceptible to problems during training, via the so-called vanishing gradients problem.

You can learn more about this problem in this tutorial:

A Gentle Introduction to the Rectified Linear Unit (ReLU)

The activation function used in hidden layers is typically chosen based on the type of neural network architecture.

Modern neural network models with common architectures, such as MLP and CNN, will make use of the ReLU activation function, or extensions.

In modern neural networks, the default recommendation is to use the rectified linear unit or ReLU …

— Page 174, Deep Learning, 2016.

Recurrent networks still commonly use Tanh or sigmoid activation functions, or even both. For example, the LSTM commonly uses the Sigmoid activation for recurrent connections and the Tanh activation for output.

Multilayer Perceptron (MLP): ReLU activation function.
Convolutional Neural Network (CNN): ReLU activation function.
Recurrent Neural Network: Tanh and/or Sigmoid activation function.

If you’re unsure which activation function to use for your network, try a few and compare the results.

The figure below summarizes how to choose an activation function for the hidden layers of your neural network model.

How to Choose a Hidden Layer Activation Function

Activation for Output Layers

The output layer is the layer in a neural network model that directly outputs a prediction.

All feed-forward neural network models have an output layer.

There are perhaps three activation functions you may want to consider for use in the output layer; they are:

Linear
Logistic (Sigmoid)
Softmax

This is not an exhaustive list of activation functions used for output layers, but they are the most commonly used.

Let’s take a closer look at each in turn.

Linear Output Activation Function

The linear activation function is also called “identity” (multiplied by 1.0) or “no activation.”

This is because the linear activation function does not change the weighted sum of the input in any way and instead returns the value directly.

We can get an intuition for the shape of this function with the worked example below.

# example plot for the linear activation function
from matplotlib import pyplot

# linear activation function
def linear(x):
	return x

# define input data
inputs = [x for x in range(-10, 10)]
# calculate outputs
outputs = [linear(x) for x in inputs]
# plot inputs vs outputs
pyplot.plot(inputs, outputs)
pyplot.show()

Running the example calculates the outputs for a range of values and creates a plot of inputs versus outputs.

We can see a diagonal line shape where inputs are plotted against identical outputs.

Plot of Inputs vs. Outputs for the Linear Activation Function

Target values used to train a model with a linear activation function in the output layer are typically scaled prior to modeling using normalization or standardization transforms.

Sigmoid Output Activation Function

The sigmoid of logistic activation function was described in the previous section.

Nevertheless, to add some symmetry, we can review for the shape of this function with the worked example below.

# example plot for the sigmoid activation function
from math import exp
from matplotlib import pyplot

# sigmoid activation function
def sigmoid(x):
	return 1.0 / (1.0 + exp(-x))

# define input data
inputs = [x for x in range(-10, 10)]
# calculate outputs
outputs = [sigmoid(x) for x in inputs]
# plot inputs vs outputs
pyplot.plot(inputs, outputs)
pyplot.show()

Running the example calculates the outputs for a range of values and creates a plot of inputs versus outputs.

We can see the familiar S-shape of the sigmoid activation function.

Plot of Inputs vs. Outputs for the Sigmoid Activation Function.

Target labels used to train a model with a sigmoid activation function in the output layer will have the values 0 or 1.

Softmax Output Activation Function

The softmax function outputs a vector of values that sum to 1.0 that can be interpreted as probabilities of class membership.

It is related to the argmax function that outputs a 0 for all options and 1 for the chosen option. Softmax is a “softer” version of argmax that allows a probability-like output of a winner-take-all function.

As such, the input to the function is a vector of real values and the output is a vector of the same length with values that sum to 1.0 like probabilities.

The softmax function is calculated as follows:

e^x / sum(e^x)

Where x is a vector of outputs and e is a mathematical constant that is the base of the natural logarithm.

You can learn more about the details of the Softmax function in this tutorial:

Softmax Activation Function with Python

We cannot plot the softmax function, but we can give an example of calculating it in Python.

# softmax activation function
def softmax(x):
	return exp(x) / exp(x).sum()

# define input data
inputs = [1.0, 3.0, 2.0]
# calculate outputs
outputs = softmax(inputs)
# report the probabilities
print(outputs)
# report the sum of the probabilities
print(outputs.sum())

Running the example calculates the softmax output for the input vector.

We then confirm that the sum of the outputs of the softmax indeed sums to the value 1.0.

[0.09003057 0.66524096 0.24472847]
1.0

Target labels used to train a model with the softmax activation function in the output layer will be vectors with 1 for the target class and 0 for all other classes.

How to Choose an Output Activation Function

You must choose the activation function for your output layer based on the type of prediction problem that you are solving.

Specifically, the type of variable that is being predicted.

For example, you may divide prediction problems into two main groups, predicting a categorical variable (classification) and predicting a numerical variable (regression).

If your problem is a regression problem, you should use a linear activation function.

Regression: One node, linear activation.

If your problem is a classification problem, then there are three main types of classification problems and each may use a different activation function.

Predicting a probability is not a regression problem; it is classification. In all cases of classification, your model will predict the probability of class membership (e.g. probability that an example belongs to each class) that you can convert to a crisp class label by rounding (for sigmoid) or argmax (for softmax).

If there are two mutually exclusive classes (binary classification), then your output layer will have one node and a sigmoid activation function should be used. If there are more than two mutually exclusive classes (multiclass classification), then your output layer will have one node per class and a softmax activation should be used. If there are two or more mutually inclusive classes (multilabel classification), then your output layer will have one node for each class and a sigmoid activation function is used.

Binary Classification: One node, sigmoid activation.
Multiclass Classification: One node per class, softmax activation.
Multilabel Classification: One node per class, sigmoid activation.

The figure below summarizes how to choose an activation function for the output layer of your neural network model.

How to Choose an Output Layer Activation Function

Summary

In this tutorial, you discovered how to choose activation functions for neural network models.

Specifically, you learned:

Activation functions are a key part of neural network design.
The modern default activation function for hidden layers is the ReLU function.
The activation function for output layers depends on the type of prediction problem.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post How to Choose an Activation Function for Deep Learning appeared first on Machine Learning Mastery.

Regression refers to predictive modeling problems that involve predicting a numeric value.

It is different from classification that involves predicting a class label. Unlike classification, you cannot use classification accuracy to evaluate the predictions made by a regression model.

Instead, you must use error metrics specifically designed for evaluating predictions made on regression problems.

In this tutorial, you will discover how to calculate error metrics for regression predictive modeling projects.

After completing this tutorial, you will know:

Regression predictive modeling are those problems that involve predicting a numeric value.
Metrics for regression involve calculating an error score to summarize the predictive skill of a model.
How to calculate and report mean squared error, root mean squared error, and mean absolute error.

Let’s get started.

Regression Metrics for Machine Learning
Photo by Gael Varoquaux, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

Regression Predictive Modeling
Evaluating Regression Models
Metrics for Regression
1. Mean Squared Error
2. Root Mean Squared Error
3. Mean Absolute Error

Regression Predictive Modeling

Predictive modeling is the problem of developing a model using historical data to make a prediction on new data where we do not have the answer.

Predictive modeling can be described as the mathematical problem of approximating a mapping function (f) from input variables (X) to output variables (y). This is called the problem of function approximation.

The job of the modeling algorithm is to find the best mapping function we can given the time and resources available.

For more on approximating functions in applied machine learning, see the post:

How Machine Learning Algorithms Work

Regression predictive modeling is the task of approximating a mapping function (f) from input variables (X) to a continuous output variable (y).

Regression is different from classification, which involves predicting a category or class label.

For more on the difference between classification and regression, see the tutorial:

Difference Between Classification and Regression in Machine Learning

A continuous output variable is a real-value, such as an integer or floating point value. These are often quantities, such as amounts and sizes.

For example, a house may be predicted to sell for a specific dollar value, perhaps in the range of $100,000 to $200,000.

A regression problem requires the prediction of a quantity.
A regression can have real-valued or discrete input variables.
A problem with multiple input variables is often called a multivariate regression problem.
A regression problem where input variables are ordered by time is called a time series forecasting problem.

Now that we are familiar with regression predictive modeling, let’s look at how we might evaluate a regression model.

Evaluating Regression Models

A common question by beginners to regression predictive modeling projects is:

How do I calculate accuracy for my regression model?

Accuracy (e.g. classification accuracy) is a measure for classification, not regression.

We cannot calculate accuracy for a regression model.

The skill or performance of a regression model must be reported as an error in those predictions.

This makes sense if you think about it. If you are predicting a numeric value like a height or a dollar amount, you don’t want to know if the model predicted the value exactly (this might be intractably difficult in practice); instead, we want to know how close the predictions were to the expected values.

Error addresses exactly this and summarizes on average how close predictions were to their expected values.

There are three error metrics that are commonly used for evaluating and reporting the performance of a regression model; they are:

Mean Squared Error (MSE).
Root Mean Squared Error (RMSE).
Mean Absolute Error (MAE)

There are many other metrics for regression, although these are the most commonly used. You can see the full list of regression metrics supported by the scikit-learn Python machine learning library here:

Scikit-Learn API: Regression Metrics.

In the next section, let’s take a closer look at each in turn.

Metrics for Regression

In this section, we will take a closer look at the popular metrics for regression models and how to calculate them for your predictive modeling project.

Mean Squared Error

Mean Squared Error, or MSE for short, is a popular error metric for regression problems.

It is also an important loss function for algorithms fit or optimized using the least squares framing of a regression problem. Here “least squares” refers to minimizing the mean squared error between predictions and expected values.

The MSE is calculated as the mean or average of the squared differences between predicted and expected target values in a dataset.

MSE = 1 / N * sum for i to N (y_i – yhat_i)^2

Where y_i is the i’th expected value in the dataset and yhat_i is the i’th predicted value. The difference between these two values is squared, which has the effect of removing the sign, resulting in a positive error value.

The squaring also has the effect of inflating or magnifying large errors. That is, the larger the difference between the predicted and expected values, the larger the resulting squared positive error. This has the effect of “punishing” models more for larger errors when MSE is used as a loss function. It also has the effect of “punishing” models by inflating the average error score when used as a metric.

We can create a plot to get a feeling for how the change in prediction error impacts the squared error.

The example below gives a small contrived dataset of all 1.0 values and predictions that range from perfect (1.0) to wrong (0.0) by 0.1 increments. The squared error between each prediction and expected value is calculated and plotted to show the quadratic increase in squared error.

...
# calculate error
err = (expected[i] - predicted[i])**2

The complete example is listed below.

# example of increase in mean squared error
from matplotlib import pyplot
from sklearn.metrics import mean_squared_error
# real value
expected = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
# predicted value
predicted = [1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.0]
# calculate errors
errors = list()
for i in range(len(expected)):
	# calculate error
	err = (expected[i] - predicted[i])**2
	# store error
	errors.append(err)
	# report error
	print('>%.1f, %.1f = %.3f' % (expected[i], predicted[i], err))
# plot errors
pyplot.plot(errors)
pyplot.xticks(ticks=[i for i in range(len(errors))], labels=predicted)
pyplot.xlabel('Predicted Value')
pyplot.ylabel('Mean Squared Error')
pyplot.show()

Running the example first reports the expected value, predicted value, and squared error for each case.

We can see that the error rises quickly, faster than linear (a straight line).

>1.0, 1.0 = 0.000
>1.0, 0.9 = 0.010
>1.0, 0.8 = 0.040
>1.0, 0.7 = 0.090
>1.0, 0.6 = 0.160
>1.0, 0.5 = 0.250
>1.0, 0.4 = 0.360
>1.0, 0.3 = 0.490
>1.0, 0.2 = 0.640
>1.0, 0.1 = 0.810
>1.0, 0.0 = 1.000

A line plot is created showing the curved or super-linear increase in the squared error value as the difference between the expected and predicted value is increased.

The curve is not a straight line as we might naively assume for an error metric.

Line Plot of the Increase Square Error With Predictions

The individual error terms are averaged so that we can report the performance of a model with regard to how much error the model makes generally when making predictions, rather than specifically for a given example.

The units of the MSE are squared units.

For example, if your target value represents “dollars,” then the MSE will be “squared dollars.” This can be confusing for stakeholders; therefore, when reporting results, often the root mean squared error is used instead (discussed in the next section).

The mean squared error between your expected and predicted values can be calculated using the mean_squared_error() function from the scikit-learn library.

The function takes a one-dimensional array or list of expected values and predicted values and returns the mean squared error value.

...
# calculate errors
errors = mean_squared_error(expected, predicted)

The example below gives an example of calculating the mean squared error between a list of contrived expected and predicted values.

# example of calculate the mean squared error
from sklearn.metrics import mean_squared_error
# real value
expected = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
# predicted value
predicted = [1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.0]
# calculate errors
errors = mean_squared_error(expected, predicted)
# report error
print(errors)

Running the example calculates and prints the mean squared error.

0.35000000000000003

A perfect mean squared error value is 0.0, which means that all predictions matched the expected values exactly.

This is almost never the case, and if it happens, it suggests your predictive modeling problem is trivial.

A good MSE is relative to your specific dataset.

It is a good idea to first establish a baseline MSE for your dataset using a naive predictive model, such as predicting the mean target value from the training dataset. A model that achieves an MSE better than the MSE for the naive model has skill.

Root Mean Squared Error

The Root Mean Squared Error, or RMSE, is an extension of the mean squared error.

Importantly, the square root of the error is calculated, which means that the units of the RMSE are the same as the original units of the target value that is being predicted.

For example, if your target variable has the units “dollars,” then the RMSE error score will also have the unit “dollars” and not “squared dollars” like the MSE.

As such, it may be common to use MSE loss to train a regression predictive model, and to use RMSE to evaluate and report its performance.

The RMSE can be calculated as follows:

RMSE = sqrt(1 / N * sum for i to N (y_i – yhat_i)^2)

Where y_i is the i’th expected value in the dataset, yhat_i is the i’th predicted value, and sqrt() is the square root function.

We can restate the RMSE in terms of the MSE as:

RMSE = sqrt(MSE)

Note that the RMSE cannot be calculated as the average of the square root of the mean squared error values. This is a common error made by beginners and is an example of Jensen’s inequality.

You may recall that the square root is the inverse of the square operation. MSE uses the square operation to remove the sign of each error value and to punish large errors. The square root reverses this operation, although it ensures that the result remains positive.

The root mean squared error between your expected and predicted values can be calculated using the mean_squared_error() function from the scikit-learn library.

By default, the function calculates the MSE, but we can configure it to calculate the square root of the MSE by setting the “squared” argument to False.

The function takes a one-dimensional array or list of expected values and predicted values and returns the mean squared error value.

...
# calculate errors
errors = mean_squared_error(expected, predicted, squared=False)

The example below gives an example of calculating the root mean squared error between a list of contrived expected and predicted values.

# example of calculate the root mean squared error
from sklearn.metrics import mean_squared_error
# real value
expected = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
# predicted value
predicted = [1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.0]
# calculate errors
errors = mean_squared_error(expected, predicted, squared=False)
# report error
print(errors)

Running the example calculates and prints the root mean squared error.

0.5916079783099616

A perfect RMSE value is 0.0, which means that all predictions matched the expected values exactly.

This is almost never the case, and if it happens, it suggests your predictive modeling problem is trivial.

A good RMSE is relative to your specific dataset.

It is a good idea to first establish a baseline RMSE for your dataset using a naive predictive model, such as predicting the mean target value from the training dataset. A model that achieves an RMSE better than the RMSE for the naive model has skill.

Mean Absolute Error

Mean Absolute Error, or MAE, is a popular metric because, like RMSE, the units of the error score match the units of the target value that is being predicted.

Unlike the RMSE, the changes in RMSE are linear and therefore intuitive.

That is, MSE and RMSE punish larger errors more than smaller errors, inflating or magnifying the mean error score. This is due to the square of the error value. The MAE does not give more or less weight to different types of errors and instead the scores increase linearly with increases in error.

As its name suggests, the MAE score is calculated as the average of the absolute error values. Absolute or abs() is a mathematical function that simply makes a number positive. Therefore, the difference between an expected and predicted value may be positive or negative and is forced to be positive when calculating the MAE.

The MAE can be calculated as follows:

MAE = 1 / N * sum for i to N abs(y_i – yhat_i)

Where y_i is the i’th expected value in the dataset, yhat_i is the i’th predicted value and abs() is the absolute function.

We can create a plot to get a feeling for how the change in prediction error impacts the MAE.

The example below gives a small contrived dataset of all 1.0 values and predictions that range from perfect (1.0) to wrong (0.0) by 0.1 increments. The absolute error between each prediction and expected value is calculated and plotted to show the linear increase in error.

...
# calculate error
err = abs((expected[i] - predicted[i]))

The complete example is listed below.

# plot of the increase of mean absolute error with prediction error
from matplotlib import pyplot
from sklearn.metrics import mean_squared_error
# real value
expected = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
# predicted value
predicted = [1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.0]
# calculate errors
errors = list()
for i in range(len(expected)):
	# calculate error
	err = abs((expected[i] - predicted[i]))
	# store error
	errors.append(err)
	# report error
	print('>%.1f, %.1f = %.3f' % (expected[i], predicted[i], err))
# plot errors
pyplot.plot(errors)
pyplot.xticks(ticks=[i for i in range(len(errors))], labels=predicted)
pyplot.xlabel('Predicted Value')
pyplot.ylabel('Mean Absolute Error')
pyplot.show()

Running the example first reports the expected value, predicted value, and absolute error for each case.

We can see that the error rises linearly, which is intuitive and easy to understand.

>1.0, 1.0 = 0.000
>1.0, 0.9 = 0.100
>1.0, 0.8 = 0.200
>1.0, 0.7 = 0.300
>1.0, 0.6 = 0.400
>1.0, 0.5 = 0.500
>1.0, 0.4 = 0.600
>1.0, 0.3 = 0.700
>1.0, 0.2 = 0.800
>1.0, 0.1 = 0.900
>1.0, 0.0 = 1.000

A line plot is created showing the straight line or linear increase in the absolute error value as the difference between the expected and predicted value is increased.

Line Plot of the Increase Absolute Error With Predictions

The mean absolute error between your expected and predicted values can be calculated using the mean_absolute_error() function from the scikit-learn library.

The function takes a one-dimensional array or list of expected values and predicted values and returns the mean absolute error value.

...
# calculate errors
errors = mean_absolute_error(expected, predicted)

The example below gives an example of calculating the mean absolute error between a list of contrived expected and predicted values.

# example of calculate the mean absolute error
from sklearn.metrics import mean_absolute_error
# real value
expected = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
# predicted value
predicted = [1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.0]
# calculate errors
errors = mean_absolute_error(expected, predicted)
# report error
print(errors)

Running the example calculates and prints the mean absolute error.

0.5

A perfect mean absolute error value is 0.0, which means that all predictions matched the expected values exactly.

This is almost never the case, and if it happens, it suggests your predictive modeling problem is trivial.

A good MAE is relative to your specific dataset.

It is a good idea to first establish a baseline MAE for your dataset using a naive predictive model, such as predicting the mean target value from the training dataset. A model that achieves a MAE better than the MAE for the naive model has skill.

Summary

In this tutorial, you discovered how to calculate error for regression predictive modeling projects.

Specifically, you learned:

Regression predictive modeling are those problems that involve predicting a numeric value.
Metrics for regression involve calculating an error score to summarize the predictive skill of a model.
How to calculate and report mean squared error, root mean squared error, and mean absolute error.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Regression Metrics for Machine Learning appeared first on Machine Learning Mastery.

Recommender systems may be the most common type of predictive model that the average person may encounter.

They provide the basis for recommendations on services such as Amazon, Spotify, and Youtube.

Recommender systems are a huge daunting topic if you’re just getting started. There is a myriad of data preparation techniques, algorithms, and model evaluation methods.

Not all of the techniques will be relevant, and in fact, the state-of-the-art can be ignored for now as you will likely get very good results by focusing on the fundamentals, e.g. treat it as a straightforward classification or regression problem.

It is important to know the basics and have it all laid out for you in a systematic way. For this, I recommend skimming or reading the standard books and papers on the topic and looking at some of the popular libraries.

In this tutorial, you will discover resources you can use to get started with recommender systems.

After completing this tutorial, you will know:

The top review papers on recommender systems you can use to quickly understand the state of the field.
The top books on recommender systems from which you can learn the algorithms and techniques required when developing and evaluating recommender systems.
The top Python libraries and APIs that you can use to prototype and develop your own recommender systems.

Let’s get started.

How to Get Started With Recommender Systems
Photo by Paul Toogood, some right reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

Papers on Recommender Systems
Books on Recommender Systems
Recommender Systems Libraries

Papers on Recommender Systems

Research papers on recommender systems can help you very quickly get up to speed on the state of the field.

Specifically, review papers that use precise language to define what a recommender system is, the algorithms that can be used, standard datasets and metrics for comparing algorithms, and hints at the state of the art techniques.

By skimming or reading a handful of review papers on recommender systems, you can quickly develop a foundation from which to dive deeper and start developing your own systems.

The field does not change that quickly, and techniques from 10 or 20 years ago will give you solid results.

Review papers on recommender systems I recommended to establish a foundational understanding include:

Amazon.com Recommendations: Item-to-item Collaborative Filtering, 2003.
Matrix Factorization Techniques for Recommender Systems, 2009.
Recommender Systems, 2012.
Recommender Systems Survey, 2013.
Advances in Collaborative Filtering, 2015.

Matrix Factorization Techniques for Recommender Systems

Once you have questions about specific techniques, you can then find papers that focus on those techniques and dive deeper.

You can search for papers on specific techniques here:

Google Scholar

Do you know of additional good review papers on recommender systems?
Let me know in the comments below.

Books on Recommender Systems

Books on recommender systems provide the space to lay out the field and take you on a tour of the techniques and give you the detail you need to understand them, with more breadth and detail than a much shorter review paper.

Again, given that the field is quite mature, older books, such as those published a decade ago, should not be immediately neglected.

Some top textbooks published by key researchers in the field include the following:

Recommender Systems: An Introduction, 2010.
Recommender Systems: The Textbook, 2016.

I own a hard copy of “Recommender Systems: An Introduction” and cannot recommend it highly enough.

This book offers an overview of approaches to developing state-of-the-art recommender systems. The authors present current algorithmic approaches for generating personalized buying proposals, such as collaborative and content-based filtering, as well as more interactive and knowledge- based approaches. They also discuss how to measure the effectiveness of recommender systems and illustrate the methods with practical case studies.

— Recommender Systems: An Introduction, 2010.

The table of contents for this book is as follows:

Chapter 1: Introduction
Chapter 2: Collaborative recommendation
Chapter 3: Content-based recommendation
Chapter 4: Knowledge-based recommendation
Chapter 5: Hybrid recommendation approaches
Chapter 6: Explanations in recommender systems
Chapter 7: Evaluating recommender systems
Chapter 8: Case study: Personalized game recommendations on the mobile Internet
Chapter 9: Attacks on collaborative recommender systems
Chapter 10: Online consumer decision making
Chapter 11: Recommender systems and the next-generation web
Chapter 12: Recommendations in ubiquitous environments
Chapter 13: Summary and outlook

Recommender Systems: An Introduction

It can be good to get a handbook on the topic with chapters written by different academics summarizing or championing their preferred techniques and methods.

I recommend this handbook:

Recommender Systems Handbook, 2015.

If you are looking for a more hands-on book, I recommend:

Practical Recommender Systems, 2019.

Have you read one of these books? Or do you know another great book on the topic?
Let me know in the comments below.

Recommender Systems Libraries

You probably don’t need to dive into the start of the art, at least not immediately.

As such, standard machine learning libraries are a great place to start.

For example, you can develop an effective recommender system using matrix factorization methods (SVD) or even a straight forward k-nearest neighbors model by items or by users.

As such, I recommend starting with some experiments with scikit-learn:

Scikit-Learn Python Machine Learning Library.

You can practice on standard recommender system datasets if your own data is not yet accessible or available, or you just want to get the hang of things first.

Popular standard datasets for recommender systems include:

MovieLens
Yahoo datasets (music, urls, movies, etc.)

If you are ready for state-of-the-art techniques, a great place to start is “papers with code” that lists both academic papers and links to the source code for the methods described in the paper:

Papers With Code: Recommendation Systems

There are a number of proprietary and open-source libraries and services for recommender systems.

I recommend sticking with open-source Python libraries in the beginning, such as:

Have you used any of these libraries to develop a recommender system?
Let me know in the comments below.

Summary

In this tutorial, you discovered resources you can use to get started with recommender systems.

Specifically, you learned:

The top review papers on recommender systems you can use to quickly understand the state of the field.
The top books on recommender systems from which you can learn the algorithms and techniques required when developing and evaluating recommender systems.
The top Python libraries and APIs that you can use to prototype and develop your own recommender systems.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post How to Get Started With Recommender Systems appeared first on Machine Learning Mastery.

The Nelder-Mead optimization algorithm is a widely used approach for non-differentiable objective functions.

As such, it is generally referred to as a pattern search algorithm and is used as a local or global search procedure, challenging nonlinear and potentially noisy and multimodal function optimization problems.

In this tutorial, you will discover the Nelder-Mead optimization algorithm.

After completing this tutorial, you will know:

The Nelder-Mead optimization algorithm is a type of pattern search that does not use function gradients.
How to apply the Nelder-Mead algorithm for function optimization in Python.
How to interpret the results of the Nelder-Mead algorithm on noisy and multimodal objective functions.

Let’s get started.

How to Use the Nelder-Mead Optimization in Python
Photo by Don Graham, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

Nelder-Mead Algorithm
Nelder-Mead Example in Python
Nelder-Mead on Challenging Functions
1. Noisy Optimization Problem
2. Multimodal Optimization Problem

Nelder-Mead Algorithm

Nelder-Mead is an optimization algorithm named after the developers of the technique, John Nelder and Roger Mead.

The algorithm was described in their 1965 paper titled “A Simplex Method For Function Minimization” and has become a standard and widely used technique for function optimization.

It is appropriate for one-dimensional or multidimensional functions with numerical inputs.

Nelder-Mead is a pattern search optimization algorithm, which means it does require or use function gradient information and is appropriate for optimization problems where the gradient of the function is unknown or cannot be reasonably computed.

It is often used for multidimensional nonlinear function optimization problems, although it can get stuck in local optima.

Practical performance of the Nelder–Mead algorithm is often reasonable, though stagnation has been observed to occur at nonoptimal points. Restarting can be used when stagnation is detected.

— Page 239, Numerical Optimization, 2006.

A starting point must be provided to the algorithm, which may be the endpoint of another global optimization algorithm or a random point drawn from the domain.

Given that the algorithm may get stuck, it may benefit from multiple restarts with different starting points.

The Nelder-Mead simplex method uses a simplex to traverse the space in search of a minimum.

— Page 105, Algorithms for Optimization, 2019.

The algorithm works by using a shape structure (called a simplex) composed of n + 1 points (vertices), where n is the number of input dimensions to the function.

For example, on a two-dimensional problem that may be plotted as a surface, the shape structure would be composed of three points represented as a triangle.

The Nelder-Mead method uses a series of rules that dictate how the simplex is updated based on evaluations of the objective function at its vertices.

— Page 106, Algorithms for Optimization, 2019.

The points of the shape structure are evaluated and simple rules are used to decide how to move the points of the shape based on their relative evaluation. This includes operations such as “reflection,” “expansion,” “contraction,” and “shrinkage” of the simplex shape on the surface of the objective function.

In a single iteration of the Nelder–Mead algorithm, we seek to remove the vertex with the worst function value and replace it with another point with a better value. The new point is obtained by reflecting, expanding, or contracting the simplex along the line joining the worst vertex with the centroid of the remaining vertices. If we cannot find a better point in this manner, we retain only the vertex with the best function value, and we shrink the simplex by moving all other vertices toward this value.

— Page 238, Numerical Optimization, 2006.

The search stops when the points converge on an optimum, when a minimum difference between evaluations is observed, or when a maximum number of function evaluations are performed.

Now that we have a high-level idea of how the algorithm works, let’s look at how we might use it in practice.

Nelder-Mead Example in Python

The Nelder-Mead optimization algorithm can be used in Python via the minimize() function.

This function requires that the “method” argument be set to “nelder-mead” to use the Nelder-Mead algorithm. It takes the objective function to be minimized and an initial point for the search.

...
# perform the search
result = minimize(objective, pt, method='nelder-mead')

The result is an OptimizeResult object that contains information about the result of the optimization accessible via keys.

For example, the “success” boolean indicates whether the search was completed successfully or not, the “message” provides a human-readable message about the success or failure of the search, and the “nfev” key indicates the number of function evaluations that were performed.

Importantly, the “x” key specifies the input values that indicate the optima found by the search, if successful.

...
# summarize the result
print('Status : %s' % result['message'])
print('Total Evaluations: %d' % result['nfev'])
print('Solution: %s' % result['x'])

We can demonstrate the Nelder-Mead optimization algorithm on a well-behaved function to show that it can quickly and efficiently find the optima without using any derivative information from the function.

In this case, we will use the x^2 function in two-dimensions, defined in the range -5.0 to 5.0 with the known optima at [0.0, 0.0].

We can define the objective() function below.

# objective function
def objective(x):
	return x[0]**2.0 + x[1]**2.0

We will use a random point in the defined domain as a starting point for the search.

...
# define range for input
r_min, r_max = -5.0, 5.0
# define the starting point as a random sample from the domain
pt = r_min + rand(2) * (r_max - r_min)

The search can then be performed. We use the default maximum number of function evaluations set via the “maxiter” and set to N*200, where N is the number of input variables, which is two in this case, e.g. 400 evaluations.

...
# perform the search
result = minimize(objective, pt, method='nelder-mead')

After the search is finished, we will report the total function evaluations used to find the optima and the success message of the search, which we expect to be positive in this case.

...
# summarize the result
print('Status : %s' % result['message'])
print('Total Evaluations: %d' % result['nfev'])

Finally, we will retrieve the input values for located optima, evaluate it using the objective function, and report both in a human-readable manner.

...
# evaluate solution
solution = result['x']
evaluation = objective(solution)
print('Solution: f(%s) = %.5f' % (solution, evaluation))

Tying this together, the complete example of using the Nelder-Mead optimization algorithm on a simple convex objective function is listed below.

# nelder-mead optimization of a convex function
from scipy.optimize import minimize
from numpy.random import rand

# objective function
def objective(x):
	return x[0]**2.0 + x[1]**2.0

# define range for input
r_min, r_max = -5.0, 5.0
# define the starting point as a random sample from the domain
pt = r_min + rand(2) * (r_max - r_min)
# perform the search
result = minimize(objective, pt, method='nelder-mead')
# summarize the result
print('Status : %s' % result['message'])
print('Total Evaluations: %d' % result['nfev'])
# evaluate solution
solution = result['x']
evaluation = objective(solution)
print('Solution: f(%s) = %.5f' % (solution, evaluation))

Running the example executes the optimization, then reports the results.

In this case, we can see that the search was successful, as we expected, and was completed after 88 function evaluations.

We can see that the optima was located with inputs very close to [0,0], which evaluates to the minimum objective value of 0.0.

Status: Optimization terminated successfully.
Total Evaluations: 88
Solution: f([ 2.25680716e-05 -3.87021351e-05]) = 0.00000

Now that we have seen how to use the Nelder-Mead optimization algorithm successfully, let’s look at some examples where it does not perform so well.

Nelder-Mead on Challenging Functions

The Nelder-Mead optimization algorithm works well for a range of challenging nonlinear and non-differentiable objective functions.

Nevertheless, it can get stuck on multimodal optimization problems and noisy problems.

To make this concrete, let’s look at an example of each.

Noisy Optimization Problem

A noisy objective function is a function that gives different answers each time the same input is evaluated.

We can make an objective function artificially noisy by adding small Gaussian random numbers to the inputs prior to their evaluation.

For example, we can define a one-dimensional version of the x^2 function and use the randn() function to add small Gaussian random numbers to the input with a mean of 0.0 and a standard deviation of 0.3.

# objective function
def objective(x):
	return (x + randn(len(x))*0.3)**2.0

The noise will make the function challenging to optimize for the algorithm and it will very likely not locate the optima at x=0.0.

The complete example of using Nelder-Mead to optimize the noisy objective function is listed below.

# nelder-mead optimization of noisy one-dimensional convex function
from scipy.optimize import minimize
from numpy.random import rand
from numpy.random import randn

# objective function
def objective(x):
	return (x + randn(len(x))*0.3)**2.0

# define range for input
r_min, r_max = -5.0, 5.0
# define the starting point as a random sample from the domain
pt = r_min + rand(1) * (r_max - r_min)
# perform the search
result = minimize(objective, pt, method='nelder-mead')
# summarize the result
print('Status : %s' % result['message'])
print('Total Evaluations: %d' % result['nfev'])
# evaluate solution
solution = result['x']
evaluation = objective(solution)
print('Solution: f(%s) = %.5f' % (solution, evaluation))

Running the example executes the optimization, then reports the results.

In this case, the algorithm does not converge and instead uses the maximum number of function evaluations, which is 200.

Status: Maximum number of function evaluations has been exceeded.
Total Evaluations: 200
Solution: f([-0.6918238]) = 0.79431

The algorithm may converge on some runs of the code but will arrive on a point away from the optima.

Multimodal Optimization Problem

Many nonlinear objective functions may have multiple optima, referred to as multimodal problems.

The problem may be structured such that it has multiple global optima that have an equivalent function evaluation, or a single global optima and multiple local optima where algorithms like the Nelder-Mead can get stuck in search of the local optima.

The Ackley function is an example of the latter. It is a two-dimensional objective function that has a global optima at [0,0] but has many local optima.

The example below implements the Ackley and creates a three-dimensional plot showing the global optima and multiple local optima.

# ackley multimodal function
from numpy import arange
from numpy import exp
from numpy import sqrt
from numpy import cos
from numpy import e
from numpy import pi
from numpy import meshgrid
from matplotlib import pyplot
from mpl_toolkits.mplot3d import Axes3D

# objective function
def objective(x, y):
	return -20.0 * exp(-0.2 * sqrt(0.5 * (x**2 + y**2))) - exp(0.5 * (cos(2 * pi * x) + cos(2 * pi * y))) + e + 20

# define range for input
r_min, r_max = -5.0, 5.0
# sample input range uniformly at 0.1 increments
xaxis = arange(r_min, r_max, 0.1)
yaxis = arange(r_min, r_max, 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a surface plot with the jet color scheme
figure = pyplot.figure()
axis = figure.gca(projection='3d')
axis.plot_surface(x, y, results, cmap='jet')
# show the plot
pyplot.show()

Running the example creates the surface plot of the Ackley function showing the vast number of local optima.

3D Surface Plot of the Ackley Multimodal Function

We would expect the Nelder-Mead function to get stuck in one of the local optima while in search of the global optima.

Initially, when the simplex is large, the algorithm may jump over many local optima, but as it contracts, it will get stuck.

We can explore this with the example below that demonstrates the Nelder-Mead algorithm on the Ackley function.

# nelder-mead for multimodal function optimization
from scipy.optimize import minimize
from numpy.random import rand
from numpy import exp
from numpy import sqrt
from numpy import cos
from numpy import e
from numpy import pi

# objective function
def objective(v):
	x, y = v
	return -20.0 * exp(-0.2 * sqrt(0.5 * (x**2 + y**2))) - exp(0.5 * (cos(2 * pi * x) + cos(2 * pi * y))) + e + 20

# define range for input
r_min, r_max = -5.0, 5.0
# define the starting point as a random sample from the domain
pt = r_min + rand(2) * (r_max - r_min)
# perform the search
result = minimize(objective, pt, method='nelder-mead')
# summarize the result
print('Status : %s' % result['message'])
print('Total Evaluations: %d' % result['nfev'])
# evaluate solution
solution = result['x']
evaluation = objective(solution)
print('Solution: f(%s) = %.5f' % (solution, evaluation))

Running the example executes the optimization, then reports the results.

In this case, we can see that the search completed successfully but did not locate the global optima. It got stuck and found a local optima.

Each time we run the example, we will find a different local optima given the different random starting point for the search.

Status: Optimization terminated successfully.
Total Evaluations: 62
Solution: f([-4.9831427 -3.98656015]) = 11.90126

Summary

In this tutorial, you discovered the Nelder-Mead optimization algorithm.

Specifically, you learned:

The Nelder-Mead optimization algorithm is a type of pattern search that does not use function gradients.
How to apply the Nelder-Mead algorithm for function optimization in Python.
How to interpret the results of the Nelder-Mead algorithm on noisy and multimodal objective functions.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post How to Use Nelder-Mead Optimization in Python appeared first on Machine Learning Mastery.

Tutorial Overview

Perceptron Algorithm

Perceptron With Scikit-Learn

Tune Perceptron Hyperparameters

Further Reading

Tutorials

Books

APIs

Articles

Summary

Tutorial Overview

Dynamic Classifier Selection

Dynamic Classifier Selection With Scikit-Learn

DCS With Overall Local Accuracy (OLA)

DCS With Local Class Accuracy (LCA)

Hyperparameter Tuning for DCS

Explore k in k-Nearest Neighbor

Explore Algorithms for Classifier Pool

Further Reading

Related Tutorials

Papers

Books

APIs

Summary

Tutorial Overview

Calculus in Machine Learning Books

Introductory Calculus Books

Calculus Textbooks

Summary

Tutorial Overview

What Is Meta?

What Is Meta-Learning?

Meta-Algorithms, Meta-Classifiers, and Meta-Models

Model Selection and Tuning as Meta-Learning

Multi-Task Learning as Meta-Learning

Further Reading

Related Tutorials

Papers

Books

Articles

Summary

Tutorial Overview

Occam’s Razor for Model Selection

Occam’s Two Razors for Machine Learning

Occam’s Razor and Ensemble Learning

Further Reading

Related Tutorials

Papers

Books

Articles

Summary

Tutorial Overview

Optimization Algorithms

Differentiable Objective Function

Bracketing Algorithms

Local Descent Algorithms

First-Order Algorithms

Second-Order Algorithms

Non-Differential Objective Function

Direct Algorithms

Stochastic Algorithms

Population Algorithms

Further Reading

Books

Articles

Summary

Tutorial Overview

Optimization for Feature Selection

Enumerate All Feature Subsets

Optimize Feature Subsets

Further Reading

Tutorials

APIs

Summary

Tutorial Overview

Histogram Gradient Boosting

Histogram Gradient Boosting With Scikit-Learn

Histogram Gradient Boosting With XGBoost

Histogram Gradient Boosting With LightGBM

Further Reading