Quantcast
Channel: MachineLearningMastery.com
Viewing all 910 articles
Browse latest View live

How to Develop a Neural Net for Predicting Car Insurance Payout

$
0
0

Developing a neural network predictive model for a new dataset can be challenging.

One approach is to first inspect the dataset and develop ideas for what models might work, then explore the learning dynamics of simple models on the dataset, then finally develop and tune a model for the dataset with a robust test harness.

This process can be used to develop effective neural network models for classification and regression predictive modeling problems.

In this tutorial, you will discover how to develop a Multilayer Perceptron neural network model for the Swedish car insurance regression dataset.

After completing this tutorial, you will know:

  • How to load and summarize the Swedish car insurance dataset and use the results to suggest data preparations and model configurations to use.
  • How to explore the learning dynamics of simple MLP models and data transforms on the dataset.
  • How to develop robust estimates of model performance, tune model performance, and make predictions on new data.

Let’s get started.

How to Develop a Neural Net for Predicting Car Insurance Payout

How to Develop a Neural Net for Predicting Car Insurance Payout
Photo by Dimitry B., some rights reserved.

Tutorial Overview

This tutorial is divided into four parts; they are:

  1. Auto Insurance Regression Dataset
  2. First MLP and Learning Dynamics
  3. Evaluating and Tuning MLP Models
  4. Final Model and Make Predictions

Auto Insurance Regression Dataset

The first step is to define and explore the dataset.

We will be working with the “Auto Insurance” standard regression dataset.

The dataset describes Swedish car insurance. There is a single input variable, which is the number of claims, and the target variable is a total payment for the claims in thousands of Swedish krona. The goal is to predict the total payment given the number of claims.

You can learn more about the dataset here:

You can see the first few rows of the dataset below.

108,392.5
19,46.2
13,15.7
124,422.2
40,119.4
...

We can see that the values are numeric and may range from tens to hundreds. This suggests some type of scaling would be appropriate for the data when modeling with a neural network.

We can load the dataset as a pandas DataFrame directly from the URL; for example:

# load the dataset and summarize the shape
from pandas import read_csv
# define the location of the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/auto-insurance.csv'
# load the dataset
df = read_csv(url, header=None)
# summarize shape
print(df.shape)

Running the example loads the dataset directly from the URL and reports the shape of the dataset.

In this case, we can confirm that the dataset has two variables (one input and one output) and that the dataset has 63 rows of data.

This is not many rows of data for a neural network and suggests that a small network, perhaps with regularization, would be appropriate.

It also suggests that using k-fold cross-validation would be a good idea given that it will give a more reliable estimate of model performance than a train/test split and because a single model will fit in seconds instead of hours or days with the largest datasets.

(63, 2)

Next, we can learn more about the dataset by looking at summary statistics and a plot of the data.

# show summary statistics and plots of the dataset
from pandas import read_csv
from matplotlib import pyplot
# define the location of the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/auto-insurance.csv'
# load the dataset
df = read_csv(url, header=None)
# show summary statistics
print(df.describe())
# plot histograms
df.hist()
pyplot.show()

Running the example first loads the data before and then prints summary statistics for each variable

We can see that the mean value for each variable is in the tens, with values ranging from 0 to the hundreds. This confirms that scaling the data is probably a good idea.

0           1
count   63.000000   63.000000
mean    22.904762   98.187302
std     23.351946   87.327553
min      0.000000    0.000000
25%      7.500000   38.850000
50%     14.000000   73.400000
75%     29.000000  140.000000
max    124.000000  422.200000

A histogram plot is then created for each variable.

We can see that each variable has a similar distribution. It looks like a skewed Gaussian distribution or an exponential distribution.

We may have some benefit in using a power transform on each variable in order to make the probability distribution less skewed, which will likely improve model performance.

Histograms of the Auto Insurance Regression Dataset

Histograms of the Auto Insurance Regression Dataset

Now that we are familiar with the dataset, let’s explore how we might develop a neural network model.

First MLP and Learning Dynamics

We will develop a Multilayer Perceptron (MLP) model for the dataset using TensorFlow.

We cannot know what model architecture of learning hyperparameters would be good or best for this dataset, so we must experiment and discover what works well.

Given that the dataset is small, a small batch size is probably a good idea, e.g. 8 or 16 rows. Using the Adam version of stochastic gradient descent is a good idea when getting started as it will automatically adapt the learning rate and works well on most datasets.

Before we evaluate models in earnest, it is a good idea to review the learning dynamics and tune the model architecture and learning configuration until we have stable learning dynamics, then look at getting the most out of the model.

We can do this by using a simple train/test split of the data and review plots of the learning curves. This will help us see if we are over-learning or under-learning; then we can adapt the configuration accordingly.

First, we can split the dataset into input and output variables, then into 67/33 train and test sets.

...
# split into input and output columns
X, y = df.values[:, :-1], df.values[:, -1]
# split into train and test datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

Next, we can define a minimal MLP model. In this case, we will use one hidden layer with 10 nodes and one output layer (chosen arbitrarily). We will use the ReLU activation function in the hidden layer and the “he_normal” weight initialization, as together, they are a good practice.

The output of the model is a linear activation (no activation) and we will minimize mean squared error (MSE) loss.

...
# determine the number of input features
n_features = X.shape[1]
# define model
model = Sequential()
model.add(Dense(10, activation='relu', kernel_initializer='he_normal', input_shape=(n_features,)))
model.add(Dense(1))
# compile the model
model.compile(optimizer='adam', loss='mse')

We will fit the model for 100 training epochs (chosen arbitrarily) with a batch size of eight because it is a small dataset.

We are fitting the model on raw data, which we think might be a bad idea, but it is an important starting point.

...
# fit the model
history = model.fit(X_train, y_train, epochs=100, batch_size=8, verbose=0, validation_data=(X_test,y_test))

At the end of training, we will evaluate the model’s performance on the test dataset and report performance as the mean absolute error (MAE), which I typically prefer over MSE or RMSE.

...
# predict test set
yhat = model.predict(X_test)
# evaluate predictions
score = mean_absolute_error(y_test, yhat)
print('MAE: %.3f' % score)

Finally, we will plot learning curves of the MSE loss on the train and test sets during training.

...
# plot learning curves
pyplot.title('Learning Curves')
pyplot.xlabel('Epoch')
pyplot.ylabel('Mean Squared Error')
pyplot.plot(history.history['loss'], label='train')
pyplot.plot(history.history['val_loss'], label='val')
pyplot.legend()
pyplot.show()

Tying this all together, the complete example of evaluating our first MLP on the auto insurance dataset is listed below.

# fit a simple mlp model and review learning curves
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from matplotlib import pyplot
# load the dataset
path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/auto-insurance.csv'
df = read_csv(path, header=None)
# split into input and output columns
X, y = df.values[:, :-1], df.values[:, -1]
# split into train and test datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
# determine the number of input features
n_features = X.shape[1]
# define model
model = Sequential()
model.add(Dense(10, activation='relu', kernel_initializer='he_normal', input_shape=(n_features,)))
model.add(Dense(1))
# compile the model
model.compile(optimizer='adam', loss='mse')
# fit the model
history = model.fit(X_train, y_train, epochs=100, batch_size=8, verbose=0, validation_data=(X_test,y_test))
# predict test set
yhat = model.predict(X_test)
# evaluate predictions
score = mean_absolute_error(y_test, yhat)
print('MAE: %.3f' % score)
# plot learning curves
pyplot.title('Learning Curves')
pyplot.xlabel('Epoch')
pyplot.ylabel('Mean Squared Error')
pyplot.plot(history.history['loss'], label='train')
pyplot.plot(history.history['val_loss'], label='val')
pyplot.legend()
pyplot.show()

Running the example first fits the model on the training dataset, then reports the MAE on the test dataset.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the model achieved a MAE of about 33.2, which is a good baseline in performance, which we might be able to improve upon.

MAE: 33.233

Line plots of the MSE on the train and test sets are then created.

We can see that the model has a good fit and converges nicely. The configuration of the model is a good starting point.

Learning Curves of Simple MLP on Auto Insurance Dataset

Learning Curves of Simple MLP on Auto Insurance Dataset

The learning dynamics are good so far, and the MAE is a rough estimate and should not be relied upon.

We can probably increase the capacity of the model a little and expect similar learning dynamics. For example, we can add a second hidden layer with eight nodes (chosen arbitrarily) and double the number of training epochs to 200.

...
# define model
model = Sequential()
model.add(Dense(10, activation='relu', kernel_initializer='he_normal', input_shape=(n_features,)))
model.add(Dense(8, activation='relu', kernel_initializer='he_normal'))
model.add(Dense(1))
# compile the model
model.compile(optimizer='adam', loss='mse')
# fit the model
history = model.fit(X_train, y_train, epochs=200, batch_size=8, verbose=0, validation_data=(X_test,y_test))

The complete example is listed below.

# fit a deeper mlp model and review learning curves
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from matplotlib import pyplot
# load the dataset
path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/auto-insurance.csv'
df = read_csv(path, header=None)
# split into input and output columns
X, y = df.values[:, :-1], df.values[:, -1]
# split into train and test datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
# determine the number of input features
n_features = X.shape[1]
# define model
model = Sequential()
model.add(Dense(10, activation='relu', kernel_initializer='he_normal', input_shape=(n_features,)))
model.add(Dense(8, activation='relu', kernel_initializer='he_normal'))
model.add(Dense(1))
# compile the model
model.compile(optimizer='adam', loss='mse')
# fit the model
history = model.fit(X_train, y_train, epochs=200, batch_size=8, verbose=0, validation_data=(X_test,y_test))
# predict test set
yhat = model.predict(X_test)
# evaluate predictions
score = mean_absolute_error(y_test, yhat)
print('MAE: %.3f' % score)
# plot learning curves
pyplot.title('Learning Curves')
pyplot.xlabel('Epoch')
pyplot.ylabel('Mean Squared Error')
pyplot.plot(history.history['loss'], label='train')
pyplot.plot(history.history['val_loss'], label='val')
pyplot.legend()
pyplot.show()

Running the example first fits the model on the training dataset, then reports the MAE on the test dataset.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see a slight improvement in MAE to about 27.9, although the high variance of the train/test split means that this evaluation is not reliable.

MAE: 27.939

Learning curves for the MSE train and test sets are then plotted. We can see that, as expected, the model achieves a good fit and convergences within a reasonable number of iterations.

Learning Curves of Deeper MLP on Auto Insurance Dataset

Learning Curves of Deeper MLP on Auto Insurance Dataset

Finally, we can try transforming the data and see how this impacts the learning dynamics.

In this case, we will use a power transform to make the data distribution less skewed. This will also automatically standardize the variables so that they have a mean of zero and a standard deviation of one — a good practice when modeling with a neural network.

First, we must ensure that the target variable is a two-dimensional array.

...
# ensure that the target variable is a 2d array
y_train, y_test = y_train.reshape((len(y_train),1)), y_test.reshape((len(y_test),1))

Next, we can apply a PowerTransformer to the input and target variables.

This can be achieved by first fitting the transform on the training data, then transforming the train and test sets.

This process is applied separately for the input and output variables to avoid data leakage.

...
# power transform input data
pt1 = PowerTransformer()
pt1.fit(X_train)
X_train = pt1.transform(X_train)
X_test = pt1.transform(X_test)
# power transform output data
pt2 = PowerTransformer()
pt2.fit(y_train)
y_train = pt2.transform(y_train)
y_test = pt2.transform(y_test)

The data is then used to fit the model.

The transform can then be inverted on the predictions made by the model and the expected target values from the test set and we can calculate the MAE in the correct scale as before.

...
# inverse transforms on target variable
y_test = pt2.inverse_transform(y_test)
yhat = pt2.inverse_transform(yhat)

Tying this together, the complete example of fitting and evaluating an MLP with transformed data and creating learning curves of the model is listed below.

# fit a mlp model with data transforms and review learning curves
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import PowerTransformer
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from matplotlib import pyplot
# load the dataset
path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/auto-insurance.csv'
df = read_csv(path, header=None)
# split into input and output columns
X, y = df.values[:, :-1], df.values[:, -1]
# split into train and test datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
# ensure that the target variable is a 2d array
y_train, y_test = y_train.reshape((len(y_train),1)), y_test.reshape((len(y_test),1))
# power transform input data
pt1 = PowerTransformer()
pt1.fit(X_train)
X_train = pt1.transform(X_train)
X_test = pt1.transform(X_test)
# power transform output data
pt2 = PowerTransformer()
pt2.fit(y_train)
y_train = pt2.transform(y_train)
y_test = pt2.transform(y_test)
# determine the number of input features
n_features = X.shape[1]
# define model
model = Sequential()
model.add(Dense(10, activation='relu', kernel_initializer='he_normal', input_shape=(n_features,)))
model.add(Dense(8, activation='relu', kernel_initializer='he_normal'))
model.add(Dense(1))
# compile the model
model.compile(optimizer='adam', loss='mse')
# fit the model
history = model.fit(X_train, y_train, epochs=200, batch_size=8, verbose=0, validation_data=(X_test,y_test))
# predict test set
yhat = model.predict(X_test)
# inverse transforms on target variable
y_test = pt2.inverse_transform(y_test)
yhat = pt2.inverse_transform(yhat)
# evaluate predictions
score = mean_absolute_error(y_test, yhat)
print('MAE: %.3f' % score)
# plot learning curves
pyplot.title('Learning Curves')
pyplot.xlabel('Epoch')
pyplot.ylabel('Mean Squared Error')
pyplot.plot(history.history['loss'], label='train')
pyplot.plot(history.history['val_loss'], label='val')
pyplot.legend()
pyplot.show()

Running the example first fits the model on the training dataset, then reports the MAE on the test dataset.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, the model achieves a reasonable MAE score, although worse than the performance reported previously. We will ignore model performance for now.

MAE: 34.320

Line plots of the learning curves are created showing that the model achieved a reasonable fit and had more than enough time to converge.

Learning Curves of Deeper MLP With Data Transforms on the Auto Insurance Dataset

Learning Curves of Deeper MLP With Data Transforms on the Auto Insurance Dataset

Now that we have some idea of the learning dynamics for simple MLP models with and without data transforms, we can look at evaluating the performance of the models as well as tuning the configuration of the models.

Evaluating and Tuning MLP Models

The k-fold cross-validation procedure can provide a more reliable estimate of MLP performance, although it can be very slow.

This is because k models must be fit and evaluated. This is not a problem when the dataset size is small, such as the auto insurance dataset.

We can use the KFold class to create the splits and enumerate each fold manually, fit the model, evaluate it, and then report the mean of the evaluation scores at the end of the procedure.

# prepare cross validation
kfold = KFold(10)
# enumerate splits
scores = list()
for train_ix, test_ix in kfold.split(X, y):
	# fit and evaluate the model...
	...
...
# summarize all scores
print('Mean MAE: %.3f (%.3f)' % (mean(scores), std(scores)))

We can use this framework to develop a reliable estimate of MLP model performance with a range of different data preparations, model architectures, and learning configurations.

It is important that we first developed an understanding of the learning dynamics of the model on the dataset in the previous section before using k-fold cross-validation to estimate the performance. If we started to tune the model directly, we might get good results, but if not, we might have no idea of why, e.g. that the model was over or under fitting.

If we make large changes to the model again, it is a good idea to go back and confirm that model is converging appropriately.

The complete example of this framework to evaluate the base MLP model from the previous section is listed below.

# k-fold cross-validation of base model for the auto insurance regression dataset
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.metrics import mean_absolute_error
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from matplotlib import pyplot
# load the dataset
path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/auto-insurance.csv'
df = read_csv(path, header=None)
# split into input and output columns
X, y = df.values[:, :-1], df.values[:, -1]
# prepare cross validation
kfold = KFold(10)
# enumerate splits
scores = list()
for train_ix, test_ix in kfold.split(X, y):
	# split data
	X_train, X_test, y_train, y_test = X[train_ix], X[test_ix], y[train_ix], y[test_ix]
	# determine the number of input features
	n_features = X.shape[1]
	# define model
	model = Sequential()
	model.add(Dense(10, activation='relu', kernel_initializer='he_normal', input_shape=(n_features,)))
	model.add(Dense(1))
	# compile the model
	model.compile(optimizer='adam', loss='mse')
	# fit the model
	model.fit(X_train, y_train, epochs=100, batch_size=8, verbose=0)
	# predict test set
	yhat = model.predict(X_test)
	# evaluate predictions
	score = mean_absolute_error(y_test, yhat)
	print('>%.3f' % score)
	scores.append(score)
# summarize all scores
print('Mean MAE: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example reports the model performance each iteration of the evaluation procedure and reports the mean and standard deviation of the MAE at the end of the run.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the MLP model achieved a MAE of about 38.913.

We will use this result as our baseline to see if we can achieve better performance.

>27.314
>69.577
>20.891
>14.810
>13.412
>69.540
>25.612
>49.508
>35.769
>62.696
Mean MAE: 38.913 (21.056)

First, let’s try evaluating a deeper model on the raw dataset to see if it performs any better than a baseline model.

...
# define model
model = Sequential()
model.add(Dense(10, activation='relu', kernel_initializer='he_normal', input_shape=(n_features,)))
model.add(Dense(8, activation='relu', kernel_initializer='he_normal'))
model.add(Dense(1))
# compile the model
model.compile(optimizer='adam', loss='mse')
# fit the model
model.fit(X_train, y_train, epochs=200, batch_size=8, verbose=0)

The complete example is listed below.

# k-fold cross-validation of deeper model for the auto insurance regression dataset
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.metrics import mean_absolute_error
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from matplotlib import pyplot
# load the dataset
path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/auto-insurance.csv'
df = read_csv(path, header=None)
# split into input and output columns
X, y = df.values[:, :-1], df.values[:, -1]
# prepare cross validation
kfold = KFold(10)
# enumerate splits
scores = list()
for train_ix, test_ix in kfold.split(X, y):
	# split data
	X_train, X_test, y_train, y_test = X[train_ix], X[test_ix], y[train_ix], y[test_ix]
	# determine the number of input features
	n_features = X.shape[1]
	# define model
	model = Sequential()
	model.add(Dense(10, activation='relu', kernel_initializer='he_normal', input_shape=(n_features,)))
	model.add(Dense(8, activation='relu', kernel_initializer='he_normal'))
	model.add(Dense(1))
	# compile the model
	model.compile(optimizer='adam', loss='mse')
	# fit the model
	model.fit(X_train, y_train, epochs=200, batch_size=8, verbose=0)
	# predict test set
	yhat = model.predict(X_test)
	# evaluate predictions
	score = mean_absolute_error(y_test, yhat)
	print('>%.3f' % score)
	scores.append(score)
# summarize all scores
print('Mean MAE: %.3f (%.3f)' % (mean(scores), std(scores)))

Running reports the mean and standard deviation of the MAE at the end of the run.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the MLP model achieved a MAE of about 35.384, which is slightly better than the baseline model that achieved an MAE of about 38.913.

Mean MAE: 35.384 (14.951)

Next, let’s try using the same model with a power transform for the input and target variables as we did in the previous section.

The complete example is listed below.

# k-fold cross-validation of deeper model with data transforms
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import PowerTransformer
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from matplotlib import pyplot
# load the dataset
path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/auto-insurance.csv'
df = read_csv(path, header=None)
# split into input and output columns
X, y = df.values[:, :-1], df.values[:, -1]
# prepare cross validation
kfold = KFold(10)
# enumerate splits
scores = list()
for train_ix, test_ix in kfold.split(X, y):
	# split data
	X_train, X_test, y_train, y_test = X[train_ix], X[test_ix], y[train_ix], y[test_ix]
	# ensure target is a 2d array
	y_train, y_test = y_train.reshape((len(y_train),1)), y_test.reshape((len(y_test),1))
	# prepare input data
	pt1 = PowerTransformer()
	pt1.fit(X_train)
	X_train = pt1.transform(X_train)
	X_test = pt1.transform(X_test)
	# prepare target
	pt2 = PowerTransformer()
	pt2.fit(y_train)
	y_train = pt2.transform(y_train)
	y_test = pt2.transform(y_test)
	# determine the number of input features
	n_features = X.shape[1]
	# define model
	model = Sequential()
	model.add(Dense(10, activation='relu', kernel_initializer='he_normal', input_shape=(n_features,)))
	model.add(Dense(8, activation='relu', kernel_initializer='he_normal'))
	model.add(Dense(1))
	# compile the model
	model.compile(optimizer='adam', loss='mse')
	# fit the model
	model.fit(X_train, y_train, epochs=200, batch_size=8, verbose=0)
	# predict test set
	yhat = model.predict(X_test)
	# inverse transforms
	y_test = pt2.inverse_transform(y_test)
	yhat = pt2.inverse_transform(yhat)
	# evaluate predictions
	score = mean_absolute_error(y_test, yhat)
	print('>%.3f' % score)
	scores.append(score)
# summarize all scores
print('Mean MAE: %.3f (%.3f)' % (mean(scores), std(scores)))

Running reports the mean and standard deviation of the MAE at the end of the run.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the MLP model achieved a MAE of about 37.371, which is better than the baseline model, but not better than the deeper baseline model.

Perhaps this transform is not as helpful as we initially thought.

Mean MAE: 37.371 (29.326)

An alternate transform is to normalize the input and target variables.

This means to scale the values of each variable to the range [0, 1]. We can achieve this using the MinMaxScaler; for example:

...
# prepare input data
pt1 = MinMaxScaler()
pt1.fit(X_train)
X_train = pt1.transform(X_train)
X_test = pt1.transform(X_test)
# prepare target
pt2 = MinMaxScaler()
pt2.fit(y_train)
y_train = pt2.transform(y_train)
y_test = pt2.transform(y_test)

Tying this together, the complete example of evaluating the deeper MLP with data normalization is listed below.

# k-fold cross-validation of deeper model with normalization transforms
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from matplotlib import pyplot
# load the dataset
path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/auto-insurance.csv'
df = read_csv(path, header=None)
# split into input and output columns
X, y = df.values[:, :-1], df.values[:, -1]
# prepare cross validation
kfold = KFold(10)
# enumerate splits
scores = list()
for train_ix, test_ix in kfold.split(X, y):
	# split data
	X_train, X_test, y_train, y_test = X[train_ix], X[test_ix], y[train_ix], y[test_ix]
	# ensure target is a 2d array
	y_train, y_test = y_train.reshape((len(y_train),1)), y_test.reshape((len(y_test),1))
	# prepare input data
	pt1 = MinMaxScaler()
	pt1.fit(X_train)
	X_train = pt1.transform(X_train)
	X_test = pt1.transform(X_test)
	# prepare target
	pt2 = MinMaxScaler()
	pt2.fit(y_train)
	y_train = pt2.transform(y_train)
	y_test = pt2.transform(y_test)
	# determine the number of input features
	n_features = X.shape[1]
	# define model
	model = Sequential()
	model.add(Dense(10, activation='relu', kernel_initializer='he_normal', input_shape=(n_features,)))
	model.add(Dense(8, activation='relu', kernel_initializer='he_normal'))
	model.add(Dense(1))
	# compile the model
	model.compile(optimizer='adam', loss='mse')
	# fit the model
	model.fit(X_train, y_train, epochs=200, batch_size=8, verbose=0)
	# predict test set
	yhat = model.predict(X_test)
	# inverse transforms
	y_test = pt2.inverse_transform(y_test)
	yhat = pt2.inverse_transform(yhat)
	# evaluate predictions
	score = mean_absolute_error(y_test, yhat)
	print('>%.3f' % score)
	scores.append(score)
# summarize all scores
print('Mean MAE: %.3f (%.3f)' % (mean(scores), std(scores)))

Running reports the mean and standard deviation of the MAE at the end of the run.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the MLP model achieved a MAE of about 30.388, which is better than any other configuration we have tried so far.

Mean MAE: 30.388 (14.258)

We could continue to test alternate configurations to the model architecture (more or fewer nodes or layers), learning hyperparameters (more or fewer batches), and data transforms.

I leave this as an exercise; let me know what you discover. Can you get better results?
Post your results in the comments below, I’d love to see what you get.

Next, let’s look at how we might fit a final model and use it to make predictions.

Final Model and Make Predictions

Once we choose a model configuration, we can train a final model on all available data and use it to make predictions on new data.

In this case, we will use the deeper model with data normalization as our final model.

This means if we wanted to save the model to file, we would have to save the model itself (for making predictions), the transform for input data (for new input data), and the transform for target variable (for new predictions).

We can prepare the data and fit the model as before, although on the entire dataset instead of a training subset of the dataset.

...
# split into input and output columns
X, y = df.values[:, :-1], df.values[:, -1]
# ensure target is a 2d array
y = y.reshape((len(y),1))
# prepare input data
pt1 = MinMaxScaler()
pt1.fit(X)
X = pt1.transform(X)
# prepare target
pt2 = MinMaxScaler()
pt2.fit(y)
y = pt2.transform(y)
# determine the number of input features
n_features = X.shape[1]
# define model
model = Sequential()
model.add(Dense(10, activation='relu', kernel_initializer='he_normal', input_shape=(n_features,)))
model.add(Dense(8, activation='relu', kernel_initializer='he_normal'))
model.add(Dense(1))
# compile the model
model.compile(optimizer='adam', loss='mse')

We can then use this model to make predictions on new data.

First, we can define a row of new data, which is just one variable for this dataset.

...
# define a row of new data
row = [13]

We can then transform this new data ready to be used as input to the model.

...
# transform the input data
X_new = pt1.transform([row])

We can then make a prediction.

...
# make prediction
yhat = model.predict(X_new)

Then invert the transform on the prediction so we can use or interpret the result in the correct scale.

...
# invert transform on prediction
yhat = pt2.inverse_transform(yhat)

And in this case, we will simply report the prediction.

...
# report prediction
print('f(%s) = %.3f' % (row, yhat[0]))

Tying this all together, the complete example of fitting a final model for the auto insurance dataset and using it to make a prediction on new data is listed below.

# fit a final model and make predictions on new data.
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
# load the dataset
path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/auto-insurance.csv'
df = read_csv(path, header=None)
# split into input and output columns
X, y = df.values[:, :-1], df.values[:, -1]
# ensure target is a 2d array
y = y.reshape((len(y),1))
# prepare input data
pt1 = MinMaxScaler()
pt1.fit(X)
X = pt1.transform(X)
# prepare target
pt2 = MinMaxScaler()
pt2.fit(y)
y = pt2.transform(y)
# determine the number of input features
n_features = X.shape[1]
# define model
model = Sequential()
model.add(Dense(10, activation='relu', kernel_initializer='he_normal', input_shape=(n_features,)))
model.add(Dense(8, activation='relu', kernel_initializer='he_normal'))
model.add(Dense(1))
# compile the model
model.compile(optimizer='adam', loss='mse')
# fit the model
model.fit(X, y, epochs=200, batch_size=8, verbose=0)
# define a row of new data
row = [13]
# transform the input data
X_new = pt1.transform([row])
# make prediction
yhat = model.predict(X_new)
# invert transform on prediction
yhat = pt2.inverse_transform(yhat)
# report prediction
print('f(%s) = %.3f' % (row, yhat[0]))

Running the example fits the model on the entire dataset and makes a prediction for a single row of new data.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that an input of 13 results in an output of 62 (thousand Swedish Krona).

f([13]) = 62.595

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Tutorials

Summary

In this tutorial, you discovered how to develop a Multilayer Perceptron neural network model for the Swedish car insurance regression dataset.

Specifically, you learned:

  • How to load and summarize the Swedish car insurance dataset and use the results to suggest data preparations and model configurations to use.
  • How to explore the learning dynamics of simple MLP models and data transforms on the dataset.
  • How to develop robust estimates of model performance, tune model performance and make predictions on new data.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post How to Develop a Neural Net for Predicting Car Insurance Payout appeared first on Machine Learning Mastery.


Local Optimization Versus Global Optimization

$
0
0

Optimization refers to finding the set of inputs to an objective function that results in the maximum or minimum output from the objective function.

It is common to describe optimization problems in terms of local vs. global optimization.

Similarly, it is also common to describe optimization algorithms or search algorithms in terms of local vs. global search.

In this tutorial, you will discover the practical differences between local and global optimization.

After completing this tutorial, you will know:

  • Local optimization involves finding the optimal solution for a specific region of the search space, or the global optima for problems with no local optima.
  • Global optimization involves finding the optimal solution on problems that contain local optima.
  • How and when to use local and global search algorithms and how to use both methods in concert.

Let’s get started.

Local Optimization Versus Global Optimization

Local Optimization Versus Global Optimization
Photo by Marco Verch, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  1. Local Optimization
  2. Global Optimization
  3. Local vs. Global Optimization

Local Optimization

A local optima is the extrema (minimum or maximum) of the objective function for a given region of the input space, e.g. a basin in a minimization problem.

… we seek a point that is only locally optimal, which means that it minimizes the objective function among feasible points that are near it …

— Page 9, Convex Optimization, 2004.

An objective function may have many local optima, or it may have a single local optima, in which case the local optima is also the global optima.

  • Local Optimization: Locate the optima for an objective function from a starting point believed to contain the optima (e.g. a basin).

Local optimization or local search refers to searching for the local optima.

A local optimization algorithm, also called a local search algorithm, is an algorithm intended to locate a local optima. It is suited to traversing a given region of the search space and getting close to (or finding exactly) the extrema of the function in that region.

… local optimization methods are widely used in applications where there is value in finding a good point, if not the very best.

— Page 9, Convex Optimization, 2004.

Local search algorithms typically operate on a single candidate solution and involve iteratively making small changes to the candidate solution and evaluating the change to see if it leads to an improvement and is taken as the new candidate solution.

A local optimization algorithm will locate the global optimum:

  • If the local optima is the global optima, or
  • If the region being searched contains the global optima.

These define the ideal use cases for using a local search algorithm.

There may be debate over what exactly constitutes a local search algorithm; nevertheless, three examples of local search algorithms using our definitions include:

  • Nelder-Mead Algorithm
  • BFGS Algorithm
  • Hill-Climbing Algorithm

Now that we are familiar with local optimization, let’s take a look at global optimization.

Global Optimization

A global optimum is the extrema (minimum or maximum) of the objective function for the entire input search space.

Global optimization, where the algorithm searches for the global optimum by employing mechanisms to search larger parts of the search space.

— Page 37, Computational Intelligence: An Introduction, 2007.

An objective function may have one or more than one global optima, and if more than one, it is referred to as a multimodal optimization problem and each optimum will have a different input and the same objective function evaluation.

  • Global Optimization: Locate the optima for an objective function that may contain local optima.

An objective function always has a global optima (otherwise we would not be interested in optimizing it), although it may also have local optima that have an objective function evaluation that is not as good as the global optima.

The global optima may be the same as the local optima, in which case it would be more appropriate to refer to the optimization problem as a local optimization, instead of global optimization.

The presence of the local optima is a major component of what defines the difficulty of a global optimization problem as it may be relatively easy to locate a local optima and relatively difficult to locate the global optima.

Global optimization or global search refers to searching for the global optima.

A global optimization algorithm, also called a global search algorithm, is intended to locate a global optima. It is suited to traversing the entire input search space and getting close to (or finding exactly) the extrema of the function.

Global optimization is used for problems with a small number of variables, where computing time is not critical, and the value of finding the true global solution is very high.

— Page 9, Convex Optimization, 2004.

Global search algorithms may involve managing a single or a population of candidate solutions from which new candidate solutions are iteratively generated and evaluated to see if they result in an improvement and taken as the new working state.

There may be debate over what exactly constitutes a global search algorithm; nevertheless, three examples of global search algorithms using our definitions include:

  • Genetic Algorithm
  • Simulated Annealing
  • Particle Swarm Optimization

Now that we are familiar with global and local optimization, let’s compare and contrast the two.

Local vs. Global Optimization

Local and global search optimization algorithms solve different problems or answer different questions.

A local optimization algorithm should be used when you know that you are in the region of the global optima or that your objective function contains a single optima, e.g. unimodal.

A global optimization algorithm should be used when you know very little about the structure of the objective function response surface, or when you know that the function contains local optima.

Local optimization, where the algorithm may get stuck in a local optimum without finding a global optimum.

— Page 37, Computational Intelligence: An Introduction, 2007.

Applying a local search algorithm to a problem that requires a global search algorithm will deliver poor results as the local search will get caught (deceived) by local optima.

  • Local search: When you are in the region of the global optima.
  • Global search: When you know that there are local optima.

Local search algorithms often give computational complexity grantees related to locating the global optima, as long as the assumptions made by the algorithm hold.

Global search algorithms often give very few if any grantees about locating the global optima. As such, global search is often used on problems that are sufficiently difficult that “good” or “good enough” solutions are preferred over no solutions at all. This might mean relatively good local optima instead of the true global optima if locating the global optima is intractable.

It is often appropriate to re-run or re-start the algorithm multiple times and record the optima found by each run to give some confidence that relatively good solutions have been located.

  • Local search: For narrow problems where the global solution is required.
  • Global search: For broad problems where the global optima might be intractable.

We often know very little about the response surface for an objective function, e.g. whether a local or global search algorithm is most appropriate. Therefore, it may be desirable to establish a baseline in performance with a local search algorithm and then explore a global search algorithm to see if it can perform better. If it cannot, it may suggest that the problem is indeed unimodal or appropriate for a local search algorithm.

  • Best Practice: Establish a baseline with a local search then explore a global search on objective functions where little is known.

Local optimization is a simpler problem to solve than global optimization. As such, the vast majority of the research on mathematical optimization has been focused on local search techniques.

A large fraction of the research on general nonlinear programming has focused on methods for local optimization, which as a consequence are well developed.

— Page 9, Convex Optimization, 2004.

Global search algorithms are often coarse in their navigation of the search space.

Many population methods perform well in global search, being able to avoid local minima and finding the best regions of the design space. Unfortunately, these methods do not perform as well in local search in comparison to descent methods.

— Page 162, Algorithms for Optimization, 2019.

As such, they may locate the basin for a good local optima or the global optima, but may not be able to locate the best solution within the basin.

Local and global optimization techniques can be combined to form hybrid training algorithms.

— Page 37, Computational Intelligence: An Introduction, 2007.

Therefore, it is a good practice to apply a local search to the optima candidate solutions found by a global search algorithm.

  • Best Practice: Apply a local search to the solutions found by a global search.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Books

Articles

Summary

In this tutorial, you discovered the practical differences between local and global optimization.

Specifically, you learned:

  • Local optimization involves finding the optimal solution for a specific region of the search space, or the global optima for problems with no local optima.
  • Global optimization involves finding the optimal solution on problems that contain local optima.
  • How and when to use local and global search algorithms and how to use both methods in concert.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Local Optimization Versus Global Optimization appeared first on Machine Learning Mastery.

Difference Between Backpropagation and Stochastic Gradient Descent

$
0
0

Last Updated on February 1, 2021

There is a lot of confusion for beginners around what algorithm is used to train deep learning neural network models.

It is common to hear neural networks learn using the “back-propagation of error” algorithm or “stochastic gradient descent.” Sometimes, either of these algorithms is used as a shorthand for how a neural net is fit on a training dataset, although in many cases, there is a deep confusion as to what these algorithms are, how they are related, and how they might work together.

This tutorial is designed to make the role of the stochastic gradient descent and back-propagation algorithms clear in training between networks.

In this tutorial, you will discover the difference between stochastic gradient descent and the back-propagation algorithm.

After completing this tutorial, you will know:

  • Stochastic gradient descent is an optimization algorithm for minimizing the loss of a predictive model with regard to a training dataset.
  • Back-propagation is an automatic differentiation algorithm for calculating gradients for the weights in a neural network graph structure.
  • Stochastic gradient descent and the back-propagation of error algorithms together are used to train neural network models.

Let’s get started.

Difference Between Backpropagation and Stochastic Gradient Descent

Difference Between Backpropagation and Stochastic Gradient Descent
Photo by Christian Collins, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  • Stochastic Gradient Descent
  • Backpropagation Algorithm
  • Stochastic Gradient Descent With Back-propagation

Stochastic Gradient Descent

Gradient Descent is an optimization algorithm that finds the set of input variables for a target function that results in a minimum value of the target function, called the minimum of the function.

As its name suggests, gradient descent involves calculating the gradient of the target function.

You may recall from calculus that the first-order derivative of a function calculates the slope or curvature of a function at a given point. Read left to right, a positive derivative suggests the target function is sloping uphill and a negative derivative suggests the target function is sloping downhill.

  • Derivative: Slope or curvature of a target function with respect to specific input values to the function.

If the target function takes multiple input variables, they can be taken together as a vector of variables. Working with vectors and matrices is referred to linear algebra and doing calculus with structures from linear algebra is called matrix calculus or vector calculus. In vector calculus, the vector of first-order derivatives (partial derivatives) is generally referred to as the gradient of the target function.

  • Gradient: Vector of partial derivatives of a target function with respect to input variables.

The gradient descent algorithm requires the calculation of the gradient of the target function with respect to the specific values of the input values. The gradient points uphill, therefore the negative of the gradient of each input variable is followed downhill to result in new values for each variable that results in a lower evaluation of the target function.

A step size is used to scale the gradient and control how much to change each input variable with respect to the gradient.

  • Step Size: Learning rate or alpha, a hyperparameter used to control how much to change each input variable with respect to the gradient.

This process is repeated until the minimum of the target function is located, a maximum number of candidate solutions are evaluated, or some other stop condition.

Gradient descent can be adapted to minimize the loss function of a predictive model on a training dataset, such as a classification or regression model. This adaptation is called stochastic gradient descent.

  • Stochastic Gradient Descent: Extension of the gradient descent optimization algorithm for minimizing a loss function of a predictive model on a training dataset.

The target function is taken as the loss or error function on the dataset, such as mean squared error for regression or cross-entropy for classification. The parameters of the model are taken as the input variables for the target function.

  • Loss function: target function that is being minimized.
  • Model parameters: input parameters to the loss function that are being optimized.

The algorithm is referred to as “stochastic” because the gradients of the target function with respect to the input variables are noisy (e.g. a probabilistic approximation). This means that the evaluation of the gradient may have statistical noise that may obscure the true underlying gradient signal, caused because of the sparseness and noise in the training dataset.

The insight of stochastic gradient descent is that the gradient is an expectation. The expectation may be approximately estimated using a small set of samples.

— Page 151, Deep Learning, 2016.

Stochastic gradient descent can be used to train (optimize) many different model types, like linear regression and logistic regression, although often more efficient optimization algorithms have been discovered and should probably be used instead.

Stochastic gradient descent (SGD) and its variants are probably the most used optimization algorithms for machine learning in general and for deep learning in particular.

— Page 294, Deep Learning, 2016.

Stochastic gradient descent is the most efficient algorithm discovered for training artificial neural networks, where the weights are the model parameters and the target loss function is the prediction error averaged over one, a subset (batch) of the entire training dataset.

Nearly all of deep learning is powered by one very important algorithm: stochastic gradient descent or SGD.

— Page 151, Deep Learning, 2016.

There are many popular extensions to stochastic gradient descent designed to improve the optimization process (same or better loss in fewer iterations), such as Momentum, Root Mean Squared Propagation (RMSProp) and Adaptive Movement Estimation (Adam).

A challenge when using stochastic gradient descent to train a neural network is how to calculate the gradient for nodes in hidden layers in the network, e.g. nodes one or more steps away from the output layer of the model.

This requires a specific technique from calculus called the chain rule and an efficient algorithm that implements the chain rule that can be used to calculate gradients for any parameter in the network. This algorithm is called back-propagation.

Back-Propagation Algorithm

Back-propagation, also called “backpropagation,” or simply “backprop,” is an algorithm for calculating the gradient of a loss function with respect to variables of a model.

  • Back-Propagation: Algorithm for calculating the gradient of a loss function with respect to variables of a model.

You may recall from calculus that the first-order derivative of a function for a specific value of an input variable is the rate of change or curvature of the function for that input. When we have multiple input variables for a function, they form a vector and the vector of first-order derivatives (partial derivatives) is called the gradient (i.e. vector calculus).

  • Gradient: Vector of partial derivatives of specific input values with respect to a target function.

Back-propagation is used when training neural network models to calculate the gradient for each weight in the network model. The gradient is then used by an optimization algorithm to update the model weights.

The algorithm was developed explicitly for calculating the gradients of variables in graph structures working backward from the output of the graph toward the input of the graph, propagating the error in the predicted output that is used to calculate gradient for each variable.

The back-propagation algorithm, often simply called backprop, allows the information from the cost to then flow backwards through the network, in order to compute the gradient.

— Page 204, Deep Learning, 2016.

The loss function represents the error of the model or error function, the weights are the variables for the function, and the gradients of the error function with regard to the weights are therefore referred to as error gradients.

  • Error Function: Loss function that is minimized when training a neural network.
  • Weights: Parameters of the network taken as input values to the loss function.
  • Error Gradients: First-order derivatives of the loss function with regard to the parameters.

This gives the algorithm its name “back-propagation,” or sometimes “error back-propagation” or the “back-propagation of error.”

  • Back-Propagation of Error: Comment on how gradients are calculated recursively backward through the network graph starting at the output layer.

The algorithm involves the recursive application of the chain rule from calculus (different from the chain rule from probability) that is used to calculate the derivative of a sub-function given the derivative of the parent function for which the derivative is known.

The chain rule of calculus […] is used to compute the derivatives of functions formed by composing other functions whose derivatives are known. Back-propagation is an algorithm that computes the chain rule, with a specific order of operations that is highly efficient.

— Page 205, Deep Learning, 2016.

  • Chain Rule: Calculus formula for calculating the derivatives of functions using related functions whose derivatives are known.

There are other algorithms for calculating the chain rule, but the back-propagation algorithm is an efficient algorithm for the specific graph structured using a neural network.

It is fair to call the back-propagation algorithm a type of automatic differentiation algorithm and it belongs to a class of differentiation techniques called reverse accumulation.

The back-propagation algorithm described here is only one approach to automatic differentiation. It is a special case of a broader class of techniques called reverse mode accumulation.

— Page 222, Deep Learning, 2016.

Although Back-propagation was developed to train neural network models, both the back-propagation algorithm specifically and the chain-rule formula that it implements efficiently can be used more generally to calculate derivatives of functions.

Furthermore, back-propagation is often misunderstood as being specific to multi-layer neural networks, but in principle it can compute derivatives of any function …

— Page 204, Deep Learning, 2016.

Stochastic Gradient Descent With Back-propagation

Stochastic Gradient Descent is an optimization algorithm that can be used to train neural network models.

The Stochastic Gradient Descent algorithm requires gradients to be calculated for each variable in the model so that new values for the variables can be calculated.

Back-propagation is an automatic differentiation algorithm that can be used to calculate the gradients for the parameters in neural networks.

Together, the back-propagation algorithm and Stochastic Gradient Descent algorithm can be used to train a neural network. We might call this “Stochastic Gradient Descent with Back-propagation.”

  • Stochastic Gradient Descent With Back-propagation: A more complete description of the general algorithm used to train a neural network, referencing the optimization algorithm and gradient calculation algorithm.

It is common for practitioners to say they train their model using back-propagation. Technically, this is incorrect. Even as a short-hand, this would be incorrect. Back-propagation is not an optimization algorithm and cannot be used to train a model.

The term back-propagation is often misunderstood as meaning the whole learning algorithm for multi-layer neural networks. Actually, back-propagation refers only to the method for computing the gradient, while another algorithm, such as stochastic gradient descent, is used to perform learning using this gradient.

— Page 204, Deep Learning, 2016.

It would be fair to say that a neural network is trained or learns using Stochastic Gradient Descent as a shorthand, as it is assumed that the back-propagation algorithm is used to calculate gradients as part of the optimization procedure.

That being said, a different algorithm can be used to optimize the parameter of a neural network, such as a genetic algorithm that does not require gradients. If the Stochastic Gradient Descent optimization algorithm is used, a different algorithm can be used to calculate the gradients for the loss function with respect to the model parameters, such as alternate algorithms that implement the chain rule.

Nevertheless, the “Stochastic Gradient Descent with Back-propagation” combination is widely used because it is the most efficient and effective general approach sofar developed for fitting neural network models.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Books

Articles

Summary

In this tutorial, you discovered the difference between stochastic gradient descent and the back-propagation algorithm.

Specifically, you learned:

  • Stochastic gradient descent is an optimization algorithm for minimizing the loss of a predictive model with regard to a training dataset.
  • Back-propagation is an automatic differentiation algorithm for calculating gradients for the weights in a neural network graph structure.
  • Stochastic gradient descent and the back-propagation of error algorithms together are used to train neural network models.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Difference Between Backpropagation and Stochastic Gradient Descent appeared first on Machine Learning Mastery.

Weight Initialization for Deep Learning Neural Networks

$
0
0

Weight initialization is an important design choice when developing deep learning neural network models.

Historically, weight initialization involved using small random numbers, although over the last decade, more specific heuristics have been developed that use information, such as the type of activation function that is being used and the number of inputs to the node.

These more tailored heuristics can result in more effective training of neural network models using the stochastic gradient descent optimization algorithm.

In this tutorial, you will discover how to implement weight initialization techniques for deep learning neural networks.

After completing this tutorial, you will know:

  • Weight initialization is used to define the initial values for the parameters in neural network models prior to training the models on a dataset.
  • How to implement the xavier and normalized xavier weight initialization heuristics used for nodes that use the Sigmoid or Tanh activation functions.
  • How to implement the he weight initialization heuristic used for nodes that use the ReLU activation function.

Let’s get started.

Weight Initialization for Deep Learning Neural Networks

Weight Initialization for Deep Learning Neural Networks
Photo by Andres Alvarado, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  1. Weight Initialization for Neural Networks
  2. Weight Initialization for Sigmoid and Tanh
    1. Xavier Weight Initialization
    2. Normalized Xavier Weight Initialization
  3. Weight Initialization for ReLU
    1. He Weight Initialization

Weight Initialization for Neural Networks

Weight initialization is an important consideration in the design of a neural network model.

The nodes in neural networks are composed of parameters referred to as weights used to calculate a weighted sum of the inputs.

Neural network models are fit using an optimization algorithm called stochastic gradient descent that incrementally changes the network weights to minimize a loss function, hopefully resulting in a set of weights for the mode that is capable of making useful predictions.

This optimization algorithm requires a starting point in the space of possible weight values from which to begin the optimization process. Weight initialization is a procedure to set the weights of a neural network to small random values that define the starting point for the optimization (learning or training) of the neural network model.

… training deep models is a sufficiently difficult task that most algorithms are strongly affected by the choice of initialization. The initial point can determine whether the algorithm converges at all, with some initial points being so unstable that the algorithm encounters numerical difficulties and fails altogether.

— Page 301, Deep Learning, 2016.

Each time, a neural network is initialized with a different set of weights, resulting in a different starting point for the optimization process, and potentially resulting in a different final set of weights with different performance characteristics.

For more on the expectation of different results each time the same algorithm is trained on the same dataset, see the tutorial:

We cannot initialize all weights to the value 0.0 as the optimization algorithm results in some asymmetry in the error gradient to begin searching effectively.

For more on why we initialize neural networks with random weights, see the tutorial:

Historically, weight initialization follows simple heuristics, such as:

  • Small random values in the range [-0.3, 0.3]
  • Small random values in the range [0, 1]
  • Small random values in the range [-1, 1]

These heuristics continue to work well in general.

We almost always initialize all the weights in the model to values drawn randomly from a Gaussian or uniform distribution. The choice of Gaussian or uniform distribution does not seem to matter very much, but has not been exhaustively studied. The scale of the initial distribution, however, does have a large effect on both the outcome of the optimization procedure and on the ability of the network to generalize.

— Page 302, Deep Learning, 2016.

Nevertheless, more tailored approaches have been developed over the last decade that have become the defacto standard given they may result in a slightly more effective optimization (model training) process.

These modern weight initialization techniques are divided based on the type of activation function used in the nodes that are being initialized, such as “Sigmoid and Tanh” and “ReLU.”

Next, let’s take a closer look at these modern weight initialization heuristics for nodes with Sigmoid and Tanh activation functions.

Weight Initialization for Sigmoid and Tanh

The current standard approach for initialization of the weights of neural network layers and nodes that use the Sigmoid or TanH activation function is called “glorot” or “xavier” initialization.

It is named for Xavier Glorot, currently a research scientist at Google DeepMind, and was described in the 2010 paper by Xavier and Yoshua Bengio titled “Understanding The Difficulty Of Training Deep Feedforward Neural Networks.”

There are two versions of this weight initialization method, which we will refer to as “xavier” and “normalized xavier.”

Glorot and Bengio proposed to adopt a properly scaled uniform distribution for initialization. This is called “Xavier” initialization […] Its derivation is based on the assumption that the activations are linear. This assumption is invalid for ReLU and PReLU.

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, 2015.

Both approaches were derived assuming that the activation function is linear, nevertheless, they have become the standard for nonlinear activation functions like Sigmoid and Tanh, but not ReLU.

Let’s take a closer look at each in turn.

Xavier Weight Initialization

The xavier initialization method is calculated as a random number with a uniform probability distribution (U) between the range -(1/sqrt(n)) and 1/sqrt(n), where n is the number of inputs to the node.

  • weight = U [-(1/sqrt(n)), 1/sqrt(n)]

We can implement this directly in Python.

The example below assumes 10 inputs to a node, then calculates the lower and upper bounds of the range and calculates 1,000 initial weight values that could be used for the nodes in a layer or a network that uses the sigmoid or tanh activation function.

After calculating the weights, the lower and upper bounds are printed as are the min, max, mean, and standard deviation of the generated weights.

The complete example is listed below.

# example of the xavier weight initialization
from math import sqrt
from numpy import mean
from numpy.random import rand
# number of nodes in the previous layer
n = 10
# calculate the range for the weights
lower, upper = -(1.0 / sqrt(n)), (1.0 / sqrt(n))
# generate random numbers
numbers = rand(1000)
# scale to the desired range
scaled = lower + numbers * (upper - lower)
# summarize
print(lower, upper)
print(scaled.min(), scaled.max())
print(scaled.mean(), scaled.std())

Running the example generates the weights and prints the summary statistics.

We can see that the bounds of the weight values are about -0.316 and 0.316. These bounds would become wider with fewer inputs and more narrow with more inputs.

We can see that the generated weights respect these bounds and that the mean weight value is close to zero with the standard deviation close to 0.17.

-0.31622776601683794 0.31622776601683794
-0.3157663248679193 0.3160839282916222
0.006806069733149146 0.17777128902976705

It can also help to see how the spread of the weights changes with the number of inputs.

For this, we can calculate the bounds on the weight initialization with different numbers of inputs from 1 to 100 and plot the result.

The complete example is listed below.

# plot of the bounds on xavier weight initialization for different numbers of inputs
from math import sqrt
from matplotlib import pyplot
# define the number of inputs from 1 to 100
values = [i for i in range(1, 101)]
# calculate the range for each number of inputs
results = [1.0 / sqrt(n) for n in values]
# create an error bar plot centered on 0 for each number of inputs
pyplot.errorbar(values, [0.0 for _ in values], yerr=results)
pyplot.show()

Running the example creates a plot that allows us to compare the range of weights with different numbers of input values.

We can see that with very few inputs, the range is large, such as between -1 and 1 or -0.7 to -7. We can then see that our range rapidly drops to about 20 weights to near -0.1 and 0.1, where it remains reasonably constant.

Plot of Range of Xavier Weight Initialization With Inputs From One to One Hundred

Plot of Range of Xavier Weight Initialization With Inputs From One to One Hundred

Normalized Xavier Weight Initialization

The normalized xavier initialization method is calculated as a random number with a uniform probability distribution (U) between the range -(sqrt(6)/sqrt(n + n)) and sqrt(6)/sqrt(n + n), where n us the number of inputs to the node (e.g. number of nodes in the previous layer) and m is the number of outputs from the layer (e.g. number of nodes in the current layer).

  • weight = U [-(sqrt(6)/sqrt(n + n)), sqrt(6)/sqrt(n + n)]

We can implement this directly in Python as we did in the previous section and summarize the statistical summary of 1,000 generated weights.

The complete example is listed below.

# example of the normalized xavier weight initialization
from math import sqrt
from numpy import mean
from numpy.random import rand
# number of nodes in the previous layer
n = 10
# number of nodes in the next layer
m = 20
# calculate the range for the weights
lower, upper = -(sqrt(6.0) / sqrt(n + m)), (sqrt(6.0) / sqrt(n + m))
# generate random numbers
numbers = rand(1000)
# scale to the desired range
scaled = lower + numbers * (upper - lower)
# summarize
print(lower, upper)
print(scaled.min(), scaled.max())
print(scaled.mean(), scaled.std())

Running the example generates the weights and prints the summary statistics.

We can see that the bounds of the weight values are about -0.447 and 0.447. These bounds would become wider with fewer inputs and more narrow with more inputs.

We can see that the generated weights respect these bounds and that the mean weight value is close to zero with the standard deviation close to 0.17.

-0.44721359549995787 0.44721359549995787
-0.4447861894315135 0.4463641245392874
-0.01135636099916006 0.2581340352889168

It can also help to see how the spread of the weights changes with the number of inputs.

For this, we can calculate the bounds on the weight initialization with different numbers of inputs from 1 to 100 and a fixed number of 10 outputs and plot the result.

The complete example is listed below.

# plot of the bounds of normalized xavier weight initialization for different numbers of inputs
from math import sqrt
from matplotlib import pyplot
# define the number of inputs from 1 to 100
values = [i for i in range(1, 101)]
# define the number of outputs
m = 10
# calculate the range for each number of inputs
results = [1.0 / sqrt(n + m) for n in values]
# create an error bar plot centered on 0 for each number of inputs
pyplot.errorbar(values, [0.0 for _ in values], yerr=results)
pyplot.show()

Running the example creates a plot that allows us to compare the range of weights with different numbers of input values.

We can see that the range starts wide at about -0.3 to 0.3 with few inputs and reduces to about -0.1 to 0.1 as the number of inputs increases.

Compared to the non-normalized version in the previous section, the range is initially smaller, although transitions to the compact range at a similar rate.

Plot of Range of Normalized Xavier Weight Initialization With Inputs From One to One Hundred

Plot of Range of Normalized Xavier Weight Initialization With Inputs From One to One Hundred

Weight Initialization for ReLU

The “xavier” weight initialization was found to have problems when used to initialize networks that use the rectified linear (ReLU) activation function.

As such, a modified version of the approach was developed specifically for nodes and layers that use ReLU activation, popular in the hidden layers of most multilayer Perceptron and convolutional neural network models.

The current standard approach for initialization of the weights of neural network layers and nodes that use the rectified linear (ReLU) activation function is called “he” initialization.

It is named for Kaiming He, currently a research scientist at Facebook, and was described in the 2015 paper by Kaiming He, et al. titled “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification.”

He Weight Initialization

The he initialization method is calculated as a random number with a Gaussian probability distribution (G) with a mean of 0.0 and a standard deviation of sqrt(2/n), where n is the number of inputs to the node.

  • weight = G (0.0, sqrt(2/n))

We can implement this directly in Python.

The example below assumes 10 inputs to a node, then calculates the standard deviation of the Gaussian distribution and calculates 1,000 initial weight values that could be used for the nodes in a layer or a network that uses the ReLU activation function.

After calculating the weights, the calculated standard deviation is printed as are the min, max, mean, and standard deviation of the generated weights.

The complete example is listed below.

# example of the he weight initialization
from math import sqrt
from numpy.random import randn
# number of nodes in the previous layer
n = 10
# calculate the range for the weights
std = sqrt(2.0 / n)
# generate random numbers
numbers = randn(1000)
# scale to the desired range
scaled = numbers * std
# summarize
print(std)
print(scaled.min(), scaled.max())
print(scaled.mean(), scaled.std())

Running the example generates the weights and prints the summary statistics.

We can see that the bound of the calculated standard deviation of the weights is about 0.447. This standard deviation would become larger with fewer inputs and smaller with more inputs.

We can see that the range of the weights is about -1.573 to 1.433 which is close to the theoretical range of about -1.788 and 1.788, which is four times the standard deviation, capturing 99.7% of observations in the Gaussian distribution. We can also see that the mean and standard deviation of the generated weights are close to the prescribed 0.0 and 0.447 respectively.

0.4472135954999579
-1.5736761136523203 1.433348584081719
-0.00023406487278826836 0.4522609460629265

It can also help to see how the spread of the weights changes with the number of inputs.

For this, we can calculate the bounds on the weight initialization with different numbers of inputs from 1 to 100 and plot the result.

The complete example is listed below.

# plot of the bounds on he weight initialization for different numbers of inputs
from math import sqrt
from matplotlib import pyplot
# define the number of inputs from 1 to 100
values = [i for i in range(1, 101)]
# calculate the range for each number of inputs
results = [sqrt(2.0 / n) for n in values]
# create an error bar plot centered on 0 for each number of inputs
pyplot.errorbar(values, [0.0 for _ in values], yerr=results)
pyplot.show()

Running the example creates a plot that allows us to compare the range of weights with different numbers of input values.

We can see that with very few inputs, the range is large, near -1.5 and 1.5 or -1.0 to -1.0. We can then see that our range rapidly drops to about 20 weights to near -0.1 and 0.1, where it remains reasonably constant.

Plot of Range of He Weight Initialization With Inputs From One to One Hundred

Plot of Range of He Weight Initialization With Inputs From One to One Hundred

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Tutorials

Papers

Books

Summary

In this tutorial, you discovered how to implement weight initialization techniques for deep learning neural networks.

Specifically, you learned:

  • Weight initialization is used to define the initial values for the parameters in neural network models prior to training the models on a dataset.
  • How to implement the xavier and normalized xavier weight initialization heuristics used for nodes that use the Sigmoid or Tanh activation functions.
  • How to implement the he weight initialization heuristic used for nodes that use the ReLU activation function.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Weight Initialization for Deep Learning Neural Networks appeared first on Machine Learning Mastery.

Gradient Descent With Momentum from Scratch

$
0
0

Gradient descent is an optimization algorithm that follows the negative gradient of an objective function in order to locate the minimum of the function.

A problem with gradient descent is that it can bounce around the search space on optimization problems that have large amounts of curvature or noisy gradients, and it can get stuck in flat spots in the search space that have no gradient.

Momentum is an extension to the gradient descent optimization algorithm that allows the search to build inertia in a direction in the search space and overcome the oscillations of noisy gradients and coast across flat spots of the search space.

In this tutorial, you will discover the gradient descent with momentum algorithm.

After completing this tutorial, you will know:

  • Gradient descent is an optimization algorithm that uses the gradient of the objective function to navigate the search space.
  • Gradient descent can be accelerated by using momentum from past updates to the search position.
  • How to implement gradient descent optimization with momentum and develop an intuition for its behavior.

Let’s get started.

Gradient Descent With Momentum from Scratch

Gradient Descent With Momentum from Scratch
Photo by Chris Barnes, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  1. Gradient Descent
  2. Momentum
  3. Gradient Descent With Momentum
    1. One-Dimensional Test Problem
    2. Gradient Descent Optimization
    3. Visualization of Gradient Descent Optimization
    4. Gradient Descent Optimization With Momentum
    5. Visualization of Gradient Descent Optimization With Momentum

Gradient Descent

Gradient descent is an optimization algorithm.

It is technically referred to as a first-order optimization algorithm as it explicitly makes use of the first-order derivative of the target objective function.

First-order methods rely on gradient information to help direct the search for a minimum …

— Page 69, Algorithms for Optimization, 2019.

The first-order derivative, or simply the “derivative,” is the rate of change or slope of the target function at a specific point, e.g. for a specific input.

If the target function takes multiple input variables, it is referred to as a multivariate function and the input variables can be thought of as a vector. In turn, the derivative of a multivariate target function may also be taken as a vector and is referred to generally as the “gradient.”

  • Gradient: First-order derivative for a multivariate objective function.

The derivative or the gradient points in the direction of the steepest ascent of the target function for a specific input.

Gradient descent refers to a minimization optimization algorithm that follows the negative of the gradient downhill of the target function to locate the minimum of the function.

The gradient descent algorithm requires a target function that is being optimized and the derivative function for the objective function. The target function f() returns a score for a given set of inputs, and the derivative function f'() gives the derivative of the target function for a given set of inputs.

The gradient descent algorithm requires a starting point (x) in the problem, such as a randomly selected point in the input space.

The derivative is then calculated and a step is taken in the input space that is expected to result in a downhill movement in the target function, assuming we are minimizing the target function.

A downhill movement is made by first calculating how far to move in the input space, calculated as the step size (called alpha or the learning rate) multiplied by the gradient. This is then subtracted from the current point, ensuring we move against the gradient, or down the target function.

  • x = x – step_size * f'(x)

The steeper the objective function at a given point, the larger the magnitude of the gradient and, in turn, the larger the step taken in the search space. The size of the step taken is scaled using a step size hyperparameter.

  • Step Size (alpha): Hyperparameter that controls how far to move in the search space against the gradient each iteration of the algorithm, also called the learning rate.

If the step size is too small, the movement in the search space will be small and the search will take a long time. If the step size is too large, the search may bounce around the search space and skip over the optima.

Now that we are familiar with the gradient descent optimization algorithm, let’s take a look at momentum.

Momentum

Momentum is an extension to the gradient descent optimization algorithm, often referred to as gradient descent with momentum.

It is designed to accelerate the optimization process, e.g. decrease the number of function evaluations required to reach the optima, or to improve the capability of the optimization algorithm, e.g. result in a better final result.

A problem with the gradient descent algorithm is that the progression of the search can bounce around the search space based on the gradient. For example, the search may progress downhill towards the minima, but during this progression, it may move in another direction, even uphill, depending on the gradient of specific points (sets of parameters) encountered during the search.

This can slow down the progress of the search, especially for those optimization problems where the broader trend or shape of the search space is more useful than specific gradients along the way.

One approach to this problem is to add history to the parameter update equation based on the gradient encountered in the previous updates.

This change is based on the metaphor of momentum from physics where acceleration in a direction can be accumulated from past updates.

The name momentum derives from a physical analogy, in which the negative gradient is a force moving a particle through parameter space, according to Newton’s laws of motion.

— Page 296, Deep Learning, 2016.

Momentum involves adding an additional hyperparameter that controls the amount of history (momentum) to include in the update equation, i.e. the step to a new point in the search space. The value for the hyperparameter is defined in the range 0.0 to 1.0 and often has a value close to 1.0, such as 0.8, 0.9, or 0.99. A momentum of 0.0 is the same as gradient descent without momentum.

First, let’s break the gradient descent update equation down into two parts: the calculation of the change to the position and the update of the old position to the new position.

The change in the parameters is calculated as the gradient for the point scaled by the step size.

  • change_x = step_size * f'(x)

The new position is calculated by simply subtracting the change from the current point

  • x = x – change_x

Momentum involves maintaining the change in the position and using it in the subsequent calculation of the change in position.

If we think of updates over time, then the update at the current iteration or time (t) will add the change used at the previous time (t-1) weighted by the momentum hyperparameter, as follows:

  • change_x(t) = step_size * f'(x(t-1)) + momentum * change_x(t-1)

The update to the position is then performed as before.

  • x(t) = x(t-1) – change_x(t)

The change in the position accumulates magnitude and direction of changes over the iterations of the search, proportional to the size of the momentum hyperparameter.

For example, a large momentum (e.g. 0.9) will mean that the update is strongly influenced by the previous update, whereas a modest momentum (0.2) will mean very little influence.

The momentum algorithm accumulates an exponentially decaying moving average of past gradients and continues to move in their direction.

— Page 296, Deep Learning, 2016.

Momentum has the effect of dampening down the change in the gradient and, in turn, the step size with each new point in the search space.

Momentum can increase speed when the cost surface is highly nonspherical because it damps the size of the steps along directions of high curvature thus yielding a larger effective learning rate along the directions of low curvature.

— Page 21, Neural Networks: Tricks of the Trade, 2012.

Momentum is most useful in optimization problems where the objective function has a large amount of curvature (e.g. changes a lot), meaning that the gradient may change a lot over relatively small regions of the search space.

The method of momentum is designed to accelerate learning, especially in the face of high curvature, small but consistent gradients, or noisy gradients.

— Page 296, Deep Learning, 2016.

It is also helpful when the gradient is estimated, such as from a simulation, and may be noisy, e.g. when the gradient has a high variance.

Finally, momentum is helpful when the search space is flat or nearly flat, e.g. zero gradient. The momentum allows the search to progress in the same direction as before the flat spot and helpfully cross the flat region.

Now that we are familiar with what momentum is, let’s look at a worked example.

Gradient Descent with Momentum

In this section, we will first implement the gradient descent optimization algorithm, then update it to use momentum and compare results.

One-Dimensional Test Problem

First, let’s define an optimization function.

We will use a simple one-dimensional function that squares the input and defines the range of valid inputs from -1.0 to 1.0.

The objective() function below implements this function.

# objective function
def objective(x):
	return x**2.0

We can then sample all inputs in the range and calculate the objective function value for each.

...
# define range for input
r_min, r_max = -1.0, 1.0
# sample input range uniformly at 0.1 increments
inputs = arange(r_min, r_max+0.1, 0.1)
# compute targets
results = objective(inputs)

Finally, we can create a line plot of the inputs (x-axis) versus the objective function values (y-axis) to get an intuition for the shape of the objective function that we will be searching.

...
# create a line plot of input vs result
pyplot.plot(inputs, results)
# show the plot
pyplot.show()

The example below ties this together and provides an example of plotting the one-dimensional test function.

# plot of simple function
from numpy import arange
from matplotlib import pyplot

# objective function
def objective(x):
	return x**2.0

# define range for input
r_min, r_max = -1.0, 1.0
# sample input range uniformly at 0.1 increments
inputs = arange(r_min, r_max+0.1, 0.1)
# compute targets
results = objective(inputs)
# create a line plot of input vs result
pyplot.plot(inputs, results)
# show the plot
pyplot.show()

Running the example creates a line plot of the inputs to the function (x-axis) and the calculated output of the function (y-axis).

We can see the familiar U-shape called a parabola.

Line Plot of Simple One Dimensional Function

Line Plot of Simple One Dimensional Function

Gradient Descent Optimization

Next, we can apply the gradient descent algorithm to the problem.

First, we need a function that calculates the derivative for the objective function.

The derivative of x^2 is x * 2 and the derivative() function implements this below.

# derivative of objective function
def derivative(x):
	return x * 2.0

We can define a function that implements the gradient descent optimization algorithm.

The procedure involves starting with a randomly selected point in the search space, then calculating the gradient, updating the position in the search space, evaluating the new position, and reporting the progress. This process is then repeated for a fixed number of iterations. The final point and its evaluation are then returned from the function.

The function gradient_descent() below implements this and takes the name of the objective and gradient functions as well as the bounds on the inputs to the objective function, number of iterations, and step size, then returns the solution and its evaluation at the end of the search.

# gradient descent algorithm
def gradient_descent(objective, derivative, bounds, n_iter, step_size):
	# generate an initial point
	solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
	# run the gradient descent
	for i in range(n_iter):
		# calculate gradient
		gradient = derivative(solution)
		# take a step
		solution = solution - step_size * gradient
		# evaluate candidate point
		solution_eval = objective(solution)
		# report progress
		print('>%d f(%s) = %.5f' % (i, solution, solution_eval))
	return [solution, solution_eval]

We can then define the bounds of the objective function, the step size, and the number of iterations for the algorithm.

We will use a step size of 0.1 and 30 iterations, both found after a little experimentation.

The seed for the pseudorandom number generator is fixed so that we always get the same sequence of random numbers, and in this case, it ensures that we get the same starting point for the search each time the code is run (e.g. something interesting far from the optima).

...
# seed the pseudo random number generator
seed(4)
# define range for input
bounds = asarray([[-1.0, 1.0]])
# define the total iterations
n_iter = 30
# define the maximum step size
step_size = 0.1
# perform the gradient descent search
best, score = gradient_descent(objective, derivative, bounds, n_iter, step_size)

Tying this together, the complete example of applying grid search to our one-dimensional test function is listed below.

# example of gradient descent for a one-dimensional function
from numpy import asarray
from numpy.random import rand
from numpy.random import seed

# objective function
def objective(x):
	return x**2.0

# derivative of objective function
def derivative(x):
	return x * 2.0

# gradient descent algorithm
def gradient_descent(objective, derivative, bounds, n_iter, step_size):
	# generate an initial point
	solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
	# run the gradient descent
	for i in range(n_iter):
		# calculate gradient
		gradient = derivative(solution)
		# take a step
		solution = solution - step_size * gradient
		# evaluate candidate point
		solution_eval = objective(solution)
		# report progress
		print('>%d f(%s) = %.5f' % (i, solution, solution_eval))
	return [solution, solution_eval]

# seed the pseudo random number generator
seed(4)
# define range for input
bounds = asarray([[-1.0, 1.0]])
# define the total iterations
n_iter = 30
# define the step size
step_size = 0.1
# perform the gradient descent search
best, score = gradient_descent(objective, derivative, bounds, n_iter, step_size)
print('Done!')
print('f(%s) = %f' % (best, score))

Running the example starts with a random point in the search space, then applies the gradient descent algorithm, reporting performance along the way.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the algorithm finds a good solution after about 27 iterations, with a function evaluation of about 0.0.

Note the optima for this function is at f(0.0) = 0.0.

We would expect that gradient descent with momentum will accelerate the optimization procedure and find a similarly evaluated solution in fewer iterations.

>0 f([0.74724774]) = 0.55838
>1 f([0.59779819]) = 0.35736
>2 f([0.47823856]) = 0.22871
>3 f([0.38259084]) = 0.14638
>4 f([0.30607268]) = 0.09368
>5 f([0.24485814]) = 0.05996
>6 f([0.19588651]) = 0.03837
>7 f([0.15670921]) = 0.02456
>8 f([0.12536737]) = 0.01572
>9 f([0.10029389]) = 0.01006
>10 f([0.08023512]) = 0.00644
>11 f([0.06418809]) = 0.00412
>12 f([0.05135047]) = 0.00264
>13 f([0.04108038]) = 0.00169
>14 f([0.0328643]) = 0.00108
>15 f([0.02629144]) = 0.00069
>16 f([0.02103315]) = 0.00044
>17 f([0.01682652]) = 0.00028
>18 f([0.01346122]) = 0.00018
>19 f([0.01076897]) = 0.00012
>20 f([0.00861518]) = 0.00007
>21 f([0.00689214]) = 0.00005
>22 f([0.00551372]) = 0.00003
>23 f([0.00441097]) = 0.00002
>24 f([0.00352878]) = 0.00001
>25 f([0.00282302]) = 0.00001
>26 f([0.00225842]) = 0.00001
>27 f([0.00180673]) = 0.00000
>28 f([0.00144539]) = 0.00000
>29 f([0.00115631]) = 0.00000
Done!
f([0.00115631]) = 0.000001

Visualization of Gradient Descent Optimization

Next, we can visualize the progress of the search on a plot of the target function.

First, we can update the gradient_descent() function to store all solutions and their score found during the optimization as lists and return them at the end of the search instead of the best solution found.

# gradient descent algorithm
def gradient_descent(objective, derivative, bounds, n_iter, step_size):
	# track all solutions
	solutions, scores = list(), list()
	# generate an initial point
	solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
	# run the gradient descent
	for i in range(n_iter):
		# calculate gradient
		gradient = derivative(solution)
		# take a step
		solution = solution - step_size * gradient
		# evaluate candidate point
		solution_eval = objective(solution)
		# store solution
		solutions.append(solution)
		scores.append(solution_eval)
		# report progress
		print('>%d f(%s) = %.5f' % (i, solution, solution_eval))
	return [solutions, scores]

The function can be called and we can get the lists of the solutions and the scores found during the search.

...
# perform the gradient descent search
solutions, scores = gradient_descent(objective, derivative, bounds, n_iter, step_size)

We can create a line plot of the objective function, as before.

...
# sample input range uniformly at 0.1 increments
inputs = arange(bounds[0,0], bounds[0,1]+0.1, 0.1)
# compute targets
results = objective(inputs)
# create a line plot of input vs result
pyplot.plot(inputs, results)

Finally, we can plot each solution found as a red dot and connect the dots with a line so we can see how the search moved downhill.

...
# plot the solutions found
pyplot.plot(solutions, scores, '.-', color='red')

Tying this all together, the complete example of plotting the result of the gradient descent search on the one-dimensional test function is listed below.

# example of plotting a gradient descent search on a one-dimensional function
from numpy import asarray
from numpy import arange
from numpy.random import rand
from numpy.random import seed
from matplotlib import pyplot

# objective function
def objective(x):
	return x**2.0

# derivative of objective function
def derivative(x):
	return x * 2.0

# gradient descent algorithm
def gradient_descent(objective, derivative, bounds, n_iter, step_size):
	# track all solutions
	solutions, scores = list(), list()
	# generate an initial point
	solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
	# run the gradient descent
	for i in range(n_iter):
		# calculate gradient
		gradient = derivative(solution)
		# take a step
		solution = solution - step_size * gradient
		# evaluate candidate point
		solution_eval = objective(solution)
		# store solution
		solutions.append(solution)
		scores.append(solution_eval)
		# report progress
		print('>%d f(%s) = %.5f' % (i, solution, solution_eval))
	return [solutions, scores]

# seed the pseudo random number generator
seed(4)
# define range for input
bounds = asarray([[-1.0, 1.0]])
# define the total iterations
n_iter = 30
# define the step size
step_size = 0.1
# perform the gradient descent search
solutions, scores = gradient_descent(objective, derivative, bounds, n_iter, step_size)
# sample input range uniformly at 0.1 increments
inputs = arange(bounds[0,0], bounds[0,1]+0.1, 0.1)
# compute targets
results = objective(inputs)
# create a line plot of input vs result
pyplot.plot(inputs, results)
# plot the solutions found
pyplot.plot(solutions, scores, '.-', color='red')
# show the plot
pyplot.show()

Running the example performs the gradient descent search on the objective function as before, except in this case, each point found during the search is plotted.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the search started more than halfway up the right part of the function and stepped downhill to the bottom of the basin.

We can see that in the parts of the objective function with the larger curve, the derivative (gradient) is larger, and in turn, larger steps are taken. Similarly, the gradient is smaller as we get closer to the optima, and in turn, smaller steps are taken.

This highlights that the step size is used as a scale factor on the magnitude of the gradient (curvature) of the objective function.

Plot of the Progress of Gradient Descent on a One Dimensional Objective Function

Plot of the Progress of Gradient Descent on a One Dimensional Objective Function

Gradient Descent Optimization With Momentum

Next, we can update the gradient descent optimization algorithm to use momentum.

This can be achieved by updating the gradient_descent() function to take a “momentum” argument that defines the amount of momentum used during the search.

The change made to the solution must be remembered from the previous iteration of the loop, with an initial value of 0.0.

...
# keep track of the change
change = 0.0

We can then break the update procedure down into first calculating the gradient, then calculating the change to the solution, calculating the position of the new solution, then saving the change for the next iteration.

...
# calculate gradient
gradient = derivative(solution)
# calculate update
new_change = step_size * gradient + momentum * change
# take a step
solution = solution - new_change
# save the change
change = new_change

The updated version of the gradient_descent() function with these changes is listed below.

# gradient descent algorithm
def gradient_descent(objective, derivative, bounds, n_iter, step_size, momentum):
	# generate an initial point
	solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
	# keep track of the change
	change = 0.0
	# run the gradient descent
	for i in range(n_iter):
		# calculate gradient
		gradient = derivative(solution)
		# calculate update
		new_change = step_size * gradient + momentum * change
		# take a step
		solution = solution - new_change
		# save the change
		change = new_change
		# evaluate candidate point
		solution_eval = objective(solution)
		# report progress
		print('>%d f(%s) = %.5f' % (i, solution, solution_eval))
	return [solution, solution_eval]

We can then choose a momentum value and pass it to the gradient_descent() function.

After a little trial and error, a momentum value of 0.3 was found to be effective on this problem, given the fixed step size of 0.1.

...
# define momentum
momentum = 0.3
# perform the gradient descent search with momentum
best, score = gradient_descent(objective, derivative, bounds, n_iter, step_size, momentum)

Tying this together, the complete example of gradient descent optimization with momentum is listed below.

# example of gradient descent with momentum for a one-dimensional function
from numpy import asarray
from numpy.random import rand
from numpy.random import seed

# objective function
def objective(x):
	return x**2.0

# derivative of objective function
def derivative(x):
	return x * 2.0

# gradient descent algorithm
def gradient_descent(objective, derivative, bounds, n_iter, step_size, momentum):
	# generate an initial point
	solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
	# keep track of the change
	change = 0.0
	# run the gradient descent
	for i in range(n_iter):
		# calculate gradient
		gradient = derivative(solution)
		# calculate update
		new_change = step_size * gradient + momentum * change
		# take a step
		solution = solution - new_change
		# save the change
		change = new_change
		# evaluate candidate point
		solution_eval = objective(solution)
		# report progress
		print('>%d f(%s) = %.5f' % (i, solution, solution_eval))
	return [solution, solution_eval]

# seed the pseudo random number generator
seed(4)
# define range for input
bounds = asarray([[-1.0, 1.0]])
# define the total iterations
n_iter = 30
# define the step size
step_size = 0.1
# define momentum
momentum = 0.3
# perform the gradient descent search with momentum
best, score = gradient_descent(objective, derivative, bounds, n_iter, step_size, momentum)
print('Done!')
print('f(%s) = %f' % (best, score))

Running the example starts with a random point in the search space, then applies the gradient descent algorithm with momentum, reporting performance along the way.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the algorithm finds a good solution after about 13 iterations, with a function evaluation of about 0.0.

As expected, this is faster (fewer iterations) than gradient descent without momentum, using the same starting point and step size that took 27 iterations.

>0 f([0.74724774]) = 0.55838
>1 f([0.54175461]) = 0.29350
>2 f([0.37175575]) = 0.13820
>3 f([0.24640494]) = 0.06072
>4 f([0.15951871]) = 0.02545
>5 f([0.1015491]) = 0.01031
>6 f([0.0638484]) = 0.00408
>7 f([0.03976851]) = 0.00158
>8 f([0.02459084]) = 0.00060
>9 f([0.01511937]) = 0.00023
>10 f([0.00925406]) = 0.00009
>11 f([0.00564365]) = 0.00003
>12 f([0.0034318]) = 0.00001
>13 f([0.00208188]) = 0.00000
>14 f([0.00126053]) = 0.00000
>15 f([0.00076202]) = 0.00000
>16 f([0.00046006]) = 0.00000
>17 f([0.00027746]) = 0.00000
>18 f([0.00016719]) = 0.00000
>19 f([0.00010067]) = 0.00000
>20 f([6.05804744e-05]) = 0.00000
>21 f([3.64373635e-05]) = 0.00000
>22 f([2.19069576e-05]) = 0.00000
>23 f([1.31664443e-05]) = 0.00000
>24 f([7.91100141e-06]) = 0.00000
>25 f([4.75216828e-06]) = 0.00000
>26 f([2.85408468e-06]) = 0.00000
>27 f([1.71384267e-06]) = 0.00000
>28 f([1.02900153e-06]) = 0.00000
>29 f([6.17748881e-07]) = 0.00000
Done!
f([6.17748881e-07]) = 0.000000

Visualization of Gradient Descent Optimization With Momentum

Finally, we can visualize the progress of the gradient descent optimization algorithm with momentum.

The complete example is listed below.

# example of plotting gradient descent with momentum for a one-dimensional function
from numpy import asarray
from numpy import arange
from numpy.random import rand
from numpy.random import seed
from matplotlib import pyplot

# objective function
def objective(x):
	return x**2.0

# derivative of objective function
def derivative(x):
	return x * 2.0

# gradient descent algorithm
def gradient_descent(objective, derivative, bounds, n_iter, step_size, momentum):
	# track all solutions
	solutions, scores = list(), list()
	# generate an initial point
	solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
	# keep track of the change
	change = 0.0
	# run the gradient descent
	for i in range(n_iter):
		# calculate gradient
		gradient = derivative(solution)
		# calculate update
		new_change = step_size * gradient + momentum * change
		# take a step
		solution = solution - new_change
		# save the change
		change = new_change
		# evaluate candidate point
		solution_eval = objective(solution)
		# store solution
		solutions.append(solution)
		scores.append(solution_eval)
		# report progress
		print('>%d f(%s) = %.5f' % (i, solution, solution_eval))
	return [solutions, scores]

# seed the pseudo random number generator
seed(4)
# define range for input
bounds = asarray([[-1.0, 1.0]])
# define the total iterations
n_iter = 30
# define the step size
step_size = 0.1
# define momentum
momentum = 0.3
# perform the gradient descent search with momentum
solutions, scores = gradient_descent(objective, derivative, bounds, n_iter, step_size, momentum)
# sample input range uniformly at 0.1 increments
inputs = arange(bounds[0,0], bounds[0,1]+0.1, 0.1)
# compute targets
results = objective(inputs)
# create a line plot of input vs result
pyplot.plot(inputs, results)
# plot the solutions found
pyplot.plot(solutions, scores, '.-', color='red')
# show the plot
pyplot.show()

Running the example performs the gradient descent search with momentum on the objective function as before, except in this case, each point found during the search is plotted.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, if we compare the plot to the plot created previously for the performance of gradient descent (without momentum), we can see that the search indeed reaches the optima in fewer steps, noted with fewer distinct red dots on the path to the bottom of the basin.

Plot of the Progress of Gradient Descent With Momentum on a One Dimensional Objective Function

Plot of the Progress of Gradient Descent With Momentum on a One Dimensional Objective Function

As an extension, try different values for momentum, such as 0.8, and review the resulting plot.
Let me know what you discover in the comments below.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Books

APIs

Articles

Summary

In this tutorial, you discovered the gradient descent with momentum algorithm.

Specifically, you learned:

  • Gradient descent is an optimization algorithm that uses the gradient of the objective function to navigate the search space.
  • Gradient descent can be accelerated by using momentum from past updates to the search position.
  • How to implement gradient descent optimization with momentum and develop an intuition for its behavior.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Gradient Descent With Momentum from Scratch appeared first on Machine Learning Mastery.

Function Optimization With SciPy

$
0
0

Optimization involves finding the inputs to an objective function that result in the minimum or maximum output of the function.

The open-source Python library for scientific computing called SciPy provides a suite of optimization algorithms. Many of the algorithms are used as a building block in other algorithms, most notably machine learning algorithms in the scikit-learn library.

These optimization algorithms can be used directly in a standalone manner to optimize a function. Most notably, algorithms for local search and algorithms for global search, the two main types of optimization you may encounter on a machine learning project.

In this tutorial, you will discover optimization algorithms provided by the SciPy library.

After completing this tutorial, you will know:

  • The SciPy library provides a suite of different optimization algorithms for different purposes.
  • The local search optimization algorithms available in SciPy.
  • The global search optimization algorithms available in SciPy.

Let’s get started.

Function Optimization With SciPy

Function Optimization With SciPy
Photo by Manoel Lemos, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  1. Optimization With SciPy
  2. Local Search With SciPy
  3. Global Search With SciPy

Optimization With SciPy

The Python SciPy open-source library for scientific computing provides a suite of optimization techniques.

Many of the algorithms are used as building blocks for other algorithms within the SciPy library, as well as machine learning libraries such as scikit-learn.

Before we review specific techniques, let’s look at the types of algorithms provided by the library.

They are:

  • Scalar Optimization: Optimization of a convex single variable function.
  • Local Search: Optimization of a unimodal multiple variable function.
  • Global Search: Optimization of a multimodal multiple variable function.
  • Least Squares: Solve linear and non-linear least squares problems.
  • Curve Fitting: Fit a curve to a data sample.
  • Root Finding: Find the root (input that gives an output of zero) of a function.
  • Linear Programming: Linear optimization subject to constraints.

All algorithms assume the objective function that is being optimized is a minimization function. If your function is maximizing, it can be converted to minimizing by adding a negative sign to values returned from your objective function.

In addition to the above list, the library also provides utility functions used by some of the algorithms, as well as the Rosenbrock test problem.

For a good overview of the capabilities of the SciPy library for optimization, see:

Now that we have a high-level idea of the types of optimization techniques supported by the library, let’s take a closer look at two groups of algorithms we are more likely to use in applied machine learning. They are local search and global search.

Local Search With SciPy

Local search, or local function optimization, refers to algorithms that seek the input to a function that results in the minimum or maximum output where the function or constrained region being searched is assumed to have a single optima, e.g. unimodal.

The function that is being optimized may or may not be convex, and may have one or more than one input variable.

A local search optimization may be applied directly to optimize a function if the function is believed or known to be unimodal; otherwise, the local search algorithm may be applied to fine-tune the result of a global search algorithm.

The SciPy library provides local search via the minimize() function.

The minimize() function takes as input the name of the objective function that is being minimized and the initial point from which to start the search and returns an OptimizeResult that summarizes the success or failure of the search and the details of the solution if found.

...
# minimize an objective function
result = minimize(objective, point)

Additional information about the objective function can be provided if known, such as the bounds on the input variables, a function for computing the first derivative of the function (gradient or Jacobian matrix), a function for computing the second derivative of the function (Hessian matrix), and any constraints on the inputs.

Importantly, the function provides the “method” argument that allows the specific optimization used in the local search to be specified.

A suite of popular local search algorithms are available, such as:

The example below demonstrates how to solve a two-dimensional convex function using the L-BFGS-B local search algorithm.

# l-bfgs-b algorithm local optimization of a convex function
from scipy.optimize import minimize
from numpy.random import rand

# objective function
def objective(x):
	return x[0]**2.0 + x[1]**2.0

# define range for input
r_min, r_max = -5.0, 5.0
# define the starting point as a random sample from the domain
pt = r_min + rand(2) * (r_max - r_min)
# perform the l-bfgs-b algorithm search
result = minimize(objective, pt, method='L-BFGS-B')
# summarize the result
print('Status : %s' % result['message'])
print('Total Evaluations: %d' % result['nfev'])
# evaluate solution
solution = result['x']
evaluation = objective(solution)
print('Solution: f(%s) = %.5f' % (solution, evaluation))

Running the example performs the optimization and reports the success or failure of the search, the number of function evaluations performed, and the input that resulted in the optima of the function.

Status : b'CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL'
Total Evaluations: 9
Solution: f([3.38059583e-07 3.70089258e-07]) = 0.00000

Now that we are familiar with using a local search algorithm with SciPy, let’s look at global search.

Global Search With SciPy

Global search or global function optimization refers to algorithms that seek the input to a function that results in the minimum or maximum output where the function or constrained region being searched is assumed to have multiple local optima, e.g. multimodal.

The function that is being optimized is typically nonlinear, nonconvex, and may have one or more than one input variable.

Global search algorithms are typically stochastic, meaning that they make use of randomness in the search process and may or may not manage a population of candidate solutions as part of the search.

The SciPy library provides a number of stochastic global optimization algorithms, each via different functions. They are:

The library also provides the shgo() function for sequence optimization and the brute() for grid search optimization.

Each algorithm returns an OptimizeResult object that summarizes the success or failure of the search and the details of the solution if found.

The example below demonstrates how to solve a two-dimensional multimodal function using simulated annealing.

# simulated annealing global optimization for a multimodal objective function
from scipy.optimize import dual_annealing

# objective function
def objective(v):
	x, y = v
	return (x**2 + y - 11)**2 + (x + y**2 -7)**2

# define range for input
r_min, r_max = -5.0, 5.0
# define the bounds on the search
bounds = [[r_min, r_max], [r_min, r_max]]
# perform the simulated annealing search
result = dual_annealing(objective, bounds)
# summarize the result
print('Status : %s' % result['message'])
print('Total Evaluations: %d' % result['nfev'])
# evaluate solution
solution = result['x']
evaluation = objective(solution)
print('Solution: f(%s) = %.5f' % (solution, evaluation))

Running the example performs the optimization and reports the success or failure of the search, the number of function evaluations performed, and the input that resulted in the optima of the function.

Status : ['Maximum number of iteration reached']
Total Evaluations: 4028
Solution: f([-3.77931027 -3.283186 ]) = 0.00000

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

APIs

Articles

Summary

In this tutorial, you discovered optimization algorithms provided by the SciPy library.

Specifically, you learned:

  • The SciPy library provides a suite of different optimization algorithms for different purposes.
  • The local search optimization algorithms available in SciPy.
  • The global search optimization algorithms available in SciPy.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Function Optimization With SciPy appeared first on Machine Learning Mastery.

How to Use Optimization Algorithms to Manually Fit Regression Models

$
0
0

Regression models are fit on training data using linear regression and local search optimization algorithms.

Models like linear regression and logistic regression are trained by least squares optimization, and this is the most efficient approach to finding coefficients that minimize error for these models.

Nevertheless, it is possible to use alternate optimization algorithms to fit a regression model to a training dataset. This can be a useful exercise to learn more about how regression functions and the central nature of optimization in applied machine learning. It may also be required for regression with data that does not meet the requirements of a least squares optimization procedure.

In this tutorial, you will discover how to manually optimize the coefficients of regression models.

After completing this tutorial, you will know:

  • How to develop the inference models for regression from scratch.
  • How to optimize the coefficients of a linear regression model for predicting numeric values.
  • How to optimize the coefficients of a logistic regression model using stochastic hill climbing.

Let’s get started.

How to Use Optimization Algorithms to Manually Fit Regression Models

How to Use Optimization Algorithms to Manually Fit Regression Models
Photo by Christian Collins, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  1. Optimize Regression Models
  2. Optimize a Linear Regression Model
  3. Optimize a Logistic Regression Model

Optimize Regression Models

Regression models, like linear regression and logistic regression, are well-understood algorithms from the field of statistics.

Both algorithms are linear, meaning the output of the model is a weighted sum of the inputs. Linear regression is designed for “regression” problems that require a number to be predicted, and logistic regression is designed for “classification” problems that require a class label to be predicted.

These regression models involve the use of an optimization algorithm to find a set of coefficients for each input to the model that minimizes the prediction error. Because the models are linear and well understood, efficient optimization algorithms can be used.

In the case of linear regression, the coefficients can be found by least squares optimization, which can be solved using linear algebra. In the case of logistic regression, a local search optimization algorithm is commonly used.

It is possible to use any arbitrary optimization algorithm to train linear and logistic regression models.

That is, we can define a regression model and use a given optimization algorithm to find a set of coefficients for the model that result in a minimum of prediction error or a maximum of classification accuracy.

Using alternate optimization algorithms is expected to be less efficient on average than using the recommended optimization. Nevertheless, it may be more efficient in some specific cases, such as if the input data does not meet the expectations of the model like a Gaussian distribution and is uncorrelated with outer inputs.

It can also be an interesting exercise to demonstrate the central nature of optimization in training machine learning algorithms, and specifically regression models.

Next, let’s explore how to train a linear regression model using stochastic hill climbing.

Optimize a Linear Regression Model

The linear regression model might be the simplest predictive model that learns from data.

The model has one coefficient for each input and the predicted output is simply the weights of some inputs and coefficients.

In this section, we will optimize the coefficients of a linear regression model.

First, let’s define a synthetic regression problem that we can use as the focus of optimizing the model.

We can use the make_regression() function to define a regression problem with 1,000 rows and 10 input variables.

The example below creates the dataset and summarizes the shape of the data.

# define a regression dataset
from sklearn.datasets import make_regression
# define dataset
X, y = make_regression(n_samples=1000, n_features=10, n_informative=2, noise=0.2, random_state=1)
# summarize the shape of the dataset
print(X.shape, y.shape)

Running the example prints the shape of the created dataset, confirming our expectations.

(1000, 10) (1000,)

Next, we need to define a linear regression model.

Before we optimize the model coefficients, we must develop the model and our confidence in how it works.

Let’s start by developing a function that calculates the activation of the model for a given input row of data from the dataset.

This function will take the row of data and the coefficients for the model and calculate the weighted sum of the input with the addition of an extra y-intercept (also called the offset or bias) coefficient. The predict_row() function below implements this.

We are using simple Python lists and imperative programming style instead of NumPy arrays or list compressions intentionally to make the code more readable for Python beginners. Feel free to optimize it and post your code in the comments below.

# linear regression
def predict_row(row, coefficients):
	# add the bias, the last coefficient
	result = coefficients[-1]
	# add the weighted input
	for i in range(len(row)):
		result += coefficients[i] * row[i]
	return result

Next, we can call the predict_row() function for each row in a given dataset. The predict_dataset() function below implements this.

Again, we are intentionally using a simple imperative coding style for readability instead of list compressions.

# use model coefficients to generate predictions for a dataset of rows
def predict_dataset(X, coefficients):
	yhats = list()
	for row in X:
		# make a prediction
		yhat = predict_row(row, coefficients)
		# store the prediction
		yhats.append(yhat)
	return yhats

Finally, we can use the model to make predictions on our synthetic dataset to confirm it is all working correctly.

We can generate a random set of model coefficients using the rand() function.

Recall that we need one coefficient for each input (ten inputs in this dataset) plus an extra weight for the y-intercept coefficient.

...
# define dataset
X, y = make_regression(n_samples=1000, n_features=10, n_informative=2, noise=0.2, random_state=1)
# determine the number of coefficients
n_coeff = X.shape[1] + 1
# generate random coefficients
coefficients = rand(n_coeff)

We can then use these coefficients with the dataset to make predictions.

...
# generate predictions for dataset
yhat = predict_dataset(X, coefficients)

We can evaluate the mean squared error of these predictions.

...
# calculate model prediction error
score = mean_squared_error(y, yhat)
print('MSE: %f' % score)

That’s it.

We can tie all of this together and demonstrate our linear regression model for regression predictive modeling. The complete example is listed below.

# linear regression model
from numpy.random import rand
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error

# linear regression
def predict_row(row, coefficients):
	# add the bias, the last coefficient
	result = coefficients[-1]
	# add the weighted input
	for i in range(len(row)):
		result += coefficients[i] * row[i]
	return result

# use model coefficients to generate predictions for a dataset of rows
def predict_dataset(X, coefficients):
	yhats = list()
	for row in X:
		# make a prediction
		yhat = predict_row(row, coefficients)
		# store the prediction
		yhats.append(yhat)
	return yhats

# define dataset
X, y = make_regression(n_samples=1000, n_features=10, n_informative=2, noise=0.2, random_state=1)
# determine the number of coefficients
n_coeff = X.shape[1] + 1
# generate random coefficients
coefficients = rand(n_coeff)
# generate predictions for dataset
yhat = predict_dataset(X, coefficients)
# calculate model prediction error
score = mean_squared_error(y, yhat)
print('MSE: %f' % score)

Running the example generates a prediction for each example in the training dataset, then prints the mean squared error for the predictions.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

We would expect a large error given a set of random weights, and that is what we see in this case, with an error value of about 7,307 units.

MSE: 7307.756740

We can now optimize the coefficients of the dataset to achieve low error on this dataset.

First, we need to split the dataset into train and test sets. It is important to hold back some data not used in optimizing the model so that we can prepare a reasonable estimate of the performance of the model when used to make predictions on new data.

We will use 67 percent of the data for training and the remaining 33 percent as a test set for evaluating the performance of the model.

...
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

Next, we can develop a stochastic hill climbing algorithm.

The optimization algorithm requires an objective function to optimize. It must take a set of coefficients and return a score that is to be minimized or maximized corresponding to a better model.

In this case, we will evaluate the mean squared error of the model with a given set of coefficients and return the error score, which must be minimized.

The objective() function below implements this, given the dataset and a set of coefficients, and returns the error of the model.

# objective function
def objective(X, y, coefficients):
	# generate predictions for dataset
	yhat = predict_dataset(X, coefficients)
	# calculate accuracy
	score = mean_squared_error(y, yhat)
	return score

Next, we can define the stochastic hill climbing algorithm.

The algorithm will require an initial solution (e.g. random coefficients) and will iteratively keep making small changes to the solution and checking if it results in a better performing model. The amount of change made to the current solution is controlled by a step_size hyperparameter. This process will continue for a fixed number of iterations, also provided as a hyperparameter.

The hillclimbing() function below implements this, taking the dataset, objective function, initial solution, and hyperparameters as arguments and returns the best set of coefficients found and the estimated performance.

# hill climbing local search algorithm
def hillclimbing(X, y, objective, solution, n_iter, step_size):
	# evaluate the initial point
	solution_eval = objective(X, y, solution)
	# run the hill climb
	for i in range(n_iter):
		# take a step
		candidate = solution + randn(len(solution)) * step_size
		# evaluate candidate point
		candidte_eval = objective(X, y, candidate)
		# check if we should keep the new point
		if candidte_eval <= solution_eval:
			# store the new point
			solution, solution_eval = candidate, candidte_eval
			# report progress
			print('>%d %.5f' % (i, solution_eval))
	return [solution, solution_eval]

We can then call this function, passing in an initial set of coefficients as the initial solution and the training dataset as the dataset to optimize the model against.

...
# define the total iterations
n_iter = 2000
# define the maximum step size
step_size = 0.15
# determine the number of coefficients
n_coef = X.shape[1] + 1
# define the initial solution
solution = rand(n_coef)
# perform the hill climbing search
coefficients, score = hillclimbing(X_train, y_train, objective, solution, n_iter, step_size)
print('Done!')
print('Coefficients: %s' % coefficients)
print('Train MSE: %f' % (score))

Finally, we can evaluate the best model on the test dataset and report the performance.

...
# generate predictions for the test dataset
yhat = predict_dataset(X_test, coefficients)
# calculate accuracy
score = mean_squared_error(y_test, yhat)
print('Test MSE: %f' % (score))

Tying this together, the complete example of optimizing the coefficients of a linear regression model on the synthetic regression dataset is listed below.

# optimize linear regression coefficients for regression dataset
from numpy.random import randn
from numpy.random import rand
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# linear regression
def predict_row(row, coefficients):
	# add the bias, the last coefficient
	result = coefficients[-1]
	# add the weighted input
	for i in range(len(row)):
		result += coefficients[i] * row[i]
	return result

# use model coefficients to generate predictions for a dataset of rows
def predict_dataset(X, coefficients):
	yhats = list()
	for row in X:
		# make a prediction
		yhat = predict_row(row, coefficients)
		# store the prediction
		yhats.append(yhat)
	return yhats

# objective function
def objective(X, y, coefficients):
	# generate predictions for dataset
	yhat = predict_dataset(X, coefficients)
	# calculate accuracy
	score = mean_squared_error(y, yhat)
	return score

# hill climbing local search algorithm
def hillclimbing(X, y, objective, solution, n_iter, step_size):
	# evaluate the initial point
	solution_eval = objective(X, y, solution)
	# run the hill climb
	for i in range(n_iter):
		# take a step
		candidate = solution + randn(len(solution)) * step_size
		# evaluate candidate point
		candidte_eval = objective(X, y, candidate)
		# check if we should keep the new point
		if candidte_eval <= solution_eval:
			# store the new point
			solution, solution_eval = candidate, candidte_eval
			# report progress
			print('>%d %.5f' % (i, solution_eval))
	return [solution, solution_eval]

# define dataset
X, y = make_regression(n_samples=1000, n_features=10, n_informative=2, noise=0.2, random_state=1)
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
# define the total iterations
n_iter = 2000
# define the maximum step size
step_size = 0.15
# determine the number of coefficients
n_coef = X.shape[1] + 1
# define the initial solution
solution = rand(n_coef)
# perform the hill climbing search
coefficients, score = hillclimbing(X_train, y_train, objective, solution, n_iter, step_size)
print('Done!')
print('Coefficients: %s' % coefficients)
print('Train MSE: %f' % (score))
# generate predictions for the test dataset
yhat = predict_dataset(X_test, coefficients)
# calculate accuracy
score = mean_squared_error(y_test, yhat)
print('Test MSE: %f' % (score))

Running the example will report the iteration number and mean squared error each time there is an improvement made to the model.

At the end of the search, the performance of the best set of coefficients on the training dataset is reported and the performance of the same model on the test dataset is calculated and reported.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the optimization algorithm found a set of coefficients that achieved an error of about 0.08 on both the train and test datasets.

The fact that the algorithm found a model with very similar performance on train and test datasets is a good sign, showing that the model did not over-fit (over-optimize) to the training dataset. This means the model generalizes well to new data.

...
>1546 0.35426
>1567 0.32863
>1572 0.32322
>1619 0.24890
>1665 0.24800
>1691 0.24162
>1715 0.15893
>1809 0.15337
>1892 0.14656
>1956 0.08042
Done!
Coefficients: [ 1.30559829e-02 -2.58299382e-04  3.33118191e+00  3.20418534e-02
  1.36497902e-01  8.65445367e+01  2.78356715e-02 -8.50901499e-02
  8.90078243e-02  6.15779867e-02 -3.85657793e-02]
Train MSE: 0.080415
Test MSE: 0.080779

Now that we are familiar with how to manually optimize the coefficients of a linear regression model, let’s look at how we can extend the example to optimize the coefficients of a logistic regression model for classification.

Optimize a Logistic Regression Model

A Logistic Regression model is an extension of linear regression for classification predictive modeling.

Logistic regression is for binary classification tasks, meaning datasets that have two class labels, class=0 and class=1.

The output first involves calculating the weighted sum of the inputs, then passing this weighted sum through a logistic function, also called a sigmoid function. The result is a Binomial probability between 0 and 1 for the example belonging to class=1.

In this section, we will build on what we learned in the previous section to optimize the coefficients of regression models for classification. We will develop the model and test it with random coefficients, then use stochastic hill climbing to optimize the model coefficients.

First, let’s define a synthetic binary classification problem that we can use as the focus of optimizing the model.

We can use the make_classification() function to define a binary classification problem with 1,000 rows and five input variables.

The example below creates the dataset and summarizes the shape of the data.

# define a binary classification dataset
from sklearn.datasets import make_classification
# define dataset
X, y = make_classification(n_samples=1000, n_features=5, n_informative=2, n_redundant=1, random_state=1)
# summarize the shape of the dataset
print(X.shape, y.shape)

Running the example prints the shape of the created dataset, confirming our expectations.

(1000, 5) (1000,)

Next, we need to define a logistic regression model.

Let’s start by updating the predict_row() function to pass the weighted sum of the input and coefficients through a logistic function.

The logistic function is defined as:

  • logistic = 1.0 / (1.0 + exp(-result))

Where result is the weighted sum of the inputs and the coefficients and exp() is e (Euler’s number) raised to the power of the provided value, implemented via the exp() function.

The updated predict_row() function is listed below.

# logistic regression
def predict_row(row, coefficients):
	# add the bias, the last coefficient
	result = coefficients[-1]
	# add the weighted input
	for i in range(len(row)):
		result += coefficients[i] * row[i]
	# logistic function
	logistic = 1.0 / (1.0 + exp(-result))
	return logistic

That’s about it in terms of changes for linear regression to logistic regression.

As with linear regression, we can test the model with a set of random model coefficients.

...
# determine the number of coefficients
n_coeff = X.shape[1] + 1
# generate random coefficients
coefficients = rand(n_coeff)
# generate predictions for dataset
yhat = predict_dataset(X, coefficients)

The predictions made by the model are probabilities for an example belonging to class=1.

We can round the prediction to be integer values 0 and 1 for the expected class labels.

...
# round predictions to labels
yhat = [round(y) for y in yhat]

We can evaluate the classification accuracy of these predictions.

...
# calculate accuracy
score = accuracy_score(y, yhat)
print('Accuracy: %f' % score)

That’s it.

We can tie all of this together and demonstrate our simple logistic regression model for binary classification. The complete example is listed below.

# logistic regression function for binary classification
from math import exp
from numpy.random import rand
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score

# logistic regression
def predict_row(row, coefficients):
	# add the bias, the last coefficient
	result = coefficients[-1]
	# add the weighted input
	for i in range(len(row)):
		result += coefficients[i] * row[i]
	# logistic function
	logistic = 1.0 / (1.0 + exp(-result))
	return logistic

# use model coefficients to generate predictions for a dataset of rows
def predict_dataset(X, coefficients):
	yhats = list()
	for row in X:
		# make a prediction
		yhat = predict_row(row, coefficients)
		# store the prediction
		yhats.append(yhat)
	return yhats

# define dataset
X, y = make_classification(n_samples=1000, n_features=5, n_informative=2, n_redundant=1, random_state=1)
# determine the number of coefficients
n_coeff = X.shape[1] + 1
# generate random coefficients
coefficients = rand(n_coeff)
# generate predictions for dataset
yhat = predict_dataset(X, coefficients)
# round predictions to labels
yhat = [round(y) for y in yhat]
# calculate accuracy
score = accuracy_score(y, yhat)
print('Accuracy: %f' % score)

Running the example generates a prediction for each example in the training dataset then prints the classification accuracy for the predictions.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

We would expect about 50 percent accuracy given a set of random weights and a dataset with an equal number of examples in each class, and that is approximately what we see in this case.

Accuracy: 0.540000

We can now optimize the weights of the dataset to achieve good accuracy on this dataset.

The stochastic hill climbing algorithm used for linear regression can be used again for logistic regression.

The important difference is an update to the objective() function to round the predictions and evaluate the model using classification accuracy instead of mean squared error.

# objective function
def objective(X, y, coefficients):
	# generate predictions for dataset
	yhat = predict_dataset(X, coefficients)
	# round predictions to labels
	yhat = [round(y) for y in yhat]
	# calculate accuracy
	score = accuracy_score(y, yhat)
	return score

The hillclimbing() function also must be updated to maximize the score of solutions instead of minimizing in the case of linear regression.

# hill climbing local search algorithm
def hillclimbing(X, y, objective, solution, n_iter, step_size):
	# evaluate the initial point
	solution_eval = objective(X, y, solution)
	# run the hill climb
	for i in range(n_iter):
		# take a step
		candidate = solution + randn(len(solution)) * step_size
		# evaluate candidate point
		candidte_eval = objective(X, y, candidate)
		# check if we should keep the new point
		if candidte_eval >= solution_eval:
			# store the new point
			solution, solution_eval = candidate, candidte_eval
			# report progress
			print('>%d %.5f' % (i, solution_eval))
	return [solution, solution_eval]

Finally, the coefficients found by the search can be evaluated using classification accuracy at the end of the run.

...
# generate predictions for the test dataset
yhat = predict_dataset(X_test, coefficients)
# round predictions to labels
yhat = [round(y) for y in yhat]
# calculate accuracy
score = accuracy_score(y_test, yhat)
print('Test Accuracy: %f' % (score))

Tying this all together, the complete example of using stochastic hill climbing to maximize classification accuracy of a logistic regression model is listed below.

# optimize logistic regression model with a stochastic hill climber
from math import exp
from numpy.random import randn
from numpy.random import rand
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# logistic regression
def predict_row(row, coefficients):
	# add the bias, the last coefficient
	result = coefficients[-1]
	# add the weighted input
	for i in range(len(row)):
		result += coefficients[i] * row[i]
	# logistic function
	logistic = 1.0 / (1.0 + exp(-result))
	return logistic

# use model coefficients to generate predictions for a dataset of rows
def predict_dataset(X, coefficients):
	yhats = list()
	for row in X:
		# make a prediction
		yhat = predict_row(row, coefficients)
		# store the prediction
		yhats.append(yhat)
	return yhats

# objective function
def objective(X, y, coefficients):
	# generate predictions for dataset
	yhat = predict_dataset(X, coefficients)
	# round predictions to labels
	yhat = [round(y) for y in yhat]
	# calculate accuracy
	score = accuracy_score(y, yhat)
	return score

# hill climbing local search algorithm
def hillclimbing(X, y, objective, solution, n_iter, step_size):
	# evaluate the initial point
	solution_eval = objective(X, y, solution)
	# run the hill climb
	for i in range(n_iter):
		# take a step
		candidate = solution + randn(len(solution)) * step_size
		# evaluate candidate point
		candidte_eval = objective(X, y, candidate)
		# check if we should keep the new point
		if candidte_eval >= solution_eval:
			# store the new point
			solution, solution_eval = candidate, candidte_eval
			# report progress
			print('>%d %.5f' % (i, solution_eval))
	return [solution, solution_eval]

# define dataset
X, y = make_classification(n_samples=1000, n_features=5, n_informative=2, n_redundant=1, random_state=1)
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
# define the total iterations
n_iter = 2000
# define the maximum step size
step_size = 0.1
# determine the number of coefficients
n_coef = X.shape[1] + 1
# define the initial solution
solution = rand(n_coef)
# perform the hill climbing search
coefficients, score = hillclimbing(X_train, y_train, objective, solution, n_iter, step_size)
print('Done!')
print('Coefficients: %s' % coefficients)
print('Train Accuracy: %f' % (score))
# generate predictions for the test dataset
yhat = predict_dataset(X_test, coefficients)
# round predictions to labels
yhat = [round(y) for y in yhat]
# calculate accuracy
score = accuracy_score(y_test, yhat)
print('Test Accuracy: %f' % (score))

Running the example will report the iteration number and classification accuracy each time there is an improvement made to the model.

At the end of the search, the performance of the best set of coefficients on the training dataset is reported and the performance of the same model on the test dataset is calculated and reported.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the optimization algorithm found a set of weights that achieved about 87.3 percent accuracy on the training dataset and about 83.9 percent accuracy on the test dataset.

...
>200 0.85672
>225 0.85672
>230 0.85672
>245 0.86418
>281 0.86418
>285 0.86716
>294 0.86716
>306 0.86716
>316 0.86716
>317 0.86716
>320 0.86866
>348 0.86866
>362 0.87313
>784 0.87313
>1649 0.87313
Done!
Coefficients: [-0.04652756  0.23243427  2.58587637 -0.45528253 -0.4954355  -0.42658053]
Train Accuracy: 0.873134
Test Accuracy: 0.839394

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

APIs

Articles

Summary

In this tutorial, you discovered how to manually optimize the coefficients of regression models.

Specifically, you learned:

  • How to develop the inference models for regression from scratch.
  • How to optimize the coefficients of a linear regression model for predicting numeric values.
  • How to optimize the coefficients of a logistic regression model using stochastic hill climbing.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post How to Use Optimization Algorithms to Manually Fit Regression Models appeared first on Machine Learning Mastery.

How to Develop a Neural Net for Predicting Disturbances in the Ionosphere

$
0
0

It can be challenging to develop a neural network predictive model for a new dataset.

One approach is to first inspect the dataset and develop ideas for what models might work, then explore the learning dynamics of simple models on the dataset, then finally develop and tune a model for the dataset with a robust test harness.

This process can be used to develop effective neural network models for classification and regression predictive modeling problems.

In this tutorial, you will discover how to develop a Multilayer Perceptron neural network model for the ionosphere binary classification dataset.

After completing this tutorial, you will know:

  • How to load and summarize the ionosphere dataset and use the results to suggest data preparations and model configurations to use.
  • How to explore the learning dynamics of simple MLP models on the dataset.
  • How to develop robust estimates of model performance, tune model performance, and make predictions on new data.

Let’s get started.

How to Develop a Neural Net for Predicting Disturbances in the Ionosphere

How to Develop a Neural Net for Predicting Disturbances in the Ionosphere
Photo by Sergey Pesterev, some rights reserved.

Tutorial Overview

This tutorial is divided into four parts; they are:

  1. Ionosphere Binary Classification Dataset
  2. Neural Network Learning Dynamics
  3. Evaluating and Tuning MLP Models
  4. Final Model and Make Predictions

Ionosphere Binary Classification Dataset

The first step is to define and explore the dataset.

We will be working with the “Ionosphere” standard binary classification dataset.

This dataset involves predicting whether a structure is in the atmosphere or not given radar returns.

You can learn more about the dataset here:

You can see the first few rows of the dataset below.

1,0,0.99539,-0.05889,0.85243,0.02306,0.83398,-0.37708,1,0.03760,0.85243,-0.17755,0.59755,-0.44945,0.60536,-0.38223,0.84356,-0.38542,0.58212,-0.32192,0.56971,-0.29674,0.36946,-0.47357,0.56811,-0.51171,0.41078,-0.46168,0.21266,-0.34090,0.42267,-0.54487,0.18641,-0.45300,g
1,0,1,-0.18829,0.93035,-0.36156,-0.10868,-0.93597,1,-0.04549,0.50874,-0.67743,0.34432,-0.69707,-0.51685,-0.97515,0.05499,-0.62237,0.33109,-1,-0.13151,-0.45300,-0.18056,-0.35734,-0.20332,-0.26569,-0.20468,-0.18401,-0.19040,-0.11593,-0.16626,-0.06288,-0.13738,-0.02447,b
1,0,1,-0.03365,1,0.00485,1,-0.12062,0.88965,0.01198,0.73082,0.05346,0.85443,0.00827,0.54591,0.00299,0.83775,-0.13644,0.75535,-0.08540,0.70887,-0.27502,0.43385,-0.12062,0.57528,-0.40220,0.58984,-0.22145,0.43100,-0.17365,0.60436,-0.24180,0.56045,-0.38238,g
1,0,1,-0.45161,1,1,0.71216,-1,0,0,0,0,0,0,-1,0.14516,0.54094,-0.39330,-1,-0.54467,-0.69975,1,0,0,1,0.90695,0.51613,1,1,-0.20099,0.25682,1,-0.32382,1,b
1,0,1,-0.02401,0.94140,0.06531,0.92106,-0.23255,0.77152,-0.16399,0.52798,-0.20275,0.56409,-0.00712,0.34395,-0.27457,0.52940,-0.21780,0.45107,-0.17813,0.05982,-0.35575,0.02309,-0.52879,0.03286,-0.65158,0.13290,-0.53206,0.02431,-0.62197,-0.05707,-0.59573,-0.04608,-0.65697,g
...

We can see that the values are all numeric and perhaps in the range [-1, 1]. This suggests some type of scaling would probably not be needed.

We can also see that the label is a string (“g” and “b“), suggesting that the values will need to be encoded to 0 and 1 prior to fitting a model.

We can load the dataset as a pandas DataFrame directly from the URL; for example:

# load the ionosphere dataset and summarize the shape
from pandas import read_csv
# define the location of the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/ionosphere.csv'
# load the dataset
df = read_csv(url, header=None)
# summarize shape
print(df.shape)

Running the example loads the dataset directly from the URL and reports the shape of the dataset.

In this case, we can see that the dataset has 35 variables (34 input and one output) and that the dataset has 351 rows of data.

This is not many rows of data for a neural network and suggests that a small network, perhaps with regularization, would be appropriate.

It also suggests that using k-fold cross-validation would be a good idea given that it will give a more reliable estimate of model performance than a train/test split and because a single model will fit in seconds instead of hours or days with the largest datasets.

(351, 35)

Next, we can learn more about the dataset by looking at summary statistics and a plot of the data.

# show summary statistics and plots of the ionosphere dataset
from pandas import read_csv
from matplotlib import pyplot
# define the location of the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/ionosphere.csv'
# load the dataset
df = read_csv(url, header=None)
# show summary statistics
print(df.describe())
# plot histograms
df.hist()
pyplot.show()

Running the example first loads the data before and then prints summary statistics for each variable.

We can see that the mean values for each variable are in the tens, with values ranging from -1 to 1. This confirms that scaling the data is probably not required.

0      1           2   ...          31          32          33
count  351.000000  351.0  351.000000  ...  351.000000  351.000000  351.000000
mean     0.891738    0.0    0.641342  ...   -0.003794    0.349364    0.014480
std      0.311155    0.0    0.497708  ...    0.513574    0.522663    0.468337
min      0.000000    0.0   -1.000000  ...   -1.000000   -1.000000   -1.000000
25%      1.000000    0.0    0.472135  ...   -0.242595    0.000000   -0.165350
50%      1.000000    0.0    0.871110  ...    0.000000    0.409560    0.000000
75%      1.000000    0.0    1.000000  ...    0.200120    0.813765    0.171660
max      1.000000    0.0    1.000000  ...    1.000000    1.000000    1.000000

A histogram plot is then created for each variable.

We can see that many variables have a Gaussian or Gaussian-like distribution.

We may have some benefit in using a power transform on each variable in order to make the probability distribution less skewed which will likely improve model performance.

Histograms of the Ionosphere Classification Dataset

Histograms of the Ionosphere Classification Dataset

Now that we are familiar with the dataset, let’s explore how we might develop a neural network model.

Neural Network Learning Dynamics

We will develop a Multilayer Perceptron (MLP) model for the dataset using TensorFlow.

We cannot know what model architecture of learning hyperparameters would be good or best for this dataset, so we must experiment and discover what works well.

Given that the dataset is small, a small batch size is probably a good idea, e.g. 16 or 32 rows. Using the Adam version of stochastic gradient descent is a good idea when getting started as it will automatically adapts the learning rate and works well on most datasets.

Before we evaluate models in earnest, it is a good idea to review the learning dynamics and tune the model architecture and learning configuration until we have stable learning dynamics, then look at getting the most out of the model.

We can do this by using a simple train/test split of the data and review plots of the learning curves. This will help us see if we are over-learning or under-learning; then we can adapt the configuration accordingly.

First, we must ensure all input variables are floating-point values and encode the target label as integer values 0 and 1.

...
# ensure all data are floating point values
X = X.astype('float32')
# encode strings to integer
y = LabelEncoder().fit_transform(y)

Next, we can split the dataset into input and output variables, then into 67/33 train and test sets.

...
# split into input and output columns
X, y = df.values[:, :-1], df.values[:, -1]
# split into train and test datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

We can define a minimal MLP model. In this case, we will use one hidden layer with 10 nodes and one output layer (chosen arbitrarily). We will use the ReLU activation function in the hidden layer and the “he_normal” weight initialization, as together, they are a good practice.

The output of the model is a sigmoid activation for binary classification and we will minimize binary cross-entropy loss.

...
# determine the number of input features
n_features = X.shape[1]
# define model
model = Sequential()
model.add(Dense(10, activation='relu', kernel_initializer='he_normal', input_shape=(n_features,)))
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy')

We will fit the model for 200 training epochs (chosen arbitrarily) with a batch size of 32 because it is a small dataset.

We are fitting the model on raw data, which we think might be a good idea, but it is an important starting point.

...
# fit the model
history = model.fit(X_train, y_train, epochs=200, batch_size=32, verbose=0, validation_data=(X_test,y_test))

At the end of training, we will evaluate the model’s performance on the test dataset and report performance as the classification accuracy.

...
# predict test set
yhat = model.predict_classes(X_test)
# evaluate predictions
score = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % score)

Finally, we will plot learning curves of the cross-entropy loss on the train and test sets during training.

...
# plot learning curves
pyplot.title('Learning Curves')
pyplot.xlabel('Epoch')
pyplot.ylabel('Cross Entropy')
pyplot.plot(history.history['loss'], label='train')
pyplot.plot(history.history['val_loss'], label='val')
pyplot.legend()
pyplot.show()

Tying this all together, the complete example of evaluating our first MLP on the ionosphere dataset is listed below.

# fit a simple mlp model on the ionosphere and review learning curves
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from matplotlib import pyplot
# load the dataset
path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/ionosphere.csv'
df = read_csv(path, header=None)
# split into input and output columns
X, y = df.values[:, :-1], df.values[:, -1]
# ensure all data are floating point values
X = X.astype('float32')
# encode strings to integer
y = LabelEncoder().fit_transform(y)
# split into train and test datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
# determine the number of input features
n_features = X.shape[1]
# define model
model = Sequential()
model.add(Dense(10, activation='relu', kernel_initializer='he_normal', input_shape=(n_features,)))
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy')
# fit the model
history = model.fit(X_train, y_train, epochs=200, batch_size=32, verbose=0, validation_data=(X_test,y_test))
# predict test set
yhat = model.predict_classes(X_test)
# evaluate predictions
score = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % score)
# plot learning curves
pyplot.title('Learning Curves')
pyplot.xlabel('Epoch')
pyplot.ylabel('Cross Entropy')
pyplot.plot(history.history['loss'], label='train')
pyplot.plot(history.history['val_loss'], label='val')
pyplot.legend()
pyplot.show()

Running the example first fits the model on the training dataset, then reports the classification accuracy on the test dataset.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the model achieved an accuracy of about 88 percent, which is a good baseline in performance that we might be able to improve upon.

Accuracy: 0.888

Line plots of the loss on the train and test sets are then created.

We can see that the model appears to converge but has overfit the training dataset.

Learning Curves of Simple MLP on Ionosphere Dataset

Learning Curves of Simple MLP on Ionosphere Dataset

Let’s try increasing the capacity of the model.

This will slow down learning for the same learning hyperparameters and may offer better accuracy.

We will add a second hidden layer with eight nodes, chosen arbitrarily.

...
# define model
model = Sequential()
model.add(Dense(10, activation='relu', kernel_initializer='he_normal', input_shape=(n_features,)))
model.add(Dense(8, activation='relu', kernel_initializer='he_normal'))
model.add(Dense(1, activation='sigmoid'))

The complete example is listed below.

# fit a deeper mlp model on the ionosphere and review learning curves
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from matplotlib import pyplot
# load the dataset
path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/ionosphere.csv'
df = read_csv(path, header=None)
# split into input and output columns
X, y = df.values[:, :-1], df.values[:, -1]
# ensure all data are floating point values
X = X.astype('float32')
# encode strings to integer
y = LabelEncoder().fit_transform(y)
# split into train and test datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
# determine the number of input features
n_features = X.shape[1]
# define model
model = Sequential()
model.add(Dense(10, activation='relu', kernel_initializer='he_normal', input_shape=(n_features,)))
model.add(Dense(8, activation='relu', kernel_initializer='he_normal'))
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy')
# fit the model
history = model.fit(X_train, y_train, epochs=200, batch_size=32, verbose=0, validation_data=(X_test,y_test))
# predict test set
yhat = model.predict_classes(X_test)
# evaluate predictions
score = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % score)
# plot learning curves
pyplot.title('Learning Curves')
pyplot.xlabel('Epoch')
pyplot.ylabel('Cross Entropy')
pyplot.plot(history.history['loss'], label='train')
pyplot.plot(history.history['val_loss'], label='val')
pyplot.legend()

Running the example first fits the model on the training dataset, then reports the accuracy on the test dataset.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see a slight improvement in accuracy to about 93 percent, although the high variance of the train/test split means that this evaluation is not reliable.

Accuracy: 0.931

Learning curves for the loss on the train and test sets are then plotted. We can see that the model still appears to show an overfitting behavior.

Learning Curves of Deeper MLP on the Ionosphere Dataset

Learning Curves of Deeper MLP on the Ionosphere Dataset

Finally, we can try a wider network.

We will increase the number of nodes in the first hidden layer from 10 to 50, and in the second hidden layer from 8 to 10.

This will add more capacity to the model, slow down learning, and may further improve results.

...
# define model
model = Sequential()
model.add(Dense(50, activation='relu', kernel_initializer='he_normal', input_shape=(n_features,)))
model.add(Dense(10, activation='relu', kernel_initializer='he_normal'))
model.add(Dense(1, activation='sigmoid'))

We will also reduce the number of training epochs from 200 to 100.

...
# fit the model
history = model.fit(X_train, y_train, epochs=100, batch_size=32, verbose=0, validation_data=(X_test,y_test))

The complete example is listed below.

# fit a wider mlp model on the ionosphere and review learning curves
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from matplotlib import pyplot
# load the dataset
path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/ionosphere.csv'
df = read_csv(path, header=None)
# split into input and output columns
X, y = df.values[:, :-1], df.values[:, -1]
# ensure all data are floating point values
X = X.astype('float32')
# encode strings to integer
y = LabelEncoder().fit_transform(y)
# split into train and test datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
# determine the number of input features
n_features = X.shape[1]
# define model
model = Sequential()
model.add(Dense(50, activation='relu', kernel_initializer='he_normal', input_shape=(n_features,)))
model.add(Dense(10, activation='relu', kernel_initializer='he_normal'))
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy')
# fit the model
history = model.fit(X_train, y_train, epochs=100, batch_size=32, verbose=0, validation_data=(X_test,y_test))
# predict test set
yhat = model.predict_classes(X_test)
# evaluate predictions
score = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % score)
# plot learning curves
pyplot.title('Learning Curves')
pyplot.xlabel('Epoch')
pyplot.ylabel('Cross Entropy')
pyplot.plot(history.history['loss'], label='train')
pyplot.plot(history.history['val_loss'], label='val')
pyplot.legend()
pyplot.show()

Running the example first fits the model on the training dataset, then reports the accuracy on the test dataset.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, the model achieves a better accuracy score, with a value of about 94 percent. We will ignore model performance for now.

Accuracy: 0.940

Line plots of the learning curves are created showing that the model achieved a reasonable fit and had more than enough time to converge.

Learning Curves of Wider MLP on the Ionosphere Dataset

Learning Curves of Wider MLP on the Ionosphere Dataset

Now that we have some idea of the learning dynamics for simple MLP models on the dataset, we can look at evaluating the performance of the models as well as tuning the configuration of the models.

Evaluating and Tuning MLP Models

The k-fold cross-validation procedure can provide a more reliable estimate of MLP performance, although it can be very slow.

This is because k models must be fit and evaluated. This is not a problem when the dataset size is small, such as the ionosphere dataset.

We can use the StratifiedKFold class and enumerate each fold manually, fit the model, evaluate it, and then report the mean of the evaluation scores at the end of the procedure.

# prepare cross validation
kfold = KFold(10)
# enumerate splits
scores = list()
for train_ix, test_ix in kfold.split(X, y):
	# fit and evaluate the model...
	...
...
# summarize all scores
print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

We can use this framework to develop a reliable estimate of MLP model performance with a range of different data preparations, model architectures, and learning configurations.

It is important that we first developed an understanding of the learning dynamics of the model on the dataset in the previous section before using k-fold cross-validation to estimate the performance. If we started to tune the model directly, we might get good results, but if not, we might have no idea of why, e.g. that the model was over or under fitting.

If we make large changes to the model again, it is a good idea to go back and confirm that the model is converging appropriately.

The complete example of this framework to evaluate the base MLP model from the previous section is listed below.

# k-fold cross-validation of base model for the ionosphere dataset
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from matplotlib import pyplot
# load the dataset
path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/ionosphere.csv'
df = read_csv(path, header=None)
# split into input and output columns
X, y = df.values[:, :-1], df.values[:, -1]
# ensure all data are floating point values
X = X.astype('float32')
# encode strings to integer
y = LabelEncoder().fit_transform(y)
# prepare cross validation
kfold = StratifiedKFold(10)
# enumerate splits
scores = list()
for train_ix, test_ix in kfold.split(X, y):
	# split data
	X_train, X_test, y_train, y_test = X[train_ix], X[test_ix], y[train_ix], y[test_ix]
	# determine the number of input features
	n_features = X.shape[1]
	# define model
	model = Sequential()
	model.add(Dense(50, activation='relu', kernel_initializer='he_normal', input_shape=(n_features,)))
	model.add(Dense(10, activation='relu', kernel_initializer='he_normal'))
	model.add(Dense(1, activation='sigmoid'))
	# compile the model
	model.compile(optimizer='adam', loss='binary_crossentropy')
	# fit the model
	model.fit(X_train, y_train, epochs=100, batch_size=32, verbose=0)
	# predict test set
	yhat = model.predict_classes(X_test)
	# evaluate predictions
	score = accuracy_score(y_test, yhat)
	print('>%.3f' % score)
	scores.append(score)
# summarize all scores
print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example reports the model performance each iteration of the evaluation procedure and reports the mean and standard deviation of classification accuracy at the end of the run.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the MLP model achieved a mean accuracy of about 93.4 percent.

We will use this result as our baseline to see if we can achieve better performance.

>0.972
>0.886
>0.943
>0.886
>0.914
>0.943
>0.943
>1.000
>0.971
>0.886
Mean Accuracy: 0.934 (0.039)

Next, let’s try adding regularization to reduce overfitting of the model.

In this case, we can add dropout layers between the hidden layers of the network. For example:

...
# define model
model = Sequential()
model.add(Dense(50, activation='relu', kernel_initializer='he_normal', input_shape=(n_features,)))
model.add(Dropout(0.4))
model.add(Dense(10, activation='relu', kernel_initializer='he_normal'))
model.add(Dropout(0.4))
model.add(Dense(1, activation='sigmoid'))

The complete example of the MLP model with dropout is listed below.

# k-fold cross-validation of the MLP with dropout for the ionosphere dataset
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from matplotlib import pyplot
# load the dataset
path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/ionosphere.csv'
df = read_csv(path, header=None)
# split into input and output columns
X, y = df.values[:, :-1], df.values[:, -1]
# ensure all data are floating point values
X = X.astype('float32')
# encode strings to integer
y = LabelEncoder().fit_transform(y)
# prepare cross validation
kfold = StratifiedKFold(10)
# enumerate splits
scores = list()
for train_ix, test_ix in kfold.split(X, y):
	# split data
	X_train, X_test, y_train, y_test = X[train_ix], X[test_ix], y[train_ix], y[test_ix]
	# determine the number of input features
	n_features = X.shape[1]
	# define model
	model = Sequential()
	model.add(Dense(50, activation='relu', kernel_initializer='he_normal', input_shape=(n_features,)))
	model.add(Dropout(0.4))
	model.add(Dense(10, activation='relu', kernel_initializer='he_normal'))
	model.add(Dropout(0.4))
	model.add(Dense(1, activation='sigmoid'))
	# compile the model
	model.compile(optimizer='adam', loss='binary_crossentropy')
	# fit the model
	model.fit(X_train, y_train, epochs=100, batch_size=32, verbose=0)
	# predict test set
	yhat = model.predict_classes(X_test)
	# evaluate predictions
	score = accuracy_score(y_test, yhat)
	print('>%.3f' % score)
	scores.append(score)
# summarize all scores
print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running reports the mean and standard deviation of the classification accuracy at the end of the run.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the MLP model with dropout achieves better results with an accuracy of about 94.6 percent compared to 93.4 percent without dropout

Mean Accuracy: 0.946 (0.043)

Finally, we will try reducing the batch size from 32 down to 8.

This will result in more noisy gradients and may also slow down the speed at which the model is learning the problem.

...
# fit the model
model.fit(X_train, y_train, epochs=100, batch_size=8, verbose=0)

The complete example is listed below.

# k-fold cross-validation of the MLP with dropout for the ionosphere dataset
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from matplotlib import pyplot
# load the dataset
path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/ionosphere.csv'
df = read_csv(path, header=None)
# split into input and output columns
X, y = df.values[:, :-1], df.values[:, -1]
# ensure all data are floating point values
X = X.astype('float32')
# encode strings to integer
y = LabelEncoder().fit_transform(y)
# prepare cross validation
kfold = StratifiedKFold(10)
# enumerate splits
scores = list()
for train_ix, test_ix in kfold.split(X, y):
	# split data
	X_train, X_test, y_train, y_test = X[train_ix], X[test_ix], y[train_ix], y[test_ix]
	# determine the number of input features
	n_features = X.shape[1]
	# define model
	model = Sequential()
	model.add(Dense(50, activation='relu', kernel_initializer='he_normal', input_shape=(n_features,)))
	model.add(Dropout(0.4))
	model.add(Dense(10, activation='relu', kernel_initializer='he_normal'))
	model.add(Dropout(0.4))
	model.add(Dense(1, activation='sigmoid'))
	# compile the model
	model.compile(optimizer='adam', loss='binary_crossentropy')
	# fit the model
	model.fit(X_train, y_train, epochs=100, batch_size=8, verbose=0)
	# predict test set
	yhat = model.predict_classes(X_test)
	# evaluate predictions
	score = accuracy_score(y_test, yhat)
	print('>%.3f' % score)
	scores.append(score)
# summarize all scores
print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running reports the mean and standard deviation of the classification accuracy at the end of the run.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the MLP model with dropout achieves slightly better results with an accuracy of about 94.9 percent.

Mean Accuracy: 0.949 (0.042)

We will use this configuration as our final model.

We could continue to test alternate configurations to the model architecture (more or fewer nodes or layers), learning hyperparameters (more or fewer batches), and data transforms.

I leave this as an exercise; let me know what you discover. Can you get better results?
Post your results in the comments below, I’d love to see what you get.

Next, let’s look at how we might fit a final model and use it to make predictions.

Final Model and Make Predictions

Once we choose a model configuration, we can train a final model on all available data and use it to make predictions on new data.

In this case, we will use the model with dropout and a small batch size as our final model.

We can prepare the data and fit the model as before, although on the entire dataset instead of a training subset of the dataset.

...
# split into input and output columns
X, y = df.values[:, :-1], df.values[:, -1]
# ensure all data are floating point values
X = X.astype('float32')
# encode strings to integer
le = LabelEncoder()
y = le.fit_transform(y)
# determine the number of input features
n_features = X.shape[1]
# define model
model = Sequential()
model.add(Dense(50, activation='relu', kernel_initializer='he_normal', input_shape=(n_features,)))
model.add(Dropout(0.4))
model.add(Dense(10, activation='relu', kernel_initializer='he_normal'))
model.add(Dropout(0.4))
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy')

We can then use this model to make predictions on new data.

First, we can define a row of new data.

...
# define a row of new data
row = [1,0,0.99539,-0.05889,0.85243,0.02306,0.83398,-0.37708,1,0.03760,0.85243,-0.17755,0.59755,-0.44945,0.60536,-0.38223,0.84356,-0.38542,0.58212,-0.32192,0.56971,-0.29674,0.36946,-0.47357,0.56811,-0.51171,0.41078,-0.46168,0.21266,-0.34090,0.42267,-0.54487,0.18641,-0.45300]

Note: I took this row from the first row of the dataset and the expected label is a ‘g‘.

We can then make a prediction.

...
# make prediction
yhat = model.predict_classes([row])

Then invert the transform on the prediction, so we can use or interpret the result in the correct label.

...
# invert transform to get label for class
yhat = le.inverse_transform(yhat)

And in this case, we will simply report the prediction.

...
# report prediction
print('Predicted: %s' % (yhat[0]))

Tying this all together, the complete example of fitting a final model for the ionosphere dataset and using it to make a prediction on new data is listed below.

# fit a final model and make predictions on new data for the ionosphere dataset
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
# load the dataset
path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/ionosphere.csv'
df = read_csv(path, header=None)
# split into input and output columns
X, y = df.values[:, :-1], df.values[:, -1]
# ensure all data are floating point values
X = X.astype('float32')
# encode strings to integer
le = LabelEncoder()
y = le.fit_transform(y)
# determine the number of input features
n_features = X.shape[1]
# define model
model = Sequential()
model.add(Dense(50, activation='relu', kernel_initializer='he_normal', input_shape=(n_features,)))
model.add(Dropout(0.4))
model.add(Dense(10, activation='relu', kernel_initializer='he_normal'))
model.add(Dropout(0.4))
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy')
# fit the model
model.fit(X, y, epochs=100, batch_size=8, verbose=0)
# define a row of new data
row = [1,0,0.99539,-0.05889,0.85243,0.02306,0.83398,-0.37708,1,0.03760,0.85243,-0.17755,0.59755,-0.44945,0.60536,-0.38223,0.84356,-0.38542,0.58212,-0.32192,0.56971,-0.29674,0.36946,-0.47357,0.56811,-0.51171,0.41078,-0.46168,0.21266,-0.34090,0.42267,-0.54487,0.18641,-0.45300]
# make prediction
yhat = model.predict_classes([row])
# invert transform to get label for class
yhat = le.inverse_transform(yhat)
# report prediction
print('Predicted: %s' % (yhat[0]))

Running the example fits the model on the entire dataset and makes a prediction for a single row of new data.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the model predicted a “g” label for the input row.

Predicted: g

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Tutorials

Summary

In this tutorial, you discovered how to develop a Multilayer Perceptron neural network model for the ionosphere binary classification dataset.

Specifically, you learned:

  • How to load and summarize the ionosphere dataset and use the results to suggest data preparations and model configurations to use.
  • How to explore the learning dynamics of simple MLP models on the dataset.
  • How to develop robust estimates of model performance, tune model performance and make predictions on new data.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post How to Develop a Neural Net for Predicting Disturbances in the Ionosphere appeared first on Machine Learning Mastery.


A Gentle Introduction to Stochastic Optimization Algorithms

$
0
0

Stochastic optimization refers to the use of randomness in the objective function or in the optimization algorithm.

Challenging optimization algorithms, such as high-dimensional nonlinear objective problems, may contain multiple local optima in which deterministic optimization algorithms may get stuck.

Stochastic optimization algorithms provide an alternative approach that permits less optimal local decisions to be made within the search procedure that may increase the probability of the procedure locating the global optima of the objective function.

In this tutorial, you will discover a gentle introduction to stochastic optimization.

After completing this tutorial, you will know:

  • Stochastic optimization algorithms make use of randomness as part of the search procedure.
  • Examples of stochastic optimization algorithms like simulated annealing and genetic algorithms.
  • Practical considerations when using stochastic optimization algorithms such as repeated evaluations.

Let’s get started.

A Gentle Introduction to Stochastic Optimization Algorithms

A Gentle Introduction to Stochastic Optimization Algorithms
Photo by John, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  1. What Is Stochastic Optimization?
  2. Stochastic Optimization Algorithms
  3. Practical Considerations for Stochastic Optimization

What Is Stochastic Optimization?

Optimization refers to optimization algorithms that seek the inputs to a function that result in the minimum or maximum of an objective function.

Stochastic optimization or stochastic search refers to an optimization task that involves randomness in some way, such as either from the objective function or in the optimization algorithm.

Stochastic search and optimization pertains to problems where there is randomness noise in the measurements provided to the algorithm and/or there is injected (Monte Carlo) randomness in the algorithm itself.

— Page xiii, Introduction to Stochastic Search and Optimization, 2003.

Randomness in the objective function means that the evaluation of candidate solutions involves some uncertainty or noise and algorithms must be chosen that can make progress in the search in the presence of this noise.

Randomness in the algorithm is used as a strategy, e.g. stochastic or probabilistic decisions. It is used as an alternative to deterministic decisions in an effort to improve the likelihood of locating the global optima or a better local optima.

Standard stochastic optimization methods are brittle, sensitive to stepsize choice and other algorithmic parameters, and they exhibit instability outside of well-behaved families of objectives.

The Importance Of Better Models In Stochastic Optimization, 2019.

It is more common to refer to an algorithm that uses randomness than an objective function that contains noisy evaluations when discussing stochastic optimization. This is because randomness in the objective function can be addressed by using randomness in the optimization algorithm. As such, stochastic optimization may be referred to as “robust optimization.”

A deterministic algorithm may be misled (e.g. “deceived” or “confused“) by the noisy evaluation of candidate solutions or noisy function gradients, causing the algorithm to bounce around or get stuck (e.g. fail to converge).

Methods for stochastic optimization provide a means of coping with inherent system noise and coping with models or systems that are highly nonlinear, high dimensional, or otherwise inappropriate for classical deterministic methods of optimization.

Stochastic Optimization, 2011.

Using randomness in an optimization algorithm allows the search procedure to perform well on challenging optimization problems that may have a nonlinear response surface. This is achieved by the algorithm taking locally suboptimal steps or moves in the search space that allow it to escape local optima.

Randomness can help escape local optima and increase the chances of finding a global optimum.

— Page 8, Algorithms for Optimization, 2019.

The randomness used in a stochastic optimization algorithm does not have to be true randomness; instead, pseudorandom is sufficient. A pseudorandom number generator is almost universally used in stochastic optimization.

Use of randomness in a stochastic optimization algorithm does not mean that the algorithm is random. Instead, it means that some decisions made during the search procedure involve some portion of randomness. For example, we can conceptualize this as the move from the current to the next point in the search space made by the algorithm may be made according to a probability distribution relative to the optimal move.

Now that we have an idea of what stochastic optimization is, let’s look at some examples of stochastic optimization algorithms.

Stochastic Optimization Algorithms

The use of randomness in the algorithms often means that the techniques are referred to as “heuristic search” as they use a rough rule-of-thumb procedure that may or may not work to find the optima instead of a precise procedure.

Many stochastic algorithms are inspired by a biological or natural process and may be referred to as “metaheuristics” as a higher-order procedure providing the conditions for a specific search of the objective function. They are also referred to as “black box” optimization algorithms.

Metaheuristics is a rather unfortunate term often used to describe a major subfield, indeed the primary subfield, of stochastic optimization.

— Page 7, Essentials of Metaheuristics, 2011.

There are many stochastic optimization algorithms.

Some examples of stochastic optimization algorithms include:

  • Iterated Local Search
  • Stochastic Hill Climbing
  • Stochastic Gradient Descent
  • Tabu Search
  • Greedy Randomized Adaptive Search Procedure

Some examples of stochastic optimization algorithms that are inspired by biological or physical processes include:

  • Simulated Annealing
  • Evolution Strategies
  • Genetic Algorithm
  • Differential Evolution
  • Particle Swarm Optimization

Now that we are familiar with some examples of stochastic optimization algorithms, let’s look at some practical considerations when using them.

Practical Considerations for Stochastic Optimization

There are important considerations when using stochastic optimization algorithms.

The stochastic nature of the procedure means that any single run of an algorithm will be different, given a different source of randomness used by the algorithm and, in turn, different starting points for the search and decisions made during the search.

The pseudorandom number generator used as the source of randomness can be seeded to ensure the same sequence of random numbers is provided each run of the algorithm. This is good for small demonstrations and tutorials, although it is fragile as it is working against the inherent randomness of the algorithm.

Instead, a given algorithm can be executed many times to control for the randomness of the procedure.

This idea of multiple runs of the algorithm can be used in two key situations:

  • Comparing Algorithms
  • Evaluating Final Result

Algorithms may be compared based on the relative quality of the result found, the number of function evaluations performed, or some combination or derivation of these considerations. The result of any one run will depend upon the randomness used by the algorithm and alone cannot meaningfully represent the capability of the algorithm. Instead, a strategy of repeated evaluation should be used.

Any comparison between stochastic optimization algorithms will require the repeated evaluation of each algorithm with a different source of randomness and the summarization of the probability distribution of best results found, such as the mean and standard deviation of objective values. The mean result from each algorithm can then be compared.

In cases where multiple local minima are likely to exist, it can be beneficial to incorporate random restarts after our terminiation conditions are met where we restart our local descent method from randomly selected initial points.

— Page 66, Algorithms for Optimization, 2019.

Similarly, any single run of a chosen optimization algorithm alone does not meaningfully represent the global optima of the objective function. Instead, a strategy of repeated evaluation should be used to develop a distribution of optimal solutions.

The maximum or minimum of the distribution can be taken as the final solution, and the distribution itself will provide a point of reference and confidence that the solution found is “relatively good” or “good enough” given the resources expended.

  • Multi-Restart: An approach for improving the likelihood of locating the global optima via the repeated application of a stochastic optimization algorithm to an optimization problem.

The repeated application of a stochastic optimization algorithm on an objective function is sometimes referred to as a multi-restart strategy and may be built in to the optimization algorithm itself or prescribed more generally as a procedure around the chosen stochastic optimization algorithm.

Each time you do a random restart, the hill-climber then winds up in some (possibly new) local optimum.

— Page 26, Essentials of Metaheuristics, 2011.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Related Tutorials

Papers

Books

Articles

Summary

In this tutorial, you discovered a gentle introduction to stochastic optimization.

Specifically, you learned:

  • Stochastic optimization algorithms make use of randomness as part of the search procedure.
  • Examples of stochastic optimization algorithms like simulated annealing and genetic algorithms.
  • Practical considerations when using stochastic optimization algorithms such as repeated evaluations.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Stochastic Optimization Algorithms appeared first on Machine Learning Mastery.

No Free Lunch Theorem for Machine Learning

$
0
0

The No Free Lunch Theorem is often thrown around in the field of optimization and machine learning, often with little understanding of what it means or implies.

The theorem states that all optimization algorithms perform equally well when their performance is averaged across all possible problems.

It implies that there is no single best optimization algorithm. Because of the close relationship between optimization, search, and machine learning, it also implies that there is no single best machine learning algorithm for predictive modeling problems such as classification and regression.

In this tutorial, you will discover the no free lunch theorem for optimization and search.

After completing this tutorial, you will know:

  • The no free lunch theorem suggests the performance of all optimization algorithms are identical, under some specific constraints.
  • There is provably no single best optimization algorithm or machine learning algorithm.
  • The practical implications of the theorem may be limited given we are interested in a small subset of all possible objective functions.

Let’s get started.

No Free Lunch Theorem for Machine Learning

No Free Lunch Theorem for Machine Learning
Photo by Francisco Anzola, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  1. What Is the No Free Lunch Theorem?
  2. Implications for Optimization
  3. Implications for Machine Learning

What Is the No Free Lunch Theorem?

The No Free Lunch Theorem, often abbreviated as NFL or NFLT, is a theoretical finding that suggests all optimization algorithms perform equally well when their performance is averaged over all possible objective functions.

The NFL stated that within certain constraints, over the space of all possible problems, every optimization technique will perform as well as every other one on average (including Random Search)

— Page 203, Essentials of Metaheuristics, 2011.

The theorem applies to optimization generally and to search problems, as optimization can be described or framed as a search problem.

The implication is that the performance of your favorite algorithm is identical to a completely naive algorithm, such as random search.

Roughly speaking we show that for both static and time dependent optimization problems the average performance of any pair of algorithms across all possible problems is exactly identical.

No Free Lunch Theorems For Optimization, 1997.

An easy way to think about this finding is to consider a large table like you might have in excel.

Across the top of the table, each column represents a different optimization algorithm. Down the side of the table, each row represents a different objective function. Each cell of the table is the performance of the algorithm on the objective function, using whatever performance measure you like, as long as it is consistent across the whole table.

Depiction on the No Free Lunch Theorem as a Table of Algorithms and Problems

Depiction on the No Free Lunch Theorem as a Table of Algorithms and Problems

You can imagine that this table will be infinitely large. Nevertheless, in this table, we can calculate the average performance of any algorithm from all the values in its column and it will be identical to the average performance of any other algorithm column.

If one algorithm performs better than another algorithm on one class of problems, then it will perform worse on another class of problems

— Page 6, Algorithms for Optimization, 2019.

Now that we are familiar with the no free lunch theorem in general, let’s look at the specific implications for optimization.

Implications for Optimization

So-called black-box optimization algorithms are general optimization algorithms that can be applied to many different optimization problems and assume very little about the objective function.

Examples of black-box algorithms include the genetic algorithm, simulated annealing, and particle swarm optimization.

The no free lunch theorem was proposed in an environment of the late 1990s where claims of one black-box optimization algorithm being better than another optimization algorithm were being made routinely. The theorem pushes back on this, indicating that there is no best optimization algorithm, that it is provably impossible.

The theorem does state that no optimization algorithm is any better than any other optimization algorithm, on average.

… known as the “no free lunch” theorem, sets a limit on how good a learner can be. The limit is pretty low: no learner can be better than random guessing!

— Page 63, The Master Algorithm, 2018.

The catch is that the application of algorithms does not assume anything about the problem. In fact, algorithms are applied to objective functions with no prior knowledge, even such as whether the objective function is minimizing or maximizing. And this is a hard constraint of the theorem.

We often have “some” knowledge about the objective function being optimized. In fact, if in practice we truly knew nothing about the objective function, we could not choose an optimization algorithm.

As elaborated by the no free lunch theorems of Wolpert and Macready, there is no reason to prefer one algorithm over another unless we make assumptions about the probability distribution over the space of possible objective functions.

— Page 6, Algorithms for Optimization, 2019.

The beginner practitioner in the field of optimization is counseled to learn and use as much about the problem as possible in the optimization algorithm.

The more we know and harness in the algorithms about the problem, the better tailored the technique is to the problem and the more likely the algorithm is expected to perform well on the problem. The no free lunch theorem supports this advice.

We don’t care about all possible worlds, only the one we live in. If we know something about the world and incorporate it into our learner, it now has an advantage over random guessing.

— Page 63, The Master Algorithm, 2018.

Additionally, the performance is averaged over all possible objective functions and all possible optimization algorithms. Whereas in practice, we are interested in a small subset of objective functions that may have a specific structure or form and algorithms tailored to those functions.

… we cannot emphasize enough that no claims whatsoever are being made in this paper concerning how well various search algorithms work in practice The focus of this paper is on what can be said a priori without any assumptions and from mathematical principles alone concerning the utility of a search algorithm.

No Free Lunch Theorems For Optimization, 1997.

These implications lead some practitioners to note the limited practical value of the theorem.

This is of considerable theoretical interest but, I think, of limited practical value, because the space of all possible problems likely includes many extremely unusual and pathological problems which are rarely if ever seen in practice.

— Page 203, Essentials of Metaheuristics, 2011.

Now that we have reviewed the implications of the no free lunch theorem for optimization, let’s review the implications for machine learning.

Implications for Machine Learning

Learning can be described or framed as an optimization problem, and most machine learning algorithms solve an optimization problem at their core.

The no free lunch theorem for optimization and search is applied to machine learning, specifically supervised learning, which underlies classification and regression predictive modeling tasks.

This means that all machine learning algorithms are equally effective across all possible prediction problems, e.g. random forest is as good as random predictions.

So all learning algorithms are the same in that: (1) by several definitions of “average”, all algorithms have the same average off-training-set misclassification risk, (2) therefore no learning algorithm can have lower risk than another one for all f …

The Supervised Learning No-Free-Lunch Theorems, 2002.

This also has implications for the way in which algorithms are evaluated or chosen, such as choosing a learning algorithm via a k-fold cross-validation test harness or not.

… an algorithm that uses cross validation to choose amongst a prefixed set of learning algorithms does no better on average than one that does not.

The Supervised Learning No-Free-Lunch Theorems, 2002.

It also has implications for common heuristics for what constitutes a “good” machine learning model, such as avoiding overfitting or choosing the simplest possible model that performs well.

Another set of examples is provided by all the heuristics that people have come up with for supervised learning avoid overfitting prefer simpler to more complex models etc. [no free lunch] says that all such heuristics fail as often as they succeed.

The Supervised Learning No-Free-Lunch Theorems, 2002.

Given that there is no best single machine learning algorithm across all possible prediction problems, it motivates the need to continue to develop new learning algorithms and to better understand algorithms that have already been developed.

As a consequence of the no free lunch theorem, we need to develop many different types of models, to cover the wide variety of data that occurs in the real world. And for each model, there may be many different algorithms we can use to train the model, which make different speed-accuracy-complexity tradeoffs.

— Pages 24-25, Machine Learning: A Probabilistic Perspective, 2012.

It also supports the argument of testing a suite of different machine learning algorithms for a given predictive modeling problem.

The “No Free Lunch” Theorem argues that, without having substantive information about the modeling problem, there is no single model that will always do better than any other model. Because of this, a strong case can be made to try a wide variety of techniques, then determine which model to focus on.

— Pages 25-26, Applied Predictive Modeling, 2013.

Nevertheless, as with optimization, the implications of the theorem are based on the choice of learning algorithms having zero knowledge of the problem that is being solved.

In practice, this is not the case, and a beginner machine learning practitioner is encouraged to review the available data in order to learn something about the problem that can be incorporated into the learning algorithm.

We may even want to take this one step further and say that learning is not possible without some prior knowledge and that data alone is not enough.

In the meantime, the practical consequence of the “no free lunch” theorem is that there’s no such thing as learning without knowledge. Data alone is not enough.

— Page 64, The Master Algorithm, 2018.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Papers

Books

Articles

Summary

In this tutorial, you discovered the no free lunch theorem for optimization and search.

Specifically, you learned:

  • The no free lunch theorem suggests the performance of all optimization algorithms are identical, under some specific constraints.
  • There is provably no single best optimization algorithm or machine learning algorithm.
  • The practical implications of the theorem may be limited given we are interested in a small subset of all possible objective functions.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post No Free Lunch Theorem for Machine Learning appeared first on Machine Learning Mastery.

Simulated Annealing From Scratch in Python

$
0
0

Simulated Annealing is a stochastic global search optimization algorithm.

This means that it makes use of randomness as part of the search process. This makes the algorithm appropriate for nonlinear objective functions where other local search algorithms do not operate well.

Like the stochastic hill climbing local search algorithm, it modifies a single solution and searches the relatively local area of the search space until the local optima is located. Unlike the hill climbing algorithm, it may accept worse solutions as the current working solution.

The likelihood of accepting worse solutions starts high at the beginning of the search and decreases with the progress of the search, giving the algorithm the opportunity to first locate the region for the global optima, escaping local optima, then hill climb to the optima itself.

In this tutorial, you will discover the simulated annealing optimization algorithm for function optimization.

After completing this tutorial, you will know:

  • Simulated annealing is a stochastic global search algorithm for function optimization.
  • How to implement the simulated annealing algorithm from scratch in Python.
  • How to use the simulated annealing algorithm and inspect the results of the algorithm.

Let’s get started.

Simulated Annealing From Scratch in Python

Simulated Annealing From Scratch in Python
Photo by Susanne Nilsson, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  1. Simulated Annealing
  2. Implement Simulated Annealing
  3. Simulated Annealing Worked Example

Simulated Annealing

Simulated Annealing is a stochastic global search optimization algorithm.

The algorithm is inspired by annealing in metallurgy where metal is heated to a high temperature quickly, then cooled slowly, which increases its strength and makes it easier to work with.

The annealing process works by first exciting the atoms in the material at a high temperature, allowing the atoms to move around a lot, then decreasing their excitement slowly, allowing the atoms to fall into a new, more stable configuration.

When hot, the atoms in the material are more free to move around, and, through random motion, tend to settle into better positions. A slow cooling brings the material to an ordered, crystalline state.

— Page 128, Algorithms for Optimization, 2019.

The simulated annealing optimization algorithm can be thought of as a modified version of stochastic hill climbing.

Stochastic hill climbing maintains a single candidate solution and takes steps of a random but constrained size from the candidate in the search space. If the new point is better than the current point, then the current point is replaced with the new point. This process continues for a fixed number of iterations.

Simulated annealing executes the search in the same way. The main difference is that new points that are not as good as the current point (worse points) are accepted sometimes.

A worse point is accepted probabilistically where the likelihood of accepting a solution worse than the current solution is a function of the temperature of the search and how much worse the solution is than the current solution.

The algorithm varies from Hill-Climbing in its decision of when to replace S, the original candidate solution, with R, its newly tweaked child. Specifically: if R is better than S, we’ll always replace S with R as usual. But if R is worse than S, we may still replace S with R with a certain probability

— Page 23, Essentials of Metaheuristics, 2011.

The initial temperature for the search is provided as a hyperparameter and decreases with the progress of the search. A number of different schemes (annealing schedules) may be used to decrease the temperature during the search from the initial value to a very low value, although it is common to calculate temperature as a function of the iteration number.

A popular example for calculating temperature is the so-called “fast simulated annealing,” calculated as follows

  • temperature = initial_temperature / (iteration_number + 1)

We add one to the iteration number in the case that iteration numbers start at zero, to avoid a divide by zero error.

The acceptance of worse solutions uses the temperature as well as the difference between the objective function evaluation of the worse solution and the current solution. A value is calculated between 0 and 1 using this information, indicating the likelihood of accepting the worse solution. This distribution is then sampled using a random number, which, if less than the value, means the worse solution is accepted.

It is this acceptance probability, known as the Metropolis criterion, that allows the algorithm to escape from local minima when the temperature is high.

— Page 128, Algorithms for Optimization, 2019.

This is called the metropolis acceptance criterion and for minimization is calculated as follows:

  • criterion = exp( -(objective(new) – objective(current)) / temperature)

Where exp() is e (the mathematical constant) raised to a power of the provided argument, and objective(new), and objective(current) are the objective function evaluation of the new (worse) and current candidate solutions.

The effect is that poor solutions have more chances of being accepted early in the search and less likely of being accepted later in the search. The intent is that the high temperature at the beginning of the search will help the search locate the basin for the global optima and the low temperature later in the search will help the algorithm hone in on the global optima.

The temperature starts high, allowing the process to freely move about the search space, with the hope that in this phase the process will find a good region with the best local minimum. The temperature is then slowly brought down, reducing the stochasticity and forcing the search to converge to a minimum

— Page 128, Algorithms for Optimization, 2019.

Now that we are familiar with the simulated annealing algorithm, let’s look at how to implement it from scratch.

Implement Simulated Annealing

In this section, we will explore how we might implement the simulated annealing optimization algorithm from scratch.

First, we must define our objective function and the bounds on each input variable to the objective function. The objective function is just a Python function we will name objective(). The bounds will be a 2D array with one dimension for each input variable that defines the minimum and maximum for the variable.

For example, a one-dimensional objective function and bounds would be defined as follows:

# objective function
def objective(x):
	return 0

# define range for input
bounds = asarray([[-5.0, 5.0]])

Next, we can generate our initial point as a random point within the bounds of the problem, then evaluate it using the objective function.

...
# generate an initial point
best = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
# evaluate the initial point
best_eval = objective(best)

We need to maintain the “current” solution that is the focus of the search and that may be replaced with better solutions.

...
# current working solution
curr, curr_eval = best, best_eval

Now we can loop over a predefined number of iterations of the algorithm defined as “n_iterations“, such as 100 or 1,000.

...
# run the algorithm
for i in range(n_iterations):
	...

The first step of the algorithm iteration is to generate a new candidate solution from the current working solution, e.g. take a step.

This requires a predefined “step_size” parameter, which is relative to the bounds of the search space. We will take a random step with a Gaussian distribution where the mean is our current point and the standard deviation is defined by the “step_size“. That means that about 99 percent of the steps taken will be within 3 * step_size of the current point.

...
# take a step
candidate = solution + randn(len(bounds)) * step_size

We don’t have to take steps in this way. You may wish to use a uniform distribution between 0 and the step size. For example:

...
# take a step
candidate = solution + rand(len(bounds)) * step_size

Next, we need to evaluate it.

...
# evaluate candidate point
candidte_eval = objective(candidate)

We then need to check if the evaluation of this new point is as good as or better than the current best point, and if it is, replace our current best point with this new point.

This is separate from the current working solution that is the focus of the search.

...
# check for new best solution
if candidate_eval < best_eval:
	# store new best point
	best, best_eval = candidate, candidate_eval
	# report progress
	print('>%d f(%s) = %.5f' % (i, best, best_eval))

Next, we need to prepare to replace the current working solution.

The first step is to calculate the difference between the objective function evaluation of the current solution and the current working solution.

...
# difference between candidate and current point evaluation
diff = candidate_eval - curr_eval

Next, we need to calculate the current temperature, using the fast annealing schedule, where “temp” is the initial temperature provided as an argument.

...
# calculate temperature for current epoch
t = temp / float(i + 1)

We can then calculate the likelihood of accepting a solution with worse performance than our current working solution.

...
# calculate metropolis acceptance criterion
metropolis = exp(-diff / t)

Finally, we can accept the new point as the current working solution if it has a better objective function evaluation (the difference is negative) or if the objective function is worse, but we probabilistically decide to accept it.

...
# check if we should keep the new point
if diff < 0 or rand() < metropolis:
	# store the new current point
	curr, curr_eval = candidate, candidate_eval

And that’s it.

We can implement this simulated annealing algorithm as a reusable function that takes the name of the objective function, the bounds of each input variable, the total iterations, step size, and initial temperature as arguments, and returns the best solution found and its evaluation.

# simulated annealing algorithm
def simulated_annealing(objective, bounds, n_iterations, step_size, temp):
	# generate an initial point
	best = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
	# evaluate the initial point
	best_eval = objective(best)
	# current working solution
	curr, curr_eval = best, best_eval
	# run the algorithm
	for i in range(n_iterations):
		# take a step
		candidate = curr + randn(len(bounds)) * step_size
		# evaluate candidate point
		candidate_eval = objective(candidate)
		# check for new best solution
		if candidate_eval < best_eval:
			# store new best point
			best, best_eval = candidate, candidate_eval
			# report progress
			print('>%d f(%s) = %.5f' % (i, best, best_eval))
		# difference between candidate and current point evaluation
		diff = candidate_eval - curr_eval
		# calculate temperature for current epoch
		t = temp / float(i + 1)
		# calculate metropolis acceptance criterion
		metropolis = exp(-diff / t)
		# check if we should keep the new point
		if diff < 0 or rand() < metropolis:
			# store the new current point
			curr, curr_eval = candidate, candidate_eval
	return [best, best_eval]

Now that we know how to implement the simulated annealing algorithm in Python, let’s look at how we might use it to optimize an objective function.

Simulated Annealing Worked Example

In this section, we will apply the simulated annealing optimization algorithm to an objective function.

First, let’s define our objective function.

We will use a simple one-dimensional x^2 objective function with the bounds [-5, 5].

The example below defines the function, then creates a line plot of the response surface of the function for a grid of input values, and marks the optima at f(0.0) = 0.0 with a red line

# convex unimodal optimization function
from numpy import arange
from matplotlib import pyplot

# objective function
def objective(x):
	return x[0]**2.0

# define range for input
r_min, r_max = -5.0, 5.0
# sample input range uniformly at 0.1 increments
inputs = arange(r_min, r_max, 0.1)
# compute targets
results = [objective([x]) for x in inputs]
# create a line plot of input vs result
pyplot.plot(inputs, results)
# define optimal input value
x_optima = 0.0
# draw a vertical line at the optimal input
pyplot.axvline(x=x_optima, ls='--', color='red')
# show the plot
pyplot.show()

Running the example creates a line plot of the objective function and clearly marks the function optima.

Line Plot of Objective Function With Optima Marked With a Dashed Red Line

Line Plot of Objective Function With Optima Marked With a Dashed Red Line

Before we apply the optimization algorithm to the problem, let’s take a moment to understand the acceptance criterion a little better.

First, the fast annealing schedule is an exponential function of the number of iterations. We can make this clear by creating a plot of the temperature for each algorithm iteration.

We will use an initial temperature of 10 and 100 algorithm iterations, both arbitrarily chosen.

The complete example is listed below.

# explore temperature vs algorithm iteration for simulated annealing
from matplotlib import pyplot
# total iterations of algorithm
iterations = 100
# initial temperature
initial_temp = 10
# array of iterations from 0 to iterations - 1
iterations = [i for i in range(iterations)]
# temperatures for each iterations
temperatures = [initial_temp/float(i + 1) for i in iterations]
# plot iterations vs temperatures
pyplot.plot(iterations, temperatures)
pyplot.xlabel('Iteration')
pyplot.ylabel('Temperature')
pyplot.show()

Running the example calculates the temperature for each algorithm iteration and creates a plot of algorithm iteration (x-axis) vs. temperature (y-axis).

We can see that temperature drops rapidly, exponentially, not linearly, such that after 20 iterations it is below 1 and stays low for the remainder of the search.

Line Plot of Temperature vs. Algorithm Iteration for Fast Annealing

Line Plot of Temperature vs. Algorithm Iteration for Fast Annealing

Next, we can get a better idea of how the metropolis acceptance criterion changes over time with the temperature.

Recall that the criterion is a function of temperature, but is also a function of how different the objective evaluation of the new point is compared to the current working solution. As such, we will plot the criterion for a few different “differences in objective function value” to see the effect it has on acceptance probability.

The complete example is listed below.

# explore metropolis acceptance criterion for simulated annealing
from math import exp
from matplotlib import pyplot
# total iterations of algorithm
iterations = 100
# initial temperature
initial_temp = 10
# array of iterations from 0 to iterations - 1
iterations = [i for i in range(iterations)]
# temperatures for each iterations
temperatures = [initial_temp/float(i + 1) for i in iterations]
# metropolis acceptance criterion
differences = [0.01, 0.1, 1.0]
for d in differences:
	metropolis = [exp(-d/t) for t in temperatures]
	# plot iterations vs metropolis
	label = 'diff=%.2f' % d
	pyplot.plot(iterations, metropolis, label=label)
# inalize plot
pyplot.xlabel('Iteration')
pyplot.ylabel('Metropolis Criterion')
pyplot.legend()
pyplot.show()

Running the example calculates the metropolis acceptance criterion for each algorithm iteration using the temperature shown for each iteration (shown in the previous section).

The plot has three lines for three differences between the new worse solution and the current working solution.

We can see that the worse the solution is (the larger the difference), the less likely the model is to accept the worse solution regardless of the algorithm iteration, as we might expect. We can also see that in all cases, the likelihood of accepting worse solutions decreases with algorithm iteration.

Line Plot of Metropolis Acceptance Criterion vs. Algorithm Iteration for Simulated Annealing

Line Plot of Metropolis Acceptance Criterion vs. Algorithm Iteration for Simulated Annealing

Now that we are more familiar with the behavior of the temperature and metropolis acceptance criterion over time, let’s apply simulated annealing to our test problem.

First, we will seed the pseudorandom number generator.

This is not required in general, but in this case, I want to ensure we get the same results (same sequence of random numbers) each time we run the algorithm so we can plot the results later.

...
# seed the pseudorandom number generator
seed(1)

Next, we can define the configuration of the search.

In this case, we will search for 1,000 iterations of the algorithm and use a step size of 0.1. Given that we are using a Gaussian function for generating the step, this means that about 99 percent of all steps taken will be within a distance of (0.1 * 3) of a given point, e.g. three standard deviations.

We will also use an initial temperature of 10.0. The search procedure is more sensitive to the annealing schedule than the initial temperature, as such, initial temperature values are almost arbitrary.

...
n_iterations = 1000
# define the maximum step size
step_size = 0.1
# initial temperature
temp = 10

Next, we can perform the search and report the results.

...
# perform the simulated annealing search
best, score = simulated_annealing(objective, bounds, n_iterations, step_size, temp)
print('Done!')
print('f(%s) = %f' % (best, score))

Tying this all together, the complete example is listed below.

# simulated annealing search of a one-dimensional objective function
from numpy import asarray
from numpy import exp
from numpy.random import randn
from numpy.random import rand
from numpy.random import seed

# objective function
def objective(x):
	return x[0]**2.0

# simulated annealing algorithm
def simulated_annealing(objective, bounds, n_iterations, step_size, temp):
	# generate an initial point
	best = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
	# evaluate the initial point
	best_eval = objective(best)
	# current working solution
	curr, curr_eval = best, best_eval
	# run the algorithm
	for i in range(n_iterations):
		# take a step
		candidate = curr + randn(len(bounds)) * step_size
		# evaluate candidate point
		candidate_eval = objective(candidate)
		# check for new best solution
		if candidate_eval < best_eval:
			# store new best point
			best, best_eval = candidate, candidate_eval
			# report progress
			print('>%d f(%s) = %.5f' % (i, best, best_eval))
		# difference between candidate and current point evaluation
		diff = candidate_eval - curr_eval
		# calculate temperature for current epoch
		t = temp / float(i + 1)
		# calculate metropolis acceptance criterion
		metropolis = exp(-diff / t)
		# check if we should keep the new point
		if diff < 0 or rand() < metropolis:
			# store the new current point
			curr, curr_eval = candidate, candidate_eval
	return [best, best_eval]

# seed the pseudorandom number generator
seed(1)
# define range for input
bounds = asarray([[-5.0, 5.0]])
# define the total iterations
n_iterations = 1000
# define the maximum step size
step_size = 0.1
# initial temperature
temp = 10
# perform the simulated annealing search
best, score = simulated_annealing(objective, bounds, n_iterations, step_size, temp)
print('Done!')
print('f(%s) = %f' % (best, score))

Running the example reports the progress of the search including the iteration number, the input to the function, and the response from the objective function each time an improvement was detected.

At the end of the search, the best solution is found and its evaluation is reported.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see about 20 improvements over the 1,000 iterations of the algorithm and a solution that is very close to the optimal input of 0.0 that evaluates to f(0.0) = 0.0.

>34 f([-0.78753544]) = 0.62021
>35 f([-0.76914239]) = 0.59158
>37 f([-0.68574854]) = 0.47025
>39 f([-0.64797564]) = 0.41987
>40 f([-0.58914623]) = 0.34709
>41 f([-0.55446029]) = 0.30743
>42 f([-0.41775702]) = 0.17452
>43 f([-0.35038542]) = 0.12277
>50 f([-0.15799045]) = 0.02496
>66 f([-0.11089772]) = 0.01230
>67 f([-0.09238208]) = 0.00853
>72 f([-0.09145261]) = 0.00836
>75 f([-0.05129162]) = 0.00263
>93 f([-0.02854417]) = 0.00081
>144 f([0.00864136]) = 0.00007
>149 f([0.00753953]) = 0.00006
>167 f([-0.00640394]) = 0.00004
>225 f([-0.00044965]) = 0.00000
>503 f([-0.00036261]) = 0.00000
>512 f([0.00013605]) = 0.00000
Done!
f([0.00013605]) = 0.000000

It can be interesting to review the progress of the search as a line plot that shows the change in the evaluation of the best solution each time there is an improvement.

We can update the simulated_annealing() to keep track of the objective function evaluations each time there is an improvement and return this list of scores.

# simulated annealing algorithm
def simulated_annealing(objective, bounds, n_iterations, step_size, temp):
	# generate an initial point
	best = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
	# evaluate the initial point
	best_eval = objective(best)
	# current working solution
	curr, curr_eval = best, best_eval
	# run the algorithm
	for i in range(n_iterations):
		# take a step
		candidate = curr + randn(len(bounds)) * step_size
		# evaluate candidate point
		candidate_eval = objective(candidate)
		# check for new best solution
		if candidate_eval < best_eval:
			# store new best point
			best, best_eval = candidate, candidate_eval
			# keep track of scores
			scores.append(best_eval)
			# report progress
			print('>%d f(%s) = %.5f' % (i, best, best_eval))
		# difference between candidate and current point evaluation
		diff = candidate_eval - curr_eval
		# calculate temperature for current epoch
		t = temp / float(i + 1)
		# calculate metropolis acceptance criterion
		metropolis = exp(-diff / t)
		# check if we should keep the new point
		if diff < 0 or rand() < metropolis:
			# store the new current point
			curr, curr_eval = candidate, candidate_eval
	return [best, best_eval, scores]

We can then create a line plot of these scores to see the relative change in objective function for each improvement found during the search.

...
# line plot of best scores
pyplot.plot(scores, '.-')
pyplot.xlabel('Improvement Number')
pyplot.ylabel('Evaluation f(x)')
pyplot.show()

Tying this together, the complete example of performing the search and plotting the objective function scores of the improved solutions during the search is listed below.

# simulated annealing search of a one-dimensional objective function
from numpy import asarray
from numpy import exp
from numpy.random import randn
from numpy.random import rand
from numpy.random import seed
from matplotlib import pyplot

# objective function
def objective(x):
	return x[0]**2.0

# simulated annealing algorithm
def simulated_annealing(objective, bounds, n_iterations, step_size, temp):
	# generate an initial point
	best = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
	# evaluate the initial point
	best_eval = objective(best)
	# current working solution
	curr, curr_eval = best, best_eval
	scores = list()
	# run the algorithm
	for i in range(n_iterations):
		# take a step
		candidate = curr + randn(len(bounds)) * step_size
		# evaluate candidate point
		candidate_eval = objective(candidate)
		# check for new best solution
		if candidate_eval < best_eval:
			# store new best point
			best, best_eval = candidate, candidate_eval
			# keep track of scores
			scores.append(best_eval)
			# report progress
			print('>%d f(%s) = %.5f' % (i, best, best_eval))
		# difference between candidate and current point evaluation
		diff = candidate_eval - curr_eval
		# calculate temperature for current epoch
		t = temp / float(i + 1)
		# calculate metropolis acceptance criterion
		metropolis = exp(-diff / t)
		# check if we should keep the new point
		if diff < 0 or rand() < metropolis:
			# store the new current point
			curr, curr_eval = candidate, candidate_eval
	return [best, best_eval, scores]

# seed the pseudorandom number generator
seed(1)
# define range for input
bounds = asarray([[-5.0, 5.0]])
# define the total iterations
n_iterations = 1000
# define the maximum step size
step_size = 0.1
# initial temperature
temp = 10
# perform the simulated annealing search
best, score, scores = simulated_annealing(objective, bounds, n_iterations, step_size, temp)
print('Done!')
print('f(%s) = %f' % (best, score))
# line plot of best scores
pyplot.plot(scores, '.-')
pyplot.xlabel('Improvement Number')
pyplot.ylabel('Evaluation f(x)')
pyplot.show()

Running the example performs the search and reports the results as before.

A line plot is created showing the objective function evaluation for each improvement during the hill climbing search. We can see about 20 changes to the objective function evaluation during the search with large changes initially and very small to imperceptible changes towards the end of the search as the algorithm converged on the optima.

Line Plot of Objective Function Evaluation for Each Improvement During the Simulated Annealing Search

Line Plot of Objective Function Evaluation for Each Improvement During the Simulated Annealing Search

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Papers

Books

Articles

Summary

In this tutorial, you discovered the simulated annealing optimization algorithm for function optimization.

Specifically, you learned:

  • Simulated annealing is a stochastic global search algorithm for function optimization.
  • How to implement the simulated annealing algorithm from scratch in Python.
  • How to use the simulated annealing algorithm and inspect the results of the algorithm.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Simulated Annealing From Scratch in Python appeared first on Machine Learning Mastery.

Prediction Intervals for Deep Learning Neural Networks

$
0
0

Prediction intervals provide a measure of uncertainty for predictions on regression problems.

For example, a 95% prediction interval indicates that 95 out of 100 times, the true value will fall between the lower and upper values of the range. This is different from a simple point prediction that might represent the center of the uncertainty interval.

There are no standard techniques for calculating a prediction interval for deep learning neural networks on regression predictive modeling problems. Nevertheless, a quick and dirty prediction interval can be estimated using an ensemble of models that, in turn, provide a distribution of point predictions from which an interval can be calculated.

In this tutorial, you will discover how to calculate a prediction interval for deep learning neural networks.

After completing this tutorial, you will know:

  • Prediction intervals provide a measure of uncertainty on regression predictive modeling problems.
  • How to develop and evaluate a simple Multilayer Perceptron neural network on a standard regression problem.
  • How to calculate and report a prediction interval using an ensemble of neural network models.

Let’s get started.

Prediction Intervals for Deep Learning Neural Networks

Prediction Intervals for Deep Learning Neural Networks
Photo by eugene_o, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  1. Prediction Interval
  2. Neural Network for Regression
  3. Neural Network Prediction Interval

Prediction Interval

Generally, predictive models for regression problems (i.e. predicting a numerical value) make a point prediction.

This means they predict a single value but do not give any indication of the uncertainty about the prediction.

By definition, a prediction is an estimate or an approximation and contains some uncertainty. The uncertainty comes from the errors in the model itself and noise in the input data. The model is an approximation of the relationship between the input variables and the output variables.

A prediction interval is a quantification of the uncertainty on a prediction.

It provides a probabilistic upper and lower bounds on the estimate of an outcome variable.

A prediction interval for a single future observation is an interval that will, with a specified degree of confidence, contain a future randomly selected observation from a distribution.

— Page 27, Statistical Intervals: A Guide for Practitioners and Researchers, 2017.

Prediction intervals are most commonly used when making predictions or forecasts with a regression model, where a quantity is being predicted.

The prediction interval surrounds the prediction made by the model and hopefully covers the range of the true outcome.

For more on prediction intervals in general, see the tutorial:

Now that we are familiar with what a prediction interval is, we can consider how we might calculate an interval for a neural network. Let’s start by defining a regression problem and a neural network model to address it.

Neural Network for Regression

In this section, we will define a regression predictive modeling problem and a neural network model to address it.

First, let’s introduce a standard regression dataset. We will use the housing dataset.

The housing dataset is a standard machine learning dataset comprising 506 rows of data with 13 numerical input variables and a numerical target variable.

Using a test harness of repeated stratified 10-fold cross-validation with three repeats, a naive model can achieve a mean absolute error (MAE) of about 6.6. A top-performing model can achieve a MAE on this same test harness of about 1.9. This provides the bounds of expected performance on this dataset.

The dataset involves predicting the house price given details of the house’s suburb in the American city of Boston.

No need to download the dataset; we will download it automatically as part of our worked examples.

The example below downloads and loads the dataset as a Pandas DataFrame and summarizes the shape of the dataset and the first five rows of data.

# load and summarize the housing dataset
from pandas import read_csv
from matplotlib import pyplot
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
dataframe = read_csv(url, header=None)
# summarize shape
print(dataframe.shape)
# summarize first few lines
print(dataframe.head())

Running the example confirms the 506 rows of data and 13 input variables and a single numeric target variable (14 in total). We can also see that all input variables are numeric.

(506, 14)
        0     1     2   3      4      5   ...  8      9     10      11    12    13
0  0.00632  18.0  2.31   0  0.538  6.575  ...   1  296.0  15.3  396.90  4.98  24.0
1  0.02731   0.0  7.07   0  0.469  6.421  ...   2  242.0  17.8  396.90  9.14  21.6
2  0.02729   0.0  7.07   0  0.469  7.185  ...   2  242.0  17.8  392.83  4.03  34.7
3  0.03237   0.0  2.18   0  0.458  6.998  ...   3  222.0  18.7  394.63  2.94  33.4
4  0.06905   0.0  2.18   0  0.458  7.147  ...   3  222.0  18.7  396.90  5.33  36.2

[5 rows x 14 columns]

Next, we can prepare the dataset for modeling.

First, the dataset can be split into input and output columns, and then the rows can be split into train and test datasets.

In this case, we will use approximately 67% of the rows to train the model and the remaining 33% to estimate the performance of the model.

...
# split into input and output values
X, y = values[:,:-1], values[:,-1]
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.67)

You can learn more about the train-test split in this tutorial:

We will then scale all input columns (variables) to have the range 0-1, called data normalization, which is a good practice when working with neural network models.

...
# scale input data
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

You can learn more about normalizing input data with the MinMaxScaler in this tutorial:

The complete example of preparing the data for modeling is listed below.

# load and prepare the dataset for modeling
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
dataframe = read_csv(url, header=None)
values = dataframe.values
# split into input and output values
X, y = values[:,:-1], values[:,-1]
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.67)
# scale input data
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
# summarize
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

Running the example loads the dataset as before, then splits the columns into input and output elements, rows into train and test sets, and finally scales all input variables to the range [0,1]

The shape of the train and test sets is printed, showing we have 339 rows to train the model and 167 to evaluate it.

(339, 13) (167, 13) (339,) (167,)

Next, we can define, train and evaluate a Multilayer Perceptron (MLP) model on the dataset.

We will define a simple model with two hidden layers and an output layer that predicts a numeric value. We will use the ReLU activation function and “he” weight initialization, which are a good practice.

The number of nodes in each hidden layer was chosen after a little trial and error.

...
# define neural network model
features = X_train.shape[1]
model = Sequential()
model.add(Dense(20, kernel_initializer='he_normal', activation='relu', input_dim=features))
model.add(Dense(5, kernel_initializer='he_normal', activation='relu'))
model.add(Dense(1))

We will use the efficient Adam version of stochastic gradient descent with close to default learning rate and momentum values and fit the model using the mean squared error (MSE) loss function, a standard for regression predictive modeling problems.

...
# compile the model and specify loss and optimizer
opt = Adam(learning_rate=0.01, beta_1=0.85, beta_2=0.999)
model.compile(optimizer=opt, loss='mse')

You can learn more about the Adam optimization algorithm in this tutorial:

The model will then be fit for 300 epochs with a batch size of 16 samples. This configuration was chosen after a little trial and error.

...
# fit the model on the training dataset
model.fit(X_train, y_train, verbose=2, epochs=300, batch_size=16)

You can learn more about batches and epochs in this tutorial:

Finally, the model can be used to make predictions on the test dataset and we can evaluate the predictions by comparing them to the expected values in the test set and calculate the mean absolute error (MAE), a useful measure of model performance.

...
# make predictions on the test set
yhat = model.predict(X_test, verbose=0)
# calculate the average error in the predictions
mae = mean_absolute_error(y_test, yhat)
print('MAE: %.3f' % mae)

Tying this together, the complete example is listed below.

# train and evaluate a multilayer perceptron neural network on the housing regression dataset
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
dataframe = read_csv(url, header=None)
values = dataframe.values
# split into input and output values
X, y = values[:, :-1], values[:,-1]
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.67, random_state=1)
# scale input data
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
# define neural network model
features = X_train.shape[1]
model = Sequential()
model.add(Dense(20, kernel_initializer='he_normal', activation='relu', input_dim=features))
model.add(Dense(5, kernel_initializer='he_normal', activation='relu'))
model.add(Dense(1))
# compile the model and specify loss and optimizer
opt = Adam(learning_rate=0.01, beta_1=0.85, beta_2=0.999)
model.compile(optimizer=opt, loss='mse')
# fit the model on the training dataset
model.fit(X_train, y_train, verbose=2, epochs=300, batch_size=16)
# make predictions on the test set
yhat = model.predict(X_test, verbose=0)
# calculate the average error in the predictions
mae = mean_absolute_error(y_test, yhat)
print('MAE: %.3f' % mae)

Running the example loads and prepares the dataset, defines and fits the MLP model on the training dataset, and evaluates its performance on the test set.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the model achieves a mean absolute error of approximately 2.3, which is better than a naive model and getting close to an optimal model.

No doubt we could achieve near-optimal performance with further tuning of the model, but this is good enough for our investigation of prediction intervals.

...
Epoch 296/300
22/22 - 0s - loss: 7.1741
Epoch 297/300
22/22 - 0s - loss: 6.8044
Epoch 298/300
22/22 - 0s - loss: 6.8623
Epoch 299/300
22/22 - 0s - loss: 7.7010
Epoch 300/300
22/22 - 0s - loss: 6.5374
MAE: 2.300

Next, let’s look at how we might calculate a prediction interval using our MLP model on the housing dataset.

Neural Network Prediction Interval

In this section, we will develop a prediction interval using the regression problem and model developed in the previous section.

Calculating prediction intervals for nonlinear regression algorithms like neural networks is challenging compared to linear methods like linear regression where the prediction interval calculation is trivial. There is no standard technique.

There are many ways to calculate an effective prediction interval for neural network models. I recommend some of the papers listed in the “further reading” section to learn more.

In this tutorial, we will use a very simple approach that has plenty of room for extension. I refer to it as “quick and dirty” because it is fast and easy to calculate, but is limited.

It involves fitting multiple final models (e.g. 10 to 30). The distribution of the point predictions from ensemble members is then used to calculate both a point prediction and a prediction interval.

For example, a point prediction can be taken as the mean of the point predictions from ensemble members, and a 95% prediction interval can be taken as 1.96 standard deviations from the mean.

This is a simple Gaussian prediction interval, although alternatives could be used, such as the min and max of the point predictions. Alternatively, the bootstrap method could be used to train each ensemble member on a different bootstrap sample and the 2.5th and 97.5th percentiles of the point predictions can be used as prediction intervals.

For more on the bootstrap method, see the tutorial:

These extensions are left as exercises; we will stick with the simple Gaussian prediction interval.

Let’s assume that the training dataset, defined in the previous section, is the entire dataset and we are training a final model or models on this entire dataset. We can then make predictions with prediction intervals on the test set and evaluate how effective the interval might be in the future.

We can simplify the code by dividing the elements developed in the previous section into functions.

First, let’s define a function for loading and preparing a regression dataset defined by a URL.

# load and prepare the dataset
def load_dataset(url):
	dataframe = read_csv(url, header=None)
	values = dataframe.values
	# split into input and output values
	X, y = values[:, :-1], values[:,-1]
	# split into train and test sets
	X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.67, random_state=1)
	# scale input data
	scaler = MinMaxScaler()
	scaler.fit(X_train)
	X_train = scaler.transform(X_train)
	X_test = scaler.transform(X_test)
	return X_train, X_test, y_train, y_test

Next, we can define a function that will define and train an MLP model given the training dataset, then return the fit model ready for making predictions.

# define and fit the model
def fit_model(X_train, y_train):
	# define neural network model
	features = X_train.shape[1]
	model = Sequential()
	model.add(Dense(20, kernel_initializer='he_normal', activation='relu', input_dim=features))
	model.add(Dense(5, kernel_initializer='he_normal', activation='relu'))
	model.add(Dense(1))
	# compile the model and specify loss and optimizer
	opt = Adam(learning_rate=0.01, beta_1=0.85, beta_2=0.999)
	model.compile(optimizer=opt, loss='mse')
	# fit the model on the training dataset
	model.fit(X_train, y_train, verbose=0, epochs=300, batch_size=16)
	return model

We require multiple models to make point predictions that will define a distribution of point predictions from which we can estimate the interval.

As such, we will need to fit multiple models on the training dataset. Each model must be different so that it makes different predictions. This can be achieved given the stochastic nature of training MLP models, given the random initial weights, and given the use of the stochastic gradient descent optimization algorithm.

The more models, the better the point predictions will estimate the capability of the model. I would recommend at least 10 models, and perhaps not much benefit beyond 30 models.

The function below fits an ensemble of models and stores them in a list that is returned.

For interest, each fit model is also evaluated on the test set which is reported after each model is fit. We would expect that each model will have a slightly different estimated performance on the hold-out test set and the reported scores will help us confirm this expectation.

# fit an ensemble of models
def fit_ensemble(n_members, X_train, X_test, y_train, y_test):
	ensemble = list()
	for i in range(n_members):
		# define and fit the model on the training set
		model = fit_model(X_train, y_train)
		# evaluate model on the test set
		yhat = model.predict(X_test, verbose=0)
		mae = mean_absolute_error(y_test, yhat)
		print('>%d, MAE: %.3f' % (i+1, mae))
		# store the model
		ensemble.append(model)
	return ensemble

Finally, we can use the trained ensemble of models to make point predictions, which can be summarized into a prediction interval.

The function below implements this. First, each model makes a point prediction on the input data, then the 95% prediction interval is calculated and the lower, mean, and upper values of the interval are returned.

The function is designed to take a single row as input, but could easily be adapted for multiple rows.

# make predictions with the ensemble and calculate a prediction interval
def predict_with_pi(ensemble, X):
	# make predictions
	yhat = [model.predict(X, verbose=0) for model in ensemble]
	yhat = asarray(yhat)
	# calculate 95% gaussian prediction interval
	interval = 1.96 * yhat.std()
	lower, upper = yhat.mean() - interval, yhat.mean() + interval
	return lower, yhat.mean(), upper

Finally, we can call these functions.

First, the dataset is loaded and prepared, then the ensemble is defined and fit.

...
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
X_train, X_test, y_train, y_test = load_dataset(url)
# fit ensemble
n_members = 30
ensemble = fit_ensemble(n_members, X_train, X_test, y_train, y_test)

We can then use a single row of data from the test set and make a prediction with a prediction interval, the results of which are then reported.

We also report the expected value which we would expect would be covered by the prediction interval (perhaps close to 95% of the time; this is not entirely accurate, but is a rough approximation).

...
# make predictions with prediction interval
newX = asarray([X_test[0, :]])
lower, mean, upper = predict_with_pi(ensemble, newX)
print('Point prediction: %.3f' % mean)
print('95%% prediction interval: [%.3f, %.3f]' % (lower, upper))
print('True value: %.3f' % y_test[0])

Tying this together, the complete example of making predictions with a prediction interval with a Multilayer Perceptron neural network is listed below.

# prediction interval for mlps on the housing regression dataset
from numpy import asarray
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam

# load and prepare the dataset
def load_dataset(url):
	dataframe = read_csv(url, header=None)
	values = dataframe.values
	# split into input and output values
	X, y = values[:, :-1], values[:,-1]
	# split into train and test sets
	X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.67, random_state=1)
	# scale input data
	scaler = MinMaxScaler()
	scaler.fit(X_train)
	X_train = scaler.transform(X_train)
	X_test = scaler.transform(X_test)
	return X_train, X_test, y_train, y_test

# define and fit the model
def fit_model(X_train, y_train):
	# define neural network model
	features = X_train.shape[1]
	model = Sequential()
	model.add(Dense(20, kernel_initializer='he_normal', activation='relu', input_dim=features))
	model.add(Dense(5, kernel_initializer='he_normal', activation='relu'))
	model.add(Dense(1))
	# compile the model and specify loss and optimizer
	opt = Adam(learning_rate=0.01, beta_1=0.85, beta_2=0.999)
	model.compile(optimizer=opt, loss='mse')
	# fit the model on the training dataset
	model.fit(X_train, y_train, verbose=0, epochs=300, batch_size=16)
	return model

# fit an ensemble of models
def fit_ensemble(n_members, X_train, X_test, y_train, y_test):
	ensemble = list()
	for i in range(n_members):
		# define and fit the model on the training set
		model = fit_model(X_train, y_train)
		# evaluate model on the test set
		yhat = model.predict(X_test, verbose=0)
		mae = mean_absolute_error(y_test, yhat)
		print('>%d, MAE: %.3f' % (i+1, mae))
		# store the model
		ensemble.append(model)
	return ensemble

# make predictions with the ensemble and calculate a prediction interval
def predict_with_pi(ensemble, X):
	# make predictions
	yhat = [model.predict(X, verbose=0) for model in ensemble]
	yhat = asarray(yhat)
	# calculate 95% gaussian prediction interval
	interval = 1.96 * yhat.std()
	lower, upper = yhat.mean() - interval, yhat.mean() + interval
	return lower, yhat.mean(), upper

# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
X_train, X_test, y_train, y_test = load_dataset(url)
# fit ensemble
n_members = 30
ensemble = fit_ensemble(n_members, X_train, X_test, y_train, y_test)
# make predictions with prediction interval
newX = asarray([X_test[0, :]])
lower, mean, upper = predict_with_pi(ensemble, newX)
print('Point prediction: %.3f' % mean)
print('95%% prediction interval: [%.3f, %.3f]' % (lower, upper))
print('True value: %.3f' % y_test[0])

Running the example fits each ensemble member in turn and reports its estimated performance on the hold out tests set; finally, a single prediction with prediction interval is made and reported.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that each model has a slightly different performance, confirming our expectation that the models are indeed different.

Finally, we can see that the ensemble made a point prediction of about 30.5 with a 95% prediction interval of [26.287, 34.822]. We can also see that the true value was 28.2 and that the interval does capture this value, which is great.

>1, MAE: 2.259
>2, MAE: 2.144
>3, MAE: 2.732
>4, MAE: 2.628
>5, MAE: 2.483
>6, MAE: 2.551
>7, MAE: 2.505
>8, MAE: 2.299
>9, MAE: 2.706
>10, MAE: 2.145
>11, MAE: 2.765
>12, MAE: 3.244
>13, MAE: 2.385
>14, MAE: 2.592
>15, MAE: 2.418
>16, MAE: 2.493
>17, MAE: 2.367
>18, MAE: 2.569
>19, MAE: 2.664
>20, MAE: 2.233
>21, MAE: 2.228
>22, MAE: 2.646
>23, MAE: 2.641
>24, MAE: 2.492
>25, MAE: 2.558
>26, MAE: 2.416
>27, MAE: 2.328
>28, MAE: 2.383
>29, MAE: 2.215
>30, MAE: 2.408
Point prediction: 30.555
95% prediction interval: [26.287, 34.822]
True value: 28.200

This is a quick and dirty technique for making predictions with a prediction interval for neural networks, as we discussed above.

There are easy extensions such as using the bootstrap method applied to point predictions that may be more reliable, and more advanced techniques described in some of the papers listed below that I recommend that you explore.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Tutorials

Papers

Articles

Summary

In this tutorial, you discovered how to calculate a prediction interval for deep learning neural networks.

Specifically, you learned:

  • Prediction intervals provide a measure of uncertainty on regression predictive modeling problems.
  • How to develop and evaluate a simple Multilayer Perceptron neural network on a standard regression problem.
  • How to calculate and report a prediction interval using an ensemble of neural network models.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Prediction Intervals for Deep Learning Neural Networks appeared first on Machine Learning Mastery.

Sensitivity Analysis of Dataset Size vs. Model Performance

$
0
0

Machine learning model performance often improves with dataset size for predictive modeling.

This depends on the specific datasets and on the choice of model, although it often means that using more data can result in better performance and that discoveries made using smaller datasets to estimate model performance often scale to using larger datasets.

The problem is the relationship is unknown for a given dataset and model, and may not exist for some datasets and models. Additionally, if such a relationship does exist, there may be a point or points of diminishing returns where adding more data may not improve model performance or where datasets are too small to effectively capture the capability of a model at a larger scale.

These issues can be addressed by performing a sensitivity analysis to quantify the relationship between dataset size and model performance. Once calculated, we can interpret the results of the analysis and make decisions about how much data is enough, and how small a dataset may be to effectively estimate performance on larger datasets.

In this tutorial, you will discover how to perform a sensitivity analysis of dataset size vs. model performance.

After completing this tutorial, you will know:

  • Selecting a dataset size for machine learning is a challenging open problem.
  • Sensitivity analysis provides an approach to quantifying the relationship between model performance and dataset size for a given model and prediction problem.
  • How to perform a sensitivity analysis of dataset size and interpret the results.

Let’s get started.

Sensitivity Analysis of Dataset Size vs. Model Performance

Sensitivity Analysis of Dataset Size vs. Model Performance
Photo by Graeme Churchard, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  1. Dataset Size Sensitivity Analysis
  2. Synthetic Prediction Task and Baseline Model
  3. Sensitivity Analysis of Dataset Size

Dataset Size Sensitivity Analysis

The amount of training data required for a machine learning predictive model is an open question.

It depends on your choice of model, on the way you prepare the data, and on the specifics of the data itself.

For more on the challenge of selecting a training dataset size, see the tutorial:

One way to approach this problem is to perform a sensitivity analysis and discover how the performance of your model on your dataset varies with more or less data.

This might involve evaluating the same model with different sized datasets and looking for a relationship between dataset size and performance or a point of diminishing returns.

Typically, there is a strong relationship between training dataset size and model performance, especially for nonlinear models. The relationship often involves an improvement in performance to a point and a general reduction in the expected variance of the model as the dataset size is increased.

Knowing this relationship for your model and dataset can be helpful for a number of reasons, such as:

  • Evaluate more models.
  • Find a better model.
  • Decide to gather more data.

You can evaluate a large number of models and model configurations quickly on a smaller sample of the dataset with confidence that the performance will likely generalize in a specific way to a larger training dataset.

This may allow evaluating many more models and configurations than you may otherwise be able to given the time available, and in turn, perhaps discover a better overall performing model.

You may also be able to generalize and estimate the expected performance of model performance to much larger datasets and estimate whether it is worth the effort or expense of gathering more training data.

Now that we are familiar with the idea of performing a sensitivity analysis of model performance to dataset size, let’s look at a worked example.

Synthetic Prediction Task and Baseline Model

Before we dive into a sensitivity analysis, let’s select a dataset and baseline model for the investigation.

We will use a synthetic binary (two-class) classification dataset in this tutorial. This is ideal as it allows us to scale the number of generated samples for the same problem as needed.

The make_classification() scikit-learn function can be used to create a synthetic classification dataset. In this case, we will use 20 input features (columns) and generate 1,000 samples (rows). The seed for the pseudo-random number generator is fixed to ensure the same base “problem” is used each time samples are generated.

The example below generates the synthetic classification dataset and summarizes the shape of the generated data.

# test classification dataset
from sklearn.datasets import make_classification
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1)
# summarize the dataset
print(X.shape, y.shape)

Running the example generates the data and reports the size of the input and output components, confirming the expected shape.

(1000, 20) (1000,)

Next, we can evaluate a predictive model on this dataset.

We will use a decision tree (DecisionTreeClassifier) as the predictive model. It was chosen because it is a nonlinear algorithm and has a high variance, which means that we would expect performance to improve with increases in the size of the training dataset.

We will use a best practice of repeated stratified k-fold cross-validation to evaluate the model on the dataset, with 3 repeats and 10 folds.

The complete example of evaluating the decision tree model on the synthetic classification dataset is listed below.

# evaluate a decision tree model on the synthetic classification dataset
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.tree import DecisionTreeClassifier
# load dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1)
# define model evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# define model
model = DecisionTreeClassifier()
# evaluate model
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Mean Accuracy: %.3f (%.3f)' % (scores.mean(), scores.std()))

Running the example creates the dataset then estimates the performance of the model on the problem using the chosen test harness.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the mean classification accuracy is about 82.7%.

Mean Accuracy: 0.827 (0.042)

Next, let’s look at how we might perform a sensitivity analysis of dataset size on model performance.

Sensitivity Analysis of Dataset Size

The previous section showed how to evaluate a chosen model on the available dataset.

It raises questions, such as:

Will the model perform better on more data?

More generally, we may have sophisticated questions such as:

Does the estimated performance hold on smaller or larger samples from the problem domain?

These are hard questions to answer, but we can approach them by using a sensitivity analysis. Specifically, we can use a sensitivity analysis to learn:

How sensitive is model performance to dataset size?

Or more generally:

What is the relationship of dataset size to model performance?

There are many ways to perform a sensitivity analysis, but perhaps the simplest approach is to define a test harness to evaluate model performance and then evaluate the same model on the same problem with differently sized datasets.

This will allow the train and test portions of the dataset to increase with the size of the overall dataset.

To make the code easier to read, we will split it up into functions.

First, we can define a function that will prepare (or load) the dataset of a given size. The number of rows in the dataset is specified by an argument to the function.

If you are using this code as a template, this function can be changed to load your dataset from file and select a random sample of a given size.

# load dataset
def load_dataset(n_samples):
	# define the dataset
	X, y = make_classification(n_samples=int(n_samples), n_features=20, n_informative=15, n_redundant=5, random_state=1)
	return X, y

Next, we need a function to evaluate a model on a loaded dataset.

We will define a function that takes a dataset and returns a summary of the performance of the model evaluated using the test harness on the dataset.

This function is listed below, taking the input and output elements of a dataset and returning the mean and standard deviation of the decision tree model on the dataset.

# evaluate a model
def evaluate_model(X, y):
	# define model evaluation procedure
	cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
	# define model
	model = DecisionTreeClassifier()
	# evaluate model
	scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
	# return summary stats
	return [scores.mean(), scores.std()]

Next, we can define a range of different dataset sizes to evaluate.

The sizes should be chosen proportional to the amount of data you have available and the amount of running time you are willing to expend.

In this case, we will keep the sizes modest to limit running time, from 50 to one million rows on a rough log10 scale.

...
# define number of samples to consider
sizes = [50, 100, 500, 1000, 5000, 10000, 50000, 100000, 500000, 1000000]

Next, we can enumerate each dataset size, create the dataset, evaluate a model on the dataset, and store the results for later analysis.

...
# evaluate each number of samples
means, stds = list(), list()
for n_samples in sizes:
	# get a dataset
	X, y = load_dataset(n_samples)
	# evaluate a model on this dataset size
	mean, std = evaluate_model(X, y)
	# store
	means.append(mean)
	stds.append(std)

Next, we can summarize the relationship between the dataset size and model performance.

In this case, we will simply plot the result with error bars so we can spot any trends visually.

We will use the standard deviation as a measure of uncertainty on the estimated model performance. This can be achieved by multiplying the value by 2 to cover approximately 95% of the expected performance if the performance follows a normal distribution.

This can be shown on the plot as an error bar around the mean expected performance for a dataset size.

...
# define error bar as 2 standard deviations from the mean or 95%
err = [min(1, s * 2) for s in stds]
# plot dataset size vs mean performance with error bars
pyplot.errorbar(sizes, means, yerr=err, fmt='-o')

To make the plot more readable, we can change the scale of the x-axis to log, given that our dataset sizes are on a rough log10 scale.

...
# change the scale of the x-axis to log
ax = pyplot.gca()
ax.set_xscale("log", nonpositive='clip')
# show the plot
pyplot.show()

And that’s it.

We would generally expect mean model performance to increase with dataset size. We would also expect the uncertainty in model performance to decrease with dataset size.

Tying this all together, the complete example of performing a sensitivity analysis of dataset size on model performance is listed below.

# sensitivity analysis of model performance to dataset size
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.tree import DecisionTreeClassifier
from matplotlib import pyplot

# load dataset
def load_dataset(n_samples):
	# define the dataset
	X, y = make_classification(n_samples=int(n_samples), n_features=20, n_informative=15, n_redundant=5, random_state=1)
	return X, y

# evaluate a model
def evaluate_model(X, y):
	# define model evaluation procedure
	cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
	# define model
	model = DecisionTreeClassifier()
	# evaluate model
	scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
	# return summary stats
	return [scores.mean(), scores.std()]

# define number of samples to consider
sizes = [50, 100, 500, 1000, 5000, 10000, 50000, 100000, 500000, 1000000]
# evaluate each number of samples
means, stds = list(), list()
for n_samples in sizes:
	# get a dataset
	X, y = load_dataset(n_samples)
	# evaluate a model on this dataset size
	mean, std = evaluate_model(X, y)
	# store
	means.append(mean)
	stds.append(std)
	# summarize performance
	print('>%d: %.3f (%.3f)' % (n_samples, mean, std))
# define error bar as 2 standard deviations from the mean or 95%
err = [min(1, s * 2) for s in stds]
# plot dataset size vs mean performance with error bars
pyplot.errorbar(sizes, means, yerr=err, fmt='-o')
# change the scale of the x-axis to log
ax = pyplot.gca()
ax.set_xscale("log", nonpositive='clip')
# show the plot
pyplot.show()

Running the example reports the status along the way of dataset size vs. estimated model performance.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see the expected trend of increasing mean model performance with dataset size and decreasing model variance measured using the standard deviation of classification accuracy.

We can see that there is perhaps a point of diminishing returns in estimating model performance at perhaps 10,000 or 50,000 rows.

Specifically, we do see an improvement in performance with more rows, but we can probably capture this relationship with little variance with 10K or 50K rows of data.

We can also see a drop-off in estimated performance with 1,000,000 rows of data, suggesting that we are probably maxing out the capability of the model above 100,000 rows and are instead measuring statistical noise in the estimate.

This might mean an upper bound on expected performance and likely that more data beyond this point will not improve the specific model and configuration on the chosen test harness.

>50: 0.673 (0.141)
>100: 0.703 (0.135)
>500: 0.809 (0.055)
>1000: 0.826 (0.044)
>5000: 0.835 (0.016)
>10000: 0.866 (0.011)
>50000: 0.900 (0.005)
>100000: 0.912 (0.003)
>500000: 0.938 (0.001)
>1000000: 0.936 (0.001)

The plot makes the relationship between dataset size and estimated model performance much clearer.

The relationship is nearly linear with a log dataset size. The change in the uncertainty shown as the error bar also dramatically decreases on the plot from very large values with 50 or 100 samples, to modest values with 5,000 and 10,000 samples and practically gone beyond these sizes.

Given the modest spread with 5,000 and 10,000 samples and the practically log-linear relationship, we could probably get away with using 5K or 10K rows to approximate model performance.

Line Plot With Error Bars of Dataset Size vs. Model Performance

Line Plot With Error Bars of Dataset Size vs. Model Performance

We could use these findings as the basis for testing additional model configurations and even different model types.

The danger is that different models may perform very differently with more or less data and it may be wise to repeat the sensitivity analysis with a different chosen model to confirm the relationship holds. Alternately, it may be interesting to repeat the analysis with a suite of different model types.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Tutorials

APIs

Articles

Summary

In this tutorial, you discovered how to perform a sensitivity analysis of dataset size vs. model performance.

Specifically, you learned:

  • Selecting a dataset size for machine learning is a challenging open problem.
  • Sensitivity analysis provides an approach to quantifying the relationship between model performance and dataset size for a given model and prediction problem.
  • How to perform a sensitivity analysis of dataset size and interpret the results.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Sensitivity Analysis of Dataset Size vs. Model Performance appeared first on Machine Learning Mastery.

Evolution Strategies From Scratch in Python

$
0
0

Evolution strategies is a stochastic global optimization algorithm.

It is an evolutionary algorithm related to others, such as the genetic algorithm, although it is designed specifically for continuous function optimization.

In this tutorial, you will discover how to implement the evolution strategies optimization algorithm.

After completing this tutorial, you will know:

  • Evolution Strategies is a stochastic global optimization algorithm inspired by the biological theory of evolution by natural selection.
  • There is a standard terminology for Evolution Strategies and two common versions of the algorithm referred to as (mu, lambda)-ES and (mu + lambda)-ES.
  • How to implement and apply the Evolution Strategies algorithm to continuous objective functions.

Let’s get started.

Evolution Strategies From Scratch in Python

Evolution Strategies From Scratch in Python
Photo by Alexis A. Bermúdez, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  1. Evolution Strategies
  2. Develop a (mu, lambda)-ES
  3. Develop a (mu + lambda)-ES

Evolution Strategies

Evolution Strategies, sometimes referred to as Evolution Strategy (singular) or ES, is a stochastic global optimization algorithm.

The technique was developed in the 1960s to be implemented manually by engineers for minimal drag designs in a wind tunnel.

The family of algorithms known as Evolution Strategies (ES) were developed by Ingo Rechenberg and Hans-Paul Schwefel at the Technical University of Berlin in the mid 1960s.

— Page 31, Essentials of Metaheuristics, 2011.

Evolution Strategies is a type of evolutionary algorithm and is inspired by the biological theory of evolution by means of natural selection. Unlike other evolutionary algorithms, it does not use any form of crossover; instead, modification of candidate solutions is limited to mutation operators. In this way, Evolution Strategies may be thought of as a type of parallel stochastic hill climbing.

The algorithm involves a population of candidate solutions that initially are randomly generated. Each iteration of the algorithm involves first evaluating the population of solutions, then deleting all but a subset of the best solutions, referred to as truncation selection. The remaining solutions (the parents) each are used as the basis for generating a number of new candidate solutions (mutation) that replace or compete with the parents for a position in the population for consideration in the next iteration of the algorithm (generation).

There are a number of variations of this procedure and a standard terminology to summarize the algorithm. The size of the population is referred to as lambda and the number of parents selected each iteration is referred to as mu.

The number of children created from each parent is calculated as (lambda / mu) and parameters should be chosen so that the division has no remainder.

  • mu: The number of parents selected each iteration.
  • lambda: Size of the population.
  • lambda / mu: Number of children generated from each selected parent.

A bracket notation is used to describe the algorithm configuration, e.g. (mu, lambda)-ES. For example, if mu=5 and lambda=20, then it would be summarized as (5, 20)-ES. A comma (,) separating the mu and lambda parameters indicates that the children replace the parents directly each iteration of the algorithm.

  • (mu, lambda)-ES: A version of evolution strategies where children replace parents.

A plus (+) separation of the mu and lambda parameters indicates that the children and the parents together will define the population for the next iteration.

  • (mu + lambda)-ES: A version of evolution strategies where children and parents are added to the population.

A stochastic hill climbing algorithm can be implemented as an Evolution Strategy and would have the notation (1 + 1)-ES.

This is the simile or canonical ES algorithm and there are many extensions and variants described in the literature.

Now that we are familiar with Evolution Strategies we can explore how to implement the algorithm.

Develop a (mu, lambda)-ES

In this section, we will develop a (mu, lambda)-ES, that is, a version of the algorithm where children replace parents.

First, let’s define a challenging optimization problem as the basis for implementing the algorithm.

The Ackley function is an example of a multimodal objective function that has a single global optima and multiple local optima in which a local search might get stuck.

As such, a global optimization technique is required. It is a two-dimensional objective function that has a global optima at [0,0], which evaluates to 0.0.

The example below implements the Ackley and creates a three-dimensional surface plot showing the global optima and multiple local optima.

# ackley multimodal function
from numpy import arange
from numpy import exp
from numpy import sqrt
from numpy import cos
from numpy import e
from numpy import pi
from numpy import meshgrid
from matplotlib import pyplot
from mpl_toolkits.mplot3d import Axes3D

# objective function
def objective(x, y):
	return -20.0 * exp(-0.2 * sqrt(0.5 * (x**2 + y**2))) - exp(0.5 * (cos(2 * pi * x) + cos(2 * pi * y))) + e + 20

# define range for input
r_min, r_max = -5.0, 5.0
# sample input range uniformly at 0.1 increments
xaxis = arange(r_min, r_max, 0.1)
yaxis = arange(r_min, r_max, 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a surface plot with the jet color scheme
figure = pyplot.figure()
axis = figure.gca(projection='3d')
axis.plot_surface(x, y, results, cmap='jet')
# show the plot
pyplot.show()

Running the example creates the surface plot of the Ackley function showing the vast number of local optima.

3D Surface Plot of the Ackley Multimodal Function

3D Surface Plot of the Ackley Multimodal Function

We will be generating random candidate solutions as well as modified versions of existing candidate solutions. It is important that all candidate solutions are within the bounds of the search problem.

To achieve this, we will develop a function to check whether a candidate solution is within the bounds of the search and then discard it and generate another solution if it is not.

The in_bounds() function below will take a candidate solution (point) and the definition of the bounds of the search space (bounds) and return True if the solution is within the bounds of the search or False otherwise.

# check if a point is within the bounds of the search
def in_bounds(point, bounds):
	# enumerate all dimensions of the point
	for d in range(len(bounds)):
		# check if out of bounds for this dimension
		if point[d] < bounds[d, 0] or point[d] > bounds[d, 1]:
			return False
	return True

We can then use this function when generating the initial population of “lam” (e.g. lambda) random candidate solutions.

For example:

...
# initial population
population = list()
for _ in range(lam):
	candidate = None
	while candidate is None or not in_bounds(candidate, bounds):
		candidate = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
	population.append(candidate)

Next, we can iterate over a fixed number of iterations of the algorithm. Each iteration first involves evaluating each candidate solution in the population.

We will calculate the scores and store them in a separate parallel list.

...
# evaluate fitness for the population
scores = [objective(c) for c in population]

Next, we need to select the “mu” parents with the best scores, lowest scores in this case, as we are minimizing the objective function.

We will do this in two steps. First, we will rank the candidate solutions based on their scores in ascending order so that the solution with the lowest score has a rank of 0, the next has a rank 1, and so on. We can use a double call of the argsort function to achieve this.

We will then use the ranks and select those parents that have a rank below the value “mu.” This means if mu is set to 5 to select 5 parents, only those parents with a rank between 0 and 4 will be selected.

...
# rank scores in ascending order
ranks = argsort(argsort(scores))
# select the indexes for the top mu ranked solutions
selected = [i for i,_ in enumerate(ranks) if ranks[i] < mu]

We can then create children for each selected parent.

First, we must calculate the total number of children to create per parent.

...
# calculate the number of children per parent
n_children = int(lam / mu)

We can then iterate over each parent and create modified versions of each.

We will create children using a similar technique used in stochastic hill climbing. Specifically, each variable will be sampled using a Gaussian distribution with the current value as the mean and the standard deviation provided as a “step size” hyperparameter.

...
# create children for parent
for _ in range(n_children):
	child = None
	while child is None or not in_bounds(child, bounds):
		child = population[i] + randn(len(bounds)) * step_size

We can also check if each selected parent is better than the best solution seen so far so that we can return the best solution at the end of the search.

...
# check if this parent is the best solution ever seen
if scores[i] < best_eval:
	best, best_eval = population[i], scores[i]
	print('%d, Best: f(%s) = %.5f' % (epoch, best, best_eval))

The created children can be added to a list and we can replace the population with the list of children at the end of the algorithm iteration.

...
# replace population with children
population = children

We can tie all of this together into a function named es_comma() that performs the comma version of the Evolution Strategy algorithm.

The function takes the name of the objective function, the bounds of the search space, the number of iterations, the step size, and the mu and lambda hyperparameters and returns the best solution found during the search and its evaluation.

# evolution strategy (mu, lambda) algorithm
def es_comma(objective, bounds, n_iter, step_size, mu, lam):
	best, best_eval = None, 1e+10
	# calculate the number of children per parent
	n_children = int(lam / mu)
	# initial population
	population = list()
	for _ in range(lam):
		candidate = None
		while candidate is None or not in_bounds(candidate, bounds):
			candidate = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
		population.append(candidate)
	# perform the search
	for epoch in range(n_iter):
		# evaluate fitness for the population
		scores = [objective(c) for c in population]
		# rank scores in ascending order
		ranks = argsort(argsort(scores))
		# select the indexes for the top mu ranked solutions
		selected = [i for i,_ in enumerate(ranks) if ranks[i] < mu]
		# create children from parents
		children = list()
		for i in selected:
			# check if this parent is the best solution ever seen
			if scores[i] < best_eval:
				best, best_eval = population[i], scores[i]
				print('%d, Best: f(%s) = %.5f' % (epoch, best, best_eval))
			# create children for parent
			for _ in range(n_children):
				child = None
				while child is None or not in_bounds(child, bounds):
					child = population[i] + randn(len(bounds)) * step_size
				children.append(child)
		# replace population with children
		population = children
	return [best, best_eval]

Next, we can apply this algorithm to our Ackley objective function.

We will run the algorithm for 5,000 iterations and use a step size of 0.15 in the search space. We will use a population size (lambda) of 100 select 20 parents (mu). These hyperparameters were chosen after a little trial and error.

At the end of the search, we will report the best candidate solution found during the search.

...
# seed the pseudorandom number generator
seed(1)
# define range for input
bounds = asarray([[-5.0, 5.0], [-5.0, 5.0]])
# define the total iterations
n_iter = 5000
# define the maximum step size
step_size = 0.15
# number of parents selected
mu = 20
# the number of children generated by parents
lam = 100
# perform the evolution strategy (mu, lambda) search
best, score = es_comma(objective, bounds, n_iter, step_size, mu, lam)
print('Done!')
print('f(%s) = %f' % (best, score))

Tying this together, the complete example of applying the comma version of the Evolution Strategies algorithm to the Ackley objective function is listed below.

# evolution strategy (mu, lambda) of the ackley objective function
from numpy import asarray
from numpy import exp
from numpy import sqrt
from numpy import cos
from numpy import e
from numpy import pi
from numpy import argsort
from numpy.random import randn
from numpy.random import rand
from numpy.random import seed

# objective function
def objective(v):
	x, y = v
	return -20.0 * exp(-0.2 * sqrt(0.5 * (x**2 + y**2))) - exp(0.5 * (cos(2 * pi * x) + cos(2 * pi * y))) + e + 20

# check if a point is within the bounds of the search
def in_bounds(point, bounds):
	# enumerate all dimensions of the point
	for d in range(len(bounds)):
		# check if out of bounds for this dimension
		if point[d] < bounds[d, 0] or point[d] > bounds[d, 1]:
			return False
	return True

# evolution strategy (mu, lambda) algorithm
def es_comma(objective, bounds, n_iter, step_size, mu, lam):
	best, best_eval = None, 1e+10
	# calculate the number of children per parent
	n_children = int(lam / mu)
	# initial population
	population = list()
	for _ in range(lam):
		candidate = None
		while candidate is None or not in_bounds(candidate, bounds):
			candidate = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
		population.append(candidate)
	# perform the search
	for epoch in range(n_iter):
		# evaluate fitness for the population
		scores = [objective(c) for c in population]
		# rank scores in ascending order
		ranks = argsort(argsort(scores))
		# select the indexes for the top mu ranked solutions
		selected = [i for i,_ in enumerate(ranks) if ranks[i] < mu]
		# create children from parents
		children = list()
		for i in selected:
			# check if this parent is the best solution ever seen
			if scores[i] < best_eval:
				best, best_eval = population[i], scores[i]
				print('%d, Best: f(%s) = %.5f' % (epoch, best, best_eval))
			# create children for parent
			for _ in range(n_children):
				child = None
				while child is None or not in_bounds(child, bounds):
					child = population[i] + randn(len(bounds)) * step_size
				children.append(child)
		# replace population with children
		population = children
	return [best, best_eval]


# seed the pseudorandom number generator
seed(1)
# define range for input
bounds = asarray([[-5.0, 5.0], [-5.0, 5.0]])
# define the total iterations
n_iter = 5000
# define the maximum step size
step_size = 0.15
# number of parents selected
mu = 20
# the number of children generated by parents
lam = 100
# perform the evolution strategy (mu, lambda) search
best, score = es_comma(objective, bounds, n_iter, step_size, mu, lam)
print('Done!')
print('f(%s) = %f' % (best, score))

Running the example reports the candidate solution and scores each time a better solution is found, then reports the best solution found at the end of the search.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that about 22 improvements to performance were seen during the search and the best solution is close to the optima.

No doubt, this solution can be provided as a starting point to a local search algorithm to be further refined, a common practice when using a global optimization algorithm like ES.

0, Best: f([-0.82977995 2.20324493]) = 6.91249
0, Best: f([-1.03232526 0.38816734]) = 4.49240
1, Best: f([-1.02971385 0.21986453]) = 3.68954
2, Best: f([-0.98361735 0.19391181]) = 3.40796
2, Best: f([-0.98189724 0.17665892]) = 3.29747
2, Best: f([-0.07254927 0.67931431]) = 3.29641
3, Best: f([-0.78716147 0.02066442]) = 2.98279
3, Best: f([-1.01026218 -0.03265665]) = 2.69516
3, Best: f([-0.08851828 0.26066485]) = 2.00325
4, Best: f([-0.23270782 0.04191618]) = 1.66518
4, Best: f([-0.01436704 0.03653578]) = 0.15161
7, Best: f([0.01247004 0.01582657]) = 0.06777
9, Best: f([0.00368129 0.00889718]) = 0.02970
25, Best: f([ 0.00666975 -0.0045051 ]) = 0.02449
33, Best: f([-0.00072633 -0.00169092]) = 0.00530
211, Best: f([2.05200123e-05 1.51343187e-03]) = 0.00434
315, Best: f([ 0.00113528 -0.00096415]) = 0.00427
418, Best: f([ 0.00113735 -0.00030554]) = 0.00337
491, Best: f([ 0.00048582 -0.00059587]) = 0.00219
704, Best: f([-6.91643854e-04 -4.51583644e-05]) = 0.00197
1504, Best: f([ 2.83063223e-05 -4.60893027e-04]) = 0.00131
3725, Best: f([ 0.00032757 -0.00023643]) = 0.00115
Done!
f([ 0.00032757 -0.00023643]) = 0.001147

Now that we are familiar with how to implement the comma version of evolution strategies, let’s look at how we might implement the plus version.

Develop a (mu + lambda)-ES

The plus version of the Evolution Strategies algorithm is very similar to the comma version.

The main difference is that children and the parents comprise the population at the end instead of just the children. This allows the parents to compete with the children for selection in the next iteration of the algorithm.

This can result in a more greedy behavior by the search algorithm and potentially premature convergence to local optima (suboptimal solutions). The benefit is that the algorithm is able to exploit good candidate solutions that were found and focus intently on candidate solutions in the region, potentially finding further improvements.

We can implement the plus version of the algorithm by modifying the function to add parents to the population when creating the children.

...
# keep the parent
children.append(population[i])

The updated version of the function with this addition, and with a new name es_plus() , is listed below.

# evolution strategy (mu + lambda) algorithm
def es_plus(objective, bounds, n_iter, step_size, mu, lam):
	best, best_eval = None, 1e+10
	# calculate the number of children per parent
	n_children = int(lam / mu)
	# initial population
	population = list()
	for _ in range(lam):
		candidate = None
		while candidate is None or not in_bounds(candidate, bounds):
			candidate = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
		population.append(candidate)
	# perform the search
	for epoch in range(n_iter):
		# evaluate fitness for the population
		scores = [objective(c) for c in population]
		# rank scores in ascending order
		ranks = argsort(argsort(scores))
		# select the indexes for the top mu ranked solutions
		selected = [i for i,_ in enumerate(ranks) if ranks[i] < mu]
		# create children from parents
		children = list()
		for i in selected:
			# check if this parent is the best solution ever seen
			if scores[i] < best_eval:
				best, best_eval = population[i], scores[i]
				print('%d, Best: f(%s) = %.5f' % (epoch, best, best_eval))
			# keep the parent
			children.append(population[i])
			# create children for parent
			for _ in range(n_children):
				child = None
				while child is None or not in_bounds(child, bounds):
					child = population[i] + randn(len(bounds)) * step_size
				children.append(child)
		# replace population with children
		population = children
	return [best, best_eval]

We can apply this version of the algorithm to the Ackley objective function with the same hyperparameters used in the previous section.

The complete example is listed below.

# evolution strategy (mu + lambda) of the ackley objective function
from numpy import asarray
from numpy import exp
from numpy import sqrt
from numpy import cos
from numpy import e
from numpy import pi
from numpy import argsort
from numpy.random import randn
from numpy.random import rand
from numpy.random import seed

# objective function
def objective(v):
	x, y = v
	return -20.0 * exp(-0.2 * sqrt(0.5 * (x**2 + y**2))) - exp(0.5 * (cos(2 * pi * x) + cos(2 * pi * y))) + e + 20

# check if a point is within the bounds of the search
def in_bounds(point, bounds):
	# enumerate all dimensions of the point
	for d in range(len(bounds)):
		# check if out of bounds for this dimension
		if point[d] < bounds[d, 0] or point[d] > bounds[d, 1]:
			return False
	return True

# evolution strategy (mu + lambda) algorithm
def es_plus(objective, bounds, n_iter, step_size, mu, lam):
	best, best_eval = None, 1e+10
	# calculate the number of children per parent
	n_children = int(lam / mu)
	# initial population
	population = list()
	for _ in range(lam):
		candidate = None
		while candidate is None or not in_bounds(candidate, bounds):
			candidate = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])
		population.append(candidate)
	# perform the search
	for epoch in range(n_iter):
		# evaluate fitness for the population
		scores = [objective(c) for c in population]
		# rank scores in ascending order
		ranks = argsort(argsort(scores))
		# select the indexes for the top mu ranked solutions
		selected = [i for i,_ in enumerate(ranks) if ranks[i] < mu]
		# create children from parents
		children = list()
		for i in selected:
			# check if this parent is the best solution ever seen
			if scores[i] < best_eval:
				best, best_eval = population[i], scores[i]
				print('%d, Best: f(%s) = %.5f' % (epoch, best, best_eval))
			# keep the parent
			children.append(population[i])
			# create children for parent
			for _ in range(n_children):
				child = None
				while child is None or not in_bounds(child, bounds):
					child = population[i] + randn(len(bounds)) * step_size
				children.append(child)
		# replace population with children
		population = children
	return [best, best_eval]

# seed the pseudorandom number generator
seed(1)
# define range for input
bounds = asarray([[-5.0, 5.0], [-5.0, 5.0]])
# define the total iterations
n_iter = 5000
# define the maximum step size
step_size = 0.15
# number of parents selected
mu = 20
# the number of children generated by parents
lam = 100
# perform the evolution strategy (mu + lambda) search
best, score = es_plus(objective, bounds, n_iter, step_size, mu, lam)
print('Done!')
print('f(%s) = %f' % (best, score))

Running the example reports the candidate solution and scores each time a better solution is found, then reports the best solution found at the end of the search.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that about 24 improvements to performance were seen during the search. We can also see that a better final solution was found with an evaluation of 0.000532, compared to 0.001147 found with the comma version on this objective function.

0, Best: f([-0.82977995 2.20324493]) = 6.91249
0, Best: f([-1.03232526 0.38816734]) = 4.49240
1, Best: f([-1.02971385 0.21986453]) = 3.68954
2, Best: f([-0.96315064 0.21176994]) = 3.48942
2, Best: f([-0.9524528 -0.19751564]) = 3.39266
2, Best: f([-1.02643442 0.14956346]) = 3.24784
2, Best: f([-0.90172166 0.15791013]) = 3.17090
2, Best: f([-0.15198636 0.42080645]) = 3.08431
3, Best: f([-0.76669476 0.03852254]) = 3.06365
3, Best: f([-0.98979547 -0.01479852]) = 2.62138
3, Best: f([-0.10194792 0.33439734]) = 2.52353
3, Best: f([0.12633886 0.27504489]) = 2.24344
4, Best: f([-0.01096566 0.22380389]) = 1.55476
4, Best: f([0.16241469 0.12513091]) = 1.44068
5, Best: f([-0.0047592 0.13164993]) = 0.77511
5, Best: f([ 0.07285478 -0.0019298 ]) = 0.34156
6, Best: f([-0.0323925 -0.06303525]) = 0.32951
6, Best: f([0.00901941 0.0031937 ]) = 0.02950
32, Best: f([ 0.00275795 -0.00201658]) = 0.00997
109, Best: f([-0.00204732 0.00059337]) = 0.00615
195, Best: f([-0.00101671 0.00112202]) = 0.00434
555, Best: f([ 0.00020392 -0.00044394]) = 0.00139
2804, Best: f([3.86555110e-04 6.42776651e-05]) = 0.00111
4357, Best: f([ 0.00013889 -0.0001261 ]) = 0.00053
Done!
f([ 0.00013889 -0.0001261 ]) = 0.000532

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Papers

Books

Articles

Summary

In this tutorial, you discovered how to implement the evolution strategies optimization algorithm.

Specifically, you learned:

  • Evolution Strategies is a stochastic global optimization algorithm inspired by the biological theory of evolution by natural selection.
  • There is a standard terminology for Evolution Strategies and two common versions of the algorithm referred to as (mu, lambda)-ES and (mu + lambda)-ES.
  • How to implement and apply the Evolution Strategies algorithm to continuous objective functions.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Evolution Strategies From Scratch in Python appeared first on Machine Learning Mastery.

Differential Evolution Global Optimization With Python

$
0
0

Differential Evolution is a global optimization algorithm.

It is a type of evolutionary algorithm and is related to other evolutionary algorithms such as the genetic algorithm.

Unlike the genetic algorithm, it was specifically designed to operate upon vectors of real-valued numbers instead of bitstrings. Also unlike the genetic algorithm it uses vector operations like vector subtraction and addition to navigate the search space instead of genetics-inspired transforms.

In this tutorial, you will discover the Differential Evolution global optimization algorithm.

After completing this tutorial, you will know:

  • Differential Evolution optimization is a type of evolutionary algorithm that is designed to work with real-valued candidate solutions.
  • How to use the Differential Evolution optimization algorithm API in python.
  • Examples of using Differential Evolution to solve global optimization problems with multiple optima.

Let’s get started.

Differential Evolution Global Optimization With Python

Differential Evolution Global Optimization With Python
Photo by Gergely Csatari, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  1. Differential Evolution
  2. Differential Evolution API
  3. Differential Evolution Worked Example

Differential Evolution

Differential Evolution, or DE for short, is a stochastic global search optimization algorithm.

It is a type of evolutionary algorithm and is related to other evolutionary algorithms such as the genetic algorithm. Unlike the genetic algorithm that represents candidate solutions using sequences of bits, Differential Evolution is designed to work with multi-dimensional real-valued candidate solutions for continuous objective functions.

Differential evolution (DE) is arguably one of the most powerful stochastic real-parameter optimization algorithms in current use.

Differential Evolution: A Survey of the State-of-the-Art, 2011.

The algorithm does not make use of gradient information in the search, and as such, is well suited to non-differential nonlinear objective functions.

The algorithm works by maintaining a population of candidate solutions represented as real-valued vectors. New candidate solutions are created by making modified versions of existing solutions that then replace a large portion of the population each iteration of the algorithm.

New candidate solutions are created using a “strategy” that involves selecting a base solution to which a mutation is added, and other candidate solutions from the population from which the amount and type of mutation is calculated, called a difference vector. For example, a strategy may select a best candidate solution as the base and random solutions for the difference vector in the mutation.

DE generates new parameter vectors by adding the weighted difference between two population vectors to a third vector.

Differential Evolution: A Simple and Efficient Heuristic for global Optimization over Continuous Spaces, 1997.

Base solutions are replaced by their children if the children have a better objective function evaluation.

Finally, after we have built up a new group of children, we compare each child with the parent which created it (each parent created a single child). If the child is better than the parent, it replaces the parent in the original population.

— Page 51, Essentials of Metaheuristics, 2011.

Mutation is calculated as the difference between pairs of candidate solutions that results in a difference vector that is then added to the base solution, weighted by a mutation factor hyperparameter set in the range [0,2].

Not all elements of a base solution are mutated. This is controlled via a recombination hyperparameter and is often set to a large value such as 80 percent, meaning most but not all variables in a base solution are replaced. The decision to keep or replace a value in a base solution is determined for each position separately by sampling a probability distribution such as a binomial or exponential.

A standard terminology is used to describe differential strategies with the form:

  • DE/x/y/z

Where DE stands for “differential evolution,” x defines the base solution that is to be mutated, such as “rand” for random or “best” for the best solution in the population. The y stands for the number of difference vectors added to the base solution, such as 1, and z defines the probability distribution for determining if each solution is kept or replaced in the population, such as “bin” for binomial or “exp” for exponential.

The general convention used above is DE/x/y/z, where DE stands for “differential evolution,” x represents a string denoting the base vector to be perturbed, y is the number of difference vectors considered for perturbation of x, and z stands for the type of crossover being used (exp: exponential; bin: binomial).

Differential Evolution: A Survey of the State-of-the-Art, 2011.

The configuration DE/best/1/bin and DE/best/2/bin are popular configurations as they perform well for many objective functions.

The experiments performed by Mezura-Montes et al. indicate that DE/best/1/bin (using always the best solution to find search directions and also binomial crossover) remained the most competitive scheme, regardless the characteristics of the problem to be solved, based on final accuracy and robustness of results.

Differential Evolution: A Survey of the State-of-the-Art, 2011.

Now that we are familiar with the differential evolution algorithm, let’s look at how we might use the SciPy API implementation.

Differential Evolution API

The Differential Evolution global optimization algorithm is available in Python via the differential_evolution() SciPy function.

The function takes the name of the objective function and the bounds of each input variable as minimum arguments for the search.

...
# perform the differential evolution search
result = differential_evolution(objective, bounds)

There are a number of additional hyperparameters for the search that have default values, although you can configure them to customize the search.

A key hyperparameter is the “strategy” argument that controls the type of differential evolution search that is performed. By default, this is set to “best1bin” (DE/best/1/bin), which is a good configuration for most problems. It creates new candidate solutions by selecting random solutions from the population, subtracting one from the other, and adding a scaled version of the difference to the best candidate solution in the population.

  • new = best + (mutation * (rand1 – rand2))

The “popsize” argument controls the size or number of candidate solutions that are maintained in the population. It is a factor of the number of dimensions in candidate solutions and by default, it is set to 15. That means for a 2D objective function that a population size of (2 * 15) or 30 candidate solutions will be maintained.

The total number of iterations of the algorithm is maintained by the “maxiter” argument and defaults to 1,000.

The “mutation” argument controls the number of changes made to candidate solutions each iteration. By default, this is set to 0.5. The amount of recombination is controlled via the “recombination” argument, which is set to 0.7 (70 percent of a given candidate solution) by default.

Finally, a local search is applied to the best candidate solution found at the end of the search. This is controlled via the “polish” argument, which by default is set to True.

The result of the search is an OptimizeResult object where properties can be accessed like a dictionary. The success (or not) of the search can be accessed via the ‘success‘ or ‘message‘ key.

The total number of function evaluations can be accessed via ‘nfev‘ and the optimal input found for the search is accessible via the ‘x‘ key.

Now that we are familiar with the differential evolution API in Python, let’s look at some worked examples.

Differential Evolution Worked Example

In this section, we will look at an example of using the differential evolution algorithm on a challenging objective function.

The Ackley function is an example of an objective function that has a single global optima and multiple local optima in which a local search might get stuck.

As such, a global optimization technique is required. It is a two-dimensional objective function that has a global optima at [0,0], which evaluates to 0.0.

The example below implements the Ackley and creates a three-dimensional surface plot showing the global optima and multiple local optima.

# ackley multimodal function
from numpy import arange
from numpy import exp
from numpy import sqrt
from numpy import cos
from numpy import e
from numpy import pi
from numpy import meshgrid
from matplotlib import pyplot
from mpl_toolkits.mplot3d import Axes3D

# objective function
def objective(x, y):
	return -20.0 * exp(-0.2 * sqrt(0.5 * (x**2 + y**2))) - exp(0.5 * (cos(2 * pi * x) + cos(2 * pi * y))) + e + 20

# define range for input
r_min, r_max = -5.0, 5.0
# sample input range uniformly at 0.1 increments
xaxis = arange(r_min, r_max, 0.1)
yaxis = arange(r_min, r_max, 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a surface plot with the jet color scheme
figure = pyplot.figure()
axis = figure.gca(projection='3d')
axis.plot_surface(x, y, results, cmap='jet')
# show the plot
pyplot.show()

Running the example creates the surface plot of the Ackley function showing the vast number of local optima.

3D Surface Plot of the Ackley Multimodal Function

3D Surface Plot of the Ackley Multimodal Function

We can apply the differential evolution algorithm to the Ackley objective function.

First, we can define the bounds of the search space as the limits of the function in each dimension.

...
# define the bounds on the search
bounds = [[r_min, r_max], [r_min, r_max]]

We can then apply the search by specifying the name of the objective function and the bounds of the search. In this case, we will use the default hyperparameters.

...
# perform the differential evolution search
result = differential_evolution(objective, bounds)

After the search is complete, it will report the status of the search and the number of iterations performed, as well as the best result found with its evaluation.

...
# summarize the result
print('Status : %s' % result['message'])
print('Total Evaluations: %d' % result['nfev'])
# evaluate solution
solution = result['x']
evaluation = objective(solution)
print('Solution: f(%s) = %.5f' % (solution, evaluation))

Tying this together, the complete example of applying differential evolution to the Ackley objective function is listed below.

# differential evolution global optimization for the ackley multimodal objective function
from scipy.optimize import differential_evolution
from numpy.random import rand
from numpy import exp
from numpy import sqrt
from numpy import cos
from numpy import e
from numpy import pi

# objective function
def objective(v):
	x, y = v
	return -20.0 * exp(-0.2 * sqrt(0.5 * (x**2 + y**2))) - exp(0.5 * (cos(2 * pi * x) + cos(2 * pi * y))) + e + 20

# define range for input
r_min, r_max = -5.0, 5.0
# define the bounds on the search
bounds = [[r_min, r_max], [r_min, r_max]]
# perform the differential evolution search
result = differential_evolution(objective, bounds)
# summarize the result
print('Status : %s' % result['message'])
print('Total Evaluations: %d' % result['nfev'])
# evaluate solution
solution = result['x']
evaluation = objective(solution)
print('Solution: f(%s) = %.5f' % (solution, evaluation))

Running the example executes the optimization, then reports the results.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the algorithm located the optima with inputs equal to zero and an objective function evaluation that is equal to zero.

We can see that a total of 3,063 function evaluations were performed.

Status: Optimization terminated successfully.
Total Evaluations: 3063
Solution: f([0. 0.]) = 0.00000

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Papers

Books

APIs

Articles

Summary

In this tutorial, you discovered the Differential Evolution global optimization algorithm.

Specifically, you learned:

  • Differential Evolution optimization is a type of evolutionary algorithm that is designed to work with real-valued candidate solutions.
  • How to use the Differential Evolution optimization algorithm API in python.
  • Examples of using Differential Evolution to solve global optimization problems with multiple optima.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Differential Evolution Global Optimization With Python appeared first on Machine Learning Mastery.


Simple Genetic Algorithm From Scratch in Python

$
0
0

The genetic algorithm is a stochastic global optimization algorithm.

It may be one of the most popular and widely known biologically inspired algorithms, along with artificial neural networks.

The algorithm is a type of evolutionary algorithm and performs an optimization procedure inspired by the biological theory of evolution by means of natural selection with a binary representation and simple operators based on genetic recombination and genetic mutations.

In this tutorial, you will discover the genetic algorithm optimization algorithm.

After completing this tutorial, you will know:

  • Genetic algorithm is a stochastic optimization algorithm inspired by evolution.
  • How to implement the genetic algorithm from scratch in Python.
  • How to apply the genetic algorithm to a continuous objective function.

Let’s get started.

Simple Genetic Algorithm From Scratch in Python

Simple Genetic Algorithm From Scratch in Python
Photo by Magharebia, some rights reserved.

Tutorial Overview

This tutorial is divided into four parts; they are:

  1. Genetic Algorithm
  2. Genetic Algorithm From Scratch
  3. Genetic Algorithm for OneMax
  4. Genetic Algorithm for Continuous Function Optimization

Genetic Algorithm

The Genetic Algorithm is a stochastic global search optimization algorithm.

It is inspired by the biological theory of evolution by means of natural selection. Specifically, the new synthesis that combines an understanding of genetics with the theory.

Genetic algorithms (algorithm 9.4) borrow inspiration from biological evolution, where fitter individuals are more likely to pass on their genes to the next generation.

— Page 148, Algorithms for Optimization, 2019.

The algorithm uses analogs of a genetic representation (bitstrings), fitness (function evaluations), genetic recombination (crossover of bitstrings), and mutation (flipping bits).

The algorithm works by first creating a population of a fixed size of random bitstrings. The main loop of the algorithm is repeated for a fixed number of iterations or until no further improvement is seen in the best solution over a given number of iterations.

One iteration of the algorithm is like an evolutionary generation.

First, the population of bitstrings (candidate solutions) are evaluated using the objective function. The objective function evaluation for each candidate solution is taken as the fitness of the solution, which may be minimized or maximized.

Then, parents are selected based on their fitness. A given candidate solution may be used as parent zero or more times. A simple and effective approach to selection involves drawing k candidates from the population randomly and selecting the member from the group with the best fitness. This is called tournament selection where k is a hyperparameter and set to a value such as 3. This simple approach simulates a more costly fitness-proportionate selection scheme.

In tournament selection, each parent is the fittest out of k randomly chosen chromosomes of the population

— Page 151, Algorithms for Optimization, 2019.

Parents are used as the basis for generating the next generation of candidate points and one parent for each position in the population is required.

Parents are then taken in pairs and used to create two children. Recombination is performed using a crossover operator. This involves selecting a random split point on the bit string, then creating a child with the bits up to the split point from the first parent and from the split point to the end of the string from the second parent. This process is then inverted for the second child.

For example the two parents:

  • parent1 = 00000
  • parent2 = 11111

May result in two cross-over children:

  • child1 = 00011
  • child2 = 11100

This is called one point crossover, and there are many other variations of the operator.

Crossover is applied probabilistically for each pair of parents, meaning that in some cases, copies of the parents are taken as the children instead of the recombination operator. Crossover is controlled by a hyperparameter set to a large value, such as 80 percent or 90 percent.

Crossover is the Genetic Algorithm’s distinguishing feature. It involves mixing and matching parts of two parents to form children. How you do that mixing and matching depends on the representation of the individuals.

— Page 36, Essentials of Metaheuristics, 2011.

Mutation involves flipping bits in created children candidate solutions. Typically, the mutation rate is set to 1/L, where L is the length of the bitstring.

Each bit in a binary-valued chromosome typically has a small probability of being flipped. For a chromosome with m bits, this mutation rate is typically set to 1/m, yielding an average of one mutation per child chromosome.

— Page 155, Algorithms for Optimization, 2019.

For example, if a problem used a bitstring with 20 bits, then a good default mutation rate would be (1/20) = 0.05 or a probability of 5 percent.

This defines the simple genetic algorithm procedure. It is a large field of study, and there are many extensions to the algorithm.

Now that we are familiar with the simple genetic algorithm procedure, let’s look at how we might implement it from scratch.

Genetic Algorithm From Scratch

In this section, we will develop an implementation of the genetic algorithm.

The first step is to create a population of random bitstrings. We could use boolean values True and False, string values ‘0’ and ‘1’, or integer values 0 and 1. In this case, we will use integer values.

We can generate an array of integer values in a range using the randint() function, and we can specify the range as values starting at 0 and less than 2, e.g. 0 or 1. We will also represent a candidate solution as a list instead of a NumPy array to keep things simple.

An initial population of random bitstring can be created as follows, where “n_pop” is a hyperparameter that controls the population size and “n_bits” is a hyperparameter that defines the number of bits in a single candidate solution:

...
# initial population of random bitstring
pop = [randint(0, 2, n_bits).tolist() for _ in range(n_pop)]

Next, we can enumerate over a fixed number of algorithm iterations, in this case, controlled by a hyperparameter named “n_iter“.

...
# enumerate generations
	for gen in range(n_iter):
		...

The first step in the algorithm iteration is to evaluate all candidate solutions.

We will use a function named objective() as a generic objective function and call it to get a fitness score, which we will minimize.

...
# evaluate all candidates in the population
scores = [objective(c) for c in pop]

We can then select parents that will be used to create children.

The tournament selection procedure can be implemented as a function that takes the population and returns one selected parent. The k value is fixed at 3 with a default argument, but you can experiment with different values if you like.

# tournament selection
def selection(pop, scores, k=3):
	# first random selection
	selection_ix = randint(len(pop))
	for ix in randint(0, len(pop), k-1):
		# check if better (e.g. perform a tournament)
		if scores[ix] < scores[selection_ix]:
			selection_ix = ix
	return pop[selection_ix]

We can then call this function one time for each position in the population to create a list of parents.

...
# select parents
selected = [selection(pop, scores) for _ in range(n_pop)]

We can then create the next generation.

This first requires a function to perform crossover. This function will take two parents and the crossover rate. The crossover rate is a hyperparameter that determines whether crossover is performed or not, and if not, the parents are copied into the next generation. It is a probability and typically has a large value close to 1.0.

The crossover() function below implements crossover using a draw of a random number in the range [0,1] to determine if crossover is performed, then selecting a valid split point if crossover is to be performed.

# crossover two parents to create two children
def crossover(p1, p2, r_cross):
	# children are copies of parents by default
	c1, c2 = p1.copy(), p2.copy()
	# check for recombination
	if rand() < r_cross:
		# select crossover point that is not on the end of the string
		pt = randint(1, len(p1)-2)
		# perform crossover
		c1 = p1[:pt] + p2[pt:]
		c2 = p2[:pt] + p1[pt:]
	return [c1, c2]

We also need a function to perform mutation.

This procedure simply flips bits with a low probability controlled by the “r_mut” hyperparameter.

# mutation operator
def mutation(bitstring, r_mut):
	for i in range(len(bitstring)):
		# check for a mutation
		if rand() < r_mut:
			# flip the bit
			bitstring[i] = 1 - bitstring[i]

We can then loop over the list of parents and create a list of children to be used as the next generation, calling the crossover and mutation functions as needed.

...
# create the next generation
children = list()
for i in range(0, n_pop, 2):
	# get selected parents in pairs
	p1, p2 = selected[i], selected[i+1]
	# crossover and mutation
	for c in crossover(p1, p2, r_cross):
		# mutation
		mutation(c, r_mut)
		# store for next generation
		children.append(c)

We can tie all of this together into a function named genetic_algorithm() that takes the name of the objective function and the hyperparameters of the search, and returns the best solution found during the search.

# genetic algorithm
def genetic_algorithm(objective, n_bits, n_iter, n_pop, r_cross, r_mut):
	# initial population of random bitstring
	pop = [randint(0, 2, n_bits).tolist() for _ in range(n_pop)]
	# keep track of best solution
	best, best_eval = 0, objective(pop[0])
	# enumerate generations
	for gen in range(n_iter):
		# evaluate all candidates in the population
		scores = [objective(c) for c in pop]
		# check for new best solution
		for i in range(n_pop):
			if scores[i] < best_eval:
				best, best_eval = pop[i], scores[i]
				print(">%d, new best f(%s) = %.3f" % (gen,  pop[i], scores[i]))
		# select parents
		selected = [selection(pop, scores) for _ in range(n_pop)]
		# create the next generation
		children = list()
		for i in range(0, n_pop, 2):
			# get selected parents in pairs
			p1, p2 = selected[i], selected[i+1]
			# crossover and mutation
			for c in crossover(p1, p2, r_cross):
				# mutation
				mutation(c, r_mut)
				# store for next generation
				children.append(c)
		# replace population
		pop = children
	return [best, best_eval]

Now that we have developed an implementation of the genetic algorithm, let’s explore how we might apply it to an objective function.

Genetic Algorithm for OneMax

In this section, we will apply the genetic algorithm to a binary string-based optimization problem.

The problem is called OneMax and evaluates a binary string based on the number of 1s in the string. For example, a bitstring with a length of 20 bits will have a score of 20 for a string of all 1s.

Given we have implemented the genetic algorithm to minimize the objective function, we can add a negative sign to this evaluation so that large positive values become large negative values.

The onemax() function below implements this and takes a bitstring of integer values as input and returns the negative sum of the values.

# objective function
def onemax(x):
	return -sum(x)

Next, we can configure the search.

The search will run for 100 iterations and we will use 20 bits in our candidate solutions, meaning the optimal fitness will be -20.0.

The population size will be 100, and we will use a crossover rate of 90 percent and a mutation rate of 5 percent. This configuration was chosen after a little trial and error.

...
# define the total iterations
n_iter = 100
# bits
n_bits = 20
# define the population size
n_pop = 100
# crossover rate
r_cross = 0.9
# mutation rate
r_mut = 1.0 / float(n_bits)

The search can then be called and the best result reported.

...
# perform the genetic algorithm search
best, score = genetic_algorithm(onemax, n_bits, n_iter, n_pop, r_cross, r_mut)
print('Done!')
print('f(%s) = %f' % (best, score))

Tying this together, the complete example of applying the genetic algorithm to the OneMax objective function is listed below.

# genetic algorithm search of the one max optimization problem
from numpy.random import randint
from numpy.random import rand

# objective function
def onemax(x):
	return -sum(x)

# tournament selection
def selection(pop, scores, k=3):
	# first random selection
	selection_ix = randint(len(pop))
	for ix in randint(0, len(pop), k-1):
		# check if better (e.g. perform a tournament)
		if scores[ix] < scores[selection_ix]:
			selection_ix = ix
	return pop[selection_ix]

# crossover two parents to create two children
def crossover(p1, p2, r_cross):
	# children are copies of parents by default
	c1, c2 = p1.copy(), p2.copy()
	# check for recombination
	if rand() < r_cross:
		# select crossover point that is not on the end of the string
		pt = randint(1, len(p1)-2)
		# perform crossover
		c1 = p1[:pt] + p2[pt:]
		c2 = p2[:pt] + p1[pt:]
	return [c1, c2]

# mutation operator
def mutation(bitstring, r_mut):
	for i in range(len(bitstring)):
		# check for a mutation
		if rand() < r_mut:
			# flip the bit
			bitstring[i] = 1 - bitstring[i]

# genetic algorithm
def genetic_algorithm(objective, n_bits, n_iter, n_pop, r_cross, r_mut):
	# initial population of random bitstring
	pop = [randint(0, 2, n_bits).tolist() for _ in range(n_pop)]
	# keep track of best solution
	best, best_eval = 0, objective(pop[0])
	# enumerate generations
	for gen in range(n_iter):
		# evaluate all candidates in the population
		scores = [objective(c) for c in pop]
		# check for new best solution
		for i in range(n_pop):
			if scores[i] < best_eval:
				best, best_eval = pop[i], scores[i]
				print(">%d, new best f(%s) = %.3f" % (gen,  pop[i], scores[i]))
		# select parents
		selected = [selection(pop, scores) for _ in range(n_pop)]
		# create the next generation
		children = list()
		for i in range(0, n_pop, 2):
			# get selected parents in pairs
			p1, p2 = selected[i], selected[i+1]
			# crossover and mutation
			for c in crossover(p1, p2, r_cross):
				# mutation
				mutation(c, r_mut)
				# store for next generation
				children.append(c)
		# replace population
		pop = children
	return [best, best_eval]

# define the total iterations
n_iter = 100
# bits
n_bits = 20
# define the population size
n_pop = 100
# crossover rate
r_cross = 0.9
# mutation rate
r_mut = 1.0 / float(n_bits)
# perform the genetic algorithm search
best, score = genetic_algorithm(onemax, n_bits, n_iter, n_pop, r_cross, r_mut)
print('Done!')
print('f(%s) = %f' % (best, score))

Running the example will report the best result as it is found along the way, then the final best solution at the end of the search, which we would expect to be the optimal solution.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the search found the optimal solution after about eight generations.

>0, new best f([1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1]) = -14.000
>0, new best f([1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0]) = -15.000
>1, new best f([1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1]) = -16.000
>2, new best f([0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1]) = -17.000
>2, new best f([1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]) = -19.000
>8, new best f([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]) = -20.000
Done!
f([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]) = -20.000000

Genetic Algorithm for Continuous Function Optimization

Optimizing the OneMax function is not very interesting; we are more likely to want to optimize a continuous function.

For example, we can define the x^2 minimization function that takes input variables and has an optima at  f(0, 0) = 0.0.

# objective function
def objective(x):
	return x[0]**2.0 + x[1]**2.0

We can minimize this function with a genetic algorithm.

First, we must define the bounds of each input variable.

...
# define range for input
bounds = [[-5.0, 5.0], [-5.0, 5.0]]

We will take the “n_bits” hyperparameter as a number of bits per input variable to the objective function and set it to 16 bits.

...
# bits per variable
n_bits = 16

This means our actual bit string will have (16 * 2) = 32 bits, given the two input variables.

We must update our mutation rate accordingly.

...
# mutation rate
r_mut = 1.0 / (float(n_bits) * len(bounds))

Next, we need to ensure that the initial population creates random bitstrings that are large enough.

...
# initial population of random bitstring
pop = [randint(0, 2, n_bits*len(bounds)).tolist() for _ in range(n_pop)]

Finally, we need to decode the bitstrings to numbers prior to evaluating each with the objective function.

We can achieve this by first decoding each substring to an integer, then scaling the integer to the desired range. This will give a vector of values in the range that can then be provided to the objective function for evaluation.

The decode() function below implements this, taking the bounds of the function, the number of bits per variable, and a bitstring as input and returns a list of decoded real values.

# decode bitstring to numbers
def decode(bounds, n_bits, bitstring):
	decoded = list()
	largest = 2**n_bits
	for i in range(len(bounds)):
		# extract the substring
		start, end = i * n_bits, (i * n_bits)+n_bits
		substring = bitstring[start:end]
		# convert bitstring to a string of chars
		chars = ''.join([str(s) for s in substring])
		# convert string to integer
		integer = int(chars, 2)
		# scale integer to desired range
		value = bounds[i][0] + (integer/largest) * (bounds[i][1] - bounds[i][0])
		# store
		decoded.append(value)
	return decoded

We can then call this at the beginning of the algorithm loop to decode the population, then evaluate the decoded version of the population.

...
# decode population
decoded = [decode(bounds, n_bits, p) for p in pop]
# evaluate all candidates in the population
scores = [objective(d) for d in decoded]

Tying this together, the complete example of the genetic algorithm for continuous function optimization is listed below.

# genetic algorithm search for continuous function optimization
from numpy.random import randint
from numpy.random import rand

# objective function
def objective(x):
	return x[0]**2.0 + x[1]**2.0

# decode bitstring to numbers
def decode(bounds, n_bits, bitstring):
	decoded = list()
	largest = 2**n_bits
	for i in range(len(bounds)):
		# extract the substring
		start, end = i * n_bits, (i * n_bits)+n_bits
		substring = bitstring[start:end]
		# convert bitstring to a string of chars
		chars = ''.join([str(s) for s in substring])
		# convert string to integer
		integer = int(chars, 2)
		# scale integer to desired range
		value = bounds[i][0] + (integer/largest) * (bounds[i][1] - bounds[i][0])
		# store
		decoded.append(value)
	return decoded

# tournament selection
def selection(pop, scores, k=3):
	# first random selection
	selection_ix = randint(len(pop))
	for ix in randint(0, len(pop), k-1):
		# check if better (e.g. perform a tournament)
		if scores[ix] < scores[selection_ix]:
			selection_ix = ix
	return pop[selection_ix]

# crossover two parents to create two children
def crossover(p1, p2, r_cross):
	# children are copies of parents by default
	c1, c2 = p1.copy(), p2.copy()
	# check for recombination
	if rand() < r_cross:
		# select crossover point that is not on the end of the string
		pt = randint(1, len(p1)-2)
		# perform crossover
		c1 = p1[:pt] + p2[pt:]
		c2 = p2[:pt] + p1[pt:]
	return [c1, c2]

# mutation operator
def mutation(bitstring, r_mut):
	for i in range(len(bitstring)):
		# check for a mutation
		if rand() < r_mut:
			# flip the bit
			bitstring[i] = 1 - bitstring[i]

# genetic algorithm
def genetic_algorithm(objective, bounds, n_bits, n_iter, n_pop, r_cross, r_mut):
	# initial population of random bitstring
	pop = [randint(0, 2, n_bits*len(bounds)).tolist() for _ in range(n_pop)]
	# keep track of best solution
	best, best_eval = 0, objective(pop[0])
	# enumerate generations
	for gen in range(n_iter):
		# decode population
		decoded = [decode(bounds, n_bits, p) for p in pop]
		# evaluate all candidates in the population
		scores = [objective(d) for d in decoded]
		# check for new best solution
		for i in range(n_pop):
			if scores[i] < best_eval:
				best, best_eval = pop[i], scores[i]
				print(">%d, new best f(%s) = %f" % (gen,  decoded[i], scores[i]))
		# select parents
		selected = [selection(pop, scores) for _ in range(n_pop)]
		# create the next generation
		children = list()
		for i in range(0, n_pop, 2):
			# get selected parents in pairs
			p1, p2 = selected[i], selected[i+1]
			# crossover and mutation
			for c in crossover(p1, p2, r_cross):
				# mutation
				mutation(c, r_mut)
				# store for next generation
				children.append(c)
		# replace population
		pop = children
	return [best, best_eval]

# define range for input
bounds = [[-5.0, 5.0], [-5.0, 5.0]]
# define the total iterations
n_iter = 100
# bits per variable
n_bits = 16
# define the population size
n_pop = 100
# crossover rate
r_cross = 0.9
# mutation rate
r_mut = 1.0 / (float(n_bits) * len(bounds))
# perform the genetic algorithm search
best, score = genetic_algorithm(objective, bounds, n_bits, n_iter, n_pop, r_cross, r_mut)
print('Done!')
decoded = decode(bounds, n_bits, best)
print('f(%s) = %f' % (decoded, score))

Running the example reports the best decoded results along the way and the best decoded solution at the end of the run.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the algorithm discovers an input very close to f(0.0, 0.0) = 0.0.

>0, new best f([-0.785064697265625, -0.807647705078125]) = 1.268621
>0, new best f([0.385894775390625, 0.342864990234375]) = 0.266471
>1, new best f([-0.342559814453125, -0.1068115234375]) = 0.128756
>2, new best f([-0.038909912109375, 0.30242919921875]) = 0.092977
>2, new best f([0.145721435546875, 0.1849365234375]) = 0.055436
>3, new best f([0.14404296875, -0.029754638671875]) = 0.021634
>5, new best f([0.066680908203125, 0.096435546875]) = 0.013746
>5, new best f([-0.036468505859375, -0.10711669921875]) = 0.012804
>6, new best f([-0.038909912109375, -0.099639892578125]) = 0.011442
>7, new best f([-0.033111572265625, 0.09674072265625]) = 0.010455
>7, new best f([-0.036468505859375, 0.05584716796875]) = 0.004449
>10, new best f([0.058746337890625, 0.008087158203125]) = 0.003517
>10, new best f([-0.031585693359375, 0.008087158203125]) = 0.001063
>12, new best f([0.022125244140625, 0.008087158203125]) = 0.000555
>13, new best f([0.022125244140625, 0.00701904296875]) = 0.000539
>13, new best f([-0.013885498046875, 0.008087158203125]) = 0.000258
>16, new best f([-0.011444091796875, 0.00518798828125]) = 0.000158
>17, new best f([-0.0115966796875, 0.00091552734375]) = 0.000135
>17, new best f([-0.004730224609375, 0.00335693359375]) = 0.000034
>20, new best f([-0.004425048828125, 0.00274658203125]) = 0.000027
>21, new best f([-0.002288818359375, 0.00091552734375]) = 0.000006
>22, new best f([-0.001983642578125, 0.00091552734375]) = 0.000005
>22, new best f([-0.001983642578125, 0.0006103515625]) = 0.000004
>24, new best f([-0.001373291015625, 0.001068115234375]) = 0.000003
>25, new best f([-0.001373291015625, 0.00091552734375]) = 0.000003
>26, new best f([-0.001373291015625, 0.0006103515625]) = 0.000002
>27, new best f([-0.001068115234375, 0.0006103515625]) = 0.000002
>29, new best f([-0.000152587890625, 0.00091552734375]) = 0.000001
>33, new best f([-0.0006103515625, 0.0]) = 0.000000
>34, new best f([-0.000152587890625, 0.00030517578125]) = 0.000000
>43, new best f([-0.00030517578125, 0.0]) = 0.000000
>60, new best f([-0.000152587890625, 0.000152587890625]) = 0.000000
>65, new best f([-0.000152587890625, 0.0]) = 0.000000
Done!
f([-0.000152587890625, 0.0]) = 0.000000

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Books

API

Articles

Summary

In this tutorial, you discovered the genetic algorithm optimization.

Specifically, you learned:

  • Genetic algorithm is a stochastic optimization algorithm inspired by evolution.
  • How to implement the genetic algorithm from scratch in Python.
  • How to apply the genetic algorithm to a continuous objective function.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Simple Genetic Algorithm From Scratch in Python appeared first on Machine Learning Mastery.

How to Update Neural Network Models With More Data

$
0
0

Deep learning neural network models used for predictive modeling may need to be updated.

This may be because the data has changed since the model was developed and deployed, or it may be the case that additional labeled data has been made available since the model was developed and it is expected that the additional data will improve the performance of the model.

It is important to experiment and evaluate with a range of different approaches when updating neural network models for new data, especially if model updating will be automated, such as on a periodic schedule.

There are many ways to update neural network models, although the two main approaches involve either using the existing model as a starting point and retraining it, or leaving the existing model unchanged and combining the predictions from the existing model with a new model.

In this tutorial, you will discover how to update deep learning neural network models in response to new data.

After completing this tutorial, you will know:

  • Neural network models may need to be updated when the underlying data changes or when new labeled data is made available.
  • How to update trained neural network models with just new data or combinations of old and new data.
  • How to create an ensemble of existing and new models trained on just new data or combinations of old and new data.

Let’s get started.

How to Update Neural Network Models With More Data

How to Update Neural Network Models With More Data
Photo by Judy Gallagher, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  1. Updating Neural Network Models
  2. Retraining Update Strategies
    1. Update Model on New Data Only
    2. Update Model on Old and New Data
  3. Ensemble Update Strategies
    1. Ensemble Model With Model on New Data Only
    2. Ensemble Model With Model on Old and New Data

Updating Neural Network Models

Selecting and finalizing a deep learning neural network model for a predictive modeling project is just the beginning.

You can then start using the model to make predictions on new data.

One possible problem that you may encounter is that the nature of the prediction problem may change over time.

You may notice this by the fact that the effectiveness of predictions may begin to decline over time. This may be because the assumptions made and captured in the model are changing or no longer hold.

Generally, this is referred to as the problem of “concept drift” where the underlying probability distributions of variables and relationships between variables change over time, which can negatively impact the model built from the data.

For more on concept drift, see the tutorial:

Concept drift may affect your model at different times and depends specifically on the prediction problem you are solving and the model chosen to address it.

It can be helpful to monitor the performance of a model over time and use a clear drop in model performance as a trigger to make a change to your model, such as re-training it on new data.

Alternately, you may know that data in your domain changes frequently enough that a change to the model is required periodically, such as weekly, monthly, or annually.

Finally, you may operate your model for a while and accumulate additional data with known outcomes that you wish to use to update your model, with the hopes of improving predictive performance.

Importantly, you have a lot of flexibility when it comes to responding to a change to the problem or the availability of new data.

For example, you can take the trained neural network model and update the model weights using the new data. Or we might want to leave the existing model untouched and combine its predictions with a new model fit on the newly available data.

These approaches might represent two general themes in updating neural network models in response to new data, they are:

  • Retrain Update Strategies.
  • Ensemble Update Strategies.

Let’s take a closer look at each in turn.

Retraining Update Strategies

A benefit of neural network models is that their weights can be updated at any time with continued training.

When responding to changes in the underlying data or the availability of new data, there are a few different strategies to choose from when updating a neural network model, such as:

  • Continue training the model on the new data only.
  • Continue training the model on the old and new data.

We might also imagine variations on the above strategies, such as using a sample of the new data or a sample of new and old data instead of all available data, as well as possible instance-based weightings on sampled data.

We might also consider extensions of the model that freeze the layers of the existing model (e.g. so model weights cannot change during training), then add new layers with model weights that can change, grafting on extensions to the model to handle any change in the data. Perhaps this is a variation of the retraining and the ensemble approach in the next section, and we’ll leave it for now.

Nevertheless, these are the two main strategies to consider.

Let’s make these approaches concrete with a worked example.

Update Model on New Data Only

We can update the model on the new data only.

One extreme version of this approach is to not use any new data and simply re-train the model on the old data. This might be the same as “do nothing” in response to the new data. At the other extreme, a model could be fit on the new data only, discarding the old data and old model.

  • Ignore new data, do nothing.
  • Update existing model on new data.
  • Fit new model on new data, discard old model and data.

We will focus on the middle ground in this example, but it might be interesting to test all three approaches on your problem and see what works best.

First, we can define a synthetic binary classification dataset and split it into half, then use one portion as “old data” and another portion as “new data.”

...
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1)
# record the number of input features in the data
n_features = X.shape[1]
# split into old and new data
X_old, X_new, y_old, y_new = train_test_split(X, y, test_size=0.50, random_state=1)

We can then define a Multilayer Perceptron model (MLP) and fit it on the old data only.

...
# define the model
model = Sequential()
model.add(Dense(20, kernel_initializer='he_normal', activation='relu', input_dim=n_features))
model.add(Dense(10, kernel_initializer='he_normal', activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# define the optimization algorithm
opt = SGD(learning_rate=0.01, momentum=0.9)
# compile the model
model.compile(optimizer=opt, loss='binary_crossentropy')
# fit the model on old data
model.fit(X_old, y_old, epochs=150, batch_size=32, verbose=0)

We can then imagine saving the model and using it for some time.

Time passes, and we wish to update it on new data that has become available.

This would involve using a much smaller learning rate than normal so that we do not wash away the weights learned on the old data.

Note: you will need to discover a learning rate that is appropriate for your model and dataset that achieves better performance than simply fitting a new model from scratch.

...
# update model on new data only with a smaller learning rate
opt = SGD(learning_rate=0.001, momentum=0.9)
# compile the model
model.compile(optimizer=opt, loss='binary_crossentropy')

We can then fit the model on the new data only with this smaller learning rate.

...
model.compile(optimizer=opt, loss='binary_crossentropy')
# fit the model on new data
model.fit(X_new, y_new, epochs=100, batch_size=32, verbose=0)

Tying this together, the complete example of updating a neural network model on new data only is listed below.

# update neural network with new data only
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import SGD
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1)
# record the number of input features in the data
n_features = X.shape[1]
# split into old and new data
X_old, X_new, y_old, y_new = train_test_split(X, y, test_size=0.50, random_state=1)
# define the model
model = Sequential()
model.add(Dense(20, kernel_initializer='he_normal', activation='relu', input_dim=n_features))
model.add(Dense(10, kernel_initializer='he_normal', activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# define the optimization algorithm
opt = SGD(learning_rate=0.01, momentum=0.9)
# compile the model
model.compile(optimizer=opt, loss='binary_crossentropy')
# fit the model on old data
model.fit(X_old, y_old, epochs=150, batch_size=32, verbose=0)

# save model...

# load model...

# update model on new data only with a smaller learning rate
opt = SGD(learning_rate=0.001, momentum=0.9)
# compile the model
model.compile(optimizer=opt, loss='binary_crossentropy')
# fit the model on new data
model.fit(X_new, y_new, epochs=100, batch_size=32, verbose=0)

Next, let’s look at updating the model on new and old data.

Update Model on Old and New Data

We can update the model on a combination of both old and new data.

An extreme version of this approach is to discard the model and simply fit a new model on all available data, new and old. A less extreme version would be to use the existing model as a starting point and update it based on the combined dataset.

Again, it is a good idea to test both strategies and see what works well for your dataset.

We will focus on the less extreme update strategy in this case.

The synthetic dataset and model can be fit on the old dataset as before.

...
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1)
# record the number of input features in the data
n_features = X.shape[1]
# split into old and new data
X_old, X_new, y_old, y_new = train_test_split(X, y, test_size=0.50, random_state=1)
# define the model
model = Sequential()
model.add(Dense(20, kernel_initializer='he_normal', activation='relu', input_dim=n_features))
model.add(Dense(10, kernel_initializer='he_normal', activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# define the optimization algorithm
opt = SGD(learning_rate=0.01, momentum=0.9)
# compile the model
model.compile(optimizer=opt, loss='binary_crossentropy')
# fit the model on old data
model.fit(X_old, y_old, epochs=150, batch_size=32, verbose=0)

New data comes available and we wish to update the model on a combination of both old and new data.

First, we must use a much smaller learning rate in an attempt to use the current weights as a starting point for the search.

Note: you will need to discover a learning rate that is appropriate for your model and dataset that achieves better performance than simply fitting a new model from scratch.

...
# update model with a smaller learning rate
opt = SGD(learning_rate=0.001, momentum=0.9)
# compile the model
model.compile(optimizer=opt, loss='binary_crossentropy')

We can then create a composite dataset composed of old and new data.

...
# create a composite dataset of old and new data
X_both, y_both = vstack((X_old, X_new)), hstack((y_old, y_new))

Finally, we can update the model on this composite dataset.

...
# fit the model on new data
model.fit(X_both, y_both, epochs=100, batch_size=32, verbose=0)

Tying this together, the complete example of updating a neural network model on both old and new data is listed below.

# update neural network with both old and new data
from numpy import vstack
from numpy import hstack
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import SGD
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1)
# record the number of input features in the data
n_features = X.shape[1]
# split into old and new data
X_old, X_new, y_old, y_new = train_test_split(X, y, test_size=0.50, random_state=1)
# define the model
model = Sequential()
model.add(Dense(20, kernel_initializer='he_normal', activation='relu', input_dim=n_features))
model.add(Dense(10, kernel_initializer='he_normal', activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# define the optimization algorithm
opt = SGD(learning_rate=0.01, momentum=0.9)
# compile the model
model.compile(optimizer=opt, loss='binary_crossentropy')
# fit the model on old data
model.fit(X_old, y_old, epochs=150, batch_size=32, verbose=0)

# save model...

# load model...

# update model with a smaller learning rate
opt = SGD(learning_rate=0.001, momentum=0.9)
# compile the model
model.compile(optimizer=opt, loss='binary_crossentropy')
# create a composite dataset of old and new data
X_both, y_both = vstack((X_old, X_new)), hstack((y_old, y_new))
# fit the model on new data
model.fit(X_both, y_both, epochs=100, batch_size=32, verbose=0)

Next, let’s look at how to use ensemble models to respond to new data.

Ensemble Update Strategies

An ensemble is a predictive model that is composed of multiple other models.

There are many different types of ensemble models, although perhaps the simplest approach is to average the predictions from multiple different models.

For more on ensemble algorithms for deep learning neural networks, see the tutorial:

We can use an ensemble model as a strategy when responding to changes in the underlying data or availability of new data.

Mirroring the approaches in the previous section, we might consider two approaches to ensemble learning algorithms as strategies for responding to new data; they are:

  • Ensemble of existing model and new model fit on new data only.
  • Ensemble of existing model and new model fit on old and new data.

Again, we might consider variations on these approaches, such as samples of old and new data, and more than one existing or additional models included in the ensemble.

Nevertheless, these are the two main strategies to consider.

Let’s make these approaches concrete with a worked example.

Ensemble Model With Model on New Data Only

We can create an ensemble of the existing model and a new model fit on only the new data.

The expectation is that the ensemble predictions perform better or are more stable (lower variance) than using either the old model or the new model alone. This should be checked on your dataset before adopting the ensemble.

First, we can prepare the dataset and fit the old model, as we did in the previous sections.

...
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1)
# record the number of input features in the data
n_features = X.shape[1]
# split into old and new data
X_old, X_new, y_old, y_new = train_test_split(X, y, test_size=0.50, random_state=1)
# define the old model
old_model = Sequential()
old_model.add(Dense(20, kernel_initializer='he_normal', activation='relu', input_dim=n_features))
old_model.add(Dense(10, kernel_initializer='he_normal', activation='relu'))
old_model.add(Dense(1, activation='sigmoid'))
# define the optimization algorithm
opt = SGD(learning_rate=0.01, momentum=0.9)
# compile the model
old_model.compile(optimizer=opt, loss='binary_crossentropy')
# fit the model on old data
old_model.fit(X_old, y_old, epochs=150, batch_size=32, verbose=0)

Some time passes and new data becomes available.

We can then fit a new model on the new data, naturally discovering a model and configuration that works well or best on the new dataset only.

In this case, we’ll simply use the same model architecture and configuration as the old model.

...
# define the new model
new_model = Sequential()
new_model.add(Dense(20, kernel_initializer='he_normal', activation='relu', input_dim=n_features))
new_model.add(Dense(10, kernel_initializer='he_normal', activation='relu'))
new_model.add(Dense(1, activation='sigmoid'))
# define the optimization algorithm
opt = SGD(learning_rate=0.01, momentum=0.9)
# compile the model
new_model.compile(optimizer=opt, loss='binary_crossentropy')

We can then fit this new model on the new data only.

...
# fit the model on old data
new_model.fit(X_new, y_new, epochs=150, batch_size=32, verbose=0)

Now that we have the two models, we can make predictions with each model, and calculate the average of the predictions as the “ensemble prediction.”

...
# make predictions with both models
yhat1 = old_model.predict(X_new)
yhat2 = new_model.predict(X_new)
# combine predictions into single array
combined = hstack((yhat1, yhat2))
# calculate outcome as mean of predictions
yhat = mean(combined, axis=-1)

Tying this together, the complete example of updating using an ensemble of the existing model and a new model fit on new data only is listed below.

# ensemble old neural network with new model fit on new data only
from numpy import hstack
from numpy import mean
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import SGD
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1)
# record the number of input features in the data
n_features = X.shape[1]
# split into old and new data
X_old, X_new, y_old, y_new = train_test_split(X, y, test_size=0.50, random_state=1)
# define the old model
old_model = Sequential()
old_model.add(Dense(20, kernel_initializer='he_normal', activation='relu', input_dim=n_features))
old_model.add(Dense(10, kernel_initializer='he_normal', activation='relu'))
old_model.add(Dense(1, activation='sigmoid'))
# define the optimization algorithm
opt = SGD(learning_rate=0.01, momentum=0.9)
# compile the model
old_model.compile(optimizer=opt, loss='binary_crossentropy')
# fit the model on old data
old_model.fit(X_old, y_old, epochs=150, batch_size=32, verbose=0)

# save model...

# load model...

# define the new model
new_model = Sequential()
new_model.add(Dense(20, kernel_initializer='he_normal', activation='relu', input_dim=n_features))
new_model.add(Dense(10, kernel_initializer='he_normal', activation='relu'))
new_model.add(Dense(1, activation='sigmoid'))
# define the optimization algorithm
opt = SGD(learning_rate=0.01, momentum=0.9)
# compile the model
new_model.compile(optimizer=opt, loss='binary_crossentropy')
# fit the model on old data
new_model.fit(X_new, y_new, epochs=150, batch_size=32, verbose=0)

# make predictions with both models
yhat1 = old_model.predict(X_new)
yhat2 = new_model.predict(X_new)
# combine predictions into single array
combined = hstack((yhat1, yhat2))
# calculate outcome as mean of predictions
yhat = mean(combined, axis=-1)

Ensemble Model With Model on Old and New Data

We can create an ensemble of the existing model and a new model fit on both the old and the new data.

The expectation is that the ensemble predictions perform better or are more stable (lower variance) than using either the old model or the new model alone. This should be checked on your dataset before adopting the ensemble.

First, we can prepare the dataset and fit the old model, as we did in the previous sections.

...
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1)
# record the number of input features in the data
n_features = X.shape[1]
# split into old and new data
X_old, X_new, y_old, y_new = train_test_split(X, y, test_size=0.50, random_state=1)
# define the old model
old_model = Sequential()
old_model.add(Dense(20, kernel_initializer='he_normal', activation='relu', input_dim=n_features))
old_model.add(Dense(10, kernel_initializer='he_normal', activation='relu'))
old_model.add(Dense(1, activation='sigmoid'))
# define the optimization algorithm
opt = SGD(learning_rate=0.01, momentum=0.9)
# compile the model
old_model.compile(optimizer=opt, loss='binary_crossentropy')
# fit the model on old data
old_model.fit(X_old, y_old, epochs=150, batch_size=32, verbose=0)

Some time passes and new data becomes available.

We can then fit a new model on a composite of the old and new data, naturally discovering a model and configuration that works well or best on the new dataset only.

In this case, we’ll simply use the same model architecture and configuration as the old model.

...
# define the new model
new_model = Sequential()
new_model.add(Dense(20, kernel_initializer='he_normal', activation='relu', input_dim=n_features))
new_model.add(Dense(10, kernel_initializer='he_normal', activation='relu'))
new_model.add(Dense(1, activation='sigmoid'))
# define the optimization algorithm
opt = SGD(learning_rate=0.01, momentum=0.9)
# compile the model
new_model.compile(optimizer=opt, loss='binary_crossentropy')

We can create a composite dataset from the old and new data, then fit the new model on this dataset.

...
# create a composite dataset of old and new data
X_both, y_both = vstack((X_old, X_new)), hstack((y_old, y_new))
# fit the model on old data
new_model.fit(X_both, y_both, epochs=150, batch_size=32, verbose=0)

Finally, we can use both models together to make ensemble predictions.

...
# make predictions with both models
yhat1 = old_model.predict(X_new)
yhat2 = new_model.predict(X_new)
# combine predictions into single array
combined = hstack((yhat1, yhat2))
# calculate outcome as mean of predictions
yhat = mean(combined, axis=-1)

Tying this together, the complete example of updating using an ensemble of the existing model and a new model fit on the old and new data is listed below.

# ensemble old neural network with new model fit on old and new data
from numpy import hstack
from numpy import vstack
from numpy import mean
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import SGD
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1)
# record the number of input features in the data
n_features = X.shape[1]
# split into old and new data
X_old, X_new, y_old, y_new = train_test_split(X, y, test_size=0.50, random_state=1)
# define the old model
old_model = Sequential()
old_model.add(Dense(20, kernel_initializer='he_normal', activation='relu', input_dim=n_features))
old_model.add(Dense(10, kernel_initializer='he_normal', activation='relu'))
old_model.add(Dense(1, activation='sigmoid'))
# define the optimization algorithm
opt = SGD(learning_rate=0.01, momentum=0.9)
# compile the model
old_model.compile(optimizer=opt, loss='binary_crossentropy')
# fit the model on old data
old_model.fit(X_old, y_old, epochs=150, batch_size=32, verbose=0)

# save model...

# load model...

# define the new model
new_model = Sequential()
new_model.add(Dense(20, kernel_initializer='he_normal', activation='relu', input_dim=n_features))
new_model.add(Dense(10, kernel_initializer='he_normal', activation='relu'))
new_model.add(Dense(1, activation='sigmoid'))
# define the optimization algorithm
opt = SGD(learning_rate=0.01, momentum=0.9)
# compile the model
new_model.compile(optimizer=opt, loss='binary_crossentropy')
# create a composite dataset of old and new data
X_both, y_both = vstack((X_old, X_new)), hstack((y_old, y_new))
# fit the model on old data
new_model.fit(X_both, y_both, epochs=150, batch_size=32, verbose=0)

# make predictions with both models
yhat1 = old_model.predict(X_new)
yhat2 = new_model.predict(X_new)
# combine predictions into single array
combined = hstack((yhat1, yhat2))
# calculate outcome as mean of predictions
yhat = mean(combined, axis=-1)

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Tutorials

Summary

In this tutorial, you discovered how to update deep learning neural network models in response to new data.

Specifically, you learned:

  • Neural network models may need to be updated when the underlying data changes or when new labeled data is made available.
  • How to update trained neural network models with just new data or combinations of old and new data.
  • How to create an ensemble of existing and new models trained on just new data or combinations of old and new data.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post How to Update Neural Network Models With More Data appeared first on Machine Learning Mastery.

Random Search and Grid Search for Function Optimization

$
0
0

Function optimization requires the selection of an algorithm to efficiently sample the search space and locate a good or best solution.

There are many algorithms to choose from, although it is important to establish a baseline for what types of solutions are feasible or possible for a problem. This can be achieved using a naive optimization algorithm, such as a random search or a grid search.

The results achieved by a naive optimization algorithm are computationally efficient to generate and provide a point of comparison for more sophisticated optimization algorithms. Sometimes, naive algorithms are found to achieve the best performance, particularly on those problems that are noisy or non-smooth and those problems where domain expertise typically biases the choice of optimization algorithm.

In this tutorial, you will discover naive algorithms for function optimization.

After completing this tutorial, you will know:

  • The role of naive algorithms in function optimization projects.
  • How to generate and evaluate a random search for function optimization.
  • How to generate and evaluate a grid search for function optimization.

Let’s get started.

Random Search and Grid Search for Function Optimization

Random Search and Grid Search for Function Optimization
Photo by Kosala Bandara, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  1. Naive Function Optimization Algorithms
  2. Random Search for Function Optimization
  3. Grid Search for Function Optimization

Naive Function Optimization Algorithms

There are many different algorithms you can use for optimization, but how do you know whether the results you get are any good?

One approach to solving this problem is to establish a baseline in performance using a naive optimization algorithm.

A naive optimization algorithm is an algorithm that assumes nothing about the objective function that is being optimized.

It can be applied with very little effort and the best result achieved by the algorithm can be used as a point of reference to compare more sophisticated algorithms. If a more sophisticated algorithm cannot achieve a better result than a naive algorithm on average, then it does not have skill on your problem and should be abandoned.

There are two naive algorithms that can be used for function optimization; they are:

  • Random Search
  • Grid Search

These algorithms are referred to as “search” algorithms because, at base, optimization can be framed as a search problem. E.g. find the inputs that minimize or maximize the output of the objective function.

There is another algorithm that can be used called “exhaustive search” that enumerates all possible inputs. This is rarely used in practice as enumerating all possible inputs is not feasible, e.g. would require too much time to run.

Nevertheless, if you find yourself working on an optimization problem for which all inputs can be enumerated and evaluated in reasonable time, this should be the default strategy you should use.

Let’s take a closer look at each in turn.

Random Search for Function Optimization

Random search is also referred to as random optimization or random sampling.

Random search involves generating and evaluating random inputs to the objective function. It’s effective because it does not assume anything about the structure of the objective function. This can be beneficial for problems where there is a lot of domain expertise that may influence or bias the optimization strategy, allowing non-intuitive solutions to be discovered.

… random sampling, which simply draws m random samples over the design space using a pseudorandom number generator. To generate a random sample x, we can sample each variable independently from a distribution.

— Page 236, Algorithms for Optimization, 2019.

Random search may also be the best strategy for highly complex problems with noisy or non-smooth (discontinuous) areas of the search space that can cause algorithms that depend on reliable gradients.

We can generate a random sample from a domain using a pseudorandom number generator. Each variable requires a well-defined bound or range and a uniformly random value can be sampled from the range, then evaluated.

Generating random samples is computationally trivial and does not take up much memory, therefore, it may be efficient to generate a large sample of inputs, then evaluate them. Each sample is independent, so samples can be evaluated in parallel if needed to accelerate the process.

The example below gives an example of a simple one-dimensional minimization objective function and generates then evaluates a random sample of 100 inputs. The input with the best performance is then reported.

# example of random search for function optimization
from numpy.random import rand

# objective function
def objective(x):
	return x**2.0

# define range for input
r_min, r_max = -5.0, 5.0
# generate a random sample from the domain
sample = r_min + rand(100) * (r_max - r_min)
# evaluate the sample
sample_eval = objective(sample)
# locate the best solution
best_ix = 0
for i in range(len(sample)):
	if sample_eval[i] < sample_eval[best_ix]:
		best_ix = i
# summarize best solution
print('Best: f(%.5f) = %.5f' % (sample[best_ix], sample_eval[best_ix]))

Running the example generates a random sample of input values, which are then evaluated. The best performing point is then identified and reported.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the result is very close to the optimal input of 0.0.

Best: f(-0.01762) = 0.00031

We can update the example to plot the objective function and show the sample and best result. The complete example is listed below.

# example of random search for function optimization with plot
from numpy import arange
from numpy.random import rand
from matplotlib import pyplot

# objective function
def objective(x):
	return x**2.0

# define range for input
r_min, r_max = -5.0, 5.0
# generate a random sample from the domain
sample = r_min + rand(100) * (r_max - r_min)
# evaluate the sample
sample_eval = objective(sample)
# locate the best solution
best_ix = 0
for i in range(len(sample)):
	if sample_eval[i] < sample_eval[best_ix]:
		best_ix = i
# summarize best solution
print('Best: f(%.5f) = %.5f' % (sample[best_ix], sample_eval[best_ix]))
# sample input range uniformly at 0.1 increments
inputs = arange(r_min, r_max, 0.1)
# compute targets
results = objective(inputs)
# create a line plot of input vs result
pyplot.plot(inputs, results)
# plot the sample
pyplot.scatter(sample, sample_eval)
# draw a vertical line at the best input
pyplot.axvline(x=sample[best_ix], ls='--', color='red')
# show the plot
pyplot.show()

Running the example again generates the random sample and reports the best result.

Best: f(0.01934) = 0.00037

A line plot is then created showing the shape of the objective function, the random sample, and a red line for the best result located from the sample.

Line Plot of One-Dimensional Objective Function With Random Sample

Line Plot of One-Dimensional Objective Function With Random Sample

Grid Search for Function Optimization

Grid search is also referred to as a grid sampling or full factorial sampling.

Grid search involves generating uniform grid inputs for an objective function. In one-dimension, this would be inputs evenly spaced along a line. In two-dimensions, this would be a lattice of evenly spaced points across the surface, and so on for higher dimensions.

The full factorial sampling plan places a grid of evenly spaced points over the search space. This approach is easy to implement, does not rely on randomness, and covers the space, but it uses a large number of points.

— Page 235, Algorithms for Optimization, 2019.

Like random search, a grid search can be particularly effective on problems where domain expertise is typically used to influence the selection of specific optimization algorithms. The grid can help to quickly identify areas of a search space that may deserve more attention.

The grid of samples is typically uniform, although this does not have to be the case. For example, a log-10 scale could be used with a uniform spacing, allowing sampling to be performed across orders of magnitude.

The downside is that the coarseness of the grid may step over whole regions of the search space where good solutions reside, a problem that gets worse as the number of inputs (dimensions of the search space) to the problem increases.

A grid of samples can be generated by choosing the uniform separation of points, then enumerating each variable in turn and incrementing each variable by the chosen separation.

The example below gives an example of a simple two-dimensional minimization objective function and generates then evaluates a grid sample with a spacing of 0.1 for both input variables. The input with the best performance is then reported.

# example of grid search for function optimization
from numpy import arange
from numpy.random import rand

# objective function
def objective(x, y):
	return x**2.0 + y**2.0

# define range for input
r_min, r_max = -5.0, 5.0
# generate a grid sample from the domain
sample = list()
step = 0.1
for x in arange(r_min, r_max+step, step):
	for y in arange(r_min, r_max+step, step):
		sample.append([x,y])
# evaluate the sample
sample_eval = [objective(x,y) for x,y in sample]
# locate the best solution
best_ix = 0
for i in range(len(sample)):
	if sample_eval[i] < sample_eval[best_ix]:
		best_ix = i
# summarize best solution
print('Best: f(%.5f,%.5f) = %.5f' % (sample[best_ix][0], sample[best_ix][1], sample_eval[best_ix]))

Running the example generates a grid of input values, which are then evaluated. The best performing point is then identified and reported.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the result finds the optima exactly.

Best: f(-0.00000,-0.00000) = 0.00000

We can update the example to plot the objective function and show the sample and best result. The complete example is listed below.

# example of grid search for function optimization with plot
from numpy import arange
from numpy import meshgrid
from numpy.random import rand
from matplotlib import pyplot

# objective function
def objective(x, y):
	return x**2.0 + y**2.0

# define range for input
r_min, r_max = -5.0, 5.0
# generate a grid sample from the domain
sample = list()
step = 0.5
for x in arange(r_min, r_max+step, step):
	for y in arange(r_min, r_max+step, step):
		sample.append([x,y])
# evaluate the sample
sample_eval = [objective(x,y) for x,y in sample]
# locate the best solution
best_ix = 0
for i in range(len(sample)):
	if sample_eval[i] < sample_eval[best_ix]:
		best_ix = i
# summarize best solution
print('Best: f(%.5f,%.5f) = %.5f' % (sample[best_ix][0], sample[best_ix][1], sample_eval[best_ix]))
# sample input range uniformly at 0.1 increments
xaxis = arange(r_min, r_max, 0.1)
yaxis = arange(r_min, r_max, 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a filled contour plot
pyplot.contourf(x, y, results, levels=50, cmap='jet')
# plot the sample as black circles
pyplot.plot([x for x,_ in sample], [y for _,y in sample], '.', color='black')
# draw the best result as a white star
pyplot.plot(sample[best_ix][0], sample[best_ix][1], '*', color='white')
# show the plot
pyplot.show()

Running the example again generates the grid sample and reports the best result.

Best: f(0.00000,0.00000) = 0.00000

A contour plot is then created showing the shape of the objective function, the grid sample as black dots, and a white star for the best result located from the sample.

Note that some of the black dots for the edge of the domain appear to be off the plot; this is just an artifact for how we are choosing to draw the dots (e.g. not centered on the sample).

Contour Plot of One-Dimensional Objective Function With Grid Sample

Contour Plot of One-Dimensional Objective Function With Grid Sample

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Books

Articles

Summary

In this tutorial, you discovered naive algorithms for function optimization.

Specifically, you learned:

  • The role of naive algorithms in function optimization projects.
  • How to generate and evaluate a random search for function optimization.
  • How to generate and evaluate a grid search for function optimization.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Random Search and Grid Search for Function Optimization appeared first on Machine Learning Mastery.

Basin Hopping Optimization in Python

$
0
0

Basin hopping is a global optimization algorithm.

It was developed to solve problems in chemical physics, although it is an effective algorithm suited for nonlinear objective functions with multiple optima.

In this tutorial, you will discover the basin hopping global optimization algorithm.

After completing this tutorial, you will know:

  • Basin hopping optimization is a global optimization that uses random perturbations to jump basins, and a local search algorithm to optimize each basin.
  • How to use the basin hopping optimization algorithm API in python.
  • Examples of using basin hopping to solve global optimization problems with multiple optima.

Let’s get started.

Basin Hopping Optimization in Python

Basin Hopping Optimization in Python
Photo by Pedro Szekely, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  1. Basin Hopping Optimization
  2. Basin Hopping API
  3. Basin Hopping Examples
    1. Multimodal Optimization With Local Optima
    2. Multimodal Optimization With Multiple Global Optima

Basin Hopping Optimization

Basin Hopping is a global optimization algorithm developed for use in the field of chemical physics.

Basin-Hopping (BH) or Monte-Carlo Minimization (MCM) is so far the most reliable algorithms in chemical physics to search for the lowest-energy structure of atomic clusters and macromolecular systems.

Basin Hopping With Occasional Jumping, 2004.

Local optimization refers to optimization algorithms intended to locate an optima for a univariate objective function or operate in a region where an optima is believed to be present. Whereas global optimization algorithms are intended to locate the single global optima among potentially multiple local (non-global) optimal.

Basin Hopping was described by David Wales and Jonathan Doye in their 1997 paper titled “Global Optimization by Basin-Hopping and the Lowest Energy Structures of Lennard-Jones Clusters Containing up to 110 Atoms.”

The algorithms involve cycling two steps, a perturbation of good candidate solutions and the application of a local search to the perturbed solution.

[Basin hopping] transforms the complex energy landscape into a collection of basins, and explores them by hopping, which is achieved by random Monte Carlo moves and acceptance/rejection using the Metropolis criterion.

Basin Hopping With Occasional Jumping, 2004.

The perturbation allows the search algorithm to jump to new regions of the search space and potentially locate a new basin leading to a different optima, e.g. “basin hopping” in the techniques name.

The local search allows the algorithm to traverse the new basin to the optima.

The new optima may be kept as the basis for new random perturbations, otherwise, it is discarded. The decision to keep the new solution is controlled by a stochastic decision function with a “temperature” variable, much like simulated annealing.

Temperature is adjusted as a function of the number of iterations of the algorithm. This allows arbitrary solutions to be accepted early in the run when the temperature is high, and a stricter policy of only accepting better quality solutions later in the search when the temperature is low.

In this way, the algorithm is much like an iterated local search with different (perturbed) starting points.

The algorithm runs for a specified number of iterations or function evaluations and can be run multiple times to increase confidence that the global optima was located or that a relative good solution was located.

Now that we are familiar with the basic hopping algorithm from a high level, let’s look at the API for basin hopping in Python.

Basin Hopping API

Basin hopping is available in Python via the basinhopping() SciPy function.

The function takes the name of the objective function to be minimized and the initial starting point.

...
# perform the basin hopping search
result = basinhopping(objective, pt)

Another important hyperparameter is the number of iterations to run the search set via the “niter” argument and defaults to 100.

This can be set to thousands of iterations or more.

...
# perform the basin hopping search
result = basinhopping(objective, pt, niter=10000)

The amount of perturbation applied to the candidate solution can be controlled via the “stepsize” that defines the maximum amount of change applied in the context of the bounds of the problem domain. By default, this is set to 0.5 but should be set to something reasonable in the domain that might allow the search to find a new basin.

For example, if the reasonable bounds of a search space were -100 to 100, then perhaps a step size of 5.0 or 10.0 units would be appropriate (e.g. 2.5% or 5% of the domain).

...
# perform the basin hopping search
result = basinhopping(objective, pt, stepsize=10.0)

By default, the local search algorithm used is the “L-BFGS-B” algorithm.

This can be changed by setting the “minimizer_kwargs” argument to a directory with a key of “method” and the value as the name of the local search algorithm to use, such as “nelder-mead.” Any of the local search algorithms provided by the SciPy library can be used.

...
# perform the basin hopping search
result = basinhopping(objective, pt, minimizer_kwargs={'method':'nelder-mead'})

The result of the search is a OptimizeResult object where properties can be accessed like a dictionary. The success (or not) of the search can be accessed via the ‘success‘ or ‘message‘ key.

The total number of function evaluations can be accessed via ‘nfev‘ and the optimal input found for the search is accessible via the ‘x‘ key.

Now that we are familiar with the basin hopping API in Python, let’s look at some worked examples.

Basin Hopping Examples

In this section, we will look at some examples of using the basin hopping algorithm on multi-modal objective functions.

Multimodal objective functions are those that have multiple optima, such as a global optima and many local optima, or multiple global optima with the same objective function output.

We will look at examples of basin hopping on both functions.

Multimodal Optimization With Local Optima

The Ackley function is an example of an objective function that has a single global optima and multiple local optima in which a local search might get stuck.

As such, a global optimization technique is required. It is a two-dimensional objective function that has a global optima at [0,0], which evaluates to 0.0.

The example below implements the Ackley and creates a three-dimensional surface plot showing the global optima and multiple local optima.

# ackley multimodal function
from numpy import arange
from numpy import exp
from numpy import sqrt
from numpy import cos
from numpy import e
from numpy import pi
from numpy import meshgrid
from matplotlib import pyplot
from mpl_toolkits.mplot3d import Axes3D

# objective function
def objective(x, y):
	return -20.0 * exp(-0.2 * sqrt(0.5 * (x**2 + y**2))) - exp(0.5 * (cos(2 * pi * x) + cos(2 * pi * y))) + e + 20

# define range for input
r_min, r_max = -5.0, 5.0
# sample input range uniformly at 0.1 increments
xaxis = arange(r_min, r_max, 0.1)
yaxis = arange(r_min, r_max, 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a surface plot with the jet color scheme
figure = pyplot.figure()
axis = figure.gca(projection='3d')
axis.plot_surface(x, y, results, cmap='jet')
# show the plot
pyplot.show()

Running the example creates the surface plot of the Ackley function showing the vast number of local optima.

3D Surface Plot of the Ackley Multimodal Function

3D Surface Plot of the Ackley Multimodal Function

We can apply the basin hopping algorithm to the Ackley objective function.

In this case, we will start the search using a random point drawn from the input domain between -5 and 5.

...
# define the starting point as a random sample from the domain
pt = r_min + rand(2) * (r_max - r_min)

We will use a step size of 0.5, 200 iterations, and the default local search algorithm. This configuration was chosen after a little trial and error.

...
# perform the basin hopping search
result = basinhopping(objective, pt, stepsize=0.5, niter=200)

After the search is complete, it will report the status of the search and the number of iterations performed as well as the best result found with its evaluation.

...
# summarize the result
print('Status : %s' % result['message'])
print('Total Evaluations: %d' % result['nfev'])
# evaluate solution
solution = result['x']
evaluation = objective(solution)
print('Solution: f(%s) = %.5f' % (solution, evaluation))

Tying this together, the complete example of applying basin hopping to the Ackley objective function is listed below.

# basin hopping global optimization for the ackley multimodal objective function
from scipy.optimize import basinhopping
from numpy.random import rand
from numpy import exp
from numpy import sqrt
from numpy import cos
from numpy import e
from numpy import pi

# objective function
def objective(v):
	x, y = v
	return -20.0 * exp(-0.2 * sqrt(0.5 * (x**2 + y**2))) - exp(0.5 * (cos(2 * pi * x) + cos(2 * pi * y))) + e + 20

# define range for input
r_min, r_max = -5.0, 5.0
# define the starting point as a random sample from the domain
pt = r_min + rand(2) * (r_max - r_min)
# perform the basin hopping search
result = basinhopping(objective, pt, stepsize=0.5, niter=200)
# summarize the result
print('Status : %s' % result['message'])
print('Total Evaluations: %d' % result['nfev'])
# evaluate solution
solution = result['x']
evaluation = objective(solution)
print('Solution: f(%s) = %.5f' % (solution, evaluation))

Running the example executes the optimization, then reports the results.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the algorithm located the optima with inputs very close to zero and an objective function evaluation that is practically zero.

We can see that 200 iterations of the algorithm resulted in 86,020 function evaluations.

Status: ['requested number of basinhopping iterations completed successfully']
Total Evaluations: 86020
Solution: f([ 5.29778873e-10 -2.29022817e-10]) = 0.00000

Multimodal Optimization With Multiple Global Optima

The Himmelblau function is an example of an objective function that has multiple global optima.

Specifically, it has four optima and each has the same objective function evaluation. It is a two-dimensional objective function that has a global optima at [3.0, 2.0], [-2.805118, 3.131312], [-3.779310, -3.283186], [3.584428, -1.848126].

This means each run of a global optimization algorithm may find a different global optima.

The example below implements the Himmelblau and creates a three-dimensional surface plot to give an intuition for the objective function.

# himmelblau multimodal test function
from numpy import arange
from numpy import meshgrid
from matplotlib import pyplot
from mpl_toolkits.mplot3d import Axes3D

# objective function
def objective(x, y):
	return (x**2 + y - 11)**2 + (x + y**2 -7)**2

# define range for input
r_min, r_max = -5.0, 5.0
# sample input range uniformly at 0.1 increments
xaxis = arange(r_min, r_max, 0.1)
yaxis = arange(r_min, r_max, 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a surface plot with the jet color scheme
figure = pyplot.figure()
axis = figure.gca(projection='3d')
axis.plot_surface(x, y, results, cmap='jet')
# show the plot
pyplot.show()

Running the example creates the surface plot of the Himmelblau function showing the four global optima as dark blue basins.

3D Surface Plot of the Himmelblau Multimodal Function

3D Surface Plot of the Himmelblau Multimodal Function

We can apply the basin hopping algorithm to the Himmelblau objective function.

As in the previous example, we will start the search using a random point drawn from the input domain between -5 and 5.

We will use a step size of 0.5, 200 iterations, and the default local search algorithm. At the end of the search, we will report the input for the best located optima,

# basin hopping global optimization for the himmelblau multimodal objective function
from scipy.optimize import basinhopping
from numpy.random import rand

# objective function
def objective(v):
	x, y = v
	return (x**2 + y - 11)**2 + (x + y**2 -7)**2

# define range for input
r_min, r_max = -5.0, 5.0
# define the starting point as a random sample from the domain
pt = r_min + rand(2) * (r_max - r_min)
# perform the basin hopping search
result = basinhopping(objective, pt, stepsize=0.5, niter=200)
# summarize the result
print('Status : %s' % result['message'])
print('Total Evaluations: %d' % result['nfev'])
# evaluate solution
solution = result['x']
evaluation = objective(solution)
print('Solution: f(%s) = %.5f' % (solution, evaluation))

Running the example executes the optimization, then reports the results.

Want to Get Started With Ensemble Learning?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Course

In this case, we can see that the algorithm located an optima at about [3.0, 2.0].

We can see that 200 iterations of the algorithm resulted in 7,660 function evaluations.

Status : ['requested number of basinhopping iterations completed successfully']
Total Evaluations: 7660
Solution: f([3. 1.99999999]) = 0.00000

If we run the search again, we may expect a different global optima to be located.

For example, below, we can see an optima located at about [-2.805118, 3.131312], different from the previous run.

Status : ['requested number of basinhopping iterations completed successfully']
Total Evaluations: 7636
Solution: f([-2.80511809 3.13131252]) = 0.00000

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Papers

Books

APIs

Articles

Summary

In this tutorial, you discovered the basin hopping global optimization algorithm.

Specifically, you learned:

  • Basin hopping optimization is a global optimization that uses random perturbations to jump basins, and a local search algorithm to optimize each basin.
  • How to use the basin hopping optimization algorithm API in python.
  • Examples of using basin hopping to solve global optimization problems with multiple optima.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Basin Hopping Optimization in Python appeared first on Machine Learning Mastery.

XGBoost for Regression

$
0
0

Extreme Gradient Boosting (XGBoost) is an open-source library that provides an efficient and effective implementation of the gradient boosting algorithm.

Shortly after its development and initial release, XGBoost became the go-to method and often the key component in winning solutions for a range of problems in machine learning competitions.

Regression predictive modeling problems involve predicting a numerical value such as a dollar amount or a height. XGBoost can be used directly for regression predictive modeling.

In this tutorial, you will discover how to develop and evaluate XGBoost regression models in Python.

After completing this tutorial, you will know:

  • XGBoost is an efficient implementation of gradient boosting that can be used for regression predictive modeling.
  • How to evaluate an XGBoost regression model using the best practice technique of repeated k-fold cross-validation.
  • How to fit a final model and use it to make a prediction on new data.

Let’s get started.

XGBoost for Regression

XGBoost for Regression
Photo by chas B, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  1. Extreme Gradient Boosting
  2. XGBoost Regression API
  3. XGBoost Regression Example

Extreme Gradient Boosting

Gradient boosting refers to a class of ensemble machine learning algorithms that can be used for classification or regression predictive modeling problems.

Ensembles are constructed from decision tree models. Trees are added one at a time to the ensemble and fit to correct the prediction errors made by prior models. This is a type of ensemble machine learning model referred to as boosting.

Models are fit using any arbitrary differentiable loss function and gradient descent optimization algorithm. This gives the technique its name, “gradient boosting,” as the loss gradient is minimized as the model is fit, much like a neural network.

For more on gradient boosting, see the tutorial:

Extreme Gradient Boosting, or XGBoost for short, is an efficient open-source implementation of the gradient boosting algorithm. As such, XGBoost is an algorithm, an open-source project, and a Python library.

It was initially developed by Tianqi Chen and was described by Chen and Carlos Guestrin in their 2016 paper titled “XGBoost: A Scalable Tree Boosting System.”

It is designed to be both computationally efficient (e.g. fast to execute) and highly effective, perhaps more effective than other open-source implementations.

The two main reasons to use XGBoost are execution speed and model performance.

XGBoost dominates structured or tabular datasets on classification and regression predictive modeling problems. The evidence is that it is the go-to algorithm for competition winners on the Kaggle competitive data science platform.

Among the 29 challenge winning solutions 3 published at Kaggle’s blog during 2015, 17 solutions used XGBoost. […] The success of the system was also witnessed in KDDCup 2015, where XGBoost was used by every winning team in the top-10.

XGBoost: A Scalable Tree Boosting System, 2016.

Now that we are familiar with what XGBoost is and why it is important, let’s take a closer look at how we can use it in our regression predictive modeling projects.

XGBoost Regression API

XGBoost can be installed as a standalone library and an XGBoost model can be developed using the scikit-learn API.

The first step is to install the XGBoost library if it is not already installed. This can be achieved using the pip python package manager on most platforms; for example:

sudo pip install xgboost

You can then confirm that the XGBoost library was installed correctly and can be used by running the following script.

# check xgboost version
import xgboost
print(xgboost.__version__)

Running the script will print your version of the XGBoost library you have installed.

Your version should be the same or higher. If not, you must upgrade your version of the XGBoost library.

1.1.1

It is possible that you may have problems with the latest version of the library. It is not your fault.

Sometimes, the most recent version of the library imposes additional requirements or may be less stable.

If you do have errors when trying to run the above script, I recommend downgrading to version 1.0.1 (or lower). This can be achieved by specifying the version to install to the pip command, as follows:

sudo pip install xgboost==1.0.1

If you require specific instructions for your development environment, see the tutorial:

The XGBoost library has its own custom API, although we will use the method via the scikit-learn wrapper classes: XGBRegressor and XGBClassifier. This will allow us to use the full suite of tools from the scikit-learn machine learning library to prepare data and evaluate models.

An XGBoost regression model can be defined by creating an instance of the XGBRegressor class; for example:

...
# create an xgboost regression model
model = XGBRegressor()

You can specify hyperparameter values to the class constructor to configure the model.

Perhaps the most commonly configured hyperparameters are the following:

  • n_estimators: The number of trees in the ensemble, often increased until no further improvements are seen.
  • max_depth: The maximum depth of each tree, often values are between 1 and 10.
  • eta: The learning rate used to weight each model, often set to small values such as 0.3, 0.1, 0.01, or smaller.
  • subsample: The number of samples (rows) used in each tree, set to a value between 0 and 1, often 1.0 to use all samples.
  • colsample_bytree: Number of features (columns) used in each tree, set to a value between 0 and 1, often 1.0 to use all features.

For example:

...
# create an xgboost regression model
model = XGBRegressor(n_estimators=1000, max_depth=7, eta=0.1, subsample=0.7, colsample_bytree=0.8)

Good hyperparameter values can be found by trial and error for a given dataset, or systematic experimentation such as using a grid search across a range of values.

Randomness is used in the construction of the model. This means that each time the algorithm is run on the same data, it may produce a slightly different model.

When using machine learning algorithms that have a stochastic learning algorithm, it is good practice to evaluate them by averaging their performance across multiple runs or repeats of cross-validation. When fitting a final model, it may be desirable to either increase the number of trees until the variance of the model is reduced across repeated evaluations, or to fit multiple final models and average their predictions.

Let’s take a look at how to develop an XGBoost ensemble for regression.

XGBoost Regression Example

In this section, we will look at how we might develop an XGBoost model for a standard regression predictive modeling dataset.

First, let’s introduce a standard regression dataset.

We will use the housing dataset.

The housing dataset is a standard machine learning dataset comprising 506 rows of data with 13 numerical input variables and a numerical target variable.

Using a test harness of repeated stratified 10-fold cross-validation with three repeats, a naive model can achieve a mean absolute error (MAE) of about 6.6. A top-performing model can achieve a MAE on this same test harness of about 1.9. This provides the bounds of expected performance on this dataset.

The dataset involves predicting the house price given details of the house’s suburb in the American city of Boston.

No need to download the dataset; we will download it automatically as part of our worked examples.

The example below downloads and loads the dataset as a Pandas DataFrame and summarizes the shape of the dataset and the first five rows of data.

# load and summarize the housing dataset
from pandas import read_csv
from matplotlib import pyplot
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
dataframe = read_csv(url, header=None)
# summarize shape
print(dataframe.shape)
# summarize first few lines
print(dataframe.head())

Running the example confirms the 506 rows of data and 13 input variables and a single numeric target variable (14 in total). We can also see that all input variables are numeric.

(506, 14)
        0     1     2   3      4      5   ...  8      9     10      11    12    13
0  0.00632  18.0  2.31   0  0.538  6.575  ...   1  296.0  15.3  396.90  4.98  24.0
1  0.02731   0.0  7.07   0  0.469  6.421  ...   2  242.0  17.8  396.90  9.14  21.6
2  0.02729   0.0  7.07   0  0.469  7.185  ...   2  242.0  17.8  392.83  4.03  34.7
3  0.03237   0.0  2.18   0  0.458  6.998  ...   3  222.0  18.7  394.63  2.94  33.4
4  0.06905   0.0  2.18   0  0.458  7.147  ...   3  222.0  18.7  396.90  5.33  36.2

[5 rows x 14 columns]

Next, let’s evaluate a regression XGBoost model with default hyperparameters on the problem.

First, we can split the loaded dataset into input and output columns for training and evaluating a predictive model.

...
# split data into input and output columns
X, y = data[:, :-1], data[:, -1]

Next, we can create an instance of the model with a default configuration.

...
# define model
model = XGBRegressor()

We will evaluate the model using the best practice of repeated k-fold cross-validation with 3 repeats and 10 folds.

This can be achieved by using the RepeatedKFold class to configure the evaluation procedure and calling the cross_val_score() to evaluate the model using the procedure and collect the scores.

Model performance will be evaluated using mean squared error (MAE). Note, MAE is made negative in the scikit-learn library so that it can be maximized. As such, we can ignore the sign and assume all errors are positive.

...
# define model evaluation method
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)

Once evaluated, we can report the estimated performance of the model when used to make predictions on new data for this problem.

In this case, because the scores were made negative, we can use the absolute() NumPy function to make the scores positive.

We then report a statistical summary of the performance using the mean and standard deviation of the distribution of scores, another good practice.

...
# force scores to be positive
scores = absolute(scores)
print('Mean MAE: %.3f (%.3f)' % (scores.mean(), scores.std()) )

Tying this together, the complete example of evaluating an XGBoost model on the housing regression predictive modeling problem is listed below.

# evaluate an xgboost regression model on the housing dataset
from numpy import absolute
from pandas import read_csv
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from xgboost import XGBRegressor
# load the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
dataframe = read_csv(url, header=None)
data = dataframe.values
# split data into input and output columns
X, y = data[:, :-1], data[:, -1]
# define model
model = XGBRegressor()
# define model evaluation method
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)
# force scores to be positive
scores = absolute(scores)
print('Mean MAE: %.3f (%.3f)' % (scores.mean(), scores.std()) )

Running the example evaluates the XGBoost Regression algorithm on the housing dataset and reports the average MAE across the three repeats of 10-fold cross-validation.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the model achieved a MAE of about 2.1.

This is a good score, better than the baseline, meaning the model has skill and close to the best score of 1.9.

Mean MAE: 2.109 (0.320)

We may decide to use the XGBoost Regression model as our final model and make predictions on new data.

This can be achieved by fitting the model on all available data and calling the predict() function, passing in a new row of data.

For example:

...
# make a prediction
yhat = model.predict(new_data)

We can demonstrate this with a complete example, listed below.

# fit a final xgboost model on the housing dataset and make a prediction
from numpy import asarray
from pandas import read_csv
from xgboost import XGBRegressor
# load the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
dataframe = read_csv(url, header=None)
data = dataframe.values
# split dataset into input and output columns
X, y = data[:, :-1], data[:, -1]
# define model
model = XGBRegressor()
# fit model
model.fit(X, y)
# define new data
row = [0.00632,18.00,2.310,0,0.5380,6.5750,65.20,4.0900,1,296.0,15.30,396.90,4.98]
new_data = asarray([row])
# make a prediction
yhat = model.predict(new_data)
# summarize prediction
print('Predicted: %.3f' % yhat)

Running the example fits the model and makes a prediction for the new rows of data.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the model predicted a value of about 24.

Predicted: 24.019

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Tutorials

Papers

APIs

Summary

In this tutorial, you discovered how to develop and evaluate XGBoost regression models in Python.

Specifically, you learned:

  • XGBoost is an efficient implementation of gradient boosting that can be used for regression predictive modeling.
  • How to evaluate an XGBoost regression model using the best practice technique of repeated k-fold cross-validation.
  • How to fit a final model and use it to make a prediction on new data.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post XGBoost for Regression appeared first on Machine Learning Mastery.

Viewing all 910 articles
Browse latest View live