MachineLearningMastery.com

Feature selection is the process of identifying and selecting a subset of input variables that are most relevant to the target variable.

Perhaps the simplest case of feature selection is the case where there are numerical input variables and a numerical target for regression predictive modeling. This is because the strength of the relationship between each input variable and the target can be calculated, called correlation, and compared relative to each other.

In this tutorial, you will discover how to perform feature selection with numerical input data for regression predictive modeling.

After completing this tutorial, you will know:

How to evaluate the importance of numerical input data using the correlation and mutual information statistics.
How to perform feature selection for numerical input data when fitting and evaluating a regression model.
How to tune the number of features selected in a modeling pipeline using a grid search.

Let’s get started.

How to Perform Feature Selection for Regression Data
Photo by Dennis Jarvis, some rights reserved.

Tutorial Overview

This tutorial is divided into four parts; they are:

Regression Dataset
Numerical Feature Selection
1. Correlation Feature Selection
2. Mutual Information Feature Selection
Modeling With Selected Features
1. Model Built Using All Features
2. Model Built Using Correlation Features
3. Model Built Using Mutual Information Features
Tune the Number of Selected Features

Regression Dataset

We will use a synthetic regression dataset as the basis of this tutorial.

Recall that a regression problem is a problem in which we want to predict a numerical value. In this case, we require a dataset that also has numerical input variables.

The make_regression() function from the scikit-learn library can be used to define a dataset. It provides control over the number of samples, number of input features, and, importantly, the number of relevant and redundant input features. This is critical as we specifically desire a dataset that we know has some redundant input features.

In this case, we will define a dataset with 1,000 samples, each with 100 input features where 10 are informative and the remaining 90 are redundant.

...
# generate regression dataset
X, y = make_regression(n_samples=1000, n_features=100, n_informative=10, noise=0.1, random_state=1)

The hope is that feature selection techniques can identify some or all of those features that are relevant to the target, or, at the very least, identify and remove some of the redundant input features.

Once defined, we can split the data into training and test sets so we can fit and evaluate a learning model.

We will use the train_test_split() function form scikit-learn and use 67 percent of the data for training and 33 percent for testing.

...
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

Tying these elements together, the complete example of defining, splitting, and summarizing the raw regression dataset is listed below.

# load and summarize the dataset
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
# generate regression dataset
X, y = make_regression(n_samples=1000, n_features=100, n_informative=10, noise=0.1, random_state=1)
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# summarize
print('Train', X_train.shape, y_train.shape)
print('Test', X_test.shape, y_test.shape)

Running the example reports the size of the input and output elements of the train and test sets.

We can see that we have 670 examples for training and 330 for testing.

Train (670, 100) (670,)
Test (330, 100) (330,)

Now that we have loaded and prepared the dataset, we can explore feature selection.

Numerical Feature Selection

There are two popular feature selection techniques that can be used for numerical input data and a numerical target variable.

They are:

Correlation Statistics.
Mutual Information Statistics.

Let’s take a closer look at each in turn.

Correlation Feature Selection

Correlation is a measure of how two variables change together. Perhaps the most common correlation measure is Pearson’s correlation that assumes a Gaussian distribution to each variable and reports on their linear relationship.

For numeric predictors, the classic approach to quantifying each relationship with the outcome uses the sample correlation statistic.

— Page 464, Applied Predictive Modeling, 2013.

For more on linear or parametric correlation, see the tutorial:

How to Calculate Correlation Between Variables in Python

Linear correlation scores are typically a value between -1 and 1 with 0 representing no relationship. For feature selection, we are often interested in a positive score with the larger the positive value, the larger the relationship, and, more likely, the feature should be selected for modeling. As such the linear correlation can be converted into a correlation statistic with only positive values.

The scikit-learn machine library provides an implementation of the correlation statistic in the f_regression() function. This function can be used in a feature selection strategy, such as selecting the top k most relevant features (largest values) via the SelectKBest class.

For example, we can define the SelectKBest class to use the f_regression() function and select all features, then transform the train and test sets.

...
# configure to select all features
fs = SelectKBest(score_func=f_regression, k='all')
# learn relationship from training data
fs.fit(X_train, y_train)
# transform train input data
X_train_fs = fs.transform(X_train)
# transform test input data
X_test_fs = fs.transform(X_test)
return X_train_fs, X_test_fs, fs

We can then print the scores for each variable (largest is better) and plot the scores for each variable as a bar graph to get an idea of how many features we should select.

...
# what are scores for the features
for i in range(len(fs.scores_)):
	print('Feature %d: %f' % (i, fs.scores_[i]))
# plot the scores
pyplot.bar([i for i in range(len(fs.scores_))], fs.scores_)
pyplot.show()

Tying this together with the data preparation for the dataset in the previous section, the complete example is listed below.

# example of correlation feature selection for numerical data
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
from matplotlib import pyplot

# feature selection
def select_features(X_train, y_train, X_test):
	# configure to select all features
	fs = SelectKBest(score_func=f_regression, k='all')
	# learn relationship from training data
	fs.fit(X_train, y_train)
	# transform train input data
	X_train_fs = fs.transform(X_train)
	# transform test input data
	X_test_fs = fs.transform(X_test)
	return X_train_fs, X_test_fs, fs

# load the dataset
X, y = make_regression(n_samples=1000, n_features=100, n_informative=10, noise=0.1, random_state=1)
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# feature selection
X_train_fs, X_test_fs, fs = select_features(X_train, y_train, X_test)
# what are scores for the features
for i in range(len(fs.scores_)):
	print('Feature %d: %f' % (i, fs.scores_[i]))
# plot the scores
pyplot.bar([i for i in range(len(fs.scores_))], fs.scores_)
pyplot.show()

Running the example first prints the scores calculated for each input feature and the target variable.

Note that your specific results may vary. Try running the example a few times.

We will not list the scores for all 100 input variables as it will take up too much space. Nevertheless, we can see that some variables have larger scores than others, e.g. less than 1 vs. 5, and others have a much larger scores, such as Feature 9 that has 101.

Feature 0: 0.009419
Feature 1: 1.018881
Feature 2: 1.205187
Feature 3: 0.000138
Feature 4: 0.167511
Feature 5: 5.985083
Feature 6: 0.062405
Feature 7: 1.455257
Feature 8: 0.420384
Feature 9: 101.392225
...

A bar chart of the feature importance scores for each input feature is created.

The plot clearly shows 8 to 10 features are a lot more important than the other features.

We could set k=10 When configuring the SelectKBest to select these top features.

Bar Chart of the Input Features (x) vs. Correlation Feature Importance (y)

Mutual Information Feature Selection

Mutual information from the field of information theory is the application of information gain (typically used in the construction of decision trees) to feature selection.

Mutual information is calculated between two variables and measures the reduction in uncertainty for one variable given a known value of the other variable.

You can learn more about mutual information in the following tutorial.

What Is Information Gain and Mutual Information for Machine Learning

Mutual information is straightforward when considering the distribution of two discrete (categorical or ordinal) variables, such as categorical input and categorical output data. Nevertheless, it can be adapted for use with numerical input and output data.

For technical details on how this can be achieved, see the 2014 paper titled “Mutual Information between Discrete and Continuous Data Sets.”

The scikit-learn machine learning library provides an implementation of mutual information for feature selection with numeric input and output variables via the mutual_info_regression() function.

Like f_regression(), it can be used in the SelectKBest feature selection strategy (and other strategies).

...
# configure to select all features
fs = SelectKBest(score_func=mutual_info_regression, k='all')
# learn relationship from training data
fs.fit(X_train, y_train)
# transform train input data
X_train_fs = fs.transform(X_train)
# transform test input data
X_test_fs = fs.transform(X_test)
return X_train_fs, X_test_fs, fs

We can perform feature selection using mutual information on the dataset and print and plot the scores (larger is better) as we did in the previous section.

The complete example of using mutual information for numerical feature selection is listed below.

# example of mutual information feature selection for numerical input data
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_regression
from matplotlib import pyplot

# feature selection
def select_features(X_train, y_train, X_test):
	# configure to select all features
	fs = SelectKBest(score_func=mutual_info_regression, k='all')
	# learn relationship from training data
	fs.fit(X_train, y_train)
	# transform train input data
	X_train_fs = fs.transform(X_train)
	# transform test input data
	X_test_fs = fs.transform(X_test)
	return X_train_fs, X_test_fs, fs

# load the dataset
X, y = make_regression(n_samples=1000, n_features=100, n_informative=10, noise=0.1, random_state=1)
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# feature selection
X_train_fs, X_test_fs, fs = select_features(X_train, y_train, X_test)
# what are scores for the features
for i in range(len(fs.scores_)):
	print('Feature %d: %f' % (i, fs.scores_[i]))
# plot the scores
pyplot.bar([i for i in range(len(fs.scores_))], fs.scores_)
pyplot.show()

Running the example first prints the scores calculated for each input feature and the target variable.

Note that your specific results may vary. Try running the example a few times.

Again, we will not list the scores for all 100 input variables. We can see many features have a score of 0.0, whereas this technique has identified many more features that may be relevant to the target.

Feature 0: 0.045484
Feature 1: 0.000000
Feature 2: 0.000000
Feature 3: 0.000000
Feature 4: 0.024816
Feature 5: 0.000000
Feature 6: 0.022659
Feature 7: 0.000000
Feature 8: 0.000000
Feature 9: 0.074320
...

A bar chart of the feature importance scores for each input feature is created.

Compared to the correlation feature selection method we can clearly see many more features scored as being relevant. This may be because of the statistical noise that we added to the dataset in its construction.

Bar Chart of the Input Features (x) vs. the Mutual Information Feature Importance (y)

Now that we know how to perform feature selection on numerical input data for a regression predictive modeling problem, we can try developing a model using the selected features and compare the results.

Modeling With Selected Features

There are many different techniques for scoring features and selecting features based on scores; how do you know which one to use?

A robust approach is to evaluate models using different feature selection methods (and numbers of features) and select the method that results in a model with the best performance.

In this section, we will evaluate a Linear Regression model with all features compared to a model built from features selected by correlation statistics and those features selected via mutual information.

Linear regression is a good model for testing feature selection methods as it can perform better if irrelevant features are removed from the model.

Model Built Using All Features

As a first step, we will evaluate a LinearRegression model using all the available features.

The model is fit on the training dataset and evaluated on the test dataset.

The complete example is listed below.

# evaluation of a model using all input features
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
# load the dataset
X, y = make_regression(n_samples=1000, n_features=100, n_informative=10, noise=0.1, random_state=1)
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# fit the model
model = LinearRegression()
model.fit(X_train, y_train)
# evaluate the model
yhat = model.predict(X_test)
# evaluate predictions
mae = mean_absolute_error(y_test, yhat)
print('MAE: %.3f' % mae)

Running the example prints the mean absolute error (MAE) of the model on the training dataset.

Note that your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see that the model achieves an error of about 0.086.

We would prefer to use a subset of features that achieves an error that is as good or better than this.

MAE: 0.086

Model Built Using Correlation Features

We can use the correlation method to score the features and select the 10 most relevant ones.

The select_features() function below is updated to achieve this.

# feature selection
def select_features(X_train, y_train, X_test):
	# configure to select a subset of features
	fs = SelectKBest(score_func=f_regression, k=10)
	# learn relationship from training data
	fs.fit(X_train, y_train)
	# transform train input data
	X_train_fs = fs.transform(X_train)
	# transform test input data
	X_test_fs = fs.transform(X_test)
	return X_train_fs, X_test_fs, fs

The complete example of evaluating a linear regression model fit and evaluated on data using this feature selection method is listed below.

# evaluation of a model using 10 features chosen with correlation
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

# feature selection
def select_features(X_train, y_train, X_test):
	# configure to select a subset of features
	fs = SelectKBest(score_func=f_regression, k=10)
	# learn relationship from training data
	fs.fit(X_train, y_train)
	# transform train input data
	X_train_fs = fs.transform(X_train)
	# transform test input data
	X_test_fs = fs.transform(X_test)
	return X_train_fs, X_test_fs, fs

# load the dataset
X, y = make_regression(n_samples=1000, n_features=100, n_informative=10, noise=0.1, random_state=1)
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# feature selection
X_train_fs, X_test_fs, fs = select_features(X_train, y_train, X_test)
# fit the model
model = LinearRegression()
model.fit(X_train_fs, y_train)
# evaluate the model
yhat = model.predict(X_test_fs)
# evaluate predictions
mae = mean_absolute_error(y_test, yhat)
print('MAE: %.3f' % mae)

Running the example reports the performance of the model on just 10 of the 100 input features selected using the correlation statistic.

Note that your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we see that the model achieved an error score of about 2.7, which is much larger than the baseline model that used all features and achieved an MAE of 0.086.

This suggests that although the method has a strong idea of what features to select, building a model from these features alone does not result in a more skillful model. This could be because features that are important to the target are being left out, meaning that the method is being deceived about what is important.

MAE: 2.740

Let’s go the other way and try to use the method to remove some redundant features rather than all redundant features.

We can do this by setting the number of selected features to a much larger value, in this case, 88, hoping it can find and discard 12 of the 90 redundant features.

The complete example is listed below.

# evaluation of a model using 88 features chosen with correlation
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

# feature selection
def select_features(X_train, y_train, X_test):
	# configure to select a subset of features
	fs = SelectKBest(score_func=f_regression, k=88)
	# learn relationship from training data
	fs.fit(X_train, y_train)
	# transform train input data
	X_train_fs = fs.transform(X_train)
	# transform test input data
	X_test_fs = fs.transform(X_test)
	return X_train_fs, X_test_fs, fs

# load the dataset
X, y = make_regression(n_samples=1000, n_features=100, n_informative=10, noise=0.1, random_state=1)
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# feature selection
X_train_fs, X_test_fs, fs = select_features(X_train, y_train, X_test)
# fit the model
model = LinearRegression()
model.fit(X_train_fs, y_train)
# evaluate the model
yhat = model.predict(X_test_fs)
# evaluate predictions
mae = mean_absolute_error(y_test, yhat)
print('MAE: %.3f' % mae)

Running the example reports the performance of the model on 88 of the 100 input features selected using the correlation statistic.

Note that your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see that removing some of the redundant features has resulted in a small lift in performance with an error of about 0.085 compared to the baseline that achieved an error of about 0.086.

MAE: 0.085

Model Built Using Mutual Information Features

We can repeat the experiment and select the top 88 features using a mutual information statistic.

The updated version of the select_features() function to achieve this is listed below.

# feature selection
def select_features(X_train, y_train, X_test):
	# configure to select a subset of features
	fs = SelectKBest(score_func=mutual_info_regression, k=88)
	# learn relationship from training data
	fs.fit(X_train, y_train)
	# transform train input data
	X_train_fs = fs.transform(X_train)
	# transform test input data
	X_test_fs = fs.transform(X_test)
	return X_train_fs, X_test_fs, fs

The complete example of using mutual information for feature selection to fit a linear regression model is listed below.

# evaluation of a model using 88 features chosen with mutual information
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_regression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

# feature selection
def select_features(X_train, y_train, X_test):
	# configure to select a subset of features
	fs = SelectKBest(score_func=mutual_info_regression, k=88)
	# learn relationship from training data
	fs.fit(X_train, y_train)
	# transform train input data
	X_train_fs = fs.transform(X_train)
	# transform test input data
	X_test_fs = fs.transform(X_test)
	return X_train_fs, X_test_fs, fs

# load the dataset
X, y = make_regression(n_samples=1000, n_features=100, n_informative=10, noise=0.1, random_state=1)
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# feature selection
X_train_fs, X_test_fs, fs = select_features(X_train, y_train, X_test)
# fit the model
model = LinearRegression()
model.fit(X_train_fs, y_train)
# evaluate the model
yhat = model.predict(X_test_fs)
# evaluate predictions
mae = mean_absolute_error(y_test, yhat)
print('MAE: %.3f' % mae)

Running the example fits the model on the 88 top selected features chosen using mutual information.

Note: your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see a further reduction in error as compared to the correlation statistic, in this case, achieving a MAE of about 0.084 compared to 0.085 in the previous section.

MAE: 0.084

Tune the Number of Selected Features

In the previous example, we selected 88 features, but how do we know that is a good or best number of features to select?

Instead of guessing, we can systematically test a range of different numbers of selected features and discover which results in the best performing model. This is called a grid search, where the k argument to the SelectKBest class can be tuned.

It is a good practice to evaluate model configurations on regression tasks using repeated stratified k-fold cross-validation. We will use three repeats of 10-fold cross-validation via the RepeatedKFold class.

...
# define the evaluation method
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

We can define a Pipeline that correctly prepares the feature selection transform on the training set and applies it to the train set and test set for each fold of the cross-validation.

In this case, we will use the mutual information statistical method for selecting features.

...
# define the pipeline to evaluate
model = LinearRegression()
fs = SelectKBest(score_func=mutual_info_regression)
pipeline = Pipeline(steps=[('sel',fs), ('lr', model)])

We can then define the grid of values to evaluate as 80 to 100.

Note that the grid is a dictionary mapping of parameter-to-values to search, and given that we are using a Pipeline, we can access the SelectKBest object via the name we gave it ‘sel‘ and then the parameter name ‘k‘ separated by two underscores, or ‘sel__k‘.

...
# define the grid
grid = dict()
grid['sel__k'] = [i for i in range(X.shape[1]-20, X.shape[1]+1)]

We can then define and run the search.

In this case, we will evaluate models using the negative mean absolute error (neg_mean_absolute_error). It is negative because the scikit-learn requires the score to be maximized, so the MAE is made negative, meaning scores scale from -infinity to 0 (best).

...
# define the grid search
search = GridSearchCV(pipeline, grid, scoring='neg_mean_absolure_error', n_jobs=-1, cv=cv)
# perform the search
results = search.fit(X, y)

Tying this together, the complete example is listed below.

# compare different numbers of features selected using mutual information
from sklearn.datasets import make_regression
from sklearn.model_selection import RepeatedKFold
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_regression
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
# define dataset
X, y = make_regression(n_samples=1000, n_features=100, n_informative=10, noise=0.1, random_state=1)
# define the evaluation method
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# define the pipeline to evaluate
model = LinearRegression()
fs = SelectKBest(score_func=mutual_info_regression)
pipeline = Pipeline(steps=[('sel',fs), ('lr', model)])
# define the grid
grid = dict()
grid['sel__k'] = [i for i in range(X.shape[1]-20, X.shape[1]+1)]
# define the grid search
search = GridSearchCV(pipeline, grid, scoring='neg_mean_squared_error', n_jobs=-1, cv=cv)
# perform the search
results = search.fit(X, y)
# summarize best
print('Best MAE: %.3f' % results.best_score_)
print('Best Config: %s' % results.best_params_)
# summarize all
means = results.cv_results_['mean_test_score']
params = results.cv_results_['params']
for mean, param in zip(means, params):
    print(">%.3f with: %r" % (mean, param))

Running the example grid searches different numbers of selected features using mutual information statistics, where each modeling pipeline is evaluated using repeated cross-validation.

Your specific results may vary given the stochastic nature of the learning algorithm and evaluating procedure. Try running the example a few times.

In this case, we can see that the best number of selected features is 81, which achieves a MAE of about 0.082 (ignoring the sign).

Best MAE: -0.082
Best Config: {'sel__k': 81}
>-1.100 with: {'sel__k': 80}
>-0.082 with: {'sel__k': 81}
>-0.082 with: {'sel__k': 82}
>-0.082 with: {'sel__k': 83}
>-0.082 with: {'sel__k': 84}
>-0.082 with: {'sel__k': 85}
>-0.082 with: {'sel__k': 86}
>-0.082 with: {'sel__k': 87}
>-0.082 with: {'sel__k': 88}
>-0.083 with: {'sel__k': 89}
>-0.083 with: {'sel__k': 90}
>-0.083 with: {'sel__k': 91}
>-0.083 with: {'sel__k': 92}
>-0.083 with: {'sel__k': 93}
>-0.083 with: {'sel__k': 94}
>-0.083 with: {'sel__k': 95}
>-0.083 with: {'sel__k': 96}
>-0.083 with: {'sel__k': 97}
>-0.083 with: {'sel__k': 98}
>-0.083 with: {'sel__k': 99}
>-0.083 with: {'sel__k': 100}

We might want to see the relationship between the number of selected features and MAE. In this relationship, we may expect that more features result in better performance, to a point.

This relationship can be explored by manually evaluating each configuration of k for the SelectKBest from 81 to 100, gathering the sample of MAE scores, and plotting the results using box and whisker plots side by side. The spread and mean of these box plots would be expected to show any interesting relationship between the number of selected features and the MAE of the pipeline.

Note that we started the spread of k values at 81 instead of 80 because the distribution of MAE scores for k=80 is dramatically larger than all other values of k considered and it washed out the plot of the results on the graph.

The complete example of achieving this is listed below.

# compare different numbers of features selected using mutual information
from numpy import mean
from numpy import std
from sklearn.datasets import make_regression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_regression
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from matplotlib import pyplot
# define dataset
X, y = make_regression(n_samples=1000, n_features=100, n_informative=10, noise=0.1, random_state=1)
# define number of features to evaluate
num_features = [i for i in range(X.shape[1]-19, X.shape[1]+1)]
# enumerate each number of features
results = list()
for k in num_features:
	# create pipeline
	model = LinearRegression()
	fs = SelectKBest(score_func=mutual_info_regression, k=k)
	pipeline = Pipeline(steps=[('sel',fs), ('lr', model)])
	# evaluate the model
	cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
	scores = cross_val_score(pipeline, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)
	results.append(scores)
	# summarize the results
	print('>%d %.3f (%.3f)' % (k, mean(scores), std(scores)))
# plot model performance for comparison
pyplot.boxplot(results, labels=num_features, showmeans=True)
pyplot.show()

Running the example first reports the mean and standard deviation MAE for each number of selected features.

Your specific results may vary given the stochastic nature of the learning algorithm and evaluating procedure. Try running the example a few times.

In this case, reporting the mean and standard deviation of MAE is not very interesting, other than values of k in the 80s appear better than those in the 90s.

>81 -0.082 (0.006)
>82 -0.082 (0.006)
>83 -0.082 (0.006)
>84 -0.082 (0.006)
>85 -0.082 (0.006)
>86 -0.082 (0.006)
>87 -0.082 (0.006)
>88 -0.082 (0.006)
>89 -0.083 (0.006)
>90 -0.083 (0.006)
>91 -0.083 (0.006)
>92 -0.083 (0.006)
>93 -0.083 (0.006)
>94 -0.083 (0.006)
>95 -0.083 (0.006)
>96 -0.083 (0.006)
>97 -0.083 (0.006)
>98 -0.083 (0.006)
>99 -0.083 (0.006)
>100 -0.083 (0.006)

Box and whisker plots are created side by side showing the trend of k vs. MAE where the green triangle represents the mean and orange line represents the median of the distribution.

Box and Whisker Plots of MAE for Each Number of Selected Features Using Mutual Information

Summary

In this tutorial, you discovered how to perform feature selection with numerical input data for regression predictive modeling.

Specifically, you learned:

How to evaluate the importance of numerical input data using the correlation and mutual information statistics.
How to perform feature selection for numerical input data when fitting and evaluating a regression model.
How to tune the number of features selected in a modeling pipeline using a grid search.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post How to Perform Feature Selection for Regression Data appeared first on Machine Learning Mastery.

Many machine learning algorithms perform better when numerical input variables are scaled to a standard range.

This includes algorithms that use a weighted sum of the input, like linear regression, and algorithms that use distance measures, like k-nearest neighbors.

The two most popular techniques for scaling numerical data prior to modeling are normalization and standardization. Normalization scales each input variable separately to the range 0-1, which is the range for floating-point values where we have the most precision. Standardization scales each input variable separately by subtracting the mean (called centering) and dividing by the standard deviation to shift the distribution to have a mean of zero and a standard deviation of one.

In this tutorial, you will discover how to use scaler transforms to standardize and normalize numerical input variables for classification and regression.

After completing this tutorial, you will know:

Data scaling is a recommended pre-processing step when working with many machine learning algorithms.
Data scaling can be achieved by normalizing or standardizing real-valued input and output variables.
How to apply standardization and normalization to improve the performance of predictive modeling algorithms.

Let’s get started.

How to Use StandardScaler and MinMaxScaler Transforms
Photo by Marco Verch, some rights reserved.

Tutorial Overview

This tutorial is divided into six parts; they are:

The Scale of Your Data Matters
Numerical Data Scaling Methods
1. Data Normalization
2. Data Standardization
Sonar Dataset
MinMaxScaler Transform
StandardScaler Transform
Common Questions

The Scale of Your Data Matters

Machine learning models learn a mapping from input variables to an output variable.

As such, the scale and distribution of the data drawn from the domain may be different for each variable.

Input variables may have different units (e.g. feet, kilometers, and hours) that, in turn, may mean the variables have different scales.

Differences in the scales across input variables may increase the difficulty of the problem being modeled. An example of this is that large input values (e.g. a spread of hundreds or thousands of units) can result in a model that learns large weight values. A model with large weight values is often unstable, meaning that it may suffer from poor performance during learning and sensitivity to input values resulting in higher generalization error.

One of the most common forms of pre-processing consists of a simple linear rescaling of the input variables.

— Page 298, Neural Networks for Pattern Recognition, 1995.

This difference in scale for input variables does not affect all machine learning algorithms.

For example, algorithms that fit a model that use a weighted sum of input variables are affected, such as linear regression, logistic regression, and artificial neural networks (deep learning).

For example, when the distance or dot products between predictors are used (such as K-nearest neighbors or support vector machines) or when the variables are required to be a common scale in order to apply a penalty, a standardization procedure is essential.

— Page 124, Feature Engineering and Selection, 2019.

Also, algorithms that use distance measures between examples or exemplars are affected, such as k-nearest neighbors and support vector machines. There are also algorithms that are unaffected by the scale of numerical input variables, most notably decision trees and ensembles of trees, like random forest.

Different attributes are measured on different scales, so if the Euclidean distance formula were used directly, the effect of some attributes might be completely dwarfed by others that had larger scales of measurement. Consequently, it is usual to normalize all attribute values …

— Page 145, Data Mining: Practical Machine Learning Tools and Techniques, 2016.

It can also be a good idea to scale the target variable for regression predictive modeling problems to make the problem easier to learn, most notably in the case of neural network models. A target variable with a large spread of values, in turn, may result in large error gradient values causing weight values to change dramatically, making the learning process unstable.

Scaling input and output variables is a critical step in using neural network models.

In practice, it is nearly always advantageous to apply pre-processing transformations to the input data before it is presented to a network. Similarly, the outputs of the network are often post-processed to give the required output values.

— Page 296, Neural Networks for Pattern Recognition, 1995.

Numerical Data Scaling Methods

Both normalization and standardization can be achieved using the scikit-learn library.

Let’s take a closer look at each in turn.

Data Normalization

Normalization is a rescaling of the data from the original range so that all values are within the new range of 0 and 1.

Normalization requires that you know or are able to accurately estimate the minimum and maximum observable values. You may be able to estimate these values from your available data.

Attributes are often normalized to lie in a fixed range — usually from zero to one—by dividing all values by the maximum value encountered or by subtracting the minimum value and dividing by the range between the maximum and minimum values.

— Page 61, Data Mining: Practical Machine Learning Tools and Techniques, 2016.

A value is normalized as follows:

y = (x – min) / (max – min)

Where the minimum and maximum values pertain to the value x being normalized.

For example, for a dataset, we could guesstimate the min and max observable values as 30 and -10. We can then normalize any value, like 18.8, as follows:

y = (x – min) / (max – min)
y = (18.8 – (-10)) / (30 – (-10))
y = 28.8 / 40
y = 0.72

You can see that if an x value is provided that is outside the bounds of the minimum and maximum values, the resulting value will not be in the range of 0 and 1. You could check for these observations prior to making predictions and either remove them from the dataset or limit them to the pre-defined maximum or minimum values.

You can normalize your dataset using the scikit-learn object MinMaxScaler.

Good practice usage with the MinMaxScaler and other scaling techniques is as follows:

Fit the scaler using available training data. For normalization, this means the training data will be used to estimate the minimum and maximum observable values. This is done by calling the fit() function.
Apply the scale to training data. This means you can use the normalized data to train your model. This is done by calling the transform() function.
Apply the scale to data going forward. This means you can prepare new data in the future on which you want to make predictions.

The default scale for the MinMaxScaler is to rescale variables into the range [0,1], although a preferred scale can be specified via the “feature_range” argument and specify a tuple, including the min and the max for all variables.

We can demonstrate the usage of this class by converting two variables to a range 0-to-1, the default range for normalization. The first variable has values between about 4 and 100, the second has values between about 0.1 and 0.001.

The complete example is listed below.

# example of a normalization
from numpy import asarray
from sklearn.preprocessing import MinMaxScaler
# define data
data = asarray([[100, 0.001],
				[8, 0.05],
				[50, 0.005],
				[88, 0.07],
				[4, 0.1]])
print(data)
# define min max scaler
scaler = MinMaxScaler()
# transform data
scaled = scaler.fit_transform(data)
print(scaled)

Running the example first reports the raw dataset, showing 2 columns with 4 rows. The values are in scientific notation which can be hard to read if you’re not used to it.

Next, the scaler is defined, fit on the whole dataset and then used to create a transformed version of the dataset with each column normalized independently. We can see that the largest raw value for each column now has the value 1.0 and the smallest value for each column now has the value 0.0.

[[1.0e+02 1.0e-03]
 [8.0e+00 5.0e-02]
 [5.0e+01 5.0e-03]
 [8.8e+01 7.0e-02]
 [4.0e+00 1.0e-01]]
[[1.         0.        ]
 [0.04166667 0.49494949]
 [0.47916667 0.04040404]
 [0.875      0.6969697 ]
 [0.         1.        ]]

Now that we are familiar with normalization, let’s take a closer look at standardization.

Data Standardization

Standardizing a dataset involves rescaling the distribution of values so that the mean of observed values is 0 and the standard deviation is 1.

This can be thought of as subtracting the mean value or centering the data.

Like normalization, standardization can be useful, and even required in some machine learning algorithms when your data has input values with differing scales.

Standardization assumes that your observations fit a Gaussian distribution (bell curve) with a well-behaved mean and standard deviation. You can still standardize your data if this expectation is not met, but you may not get reliable results.

Another […] technique is to calculate the statistical mean and standard deviation of the attribute values, subtract the mean from each value, and divide the result by the standard deviation. This process is called standardizing a statistical variable and results in a set of values whose mean is zero and standard deviation is one.

— Page 61, Data Mining: Practical Machine Learning Tools and Techniques, 2016.

Standardization requires that you know or are able to accurately estimate the mean and standard deviation of observable values. You may be able to estimate these values from your training data, not the entire dataset.

Again, it is emphasized that the statistics required for the transformation (e.g., the mean) are estimated from the training set and are applied to all data sets (e.g., the test set or new samples).

— Page 124, Feature Engineering and Selection, 2019.

Subtracting the mean from the data is called centering, whereas dividing by the standard deviation is called scaling. As such, the method is sometime called “center scaling“.

The most straightforward and common data transformation is to center scale the predictor variables. To center a predictor variable, the average predictor value is subtracted from all the values. As a result of centering, the predictor has a zero mean. Similarly, to scale the data, each value of the predictor variable is divided by its standard deviation. Scaling the data coerce the values to have a common standard deviation of one.

— Page 30, Applied Predictive Modeling, 2013.

A value is standardized as follows:

y = (x – mean) / standard_deviation

Where the mean is calculated as:

mean = sum(x) / count(x)

And the standard_deviation is calculated as:

standard_deviation = sqrt( sum( (x – mean)^2 ) / count(x))

We can guesstimate a mean of 10.0 and a standard deviation of about 5.0. Using these values, we can standardize the first value of 20.7 as follows:

y = (x – mean) / standard_deviation
y = (20.7 – 10) / 5
y = (10.7) / 5
y = 2.14

The mean and standard deviation estimates of a dataset can be more robust to new data than the minimum and maximum.

You can standardize your dataset using the scikit-learn object StandardScaler.

We can demonstrate the usage of this class by converting two variables to a range 0-to-1 defined in the previous section. We will use the default configuration that will both center and scale the values in each column, e.g. full standardization.

The complete example is listed below.

# example of a standardization
from numpy import asarray
from sklearn.preprocessing import StandardScaler
# define data
data = asarray([[100, 0.001],
				[8, 0.05],
				[50, 0.005],
				[88, 0.07],
				[4, 0.1]])
print(data)
# define standard scaler
scaler = StandardScaler()
# transform data
scaled = scaler.fit_transform(data)
print(scaled)

Running the example first reports the raw dataset, showing 2 columns with 4 rows as before.

Next, the scaler is defined, fit on the whole dataset and then used to create a transformed version of the dataset with each column standardized independently. We can see that the mean value in each column is assigned a value of 0.0 if present and the values are centered around 0.0 with values both positive and negative.

[[1.0e+02 1.0e-03]
 [8.0e+00 5.0e-02]
 [5.0e+01 5.0e-03]
 [8.8e+01 7.0e-02]
 [4.0e+00 1.0e-01]]
[[ 1.26398112 -1.16389967]
 [-1.06174414  0.12639634]
 [ 0.         -1.05856939]
 [ 0.96062565  0.65304778]
 [-1.16286263  1.44302493]]

Next, we can introduce a real dataset that provides the basis for applying normalization and standardization transforms as a part of modeling.

Sonar Dataset

The sonar dataset is a standard machine learning dataset for binary classification.

It involves 60 real-valued inputs and a two-class target variable. There are 208 examples in the dataset and the classes are reasonably balanced.

A baseline classification algorithm can achieve a classification accuracy of about 53.4 percent using repeated stratified 10-fold cross-validation. Top performance on this dataset is about 88 percent using repeated stratified 10-fold cross-validation.

The dataset describes radar returns of rocks or simulated mines.

You can learn more about the dataset from here:

No need to download the dataset; we will download it automatically from our worked examples.

First, let’s load and summarize the dataset. The complete example is listed below.

# load and summarize the sonar dataset
from pandas import read_csv
from pandas.plotting import scatter_matrix
from matplotlib import pyplot
# Load dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv"
dataset = read_csv(url, header=None)
# summarize the shape of the dataset
print(dataset.shape)
# summarize each variable
print(dataset.describe())
# histograms of the variables
dataset.hist()
pyplot.show()

Running the example first summarizes the shape of the loaded dataset.

This confirms the 60 input variables, one output variable, and 208 rows of data.

A statistical summary of the input variables is provided showing that values are numeric and range approximately from 0 to 1.

(208, 61)
               0           1           2   ...          57          58          59
count  208.000000  208.000000  208.000000  ...  208.000000  208.000000  208.000000
mean     0.029164    0.038437    0.043832  ...    0.007949    0.007941    0.006507
std      0.022991    0.032960    0.038428  ...    0.006470    0.006181    0.005031
min      0.001500    0.000600    0.001500  ...    0.000300    0.000100    0.000600
25%      0.013350    0.016450    0.018950  ...    0.003600    0.003675    0.003100
50%      0.022800    0.030800    0.034300  ...    0.005800    0.006400    0.005300
75%      0.035550    0.047950    0.057950  ...    0.010350    0.010325    0.008525
max      0.137100    0.233900    0.305900  ...    0.044000    0.036400    0.043900

[8 rows x 60 columns]

Finally, a histogram is created for each input variable.

If we ignore the clutter of the plots and focus on the histograms themselves, we can see that many variables have a skewed distribution.

The dataset provides a good candidate for using scaler transforms as the variables have differing minimum and maximum values, as well as different data distributions.

Histogram Plots of Input Variables for the Sonar Binary Classification Dataset

Next, let’s fit and evaluate a machine learning model on the raw dataset.

We will use a k-nearest neighbor algorithm with default hyperparameters and evaluate it using repeated stratified k-fold cross-validation. The complete example is listed below.

# evaluate knn on the raw sonar dataset
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import LabelEncoder
from matplotlib import pyplot
# load dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv"
dataset = read_csv(url, header=None)
data = dataset.values
# separate into input and output columns
X, y = data[:, :-1], data[:, -1]
# ensure inputs are floats and output is an integer label
X = X.astype('float32')
y = LabelEncoder().fit_transform(y.astype('str'))
# define and configure the model
model = KNeighborsClassifier()
# evaluate the model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
# report model performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example evaluates a KNN model on the raw sonar dataset.

We can see that the model achieved a mean classification accuracy of about 79.7 percent, showing that it has skill (better than 53.4 percent) and is in the ball-park of good performance (88 percent).

Accuracy: 0.797 (0.073)

Next, let’s explore a scaling transform of the dataset.

MinMaxScaler Transform

We can apply the MinMaxScaler to the Sonar dataset directly to normalize the input variables.

We will use the default configuration and scale values to the range 0 and 1. First, a MinMaxScaler instance is defined with default hyperparameters. Once defined, we can call the fit_transform() function and pass it to our dataset to create a transformed version of our dataset.

...
# perform a robust scaler transform of the dataset
trans = MinMaxScaler()
data = trans.fit_transform(data)

Let’s try it on our sonar dataset.

The complete example of creating a MinMaxScaler transform of the sonar dataset and plotting histograms of the result is listed below.

# visualize a minmax scaler transform of the sonar dataset
from pandas import read_csv
from pandas import DataFrame
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import MinMaxScaler
from matplotlib import pyplot
# load dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv"
dataset = read_csv(url, header=None)
# retrieve just the numeric input values
data = dataset.values[:, :-1]
# perform a robust scaler transform of the dataset
trans = MinMaxScaler()
data = trans.fit_transform(data)
# convert the array back to a dataframe
dataset = DataFrame(data)
# summarize
print(dataset.describe())
# histograms of the variables
dataset.hist()
pyplot.show()

Running the example first reports a summary of each input variable.

We can see that the distributions have been adjusted and that the minimum and maximum values for each variable are now a crisp 0.0 and 1.0 respectively.

0           1           2   ...          57          58          59
count  208.000000  208.000000  208.000000  ...  208.000000  208.000000  208.000000
mean     0.204011    0.162180    0.139068  ...    0.175035    0.216015    0.136425
std      0.169550    0.141277    0.126242  ...    0.148051    0.170286    0.116190
min      0.000000    0.000000    0.000000  ...    0.000000    0.000000    0.000000
25%      0.087389    0.067938    0.057326  ...    0.075515    0.098485    0.057737
50%      0.157080    0.129447    0.107753  ...    0.125858    0.173554    0.108545
75%      0.251106    0.202958    0.185447  ...    0.229977    0.281680    0.183025
max      1.000000    1.000000    1.000000  ...    1.000000    1.000000    1.000000

[8 rows x 60 columns]

Histogram plots of the variables are created, although the distributions don’t look much different from their original distributions seen in the previous section.

Histogram Plots of MinMaxScaler Transformed Input Variables for the Sonar Dataset

Next, let’s evaluate the same KNN model as the previous section, but in this case, on a MinMaxScaler transform of the dataset.

The complete example is listed below.

# evaluate knn on the sonar dataset with minmax scaler transform
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
from matplotlib import pyplot
# load dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv"
dataset = read_csv(url, header=None)
data = dataset.values
# separate into input and output columns
X, y = data[:, :-1], data[:, -1]
# ensure inputs are floats and output is an integer label
X = X.astype('float32')
y = LabelEncoder().fit_transform(y.astype('str'))
# define the pipeline
trans = MinMaxScaler()
model = KNeighborsClassifier()
pipeline = Pipeline(steps=[('t', trans), ('m', model)])
# evaluate the pipeline
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
# report pipeline performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example, we can see that the MinMaxScaler transform results in a lift in performance from 79.7 percent accuracy without the transform to about 81.3 percent with the transform.

Accuracy: 0.813 (0.085)

Next, let’s explore the effect of standardizing the input variables.

StandardScaler Transform

We can apply the StandardScaler to the Sonar dataset directly to standardize the input variables.

We will use the default configuration and scale values to subtract the mean to center them on 0.0 and divide by the standard deviation to give the standard deviation of 1.0. First, a StandardScaler instance is defined with default hyperparameters.

Once defined, we can call the fit_transform() function and pass it to our dataset to create a transformed version of our dataset.

...
# perform a robust scaler transform of the dataset
trans = StandardScaler()
data = trans.fit_transform(data)

Let’s try it on our sonar dataset.

The complete example of creating a StandardScaler transform of the sonar dataset and plotting histograms of the results is listed below.

# visualize a standard scaler transform of the sonar dataset
from pandas import read_csv
from pandas import DataFrame
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from matplotlib import pyplot
# load dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv"
dataset = read_csv(url, header=None)
# retrieve just the numeric input values
data = dataset.values[:, :-1]
# perform a robust scaler transform of the dataset
trans = StandardScaler()
data = trans.fit_transform(data)
# convert the array back to a dataframe
dataset = DataFrame(data)
# summarize
print(dataset.describe())
# histograms of the variables
dataset.hist()
pyplot.show()

Running the example first reports a summary of each input variable.

We can see that the distributions have been adjusted and that the mean is a very small number close to zero and the standard deviation is very close to 1.0 for each variable.

0             1   ...            58            59
count  2.080000e+02  2.080000e+02  ...  2.080000e+02  2.080000e+02
mean  -4.190024e-17  1.663333e-16  ...  1.283695e-16  3.149190e-17
std    1.002413e+00  1.002413e+00  ...  1.002413e+00  1.002413e+00
min   -1.206158e+00 -1.150725e+00  ... -1.271603e+00 -1.176985e+00
25%   -6.894939e-01 -6.686781e-01  ... -6.918580e-01 -6.788714e-01
50%   -2.774703e-01 -2.322506e-01  ... -2.499546e-01 -2.405314e-01
75%    2.784345e-01  2.893335e-01  ...  3.865486e-01  4.020352e-01
max    4.706053e+00  5.944643e+00  ...  4.615037e+00  7.450343e+00

[8 rows x 60 columns]

Histogram plots of the variables are created, although the distributions don’t look much different from their original distributions seen in the previous section other than their scale on the x-axis.

Histogram Plots of StandardScaler Transformed Input Variables for the Sonar Dataset

Next, let’s evaluate the same KNN model as the previous section, but in this case, on a StandardScaler transform of the dataset.

The complete example is listed below.

# evaluate knn on the sonar dataset with standard scaler transform
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from matplotlib import pyplot
# load dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv"
dataset = read_csv(url, header=None)
data = dataset.values
# separate into input and output columns
X, y = data[:, :-1], data[:, -1]
# ensure inputs are floats and output is an integer label
X = X.astype('float32')
y = LabelEncoder().fit_transform(y.astype('str'))
# define the pipeline
trans = StandardScaler()
model = KNeighborsClassifier()
pipeline = Pipeline(steps=[('t', trans), ('m', model)])
# evaluate the pipeline
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
# report pipeline performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example, we can see that the StandardScaler transform results in a lift in performance from 79.7 percent accuracy without the transform to about 81.0 percent with the transform, although slightly lower than the result using the MinMaxScaler.

Accuracy: 0.810 (0.080)

Common Questions

This section lists some common questions and answers when scaling numerical data.

Q. Should I Normalize or Standardize?

Whether input variables require scaling depends on the specifics of your problem and of each variable.

You may have a sequence of quantities as inputs, such as prices or temperatures.

If the distribution of the quantity is normal, then it should be standardized, otherwise, the data should be normalized. This applies if the range of quantity values is large (10s, 100s, etc.) or small (0.01, 0.0001).

If the quantity values are small (near 0-1) and the distribution is limited (e.g. standard deviation near 1), then perhaps you can get away with no scaling of the data.

These manipulations are generally used to improve the numerical stability of some calculations. Some models […] benefit from the predictors being on a common scale.

— Pages 30-31, Applied Predictive Modeling, 2013.

Predictive modeling problems can be complex, and it may not be clear how to best scale input data.

If in doubt, normalize the input sequence. If you have the resources, explore modeling with the raw data, standardized data, and normalized data and see if there is a beneficial difference in the performance of the resulting model.

If the input variables are combined linearly, as in an MLP [Multilayer Perceptron], then it is rarely strictly necessary to standardize the inputs, at least in theory. […] However, there are a variety of practical reasons why standardizing the inputs can make training faster and reduce the chances of getting stuck in local optima.

— Should I normalize/standardize/rescale the data? Neural Nets FAQ

Q. Should I Standardize then Normalize?

Standardization can give values that are both positive and negative centered around zero.

It may be desirable to normalize data after it has been standardized.

This might be a good idea of you have a mixture of standardized and normalized variables and wish all input variables to have the same minimum and maximum values as input for a given algorithm, such as an algorithm that calculates distance measures.

Q. But Which is Best?

This is unknowable.

Evaluate models on data prepared with each transform and use the transform or combination of transforms that result in the best performance for your data set on your model.

Q. How Do I Handle Out-of-Bounds Values?

You may normalize your data by calculating the minimum and maximum on the training data.

Later, you may have new data with values smaller or larger than the minimum or maximum respectively.

One simple approach to handling this may be to check for such out-of-bound values and change their values to the known minimum or maximum prior to scaling. Alternately, you may want to estimate the minimum and maximum values used in the normalization manually based on domain knowledge.

Summary

In this tutorial, you discovered how to use scaler transforms to standardize and normalize numerical input variables for classification and regression.

Specifically, you learned:

Data scaling is a recommended pre-processing step when working with many machine learning algorithms.
Data scaling can be achieved by normalizing or standardizing real-valued input and output variables.
How to apply standardization and normalization to improve the performance of predictive modeling algorithms.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post How to Use StandardScaler and MinMaxScaler Transforms in Python appeared first on Machine Learning Mastery.

Machine learning models require all input and output variables to be numeric.

This means that if your data contains categorical data, you must encode it to numbers before you can fit and evaluate a model.

The two most popular techniques are an Ordinal Encoding and a One-Hot Encoding.

In this tutorial, you will discover how to use encoding schemes for categorical machine learning data.

After completing this tutorial, you will know:

Encoding is a required pre-processing step when working with categorical data for machine learning algorithms.
How to use ordinal encoding for categorical variables that have a natural rank ordering.
How to use one-hot encoding for categorical variables that do not have a natural rank ordering.

Let’s get started.

Ordinal and One-Hot Encoding Transforms for Machine Learning
Photo by Felipe Valduga, some rights reserved.

Tutorial Overview

This tutorial is divided into six parts; they are:

Nominal and Ordinal Variables
Encoding Categorical Data
1. Ordinal Encoding
2. One-Hot Encoding
3. Dummy Variable Encoding
Breast Cancer Dataset
OrdinalEncoder Transform
OneHotEncoder Transform
Common Questions

Nominal and Ordinal Variables

Numerical data, as its name suggests, involves features that are only composed of numbers, such as integers or floating-point values.

Categorical data are variables that contain label values rather than numeric values.

The number of possible values is often limited to a fixed set.

Categorical variables are often called nominal.

Some examples include:

A “pet” variable with the values: “dog” and “cat“.
A “color” variable with the values: “red“, “green“, and “blue“.
A “place” variable with the values: “first“, “second“, and “third“.

Each value represents a different category.

Some categories may have a natural relationship to each other, such as a natural ordering.

The “place” variable above does have a natural ordering of values. This type of categorical variable is called an ordinal variable because the values can be ordered or ranked.

A numerical variable can be converted to an ordinal variable by dividing the range of the numerical variable into bins and assigning values to each bin. For example, a numerical variable between 1 and 10 can be divided into an ordinal variable with 5 labels with an ordinal relationship: 1-2, 3-4, 5-6, 7-8, 9-10. This is called discretization.

Nominal Variable (Categorical). Variable comprises a finite set of discrete values with no relationship between values.
Ordinal Variable. Variable comprises a finite set of discrete values with a ranked ordering between values.

Some algorithms can work with categorical data directly.

For example, a decision tree can be learned directly from categorical data with no data transform required (this depends on the specific implementation).

Many machine learning algorithms cannot operate on label data directly. They require all input variables and output variables to be numeric.

In general, this is mostly a constraint of the efficient implementation of machine learning algorithms rather than hard limitations on the algorithms themselves.

Some implementations of machine learning algorithms require all data to be numerical. For example, scikit-learn has this requirement.

This means that categorical data must be converted to a numerical form. If the categorical variable is an output variable, you may also want to convert predictions by the model back into a categorical form in order to present them or use them in some application.

Encoding Categorical Data

There are three common approaches for converting ordinal and categorical variables to numerical values. They are:

Ordinal Encoding
One-Hot Encoding
Dummy Variable Encoding

Let’s take a closer look at each in turn.

Ordinal Encoding

In ordinal encoding, each unique category value is assigned an integer value.

For example, “red” is 1, “green” is 2, and “blue” is 3.

This is called an ordinal encoding or an integer encoding and is easily reversible. Often, integer values starting at zero are used.

For some variables, an ordinal encoding may be enough. The integer values have a natural ordered relationship between each other and machine learning algorithms may be able to understand and harness this relationship.

It is a natural encoding for ordinal variables. For categorical variables, it imposes an ordinal relationship where no such relationship may exist. This can cause problems and a one-hot encoding may be used instead.

This ordinal encoding transform is available in the scikit-learn Python machine learning library via the OrdinalEncoder class.

By default, it will assign integers to labels in the order that is observed in the data. If a specific order is desired, it can be specified via the “categories” argument as a list with the rank order of all expected labels.

We can demonstrate the usage of this class by converting colors categories “red”, “green” and “blue” into integers. First the categories are sorted then numbers are applied. For strings, this means the labels are sorted alphabetically and that blue=0, green=1 and red=2.

The complete example is listed below.

# example of a ordinal encoding
from numpy import asarray
from sklearn.preprocessing import OrdinalEncoder
# define data
data = asarray([['red'], ['green'], ['blue']])
print(data)
# define ordinal encoding
encoder = OrdinalEncoder()
# transform data
result = encoder.fit_transform(data)
print(result)

Running the example first reports the 3 rows of label data, then the ordinal encoding.

We can see that the numbers are assigned to the labels as we expected.

[['red']
 ['green']
 ['blue']]
[[2.]
 [1.]
 [0.]]

This OrdinalEncoder class is intended for input variables that are organized into rows and columns, e.g. a matrix.

If a categorical target variable needs to be encoded for a classification predictive modeling problem, then the LabelEncoder class can be used. It does the same thing as the OrdinalEncoder, although it expects a one-dimensional input for the single target variable.

One-Hot Encoding

For categorical variables where no ordinal relationship exists, the integer encoding may not be enough, at best, or misleading to the model at worst.

Forcing an ordinal relationship via an ordinal encoding and allowing the model to assume a natural ordering between categories may result in poor performance or unexpected results (predictions halfway between categories).

In this case, a one-hot encoding can be applied to the ordinal representation. This is where the integer encoded variable is removed and one new binary variable is added for each unique integer value in the variable.

Each bit represents a possible category. If the variable cannot belong to multiple categories at once, then only one bit in the group can be “on.” This is called one-hot encoding …

— Page 78, Feature Engineering for Machine Learning, 2018.

In the “color” variable example, there are three categories, and, therefore, three binary variables are needed. A “1” value is placed in the binary variable for the color and “0” values for the other colors.

This one-hot encoding transform is available in the scikit-learn Python machine learning library via the OneHotEncoder class.

We can demonstrate the usage of the OneHotEncoder on the color categories. First the categories are sorted, in this case alphabetically because they are strings, then binary variables are created for each category in turn. This means blue will be represented as [1, 0, 0] with a “1” in for the first binary variable, then green, then finally red.

The complete example is listed below.

# example of a one hot encoding
from numpy import asarray
from sklearn.preprocessing import OneHotEncoder
# define data
data = asarray([['red'], ['green'], ['blue']])
print(data)
# define one hot encoding
encoder = OneHotEncoder(sparse=False)
# transform data
onehot = encoder.fit_transform(data)
print(onehot)

Running the example first lists the three rows of label data, then the one hot encoding matching our expectation of 3 binary variables in the order “blue”, “green” and “red”.

[['red']
 ['green']
 ['blue']]
[[0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]]

If you know all of the labels to be expected in the data, they can be specified via the “categories” argument as a list.

The encoder is fit on the training dataset, which likely contains at least one example of all expected labels for each categorical variable if you do not specify the list of labels. If new data contains categories not seen in the training dataset, the “handle_unknown” argument can be set to “ignore” to not raise an error, which will result in a zero value for each label.

Dummy Variable Encoding

The one-hot encoding creates one binary variable for each category.

The problem is that this representation includes redundancy. For example, if we know that [1, 0, 0] represents “blue” and [0, 1, 0] represents “green” we don’t need another binary variable to represent “red“, instead we could use 0 values for both “blue” and “green” alone, e.g. [0, 0].

This is called a dummy variable encoding, and always represents C categories with C-1 binary variables.

When there are C possible values of the predictor and only C – 1 dummy variables are used, the matrix inverse can be computed and the contrast method is said to be a full rank parameterization

— Page 95, Feature Engineering and Selection, 2019.

In addition to being slightly less redundant, a dummy variable representation is required for some models.

For example, in the case of a linear regression model (and other regression models that have a bias term), a one hot encoding will case the matrix of input data to become singular, meaning it cannot be inverted and the linear regression coefficients cannot be calculated using linear algebra. For these types of models a dummy variable encoding must be used instead.

If the model includes an intercept and contains dummy variables […], then the […] columns would add up (row-wise) to the intercept and this linear combination would prevent the matrix inverse from being computed (as it is singular).

— Page 95, Feature Engineering and Selection, 2019.

We rarely encounter this problem in practice when evaluating machine learning algorithms, unless we are using linear regression of course.

… there are occasions when a complete set of dummy variables is useful. For example, the splits in a tree-based model are more interpretable when the dummy variables encode all the information for that predictor. We recommend using the full set if dummy variables when working with tree-based models.

— Page 56, Applied Predictive Modeling, 2013.

We can use the OneHotEncoder class to implement a dummy encoding as well as a one hot encoding.

The “drop” argument can be set to indicate which category will be come the one that is assigned all zero values, called the “baseline“. We can set this to “first” so that the first category is used. When the labels are sorted alphabetically, the first “blue” label will be the first and will become the baseline.

There will always be one fewer dummy variable than the number of levels. The level with no dummy variable […] is known as the baseline.

— Page 86, An Introduction to Statistical Learning with Applications in R, 2014.

We can demonstrate this with our color categories. The complete example is listed below.

# example of a dummy variable encoding
from numpy import asarray
from sklearn.preprocessing import OneHotEncoder
# define data
data = asarray([['red'], ['green'], ['blue']])
print(data)
# define one hot encoding
encoder = OneHotEncoder(drop='first', sparse=False)
# transform data
onehot = encoder.fit_transform(data)
print(onehot)

Running the example first lists the three rows for the categorical variable, then the dummy variable encoding, showing that green is “encoded” as [1, 0], “red” is encoded as [0, 1] and “blue” is encoded as [0, 0] as we specified.

[['red']
 ['green']
 ['blue']]
[[0. 1.]
 [1. 0.]
 [0. 0.]]

Now that we are familiar with the three approaches for encoding categorical variables, let’s look at a dataset that has categorical variables.

Breast Cancer Dataset

As the basis of this tutorial, we will use the “Breast Cancer” dataset that has been widely studied in machine learning since the 1980s.

The dataset classifies breast cancer patient data as either a recurrence or no recurrence of cancer. There are 286 examples and nine input variables. It is a binary classification problem.

A reasonable classification accuracy score on this dataset is between 68 percent and 73 percent. We will aim for this region, but note that the models in this tutorial are not optimized: they are designed to demonstrate encoding schemes.

No need to download the dataset as we will access it directly from the code examples.

Looking at the data, we can see that all nine input variables are categorical.

Specifically, all variables are quoted strings. Some variables show an obvious ordinal relationship for ranges of values (like age ranges), and some do not.

'40-49','premeno','15-19','0-2','yes','3','right','left_up','no','recurrence-events'
'50-59','ge40','15-19','0-2','no','1','right','central','no','no-recurrence-events'
'50-59','ge40','35-39','0-2','no','2','left','left_low','no','recurrence-events'
'40-49','premeno','35-39','0-2','yes','3','right','left_low','yes','no-recurrence-events'
'40-49','premeno','30-34','3-5','yes','2','left','right_up','no','recurrence-events'
...

Note that this dataset has missing values marked with a “nan” value.

We will leave these values as-is in this tutorial and use the encoding schemes to encode “nan” as just another value. This is one possible and quite reasonable approach to handling missing values for categorical variables.

We can load this dataset into memory using the Pandas library.

...
# load the dataset
dataset = read_csv(url, header=None)
# retrieve the array of data
data = dataset.values

Once loaded, we can split the columns into input (X) and output (y) for modeling.

...
# separate into input and output columns
X = data[:, :-1].astype(str)
y = data[:, -1].astype(str)

Making use of this function, the complete example of loading and summarizing the raw categorical dataset is listed below.

# load and summarize the dataset
from pandas import read_csv
# define the location of the dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv"
# load the dataset
dataset = read_csv(url, header=None)
# retrieve the array of data
data = dataset.values
# separate into input and output columns
X = data[:, :-1].astype(str)
y = data[:, -1].astype(str)
# summarize
print('Input', X.shape)
print('Output', y.shape)

Running the example reports the size of the input and output elements of the dataset.

We can see that we have 286 examples and nine input variables.

Input (286, 9)
Output (286,)

Now that we are familiar with the dataset, let’s look at how we can encode it for modeling.

OrdinalEncoder Transform

An ordinal encoding involves mapping each unique label to an integer value.

This type of encoding is really only appropriate if there is a known relationship between the categories. This relationship does exist for some of the variables in our dataset, and ideally, this should be harnessed when preparing the data.

In this case, we will ignore any possible existing ordinal relationship and assume all variables are categorical. It can still be helpful to use an ordinal encoding, at least as a point of reference with other encoding schemes.

We can use the OrdinalEncoder from scikit-learn to encode each variable to integers. This is a flexible class and does allow the order of the categories to be specified as arguments if any such order is known.

Note: I will leave it as an exercise for you to update the example below to try specifying the order for those variables that have a natural ordering and see if it has an impact on model performance.

Once defined, we can call the fit_transform() function and pass it to our dataset to create a quantile transformed version of our dataset.

...
# ordinal encode input variables
ordinal = OrdinalEncoder()
X = ordinal.fit_transform(X)

We can also prepare the target in the same manner.

...
# ordinal encode target variable
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)

Let’s try it on our breast cancer dataset.

The complete example of creating an ordinal encoding transform of the breast cancer dataset and summarizing the result is listed below.

# ordinal encode the breast cancer dataset
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder
# define the location of the dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv"
# load the dataset
dataset = read_csv(url, header=None)
# retrieve the array of data
data = dataset.values
# separate into input and output columns
X = data[:, :-1].astype(str)
y = data[:, -1].astype(str)
# ordinal encode input variables
ordinal_encoder = OrdinalEncoder()
X = ordinal_encoder.fit_transform(X)
# ordinal encode target variable
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)
# summarize the transformed data
print('Input', X.shape)
print(X[:5, :])
print('Output', y.shape)
print(y[:5])

Running the example transforms the dataset and reports the shape of the resulting dataset.

We would expect the number of rows, and in this case, the number of columns, to be unchanged, except all string values are now integer values.

As expected, in this case, we can see that the number of variables is unchanged, but all values are now ordinal encoded integers.

Input (286, 9)
[[2. 2. 2. 0. 1. 2. 1. 2. 0.]
 [3. 0. 2. 0. 0. 0. 1. 0. 0.]
 [3. 0. 6. 0. 0. 1. 0. 1. 0.]
 [2. 2. 6. 0. 1. 2. 1. 1. 1.]
 [2. 2. 5. 4. 1. 1. 0. 4. 0.]]
Output (286,)
[1 0 1 0 1]

Next, let’s evaluate machine learning on this dataset with this encoding.

The best practice when encoding variables is to fit the encoding on the training dataset, then apply it to the train and test datasets.

We will first split the dataset, then prepare the encoding on the training set, and apply it to the test set.

...
# split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

We can then fit the OrdinalEncoder on the training dataset and use it to transform the train and test datasets.

...
# ordinal encode input variables
ordinal_encoder = OrdinalEncoder()
ordinal_encoder.fit(X_train)
X_train = ordinal_encoder.transform(X_train)
X_test = ordinal_encoder.transform(X_test)

The same approach can be used to prepare the target variable. We can then fit a logistic regression algorithm on the training dataset and evaluate it on the test dataset.

The complete example is listed below.

# evaluate logistic regression on the breast cancer dataset with an ordinal encoding
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.metrics import accuracy_score
# define the location of the dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv"
# load the dataset
dataset = read_csv(url, header=None)
# retrieve the array of data
data = dataset.values
# separate into input and output columns
X = data[:, :-1].astype(str)
y = data[:, -1].astype(str)
# split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# ordinal encode input variables
ordinal_encoder = OrdinalEncoder()
ordinal_encoder.fit(X_train)
X_train = ordinal_encoder.transform(X_train)
X_test = ordinal_encoder.transform(X_test)
# ordinal encode target variable
label_encoder = LabelEncoder()
label_encoder.fit(y_train)
y_train = label_encoder.transform(y_train)
y_test = label_encoder.transform(y_test)
# define the model
model = LogisticRegression()
# fit on the training set
model.fit(X_train, y_train)
# predict on test set
yhat = model.predict(X_test)
# evaluate predictions
accuracy = accuracy_score(y_test, yhat)
print('Accuracy: %.2f' % (accuracy*100))

Running the example prepares the dataset in the correct manner, then evaluates a model fit on the transformed data.

Your specific results may differ given the stochastic nature of the algorithm and evaluation procedure.

In this case, the model achieved a classification accuracy of about 75.79 percent, which is a reasonable score.

Accuracy: 75.79

Next, let’s take a closer look at the one-hot encoding.

OneHotEncoder Transform

A one-hot encoding is appropriate for categorical data where no relationship exists between categories.

The scikit-learn library provides the OneHotEncoder class to automatically one hot encode one or more variables.

By default the OneHotEncoder will output data with a sparse representation, which is efficient given that most values are 0 in the encoded representation. We will disable this feature by setting the “sparse” argument to False so that we can review the effect of the encoding.

Once defined, we can call the fit_transform() function and pass it to our dataset to create a quantile transformed version of our dataset.

...
# one hot encode input variables
onehot_encoder = OneHotEncoder(sparse=False)
X = onehot_encoder.fit_transform(X)

As before, we must label encode the target variable.

The complete example of creating a one-hot encoding transform of the breast cancer dataset and summarizing the result is listed below.

# one-hot encode the breast cancer dataset
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
# define the location of the dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv"
# load the dataset
dataset = read_csv(url, header=None)
# retrieve the array of data
data = dataset.values
# separate into input and output columns
X = data[:, :-1].astype(str)
y = data[:, -1].astype(str)
# one hot encode input variables
onehot_encoder = OneHotEncoder(sparse=False)
X = onehot_encoder.fit_transform(X)
# ordinal encode target variable
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)
# summarize the transformed data
print('Input', X.shape)
print(X[:5, :])

Running the example transforms the dataset and reports the shape of the resulting dataset.

We would expect the number of rows to remain the same, but the number of columns to dramatically increase.

As expected, in this case, we can see that the number of variables has leaped up from 9 to 43 and all values are now binary values 0 or 1.

Input (286, 43)
[[0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.
  0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 1. 0.]
 [0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.
  0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0.
  0. 0. 0. 1. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0.
  0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 1. 0. 1. 0. 0. 0. 0. 0. 1.]
 [0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  1. 0. 0. 0. 1. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 1. 0. 1. 0.]]

Next, let’s evaluate machine learning on this dataset with this encoding as we did in the previous section.

The encoding is fit on the training set then applied to both train and test sets as before.

...
# one-hot encode input variables
onehot_encoder = OneHotEncoder()
onehot_encoder.fit(X_train)
X_train = onehot_encoder.transform(X_train)
X_test = onehot_encoder.transform(X_test)

Tying this together, the complete example is listed below.

# evaluate logistic regression on the breast cancer dataset with an one-hot encoding
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import accuracy_score
# define the location of the dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv"
# load the dataset
dataset = read_csv(url, header=None)
# retrieve the array of data
data = dataset.values
# separate into input and output columns
X = data[:, :-1].astype(str)
y = data[:, -1].astype(str)
# split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# one-hot encode input variables
onehot_encoder = OneHotEncoder()
onehot_encoder.fit(X_train)
X_train = onehot_encoder.transform(X_train)
X_test = onehot_encoder.transform(X_test)
# ordinal encode target variable
label_encoder = LabelEncoder()
label_encoder.fit(y_train)
y_train = label_encoder.transform(y_train)
y_test = label_encoder.transform(y_test)
# define the model
model = LogisticRegression()
# fit on the training set
model.fit(X_train, y_train)
# predict on test set
yhat = model.predict(X_test)
# evaluate predictions
accuracy = accuracy_score(y_test, yhat)
print('Accuracy: %.2f' % (accuracy*100))

Running the example prepares the dataset in the correct manner, then evaluates a model fit on the transformed data.

Your specific results may differ given the stochastic nature of the algorithm and evaluation procedure.

In this case, the model achieved a classification accuracy of about 70.53 percent, which is slightly worse than the ordinal encoding in the previous section.

Accuracy: 70.53

Common Questions

This section lists some common questions and answers when encoding categorical data.

Q. What if I have a mixture of numeric and categorical data?

Or, what if I have a mixture of categorical and ordinal data?

You will need to prepare or encode each variable (column) in your dataset separately, then concatenate all of the prepared variables back together into a single array for fitting or evaluating the model.

Alternately, you can use the ColumnTransformer to conditionally apply different data transforms to different input variables.

Q. What if I have hundreds of categories?

Or, what if I concatenate many one-hot encoded vectors to create a many-thousand-element input vector?

You can use a one-hot encoding up to thousands and tens of thousands of categories. Also, having large vectors as input sounds intimidating, but the models can generally handle it.

Q. What encoding technique is the best?

This is unknowable.

Test each technique (and more) on your dataset with your chosen model and discover what works best for your case.

Summary

In this tutorial, you discovered how to use encoding schemes for categorical machine learning data.

Specifically, you learned:

Encoding is a required pre-processing step when working with categorical data for machine learning algorithms.
How to use ordinal encoding for categorical variables that have a natural rank ordering.
How to use one-hot encoding for categorical variables that do not have a natural rank ordering.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Ordinal and One-Hot Encodings for Categorical Data appeared first on Machine Learning Mastery.

On a predictive modeling project, machine learning algorithms learn a mapping from input variables to a target variable.

The most common form of predictive modeling project involves so-called structured data or tabular data. This is data as it looks in a spreadsheet or a matrix, with rows of examples and columns of features for each example.

We cannot fit and evaluate machine learning algorithms on raw data; instead, we must transform the data to meet the requirements of individual machine learning algorithms. More than that, we must choose a representation for the data that best exposes the unknown underlying structure of the prediction problem to the learning algorithms in order to get the best performance given our available resources on a predictive modeling project.

Given that we have standard implementations of highly parameterized machine learning algorithms in open source libraries, fitting models has become routine. As such, the most challenging part of each predictive modeling project is how to prepare the one thing that is unique to the project: the data used for modeling.

In this tutorial, you will discover the importance of data preparation for each machine learning project.

After completing this tutorial, you will know:

Structure data in machine learning consists of rows and columns in one large table.
Data preparation is a required step in each machine learning project.
The routineness of machine learning algorithms means the majority of effort on each project is spent on data preparation.

Let’s get started.

Why Data Preparation Is So Important in Machine Learning
Photo by lwtt93, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

What Is Data in Machine Learning
Raw Data Must Be Prepared
1. Machine Learning Algorithms Expect Numbers
2. Machine Learning Algorithms Have Requirements
3. Model Performance Depends on Data
Predictive Modeling Is Mostly Data Preparation

What Is Data in Machine Learning

Predictive modeling projects involve learning from data.

Data refers to examples or cases from the domain that characterize the problem you want to solve. In supervised learning, data is composed of examples where each example has an input element that will be provided to a model and an output or target element that the model is expected to predict.

What we call data are observations of real-world phenomena. […] Each piece of data provides a small window into a limited aspect of reality.

— Page 1, Feature Engineering for Machine Learning, 2018.

Classification is an example of a supervised learning problem where the target is a label, and regression is an example of a supervised learning problem where the target is a number.

The input data may have many forms, such as an image, time series, text, video, and so on. The most common type of input data is typically referred to as tabular data or structured data. This is data as you might see it in a spreadsheet, in a database, or in a comma separated variable (CSV) file. This is the type of data that we will focus on.

Think of a large table of data. In linear algebra, we refer to this table of data as a matrix. The table is composed of rows and columns. A row represents one example from the problem domain, and may be referred to as an “example“, an “instance“, or a “case“. A column represents the properties observed about the example and may be referred to as a “variable“, a “feature“, or a “attribute“.

Row. A single example from the domain, often called an instance or example in machine learning.
Column. A single property recorded for each example, often called a variable or feature in machine learning.

For example, the columns used for input to the model are referred to as input variables, and the column that contains the target to be predicted is referred to as the output variable. The rows used to train a model are referred to as the training dataset and the rows used to evaluate the model are referred to as the test dataset.

Input Variables: Columns in the dataset provided to a model in order to make a prediction.
Output Variable. Column in the dataset to be predicted by a model.

When you collect your data, you may have to transform it so it forms one large table.

For example, if you have your data in a relational database, it is common to represent entities in separate tables in what is referred to as a “normal form” so that redundancy is minimized. In order to create one large table with one row per “subject” or “entity” that you want to model, you may need to reverse this process and introduce redundancy in the data in a process referred to as denormalization.

If your data is in a spreadsheet or database, it is standard practice to extract and save the data in CSV format. This is a standard representation that is portable, well understood, and ready for the predictive modeling process with no external dependencies.

Now that we are familiar with structured data, let’s look at why we need to prepare the data before we can use it in a model.

Raw Data Must Be Prepared

Data collected from your domain is referred to as raw data and is collected in the context of a problem you want to solve.

This means you must first define what you want to predict, then gather the data that you think will help you best make the predictions. This data collection exercise often requires a domain expert and may require many iterations of collecting more data, both in terms of new rows of data once they become available and new columns once identified as likely relevant to making a prediction.

Raw data: Data in the form provided from the domain.

In almost all cases, raw data will need to be changed before you can use it as the basis for modeling with machine learning.

A feature is a numeric representation of an aspect of raw data. Features sit between data and models in the machine learning pipeline. Feature engineering is the act of extracting features from raw data and transforming them into formats that are suitable for the machine learning model.

— Page vii, Feature Engineering for Machine Learning, 2018.

The cases with no data preparation are so rare or so trivial that it is practically a rule to prepare raw data in every machine learning project.

There are three main reasons why you must prepare raw data in a machine learning project.

Let’s take a look at each in turn.

1. Machine Learning Algorithms Expect Numbers

Even though your data is represented in one large table of rows and columns, the variables in the table may have different data types.

Some variables may be numeric, such as integers, floating-point values, ranks, rates, percentages, and so on. Other variables may be names, categories, or labels represented with characters or words, and some may be binary, represented with 0 and 1 or True and False.

The problem is, machine learning algorithms at their core operate on numeric data. They take numbers as input and predict a number as output. All data is seen as vectors and matrices, using the terminology from linear algebra.

As such, raw data must be changed prior to training, evaluating, and using machine learning models.

Sometimes the changes to the data can be managed internally by the machine learning algorithm; most commonly, this must be handled by the machine learning practitioner prior to modeling in what is commonly referred to as “data preparation” or “data pre-processing“.

2. Machine Learning Algorithms Have Requirements

Even if your raw data contains only numbers, some data preparation is likely required.

There are many different machine learning algorithms to choose from for a given predictive modeling project. We cannot know which algorithm will be appropriate, let alone the most appropriate for our task. Therefore, it is a good practice to evaluate a suite of different candidate algorithms systematically and discover what works well or best on our data.

The problem is, each algorithm has specific requirements or expectations with regard to the data.

… data preparation can make or break a model’s predictive ability. Different models have different sensitivities to the type of predictors in the model; how the predictors enter the model is also important.

— Page 27, Applied Predictive Modeling, 2013.

For example, some algorithms assume each input variable, and perhaps the target variable, to have a specific probability distribution. This is often the case for linear machine learning models that expect each numeric input variable to have a Gaussian probability distribution.

This means that if you have input variables that are not Gaussian or nearly Gaussian, you might need to change them so that they are Gaussian or more Gaussian. Alternatively, it may encourage you to reconfigure the algorithm to have a different expectation on the data.

Some algorithms are known to perform worse if there are input variables that are irrelevant or redundant to the target variable. There are also algorithms that are negatively impacted if two or more input variables are highly correlated. In these cases, irrelevant or highly correlated variables may need to be identified and removed, or alternate algorithms may need to be used.

There are also algorithms that have very few requirements about the probability distribution of input variables or the presence of redundancies, but in turn, may require many more examples (rows) in order to learn how to make good predictions.

The need for data pre-processing is determined by the type of model being used. Some procedures, such as tree-based models, are notably insensitive to the characteristics of the predictor data. Others, like linear regression, are not.

— Page 27, Applied Predictive Modeling, 2013.

As such, there is an interplay between the data and the choice of algorithms. Primarily, the algorithms impose expectations on the data, and adherence to these expectations requires the data to be appropriately prepared. Conversely, the form of the data may help choose algorithms to evaluate that are more likely to be effective.

3. Model Performance Depends on Data

Even if you prepare your data to meet the expectations of each model, you may not get the best performance.

Often, the performance of machine learning algorithms that have strong expectations degrades gracefully to the degree that the expectation is violated.

Further, it is common for an algorithm to perform well or better than other methods, even when its expectations have been ignored or completely violated. It is a common enough situation that this must be factored into the preparation and evaluation of machine learning algorithms.

The idea that there are different ways to represent predictors in a model, and that some of these representations are better than others, leads to the idea of feature engineering — the process of creating representations of data that increase the effectiveness of a model.

— Page 3, Feature Engineering and Selection, 2019.

The performance of a machine learning algorithm is only as good as the data used to train it. This is often summarized as “garbage in, garbage out“. Garbage is harsh, but it could mean a “weak representation” of the problem that insufficiently captures the dynamics required to learn how to map examples of inputs to outputs.

Let’s take for granted that we have “sufficient” data to capture the relationship between input and output variables. It’s a slippery and domain-specific principle, and in practice, we have the data that we have, and our job is to do the best we can with that data.

A dataset may be a “weak representation” of the problem we are trying to solve for many reasons, although there are two main classes of reason. It may be because complex nonlinear relationships are compressed in the raw data that can be unpacked using data preparation techniques. It may also be because the data is not perfect, ranging from mild random fluctuations in the observations, referred to as a statistical noise, to errors that result in out-of-range values and conflicting data.

Complex Data: Raw data contains compressed complex nonlinear relationships that may need to be exposed
Messy Data: Raw data contains statistical noise, errors, missing values, and conflicting examples.

We can think about getting the most out of our predictive modeling project in two ways: focus on the model and focus on the data.

We could minimally prepare the raw data and begin modeling. This puts full onus on the model to tease out the relationships in the data and learn the mapping function from inputs to outputs as best it can. This may be a reasonable path through a project and may require a large dataset and a flexible and powerful machine learning algorithm with few expectations, such as random forest or gradient boosting.

Alternately, we could push the onus back onto the data and the data preparation process. This requires that each row of data maximumly or best expresses the information content of the data for modeling. Just like denormalization of data in a relational database to rows and columns, data preparation can denormalize the complex structure inherent in each single observation. This is also a reasonable path. It may require more knowledge of the data than is available but allows good or even best modeling performance to be achieved almost irrespective of the machine learning algorithm used.

Often a balance between these approaches is pursued on any given project. That is both exploring powerful and flexible machine learning algorithms and using data preparation to best expose the structure of the data to the learning algorithms.

This is all to say, data preprocessing is a path to better data, and in turn, better model performance.

Predictive Modeling Is Mostly Data Preparation

Modeling data with machine learning algorithms has become routine.

The vast majority of the common, popular, and widely used machine learning algorithms are decades old. Linear regression is more than 100 years old.

That is to say, most algorithms are well understood and well parameterized and there are standard definitions and implementations available in open source software, like the scikit-learn machine learning library in Python.

Although the algorithms are well understood operationally, most don’t have satisfiable theories about why they work or how to map algorithms to problems. This is why each predictive modeling project is empirical rather than theoretical, requiring a process of systematic experimentation of algorithms on data.

Given that machine learning algorithms are routine for the most part, the one thing that changes from project to project is the specific data used in the modeling.

Data quality is one of the most important problems in data management, since dirty data often leads to inaccurate data analytics results and incorrect business decisions.

— Page xiii, Data Cleaning, 2019.

If you have collected data for a classification or regression predictive modeling problem, it may be the first time ever, in all of history, that the problem has been modeled. You are breaking new ground. That is not to say that the class of problems has not been tackled before; it probably has and you can learn from what was found if results were published. But it is today that your specific collection of observations makes your predictive modeling problem unique.

As such, the majority of your project will be spent on the data. Gathering data, verifying data, cleaning data, visualizing data, transforming data, and so on.

… it has been stated that up to 80% of data analysis is spent on the process of cleaning and preparing data. However, being a prerequisite to the rest of the data analysis workflow (visualization, modeling, reporting), it’s essential that you become fluent and efficient in data wrangling techniques.

— Page v, Data Wrangling with R, 2016.

Your job is to discover how to best expose the learning algorithms to the unknown underlying structure of your prediction problem. The path to get there is through data preparation.

In order for you to be an effective machine learning practitioner, you must know:

The different types of data preparation to consider on a project.
The top few algorithms for each class of data preparation technique.
When to use and how to configure top data preparation techniques.

This is often hard-earned knowledge, as there are few resources dedicated to the topic. Instead, you often must scour literature on statistics and application papers to get an idea of what’s available and how to use it.

Practitioners agree that the vast majority of time in building a machine learning pipeline is spent on feature engineering and data cleaning. Yet, despite its importance, the topic is rarely discussed on its own.

— Page vii, Feature Engineering for Machine Learning, 2018.

Summary

In this tutorial, you discovered the importance of data preparation for each machine learning project.

Specifically, you learned:

Structure data in machine learning consists of rows and columns in one large table.
Data preparation is a required step in each machine learning project.
The routineness of machine learning algorithms means the majority of effort on each project is spent on data preparation.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Why Data Preparation Is So Important in Machine Learning appeared first on Machine Learning Mastery.

Data preparation may be one of the most difficult steps in any machine learning project.

The reason is that each dataset is different and highly specific to the project. Nevertheless, there are enough commonalities across predictive modeling projects that we can define a loose sequence of steps and subtasks that you are likely to perform.

This process provides a context in which we can consider the data preparation required for the project, informed both by the definition of the project performed before data preparation and the evaluation of machine learning algorithms performed after.

In this tutorial, you will discover how to consider data preparation as a step in a broader predictive modeling machine learning project.

After completing this tutorial, you will know:

Each predictive modeling project with machine learning is different, but there are common steps performed on each project.
Data preparation involves best exposing the unknown underlying structure of the problem to learning algorithms.
The steps before and after data preparation in a project can inform what data preparation methods to apply, or at least explore.

Let’s get started.

What Is Data Preparation in a Machine Learning Project
Photo by dashll, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

Applied Machine Learning Process
What Is Data Preparation
How to Choose Data Preparation Techniques

Applied Machine Learning Process

Each machine learning project is different because the specific data at the core of the project is different.

You may be the first person (ever!) to work on the specific predictive modeling problem. That does not mean that others have not worked on similar prediction tasks or perhaps even the same high-level task, but you are the first to use the specific data that you have collected (unless you are using a standard dataset for practice).

… the right features can only be defined in the context of both the model and the data; since data and models are so diverse, it’s difficult to generalize the practice of feature engineering across projects.

— Page vii, Feature Engineering for Machine Learning, 2018.

This makes each machine learning project unique. No one can tell you what the best results are or might be, or what algorithms to use to achieve them. You must establish a baseline in performance as a point of reference to compare all of your models and you must discover what algorithm works best for your specific dataset.

You are not alone, and the vast literature on applied machine learning that has come before can inform you as to techniques to use to robustly evaluate your model and algorithms to evaluate.

Even though your project is unique, the steps on the path to a good or even the best result are generally the same from project to project. This is sometimes referred to as the “applied machine learning process“, “data science process“, or the older name “knowledge discovery in databases” (KDD).

The process of applied machine learning consists of a sequence of steps. The steps are the same, but the names of the steps and tasks performed may differ from description to description.

Further, the steps are written sequentially, but we will jump back and forth between the steps for any given project.

I like to define the process using the four high-level steps:

Step 1: Define Problem.
Step 2: Prepare Data.
Step 3: Evaluate Models.
Step 4: Finalize Model.

Let’s take a closer look at each of these steps.

Step 1: Define Problem

This step is concerned with learning enough about the project to select the framing or framings of the prediction task. For example, is it classification or regression, or some other higher-order problem type?

It involves collecting the data that is believed to be useful in making a prediction and clearly defining the form that the prediction will take. It may also involve talking to project stakeholders and other people with deep expertise in the domain.

This step also involves taking a close look at the data, as well as perhaps exploring the data using summary statistics and data visualization.

Step 2: Prepare Data

This step is concerned with transforming the raw data that was collected into a form that can be used in modeling.

Data pre-processing techniques generally refer to the addition, deletion, or transformation of training set data.

— Page 27, Applied Predictive Modeling, 2013.

We will take a closer look at this step in the next section.

Step 3: Evaluate Models

This step is concerned with evaluating machine learning models on your dataset.

It requires that you design a robust test harness used to evaluate your models so that the results you get can be trusted and used to select among the models that you have evaluated.

This involves tasks such as selecting a performance metric for evaluating the skill of a model, establishing a baseline or floor in performance to which all model evaluations can be compared, and a resampling technique for splitting the data into training and test sets to simulate how the final model will be used.

For quick and dirty estimates of model performance, or for a very large dataset, a single train-test split of the data may be performed. It is more common to use k-fold cross-validation as the data resampling technique, often with repeats of the process to improve the robustness of the result.

This step also involves tasks for getting the most out of well-performing models such as hyperparameter tuning and ensembles of models.

Step 4: Finalize Model

This step is concerned with selecting and using a final model.

Once a suite of models has been evaluated, you must choose a model that represents the “solution” to the project. This is called model selection and may involve further evaluation of candidate models on a hold out validation dataset, or selection via other project-specific criteria such as model complexity.

It may also involve summarizing the performance of the model in a standard way for project stakeholders, which is an important step.

Finally, there will likely be tasks related to the productization of the model, such as integrating it into a software project or production system and designing a monitoring and maintenance schedule for the model.

Now that we are familiar with the process of applied machine learning and where data preparation fits into that process, let’s take a closer look at the types of tasks that may be performed.

What Is Data Preparation

On a predictive modeling project, such as classification or regression, raw data typically cannot be used directly.

This is because of reasons such as:

Machine learning algorithms require data to be numbers.
Some machine learning algorithms impose requirements on the data.
Statistical noise and errors in the data may need to be corrected.
Complex nonlinear relationships may be teased out of the data.

As such, the raw data must be pre-processed prior to being used to fit and evaluate a machine learning model. This step in a predictive modeling project is referred to as “data preparation“, although it goes by many other names, such as “data wrangling“, “data cleaning“, “data pre-processing” and “feature engineering“. Some of these names may better fit as sub-tasks for the broader data preparation process.

We can define data preparation as the transformation of raw data into a form that is more suitable for modeling.

Data wrangling, which is also commonly referred to as data munging, transformation, manipulation, janitor work, etc., can be a painstakingly laborious process.

— Page v, Data Wrangling with R, 2016.

This is highly specific to your data, to the goals of your project, and to the algorithms that will be used to model your data. We will talk more about these relationships in the next section.

Nevertheless, there are common or standard tasks that you may use or explore during the data preparation step in a machine learning project.

These tasks include:

Data Cleaning: Identifying and correcting mistakes or errors in the data.
Feature Selection: Identifying those input variables that are most relevant to the task.
Data Transforms: Changing the scale or distribution of variables.
Feature Engineering: Deriving new variables from available data.
Dimensionality Reduction: Creating compact projections of the data.

Each of these tasks is a whole field of study with specialized algorithms.

Data preparation is not performed blindly.

In some cases, variables must be encoded or transformed before we can apply a machine learning algorithm, such as converting strings to numbers. In other cases, it is less clear, such as scaling a variable may or may not be useful to an algorithm.

The broader philosophy of data preparation is to discover how to best expose the underlying structure of the problem to the learning algorithms. This is the guiding light.

We don’t know the underlying structure of the problem; if we did, we wouldn’t need a learning algorithm to discover it and learn how to make skillful predictions. Therefore, exposing the unknown underlying structure of the problem is a process of discovery, along with discovering the well- or best-performing learning algorithms for the project.

However, we often do not know the best re-representation of the predictors to improve model performance. Instead, the re-working of predictors is more of an art, requiring the right tools and experience to find better predictor representations. Moreover, we may need to search many alternative predictor representations to improve model performance.

— Page xii, Feature Engineering and Selection, 2019.

It can be more complicated than it appears at first glance. For example, different input variables may require different data preparation methods. Further, different variables or subsets of input variables may require different sequences of data preparation methods.

It can feel overwhelming, given the large number of methods, each of which may have their own configuration and requirements. Nevertheless, the machine learning process steps before and after data preparation can help to inform what techniques to consider.

How to Choose Data Preparation Techniques

How do we know what data preparation techniques to use in our data?

As with many questions of statistics, the answer to “which feature engineering methods are the best?” is that it depends. Specifically, it depends on the model being used and the true relationship with the outcome.

— Page 28, Applied Predictive Modeling, 2013.

On the surface, this is a challenging question, but if we look at the data preparation step in the context of the whole project, it becomes more straightforward. The steps in a predictive modeling project before and after the data preparation step inform the data preparation that may be required.

The step before data preparation involves defining the problem.

As part of defining the problem, this may involve many sub-tasks, such as:

Gather data from the problem domain.
Discuss the project with subject matter experts.
Select those variables to be used as inputs and outputs for a predictive model.
Review the data that has been collected.
Summarize the collected data using statistical methods.
Visualize the collected data using plots and charts.

Information known about the data can be used in selecting and configuring data preparation methods.

For example, plots of the data may help identify whether a variable has outlier values. This can help in data cleaning operations. It may also provide insight into the probability distribution that underlies the data. This may help in determining whether data transforms that change a variable’s probability distribution would be appropriate.

Statistical methods, such as descriptive statistics, can be used to determine whether scaling operations might be required. Statistical hypothesis tests can be used to determine whether a variable matches a given probability distribution.

Pairwise plots and statistics can be used to determine whether variables are related, and if so, how much, providing insight into whether one or more variables are redundant or irrelevant to the target variable.

As such, there may be a lot of interplay between the definition of the problem and the preparation of the data.

There may also be interplay between the data preparation step and the evaluation of models.

Model evaluation may involve sub-tasks such as:

Select a performance metric for evaluating model predictive skill.
Select a model evaluation procedure.
Select algorithms to evaluate.
Tune algorithm hyperparameters.
Combine predictive models into ensembles.

Information known about the choice of algorithms and the discovery of well-performing algorithms can also inform the selection and configuration of data preparation methods.

For example, the choice of algorithms may impose requirements and expectations on the type and form of input variables in the data. This might require variables to have a specific probability distribution, the removal of correlated input variables, and/or the removal of variables that are not strongly related to the target variable.

The choice of performance metric may also require careful preparation of the target variable in order to meet the expectations, such as scoring regression models based on prediction error using a specific unit of measure, requiring the inversion of any scaling transforms applied to that variable for modeling.

These examples, and more, highlight that although data preparation is an important step in a predictive modeling project, it does not stand alone. Instead, it is strongly influenced by the tasks performed both before and after data preparation. This highlights the highly iterative nature of any predictive modeling project.

Summary

In this tutorial, you discovered how to consider data preparation as a step in a broader predictive modeling machine learning project.

Specifically, you learned:

Each predictive modeling project with machine learning is different, but there are common steps performed on each project.
Data preparation involves best exposing the unknown underlying structure of the problem to learning algorithms.
The steps before and after data preparation in a project can inform what data preparation methods to apply, or at least explore.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post What Is Data Preparation in a Machine Learning Project appeared first on Machine Learning Mastery.

Predictive modeling machine learning projects, such as classification and regression, always involve some form of data preparation.

The specific data preparation required for a dataset depends on the specifics of the data, such as the variable types, as well as the algorithms that will be used to model them that may impose expectations or requirements on the data.

Nevertheless, there is a collection of standard data preparation algorithms that can be applied to structured data (e.g. data that forms a large table like in a spreadsheet). These data preparation algorithms can be organized or grouped by type into a framework that can be helpful when comparing and selecting techniques for a specific project.

In this tutorial, you will discover the common data preparation tasks performed in a predictive modeling machine learning task.

After completing this tutorial, you will know:

Techniques such as data cleaning can identify and fix errors in data like missing values.
Data transforms can change the scale, type, and probability distribution of variables in the dataset.
Techniques such as feature selection and dimensionality reduction can reduce the number of input variables.

Let’s get started.

Tour of Data Preparation Techniques for Machine Learning
Photo by Nicolas Raymond, some rights reserved.

Tutorial Overview

This tutorial is divided into six parts; they are:

Common Data Preparation Tasks
Data Cleaning
Feature Selection
Data Transforms
Feature Engineering
Dimensionality Reduction

Common Data Preparation Tasks

We can define data preparation as the transformation of raw data into a form that is more suitable for modeling.

Nevertheless, there are steps in a predictive modeling project before and after the data preparation step that are important and inform the data preparation that is to be performed.

The process of applied machine learning consists of a sequence of steps.

We may jump back and forth between the steps for any given project, but all projects have the same general steps; they are:

Step 1: Define Problem.
Step 2: Prepare Data.
Step 3: Evaluate Models.
Step 4: Finalize Model.

We are concerned with the data preparation step (step 2), and there are common or standard tasks that you may use or explore during the data preparation step in a machine learning project.

The types of data preparation performed depend on your data, as you might expect.

Nevertheless, as you work through multiple predictive modeling projects, you see and require the same types of data preparation tasks again and again.

These tasks include:

Data Cleaning: Identifying and correcting mistakes or errors in the data.
Feature Selection: Identifying those input variables that are most relevant to the task.
Data Transforms: Changing the scale or distribution of variables.
Feature Engineering: Deriving new variables from available data.
Dimensionality Reduction: Creating compact projections of the data.

This provides a rough framework that we can use to think about and navigate different data preparation algorithms we may consider on a given project with structured or tabular data.

Let’s take a closer look at each in turn.

Data Cleaning

Data cleaning involves fixing systematic problems or errors in “messy” data.

The most useful data cleaning involves deep domain expertise and could involve identifying and addressing specific observations that may be incorrect.

There are many reasons data may have incorrect values, such as being mistyped, corrupted, duplicated, and so on. Domain expertise may allow obviously erroneous observations to be identified as they are different from what is expected, such as a person’s height of 200 feet.

Once messy, noisy, corrupt, or erroneous observations are identified, they can be addressed. This might involve removing a row or a column. Alternately, it might involve replacing observations with new values.

Nevertheless, there are general data cleaning operations that can be performed, such as:

Using statistics to define normal data and identify outliers.
Identifying columns that have the same value or no variance and removing them.
Identifying duplicate rows of data and removing them.
Marking empty values as missing.
Imputing missing values using statistics or a learned model.

Data cleaning is an operation that is typically performed first, prior to other data preparation operations.

Overview of Data Cleaning

For more on data cleaning see the tutorial:

How to Perform Data Cleaning for Machine Learning with Python

Feature Selection

Feature selection refers to techniques for selecting a subset of input features that are most relevant to the target variable that is being predicted.

This is important as irrelevant and redundant input variables can distract or mislead learning algorithms possibly resulting in lower predictive performance. Additionally, it is desirable to develop models only using the data that is required to make a prediction, e.g. to favor the simplest possible well performing model.

Feature selection techniques are generally grouped into those that use the target variable (supervised) and those that do not (unsupervised). Additionally, the supervised techniques can be further divided into models that automatically select features as part of fitting the model (intrinsic), those that explicitly choose features that result in the best performing model (wrapper) and those that score each input feature and allow a subset to be selected (filter).

Overview of Feature Selection Techniques

Statistical methods are popular for scoring input features, such as correlation. The features can then be ranked by their scores and a subset with the largest scores used as input to a model. The choice of statistical measure depends on the data types of the input variables and a review of different statistical measures that can be used.

For an overview of how to select statistical feature selection methods based on data type, see the tutorial:

How to Choose a Feature Selection Method For Machine Learning

Additionally, there are different common feature selection use cases we may encounter in a predictive modeling project, such as:

Categorical inputs for a classification target variable.
Numerical inputs for a classification target variable.
Numerical inputs for a regression target variable.

When a mixture of input variable data types a present, different filter methods can be used. Alternately, a wrapper method such as the popular RFE method can be used that is agnostic to the input variable type.

The broader field of scoring the relative importance of input features is referred to as feature importance and many model-based techniques exist whose outputs can be used to aide in interpreting the model, interpreting the dataset, or in selecting features for modeling.

For more on feature importance, see the tutorial:

How to Calculate Feature Importance With Python

Data Transforms

Data transforms are used to change the type or distribution of data variables.

This is a large umbrella of different techniques and they may be just as easily applied to input and output variables.

Recall that data may have one of a few types, such as numeric or categorical, with subtypes for each, such as integer and real-valued for numeric, and nominal, ordinal, and boolean for categorical.

Numeric Data Type: Number values.
- Integer: Integers with no fractional part.
- Real: Floating point values.
Categorical Data Type: Label values.
- Ordinal: Labels with a rank ordering.
- Nominal: Labels with no rank ordering.
- Boolean: Values True and False.

The figure below provides an overview of this same breakdown of high-level data types.

Overview of Data Variable Types

We may wish to convert a numeric variable to an ordinal variable in a process called discretization. Alternatively, we may encode a categorical variable as integers or boolean variables, required on most classification tasks.

Discretization Transform: Encode a numeric variable as an ordinal variable.
Ordinal Transform: Encode a categorical variable into an integer variable.
One-Hot Transform: Encode a categorical variable into binary variables.

For real-valued numeric variables, the way they are represented in a computer means there is dramatically more resolution in the range 0-1 than in the broader range of the data type. As such, it may be desirable to scale variables to this range, called normalization. If the data has a Gaussian probability distribution, it may be more useful to shift the data to a standard Gaussian with a mean of zero and a standard deviation of one.

Normalization Transform: Scale a variable to the range 0 and 1.
Standardization Transform: Scale a variable to a standard Gaussian.

The probability distribution for numerical variables can be changed.

For example, if the distribution is nearly Gaussian, but is skewed or shifted, it can be made more Gaussian using a power transform. Alternatively, quantile transforms can be used to force a probability distribution, such as a uniform or Gaussian on a variable with an unusual natural distribution.

Power Transform: Change the distribution of a variable to be more Gaussian.
Quantile Transform: Impose a probability distribution such as uniform or Gaussian.

An important consideration with data transforms is that the operations are generally performed separately for each variable. As such, we may want to perform different operations on different variable types.

Overview of Data Transform Techniques

We may also want to use the transform on new data in the future. This can be achieved by saving the transform objects to file along with the final model trained on all available data.

Feature Engineering

Feature engineering refers to the process of creating new input variables from the available data.

Engineering new features is highly specific to your data and data types. As such, it often requires the collaboration of a subject matter expert to help identify new features that could be constructed from the data.

This specialization makes it a challenging topic to generalize to general methods.

Nevertheless, there are some techniques that can be reused, such as:

Adding a boolean flag variable for some state.
Adding a group or global summary statistic, such as a mean.
Adding new variables for each component of a compound variable, such as a date-time.

A popular approach drawn from statistics is to create copies of numerical input variables that have been changed with a simple mathematical operation, such as raising them to a power or multiplied with other input variables, referred to as polynomial features.

Polynomial Transform: Create copies of numerical input variables that are raised to a power.

The theme of feature engineering is to add broader context to a single observation or decompose a complex variable, both in an effort to provide a more straightforward perspective on the input data.

I like to think of feature engineering as a type of data transform, although it would be just as reasonable to think of data transforms as a type of feature engineering.

Dimensionality Reduction

The number of input features for a dataset may be considered the dimensionality of the data.

For example, two input variables together can define a two-dimensional area where each row of data defines a point in that space. This idea can then be scaled to any number of input variables to create large multi-dimensional hyper-volumes.

The problem is, the more dimensions this space has (e.g. the more input variables), the more likely it is that the dataset represents a very sparse and likely unrepresentative sampling of that space. This is referred to as the curse of dimensionality.

This motivates feature selection, although an alternative to feature selection is to create a projection of the data into a lower-dimensional space that still preserves the most important properties of the original data.

This is referred to generally as dimensionality reduction and provides an alternative to feature selection. Unlike feature selection, the variables in the projected data are not directly related to the original input variables, making the projection difficult to interpret.

The most common approach to dimensionality reduction is to use a matrix factorization technique:

Principal Component Analysis (PCA)
Singular Value Decomposition (SVD)

The main impact of these techniques is that they remove linear dependencies between input variables, e.g. correlated variables.

Other approaches exist that discover a lower dimensionality reduction. We might refer to these as model-based methods such as LDA and perhaps autoencoders.

Linear Discriminant Analysis (LDA)

Sometimes manifold learning algorithms can also be used, such as Kohonen self-organizing maps and t-SNE.

Overview of Dimensionality Reduction Techniques

For on dimensionality reduction, see the tutorial:

Introduction to Dimensionality Reduction for Machine Learning

Summary

In this tutorial, you discovered the common data preparation tasks performed in a predictive modeling machine learning task.

Specifically, you learned:

Techniques, such data cleaning, can identify and fix errors in data like missing values.
Data transforms can change the scale, type, and probability distribution of variables in the dataset.
Techniques such as feature selection and dimensionality reduction can reduce the number of input variables.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Tour of Data Preparation Techniques for Machine Learning appeared first on Machine Learning Mastery.

Data preparation is the process of transforming raw data into a form that is appropriate for modeling.

A naive approach to preparing data applies the transform on the entire dataset before evaluating the performance of the model. This results in a problem referred to as data leakage, where knowledge of the hold-out test set leaks into the dataset used to train the model. This can result in an incorrect estimate of model performance when making predictions on new data.

A careful application of data preparation techniques is required in order to avoid data leakage, and this varies depending on the model evaluation scheme used, such as train-test splits or k-fold cross-validation.

In this tutorial, you will discover how to avoid data leakage during data preparation when evaluating machine learning models.

After completing this tutorial, you will know:

Naive application of data preparation methods to the whole dataset results in data leakage that causes incorrect estimates of model performance.
Data preparation must be prepared on the training set only in order to avoid data leakage.
How to implement data preparation without data leakage for train-test splits and k-fold cross-validation in Python.

Let’s get started.

How to Avoid Data Leakage When Performing Data Preparation
Photo by kuhnmi, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

Problem With Naive Data Preparation
Data Preparation With Train and Test Sets
1. Train-Test Evaluation With Naive Data Preparation
2. Train-Test Evaluation With Correct Data Preparation
Data Preparation With k-fold Cross-Validation
1. Cross-Validation Evaluation With Naive Data Preparation
2. Cross-Validation Evaluation With Correct Data Preparation

Problem With Naive Data Preparation

The manner in which data preparation techniques are applied to data matters.

A common approach is to first apply one or more transforms to the entire dataset. Then the dataset is split into train and test sets or k-fold cross-validation is used to fit and evaluate a machine learning model.

1. Prepare Dataset
2. Split Data
3. Evaluate Models

Although this is a common approach, it is dangerously incorrect in most cases.

The problem with applying data preparation techniques before splitting data for model evaluation is that it can lead to data leakage and, in turn, will likely result in an incorrect estimate of a model’s performance on the problem.

Data leakage refers to a problem where information about the holdout dataset, such as a test or validation dataset, is made available to the model in the training dataset. This leakage is often small and subtle but can have a marked effect on performance.

… leakage means that information is revealed to the model that gives it an unrealistic advantage to make better predictions. This could happen when test data is leaked into the training set, or when data from the future is leaked to the past. Any time that a model is given information that it shouldn’t have access to when it is making predictions in real time in production, there is leakage.

— Page 93, Feature Engineering for Machine Learning, 2018.

We get data leakage by applying data preparation techniques to the entire dataset.

This is not a direct type of data leakage, where we would train the model on the test dataset. Instead, it is an indirect type of data leakage, where some knowledge about the test dataset, captured in summary statistics is available to the model during training. This can make it a harder type of data leakage to spot, especially for beginners.

One other aspect of resampling is related to the concept of information leakage which is where the test set data are used (directly or indirectly) during the training process. This can lead to overly optimistic results that do not replicate on future data points and can occur in subtle ways.

— Page 55, Feature Engineering and Selection, 2019.

For example, consider the case where we want to normalize a data, that is scale input variables to the range 0-1.

When we normalize the input variables, this requires that we first calculate the minimum and maximum values for each variable before using these values to scale the variables. The dataset is then split into train and test datasets, but the examples in the training dataset know something about the data in the test dataset; they have been scaled by the global minimum and maximum values, so they know more about the global distribution of the variable then they should.

We get the same type of leakage with almost all data preparation techniques; for example, standardization estimates the mean and standard deviation values from the domain in order to scale the variables; even models that impute missing values using a model or summary statistics will draw on the full dataset to fill in values in the training dataset.

The solution is straightforward.

Data preparation must be fit on the training dataset only. That is, any coefficients or models prepared for the data preparation process must only use rows of data in the training dataset.

Once fit, the data preparation algorithms or models can then be applied to the training dataset, and to the test dataset.

1. Split Data.
2. Fit Data Preparation on Training Dataset.
3. Apply Data Preparation to Train and Test Datasets.
4. Evaluate Models.

More generally, the entire modeling pipeline must be prepared only on the training dataset to avoid data leakage. This might include data transforms, but also other techniques such feature selection, dimensionality reduction, feature engineering and more. This means so-called “model evaluation” should really be called “modeling pipeline evaluation”.

In order for any resampling scheme to produce performance estimates that generalize to new data, it must contain all of the steps in the modeling process that could significantly affect the model’s effectiveness.

— Pages 54-55, Feature Engineering and Selection, 2019.

Now that we are familiar with how to apply data preparation to avoid data leakage, let’s look at some worked examples.

Data Preparation With Train and Test Sets

In this section, we will evaluate a logistic regression model using train and test sets on a synthetic binary classification dataset where the input variables have been normalized.

First, let’s define our synthetic dataset.

We will use the make_classification() function to create the dataset with 1,000 rows of data and 20 numerical input features. The example below creates the dataset and summarizes the shape of the input and output variable arrays.

# test classification dataset
from sklearn.datasets import make_classification
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
# summarize the dataset
print(X.shape, y.shape)

Running the example creates the dataset and confirms that the input part of the dataset has 1,000 rows and 20 columns for the 20 input variables and that the output variable has 1,000 examples to match the 1,000 rows of input data, one value per row.

(1000, 20) (1000,)

Next, we can evaluate our model on the scaled dataset, starting with their naive or incorrect approach.

Train-Test Evaluation With Naive Data Preparation

The naive approach involves first applying the data preparation method, then splitting the data before finally evaluating the model.

We can normalize the input variables using the MinMaxScaler class, which is first defined with the default configuration scaling the data to the range 0-1, then the fit_transform() function is called to fit the transform on the dataset and apply it to the dataset in a single step. The result is a normalized version of the input variables, where each column in the array is separately normalized (e.g. has its own minimum and maximum calculated).

...
# standardize the dataset
scaler = MinMaxScaler()
X = scaler.fit_transform(X)

Next, we can split our dataset into train and test sets using the train_test_split() function. We will use 67 percent for the training set and 33 percent for the test set.

...
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

We can then define our logistic regression algorithm via the LogisticRegression class, with default configuration, and fit it on the training dataset.

...
# fit the model
model = LogisticRegression()
model.fit(X_train, y_train)

The fit model can then make a prediction for the input data for the test set, and we can compare the predictions to the expected values and calculate a classification accuracy score.

...
# evaluate the model
yhat = model.predict(X_test)
# evaluate predictions
accuracy = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % (accuracy*100))

Tying this together, the complete example is listed below.

# naive approach to normalizing the data before splitting the data and evaluating the model
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
# standardize the dataset
scaler = MinMaxScaler()
X = scaler.fit_transform(X)
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# fit the model
model = LogisticRegression()
model.fit(X_train, y_train)
# evaluate the model
yhat = model.predict(X_test)
# evaluate predictions
accuracy = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % (accuracy*100))

Running the example normalizes the data, splits the data into train and test sets, then fits and evaluates the model.

Your specific results may vary given the stochastic nature of the learning algorithm and evaluation procedure.

In this case, we can see that the estimate for the model is about 84.848 percent.

Accuracy: 84.848

Given we know that there was data leakage, we know that this estimate of model accuracy is wrong.

Next, let’s explore how we might correctly prepare the data to avoid data leakage.

Train-Test Evaluation With Correct Data Preparation

The correct approach to performing data preparation with a train-test split evaluation is to fit the data preparation on the training set, then apply the transform to the train and test sets.

This requires that we first split the data into train and test sets.

...
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

We can then define the MinMaxScaler and call the fit() function on the training set, then apply the transform() function on the train and test sets to create a normalized version of each dataset.

...
# define the scaler
scaler = MinMaxScaler()
# fit on the training dataset
scaler.fit(X_train)
# scale the training dataset
X_train = scaler.transform(X_train)
# scale the test dataset
X_test = scaler.transform(X_test)

This avoids data leakage as the calculation of the minimum and maximum value for each input variable is calculated using only the training dataset (X_train) instead of the entire dataset (X).

The model can then be evaluated as before.

Tying this together, the complete example is listed below.

# correct approach for normalizing the data after the data is split before the model is evaluated
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# define the scaler
scaler = MinMaxScaler()
# fit on the training dataset
scaler.fit(X_train)
# scale the training dataset
X_train = scaler.transform(X_train)
# scale the test dataset
X_test = scaler.transform(X_test)
# fit the model
model = LogisticRegression()
model.fit(X_train, y_train)
# evaluate the model
yhat = model.predict(X_test)
# evaluate predictions
accuracy = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % (accuracy*100))

Running the example splits the data into train and test sets, normalizes the data correctly, then fits and evaluates the model.

Your specific results may vary given the stochastic nature of the learning algorithm and evaluation procedure.

In this case, we can see that the estimate for the model is about 85.455 percent, which is more accurate than the estimate with data leakage in the previous section that achieved an accuracy of 84.848 percent.

We expect data leakage to result in an incorrect estimate of model performance. We would expect this to be an optimistic estimate with data leakage, e.g. better performance, although in this case, we can see that data leakage resulted in slightly worse performance. This might be because of the difficulty of the prediction task.

Accuracy: 85.455

Data Preparation With k-fold Cross-Validation

In this section, we will evaluate a logistic regression model using k-fold cross-validation on a synthetic binary classification dataset where the input variables have been normalized.

You may recall that k-fold cross-validation involves splitting a dataset into k non-overlapping groups of rows. The model is then trained on all but one group to form a training dataset and then evaluated on the held-out fold. This process is repeated so that each fold is given a chance to be used as the holdout test set. Finally, the average performance across all evaluations is reported.

The k-fold cross-validation procedure generally gives a more reliable estimate of model performance than a train-test split, although it is more computationally expensive given the repeated fitting and evaluation of models.

Let’s first look at naive data preparation with k-fold cross-validation.

Cross-Validation Evaluation With Naive Data Preparation

Naive data preparation with cross-validation involves applying the data transforms first, then using the cross-validation procedure.

We will use the synthetic dataset prepared in the previous section and normalize the data directly.

...
# standardize the dataset
scaler = MinMaxScaler()
X = scaler.fit_transform(X)

The k-fold cross-validation procedure must first be defined. We will use repeated stratified 10-fold cross-validation, which is a best practice for classification. Repeated means that the whole cross-validation procedure is repeated multiple times, three in this case. Stratified means that each group of rows will have the relative composition of examples from each class as the whole dataset. We will use k=10 or 10-fold cross-validation.

This can be achieved using the RepeatedStratifiedKFold which can be configured to three repeats and 10 folds, and then using the cross_val_score() function to perform the procedure, passing in the defined model, cross-validation object, and metric to calculate, in this case, accuracy.

...
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model using cross-validation
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

We can then report the average accuracy across all of the repeats and folds.

Tying this all together, the complete example of evaluating a model with cross-validation using data preparation with data leakage is listed below.

# naive data preparation for model evaluation with k-fold cross-validation
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
# standardize the dataset
scaler = MinMaxScaler()
X = scaler.fit_transform(X)
# define the model
model = LogisticRegression()
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model using cross-validation
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores)*100, std(scores)*100))

Running the example normalizes the data first, then evaluates the model using repeated stratified cross-validation.

Your specific results may vary given the stochastic nature of the learning algorithm and evaluation procedure.

In this case, we can see that the model achieved an estimated accuracy of about 85.300 percent, which we know is incorrect given the data leakage allowed via the data preparation procedure.

Accuracy: 85.300 (3.607)

Next, let’s look at how we can evaluate the model with cross-validation and avoid data leakage.

Cross-Validation Evaluation With Correct Data Preparation

Data preparation without data leakage when using cross-validation is slightly more challenging.

It requires that the data preparation method is prepared on the training set and applied to the train and test sets within the cross-validation procedure, e.g. the groups of folds of rows.

We can achieve this by defining a modeling pipeline that defines a sequence of data preparation steps to perform and ending in the model to fit and evaluate.

To provide a solid methodology, we should constrain ourselves to developing the list of preprocessing techniques, estimate them only in the presence of the training data points, and then apply the techniques to future data (including the test set).

— Page 55, Feature Engineering and Selection, 2019.

The evaluation procedure changes from simply and incorrectly evaluating just the model to correctly evaluating the entire pipeline of data preparation and model together as a single atomic unit.

This can be achieved using the Pipeline class.

This class takes a list of steps that define the pipeline. Each step in the list is a tuple with two elements. The first element is the name of the step (a string) and the second is the configured object of the step, such as a transform or a model. The model is only supported as the final step, although we can have as many transforms as we like in the sequence.

...
# define the pipeline
steps = list()
steps.append(('scaler', MinMaxScaler()))
steps.append(('model', LogisticRegression()))
pipeline = Pipeline(steps=steps)

We can then pass the configured object to the cross_val_score() function for evaluation.

...
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model using cross-validation
scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

Tying this together, the complete example of correctly performing data preparation without data leakage when using cross-validation is listed below.

# correct data preparation for model evaluation with k-fold cross-validation
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
# define the pipeline
steps = list()
steps.append(('scaler', MinMaxScaler()))
steps.append(('model', LogisticRegression()))
pipeline = Pipeline(steps=steps)
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model using cross-validation
scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores)*100, std(scores)*100))

Running the example normalizes the data correctly within the cross-validation folds of the evaluation procedure to avoid data leakage.

Your specific results may vary given the stochastic nature of the learning algorithm and evaluation procedure.

In this case, we can see that the model has an estimated accuracy of about 85.433 percent, compared to the approach with data leakage that achieved an accuracy of about 85.300 percent.

As with the train-test example in the previous section, removing data leakage has resulted in a slight improvement in performance when our intuition might suggest a drop given that data leakage often results in an optimistic estimate of model performance. Nevertheless, the examples clearly demonstrate that data leakage does impact the estimate of model performance and how to correct data leakage by correctly performing data preparation after the data is split.

Accuracy: 85.433 (3.471)

Summary

In this tutorial, you discovered how to avoid data leakage during data preparation when evaluating machine learning models.

Specifically, you learned:

Naive application of data preparation methods to the whole dataset results in data leakage that causes incorrect estimates of model performance.
Data preparation must be prepared on the training set only in order to avoid data leakage.
How to implement data preparation without data leakage for train-test splits and k-fold cross-validation in Python.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post How to Avoid Data Leakage When Performing Data Preparation appeared first on Machine Learning Mastery.

Datasets may have missing values, and this can cause problems for many machine learning algorithms.

As such, it is good practice to identify and replace missing values for each column in your input data prior to modeling your prediction task. This is called missing data imputation, or imputing for short.

A popular approach to missing data imputation is to use a model to predict the missing values. This requires a model to be created for each input variable that has missing values. Although any one among a range of different models can be used to predict the missing values, the k-nearest neighbor (KNN) algorithm has proven to be generally effective, often referred to as “nearest neighbor imputation.”

In this tutorial, you will discover how to use nearest neighbor imputation strategies for missing data in machine learning.

After completing this tutorial, you will know:

Missing values must be marked with NaN values and can be replaced with nearest neighbor estimated values.
How to load a CSV file with missing values and mark the missing values with NaN values and report the number and percentage of missing values for each column.
How to impute missing values with nearest neighbor models as a data preparation method when evaluating models and when fitting a final model to make predictions on new data.

Let’s get started.

kNN Imputation for Missing Values in Machine Learning
Photo by portengaround, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

k-Nearest Neighbor Imputation
Horse Colic Dataset
Nearest Neighbor Imputation With KNNImputer
1. KNNImputer Data Transform
2. KNNImputer and Model Evaluation
3. KNNImputer and Different Number of Neighbors
4. KNNImputer Transform When Making a Prediction

k-Nearest Neighbor Imputation

A dataset may have missing values.

These are rows of data where one or more values or columns in that row are not present. The values may be missing completely or they may be marked with a special character or value, such as a question mark “?“.

Values could be missing for many reasons, often specific to the problem domain, and might include reasons such as corrupt measurements or unavailability.

Most machine learning algorithms require numeric input values, and a value to be present for each row and column in a dataset. As such, missing values can cause problems for machine learning algorithms.

It is common to identify missing values in a dataset and replace them with a numeric value. This is called data imputing, or missing data imputation.

… missing data can be imputed. In this case, we can use information in the training set predictors to, in essence, estimate the values of other predictors.

— Page 42, Applied Predictive Modeling, 2013.

An effective approach to data imputing is to use a model to predict the missing values. A model is created for each feature that has missing values, taking as input values of perhaps all other input features.

One popular technique for imputation is a K-nearest neighbor model. A new sample is imputed by finding the samples in the training set “closest” to it and averages these nearby points to fill in the value.

— Page 42, Applied Predictive Modeling, 2013.

If input variables are numeric, then regression models can be used for prediction, and this case is quite common. A range of different models can be used, although a simple k-nearest neighbor (KNN) model has proven to be effective in experiments. The use of a KNN model to predict or fill missing values is referred to as “Nearest Neighbor Imputation” or “KNN imputation.”

We show that KNNimpute appears to provide a more robust and sensitive method for missing value estimation […] and KNNimpute surpass the commonly used row average method (as well as filling missing values with zeros).

— Missing value estimation methods for DNA microarrays, 2001.

Configuration of KNN imputation often involves selecting the distance measure (e.g. Euclidean) and the number of contributing neighbors for each prediction, the k hyperparameter of the KNN algorithm.

Now that we are familiar with nearest neighbor methods for missing value imputation, let’s take a look at a dataset with missing values.

Horse Colic Dataset

The horse colic dataset describes medical characteristics of horses with colic and whether they lived or died.

There are 300 rows and 26 input variables with one output variable. It is a binary classification prediction task that involves predicting 1 if the horse lived and 2 if the horse died.

A naive model can achieve a classification accuracy of about 67 percent, and a top-performing model can achieve an accuracy of about 85.2 percent using three repeats of 10-fold cross-validation. This defines the range of expected modeling performance on the dataset.

The dataset has many missing values for many of the columns where each missing value is marked with a question mark character (“?”).

Below provides an example of rows from the dataset with marked missing values.

2,1,530101,38.50,66,28,3,3,?,2,5,4,4,?,?,?,3,5,45.00,8.40,?,?,2,2,11300,00000,00000,2
1,1,534817,39.2,88,20,?,?,4,1,3,4,2,?,?,?,4,2,50,85,2,2,3,2,02208,00000,00000,2
2,1,530334,38.30,40,24,1,1,3,1,3,3,1,?,?,?,1,1,33.00,6.70,?,?,1,2,00000,00000,00000,1
1,9,5290409,39.10,164,84,4,1,6,2,2,4,4,1,2,5.00,3,?,48.00,7.20,3,5.30,2,1,02208,00000,00000,1
...

You can learn more about the dataset here:

No need to download the dataset as we will download it automatically in the worked examples.

Marking missing values with a NaN (not a number) value in a loaded dataset using Python is a best practice.

We can load the dataset using the read_csv() Pandas function and specify the “na_values” to load values of ‘?’ as missing, marked with a NaN value.

...
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv'
dataframe = read_csv(url, header=None, na_values='?')

Once loaded, we can review the loaded data to confirm that “?” values are marked as NaN.

...
# summarize the first few rows
print(dataframe.head())

We can then enumerate each column and report the number of rows with missing values for the column.

...
# summarize the number of rows with missing values for each column
for i in range(dataframe.shape[1]):
	# count number of rows with missing values
	n_miss = dataframe[[i]].isnull().sum()
	perc = n_miss / dataframe.shape[0] * 100
	print('> %d, Missing: %d (%.1f%%)' % (i, n_miss, perc))

Tying this together, the complete example of loading and summarizing the dataset is listed below.

# summarize the horse colic dataset
from pandas import read_csv
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv'
dataframe = read_csv(url, header=None, na_values='?')
# summarize the first few rows
print(dataframe.head())
# summarize the number of rows with missing values for each column
for i in range(dataframe.shape[1]):
	# count number of rows with missing values
	n_miss = dataframe[[i]].isnull().sum()
	perc = n_miss / dataframe.shape[0] * 100
	print('> %d, Missing: %d (%.1f%%)' % (i, n_miss, perc))

Running the example first loads the dataset and summarizes the first five rows.

We can see that the missing values that were marked with a “?” character have been replaced with NaN values.

0   1        2     3      4     5    6   ...   21   22  23     24  25  26  27
0  2.0   1   530101  38.5   66.0  28.0  3.0  ...  NaN  2.0   2  11300   0   0   2
1  1.0   1   534817  39.2   88.0  20.0  NaN  ...  2.0  3.0   2   2208   0   0   2
2  2.0   1   530334  38.3   40.0  24.0  1.0  ...  NaN  1.0   2      0   0   0   1
3  1.0   9  5290409  39.1  164.0  84.0  4.0  ...  5.3  2.0   1   2208   0   0   1
4  2.0   1   530255  37.3  104.0  35.0  NaN  ...  NaN  2.0   2   4300   0   0   2

[5 rows x 28 columns]

Next, we can see a list of all columns in the dataset and the number and percentage of missing values.

We can see that some columns (e.g. column indexes 1 and 2) have no missing values and other columns (e.g. column indexes 15 and 21) have many or even a majority of missing values.

> 0, Missing: 1 (0.3%)
> 1, Missing: 0 (0.0%)
> 2, Missing: 0 (0.0%)
> 3, Missing: 60 (20.0%)
> 4, Missing: 24 (8.0%)
> 5, Missing: 58 (19.3%)
> 6, Missing: 56 (18.7%)
> 7, Missing: 69 (23.0%)
> 8, Missing: 47 (15.7%)
> 9, Missing: 32 (10.7%)
> 10, Missing: 55 (18.3%)
> 11, Missing: 44 (14.7%)
> 12, Missing: 56 (18.7%)
> 13, Missing: 104 (34.7%)
> 14, Missing: 106 (35.3%)
> 15, Missing: 247 (82.3%)
> 16, Missing: 102 (34.0%)
> 17, Missing: 118 (39.3%)
> 18, Missing: 29 (9.7%)
> 19, Missing: 33 (11.0%)
> 20, Missing: 165 (55.0%)
> 21, Missing: 198 (66.0%)
> 22, Missing: 1 (0.3%)
> 23, Missing: 0 (0.0%)
> 24, Missing: 0 (0.0%)
> 25, Missing: 0 (0.0%)
> 26, Missing: 0 (0.0%)
> 27, Missing: 0 (0.0%)

Now that we are familiar with the horse colic dataset that has missing values, let’s look at how we can use nearest neighbor imputation.

Nearest Neighbor Imputation with KNNImputer

The scikit-learn machine learning library provides the KNNImputer class that supports nearest neighbor imputation.

In this section, we will explore how to effectively use the KNNImputer class.

KNNImputer Data Transform

KNNImputer is a data transform that is first configured based on the method used to estimate the missing values.

The default distance measure is a Euclidean distance measure that is NaN aware, e.g. will not include NaN values when calculating the distance between members of the training dataset. This is set via the “metric” argument.

The number of neighbors is set to five by default and can be configured by the “n_neighbors” argument.

Finally, the distance measure can be weighed proportional to the distance between instances (rows), although this is set to a uniform weighting by default, controlled via the “weights” argument.

...
# define imputer
imputer = KNNImputer(n_neighbors=5, weights='uniform', metric='nan_euclidean')

Then, the imputer is fit on a dataset.

...
# fit on the dataset
imputer.fit(X)

Then, the fit imputer is applied to a dataset to create a copy of the dataset with all missing values for each column replaced with an estimated value.

...
# transform the dataset
Xtrans = imputer.transform(X)

We can demonstrate its usage on the horse colic dataset and confirm it works by summarizing the total number of missing values in the dataset before and after the transform.

The complete example is listed below.

# knn imputation transform for the horse colic dataset
from numpy import isnan
from pandas import read_csv
from sklearn.impute import KNNImputer
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv'
dataframe = read_csv(url, header=None, na_values='?')
# split into input and output elements
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
# print total missing
print('Missing: %d' % sum(isnan(X).flatten()))
# define imputer
imputer = KNNImputer()
# fit on the dataset
imputer.fit(X)
# transform the dataset
Xtrans = imputer.transform(X)
# print total missing
print('Missing: %d' % sum(isnan(Xtrans).flatten()))

Running the example first loads the dataset and reports the total number of missing values in the dataset as 1,605.

The transform is configured, fit, and performed, and the resulting new dataset has no missing values, confirming it was performed as we expected.

Each missing value was replaced with a value estimated by the model.

Missing: 1605
Missing: 0

KNNImputer and Model Evaluation

It is a good practice to evaluate machine learning models on a dataset using k-fold cross-validation.

To correctly apply nearest neighbor missing data imputation and avoid data leakage, it is required that the models are calculated for each column are calculated on the training dataset only, then applied to the train and test sets for each fold in the dataset.

This can be achieved by creating a modeling pipeline where the first step is the nearest neighbor imputation, then the second step is the model. This can be achieved using the Pipeline class.

For example, the Pipeline below uses a KNNImputer with the default strategy, followed by a random forest model.

...
# define modeling pipeline
model = RandomForestClassifier()
imputer = KNNImputer()
pipeline = Pipeline(steps=[('i', imputer), ('m', model)])

We can evaluate the imputed dataset and random forest modeling pipeline for the horse colic dataset with repeated 10-fold cross-validation.

The complete example is listed below.

# evaluate knn imputation and random forest for the horse colic dataset
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import KNNImputer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv'
dataframe = read_csv(url, header=None, na_values='?')
# split into input and output elements
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
# define modeling pipeline
model = RandomForestClassifier()
imputer = KNNImputer()
pipeline = Pipeline(steps=[('i', imputer), ('m', model)])
# define model evaluation
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example correctly applies data imputation to each fold of the cross-validation procedure.

The pipeline is evaluated using three repeats of 10-fold cross-validation and reports the mean classification accuracy on the dataset as about 77.7 percent, which is a reasonable score.

Mean Accuracy: 0.777 (0.072)

How do we know that using a default number of neighbors of five is good or best for this dataset?

The answer is that we don’t.

KNNImputer and Different Number of Neighbors

The key hyperparameter for the KNN algorithm is k; that controls the number of nearest neighbors that are used to contribute to a prediction.

It is good practice to test a suite of different values for k.

The example below evaluates model pipelines and compares odd values for k from 1 to 21.

# compare knn imputation strategies for the horse colic dataset
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import KNNImputer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from matplotlib import pyplot
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv'
dataframe = read_csv(url, header=None, na_values='?')
# split into input and output elements
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
# evaluate each strategy on the dataset
results = list()
strategies = [str(i) for i in [1,3,5,7,9,15,18,21]]
for s in strategies:
	# create the modeling pipeline
	pipeline = Pipeline(steps=[('i', KNNImputer(n_neighbors=int(s))), ('m', RandomForestClassifier())])
	# evaluate the model
	cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
	scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
	# store results
	results.append(scores)
	print('>%s %.3f (%.3f)' % (s, mean(scores), std(scores)))
# plot model performance for comparison
pyplot.boxplot(results, labels=strategies, showmeans=True)
pyplot.xticks(rotation=45)
pyplot.show()

Running the example evaluates each k value on the horse colic dataset using repeated cross-validation.

Your specific results may vary given the stochastic nature of the learning algorithm; consider running the example a few times.

The mean classification accuracy is reported for the pipeline with each k value used for imputation.

In this case, we can see that larger k values result in a better performing model, with a k=21 resulting in the best performance of about 81.9 percent accuracy.

>1 0.691 (0.074)
>3 0.741 (0.075)
>5 0.780 (0.056)
>7 0.781 (0.065)
>9 0.798 (0.074)
>15 0.790 (0.083)
>18 0.807 (0.076)
>21 0.819 (0.060)

At the end of the run, a box and whisker plot is created for each set of results, allowing the distribution of results to be compared.

The plots clearly show the rising trend in model performance as the k is increased for the imputation.

Box and Whisker Plot of Imputation Number of Neighbors for the Horse Colic Dataset

KNNImputer Transform When Making a Prediction

We may wish to create a final modeling pipeline with the nearest neighbor imputation and random forest algorithm, then make a prediction for new data.

This can be achieved by defining the pipeline and fitting it on all available data, then calling the predict() function, passing new data in as an argument.

Importantly, the row of new data must mark any missing values using the NaN value.

...
# define new data
row = [2,1,530101,38.50,66,28,3,3,nan,2,5,4,4,nan,nan,nan,3,5,45.00,8.40,nan,nan,2,2,11300,00000,00000]

The complete example is listed below.

# knn imputation strategy and prediction for the hose colic dataset
from numpy import nan
from pandas import read_csv
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import KNNImputer
from sklearn.pipeline import Pipeline
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv'
dataframe = read_csv(url, header=None, na_values='?')
# split into input and output elements
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
# create the modeling pipeline
pipeline = Pipeline(steps=[('i', KNNImputer(n_neighbors=21)), ('m', RandomForestClassifier())])
# fit the model
pipeline.fit(X, y)
# define new data
row = [2,1,530101,38.50,66,28,3,3,nan,2,5,4,4,nan,nan,nan,3,5,45.00,8.40,nan,nan,2,2,11300,00000,00000]
# make a prediction
yhat = pipeline.predict([row])
# summarize prediction
print('Predicted Class: %d' % yhat[0])

Running the example fits the modeling pipeline on all available data.

A new row of data is defined with missing values marked with NaNs and a classification prediction is made.

Predicted Class: 2

Summary

In this tutorial, you discovered how to use nearest neighbor imputation strategies for missing data in machine learning.

Specifically, you learned:

Missing values must be marked with NaN values and can be replaced with nearest neighbor estimated values.
How to load a CSV file with missing values and mark the missing values with NaN values and report the number and percentage of missing values for each column.
How to impute missing values with nearest neighbor models as a data preparation method when evaluating models and when fitting a final model to make predictions on new data.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post kNN Imputation for Missing Values in Machine Learning appeared first on Machine Learning Mastery.

Data preparation is the process of transforming raw data into learning algorithms.

In some cases, data preparation is a required step in order to provide the data to an algorithm in its required input format. In other cases, the most appropriate representation of the input data is not known and must be explored in a trial-and-error manner in order to discover what works best for a given model and dataset.

Max Kuhn and Kjell Johnson have written a new book focused on this important topic of data preparation and how to get the most out of your data on a predictive modeling project with machine learning algorithms. The title of the book is “Feature Engineering and Selection: A Practical Approach for Predictive Models” and it was released in 2019.

In this post, you will discover my review and breakdown of the book “Feature Engineering and Selection” on the topic of data preparation for machine learning.

Let’s dive in!

Feature Engineering and Selection (Book Review)

Overview

This tutorial is divided into three parts; they are:

Feature Engineering and Selection
Breakdown of the Book
Final Thoughts on the Book

Feature Engineering and Selection

“Feature Engineering and Selection: A Practical Approach for Predictive Models” is a book written by Max Kuhn and Kjell Johnson and published in 2019.

Kuhn and Johnson are the authors of one of my favorite books on practical machine learning titled “Applied Predictive Modeling,” published in 2013. And Kuhn is also the author of the popular caret R package for machine learning. As such, any book they publish, I will immediately buy and devour.

This new book is focused on the problem of data preparation for machine learning.

The authors highlight that although fitting and evaluating models is routine, achieving good performance for a predictive modeling problem is highly dependent upon how the data is prepared.

Despite our attempts to follow these good practices, we are sometimes frustrated to find that the best models have less-than-anticipated, less-than-useful useful predictive performance. This lack of performance may be due to […] relevant predictors that were collected are represented in a way that models have trouble achieving good performance.

— Page xi, “Feature Engineering and Selection,” 2019.

They refer to the process of preparing data for modeling as “feature engineering.”

This is a slightly different definition than I am used to. I would call it “data preparation” or “data preprocessing” and hold “feature engineering” apart as a subtask focused on systematic steps for creating new input variables from existing data.

Nevertheless, I see where they are coming from, as all data preparation could fit that definition.

Adjusting and reworking the predictors to enable models to better uncover predictor- response relationships has been termed feature engineering.

— Page xi, “Feature Engineering and Selection,” 2019.

They motivate the book by pointing out that we cannot know the most appropriate data representation to use in order to achieve the best predictive modeling performance.

That we may need to systematically test a suite of representations in order to discover what works best. This matches the empirical approach that I recommend in general, which is rarely discussed, although comforting to see in a textbook.

… we often do not know the best re-representation of the predictors to improve model performance. […] we may need to search many alternative predictor representations to improve model performance.

— Page xii, “Feature Engineering and Selection,” 2019.

Given the importance of data preparation in order to achieve good performance on a dataset, the book is focused on highlighting specific data preparation techniques and how to use them.

The goals of Feature Engineering and Selection are to provide tools for re-representing predictors, to place these tools in the context of a good predictive modeling framework, and to convey our experience of utilizing these tools in practice.

— Page xii, “Feature Engineering and Selection,” 2019.

Like their previous book, all worked examples are in R, and in this case, the source code is available from the book’s GitHub project.

Also, unlike the previous book, the complete contents of the book are also available for free online:

Feature Engineering and Selection, Official Website.

Next, let’s take a closer look at the topics covered by the book.

Breakdown of the Book

The book is divided into 12 chapters

They are:

Chapter 1. Introduction
Chapter 2. Illustrative Example: Predicting Risk Ischemic Stroke
Chapter 3. A Review of the Predictive Modeling Process
Chapter 4. Exploratory Visualizations
Chapter 5. Encoding Categorical Predictors
Chapter 6. Engineering Numeric Predictors
Chapter 7. Detecting Interaction Effects
Chapter 8. Handling Missing Data
Chapter 9. Working with Profile Data
Chapter 10. Feature Selection Overview
Chapter 11. Greedy Search Methods
Chapter 12. Global Search Methods

Let’s take a closer look at each chapter.

Chapter 1. Introduction

The introductory chapter provides a good overview of the challenge of predictive modeling.

It starts by highlighting the important distinction between descriptive and predictive models.

… the prediction of a particular value (such as arrival time) reflects an estimation problem where our goal is not necessarily to understand if a trend or fact is genuine but is focused on having the most accurate determination of that value. The uncertainty in the prediction is another important quantity, especially to gauge the trustworthiness of the value generated by the model.

— Page 1, “Feature Engineering and Selection,” 2019.

Importantly, the chapter emphasizes the need for data preparation in order to get the most out of a predictive model on a project.

The idea that there are different ways to represent predictors in a model, and that some of these representations are better than others, leads to the idea of feature engineering—the process of creating representations of data that increase the effectiveness of a model.

— Page 3, Feature Engineering and Selection, 2019.

As an introduction, a number of foundational topics are covered that you should probably already be familiar with, including:

Overfitting
Supervised and Unsupervised Procedures
No Free Lunch
The Model Versus the Modeling Process
Model Bias and Variance
Experience-Driven Modeling and Empirically Driven Modeling
Big Data

I really like that they point out the iterative nature of the predictive modeling process. That it is not a single pass through data as is often discussed elsewhere.

When modeling data, there is almost never a single model fit or feature set that will immediately solve the problem. The process is more likely to be a campaign of trial and error to achieve the best results.

— Page 16, “Feature Engineering and Selection,” 2019.

I also really like that they hammer home just how much the chosen representation of the input data impacts the performance of a model and we just cannot guess how well a given representation will allow a model to perform.

The effect of feature sets can be much larger than the effect of different models. The interplay between models and features is complex and somewhat unpredictable.

— Page 16, “Feature Engineering and Selection,” 2019.

Chapter 2. Illustrative Example: Predicting Risk Ischemic Stroke

As the name suggests, this chapter aims to make the process of predictive modeling concrete with a worked example.

As a primer to feature engineering, an abbreviated example is presented with a modeling process […] For the purpose of illustration, this example will focus on exploration, analysis fit, and feature engineering, through the lens of a single model (logistic regression).

— Page 21, Feature Engineering and Selection, 2019.

Chapter 3. A Review of the Predictive Modeling Process

This chapter reviews the process of predictive modeling with a focus on how and where data preparation fits into the process.

It covers concerns such as:

Measuring Performance
Data Splitting
Resampling
Tuning Parameters and Overfitting
Model Optimizing and Tuning
Comparing Models Using the Training Set
Feature Engineering Without Overfitting

The important takeaway from this chapter is that the application of data preparation in the process is critical, as a misapplication can result in data leakage and overfitting.

In order for any resampling scheme to produce performance estimates that generalize to new data, it must contain all of the steps in the modeling process that could significantly affect the model’s effectiveness.

— Pages 54-55, “Feature Engineering and Selection,” 2019.

The solution is to fit data preparation on the training dataset only, then apply the fit transforms on the test set and other datasets as needed. This is a best practice in predictive modeling when using train/test splits and k-fold cross-validation.

To provide a solid methodology, we should constrain ourselves to developing the list of preprocessing techniques, estimate them only in the presence of the training data points, and then apply the techniques to future data (including the test set).

— Page 55, “Feature Engineering and Selection,” 2019.

Chapter 4. Exploratory Visualizations

This chapter focuses on an important step to perform prior to data preparation, namely take a close look at the data.

The authors suggest using data visualization techniques to first understand the target variable that is being predicted, then to focus on the input variables. This information can then be used to inform the types of data preparation methods to explore.

One of the first steps of the exploratory data process when the ultimate purpose is to predict a response, is to create visualizations that help elucidate knowledge of the response and then to uncover relationships between the predictors and the response.

— Page 65, “Feature Engineering and Selection,” 2019.

Chapter 5. Encoding Categorical Predictors

This chapter focuses on alternate representations for categorical variables that summarize qualitative information.

Categorical or nominal predictors are those that contain qualitative data

— Page 93, “Feature Engineering and Selection,” 2019.

Categorical variables may have a rank-order relationship (ordinal) or have no such relationship (nominal).

Simple categorical variables can also be classified as ordered or unordered. […] Ordered and unordered factors might require different approaches for including the embedded information in a model.

— Page 93, “Feature Engineering and Selection,” 2019.

This includes techniques such as dummy variables, hashing, and embeddings.

Chapter 6. Engineering Numeric Predictors

This chapter focuses on alternate representations for numerical variables that summarize qualitative information.

The objective of this chapter is to develop tools for converting these types of predictors into a form that a model can better utilize.

— Page 121, “Feature Engineering and Selection,” 2019.

There are many well-understood problems that we may observe with numerical variables such as:

Variables have different scales.
Skewed probability distributions.
Outliers or extreme values.
Bimodal distribution.
Complex interrelationships.
Redundant information.

Interestingly, the authors represent a suite of techniques organized by the effect each method has on the input variable. That is, whether the method operates on one input variable or many and produces a single result or many, e.g.:

One-to-One
One-to-Many
Many-to-Many

This includes a host of methods, such as data scaling, power transforms, and projection methods.

Chapter 7. Detecting Interaction Effects

This chapter focuses on a topic that is often overlooked, which is the study of how variables interact in a dataset.

Technically, interaction refers to those variables that together have more or less effect than if the variables are considered in isolation.

For many problems, additional variation in the response can be explained by the effect of two or more predictors working in conjunction with each other. […] More formally, two or more predictors are said to interact if their combined effect is different (less or greater) than what we would expect if we were to add the impact of each of their effects when considered alone.

— Page 157, “Feature Engineering and Selection,” 2019.

This topic is often overlooked in the context of data preparation, as it is often believed that the learning algorithms used in predictive modeling will learn any relevant interrelationships between the variables that assist in predicting the target variable.

Chapter 8. Handling Missing Data

This chapter focuses on the problem of missing observations in the available data.

It is an important topic because most data has missing or corrupt values, or it will if the dataset is scaled up.

Missing data are not rare in real data sets.

— Page 157, “Feature Engineering and Selection,” 2019.

After reviewing causes for missing data and data visualizations that help to understand the scope of missing values in a dataset, the chapter works through three main solutions:

Delete data with missing values.
Encode missing values so models can learn about them.
Impute missing values from available data.

Chapter 9. Working With Profile Data

This chapter provides a case study of data preparation methods for profile data.

It might be poorly named, but has to do with data with dependencies at different scales, e.g. how to do useful data prep with data at day/week/month scope on a given dataset (e.g. hierarchical structures).

Since the goal is to make daily predictions, the profile of within-day weather measurements should be somehow summarized at the day level in a manner that preserves the potential predictive information. For this example, daily features could include the mean or median of the numeric data and perhaps the range of values within a day.

— Page 205, “Feature Engineering and Selection,” 2019.

I found it entirely uninteresting, I’m afraid. But I’m sure it would be the most interesting chapter to anyone currently working with this type of data.

Chapter 10. Feature Selection Overview

This chapter motivates the need for feature selection as the selection of the most relevant inputs, not the target variable that is being predicted.

… some may not be relevant to the outcome. […] there is a genuine need to appropriately select predictors for modeling.

— Page 227, “Feature Engineering and Selection,” 2019.

In addition to lifting model performance, selecting fewer input variables can make the model more interpretable, although often at the cost of model skill. This is a common trade-off seen in predictive modeling.

… there is often a trade-off between predictive performance and interpretability, and it is generally not possible to maximize both at the same time.

— Page 227, “Feature Engineering and Selection,” 2019.

A framework of three methods is used to organize feature selection methods, including:

Intrinsic/Implicit Feature Selection.
Filter Feature Selection.
Wrapper Feature Selection.

Feature selection methodologies fall into three general classes: intrinsic (or implicit) methods, filter methods, and wrapper methods.

— Page 228, “Feature Engineering and Selection,” 2019.

The remaining two chapters also focus on feature selection.

Chapter 11. Greedy Search Methods

This chapter focuses on methods that evaluate features one at a time and then select subsets of features that score well.

This includes methods that calculate the strength of a statistical relationship between the input and the target and methods that reiteratively delete features from the dataset and evaluate a model each step.

A simple approach to identifying potentially predictively important features is to evaluate each feature individually. […] Simple filters are ideal for finding individual predictors. However, this approach does not take into account the impact of multiple features together.

— Page 255, “Feature Engineering and Selection,” 2019.

Chapter 12. Global Search Methods

This chapter focuses on global search algorithms that test different subsets of features based on the performance of the models fit on those features.

Global search methods can be an effective tool for investigating the predictor space and identifying subsets of predictors that are optimally related to the response. […] Although the global search approaches are usually effective at finding good feature sets, they are computationally taxing.

— Page 281, “Feature Engineering and Selection,” 2019.

This includes well-known global stochastic search algorithms such as simulated annealing and genetic algorithms.

Final Thoughts on the Book

I think this is the much needed missing textbook on data preparation.

I also think that if you are a serious machine learning practitioner, you need a copy.

If you are familiar with both R and Python for machine learning, the book highlights just how far libraries like Python/scikit-learn have to go to catch-up to the R/caret ecosystem.

When it comes to data preparation, I don’t think worked examples are as useful as they are when demonstrating algorithms. Perhaps it is just me and my preference. Given how different each dataset is in terms of number, type, and composition of features, demonstrating data preparation on standard datasets is not a helpful teaching aide.

What I would prefer is a more systematic coverage of the problems we may see in raw data when it comes to modeling and how each data preparation method addresses it. E.g. I’d love a long catalog of methods, how they work, and when to use them rather than prose about each method.

Anyway, that is just me pushing hard on how to make the book better or an alternate vision of the material. It’s a must-have, no doubt.

Summary

In this post, you discovered my review and breakdown of the book Feature Engineering and Selection on the topic of data preparation for machine learning.

Have you read the book?
Let me know what you think of it in the comments below.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Feature Engineering and Selection (Book Review) appeared first on Machine Learning Mastery.

Data Preparation for Machine Learning Crash Course.
Get on top of data preparation with Python in 7 days.

Data preparation involves transforming raw data into a form that is more appropriate for modeling.

Preparing data may be the most important part of a predictive modeling project and the most time-consuming, although it seems to be the least discussed. Instead, the focus is on machine learning algorithms, whose usage and parameterization has become quite routine.

Practical data preparation requires knowledge of data cleaning, feature selection data transforms, dimensionality reduction, and more.

In this crash course, you will discover how you can get started and confidently prepare data for a predictive modeling project with Python in seven days.

This is a big and important post. You might want to bookmark it.

Discover data cleaning, feature selection, data transforms, dimensionality reduction and much more in my new book, with 30 step-by-step tutorials and full Python source code.

Let’s get started.

Updated Jun/2020: Changed the target for the horse colic dataset.

Data Preparation for Machine Learning (7-Day Mini-Course)
Photo by Christian Collins, some rights reserved.

Who Is This Crash-Course For?

Before we get started, let’s make sure you are in the right place.

This course is for developers who may know some applied machine learning. Maybe you know how to work through a predictive modeling problem end to end, or at least most of the main steps, with popular tools.

The lessons in this course do assume a few things about you, such as:

You know your way around basic Python for programming.
You may know some basic NumPy for array manipulation.
You may know some basic scikit-learn for modeling.

You do NOT need to be:

A math wiz!
A machine learning expert!

This crash course will take you from a developer who knows a little machine learning to a developer who can effectively and competently prepare data for a predictive modeling project.

Note: This crash course assumes you have a working Python 3 SciPy environment with at least NumPy installed. If you need help with your environment, you can follow the step-by-step tutorial here:

How to Set Up Your Python Environment for Machine Learning With Anaconda

Crash-Course Overview

This crash course is broken down into seven lessons.

You could complete one lesson per day (recommended) or complete all of the lessons in one day (hardcore). It really depends on the time you have available and your level of enthusiasm.

Below is a list of the seven lessons that will get you started and productive with data preparation in Python:

Lesson 01: Importance of Data Preparation
Lesson 02: Fill Missing Values With Imputation
Lesson 03: Select Features With RFE
Lesson 04: Scale Data With Normalization
Lesson 05: Transform Categories With One-Hot Encoding
Lesson 06: Transform Numbers to Categories With kBins
Lesson 07: Dimensionality Reduction with PCA

Each lesson could take you 60 seconds or up to 30 minutes. Take your time and complete the lessons at your own pace. Ask questions and even post results in the comments below.

The lessons might expect you to go off and find out how to do things. I will give you hints, but part of the point of each lesson is to force you to learn where to go to look for help with and about the algorithms and the best-of-breed tools in Python. (Hint: I have all of the answers on this blog; use the search box.)

Post your results in the comments; I’ll cheer you on!

Hang in there; don’t give up.

Want to Get Started With Data Preparation?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Course

Lesson 01: Importance of Data Preparation

In this lesson, you will discover the importance of data preparation in predictive modeling with machine learning.

Predictive modeling projects involve learning from data.

Data refers to examples or cases from the domain that characterize the problem you want to solve.

On a predictive modeling project, such as classification or regression, raw data typically cannot be used directly.

There are four main reasons why this is the case:

Data Types: Machine learning algorithms require data to be numbers.
Data Requirements: Some machine learning algorithms impose requirements on the data.
Data Errors: Statistical noise and errors in the data may need to be corrected.
Data Complexity: Complex nonlinear relationships may be teased out of the data.

The raw data must be pre-processed prior to being used to fit and evaluate a machine learning model. This step in a predictive modeling project is referred to as “data preparation.”

There are common or standard tasks that you may use or explore during the data preparation step in a machine learning project.

These tasks include:

Data Cleaning: Identifying and correcting mistakes or errors in the data.
Feature Selection: Identifying those input variables that are most relevant to the task.
Data Transforms: Changing the scale or distribution of variables.
Feature Engineering: Deriving new variables from available data.
Dimensionality Reduction: Creating compact projections of the data.

Each of these tasks is a whole field of study with specialized algorithms.

Your Task

For this lesson, you must list three data preparation algorithms that you know of or may have used before and give a one-line summary for its purpose.

One example of a data preparation algorithm is data normalization that scales numerical variables to the range between zero and one.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will discover how to fix data that has missing values, called data imputation.

Lesson 02: Fill Missing Values With Imputation

In this lesson, you will discover how to identify and fill missing values in data.

Real-world data often has missing values.

Data can have missing values for a number of reasons, such as observations that were not recorded and data corruption. Handling missing data is important as many machine learning algorithms do not support data with missing values.

Filling missing values with data is called data imputation and a popular approach for data imputation is to calculate a statistical value for each column (such as a mean) and replace all missing values for that column with the statistic.

The horse colic dataset describes medical characteristics of horses with colic and whether they lived or died. It has missing values marked with a question mark ‘?’. We can load the dataset with the read_csv() function and ensure that question mark values are marked as NaN.

Once loaded, we can use the SimpleImputer class to transform all missing values marked with a NaN value with the mean of the column.

The complete example is listed below.

# statistical imputation transform for the horse colic dataset
from numpy import isnan
from pandas import read_csv
from sklearn.impute import SimpleImputer
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv'
dataframe = read_csv(url, header=None, na_values='?')
# split into input and output elements
data = dataframe.values
ix = [i for i in range(data.shape[1]) if i != 23]
X, y = data[:, ix], data[:, 23]
# print total missing
print('Missing: %d' % sum(isnan(X).flatten()))
# define imputer
imputer = SimpleImputer(strategy='mean')
# fit on the dataset
imputer.fit(X)
# transform the dataset
Xtrans = imputer.transform(X)
# print total missing
print('Missing: %d' % sum(isnan(Xtrans).flatten()))

Your Task

For this lesson, you must run the example and review the number of missing values in the dataset before and after the data imputation transform.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will discover how to select the most important features in a dataset.

Lesson 03: Select Features With RFE

In this lesson, you will discover how to select the most important features in a dataset.

Feature selection is the process of reducing the number of input variables when developing a predictive model.

It is desirable to reduce the number of input variables to both reduce the computational cost of modeling and, in some cases, to improve the performance of the model.

Recursive Feature Elimination, or RFE for short, is a popular feature selection algorithm.

RFE is popular because it is easy to configure and use and because it is effective at selecting those features (columns) in a training dataset that are more or most relevant in predicting the target variable.

The scikit-learn Python machine learning library provides an implementation of RFE for machine learning. RFE is a transform. To use it, first, the class is configured with the chosen algorithm specified via the “estimator” argument and the number of features to select via the “n_features_to_select” argument.

The example below defines a synthetic classification dataset with five redundant input features. RFE is then used to select five features using the decision tree algorithm.

# report which features were selected by RFE
from sklearn.datasets import make_classification
from sklearn.feature_selection import RFE
from sklearn.tree import DecisionTreeClassifier
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)
# define RFE
rfe = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=5)
# fit RFE
rfe.fit(X, y)
# summarize all features
for i in range(X.shape[1]):
	print('Column: %d, Selected=%s, Rank: %d' % (i, rfe.support_[i], rfe.ranking_[i]))

Your Task

For this lesson, you must run the example and review which features were selected and the relative ranking that each input feature was assigned.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will discover how to scale numerical data.

Lesson 04: Scale Data With Normalization

In this lesson, you will discover how to scale numerical data for machine learning.

Many machine learning algorithms perform better when numerical input variables are scaled to a standard range.

This includes algorithms that use a weighted sum of the input, like linear regression, and algorithms that use distance measures, like k-nearest neighbors.

One of the most popular techniques for scaling numerical data prior to modeling is normalization. Normalization scales each input variable separately to the range 0-1, which is the range for floating-point values where we have the most precision. It requires that you know or are able to accurately estimate the minimum and maximum observable values for each variable. You may be able to estimate these values from your available data.

You can normalize your dataset using the scikit-learn object MinMaxScaler.

The example below defines a synthetic classification dataset, then uses the MinMaxScaler to normalize the input variables.

# example of normalizing input data
from sklearn.datasets import make_classification
from sklearn.preprocessing import MinMaxScaler
# define dataset
X, y = make_classification(n_samples=1000, n_features=5, n_informative=5, n_redundant=0, random_state=1)
# summarize data before the transform
print(X[:3, :])
# define the scaler
trans = MinMaxScaler()
# transform the data
X_norm = trans.fit_transform(X)
# summarize data after the transform
print(X_norm[:3, :])

Your Task

For this lesson, you must run the example and report the scale of the input variables both prior to and then after the normalization transform.

For bonus points, calculate the minimum and maximum of each variable before and after the transform to confirm it was applied as expected.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will discover how to transform categorical variables to numbers.

Lesson 05: Transform Categories With One-Hot Encoding

In this lesson, you will discover how to encode categorical input variables as numbers.

Machine learning models require all input and output variables to be numeric. This means that if your data contains categorical data, you must encode it to numbers before you can fit and evaluate a model.

One of the most popular techniques for transforming categorical variables into numbers is the one-hot encoding.

Categorical data are variables that contain label values rather than numeric values.

Each label for a categorical variable can be mapped to a unique integer, called an ordinal encoding. Then, a one-hot encoding can be applied to the ordinal representation. This is where one new binary variable is added to the dataset for each unique integer value in the variable, and the original categorical variable is removed from the dataset.

For example, imagine we have a “color” variable with three categories (‘red‘, ‘green‘, and ‘blue‘). In this case, three binary variables are needed. A “1” value is placed in the binary variable for the color and “0” values for the other colors.

For example:

red,	green,	blue
1,		0,		0
0,		1,		0
0,		0,		1

This one-hot encoding transform is available in the scikit-learn Python machine learning library via the OneHotEncoder class.

The breast cancer dataset contains only categorical input variables.

The example below loads the dataset and one hot encodes each of the categorical input variables.

# one-hot encode the breast cancer dataset
from pandas import read_csv
from sklearn.preprocessing import OneHotEncoder
# define the location of the dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv"
# load the dataset
dataset = read_csv(url, header=None)
# retrieve the array of data
data = dataset.values
# separate into input and output columns
X = data[:, :-1].astype(str)
y = data[:, -1].astype(str)
# summarize the raw data
print(X[:3, :])
# define the one hot encoding transform
encoder = OneHotEncoder(sparse=False)
# fit and apply the transform to the input data
X_oe = encoder.fit_transform(X)
# summarize the transformed data
print(X_oe[:3, :])

Your Task

For this lesson, you must run the example and report on the raw data before the transform, and the impact on the data after the one-hot encoding was applied.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will discover how to transform numerical variables into categories.

Lesson 06: Transform Numbers to Categories With kBins

In this lesson, you will discover how to transform numerical variables into categorical variables.

Some machine learning algorithms may prefer or require categorical or ordinal input variables, such as some decision tree and rule-based algorithms.

This could be caused by outliers in the data, multi-modal distributions, highly exponential distributions, and more.

Many machine learning algorithms prefer or perform better when numerical input variables with non-standard distributions are transformed to have a new distribution or an entirely new data type.

One approach is to use the transform of the numerical variable to have a discrete probability distribution where each numerical value is assigned a label and the labels have an ordered (ordinal) relationship.

This is called a discretization transform and can improve the performance of some machine learning models for datasets by making the probability distribution of numerical input variables discrete.

The discretization transform is available in the scikit-learn Python machine learning library via the KBinsDiscretizer class.

It allows you to specify the number of discrete bins to create (n_bins), whether the result of the transform will be an ordinal or one-hot encoding (encode), and the distribution used to divide up the values of the variable (strategy), such as ‘uniform.’

The example below creates a synthetic input variable with 10 numerical input variables, then encodes each into 10 discrete bins with an ordinal encoding.

# discretize numeric input variables
from sklearn.datasets import make_classification
from sklearn.preprocessing import KBinsDiscretizer
# define dataset
X, y = make_classification(n_samples=1000, n_features=5, n_informative=5, n_redundant=0, random_state=1)
# summarize data before the transform
print(X[:3, :])
# define the transform
trans = KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform')
# transform the data
X_discrete = trans.fit_transform(X)
# summarize data after the transform
print(X_discrete[:3, :])

Your Task

For this lesson, you must run the example and report on the raw data before the transform, and then the effect the transform had on the data.

For bonus points, explore alternate configurations of the transform, such as different strategies and number of bins.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will discover how to reduce the dimensionality of input data.

Lesson 07: Dimensionality Reduction With PCA

In this lesson, you will discover how to use dimensionality reduction to reduce the number of input variables in a dataset.

The number of input variables or features for a dataset is referred to as its dimensionality.

Dimensionality reduction refers to techniques that reduce the number of input variables in a dataset.

More input features often make a predictive modeling task more challenging to model, more generally referred to as the curse of dimensionality.

Although on high-dimensionality statistics, dimensionality reduction techniques are often used for data visualization, these techniques can be used in applied machine learning to simplify a classification or regression dataset in order to better fit a predictive model.

Perhaps the most popular technique for dimensionality reduction in machine learning is Principal Component Analysis, or PCA for short. This is a technique that comes from the field of linear algebra and can be used as a data preparation technique to create a projection of a dataset prior to fitting a model.

The resulting dataset, the projection, can then be used as input to train a machine learning model.

The scikit-learn library provides the PCA class that can be fit on a dataset and used to transform a training dataset and any additional datasets in the future.

The example below creates a synthetic binary classification dataset with 10 input variables then uses PCA to reduce the dimensionality of the dataset to the three most important components.

# example of pca for dimensionality reduction
from sklearn.datasets import make_classification
from sklearn.decomposition import PCA
# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=3, n_redundant=7, random_state=1)
# summarize data before the transform
print(X[:3, :])
# define the transform
trans = PCA(n_components=3)
# transform the data
X_dim = trans.fit_transform(X)
# summarize data after the transform
print(X_dim[:3, :])

Your Task

For this lesson, you must run the example and report on the structure and form of the raw dataset and the dataset after the transform was applied.

For bonus points, explore transforms with different numbers of selected components.

Post your answer in the comments below. I would love to see what you come up with.

This was the final lesson in the mini-course.

The End!
(Look How Far You Have Come)

You made it. Well done!

Take a moment and look back at how far you have come.

You discovered:

The importance of data preparation in a predictive modeling machine learning project.
How to mark missing data and impute the missing values using statistical imputation.
How to remove redundant input variables using recursive feature elimination.
How to transform input variables with differing scales to a standard range called normalization.
How to transform categorical input variables to be numbers called one-hot encoding.
How to transform numerical variables into discrete categories called discretization.
How to use PCA to create a projection of a dataset into a lower number of dimensions.

Summary

How did you do with the mini-course?
Did you enjoy this crash course?

Do you have any questions? Were there any sticking points?
Let me know. Leave a comment below.

The post Data Preparation for Machine Learning (7-Day Mini-Course) appeared first on Machine Learning Mastery.

Data preparation is the transformation of raw data into a form that is more appropriate for modeling.

It is a challenging topic to discuss as the data differs in form, type, and structure from project to project.

Nevertheless, there are common data preparation tasks across projects. It is a huge field of study and goes by many names, such as “data cleaning,” “data wrangling,” “data preprocessing,” “feature engineering,” and more. Some of these are distinct data preparation tasks, and some of the terms are used to describe the entire data preparation process.

Even though it is a challenging topic to discuss, there are a number of books on the topic.

In this post, you will discover the top books on data cleaning, data preparation, feature engineering, and related topics.

Let’s get started.

Discover data cleaning, feature selection, data transforms, dimensionality reduction and much more in my new book, with 30 step-by-step tutorials and full Python source code.

Overview

The focus here is on data preparation for tabular data, e.g. data in the form of a table with rows and columns as it looks in an excel spreadsheet.

Data preparation is an important topic for all data types, although specialty methods are required for each, such as image data in computer vision, text data in natural language processing, and sequence data in time series forecasting.

Data preparation is often a chapter in a machine learning textbook, although there are books dedicated to the topic. We will focus on these books.

I have gathered all the books I can find on the topic data preparation, selected what I think are the best or better books, and organized them into three groups; they are:

Data Cleaning
Data Wrangling
Feature Engineering

I will try to give the flavor of each book, including the goal, the table of contents, and where to learn more about it.

Want to Get Started With Data Preparation?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Course

Data Cleaning

Data cleaning refers to identifying and fixing errors in the data prior to modeling, including, but not limited to, outliers, missing values, and much more.

The top books on data cleaning include:

Let’s take a closer look at each in turn.

“Bad Data Handbook”

The book “Bad Data Handbook: Cleaning Up The Data So You Can Get Back To Work” was edited by Q. Ethan Mccallum and was published in 2012.

Bad data is described not only as corrupt data but any data that impairs the modeling process.

It’s tough to nail down a precise definition of “Bad Data.” Some people consider it a purely hands-on, technical phenomenon: missing values, malformed records, and cranky file formats. Sure, that’s part of the picture, but Bad Data is so much more. […] Bad Data is data that gets in the way.

— Page 1, “Bad Data Handbook: Cleaning Up The Data So You Can Get Back To Work,” 2012.

It is a collection of essays by 19 machine learning practitioners and us full of useful nuggets on data preparation and management.

Bad Data Handbook

The complete table of contents for the book is listed below.

Chapter 01: Setting the Pace: What Is Bad Data?
Chapter 02: Is It Just Me, or Does This Data Smell Funny?
Chapter 03: Data Intended for Human Consumption, Not Machine Consumption
Chapter 04: Bad Data Lurking in Plain Text
Chapter 05: (Re)Organizing the Web’s Data
Chapter 06: Detecting Liars and the Confused in Contradictory Online Reviews
Chapter 07: Will the Bad Data Please Stand Up?
Chapter 08: Blood, Sweat, and Urine
Chapter 09: When Data and Reality Don’t Match
Chapter 10: Subtle Sources of Bias and Error
Chapter 11: Don’t Let the Perfect Be the Enemy of the Good: Is Bad Data Really Bad?
Chapter 12: When Databases Attack: A Guide for When to Stick to Files
Chapter 13: Crouching Table, Hidden Network
Chapter 14: Myths of Cloud Computing
Chapter 15: The Dark Side of Data Science
Chapter 16: How to Feed and Care for Your Machine-Learning Expert
Chapter 17: Data Traceability
Chapter 18: Social Media: Erasable Ink?
Chapter 19: Data Quality Analysis Demystified: Knowing When Your Data Is Good Enough

I like this book a lot; it is full of valuable practical advice. I highly recommend it!

Learn More:

Bad Data Handbook, on Amazon.

“Best Practices in Data Cleaning”

The book “Best Practices in Data Cleaning: A Complete Guide to Everything You Need to Do Before and After Collecting Your Data” was written by Jason Osborne and was published in 2012.

This is a more general textbook on data preparation for computational-based social sciences rather than machine learning specifically. Nevertheless, it contains a ton of useful advice.

My goal in writing this book is to collect, in one place, a systematic overview of what I consider to be best practices in data cleaning—things I can demonstrate as making a difference in your data analyses. I seek to change the status quo, the current state of affairs in quantitative research in the social sciences (and beyond).

— Page 2, “Best Practices in Data Cleaning: A Complete Guide to Everything You Need to Do Before and After Collecting Your Data,” 2012.

Best Practices in Data Cleaning

The complete table of contents for the book is listed below.

Chapter 01: Why Data Cleaning Is Important: Debunking the Myth of Robustness
Chapter 02: Power and Planning for Data Collection: Debunking the Myth of Adequate Power
Chapter 03: Being True to the Target Population: Debunking the Myth of Representativeness
Chapter 04: Using Large Data Sets With Probability Sampling Frameworks: Debunking the Myth of Equality
Chapter 05: Screening Your Data for Potential Problems: Debunking the Myth of Perfect Data
Chapter 06: Dealing With Missing or Incomplete Data: Debunking the Myth of Emptiness
Chapter 07: Extreme and Influential Data Points: Debunking the Myth of Equality
Chapter 08: Improving the Normality of Variables Through Box-Cox Transformation: Debunking the Myth of Distributional Irrelevance
Chapter 09: Does Reliability Matter? Debunking the Myth of Perfect Measurement
Chapter 10: Random Responding, Motivated Misresponding, and Response Sets: Debunking the Myth of the Motivated Participant
Chapter 11: Why Dichotomizing Continuous Variables Is Rarely a Good Practice: Debunking the Myth of Categorization
Chapter 12: The Special Challenge of Cleaning Repeated Measures Data: Lots of Pits in Which to Fall
Chapter 13: Now That the Myths Are Debunked…: Visions of Rational Quantitative Methodology for the 21st Century

I think this is a great reference guide for general data preparation techniques, perhaps better coverage than most “machine learning” focused books given the stronger statistical focus.

Learn More:

Best Practices in Data Cleaning, on Amazon.

“Data Cleaning”

The book “Data Cleaning” was written by Ihab Ilyas and Xu Chu, and published in 2019.

As the name suggests, the book is focused on data cleaning techniques that fix errors in raw data prior to modeling.

Data cleaning is used to refer to all kinds of tasks and activities to detect and repair errors in the data. Rather than focus on a particular data cleaning task, in this book, we give an overview of the end-to-end data cleaning process, describing various error detection and repair methods, and attempt to anchor these proposals with multiple taxonomies and views.

— Page ixx, “Data Cleaning,” 2019.

Data Cleaning

The complete table of contents for the book is listed below.

Chapter 01: Introduction
Chapter 02: Outlier Detection
Chapter 03: Data Deduplication
Chapter 04: Data Transformation
Chapter 05: Data Quality Rule Definition and Discovery
Chapter 06: Rule-Based Data Cleaning
Chapter 07: Machine Learning and Probabilistic Data Cleaning
Chapter 08: Conclusion and Future Thoughts

It is more of a textbook than a practical book and is a good fit for academics and researchers looking for both a review of methods and references to the original research papers.

Learn More:

Data Cleaning, on Amazon.

Data Wrangling

Data wrangling is a more general or colloquial term for data preparation that might include some data cleaning and feature engineering.

The top books on data wrangling include:

Let’s take a closer look at each in turn.

“Data Wrangling with Python”

The book “Data Wrangling with Python: Tips and Tools to Make Your Life Easier” was written by Jacqueline Kazil and Katharine Jarmul and was published in 2016.

The focus of this book are the tools and methods to help you get raw data into a form ready for modeling.

Data wrangling is about taking a messy or unrefined source of data and turning it into something useful.

— Page xii, “Data Wrangling with Python: Tips and Tools to Make Your Life Easier,” 2016.

This is a beginner’s book for those making their first steps into Python for data preparation and modeling, e.g. current excel users.

This book is for folks who want to explore data wrangling beyond desktop tools. If you are great at Excel and want to take your data analysis to the next level, this book will help!

— Page xii, “Data Wrangling with Python: Tips and Tools to Make Your Life Easier,” 2016.

Data Wrangling with Python

The complete table of contents for the book is listed below.

Chapter 01: Introduction to Python
Chapter 02: Python Basics
Chapter 03: Data Meant to Be Read by Machines
Chapter 04: Working with Excel Files
Chapter 05: PDFs and Problem Solving in Python
Chapter 06: Acquiring and Storing Data
Chapter 07: Data Cleanup: Investigation, Matching, and Formatting
Chapter 08: Data Cleanup: Standardizing and Scripting
Chapter 09: Data Exploration and Analysis
Chapter 10: Presenting Your Data
Chapter 11: Web Scraping: Acquiring and Storing Data from the Web
Chapter 12: Advanced Web Scraping: Screen Scrapers and Spiders
Chapter 13: APIs
Chapter 14: Automation and Scaling
Chapter 15: Conclusion

This is the book to get if you are just starting out with Python for data loading and organization.

Learn More:

Data Wrangling with Python, on Amazon.

“Principles of Data Wrangling”

The book “Principles of Data Wrangling: Practical Techniques for Data Preparation” was written by Tye Rattenbury, et al. and was published in 2017.

Data wrangling is used to describe all of the tasks related to getting data ready for modeling.

The phrase data wrangling, born in the modern context of agile analytics, is meant to describe the lion’s share of the time people spend working with data.

— Page ix, “Principles of Data Wrangling: Practical Techniques for Data Preparation,” 2017.

Principles of Data Wrangling

The complete table of contents for the book is listed below.

Chapter 01: Introduction
Chapter 02: A Data Workflow Framework
Chapter 03: The Dynamics of Data Wrangling
Chapter 04: Profiling
Chapter 05: Transformation: Structuring
Chapter 06: Transformation: Enriching
Chapter 07: Using Transformation to Clean Data
Chapter 08: Roles and Responsibilities
Chapter 09: Data Wrangling Tools

It’s a good book, but very high level. Perhaps it is better suited to the manager than the practitioner. For example, I don’t think I saw a single line of code.

Learn More:

Principles of Data Wrangling, on Amazon.

“Data Wrangling with R”

The book “Data Wrangling with R” was written by Bradley Boehmke and was published in 2016.

As its name suggests, this book is focused on data preparation with R.

In this book, I will help you learn the essentials of preprocessing data leveraging the R programming language to easily and quickly turn noisy data into usable pieces of information.

— Page v, Data Wrangling with R, 2016.

This is a practical book. It has lots of small, focused chapters with code examples on specific problems you will encounter during data preparation. It’s a welcome change compared to many of the other high-level books in this round-up.

Data Wrangling with R

The complete table of contents for the book is listed below.

Chapter 01: The Role of Data Wrangling
Chapter 02: Introduction to R
Chapter 03: The Basics
Chapter 04: Dealing with Numbers
Chapter 05: Dealing with Character Strings
Chapter 06: Dealing with Regular Expressions
Chapter 07: Dealing with Factors
Chapter 08: Dealing with Dates
Chapter 09: Data Structure Basics
Chapter 10: Managing Vectors
Chapter 11: Managing Lists
Chapter 12: Managing Matrices
Chapter 13: Managing Data Frames
Chapter 14: Dealing with Missing Values
Chapter 15: Importing Data
Chapter 16: Scraping Data
Chapter 17: Exporting Data
Chapter 18: Functions
Chapter 19: Loop Control Statements
Chapter 20: Simplify Your Code with %>%
Chapter 21: Reshaping Your Data with tidyr
Chapter 22: Transforming Your Data with dplyr

I’m a fan of this book, and if you are using R, you need a copy. A downside is that there is a little too much of the R basics in this book. I would rather these beleft out and the reader directed to an introductory R book, lifting the requirements on the reader slightly.

Learn More:

Data Wrangling with R, on Amazon.

Feature Engineering

Feature engineering refers to creating new input variables from raw data, although it also refers to data preparation more generally.

Top books on feature engineering include:

Let’s take a closer look at each in turn.

“Feature Engineering and Selection”

The book “Feature Engineering and Selection: A Practical Approach for Predictive Models” was written by Max Kuhn and Kjell Johnson and was published in 2019.

This book describes the general process of preparing raw data for modeling as feature engineering.

Adjusting and reworking the predictors to enable models to better uncover predictor-response relationships has been termed feature engineering.

— Page xi, “Feature Engineering and Selection: A Practical Approach for Predictive Models,” 2019.

The examples in the book are demonstrated using R, which is important, as the author Max Kuhn is also creator of the popular caret package.

An important perspective taken in the book is that data preparation is not just about meeting the expectations of modeling algorithms; it is required to best expose the underlying structure of the problem, requiring iterative trial and error. This is the same perspective that I take in general and it’s refreshing to see in a modern book.

… we often do not know the best re-representation of the predictors to improve model performance. Instead, the re-working of predictors is more of an art, requiring the right tools and experience to find better predictor representations. Moreover, we may need to search many alternative predictor representations to improve model performance.

— Page xii, “Feature Engineering and Selection: A Practical Approach for Predictive Models,” 2019.

Feature Engineering and Selection

The complete table of contents for the book is listed below.

Chapter 1. Introduction
Chapter 2. Illustrative Example: Predicting Risk Ischemic Stroke
Chapter 3. A Review of the Predictive Modeling Process
Chapter 4. Exploratory Visualizations
Chapter 5. Encoding Categorical Predictors
Chapter 6. Engineering Numeric Predictors
Chapter 7. Detecting Interaction Effects
Chapter 8. Handling Missing Data
Chapter 9. Working with Profile Data
Chapter 10. Feature Selection Overview
Chapter 11. Greedy Search Methods
Chapter 12. Global Search Methods

I think this is a must-own book, even if R is not your primary language. The breadth of the methods discussed is worth the sticker price alone.

Learn More:

Feature Engineering and Selection, on Amazon.

“Feature Engineering for Machine Learning”

The book “Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists” was written by Alice Zheng and Amanda Casari and was published in 2018.

I think this book has the most direct definitions up front of all of the books I looked at, describing a feature as a numerical input to a model and feature engineering about getting useful numerical features from the raw data. Very crisp!

A feature is a numeric representation of an aspect of raw data. Features sit between data and models in the machine learning pipeline. Feature engineering is the act of extracting features from raw data and transforming them into formats that are suitable for the machine learn‐ ing model.

— Page vii, “Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists,” 2018.

The examples are in Python and focus on using NumPy and Pandas, and there are lots of worked examples, which are great. I think this is a good sister book or Python equivalent to the above “Data Wrangling with R” or “Feature Engineering and Selection,” although perhaps with less coverage.

Feature Engineering for Machine Learning

The complete table of contents for the book is listed below.

Chapter 1: Machine Learning Pipeline
Chapter 2: Fancy Tricks with Simple Numbers
Chapter 3: Text Data: Flattening, Filtering, and Chunking
Chapter 4: The Effects of Feature Scaling: From Bag-of-Words to Tf-Idf
Chapter 5: Categorical variables: Counting Eggs in the Age of Robotic Chickens
Chapter 6: Dimensionality Reduction: Squashing the Data Pancake with PCA
Chapter 7: Nonlinear Featurization via K-Means Model Stacking
Chapter 8: Automating the Featurizer: Image Feature Extraction and Deep Learning
Chapter 9: Back to the Future: Building an Academic Paper Recommender
Appendix A: Linear Modeling and Linear Algebra Basics

I like the book.

I guess I would prefer to drop the math and direct the reader to a textbook. I would also prefer the examples to focus on the machine learning modeling pipeline rather than standalone transforms. But I’m being picky and pushing hard for directly useful code on a given project.

Learn More:

Feature Engineering for Machine Learning, on Amazon.

Recommendations

You have to pick the book that is right for you, based on your needs, e.g. code or textbook, Python or R.

I own all of these books, but the two I recommend are:

The reason is I like practical books and I like the R and Python perspectives when I’m figuring out what to try.

A close follow-up would be:

Data Wrangling with R, 2016.
Bad Data Handbook: Cleaning Up The Data So You Can Get Back To Work, 2012.

The first is super practical; the second is full of super helpful (yet super specific) advice.

For textbooks, needed for their references by most researchers, I’d probably recommend:

Best Practices in Data Cleaning: A Complete Guide to Everything You Need to Do Before and After Collecting Your Data, 2012.

Summary

In this post, you discovered the top books on data cleaning, data preparation, feature engineering and related topics.

Did I miss a good book on data preparation?
Let me know in the comments below.

Have you read any of the books listed?
Let me know what you think of it in the comments.

The post 8 Top Books on Data Cleaning and Feature Engineering appeared first on Machine Learning Mastery.

Data preparation is an important part of a predictive modeling project.

Correct application of data preparation will transform raw data into a representation that allows learning algorithms to get the most out of the data and make skillful predictions. The problem is choosing a transform or sequence of transforms that results in a useful representation is very challenging. So much so that it may be considered more of an art than a science.

In this tutorial, you will discover strategies that you can use to select data preparation techniques for your predictive modeling datasets.

After completing this tutorial, you will know:

Data preparation techniques can be chosen based on detailed knowledge of the dataset and algorithm and this is the most common approach.
Data preparation techniques can be grid searched as just another hyperparameter in the modeling pipeline.
Data transforms can be applied to a training dataset in parallel to create many extracted features on which feature selection can be applied and a model trained.

Discover data cleaning, feature selection, data transforms, dimensionality reduction and much more in my new book, with 30 step-by-step tutorials and full Python source code.

Let’s get started.

How to Choose Data Preparation Methods for Machine Learning
Photo by StockPhotosforFree, some rights reserved.

Tutorial Overview

This tutorial is divided into four parts; they are:

Strategies for Choosing Data Preparation Techniques
Approach 1: Manually Specify Data Preparation
Approach 2: Grid Search Data Preparation Methods
Approach 3: Apply Data Preparation Methods in Parallel

Strategies for Choosing Data Preparation Techniques

The performance of a machine learning model is only as good as the data used to train it.

This puts a heavy burden on the data and the techniques used to prepare it for modeling.

Data preparation refers to the techniques used to transform raw data into a form that best meets the expectations or requirements of a machine learning algorithm.

It is a challenge because we cannot know a representation of the raw data that will result in good or best performance of a predictive model.

However, we often do not know the best re-representation of the predictors to improve model performance. Instead, the re-working of predictors is more of an art, requiring the right tools and experience to find better predictor representations. Moreover, we may need to search many alternative predictor representations to improve model performance.

— Page xii, Feature Engineering and Selection, 2019.

Instead, we must use controlled experiments to systematically evaluate data transforms on a model in order to discover what works well or best.

As such, on a predictive modeling project, there are three main strategies we may decide to use in order to select a data preparation technique or sequences of techniques for a dataset; they are:

Manually specify the data preparation to use for a given algorithm based on the deep knowledge of the data and the chosen algorithm.
Test a suite of different data transforms and sequences of transforms and discover what works well or best on the dataset for one or range of models.
Apply a suite of data transforms on the data in parallel to create a large number of engineered features that can be reduced using feature selection and used to train models.

Let’s take a closer look at each of these approaches in turn.

Want to Get Started With Data Preparation?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Course

Approach 1: Manually Specify Data Preparation

This approach involves studying the data and the requirements of specific algorithms and selecting data transforms that change your data to best meet the requirements.

Many practitioners see this as the only possible approach to selecting data preparation techniques as it is often the only approach taught or described in textbooks.

This approach might involve first selecting an algorithm and preparing data specifically for it, or testing a suite of algorithms and ensuring the data preparation methods are tailored to each algorithm.

This approach requires having detailed knowledge about your data. This can be achieved by reviewing summary statistics for each variable, plots of data distributions, and possibly even statistical tests to see if the data matches a known distribution.

This approach also requires detailed knowledge of the algorithms you will be using. This can be achieved by reviewing textbooks that describe the algorithms.

From a high level, the data requirements of most algorithms are well known.

For example, the following algorithms will probably be sensitive to the scale and distribution of your numerical input variables, as well as the presence of irrelevant and redundant variables:

Linear Regression (and extensions)
Logistic Regression
Linear Discriminant Analysis
Gaussian Naive Bayes
Neural Networks
Support Vector Machines
k-Nearest Neighbors

The following algorithms will probably not be sensitive to the scale and distribution of your numerical input variables and are reasonably insensitive to irrelevant and redundant variables:

Decision Tree
AdaBoost
Bagged Decision Trees
Random Forest
Gradient Boosting

The benefit of this approach is that it gives you some confidence that your data has been tailored to the expectations and requirements of specific algorithms. This may result in good or even great performance.

The downside is that it can be a slow process requiring a lot of analysis, expertise, and, potentially, research. It also may result in a false sense of confidence that good or best results have already been achieved and that no or little further improvement is possible.

Approach 2: Grid Search Data Preparation Methods

This approach acknowledges that algorithms may have expectations and requirements, and does ensure that transforms of the dataset are created to meet those requirements, although it does not assume that meeting them will result in the best performance.

It leaves the door open to non-obvious and unintuitive solutions.

This might be a data transform that “should not work” or “should not be appropriate for the algorithm” yet results in good or great performance. Alternatively, it may be the absence of a data transform for an input variable that is deemed “absolutely required” yet results in good or great performance.

This can be achieved by designing a grid search of data preparation techniques and/or sequences of data preparation techniques in pipelines. This may involve evaluating each on a single chosen machine learning algorithm, or on a suite of machine learning algorithms.

The result will be a large number of outcomes that will clearly indicate those data transforms, transform sequences, and/or transforms coupled with models that result in good or best performance on the dataset.

These could be used directly, although more likely would provide the basis for further investigation by tuning data transforms and model hyperparameters to get the most out of the methods, and ablative studies to confirm all elements of a modeling pipeline contribute to the skillful predictions.

I generally use this approach myself and advocate it to beginners or practitioners looking to achieve good results on a project quickly.

The benefit of this approach is that it always results in suggestions of modeling pipelines that give good relative results. Most importantly, it can unearth the non-obvious and unintuitive solutions to practitioners without the need for deep expertise.

The downside is the need for some programming aptitude to implement the grid search and the added computational cost of evaluating many different data preparation techniques and pipelines.

Approach 3: Apply Data Preparation Methods in Parallel

Like the previous approach, with this approach, assumes that algorithms have expectations and requirements, and it also allows for good solutions to be found that violate those expectations, although it goes one step further.

This approach also acknowledges that a model fit on multiple perspectives on the same data may be beneficial over a model that is fit on a single perspective of the data.

This is achieved by performing multiple data transforms on the raw dataset in parallel, then gathering the results from all transforms together into one large dataset with hundreds or even thousands of input features (i.e. the FeatureUnion class in scikit-learn can be used to achieve this). It allows for good input features found from different transforms to be used in parallel.

The number of input features may explode dramatically for each transform that is used. Therefore, it is good to combine this approach with a feature selection method to select a subset of the features that is most relevant to the target variable. Again, this may involve the application of one, two, or more different feature selection techniques to provide a larger than normal subset of useful features.

Alternatively, a dimensionality reduction technique (e.g. PCA) can be used on the generated features, or an algorithm that performs automatic feature selection (e.g. random forest) can be trained on the generated features directly.

I like to think of it as an explicit feature engineering approach where we generate all the features we can possibly think of from the raw data, unpacking distributions and relationships in the data. Then select a subset of the most relevant features and fit model. Because we are explicitly using data transforms to unpack the complexity of the problem into parallel features, it may allow the use of a much simpler predictive model, such as a linear model with a strong penalty to help it ignore less useful features.

A variation on this approach would be to fit a different model on each transform of the raw dataset and use an ensemble model to combine the predictions from each of the models.

A benefit of this general approach is that it allows a model to harness multiple different perspectives or views on the same raw data, a feature that the other two approaches discussed above lack. This may allow extra predictive skill to be squeezed from the dataset.

A downside of this approach is the increased computational cost, and the careful choice of the feature selection technique, and/or model used to interpret such a large number of input features.

Summary

In this tutorial, you discovered strategies that you can use to select data preparation techniques for your predictive modeling dataset.

Specifically, you learned:

Data preparation techniques can be chosen based on detailed knowledge of the dataset and algorithm and this is the most common approach.
Data preparation techniques can be grid searched as just another hyperparameter in the modeling pipeline.
Data transforms can be applied to a training dataset in parallel to create many extracted features on which feature selection can be applied and a model trained.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post How to Choose Data Preparation Methods for Machine Learning appeared first on Machine Learning Mastery.

Machine learning predictive modeling performance is only as good as your data, and your data is only as good as the way you prepare it for modeling.

The most common approach to data preparation is to study a dataset and review the expectations of a machine learning algorithm, then carefully choose the most appropriate data preparation techniques to transform the raw data to best meet the expectations of the algorithm. This is slow, expensive, and requires a vast amount of expertise.

An alternative approach to data preparation is to apply a suite of common and commonly useful data preparation techniques to the raw data in parallel and combine the results of all of the transforms together into a single large dataset from which a model can be fit and evaluated.

This is an alternative philosophy for data preparation that treats data transforms as an approach to extract salient features from raw data to expose the structure of the problem to the learning algorithms. It requires learning algorithms that are scalable of weight input features and using those input features that are most relevant to the target that is being predicted.

This approach requires less expertise, is computationally effective compared to a full grid search of data preparation methods, and can aid in the discovery of unintuitive data preparation solutions that achieve good or best performance for a given predictive modeling problem.

In this tutorial, you will discover how to use feature extraction for data preparation with tabular data.

After completing this tutorial, you will know:

Feature extraction provides an alternate approach to data preparation for tabular data, where all data transforms are applied in parallel to raw input data and combined together to create one large dataset.
How to use the feature extraction method for data preparation to improve model performance over a baseline for a standard classification dataset.
How to add feature selection to the feature extraction modeling pipeline to give a further lift in modeling performance on a standard dataset.

Discover data cleaning, feature selection, data transforms, dimensionality reduction and much more in my new book, with 30 step-by-step tutorials and full Python source code.

Let’s get started.

How to Use Feature Extraction on Tabular Data for Data Preparation
Photo by Nicolas Valdes, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

Feature Extraction Technique for Data Preparation
Dataset and Performance Baseline
1. Wine Classification Dataset
2. Baseline Model Performance
Feature Extraction Approach to Data Preparation

Feature Extraction Technique for Data Preparation

Data preparation can be challenging.

The approach that is most often prescribed and followed is to analyze the dataset, review the requirements of the algorithms, and transform the raw data to best meet the expectations of the algorithms.

This can be effective, but is also slow and can require deep expertise both with data analysis and machine learning algorithms.

An alternative approach is to treat the preparation of input variables as a hyperparameter of the modeling pipeline and to tune it along with the choice of algorithm and algorithm configuration.

This too can be an effective approach exposing unintuitive solutions and requiring very little expertise, although it can be computationally expensive.

An approach that seeks a middle ground between these two approaches to data preparation is to treat the transformation of input data as a feature engineering or feature extraction procedure. This involves applying a suite of common or commonly useful data preparation techniques to the raw data, then aggregating all features together to create one large dataset, then fit and evaluate a model on this data.

The philosophy of the approach treats each data preparation technique as a transform that extracts salient features from raw data to be presented to the learning algorithm. Ideally, such transforms untangle complex relationships and compound input variables, in turn allowing the use of simpler modeling algorithms, such as linear machine learning techniques.

For lack of a better name, we will refer to this as the “Feature Engineering Method” or the “Feature Extraction Method” for configuring data preparation for a predictive modeling project.

It allows data analysis and algorithm expertise to be used in the selection of data preparation methods and allows unintuitive solutions to be found but at a much lower computational cost.

The exclusion in the number of input features can also be explicitly addressed through the use of feature selection techniques that attempt to rank order the importance or value of the vast number of extracted features and only select a small subset of the most relevant to predicting the target variable.

We can explore this approach to data preparation with a worked example.

Before we dive into a worked example, let’s first select a standard dataset and develop a baseline in performance.

Want to Get Started With Data Preparation?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Course

Dataset and Performance Baseline

In this section, we will first select a standard machine learning dataset and establish a baseline in performance on this dataset. This will provide the context for exploring the feature extraction method of data preparation in the next section.

Wine Classification Dataset

We will use the wine classification dataset.

This dataset has 13 input variables that describe the chemical composition of samples of wine and requires that the wine be classified as one of three types.

You can learn more about the dataset here:

No need to download the dataset as we will download it automatically as part of our worked examples.

Open the dataset and review the raw data. The first few rows of data are listed below.

We can see that it is a multi-class classification predictive modeling problem with numerical input variables, each of which has different scales.

14.23,1.71,2.43,15.6,127,2.8,3.06,.28,2.29,5.64,1.04,3.92,1065,1
13.2,1.78,2.14,11.2,100,2.65,2.76,.26,1.28,4.38,1.05,3.4,1050,1
13.16,2.36,2.67,18.6,101,2.8,3.24,.3,2.81,5.68,1.03,3.17,1185,1
14.37,1.95,2.5,16.8,113,3.85,3.49,.24,2.18,7.8,.86,3.45,1480,1
13.24,2.59,2.87,21,118,2.8,2.69,.39,1.82,4.32,1.04,2.93,735,1
...

The example loads the dataset and splits it into the input and output columns, then summarizes the data arrays.

# example of loading and summarizing the wine dataset
from pandas import read_csv
# define the location of the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/wine.csv'
# load the dataset as a data frame
df = read_csv(url, header=None)
# retrieve the numpy array
data = df.values
# split the columns into input and output variables
X, y = data[:, :-1], data[:, -1]
# summarize the shape of the loaded data
print(X.shape, y.shape)

Running the example, we can see that the dataset was loaded correctly and that there are 179 rows of data with 13 input variables and a single target variable.

(178, 13) (178,)

Next, let’s evaluate a model on this dataset and establish a baseline in performance.

Baseline Model Performance

We can establish a baseline in performance on the wine classification task by evaluating a model on the raw input data.

In this case, we will evaluate a logistic regression model.

First, we can perform minimum data preparation by ensuring the input variables are numeric and that the target variable is label encoded, as expected by the scikit-learn library.

...
# minimally prepare dataset
X = X.astype('float')
y = LabelEncoder().fit_transform(y.astype('str'))

Next, we can define our predictive model.

...
# define the model
model = LogisticRegression(solver='liblinear')

We will evaluate the model using the gold standard of repeated stratified k-fold cross-validation with 10 folds and three repeats.

Model performance will be evaluated using classification accuracy.

...
model = LogisticRegression(solver='liblinear')
# define the cross-validation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)

At the end of the run, we will report the mean and standard deviation of the accuracy scores collected across all repeats and evaluation folds.

...
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Tying this together, the complete example of evaluating a logistic regression model on the raw wine classification dataset is listed below.

# baseline model performance on the wine dataset
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
# load the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/wine.csv'
df = read_csv(url, header=None)
data = df.values
X, y = data[:, :-1], data[:, -1]
# minimally prepare dataset
X = X.astype('float')
y = LabelEncoder().fit_transform(y.astype('str'))
# define the model
model = LogisticRegression(solver='liblinear')
# define the cross-validation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example evaluates the model performance and reports the mean and standard deviation classification accuracy.

Your results may vary given the stochastic nature of the learning algorithm, the evaluation procedure, and differences in precision across machines. Try running the example a few times.

In this case, we can see that the logistic regression model fit on the raw input data achieved the average classification accuracy of about 95.3 percent, providing a baseline in performance.

Accuracy: 0.953 (0.048)

Next, let’s explore whether we can improve the performance using the feature extraction based approach to data preparation.

Feature Extraction Approach to Data Preparation

In this section, we can explore whether we can improve performance using the feature extraction approach to data preparation.

The first step is to select a suite of common and commonly useful data preparation techniques.

In this case, given that the input variables are numeric, we will use a range of transforms to change the scale of the input variables such as MinMaxScaler, StandardScaler, and RobustScaler, as well as transforms for chaining the distribution of the input variables such as QuantileTransformer and KBinsDiscretizer. Finally, we will also use transforms that remove linear dependencies between the input variables such as PCA and TruncatedSVD.

The FeatureUnion class can be used to define a list of transforms to perform, the results of which will be aggregated together, i.e. unioned. This will create a new dataset that has a vast number of columns.

An estimate of the number of columns would be 13 input variables times five transforms or 65 plus the 14 columns output from the PCA and SVD dimensionality reduction methods, to give a total of about 79 features.

...
# transforms for the feature union
transforms = list()
transforms.append(('mms', MinMaxScaler()))
transforms.append(('ss', StandardScaler()))
transforms.append(('rs', RobustScaler()))
transforms.append(('qt', QuantileTransformer(n_quantiles=100, output_distribution='normal')))
transforms.append(('kbd', KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform')))
transforms.append(('pca', PCA(n_components=7)))
transforms.append(('svd', TruncatedSVD(n_components=7)))
# create the feature union
fu = FeatureUnion(transforms)

We can then create a modeling Pipeline with the FeatureUnion as the first step and the logistic regression model as the final step.

...
# define the model
model = LogisticRegression(solver='liblinear')
# define the pipeline
steps = list()
steps.append(('fu', fu))
steps.append(('m', model))
pipeline = Pipeline(steps=steps)

The pipeline can then be evaluated using repeated stratified k-fold cross-validation as before.

Tying this together, the complete example is listed below.

# data preparation as feature engineering for wine dataset
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.decomposition import PCA
from sklearn.decomposition import TruncatedSVD
# load the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/wine.csv'
df = read_csv(url, header=None)
data = df.values
X, y = data[:, :-1], data[:, -1]
# minimally prepare dataset
X = X.astype('float')
y = LabelEncoder().fit_transform(y.astype('str'))
# transforms for the feature union
transforms = list()
transforms.append(('mms', MinMaxScaler()))
transforms.append(('ss', StandardScaler()))
transforms.append(('rs', RobustScaler()))
transforms.append(('qt', QuantileTransformer(n_quantiles=100, output_distribution='normal')))
transforms.append(('kbd', KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform')))
transforms.append(('pca', PCA(n_components=7)))
transforms.append(('svd', TruncatedSVD(n_components=7)))
# create the feature union
fu = FeatureUnion(transforms)
# define the model
model = LogisticRegression(solver='liblinear')
# define the pipeline
steps = list()
steps.append(('fu', fu))
steps.append(('m', model))
pipeline = Pipeline(steps=steps)
# define the cross-validation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example evaluates the model performance and reports the mean and standard deviation classification accuracy.

Your results may vary given the stochastic nature of the learning algorithm, the evaluation procedure, and differences in precision across machines. Try running the example a few times.

In this case, we can see a lift in performance over the baseline performance, achieving a mean classification accuracy of about 96.8 percent as compared to 95.3 percent in the previous section.

Accuracy: 0.968 (0.037)

Try adding more data preparation methods to the FeatureUnion to see if you can improve the performance.

Can you get better results?
Let me know what you discover in the comments below.

We can also use feature selection to reduce the approximately 80 extracted features down to a subset of those that are most relevant to the model. In addition to reducing the complexity of the model, it can also result in a lift in performance by removing irrelevant and redundant input features.

In this case, we will use the Recursive Feature Elimination, or RFE, technique for feature selection and configure it to select the 15 most relevant features.

...
# define the feature selection
rfe = RFE(estimator=LogisticRegression(solver='liblinear'), n_features_to_select=15)

We can then add the RFE feature selection to the modeling pipeline after the FeatureUnion and before the LogisticRegression algorithm.

...
# define the pipeline
steps = list()
steps.append(('fu', fu))
steps.append(('rfe', rfe))
steps.append(('m', model))
pipeline = Pipeline(steps=steps)

Tying this together, the complete example of the feature selection data preparation method with feature selection is listed below.

# data preparation as feature engineering with feature selection for wine dataset
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.feature_selection import RFE
from sklearn.decomposition import PCA
from sklearn.decomposition import TruncatedSVD
# load the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/wine.csv'
df = read_csv(url, header=None)
data = df.values
X, y = data[:, :-1], data[:, -1]
# minimally prepare dataset
X = X.astype('float')
y = LabelEncoder().fit_transform(y.astype('str'))
# transforms for the feature union
transforms = list()
transforms.append(('mms', MinMaxScaler()))
transforms.append(('ss', StandardScaler()))
transforms.append(('rs', RobustScaler()))
transforms.append(('qt', QuantileTransformer(n_quantiles=100, output_distribution='normal')))
transforms.append(('kbd', KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform')))
transforms.append(('pca', PCA(n_components=7)))
transforms.append(('svd', TruncatedSVD(n_components=7)))
# create the feature union
fu = FeatureUnion(transforms)
# define the feature selection
rfe = RFE(estimator=LogisticRegression(solver='liblinear'), n_features_to_select=15)
# define the model
model = LogisticRegression(solver='liblinear')
# define the pipeline
steps = list()
steps.append(('fu', fu))
steps.append(('rfe', rfe))
steps.append(('m', model))
pipeline = Pipeline(steps=steps)
# define the cross-validation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example evaluates the model performance and reports the mean and standard deviation classification accuracy.

Your results may vary given the stochastic nature of the learning algorithm, the evaluation procedure, and differences in precision across machines. Try running the example a few times.

Again, we can see a further lift in performance from 96.8 percent with all extracted features to about 98.9 with feature selection used prior to modeling.

Accuracy: 0.989 (0.022)

Can you achieve better performance with a different feature selection technique or with more or fewer selected features?
Let me know what you discover in the comments below.

Summary

In this tutorial, you discovered how to use feature extraction for data preparation with tabular data.

Specifically, you learned:

Feature extraction provides an alternate approach to data preparation for tabular data, where all data transforms are applied in parallel to raw input data and combined together to create one large dataset.
How to use the feature extraction method for data preparation to improve model performance over a baseline for a standard classification dataset.
How to add feature selection to the feature extraction modeling pipeline to give a further lift in modeling performance on a standard dataset.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post How to Use Feature Extraction on Tabular Data for Machine Learning appeared first on Machine Learning Mastery.

The presence of outliers in a classification or regression dataset can result in a poor fit and lower predictive modeling performance.

Identifying and removing outliers is challenging with simple statistical methods for most machine learning datasets given the large number of input variables. Instead, automatic outlier detection methods can be used in the modeling pipeline and compared, just like other data preparation transforms that may be applied to the dataset.

In this tutorial, you will discover how to use automatic outlier detection and removal to improve machine learning predictive modeling performance.

After completing this tutorial, you will know:

Automatic outlier detection models provide an alternative to statistical techniques with a larger number of input variables with complex and unknown inter-relationships.
How to correctly apply automatic outlier detection and removal to the training dataset only to avoid data leakage.
How to evaluate and compare predictive modeling pipelines with outliers removed from the training dataset.

Discover data cleaning, feature selection, data transforms, dimensionality reduction and much more in my new book, with 30 step-by-step tutorials and full Python source code.

Let’s get started.

Model-Based Outlier Detection and Removal in Python
Photo by Zoltán Vörös, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

Outlier Detection and Removal
Dataset and Performance Baseline
1. House Price Regression Dataset
2. Baseline Model Performance
Automatic Outlier Detection
1. Isolation Forest
2. Minimum Covariance Determinant
3. Local Outlier Factor
4. One-Class SVM

Outlier Detection and Removal

Outliers are observations in a dataset that don’t fit in some way.

Perhaps the most common or familiar type of outlier is the observations that are far from the rest of the observations or the center of mass of observations.

This is easy to understand when we have one or two variables and we can visualize the data as a histogram or scatter plot, although it becomes very challenging when we have many input variables defining a high-dimensional input feature space.

In this case, simple statistical methods for identifying outliers can break down, such as methods that use standard deviations or the interquartile range.

It can be important to identify and remove outliers from data when training machine learning algorithms for predictive modeling.

Outliers can skew statistical measures and data distributions, providing a misleading representation of the underlying data and relationships. Removing outliers from training data prior to modeling can result in a better fit of the data and, in turn, more skillful predictions.

Thankfully, there are a variety of automatic model-based methods for identifying outliers in input data. Importantly, each method approaches the definition of an outlier is slightly different ways, providing alternate approaches to preparing a training dataset that can be evaluated and compared, just like any other data preparation step in a modeling pipeline.

Before we dive into automatic outlier detection methods, let’s first select a standard machine learning dataset that we can use as the basis for our investigation.

Want to Get Started With Data Preparation?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Course

Dataset and Performance Baseline

In this section, we will first select a standard machine learning dataset and establish a baseline in performance on this dataset.

This will provide the context for exploring the outlier identification and removal method of data preparation in the next section.

House Price Regression Dataset

We will use the house price regression dataset.

This dataset has 13 input variables that describe the properties of the house and suburb and requires the prediction of the median value of houses in the suburb in thousands of dollars.

You can learn more about the dataset here:

No need to download the dataset as we will download it automatically as part of our worked examples.

Open the dataset and review the raw data. The first few rows of data are listed below.

We can see that it is a regression predictive modeling problem with numerical input variables, each of which has different scales.

0.00632,18.00,2.310,0,0.5380,6.5750,65.20,4.0900,1,296.0,15.30,396.90,4.98,24.00
0.02731,0.00,7.070,0,0.4690,6.4210,78.90,4.9671,2,242.0,17.80,396.90,9.14,21.60
0.02729,0.00,7.070,0,0.4690,7.1850,61.10,4.9671,2,242.0,17.80,392.83,4.03,34.70
0.03237,0.00,2.180,0,0.4580,6.9980,45.80,6.0622,3,222.0,18.70,394.63,2.94,33.40
0.06905,0.00,2.180,0,0.4580,7.1470,54.20,6.0622,3,222.0,18.70,396.90,5.33,36.20
...

The dataset has many numerical input variables that have unknown and complex relationships. We don’t know that outliers exist in this dataset, although we may guess that some outliers may be present.

The example below loads the dataset and splits it into the input and output columns, splits it into train and test datasets, then summarizes the shapes of the data arrays.

# load and summarize the dataset
from pandas import read_csv
from sklearn.model_selection import train_test_split
# load the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
df = read_csv(url, header=None)
# retrieve the array
data = df.values
# split into input and output elements
X, y = data[:, :-1], data[:, -1]
# summarize the shape of the dataset
print(X.shape, y.shape)
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# summarize the shape of the train and test sets
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

Running the example, we can see that the dataset was loaded correctly and that there are 506 rows of data with 13 input variables and a single target variable.

The dataset is split into train and test sets with 339 rows used for model training and 167 for model evaluation.

(506, 13) (506,)
(339, 13) (167, 13) (339,) (167,)

Next, let’s evaluate a model on this dataset and establish a baseline in performance.

Baseline Model Performance

It is a regression predictive modeling problem, meaning that we will be predicting a numeric value. All input variables are also numeric.

In this case, we will fit a linear regression algorithm and evaluate model performance by training the model on the test dataset and making a prediction on the test data and evaluate the predictions using the mean absolute error (MAE).

The complete example of evaluating a linear regression model on the dataset is listed below.

# evaluate model on the raw dataset
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
# load the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
df = read_csv(url, header=None)
# retrieve the array
data = df.values
# split into input and output elements
X, y = data[:, :-1], data[:, -1]
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# fit the model
model = LinearRegression()
model.fit(X_train, y_train)
# evaluate the model
yhat = model.predict(X_test)
# evaluate predictions
mae = mean_absolute_error(y_test, yhat)
print('MAE: %.3f' % mae)

Running the example fits and evaluates the model, then reports the MAE.

Your specific results may differ given the stochastic nature of the learning algorithm, the evaluation procedure, and/or differences in precision across systems. Try running the example a few times.

In this case, we can see that the model achieved a MAE of about 3.417. This provides a baseline in performance to which we can compare different outlier identification and removal procedures.

MAE: 3.417

Next, we can try removing outliers from the training dataset.

Automatic Outlier Detection

The scikit-learn library provides a number of built-in automatic methods for identifying outliers in data.

In this section, we will review four methods and compare their performance on the house price dataset.

Each method will be defined, then fit on the training dataset. The fit model will then predict which examples in the training dataset are outliers and which are not (so-called inliers). The outliers will then be removed from the training dataset, then the model will be fit on the remaining examples and evaluated on the entire test dataset.

It would be invalid to fit the outlier detection method on the entire training dataset as this would result in data leakage. That is, the model would have access to data (or information about the data) in the test set not used to train the model. This may result in an optimistic estimate of model performance.

We could attempt to detect outliers on “new data” such as the test set prior to making a prediction, but then what do we do if outliers are detected?

One approach might be to return a “None” indicating that the model is unable to make a prediction on those outlier cases. This might be an interesting extension to explore that may be appropriate for your project.

Isolation Forest

Isolation Forest, or iForest for short, is a tree-based anomaly detection algorithm.

It is based on modeling the normal data in such a way as to isolate anomalies that are both few in number and different in the feature space.

… our proposed method takes advantage of two anomalies’ quantitative properties: i) they are the minority consisting of fewer instances and ii) they have attribute-values that are very different from those of normal instances.

— Isolation Forest, 2008.

The scikit-learn library provides an implementation of Isolation Forest in the IsolationForest class.

Perhaps the most important hyperparameter in the model is the “contamination” argument, which is used to help estimate the number of outliers in the dataset. This is a value between 0.0 and 0.5 and by default is set to 0.1.

...
# identify outliers in the training dataset
iso = IsolationForest(contamination=0.1)
yhat = iso.fit_predict(X_train)

Once identified, we can remove the outliers from the training dataset.

...
# select all rows that are not outliers
mask = yhat != -1
X_train, y_train = X_train[mask, :], y_train[mask]

Tying this together, the complete example of evaluating the linear model on the housing dataset with outliers identified and removed with isolation forest is listed below.

# evaluate model performance with outliers removed using isolation forest
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import IsolationForest
from sklearn.metrics import mean_absolute_error
# load the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
df = read_csv(url, header=None)
# retrieve the array
data = df.values
# split into input and output elements
X, y = data[:, :-1], data[:, -1]
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# summarize the shape of the training dataset
print(X_train.shape, y_train.shape)
# identify outliers in the training dataset
iso = IsolationForest(contamination=0.1)
yhat = iso.fit_predict(X_train)
# select all rows that are not outliers
mask = yhat != -1
X_train, y_train = X_train[mask, :], y_train[mask]
# summarize the shape of the updated training dataset
print(X_train.shape, y_train.shape)
# fit the model
model = LinearRegression()
model.fit(X_train, y_train)
# evaluate the model
yhat = model.predict(X_test)
# evaluate predictions
mae = mean_absolute_error(y_test, yhat)
print('MAE: %.3f' % mae)

Running the example fits and evaluates the model, then reports the MAE.

Your specific results may differ given the stochastic nature of the learning algorithm, the evaluation procedure, and/or differences in precision across systems. Try running the example a few times.

In this case, we can see that that model identified and removed 34 outliers and achieved a MAE of about 3.189, an improvement over the baseline that achieved a score of about 3.417.

(339, 13) (339,)
(305, 13) (305,)
MAE: 3.189

Minimum Covariance Determinant

If the input variables have a Gaussian distribution, then simple statistical methods can be used to detect outliers.

For example, if the dataset has two input variables and both are Gaussian, then the feature space forms a multi-dimensional Gaussian and knowledge of this distribution can be used to identify values far from the distribution.

This approach can be generalized by defining a hypersphere (ellipsoid) that covers the normal data, and data that falls outside this shape is considered an outlier. An efficient implementation of this technique for multivariate data is known as the Minimum Covariance Determinant, or MCD for short.

The Minimum Covariance Determinant (MCD) method is a highly robust estimator of multivariate location and scatter, for which a fast algorithm is available. […] It also serves as a convenient and efficient tool for outlier detection.

— Minimum Covariance Determinant and Extensions, 2017.

The scikit-learn library provides access to this method via the EllipticEnvelope class.

It provides the “contamination” argument that defines the expected ratio of outliers to be observed in practice. In this case, we will set it to a value of 0.01, found with a little trial and error.

...
# identify outliers in the training dataset
ee = EllipticEnvelope(contamination=0.01)
yhat = ee.fit_predict(X_train)

Once identified, the outliers can be removed from the training dataset as we did in the prior example.

Tying this together, the complete example of identifying and removing outliers from the housing dataset using the elliptical envelope (minimum covariant determinant) method is listed below.

# evaluate model performance with outliers removed using elliptical envelope
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.covariance import EllipticEnvelope
from sklearn.metrics import mean_absolute_error
# load the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
df = read_csv(url, header=None)
# retrieve the array
data = df.values
# split into input and output elements
X, y = data[:, :-1], data[:, -1]
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# summarize the shape of the training dataset
print(X_train.shape, y_train.shape)
# identify outliers in the training dataset
ee = EllipticEnvelope(contamination=0.01)
yhat = ee.fit_predict(X_train)
# select all rows that are not outliers
mask = yhat != -1
X_train, y_train = X_train[mask, :], y_train[mask]
# summarize the shape of the updated training dataset
print(X_train.shape, y_train.shape)
# fit the model
model = LinearRegression()
model.fit(X_train, y_train)
# evaluate the model
yhat = model.predict(X_test)
# evaluate predictions
mae = mean_absolute_error(y_test, yhat)
print('MAE: %.3f' % mae)

Running the example fits and evaluates the model, then reports the MAE.

Your specific results may differ given the stochastic nature of the learning algorithm, the evaluation procedure, and/or differences in precision across systems. Try running the example a few times.

In this case, we can see that the elliptical envelope method identified and removed only 4 outliers, resulting in a drop in MAE from 3.417 with the baseline to 3.388.

(339, 13) (339,)
(335, 13) (335,)
MAE: 3.388

Local Outlier Factor

A simple approach to identifying outliers is to locate those examples that are far from the other examples in the feature space.

This can work well for feature spaces with low dimensionality (few features), although it can become less reliable as the number of features is increased, referred to as the curse of dimensionality.

The local outlier factor, or LOF for short, is a technique that attempts to harness the idea of nearest neighbors for outlier detection. Each example is assigned a scoring of how isolated or how likely it is to be outliers based on the size of its local neighborhood. Those examples with the largest score are more likely to be outliers.

We introduce a local outlier (LOF) for each object in the dataset, indicating its degree of outlier-ness.

— LOF: Identifying Density-based Local Outliers, 2000.

The scikit-learn library provides an implementation of this approach in the LocalOutlierFactor class.

The model provides the “contamination” argument, that is the expected percentage of outliers in the dataset, be indicated and defaults to 0.1.

...
# identify outliers in the training dataset
lof = LocalOutlierFactor()
yhat = lof.fit_predict(X_train)

Tying this together, the complete example of identifying and removing outliers from the housing dataset using the local outlier factor method is listed below.

# evaluate model performance with outliers removed using local outlier factor
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import LocalOutlierFactor
from sklearn.metrics import mean_absolute_error
# load the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
df = read_csv(url, header=None)
# retrieve the array
data = df.values
# split into input and output elements
X, y = data[:, :-1], data[:, -1]
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# summarize the shape of the training dataset
print(X_train.shape, y_train.shape)
# identify outliers in the training dataset
lof = LocalOutlierFactor()
yhat = lof.fit_predict(X_train)
# select all rows that are not outliers
mask = yhat != -1
X_train, y_train = X_train[mask, :], y_train[mask]
# summarize the shape of the updated training dataset
print(X_train.shape, y_train.shape)
# fit the model
model = LinearRegression()
model.fit(X_train, y_train)
# evaluate the model
yhat = model.predict(X_test)
# evaluate predictions
mae = mean_absolute_error(y_test, yhat)
print('MAE: %.3f' % mae)

Running the example fits and evaluates the model, then reports the MAE.

Your specific results may differ given the stochastic nature of the learning algorithm, the evaluation procedure, and/or differences in precision across systems. Try running the example a few times.

In this case, we can see that the local outlier factor method identified and removed 34 outliers, the same number as isolation forest, resulting in a drop in MAE from 3.417 with the baseline to 3.356. Better, but not as good as isolation forest, suggesting a different set of outliers were identified and removed.

(339, 13) (339,)
(305, 13) (305,)
MAE: 3.356

One-Class SVM

The support vector machine, or SVM, algorithm developed initially for binary classification can be used for one-class classification.

When modeling one class, the algorithm captures the density of the majority class and classifies examples on the extremes of the density function as outliers. This modification of SVM is referred to as One-Class SVM.

… an algorithm that computes a binary function that is supposed to capture regions in input space where the probability density lives (its support), that is, a function such that most of the data will live in the region where the function is nonzero.

— Estimating the Support of a High-Dimensional Distribution, 2001.

Although SVM is a classification algorithm and One-Class SVM is also a classification algorithm, it can be used to discover outliers in input data for both regression and classification datasets.

The scikit-learn library provides an implementation of one-class SVM in the OneClassSVM class.

The class provides the “nu” argument that specifies the approximate ratio of outliers in the dataset, which defaults to 0.1. In this case, we will set it to 0.01, found with a little trial and error.

...
# identify outliers in the training dataset
ee = OneClassSVM(nu=0.01)
yhat = ee.fit_predict(X_train)

Tying this together, the complete example of identifying and removing outliers from the housing dataset using the one class SVM method is listed below.

# evaluate model performance with outliers removed using one class SVM
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.svm import OneClassSVM
from sklearn.metrics import mean_absolute_error
# load the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
df = read_csv(url, header=None)
# retrieve the array
data = df.values
# split into input and output elements
X, y = data[:, :-1], data[:, -1]
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# summarize the shape of the training dataset
print(X_train.shape, y_train.shape)
# identify outliers in the training dataset
ee = OneClassSVM(nu=0.01)
yhat = ee.fit_predict(X_train)
# select all rows that are not outliers
mask = yhat != -1
X_train, y_train = X_train[mask, :], y_train[mask]
# summarize the shape of the updated training dataset
print(X_train.shape, y_train.shape)
# fit the model
model = LinearRegression()
model.fit(X_train, y_train)
# evaluate the model
yhat = model.predict(X_test)
# evaluate predictions
mae = mean_absolute_error(y_test, yhat)
print('MAE: %.3f' % mae)

Running the example fits and evaluates the model, then reports the MAE.

Your specific results may differ given the stochastic nature of the learning algorithm, the evaluation procedure, and/or differences in precision across systems. Try running the example a few times.

In this case, we can see that only three outliers were identified and removed and the model achieved a MAE of about 3.431, which is not better than the baseline model that achieved 3.417. Perhaps better performance can be achieved with more tuning.

(339, 13) (339,)
(336, 13) (336,)
MAE: 3.431

Summary

In this tutorial, you discovered how to use automatic outlier detection and removal to improve machine learning predictive modeling performance.

Specifically, you learned:

Automatic outlier detection models provide an alternative to statistical techniques with a larger number of input variables with complex and unknown inter-relationships.
How to correctly apply automatic outlier detection and removal to the training dataset only to avoid data leakage.
How to evaluate and compare predictive modeling pipelines with outliers removed from the training dataset.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post 4 Automatic Outlier Detection Algorithms in Python appeared first on Machine Learning Mastery.

Dimensionality reduction is an unsupervised learning technique.

Nevertheless, it can be used as a data transform pre-processing step for machine learning algorithms on classification and regression predictive modeling datasets with supervised learning algorithms.

There are many dimensionality reduction algorithms to choose from and no single best algorithm for all cases. Instead, it is a good idea to explore a range of dimensionality reduction algorithms and different configurations for each algorithm.

In this tutorial, you will discover how to fit and evaluate top dimensionality reduction algorithms in Python.

After completing this tutorial, you will know:

Dimensionality reduction seeks a lower-dimensional representation of numerical input data that preserves the salient relationships in the data.
There are many different dimensionality reduction algorithms and no single best method for all datasets.
How to implement, fit, and evaluate top dimensionality reduction in Python with the scikit-learn machine learning library.

Discover data cleaning, feature selection, data transforms, dimensionality reduction and much more in my new book, with 30 step-by-step tutorials and full Python source code.

Let’s get started.

Dimensionality Reduction Algorithms With Python
Photo by Bernard Spragg. NZ, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

Dimensionality Reduction
Dimensionality Reduction Algorithms
Examples of Dimensionality Reduction
1. Scikit-Learn Library Installation
2. Classification Dataset
3. Principal Component Analysis
4. Singular Value Decomposition
5. Linear Discriminant Analysis
6. Isomap Embedding
7. Locally Linear Embedding
8. Modified Locally Linear Embedding

Dimensionality Reduction

Dimensionality reduction refers to techniques for reducing the number of input variables in training data.

When dealing with high dimensional data, it is often useful to reduce the dimensionality by projecting the data to a lower dimensional subspace which captures the “essence” of the data. This is called dimensionality reduction.

— Page 11, Machine Learning: A Probabilistic Perspective, 2012.

High-dimensionality might mean hundreds, thousands, or even millions of input variables.

Fewer input dimensions often means correspondingly fewer parameters or a simpler structure in the machine learning model, referred to as degrees of freedom. A model with too many degrees of freedom is likely to overfit the training dataset and may not perform well on new data.

It is desirable to have simple models that generalize well, and in turn, input data with few input variables. This is particularly true for linear models where the number of inputs and the degrees of freedom of the model are often closely related.

Dimensionality reduction is a data preparation technique performed on data prior to modeling. It might be performed after data cleaning and data scaling and before training a predictive model.

… dimensionality reduction yields a more compact, more easily interpretable representation of the target concept, focusing the user’s attention on the most relevant variables.

— Page 289, Data Mining: Practical Machine Learning Tools and Techniques, 4th edition, 2016.

As such, any dimensionality reduction performed on training data must also be performed on new data, such as a test dataset, validation dataset, and data when making a prediction with the final model.

Want to Get Started With Data Preparation?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Course

Dimensionality Reduction Algorithms

There are many algorithms that can be used for dimensionality reduction.

Two main classes of methods are those drawn from linear algebra and those drawn from manifold learning.

Linear Algebra Methods

Matrix factorization methods drawn from the field of linear algebra can be used for dimensionality.

For more on matrix factorization, see the tutorial:

A Gentle Introduction to Matrix Factorization for Machine Learning

Some of the more popular methods include:

Principal Components Analysis
Singular Value Decomposition
Non-Negative Matrix Factorization

Manifold Learning Methods

Manifold learning methods seek a lower-dimensional projection of high dimensional input that captures the salient properties of the input data.

Some of the more popular methods include:

Isomap Embedding
Locally Linear Embedding
Multidimensional Scaling
Spectral Embedding
t-distributed Stochastic Neighbor Embedding

Each algorithm offers a different approach to the challenge of discovering natural relationships in data at lower dimensions.

There is no best dimensionality reduction algorithm, and no easy way to find the best algorithm for your data without using controlled experiments.

In this tutorial, we will review how to use each subset of these popular dimensionality reduction algorithms from the scikit-learn library.

The examples will provide the basis for you to copy-paste the examples and test the methods on your own data.

We will not dive into the theory behind how the algorithms work or compare them directly. For a good starting point on this topic, see:

Let’s dive in.

Examples of Dimensionality Reduction

In this section, we will review how to use popular dimensionality reduction algorithms in scikit-learn.

This includes an example of using the dimensionality reduction technique as a data transform in a modeling pipeline and evaluating a model fit on the data.

The examples are designed for you to copy-paste into your own project and apply the methods to your own data. There are some algorithms available in the scikit-learn library that are not demonstrated because they cannot be used as a data transform directly given the nature of the algorithm.

As such, we will use a synthetic classification dataset in each example.

Scikit-Learn Library Installation

First, let’s install the library.

Don’t skip this step as you will need to ensure you have the latest version installed.

You can install the scikit-learn library using the pip Python installer, as follows:

sudo pip install scikit-learn

For additional installation instructions specific to your platform, see:

Installing scikit-learn

Next, let’s confirm that the library is installed and you are using a modern version.

Run the following script to print the library version number.

# check scikit-learn version
import sklearn
print(sklearn.__version__)

Running the example, you should see the following version number or higher.

0.23.0

Classification Dataset

We will use the make_classification() function to create a test binary classification dataset.

The dataset will have 1,000 examples with 20 input features, 10 of which are informative and 10 of which are redundant. This provides an opportunity for each technique to identify and remove redundant input features.

The fixed random seed for the pseudorandom number generator ensures we generate the same synthetic dataset each time the code runs.

An example of creating and summarizing the synthetic classification dataset is listed below.

# synthetic classification dataset
from sklearn.datasets import make_classification
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=10, random_state=7)
# summarize the dataset
print(X.shape, y.shape)

Running the example creates the dataset and reports the number of rows and columns matching our expectations.

(1000, 20) (1000,)

It is a binary classification task and we will evaluate a LogisticRegression model after each dimensionality reduction transform.

The model will be evaluated using the gold standard of repeated stratified 10-fold cross-validation. The mean and standard deviation classification accuracy across all folds and repeats will be reported.

The example below evaluates the model on the raw dataset as a point of comparison.

# evaluate logistic regression model on raw data
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.linear_model import LogisticRegression
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=10, random_state=7)
# define the model
model = LogisticRegression()
# evaluate model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example evaluates the logistic regression on the raw dataset with all 20 columns, achieving a classification accuracy of about 82.4 percent.

A successful dimensionality reduction transform on this data should result in a model that has better accuracy than this baseline, although this may not be possible with all techniques.

Note: we are not trying to “solve” this dataset, just provide working examples that you can use as a starting point.

Accuracy: 0.824 (0.034)

Next, we can start looking at examples of dimensionality reduction algorithms applied to this dataset.

I have made some minimal attempts to tune each method to the dataset. Each dimensionality reduction method will be configured to reduce the 20 input columns to 10 where possible.

We will use a Pipeline to combine the data transform and model into an atomic unit that can be evaluated using the cross-validation procedure; for example:

...
# define the pipeline
steps = [('pca', PCA(n_components=10)), ('m', LogisticRegression())]
model = Pipeline(steps=steps)

Let’s get started.

Can you get a better result for one of the algorithms?
Let me know in the comments below.

Principal Component Analysis

Principal Component Analysis, or PCA, might be the most popular technique for dimensionality reduction with dense data (few zero values).

For more on how PCA works, see the tutorial:

How to Calculate Principal Component Analysis (PCA) from Scratch in Python

The scikit-learn library provides the PCA class implementation of Principal Component Analysis that can be used as a dimensionality reduction data transform. The “n_components” argument can be set to configure the number of desired dimensions in the output of the transform.

The complete example of evaluating a model with PCA dimensionality reduction is listed below.

# evaluate pca with logistic regression algorithm for classification
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=10, random_state=7)
# define the pipeline
steps = [('pca', PCA(n_components=10)), ('m', LogisticRegression())]
model = Pipeline(steps=steps)
# evaluate model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example evaluates the modeling pipeline with dimensionality reduction and a logistic regression predictive model.

In this case, we don’t see any lift in model performance in using the PCA transform.

Accuracy: 0.824 (0.034)

Singular Value Decomposition

Singular Value Decomposition, or SVD, is one of the most popular techniques for dimensionality reduction for sparse data (data with many zero values).

For more on how SVD works, see the tutorial:

How to Calculate the SVD from Scratch with Python

The scikit-learn library provides the TruncatedSVD class implementation of Singular Value Decomposition that can be used as a dimensionality reduction data transform. The “n_components” argument can be set to configure the number of desired dimensions in the output of the transform.

The complete example of evaluating a model with SVD dimensionality reduction is listed below.

# evaluate svd with logistic regression algorithm for classification
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import LogisticRegression
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=10, random_state=7)
# define the pipeline
steps = [('svd', TruncatedSVD(n_components=10)), ('m', LogisticRegression())]
model = Pipeline(steps=steps)
# evaluate model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example evaluates the modeling pipeline with dimensionality reduction and a logistic regression predictive model.

In this case, we don’t see any lift in model performance in using the SVD transform.

Accuracy: 0.824 (0.034)

Linear Discriminant Analysis

Linear Discriminant Analysis, or LDA, is a multi-class classification algorithm that can be used for dimensionality reduction.

The number of dimensions for the projection is limited to 1 and C-1, where C is the number of classes. In this case, our dataset is a binary classification problem (two classes), limiting the number of dimensions to 1.

For more on LDA for dimensionality reduction, see the tutorial:

Linear Discriminant Analysis for Dimensionality Reduction in Python

The scikit-learn library provides the LinearDiscriminantAnalysis class implementation of Linear Discriminant Analysis that can be used as a dimensionality reduction data transform. The “n_components” argument can be set to configure the number of desired dimensions in the output of the transform.

The complete example of evaluating a model with LDA dimensionality reduction is listed below.

# evaluate lda with logistic regression algorithm for classification
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=10, random_state=7)
# define the pipeline
steps = [('lda', LinearDiscriminantAnalysis(n_components=1)), ('m', LogisticRegression())]
model = Pipeline(steps=steps)
# evaluate model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example evaluates the modeling pipeline with dimensionality reduction and a logistic regression predictive model.

In this case, we can see a slight lift in performance as compared to the baseline fit on the raw data.

Accuracy: 0.825 (0.034)

Isomap Embedding

Isomap Embedding, or Isomap, creates an embedding of the dataset and attempts to preserve the relationships in the dataset.

The scikit-learn library provides the Isomap class implementation of Isomap Embedding that can be used as a dimensionality reduction data transform. The “n_components” argument can be set to configure the number of desired dimensions in the output of the transform.

The complete example of evaluating a model with SVD dimensionality reduction is listed below.

# evaluate isomap with logistic regression algorithm for classification
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.manifold import Isomap
from sklearn.linear_model import LogisticRegression
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=10, random_state=7)
# define the pipeline
steps = [('iso', Isomap(n_components=10)), ('m', LogisticRegression())]
model = Pipeline(steps=steps)
# evaluate model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example evaluates the modeling pipeline with dimensionality reduction and a logistic regression predictive model.

In this case, we can see a lift in performance with the Isomap data transform as compared to the baseline fit on the raw data.

Accuracy: 0.888 (0.029)

Locally Linear Embedding

Locally Linear Embedding, or LLE, creates an embedding of the dataset and attempts to preserve the relationships between neighborhoods in the dataset.

The scikit-learn library provides the LocallyLinearEmbedding class implementation of Locally Linear Embedding that can be used as a dimensionality reduction data transform. The “n_components” argument can be set to configure the number of desired dimensions in the output of the transform

The complete example of evaluating a model with LLE dimensionality reduction is listed below.

# evaluate lle and logistic regression for classification
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.manifold import LocallyLinearEmbedding
from sklearn.linear_model import LogisticRegression
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=10, random_state=7)
# define the pipeline
steps = [('lle', LocallyLinearEmbedding(n_components=10)), ('m', LogisticRegression())]
model = Pipeline(steps=steps)
# evaluate model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example evaluates the modeling pipeline with dimensionality reduction and a logistic regression predictive model.

In this case, we can see a lift in performance with the LLE data transform as compared to the baseline fit on the raw data.

Accuracy: 0.886 (0.028)

Modified Locally Linear Embedding

Modified Locally Linear Embedding, or Modified LLE, is an extension of Locally Linear Embedding that creates multiple weighting vectors for each neighborhood.

The scikit-learn library provides the LocallyLinearEmbedding class implementation of Modified Locally Linear Embedding that can be used as a dimensionality reduction data transform. The “method” argument must be set to ‘modified’ and the “n_components” argument can be set to configure the number of desired dimensions in the output of the transform which must be less than the “n_neighbors” argument.

The complete example of evaluating a model with Modified LLE dimensionality reduction is listed below.

# evaluate modified lle and logistic regression for classification
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.manifold import LocallyLinearEmbedding
from sklearn.linear_model import LogisticRegression
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=10, random_state=7)
# define the pipeline
steps = [('lle', LocallyLinearEmbedding(n_components=5, method='modified', n_neighbors=10)), ('m', LogisticRegression())]
model = Pipeline(steps=steps)
# evaluate model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example evaluates the modeling pipeline with dimensionality reduction and a logistic regression predictive model.

In this case, we can see a lift in performance with the modified LLE data transform as compared to the baseline fit on the raw data.

Accuracy: 0.846 (0.036)

Summary

In this tutorial, you discovered how to fit and evaluate top dimensionality reduction algorithms in Python.

Specifically, you learned:

Dimensionality reduction seeks a lower-dimensional representation of numerical input data that preserves the salient relationships in the data.
There are many different dimensionality reduction algorithms and no single best method for all datasets.
How to implement, fit, and evaluate top dimensionality reduction in Python with the scikit-learn machine learning library.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post 6 Dimensionality Reduction Algorithms With Python appeared first on Machine Learning Mastery.

There are a vast number of different types of data preparation techniques that could be used on a predictive modeling project.

In some cases, the distribution of the data or the requirements of a machine learning model may suggest the data preparation needed, although this is rarely the case given the complexity and high-dimensionality of the data, the ever-increasing parade of new machine learning algorithms and limited, although human, limitations of the practitioner.

Instead, data preparation can be treated as another hyperparameter to tune as part of the modeling pipeline. This raises the question of how to know what data preparation methods to consider in the search, which can feel overwhelming to experts and beginners alike.

The solution is to think about the vast field of data preparation in a structured way and systematically evaluate data preparation techniques based on their effect on the raw data.

In this tutorial, you will discover a framework that provides a structured approach to both thinking about and grouping data preparation techniques for predictive modeling with structured data.

After completing this tutorial, you will know:

The challenge and overwhelm of framing data preparation as yet an additional hyperparameter to tune in the machine learning modeling pipeline.
A framework that defines five groups of data preparation techniques to consider.
Examples of data preparation techniques that belong to each group that can be evaluated on your predictive modeling project.

Discover data cleaning, feature selection, data transforms, dimensionality reduction and much more in my new book, with 30 step-by-step tutorials and full Python source code.

Let’s get started.

Framework for Data Preparation Techniques in Machine Learning
Photo by Phil Dolby, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

Challenge of Data Preparation
Framework for Data Preparation
Data Preparation Techniques

Challenge of Data Preparation

Data preparation refers to transforming raw data into a form that is better suited to predictive modeling.

This may be required because the data itself contains mistakes or errors. It may also be because the chosen algorithms have expectations regarding the type and distribution of the data.

To make the task of data preparation even more challenging, it is also common that the data preparation required to get the best performance from a predictive model may not be obvious and may bend or violate the expectations of the model that is being used.

As such, it is common to treat the choice and configuration of data preparation applied to the raw data as yet another hyperparameter of the modeling pipeline to be tuned.

This framing of data preparation is very effective in practice, as it allows you to use automatic search techniques like grid search and random search to discover unintuitive data preparation steps that result in skillful predictive models.

This framing of data preparation can also feel overwhelming to beginners given the large number and variety of data preparation techniques.

The solution to this overwhelm is to think about data preparation techniques in a systematic way.

Want to Get Started With Data Preparation?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Course

Framework for Data Preparation

Effective data preparation requires that the data preparation techniques available are organized and considered in a structured and systematic way.

This allows you to ensure that approach techniques are explored for your dataset and that potentially effective techniques are not skipped or ignored.

This can be achieved using a framework to organize data preparation techniques that consider their effect on the raw dataset.

For example, structured machine learning data, such as data we might store in a CSV file for classification and regression, consists of rows, columns, and values. We might consider data preparation techniques that operate at each of these levels.

Data Preparation for Rows
Data Preparation for Columns
Data Preparation for Values

Data preparation for rows may be techniques that add or remove rows of data from the dataset. Similarly, data preparation for columns may be techniques that add or remove rows (features or variables) from the dataset. Whereas data preparation for values may be techniques that change the values in the dataset, often for a given column.

There is one more type of data preparation that does not neatly fit into this structure, and that is dimensionality reduction techniques. These techniques change the columns and the values at the same time, e.g. projecting the data into a lower-dimensional space.

Data Preparation for Columns + Values

This raises the question of techniques that might apply to rows and values at the same time. This might include data preparation that consolidates rows of data in some way.

Data Preparation for Rows + Values

We can summarize this framework and some high-level groups of data preparation methods in the following image.

Machine Learning Data Preparation Framework

Now that we have a framework for thinking about data preparation based on their effect on the data, let’s look at examples of techniques that fit into each group.

Data Preparation Techniques

This section explores the five high-level groups of data preparation techniques defined in the previous section and suggests specific techniques that may fall within each group.

Did I miss one of your preferred or favorite data preparation techniques?
Let me know in the comments below.

Data Preparation for Rows

This group is for data preparation techniques that add or remove rows of data.

In machine learning, rows are often referred to as samples, examples, or instances.

These techniques are often used to augment a limited training dataset or to remove errors or ambiguity from the dataset.

The main class of techniques that come to mind are data preparation techniques that are often used for imbalanced classification.

This includes techniques such as SMOTE that create synthetic rows of training data for under-represented classes and random undersampling that remove examples for over-represented classes.

For more on SMOTE data sampling, see the tutorial:

SMOTE for Imbalanced Classification with Python

It also includes more advanced combined over- and undersampling techniques that attempt to identify and remove ambiguous examples along the decision boundary of a classification problem and remove them or change their class label.

For more on these types of data preparation, see the tutorial:

How to Combine Oversampling and Undersampling for Imbalanced Classification

This class of data preparation techniques also includes algorithms for identifying and removing outliers from the data. These are rows of data that may be far from the center of probability mass in the dataset and, in turn, may be unrepresentative of the data from the domain.

For more on outlier detection and removal methods, see the tutorial:

How to Remove Outliers for Machine Learning

Data Preparation for Columns

This group is for data preparation techniques that add or remove columns of data.

In machine learning, columns are often referred to as variables or features.

These techniques are often required to either reduce the complexity (dimensionality) of a prediction problem or to unpack compound input variables or complex interactions between features.

The main class of techniques that come to mind are feature selection techniques.

This includes techniques that use statistics to score the relevance of input variables to the target variable based on the data type of each.

For more on these types of data preparation techniques, see the tutorial:

How to Choose a Feature Selection Method for Machine Learning

This also includes feature selection techniques that systematically test the impact of different combinations of input variables on the predictive skill of a machine learning model.

For more on these types of methods, see the tutorial:

Recursive Feature Elimination (RFE) for Feature Selection in Python

Related are techniques that use a model to score the importance of input features based on their use by a predictive model, referred to as feature importance methods. These methods are often used for data interpretation, although they can also be used for feature selection.

For more on these types of methods, see the tutorial:

How to Calculate Feature Importance With Python

This group of methods also brings to mind techniques for creating or deriving new columns of data, new features. These are often referred to as feature engineering, although sometimes the whole field of data preparation is referred to as feature engineering.

For example, new features that represent values raised to exponents or multiplicative combinations of features can be created and added to the dataset as new columns.

For more on these types of data preparation techniques, see the tutorial:

How to Use Polynomial Feature Transforms for Machine Learning

This might also include data transforms that change a variable type, such as creating dummy variables for a categorical variable, often referred to as a one-hot encoding.

For more on these types of data preparation techniques, see the tutorial:

Ordinal and One-Hot Encodings for Categorical Data

Data Preparation for Values

This group is for data preparation techniques that change the raw values in the data.

These techniques are often required to meet the expectations or requirements of specific machine learning algorithms.

The main class of techniques that come to mind is data transforms that change the scale or distribution of input variables.

For example, data transforms such as standardization and normalization change the scale of numeric input variables. Data transforms like ordinal encoding change the type of categorical input variables.

There are also many data transforms for changing the distribution of input variables.

For example, discretization or binning change the distribution of numerical input variables into categorical variables with an ordinal ranking.

For more on this type of data transform, see the tutorial:

How to Use Discretization Transforms for Machine Learning

The power transform can be used to change the distribution of data to remove a skew and make the distribution more normal (Gaussian).

For more on this method, see the tutorial:

How to Use Power Transforms for Machine Learning

The quantile transform is a flexible type of data preparation technique that can map a numerical input variable or to different types of distributions such as normal or Gaussian.

You can learn more about this data preparation technique here:

How to Use Quantile Transforms for Machine Learning

Another type of data preparation technique that belongs to this group are methods that systematically change values in the dataset.

This includes techniques that identify and replace missing values, often referred to as missing value imputation. This can be achieved using statistical methods or more advanced model-based methods.

For more on these methods, see the tutorial:

Statistical Imputation for Missing Values in Machine Learning

All of the methods discussed could also be considered feature engineering methods (e.g. fitting into the previously discussed group of data preparation methods) if the results of the transforms are appended to the raw data as new columns.

Data Preparation for Columns + Values

This group is for data preparation techniques that change both the number of columns and the values in the data.

The main class of techniques that this brings to mind are dimensionality reduction techniques that specifically reduce the number of columns and the scale and distribution of numerical input variables.

This includes matrix factorization methods used in linear algebra as well as manifold learning algorithms used in high-dimensional statistics.

For more information on these techniques, see the tutorial:

Introduction to Dimensionality Reduction for Machine Learning

Although these techniques are designed to create projections of rows in a lower-dimensional space, perhaps this also leaves the door open to techniques that do the inverse. That is, use all or a subset of the input variables to create a projection into a higher-dimensional space, perhaps decompiling complex non-linear relationships.

Perhaps polynomial transforms where the results replace the raw dataset would fit into this class of data preparation methods.

Do you know of other methods that fit into this group?
Let me know in the comments below.

Data Preparation for Rows + Values

This group is for data preparation techniques that change both the number of rows and the values in the data.

I have not explicitly considered data transforms of this type before, but it falls out of the framework as defined.

A group of methods that come to mind are clustering algorithms where all or subsets of rows of data in the dataset are replaced with data samples at the cluster centers, referred to as cluster centroids.

Related might be replacing rows with exemplars (aggregates of rows) taken from specific machine learning algorithms, such as support vectors from a support vector machine, or the codebook vectors taken from a learning vector quantization.

Naturally, these aggregate rows are simply added to the dataset rather than replacing rows, then they would naturally fit into the “Data Preparation for Rows” group described above.

Do you know of other methods that fit into this group?
Let me know in the comments below.

Summary

In this tutorial, you discovered a framework for systematically grouping data preparation techniques based on their effect on raw data.

Specifically, you learned:

The challenge and overwhelm of framing data preparation as yet an additional hyperparameter to tune in the machine learning modeling pipeline.
A framework that defines five groups of data preparation techniques to consider.
Examples of data preparation techniques that belong to each group that can be evaluated on your predictive modeling project.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Framework for Data Preparation Techniques in Machine Learning appeared first on Machine Learning Mastery.

Machine learning predictive modeling performance is only as good as your data, and your data is only as good as the way you prepare it for modeling.

The most common approach to data preparation is to study a dataset and review the expectations of a machine learning algorithms, then carefully choose the most appropriate data preparation techniques to transform the raw data to best meet the expectations of the algorithm. This is slow, expensive, and requires a vast amount of expertise.

An alternative approach to data preparation is to grid search a suite of common and commonly useful data preparation techniques to the raw data. This is an alternative philosophy for data preparation that treats data transforms as another hyperparameter of the modeling pipeline to be searched and tuned.

This approach requires less expertise than the traditional manual approach to data preparation, although it is computationally costly. The benefit is that it can aid in the discovery of non-intuitive data preparation solutions that achieve good or best performance for a given predictive modeling problem.

In this tutorial, you will discover how to use the grid search approach for data preparation with tabular data.

After completing this tutorial, you will know:

Grid search provides an alternative approach to data preparation for tabular data, where transforms are tried as hyperparameters of the modeling pipeline.
How to use the grid search method for data preparation to improve model performance over a baseline for a standard classification dataset.
How to grid search sequences of data preparation methods to further improve model performance.

Discover data cleaning, feature selection, data transforms, dimensionality reduction and much more in my new book, with 30 step-by-step tutorials and full Python source code.

Let’s get started.

How to Grid Search Data Preparation Techniques
Photo by Wall Boat, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

Grid Search Technique for Data Preparation
Dataset and Performance Baseline
1. Wine Classification Dataset
2. Baseline Model Performance
Grid Search Approach to Data Preparation

Grid Search Technique for Data Preparation

Data preparation can be challenging.

This can be effective but is also slow and can require deep expertise with data analysis and machine learning algorithms.

An alternative approach is to treat the preparation of input variables as a hyperparameter of the modeling pipeline and to tune it along with the choice of algorithm and algorithm configurations.

This can be achieved by designing a grid search of data preparation techniques and/or sequences of data preparation techniques in pipelines. This may involve evaluating each on a single chosen machine learning algorithm, or on a suite of machine learning algorithms.

We can explore this approach to data preparation with a worked example.

Before we dive into a worked example, let’s first select a standard dataset and develop a baseline in performance.

Want to Get Started With Data Preparation?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Course

Dataset and Performance Baseline

In this section, we will first select a standard machine learning dataset and establish a baseline in performance on this dataset. This will provide the context for exploring the grid search method of data preparation in the next section.

Wine Classification Dataset

We will use the wine classification dataset.

This dataset has 13 input variables that describe the chemical composition of samples of wine and requires that the wine be classified as one of three types.

You can learn more about the dataset here:

No need to download the dataset as we will download it automatically as part of our worked examples.

Open the dataset and review the raw data. The first few rows of data are listed below.

We can see that it is a multi-class classification predictive modeling problem with numerical input variables, each of which has different scales.

14.23,1.71,2.43,15.6,127,2.8,3.06,.28,2.29,5.64,1.04,3.92,1065,1
13.2,1.78,2.14,11.2,100,2.65,2.76,.26,1.28,4.38,1.05,3.4,1050,1
13.16,2.36,2.67,18.6,101,2.8,3.24,.3,2.81,5.68,1.03,3.17,1185,1
14.37,1.95,2.5,16.8,113,3.85,3.49,.24,2.18,7.8,.86,3.45,1480,1
13.24,2.59,2.87,21,118,2.8,2.69,.39,1.82,4.32,1.04,2.93,735,1
...

The example below loads the dataset and splits it into the input and output columns, then summarizes the data arrays.

# example of loading and summarizing the wine dataset
from pandas import read_csv
# define the location of the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/wine.csv'
# load the dataset as a data frame
df = read_csv(url, header=None)
# retrieve the numpy array
data = df.values
# split the columns into input and output variables
X, y = data[:, :-1], data[:, -1]
# summarize the shape of the loaded data
print(X.shape, y.shape)

Running the example, we can see that the dataset was loaded correctly and that there are 179 rows of data with 13 input variables and a single target variable.

(178, 13) (178,)

Next, let’s evaluate a model on this dataset and establish a baseline in performance.

Baseline Model Performance

We can establish a baseline in performance on the wine classification task by evaluating a model on the raw input data.

In this case, we will evaluate a logistic regression model.

First, we can define a function to load the dataset and perform some minimal data preparation to ensure the inputs are numeric and the target is label encoded.

# prepare the dataset
def load_dataset():
	# load the dataset
	url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/wine.csv'
	df = read_csv(url, header=None)
	data = df.values
	X, y = data[:, :-1], data[:, -1]
	# minimally prepare dataset
	X = X.astype('float')
	y = LabelEncoder().fit_transform(y.astype('str'))
	return X, y

We will evaluate the model using the gold standard of repeated stratified k-fold cross-validation with 10 folds and three repeats.

# evaluate a model
def evaluate_model(X, y, model):
	# define the cross-validation procedure
	cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
	# evaluate model
	scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
	return scores

We can then call the function to load the dataset, define our model, then evaluate it, reporting the mean and standard deviation accuracy.

...
# get the dataset
X, y = load_dataset()
# define the model
model = LogisticRegression(solver='liblinear')
# evaluate the model
scores = evaluate_model(X, y, model)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Tying this together, the complete example of evaluating a logistic regression model on the raw wine classification dataset is listed below.

# baseline model performance on the wine dataset
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

# prepare the dataset
def load_dataset():
	# load the dataset
	url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/wine.csv'
	df = read_csv(url, header=None)
	data = df.values
	X, y = data[:, :-1], data[:, -1]
	# minimally prepare dataset
	X = X.astype('float')
	y = LabelEncoder().fit_transform(y.astype('str'))
	return X, y

# evaluate a model
def evaluate_model(X, y, model):
	# define the cross-validation procedure
	cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
	# evaluate model
	scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
	return scores

# get the dataset
X, y = load_dataset()
# define the model
model = LogisticRegression(solver='liblinear')
# evaluate the model
scores = evaluate_model(X, y, model)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example evaluates the model performance and reports the mean and standard deviation classification accuracy.

Your results may vary given the stochastic nature of the learning algorithm, the evaluation procedure, and differences in precision across machines. Try running the example a few times.

In this case, we can see that the logistic regression model fit on the raw input data achieved the average classification accuracy of about 95.3 percent, providing a baseline in performance.

Accuracy: 0.953 (0.048)

Next, let’s explore whether we can improve the performance using the grid-search-based approach to data preparation.

Grid Search Approach to Data Preparation

In this section, we can explore whether we can improve performance using the grid search approach to data preparation.

The first step is to define a series of modeling pipelines to evaluate, where each pipeline defines one (or more) data preparation techniques and ends with a model that takes the transformed data as input.

We will define a function to create these pipelines as a list of tuples, where each tuple defines the short name for the pipeline and the pipeline itself. We will evaluate a range of different data scaling methods (e.g. MinMaxScaler and StandardScaler), distribution transforms (QuantileTransformer and KBinsDiscretizer), as well as dimensionality reduction transforms (PCA and SVD).

# get modeling pipelines to evaluate
def get_pipelines(model):
	pipelines = list()
	# normalize
	p = Pipeline([('s',MinMaxScaler()), ('m',model)])
	pipelines.append(('norm', p))
	# standardize
	p = Pipeline([('s',StandardScaler()), ('m',model)])
	pipelines.append(('std', p))
	# quantile
	p = Pipeline([('s',QuantileTransformer(n_quantiles=100, output_distribution='normal')), ('m',model)])
	pipelines.append(('quan', p))
	# discretize
	p = Pipeline([('s',KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform')), ('m',model)])
	pipelines.append(('kbins', p))
	# pca
	p = Pipeline([('s',PCA(n_components=7)), ('m',model)])
	pipelines.append(('pca', p))
	# svd
	p = Pipeline([('s',TruncatedSVD(n_components=7)), ('m',model)])
	pipelines.append(('svd', p))
	return pipelines

We can then call this function to get the list of transforms, then enumerate each, evaluating it and reporting the performance along the way.

...
# get the modeling pipelines
pipelines = get_pipelines(model)
# evaluate each pipeline
results, names = list(), list()
for name, pipeline in pipelines:
	# evaluate
	scores = evaluate_model(X, y, pipeline)
	# summarize
	print('>%s: %.3f (%.3f)' % (name, mean(scores), std(scores)))
	# store
	results.append(scores)
	names.append(name)

At the end of the run, we can create a box and whisker plot for each set of scores and compare the distributions of results visually.

...
# plot the result
pyplot.boxplot(results, labels=names, showmeans=True)
pyplot.show()

Tying this together, the complete example of grid searching data preparation techniques on the wine classification dataset is listed below.

# compare data preparation methods for the wine classification dataset
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.decomposition import PCA
from sklearn.decomposition import TruncatedSVD
from matplotlib import pyplot

# prepare the dataset
def load_dataset():
	# load the dataset
	url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/wine.csv'
	df = read_csv(url, header=None)
	data = df.values
	X, y = data[:, :-1], data[:, -1]
	# minimally prepare dataset
	X = X.astype('float')
	y = LabelEncoder().fit_transform(y.astype('str'))
	return X, y

# evaluate a model
def evaluate_model(X, y, model):
	# define the cross-validation procedure
	cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
	# evaluate model
	scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
	return scores

# get modeling pipelines to evaluate
def get_pipelines(model):
	pipelines = list()
	# normalize
	p = Pipeline([('s',MinMaxScaler()), ('m',model)])
	pipelines.append(('norm', p))
	# standardize
	p = Pipeline([('s',StandardScaler()), ('m',model)])
	pipelines.append(('std', p))
	# quantile
	p = Pipeline([('s',QuantileTransformer(n_quantiles=100, output_distribution='normal')), ('m',model)])
	pipelines.append(('quan', p))
	# discretize
	p = Pipeline([('s',KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform')), ('m',model)])
	pipelines.append(('kbins', p))
	# pca
	p = Pipeline([('s',PCA(n_components=7)), ('m',model)])
	pipelines.append(('pca', p))
	# svd
	p = Pipeline([('s',TruncatedSVD(n_components=7)), ('m',model)])
	pipelines.append(('svd', p))
	return pipelines

# get the dataset
X, y = load_dataset()
# define the model
model = LogisticRegression(solver='liblinear')
# get the modeling pipelines
pipelines = get_pipelines(model)
# evaluate each pipeline
results, names = list(), list()
for name, pipeline in pipelines:
	# evaluate
	scores = evaluate_model(X, y, pipeline)
	# summarize
	print('>%s: %.3f (%.3f)' % (name, mean(scores), std(scores)))
	# store
	results.append(scores)
	names.append(name)
# plot the result
pyplot.boxplot(results, labels=names, showmeans=True)
pyplot.show()

Running the example evaluates the performance of each pipeline and reports the mean and standard deviation classification accuracy.

Your results may vary given the stochastic nature of the learning algorithm, the evaluation procedure, and differences in precision across machines. Try running the example a few times.

In this case, we can see that standardizing the input variables and using a quantile transform both achieves the best result with a classification accuracy of about 98.7 percent, an improvement over the baseline with no data preparation that achieved a classification accuracy of 95.3 percent.

You can add your own modeling pipelines to the get_pipelines() function and compare their result.

Can you get better results?
Let me know in the comments below.

>norm: 0.976 (0.031)
>std: 0.987 (0.023)
>quan: 0.987 (0.023)
>kbins: 0.968 (0.045)
>pca: 0.963 (0.039)
>svd: 0.953 (0.048)

A figure is created showing box and whisker plots that summarize the distribution of classification accuracy scores for each data preparation technique. We can see that the distribution of scores for the standardization and quantile transforms are compact and very similar and have an outlier. We can see that the spread of scores for the other transforms is larger and skewing down.

The results may suggest that standardizing the dataset is probably an important step in the data preparation and related transforms, such as the quantile transform, and perhaps even the power transform may offer benefits if combined with standardization by making one or more input variables more Gaussian.

Box and Whisker Plot of Classification Accuracy for Different Data Transforms on the Wine Classification Dataset

We can also explore sequences of transforms to see if they can offer a lift in performance.

For example, we might want to apply RFE feature selection after the standardization transform to see if the same or better results can be used with fewer input variables (e.g. less complexity).

We might also want to see if a power transform preceded with a data scaling transform can achieve good performance on the dataset as we believe it could given the success of the quantile transform.

The updated get_pipelines() function with sequences of transforms is provided below.

# get modeling pipelines to evaluate
def get_pipelines(model):
	pipelines = list()
	# standardize
	p = Pipeline([('s',StandardScaler()), ('r', RFE(estimator=LogisticRegression(solver='liblinear'), n_features_to_select=10)), ('m',model)])
	pipelines.append(('std', p))
	# scale and power
	p = Pipeline([('s',MinMaxScaler((1,2))), ('p', PowerTransformer()), ('m',model)])
	pipelines.append(('power', p))
	return pipelines

Tying this together, the complete example is listed below.

# compare sequences of data preparation methods for the wine classification dataset
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import PowerTransformer
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.decomposition import PCA
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_selection import RFE
from matplotlib import pyplot

# prepare the dataset
def load_dataset():
	# load the dataset
	url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/wine.csv'
	df = read_csv(url, header=None)
	data = df.values
	X, y = data[:, :-1], data[:, -1]
	# minimally prepare dataset
	X = X.astype('float')
	y = LabelEncoder().fit_transform(y.astype('str'))
	return X, y

# evaluate a model
def evaluate_model(X, y, model):
	# define the cross-validation procedure
	cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
	# evaluate model
	scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
	return scores

# get modeling pipelines to evaluate
def get_pipelines(model):
	pipelines = list()
	# standardize
	p = Pipeline([('s',StandardScaler()), ('r', RFE(estimator=LogisticRegression(solver='liblinear'), n_features_to_select=10)), ('m',model)])
	pipelines.append(('std', p))
	# scale and power
	p = Pipeline([('s',MinMaxScaler((1,2))), ('p', PowerTransformer()), ('m',model)])
	pipelines.append(('power', p))
	return pipelines

# get the dataset
X, y = load_dataset()
# define the model
model = LogisticRegression(solver='liblinear')
# get the modeling pipelines
pipelines = get_pipelines(model)
# evaluate each pipeline
results, names = list(), list()
for name, pipeline in pipelines:
	# evaluate
	scores = evaluate_model(X, y, pipeline)
	# summarize
	print('>%s: %.3f (%.3f)' % (name, mean(scores), std(scores)))
	# store
	results.append(scores)
	names.append(name)
# plot the result
pyplot.boxplot(results, labels=names, showmeans=True)
pyplot.show()

Running the example evaluates the performance of each pipeline and reports the mean and standard deviation classification accuracy.

Your results may vary given the stochastic nature of the learning algorithm, the evaluation procedure, and differences in precision across machines. Try running the example a few times.

In this case, we can see that the standardization with feature selection offers an additional lift in accuracy from 98.7 percent to 98.9 percent, although the data scaling and power transform do not offer any additional benefit over the quantile transform.

>std: 0.989 (0.022)
>power: 0.987 (0.023)

A figure is created showing box and whisker plots that summarize the distribution of classification accuracy scores for each data preparation technique.

We can see that the distribution of results for both pipelines of transforms is compact with very little spread other than outlier.

Box and Whisker Plot of Classification Accuracy for Different Sequences of Data Transforms on the Wine Classification Dataset

Summary

In this tutorial, you discovered how to use a grid search approach for data preparation with tabular data.

Specifically, you learned:

Grid search provides an alternative approach to data preparation for tabular data, where transforms are tried as hyperparameters of the modeling pipeline.
How to use the grid search method for data preparation to improve model performance over a baseline for a standard classification dataset.
How to grid search sequences of data preparation methods to further improve model performance.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post How to Grid Search Data Preparation Techniques appeared first on Machine Learning Mastery.

The scikit-learn Python library for machine learning offers a suite of data transforms for changing the scale and distribution of input data, as well as removing input features (columns).

There are many simple data cleaning operations, such as removing outliers and removing columns with few observations, that are often performed manually to the data, requiring custom code.

The scikit-learn library provides a way to wrap these custom data transforms in a standard way so they can be used just like any other transform, either on data directly or as a part of a modeling pipeline.

In this tutorial, you will discover how to define and use custom data transforms for scikit-learn.

After completing this tutorial, you will know:

That custom data transforms can be created for scikit-learn using the FunctionTransformer class.
How to develop and apply a custom transform to remove columns with few unique values.
How to develop and apply a custom transform that replaces outliers for each column.

Discover data cleaning, feature selection, data transforms, dimensionality reduction and much more in my new book, with 30 step-by-step tutorials and full Python source code.

Let’s get started.

How to Create Custom Data Transforms for Scikit-Learn
Photo by Berit Watkin, some rights reserved.

Tutorial Overview

This tutorial is divided into four parts; they are:

Custom Data Transforms in Scikit-Learn
Oil Spill Dataset
Custom Transform to Remove Columns
Custom Transform to Replace Outliers

Custom Data Transforms in Scikit-Learn

Data preparation refers to changing the raw data in some way that makes it more appropriate for predictive modeling with machine learning algorithms.

The scikit-learn Python machine learning library offers many different data preparation techniques directly, such as techniques for scaling numerical input variables and changing the probability distribution of variables.

These transforms can be fit and then applied on a dataset or used as part of a predictive modeling pipeline, allowing a sequence of transforms to be applied correctly without data leakage when evaluating model performance with data sampling techniques, such as k-fold cross-validation.

Although the data preparation techniques available in scikit-learn are extensive, there may be additional data preparation steps that are required.

Typically, these additional steps are performed manually prior to modeling and require writing custom code. The risk is that these data preparation steps may be performed inconsistently.

The solution is to create a custom data transform in scikit-learn using the FunctionTransformer class.

This class allows you to specify a function that is called to transform the data. You can define the function and perform any valid change, such as changing values or removing columns of data (not removing rows).

The class can then be used just like any other data transform in scikit-learn, e.g. to transform data directly, or used in a modeling pipeline.

The catch is that the transform is stateless, meaning that no state can be kept.

This means that the transform cannot be used to calculate statistics on the training dataset that are then used to transform the train and test datasets.

In addition to custom scaling operations, this can be helpful for standard data cleaning operations, such as identifying and removing columns with few unique values and identifying and removing relative outliers.

We will explore both of these cases, but first, let’s define a dataset that we can use as the basis for exploration.

Want to Get Started With Data Preparation?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Course

Oil Spill Dataset

The so-called “oil spill” dataset is a standard machine learning dataset.

The task involves predicting whether a patch contains an oil spill or not, e.g. from the illegal or accidental dumping of oil in the ocean, given a vector that describes the contents of a patch of a satellite image.

There are 937 cases. Each case is composed of 48 numerical computer vision derived features, a patch number, and a class label.

The normal case is no oil spill assigned the class label of 0, whereas an oil spill is indicated by a class label of 1. There are 896 cases for no oil spill and 41 cases of an oil spill.

You can access the entire dataset here:

Review the contents of the file.

The first few lines of the file should look as follows:

1,2558,1506.09,456.63,90,6395000,40.88,7.89,29780,0.19,214.7,0.21,0.26,0.49,0.1,0.4,99.59,32.19,1.84,0.16,0.2,87.65,0,0.47,132.78,-0.01,3.78,0.22,3.2,-3.71,-0.18,2.19,0,2.19,310,16110,0,138.68,89,69,2850,1000,763.16,135.46,3.73,0,33243.19,65.74,7.95,1
2,22325,79.11,841.03,180,55812500,51.11,1.21,61900,0.02,901.7,0.02,0.03,0.11,0.01,0.11,6058.23,4061.15,2.3,0.02,0.02,87.65,0,0.58,132.78,-0.01,3.78,0.84,7.09,-2.21,0,0,0,0,704,40140,0,68.65,89,69,5750,11500,9593.48,1648.8,0.6,0,51572.04,65.73,6.26,0
3,115,1449.85,608.43,88,287500,40.42,7.34,3340,0.18,86.1,0.21,0.32,0.5,0.17,0.34,71.2,16.73,1.82,0.19,0.29,87.65,0,0.46,132.78,-0.01,3.78,0.7,4.79,-3.36,-0.23,1.95,0,1.95,29,1530,0.01,38.8,89,69,1400,250,150,45.13,9.33,1,31692.84,65.81,7.84,1
4,1201,1562.53,295.65,66,3002500,42.4,7.97,18030,0.19,166.5,0.21,0.26,0.48,0.1,0.38,120.22,33.47,1.91,0.16,0.21,87.65,0,0.48,132.78,-0.01,3.78,0.84,6.78,-3.54,-0.33,2.2,0,2.2,183,10080,0,108.27,89,69,6041.52,761.58,453.21,144.97,13.33,1,37696.21,65.67,8.07,1
5,312,950.27,440.86,37,780000,41.43,7.03,3350,0.17,232.8,0.15,0.19,0.35,0.09,0.26,289.19,48.68,1.86,0.13,0.16,87.65,0,0.47,132.78,-0.01,3.78,0.02,2.28,-3.44,-0.44,2.19,0,2.19,45,2340,0,14.39,89,69,1320.04,710.63,512.54,109.16,2.58,0,29038.17,65.66,7.35,0
...

We can see that the first column contains integers for the patch number. We can also see that the computer vision derived features are real-valued with differing scales, such as thousands in the second column and fractions in other columns.

This dataset contains columns with very few unique values and columns with outliers that provide a good basis for data cleaning.

The example below downloads the dataset and loads it as a numPy array and summarizes the number of rows and columns.

# load the oil dataset
from pandas import read_csv
# define the location of the dataset
path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/oil-spill.csv'
# load the dataset
df = read_csv(path, header=None)
# split data into inputs and outputs
data = df.values
X = data[:, :-1]
y = data[:, -1]
print(X.shape, y.shape)

Running the example loads the dataset and confirms the expected number of rows and columns.

(937, 49) (937,)

Now that we have a dataset that we can use as the basis for data transforms, let’s look at how we can define some custom data cleaning transforms using the FunctionTransformer class.

Custom Transform to Remove Columns

Columns that have few unique values are probably not contributing anything useful to predicting the target value.

This is not absolutely true, but it is true enough that you should test the performance of your model fit on a dataset with columns of this type removed.

This is a type of data cleaning, and there is a data transform provided in scikit-learn called the VarianceThreshold that attempts to address this using the variance of each column.

Another approach is to remove columns that have fewer than a specified number of unique values, such as 1.

We can develop a function that applies this transform and use the minimum number of unique values as a configurable default argument. We will also add some debugging to confirm it is working as we expect.

First, the number of unique values for each column can be calculated. Ten columns with equal or fewer than the minimum number of unique values can be identified. Finally, those identified columns can be removed from the dataset.

The cust_transform() function below implements this.

# remove columns with few unique values
def cust_transform(X, min_values=1, verbose=True):
	# get number of unique values for each column
	counts = [len(unique(X[:, i])) for i in range(X.shape[1])]
	if verbose:
		print('Unique Values: %s' % counts)
	# select columns to delete
	to_del = [i for i,v in enumerate(counts) if v <= min_values]
	if verbose:
		print('Deleting: %s' % to_del)
	if len(to_del) is 0:
		return X
	# select all but the columns that are being removed
	ix = [i for i in range(X.shape[1]) if i not in to_del]
	result = X[:, ix]
	return result

We can then use this function in the FunctionTransformer.

A limitation of this transform is that it selects columns to delete based on the provided data. This means if a train and test dataset differ greatly, then it is possible for different columns to be removed from each, making model evaluation challenging (unstable!?). As such, it is best to keep the minimum number of unique values small, such as 1.

We can use this transform on the oil spill dataset. The complete example is listed below.

# custom data transform for removing columns with few unique values
from numpy import unique
from pandas import read_csv
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import LabelEncoder

# load a dataset
def load_dataset(path):
	# load the dataset
	df = read_csv(path, header=None)
	data = df.values
	# split data into inputs and outputs
	X, y = data[:, :-1], data[:, -1]
	# minimally prepare dataset
	X = X.astype('float')
	y = LabelEncoder().fit_transform(y.astype('str'))
	return X, y

# remove columns with few unique values
def cust_transform(X, min_values=1, verbose=True):
	# get number of unique values for each column
	counts = [len(unique(X[:, i])) for i in range(X.shape[1])]
	if verbose:
		print('Unique Values: %s' % counts)
	# select columns to delete
	to_del = [i for i,v in enumerate(counts) if v <= min_values]
	if verbose:
		print('Deleting: %s' % to_del)
	if len(to_del) is 0:
		return X
	# select all but the columns that are being removed
	ix = [i for i in range(X.shape[1]) if i not in to_del]
	result = X[:, ix]
	return result

# define the location of the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/oil-spill.csv'
# load the dataset
X, y = load_dataset(url)
print(X.shape, y.shape)
# define the transformer
trans = FunctionTransformer(cust_transform)
# apply the transform
X = trans.fit_transform(X)
# summarize new shape
print(X.shape)

Running the example first reports the number of rows and columns in the raw dataset.

Next, a list is printed that shows the number of unique values observed for each column in the dataset. We can see that many columns have very few unique values.

The columns with one (or fewer) unique values are then identified and reported. In this case, column index 22. This column is removed from the dataset.

Finally, the shape of the transformed dataset is reported, showing 48 instead of 49 columns, confirming that the column with a single unique value was deleted.

(937, 49) (937,)
Unique Values: [238, 297, 927, 933, 179, 375, 820, 618, 561, 57, 577, 59, 73, 107, 53, 91, 893, 810, 170, 53, 68, 9, 1, 92, 9, 8, 9, 308, 447, 392, 107, 42, 4, 45, 141, 110, 3, 758, 9, 9, 388, 220, 644, 649, 499, 2, 937, 169, 286]
Deleting: [22]
(937, 48)

There are many extensions you could explore to this transform, such as:

Ensure that it is only applied to numerical input variables.
Experiment with a different minimum number of unique values.
Use a percentage rather than an absolute number of unique values.

If you explore any of these extensions, let me know in the comments below.

Next, let’s look at a transform that replaces values in the dataset.

Custom Transform to Replace Outliers

Outliers are observations that are different or unlike the other observations.

If we consider one variable at a time, an outlier would be a value that is far from the center of mass (the rest of the values), meaning it is rare or has a low probability of being observed.

There are standard ways for identifying outliers for common probability distributions. For Gaussian data, we can identify outliers as observations that are three or more standard deviations from the mean.

This may or may not be a desirable way to identify outliers for data that has many input variables, yet can be effective in some cases.

We can identify outliers in this way and replace their value with a correction, such as the mean.

Each column is considered one at a time and mean and standard deviation statistics are calculated. Using these statistics, upper and lower bounds of “normal” values are defined, then all values that fall outside these bounds can be identified. If one or more outliers are identified, their values are then replaced with the mean value that was already calculated.

The cust_transform() function below implements this as a function applied to the dataset, where we parameterize the number of standard deviations from the mean and whether or not debug information will be displayed.

# replace outliers
def cust_transform(X, n_stdev=3, verbose=True):
	# copy the array
	result = X.copy()
	# enumerate each column
	for i in range(result.shape[1]):
		# retrieve values for column
		col = X[:, i]
		# calculate statistics
		mu, sigma = mean(col), std(col)
		# define bounds
		lower, upper = mu-(sigma*n_stdev), mu+(sigma*n_stdev)
		# select indexes that are out of bounds
		ix = where(logical_or(col < lower, col > upper))[0]
		if verbose and len(ix) > 0:
			print('>col=%d, outliers=%d' % (i, len(ix)))
		# replace values
		result[ix, i] = mu
	return result

We can then use this function in the FunctionTransformer.

The method of outlier detection assumes a Gaussian probability distribution and applies to each variable independently, both of which are strong assumptions.

An additional limitation of this implementation is that the mean and standard deviation statistics are calculated on the provided dataset, meaning that the definition of an outlier and its replacement value are both relative to the dataset. This means that different definitions of outliers and different replacement values could be used if the transform is used on the train and test sets.

We can use this transform on the oil spill dataset. The complete example is listed below.

# custom data transform for replacing outliers
from numpy import mean
from numpy import std
from numpy import where
from numpy import logical_or
from pandas import read_csv
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import LabelEncoder

# load a dataset
def load_dataset(path):
	# load the dataset
	df = read_csv(path, header=None)
	data = df.values
	# split data into inputs and outputs
	X, y = data[:, :-1], data[:, -1]
	# minimally prepare dataset
	X = X.astype('float')
	y = LabelEncoder().fit_transform(y.astype('str'))
	return X, y

# replace outliers
def cust_transform(X, n_stdev=3, verbose=True):
	# copy the array
	result = X.copy()
	# enumerate each column
	for i in range(result.shape[1]):
		# retrieve values for column
		col = X[:, i]
		# calculate statistics
		mu, sigma = mean(col), std(col)
		# define bounds
		lower, upper = mu-(sigma*n_stdev), mu+(sigma*n_stdev)
		# select indexes that are out of bounds
		ix = where(logical_or(col < lower, col > upper))[0]
		if verbose and len(ix) > 0:
			print('>col=%d, outliers=%d' % (i, len(ix)))
		# replace values
		result[ix, i] = mu
	return result

# define the location of the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/oil-spill.csv'
# load the dataset
X, y = load_dataset(url)
print(X.shape, y.shape)
# define the transformer
trans = FunctionTransformer(cust_transform)
# apply the transform
X = trans.fit_transform(X)
# summarize new shape
print(X.shape)

Running the example first reports the shape of the dataset prior to any change.

Next, the number of outliers for each column is calculated and only those columns with one or more outliers are reported in the output. We can see that a total of 32 columns in the dataset have one or more outliers.

The outliers are then removed and the shape of the resulting dataset is reported, confirming no change in the number of rows or columns.

(937, 49) (937,)
>col=0, outliers=10
>col=1, outliers=8
>col=3, outliers=8
>col=5, outliers=7
>col=6, outliers=1
>col=7, outliers=12
>col=8, outliers=15
>col=9, outliers=14
>col=10, outliers=19
>col=11, outliers=17
>col=12, outliers=22
>col=13, outliers=2
>col=14, outliers=16
>col=15, outliers=8
>col=16, outliers=8
>col=17, outliers=6
>col=19, outliers=12
>col=20, outliers=20
>col=27, outliers=14
>col=28, outliers=18
>col=29, outliers=2
>col=30, outliers=13
>col=32, outliers=3
>col=34, outliers=14
>col=35, outliers=15
>col=37, outliers=13
>col=40, outliers=18
>col=41, outliers=13
>col=42, outliers=12
>col=43, outliers=12
>col=44, outliers=19
>col=46, outliers=21
(937, 49)

There are many extensions you could explore to this transform, such as:

Ensure that it is only applied to numerical input variables.
Experiment with a different number of standard deviations from the mean, such as 2 or 4.
Use a different definition of outlier, such as the IQR or a model.

If you explore any of these extensions, let me know in the comments below.

This tutorial is divided into three parts; they are:

Imputing the Horse Colic Dataset
Model With a Binary Flag for Missing Values
Model With Indicators of All Missing Values

Imputing the Horse Colic Dataset

The horse colic dataset describes medical characteristics of horses with colic and whether they lived or died.

There are 300 rows and 26 input variables with one output variable. It is a binary classification prediction task that involves predicting 1 if the horse lived and 2 if the horse died.

There are many fields we could select to predict in this dataset. In this case, we will predict whether the problem was surgical or not (column index 23), making it a binary classification problem.

The dataset has numerous missing values for many of the columns where each missing value is marked with a question mark character (“?”).

Below provides an example of rows from the dataset with marked missing values.

2,1,530101,38.50,66,28,3,3,?,2,5,4,4,?,?,?,3,5,45.00,8.40,?,?,2,2,11300,00000,00000,2
1,1,534817,39.2,88,20,?,?,4,1,3,4,2,?,?,?,4,2,50,85,2,2,3,2,02208,00000,00000,2
2,1,530334,38.30,40,24,1,1,3,1,3,3,1,?,?,?,1,1,33.00,6.70,?,?,1,2,00000,00000,00000,1
1,9,5290409,39.10,164,84,4,1,6,2,2,4,4,1,2,5.00,3,?,48.00,7.20,3,5.30,2,1,02208,00000,00000,1
...

You can learn more about the dataset here:

No need to download the dataset as we will download it automatically in the worked examples.

Marking missing values with a NaN (not a number) value in a loaded dataset using Python is a best practice.

We can load the dataset using the read_csv() Pandas function and specify the “na_values” to load values of ‘?’ as missing, marked with a NaN value.

The example below downloads the dataset, marks “?” values as NaN (missing) and summarizes the shape of the dataset.

# summarize the horse colic dataset
from pandas import read_csv
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv'
dataframe = read_csv(url, header=None, na_values='?')
data = dataframe.values
# split into input and output elements
ix = [i for i in range(data.shape[1]) if i != 23]
X, y = data[:, ix], data[:, 23]
print(X.shape, y.shape)

Running the example downloads the dataset and reports the number of rows and columns, matching our expectations.

(300, 27) (300,)

Next, we can evaluate a model on this dataset.

We can use the SimpleImputer class to perform statistical imputation and replace the missing values with the mean of each column. We can then fit a random forest model on the dataset.

For more on how to use the SimpleImputer class, see the tutorial:

Statistical Imputation for Missing Values in Machine Learning

To achieve this, we will define a pipeline that first performs imputation, then fits the model and evaluates this modeling pipeline using repeated stratified k-fold cross-validation with three repeats and 10 folds.

The complete example is listed below.

# evaluate mean imputation and random forest for the horse colic dataset
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv'
dataframe = read_csv(url, header=None, na_values='?')
# split into input and output elements
data = dataframe.values
ix = [i for i in range(data.shape[1]) if i != 23]
X, y = data[:, ix], data[:, 23]
# define modeling pipeline
model = RandomForestClassifier()
imputer = SimpleImputer()
pipeline = Pipeline(steps=[('i', imputer), ('m', model)])
# define model evaluation
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example evaluates the random forest with mean statistical imputation on the horse colic dataset.

Your specific results may vary given the stochastic nature of the learning algorithm, the stochastic nature of the evaluation procedure, and differences in precision across machines. Try running the example a few times.

In this case, the pipeline achieved an estimated classification accuracy of about 86.2 percent.

Mean Accuracy: 0.862 (0.056)

Next, let’s see if we can improve the performance of the model by providing more information about missing values.

Want to Get Started With Data Preparation?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Course

Model With a Binary Flag for Missing Values

In the previous section, we replaced missing values with a calculated statistic.

The model is unaware that missing values were replaced.

It is possible that knowledge of whether a row contains a missing value or not will be useful to the model when making a prediction.

One approach to exposing the model to this knowledge is by providing an additional column that is a binary flag indicating whether the row had a missing value or not.

0: Row does not contain a missing value.
1: Row contains a missing value (which was/will be imputed).

This can be achieved directly on the loaded dataset. First, we can sum the values for each row to create a new column where if the row contains at least one NaN, then the sum will be a NaN.

We can then mark all values in the new column as 1 if they contain a NaN, or 0 otherwise.

Finally, we can add this column to the loaded dataset.

Tying this together, the complete example of adding a binary flag to indicate one or more missing values in each row is listed below.

# add a binary flag that indicates if a row contains a missing value
from numpy import isnan
from numpy import hstack
from pandas import read_csv
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv'
dataframe = read_csv(url, header=None, na_values='?')
# split into input and output elements
data = dataframe.values
ix = [i for i in range(data.shape[1]) if i != 23]
X, y = data[:, ix], data[:, 23]
print(X.shape)
# sum each row where rows with a nan will sum to nan
a = X.sum(axis=1)
# mark all nan as 1
a[isnan(a)] = 1
# mark all non-nan as 0
a[~isnan(a)] = 0
a = a.reshape((len(a), 1))
# add to the dataset as another column
X = hstack((X, a))
print(X.shape)

Running the example first downloads the dataset and reports the number of rows and columns, as expected.

Then the new binary variable indicating whether a row contains a missing value is created and added to the end of the input variables. The shape of the input data is then reported, confirming the addition of the feature, from 27 to 28 columns.

(300, 27)
(300, 28)

We can then evaluate the model as we did in the previous section with the additional binary flag and see if it impacts model performance.

The complete example is listed below.

# evaluate model performance with a binary flag for missing values and imputed missing
from numpy import isnan
from numpy import hstack
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv'
dataframe = read_csv(url, header=None, na_values='?')
# split into input and output elements
data = dataframe.values
ix = [i for i in range(data.shape[1]) if i != 23]
X, y = data[:, ix], data[:, 23]
# sum each row where rows with a nan will sum to nan
a = X.sum(axis=1)
# mark all nan as 1
a[isnan(a)] = 1
# mark all non-nan as 0
a[~isnan(a)] = 0
a = a.reshape((len(a), 1))
# add to the dataset as another column
X = hstack((X, a))
# define modeling pipeline
model = RandomForestClassifier()
imputer = SimpleImputer()
pipeline = Pipeline(steps=[('i', imputer), ('m', model)])
# define model evaluation
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example reports the mean and standard deviation classification accuracy on the horse colic dataset with the additional feature and imputation.

In this case, we see a modest lift in performance from 86.2 percent to 86.3 percent. The difference is small and may not be statistically significant.

Mean Accuracy: 0.863 (0.055)

Most rows in this dataset have a missing value, and this approach might be more beneficial on datasets with fewer missing values.

Next, let’s see if we can provide even more information about the missing values to the model.

Model With Indicators of All Missing Values

In the previous section, we added one additional column to indicate whether a row contains a missing value or not.

One step further is to indicate whether each input value was missing and imputed or not. This effectively adds one additional column for each input variable that contains missing values and may offer benefit to the model.

This can be achieved by setting the “add_indicator” argument to True when defining the SimpleImputer instance.

...
# impute and mark missing values
X = SimpleImputer(add_indicator=True).fit_transform(X)

We can demonstrate this with a worked example.

The example below loads the horse colic dataset as before, then imputes the missing values on the entire dataset and adds indicators variables for each input variable that has missing values

# impute and add indicators for columns with missing values
from pandas import read_csv
from sklearn.impute import SimpleImputer
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv'
dataframe = read_csv(url, header=None, na_values='?')
data = dataframe.values
# split into input and output elements
ix = [i for i in range(data.shape[1]) if i != 23]
X, y = data[:, ix], data[:, 23]
print(X.shape)
# impute and mark missing values
X = SimpleImputer(strategy='mean', add_indicator=True).fit_transform(X)
print(X.shape)

Running the example first downloads and summarizes the shape of the dataset as expected, then applies the imputation and adds the binary (1 and 0 values) columns indicating whether each row contains a missing value for a given input variable.

We can see that the number of input variables has increased from 27 to 48, indicating the addition of 21 binary input variables, and in turn, that 21 of the 27 input variables must contain at least one missing value.

(300, 27)
(300, 48)

Next, we can evaluate the model with this additional information.

The complete example below demonstrates this.

# evaluate imputation with added indicators features on the horse colic dataset
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv'
dataframe = read_csv(url, header=None, na_values='?')
# split into input and output elements
data = dataframe.values
ix = [i for i in range(data.shape[1]) if i != 23]
X, y = data[:, ix], data[:, 23]
# define modeling pipeline
model = RandomForestClassifier()
imputer = SimpleImputer(add_indicator=True)
pipeline = Pipeline(steps=[('i', imputer), ('m', model)])
# define model evaluation
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example reports the mean and standard deviation classification accuracy on the horse colic dataset with the additional indicators features and imputation.

In this case, we see a nice lift in performance from 86.3 percent in the previous section to 86.7 percent.

This may provide strong evidence that adding one flag per column that was inputted is a better strategy on this dataset and chosen model.

Mean Accuracy: 0.867 (0.055)

Summary

In this tutorial, you discovered how to add binary flags for missing values for modeling.

Specifically, you learned:

How to load and evaluate models with statistical imputation on a classification dataset with missing values.
How to add a flag that indicates if a row has one more missing values and evaluate models with this new feature.
How to add a flag for each input variable that has missing values and evaluate models with these new features.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post Add Binary Flags for Missing Values for Machine Learning appeared first on Machine Learning Mastery.

Many machine learning models perform better when input variables are carefully transformed or scaled prior to modeling.

It is convenient, and therefore common, to apply the same data transforms, such as standardization and normalization, equally to all input variables. This can achieve good results on many problems. Nevertheless, better results may be achieved by carefully selecting which data transform to apply to each input variable prior to modeling.

In this tutorial, you will discover how to apply selective scaling of numerical input variables.

After completing this tutorial, you will know:

How to load and calculate a baseline predictive performance for the diabetes classification dataset.
How to evaluate modeling pipelines with data transforms applied blindly to all numerical input variables.
How to evaluate modeling pipelines with selective normalization and standardization applied to subsets of input variables.

Discover data cleaning, feature selection, data transforms, dimensionality reduction and much more in my new book, with 30 step-by-step tutorials and full Python source code.

Let’s get started.

How to Selectively Scale Numerical Input Variables for Machine Learning
Photo by Marco Verch, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

Diabetes Numerical Dataset
Non-Selective Scaling of Numerical Inputs
1. Normalize All Input Variables
2. Standardize All Input Variables
Selective Scaling of Numerical Inputs
1. Normalize Only Non-Gaussian Input Variables
2. Standardize Only Gaussian-Like Input Variables
3. Selectively Normalize and Standardize Input Variables

Diabetes Numerical Dataset

As the basis of this tutorial, we will use the so-called “diabetes” dataset that has been widely studied as a machine learning dataset since the 1990s.

The dataset classifies patients’ data as either an onset of diabetes within five years or not. There are 768 examples and eight input variables. It is a binary classification problem.

You can learn more about the dataset here:

No need to download the dataset; we will download it automatically as part of the worked examples that follow.

Looking at the data, we can see that all nine input variables are numerical.

6,148,72,35,0,33.6,0.627,50,1
1,85,66,29,0,26.6,0.351,31,0
8,183,64,0,0,23.3,0.672,32,1
1,89,66,23,94,28.1,0.167,21,0
0,137,40,35,168,43.1,2.288,33,1
...

We can load this dataset into memory using the Pandas library.

The example below downloads and summarizes the diabetes dataset.

# load and summarize the diabetes dataset
from pandas import read_csv
from pandas.plotting import scatter_matrix
from matplotlib import pyplot
# Load dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv"
dataset = read_csv(url, header=None)
# summarize the shape of the dataset
print(dataset.shape)
# histograms of the variables
dataset.hist()
pyplot.show()

Running the example first downloads the dataset and loads it as a DataFrame.

The shape of the dataset is printed, confirming the number of rows, and nine variables, eight input, and one target.

(768, 9)

Finally, a plot is created showing a histogram for each variable in the dataset.

This is useful as we can see that some variables have a Gaussian or Gaussian-like distribution (1, 2, 5) and others have an exponential-like distribution (0, 3, 4, 6, 7). This may suggest the need for different numerical data transforms for the different types of input variables.

Histogram of Each Variable in the Diabetes Classification Dataset

Now that we are a little familiar with the dataset, let’s try fitting and evaluating a model on the raw dataset.

We will use a logistic regression model as they are a robust and effective linear model for binary classification tasks. We will evaluate the model using repeated stratified k-fold cross-validation, a best practice, and use 10 folds and three repeats.

The complete example is listed below.

# evaluate a logistic regression model on the raw diabetes dataset
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.linear_model import LogisticRegression
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv'
dataframe = read_csv(url, header=None)
data = dataframe.values
# separate into input and output elements
X, y = data[:, :-1], data[:, -1]
# minimally prepare dataset
X = X.astype('float')
y = LabelEncoder().fit_transform(y.astype('str'))
# define the model
model = LogisticRegression(solver='liblinear')
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model
m_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# summarize the result
print('Accuracy: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))

Running the example evaluates the model and reports the mean and standard deviation accuracy for fitting a logistic regression model on the raw dataset.

Your specific results may differ given the stochastic nature of the learning algorithm, the stochastic nature of the evaluation procedure, and differences in precision across machines and library versions. Try running the example a few times.

In this case, we can see that the model achieved an accuracy of about 76.8 percent.

Accuracy: 0.768 (0.040)

Now that we have established a baseline in performance on the dataset, let’s see if we can improve the performance using data scaling.

Want to Get Started With Data Preparation?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Course

Non-Selective Scaling of Numerical Inputs

Many algorithms prefer or require that input variables are scaled to a consistent range prior to fitting a model.

This includes the logistic regression model that assumes input variables have a Gaussian probability distribution. It may also provide a more numerically stable model if the input variables are standardized. Nevertheless, even when these expectations are violated, the logistic regression can perform well or best for a given dataset as may be the case for the diabetes dataset.

Two common techniques for scaling numerical input variables are normalization and standardization.

Normalization scales each input variable to the range 0-1 and can be implemented using the MinMaxScaler class in scikit-learn. Standardization scales each input variable to have a mean of 0.0 and a standard deviation of 1.0 and can be implemented using the StandardScaler class in scikit-learn.

To learn more about normalization, standardization, and how to use these methods in scikit-learn, see the tutorial:

How to Use StandardScaler and MinMaxScaler Transforms in Python

A naive approach to data scaling applies a single transform to all input variables, regardless of their scale or probability distribution. And this is often effective.

Let’s try normalizing and standardizing all input variables directly and compare the performance to the baseline logistic regression model fit on the raw data.

Normalize All Input Variables

We can update the baseline code example to use a modeling pipeline where the first step is to apply a scaler and the final step is to fit the model.

This ensures that the scaling operation is fit or prepared on the training set only and then applied to the train and test sets during the cross-validation process, avoiding data leakage. Data leakage can result in an optimistically biased estimate of model performance.

This can be achieved using the Pipeline class where each step in the pipeline is defined as a tuple with a name and the instance of the transform or model to use.

...
# define the modeling pipeline
scaler = MinMaxScaler()
model = LogisticRegression(solver='liblinear')
pipeline = Pipeline([('s',scaler),('m',model)])

Tying this together, the complete example of evaluating a logistic regression on diabetes dataset with all input variables normalized is listed below.

# evaluate a logistic regression model on the normalized diabetes dataset
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv'
dataframe = read_csv(url, header=None)
data = dataframe.values
# separate into input and output elements
X, y = data[:, :-1], data[:, -1]
# minimally prepare dataset
X = X.astype('float')
y = LabelEncoder().fit_transform(y.astype('str'))
# define the modeling pipeline
model = LogisticRegression(solver='liblinear')
scaler = MinMaxScaler()
pipeline = Pipeline([('s',scaler),('m',model)])
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model
m_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# summarize the result
print('Accuracy: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))

Running the example evaluates the modeling pipeline and reports the mean and standard deviation accuracy for fitting a logistic regression model on the normalized dataset.

In this case, we can see that the normalization of the input variables has resulted in a drop in the mean classification accuracy from 76.8 percent with a model fit on the raw data to about 76.4 percent for the pipeline with normalization.

Accuracy: 0.764 (0.045)

Next, let’s try standardizing all input variables.

Standardize All Input Variables

We can update the modeling pipeline to use standardization instead of normalization for all input variables prior to fitting and evaluating the logistic regression model.

This might be an appropriate transform for those input variables with a Gaussian-like distribution, but perhaps not the other variables.

...
# define the modeling pipeline
scaler = StandardScaler()
model = LogisticRegression(solver='liblinear')
pipeline = Pipeline([('s',scaler),('m',model)])

Tying this together, the complete example of evaluating a logistic regression model on diabetes dataset with all input variables standardized is listed below.

# evaluate a logistic regression model on the standardized diabetes dataset
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv'
dataframe = read_csv(url, header=None)
data = dataframe.values
# separate into input and output elements
X, y = data[:, :-1], data[:, -1]
# minimally prepare dataset
X = X.astype('float')
y = LabelEncoder().fit_transform(y.astype('str'))
# define the modeling pipeline
scaler = StandardScaler()
model = LogisticRegression(solver='liblinear')
pipeline = Pipeline([('s',scaler),('m',model)])
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model
m_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# summarize the result
print('Accuracy: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))

Running the example evaluates the modeling pipeline and reports the mean and standard deviation accuracy for fitting a logistic regression model on the standardized dataset.

In this case, we can see that standardizing all numerical input variables has resulted in a lift in mean classification accuracy from 76.8 percent with a model evaluated on the raw dataset to about 77.2 percent for a model evaluated on the dataset with standardized input variables.

Accuracy: 0.772 (0.043)

So far, we have learned that normalizing all variables does not help performance, but standardizing all input variables does help performance.

Next, let’s explore if selectively applying scaling to the input variables can offer further improvement.

Selective Scaling of Numerical Inputs

Data transforms can be applied selectively to input variables using the ColumnTransformer class in scikit-learn.

It allows you to specify the transform (or pipeline of transforms) to apply and the column indexes to apply them to. This can then be used as part of a modeling pipeline and evaluated using cross-validation.

You can learn more about how to use the ColumnTransformer in the tutorial:

How to Use the ColumnTransformer for Data Preparation

We can explore using the ColumnTransformer to selectively apply normalization and standardization to the numerical input variables of the diabetes dataset in order to see if we can achieve further performance improvements.

Normalize Only Non-Gaussian Input Variables

First, let’s try normalizing just those input variables that do not have a Gaussian-like probability distribution and leave the rest of the input variables alone in the raw state.

We can define two groups of input variables using the column indexes, one for the variables with a Gaussian-like distribution, and one for the input variables with the exponential-like distribution.

...
# define column indexes for the variables with "normal" and "exponential" distributions
norm_ix = [1, 2, 5]
exp_ix = [0, 3, 4, 6, 7]

We can then selectively normalize the “exp_ix” group and let the other input variables pass through without any data preparation.

...
# define the selective transforms
t = [('e', MinMaxScaler(), exp_ix)]
selective = ColumnTransformer(transformers=t, remainder='passthrough')

The selective transform can then be used as part of our modeling pipeline.

...
# define the modeling pipeline
model = LogisticRegression(solver='liblinear')
pipeline = Pipeline([('s',selective),('m',model)])

Tying this together, the complete example of evaluating a logistic regression model on data with selective normalization of some input variables is listed below.

# evaluate a logistic regression model on the diabetes dataset with selective normalization
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.compose import ColumnTransformer
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv'
dataframe = read_csv(url, header=None)
data = dataframe.values
# separate into input and output elements
X, y = data[:, :-1], data[:, -1]
# minimally prepare dataset
X = X.astype('float')
y = LabelEncoder().fit_transform(y.astype('str'))
# define column indexes for the variables with "normal" and "exponential" distributions
norm_ix = [1, 2, 5]
exp_ix = [0, 3, 4, 6, 7]
# define the selective transforms
t = [('e', MinMaxScaler(), exp_ix)]
selective = ColumnTransformer(transformers=t, remainder='passthrough')
# define the modeling pipeline
model = LogisticRegression(solver='liblinear')
pipeline = Pipeline([('s',selective),('m',model)])
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model
m_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# summarize the result
print('Accuracy: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))

Running the example evaluates the modeling pipeline and reports the mean and standard deviation accuracy.

In this case, we can see slightly better performance, increasing mean accuracy with the baseline model fit on the raw dataset with 76.8 percent to about 76.9 with selective normalization of some input variables.

The results are not as good as standardizing all input variables though.

Accuracy: 0.769 (0.043)

Standardize Only Gaussian-Like Input Variables

We can repeat the experiment from the previous section, although in this case, selectively standardize those input variables that have a Gaussian-like distribution and leave the remaining input variables untouched.

...
# define the selective transforms
t = [('n', StandardScaler(), norm_ix)]
selective = ColumnTransformer(transformers=t, remainder='passthrough')

Tying this together, the complete example of evaluating a logistic regression model on data with selective standardizing of some input variables is listed below.

# evaluate a logistic regression model on the diabetes dataset with selective standardization
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv'
dataframe = read_csv(url, header=None)
data = dataframe.values
# separate into input and output elements
X, y = data[:, :-1], data[:, -1]
# minimally prepare dataset
X = X.astype('float')
y = LabelEncoder().fit_transform(y.astype('str'))
# define column indexes for the variables with "normal" and "exponential" distributions
norm_ix = [1, 2, 5]
exp_ix = [0, 3, 4, 6, 7]
# define the selective transforms
t = [('n', StandardScaler(), norm_ix)]
selective = ColumnTransformer(transformers=t, remainder='passthrough')
# define the modeling pipeline
model = LogisticRegression(solver='liblinear')
pipeline = Pipeline([('s',selective),('m',model)])
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model
m_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# summarize the result
print('Accuracy: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))

Running the example evaluates the modeling pipeline and reports the mean and standard deviation accuracy.

In this case, we can see that we achieved a lift in performance over both the baseline model fit on the raw dataset with 76.8 percent and over the standardization of all input variables that achieved 77.2 percent. With selective standardization, we have achieved a mean accuracy of about 77.3 percent, a modest but measurable bump.

Accuracy: 0.773 (0.041)

Selectively Normalize and Standardize Input Variables

The results so far raise the question as to whether we can get a further lift by combining the use of selective normalization and standardization on the dataset at the same time.

This can be achieved by defining both transforms and their respective column indexes for the ColumnTransformer class, with no remaining variables being passed through.

...
# define the selective transforms
t = [('e', MinMaxScaler(), exp_ix), ('n', StandardScaler(), norm_ix)]
selective = ColumnTransformer(transformers=t)

Tying this together, the complete example of evaluating a logistic regression model on data with selective normalization and standardization of the input variables is listed below.

# evaluate a logistic regression model on the diabetes dataset with selective scaling
from numpy import mean
from numpy import std
from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv'
dataframe = read_csv(url, header=None)
data = dataframe.values
# separate into input and output elements
X, y = data[:, :-1], data[:, -1]
# minimally prepare dataset
X = X.astype('float')
y = LabelEncoder().fit_transform(y.astype('str'))
# define column indexes for the variables with "normal" and "exponential" distributions
norm_ix = [1, 2, 5]
exp_ix = [0, 3, 4, 6, 7]
# define the selective transforms
t = [('e', MinMaxScaler(), exp_ix), ('n', StandardScaler(), norm_ix)]
selective = ColumnTransformer(transformers=t)
# define the modeling pipeline
model = LogisticRegression(solver='liblinear')
pipeline = Pipeline([('s',selective),('m',model)])
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model
m_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# summarize the result
print('Accuracy: %.3f (%.3f)' % (mean(m_scores), std(m_scores)))

Running the example evaluates the modeling pipeline and reports the mean and standard deviation accuracy.

In this case, interestingly, we can see that we have achieved the same performance as standardizing all input variables with 77.2 percent.

Further, the results suggest that the chosen model performs better when the non-Gaussian like variables are left as-is than being standardized or normalized.

I would not have guessed at this finding, which highlights the importance of careful experimentation.

Accuracy: 0.772 (0.040)

Can you do better?

Try other transforms or combinations of transforms and see if you can achieve better results.
Share your findings in the comments below.

Summary

In this tutorial, you discovered how to apply selective scaling of numerical input variables.

Specifically, you learned:

How to load and calculate a baseline predictive performance for the diabetes classification dataset.
How to evaluate modeling pipelines with data transforms applied blindly to all numerical input variables.
How to evaluate modeling pipelines with selective normalization and standardization applied to subsets of input variables.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

The post How to Selectively Scale Numerical Input Variables for Machine Learning appeared first on Machine Learning Mastery.

Tutorial Overview

Regression Dataset

Numerical Feature Selection

Correlation Feature Selection

Mutual Information Feature Selection

Modeling With Selected Features

Model Built Using All Features

Model Built Using Correlation Features

Model Built Using Mutual Information Features

Tune the Number of Selected Features

Further Reading

Tutorials

Books

APIs

Articles

Summary

Tutorial Overview

The Scale of Your Data Matters

Numerical Data Scaling Methods

Data Normalization

Data Standardization

Sonar Dataset

MinMaxScaler Transform

StandardScaler Transform

Common Questions

Q. Should I Normalize or Standardize?

Q. Should I Standardize then Normalize?

Q. But Which is Best?

Q. How Do I Handle Out-of-Bounds Values?

Further Reading

Tutorials

Books

APIs

Articles

Summary

Tutorial Overview

Nominal and Ordinal Variables

Encoding Categorical Data

Ordinal Encoding

One-Hot Encoding

Dummy Variable Encoding

Breast Cancer Dataset

OrdinalEncoder Transform

OneHotEncoder Transform

Common Questions

Further Reading

Tutorials

Books

APIs

Articles

Summary

Tutorial Overview

What Is Data in Machine Learning

Raw Data Must Be Prepared

1. Machine Learning Algorithms Expect Numbers

2. Machine Learning Algorithms Have Requirements

3. Model Performance Depends on Data

Predictive Modeling Is Mostly Data Preparation

Further Reading

Books

Articles

Summary

Tutorial Overview

Applied Machine Learning Process

Step 1: Define Problem

Step 2: Prepare Data

Step 3: Evaluate Models

Step 4: Finalize Model

What Is Data Preparation

How to Choose Data Preparation Techniques

Further Reading

Tutorials

Books

Articles

Summary

Tutorial Overview

Common Data Preparation Tasks

Data Cleaning

Feature Selection

Data Transforms

Data Preparation for Machine Learning Crash Course.
Get on top of data preparation with Python in 7 days.