Machine Learning - Random Forest¶

We’ll be training and tuning a random forest for wine quality based on traits like acidity, residual sugar, and alcohol concentration.

Import libraries and modules.¶

Import all the necessary packages: Numpy and Pandas for Data Exploration and sklearn for machine learning algorithms

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as plt
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.externals import joblib

Load Dataset¶

For this project, we are using wine quality data available on http://mlr.cs.umass.edu/ database. Its a csv file

dataset_url = 'http://mlr.cs.umass.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
data = pd.read_csv(dataset_url, sep=';')

Data Exploration¶

data.head()

Snapshot of the data: 1599 rows and 12 columns. Column names are: quality (target), fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol. All of the features and target are numerical

Correlation matrix¶

%matplotlib inline
sns.pairplot(data, kind="reg");

Some of the features like fixed acidity and density show strong correlations to each other. Alcohol and Quality good correlation. Some of the features like alcohol, total sulphur dioxide, chlorides are not normally distributed so we will have to standardize our dataset before applying machine learning

corr = data.corr()
sns.heatmap(corr)

<matplotlib.axes._subplots.AxesSubplot at 0x11b2a9160>

The correlation heat map further emphasizes the relationship between various features. Fixed acidity is positively correlated to citric acid and density and negatively correlated to pH. Free sulphur dioxide and total sulphur dioxide show strong positive correlation. Alcohol is negatively correlated to density. Quality shows positive correlation to Sulphates and alcohol, and negative correlation to chlorides and volatile acidity.

Target - Features Split¶

Now let us build our Model. First step is split the data into features and target. Quality is the target which we pass to 'y' and rest of the columns form features 'X'

y = data.quality
X = data.drop('quality', axis=1)

Train - Test Split¶

Next, we will split our data into train and test. Train set will be used for fitting the model and test set will be used for validation. We'll set aside 20% of the data as a test set for evaluating our model. We also set an arbitrary "random state" (seed) so that we can reproduce our results. We also stratified our target variable to ensure training and test set looks similar

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=123, 
                                                    stratify=y)

Standardization¶

Standardization is the process of subtracting the means from each feature and then dividing by the feature standard deviations. As seen before, our features are not normalized and on similar scale, so might skew our model. So we used StandardScaler() in preprocessing module to transform our train dataset. We will also apply the same scaler to our test data later on

scaler_X = preprocessing.StandardScaler().fit(X_train)
scaler_y = preprocessing.StandardScaler().fit(y_train)
X_train_scaled = scaler_X.transform(X_train)
y_train_scaled = scaler_y.transform(y_train)

/Users/nonusingh/anaconda/lib/python3.6/site-packages/sklearn/utils/validation.py:429: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.
  warnings.warn(msg, _DataConversionWarning)
/Users/nonusingh/anaconda/lib/python3.6/site-packages/sklearn/preprocessing/data.py:586: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
/Users/nonusingh/anaconda/lib/python3.6/site-packages/sklearn/preprocessing/data.py:649: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)

Random Forest regressor¶

We will next create a classifier using random forest algorithm. Random forests are an ensemble learning method for classification/regression that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.

clf = RandomForestRegressor()

Fit the model¶

Next, let us fit the train data set using the classifier we created in the last step. We can fit and transform in the same step using fit_transform

clf.fit_transform(X_train_scaled, y_train_scaled)

/Users/nonusingh/anaconda/lib/python3.6/site-packages/sklearn/utils/deprecation.py:70: DeprecationWarning: Function transform is deprecated; Support to use estimators as feature selectors will be removed in version 0.19. Use SelectFromModel instead.
  warnings.warn(msg, category=DeprecationWarning)

array([[ 2.19680282, -0.69866131, -0.58608178],
       [-0.31792985,  1.2491516 ,  2.97009781],
       [ 0.46443143, -0.35492962, -0.20843439],
       ..., 
       [ 1.10708533, -0.98510439,  0.35803669],
       [ 0.46443143, -0.35492962, -0.68049363],
       [-0.62528606,  0.79084268, -0.39725809]])

Make Predictions¶

Now we make scale the test data using the same scaler function we created for train data and make predictions

X_test_scaled = scaler_X.transform(X_test)
y_test_scaled = scaler_y.transform(y_test)
pred = clf.predict(X_test_scaled)

/Users/nonusingh/anaconda/lib/python3.6/site-packages/sklearn/utils/validation.py:429: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.
  warnings.warn(msg, _DataConversionWarning)
/Users/nonusingh/anaconda/lib/python3.6/site-packages/sklearn/preprocessing/data.py:649: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)

Scoring the model¶

Now we will score our model using r2 score

r2_score(y_test_scaled, pred)

0.4465524547490769

39.2% is not that impressive. We need to tune the model to boost performance.

Tuning the model¶

Now it's time to consider the hyperparameters that we'll want to tune for our model. Hyperparameters express "higher-level" structural information about the model, and they are typically set before training the model.

hyperparameters = { 'randomforestregressor__max_features' : ['auto', 'sqrt', 'log2'],
                  'randomforestregressor__max_depth': [None, 5, 3, 1]}

Pipeline¶

We can put the standardization, and model fitting in the same step using a pipeline

pipeline = make_pipeline(preprocessing.StandardScaler(), 
                         RandomForestRegressor(n_estimators=100))

Cross Validation and model tuning using gridsearch¶

Cross-validation is a process for reliably estimating the performance of a method for building a model by training and evaluating your model multiple times using the same method. Typically used method is k folds where we Split data into k equal parts, or "folds", Train model on k-1 folds and Evaluate it on the remaining "hold-out" fold (e.g. the 10th fold). The best practice when performing CV is to include your data preprocessing steps inside the cross-validation loop. This prevents accidentally tainting your training folds with influential data from your test fold.

clf = GridSearchCV(pipeline, hyperparameters, cv=10)

Next on the classifier we just created, we fir the model

clf.fit(X_train, y_train)
clf.best_params_

{'randomforestregressor__max_depth': None,
 'randomforestregressor__max_features': 'sqrt'}

GridSearchCV essentially performs cross-validation across the entire "grid" (all possible permutations) of hyperparameters. It takes in your model (in this case, we're using a model pipeline), the hyperparameters you want to tune, and the number of folds to create

The optimum hyperparameters are max_depth: None and max_Features:sqrt

Make Predictions using new model¶

y_pred = clf.predict(X_test)

New Score¶

r2_score(y_test, y_pred)

0.46846612991101166

We improved from 39% to 46%!¶

Trying other Algorithms¶

Adaboost¶

from sklearn.ensemble import AdaBoostRegressor

pipeline_2 = make_pipeline(preprocessing.StandardScaler(), AdaBoostRegressor())

We created a new pipeline to fit our model: pipeline_2 for applying adaboost classifier

hyperparameters_2 = { 'adaboostregressor__n_estimators' : [50, 100, 25],
                  'adaboostregressor__learning_rate': [1, 0.5, 0.8]}

Tuned the hyper-parameters of the tree

clf_2 = GridSearchCV(pipeline_2, hyperparameters_2, cv=10)

fit the training data using the new classifier

clf_2.fit(X_train, y_train)
clf_2.best_params_

{'adaboostregressor__learning_rate': 0.8,
 'adaboostregressor__n_estimators': 50}

y_pred_2 = clf_2.predict(X_test)

Made predictions for the test set

r2_score(y_test, y_pred_2)

0.28807455287310391

But could not improve our accuracy

Linear Regression¶

from sklearn.linear_model import LinearRegression

pipeline_3 = make_pipeline(preprocessing.StandardScaler(), LinearRegression())

pipeline_3.get_params()

{'linearregression': LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False),
 'linearregression__copy_X': True,
 'linearregression__fit_intercept': True,
 'linearregression__n_jobs': 1,
 'linearregression__normalize': False,
 'standardscaler': StandardScaler(copy=True, with_mean=True, with_std=True),
 'standardscaler__copy': True,
 'standardscaler__with_mean': True,
 'standardscaler__with_std': True,
 'steps': [('standardscaler',
   StandardScaler(copy=True, with_mean=True, with_std=True)),
  ('linearregression',
   LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False))]}

hyperparameters_3 = {'linearregression__normalize': [True, False] }

clf_3 = GridSearchCV(pipeline_3, hyperparameters_3, cv=10)

clf_3.fit(X_train, y_train)
clf_3.best_params_

{'linearregression__normalize': True}

y_pred_3 = clf_3.predict(X_test)

r2_score(y_test, y_pred_3)

0.30260002699604061

It is better than Adaboost but not better than Random Forest. Hence Random Forest gives us the best accuracy our problem

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
0	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5
1	7.8	0.88	0.00	2.6	0.098	25.0	67.0	0.9968	3.20	0.68	9.8	5
2	7.8	0.76	0.04	2.3	0.092	15.0	54.0	0.9970	3.26	0.65	9.8	5
3	11.2	0.28	0.56	1.9	0.075	17.0	60.0	0.9980	3.16	0.58	9.8	6
4	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5