Machine Learning - Random Forest

We’ll be training and tuning a random forest for wine quality based on traits like acidity, residual sugar, and alcohol concentration.

Import libraries and modules.

Import all the necessary packages: Numpy and Pandas for Data Exploration and sklearn for machine learning algorithms

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as plt
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.externals import joblib

Load Dataset

For this project, we are using wine quality data available on http://mlr.cs.umass.edu/ database. Its a csv file

In [2]:
dataset_url = 'http://mlr.cs.umass.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
data = pd.read_csv(dataset_url, sep=';')

Data Exploration

In [3]:
data.head()
Out[3]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5

Snapshot of the data: 1599 rows and 12 columns. Column names are: quality (target), fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol. All of the features and target are numerical

Correlation matrix

In [4]:
%matplotlib inline
sns.pairplot(data, kind="reg");

Some of the features like fixed acidity and density show strong correlations to each other. Alcohol and Quality good correlation. Some of the features like alcohol, total sulphur dioxide, chlorides are not normally distributed so we will have to standardize our dataset before applying machine learning

In [5]:
corr = data.corr()
sns.heatmap(corr)
Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x11b2a9160>

The correlation heat map further emphasizes the relationship between various features. Fixed acidity is positively correlated to citric acid and density and negatively correlated to pH. Free sulphur dioxide and total sulphur dioxide show strong positive correlation. Alcohol is negatively correlated to density. Quality shows positive correlation to Sulphates and alcohol, and negative correlation to chlorides and volatile acidity.

Target - Features Split

Now let us build our Model. First step is split the data into features and target. Quality is the target which we pass to 'y' and rest of the columns form features 'X'

In [6]:
y = data.quality
X = data.drop('quality', axis=1)

Train - Test Split

Next, we will split our data into train and test. Train set will be used for fitting the model and test set will be used for validation. We'll set aside 20% of the data as a test set for evaluating our model. We also set an arbitrary "random state" (seed) so that we can reproduce our results. We also stratified our target variable to ensure training and test set looks similar

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=123, 
                                                    stratify=y)

Standardization

Standardization is the process of subtracting the means from each feature and then dividing by the feature standard deviations. As seen before, our features are not normalized and on similar scale, so might skew our model. So we used StandardScaler() in preprocessing module to transform our train dataset. We will also apply the same scaler to our test data later on

In [8]:
scaler_X = preprocessing.StandardScaler().fit(X_train)
scaler_y = preprocessing.StandardScaler().fit(y_train)
X_train_scaled = scaler_X.transform(X_train)
y_train_scaled = scaler_y.transform(y_train)
/Users/nonusingh/anaconda/lib/python3.6/site-packages/sklearn/utils/validation.py:429: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.
  warnings.warn(msg, _DataConversionWarning)
/Users/nonusingh/anaconda/lib/python3.6/site-packages/sklearn/preprocessing/data.py:586: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)
/Users/nonusingh/anaconda/lib/python3.6/site-packages/sklearn/preprocessing/data.py:649: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)

Random Forest regressor

We will next create a classifier using random forest algorithm. Random forests are an ensemble learning method for classification/regression that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.

In [9]:
clf = RandomForestRegressor()

Fit the model

Next, let us fit the train data set using the classifier we created in the last step. We can fit and transform in the same step using fit_transform

In [10]:
clf.fit_transform(X_train_scaled, y_train_scaled)
/Users/nonusingh/anaconda/lib/python3.6/site-packages/sklearn/utils/deprecation.py:70: DeprecationWarning: Function transform is deprecated; Support to use estimators as feature selectors will be removed in version 0.19. Use SelectFromModel instead.
  warnings.warn(msg, category=DeprecationWarning)
Out[10]:
array([[ 2.19680282, -0.69866131, -0.58608178],
       [-0.31792985,  1.2491516 ,  2.97009781],
       [ 0.46443143, -0.35492962, -0.20843439],
       ..., 
       [ 1.10708533, -0.98510439,  0.35803669],
       [ 0.46443143, -0.35492962, -0.68049363],
       [-0.62528606,  0.79084268, -0.39725809]])

Make Predictions

Now we make scale the test data using the same scaler function we created for train data and make predictions

In [11]:
X_test_scaled = scaler_X.transform(X_test)
y_test_scaled = scaler_y.transform(y_test)
pred = clf.predict(X_test_scaled)
/Users/nonusingh/anaconda/lib/python3.6/site-packages/sklearn/utils/validation.py:429: DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.
  warnings.warn(msg, _DataConversionWarning)
/Users/nonusingh/anaconda/lib/python3.6/site-packages/sklearn/preprocessing/data.py:649: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  warnings.warn(DEPRECATION_MSG_1D, DeprecationWarning)

Scoring the model

Now we will score our model using r2 score

In [12]:
r2_score(y_test_scaled, pred)
Out[12]:
0.4465524547490769

39.2% is not that impressive. We need to tune the model to boost performance.

Tuning the model

Now it's time to consider the hyperparameters that we'll want to tune for our model. Hyperparameters express "higher-level" structural information about the model, and they are typically set before training the model.

In [13]:
hyperparameters = { 'randomforestregressor__max_features' : ['auto', 'sqrt', 'log2'],
                  'randomforestregressor__max_depth': [None, 5, 3, 1]}

Pipeline

We can put the standardization, and model fitting in the same step using a pipeline

In [14]:
pipeline = make_pipeline(preprocessing.StandardScaler(), 
                         RandomForestRegressor(n_estimators=100))

Cross Validation and model tuning using gridsearch

Cross-validation is a process for reliably estimating the performance of a method for building a model by training and evaluating your model multiple times using the same method. Typically used method is k folds where we Split data into k equal parts, or "folds", Train model on k-1 folds and Evaluate it on the remaining "hold-out" fold (e.g. the 10th fold). The best practice when performing CV is to include your data preprocessing steps inside the cross-validation loop. This prevents accidentally tainting your training folds with influential data from your test fold.

In [15]:
clf = GridSearchCV(pipeline, hyperparameters, cv=10)

Next on the classifier we just created, we fir the model

In [16]:
clf.fit(X_train, y_train)
clf.best_params_
Out[16]:
{'randomforestregressor__max_depth': None,
 'randomforestregressor__max_features': 'sqrt'}

GridSearchCV essentially performs cross-validation across the entire "grid" (all possible permutations) of hyperparameters. It takes in your model (in this case, we're using a model pipeline), the hyperparameters you want to tune, and the number of folds to create

The optimum hyperparameters are max_depth: None and max_Features:sqrt

Make Predictions using new model

In [17]:
y_pred = clf.predict(X_test)

New Score

In [18]:
r2_score(y_test, y_pred)
Out[18]:
0.46846612991101166

We improved from 39% to 46%!

Trying other Algorithms

Adaboost

In [19]:
from sklearn.ensemble import AdaBoostRegressor
In [20]:
pipeline_2 = make_pipeline(preprocessing.StandardScaler(), AdaBoostRegressor())

We created a new pipeline to fit our model: pipeline_2 for applying adaboost classifier

In [21]:
hyperparameters_2 = { 'adaboostregressor__n_estimators' : [50, 100, 25],
                  'adaboostregressor__learning_rate': [1, 0.5, 0.8]}

Tuned the hyper-parameters of the tree

In [22]:
clf_2 = GridSearchCV(pipeline_2, hyperparameters_2, cv=10)

fit the training data using the new classifier

In [23]:
clf_2.fit(X_train, y_train)
clf_2.best_params_
Out[23]:
{'adaboostregressor__learning_rate': 0.8,
 'adaboostregressor__n_estimators': 50}
In [24]:
y_pred_2 = clf_2.predict(X_test)

Made predictions for the test set

In [25]:
r2_score(y_test, y_pred_2)
Out[25]:
0.28807455287310391

But could not improve our accuracy

Linear Regression

In [26]:
from sklearn.linear_model import LinearRegression
In [27]:
pipeline_3 = make_pipeline(preprocessing.StandardScaler(), LinearRegression())
In [28]:
pipeline_3.get_params()
Out[28]:
{'linearregression': LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False),
 'linearregression__copy_X': True,
 'linearregression__fit_intercept': True,
 'linearregression__n_jobs': 1,
 'linearregression__normalize': False,
 'standardscaler': StandardScaler(copy=True, with_mean=True, with_std=True),
 'standardscaler__copy': True,
 'standardscaler__with_mean': True,
 'standardscaler__with_std': True,
 'steps': [('standardscaler',
   StandardScaler(copy=True, with_mean=True, with_std=True)),
  ('linearregression',
   LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False))]}
In [29]:
hyperparameters_3 = {'linearregression__normalize': [True, False] }
In [30]:
clf_3 = GridSearchCV(pipeline_3, hyperparameters_3, cv=10)
In [31]:
clf_3.fit(X_train, y_train)
clf_3.best_params_
Out[31]:
{'linearregression__normalize': True}
In [32]:
y_pred_3 = clf_3.predict(X_test)
In [33]:
r2_score(y_test, y_pred_3)
Out[33]:
0.30260002699604061

It is better than Adaboost but not better than Random Forest. Hence Random Forest gives us the best accuracy our problem

In [ ]: