We’ll be training and tuning a random forest for wine quality based on traits like acidity, residual sugar, and alcohol concentration.
Import all the necessary packages: Numpy and Pandas for Data Exploration and sklearn for machine learning algorithms
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as plt
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.externals import joblib
For this project, we are using wine quality data available on http://mlr.cs.umass.edu/ database. Its a csv file
dataset_url = 'http://mlr.cs.umass.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
data = pd.read_csv(dataset_url, sep=';')
data.head()
Snapshot of the data: 1599 rows and 12 columns. Column names are: quality (target), fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol. All of the features and target are numerical
%matplotlib inline
sns.pairplot(data, kind="reg");
Some of the features like fixed acidity and density show strong correlations to each other. Alcohol and Quality good correlation. Some of the features like alcohol, total sulphur dioxide, chlorides are not normally distributed so we will have to standardize our dataset before applying machine learning
corr = data.corr()
sns.heatmap(corr)
The correlation heat map further emphasizes the relationship between various features. Fixed acidity is positively correlated to citric acid and density and negatively correlated to pH. Free sulphur dioxide and total sulphur dioxide show strong positive correlation. Alcohol is negatively correlated to density. Quality shows positive correlation to Sulphates and alcohol, and negative correlation to chlorides and volatile acidity.
Now let us build our Model. First step is split the data into features and target. Quality is the target which we pass to 'y' and rest of the columns form features 'X'
y = data.quality
X = data.drop('quality', axis=1)
Next, we will split our data into train and test. Train set will be used for fitting the model and test set will be used for validation. We'll set aside 20% of the data as a test set for evaluating our model. We also set an arbitrary "random state" (seed) so that we can reproduce our results. We also stratified our target variable to ensure training and test set looks similar
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=123,
stratify=y)
Standardization is the process of subtracting the means from each feature and then dividing by the feature standard deviations. As seen before, our features are not normalized and on similar scale, so might skew our model. So we used StandardScaler() in preprocessing module to transform our train dataset. We will also apply the same scaler to our test data later on
scaler_X = preprocessing.StandardScaler().fit(X_train)
scaler_y = preprocessing.StandardScaler().fit(y_train)
X_train_scaled = scaler_X.transform(X_train)
y_train_scaled = scaler_y.transform(y_train)
We will next create a classifier using random forest algorithm. Random forests are an ensemble learning method for classification/regression that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.
clf = RandomForestRegressor()
Next, let us fit the train data set using the classifier we created in the last step. We can fit and transform in the same step using fit_transform
clf.fit_transform(X_train_scaled, y_train_scaled)
Now we make scale the test data using the same scaler function we created for train data and make predictions
X_test_scaled = scaler_X.transform(X_test)
y_test_scaled = scaler_y.transform(y_test)
pred = clf.predict(X_test_scaled)
Now we will score our model using r2 score
r2_score(y_test_scaled, pred)
39.2% is not that impressive. We need to tune the model to boost performance.
Now it's time to consider the hyperparameters that we'll want to tune for our model. Hyperparameters express "higher-level" structural information about the model, and they are typically set before training the model.
hyperparameters = { 'randomforestregressor__max_features' : ['auto', 'sqrt', 'log2'],
'randomforestregressor__max_depth': [None, 5, 3, 1]}
We can put the standardization, and model fitting in the same step using a pipeline
pipeline = make_pipeline(preprocessing.StandardScaler(),
RandomForestRegressor(n_estimators=100))
Cross-validation is a process for reliably estimating the performance of a method for building a model by training and evaluating your model multiple times using the same method. Typically used method is k folds where we Split data into k equal parts, or "folds", Train model on k-1 folds and Evaluate it on the remaining "hold-out" fold (e.g. the 10th fold). The best practice when performing CV is to include your data preprocessing steps inside the cross-validation loop. This prevents accidentally tainting your training folds with influential data from your test fold.
clf = GridSearchCV(pipeline, hyperparameters, cv=10)
Next on the classifier we just created, we fir the model
clf.fit(X_train, y_train)
clf.best_params_
GridSearchCV essentially performs cross-validation across the entire "grid" (all possible permutations) of hyperparameters. It takes in your model (in this case, we're using a model pipeline), the hyperparameters you want to tune, and the number of folds to create
The optimum hyperparameters are max_depth: None and max_Features:sqrt
y_pred = clf.predict(X_test)
r2_score(y_test, y_pred)
from sklearn.ensemble import AdaBoostRegressor
pipeline_2 = make_pipeline(preprocessing.StandardScaler(), AdaBoostRegressor())
We created a new pipeline to fit our model: pipeline_2 for applying adaboost classifier
hyperparameters_2 = { 'adaboostregressor__n_estimators' : [50, 100, 25],
'adaboostregressor__learning_rate': [1, 0.5, 0.8]}
Tuned the hyper-parameters of the tree
clf_2 = GridSearchCV(pipeline_2, hyperparameters_2, cv=10)
fit the training data using the new classifier
clf_2.fit(X_train, y_train)
clf_2.best_params_
y_pred_2 = clf_2.predict(X_test)
Made predictions for the test set
r2_score(y_test, y_pred_2)
But could not improve our accuracy
from sklearn.linear_model import LinearRegression
pipeline_3 = make_pipeline(preprocessing.StandardScaler(), LinearRegression())
pipeline_3.get_params()
hyperparameters_3 = {'linearregression__normalize': [True, False] }
clf_3 = GridSearchCV(pipeline_3, hyperparameters_3, cv=10)
clf_3.fit(X_train, y_train)
clf_3.best_params_
y_pred_3 = clf_3.predict(X_test)
r2_score(y_test, y_pred_3)
It is better than Adaboost but not better than Random Forest. Hence Random Forest gives us the best accuracy our problem