Titanic dataset analysis using Pandas and Numpy¶
import pandas as pd
import numpy as np
from scipy import stats, integrate
import scipy
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(color_codes=True)
%pylab inline
Problem Statement¶
What is the dependent variable and what are the factors in this data? Who had more chances of survival, what are the factors?¶
Data exploration section will investigate the dependent variable 'Survived' and understand the relationship of factors such as being a female, or child, or being in a certain class, or having sibling/spouse, parent/child affect the survival rate. We will also come up with a hypothesis and test it.
data = pd.read_csv('Titanic.csv')
data.head()
Data Wrangling¶
data.info()
The titanic data given has 891 rows, most of the columns have 891 rows except Age, Cabin and Embarked.
print(data['Cabin'].describe())
print(data['Embarked'].describe())
print(data['Age'].describe())
fig1 = data['Cabin'].value_counts().plot(kind ='bar', figsize= (15,3))
sns.plt.title('Frequency/Counts by Cabin')
fig1.set(ylabel = 'Frequency', xlabel = 'Cabin')
Cabin has 147 unique values for 204 rows, Max freq is 4. It is difficult to draw conclusion on this data and since it has just 22.8% of rows, I will be dropping this column from any further analysis. Also PassengerId does not give me any useful information, so I will drop that column as well
del data['Cabin']
del data['PassengerId']
Let us also drop the rows with missing values for Age and Embarked now
data.dropna(subset = ['Embarked', 'Age'], inplace = True)
data.info()
Pclass should not be numeric, so let us update it to upper, middle and lower class. For that, we need to look at its relationship with Fare
fig1 = sns.barplot(x="Pclass", y="Fare", data=data);
sns.plt.title('Pclass by Mean Fare')
fig1.set(ylabel = 'Average Fare')
Mean Fare of Pclass 1 was 88 dollars, Pclass 2 was 21.47 dollars and Pclass 3 was 13.22 dollars, so let us update the values of Pclass to 'Upper' for Class 1, 'Middle' for Class 2 and 'Lower' for Class 3
data.loc[data['Pclass'] == 1, 'Pclass'] = 'Upper'
data.loc[data['Pclass'] == 2, 'Pclass'] = 'Middle'
data.loc[data['Pclass'] == 3, 'Pclass'] = 'Lower'
Data Exploration¶
# Distribution of numeric variables
fig, (ax1, ax2, ax3, ax4) = plt.subplots(ncols=4, figsize = (18,3))
data['Age'].plot(kind ='hist', bins = 25, ax=ax1)
ax1.set(xlabel='Age')
data['Fare'].plot(kind = 'hist', bins= 25, ax=ax2)
ax2.set(xlabel='Fare')
data['Parch'].plot(kind = 'hist', ax=ax3)
ax3.set(xlabel='Parch')
data['SibSp'].plot(kind = 'hist', ax=ax4)
ax4.set(xlabel='SibSp')
plt.suptitle("Distribution of Numerical columns in the data", size=12)
# Distribution of categorical variables
fig, (ax1, ax2, ax3, ax4) = plt.subplots(ncols=4, figsize = (18,4))
data['Sex'].value_counts().plot(kind ='bar', ax=ax1)
ax1.set_title('Sex')
ax1.set(ylabel='Frequency')
data['Survived'].value_counts().plot(kind = 'bar', ax=ax2)
ax2.set_title('Survived')
ax2.set_xticklabels(['Perished', 'Survived'])
ax2.set(ylabel='Frequency')
data['Pclass'].value_counts().plot(kind = 'bar', ax=ax3)
ax3.set_title('Pclass')
ax3.set(ylabel='Frequency')
data['Embarked'].value_counts().plot(kind = 'bar', ax=ax4)
ax4.set_title('Embarked')
ax4.set(ylabel='Frequency')
plt.suptitle("Distribution of Categorical columns in the data", size=12)
The above plots show the distributions of numerical and categorical columns in our data. Age ranges from 0 to 80 years with mean and mode around 25-30 years, Fare ranges from 0 to over 500 dollars, Parch and SibSp has its mode at 0 meaning most people did not travel with any parent/child or sibling/spouse, There were around 453 males and 289 females onboard, 424 perished and 288 survived. Most of the passengers were in Lower Pclass and embarked at station S.
Understanding the dependencies of dependent and independent variables¶
Since for the given data, more than 50% of the passengers perished, We will investigate the factors that survival of the passengers depend on and would like to answer questions like did females have more chance of surviving, how does age or fare affect the survival, does having a parent or child, or sibling or spouse influence survival and how does Pclass affect survival. Dependent variable is 'Survived' which gives 0 for rows for passengers who perished and 1 for passengers that survived. Independent variables are Sex, Pclass, Embarked, Age, Fare etc.
There could be other factors or variables like location of cabins or location/state(sleep or awake) of passengers at the time of the accident etc which we had limited data for and hence ave been omitted from the analysis. We also omitted rows that had missing values for 'Age' and 'Embarked' so that will also skew the statistical analysis a bit.
#Function to create grouped data by factors
def grouped_by_factors(df,factor):
mean_by_factor = df.groupby(factor).describe()
return mean_by_factor
Some understanding of mean/max/std/count would be helpful for our analysis so I created a function to display statistics using groupby function. We will also be creating plots to help visualize the data.
Understanding Dependent variable 'Survived' by numerical columns¶
'Survived' by Age and Fare¶
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize =(12,3))
fig1 = sns.regplot(x="Age", y="Survived", data=data, ax = ax1)
fig2 = sns.regplot(x="Fare", y="Survived", data=data, ax = ax2)
plt.suptitle("Perished vs. Survived by Age and Fare", size=12)
fig1.set(ylabel='Survival Rate'), fig2.set(ylabel='Survival Rate')
'Survived' by SibSp and Parch¶
g = sns.PairGrid(data, y_vars=["Survived"], x_vars=["SibSp", "Parch"], size=4)
g.map(sns.barplot, color=".4")
g.set(ylabel='Survival Rate')
plt.suptitle("Perished vs. Survived by SibSp and Parch", size=12)
grouped_by_factors(data,'Survived')
The data shows 424 passengers did not survive and 288 did.
Average age of passengers that survived was 28.2(std=14.8) years as compared to 30.62(14.17) for those who did not survive. On average, passengers who survived paid higher fare(mean=51.6 dollars) as compared to who did not(mean=22.9 dollars).
From the barchart, the survival rate for those travelling with 1/2 sibling or spouse and 1/2/3 parent or children was higher than the ones that did not. The relationship of survival is not linear with the number of sibsp/parch which could be due to lack of data.
From the correlation plot, Survival rate is positively correlated to Fare and negatively correlated to Age which means younger people and those who paid more had higher chances of surviving
Understanding Dependent variable 'Survived' by Categorical columns¶
'Survived' by Pclass¶
grouped_by_factors(data,'Pclass')
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize =(12,3))
fig1 = sns.countplot(x="Pclass", data=data, hue='Survived', palette="Greens_d", ax=ax1);
fig1.set(ylabel = "Frequency")
fig1.legend(["Perished", "Survived"])
fig2 = sns.barplot(x="Pclass", y="Survived", data=data, ax=ax2);
plt.suptitle("Survival rate by Pclass", size=12)
fig2.set(ylabel='Survival Rate')
Mean Fare of Upper Class was 88 dollars, Middle Class was 21.47 dollars and Lower Class was 13.22 dollars. Most survivors were from upper class(mean survival = 0.65), followed by middle(mean survival = 0.48) and then lower(mean survival = 0.24). Most of the passengers who did not survive belonged to the lower class Pclass shows linear relation to survival probability. There could be several reasons for that. People in upper classes could have boarded lifeboats before the lower classes, it also fits well with the correlation to fare in the prev plot.
'Survived' by Embarked¶
fig1 = sns.countplot(x="Embarked", data=data, hue='Survived', palette="Greens_d");
sns.countplot(x="Embarked", data=data, hue='Pclass', palette="Reds_d");
plt.suptitle("Valuecounts of Survivors by Pclass", size=12)
plt.suptitle("Valuecounts of passengers by 'Embarked' and Pclass", size=12)
label = ["Perished", "Survived", "Lower", "Upper", "Middle"]
plt.legend(label, loc='upper center')
fig1.set(ylabel = "Frequency")
Most of the passengers were in lower Pclass and embarked from 'S' followed by 'C' and 'Q'. Does not show much relationship to survival rate
'Survived' by Sex¶
data.groupby('Sex').describe()
fig1 = sns.countplot(x="Sex", data=data, hue='Survived', palette="Greens_d");
plt.suptitle("Valuecounts of Female vs Male survivors", size=12)
label = ["Perished", "Survived"]
plt.legend(label, loc='upper center')
fig1.set(ylabel = "Frequency")
Mean age of females who boarded the ship was 27-28 years and males was 30-31 years. There were 259 females and 453 males, more number of females(mean survival = 0.75) survived than males(mean survival = 0.20)
For the purpose of this analysis, I will pick Sex, Pclass and Age as major factors and investigate them further. The reason why I am picking them is because they show correlation with survival rate. Survival showed correlation to Fare as well but since the fare is represented by Pclass, I picked Pclass over Fare. Although other factors also affect survival, but I will focus on these three for this exercise
Understanding Pclass and Sex as a factor¶
fig1=sns.barplot(x="Pclass", y="Survived", hue="Sex", data=data);
plt.suptitle("Survival Rate of Female vs Male survivors by Pclass", size=12)
fig1.set(ylabel = "Survival Rate")
There were 314 females and 577 males, mean for female survivors(mean=0.74,std= 0.44) is more than males(mean=0.19,std= 0.39) across all Pclasses, Survival has linear relationship with class. Females had high probability of survival in both Upper and Middle class. Only upper class males had high probability of survival, which was lower than low class female passengers however
Understanding Age as a factor¶
print(grouped_by_factors(data,'Age').head())
print(grouped_by_factors(data,'Age').tail())
surv_age = data[data['Survived'] == 1]
g = surv_age['Age'].plot(kind='hist', figsize=[12,6], alpha=.8)
notsurv_age = data[data['Survived'] == 0]
notsurv_age['Age'].plot(kind='hist', figsize=[12,6], alpha=.4)
plt.legend(label)
g.set(xlabel='Age')
plt.suptitle("Distribution of Age for Perished and Survived", size=12)
Age of the passengers ranged from 0 to 80 years. Green bar is for passengers who did not survive and the blue is for those who survived. The distribution is almost normal distribution with similar shape and mode around 20 years. Below is the correlation for Age vs mean survived, it shows slight negative correlation with pearson'r value of -0.082
Correlation of 'Survived' with Age¶
sns.set(style="darkgrid", color_codes=True)
g = (sns.jointplot("Age", "Survived", data=data, kind="reg",color="g", size=7)).set_axis_labels("Age", "Survival Rate")
plt.subplots_adjust(top=0.95)
plt.suptitle("Distribution of Age for Perished and Survived", size=12)
Correlation of 'Survived' with Age and Sex¶
g = sns.lmplot(x="Age", y="Survived", col="Sex", hue="Sex", data=data, y_jitter=.02, logistic=True, size =5)
plt.subplots_adjust(top=0.9)
plt.suptitle("Correlation of Age with Survival Rate", size=12)
g.set(ylabel = "Survival Rate")
Survival probability was higher for Younger Men and Older Women, Side by side comparison of males and females by age further supports that
g = sns.factorplot(x="Survived", y="Age", hue="Sex", data=data, size=6, kind="bar", palette="muted")
g.despine(left=True)
plt.subplots_adjust(top=0.95)
plt.suptitle("Survived vs Age for males and females", size=12)
g.set_xticklabels(['Perished', 'Survived'])
g.set(xlabel = "Survival Rate")
Hypothesis testing¶
I have a hypothesis that passengers that are lower in age(<15 years) had greater chance of survival than females.
Null Hypothesis would be that the difference in chances of survival of passengers greater or lower than 15 years is not significant and alternate would be that it is significant.
H0: µchild = µfemale at α = 0.05,
HA: µchild ≠ µfemale at α = 0.05, where α is the t-critical at which the probability is .05 and µchild and µfemale are population means for the two groups.
#Children under 15yrs of age
data_children = data[data['Age'] <= 15]
#Females of age greater than 15 years
data_female = data[(data['Sex'] == 'female') & (data['Age'] > 15)]
scipy.stats.ttest_ind(data_children['Survived'], data_female['Survived'], axis=0, equal_var=False, nan_policy='propagate')
Since p value is low, the difference in mean survival is significant for females vs. children. Negative t-statistic shows that the mean survival of females is more than that of children¶
Conclusions¶
In Conclusion with the given dataset, Most contributing factors are 'Sex' and Pclass. Women had the most probability of survival in general. Survival rate is positively correlated to Fare and negatively correlated to Age which means younger people and those who paid more had higher chances of surviving. Females had positive correlation of survival with age and Males had negative correlation. Most survivors were from Upper Pclass followed by medium and lower class passengers. Most of the passengers in lower class perished. Passengers with any parent/child/sibling or spouse had higher chance at survival than the ones that did not. The analysis has following limitations: Omitted rows with missing values for 'Age' and 'emabarked' Did not draw conclusions based on 'Name' column dropped 'Cabin' and 'PassengerId' during data wrangling phase The data set is limited, the complete dataset should contain data for 1500 passengers