- Sat 01 July 2017
- posts
- Michael Campbell
- #data-science, #data-analysis, #UCI
Analyzing the UCI heart disease dataset¶
The UCI repository contains three datasets on heart disease. Each dataset contains information about several patients suspected of having heart disease such as whether or not the patient is a smoker, the patients resting heart rate, age, sex, etc. The patients were all tested for heart disease and the results of that tests are given as numbers ranging from 0 (no heart disease) to 4 (severe heart disease). The goal of this notebook will be to use machine learning and statistical techniques to predict both the presence and severity of heart disease from the features given. In addition, I will also analyze which features are most important in predicting the presence and severity of heart disease.
There are three relevant datasets which I will be using, which are from Hungary, Long Beach, and Cleveland. Each of these hospitals recorded patient data, which was published with personal information removed from the database.
Cleaning the Data ¶
The datasets are slightly messy and will first need to be cleaned. For example,the dataset isn't in standard csv format, instead each feature spans several lines, with each feature being separated by the word 'name'. I will first process the data to bring it into csv format, and then import it into a pandas df. The names and descriptions of the features, found on the UCI repository is stored in the string feature_names.
%matplotlib inline
import xgboost as xgb
import numpy as np
import io
import pandas as pd
import matplotlib.pyplot as plt
import urllib
import re
#found on UCI database.
feature_names = '''1 id: patient identification number
2 ccf: social security number (I replaced this with a dummy value of 0)
3 age: age in years
4 sex: sex (1 = male; 0 = female)
5 painloc: chest pain location (1 = substernal; 0 = otherwise)
6 painexer (1 = provoked by exertion; 0 = otherwise)
7 relrest (1 = relieved after rest; 0 = otherwise)
8 pncaden (sum of 5, 6, and 7)
9 cp: chest pain type
10 trestbps: resting blood pressure (in mm Hg on admission to the hospital)
11 htn
12 chol: serum cholestoral in mg/dl
13 smoke: I believe this is 1 = yes; 0 = no (is or is not a smoker)
14 cigs (cigarettes per day)
15 years (number of years as a smoker)
16 fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
17 dm (1 = history of diabetes; 0 = no such history)
18 famhist: family history of coronary artery disease (1 = yes; 0 = no)
19 restecg: resting electrocardiographic results
20 ekgmo (month of exercise ECG reading)
21 ekgday (day of exercise ECG reading)
22 ekgyr (year of exercise ECG reading)
23 dig (digitalis used furing exercise ECG: 1 = yes; 0 = no)
24 prop (Beta blocker used during exercise ECG: 1 = yes; 0 = no)
25 nitr (nitrates used during exercise ECG: 1 = yes; 0 = no)
26 pro (calcium channel blocker used during exercise ECG: 1 = yes; 0 = no)
27 diuretic (diuretic used used during exercise ECG: 1 = yes; 0 = no)
28 proto: exercise protocol
29 thaldur: duration of exercise test in minutes
30 thaltime: time when ST measure depression was noted
31 met: mets achieved
32 thalach: maximum heart rate achieved
33 thalrest: resting heart rate
34 tpeakbps: peak exercise blood pressure (first of 2 parts)
35 tpeakbpd: peak exercise blood pressure (second of 2 parts)
36 dummy
37 trestbpd: resting blood pressure
38 exang: exercise induced angina (1 = yes; 0 = no)
39 xhypo: (1 = yes; 0 = no)
40 oldpeak = ST depression induced by exercise relative to rest
41 slope: the slope of the peak exercise ST segment
42 rldv5: height at rest
43 rldv5e: height at peak exercise
44 ca: number of major vessels (0-3) colored by flourosopy
45 restckm: irrelevant
46 exerckm: irrelevant
47 restef: rest raidonuclid (sp?) ejection fraction
48 restwm: rest wall (sp?) motion abnormality
49 exeref: exercise radinalid (sp?) ejection fraction
50 exerwm: exercise wall (sp?) motion
51 thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
52 thalsev: not used
53 thalpul: not used
54 earlobe: not used
55 cmo: month of cardiac cath (sp?) (perhaps "call")
56 cday: day of cardiac cath (sp?)
57 cyr: year of cardiac cath (sp?)
58 num: diagnosis of heart disease
59 lmt
60 ladprox
61 laddist
62 diag
63 cxmain
64 ramus
65 om1
66 om2
67 rcaprox
68 rcadist
69 lvx1: not used
70 lvx2: not used
71 lvx3: not used
72 lvx4: not used
73 lvf: not used
74 cathef: not used
75 junk: not used '''
name_cols = []
for line in feature_names.split('\n'):
name_cols.append(line.strip().split()[1].strip(': '))
file1=io.TextIOWrapper(urllib.request.urlopen(
'http://mlr.cs.umass.edu/ml/machine-learning-databases/heart-disease/cleveland.data'),encoding='ISO-8859-1')
file2=io.TextIOWrapper(urllib.request.urlopen(
'http://mlr.cs.umass.edu/ml/machine-learning-databases/heart-disease/hungarian.data'),encoding='ISO-8859-1')
file3=io.TextIOWrapper(urllib.request.urlopen(
'http://mlr.cs.umass.edu/ml/machine-learning-databases/heart-disease/long-beach-va.data'),encoding='ISO-8859-1')
outer_list = []
inner_list = []
for file_ in (file1,file2,file3):
for line in file_:
quantities = re.split('[^0-9.name-]+',line.strip())
for x in quantities:
if x!='name':
inner_list.append(x)
else:
outer_list.append(inner_list)
inner_list=[]
Removing corrupted data: ¶
The data should have 75 rows, however, several of the rows were not written correctly and instead have too many elements. These rows will be deleted, and the data will then be loaded into a pandas dataframe.
for i,row in enumerate(outer_list):
if len(row)!=75:
print(len(row),i)
print(inner_list)
removeable_rows = sorted([i for i,row in enumerate(outer_list) if len(row)!=75],reverse=True)
for i in removeable_rows:
del outer_list[i]
df = pd.DataFrame(outer_list,columns=name_cols)
df.head()
Flagging NaN values ¶
The NaN values are represented as -9. These will need to be flagged as NaN values in order to get good results from any machine learning algorithm.
for column in df.columns:
try:
df[column]=df[column].astype(int)
except:
df[column]=df[column].astype(float)
df = df.applymap(lambda x: np.NaN if x==-9 else x)
df.head()
Removing non-predictive features ¶
However before I do start analyzing the data I will drop columns which aren't going to be predictive. Several features such as the day of the exercise reading, or the ID of the patient are unlikely to be relevant in predicting heart disease. The exercise protocol might be predictive, however, since this might vary with the hospital, and since the hospitals had different rates for the category of heart disease, this might end up being more indicative of the hospital the patient went to and not of the likelihood of heart disease. The description of the columns on the UCI website also indicates that several of the columns should not be used.
In addition the information in columns 59+ is simply about the vessels that damage was detected in. Since I am only trying to predict the presence of heart disease and not the specific vessels which are damaged, I will discard these columns.
- ID: id of patient
- ccf: social security number (I replaced this with a dummy value of 0)
- ekgmo (month of exercise ECG reading)
- ekgday (day of exercise ECG reading)
- ekgyr (year of exercise ECG reading)
- proto: exercise protocol
- dummy
- restckm irrelevant
- exerckm irrelevant
- thalsev: not used
- thalpul: not used
- earlobe: not used
- cmo: month of cardiac cath (sp?) (perhaps "call")
- cday: day of cardiac cath (sp?)
- cyr: year of cardiac cath (sp?)
df = df[df.columns[:58]]
for feat in ['id','ccf','ekgday','ekgmo','ekgyr','proto','dummy','restckm','exerckm',
'thalsev','thalpul','earlobe','cmo','cday','cyr']:
del df[feat]
To get a better sense of the remaining data, I will print out how many distinct values occur in each of the columns
for column in df.columns[:10]:
print(column,' ',len(np.unique(df[column].dropna())))
Some columns such as pncaden contain less than 2 values. These columns are not predictive and hence should be dropped.
for column in df.columns:
if len(np.unique(df[column].dropna()))<2:
del df[column]
print(df.shape)
There are also several columns which are mostly filled with NaN entries. I will drop any entries which are filled mostly with NaN entries since I want to make predictions based on categories that all or most of the data shares.
for feat in df.columns:
#if the column is mostly empty na values, drop it
if df[feat].dropna().size<df[feat].size/2:
del df[feat]
df.head()
Encoding categorical features ¶
Most of the columns now are either categorical binary features with two values, or are continuous features such as age, or cigs. However, the column 'cp' consists of four possible values which will need to be one hot encoded.
for column in df.columns:
if len(np.unique(df[column].dropna().values))<10:
print('{}: {}'.format(column,np.unique(df[column].dropna().values)))
We can also see that the column 'prop' appear to both have corrupted rows in them, which will need to be deleted from the dataframe. I will also one hot encode the categorical features 'cp' and 'restecg' which is the type of chest pain.
df = df[(df['prop']!=22.0)]
df = pd.get_dummies(df, columns=['cp','restecg'])
df.head()
Exploratory data analysis ¶
The dataset still has a large number of features, which need to be analyzed for predictive power. I will begin by splitting the data into a test and training dataset.
from sklearn.model_selection import train_test_split
X = df[df.columns.difference(['num'])]
y = df['num']
X_train,X_test,y_train,y_test = train_test_split(X,y)
Using the anova F test and mutual information for feature selection ¶
Although there are some features which are slightly predictive by themselves, the data contains more features than necessary, and not all of these features are useful. To narrow down the number of features, I will use the sklearn class SelectKBest. By default, this class uses the anova f-value of each feature to select the best features. The f value is a ratio of the variance between classes divided by the variance within classes. This tells us how much the variable differs between the classes. The higher the f value, the more likely a variable is to be relevant. However, the f value can miss features or relationships which are meaningful. Another way to approach the feature selection is to select the features with the highest mutual information. I will use both of these methods to find which one yields the best results.
To deal with missing variables in the data (NaN values), I will take the mean.
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectKBest
from sklearn.preprocessing import Imputer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import mutual_info_classif
results_f = list()
std_f = list()
results_mi = list()
std_mi = list()
for i in range(1,37,2):
pipe = Pipeline([('Imputer',Imputer(missing_values='NaN',strategy='mean',axis=0)),
('select_feat',SelectKBest(k=i)),
('clf',LogisticRegression(C=100))])
cv_scores = cross_val_score(pipe,X_train,y_train,cv=10,n_jobs=-1)
results_f.append(cv_scores.mean())
std_f.append(cv_scores.std())
pipe = Pipeline([('Imputer',Imputer(missing_values='NaN',strategy='mean',axis=0)),
('select_feat',SelectKBest(k=i,score_func = mutual_info_classif)),
('clf',LogisticRegression(C=100))])
cv_scores = cross_val_score(pipe,X_train,y_train,cv=10,n_jobs=-1)
results_mi.append(cv_scores.mean())
std_mi.append(cv_scores.std())
results_mi,std_mi,results_f,std_f = np.array(results_mi),np.array(std_mi),np.array(results_f),np.array(std_f)
plt.plot(range(1,37,2),results_f,c='b',label='f ratio')
plt.fill_between(range(1,37,2),results_f-std_f,results_f+std_f,alpha = .1,color='b')
plt.plot(range(1,37,2),results_mi,c='r',label = 'mutual info')
plt.fill_between(range(1,37,2),results_mi-std_mi,results_mi+std_mi,alpha=.1,color='r')
plt.legend(loc='upper left')
plt.xlabel('no of features')
plt.ylabel('accuracy')
plt.title('cross validated accuracy with varying no. of features')
plt.show()
The accuracy is about the same using the mutual information, and the accuracy stops increasing soon after reaching approximately 5 features.
from sklearn.ensemble import RandomForestClassifier
results = list()
std = list()
for i in range(1,37,2):
pipe = Pipeline([('Imputer',Imputer(missing_values='NaN',strategy='mean',axis=0)),
('select_feat',SelectKBest(k=i)),
('clf',RandomForestClassifier())])
cv_scores = cross_val_score(pipe,X_train,y_train,cv=10,n_jobs=-1)
results.append(cv_scores.mean())
std.append(cv_scores.std())
results, std = np.array(results),np.array(std)
plt.plot(range(1,37,2),results,c='b')
plt.fill_between(range(1,37,2),results-std,results+std,alpha = .2,color='b')
plt.legend(loc='upper left')
plt.xlabel('no of features')
plt.ylabel('accuracy')
plt.title('cross validated accuracy with random forest')
plt.show()
Choosing a classifier. ¶
There are several types of classifiers available in sklearn to use. I have already tried Logistic Regression and Random Forests. However, I have not found the optimal parameters for these models using a grid search yet. To do this, I will use a grid search to evaluate all possible combinations. I will test out three popular models for fitting categorical data, logistic regression, random forests, and support vector machines using both the linear and rbf kernel.
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.svm import LinearSVC
pipe = Pipeline([('Imputer',Imputer(missing_values='NaN',strategy='mean',axis=0)),
('select_feat',SelectKBest(k=10)),
('clf',RandomForestClassifier())])
params = {'select_feat__k':range(6,14,2),'clf__min_samples_leaf':[1,5,10,15,20,25,50]}
grid_search = GridSearchCV(pipe,param_grid=params,cv=5,n_jobs=-1)
grid_search.fit(X_train,y_train)
rf_est = grid_search.best_estimator_
print('Random forest Classifier')
print(grid_search.best_params_)
print('accuracy: {:.3f}'.format(grid_search.best_score_))
pipe = Pipeline([('Imputer',Imputer(missing_values='NaN',strategy='mean',axis=0)),
('scl',StandardScaler()),
('select_feat',SelectKBest(k=10)),
('clf',LogisticRegression())])
params = {'select_feat__k':range(10,20,2),'clf__C':[10**x for x in range(-3,5)]}
grid_search = GridSearchCV(pipe,param_grid=params,cv=5,n_jobs=-1)
grid_search.fit(X_train,y_train)
lr_est = grid_search.best_estimator_
print('Logistic Regression')
print(grid_search.best_params_)
print('accuracy: {:.3f}'.format(grid_search.best_score_))
pipe = Pipeline([('Imputer',Imputer(missing_values='NaN',strategy='mean',axis=0)),
('scl',StandardScaler()),
('select_feat',SelectKBest(k=10)),
('clf',LinearSVC(class_weight='balanced',tol=.1))])
params = {'select_feat__k':range(10,20,2),'clf__C':[10**x for x in range(-3,5)]}
grid_search = GridSearchCV(pipe,param_grid=params,cv=5,n_jobs=-1)
grid_search.fit(X_train,y_train)
lin_svm_est = grid_search.best_estimator_
print('Linear SVM')
print(grid_search.best_params_)
print('accuracy: {:.3f}'.format(grid_search.best_score_))
pipe = Pipeline([('Imputer',Imputer(missing_values='NaN',strategy='mean',axis=0)),
('scl',StandardScaler()),
('select_feat',SelectKBest(k=10)),
('clf',SVC())])
params = {'select_feat__k':range(10,20,2),
'clf__kernel':['rbf'],
'clf__C':[2**x for x in range(-5,16,2)],
'clf__gamma':[2**x for x in range(-15,4,2)],
'clf__class_weight':['balanced']}
grid_search = GridSearchCV(pipe,param_grid=params,cv=5,n_jobs=-1)
grid_search.fit(X_train,y_train)
params = {'select_feat__k':[grid_search.best_params_['select_feat__k']],
'clf__kernel':['rbf'],
'clf__C':[grid_search.best_params_['clf__C']*(2**x) for x in np.arange(-1.5,1.75,.25)],
'clf__gamma':[grid_search.best_params_['clf__gamma']*(2**x) for x in np.arange(-1.5,1.75,.25)],
'clf__class_weight':['balanced']}
grid_search = GridSearchCV(pipe,param_grid=params,cv=5,n_jobs=-1)
grid_search.fit(X_train,y_train)
svm_est = grid_search.best_estimator_
print('Kernel Support Vector Machine')
print(grid_search.best_params_)
print('accuracy: {:.3f}'.format(grid_search.best_score_))
Another possible useful classifier is the gradient boosting classifier, XGBoost, which has been used to win several kaggle challenges. I will use this to predict values from the dataset.
from xgboost import XGBClassifier
pipe = Pipeline([('Imputer',Imputer(missing_values='NaN',strategy='mean',axis=0)),
('select_feat',SelectKBest(k=10)),
('clf',XGBClassifier(max_depth=4))])
params = {'select_feat__k':range(6,12,2),
'clf__max_depth':range(3,8,1),
'clf__subsample':np.arange(.5,1,.1),
'clf__colsample_bytree':np.arange(.5,1,.1),
'clf__reg_lambda':[10**x for x in range(-3,3)]}
grid_search = GridSearchCV(pipe,param_grid=params,cv=5,n_jobs=-1)
grid_search.fit(X_train,y_train)
print(grid_search.best_params_)
print(grid_search.best_score_)
xgb_est =grid_search.best_estimator_
The xgboost does better slightly better than the random forest and logistic regression, however the results are all close to each other.
Results. ¶
Upon applying our model to the testing dataset, I manage to get an accuracy of 56.7%. The xgboost is only marginally more accurate than using a logistic regression in predicting the presence and type of heart disease.
print("Testing accuracy for XGB classifier: {:.3f}".format(xgb_est.score(X_test,y_test)))
%%html
Several groups analyzing this dataset used a subsample of 14 features. Our algorithm already selected only from these 14 features, and ended up only selecting 6 of them to create the model (note cp_2 and cp_4 are one hot encodings of the values of the feature cp)
Used in other analysis | Selected by mutual_info |
---|---|
age | relrest |
sex | cp_2 |
cp | cp_4 |
trestbps | painexer |
chol | exang |
fbs | relrest |
restecg | |
thalach | |
exang | |
oldpeak | |
slope | |
ca | |
thal | |
num | num |
print(pd.Series(df.columns.difference(['num']))[xgb_est.named_steps['select_feat'].get_support()])
print(xgb_est.named_steps['clf'].feature_importances_)
Conclusions: ¶
In predicting the presence and type of heart disease, I was able to achieve a 57.5% accuracy on the training set, and a 56.7% accuracy on the test set, indicating that our model was not overfitting the data. The most important features in predicting the presence of heart damage and their importance scores calculated by the xgboost classifier were:
feature | importance | explanation |
---|---|---|
oldpeak | .497 | the ST depression induced by exercise compared to rest |
exang | .154 | whether there was exercise induced angina |
painexer | .114 | whether or not the pain was induced by exercise |
relrest | .108 | whether or not the pain was relieved by rest |
cp_4 | .083 | presence of one type of chest pain |
cp_2 | .045 | presence of another type of chest pain |