Random Forest Approach to Predict the chances of a person getting prone to alcohol or substance use!

Hello and Welcome!

So in this blog post we’ll learn how to build a random forest classifier using scikit learn in python under 80 lines of python code.

NOTE:If you want to skip the explanation of snippets the fully working codes are available at the end of the blog along with the dataset.

The tools used are

Spyder IDE

Anaconda Distribution

First of all we need to import few dependencies that we’re going to work with

from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import os
import matplotlib.pylab as plt
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import sklearn.metrics
 # Feature Importance
from sklearn import datasets
from sklearn.ensemble import ExtraTreesClassifier

Now we need to tell our code to look for the directory where the dataset is located this is done by using the following

os.chdir("C:\TREES")

Now we need to  read the csv file and drop not available values

AH_data = pd.read_csv("health.csv")
data_clean = AH_data.dropna()

Let’s describe the dataset now to see what we’re dealing with

BIO_SEX HISPANIC WHITE BLACK NAMERICAN \
count 4575.000000 4575.000000 4575.000000 4575.000000 4575.000000 
mean 1.521093 0.111038 0.683279 0.236066 0.036284 
std 0.499609 0.314214 0.465249 0.424709 0.187017 
min 1.000000 0.000000 0.000000 0.000000 0.000000 
25% 1.000000 0.000000 0.000000 0.000000 0.000000 
50% 2.000000 0.000000 1.000000 0.000000 0.000000 
75% 2.000000 0.000000 1.000000 0.000000 0.000000 
max 2.000000 1.000000 1.000000 1.000000 1.000000

ASIAN age TREG1 ALCEVR1 ALCPROBS1 \
count 4575.000000 4575.000000 4575.000000 4575.000000 4575.000000 
mean 0.040437 16.493052 0.176393 0.527432 0.369180 
std 0.197004 1.552174 0.381196 0.499302 0.894947 
min 0.000000 12.676712 0.000000 0.000000 0.000000 
25% 0.000000 15.254795 0.000000 0.000000 0.000000 
50% 0.000000 16.509589 0.000000 1.000000 0.000000 
75% 0.000000 17.679452 0.000000 1.000000 0.000000 
max 1.000000 21.512329 1.000000 1.000000 6.000000

ESTEEM1 VIOL1 PASSIST DEVIANT1 \
count ... 4575.000000 4575.000000 4575.000000 4575.000000 
mean ... 40.952131 1.618579 0.102514 2.645027 
std ... 5.381439 2.593230 0.303356 3.520554 
min ... 18.000000 0.000000 0.000000 0.000000 
25% ... 38.000000 0.000000 0.000000 0.000000 
50% ... 40.000000 0.000000 0.000000 1.000000 
75% ... 45.000000 2.000000 0.000000 4.000000 
max ... 50.000000 19.000000 1.000000 27.000000

SCHCONN1 GPA1 EXPEL1 FAMCONCT PARACTV \
count 4575.000000 4575.000000 4575.000000 4575.000000 4575.000000 
mean 28.360656 2.815647 0.040219 22.570557 6.290710 
std 5.156385 0.770167 0.196493 2.614754 3.360219 
min 6.000000 1.000000 0.000000 6.300000 0.000000 
25% 25.000000 2.250000 0.000000 21.700000 4.000000 
50% 29.000000 2.750000 0.000000 23.700000 6.000000 
75% 32.000000 3.500000 0.000000 24.300000 9.000000 
max 38.000000 4.000000 1.000000 25.000000 18.000000

PARPRES 
count 4575.000000 
mean 13.398033 
std 2.085837 
min 3.000000 
25% 12.000000 
50% 14.000000 
75% 15.000000 
max 15.000000

This is what we’re dealing with!

Now we set our predictors or our features and clean the target variables and devide the dataset into training and testing data (60:40 ratio)

Then we’re going to RandomForestClassifer from ensemble from sklearn

from sklearn.ensemble import RandomForestClassifier

Now finally let's build the forest

classifier=RandomForestClassifier(n_estimators=25)
classifier=classifier.fit(pred_train,tar_train)

Here 25 is the number of trees that the forest will contain.

Let’s printout the accuracy and confusion matrix

[[1428 92]
 [ 183 127]]
0.849726775956

89% Accuracy NOT BAD!! 😀

In the confusion matrix the diagonal indicates the number of true positive and true negative values and  92 and 183 tells about the false negative and false positive respectively

Time to display the importance of each attribute

# fit an Extra Trees model to the data
model = ExtraTreesClassifier()
model.fit(pred_train,tar_train)
# display the relative importance of each attribute
print(model.feature_importances_)

[ 0.02305763 0.01585706 0.01981318 0.01882172 0.00819704 0.00474358
 0.06132761 0.0596015 0.0433457 0.11824281 0.02111982 0.01668584
 0.03413818 0.05850854 0.05522334 0.044511 0.01733067 0.06630337
 0.05980527 0.07483757 0.01314004 0.06078445 0.05635996 0.0482441 ]

We see that whether the person has used marijuana has most importance and whether the person is Asian or not has least importance(LoL 🙂 )

Hold On we aren’t done yet,

Do we actually need 25 trees in our forest?

trees=range(25)
accuracy=np.zeros(25)

for idx in range(len(trees)):
 classifier=RandomForestClassifier(n_estimators=idx + 1)
 classifier=classifier.fit(pred_train,tar_train)
 predictions=classifier.predict(pred_test)
 accuracy[idx]=sklearn.metrics.accuracy_score(tar_test, predictions)
 
plt.cla()
plt.plot(trees, accuracy)

 

The above code determines the accuracy for trees upto range 25 and stores the accuracy of each of the result into an array and then plots it using plot function in python(No Graphwiz this time Thank God!)

Here’s the plot:
plot

We see that when there was only one tree (just like a decision tree) the accuracy was just close to 83% and even with 25 trees the accuracy just increased upto 84%

CONCLUSION:

  1. Random forests do generalize well on the data 
  2. Trees are themselves not interpreted and the entire forest is interpreted which can be a disadvantage as one tree may give the same result as 100 trees

The complete code:

# -*- coding: utf-8 -*-
"""
Created on Wed Nov 15 21:09:27 2017

@author: Aditya
"""

from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import os
import matplotlib.pylab as plt
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import sklearn.metrics
 # Feature Importance
from sklearn import datasets
from sklearn.ensemble import ExtraTreesClassifier

os.chdir("C:\TREES")

#Load the dataset

AH_data = pd.read_csv("health.csv")
data_clean = AH_data.dropna()

data_clean.dtypes
data_clean.describe()

#Split into training and testing sets

predictors = data_clean[['BIO_SEX','HISPANIC','WHITE','BLACK','NAMERICAN','ASIAN','age',
'ALCEVR1','ALCPROBS1','marever1','cocever1','inhever1','cigavail','DEP1','ESTEEM1','VIOL1',
'PASSIST','DEVIANT1','SCHCONN1','GPA1','EXPEL1','FAMCONCT','PARACTV','PARPRES']]

targets = data_clean.TREG1

pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)

pred_train.shape
pred_test.shape
tar_train.shape
tar_test.shape

#Build model on training data
from sklearn.ensemble import RandomForestClassifier

classifier=RandomForestClassifier(n_estimators=25)
classifier=classifier.fit(pred_train,tar_train)

predictions=classifier.predict(pred_test)

print(sklearn.metrics.confusion_matrix(tar_test,predictions))
print(sklearn.metrics.accuracy_score(tar_test, predictions))




# fit an Extra Trees model to the data
model = ExtraTreesClassifier()
model.fit(pred_train,tar_train)
# display the relative importance of each attribute
print(model.feature_importances_)




"""
Running a different number of trees and see the effect
 of that on the accuracy of the prediction
"""

trees=range(25)
accuracy=np.zeros(25)

for idx in range(len(trees)):
 classifier=RandomForestClassifier(n_estimators=idx + 1)
 classifier=classifier.fit(pred_train,tar_train)
 predictions=classifier.predict(pred_test)
 accuracy[idx]=sklearn.metrics.accuracy_score(tar_test, predictions)
 
plt.cla()
plt.plot(trees, accuracy)

 

And the dataset used : [CLICK HERE]

I’ve blogged about the same analysis with decision trees here:Decision Tree Approach To Predict the Chances of A Person being Addicted to Substances or Alcohol

Feel Free to Comment or Contact me

 

Leave a comment