So what the heck is a BlockChain anyway?

“A glorified over-hyped linkedlist?”

Let’s break it down!

BlockChain was originally developed by a group of researchers to timestamp digital documents so that there are no changes to it. Just like the other classic machine learning algorithms it went dead in the 90s, but was later in 2009 adopted by Mr Mystery man Satoshi Nakomoto for bitcoins architecture. The concept of BlockChain took off with the advancements in cryptocurrencies such as bitcoins (read : https://bitcoin.org/bitcoin.pdf) and the rest is history or should I say “The rest is the future”.

Note : Bitcoin IS NOT a blockchain or vice versa. Bitcoin is a peer to peer electronic value sharing system which uses the underlying concept of blockchain.

Once a data is put into a block chain its almost impossible to change it.

This is how a blockhain looks like:

ll.PNG

Image Source : https://www.pluralsight.com/guides/software-engineering-best-practices/blockchain-architecture

Each block contains data, hash of the current block and hash of previous block. Data that is stored in the blocks  depends on our applications in the diagram it is transactions because its using bitcoins.

Also see : https://www.firstpost.com/tech/news-analysis/andhra-pradesh-to-become-first-state-to-deploy-blockchain-technology-across-the-administration-4125897.html

Hash is a unique identity to identify a block and all of its content. Once block is created a hash will be computed and changing something inside the block will also change the hash of that block which tells us that something has changed. By storing hash of previous block we are chaining blocks which makes the blockchain very secure.

In the diagram block number 3 points to 2 and block 2 points 1 by storing the hash of the previous blocks,The first block is a bit special as it is the starting block and no blocks point to it (DS and Algo people can map this to the head of a linked list or a root of the tree) This block is called as the  genesis block. The genesis block is almost always hard-coded into the software of the applications that utilize its block chain.

If you tamper something with block 2, it makes rest of the following blocks also invalid because the following blocks will hold the original hash value before the tampering occurred.

But you may ask “Computers these days are very fast, aren’t they”?

Yes they are and they can crack those hash values and make the block legit again.

To prevent this, bockchain uses a proof of work algorithm which slows down creation of blocks. In  bitcoins it takes approximately 10 minutes to add a block. Due to this, it is difficult to tamper with blocks and proof of work for all the following blocks will take lot of time to compute which is not physically possible owing to the factor that the number of blocks is huge.

To add to the security, Blockchains are distributed and instead of having a central authority they have peer to peer networking. So when someone joins the party he gets a full copy of the blockchain.

Now when someone creates a new block it is sent to everyone in the network and each node verifies to see that it has not been tampered with and if everything is good its added to it’s copy of set of nodes. All nodes create consensus and agree about which is valid and not. So to successfully tamper we need to tamper with all blocks after the tampered block and re-do the proof of work and take control of about 50% of the peer to peer network and only then the  tampered block will be accepted which is impossible to do even if you have the world’s fastest super computers.

Fascinating, isn’t it?

Some interesting Sources:

Advertisements

Perform A Cluster Analysis to Predict the GPA in under 150 lines of Python

Hey there!

Cluster analysis is an unsupervised machine learning method that partitions the observations in a data set into a smaller set of clusters where each observation belongs to only one cluster. The goal of cluster analysis is to group, or cluster, observations into subsets based on their similarity of responses on multiple variables. Clustering variables should be primarily quantitative variables, but binary variables may also be included.

NOTE:If you want to skip explanation scroll down to the end and find the complete code

So let’s do that in Python!

As always we’ll start off with importing the dependencies:

from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import os
import matplotlib.pylab as plt
from sklearn.cross_validation import train_test_split
from sklearn import preprocessing
from sklearn.cluster import KMeans

Now let’s load the datset

os.chdir("C:/trees")
data = pd.read_csv("health.csv")

Now we’ll clean our dataset and set the predictors by only taking a subset of them from our dataset.We will also have to standardize the variables so that solution is not driven by variables measured with larger scale.Finally we’ll divide our data into trianing and testing data using the train_test_split function

#upper-case all DataFrame column names
data.columns = map(str.upper, data.columns)

# Data Management

data_clean = data.dropna()

# subset clustering variables
cluster=data_clean[['ALCEVR1','MAREVER1','ALCPROBS1','DEVIANT1','VIOL1',
'DEP1','ESTEEM1','SCHCONN1','PARACTV', 'PARPRES','FAMCONCT']]
cluster.describe()

# standardize clustering variables to have mean=0 and sd=1
clustervar=cluster.copy()
clustervar['ALCEVR1']=preprocessing.scale(clustervar['ALCEVR1'].astype('float64'))
clustervar['ALCPROBS1']=preprocessing.scale(clustervar['ALCPROBS1'].astype('float64'))
clustervar['MAREVER1']=preprocessing.scale(clustervar['MAREVER1'].astype('float64'))
clustervar['DEP1']=preprocessing.scale(clustervar['DEP1'].astype('float64'))
clustervar['ESTEEM1']=preprocessing.scale(clustervar['ESTEEM1'].astype('float64'))
clustervar['VIOL1']=preprocessing.scale(clustervar['VIOL1'].astype('float64'))
clustervar['DEVIANT1']=preprocessing.scale(clustervar['DEVIANT1'].astype('float64'))
clustervar['FAMCONCT']=preprocessing.scale(clustervar['FAMCONCT'].astype('float64'))
clustervar['SCHCONN1']=preprocessing.scale(clustervar['SCHCONN1'].astype('float64'))
clustervar['PARACTV']=preprocessing.scale(clustervar['PARACTV'].astype('float64'))
clustervar['PARPRES']=preprocessing.scale(clustervar['PARPRES'].astype('float64'))

# split data into train and test sets
clus_train, clus_test = train_test_split(clustervar, test_size=.3, random_state=123)

Now we’ll generate clusters from 1 to 10 because we’re not sure which one will workout for us and use the cdist function that scipy provides to find the average distance of all observations from the cluster centroids.

# k-means cluster analysis for 1-9 clusters 
from scipy.spatial.distance import cdist
clusters=range(1,10)
meandist=[]

for k in clusters:
 model=KMeans(n_clusters=k)
 model.fit(clus_train)
 clusassign=model.predict(clus_train)
 meandist.append(sum(np.min(cdist(clus_train, model.cluster_centers_, 'euclidean'), axis=1)) 
 / clus_train.shape[0])

The meandist will have the (sum of the minimum distance between each observation and cluster centroid.Note that euclidean distance is used here,axis=1 indicates that we need to take the minimum distance) /(divided by the number of observations in the datset)

We then generate a scatter plot to see where there’s a elbow which generally tells us the best number of clusters to choose.

"""
Plot average distance from observations from the cluster centroid
to use the Elbow Method to identify number of clusters to choose
"""

plt.plot(clusters, meandist)
plt.xlabel('Number of clusters')
plt.ylabel('Average distance')
plt.title('Selecting k with the Elbow Method')

elbow

As you can see it’s subjective and there are two elbows here one at 2 and the other at 3.

As an  example we’ll try our analysis for 3 clusters now to see if that works.

So we rerun the cluster analysis for 3 clusters

# Interpret 3 cluster solution
model3=KMeans(n_clusters=3)
model3.fit(clus_train)
clusassign=model3.predict(clus_train)

Since there are around 11 dimension plotting it won’t be easy so we use a technique called principle component analysis that creates a smaller number of variables by doing a linear combination in them.Majority of variance is accounted by the first few canonical variables i.e they perfectly combine the 11 variables.

from sklearn.decomposition import PCA
pca_2 = PCA(2) #returns 1st 2 cannonical variables 
plot_columns = pca_2.fit_transform(clus_train)
plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=model3.labels_,)
plt.xlabel('Canonical variable 1')
plt.ylabel('Canonical variable 2')
plt.title('Scatterplot of Canonical Variables for 3 Clusters')
plt.show()

The output is a shown below:

cluster scatter

We see that blue and yellow clusters overlap and are densely packed which isn’t good news.This suggests that the 2 cluster solution may be better.

First we’ll have a look at the pattern of means for the clustering variables for each cluster to see whether they are meaningful. To do this, we have to link the cluster assignment variable back to its corresponding observation in the clus_train dataset
that has the clustering variables. The first thing we need to do is create
a unique identifier variable for our clus_train dataset that has the clustering variables. We can do this by using the index, which is automatically created by
Python as part of a data frame. We will reset the index with the following
code, clus_train.reset_index. In parenthesis, level=0 tells Python to only remove
the given levels from the index, and inplace=True, tells Python to add the new
column to the existing clus_train dataset. So now, we will have a new
variable labeled index that we can use as a unique identifier. In the next step, we will create an object called cluslist and use the list function to pull the new index variable from the clus_train data set and convert it to a list. This will be combined with the cluster
assignment variable, so that we can merge the two datasets together by each
observation’s unique identifier. Then we are going to do the same for our cluster assignment variable which is currently contained with the model3.labels_ attribute
from the cluster analysis. We’ll call this list labels, and use the list function again to create a list of the cluster assignment for each observation.

"""

BEGIN multiple steps to merge cluster assignment with clustering variables to examine
cluster variable means by cluster
"""
# create a unique identifier variable from the index for the 
# cluster training data to merge with the cluster assignment variable
clus_train.reset_index(level=0, inplace=True)
# create a list that has the new index variable
cluslist=list(clus_train['index'])
# create a list of cluster assignments
labels=list(model3.labels_)
# combine index variable list with cluster assignment list into a dictionary
newlist=dict(zip(cluslist, labels))
newlist
# convert newlist dictionary to a dataframe
newclus=DataFrame.from_dict(newlist, orient='index')
newclus
# rename the cluster assignment column
newclus.columns = ['cluster']

# now do the same for the cluster assignment variable
# create a unique identifier variable from the index for the 
# cluster assignment dataframe 
# to merge with cluster training data
newclus.reset_index(level=0, inplace=True)
# merge the cluster assignment dataframe with the cluster training variable dataframe
# by the index variable
merged_train=pd.merge(clus_train, newclus, on='index')
merged_train.head(n=100)
# cluster frequencies
merged_train.cluster.value_counts()

"""
END multiple steps to merge cluster assignment with clustering variables to examine
cluster variable means by cluster
"""

Now finally we’ll print the output:

# FINALLY calculate clustering variable means by cluster
clustergrp = merged_train.groupby('cluster').mean()
print ("Clustering variable means by cluster")
print(clustergrp)
Clustering variable means by cluster
 index ALCEVR1 MAREVER1 ALCPROBS1 DEVIANT1 VIOL1 \
cluster 
0 3321.939617 0.663367 1.092524 0.890922 1.099921 0.784946 
1 3239.829577 -1.056455 -0.474543 -0.412562 -0.451110 -0.264092 
2 3323.044424 0.946562 -0.059040 -0.056945 -0.122713 -0.168283

DEP1 ESTEEM1 SCHCONN1 PARACTV PARPRES FAMCONCT 
cluster 
0 0.844458 -0.652370 -0.928611 -0.413087 -0.476838 -0.956729 
1 -0.292352 0.206912 0.341590 0.091216 0.156604 0.298288 
2 -0.196129 0.190528 0.136841 0.156060 0.103881 0.233021

The means on the clustering variables showed that compared to the other clusters, adolescents in the first cluster, cluster 0, had the highest likelihood of having used alcohol, but otherwise tended to fall somewhere in between the other two clusters on the other variables. On the other hand, the second cluster, cluster 1, clearly includes the most troubled adolescents. Adolescents in this cluster had the highest likelihood of having used alcohol, a very high likelihood of having used marijuana, more alcohol problems, and more engagement in deviant and violent behaviors compared to the other two clusters. They also had higher levels of depression, lower self-esteem, and the lowest levels of school connectedness, parental presence, involvement of parent in activities, and family connectedness. The third cluster, cluster 2, appears to include the least troubled adolescents. Compared to adolescents in the other clusters, they were least likely to have used alcohol and marijuana, and had the lowest number of alcohol problems and deviant and violent behavior. They also had greater school and
family connectedness.

Now let’s see how the clusters differ on GPA.

# validate clusters in training data by examining cluster differences in GPA using ANOVA
# first have to merge GPA with clustering variables and cluster assignment data 
gpa_data=data_clean['GPA1']
# split GPA data into train and test sets
gpa_train, gpa_test = train_test_split(gpa_data, test_size=.3, random_state=123)
gpa_train1=pd.DataFrame(gpa_train)
gpa_train1.reset_index(level=0, inplace=True)
merged_train_all=pd.merge(gpa_train1, merged_train, on='index')
sub1 = merged_train_all[['GPA1', 'cluster']].dropna()

import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi

gpamod = smf.ols(formula='GPA1 ~ C(cluster)', data=sub1).fit()
print (gpamod.summary())

print ('means for GPA by cluster')
m1= sub1.groupby('cluster').mean()
print (m1)

print ('standard deviations for GPA by cluster')
m2= sub1.groupby('cluster').std()
print (m2)

mc1 = multi.MultiComparison(sub1['GPA1'], sub1['cluster'])
res1 = mc1.tukeyhsd()
print(res1.summary())

OUTPUT:

OLS Regression Results

==============================================================================

Dep. Variable: GPA1 R-squared: 0.078

Model: OLS Adj. R-squared: 0.078

Method: Least Squares F-statistic: 136.0

Date: Mon, 20 Nov 2017 Prob (F-statistic): 2.10e-57

Time: 14:28:57 Log-Likelihood: -3596.8

No. Observations: 3202 AIC: 7200.

Df Residuals: 3199 BIC: 7218.

Df Model: 2

Covariance Type: nonrobust

===================================================================================

coef std err t P>|t| [0.025 0.975]

-----------------------------------------------------------------------------------

Intercept 2.9951 0.020 151.563 0.000 2.956 3.034

C(cluster)[T.1] -0.5712 0.035 -16.468 0.000 -0.639 -0.503

C(cluster)[T.2] -0.1614 0.030 -5.397 0.000 -0.220 -0.103

==============================================================================

Omnibus: 152.383 Durbin-Watson: 2.017

Prob(Omnibus): 0.000 Jarque-Bera (JB): 92.763

Skew: -0.280 Prob(JB): 7.19e-21

Kurtosis: 2.382 Cond. No. 3.41

==============================================================================

Warnings:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

means for GPA by cluster

GPA1

cluster

0 2.995067

1 2.423876

2 2.833712

standard deviations for GPA by cluster

GPA1

cluster

0 0.738169

1 0.782335

2 0.728128

Multiple Comparison of Means - Tukey HSD,FWER=0.05

=============================================

group1 group2 meandiff lower upper reject

---------------------------------------------

0 1 -0.5712 -0.6525 -0.4899 True

0 2 -0.1614 -0.2315 -0.0913 True

1 2 0.4098 0.3248 0.4949 True

---------------------------------------------

We’re printing the mean and SD for gpa for each cluster here.We’ll use analysis of variance to test whether there are significant differences between clusters on the quantitative GPA variable. To do this, we have to import the statsmodels.formula.api and the statsmodels.stats.multicomp libraries. We use the ols function to test the analysis of variance. The formula specifies the model, with GPA as the response variable and cluster, as the explanatory variable. The capital C tells Python that the cluster assignment variable is categorical. We can also print the mean GPA in standard deviation for each cluster using the groupby function. Then, because our categorical cluster
variable has three categories, we will request a tukey test to evaluate post hot comparisons between the clusters using the multi comparison function from the statsmodels.stats.multicomplibrary which we imported as multi. Here are the results.

RESULTS:
The analysis of variance summary table indicates that the clusters
differed significantly on GPA. When we examine the means, we find that not surprisingly, adolescents in cluster 1, the most troubled group, had the lowest GPA, and adolescents in cluster 2, the least troubled group, had the highest GPA. The tukey test shows that the clusters differed significantly in mean GPA, although the difference between cluster 0 and cluster 2 were smaller.

THE COMPLETE CODE:

# -*- coding: utf-8 -*-
"""
Created on Mon Nov 20 11:33:06 2017

@author: Aditya
"""

from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import os
import matplotlib.pylab as plt
from sklearn.cross_validation import train_test_split
from sklearn import preprocessing
from sklearn.cluster import KMeans

"""
Data Management
"""
os.chdir("C:/trees")
data = pd.read_csv("health.csv")

#upper-case all DataFrame column names
data.columns = map(str.upper, data.columns)

# Data Management

data_clean = data.dropna()

# subset clustering variables
cluster=data_clean[['ALCEVR1','MAREVER1','ALCPROBS1','DEVIANT1','VIOL1',
'DEP1','ESTEEM1','SCHCONN1','PARACTV', 'PARPRES','FAMCONCT']]
cluster.describe()

# standardize clustering variables to have mean=0 and sd=1
clustervar=cluster.copy()
clustervar['ALCEVR1']=preprocessing.scale(clustervar['ALCEVR1'].astype('float64'))
clustervar['ALCPROBS1']=preprocessing.scale(clustervar['ALCPROBS1'].astype('float64'))
clustervar['MAREVER1']=preprocessing.scale(clustervar['MAREVER1'].astype('float64'))
clustervar['DEP1']=preprocessing.scale(clustervar['DEP1'].astype('float64'))
clustervar['ESTEEM1']=preprocessing.scale(clustervar['ESTEEM1'].astype('float64'))
clustervar['VIOL1']=preprocessing.scale(clustervar['VIOL1'].astype('float64'))
clustervar['DEVIANT1']=preprocessing.scale(clustervar['DEVIANT1'].astype('float64'))
clustervar['FAMCONCT']=preprocessing.scale(clustervar['FAMCONCT'].astype('float64'))
clustervar['SCHCONN1']=preprocessing.scale(clustervar['SCHCONN1'].astype('float64'))
clustervar['PARACTV']=preprocessing.scale(clustervar['PARACTV'].astype('float64'))
clustervar['PARPRES']=preprocessing.scale(clustervar['PARPRES'].astype('float64'))

# split data into train and test sets
clus_train, clus_test = train_test_split(clustervar, test_size=.3, random_state=123)

# k-means cluster analysis for 1-9 clusters 
from scipy.spatial.distance import cdist
clusters=range(1,10)
meandist=[]

for k in clusters:
 model=KMeans(n_clusters=k)
 model.fit(clus_train)
 clusassign=model.predict(clus_train)
 meandist.append(sum(np.min(cdist(clus_train, model.cluster_centers_, 'euclidean'), axis=1)) 
 / clus_train.shape[0])

"""
Plot average distance from observations from the cluster centroid
to use the Elbow Method to identify number of clusters to choose
"""

plt.plot(clusters, meandist)
plt.xlabel('Number of clusters')
plt.ylabel('Average distance')
plt.title('Selecting k with the Elbow Method')

# Interpret 3 cluster solution
model3=KMeans(n_clusters=3)
model3.fit(clus_train)
clusassign=model3.predict(clus_train)
# plot clusters

from sklearn.decomposition import PCA
pca_2 = PCA(2)
plot_columns = pca_2.fit_transform(clus_train)
plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=model3.labels_,)
plt.xlabel('Canonical variable 1')
plt.ylabel('Canonical variable 2')
plt.title('Scatterplot of Canonical Variables for 3 Clusters')
plt.show()

"""
BEGIN multiple steps to merge cluster assignment with clustering variables to examine
cluster variable means by cluster
"""
# create a unique identifier variable from the index for the 
# cluster training data to merge with the cluster assignment variable
clus_train.reset_index(level=0, inplace=True)
# create a list that has the new index variable
cluslist=list(clus_train['index'])
# create a list of cluster assignments
labels=list(model3.labels_)
# combine index variable list with cluster assignment list into a dictionary
newlist=dict(zip(cluslist, labels))
newlist
# convert newlist dictionary to a dataframe
newclus=DataFrame.from_dict(newlist, orient='index')
newclus
# rename the cluster assignment column
newclus.columns = ['cluster']

# now do the same for the cluster assignment variable
# create a unique identifier variable from the index for the 
# cluster assignment dataframe 
# to merge with cluster training data
newclus.reset_index(level=0, inplace=True)
# merge the cluster assignment dataframe with the cluster training variable dataframe
# by the index variable
merged_train=pd.merge(clus_train, newclus, on='index')
merged_train.head(n=100)
# cluster frequencies
merged_train.cluster.value_counts()

"""
END multiple steps to merge cluster assignment with clustering variables to examine
cluster variable means by cluster
"""

# FINALLY calculate clustering variable means by cluster
clustergrp = merged_train.groupby('cluster').mean()
print ("Clustering variable means by cluster")
print(clustergrp)




# validate clusters in training data by examining cluster differences in GPA using ANOVA
# first have to merge GPA with clustering variables and cluster assignment data 
gpa_data=data_clean['GPA1']
# split GPA data into train and test sets
gpa_train, gpa_test = train_test_split(gpa_data, test_size=.3, random_state=123)
gpa_train1=pd.DataFrame(gpa_train)
gpa_train1.reset_index(level=0, inplace=True)
merged_train_all=pd.merge(gpa_train1, merged_train, on='index')
sub1 = merged_train_all[['GPA1', 'cluster']].dropna()

import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi

gpamod = smf.ols(formula='GPA1 ~ C(cluster)', data=sub1).fit()
print (gpamod.summary())

print ('means for GPA by cluster')
m1= sub1.groupby('cluster').mean()
print (m1)

print ('standard deviations for GPA by cluster')
m2= sub1.groupby('cluster').std()
print (m2)

mc1 = multi.MultiComparison(sub1['GPA1'], sub1['cluster'])
res1 = mc1.tukeyhsd()
print(res1.summary())




 

LASSO Regression to predict the “School Connectedness” of a student

Lasso regression analysis is a shrinkage and variable selection method for linear regression models. The goal of lasso regression is to obtain the subset of predictors that minimizes prediction error for a quantitative response variable. The lasso does this by imposing a constraint on the model parameters that causes regression coefficients for some variables to shrink toward zero. Variables with a regression coefficient equal to zero after the shrinkage process are excluded from the model. Variables with non-zero regression coefficients variables are most strongly associated with the response variable. Explanatory variables can be either quantitative, categorical or both.

NOTE:If you want to the full code,datset and skip explanation,scroll down to the end.

Let’s do that in Python Now!

First off all THE DEPENDENCIES!

#from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import os
import matplotlib.pylab as plt
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LassoLarsCV

Now let’s load the dataset.

#Load the dataset
os.chdir("C:/trees")
data = pd.read_csv("health.csv")

We’ll drop off the not available values and create a gender variable called male which will be 1 for males and zero for females.

data_clean = data.dropna()
recode1 = {1:1, 2:0}
data_clean['MALE']= data_clean['BIO_SEX'].map(recode1)

Now we set the predictors and the target variables.

#select predictor variables and target variable as separate data sets 
predvar= data_clean[['MALE','HISPANIC','WHITE','BLACK','NAMERICAN','ASIAN',
'AGE','ALCEVR1','ALCPROBS1','MAREVER1','COCEVER1','INHEVER1','CIGAVAIL','DEP1',
'ESTEEM1','VIOL1','PASSIST','DEVIANT1','GPA1','EXPEL1','FAMCONCT','PARACTV',
'PARPRES']]

target = data_clean.SCHCONN1

Now since lasso regression applies some penalty to the coefficient in an effort to shrink the not-so-important variable to zero,we need all the variables to be in the same scale.

The below code sets all the predictors to have a mean=0 and standard variation=1

# standardize predictors to have mean=0 and sd=1
predictors=predvar.copy()
from sklearn import preprocessing
predictors['MALE']=preprocessing.scale(predictors['MALE'].astype('float64'))
predictors['HISPANIC']=preprocessing.scale(predictors['HISPANIC'].astype('float64'))
predictors['WHITE']=preprocessing.scale(predictors['WHITE'].astype('float64'))
predictors['NAMERICAN']=preprocessing.scale(predictors['NAMERICAN'].astype('float64'))
predictors['ASIAN']=preprocessing.scale(predictors['ASIAN'].astype('float64'))
predictors['AGE']=preprocessing.scale(predictors['AGE'].astype('float64'))
predictors['ALCEVR1']=preprocessing.scale(predictors['ALCEVR1'].astype('float64'))
predictors['ALCPROBS1']=preprocessing.scale(predictors['ALCPROBS1'].astype('float64'))
predictors['MAREVER1']=preprocessing.scale(predictors['MAREVER1'].astype('float64'))
predictors['COCEVER1']=preprocessing.scale(predictors['COCEVER1'].astype('float64'))
predictors['INHEVER1']=preprocessing.scale(predictors['INHEVER1'].astype('float64'))
predictors['CIGAVAIL']=preprocessing.scale(predictors['CIGAVAIL'].astype('float64'))
predictors['DEP1']=preprocessing.scale(predictors['DEP1'].astype('float64'))
predictors['ESTEEM1']=preprocessing.scale(predictors['ESTEEM1'].astype('float64'))
predictors['VIOL1']=preprocessing.scale(predictors['VIOL1'].astype('float64'))
predictors['PASSIST']=preprocessing.scale(predictors['PASSIST'].astype('float64'))
predictors['DEVIANT1']=preprocessing.scale(predictors['DEVIANT1'].astype('float64'))
predictors['GPA1']=preprocessing.scale(predictors['GPA1'].astype('float64'))
predictors['EXPEL1']=preprocessing.scale(predictors['EXPEL1'].astype('float64'))
predictors['FAMCONCT']=preprocessing.scale(predictors['FAMCONCT'].astype('float64'))
predictors['PARACTV']=preprocessing.scale(predictors['PARACTV'].astype('float64'))
predictors['PARPRES']=preprocessing.scale(predictors['PARPRES'].astype('float64'))

Now scikit learn offers few model to run lasso and we’re going to use the “L.A.R” model which stands for least angle regression.

The model initially starts off with no predictors and then adds a predictor at each step.It first adds a predictor that is mostly co related with our target variable and keeps repeating this process until its tested with all the variables.In these steps the parameter estimates are shrunk and if they have a coefficient of zero they are removed from the model.

# split data into train and test sets
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, 
 test_size=.3, random_state=123)

# specify the lasso regression model
model=LassoLarsCV(cv=10, precompute=False).fit(pred_train,tar_train)

The above code splits the data into training and testing data(70:30) and then uses the LAR model with cv=10 which indicates that we are using the k-fold cross validation.

Read more about k-fold here.

The we create a dictionary that contains the list of the variables and their model co-efficients which indicate how important that variable is in the prediction.

# print variable names and regression coefficients
 print(dict(zip(predictors.columns, model.coef_)))

The output is:

{‘MALE’: -0.21508693783564758, ‘HISPANIC’: 0.20300474500010574, ‘WHITE’: 0.0, ‘BLACK’: -0.69364786863867023, ‘NAMERICAN’: -0.10784573616426067, ‘ASIAN’: 0.18869030694622255, ‘AGE’: 0.21734102065275895, ‘ALCEVR1’: -0.32499649609663611, ‘ALCPROBS1’: 0.0, ‘MAREVER1’: -0.15980377488873126, ‘COCEVER1’: -0.20000921703104557, ‘INHEVER1’: 0.0, ‘CIGAVAIL’: -0.1098387918895065, ‘DEP1’: -0.85417844475686255, ‘ESTEEM1’: 1.0974098143740438, ‘VIOL1’: -0.63926717022798629, ‘PASSIST’: 0.0, ‘DEVIANT1’: -0.41808246027792678, ‘GPA1’: 0.66557641766976372, ‘EXPEL1’: -0.073828998548618921, ‘FAMCONCT’: 0.5152729478743262, ‘PARACTV’: 0.29991192982912851, ‘PARPRES’: 0.0}

As you can see out of 23 predictors only 18 were saved and the rest of them shrunk to zero.

Variables like self esteem and GPA were positive with school connectedness and variables like black ethnicity and depression were negative.

Now let’s plot the coefficents which gives us a clear picture of our predictors and their importance.

# plot coefficient progression
m_log_alphas = -np.log10(model.alphas_)
ax = plt.gca()
plt.plot(m_log_alphas, model.coef_path_.T)
plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k',
 label='alpha CV')
plt.ylabel('Regression Coefficients')
plt.xlabel('-log(alpha)')
plt.title('Regression Coefficients Progression for Lasso Paths')

The output:

reg coeff

The purple line indicates the self esteem (plays a positive role)

The red line below indicated the depression(plays a negative role)

We’ll also plot the mean square error for each of the k fold in the cross validation that we had used.

# plot mean square error for each fold
m_log_alphascv = -np.log10(model.cv_alphas_)
plt.figure()
plt.plot(m_log_alphascv, model.cv_mse_path_, ':')
plt.plot(m_log_alphascv, model.cv_mse_path_.mean(axis=-1), 'k',
 label='Average across the folds', linewidth=2)
plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k',
 label='alpha CV')
plt.legend()
plt.xlabel('-log(alpha)')
plt.ylabel('Mean squared error')
plt.title('Mean squared error on each fold')

Output:

mse

Now we’ll print the mean square error from the training and the testing dataset

# MSE from training and test data
from sklearn.metrics import mean_squared_error
train_error = mean_squared_error(tar_train, model.predict(pred_train))
test_error = mean_squared_error(tar_test, model.predict(pred_test))
print ('training data MSE')
print(train_error)
print ('test data MSE')
print(test_error)

Output:

training data MSE
18.1485726641
test data MSE
17.2925174272

We can see that the selected model is less accurate and also that the training mean square error was approximately equal to the testing mean square error i.e the prediction accuracy was stable.

Now we’re going to print the R square error which tells us how much our model is overfitted or has a variance.

 # R-square from training and test data
rsquared_train=model.score(pred_train,tar_train)
rsquared_test=model.score(pred_test,tar_test)
print ('training data R-square')
print(rsquared_train)
print ('test data R-square')
print(rsquared_test)

Output:

training data R-square
0.333611136927
test data R-square
0.31001113416

As you see that the model has a less variance which is good.

Ideally we need a model with less variance and bias.That’s where the trade off comes in!!!

Conclusion:

From a pool of 23 categorical and quantitative predictor variables that best predicted the quantitative response variable measuring school connectedness in adolescents were selected. Categorical predictors included gender and a series of 5 binary categorical variables for race and ethnicity (Hispanic, White, Black, Native American and Asian) to improve interpretability of the selected model with fewer predictors. Binary substance use variables were measured with individual questions about whether the adolescent had ever used alcohol, marijuana, cocaine or inhalants. Additional categorical variables included the availability of cigarettes in the home, whether or not either parent was on public assistance and any experience with being expelled from school. Quantitative predictor variables include age, alcohol problems, and a measure of deviance that included such behaviors as vandalism, other property damage, lying, stealing, running away, driving without permission, selling drugs, and skipping school. Another scale for violence, one for depression, and others measuring self-esteem, parental presence, parental activities, family connectedness and grade point average were also included. All predictor variables were standardized to have a mean of zero and a standard deviation of one.

Data were randomly split into a training set that included 70% of the observations (N=3201) and a test set that included 30% of the observations (N=1701). The least angle regression algorithm with k=10 fold cross validation was used to estimate the lasso regression model in the training set, and the model was validated using the test set. The change in the cross validation average (mean) squared error at each step was used to identify the best subset of predictor variables.

18 out of 23 variables were selected and These 18 variables accounted for 33.4% of the variance in the school connectedness response variable. 

THE DATSET USED:

CLICK HERE

THE COMPLETE CODE:

# -*- coding: utf-8 -*-
"""
Created on Sat Nov 18 10:16:09 2017

@author: Aditya
"""

#from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import os
import matplotlib.pylab as plt
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LassoLarsCV
 
#Load the dataset
os.chdir("C:/trees")
data = pd.read_csv("health.csv")

#upper-case all DataFrame column names
data.columns = map(str.upper, data.columns)

# Data Management
data_clean = data.dropna()
recode1 = {1:1, 2:0}
data_clean['MALE']= data_clean['BIO_SEX'].map(recode1)

#select predictor variables and target variable as separate data sets 
predvar= data_clean[['MALE','HISPANIC','WHITE','BLACK','NAMERICAN','ASIAN',
'AGE','ALCEVR1','ALCPROBS1','MAREVER1','COCEVER1','INHEVER1','CIGAVAIL','DEP1',
'ESTEEM1','VIOL1','PASSIST','DEVIANT1','GPA1','EXPEL1','FAMCONCT','PARACTV',
'PARPRES']]

target = data_clean.SCHCONN1
 
# standardize predictors to have mean=0 and sd=1
predictors=predvar.copy()
from sklearn import preprocessing
predictors['MALE']=preprocessing.scale(predictors['MALE'].astype('float64'))
predictors['HISPANIC']=preprocessing.scale(predictors['HISPANIC'].astype('float64'))
predictors['WHITE']=preprocessing.scale(predictors['WHITE'].astype('float64'))
predictors['NAMERICAN']=preprocessing.scale(predictors['NAMERICAN'].astype('float64'))
predictors['ASIAN']=preprocessing.scale(predictors['ASIAN'].astype('float64'))
predictors['AGE']=preprocessing.scale(predictors['AGE'].astype('float64'))
predictors['ALCEVR1']=preprocessing.scale(predictors['ALCEVR1'].astype('float64'))
predictors['ALCPROBS1']=preprocessing.scale(predictors['ALCPROBS1'].astype('float64'))
predictors['MAREVER1']=preprocessing.scale(predictors['MAREVER1'].astype('float64'))
predictors['COCEVER1']=preprocessing.scale(predictors['COCEVER1'].astype('float64'))
predictors['INHEVER1']=preprocessing.scale(predictors['INHEVER1'].astype('float64'))
predictors['CIGAVAIL']=preprocessing.scale(predictors['CIGAVAIL'].astype('float64'))
predictors['DEP1']=preprocessing.scale(predictors['DEP1'].astype('float64'))
predictors['ESTEEM1']=preprocessing.scale(predictors['ESTEEM1'].astype('float64'))
predictors['VIOL1']=preprocessing.scale(predictors['VIOL1'].astype('float64'))
predictors['PASSIST']=preprocessing.scale(predictors['PASSIST'].astype('float64'))
predictors['DEVIANT1']=preprocessing.scale(predictors['DEVIANT1'].astype('float64'))
predictors['GPA1']=preprocessing.scale(predictors['GPA1'].astype('float64'))
predictors['EXPEL1']=preprocessing.scale(predictors['EXPEL1'].astype('float64'))
predictors['FAMCONCT']=preprocessing.scale(predictors['FAMCONCT'].astype('float64'))
predictors['PARACTV']=preprocessing.scale(predictors['PARACTV'].astype('float64'))
predictors['PARPRES']=preprocessing.scale(predictors['PARPRES'].astype('float64'))

# split data into train and test sets
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, 
 test_size=.3, random_state=123)

# specify the lasso regression model
model=LassoLarsCV(cv=10, precompute=False).fit(pred_train,tar_train)

# print variable names and regression coefficients
print(dict(zip(predictors.columns, model.coef_)))

# plot coefficient progression
m_log_alphas = -np.log10(model.alphas_)
ax = plt.gca()
plt.plot(m_log_alphas, model.coef_path_.T)
plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k',
 label='alpha CV')
plt.ylabel('Regression Coefficients')
plt.xlabel('-log(alpha)')
plt.title('Regression Coefficients Progression for Lasso Paths')

# plot mean square error for each fold
m_log_alphascv = -np.log10(model.cv_alphas_)
plt.figure()
plt.plot(m_log_alphascv, model.cv_mse_path_, ':')
plt.plot(m_log_alphascv, model.cv_mse_path_.mean(axis=-1), 'k',
 label='Average across the folds', linewidth=2)
plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k',
 label='alpha CV')
plt.legend()
plt.xlabel('-log(alpha)')
plt.ylabel('Mean squared error')
plt.title('Mean squared error on each fold')


# MSE from training and test data
from sklearn.metrics import mean_squared_error
train_error = mean_squared_error(tar_train, model.predict(pred_train))
test_error = mean_squared_error(tar_test, model.predict(pred_test))
print ('training data MSE')
print(train_error)
print ('test data MSE')
print(test_error)

# R-square from training and test data
rsquared_train=model.score(pred_train,tar_train)
rsquared_test=model.score(pred_test,tar_test)
print ('training data R-square')
print(rsquared_train)
print ('test data R-square')
print(rsquared_test)

Random Forest Approach to Predict the chances of a person getting prone to alcohol or substance use!

Hello and Welcome!

So in this blog post we’ll learn how to build a random forest classifier using scikit learn in python under 80 lines of python code.

NOTE:If you want to skip the explanation of snippets the fully working codes are available at the end of the blog along with the dataset.

The tools used are

Spyder IDE

Anaconda Distribution

First of all we need to import few dependencies that we’re going to work with

from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import os
import matplotlib.pylab as plt
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import sklearn.metrics
 # Feature Importance
from sklearn import datasets
from sklearn.ensemble import ExtraTreesClassifier

Now we need to tell our code to look for the directory where the dataset is located this is done by using the following

os.chdir("C:\TREES")

Now we need to  read the csv file and drop not available values

AH_data = pd.read_csv("health.csv")
data_clean = AH_data.dropna()

Let’s describe the dataset now to see what we’re dealing with

BIO_SEX HISPANIC WHITE BLACK NAMERICAN \
count 4575.000000 4575.000000 4575.000000 4575.000000 4575.000000 
mean 1.521093 0.111038 0.683279 0.236066 0.036284 
std 0.499609 0.314214 0.465249 0.424709 0.187017 
min 1.000000 0.000000 0.000000 0.000000 0.000000 
25% 1.000000 0.000000 0.000000 0.000000 0.000000 
50% 2.000000 0.000000 1.000000 0.000000 0.000000 
75% 2.000000 0.000000 1.000000 0.000000 0.000000 
max 2.000000 1.000000 1.000000 1.000000 1.000000

ASIAN age TREG1 ALCEVR1 ALCPROBS1 \
count 4575.000000 4575.000000 4575.000000 4575.000000 4575.000000 
mean 0.040437 16.493052 0.176393 0.527432 0.369180 
std 0.197004 1.552174 0.381196 0.499302 0.894947 
min 0.000000 12.676712 0.000000 0.000000 0.000000 
25% 0.000000 15.254795 0.000000 0.000000 0.000000 
50% 0.000000 16.509589 0.000000 1.000000 0.000000 
75% 0.000000 17.679452 0.000000 1.000000 0.000000 
max 1.000000 21.512329 1.000000 1.000000 6.000000

ESTEEM1 VIOL1 PASSIST DEVIANT1 \
count ... 4575.000000 4575.000000 4575.000000 4575.000000 
mean ... 40.952131 1.618579 0.102514 2.645027 
std ... 5.381439 2.593230 0.303356 3.520554 
min ... 18.000000 0.000000 0.000000 0.000000 
25% ... 38.000000 0.000000 0.000000 0.000000 
50% ... 40.000000 0.000000 0.000000 1.000000 
75% ... 45.000000 2.000000 0.000000 4.000000 
max ... 50.000000 19.000000 1.000000 27.000000

SCHCONN1 GPA1 EXPEL1 FAMCONCT PARACTV \
count 4575.000000 4575.000000 4575.000000 4575.000000 4575.000000 
mean 28.360656 2.815647 0.040219 22.570557 6.290710 
std 5.156385 0.770167 0.196493 2.614754 3.360219 
min 6.000000 1.000000 0.000000 6.300000 0.000000 
25% 25.000000 2.250000 0.000000 21.700000 4.000000 
50% 29.000000 2.750000 0.000000 23.700000 6.000000 
75% 32.000000 3.500000 0.000000 24.300000 9.000000 
max 38.000000 4.000000 1.000000 25.000000 18.000000

PARPRES 
count 4575.000000 
mean 13.398033 
std 2.085837 
min 3.000000 
25% 12.000000 
50% 14.000000 
75% 15.000000 
max 15.000000

This is what we’re dealing with!

Now we set our predictors or our features and clean the target variables and devide the dataset into training and testing data (60:40 ratio)

Then we’re going to RandomForestClassifer from ensemble from sklearn

from sklearn.ensemble import RandomForestClassifier

Now finally let's build the forest

classifier=RandomForestClassifier(n_estimators=25)
classifier=classifier.fit(pred_train,tar_train)

Here 25 is the number of trees that the forest will contain.

Let’s printout the accuracy and confusion matrix

[[1428 92]
 [ 183 127]]
0.849726775956

89% Accuracy NOT BAD!! 😀

In the confusion matrix the diagonal indicates the number of true positive and true negative values and  92 and 183 tells about the false negative and false positive respectively

Time to display the importance of each attribute

# fit an Extra Trees model to the data
model = ExtraTreesClassifier()
model.fit(pred_train,tar_train)
# display the relative importance of each attribute
print(model.feature_importances_)

[ 0.02305763 0.01585706 0.01981318 0.01882172 0.00819704 0.00474358
 0.06132761 0.0596015 0.0433457 0.11824281 0.02111982 0.01668584
 0.03413818 0.05850854 0.05522334 0.044511 0.01733067 0.06630337
 0.05980527 0.07483757 0.01314004 0.06078445 0.05635996 0.0482441 ]

We see that whether the person has used marijuana has most importance and whether the person is Asian or not has least importance(LoL 🙂 )

Hold On we aren’t done yet,

Do we actually need 25 trees in our forest?

trees=range(25)
accuracy=np.zeros(25)

for idx in range(len(trees)):
 classifier=RandomForestClassifier(n_estimators=idx + 1)
 classifier=classifier.fit(pred_train,tar_train)
 predictions=classifier.predict(pred_test)
 accuracy[idx]=sklearn.metrics.accuracy_score(tar_test, predictions)
 
plt.cla()
plt.plot(trees, accuracy)

 

The above code determines the accuracy for trees upto range 25 and stores the accuracy of each of the result into an array and then plots it using plot function in python(No Graphwiz this time Thank God!)

Here’s the plot:
plot

We see that when there was only one tree (just like a decision tree) the accuracy was just close to 83% and even with 25 trees the accuracy just increased upto 84%

CONCLUSION:

  1. Random forests do generalize well on the data 
  2. Trees are themselves not interpreted and the entire forest is interpreted which can be a disadvantage as one tree may give the same result as 100 trees

The complete code:

# -*- coding: utf-8 -*-
"""
Created on Wed Nov 15 21:09:27 2017

@author: Aditya
"""

from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import os
import matplotlib.pylab as plt
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import sklearn.metrics
 # Feature Importance
from sklearn import datasets
from sklearn.ensemble import ExtraTreesClassifier

os.chdir("C:\TREES")

#Load the dataset

AH_data = pd.read_csv("health.csv")
data_clean = AH_data.dropna()

data_clean.dtypes
data_clean.describe()

#Split into training and testing sets

predictors = data_clean[['BIO_SEX','HISPANIC','WHITE','BLACK','NAMERICAN','ASIAN','age',
'ALCEVR1','ALCPROBS1','marever1','cocever1','inhever1','cigavail','DEP1','ESTEEM1','VIOL1',
'PASSIST','DEVIANT1','SCHCONN1','GPA1','EXPEL1','FAMCONCT','PARACTV','PARPRES']]

targets = data_clean.TREG1

pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, targets, test_size=.4)

pred_train.shape
pred_test.shape
tar_train.shape
tar_test.shape

#Build model on training data
from sklearn.ensemble import RandomForestClassifier

classifier=RandomForestClassifier(n_estimators=25)
classifier=classifier.fit(pred_train,tar_train)

predictions=classifier.predict(pred_test)

print(sklearn.metrics.confusion_matrix(tar_test,predictions))
print(sklearn.metrics.accuracy_score(tar_test, predictions))




# fit an Extra Trees model to the data
model = ExtraTreesClassifier()
model.fit(pred_train,tar_train)
# display the relative importance of each attribute
print(model.feature_importances_)




"""
Running a different number of trees and see the effect
 of that on the accuracy of the prediction
"""

trees=range(25)
accuracy=np.zeros(25)

for idx in range(len(trees)):
 classifier=RandomForestClassifier(n_estimators=idx + 1)
 classifier=classifier.fit(pred_train,tar_train)
 predictions=classifier.predict(pred_test)
 accuracy[idx]=sklearn.metrics.accuracy_score(tar_test, predictions)
 
plt.cla()
plt.plot(trees, accuracy)

 

And the dataset used : [CLICK HERE]

I’ve blogged about the same analysis with decision trees here:Decision Tree Approach To Predict the Chances of A Person being Addicted to Substances or Alcohol

Feel Free to Comment or Contact me

 

Decision Tree Approach To Predict the Chances of A Person being Addicted to Substances or Alcohol

Hey There!

In this article I will explain in detail how I  built a decision tree to predict the chances of a person being addicted to substances or Alcohol.

Ok before I start let me mention the tools I used.

Python 3.x through Anaconda
 Spyder IDE

NOTE:If you want the fully working code with the dataset just scroll to the end

So let’s get right into the code!

Let’s first import the dependencies that we’ll need,shall we?

from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import os
import matplotlib.pylab as plt
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import sklearn.metrics
os.chdir("C:\TREES")

This is the place where my dataset is  actually located.So we’re telling our code to look for it in that directory.

AH_data = pd.read_csv("health.csv")

data_clean = AH_data.dropna()

data_clean.dtypes
data_clean.describe()

What the above snippet does is that,it loads the dataset and then drops non available values as decision trees can’t work with those and then “describes” out data.

BIO_SEX HISPANIC WHITE BLACK AMERICAN 
count 4575.000000 4575.000000 4575.000000 4575.000000 4575.000000 
mean 1.521093 0.111038 0.683279 0.236066 0.036284 
std 0.499609 0.314214 0.465249 0.424709 0.187017 
min 1.000000 0.000000 0.000000 0.000000 0.000000 
25% 1.000000 0.000000 0.000000 0.000000 0.000000 
50% 2.000000 0.000000 1.000000 0.000000 0.000000 
75% 2.000000 0.000000 1.000000 0.000000 0.000000 
max 2.000000 1.000000 1.000000 1.000000 1.000000

ASIAN age TREG1 ALCEVR1 ALCPROBS1 \
count 4575.000000 4575.000000 4575.000000 4575.000000 4575.000000 
mean 0.040437 16.493052 0.176393 0.527432 0.369180 
std 0.197004 1.552174 0.381196 0.499302 0.894947 
min 0.000000 12.676712 0.000000 0.000000 0.000000 
25% 0.000000 15.254795 0.000000 0.000000 0.000000 
50% 0.000000 16.509589 0.000000 1.000000 0.000000 
75% 0.000000 17.679452 0.000000 1.000000 0.000000 
max 1.000000 21.512329 1.000000 1.000000 6.000000

ESTEEM1 VIOL1 PASSIST DEVIANT1 \
count ... 4575.000000 4575.000000 4575.000000 4575.000000 
mean ... 40.952131 1.618579 0.102514 2.645027 
std ... 5.381439 2.593230 0.303356 3.520554 
min ... 18.000000 0.000000 0.000000 0.000000 
25% ... 38.000000 0.000000 0.000000 0.000000 
50% ... 40.000000 0.000000 0.000000 1.000000 
75% ... 45.000000 2.000000 0.000000 4.000000 
max ... 50.000000 19.000000 1.000000 27.000000

SCHCONN1 GPA1 EXPEL1 FAMCONCT PARACTV \
count 4575.000000 4575.000000 4575.000000 4575.000000 4575.000000 
mean 28.360656 2.815647 0.040219 22.570557 6.290710 
std 5.156385 0.770167 0.196493 2.614754 3.360219 
min 6.000000 1.000000 0.000000 6.300000 0.000000 
25% 25.000000 2.250000 0.000000 21.700000 4.000000 
50% 29.000000 2.750000 0.000000 23.700000 6.000000 
75% 32.000000 3.500000 0.000000 24.300000 9.000000 
max 38.000000 4.000000 1.000000 25.000000 18.000000

PARPRES 
count 4575.000000 
mean 13.398033 
std 2.085837 
min 3.000000 
25% 12.000000 
50% 14.000000 
75% 15.000000 
max 15.000000

[8 rows x 25 columns]
(2745, 24)
(1830, 24)
(2745,)
(1830,)
[[1295 193]
 [ 184 158]]
0.793989071038

This is the description of data that we’re going to work with(I’ll share the dataset at the end of the blog)

Now we have label our data and  divide our data into training data and testing data so let’s do that

predictors = data_clean[['BIO_SEX','HISPANIC','WHITE','BLACK','NAMERICAN','ASIAN',
'age','ALCEVR1','ALCPROBS1','marever1','cocever1','inhever1','cigavail','DEP1',
'ESTEEM1','VIOL1','PASSIST','DEVIANT1','SCHCONN1','GPA1','EXPEL1','FAMCONCT',

'PARACTV',
'PARPRES']]

targets = data_clean.TREG1pred_train, pred_test, tar_train, tar_test=train_test_split(predictors, targets, test_size=.4)

And now let’s printout the specification of our data:

print(pred_train.shape)
print(pred_test.shape)
print(tar_train.shape)
print(tar_test.shape)

This gives an output:

(2745, 24)
(1830, 24)
(2745,)
(1830,)
[[1324 182]
 [ 191 133]]
0.796174863388

And finally we’ll use the sklearn method to build the decision tree and printout the confusion matrix along with the accuracy

classifier=DecisionTreeClassifier()
classifier=classifier.fit(pred_train,tar_train)

predictions=classifier.predict(pred_test)

print(sklearn.metrics.confusion_matrix(tar_test,predictions))
print(sklearn.metrics.accuracy_score(tar_test, predictions))

This gives an output:

[[1323 182]
 [ 189 136]]
0.797267759563

79% Accuracy! Not Bad 😉

And not just for the sake of it we’ll display our actual tree

#Displaying the decision tree
from sklearn import tree
#from StringIO import StringIO
from io import StringIO
#from StringIO import StringIO 
from IPython.display import Image
with open("classifier.txt", "w") as f:
 f = tree.export_graphviz(classifier, out_file=f)

I didn’t use pydotplus because it was creating problems,above code generates a text file called claasifier.txt  which when uploaded to http://webgraphviz.com/ gives a huge tree.

part

It’s huge and doesn’t wrap on the page

 

 

 

 

 

**************************************************************************************

As promised the complete code:

# -*- coding: utf-8 -*-
“””
Spyder Editor

This is a temporary script file.
"""

from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import os
import matplotlib.pylab as plt
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
import sklearn.metrics

os.chdir("C:\TREES")

#Load the dataset

AH_data = pd.read_csv("health.csv")

data_clean = AH_data.dropna()

data_clean.dtypes
print(data_clean.describe())




predictors = data_clean[['BIO_SEX','HISPANIC','WHITE','BLACK','NAMERICAN','ASIAN',
'age','ALCEVR1','ALCPROBS1','marever1','cocever1','inhever1','cigavail','DEP1',
'ESTEEM1','VIOL1','PASSIST','DEVIANT1','SCHCONN1','GPA1','EXPEL1','FAMCONCT','PARACTV',
'PARPRES']]

targets = data_clean.TREG1

pred_train, pred_test, tar_train, tar_test=train_test_split(predictors, targets, test_size=.4)
print(pred_train.shape)
print(pred_test.shape)
print(tar_train.shape)
print(tar_test.shape)




classifier=DecisionTreeClassifier()
classifier=classifier.fit(pred_train,tar_train)

predictions=classifier.predict(pred_test)

print(sklearn.metrics.confusion_matrix(tar_test,predictions))
print(sklearn.metrics.accuracy_score(tar_test, predictions))




#Displaying the decision tree
from sklearn import tree
#from StringIO import StringIO
from io import StringIO
#from StringIO import StringIO 
from IPython.display import Image
with open("fruit_classifier.txt", "w") as f:
 f = tree.export_graphviz(classifier, out_file=f)

Here’s the link to download the dataset:

Dataset Used

Let’s Dig In

So What is Machine Learning?

Arthur Lee Samuel was an American pioneer in the field of computer gaming and artificial intelligence. He coined the term “machine learning” in 1959.

samuels

Arthur Samuel

So the definition given by Samuel was

Machine learning is the science of getting computers the ability to learn
without being explicitly programmed.

That’s one of the most easiest definition to understand and it was stated in 1959.

Now In 1998 Tom M. Mitchell an American computer scientist and E. Fredkin University Professor at the Carnegie Mellon University stated that

tom m mitchell

Tom M. Mitchell

"A computer program is said to learn from experience E with respect to 
some class of tasks T and performance measure P if its performance 
at tasks in T, as measured by P, improves with experience E."

Example: Consider the task of playing the game “Go”

E = the experience of playing many games of Go

T = the task of playing Go.

P = the probability that the program will win the next game.

Now that you’ve understood what Machine Learning is Let’s go right into it!!!

_________________________________________________________________________________________

On a Broad Basis Machine Learning is Mainly Considered to be of 2 kinds(there are more which we will look onto later) That are,

1)Supervised Learning 

In supervised learning, we are given a data set and already know what our
 correct output should look like, having the idea that there is a 
relationship between the input and the output.

In simple words we just supervise the computer by teaching it “What is What”

Again there are two types of classes that we are going to encounter.

1) Regression

Regression is basically predicting a continuous valued output like 23.56 or 45.89 etc.

Eg.

1)Predicting the age of a person based on his image.(Age here takes Continuous values)

2)Predicting the value of a stock based on it previous values.


2)Classification

Classification involves the output belonging to one of the classes,i.e we predict discrete valued outputs.

E.g:

1)Classifying whether the person is older than or less than 40 years.(Here there are 2 classes,one with people older that 40 and other younger than 40)

2)Classifying whether a tumor is Benign or Malignant.

____________________________________________________________________________________________

2)Unsupervised Learning

In this “lable-less” data is just thrown at you and you require to find patterns or clusters in that data.

You don’t Supervise your system as to what you’re giving it.

Speaking more formally I’d like to mention the classic definition By Andrew Ng ,a professor from Stanford University which states.

Unsupervised learning allows us to approach problems with little or no idea
what our results should look like. We can derive structure from data where 
we don't necessarily know the effect of the variables.

We can derive this structure by clustering the data based on relationships 
among the variables in the data.

With unsupervised learning there is no feedback based on the prediction 
results.

Alrighty,Here’s an example where you can use them.

1)Clustering: Take a collection of 1,000,000 different genes, and find a way to automatically group these genes into groups that are somehow similar or related by different variables, such as lifespan, location, roles, and so on.

2)Non-clustering: The “Cocktail Party Algorithm”, allows you to find structure in a chaotic environment. (i.e. identifying individual voices and music from a mesh of sounds at a cocktail party).

3)Organize Computer clusters

4)Social network analysis

5)Market Segmentation


Great you reached till here!

And Congrats you now know “What Machine Learning is” and its types,Go Share this with your friends and flaunt your skills!
Hope to see you in the next blog where we jump right into the Math and Machine Learning.

Welcome!

Hey There!

So you’ve finally found your way here!

That’s Great to have you here.

I’ll try my best to explain whatever I learn here and the concepts will be ranging from Machine learning to Algorithms to any other Computer Science Concepts.

Let’s Have Fun Then!

See you in my future blogs!