Assignment

Final Report 

Approach

For the final model, we have selected an ensemble of 5 layer artificial neural networks trained on an upsampled training dataset.

  1. Class Imbalance
  • Upsampling of minority class by resampling with replacement
  1. Feature Selection
  • Removal of variables from training dataset with zero variance
  • Addition of new variables using matrix decomposition methods like SVD, PCA, GRP and SRP
  1. Performance Evaluation
  • Splitting of train dataset into train and validation datasets to evaluate model performance of unseen data
  1. Neural Network Ensemble
  • Making predictions using an ensemble of trained Neural networks

Approach Details 

  1. Class Imbalance

The dataset had a severe class imbalance with 72 obersvations from the minority class of the total 800. To handle the imabalance, different approaches to upsample the dataset were tried

  • Resampling with replacement
  • ADASYN
  • SMOTE

Of the three aproaches, Resampling with replacement gave the best score on the validation dataset and hence it was chosen in the final model. The minority class was upsampled 10 times in order to make number of observations from both classes almost equal

  1. Feature Selection

Given the sparse nature of the dataset, matrix decomposition methods were utilized to compress the data from these raw variables into fewer variables with greater predictive power. For this purpose, 4 algorithms from scikit-learn packages were used :

  • Truncated SVD
  • Principal Component Analysis
  • Guassian Random Projection
  • Sparse Random Projection

For each of these decompositions, 50 features were selected to be used in the training dataset

  1. Performance Evaluation

In order to test the performance of the model on unseen data, the training dataset was divided into a training and validation dataset. The split ratio was selected as 0.75 in order to ensure enough data for training the model

  1. Neural Network Ensemble

Since the neural network’s performance can vary greatly with every training epoch, we take predictions from the neural network at each epoch. In order to verify the performance of the model at that particular epoch, the predictions of the model on the validation dataset is taken and the Mathew’s Coefficient for the prediction is calculated. If the coefficient is greater than a minimum threshold (selected as 0.85 based on trial and error), then the predictions of the model on the test dataset are considered for the final ensemble.

At present, since the performance of the model on train and validation dataset is good, we are limiting the number of training epochs to 5. The batch size for the training is taken to be 256 given the size of the training dataset and the time taken to train every epoch.

For the architecture of the model, 4 hidden layers were chosen of sizes 512, 256, 128 and 32. The sizes of the layers were chosen keeping in mind the number of variables in the training dataset.

Optimizer was chosen as AdaDelta as it is one of the best options and outperformed the other options in terms of learning time and performance.

Solution 

main.py 

import pandas as pd

import numpy as np

import math

def evalModel(model,train,trainLabel,test,testLabel):

”’

For a particular model, calculates performance metrics for train and test data

”’

trainpred = model.predict(train)

testpred = model.predict(test)

trainpredlabel = NNpred(trainpred)

testpredlabel = NNpred(testpred)

trainlabel = NNpred(trainLabel)

testlabel = NNpred(testLabel)

trainMetric = calcMetric(trainpredlabel,trainlabel)

testMetric = calcMetric(testpredlabel,testlabel)

print(“Train accuracy : %f specificity : %f sensitivity : %f F1 : %f”%(trainMetric[4],trainMetric[1],trainMetric[2],trainMetric[3]))

print(“Test accuracy : %f specificity : %f sensitivity : %f  F1 : %f”%(testMetric[4],testMetric[1],testMetric[2],testMetric[3]))

def NNpred(pred):

”’

Converts the 2-column output of the keras NN model into a single column variable. This allows us to calculate performance

metrics more easily

”’

predlabel = [0 if x[0]>x[1] else 1 for x in pred]

return(predlabel)

def removeColFn(train,test,varLimit):

”’

Removes variables from the dataset that have lower variance than varLimit in the train dataset.

Since variables with very less variance cannot effect much change in the model and cannot be strong variables.

”’

trainStd = train.std()

removeCols = list(trainStd.loc[trainStd<varLimit].index)

trainNew = train.drop(removeCols,axis=1)

testNew = test.drop(removeCols,axis=1)

return([trainNew,testNew])

def calcMetric(pred,label):

”’

Utility function which actually calculates all the performance metrics

”’

confusionMatrix=[[0,0],[0,0]]

totalSize = len(pred)

true_positive = float(np.sum([1 if (x[0]==1 and x[1]==1) else 0 for x in zip(pred,label)]))

true_negative = float(np.sum([1 if (x[0]==0 and x[1]==0) else 0 for x in zip(pred,label)]))

false_positive = float(np.sum([1 if (x[0]==1 and x[1]==0) else 0 for x in zip(pred,label)]))

false_negative = float(np.sum([1 if (x[0]==0 and x[1]==1) else 0 for x in zip(pred,label)]))

if (true_negative+false_positive) != 0:

specificity = true_negative/(true_negative+false_positive)

else:

specificity = 0

if (true_positive + false_negative)!=0:

sensitivity = true_positive/(true_positive + false_negative)

else:

sensitivity = 0

if (true_positive + false_positive)!=0:

precision = true_positive/(true_positive + false_positive)

else:

precision = 0

accuracy = (true_positive+true_negative)/float(len(pred))

confusionMatrix[0][0] = true_positive/float(totalSize)

confusionMatrix[0][1] = false_negative/float(totalSize)

confusionMatrix[1][0] = false_positive/float(totalSize)

confusionMatrix[1][1] = true_negative/float(totalSize)

denom = math.sqrt((true_positive+false_positive)*(true_positive+false_negative)*(true_negative+false_positive)*(true_negative+false_negative))

mathewC = ((true_positive*true_negative)- (false_positive*false_negative))/denom

f1 = 2*(precision*sensitivity)/(precision+sensitivity)

return([confusionMatrix, specificity,sensitivity ,f1,accuracy])

def ensembleModel(predList,testLabel=None,onlyPredict = False):

”’

Ensemble test data prediction results from various models. During the ensembling process, we run the model multiple times

and select the top model among them. For each of these models, we take their predictions and store them. Finally

we ensemble these individual predictions and call for a vote. The final prediction is the majority label predicted by

all the individual model predictions.

”’

testPredLabels = []

for i in range(len(predList)):

testpred = predList[i]

testpredlabel = NNpred(testpred)

testPredLabels.append(testpredlabel)

testPredLabels = np.array(testPredLabels)

sums =[]

for i in range(testPredLabels.shape[1]):

sums.append(np.sum(testPredLabels[:,i]))

threshold = len(predList)//2

finalPred = [1 if x > threshold else 0 for x in sums]

if onlyPredict == False:

testlabel = NNpred(testLabel)

metric = calcMetric(finalPred,testlabel)

print(“Test accuracy : %f specificity : %f sensitivity : %f  F1 : %f”%(metric[4],metric[1],metric[2],metric[3]))

return(finalPred)

def readTrainDataset(filename):

f = open(filename,’r’)

labels = []

data = []

for i in f:

lab,dat = i.split(“\t”)

val = [int(x) for x in dat[:-2].split(” “)]

data.append(val)

labels.append(int(lab))

finalData = np.zeros((len(labels),100000))

for i in range(len(labels)):

for var in data[i]:

finalData[i,(var-1)] = 1

cols = []

for i in range(100000):

cols.append(‘var_’+str(i+1))

finalData = pd.DataFrame(finalData,columns=cols)

return([finalData,labels])

def readTestDataset(filename):

f = open(filename,’r’)

data = []

for i in f:

dat = i

val = [int(x) for x in dat[:-2].split(” “)]

data.append(val)

finalData = np.zeros((len(data),100000))

for i in range(len(data)):

for var in data[i]:

finalData[i,(var-1)] = 1

cols = []

for i in range(100000):

cols.append(‘var_’+str(i+1))

finalData = pd.DataFrame(finalData,columns=cols)

return(finalData)

print(“Reading the train and test data”)

trainData,trainLabel = readTrainDataset(‘train.dat’)

testData = readTestDataset(‘test.dat’)

print(“removing columns in train dataset with zero variance”)

trainStd = trainData.std()

removeCols = list(trainStd.loc[trainStd==0.0].index)

trainData = trainData.drop(removeCols,axis=1)

testData = testData.drop(removeCols,axis=1)

print(“upsampling the train dataset to correct the class imbalance”)

trainData[‘Class’] = trainLabel

upsamplingFactor= 10

class1 = trainData[trainData[‘Class’]==1]

for i in range(upsamplingFactor):

trainData = pd.concat([trainData,class1],axis=0)

trainData =trainData.sample(frac=1).reset_index(drop=True)

trainLabel = trainData.loc[:,’Class’]

trainData.drop(‘Class’,axis=1,inplace=True)

from sklearn.random_projection import GaussianRandomProjection

from sklearn.random_projection import SparseRandomProjection

from sklearn.decomposition import PCA, FastICA

from sklearn.decomposition import TruncatedSVD

print(” Computing new features based on matrix factorization”)

compNo = 50 #No of features that are derived from each of these methods.

# tSVD

tsvd = TruncatedSVD(n_components=compNo, random_state=100)

tsvd_results_train = tsvd.fit_transform(trainData)

tsvd_results_test = tsvd.transform(testData)

# PCA

pca = PCA(n_components=compNo, random_state=100)

pca2_results_train = pca.fit_transform(trainData)

pca2_results_test = pca.transform(testData)

# GRP

grp = GaussianRandomProjection(n_components=compNo, eps=0.1, random_state=420)

grp_results_train = grp.fit_transform(trainData)

grp_results_test = grp.transform(testData)

# SRP

srp = SparseRandomProjection(n_components=compNo, dense_output=True, random_state=420)

srp_results_train = srp.fit_transform(trainData)

srp_results_test = srp.transform(testData)

# Appending decomposed components

rawCols = list(trainData.columns)  #List of all the raw variables in the dataset

#This block appends all the features that we have calculated above into the dataset

for i in range(1, compNo + 1):

trainData[‘pca_’ + str(i)] = pca2_results_train[:, i – 1]

testData[‘pca_’ + str(i)] = pca2_results_test[:, i – 1]

#trainData[‘ica_’ + str(i)] = ica2_results_train[:, i – 1]   # Uncomment this section if you want to include ICA results

#testData[‘ica_’ + str(i)] = ica2_results_test[:, i – 1]

trainData[‘tsvd_’ + str(i)] = tsvd_results_train[:, i – 1]

testData[‘tsvd_’ + str(i)] = tsvd_results_test[:, i – 1]

trainData[‘grp_’ + str(i)] = grp_results_train[:, i – 1]

testData[‘grp_’ + str(i)] = grp_results_test[:, i – 1]

trainData[‘srp_’ + str(i)] = srp_results_train[:, i – 1]

testData[‘srp_’ + str(i)] = srp_results_test[:, i – 1]

import keras

from keras.models import Sequential

from keras.layers import Dense, Convolution2D, MaxPooling2D, Dropout, Flatten,BatchNormalization

from keras.utils import np_utils

import pandas as pd

import numpy as np

from keras.datasets import mnist

from sklearn.metrics import matthews_corrcoef

import copy

from sklearn.cross_validation import train_test_split

trainData,valData,trainLabel,valLabel = train_test_split(trainData,trainLabel,test_size=0.25,random_state=100)

trainData2 = np.array(trainData)

valData2 = np.array(valData)

testData2 = np.array(testData)

trainLabel2 = np.array(trainLabel)

valLabel2 = np.array(valLabel)

trainLabel2 = np_utils.to_categorical(trainLabel2)

valLabel2 = np_utils.to_categorical(valLabel2)

print(“Training the Neural Network”)

batch_size=256

nrounds = 1

nepochs=5

num_classes=2

bestModels = []

minPerformThreshold = 0.27  #Performance above which to consider for ensembling

#This is just a threshold, increasing it can potentially increase your overall prediction model but you also

# would have lesser models to vote.

# General thumb rule is to have largest layers in the beginning and then reduce gradually. Im using these sizes as they

# seem to be appropriate given the number of variables we are using. Also these are powers of 2. which makes it easier

#computationally.

# Dropout is a method for preventing overfitting. So that the model doesnt just mimic the training data. This layer

# randomly removes a portion of the data before passing it to the next layer. Please search Dropout on google to get better

# explainations

for rnd in range(nrounds):

print(“ROUND : %d”%rnd)

model = Sequential()

model.add(Dense(512, activation = “relu”, input_dim = trainData2.shape[1]))

model.add(BatchNormalization())

model.add(Dropout(0.5))

model.add(Dense(256, activation = “relu”))

model.add(Dropout(0.3))

model.add(Dense(128, activation = “relu”))

model.add(Dropout(0.2))

model.add(Dense(32, activation = “relu”))

model.add(Dropout(0.2))

model.add(Dense(2, activation = “softmax”))

model.compile(optimizer = keras.optimizers.Adadelta(),

loss = ‘binary_crossentropy’,

metrics = [‘accuracy’])

for epoch in range(nepochs):

model.fit(trainData2,trainLabel2,epochs = 1, batch_size = batch_size,

validation_data = (valData2,valLabel2),verbose=0)

testDatapred = model.predict(testData2)

pred = model.predict(valData2)

predlabel = NNpred(pred)

vallabel = NNpred(valLabel2)

mathewC = matthews_corrcoef(vallabel,predlabel)

print(mathewC)

if mathewC>minPerformThreshold:

evalModel(model,trainData2,trainLabel2,valData2,valLabel2)

bestModels.append(testDatapred)

print(“Making the final prediction on the test data”)

finalPred = ensembleModel(bestModels,onlyPredict=True)

f = open(‘testPred.dat’,’w’)

for i in finalPred:

f.write(str(i)+”\n”)