## Classifying Arrhythmia Patterns

Practical Session, 17/07/2014

Datasets

The internet has tons of useful datasets for playing around with machine learning data.

Arrhythmia Classification

Today we'll work on a dataset available through UCI:
https://archive.ics.uci.edu/ml/datasets/Arrhythmia

We've briefly presented the problem and features through the tutorial slides.

If you look through the paper, the accuracy obtained is:

• 62%-68% using a novel VF15 algorithm (detailed in the paper)
• 53% for using the classical Naive Bayes.

The purpose of this practical session is to experiment with:

• Preprocess data to normalize and replace missing values
• Classify arrhythmia patterns using naive bayes
• Compute performance using n-fold cross-validation

Check out also the RandomForestClassifier.
What's the performance you get when using it?

Code Skelet

Start from this code skelet and solve the TODOs

```import csv
import numpy as np

# Read from the CSV file
csvf = open("arrhythmia.data")

# The data has missing values noted by '?' (ignore this part)
# In this part of code we just replace them with np.nan
X = np.array(np.zeros(shape=(1, 280)))
for row in csvfile:
for idx, item in enumerate(row):
if item == '?':
row[idx] = np.nan
row[idx] = np.float32(row[idx])
row = np.array(row)
row = row.reshape(1, X.shape)
X = np.append(X, row, axis=0)
X = np.delete(X, (0), axis=0)

# X is your data vector: with rows as samples and last column being the labels

# TODO: use scikit-learn Imputer class to use mean values for missing fields

# TODO: extract labels from X into a new variable Y

# TODO: normalize using l1 norm (or other, checkout what works best!)

# TODO: Use scikit-learn KFold class to create 10 folds

# TODO: run GaussianNB classifier to fit and predict for each

# TODO: print the averaged error rate across all the runs

# TODO: Bonus, use RandomForestClassifier and checkout the error rate!``` 