Classifying Arrhythmia Patterns

Practical Session, 17/07/2014

Datasets

The internet has tons of useful datasets for playing around with machine learning data.

UCI http://archive.ics.uci.edu/ml/datasets.html
Kaggle http://www.kaggle.com/
KDD Cup http://www.sigkdd.org/kddcup/index.php

Arrhythmia Classification

Today we'll work on a dataset available through UCI:
https://archive.ics.uci.edu/ml/datasets/Arrhythmia

We've briefly presented the problem and features through the tutorial slides.

If you look through the paper, the accuracy obtained is:

  • 62%-68% using a novel VF15 algorithm (detailed in the paper)
  • 53% for using the classical Naive Bayes.

The purpose of this practical session is to experiment with:

  • Preprocess data to normalize and replace missing values
  • Classify arrhythmia patterns using naive bayes
  • Compute performance using n-fold cross-validation

Check out also the RandomForestClassifier.
What's the performance you get when using it?

Code Skelet

Start from this code skelet and solve the TODOs

import csv
import numpy as np
 
# Read from the CSV file
csvf = open("arrhythmia.data")
csvfile = csv.reader(csvf)
 
# The data has missing values noted by '?' (ignore this part)
# In this part of code we just replace them with np.nan
X = np.array(np.zeros(shape=(1, 280)))
for row in csvfile:
    for idx, item in enumerate(row):
        if item == '?':
            row[idx] = np.nan
        row[idx] = np.float32(row[idx])
    row = np.array(row)
    row = row.reshape(1, X.shape[1])
    X = np.append(X, row, axis=0)
X = np.delete(X, (0), axis=0)
 
# X is your data vector: with rows as samples and last column being the labels
 
# TODO: use scikit-learn Imputer class to use mean values for missing fields
 
# TODO: extract labels from X into a new variable Y
 
# TODO: normalize using l1 norm (or other, checkout what works best!)
 
# TODO: Use scikit-learn KFold class to create 10 folds
 
# TODO: run GaussianNB classifier to fit and predict for each
 
# TODO: print the averaged error rate across all the runs
 
# TODO: Bonus, use RandomForestClassifier and checkout the error rate!
sesiuni/ml/arrhythmia-day2.txt · Last modified: 2014/07/18 09:19 by victor