This is an old revision of the document!


Practical Session, 17/07/2014

Classifying Arrhythmia Patterns

Datasets

The internet has tons of useful datasets for playing around with machine learning data.

UCI http://archive.ics.uci.edu/ml/datasets.html
Kaggle http://www.kaggle.com/
KDD Cup http://www.sigkdd.org/kddcup/index.php

Arrhythmia Classification

Today we'll work on a dataset available through UCI:
https://archive.ics.uci.edu/ml/datasets/Arrhythmia

We've briefly presented the problem and features through the tutorial slides.

If you look through the paper, the accuracy obtained is:

  • 62%-68% using a novel VF15 algorithm (detailed in the paper)
  • 53% for using the classical Naive Bayes.

The purpose of this tutorial is to experiment with:

  • Preprocess data to normalize and replace missing values
  • Classify arrhythmia patterns using naive bayes
  • Compute performance using n-fold cross-validation

Check out the RandomForestClassifier.
What's the performance you get when using it?

Code Skelet

Start from this code skelet and solve the TODOs

import csv
import numpy as np
 
# Read from the CSV file
csvf = open("arrhythmia.data")
csvfile = csv.reader(csvf)
 
# The data has missing values noted by '?' (ignore this part)
# In this part of code we just replace them with np.nan
X = np.array(np.zeros(shape=(1, 280)))
for row in csvfile:
    for idx, item in enumerate(row):
        if item == '?':
            row[idx] = np.nan
        row[idx] = np.float32(row[idx])
    row = np.array(row)
    row = row.reshape(1, X.shape[1])
    X = np.append(X, row, axis=0)
X = np.delete(X, (0), axis=0)
 
# X is your data vector: with rows as samples and last column being the labels
 
# TODO: use scikit-learn Imputer class to use mean values for missing fields
 
# TODO: extract labels from X into a new variable Y
 
# TODO: normalize using l1 norm (or other, checkout what works best!)
 
# TODO: Use scikit-learn KFold class to create 10 folds
 
# TODO: run GaussianNB classifier to fit and predict for each
 
# TODO: print the averaged error rate across all the runs
 
# TODO: Bonus, use RandomForestClassifier and checkout the error rate!
sesiuni/ml/arrhythmia-day2.1405603740.txt.gz · Last modified: 2014/07/17 16:29 by victor