Pagini
Workshops
Parteneri
This is an old revision of the document!
Classifying Arrhythmia Patterns
Practical Session, 17/07/2014
Datasets
The internet has tons of useful datasets for playing around with machine learning data.
UCI http://archive.ics.uci.edu/ml/datasets.html
Kaggle http://www.kaggle.com/
KDD Cup http://www.sigkdd.org/kddcup/index.php
Arrhythmia Classification
Today we'll work on a dataset available through UCI:
https://archive.ics.uci.edu/ml/datasets/Arrhythmia
We've briefly presented the problem and features through the tutorial slides.
If you look through the paper, the accuracy obtained is:
The purpose of this tutorial is to experiment with:
Check out the RandomForestClassifier.
What's the performance you get when using it?
Code Skelet
Start from this code skelet and solve the TODOs
import csv import numpy as np # Read from the CSV file csvf = open("arrhythmia.data") csvfile = csv.reader(csvf) # The data has missing values noted by '?' (ignore this part) # In this part of code we just replace them with np.nan X = np.array(np.zeros(shape=(1, 280))) for row in csvfile: for idx, item in enumerate(row): if item == '?': row[idx] = np.nan row[idx] = np.float32(row[idx]) row = np.array(row) row = row.reshape(1, X.shape[1]) X = np.append(X, row, axis=0) X = np.delete(X, (0), axis=0) # X is your data vector: with rows as samples and last column being the labels # TODO: use scikit-learn Imputer class to use mean values for missing fields # TODO: extract labels from X into a new variable Y # TODO: normalize using l1 norm (or other, checkout what works best!) # TODO: Use scikit-learn KFold class to create 10 folds # TODO: run GaussianNB classifier to fit and predict for each # TODO: print the averaged error rate across all the runs # TODO: Bonus, use RandomForestClassifier and checkout the error rate!