**Pagini **

# Workshops

**Parteneri**

**This is an old revision of the document!**

Practical Session, 17/07/2014

**Datasets**

The internet has tons of useful datasets for playing around with machine learning data.

UCI http://archive.ics.uci.edu/ml/datasets.html

Kaggle http://www.kaggle.com/

KDD Cup http://www.sigkdd.org/kddcup/index.php

**Arrhythmia Classification **

Today we'll work on a dataset available through UCI:

https://archive.ics.uci.edu/ml/datasets/Arrhythmia

We've briefly presented the problem and features through the tutorial slides.

If you look through the paper, the accuracy obtained is:

- 62%-68% using a novel VF15 algorithm (detailed in the paper)
- 53% for using the classical Naive Bayes.

The purpose of this tutorial is to experiment with:

- Preprocess data to normalize and replace missing values
- Classify arrhythmia patterns using naive bayes
- Compute performance using n-fold cross-validation

Check out also the RandomForestClassifier.

What's the performance you get when using it?

**Code Skelet**

Start from this code skelet and solve the TODOs

import csv import numpy as np # Read from the CSV file csvf = open("arrhythmia.data") csvfile = csv.reader(csvf) # The data has missing values noted by '?' (ignore this part) # In this part of code we just replace them with np.nan X = np.array(np.zeros(shape=(1, 280))) for row in csvfile: for idx, item in enumerate(row): if item == '?': row[idx] = np.nan row[idx] = np.float32(row[idx]) row = np.array(row) row = row.reshape(1, X.shape[1]) X = np.append(X, row, axis=0) X = np.delete(X, (0), axis=0) # X is your data vector: with rows as samples and last column being the labels # TODO: use scikit-learn Imputer class to use mean values for missing fields # TODO: extract labels from X into a new variable Y # TODO: normalize using l1 norm (or other, checkout what works best!) # TODO: Use scikit-learn KFold class to create 10 folds # TODO: run GaussianNB classifier to fit and predict for each # TODO: print the averaged error rate across all the runs # TODO: Bonus, use RandomForestClassifier and checkout the error rate!