Data Science with Python, Scala and Apache Spark


This 8-day hands-on workshop introduces Apache Spark, the open-source cluster computing framework with in-memory processing that makes analytics applications up to 100 times faster compared to technologies in wide deployment today. Developed in the AMPLab at UC Berkeley, Spark can help reduce data interaction complexity, increase processing speed and enhance data-intensive, near-real-time applications with deep intelligence. Highly versatile in many environments, and with a strong foundation in functional programming, Spark is known for its ease of use in creating algorithms that harness insight from complex data. Spark was elevated to a top-level Apache Project in 2014 and continues to expand today.

When and Where?

Between: 3-25 September 2016 (every Saturday and Sunday, and the actual workshop dates are 3, 4, 10, 11, 17, 18, 24, 25). The workshop lasts 4 hours and will be held in 12:00-18:00 interval. The schedule will be decided at the end of August.

Private communications will be sent to the selected participants to announce further details (after registration is complete and the list of participants is finalized).


Introduction to Data Analysis with Spark

  • What is Apache Spark?
  • Introduction to Core Spark Concepts
  • Working in the PySpark shell
  • Working with PySpark in an iPython notebook
  • Building Spark/Scala Applications with sbt
  • Standalone Applications

Programming with RDDs

  • RDD Basics
  • Creating RDDs
  • RDD Operations
  • Passing Functions to Spark
  • Common Transformations and Actions
  • Caching RDDs

Working with Key-Value Pairs

  • Motivation
  • Creating Pairwise RDDs
  • Transformations on Pairwise RDDs
  • Actions Available on Pairwise RDDs
  • Data Partitioning. Key Performance Considerations

Running on a Cluster

  • Configuring a Spark Cluster
  • Deploying Applications with spark-submit

Structured Data with Spark SQL

  • The DataFrame API
  • Inner Joins and Left Outer Joins in the RDD API versus in Spark SQL
  • Datasets (compile-time type-safe DataFrames)

Building Interactive Data Analytics Apps With Flask and Spark

  • A Simple Example - Parameterized CrossFilter Histograms

Spark Streaming

  • A Simple Example - Stream of Integers / Rolling Sum
  • Streaming Data via TCP socket (netcat)
  • Streaming Data via Kafka topic (Apache Kafka)
  • Storing Streaming Analytics Results in a NoSQL Datastore (Apache Cassandra)
  • Structured Streaming / Infinite DataFrames

Advanced Spark Programming

  • Working on a Per-Partition Basis

Machine Learning with MLlib

  • Overview and Terminology
  • Machine Learning Basics
  • TF-IDF
  • Preparing The Data For Analysis / Stemming, Stopword Elimination
  • Linear Regression / The Longley Dataset
  • Logistic Regression / Filtering Spam
  • Decision Trees
  • Random Forests

Parallel graph processing with GraphX

  • A Simple Example - PageRank


  • The Complete Works of Shakespeare. Computing Word Counts
  • Detecting the 12-01-2001 Anomaly in the CrossFilter Data Set
  • Geographical Data - Analysis of City Initials per Country
  • Applying PageRank on a Subset of Wikipedia
  • Twitter Stream / Sentiment Analysis for Hashtags
  • The Brown Corpus (NLTK). Stylistic Classification with Cosine Similarity
  • Sensor Data. Detecting Tachycardia and Bradycardia in an ECG Stream


This workshop requires a solid background in functional programming. Knowledge of Python is nice-to-have, but not mandatory.

Registration form

You can register for the workshop using the online registration form.

Deadline: September 2nd, 23:59.

If you have any questions, please ask them here.


Dan Șerban



sesiuni/data_science.txt · Last modified: 2016/08/17 11:10 by fbratiloveanu