Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
sesiuni:data_science [2016/05/30 15:25]
fbratiloveanu created
sesiuni:data_science [2016/08/17 11:10] (current)
fbratiloveanu [When and Where?]
Line 1: Line 1:
-Introduction to Big Data Concepts ​with Apache Spark =+= Data Science ​with Python, Scala and Apache Spark =
  
 == Introduction == == Introduction ==
  
-This 7-day hands-on workshop introduces Apache Spark, the open-source cluster computing framework with in-memory processing that makes analytics applications up to 100 times faster compared to technologies in wide deployment today. Developed in the AMPLab at UC Berkeley, Spark can help reduce data interaction complexity, increase processing speed and enhance data-intensive,​ near-real-time applications with deep intelligence.+This 8-day hands-on workshop introduces Apache Spark, the open-source cluster computing framework with in-memory processing that makes analytics applications up to 100 times faster compared to technologies in wide deployment today. Developed in the AMPLab at UC Berkeley, Spark can help reduce data interaction complexity, increase processing speed and enhance data-intensive,​ near-real-time applications with deep intelligence.
 Highly versatile in many environments,​ and with a strong foundation in functional programming,​ Spark is known for its ease of use in creating algorithms that harness insight from complex data. Spark was elevated to a top-level Apache Project in 2014 and continues to expand today. Highly versatile in many environments,​ and with a strong foundation in functional programming,​ Spark is known for its ease of use in creating algorithms that harness insight from complex data. Spark was elevated to a top-level Apache Project in 2014 and continues to expand today.
  
 == When and Where? == == When and Where? ==
  
-* **Days**: 25 June 2016 - 31 August ​2016 (every Saturday and Sunday)+**Between**: 3-25 September ​2016 (every Saturday and Sunday, and the actual workshop dates are 3, 4, 10, 11, 17, 18, 24, 25). The workshop lasts 4 hours and will be held in 12:00-18:00 interval. The schedule will be decided at the end of August.
  
-An email will be sent to announce ​the room and the hours after the participant will be accepted.+Private communications ​will be sent to the selected participants ​to announce ​further details (after registration is complete ​and the list of participants is finalized).
  
 == Topics == == Topics ==
Line 20: Line 20:
 * Working in the PySpark shell * Working in the PySpark shell
 * Working with PySpark in an iPython notebook * Working with PySpark in an iPython notebook
 +* Building Spark/Scala Applications with sbt
 * Standalone Applications * Standalone Applications
  
Line 48: Line 49:
 * The DataFrame API * The DataFrame API
 * Inner Joins and Left Outer Joins in the RDD API versus in Spark SQL * Inner Joins and Left Outer Joins in the RDD API versus in Spark SQL
 +* Datasets (compile-time type-safe DataFrames)
  
 Building Interactive Data Analytics Apps With Flask and Spark Building Interactive Data Analytics Apps With Flask and Spark
Line 55: Line 57:
 Spark Streaming Spark Streaming
  
-* A Simple Example - Stream of Integers / Moving Average+* A Simple Example - Stream of Integers / Rolling Sum 
 +* Streaming Data via TCP socket (netcat) 
 +* Streaming Data via Kafka topic (Apache Kafka) 
 +* Storing Streaming Analytics Results in a NoSQL Datastore (Apache Cassandra) 
 +* Structured Streaming / Infinite DataFrames
  
 Advanced Spark Programming Advanced Spark Programming
Line 64: Line 70:
  
 * Overview and Terminology * Overview and Terminology
-* Machine Learning Basics. What is a Feature +* Machine Learning Basics
-* The LabeledPoint Data Type+
 * TF-IDF * TF-IDF
 * Preparing The Data For Analysis / Stemming, Stopword Elimination * Preparing The Data For Analysis / Stemming, Stopword Elimination
-LogisticRegressionWithSGD ​/ Filtering Spam+Linear Regression / The Longley Dataset 
 +* Logistic Regression ​/ Filtering Spam 
 +* Decision Trees 
 +* Random Forests 
 + 
 +Parallel graph processing with GraphX 
 + 
 +* A Simple Example - PageRank
  
 Exercises Exercises
Line 79: Line 91:
 * The Brown Corpus (NLTK). Stylistic Classification with Cosine Similarity * The Brown Corpus (NLTK). Stylistic Classification with Cosine Similarity
 * Sensor Data. Detecting Tachycardia and Bradycardia in an ECG Stream * Sensor Data. Detecting Tachycardia and Bradycardia in an ECG Stream
-== Registration is now closed == 
- 
-If you have any questions, please ask them [[https://​github.com/​dserban/​datascience2016summer/​issues/​1|here]]. 
  
 == Prerequisites == == Prerequisites ==
Line 90: Line 99:
  
 You can register for the workshop using the [[https://​docs.google.com/​forms/​d/​1ocS-KDKF99HWILR5LaEV8KE-38xa0kltZoFu8sKWjI8/​viewform|online registration form]]. You can register for the workshop using the [[https://​docs.google.com/​forms/​d/​1ocS-KDKF99HWILR5LaEV8KE-38xa0kltZoFu8sKWjI8/​viewform|online registration form]].
 +
 +Deadline: September 2nd, 23:59.
 +
 +If you have any questions, please ask them [[https://​github.com/​dserban/​datascience2016summer/​issues/​1|here]].
  
 == Instructor == == Instructor ==
Line 98: Line 111:
  
 ==== _____________ ==== ==== _____________ ====
- 
-== Participants == 
- 
  
sesiuni/data_science.1464611127.txt.gz · Last modified: 2016/05/30 15:25 by fbratiloveanu