Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
sesiuni:data_science [2016/05/30 16:08]
dserban [Introduction]
sesiuni:data_science [2016/08/17 11:10] (current)
fbratiloveanu [When and Where?]
Line 8: Line 8:
 == When and Where? == == When and Where? ==
  
-* **Days**: 25 June 2016 - 31 August ​2016 (every Saturday and Sunday)+**Between**: 3-25 September ​2016 (every Saturday and Sunday, and the actual workshop dates are 3, 4, 10, 11, 17, 18, 24, 25). The workshop lasts 4 hours and will be held in 12:00-18:00 interval. The schedule will be decided at the end of August.
  
-An email will be sent to announce ​the room and the hours after the participant will be accepted.+Private communications ​will be sent to the selected participants ​to announce ​further details (after registration is complete ​and the list of participants is finalized).
  
 == Topics == == Topics ==
Line 20: Line 20:
 * Working in the PySpark shell * Working in the PySpark shell
 * Working with PySpark in an iPython notebook * Working with PySpark in an iPython notebook
 +* Building Spark/Scala Applications with sbt
 * Standalone Applications * Standalone Applications
  
Line 48: Line 49:
 * The DataFrame API * The DataFrame API
 * Inner Joins and Left Outer Joins in the RDD API versus in Spark SQL * Inner Joins and Left Outer Joins in the RDD API versus in Spark SQL
 +* Datasets (compile-time type-safe DataFrames)
  
 Building Interactive Data Analytics Apps With Flask and Spark Building Interactive Data Analytics Apps With Flask and Spark
Line 55: Line 57:
 Spark Streaming Spark Streaming
  
-* A Simple Example - Stream of Integers / Moving Average+* A Simple Example - Stream of Integers / Rolling Sum 
 +* Streaming Data via TCP socket (netcat) 
 +* Streaming Data via Kafka topic (Apache Kafka) 
 +* Storing Streaming Analytics Results in a NoSQL Datastore (Apache Cassandra) 
 +* Structured Streaming / Infinite DataFrames
  
 Advanced Spark Programming Advanced Spark Programming
Line 64: Line 70:
  
 * Overview and Terminology * Overview and Terminology
-* Machine Learning Basics. What is a Feature +* Machine Learning Basics
-* The LabeledPoint Data Type+
 * TF-IDF * TF-IDF
 * Preparing The Data For Analysis / Stemming, Stopword Elimination * Preparing The Data For Analysis / Stemming, Stopword Elimination
-LogisticRegressionWithSGD ​/ Filtering Spam+Linear Regression / The Longley Dataset 
 +* Logistic Regression ​/ Filtering Spam 
 +* Decision Trees 
 +* Random Forests 
 + 
 +Parallel graph processing with GraphX 
 + 
 +* A Simple Example - PageRank
  
 Exercises Exercises
Line 79: Line 91:
 * The Brown Corpus (NLTK). Stylistic Classification with Cosine Similarity * The Brown Corpus (NLTK). Stylistic Classification with Cosine Similarity
 * Sensor Data. Detecting Tachycardia and Bradycardia in an ECG Stream * Sensor Data. Detecting Tachycardia and Bradycardia in an ECG Stream
-== Registration is now closed == 
- 
-If you have any questions, please ask them [[https://​github.com/​dserban/​datascience2016summer/​issues/​1|here]]. 
  
 == Prerequisites == == Prerequisites ==
Line 90: Line 99:
  
 You can register for the workshop using the [[https://​docs.google.com/​forms/​d/​1ocS-KDKF99HWILR5LaEV8KE-38xa0kltZoFu8sKWjI8/​viewform|online registration form]]. You can register for the workshop using the [[https://​docs.google.com/​forms/​d/​1ocS-KDKF99HWILR5LaEV8KE-38xa0kltZoFu8sKWjI8/​viewform|online registration form]].
 +
 +Deadline: September 2nd, 23:59.
 +
 +If you have any questions, please ask them [[https://​github.com/​dserban/​datascience2016summer/​issues/​1|here]].
  
 == Instructor == == Instructor ==
Line 98: Line 111:
  
 ==== _____________ ==== ==== _____________ ====
- 
-== Participants == 
- 
  
sesiuni/data_science.1464613725.txt.gz · Last modified: 2016/05/30 16:08 by dserban