Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revision Both sides next revision
sesiuni:data_science [2016/05/30 16:08]
dserban [Introduction]
sesiuni:data_science [2016/07/18 17:40]
dserban [Topics]
Line 8: Line 8:
 == When and Where? == == When and Where? ==
  
-* **Days**: 25 June 2016 - 31 August ​2016 (every Saturday and Sunday)+**Between**: 3-25 September ​2016 (every Saturday and Sunday, and the actual workshop dates are 3, 4, 10, 11, 17, 18, 24, 25)
  
-An email will be sent to announce ​the room and the hours after the participant will be accepted.+Private communications ​will be sent to the selected participants ​to announce ​further details (after registration is complete ​and the list of participants is finalized).
  
 == Topics == == Topics ==
Line 20: Line 20:
 * Working in the PySpark shell * Working in the PySpark shell
 * Working with PySpark in an iPython notebook * Working with PySpark in an iPython notebook
 +* Building Spark/Scala Applications with sbt
 * Standalone Applications * Standalone Applications
  
Line 48: Line 49:
 * The DataFrame API * The DataFrame API
 * Inner Joins and Left Outer Joins in the RDD API versus in Spark SQL * Inner Joins and Left Outer Joins in the RDD API versus in Spark SQL
 +* Datasets (compile-time type-safe DataFrames)
  
 Building Interactive Data Analytics Apps With Flask and Spark Building Interactive Data Analytics Apps With Flask and Spark
Line 55: Line 57:
 Spark Streaming Spark Streaming
  
-* A Simple Example - Stream of Integers / Moving Average+* A Simple Example - Stream of Integers / Rolling Sum 
 +* Streaming Data via TCP socket (netcat) 
 +* Streaming Data via Kafka topic (Apache Kafka) 
 +* Storing Streaming Analytics Results in a NoSQL Datastore (Apache Cassandra) 
 +* Structured Streaming / Infinite DataFrames
  
 Advanced Spark Programming Advanced Spark Programming
Line 64: Line 70:
  
 * Overview and Terminology * Overview and Terminology
-* Machine Learning Basics. What is a Feature +* Machine Learning Basics
-* The LabeledPoint Data Type+
 * TF-IDF * TF-IDF
 * Preparing The Data For Analysis / Stemming, Stopword Elimination * Preparing The Data For Analysis / Stemming, Stopword Elimination
-LogisticRegressionWithSGD ​/ Filtering Spam+Linear Regression / The Longley Dataset 
 +* Logistic Regression ​/ Filtering Spam 
 +* Decision Trees 
 +* Random Forests 
 + 
 +Parallel graph processing with GraphX 
 + 
 +* A Simple Example - PageRank
  
 Exercises Exercises
Line 79: Line 91:
 * The Brown Corpus (NLTK). Stylistic Classification with Cosine Similarity * The Brown Corpus (NLTK). Stylistic Classification with Cosine Similarity
 * Sensor Data. Detecting Tachycardia and Bradycardia in an ECG Stream * Sensor Data. Detecting Tachycardia and Bradycardia in an ECG Stream
-== Registration is now closed == 
- 
-If you have any questions, please ask them [[https://​github.com/​dserban/​datascience2016summer/​issues/​1|here]]. 
  
 == Prerequisites == == Prerequisites ==
Line 90: Line 99:
  
 You can register for the workshop using the [[https://​docs.google.com/​forms/​d/​1ocS-KDKF99HWILR5LaEV8KE-38xa0kltZoFu8sKWjI8/​viewform|online registration form]]. You can register for the workshop using the [[https://​docs.google.com/​forms/​d/​1ocS-KDKF99HWILR5LaEV8KE-38xa0kltZoFu8sKWjI8/​viewform|online registration form]].
 +
 +Deadline: September 2nd, 23:59.
 +
 +If you have any questions, please ask them [[https://​github.com/​dserban/​datascience2016summer/​issues/​1|here]].
  
 == Instructor == == Instructor ==
Line 98: Line 111:
  
 ==== _____________ ==== ==== _____________ ====
- 
-== Participants == 
- 
  
sesiuni/data_science.txt · Last modified: 2016/08/17 11:10 by fbratiloveanu