Big Data with Spark (3 days)

Course Objectives

3 days of training on Spark to explore the opportunities offered by this new framework, especially when used in conjunction with Cassandra or when doing Machine Learning. The training includes more than 50% of hands-on labs. Duration: 3 days

Content formation

Introduction to Apache Spark

  • Purpose of the framework and use cases
  • History of the framework
  • Proximity with Hadoop and comparison
  • Spark‘s modules
  • Integration in the ecosystem
  • Initiation to MapReduce

Scala basics

  • Why using Scala to write Spark applications
  • Declaring variables, methods and classes
  • Lambda expressions
  • Pattern matching

Spark’s API

  • Resilient Distributed Datasets (RDD)
  • Creating RDDs: supported sources
  • Transformations on RDDs: supported operations
  • Final actions on RDDs
  • Partitioning: default values and tuning
  • Persisting RDDs in memory or on disk
  • Accumulators et broadcast variables
  • Hands-on:
    • Exercises to the RDD API in practice

Spark SQL & DataFrames

  • DataFrames - How they work
  • Creating DataFrames from RDDs or using a reader
  • Requesting DataFrames using SQL
  • Reusing DataFrames
  • Hands-on:
    • Exercises to the DataFrames API in practice
    • Data exploration using Spark SQL

Spark in cluster

  • Topology and terminology
  • Cluster management: Yarn, Mesos, Standalone
  • Data Locality principles and best practices
  • Best practices for setting up distributed storage and distributed processing
  • Hardware recommendations
  • Hands-on:
    • Setup of a Spark cluster
    • Experiments with HDFS
    • Resilience

Spark & Cassandra

  • Reading full Cassandra tables
  • Requesting Cassandra through CQL
  • Object-record mapping
  • Writing RDD and DataFrames to Cassandra
  • Spark-Cassandra connector and Data Locality
  • Spark in DSE (DataStax Enterprise)
  • Hands-on:
    • Data loading into Cassandra from files using Spark
    • Reading data from Cassandra and denormalization

Spark ML and MLlib

  • Introduction to Machine Learning
  • Types of ML algorithms
  • Introduction to the typical ML workflow: data cleansing, feature engineering...
  • Algorithms in Spark ML and MLlib
  • Using the Spark ML and MLlib APIs
  • Hands-on:
    • Feature engineering
    • Classification using Random Forests
    • Cross-validation

Spark Streaming

  • Spark Streaming principles
  • Introduction to DStreams
  • Description of the API
  • Sliding window operations
  • Delivery guarantee
  • Comparison with Storm and Storm Trident
  • Hands-on:
    • Streaming processing of tweets

Note: Hands-on labs will be done in Scala