sparkbyexamples.com Open in urlscan Pro
2a06:98c1:3120::3  Public Scan

Submitted URL: http://sparkbyexamples.com/
Effective URL: https://sparkbyexamples.com/
Submission: On July 14 via manual from US — Scanned from NL

Form analysis 2 forms found in the DOM

https://sparkbyexamples.com/

<form aria-label="Search this website" role="search" class="oceanwp-searchform" id="searchform" action="https://sparkbyexamples.com/" style="width:200px;height:35px;min-height:35px" data-hs-cf-bound="true"><input aria-label="Insert search query"
    class="field" name="s" id="s" placeholder="Search"><button aria-label="Submit your search" class="search-submit" value=""><i class="icon-magnifier" aria-hidden="true"></i></button></form>

https://sparkbyexamples.com/

<form aria-label="Search this website" action="https://sparkbyexamples.com/" class="mobile-searchform" data-hs-cf-bound="true"><input aria-label="Insert search query" class="field" id="ocean-mobile-search-1" type="search" name="s" autocomplete="off"
    placeholder="Search">
  <button aria-label="Submit search" class="searchform-submit">
    <i class="icon-magnifier" aria-hidden="true" role="img"></i></button>
  <input type="hidden" name="post_type" value="post">
</form>

Text Content

🌎 NL EN FR DE ES IT HR SV SR SL
âś•


PRIVACY & TRANSPARANTIE

Wij en onze partners gebruiken cookies om Informatie op een apparaat opslaan
en/of openen. Wij en onze partners gebruiken gegevens voor Gepersonaliseerde
advertenties en inhoud, advertentie- en inhoudsmeting, doelgroep inzichten en
productontwikkeling.. Een voorbeeld van de gegevens die worden verwerkt, kan een
unieke identificatie zijn die in een cookie is opgeslagen. Sommige van onze
partners kunnen uw gegevens verwerken als onderdeel van hun legitieme zakelijke
belang zonder toestemming te vragen. Gebruik de onderstaande link met de
leverancierslijst om te zien voor welke doeleinden zij denken dat ze een
gerechtvaardigd belang hebben, of om bezwaar te maken tegen deze
gegevensverwerking. De verstrekte toestemming wordt alleen gebruikt voor
gegevensverwerking die afkomstig is van deze website. Als u uw instellingen wilt
wijzigen of uw toestemming op enig moment wilt intrekken, vindt u de link om dit
te doen in ons privacybeleid dat toegankelijk is vanaf onze startpagina..



Instellingen beheren Ga verder met aanbevolen cookies

Leverancierslijst | Privacy Policy

Skip to content
 * Home
 * About
 * Write For US

|       { One stop for all Spark Examples }
 * 
 * 
 * 
 * 
 * 
 * 


Join Sign-in
 * Spark
   * Spark RDD
   * Spark DataFrame
   * Spark SQL Functions
   * What’s New in Spark 3.0?
   * Spark Streaming
   * Apache Spark Interview Questions
 * PySpark
 * Pandas
 * R
   * R Programming
   * R Data Frame
   * R dplyr Tutorial
   * R Data Frame
   * R Vector
   * R dplyr Tutorial
 * Snowflake
 * Hive
 * Int Q
   * Spark Interview Questions
   * MongoDB Interview Questions
   * Machine Learning Interview Questions
 * More
   * Python
   * MongoDB
   * Apache Kafka
   * H2O.ai
   * Apache Hadoop
   * NumPy
   * Apache HBase
   * Apache Cassandra
   * H2O Sparkling Water
   * Scala Language

Menu Close
 * Spark
   * Spark RDD
   * Spark DataFrame
   * Spark SQL Functions
   * What’s New in Spark 3.0?
   * Spark Streaming
   * Apache Spark Interview Questions
 * PySpark
 * Pandas
 * R
   * R Programming
   * R Data Frame
   * R dplyr Tutorial
   * R Data Frame
   * R Vector
   * R dplyr Tutorial
 * Snowflake
 * Hive
 * Int Q
   * Spark Interview Questions
   * MongoDB Interview Questions
   * Machine Learning Interview Questions
 * More
   * Python
   * MongoDB
   * Apache Kafka
   * H2O.ai
   * Apache Hadoop
   * NumPy
   * Apache HBase
   * Apache Cassandra
   * H2O Sparkling Water
   * Scala Language

 * Home
 * About
 * Write For US


Spread the love





SPARK BY EXAMPLES | LEARN SPARK TUTORIAL WITH EXAMPLES

In this Apache Spark Tutorial, you will learn Spark with Scala code examples and
every sample example explained here is available at Spark Examples Github
Project for reference. All Spark examples provided in this Apache Spark Tutorial
are basic, simple, and easy to practice for beginners who are enthusiastic to
learn Spark, and these sample examples were tested in our development
environment.

Note: In case you can’t find the spark sample code example you are looking for
on this tutorial page, I would recommend using the Search option from the menu
bar to find your tutorial.

Examples explained in this Spark tutorial are with Scala, and the same is also
explained with PySpark Tutorial (Spark with Python) Examples. Python also
supports Pandas which also contains Data Frame but this is not distributed.

PySpark Tutorial For Beginners (Spa...


Please enable JavaScript




Video Player is loading.
Play Video
PlaySkip Backward
Unmute

Current Time 0:00
/
Duration 16:57
Loaded: 0.59%


00:00

Stream Type LIVE
Seek to live, currently behind liveLIVE
Remaining Time -16:57
 
1x
Playback Rate

Chapters
 * Chapters

Descriptions
 * descriptions off, selected

Captions
 * captions settings, opens captions settings dialog
 * captions off, selected
 * English (US) (Auto Generated) Captions

Audio Track
 * main, selected

Auto(360pLQ)

ShareFullscreen

This is a modal window.



Beginning of dialog window. Escape will cancel and close the window.

TextColorWhiteBlackRedGreenBlueYellowMagentaCyanOpacityOpaqueSemi-TransparentText
BackgroundColorBlackWhiteRedGreenBlueYellowMagentaCyanOpacityOpaqueSemi-TransparentTransparentCaption
Area
BackgroundColorBlackWhiteRedGreenBlueYellowMagentaCyanOpacityTransparentSemi-TransparentOpaque
Font Size50%75%100%125%150%175%200%300%400%Text Edge
StyleNoneRaisedDepressedUniformDropshadowFont FamilyProportional
Sans-SerifMonospace Sans-SerifProportional SerifMonospace SerifCasualScriptSmall
Caps
Reset restore all settings to the default valuesDone
Close Modal Dialog

End of dialog window.




web page the video is based on
PySpark Tutorial For Beginners (Spark with Python)


WHAT IS APACHE SPARK?

Apache Spark is an Open source analytical processing engine for large scale
powerful distributed data processing and machine learning applications. Spark is
Originally developed at the University of California, Berkeley’s, and later
donated to Apache Software Foundation. In February 2014, Spark became
a Top-Level Apache Project and has been contributed by thousands of engineers
and made Spark one of the most active open-source projects in Apache.

Apache Spark is a framework that is supported in Scala, Python, R Programming,
and Java. Below are different implementations of Spark.

 * Spark – Default interface for Scala and Java
 * PySpark – Python interface for Spark
 * SparklyR – R interface for Spark.


APACHE SPARK FEATURES

 * In-memory computation
 * Distributed processing using parallelize
 * Can be used with many cluster managers (Spark, Yarn, Mesos e.t.c)
 * Fault-tolerant
 * Immutable
 * Lazy evaluation
 * Cache & persistence
 * Inbuild-optimization when using DataFrames
 * Supports ANSI SQL


APACHE SPARK ADVANTAGES

 * Spark is a general-purpose, in-memory, fault-tolerant, distributed processing
   engine that allows you to process data efficiently in a distributed fashion.
 * Applications running on Spark are 100x faster than traditional systems.
 * You will get great benefits using Spark for data ingestion pipelines.
 * Using Spark we can process data from Hadoop HDFS, AWS S3, Databricks DBFS,
   Azure Blob Storage, and many file systems.
 * Spark also is used to process real-time data using Streaming and Kafka.
 * Using Spark Streaming you can also stream files from the file system and also
   stream from the socket.
 * Spark natively has machine learning and graph libraries.
 * Provides connectors to store the data in NoSQL databases like MongoDB.


APACHE SPARK ARCHITECTURE

Apache Spark works in a master-slave architecture where the master is called
“Driver” and slaves are called “Workers”. When you run a Spark application,
Spark Driver creates a context that is an entry point to your application, and
all operations (transformations and actions) are executed on worker nodes, and
the resources are managed by Cluster Manager.

Source: https://spark.apache.org/


CLUSTER MANAGER TYPES

As of writing this Apache Spark Tutorial, Spark supports below cluster managers:

 * Standalone – a simple cluster manager included with Spark that makes it easy
   to set up a cluster.
 * Apache Mesos – Mesons is a Cluster manager that can also run Hadoop MapReduce
   and Spark applications.
 * Hadoop YARN – the resource manager in Hadoop 2. This is mostly used, cluster
   manager.
 * Kubernetes – an open-source system for automating deployment, scaling, and
   management of containerized applications.

local – which is not really a cluster manager but still I wanted to mention as
we use “local” for master() in order to run Spark on your laptop/computer.


SPARK INSTALLATION

In order to run Apache Spark examples mentioned in this tutorial, you need to
have Spark and it’s needed tools to be installed on your computer. Since most
developers use Windows for development, I will explain how to install Spark on
windows in this tutorial. you can also Install Spark on Linux server if needed.

Download Apache Spark by accessing Spark Download page and select the link from
“Download Spark (point 3)”. If you wanted to use a different version of Spark &
Hadoop, select the one you wanted from drop downs and the link on point 3
changes to the selected version and provides you with an updated link to
download.

After download, untar the binary using 7zip and copy the underlying folder
spark-3.0.0-bin-hadoop2.7 to c:\apps

Now set the following environment variables.


SPARK_HOME  = C:\apps\spark-3.0.0-bin-hadoop2.7
HADOOP_HOME = C:\apps\spark-3.0.0-bin-hadoop2.7
PATH=%PATH%;C:\apps\spark-3.0.0-bin-hadoop2.7\bin


Copy

SETUP WINUTILS.EXE

Download wunutils.exe file from winutils, and copy it to %SPARK_HOME%\bin
folder. Winutils are different for each Hadoop version hence download the right
version from https://github.com/steveloughran/winutils


SPARK-SHELL

Spark binary comes with an interactive spark-shell. In order to start a shell,
go to your SPARK_HOME/bin directory and type “spark-shell2“. This command loads
the Spark and displays what version of Spark you are using.

spark-shell

By default, spark-shell provides with spark (SparkSession) and sc (SparkContext)
object’s to use. Let’s see some examples.

spark-shell create RDD

Spark-shell also creates a Spark context web UI and by default, it can access
from http://localhost:4041.


SPARK-SUBMIT

The spark-submit command is a utility to run or submit a Spark or PySpark
application program (or job) to the cluster by specifying options and
configurations, the application you are submitting can be written in Scala,
Java, or Python (PySpark) code. You can use this utility in order to do the
following.

 1. Submitting Spark application on different cluster managers like Yarn,
    Kubernetes, Mesos, and Stand-alone.
 2. Submitting Spark application on client or cluster deployment modes


./bin/spark-submit \
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --conf <key<=<value> \
  --driver-memory <value>g \
  --executor-memory <value>g \
  --executor-cores <number of cores>  \
  --jars  <comma separated dependencies>
  --class <main-class> \
  <application-jar> \
  [application-arguments]


Copy


SPARK WEB UI

Apache Spark provides a suite of Web UIs
(Jobs, Stages, Tasks, Storage, Environment, Executors, and SQL) to monitor the
status of your Spark application, resource consumption of Spark cluster, and
Spark configurations. On Spark Web UI, you can see how the operations are
executed.

Spark Web UI


SPARK HISTORY SERVER

Spark History server, keep a log of all completed Spark application you submit
by spark-submit, spark-shell. before you start, first you need to set the below
config on spark-defaults.conf


spark.eventLog.enabled true
spark.history.fs.logDirectory file:///c:/logs/path


Copy

Now, start spark history server on Linux or mac by running.


$SPARK_HOME/sbin/start-history-server.sh


Copy

If you are running Spark on windows, you can start the history server by
starting the below command.


$SPARK_HOME/bin/spark-class.cmd org.apache.spark.deploy.history.HistoryServer


Copy

By default History server listens at 18080 port and you can access it from
browser using http://localhost:18080/

Spark History Server

By clicking on each App ID, you will get the details of the application in Spark
web UI.

The history server is very helpful when you are doing Spark performance tuning
to improve spark jobs where you can cross-check the previous application run
with the current run.


SPARK MODULES

 * Spark Core
 * Spark SQL
 * Spark Streaming
 * Spark MLlib
 * Spark GraphX

Spark Modules


SPARK CORE

In this section of the Apache Spark Tutorial, you will learn different concepts
of the Spark Core library with examples in Scala code. Spark Core is the main
base library of the Spark which provides the abstraction of how distributed task
dispatching, scheduling, basic I/O functionalities and etc.

Before getting your hands dirty on Spark programming, have your Development
Environment Setup to run Spark Examples using IntelliJ IDEA


SPARKSESSION

SparkSession introduced in version 2.0, It is an entry point to underlying Spark
functionality in order to programmatically use Spark RDD, DataFrame and Dataset.
It’s object spark is default available in spark-shell.

Creating a SparkSession instance would be the first statement you would write to
program with RDD, DataFrame and Dataset. SparkSession will be created
using SparkSession.builder() builder pattern.


import org.apache.spark.sql.SparkSession
val spark:SparkSession = SparkSession.builder()
      .master("local[1]")
      .appName("SparkByExamples.com")
      .getOrCreate()   


Copy


SPARK CONTEXT

SparkContext is available since Spark 1.x (JavaSparkContext for Java) and is
used to be an entry point to Spark and PySpark before introducing SparkSession
in 2.0. Creating SparkContext was the first step to the program with RDD and to
connect to Spark Cluster. It’s object sc by default available in spark-shell.

Since Spark 2.x version, When you create SparkSession, SparkContext object is by
default create and it can be accessed using spark.sparkContext

Note that you can create just one SparkContext per JVM but can create many
SparkSession objects.


RDD SPARK TUTORIAL

RDD (Resilient Distributed Dataset) is a fundamental data structure of Spark and
it is the primary data abstraction in Apache Spark and the Spark Core. RDDs are
fault-tolerant, immutable distributed collections of objects, which means once
you create an RDD you cannot change it. Each dataset in RDD is divided into
logical partitions, which can be computed on different nodes of the cluster. 

This Apache Spark RDD Tutorial will help you start understanding and using
Apache Spark RDD (Resilient Distributed Dataset) with Scala code examples. All
RDD examples provided in this tutorial were also tested in our development
environment and are available at GitHub spark scala examples project for quick
reference.

In this section of the Apache Spark tutorial, I will introduce the RDD and
explains how to create them and use its transformation and action operations.
Here is the full article on Spark RDD in case if you wanted to learn more of and
get your fundamentals strong.


RDD CREATION

RDD’s are created primarily in two different ways, first parallelizing an
existing collection and secondly referencing a dataset in an external storage
system (HDFS, HDFS, S3 and many more). 

SPARKCONTEXT.PARALLELIZE()

sparkContext.parallelize is used to parallelize an existing collection in your
driver program. This is a basic method to create RDD.


//Create RDD from parallelize    
val dataSeq = Seq(("Java", 20000), ("Python", 100000), ("Scala", 3000))   
val rdd=spark.sparkContext.parallelize(dataSeq)


Copy

SPARKCONTEXT.TEXTFILE()

Using textFile() method we can read a text (.txt) file from many sources like
HDFS, S#, Azure, local e.t.c into RDD.


//Create RDD from external Data source
val rdd2 = spark.sparkContext.textFile("/path/textFile.txt")


Copy


RDD OPERATIONS

On Spark RDD, you can perform two kinds of operations.

RDD TRANSFORMATIONS

Spark RDD Transformations are lazy operations meaning they don’t execute until
you call an action on RDD. Since RDD’s are immutable, When you run a
transformation(for example map()), instead of updating a current RDD, it returns
a new RDD.

Some transformations on RDD’s
are flatMap(), map(), reduceByKey(), filter(), sortByKey() and all these return
a new RDD instead of updating the current.

RDD ACTIONS

RDD Action operation returns the values from an RDD to a driver node. In other
words, any RDD function that returns non RDD[T] is considered as an action. RDD
operations trigger the computation and return RDD in a List to the driver
program.

Some actions on RDD’s are count(),  collect(),  first(),  max(),  reduce()  and
more.


RDD EXAMPLES

 * How to read multiple text files into RDD
 * Read CSV file into RDD
 * Ways to create an RDD
 * Create an empty RDD
 * RDD Pair Functions
 * Generate DataFrame from RDD


DATAFRAME SPARK TUTORIAL WITH BASIC EXAMPLES

DataFrame definition is very well explained by Databricks hence I do not want to
define it again and confuse you. Below is the definition I took it from
Databricks.

> DataFrame is a distributed collection of data organized into named columns. It
> is conceptually equivalent to a table in a relational database or a data frame
> in R/Python, but with richer optimizations under the hood. DataFrames can be
> constructed from a wide array of sources such as structured data files, tables
> in Hive, external databases, or existing RDDs.
> 
> – Databricks


DATAFRAME CREATION

The simplest way to create a DataFrame is from a seq collection. DataFrame can
also be created from an RDD and by reading files from several sources.

USING CREATEDATAFRAME()

By using createDataFrame() function of the SparkSession you can create a
DataFrame.


val data = Seq(('James','','Smith','1991-04-01','M',3000),
  ('Michael','Rose','','2000-05-19','M',4000),
  ('Robert','','Williams','1978-09-05','M',4000),
  ('Maria','Anne','Jones','1967-12-01','F',4000),
  ('Jen','Mary','Brown','1980-02-17','F',-1)
)

val columns = Seq("firstname","middlename","lastname","dob","gender","salary")
df = spark.createDataFrame(data), schema = columns).toDF(columns:_*)


Copy

Since DataFrame’s are structure format which contains names and column, we can
get the schema of the DataFrame using df.printSchema()

df.show() shows the 20 elements from the DataFrame.


+---------+----------+--------+----------+------+------+
|firstname|middlename|lastname|dob       |gender|salary|
+---------+----------+--------+----------+------+------+
|James    |          |Smith   |1991-04-01|M     |3000  |
|Michael  |Rose      |        |2000-05-19|M     |4000  |
|Robert   |          |Williams|1978-09-05|M     |4000  |
|Maria    |Anne      |Jones   |1967-12-01|F     |4000  |
|Jen      |Mary      |Brown   |1980-02-17|F     |-1    |
+---------+----------+--------+----------+------+------+


Copy

In this Apache Spark SQL DataFrame Tutorial, I have explained several mostly
used operation/functions on DataFrame & DataSet with working scala examples.
This is a work in progress section where you will see more articles and samples
are coming.

 * Different ways to create a DataFrame
 * How to create an empty DataFrame
 * How to create an empty DataSet
 * Spark DataFrame – Rename nested column
 * How to add or update a column on DataFrame
 * How to drop a column on DataFrame
 * Spark when otherwise usage
 * How to add literal constant to DataFrame
 * Spark Data Types explained
 * How to change column data type
 * How to Pivot and Unpivot a DataFrame
 * Create a DataFrame using StructType & StructField schema
 * How to select the first row of each group
 * How to sort DataFrame
 * How to union DataFrame
 * How to drop Rows with null values from DataFrame
 * How to split single to multiple columns
 * How to concatenate multiple columns
 * How to replace null values in DataFrame
 * How to remove duplicate rows on DataFrame
 * How to remove distinct on multiple selected columns
 * Spark map() vs mapPartitions()


SPARK DATAFRAME ADVANCED CONCEPTS

 * Spark Partitioning, Repartitioning and Coalesce
 * How does Spark shuffle work?
 * Spark Cache and Persistence
 * Spark Persistance Storage levels
 * Spark Broadcast shared variable
 * Spark Accumulator shared variable
 * Spark UDF


SPARK ARRAY AND MAP OPERATIONS

 * How to create an Array (ArrayType) column on DataFrame
 * How to create a Map (MapType) column on DataFrame
 * How to convert an Array to columns
 * How to create an Array of struct column
 * How to explode an Array and map columns
 * How to explode an Array of structs
 * How to explode an Array of map columns to rows
 * How to create a DataFrame with nested Array
 * How to explode nested Arrays to rows
 * How to flatten nested Array to single Array
 * Spark – Convert array of String to a String column


SPARK AGGREGATE

 * How to group rows in DataFrame
 * How to get Count distinct on DataFrame
 * How to add row number to DataFrame
 * How to select the first row of each group


SPARK SQL JOINS

 * Spark SQL Join
 * How to join multiple DataFrames
 * How to inner join two tables/DataFrame
 * How to do self join
 * How to join tables on multiple columns


SPARK PERFORMANCE

 * Spark Performance Improvement


OTHER HELPFUL TOPICS ON DATAFRAME

 * How to stop DEBUG & INFO log messages
 * Print DataFrame full column contents
 * Unstructured vs semi-structured vs structured files


SPARK SQL SCHEMA & STRUCTTYPE

 * How to convert case class to a schema
 * Spark Schema explained with examples
 * How to create array of struct column
 * Spark StructType & StructField
 * How to flatten nested column


SPARK SQL FUNCTIONS

Spark SQL provides several built-in functions, When possible try to leverage
standard library as they are a little bit more compile-time safety, handles null
and perform better when compared to UDF’s. If your application is critical on
performance try to avoid using custom UDF at all costs as these are not
guarantee on performance.

In this section, we will see several Spark SQL functions Tutorials with Scala
examples.

 * Spark Date and Time Functions
 * Spark String Functions
 * Spark Array Functions
 * Spark Map Functions
 * Spark Aggregate Functions
 * Spark Window Functions
 * Spark Sort Functions




SPARK DATA SOURCE WITH EXAMPLES

Spark SQL supports operating on a variety of data sources through the DataFrame
interface. This section of the tutorial describes reading and writing data using
the Spark Data Sources with scala examples. Using Data source API we can load
from or save data to RDMS databases, Avro, parquet, XML e.t.c.


TEXT

 * Spark process Text file
 * How to process JSON from a Text file


CSV

 * How to process CSV file
 * How to convert Parquet file to CSV file
 * How to process JSON from a CSV file
 * How to Convert Avro file to CSV file
 * How to convert CSV file to Avro, Parquet & JSON


JSON

 * JSON Example (Read & Write)
 * How to Read JSON from multi-line
 * How to read JSON file with custom schema
 * How to process JSON from a CSV file
 * How to process JSON from a Text file
 * How to convert Parquet file to JSON file
 * How to convert Avro file to JSON file
 * How to convert JSON to Avro, Parquet, CSV file


PARQUET

 * Parquet Example (Read and Write)
 * How to convert Parquet file to CSV file
 * How to convert Parquet file to Avro file
 * How to convert Avro file to Parquet file


AVRO

 * Avro Example (Read and Write)
 * Spark 2.3 – Apache Avro Example
 * How to Convert Avro file to CSV file
 * How to convert Parquet file to Avro file
 * How to convert Avro file to JSON file
 * How to convert Avro file to Parquet file


XML

 * Processing Nested XML structured files
 * How to validate XML with XSD


SQL SPARK TUTORIAL

Spark SQL is one of the most used Spark modules which is used for processing
structured columnar data format. Once you have a DataFrame created, you can
interact with the data by using SQL syntax. In other words, Spark SQL brings
native RAW SQL queries on Spark meaning you can run traditional ANSI SQL’s on
Spark Dataframe. In the later section of this Apache Spark tutorial, you will
learn in details using SQL select, where, group by, join, union e.t.c

In order to use SQL, first, we need to create a temporary table on DataFrame
using createOrReplaceTempView() function. Once created, this table can be
accessed throughout the SparkSession and it will be dropped along with your
SparkContext termination.

On a table, SQL query will be executed using sql() method of the SparkSession
and this method returns a new DataFrame.


df.createOrReplaceTempView("PERSON_DATA")
val df2 = spark.sql("SELECT * from PERSON_DATA")
df2.printSchema()
df2.show()


Copy

Let’s see another example using group by.


val groupDF = spark.sql("SELECT gender, count(*) from PERSON_DATA group by gender")
groupDF.show()


Copy

This yields the below output


+------+--------+
|gender|count(1)|
+------+--------+
|     F|       2|
|     M|       3|
+------+--------+


Copy

Similarly, you can run any traditional SQL queries on DataFrames using Spark
SQL.


SPARK HDFS & S3 TUTORIAL

 * Processing files from Hadoop HDFS (TEXT, CSV, Parquet, Avro, JSON)
 * Processing TEXT files from Amazon S3 bucket
 * Processing JSON files from Amazon S3 bucket
 * Processing CSV files from Amazon S3 bucket
 * Processing Parquet files from Amazon S3 bucket
 * Processing Avro files from Amazon S3 bucket


SPARK STREAMING TUTORIAL & EXAMPLES

Spark Streaming is a scalable, high-throughput, fault-tolerant streaming
processing system that supports both batch and streaming workloads. It is used
to process real-time data from sources like file system folder, TCP
socket, S3, Kafka, Flume, Twitter, and Amazon Kinesis to name a few. The
processed data can be pushed to databases, Kafka, live dashboards e.t.c

source: https://spark.apache.org/
 * Spark Streaming – OutputModes Append vs Complete vs Update
 * Spark Streaming – Read JSON Files From Directory with Scala Example
 * Spark Streaming – Read data From TCP Socket with Scala Example
 * Spark Streaming – Consuming & Producing Kafka messages in JSON format
 * Spark Streaming – Consuming & Producing Kafka messages in Avro format
 * Using from_avro and to_avro functions
 * Reading Avro data from Kafka topic using from_avro() and to_avro()
 * Spark Batch Processing using Kafka Data Source


SPARK WITH KAFKA TUTORIALS

 * Spark Streaming – Consuming & Producing Kafka messages in JSON format
 * Spark Streaming – Consuming & Producing Kafka messages in Avro format
 * Using from_avro and to_avro functions
 * Reading Avro data from Kafka topic using from_avro() and to_avro()
 * Spark Batch Processing using Kafka Data Source


SPARK – HBASE TUTORIALS & EXAMPLES

In this section of the Spark Tutorial, you will learn several Apache HBase spark
connectors and how to read an HBase table to a Spark DataFrame and write
DataFrame to HBase table.

 * Spark HBase Connectors explained
 * Writing Spark DataFrame to HBase table using shc-core Hortonworks library
 * Creating Spark DataFrame from Hbase table using shc-core Hortonworks library


SPARK – HIVE TUTORIALS

In this section, you will learn what is Apache Hive and several examples of
connecting to Hive, creating Hive tables, reading them into DataFrame

 * Start HiveServer2 and connect to hive beeline


SPARK GRAPHX AND GRAPHFRAMES

PySpark GraphFrames are introduced in Spark 3.0 version to support Graphs on
DataFrames. Prior to 3.0, Spark has GraphX library which ideally runs on RDD and
loses all Data Frame capabilities.

REFERENCES:

 * Apache Spark Introduction
 * Learn Apache Spark from a wiki page
 * Current & Previous releases
 * About Spark from Databricks
 * Spark Github source code



SPARK TUTORIAL

 * Spark – Installation on Mac
 * Spark – Installation on Windows
 * Spark – Installation on Linux | Ubuntu
 * Spark – Cluster Setup with Hadoop Yarn
 * Spark – Web/Application UI
 * Spark – Setup with Scala and IntelliJ
 * Spark – How to Run Examples From this Site on IntelliJ IDEA
 * Spark – SparkSession
 * Spark – SparkContext

report this ad

SPARK RDD TUTORIAL

 * Spark RDD – Parallelize
 * Spark RDD – Read text file
 * Spark RDD – Read CSV
 * Spark RDD – Create RDD
 * Spark RDD – Actions
 * Spark RDD – Pair Functions
 * Spark RDD – Repartition and Coalesce
 * Spark RDD – Shuffle Partitions
 * Spark RDD – Cache vs Persist
 * Spark RDD – Persistance Storage Levels
 * Spark RDD – Broadcast Variables
 * Spark RDD – Accumulator Variables
 * Spark RDD – Convert RDD to DataFrame

SPARK SQL TUTORIAL

 * DataFrame – createDataFrame()
 * DataFrame – where() & filter()
 * DataFrame – withColumn()
 * DataFrame – withColumnRenamed()
 * DataFrame – drop()
 * DataFrame – distinct()
 * DataFrame – groupBy()
 * DataFrame – join()
 * DataFrame – map() vs mapPartitions()
 * DataFrame – foreach() vs foreachPartition()
 * DataFrame – pivot()
 * DataFrame – union()
 * DataFrame – collect()
 * DataFrame – cache() & persist()
 * DataFrame – udf()
 * Spark SQL StructType & StructField

SPARK SQL FUNCTIONS

 * Spark SQL String Functions
 * Spark SQL Date and Timestamp Functions
 * Spark SQL Array Functions
 * Spark SQL Map Functions
 * Spark SQL Sort Functions
 * Spark SQL Aggregate Functions
 * Spark SQL Window Functions
 * Spark SQL JSON Functions

SPARK DATA SOURCE API

 * Spark – Read & Write CSV file
 * Spark – Read and Write JSON file
 * Spark – Read & Write Parquet file
 * Spark – Read & Write XML file
 * Spark – Read & Write Avro files
 * Spark – Read & Write Avro files (Spark version 2.3.x or earlier)
 * Spark – Read & Write HBase using “hbase-spark” Connector
 * Spark – Read & Write from HBase using Hortonworks
 * Spark – Read & Write ORC file
 * Spark – Read Binary File

SPARK STREAMING & KAFKA

 * Spark Streaming – OutputModes
 * Spark Streaming – Reading Files From Directory
 * Spark Streaming – Reading Data From TCP Socket
 * Spark Streaming – Processing Kafka Messages in JSON Format
 * Spark Streaming – Processing Kafka messages in AVRO Format
 * Spark SQL Batch – Consume & Produce Kafka Message



report this ad
report this ad
report this ad



TOP TUTORIALS

 * Apache Spark Tutorial
 * PySpark Tutorial
 * Python Pandas Tutorial
 * R Programming Tutorial
 * Python NumPy Tutorial
 * Apache Hive Tutorial
 * Apache HBase Tutorial
 * Apache Cassandra Tutorial
 * Apache Kafka Tutorial
 * Snowflake Data Warehouse Tutorial
 * H2O Sparkling Water Tutorial


CATEGORIES

 * Apache Spark
 * PySpark
 * Pandas
 * R Programming
 * Snowflake Database
 * NumPy
 * Apache Hive
 * Apache HBase
 * Apache Kafka
 * Apache Cassandra
 * H2O Sparkling Water


LEGAL

 * Privacy Policy


ABOUT SPARKBYEXAMPLES.COM

SparkByExamples.com is a Big Data and Spark examples community page, all
examples are simple and easy to understand, and well tested in our development
environment Read more ..
 * Opens in a new tab
 * Opens in a new tab
 * Opens in a new tab
 * Opens in a new tab
 * Opens in a new tab

sparkbyexamples@gmail.com
+1 (949) 345-0676
Desert Bloom
Irvine, CA 92618
USA
Copyright sparkbyexamples.com


report this ad
x