supergloo.com Open in urlscan Pro
3.122.152.250  Public Scan

Submitted URL: http://supergloo.com/
Effective URL: https://supergloo.com/
Submission: On March 08 via api from US — Scanned from DE

Form analysis 1 forms found in the DOM

https://supergloo.com/

<form role="search" class="search-form" action="https://supergloo.com/"><label><span class="screen-reader-text">Search for:</span>
    <input type="search" class="search-field" placeholder="Search …" name="s"></label>
  <button class="search-submit"><i class="fa fa-search" aria-hidden="true"></i><span class="screen-reader-text">Search</span></button>
</form>

Text Content

🌎 DE EN FR ES IT HR SV SR SL NL
✕


🍪 DATENSCHUTZ & TRANSPARENZ

Wir und unsere Partner verwenden Cookies, um Informationen auf einem Gerät
speichern und/oder abrufen zu können. Wir und unsere Partner verwenden Daten für
Personalisierte Anzeigen und Inhalte, Anzeigen- und Inhaltsmessungen,
Erkenntnisse über Zielgruppen und Produktentwicklungen. Ein Beispiel für Daten,
welche verarbeitet werden, kann eine in einem Cookie gespeicherte eindeutige
Kennung sein. Einige unserer Partner können Ihre Daten im Rahmen ihrer legitimen
Geschäftsinteressen verarbeiten, ohne Ihre Zustimmung einzuholen. Um die
Verwendungszwecke einzusehen, für die diese ihrer Meinung nach ein berechtigtes
Interesse haben, oder um dieser Datenverarbeitung zu widersprechen, verwenden
Sie den unten stehenden Link zur Anbieterliste. Die übermittelte Einwilligung
wird nur für die von dieser Webseite ausgehende Datenverarbeitung verwendet.
Wenn Sie Ihre Einstellungen ändern oder Ihre Einwilligung jederzeit widerrufen
möchten, finden Sie den Link dazu in unserer Datenschutzerklärung, die von
unserer Homepage aus zugänglich ist.



Einstellungen verwalten Weiter mit den empfohlenen Cookies

Anbieter-Liste | Datenschutzerklärung

Skip to content
Supergloo
 * Streaming
 * Spark
   * Spark Tutorials With Scala
   * PySpark
   * PySpark SQL
 * Kafka Tutorials and Examples
   * Kafka Tutorials and Examples
   * Kafka Connect
   * Kafka Streams
 * Books
   * Savings Bundle of Software Developer Classic Summaries
   * Clean Code Summary
   * Mythical Man Month Summary
   * Learning Spark Summary
   * Pragmatic Programmer Summary
   * Spark Tutorials with Scala Book
   * Data Science from Scratch Summary
 * Courses
   * Debezium in Production Course
   * Kafka Connect in Production Course
   * MirrorMaker in Production Course
   * Scala for Spark Course
   * Spark with Scala Course
   * PySpark Course
 * About


SUPERGLOO

Primary Menu
 * Streaming
 * Spark
   * Spark Tutorials With Scala
   * PySpark
   * PySpark SQL
 * Kafka Tutorials and Examples
   * Kafka Tutorials and Examples
   * Kafka Connect
   * Kafka Streams
 * Books
   * Savings Bundle of Software Developer Classic Summaries
   * Clean Code Summary
   * Mythical Man Month Summary
   * Learning Spark Summary
   * Pragmatic Programmer Summary
   * Spark Tutorials with Scala Book
   * Data Science from Scratch Summary
 * Courses
   * Debezium in Production Course
   * Kafka Connect in Production Course
   * MirrorMaker in Production Course
   * Scala for Spark Course
   * Spark with Scala Course
   * PySpark Course
 * About

X
 * About
 * Courses
   * Apache Spark Course
   * PySpark Course
   * Scala for Spark Course
 * Credits and Disclosures
 * Home
 * Spark Tutorials
   * Spark Tutorials With Scala
 * Summary Books
   * Clean Code Summary
   * Data Science from Scratch Summary
   * Learning Spark Summary
   * Mythical Man Month Summary
   * Pragmatic Programmer Summary
   * Spark Tutorials with Scala Book
 * supergloo.com Privacy Policy
 * Terms of Use


SPARK BROADCAST VARIABLES WHEN AND WHY



Apache Spark broadcast variables are available to all nodes in the cluster. They
are used to cache a value in memory on all nodes, so it can be efficiently
accessed by tasks running on those nodes. For example, broadcast variables are
useful with large values needing to be used in each Spark task. By using […]




PYTHON KAFKA IN TWO MINUTES. MAYBE LESS.



Although Apache Kafka is written in Java, there are Python Kafka clients
available for use with Kafka.  In this tutorial, let’s go through examples of
Kafka with Python Producer and Consumer clients.  Let’s consider this a “Getting
Started” tutorial.  After completing this, you will be ready to proceed to more
complex examples.  But we need […]




OPEN SOURCE CHANGE DATA CAPTURE IN 2023



Let’s consider three open source change data capture (CDC) options ready for
production in the year 2023. Before we begin, let’s confirm we all see the CDC
trend.  To me, it seems everywhere you look these days is all about change data
capture. From my perspective that wasn’t the case for many years. Do you […]




KAFKA AND DEAD LETTER QUEUES SUPPORT? YES AND NO



In this post, let’s answer the question of Kafka and Dead Letter Quest. But
first, let’s start with an overview. A dead letter queue (DLQ) is a queue, or a
topic in Kafka, used to hold messages which can not be processed successfully.
The origin of DLQs are traditional messaging systems which were popular before
[…]




DEEP DIVE INTO PYSPARK SQL FUNCTIONS



PySpark SQL functions are available for use in the SQL context of a PySpark
application. These functions allow us to perform various data manipulation and
analysis tasks such as filtering and aggregating data, performing inner and
outer joins, and conducting basic data transformations in PySpark. PySpark
functions and PySpark SQL functions are not the same […]




PYSPARK DATAFRAMES BY EXAMPLE



PySpark DataFrames are a distributed collection of data organized into named
columns. It is conceptually equivalent to a table in a relational database or a
data frame in R, but with richer optimizations under the hood. DataFrames can be
constructed from a wide array of sources such as: structured data files, tables
in Hive, external […]




WHY KAFKA CONNECT AND WHY NOT?



Apache Kafka Connect is a development framework for data integration between
Apache Kafka and other systems. It facilitates moving data between Kafka and
other systems, such as databases, message brokers, and file systems. A connector
which moves data INTO Kafka is called a “Source”, while a connector which moves
data OUT OF Kafka is called […]




PYSPARK WITHCOLUMN BY EXAMPLE



The PySpark withColumn function is used to add a new column to a PySpark
DataFrame or to replace the values in an existing column. To execute the PySpark
withColumn function you must supply two arguments. The first argument is the
name of the new or existing column. The second argument is the desired value to
[…]




PYSPARK UDF BY EXAMPLE



A PySpark UDF, or PySpark User Defined Function, is a powerful and flexible tool
in PySpark. They allow users to define their own custom functions and then use
them in PySpark operations.  PySpark UDFs can provide a level of flexibility,
customization, and control not possible with built-in PySpark SQL API
functions.  It can allow developers […]




KAFKA AUTHENTICATION TUTORIAL WITH EXAMPLES



Kafka provides multiple authentication options.  In this tutorial, we will
describe and show the authentication options and then configure and run a demo
example of Kafka authentication. There are two primary goals of this tutorial:
There are a few key subjects which must be considered when building a
multi-tenant cluster, but it all starts with […]




PYSPARK FILTER BY EXAMPLE



In PySpark, the DataFrame filter function, filters data together based on
specified columns.  For example, with a DataFrame containing website click data,
we may wish to group together all the platform values contained a certain
column.  This would allow us to determine the most popular browser type used in
website requests. Solutions like this may […]




HOW TO PYSPARK GROUPBY THROUGH EXAMPLES



In PySpark, the DataFrame groupBy function, groups data together based on
specified columns, so aggregations can be run on the collected groups.  For
example, with a DataFrame containing website click data, we may wish to group
together all the browser type values contained a certain column, and then
determine an overall count by each browser […]




KAFKA CONNECT REST API ESSENTIALS



The Kafka Connect REST API endpoints are used for both administration of Kafka
Connectors (Sinks and Sources) as well as Kafka Connect service itself.  In this
tutorial, we will explore the Kafka Connect REST API with examples.  Before we
dive into specific examples, we need to set the context with an overview of
Kafka Connect […]




KAFKA CONSUMER GROUPS WITH KAFKA-CONSUMER-GROUPS.SH



How do Kafka administrators perform administrative and diagnostic collection
actions of Kafka Consumer Groups?  This post explores a Kafka Groups operations
admin tool called kafka-consumer-groups.sh.  This popular, command-line tool
included in Apache Kafka distributions.  There are other examples of both open
source and 3rd party tools not included with Apache Kafka which can also be […]




PYSPARK JOINS WITH SQL



Use PySpark joins with SQL to compare, and possibly combine, data from two or
more datasources based on matching field values.  This is simply called “joins”
in many cases and usually the datasources are tables from a database or flat
file sources, but more often than not, the data sources are becoming Kafka
topics.  Regardless […]




PYSPARK JOIN EXAMPLES WITH DATAFRAME JOIN FUNCTION



PySpark joins are used to combine data from two or more DataFrames based on a
common field between them.  There are many different types of joins.  The
specific join type used is usually based on the business use case as well as
most optimal for performance.  Joins can be an expensive operation in
distributed systems […]




WHAT YOU NEED TO KNOW ABOUT DEBEZIUM



If you’re looking for an application for change data capture which includes
speed, durability, significant history in production deployments across a
variety of use cases, then Debezium may be for you. This open-source platform
provides streaming from a wide range of both relational and NoSQL based
databases to Kafka or Kinesis.  There are many advantages […]




STREAMING DATA ENGINEER USE CASES



As a streaming data engineer, we face many data integration challenges such as
“How do we integrate this SaaS with this internal database?”, “Will a particular
integration be real-time or batch?”, “How does the system we design recovery
from possible failures?” and “If anyone has ever addressed a situation similar
to mine before, how did […]




SCHEMA REGISTRY IN DATA STREAMING – OPTIONS AND CHOICES



A schema registry in data streaming use cases such as micro-service integration,
streaming ETL, event driven architectures, log ingest stream processing, etc.,
is not a requirement, but there are numerous reasons for implementing one.  The
reasoning for schema registries in data streaming architectures are plentiful
and have been covered extensively already.  I’ve included some of […]




HOW TO GENERATE KAFKA STREAMING JOIN TEST DATA BY EXAMPLE



Why “Joinable” Streaming Test Data for Kafka? When creating streaming join
applications in KStreams, ksqldb, Spark, Flink, etc. with source data in Kafka,
it would be convenient to generate fake data with cross-topic relationships;
i.e. a customer topic and an order topic with a value attribute of customer.id. 
In this example, we might want to […]




SPARK STRUCTURED STREAMING WITH KAFKA EXAMPLE – PART 1



In this post, let’s explore an example of updating an existing Spark Streaming
application to newer Spark Structured Streaming.  We will start simple and then
move to a more advanced Kafka Spark Structured Streaming examples. My original
Kafka Spark Streaming post is three years old now.  On the Spark side, the data
abstractions have evolved […]




RUNNING KAFKA CONNECT – STANDALONE VS DISTRIBUTED MODE EXAMPLES



One of the many benefits of running Kafka Connect is the ability to run single
or multiple workers in tandem.  Running multiple workers provides a way for
horizontal scale-out which leads to increased capacity and/or an automated
resiliency.  For resiliency, this means answering the question, “what happens if
a particular worker goes offline for any […]




GLOBALKTABLE VS KTABLE IN KAFKA STREAMS



Kafka Streams presents two options for materialized views in the forms of
GlobalKTable vs KTables.  We will describe the meaning of “materialized views”
in a moment, but for now, let’s just agree there are pros and cons to
GlobalKTable vs KTables. Need to learn more about Kafka Streams in Java? Here’s
a pretty good option […]




WHAT AND WHY EVENT LOGS?



Before we begin diving into event logs, let’s start with a quote from one of my
software heroes. “The idea of structuring data as a stream of events is nothing
new, and it is used in many different fields. Even though the underlying
principles are often similar, the terminology is frequently inconsistent across
different fields, […]




AZURE KAFKA CONNECT EXAMPLE – BLOB STORAGE



In this Azure Kafka tutorial, let’s describe and demonstrate how to integrate
Kafka with Azure’s Blob Storage with existing Kafka Connect connectors.  Let’s
get a little wacky and cover writing to Azure Blob Storage from Kafka as well as
reading from Azure Blob Storage to Kafka.  In this case, “wacky” is a good
thing, I […]




STREAM PROCESSING



We choose Stream Processing as a way to process data more quickly than
traditional approaches.  But, how do we do Stream Processing? Is Stream
Processing different than Event Stream Processing?  Why do we need it?  What are
a few examples of event streaming patterns?  How do we implement it? Let’s get
into these questions. As […]




KAFKA CERTIFICATION TIPS FOR DEVELOPERS



If you are considering Kafka Certification, this page describes what I did to
pass the Confluent Certified Developer for Apache Kafka Certification exam.  You
may see it shortened to “ccdak confluent certified developer for apache kafka
tests“. Good luck and hopefully this page is helpful for you! There are many
reasons why you may wish […]




GCP KAFKA CONNECT GOOGLE CLOUD STORAGE EXAMPLES



In this GCP Kafka tutorial, I will describe and show how to integrate Kafka
Connect with GCP’s Google Cloud Storage (GCS).  We will cover writing to GCS
from Kafka as well as reading from GCS to Kafka.  Descriptions and examples will
be provided for both Confluent and Apache distributions of Kafka. I’ll document
the steps […]




KAFKA TEST DATA GENERATION EXAMPLES



After you start working with Kafka, you will soon find yourself asking the
question, “how can I generate test data into my Kafka cluster?”  Well, I’m here
to show you have many options for generating test data in Kafka.  In this post
and demonstration video, we’ll cover a few of the ways you can generate […]




KAFKA CONNECT S3 EXAMPLES



In this Kafka Connect S3 tutorial, let’s demo multiple Kafka S3 integration
examples.  We’ll cover writing to S3 from one topic and also multiple Kafka
source topics. Also, we’ll see an example of an S3 Kafka source connector
reading files from S3 and writing to Kafka will be shown. Examples will be
provided for both […]




KAFKA STREAMS – TRANSFORMATIONS EXAMPLES



Kafka Streams Transformations provide the ability to perform actions on Kafka
Streams such as filtering and updating values in the stream.  Kafka Stream’s
transformations contain operations such as `filter`, `map`, `flatMap`, etc. and
have similarities to functional combinators found in languages such as Scala. 
And, if you are coming from Spark, you will also notice […]




STREAM PROCESSOR WINDOWS



When moving to stream processing architecture or building stream processors, you
will soon face two choices.  Will you process streams on an individual, per
event basis?  Or, will you collect and buffer multiple events/messages first,
and then apply a function or join results to this collection of events? Examples
of single event processing might be […]




KAFKA PRODUCER IN SCALA



Kafka Producers are one of the options to publish data events (messages) to
Kafka topics.  Kafka Producers are custom coded in a variety of languages
through the use of Kafka client libraries.  The Kafka Producer API allows
messages to be sent to Kafka topics asynchronously, so they are built for speed,
but also Kafka Producers have the ability […]




KAFKA CONSUMER IN SCALA



In this Kafka Consumer tutorial, we’re going to demonstrate how to develop and
run an example of Kafka Consumer in Scala, so you can gain the confidence to
develop and deploy your own Kafka Consumer applications.  At the end of this
Kafka Consumer tutorial, you’ll have both the source code and screencast of how
to […]




KAFKA CONSUMER GROUPS BY EXAMPLE



Kafka Consumer Groups are the way to horizontally scale out event consumption
from Kafka topics… with failover resiliency.  “With failover resiliency” you
say!?  That sounds interesting.  Well, hold on, let’s leave out the resiliency
part for now and just focus on scaling out.  We’ll come back to resiliency
later. When designing for horizontal scale-out, let’s […]




KAFKA STREAMS JOINS EXAMPLES



Performing Kafka Streams Joins presents interesting design options when
implementing streaming processor architecture patterns. There are numerous
applicable scenarios, but let’s consider an application might need to access
multiple database tables or REST APIs in order to enrich a topic’s event record
with context information. For example, perhaps we could augment records in a
topic with sensor […]




KAFKA STREAMS TESTING WITH SCALA PART 1



After experimenting with Kafka Streams with Scala, I started to wonder how one
goes about Kafka Streams testing in Java or Scala.  How does one create and run
automated tests for Kafka Streams applications?  How does it compare to Spark
Streaming testing? In this tutorial, I’ll describe what I’ve learned so far. 
Also, if you […]




KAFKA STREAMS TUTORIAL WITH SCALA FOR BEGINNERS EXAMPLE



If you’re new to Kafka Streams, here’s a Kafka Streams Tutorial with Scala
tutorial which may help jumpstart your efforts.  My plan is to keep updating the
sample project, so let me know if you would like to see anything in particular
with Kafka Streams with Scala.  In this example, the intention is to 1) provide
an SBT project you […]




APACHE KAFKA ARCHITECTURE – DELIVERY GUARANTEES



Apache Kafka offers message delivery guarantees between producers and
consumers.  For more background or information Kafka mechanics such as producers
and consumers on this, please see Kafka Tutorial page.  Kafka delivery
guarantees can be divided into three groups which include “at most once”, “at
least once” and “exactly once”. at most once which can lead to […]




HOW TO DEBUG SCALA SPARK IN INTELLIJ



Have you struggled to configure debugging in IntelliJ for your Spark programs? 
Yeah, me too.  Debugging with Scala code was easy, but when I moved to Spark
things didn’t work as expected.  So, in this tutorial, let’s cover debugging
Scala based Spark programs in IntelliJ tutorial.  We’ll go through a few
examples and utilize the occasional help […]




CHANGE DATA CAPTURE – WHAT IS IT? HOW DOES IT WORK?



Change Data Capture is a mechanism to capture the changes in databases so they
may be processed someplace other than the database or application(s) which made
the change.  This article will explain what change data capture (CDC) is, how it
works, and why it’s important for businesses. Why?  Why would we want to capture
changes […]




KAFKA CONNECT MYSQL EXAMPLES



In this Kafka Connect mysql tutorial, we’ll cover reading from mySQL to Kafka
and reading from Kafka and writing to mySQL.   Let’s run this on your
environment. Now, it’s just an example and we’re not going to debate operations
concerns such as running in standalone or distributed mode.  The focus will be
keeping it simple and get it working.  We […]




SPARK KINESIS EXAMPLE – MOVING BEYOND WORD COUNT



If you are looking for Spark with Kinesis example, you are in the right place. 
This Spark Streaming with Kinesis tutorial intends to help you become better at
integrating the two. In this tutorial, we’ll examine some custom Spark Kinesis
code and also show a screencast of running it.  In addition, we’re going to
cover […]




SPARK PERFORMANCE MONITORING TOOLS – A LIST OF OPTIONS



Which Spark performance monitoring tools are available to monitor the
performance of your Spark cluster?  In this tutorial, we’ll find out.  But,
before we address this question, I assume you already know Spark includes
monitoring through the Spark UI?  And, in addition, you know Spark includes
support for monitoring and performance debugging through the Spark […]




SPARK FAIR SCHEDULER EXAMPLE



Scheduling in Spark can be a confusing topic.  When someone says “scheduling” in
Spark, do they mean scheduling applications running on the same cluster?  Or, do
they mean the internal scheduling of Spark tasks within the Spark application?
 So, before we cover an example of utilizing the Spark FAIR Scheduler, let’s
make sure we’re on […]




SPARK PERFORMANCE MONITORING WITH HISTORY SERVER



In this Apache Spark tutorial, we will explore the performance monitoring
benefits when using the Spark History server.  This Spark tutorial will review a
simple Spark application without the History server and then revisit the same
Spark app with the History server.  We will explore all the necessary steps to
configure Spark History server for […]




APACHE SPARK THRIFT SERVER LOAD TESTING EXAMPLE



Wondering how to do perform stress tests with Apache Spark Thrift Server?  This
tutorial will describe one way to do it. What is Apache Spark Thrift Server?  
Apache Spark Thrift Server is based on the Apache HiveServer2 which was created
to allow JDBC/ODBC clients to execute SQL queries using a Spark Cluster.  From
my […]




SPARK THRIFT SERVER WITH CASSANDRA EXAMPLE



With the Spark Thrift Server, you can do more than you might have thought
possible.  For example, want to use `joins` with Cassandra?  Or, help people
familiar with SQL leverage your Spark infrastructure without having to learn
Scala or Python?  They can use their existing SQL based tools they already know
such as Tableau or […]




SPARK STREAMING WITH KAFKA EXAMPLE



Spark Streaming with Kafka is becoming so common in data pipelines these days,
it’s difficult to find one without the other.   This tutorial will present an
example of streaming Kafka from Spark.  In this example, we’ll be feeding
weather data into Kafka and then processing this data from Spark Streaming in
Scala.  As the data […]




SPARK SUBMIT COMMAND LINE ARGUMENTS



The primary reason why we want to use Spark submit command line arguments is to
avoid hard-coding values into our code. As we know, hard-coding should be
avoided because it makes our application more rigid and less flexible. For
example, let’s assume we want to run our Spark job in both test and production
environments. […]




SPARK PERFORMANCE MONITORING WITH METRICS, GRAPHITE AND GRAFANA



Spark is distributed with the Metrics Java library which can greatly enhance
your abilities to diagnose issues with your Spark jobs.  In this tutorial, we’ll
cover how to configure Metrics to report to a Graphite backend and view the
results with Grafana for Spark Performance Monitoring purposes. Spark
Performance Monitoring Background If you already know […]




SPARK BROADCAST AND ACCUMULATOR EXAMPLES



On this site, we’ve learned about distributing processing tasks across a Spark
cluster.  But, let’s go a bit deeper in a couple of approaches you may need when
designing distributed tasks.  I’d like to start with a question.  What do we do
when we need each Spark worker task to coordinate certain variables and values
with […]




INTELLIJ SCALA AND APACHE SPARK



IntelliJ Scala and Spark Setup Overview In this tutorial, we’re going to review
one way to setup IntelliJ for Scala and Spark development.  The IntelliJ Scala
combination is the best, free setup for Scala and Spark development.  And I have
nothing against ScalaIDE (Eclipse for Scala) or using editors such as Sublime.
 I switched from […]




SPARK STREAMING TESTING WITH SCALA EXAMPLE



Spark Streaming Testing How do you create and automate tests of Spark Streaming
applications?  In this tutorial, we’ll show an example of one way in Scala.
 This post is heavy on code examples and has the added bonus of using a code
coverage plugin. Are the tests in this tutorial examples unit tests?  Or, are
[…]




KAFKA VS AMAZON KINESIS – HOW DO THEY COMPARE?



Apache Kafka vs. Amazon Kinesis The question of Kafka vs Kinesis often comes
up.  Let’s start with Kinesis. *** Updated Spring 2020 *** Since this original
post, AWS has released MSK.  I think this tells us everything we need to know
about Kafka vs Kinesis.  Also, since the original post, Kinesis has been
separated into […]




APACHE SPARK WITH CASSANDRA AND GAME OF THRONES



Apache Spark with Cassandra is a powerful combination in data processing
pipelines.  In this tutorial, we will build a Scala application with Spark and 
Cassandra with battle data from Game of Thrones.  Now, we’re not going to make
any show predictions!   But, we will show the most aggressive kings as well as
kings which […]




SPARK RDD – A TWO MINUTE GUIDE FOR BEGINNERS



What is Spark RDD? Spark RDD is short for Apache Spark Resilient Distributed
Dataset.  A Spark Resilient Distributed Dataset is often shortened to simply
Spark RDD.  RDDs are a foundational component of the Apache Spark large scale
data processing framework. Spark RDDs are an immutable, fault-tolerant, and
possibly distributed collection of data elements.  RDDs may […]




SPARK MACHINE LEARNING EXAMPLE WITH SCALA



In this Apache Spark Machine Learning example, Spark MLlib is introduced and
Scala source code analyzed.  This post and accompanying screencast videos
demonstrate a custom Spark MLlib Spark driver application.  Then, the Spark
MLLib Scala source code is examined.  Many topics are shown and explained, but
first, let’s describe a few machine learning concepts. Machine […]




APACHE SPARK ADVANCED CLUSTER DEPLOY TROUBLESHOOTING



In this Apache Spark cluster troubleshooting tutorial, we’ll review a few
options when your Scala Spark code does not deploy as anticipated.  For example,
does your Spark driver program rely on a 3rd party jar only compatible with
Scala 2.11, but your Spark Cluster is based on Scala 2.10?  Maybe your code
relies on a […]




SPARK SCALA WITH 3RD PARTY JARS DEPLOY TO A CLUSTER



Overview In this Apache Spark cluster deploy tutorial, we’ll cover how to deploy
Spark driver programs to a Spark cluster when the driver program utilizes
third-party jars.  In this case, we’re going to use code examples from previous
Spark SQL and Spark Streaming tutorials. At the end of this tutorial, there is a
screencast of […]




SPARK MACHINE LEARNING – CHAPTER 11 MACHINE LEARNING WITH MLLIB



Spark Machine Learning is contained with Spark MLlib.  Spark MLlib Spark’s
library of machine learning (ML) functions designed to run in parallel on
clusters.  MLlib contains a variety of learning algorithms. The topic of machine
learning itself could fill many books, so instead, this chapter explains ML in
Apache Spark. This post is an excerpt […]




SPARK STREAMING EXAMPLE – HOW TO STREAM FROM SLACK



Let’s write a Spark Streaming example which streams from Slack in Scala.  This
tutorial will show how to write, configure and execute the code, first.  Then,
the source code will be examined in detail.  If you don’t have a Slack team,
 you can set one up for free.   We’ll cover that too.  Sound fun?  […]




SPARK STREAMING WITH SCALA



Let’s start Apache Spark Streaming with Scala with small steps to build up our
skills and confidence.  These small steps will create the forward momentum
needed when learning new skills.  The quickest way to gain confidence and
momentum in learning new software development skills is executing code that
performs without error.  Right?  I mean, right!?  […]




HOW TO DEPLOY PYTHON PROGRAMS TO A SPARK CLUSTER



After you have a Spark cluster running, how do you deploy Python programs to a
Spark Cluster?  It’s not as straightforward as you might think or hope, so let’s
explore further in this PySpark tutorial. PySpark Application Deploy Overview
Let’s deploy a couple of examples of Spark PySpark program to our cluster. Let’s
start with […]




PYSPARK SQL MYSQL PYTHON EXAMPLE WITH JDBC



Let’s cover how to use Spark SQL with Python and a mySQL database input data
source.  Shall we?  Yes, yes we shall. Consider this tutorial an introductory
step when learning how to use Spark SQL with a relational database and Python. 
If you are brand new, check out the Spark with Python Tutorial. PySpark SQL […]




PYSPARK SQL JSON EXAMPLES IN PYTHON



This short PySpark SQL tutorial shows analysis of World Cup player data using
PySpark SQL with a JSON file input data source from Python perspective. PySpark
SQL with JSON Overview We are going to load a JSON input source to Spark SQL’s
SQLContext.  This Spark SQL JSON with Python tutorial has two parts.  The first
[…]




PYSPARK READING CSV WITH SQL EXAMPLES



In this pyspark reading csv tutorial, we will use Spark SQL with a CSV input
data source using the Python API.  We will continue to use the Uber CSV source
file as used in the Getting Started with Spark and Python tutorial presented
earlier. Also, this Spark SQL CSV tutorial assumes you are familiar with using
[…]




PYSPARK QUICK START



In this post, let’s cover Apache Spark with Python fundamentals to get you
started and feeling comfortable about using PySpark. The intention is for
readers to understand basic PySpark concepts through examples.  Later posts will
deeper dive into Apache Spark fundamentals and example use cases. Spark
computations can be called via Scala, Python or Java.  There […]




APACHE SPARK WITH AMAZON S3 EXAMPLES



This post will show ways and options for accessing files stored on Amazon S3
from Apache Spark.  Examples of text file interaction on Amazon S3 will be shown
from both Scala and Python using the spark-shell from Scala or ipython notebook
for Python. To begin, you should know there are multiple ways to access S3 […]




CONNECT IPYTHON NOTEBOOK TO APACHE SPARK CLUSTER



This post will cover how to connect an ipython notebook to two kinds of Spark
Clusters: Spark Cluster running in Standalone mode and a Spark Cluster running
on Amazon EC2. Requirements You need to have a Spark Cluster Standalone and
Apache Spark Cluster running to complete this tutorial.  See the Background
section of this post […]




HOW TO: APACHE SPARK CLUSTER ON AMAZON EC2 TUTORIAL



How to set up and run an Apache Spark Cluster on EC2?  This tutorial will walk
you through each step to get an Apache Spark cluster up and running on EC2. The
cluster consists of one master and one worker node. It includes each step I took
regardless if it failed or succeeded.  While your […]




PYSPARK ACTION EXAMPLES



PySpark action functions produce a computed value back to the Spark driver
program.  This is different from PySpark transformation functions which produce
RDDs, DataFrames or DataSets in results.  For example, an action function such
as count will produce a result back to the Spark driver while a collect
transformation function will not.  These may seem […]




PYSPARK TRANSFORMATIONS IN PYTHON EXAMPLES



If you’ve read the previous PySpark tutorials on this site, you know that Spark
Transformation functions produce a DataFrame, DataSet or Resilient Distributed
Dataset (RDD).  Resilient distributed datasets are Spark’s main programming
abstraction and RDDs are automatically parallelized across the cluster.  As
Spark matured, this abstraction changed from RDDs to DataFrame to DataSets, but
the […]




APACHE SPARK AND IPYTHON NOTEBOOK – THE EASY WAY



Using ipython notebook with Apache Spark couldn’t be easier.  This post will
cover how to use ipython notebook (jupyter) with Spark and why it is best choice
when using python with Spark. Requirements This post assumes you have downloaded
and extracted Apache Spark and you are running on a Mac or *nix.  If you are […]




SPARK SQL MYSQL EXAMPLE WITH JDBC



In this tutorial, we will cover using Spark SQL with a mySQL database. Overview
Let’s show examples of using Spark SQL mySQL.  We’re going to use mySQL with
Spark in this tutorial, but you can apply the concepts presented here to any
relational database which has a JDBC driver. By the way, If you are […]




SPARK SQL JSON EXAMPLES



This tutorial covers using Spark SQL with a JSON file input data source in
Scala.  If you are interested in using Python instead, check out Spark SQL JSON
in Python tutorial page. Spark SQL JSON Overview We will show examples of JSON
as input source to Spark SQL’s SQLContext.  This Spark SQL tutorial with JSON
[…]




SPARK SQL CSV EXAMPLES IN SCALA



In this Spark SQL tutorial, we will use Spark SQL with a CSV input data source.
 We will continue to use the baby names CSV source file as used in the previous
What is Spark tutorial.  This tutorial presumes the reader is familiar with
using SQL with relational databases and would like to know how […]




APACHE SPARK CLUSTER PART 2: DEPLOY SCALA PROGRAM



How do you deploy a Scala program to a Spark Cluster?  In this tutorial, we’ll
cover how to build, deploy and run a Scala driver program to a Spark Cluster.
 The focus will be on a simple example in order to gain confidence and set the
foundation for more advanced examples in the future.   To […]




APACHE SPARK CLUSTER PART 1: RUN STANDALONE



Running an Apache Spark Cluster on your local machine is a natural and early
step towards Apache Spark proficiency.  As I imagine you are already aware, you
can use a YARN-based Spark Cluster running in Cloudera, Hortonworks or MapR. 
There are numerous options for running a Spark Cluster in Amazon, Google or
Azure as well.  […]




APACHE SPARK EXAMPLES OF ACTIONS IN SCALA



Spark Action Examples in Scala When using Spark API “action” functions, a result
is produced back to the Spark Driver.  Computing this result will trigger any of
the RDDs, DataFrames or DataSets needed in order to produce the result.  Recall
Spark Transformations such as map, flatMap, and other transformations are used
to create RDDs, DataFrames […]




APACHE SPARK TRANSFORMATIONS IN SCALA EXAMPLES



Spark Transformations in Scala Examples Spark Transformations produce a new
Resilient Distributed Dataset (RDD) or DataFrame or DataSet depending on your
version of Spark.  Resilient distributed datasets are Spark’s main and original
programming abstraction for working with data distributed across multiple nodes
in your cluster.  RDDs are automatically parallelized across the cluster. In the
Scala […]




WHAT IS APACHE SPARK?



Becoming productive with Apache Spark requires an understanding of a few
fundamental elements.  In this post, let’s explore the fundamentals or the
building blocks of Apache Spark.  Let’s use descriptions and real-world examples
in the exploration. The intention is for you is to understand basic Spark
concepts.  It assumes you are familiar with installing software […]


report this ad


SEARCH

Search for: Search
report this adreport this ad
©2001-2023 Supergloo
 * Privacy Policy
 * Terms of Use
 * Credits and Disclosures
 * Contact

Kafka Connect | Kafka Streams | Kafka Tutorials and Examples | PySpark | PySpark
SQL | Spark ML | Spark Monitoring | Spark Scala | Spark SQL Tutorials and
Examples | Spark Streaming | Spark Tutorials | Streaming |

 * Twitter
 * LinkedIn
 * Reddit
 * Email

x
x
x