www.singlestore.com Open in urlscan Pro
13.224.225.80  Public Scan

URL: https://www.singlestore.com/blog/scaling-worldwide-parcel-logistics-with-singlestore-and-vectorized/
Submission: On September 10 via manual from IN — Scanned from DE

Form analysis 0 forms found in the DOM

Text Content

The new 7.5 release is available now.
See what's new.

Times Icon

 * Product
   
   Managed Service
   Cloud Database-as-a-service
   
   Database Software
   Deploy SingleStore anywhere
   Stream Icon
   Ingest Data
   Tools Icon
   Tools and Monitoring
   Lock Icon
   Security
   Sync Icon
   Replace MySQL
 * Solutions
   
   Use Cases
   How SingleStore is used
   
   Customers
   Learn who uses SingleStore
   Briefcase Icon
   Professional Services
   Industry Icon
   Industries
   Sitemap Icon
   Architectures
   Check Icon
   Comparisons
 * Docs
   
   Documentation
   Set up and use SingleStore's products and tools
   
   How SingleStore Works
   A concise overview of SingleStore
   User Chart Icon
   Guides
   Life Ring Icon
   Support
   Comments Icon
   Forums
   List Icon
   Release Notes
 * Learn
   
   Developer Hub
   Begin building faster data applications
   
   Resources
   Whitepapers, ebooks, videos and more
   Calendar Icon
   Events
   Rss Icon
   Blog
   Newspaper Icon
   News
   Presentation Icon
   Training
   Video Icon
   Product Demo
 * Pricing
   
 * Contact Us
   Open Search

 * User Circle IconSign In
   
   Laptop Icon
   Portal
   Access Managed Service & Licenses
   Comments Icon
   Forums
   Ask the Community
   Users Class Icon
   Training
   Self-Paced Learning
   Life Ring Icon
   Enterprise Support
   Submit a Ticket
 * Start Free Arrow Right Icon
   

Database


SCALING WORLDWIDE PARCEL LOGISTICS
WITH SINGLESTORE AND VECTORIZED

Carl Sverre
Share to FacebookShare to TwitterShare to LinkedInShare to RedditShare to Email
Learn how SingleStore and Redpanda can work together to solve the operational
complexity of global logistics. In this blog post we present a reference
architecture using SingleStore’s relational database and Redpanda’s streaming
platform in combination to scale worldwide parcel shipping to never-before-seen
volumes.


SCALING WORLDWIDE PARCEL LOGISTICS WITH SINGLESTORE AND VECTORIZED

Today, let's talk about how SingleStore and Redpanda can work together to solve
the operational complexity of global logistics while handling 100x the number of
packages delivered annually. SingleStore is a scale-out relational database
built for data-intensive workloads. Redpanda is a Kafka API compatible streaming
platform for mission-critical workloads created by the team at Vectorized. In
this blog post we present a reference architecture using these two systems to
scale worldwide parcel shipping to never-before-seen volumes.
Conveniently, logistics simulation has always been on my bucket list of things
to build, so when the opportunity arose to create something with Redpanda and
SingleStore it was an easy choice to make. Before we get into the reference
architecture, let's talk about some of the challenges logistics companies face
every day.
On October 12th, 2020, the Pitney Bowes Parcel Shipping Index reported that 103
billion packages were delivered in 2019 alone. That's an average of over 282
million packages delivered each day or 3200 packages delivered per second and we
aren't slowing down! The same report forecasts those numbers will increase by
2-3x by 2026. These numbers get even more extreme when you consider peak rather
than volume. For example, Amazon just announced that during the 48 hours of
Prime Day 2021 over 250 million items were purchased. That's a peak rate of up
to 5 million packages shipped per hour (1300 packages per second) which is half
of the global average. It's clear from these statistics that we need logistics
systems which can scale to handle unprecedented peak activity in order to handle
the growing demand for e-commerce worldwide.
Scale isn't the only challenge in architecting a system to handle global
logistics, complexity is another problem. As an example, let's consider one
possible package life cycle.
Each of these parcel lifecycle steps involves a lot of hidden complexity.
Consider moving packages between departure and arrival scans. During this
transportation phase, a package needs to be routed, packed, shipped, unpacked,
and handled countless times. At any point the system needs to be able to
dynamically surface reasonable ETAs and location history to customers while
ensuring that no data is lost and the package eventually arrives to its
destination.
Due to the complex system demands of global logistics, within each company you
will find a sprawling infrastructure composed of innumerable databases, event
streams, data lakes, processing clusters, monitoring systems, and more.
Collectively these systems process billions of transactions per day with the
combined goal of ensuring that each package has a safe and on-time journey.
Within this architecture, you will find the crown jewel - the package tracking
system. While I don't work for a logistics company, based on my research and
experience working with large scale data systems most of the package tracking
systems out there use an event driven and service oriented architecture to
ensure that they can scale to meet the required demand. In general an event
driven architecture is focused on keeping track of events and a service oriented
architecture focuses on manipulating and querying data at rest. It's common to
see large organizations use both of these paradigms at the same time.
Setting out to build a simulation of global package logistics, I decided to
focus on long-haul delivery. Basically, focus on moving packages long distances
rather than worry about the last-mile handoff. Here is a diagram outlining the
architecture I ended up with:
Within my architecture, packages are simulated by a scale-out simulation
cluster. Each simulator thread generates thousands of packages and emits events
into Redpanda topics as those packages are shipped from origin to destination.
In order to create semi-believable package histories, the simulation code
includes a rudimentary routing algorithm as well as a virtual clock. Let's look
at an example package history using SingleStore Studio:
Once the events arrive in Redpanda topics, SingleStore immediately loads them
using Pipelines in batches of up to 100 million records. Due to SingleStore
committing the offset metadata along with the data itself in a single
distributed transaction, we can ensure that each record is loaded exactly once.
This combined with Redpanda's built-in durability drastically reduces the
complexity of ensuring that no data is lost in the process.
It's worth highlighting the data model used to efficiently process and store the
data in SingleStore. We use two columnstore tables to store the raw package and
transition records. These tables provide on-disk compression without sacrificing
performance which is perfect for these mostly historic datasets. But, in order
to handle packages in transit, we need a storage engine which is optimized for a
high volume of inserts, updates, and deletes. This is why we use a rowstore
table to store the state of all undelivered packages. Finally, we use a
reference table to store the mostly static locations dataset which contains
information about each of our hub and point locations in our global logistics
network. A reference table is replicated on every node in the cluster so that
all queries are able to access location data without having to broadcast it
during query execution. Here is a visualization of the data model to clarify the
relationships between each of our tables and the incoming data Pipelines.
One interesting feature of our schema is the process_transitions() stored
procedure. This procedure is responsible for keeping the package_transitions and
package_states tables in sync with the latest transitions loaded from Redpanda
topics. Because no engineering blog post is finished without a code sample, here
is the procedure in its entirety:

CREATE OR REPLACE PROCEDURE process_transitions(batch QUERY(
    packageid CHAR(36) NOT NULL,
    seq INT NOT NULL,
    locationid BIGINT NOT NULL,
    next_locationid BIGINT,
    recorded DATETIME NOT NULL,
    kind TEXT NOT NULL
))
AS
BEGIN
    REPLACE INTO package_transitions (packageid, seq, locationid, next_locationid, recorded, kind)
    SELECT * FROM batch;

    INSERT INTO package_states (packageid, seq, locationid, next_locationid, recorded, kind)
    SELECT
        packageid,
        seq,
        locationid,
        next_locationid,
        recorded,
        statekind AS kind
    FROM (
        SELECT *, CASE
            WHEN kind = "arrival_scan" THEN "at_rest"
            WHEN kind = "departure_scan" THEN "in_flight"
        END AS statekind
        FROM batch
    ) batch
    WHERE batch.kind != "delivered"
    ON DUPLICATE KEY UPDATE
        seq = IF(VALUES(seq) > package_states.seq, VALUES(seq), package_states.seq),
        locationid = IF(VALUES(seq) > package_states.seq, VALUES(locationid), package_states.locationid),
        next_locationid = IF(VALUES(seq) > package_states.seq, VALUES(next_locationid), package_states.next_locationid),
        recorded = IF(VALUES(seq) > package_states.seq, VALUES(recorded), package_states.recorded),
        kind = IF(VALUES(seq) > package_states.seq, VALUES(kind), package_states.kind);

    DELETE package_states
    FROM package_states JOIN batch
    WHERE
        package_states.packageid = batch.packageid
        AND batch.kind = "delivered";
END


As you can see from the code, the procedure performs three operations. The first
copies the incoming batch into the package_transitions table allowing us to
access the full history of every package. The second operation upserts the batch
into the package_states table taking the sequence number into consideration.
Merging incoming records based on sequence numbers ensures correct state even if
we process transitions slightly out of order within a single batch. The third
operation deletes any delivered packages from the package_states table. Deleting
delivered packages ensures that the table only handles packages which are
currently in transit which keeps our memory usage stable over time. The stored
procedure is executed transactionally by SingleStore Pipelines as batches of
data become available in Redpanda, in real time.
With our simulator cluster humming along and our data model in place we come to
the last piece of the puzzle - data visualization. For this demo I used Grafana
and created two main dashboards. The first dashboard showcases the full
performance of the simulation. We focus on three main statistics: package
transitions per second, Redpanda's ingress/egress, and the ingest rate into
SingleStore. In the screenshot below you can see that with a single simulator
running on my desktop machine I am able to deliver up to 20k packages per second
(which is roughly 6x the volume of global package delivery in 2020).
The second dashboard showcases the actual data. Since the simulator runs much
faster than real time (see the simulator clock in the top right), the per-second
rate of deliveries on this dashboard is much lower. Think of it as spreading out
roughly twenty thousand deliveries per second over 100 virtual hours in order to
ensure that the data makes some amount of sense (i.e. packages don't fly around
the world in 3 seconds).
In addition to surfacing some metrics about the actual data, this dashboard also
exposes two maps showcasing real time inventory and current package
destinations. These maps are computed on the fly from SingleStore and can be
refreshed in real time as packages travel from origin to destination.
Now that we have gone over how the simulation works, let's revisit the claim
that we can handle 100x the volume of global package delivery per year. That
means we need to handle at least 300k deliveries per second. In order to
accomplish this I deployed this reference architecture to Google Cloud resulting
in this beautiful image of raw computational performance:
As you can see from this screenshot, the simulation is processing data at
roughly 1.5 GB/s ultimately leading to almost 400k deliveries per second. This
is roughly 125x the global package delivery volume in 2019 easily exceeding the
goal I originally set for myself. The cluster running this simulation included 4
Redpanda brokers (16 vCPUs & 64 GB RAM each) and 16 SingleStore leaf nodes (32
vCPUs & 128 GB RAM each).
In closing, I want to thank the team at Vectorized for helping me squeeze every
ounce of performance out of this reference architecture, and the entire
engineering team at SingleStore for building such a capable database. Peace!
Any technical blog post is only as good as the code, so I encourage you to check
it out here: https://github.com/singlestore-labs/singlestore-logistics-sim.  You
can spin it up on your computer using Docker or you can deploy it directly to
Google Cloud using Terraform by following the instructions in the readme.
The whiteboard images in this post were all created in Excalidraw - one of my
favorite companies ever!
To try SingleStore for yourself you can:
 * Download our current version for free 
 * Preview the 7.5 Database Release
 * Learn about SingleStore’s new 7.5 Managed Service Release or
 * Preview the 7.5 Managed Service Release

To try Redpanda you can download it here.
June 24th, 2021Last Modified - Jun 29, 2021 at 10:00
Tags Icon
ColumnstoreData StrategyDatabaseEngineering

--------------------------------------------------------------------------------

Carl Sverre
Whether I am building distributed systems or creating real-time visualizations,
my passion is solving hard problems and writing elegant code. I am a Senior
Director working at a database startup in San Francisco called SingleStore.

RELATED POSTS

Database4 min Read
OPERATIONALIZING SINGLESTORE

In a recent webcast , we shared some tips and tricks on how you can
operationalize SingleStore for configuring SingleStore Ops and SingleS…
Read Post

Performance3 min Read
AVX512/VBMI2: A PROGRAMMER’S PERSPECTIVE

Linus Torvalds had some interesting things to say about AVX512: “I hope AVX512
dies a painful death I absolutely detest FP benchmarks, and I realize oth…
Read Post

Database6 min Read
HOW CAREFUL ENGINEERING LED TO PROCESSING OVER A TRILLION ROWS PER SECOND

On March 13, we published a blog demonstrating the performance of SingleStore in
the context of ad hoc analytical queries. Specifically, we showed that the query
can pr…
Read Post
Introducing


SINGLESTORE MANAGED SERVICE

The World's Fastest Cloud Database
Learn More
Introducing


SINGLESTORE MANAGED SERVICE

THE WORLD'S FASTEST CLOUD DATABASE

Learn More
 * Product
 * Managed Service
 * Database Software
 * Ingest Data
 * Tools and Monitoring
 * Security
 * Pricing

 * Solutions
 * Use Cases
 * Customers
 * Industries
 * Architectures
 * Comparisons

 * Learn
 * Developer Hub
 * Forums
 * Events
 * Resources
 * Blog
 * News

 * Support
 * Support Plans
 * Support FAQ
 * Professional Services
 * Documentation
 * Training

 * Company
 * About Us
 * Leadership
 * Careers We're hiring
 * Partners
 * SingleStore.org
 * Contact Us

Twitter IconFacebook IconLinkedin In IconYoutube IconDiscourse IconGithub Icon
© SingleStore, Inc. |
Privacy Policy | Terms of Service | Legal Terms and Conditions

By browsing this site, you accept the use of cookies as they help us optimize
your experience. Learn more
OK