www.singlestore.com
Open in
urlscan Pro
13.224.225.80
Public Scan
URL:
https://www.singlestore.com/blog/scaling-worldwide-parcel-logistics-with-singlestore-and-vectorized/
Submission: On September 10 via manual from IN — Scanned from DE
Submission: On September 10 via manual from IN — Scanned from DE
Form analysis
0 forms found in the DOMText Content
The new 7.5 release is available now. See what's new. Times Icon * Product Managed Service Cloud Database-as-a-service Database Software Deploy SingleStore anywhere Stream Icon Ingest Data Tools Icon Tools and Monitoring Lock Icon Security Sync Icon Replace MySQL * Solutions Use Cases How SingleStore is used Customers Learn who uses SingleStore Briefcase Icon Professional Services Industry Icon Industries Sitemap Icon Architectures Check Icon Comparisons * Docs Documentation Set up and use SingleStore's products and tools How SingleStore Works A concise overview of SingleStore User Chart Icon Guides Life Ring Icon Support Comments Icon Forums List Icon Release Notes * Learn Developer Hub Begin building faster data applications Resources Whitepapers, ebooks, videos and more Calendar Icon Events Rss Icon Blog Newspaper Icon News Presentation Icon Training Video Icon Product Demo * Pricing * Contact Us Open Search * User Circle IconSign In Laptop Icon Portal Access Managed Service & Licenses Comments Icon Forums Ask the Community Users Class Icon Training Self-Paced Learning Life Ring Icon Enterprise Support Submit a Ticket * Start Free Arrow Right Icon Database SCALING WORLDWIDE PARCEL LOGISTICS WITH SINGLESTORE AND VECTORIZED Carl Sverre Share to FacebookShare to TwitterShare to LinkedInShare to RedditShare to Email Learn how SingleStore and Redpanda can work together to solve the operational complexity of global logistics. In this blog post we present a reference architecture using SingleStore’s relational database and Redpanda’s streaming platform in combination to scale worldwide parcel shipping to never-before-seen volumes. SCALING WORLDWIDE PARCEL LOGISTICS WITH SINGLESTORE AND VECTORIZED Today, let's talk about how SingleStore and Redpanda can work together to solve the operational complexity of global logistics while handling 100x the number of packages delivered annually. SingleStore is a scale-out relational database built for data-intensive workloads. Redpanda is a Kafka API compatible streaming platform for mission-critical workloads created by the team at Vectorized. In this blog post we present a reference architecture using these two systems to scale worldwide parcel shipping to never-before-seen volumes. Conveniently, logistics simulation has always been on my bucket list of things to build, so when the opportunity arose to create something with Redpanda and SingleStore it was an easy choice to make. Before we get into the reference architecture, let's talk about some of the challenges logistics companies face every day. On October 12th, 2020, the Pitney Bowes Parcel Shipping Index reported that 103 billion packages were delivered in 2019 alone. That's an average of over 282 million packages delivered each day or 3200 packages delivered per second and we aren't slowing down! The same report forecasts those numbers will increase by 2-3x by 2026. These numbers get even more extreme when you consider peak rather than volume. For example, Amazon just announced that during the 48 hours of Prime Day 2021 over 250 million items were purchased. That's a peak rate of up to 5 million packages shipped per hour (1300 packages per second) which is half of the global average. It's clear from these statistics that we need logistics systems which can scale to handle unprecedented peak activity in order to handle the growing demand for e-commerce worldwide. Scale isn't the only challenge in architecting a system to handle global logistics, complexity is another problem. As an example, let's consider one possible package life cycle. Each of these parcel lifecycle steps involves a lot of hidden complexity. Consider moving packages between departure and arrival scans. During this transportation phase, a package needs to be routed, packed, shipped, unpacked, and handled countless times. At any point the system needs to be able to dynamically surface reasonable ETAs and location history to customers while ensuring that no data is lost and the package eventually arrives to its destination. Due to the complex system demands of global logistics, within each company you will find a sprawling infrastructure composed of innumerable databases, event streams, data lakes, processing clusters, monitoring systems, and more. Collectively these systems process billions of transactions per day with the combined goal of ensuring that each package has a safe and on-time journey. Within this architecture, you will find the crown jewel - the package tracking system. While I don't work for a logistics company, based on my research and experience working with large scale data systems most of the package tracking systems out there use an event driven and service oriented architecture to ensure that they can scale to meet the required demand. In general an event driven architecture is focused on keeping track of events and a service oriented architecture focuses on manipulating and querying data at rest. It's common to see large organizations use both of these paradigms at the same time. Setting out to build a simulation of global package logistics, I decided to focus on long-haul delivery. Basically, focus on moving packages long distances rather than worry about the last-mile handoff. Here is a diagram outlining the architecture I ended up with: Within my architecture, packages are simulated by a scale-out simulation cluster. Each simulator thread generates thousands of packages and emits events into Redpanda topics as those packages are shipped from origin to destination. In order to create semi-believable package histories, the simulation code includes a rudimentary routing algorithm as well as a virtual clock. Let's look at an example package history using SingleStore Studio: Once the events arrive in Redpanda topics, SingleStore immediately loads them using Pipelines in batches of up to 100 million records. Due to SingleStore committing the offset metadata along with the data itself in a single distributed transaction, we can ensure that each record is loaded exactly once. This combined with Redpanda's built-in durability drastically reduces the complexity of ensuring that no data is lost in the process. It's worth highlighting the data model used to efficiently process and store the data in SingleStore. We use two columnstore tables to store the raw package and transition records. These tables provide on-disk compression without sacrificing performance which is perfect for these mostly historic datasets. But, in order to handle packages in transit, we need a storage engine which is optimized for a high volume of inserts, updates, and deletes. This is why we use a rowstore table to store the state of all undelivered packages. Finally, we use a reference table to store the mostly static locations dataset which contains information about each of our hub and point locations in our global logistics network. A reference table is replicated on every node in the cluster so that all queries are able to access location data without having to broadcast it during query execution. Here is a visualization of the data model to clarify the relationships between each of our tables and the incoming data Pipelines. One interesting feature of our schema is the process_transitions() stored procedure. This procedure is responsible for keeping the package_transitions and package_states tables in sync with the latest transitions loaded from Redpanda topics. Because no engineering blog post is finished without a code sample, here is the procedure in its entirety: CREATE OR REPLACE PROCEDURE process_transitions(batch QUERY( packageid CHAR(36) NOT NULL, seq INT NOT NULL, locationid BIGINT NOT NULL, next_locationid BIGINT, recorded DATETIME NOT NULL, kind TEXT NOT NULL )) AS BEGIN REPLACE INTO package_transitions (packageid, seq, locationid, next_locationid, recorded, kind) SELECT * FROM batch; INSERT INTO package_states (packageid, seq, locationid, next_locationid, recorded, kind) SELECT packageid, seq, locationid, next_locationid, recorded, statekind AS kind FROM ( SELECT *, CASE WHEN kind = "arrival_scan" THEN "at_rest" WHEN kind = "departure_scan" THEN "in_flight" END AS statekind FROM batch ) batch WHERE batch.kind != "delivered" ON DUPLICATE KEY UPDATE seq = IF(VALUES(seq) > package_states.seq, VALUES(seq), package_states.seq), locationid = IF(VALUES(seq) > package_states.seq, VALUES(locationid), package_states.locationid), next_locationid = IF(VALUES(seq) > package_states.seq, VALUES(next_locationid), package_states.next_locationid), recorded = IF(VALUES(seq) > package_states.seq, VALUES(recorded), package_states.recorded), kind = IF(VALUES(seq) > package_states.seq, VALUES(kind), package_states.kind); DELETE package_states FROM package_states JOIN batch WHERE package_states.packageid = batch.packageid AND batch.kind = "delivered"; END As you can see from the code, the procedure performs three operations. The first copies the incoming batch into the package_transitions table allowing us to access the full history of every package. The second operation upserts the batch into the package_states table taking the sequence number into consideration. Merging incoming records based on sequence numbers ensures correct state even if we process transitions slightly out of order within a single batch. The third operation deletes any delivered packages from the package_states table. Deleting delivered packages ensures that the table only handles packages which are currently in transit which keeps our memory usage stable over time. The stored procedure is executed transactionally by SingleStore Pipelines as batches of data become available in Redpanda, in real time. With our simulator cluster humming along and our data model in place we come to the last piece of the puzzle - data visualization. For this demo I used Grafana and created two main dashboards. The first dashboard showcases the full performance of the simulation. We focus on three main statistics: package transitions per second, Redpanda's ingress/egress, and the ingest rate into SingleStore. In the screenshot below you can see that with a single simulator running on my desktop machine I am able to deliver up to 20k packages per second (which is roughly 6x the volume of global package delivery in 2020). The second dashboard showcases the actual data. Since the simulator runs much faster than real time (see the simulator clock in the top right), the per-second rate of deliveries on this dashboard is much lower. Think of it as spreading out roughly twenty thousand deliveries per second over 100 virtual hours in order to ensure that the data makes some amount of sense (i.e. packages don't fly around the world in 3 seconds). In addition to surfacing some metrics about the actual data, this dashboard also exposes two maps showcasing real time inventory and current package destinations. These maps are computed on the fly from SingleStore and can be refreshed in real time as packages travel from origin to destination. Now that we have gone over how the simulation works, let's revisit the claim that we can handle 100x the volume of global package delivery per year. That means we need to handle at least 300k deliveries per second. In order to accomplish this I deployed this reference architecture to Google Cloud resulting in this beautiful image of raw computational performance: As you can see from this screenshot, the simulation is processing data at roughly 1.5 GB/s ultimately leading to almost 400k deliveries per second. This is roughly 125x the global package delivery volume in 2019 easily exceeding the goal I originally set for myself. The cluster running this simulation included 4 Redpanda brokers (16 vCPUs & 64 GB RAM each) and 16 SingleStore leaf nodes (32 vCPUs & 128 GB RAM each). In closing, I want to thank the team at Vectorized for helping me squeeze every ounce of performance out of this reference architecture, and the entire engineering team at SingleStore for building such a capable database. Peace! Any technical blog post is only as good as the code, so I encourage you to check it out here: https://github.com/singlestore-labs/singlestore-logistics-sim. You can spin it up on your computer using Docker or you can deploy it directly to Google Cloud using Terraform by following the instructions in the readme. The whiteboard images in this post were all created in Excalidraw - one of my favorite companies ever! To try SingleStore for yourself you can: * Download our current version for free * Preview the 7.5 Database Release * Learn about SingleStore’s new 7.5 Managed Service Release or * Preview the 7.5 Managed Service Release To try Redpanda you can download it here. June 24th, 2021Last Modified - Jun 29, 2021 at 10:00 Tags Icon ColumnstoreData StrategyDatabaseEngineering -------------------------------------------------------------------------------- Carl Sverre Whether I am building distributed systems or creating real-time visualizations, my passion is solving hard problems and writing elegant code. I am a Senior Director working at a database startup in San Francisco called SingleStore. RELATED POSTS Database4 min Read OPERATIONALIZING SINGLESTORE In a recent webcast , we shared some tips and tricks on how you can operationalize SingleStore for configuring SingleStore Ops and SingleS… Read Post Performance3 min Read AVX512/VBMI2: A PROGRAMMER’S PERSPECTIVE Linus Torvalds had some interesting things to say about AVX512: “I hope AVX512 dies a painful death I absolutely detest FP benchmarks, and I realize oth… Read Post Database6 min Read HOW CAREFUL ENGINEERING LED TO PROCESSING OVER A TRILLION ROWS PER SECOND On March 13, we published a blog demonstrating the performance of SingleStore in the context of ad hoc analytical queries. Specifically, we showed that the query can pr… Read Post Introducing SINGLESTORE MANAGED SERVICE The World's Fastest Cloud Database Learn More Introducing SINGLESTORE MANAGED SERVICE THE WORLD'S FASTEST CLOUD DATABASE Learn More * Product * Managed Service * Database Software * Ingest Data * Tools and Monitoring * Security * Pricing * Solutions * Use Cases * Customers * Industries * Architectures * Comparisons * Learn * Developer Hub * Forums * Events * Resources * Blog * News * Support * Support Plans * Support FAQ * Professional Services * Documentation * Training * Company * About Us * Leadership * Careers We're hiring * Partners * SingleStore.org * Contact Us Twitter IconFacebook IconLinkedin In IconYoutube IconDiscourse IconGithub Icon © SingleStore, Inc. | Privacy Policy | Terms of Service | Legal Terms and Conditions By browsing this site, you accept the use of cookies as they help us optimize your experience. Learn more OK