www.marklogic.com Open in urlscan Pro
54.209.4.50  Public Scan

Submitted URL: http://www.marklogic.com/product/comparisons/data-hub-vs-data-lake/
Effective URL: https://www.marklogic.com/product/comparisons/data-hub-vs-data-lake/
Submission: On August 31 via api from SG

Form analysis 1 forms found in the DOM

GET https://globalsearch.marklogic.com/?

<form action="https://globalsearch.marklogic.com/?" method="GET" class="search-form ml-search" role="search" autocomplete="off">
  <div class="search-container">
    <input type="text" name="q" class="form-control" placeholder="Search" value="">
  </div>
  <button type="submit">
    <i class="fa fa-search">&nbsp;</i>
  </button>
</form>

Text Content

Skip to Content
Gartner Cloud DBMS Report Names MarkLogic a Visionary
 * Product 
 * Solutions 
 * Learn 
 * Community 
 * Company 

 * Try Free

 * Language
   * English
   * French
   * German
   * Japanese
 * Search

Product
 * MarkLogic Data Hub →Learn how MarkLogic simplifies data integration
 * Cloud
 * Data Hub ServiceFully managed cloud data hub
 * Software
 * Multi-Model DatabaseDocuments, graphs, relational data

 * Customer Success
 * MarkLogic UniversityFree training
 * Consulting ServicesExpert implementation
 * Support24/7 engineering support
 * More Info
 * ComparisonsMarkLogic vs Competitors
 * FAQsYour top questions answered


INTRODUCING MARKLOGIC DATA HUB CENTRAL

Learn about our cloud-native data integration experience.

Read Blog →

Solutions
 * Enterprise Solutions →
 * Digital Transformation
 * Data Hub for HR
 * Mainframe to NoSQL
 * Regulatory Compliance / GRC
 * Enterprise Search

 * Industry Solutions →
 * Financial Services
 * Healthcare
 * Insurance
 * Life Sciences
 * Manufacturing
 * Media
 * Public Sector
 * More Industries


INTRODUCING MARKLOGIC DATA HUB CENTRAL

Learn about our cloud-native data integration experience.

Read Blog →

Learn
 * Free Training
 * MarkLogic University
 * Administrator Track
 * Business User Track
 * Data Architect Track
 * Developer Track
 * Courses

 * Resources
 * MarkLogic TV
 * Blog
 * eBooks
 * White Papers
 * On-Demand Webinars
 * View All Resources


2020 GARTNER MAGIC QUADRANT FOR CLOUD DBMS

Learn about the key cloud database companies

Read Report →

Community
 * Support →
 * Knowledgebase
 * Bug Fix Report

 * Developer →
 * Explore
 * Engage
 * Students
 * Documentation
 * Downloads

 * Partners →
 * Partner With Us

 * Events →
 * MarkLogic World
 * On-Demand Webinars


2020 GARTNER MAGIC QUADRANT FOR CLOUD DBMS

Learn about the key cloud database companies

Read Report →

Company
 * About Us →
 * Contact Us
 * Events
 * Leadership Team
 * Locations
 * News

 * Careers →
 * Engineering
 * Sales
 * Consulting
 * Marketing
 * Engineering Internships
 * Open Positions

 * Customers →
 * Partners →


MLTV NOW LIVE!

Watch new videos from customers, partners, and MarkLogic in a new content hub
built on DHS.

Watch MLTV →

         
 * Product 
   * MarkLogic Data Hub →Learn how MarkLogic simplifies data integration 
   * Cloud 
     * Data Hub ServiceFully managed cloud data hub 
   * Software 
     * Multi-Model DatabaseDocuments, graphs, relational data
   * Customer Success 
     * MarkLogic UniversityFree training
     * Consulting ServicesExpert implementation 
     * Support24/7 engineering support
   * More Info 
     * ComparisonsMarkLogic vs Competitors
     * FAQsYour top questions answered
 * Solutions 
   * Enterprise Solutions → 
     * Digital Transformation
     * Data Hub for HR
     * Mainframe to NoSQL
     * Regulatory Compliance / GRC 
   * Industry Solutions → 
     * Financial Services 
     * Healthcare
     * Insurance 
     * Life Sciences
     * Manufacturing 
     * Media 
     * Public Sector 
     * More Industries 
 * Learn 
   * Free Training 
     * MarkLogic University 
     * Administrator Track
     * Business User Track
     * Data Architect Track
     * Developer Track
     * Courses 
   * Resources 
     * MarkLogic TV
     * Blog
     * eBooks
     * White Papers
     * On-Demand Webinars
     * View All Resources
 * Community 
   * Support → 
     * Knowledgebase
     * Bug Fix Report
   * Developer → 
     * Explore
     * Engage
     * Students
     * Documentation
     * Downloads
   * Partners → 
     * Partner With Us
   * Events → 
     * MarkLogic World
 * Company 
   * About Us → 
     * Contact Us
     * Events
     * Leadership Team 
     * Locations
     * News
   * Careers → 
     * Engineering
     * Sales
     * Consulting
     * Marketing
     * Engineering Internships
     * Open Positions
   * Partners →
   * Customers →

 * Language 
   * English
   * French
   * German
   * Japanese
 * Search

 * Try Free


DATA HUB VS DATA LAKE VS DATA VIRTUALIZATION

 * Overview
 * Comparison Table
 * What is a Data Lake?
 * What is Data Virtualization?
 * What is a Data Hub?
 * What Are the Best Use Cases for a Data Hub?
 * When is a Data Lake a Better Fit?
 * When is Data Virtualization the Best Option?
 * How They Work Together
 * Learn More


OVERVIEW

All large organizations have massive amounts of data and it is usually spread
out across many disparate systems. This wasn’t a conscious choice but rather a
bunch of pragmatic tradeoffs. Silos are tech debt and are on the rise with the
adoption of Software as a Service (SaaS) applications and other cloud offerings,
increasing friction between the business and IT. Integrating those data silos is
notoriously difficult, and there are clear challenges when trying to use a
traditional data warehouse approach. For that reason, IT organizations have
sought modern approaches to get the job done (at the urgent request of the
business).

This comparison covers three modern approaches to data integration: Data lakes,
data virtualization or federation, and data hubs. All three approaches simplify
self-service consumption of data across heterogeneous sources without disrupting
existing applications. However, there are trade-offs to each of these new
approaches and the approaches are not mutually exclusive — many organizations
continue to use their data lake alongside a data hub-centered architecture.


COMPARISON TABLE

MarkLogic Data Hub Data Lake Data Virtualization Data Ingestion
 * Load raw data as is
 * Data physically migrated and persisted in a database

 * Load raw data as is
 * Data physically migrated and stored in HDFS or an object store

 * Virtual views of the data
 * No data is physically moved

Data Model
 * Multi-Model
 * Native JSON, XML, and RDF storage

 * HDFS is a file system that supports multiple data models

 * Often the same as the underlying federated systems, but can also create new
   composite views or semantic layers

Search & Query
 * Built-in, full-text search
 * Complete indexing (words, structure, etc.)
 * Relational views over unstructured data

 * Depends. There are various tools for data access: Hive, Hbase, Impala,
   Presto, Drill, etc. These add-on tools attempt to add query capabilities, but
   are generally limited and complex to manage

 * Queries optimized and passed to underlying systems. Dependent on indexes
   defined in those systems

Operational Capabilities
 * ACID transactions at scale
 * Real-time data processing
 * REST, JDBC, ODBC, etc.

 * No ACID transactions, cannot power transactional apps
 * Other tools used to operationalize the data

 * No ACID transactions, cannot power transactional apps
 * Can provide an access layer for data consumption via JDBC, ODBC, REST, etc.

Curation



(Harmonization, Enrichment, Mastering)

 * High-performance data pipelines at scale
 * Support for third-party tools (MuleSoft, Apache NiFi)
 * Easy to use Data Hub UI
 * Smart Mastering
 * Agile data curation

 * Depends. There are some tools that support “ELT” on Hadoop. Most use cases
   involve using an ETL tool before or after moving data to a data lake

 * Some support for data curation when the data is returned or processed, but
   usually relies on data pipeline or ETL tools

Security
 * Granular security controls
 * RBAC at document/element level
 * Redaction on export
 * Advanced encryption

 * Poor data security and governance (or at least hard to operationalize and
   requires additional tools to fill gaps such as Apache Atlas, Cloudera
   Navigator)

 * Security controls are required for both the virtual database and underlying
   database —  both layers must be secured

Scalability
 * Petabyte scalability
 * Higher cost due to indexing overhead for some implementations. Also,
   MarkLogic Data Hub Service provides predictable low-cost auto-scaling

 * Petabyte scalability
 * Ideal for low-cost storage

 * Only performs as well as the slowest federate, and is impacted by system load
   or issues in any federate

Performance
 * High-performance transactions and analytics
 * Dedicated, separate hardware from source systems for independent scaling

 * High-performance analytics
 * Performance depends on the infrastructure the system runs on

 * High-performance analytics
 * Performance depends on both the infrastructure the virtual database runs on
   and the performance of the underlying systems’ infrastructure
 * Performance is also dependent on all network connections

Deployment
 * Self-managed deployment in any environment
 * And, fully managed, serverless deployment with MarkLogic Data Hub Service

 * Self-managed deployment in any environment

 * Since there is no data migrated, they are very fast to deploy. It may only
   require a VM to be configured


WHAT IS A DATA LAKE?

A Data lake is a central repository that makes data storage at any scale or
structure possible. They became popular with the rise of Hadoop, a distributed
file system that made it easy to move raw data into one central repository where
it could be stored at a low cost. In data lakes, the data may not be curated
(enriched, mastered, harmonized) or searchable and they usually require other
tools from the Hadoop ecosystem to analyze or operationalize the data in a
multi-step process. But, data lakes have the advantage of not requiring much
work on the front end when loading data.

Data lake use cases include serving as an analytics sandbox, training machine
learning models, feeding data prep pipelines, or just offering low-cost data
storage.

A few years ago, the Hadoop landscape was contended by three main players:
Cloudera, Hortonworks, and MapR. Today, only Cloudera remains following its
merger with Hortonworks and MapR’s fire sale.

For many organizations, object stores like Amazon S3 have become de facto data
lakes, and support the move to the cloud from an on-premises Hadoop landscape.

Besides the Hadoop core, there are many other related tools in the Apache
ecosystem. For example, Spark and Kafka are two popular tools used for
processing streaming data and doing analytics in an event-streaming architecture
(they are marketing by Databricks and Confluent, respectively).

A detailed review of those tools is out of scope for this comparison. But, in
general, those tools are complementary to a data hub approach for most use
cases. They manage streaming data but still need a database. For example, Kafka
does not have a data model, indexes, or way of querying data. As a rule of
thumb, an event-based architecture and analytics platform that has a data hub
underneath is more trusted and operational than without the data hub.


WHAT IS DATA VIRTUALIZATION?

Data virtualization involves creating virtual views of data stored in existing
databases. The physical data doesn’t move but you can still get an integrated
view of the data in the new virtual data layer. This is often called data
federation (or virtual database), and the underlying databases are the
federates.

For example, you may have a few Oracle and SAP databases running and a
department needs access to the data from those systems. Rather than physically
moving the data via ETL and persisting it in another database, architects can
virtually (and quickly) retrieve and integrate the data for that particular team
or use case.

With data virtualization, queries hit the underlying database. Newer
virtualization technologies are increasingly sophisticated when handling query
execution planning and optimization. They may utilize cached data in-memory or
use integrated massively parallel processing (MPP), and the results are then
joined and mapped to create a composite view of the results. Many newer data
virtualization technologies can also write data (not just read). Newer solutions
also show advances with data governance, masking data for different roles and
use cases and using LDAP for authentication.

One of the major benefits of data virtualization is faster time to value. They
require less work and expense before you can start querying the data because the
data is not physically moved, making them less disruptive to your existing
infrastructure.

Another major benefit is that data virtualization gives users the ability to run
ad hoc SQL queries on both unstructured and structured data sources — a primary
use case for data virtualization.


SO, WITH ALL THESE BENEFITS, WHAT ARE THE DRAWBACKS OF DATA VIRTUALIZATION?

 * Virtual databases do not index the data, nor do they have separate data
   storage to store indexes. They rely on the underlying source systems to have
   indexes, which are often inadequate
 * Virtual databases map any request into a different request for each source
   system and execute on all source systems. This can create performance
   problems across the network and the system will always face concerns with
   network capacity
 * Virtual databases have no place to “curate” the data, increase data quality,
   or track data lineage or history. They do minimal data harmonization, and
   only when data is returned or processed. There is no persisted canonical form
   of the data to create a single source of truth and securely share it with
   downstream consumers.
 * Virtual databases usually have limited (or at least more complex to
   implement) security controls. For example, virtual databases may only secure
   data at the table level, not per record.
 * Virtual database volume will always be limited in scope to the volume of data
   in the underlying source systems

Examples of companies offering stand-alone data virtualization solutions are
SAS, Tibco, Denodo, and Cambridge Semantics. Other vendors such as Oracle,
Microsoft, SAP, and Informatica embed data virtualization as a feature of their
flagship products.


WHAT IS A DATA HUB?

Data hubs are data stores that act as an integration point in a hub-and-spoke
architecture. They physically move and integrate multi-structured data and store
it in an underlying database.


HERE ARE SOME OF THE KEY ADVANTAGES OF A DATA HUB

 * Data hubs are powered by an underlying multi-model database (which data lakes
   and virtual databases do not have), which gives them the ability to serve as
   a system of truth with all the required enterprise security including data
   confidentiality (access control), data availability (HA/DR), and data
   integrity (distributed transactions) capabilities
 * Data hubs have the tools to curate the data (enriching, mastering,
   harmonizing) and they support progressive harmonization, the result of which
   is persisted in the database.
 * Data hubs support operational and transactional applications, something data
   lakes are not designed for. And, while virtual databases can support
   transactions, the load is throttled by the performance of the underlying
   database systems

With these advantages, a data hub can act as a strong complement to data lakes
and data virtualization by providing a governed, transactional data layer. We
discuss this more in depth below.


WHAT ARE THE BEST USE CASES FOR A DATA HUB?

Here are some of the signs that indicate a data hub is a good choice for your
architecture:

 * When you want to integrate multi-model data — Data hubs are good at
   integrating multi-structured, changing data. They are ideal if you want to
   track where your data comes from and impose a single, easy-to-govern security
   data model. They also provide built-in curation capabilities to enrich,
   harmonize, and master data (including deduplication)
 * When the business needs a data service fast — Data hubs provide agility both
   in terms of getting data in and also getting value fast. They are much more
   than just analytical sandboxes. A data hub full of well-curated data can
   begin delivering business value with data services in weeks
 * When you need real-time, operational views — Data hubs are operational and
   transactional, providing real-time views and acting as a single source of
   truth. This makes them a good choice when your analytics team needs a
   real-time, operational analytics, not a historical snapshot
 * When you need a stable platform and trusted point of integration — Data hubs
   are backed by a database. They operate independently of other systems and
   thus are not tied to network or infrastructure constraints of other systems.
   And, they persist data, provide HA/DR, transactional consistency, enterprise
   security, and all the other capabilities required to act as a stable platform

Our customers typically use the MarkLogic Data Hub Platform for use cases such
as building a unified view, operational analytics, content monetization,
research and development, industrial IoT, regulatory compliance, ERP
integration, and mainframe migrations.


WHEN IS A DATA LAKE A BETTER FIT?

Data Lakes are best for streaming data, and they serve as good repositories when
organizations need a low-cost option for storing massive amounts of data,
structured or unstructured. Most data lakes are backed by HDFS and connect
easily into the broader Hadoop ecosystem. This makes it a good choice for large
development teams that want to use open source tools, and need a low-cost
analytics sandbox. Many organizations rely on their data lake as their “data
science workbench” to drive machine learning projects where data scientists need
to store training data and feed Jupyter, Spark, or other tools.


WHEN IS DATA VIRTUALIZATION THE BEST OPTION?

Data virtualization is the best option for certain analytics use cases that may
not require the robustness of a data hub for data integration use cases. They
can be deployed quickly and because the physical data is never moved, they do
not require much work to provision infrastructure at the beginning of a project.
Another common use for data virtualization is for data teams to run ad-hoc SQL
queries on top of non-relational data sources.


HOW CAN A DATA HUB, A DATA LAKE, AND DATA VIRTUALIZATION WORK TOGETHER?

Data hubs and data virtualization approaches are two different approaches to
data integration and may compete for the same use case. We find that customers
who are using a data hub usually do not need to implement data virtualization as
well. The data hub covers almost all of the same benefits. For instance, many
MarkLogic customers have built metadata (or content) repositories to virtualize
their critical data assets using MarkLogic Data Hub.

That said, it is possible to treat a MarkLogic Data Hub as a data source to be
federated, just like any other data source. For example, MarkLogic Data Hub can
be used to integrate data from multiple sources and can be accessed as a
federated data source using tools like Spark for training and scoring machine
learning models.

Data lakes are very complementary to data hubs. There are many of our customers
that have utilized the MarkLogic Connector for Hadoop to move data from Hadoop
into MarkLogic Data Hub, or move data from MarkLogic Data Hub to Hadoop. The
Data Hub sits on top of the data lake, where the high-quality, curated, secure,
de-duplicated, indexed and query-able data is accessible. Additionally, to
manage extremely large data volumes, MarkLogic Data Hub provides automated data
tiering to securely store and access data from a data lake.

Most commonly, customers either have an existing data lake and are in the
process of migrating off of it, or they are choosing to off-load low-usage data
into Hadoop to get the benefits of low-cost storage or support machine learning
projects.


LEARN MORE

When considering what the next step is in planning your architecture, here is
the summary of options to consider:

 * Choose to build a new data hub, with MarkLogic Data Hub Service, for your
   next big data integration project instead of using a data lake or data
   virtualization (or trying to build a custom made ‘data hub’ with a bunch of
   bolted-together components)
 * Build a data hub on top of a data lake, using MarkLogic Data Hub Service as
   the integration point for curating and governing data and the data lake for
   batch processing and data science
 * Consolidate as much data as possible via integration into one or more data
   hubs and expose that via data virtualization

We have many customers who chose to supplement or replace their data lake or
data virtualization with a MarkLogic Data Hub. Some examples you can explore
include Northern Trust, AFRL, and Chevron.




SIGN UP FOR OUR LIVE DEMO

See how MarkLogic integrates data faster, reduces costs, and enables secure data
sharing.

Register Now


 * Product
 * Solutions
 * Learn
 * Community
 * Company

 * Try Free

 *  
 *  
 *  
 *  
 *  
 *  

 * Impressum
 * Careers
 * Contact Us
 * Privacy
 * Legal & Patents

© 2021 MarkLogic Corporation. MarkLogic and the MarkLogic logo are trademarks of
MarkLogic Corporation. All other trademarks are the property of their respective
owners.

 

THIS WEBSITE USES COOKIES.

By continuing to use this website you are giving consent to cookies being used
in accordance with the MarkLogic Privacy Statement.

I Accept