www.infoworld.com Open in urlscan Pro
151.101.2.165  Public Scan

URL: https://www.infoworld.com/article/3640540/how-to-choose-a-cloud-data-warehouse.html
Submission: On August 09 via api from GB — Scanned from GB

Form analysis 1 forms found in the DOM

<form class="gsc-search-box gsc-search-box-tools" accept-charset="utf-8">
  <table cellspacing="0" cellpadding="0" role="presentation" class="gsc-search-box">
    <tbody>
      <tr>
        <td class="gsc-input">
          <div class="gsc-input-box" id="gsc-iw-id1">
            <table cellspacing="0" cellpadding="0" role="presentation" id="gs_id50" class="gstl_50 gsc-input" style="width: 100%; padding: 0px;">
              <tbody>
                <tr>
                  <td id="gs_tti50" class="gsib_a"><input autocomplete="off" type="text" size="10" class="gsc-input" name="search" title="search" aria-label="search" id="gsc-i-id1" dir="ltr" spellcheck="false" placeholder="Start Searching"
                      style="width: 100%; padding: 0px; border: none; margin: 0px; height: auto; outline: none;"></td>
                  <td class="gsib_b">
                    <div class="gsst_b" id="gs_st50" dir="ltr"><a class="gsst_a" href="javascript:void(0)" title="Clear search box" role="button" style="display: none;"><span class="gscb_a" id="gs_cb50" aria-hidden="true">×</span></a></div>
                  </td>
                </tr>
              </tbody>
            </table>
          </div>
        </td>
        <td class="gsc-search-button"><button class="gsc-search-button gsc-search-button-v2"><svg width="13" height="13" viewBox="0 0 13 13">
              <title>search</title>
              <path
                d="m4.8495 7.8226c0.82666 0 1.5262-0.29146 2.0985-0.87438 0.57232-0.58292 0.86378-1.2877 0.87438-2.1144 0.010599-0.82666-0.28086-1.5262-0.87438-2.0985-0.59352-0.57232-1.293-0.86378-2.0985-0.87438-0.8055-0.010599-1.5103 0.28086-2.1144 0.87438-0.60414 0.59352-0.8956 1.293-0.87438 2.0985 0.021197 0.8055 0.31266 1.5103 0.87438 2.1144 0.56172 0.60414 1.2665 0.8956 2.1144 0.87438zm4.4695 0.2115 3.681 3.6819-1.259 1.284-3.6817-3.7 0.0019784-0.69479-0.090043-0.098846c-0.87973 0.76087-1.92 1.1413-3.1207 1.1413-1.3553 0-2.5025-0.46363-3.4417-1.3909s-1.4088-2.0686-1.4088-3.4239c0-1.3553 0.4696-2.4966 1.4088-3.4239 0.9392-0.92727 2.0864-1.3969 3.4417-1.4088 1.3553-0.011889 2.4906 0.45771 3.406 1.4088 0.9154 0.95107 1.379 2.0924 1.3909 3.4239 0 1.2126-0.38043 2.2588-1.1413 3.1385l0.098834 0.090049z">
              </path>
            </svg></button></td>
        <td class="gsc-clear-button">
          <div class="gsc-clear-button" title="clear results">&nbsp;</div>
        </td>
      </tr>
    </tbody>
  </table>
</form>

Text Content

Close Ad


infoworld
UNITED KINGDOM
 * United States
 * United Kingdom

 * App Dev
 * Cloud
 * Gen AI
 * Machine Learning
 * Analytics
 * White Papers/Webcasts

×

search
 

Analytics
Careers
Databases
Cloud Computing
 * Amazon Web Services
 * Kubernetes
 * Microsoft Azure

Generative AI
Machine Learning
Open Source
Software Development
 * Agile Development
 * CI/CD
 * Devops
 * Java
 * JavaScript
 * Microsoft .Net

Newsletters
In-Depth
 * Features
 * How-To
 * News
 * Reviews

Video
 * Do More with R
 * Smart Python
 * IDG TECH(talk) Channel

White Papers/Webcasts
From Our Partners
   
   
 * The Latest Content from Our Sponsors

More from the Foundry Network
The voice of IT leadership
Analytics Careers CIO Role Digital Transformation Leadership Project Management
Security at the speed of business
Application Security Cloud Security Identity Management Information Security
Network Security Risk Management Security Software
Making technology work for business
Blockchain Collaboration Mobile Office Software Security Systems Management
Windows
From the data center to the edge
Data Center Internet of Things Linux Networking SD-WAN Servers Storage Wi-Fi
 * About Us |
 * Contact |
 * Republication Permissions |
 * Privacy Policy |
 * Cookie Policy |
 * European Privacy Settings |
 * Member Preferences |
 * Advertising |
 * Foundry Careers |
 * Ad Choices |
 * E-commerce Links |
 * California: Do Not Sell My Personal Info |

 * Follow Us
 * 
 * 
 * 


×

Close
 * Home
 * Data Management
 * Databases




HOW TO CHOOSE A CLOUD DATA WAREHOUSE


MODERN DATA WAREHOUSES CAN QUERY STRUCTURED DATA AND SEMI-STRUCTURED DATA
SIMULTANEOUSLY, AND EVEN COMBINE HISTORICAL DATA AND STREAMING LIVE DATA FOR
ANALYSIS.

 * 
 * 
 * 
 * 
 * 
 * 
 * 

By Martin Heller

Contributing Editor, InfoWorld | 13 December 2021 11:00 GMT


Thinkstock



Enterprise data warehouses, or EDWs, are unified databases for all historical
data across an enterprise, optimized for analytics. These days, organizations
implementing data warehouses often consider creating the data warehouse in the
cloud rather than on premises. Many also consider using data lakes that support
queries instead of traditional data warehouses. A third question is whether you
want to combine historical data with streaming live data.

A data warehouse is an analytic, usually relational, database created from two
or more data sources, typically to store historical data, which may have a scale
of petabytes. Data warehouses often have significant compute and memory
resources for running complicated queries and generating reports, and are often
the data sources for business intelligence (BI) systems and machine learning.

The write throughput requirements of transactional operational databases limit
the number and kind of indexes you can create (more indexes mean more writes and
updates per record added, and more possible contention). This in turn slows down
analytic queries against the operation database. Once you have exported your
data into a data warehouse, you can index everything you care about in the data
warehouse for good analytic query performance, without affecting the write
performance of the separate OLTP (online transaction processing) database.



Data marts contain data oriented toward a specific business line. Data marts may
be dependent on the data warehouse, independent of the data warehouse (i.e.,
drawn from an operational database or external source), or a hybrid of the two.



Data lakes, which store files of data in its native format, are essentially
“schema on read,” meaning that any application that reads data from the lake
will need to impose its own types and relationships on the data. Traditional
data warehouses, on the other hand, are “schema on write,” meaning that data
types, indexes, and relationships are imposed on the data as it is stored in the
data warehouse.

Modern data warehouses can often handle structured data and semi-structured data
and query them simultaneously. In addition, modern data warehouses can often
query historical data and streamed recent data simultaneously.




CLOUD DATA WAREHOUSES VS. ON-PREM DATA WAREHOUSES

A data warehouse can be implemented on-premises, in the cloud, or as a hybrid.
Historically, data warehouses were always on-prem, but the capital cost and lack
of scalability of on-prem servers in data centers were sometimes issues. On-prem
EDW installations grew when vendors started offering data warehouse appliances.
Now, however, the trend is to move all or part of your data warehouse to the
cloud to take advantage of the inherent scalability of cloud data warehouses,
and the ease of connecting to other cloud services.



The downside of putting petabytes of data in the cloud is the operational cost,
both for cloud data storage and for cloud data warehouse compute and memory
resources. You might think that the time to upload petabytes of data to the
cloud would be a huge barrier, but the hyperscale cloud vendors now offer
high-capacity, disk-based data transfer services.


SPEED AND SCALABILITY REQUIREMENTS

Data warehouses are designed so that analytical queries can run fast. For old
on-prem data warehouses, reports with multiple queries based on historical data
were typically run overnight. For modern cloud data warehouses, the performance
requirements are stiffer, as analysts expect to run queries based on historical
plus streaming data interactively, and then dig deeper with more queries.

Cloud data warehouses are usually designed to scale CPU capacity as needed, so
that interactive queries against petabytes of data can return answers in
minutes. Some cloud data warehouses can increase the CPU resources while a query
is running without restarting the query, and reduce them again when the data
warehouse is idle. Aggressive up-scaling and down-scaling can be a good strategy
to get high performance when needed for low overall cost.


COLUMNAR VERSUS ROW STORAGE

Row-oriented databases organize data by record, and typically attempt to store
one database row in one block of storage, so that the whole row can be retrieved
with a single read operation. Row-oriented databases are efficient for both
reading and writing rows. Most transactional databases are row-oriented, and use
b-tree indexes.

Column-oriented databases organize data by field, and attempt to store all the
data associated with a field together. Columnar databases are efficient for
reading and computing on columns. Most data warehouses store data in columns,
compress their data heavily, and use LSM-tree indexes. The original paper
describing C-Store, a read-optimized column-oriented database, was published in
2005. The C-Store paper laid the groundwork for most modern columnar store data
warehouses, including Amazon Redshift, Google BigQuery, and Snowflake.



Some databases combine row and columnar storage. They use row storage for OLTP,
and columnar storage for analytic queries. A few databases can query data in
columnar storage and row storage together, which speeds up queries where not all
fields can fit into columnar storage.


IN-MEMORY STORAGE AND LAYERED STORAGE

What’s faster than a compressed columnar store on disk? A compressed columnar
store in memory. What can handle more data than a columnar store in memory? A
layered storage system that backs memory with PMEM, such as Intel Optane, which
is faster than flash and cheaper than DRAM. Additional layers would be flash and
spinning disks. The hard part of a scheme like this is implementing the
multi-level caching without slowing down retrievals or allowing unnecessary
cache flushing in the faster layers.


ETL VERSUS ELT

ETL (extract, transform and load) tools pull the data, perform any desired
mappings and transformations, and load the data into the data storage layer. ELT
tools store the data first and transform later. When you use ELT tools, it is
common to also use a data lake.


CLUSTERED AND DISTRIBUTED CLOUD DATA WAREHOUSES

Since data warehouses are read-mostly databases, it is easier to cluster them
than to cluster OLTP databases. It is also easier to distribute data warehouses
geographically without incurring high write latency. Once your data warehouse
has a clustered architecture, it is easy to add nodes to the cluster to increase
processing capacity and return results faster.




CLOUD UI FOR ADMIN AND QUERIES

Just about every cloud data warehouse has its own user interface for
administration and queries. Some are more usable than others. Administration is
simpler than query building. Adding a node (or setting a maximum number of nodes
for autoscaling) can be as easy as pressing one button. Some cloud data
warehouses offer a graphical query builder, which is useful for SQL novices.
Many cloud data warehouses offer a history pane for past queries and their
answers.


KEY CLOUD DATA WAREHOUSES

The 13 products listed below alphabetically either are cloud data warehouses, or
provide the functionality of data warehouses while building on a different base
architecture, such as data lakes. You could argue that Ahana, Delta Lake, and
Qubole are built on data lakes rather than starting as data warehouses, but you
could also argue that they provide much the same functionality as unquestioned
data warehouses such as AWS Redshift, Azure Synapse, and Google BigQuery. As all
these products add heterogenous federated query engines, the functional
distinction between data lakes and data warehouses tends to blur.


AHANA CLOUD FOR PRESTO

Ahana Cloud for Presto turns a data lake on Amazon S3 into what is effectively a
data warehouse, without moving any data. SQL queries run quickly even when
joining multiple heterogeneous data sources.

Presto is an open source, distributed SQL query engine for running interactive
analytic queries against data sources of all sizes. Presto allows querying data
where it lives, including Hive, Cassandra, relational databases, and proprietary
data stores. A single Presto query can combine data from multiple sources.
Facebook uses Presto for interactive queries against several internal data
stores, including their 300 PB data warehouse.

Ahana Cloud for Presto runs on Amazon, has a fairly simple user interface, and
has end-to-end cluster lifecycle management. It runs in Kubernetes and is highly
scalable. It has a built-in catalog and easy integration with data sources,
catalogs, and dashboarding tools. The default Ahana query interface is Apache
Superset. You can also use Jupyter or Zeppelin notebooks, especially if you are
doing machine learning.
Ahana claims to have 3X the performance of other Presto services, including
Amazon Elastic MapReduce and Amazon Athena.


AMAZON REDSHIFT

Using Amazon Redshift you can query and combine exabytes of structured and
semi-structured data across your data warehouse, operational database, and data
lake using standard SQL. Redshift lets you easily save the results of your
queries back to your S3 data lake using open formats, such as Apache Parquet, so
that you can do additional analytics from other analytics services such as
Amazon EMR, Amazon Athena, and Amazon SageMaker.


AZURE SYNAPSE ANALYTICS

Azure Synapse Analytics is an analytics service that brings together data
integration, data warehousing, and big data analytics. It allows you to ingest,
explore, prepare, manage, and serve data for immediate BI and machine learning
needs, and query data using either serverless or dedicated resources at scale.
Azure Synapse can run queries using Spark or SQL engines. It has deep
integration with Azure Machine Learning, Azure Cognitive Services, and Power BI.


DATABRICKS DELTA LAKE

Developed by Databricks, Delta Lake is an open source project that enables
building a “lakehouse” architecture on top of existing storage systems such as
Amazon S3, Microsoft Azure Data Lake Storage, Google Cloud Storage, and HDFS. It
adds ACID transactions, metadata handling, data versioning, schema enforcement,
and schema evolution to data lakes. Databricks Lakehouse Platform uses Delta
Lake, Spark, and MLflow in a cloud service that runs on AWS, Microsoft Azure,
and Google Cloud to combine the data management and performance typically found
in data warehouses with the low-cost, flexible object stores offered by data
lakes.


GOOGLE BIGQUERY

Google BigQuery is a serverless, petabyte-scale, cloud data warehouse with an
internal BI engine, internal machine learning accessible via SQL extensions, and
integrations across all Google Cloud services including Vertex AI and
TensorFlow. BigQuery Omni extends BigQuery to analyze data across clouds, using
Anthos. Data QnA provides a natural language front end to BigQuery. Connected
Sheets allow users to analyze billions of rows of live BigQuery data in Google
Sheets. BigQuery can process federated queries including external data sources
in object storage (Google Cloud Storage) for Parquet and ORC (Optimized Row
Columnar) file formats, transactional databases (Google Cloud Bigtable, Google
Cloud SQL), or spreadsheets in Google Drive.


ORACLE AUTONOMOUS DATA WAREHOUSE

Oracle Autonomous Data Warehouse is a cloud data warehouse service that
automates provisioning, configuring, securing, tuning, scaling, and backing up
of the data warehouse. It includes tools for self-service data loading, data
transformations, business models, automatic insights, and built-in converged
database capabilities that enable simpler queries across multiple data types and
machine learning analysis. It’s available in both the Oracle public cloud and
customers’ data centers with Oracle Cloud@Customer.


QUBOLE

Qubole is a simple, open, and secure data lake platform for machine learning,
streaming, and ad hoc analytics. It is available on the AWS, Azure, Google, and
Oracle clouds. Qubole helps you to ingest datasets from a data lake, build
schemas with Hive, query the data with Hive, Presto, Quantum, or Spark, and
continue to your data engineering and data science. You can work with Qubole
data in Zeppelin or Jupyter notebooks and Airflow workflows.


ROCKSET

Rockset is an operational analytics database. It occupies a niche between
transactional databases and data warehouses. Rockset can analyze gigabytes to
terabytes of recent, real-time, and streaming data, and has the indexes to make
most queries run in milliseconds. Rockset builds a converged index on structured
and semi-structured data from OLTP databases, streams, and data lakes in real
time, and exposes a RESTful SQL interface.


SNOWFLAKE

Snowflake is a dynamically scalable enterprise data warehouse designed for the
cloud. It runs on AWS, Azure, and Google Cloud. Snowflake features storage,
compute, and global services layers that are physically separated but logically
integrated. Data workloads scale independently from one another, making
Snowflake an appropriate platform for data warehousing, data lakes, data
engineering, data science, modern data sharing, and developing data
applications.


TERADATA VANTAGE

Teradata Vantage is a connected multi-cloud data platform for enterprise
analytics that unifies data lakes, data warehouses, analytics, and new data
sources and types. Vantage runs on public clouds (such as AWS, Azure, and Google
Cloud), hybrid multi-cloud environments, on-premises with Teradata IntelliFlex,
or on commodity hardware with VMware.


VERTICA

Related:
 * Database
 * Data Architecture
 * Analytics
 * Cloud Computing

1 2 Page 1 Next
Page 1 of 2
InfoWorld Technology of the Year Awards 2023. Now open for entries!


InfoWorld Follow us
 * 
 * 
 * 
   


 * About Us
 * Contact
 * Republication Permissions
 * Privacy Policy
 * Cookie Policy
 * European Privacy Settings
 * Member Preferences
 * Advertising
 * Foundry Careers
 * Ad Choices
 * E-commerce Links
 * California: Do Not Sell My Personal Info

Copyright © 2023 IDG Communications, Inc.

Explore the Foundry Network descend
 * CIO
 * Computerworld
 * CSO Online
 * InfoWorld
 * Network World













INFOWORLD WANTS TO SHOW YOU NOTIFICATIONS

--------------------------------------------------------------------------------

YOU CAN TURN OFF NOTIFICATIONS AT ANY TIME FROM YOUR BROWSER

Accept Do not accept

POWERED BY SUBSCRIBERS