blog.cloudera.com
Open in
urlscan Pro
35.185.25.53
Public Scan
URL:
https://blog.cloudera.com/open-data-lakehouse-powered-by-iceberg-for-all-your-data-warehouse-needs/
Submission: On July 12 via api from US — Scanned from DE
Submission: On July 12 via api from US — Scanned from DE
Form analysis
3 forms found in the DOMGET https://blog.cloudera.com
<form class="search standard" method="get" action="https://blog.cloudera.com" role="search">
<div class="search-wrap">
<input class="search-input" type="search" name="s" value="" placeholder="Search all blogs">
<button type="submit" class="btn alt">Search</button>
</div>
</form>
GET https://blog.cloudera.com
<form class="search standard" method="get" action="https://blog.cloudera.com" role="search">
<div class="search-wrap">
<input class="search-input" type="search" name="s" value="" placeholder="Search all blogs">
<button type="submit" class="btn alt">Search</button>
</div>
</form>
POST https://blog.cloudera.com/wp-comments-post.php?wpe-comment-post=clouderablog
<form action="https://blog.cloudera.com/wp-comments-post.php?wpe-comment-post=clouderablog" method="post" id="commentform" class="comment-form"><noscript>JavaScript is required to submit comments. Please enable JavaScript before
proceeding.</noscript>
<p class="comment-form-author"><input id="author" required="required" placeholder="Name *" name="author" type="text" value=""></p>
<p class="comment-form-email"><input id="email" required="required" placeholder="Email *" name="email" type="text" value=""></p>
<p class="comment-form-cookies-consent"><label for="wp-comment-cookies-consent"><input id="wp-comment-cookies-consent" name="wp-comment-cookies-consent" type="checkbox" value="yes">Save my name, and email in this browser for the next time I
comment.</label></p>
<p class="comment-form-comment"><textarea id="comment" name="comment" cols="45" rows="8" maxlength="65525" required="required" placeholder="Please leave a comment here…"></textarea></p>
<p class="form-submit"><input name="btnSubmit" type="submit" id="btnSubmit" class="btn primary" value="Post Comment"> <input type="hidden" name="comment_post_ID" value="145475" id="comment_post_ID">
<input type="hidden" name="comment_parent" id="comment_parent" value="0">
<input type="hidden" name="9KOZ1TPL41pr" value="c10983b402">
</p>
<div class="_NnGfEvRb2Z2A"><input type="text" name="comment_NnGfEvRb2Z2A" tabindex="-1" autocomplete="NnGfEvRb2Z2A"></div>
<p></p>
</form>
Text Content
* Business * Technical * Culture * Categories Search * Business * Technical * Culture * Categories Previous Search Next blog post Technical | OPEN DATA LAKEHOUSE POWERED BY ICEBERG FOR ALL YOUR DATA WAREHOUSE NEEDS SHARE Apache Iceberg Data Warehouse Data Ingestion by Zoltán Borók-Nagy, Ayush Saxena, Tamas Mate, and Simhadri Govindappa Posted in Technical | April 03, 2023 8 min read Since we announced the general availability of Apache Iceberg in Cloudera Data Platform (CDP), we are excited to see customers testing their analytic workloads on Iceberg. We are also receiving several requests to share more details on how key data services in CDP, such as Cloudera Data Warehousing (CDW), Cloudera Data Engineering (CDE), Cloudera Machine Learning (CML), Cloudera Data Flow (CDF) and Cloudera Stream Processing (CSP) integrate with the Apache Iceberg table format and the easiest way to get started. In this blog, we will share with you in detail how Cloudera integrates core compute engines including Apache Hive and Apache Impala in Cloudera Data Warehouse with Iceberg. We will publish follow up blogs for other data services. ICEBERG BASICS Iceberg is an open table format designed for large analytic workloads. As described in Iceberg Introduction it supports schema evolution, hidden partitioning, partition layout evolution and time travel. Every table change creates an Iceberg snapshot, this helps to resolve concurrency issues and allows readers to scan a stable table state every time. The Apache Iceberg project also develops an implementation of the specification in the form of a Java library. This library is integrated by execution engines such as Impala, Hive and Spark. The new feature this blog post is aiming to discuss about Iceberg V2 format (version 2), as the Iceberg table specification explains, the V1 format aimed to support large analytic data tables, while V2 aimed to add row level deletes and updates. In a bit more detail, Iceberg V1 added support for creating, updating, deleting and inserting data into tables. The table metadata is stored next to the data files under a metadata directory, which allows multiple engines to use the same table simultaneously. ICEBERG V2 With Iceberg V2 it is possible to do row-level modifications without rewriting the data files. The idea is to store information about the deleted records in so-called delete files. We chose to use position delete files which provide the best performance for queries. These files store the file paths and positions of the deleted records. During queries the query engines scan both the data files and delete files belonging to the same snapshot and merge them together (i.e. eliminating the deleted rows from the output). Updating row values is achievable by doing a DELETE plus an INSERT operation in a single transaction. Compacting the tables merges the changes/deletes with the actual data files to improve performance of reads. To compact the tables use CDE Spark. By default, Hive and Impala still create Iceberg V1 tables. To create a V2 table, users need to set table property ‘format-version’ to ‘2’. Existing Iceberg V1 tables can be upgraded to V2 tables by simply setting table property ‘format-version’ to ‘2’. Hive and Impala are compatible with both Iceberg format versions, i.e. users can still use their old V1 tables; V2 tables simply have more features. USE CASES Complying with specific aspects of regulations such as GDPR (General Data Protection Regulation) and CCPA (California Consumer Privacy Act) means that databases need to be able to delete personal data upon customer requests. With delete files we can easily mark the records belonging to specific people. Then regular compaction jobs can physically erase the deleted records. Another trivial use case is when existing records need to be modified to correct wrong data or update outdated values. HOW TO UPDATE AND DELETE Currently only Hive can do row level modifications. Impala can read the updated tables and it can also INSERT data into Iceberg V2 tables. To remove all data belonging to a single customer: DELETE FROM ice_tbl WHERE user_id = 1234; To update a column value in a specific record: UPDATE ice_tbl SET col_v = col_v + 1 WHERE id = 4321; Use the MERGE INTO statement to update an Iceberg table based on a staging table: MERGE INTO customer USING (SELECT * FROM new_customer_stage) sub ON sub.id = customer.id WHEN MATCHED THEN UPDATE SET name = sub.name, state = sub.new_state WHEN NOT MATCHED THEN INSERT VALUES (sub.id, sub.name, sub.state); WHEN NOT TO USE ICEBERG Iceberg tables feature atomic DELETE and UPDATE operations, making them similar to traditional RDBMS systems. However, it’s important to note that they are not suitable for OLTP workloads as they are not designed to handle high frequency transactions. Instead, Iceberg is intended for managing large, infrequently changing datasets. If one is looking for a solution that can handle very large datasets and frequent updates, we recommend using Apache Kudu. CDW BASICS Cloudera Data Warehouse (CDW) Data Service is a Kubernetes-based application for creating highly performant, independent, self-service data warehouses in the cloud that can be scaled dynamically and upgraded independently. CDW supports streamlined application development with open standards, open file and table formats, and standard APIs. CDW leverages Apache Iceberg, Apache Impala, and Apache Hive to provide broad coverage, enabling the best-optimized set of capabilities for each workload. CDW separates the compute (Virtual Warehouses) and metadata (DB catalogs) by running them in independent Kubernetes pods. Compute in the form of Hive LLAP or Impala Virtual Warehouses can be provisioned on-demand, auto-scaled based on query load, and de-provisioned when idle thus reducing cloud costs and providing consistent quick results with high concurrency, HA, and query isolation. Thus simplifying data exploration, ETL and deriving analytical insights on any enterprise data across the Data Lake. CDW also simplifies administration by making multi-tenancy secure and manageable. It allows us to independently upgrade the Virtual Warehouses and Database Catalogs. Through tenant isolation, CDW can process workloads that do not interfere with each other, so everyone meets report timelines while controlling cloud costs. HOW TO USE In the following sections we are going to provide a few examples of how to create Iceberg V2 tables and how to interact with them. We’ll see how one can insert data, change the schema or the partition layout, how to remove/update rows, do time-travel and snapshot management. HIVE: CREATING A ICEBERG V2 TABLE A Hive Iceberg V2 table can be created by specifying the format-version as 2 in the table properties. Ex. * CREATE Iceberg V2 table CREATE EXTERNAL TABLE TBL_ICEBERG_PART(ID INT, NAME STRING) PARTITIONED BY (DEPT STRING) STORED BY ICEBERG STORED AS PARQUET TBLPROPERTIES ('FORMAT-VERSION'='2'); * CREATE TABLE AS SELECT (CTAS) CREATE EXTERNAL TABLE CTAS_ICEBERG_SOURCE STORED BY ICEBERG AS SELECT * FROM TBL_ICEBERG_PART; * CREATE TABLE LIKE (creates an empty table based on another table) CREATE EXTERNAL TABLE ICEBERG_CTLT_TARGET LIKE ICEBERG_CTLT_SOURCE STORED BY ICEBERG; INGESTING DATA Data into an Iceberg V2 table can be inserted similarly like normal Hive tables Ex: * INSERT INTO INSERT INTO TABLE TBL_ICEBERG_PART VALUES (1,'ONE','MATH'), (2, 'ONE','PHYSICS'), (3,'ONE','CHEMISTRY'), (4,'TWO','MATH'), (5, 'TWO','PHYSICS'), (6,'TWO','CHEMISTRY'); * INSERT OVERWRITE INSERT OVERWRITE TABLE CTLT_ICEBERG_SOURCE SELECT * FROM TBL_ICEBERG_PART; * MERGE MERGE INTO TBL_ICEBERG_PART USING TBL_ICEBERG_PART_2 ON TBL_ICEBERG_PART.ID = TBL_ICEBERG_PART_2.ID WHEN NOT MATCHED THEN INSERT VALUES (TBL_ICEBERG_PART_2.ID, TBL_ICEBERG_PART_2.NAME, TBL_ICEBERG_PART_2.DEPT); DELETE & UPDATES: V2 tables allow row level deletes and updates similarly like the Hive-ACID tables. Ex: * Deleting a Row DELETE FROM TBL_ICEBERG_PART WHERE DEPT = 'MATH'; * Updating a Row value UPDATE TBL_ICEBERG_PART SET DEPT='BIOLOGY' WHERE DEPT = 'PHYSICS' OR ID = 6; QUERYING ICEBERG TABLES: * SELECT statements Hive supports both vectorized and non vectorized reads for Iceberg V2 tables, Vectorization can be enabled normally using the following configs: 1. set hive.llap.io.memory.mode=cache; 2. set hive.llap.io.enabled=true; 3. set hive.vectorized.execution.enabled=true SELECT COUNT(*) FROM TBL_ICEBERG_PART; * Time-Travel statements Hive allows us to query table data for specific snapshot versions. SELECT * FROM TBL_ICEBERG_PART FOR SYSTEM_VERSION AS OF 7521248990126549311; SNAPSHOT MANAGEMENT Hive allows several operations regarding snapshot management, like: * Expire Snapshot: ALTER TABLE TBL_ICEBERG_PART EXECUTE EXPIRE_SNAPSHOTS('2021-12-09 05:39:18.689000000'); * Set Current Snapshot: ALTER TABLE TBL_ICEBERG_PART EXECUTE SET_CURRENT_SNAPSHOT (7521248990126549311); * RollBack Snapshot: ALTER TABLE TBL_ICEBERG_PART EXECUTE ROLLBACK(3088747670581784990); ALTER ICEBERG TABLES * Schema evolution: ALTER TABLE … ADD COLUMNS (...); (Add a column) ALTER TABLE … REPLACE COLUMNS (...);(Drop column by using REPLACE COLUMN to remove the old column) ALTER TABLE … CHANGE COLUMN … AFTER …; (Reorder columns) * Partition evolution: ALTER TABLE TBL_ICEBERG_PART SET PARTITION SPEC (NAME); MATERIALIZED VIEWS * Creating Materialized Views: CREATE MATERIALIZED VIEW MAT_ICEBERG AS SELECT ID, NAME FROM TBL_ICEBERG_PART ; * Rebuild Materialized Views: ALTER MATERIALIZED VIEW MAT_ICEBERG REBUILD; * Querying Materialized Views: SELECT * FROM MAT_ICEBERG; IMPALA Apache Impala is an open source, distributed, massively parallel SQL query engine with its backend executors written in C++, and its frontend (analyzer, planner) written in java. Impala uses the Iceberg Java library to get information about Iceberg tables during query analysis and planning. On the other hand, for query execution the high performing C++ executors are in charge. This means queries on Iceberg tables are lightning fast. Impala supports the following statements on Iceberg tables. CREATING ICEBERG TABLES * CREATE Iceberg V2 table: CREATE TABLE ice_t(id INT, name STRING, dept STRING) PARTITIONED BY SPEC (bucket(19, id), dept) STORED BY ICEBERG TBLPROPERTIES ('format-version'='2'); * CREATE TABLE AS SELECT (CTAS): CREATE TABLE ice_ctas PARTITIONED BY SPEC (truncate(1000, id)) STORED BY ICEBERG TBLPROPERTIES ('format-version'='2') AS SELECT id, int_col, string_col FROM source_table; * CREATE TABLE LIKE: (creates an empty table based on another table) CREATE TABLE new_ice_tbl LIKE orig_ice_tbl; QUERYING ICEBERG TABLES Impala supports reading V2 tables with position deletes. * SELECT statements: Impala supports all kinds of queries on Iceberg tables that it supports for any other tables. E.g. joins, aggregations, analytical queries etc. are all supported. SELECT * FROM ice_t; SELECT count(*) FROM ice_t i LEFT OUTER JOIN other_t b ON (i.id = other_t.fid) WHERE i.col = 42; * Time-travel statements: It’s possible to query earlier snapshots of a table (until they are expired). SELECT * FROM ice_t FOR SYSTEM_TIME AS OF '2022-01-04 10:00:00'; SELECT * FROM ice_t FOR SYSTEM_TIME AS OF now() - interval 5 days; SELECT * FROM ice_t FOR SYSTEM_VERSION AS OF 123456; We can use DESCRIBE HISTORY statement to see what are the earlier snapshots of a table: DESCRIBE HISTORY ice_t FROM '2022-01-04 10:00:00'; DESCRIBE HISTORY ice_t FROM now() - interval 5 days; DESCRIBE HISTORY ice_t BETWEEN '2022-01-04 10:00:00' AND '2022-01-05 10:00:00'; INSERT DATA INTO ICEBERG TABLES INSERT statements work for both V1 and V2 tables. * INSERT INTO: INSERT INTO ice_t VALUES (1, 2); INSERT INTO ice_t SELECT col_a, col_b FROM other_t; * INSERT OVERWRITE: INSERT OVERWRITE ice_t VALUES (1, 2); INSERT OVERWRITE ice_t SELECT col_a, col_b FROM other_t; LOAD DATA INTO ICEBERG TABLES * LOAD DATA INPATH: LOAD DATA INPATH '/tmp/some_db/parquet_files/' INTO TABLE iceberg_tbl; ALTER ICEBERG TABLES * Schema evolution: ALTER TABLE ... RENAME TO ... (renames the table) ALTER TABLE ... CHANGE COLUMN ... (change name and type of a column) ALTER TABLE ... ADD COLUMNS ... (adds columns to the end of the table) ALTER TABLE ... DROP COLUMN ... * Partition evolution: ALTER TABLE ice_p SET PARTITION SPEC (VOID(i), VOID(d), TRUNCATE(3, s), HOUR(t), i); SNAPSHOT MANAGEMENT * Expire old snapshots: ALTER TABLE ice_tbl EXECUTE expire_snapshots('2022-01-04 10:00:00'); ALTER TABLE ice_tbl EXECUTE expire_snapshots(now() - interval 5 days); DELETE and UPDATE statements for Impala are coming in later releases. As mentioned above, Impala is using its own C++ implementation to deal with Iceberg tables. This gives significant performance advantages compared to other engines. FUTURE WORK Our support for Iceberg v2 is advanced and reliable, and we continue our push for innovation. We are rapidly developing improvements, so you can expect to find new features related to Iceberg in each CDW release. Please let us know your feedback in the comments section below. SUMMARY Iceberg is an emerging, extremely interesting table format. It is under rapid development with new features coming every month. Cloudera Data Warehouse added support for the most recent format version of Iceberg in its latest release. Users can run Hive and Impala virtual warehouses and interact with their Iceberg tables via SQL statements. These engines are also evolving quickly and we deliver new features and optimizations in every release. Stay tuned, you can expect more blog posts from us about upcoming features and technical deep dives. TO LEARN MORE: * Replay our webinar Unifying Your Data: AI and Analytics on One Lakehouse, where we discuss the benefits of Iceberg and open data lakehouse. * Read why the future of data lakehouses is open. * Replay our meetup Apache Iceberg: Looking Below the Waterline. Try Cloudera Data Warehouse (CDW) by signing up for a 60 day trial, or test drive CDP. If you are interested in chatting about Apache Iceberg in CDP, let your account team know or contact us directly. As always, please provide your feedback in the comments section below. Zoltán Borók-Nagy Senior Staff Engineer More by this author Ayush Saxena Staff Software Engineer More by this author Tamas Mate Senior Software Engineer More by this author Simhadri Govindappa Software Engineer 2 More by this author EDITOR'S CHOICE Business Generative AI for the Enterprise Business Why I Prefer Cloudera CDP LEAVE A COMMENT CANCEL REPLY Your email address will not be published. Links are not permitted in comments. JavaScript is required to submit comments. Please enable JavaScript before proceeding. Save my name, and email in this browser for the next time I comment. * About * Products * Solutions * Services & Support Contact Us US: +1 888 789 1488 Outside the US: +1 650 362 0488 © 2023 Cloudera, Inc. All rights reserved. | Terms & Conditions | Privacy Policy and Data Policy | Cookie-Präferenzen Apache Hadoop and associated open source project names are trademarks of the Apache Software Foundation. For a complete list of trademarks, click here.