Open, Interoperable Storage with Iceberg Tables Now Generally Available

Posted on July 19, 2024July 19, 2024 by cloudmatrix.website

Thousands of customers have worked with Snowflake to cost-effectively build a secure data foundation as they look to solve a growing variety of business problems with more data. Increasingly customers are looking to expand that powerful foundation to a broader set of data across their enterprise. Snowflake is now making it even easier for customers to bring the platform’s usability, performance, governance and many workloads to more data with Iceberg tables (now generally available), unlocking full storage interoperability.

Customers including Booking.com, Komodo Health, and more, are already using Iceberg tables to implement open, flexible architectural patterns — like data lakehouses, data lakes and data meshes — to further simplify the development of pipelines, models and more. With Iceberg tables, organizations can work with their data on their terms, gaining increased flexibility and support over their open data to drive value.

“Apache Iceberg’s large and diverse ecosystem of contributors and products made it a clear choice for us to provide an open and common data layer across our internal and external ecosystem,” said Thomas Davey, Chief Data Officer of Booking.com. “With Iceberg, we can broaden our use cases for Snowflake as our open data lakehouse for machine learning, AI, business intelligence and geospatial analysis, even for data stored externally.”

Let’s dive into some of the Iceberg tables functionality, use cases, and what’s ahead for Snowflake’s support for Iceberg.

Why use Iceberg Tables?

Apache Iceberg was designed from the start to be engine- and vendor-agnostic in order to enable interoperability. As a project of the Apache Software Foundation, Apache Iceberg embraces open communications and consensus decision making, placing collective interests ahead of the interests of any single entity, which is critical for long-term vendor-agnosticism. This is why there’s accelerating Iceberg adoption, and why Snowflake along with many other technology vendors and open source projects are supporting Iceberg ahead of other table formats.

Iceberg tables are a table type in Snowflake, based on the open source Apache Iceberg table format. Iceberg tables provide compute engine interoperability over a single copy of data. The Snowflake Iceberg table implementation provides capabilities to interact directly with Iceberg and Parquet data in data lakes, as well as contribute to and manage Iceberg in an open lakehouse architecture.

A few reasons why Snowflake customers are adopting Iceberg tables:

1. End-to-End Open Lakehouse Implementation: With Iceberg tables managed by Snowflake in bronze, silver and gold zones, you can leverage the breadth of Snowflake’s platform with security, performance, governance and sharing with a single copy of data. Data is stored in open formats and interoperable across external compute engines.

2. Augmenting Existing Data Lakes: Customers with existing data lakes want to tap into the power of the Snowflake platform. You can utilize Snowflake-managed Iceberg tables to be a full participant in your data lake and take advantage of features like automated table maintenance, Automatic Clustering, transformation with Snowpark and much more.

3. Zero Ingest with Zero Silos: Iceberg data already managed in a data lake can be accessed directly by Snowflake via an Iceberg catalog integration. You can quickly and easily access Iceberg data in Snowflake without the additional latency that comes with ingesting or copying data.

4. Optimized Performance: With Iceberg tables, the exceptional price-for-performance of Snowflake’s elastic compute engine extends to data stored externally in open formats.

5. Table Catalog Conversion: As Iceberg data lakes grow, managing them can be complex. With Snowflake’s simplified approach to maintaining Iceberg tables, you can convert your Iceberg tables’ catalog from an external catalog to Snowflake without rewriting the data and have Snowflake handle table maintenance.

What’s new in General Availability?

If you’re an existing Snowflake Iceberg table user, general availability includes a variety of enhancements.

1. Security and governance: Iceberg tables can now inherit USAGE privilege for dependent objects, which helps streamline security for Iceberg tables. There is added flexibility on external volume and table creation, with the ability to specify an External ID. Horizon’s governance features, like Row Access Policies and Dynamic Data Masking, work out of the box on Iceberg tables.

2. Data sharing and collaboration: Leverage Iceberg data from anywhere with cross-cloud/cross-region support for externally managed Iceberg tables. Even collaborate on Iceberg data with Snowflake’s seamless, secure sharing for Iceberg tables.

3. Flexible and robust data-handling: Take advantage of the replace invalid characters functionality for cleansing and scrubbing activities in a bronze zone of an open lakehouse. Observability gets a boost with new views into operational aspects of your Iceberg tables.

4. Metadata and evolution support: We’ve added structured-type schema evolution for flexibility as source systems or business reporting needs change. Get better Iceberg ecosystem interoperability with Primary Key information added to Iceberg table metadata.

5. Even better performance: While we’ve continued to enhance the core Snowflake engine, we also added automatic clustering support to further optimize performance.

Your data, your way: Match your compression and encoding settings to your particular storage and interoperability requirements. Take advantage of added support for unmaterialized partition values in Parquet files, unlocking the ability to use Iceberg over Hive-style partitioning.

Support many architectures and workloads on open, interoperable storage

Businesses and use cases, even in the same industry, can be very different and inevitably change over time. Data infrastructure should serve the current set of business needs and be able to scale and evolve with change. With Snowflake and Iceberg tables, customers have the ability to adapt to these changes and deploy their choice of data architecture, all while maintaining leading security, performance and simplicity.

Apache Iceberg was initially developed to solve reliability and performance challenges with Apache Hive data lakes. By introducing Iceberg to your data lake or open lakehouse architecture, you can benefit from better performance due to more efficient query-pruning. Iceberg also allows you to perform atomic transactions on your data lake.

Snowflake’s platform can power a variety of workloads all on top of Iceberg: data engineering, artificial intelligence (AI), machine learning (ML), business intelligence (BI) and more. For example, data scientists can use Python to access raw data from a bronze layer to perform feature engineering, while you can integrate your BI tool of choice to support high-concurrency workloads on Iceberg tables in a gold layer. Iceberg gives you the flexibility to query from other engines if needed, and you get end-to-end visibility and governance for all workloads running in Snowflake.

What’s ahead for Snowflake’s support for Iceberg

We continue to listen to our customers to help them power an even broader spectrum of use cases on open, interoperable storage with the Snowflake platform.

Polaris Catalog integration: An open source Apache Iceberg catalog, based on an open REST API implementation, will continue to dissolve data silos.
Deeper OneLake integration: Our recently announced expanded partnership with Fabric OneLake will use Iceberg to provide bidirectional access.
Easier batch and streaming pipelines for Iceberg: Dynamic Tables are a hugely popular capability in the Snowflake platform. Supporting Iceberg as a storage format for Dynamic Tables will simplify data processing for data lakes and lakehouses.
Streamlined catalog integration: Automatically refreshed Iceberg tables simplifies and streamlines Snowflake’s integration with externally managed Iceberg tables.
Flexible sources: Friction-free solutions are key for you to get started with your Snowflake Iceberg table experience. The “direct” offerings for both Parquet and Delta Lake (currently in private preview) offer the ability to access data in place, without having to load the data into Snowflake.

Getting Started

Sign up for this lab on July 16th to get hands-on learning for using some of the latest Snowflake features on top of Iceberg tables including Snowflake Notebooks and Cortex AI. In the meantime, you can get hands-on with Iceberg today! Follow along with this quickstart guide for a step-by-step walkthrough, or check out this demo showing how Cortex AI can be used on Iceberg tables.