Navigating Iceberg with Snowflake

As Apache Iceberg continues to gain traction as the go-to open table format for lakehouse architectures, Snowflake has responded by offering multiple ways to integrate with Iceberg, each with their own trade-offs. In this post, we’ll quickly identify the key differences between Snowflake managed Iceberg tables, externally managed Iceberg tables, and where the emerging Snowflake Open Catalog fits. And if you’re looking to keep everything relevant inside of Snowflake and maintain an open, engine-agnostic architecture, I’ll provide a practical guide to sync your Iceberg tables from the Snowflake Open Catalog to Snowflake for seamless interoperability.

Snowflake Managed Iceberg Tables

Snowflake managed Iceberg tables are fully managed by Snowflake. The data is stored in cloud storage as an External Volume and not included as part of the Snowflake storage bill.

Summary

Snowflake manages Iceberg metadata, no catalog needed

Snowflake read and writes to Iceberg tables

Able to leverage Snowflake Clustering

Covered by Snowflake Platform Support

Always up to date

Data can only be accessed through Snowflake

‍

Benefits

Dead simple

Easy transition into a lakehouse architecture

Reduces storage costs

Seamless integration with existing Snowflake environment

Data stored within your company’s infrastructure

Cons

Not vendor neutral

Data inaccessible to other systems outside of Snowflake

Lacks many features of Snowflake native tables

‍

Snowflake External Iceberg Tables

Snowflake has the ability to integrate with external Iceberg catalogs and connect to their managed Iceberg, such as: AWS Glue, Nessie, or Snowflake Open Catalog.

Summary

External catalog manages Iceberg metadata

Snowflake can read Iceberg tables. It cannot write to Iceberg tables.

Interoperability: can be read by other engines like Trino, Spark, etc…

Requires DDL statements to see tables

Requires refreshing metadata to stay in sync

Supports Iceberg on top of Delta files

‍

Snowflake Open Catalog Iceberg Tables

After Snowflake open sourced the Apache Polaris, a REST catalog for Iceberg, they have since released Snowflake Open Catalog a managed service version which includes additional Enterprise security features. As of today, Snowflake Open Catalog is treated similarly as any other External Catalog integration with the exception of one feature: CATALOG_SYNC
‍

Which Route to Take?

If you are using Snowflake as the only engine to interact with the data, Snowflake managed Iceberg tables make the most sense of the Iceberg options. However, it is best served for large tables, especially those that are append heavy. There are plenty of drawbacks compared to native tables. But if you need it, it’s seamless, easy to stand up, and manage, with the benefit of moving your storage to your cloud provider.

‍

However, if you want interoperability or to be vendor neutral, you’ll want to leverage an external catalog. In my opinion, if you are driving towards a lake house architecture, this is the way to go since interoperability is probably the biggest benefit in the ever change data landscape of today. With that said, Snowflake’s Open Catalog is an extremely easy option to get started, but it is worth noting that it is still very different than using Snowflake managed Iceberg tables.
‍

Integrating Snowflake Open Catalog with Snowflake

One of the key differences when using internal and external Iceberg tables is state management. Snowflake managed Iceberg will always be the latest state and since you will be executing DDL within Snowflake to create and update the tables, your Snowflake catalog is always in sync as well. When leveraging the Snowflake Open Catalog, there are additional steps to make sure your Snowflake database is in sync with the catalog. This includes the latest metadata state as well as the catalog. You will still need to execute DDL for Snowflake to know about an Iceberg table in your catalog.

In other words, Snowflake Open Catalog does not keep Snowflake up to date on what exists. To account for that you can follow the guide below to sync an Iceberg Namespace with your Snowflake database.

If you want to skip ahead and go straight into the code, you can review the notebook here.

‍

Pre-requisites

Snowflake Open Catalog

Setup an Open Catalog account and create some iceberg tables. To follow along create a catalog named arctic_tundra.

Getting Started

Create Open Catalog Integration

Placeholders

<org> - Your Snowflake Organization

<account> - Your Snowflake Open Catalog Account

OAuth Credentials (Doc)

<client_id> - OAuth Client Id

<client_secret> - OAuth Client Secret

Optional:

OAUTH_ALLOWED_SCOPES: (doc)

‍

SQL

USE ROLE ACCOUNTADMIN;

CREATE OR REPLACE CATALOG INTEGRATION

CATALOG_INT_MY_CATALOG

CATALOG_SOURCE = POLARIS

TABLE_FORMAT = ICEBERG

REST_CONFIG = (

CATALOG_URI = 'https://<org>-<account>.snowflakecomputing.com/polaris/api/catalog'

CATALOG_API_TYPE = PUBLIC

CATALOG_NAME = 'arctic_tundra'

)

REST_AUTHENTICATION = (

TYPE = OAUTH

OAUTH_CLIENT_ID = '<client_id>'

OAUTH_CLIENT_SECRET = '<client_secret>'

OAUTH_ALLOWED_SCOPES = ('PRINCIPAL_ROLE:ALL') -- Or more restrictive scopes

)

ENABLED = TRUE;

GRANT USAGE ON INTEGRATION CATALOG_INT_MY_CATALOG TO ROLE SYSADMIN;

External Volume

In this example we are going to use S3. You will need the details from your Snowflake Open Catalog setup and AWS IAM configuration.

Placeholders:

<catalog_default_base_location> - Can be found under Catalog Details > Storage Details

<account_id> - AWS Account Id

<role_name> - AWS Role Name

<external_id> - External Id set in the AWS IAM trust policy

SQL

USE ROLE ACCOUNTADMIN;

CREATE OR REPLACE EXTERNAL VOLUME EXT_VOL_ARCTIC_TUNDRA

STORAGE_LOCATIONS =

(

NAME = 'snowflake-open-catalog-volume'

STORAGE_PROVIDER = 'S3'

STORAGE_BASE_URL = '<catalog_default_base_location>'

STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::<account_id>:role/<role_name>'

STORAGE_AWS_EXTERNAL_ID = '<external_id>'

)

ALLOW_WRITES = TRUE;

USE ROLE SECURITYADMIN;

GRANT USAGE ON EXTERNAL VOLUME EXT_VOL_ARCTIC_TUNDRA TO ROLE SYSADMIN;

Database

This is the database where your iceberg tables will live, configured to use your Catalog integration.

Placeholders

<namespace> - For simplicity sake, we are going to name the schema the same as the namespace in the catalog, but can be whatever you choose.

SQL

USE ROLE SYSADMIN;

CREATE DATABASE IF NOT EXISTS ARCTIC_TUNDRA;

ALTER DATABASE ARCTIC_TUNDRA

SET CATALOG = 'CATALOG_INT_MY_CATALOG';

CREATE SCHEMA ARCTIC_TUNDRA.EXAMPLE;

Create Stage

Directory settings will be used for the input of a stored procedure at a later step.

Placeholders:

<catalog_default_base_location> - Can be found under Catalog Details > Storage Details

SQL

USE ROLE SYSADMIN;

CREATE OR REPLACE STAGE ARCTIC_TUNDRA.EXAMPLE.STG_ARCTIC_TUNDRA

URL = '<catalog_default_base_location>'

STORAGE_INTEGRATION = INT_S3_ARCTIC_TUNDRA

DIRECTORY = (

ENABLE = true

AUTO_REFRESH = true

);

‍

Stored Procedure

Next we'll create a stored procedure to iterate through the Iceberg tables within the catalog and create the Iceberg tables within Snowflake.

Parameters

stage_name - Stage we created to see iceberg table files

schema_name - Iceberg namespace

table_exclusion_pattern - Wildcard pattern of tables to not create

The catalog is hardcoded (CATALOG_INT_MY_CATALOG) but this could be parameterized as well

SQL

USE ROLE SYSADMIN;

CREATE OR REPLACE PROCEDURE ARCTIC_TUNDRA.EXAMPLE.SP_SYNC_CATALOG_ICEBERG_TABLES(

stage_name VARCHAR,

schema_name VARCHAR,

table_exclusion_pattern VARCHAR

)

RETURNS STRING

LANGUAGE PYTHON

RUNTIME_VERSION = '3.12'

PACKAGES = ('snowflake-snowpark-python')

HANDLER = 'main'

def main(session, stage_name, schema_name, table_exclusion_pattern):

result_logs = []

# Dynamic Query

query = f"""

WITH iceberg_tables AS (

SELECT DISTINCT

CURRENT_DATABASE() AS database_name,

SPLIT_PART(RELATIVE_PATH, '/', 2) AS schema_name,

SPLIT_PART(RELATIVE_PATH, '/', 3) AS table_name,

'CATALOG = ''CATALOG_INT_MY_CATALOG''\\n' ||

'EXTERNAL_VOLUME = ''EXT_VOL_ARCTIC_TUNDRA''\\n' ||

'CATALOG_NAMESPACE = ''' || SPLIT_PART(RELATIVE_PATH, '/', 2) || '''\\n' ||

'CATALOG_TABLE_NAME = ''' || SPLIT_PART(RELATIVE_PATH, '/', 3) || '''' AS statement_params

FROM DIRECTORY(@{stage_name})

)

SELECT

'CREATE ICEBERG TABLE IF NOT EXISTS ' ||

UPPER(database_name || '.' || schema_name || '.' || table_name) || '\\n' ||

statement_params AS stmt

FROM iceberg_tables

WHERE schema_name ILIKE '{schema_name}' AND table_name NOT ILIKE '{table_exclusion_pattern}';

"""

# Execute and iterate over results

rows = session.sql(query).collect()

for row in rows:

stmt = row['STMT']

try:

session.sql(stmt).collect()

result_logs.append(f"Executed: {stmt}")

except Exception as e:

result_logs.append(f"Failed: {stmt} -- {str(e)}")

return "\\n".join(result_logs)

$$;

Sync

Execute the stored procedure.

SQL

USE ROLE SYSADMIN;

CALL ARCTIC_TUNDRA.EXAMPLE.SP_SYNC_CATALOG_ICEBERG_TABLES('ARCTIC_TUNDRA.EXAMPLE.STG_ARCTIC_TUNDRA', 'EXAMPLE','sys%');

‍

Wrap Up

This is a quick an easy way to sync your Snowflake Open Catalog to Snowflake environment, which could be extended further to replicate the functionality of CATALOG_SYNC in the reverse direction.

If you want to quickly apply this to your own environment, you can access the notebook here.

‍

Want to explore how Apache Iceberg could fit into your data architecture?

Reach out to DI Squared for a free 1:1 to lay the right groundwork. Get started today.

‍

Quick Guide: Navigating Apache Iceberg with Snowflake

Navigating Iceberg with Snowflake

Snowflake Managed Iceberg Tables

Summary

Benefits

Cons

Snowflake External Iceberg Tables

Summary

Snowflake Open Catalog Iceberg Tables

Which Route to Take?

Integrating Snowflake Open Catalog with Snowflake

Pre-requisites

Create Open Catalog Integration

External Volume

Database

Placeholders

Stored Procedure

Parameters

Wrap Up

Want to explore how Apache Iceberg could fit into your data architecture?

More insights