Navigating Iceberg with Snowflake

As Apache Iceberg continues to gain traction as the go-to open table format for lakehouse architectures, Snowflake has responded by offering multiple ways to integrate with Iceberg, each with their own trade-offs. In this post, we’ll quickly identify the key differences between Snowflake managed Iceberg tables, externally managed Iceberg tables, and where the emerging Snowflake Open Catalog fits. And if you’re looking to keep everything relevant inside of Snowflake and maintain an open, engine-agnostic architecture, I’ll provide a practical guide to sync your Iceberg tables from the Snowflake Open Catalog to Snowflake for seamless interoperability.

Snowflake Managed Iceberg Tables

Snowflake managed Iceberg tables are fully managed by Snowflake. The data is stored in cloud storage as an External Volume and not included as part of the Snowflake storage bill.  

Summary

  • Snowflake manages Iceberg metadata, no catalog needed
  • Snowflake read and writes to Iceberg tables
  • Able to leverage Snowflake Clustering
  • Covered by Snowflake Platform Support
  • Always up to date
  • Data can only be accessed through Snowflake


Benefits

  • Dead simple
  • Easy transition into a lakehouse architecture
  • Reduces storage costs
  • Seamless integration with existing Snowflake environment
  • Data stored within your company’s infrastructure

Cons

  • Not vendor neutral
  • Data inaccessible to other systems outside of Snowflake
  • Lacks many features of Snowflake native tables

Snowflake External Iceberg Tables

Snowflake has the ability to integrate with external Iceberg catalogs and connect to their managed Iceberg, such as: AWS Glue, Nessie, or Snowflake Open Catalog.

Summary

  • External catalog manages Iceberg metadata
  • Snowflake can read Iceberg tables. It cannot write to Iceberg tables.
  • Interoperability: can be read by other engines like Trino, Spark, etc…
  • Requires DDL statements to see tables
  • Requires refreshing metadata to stay in sync
  • Supports Iceberg on top of Delta files

Snowflake Open Catalog Iceberg Tables

After Snowflake open sourced the Apache Polaris, a REST catalog for Iceberg, they have since released Snowflake Open Catalog a managed service version which includes additional Enterprise security features. As of today, Snowflake Open Catalog is treated similarly as any other External Catalog integration with the exception of one feature: CATALOG_SYNC

Which Route to Take?

If you are using Snowflake as the only engine to interact with the data, Snowflake managed Iceberg tables make the most sense of the Iceberg options. However, it is best served for large tables, especially those that are append heavy. There are plenty of drawbacks compared to native tables. But if you need it, it’s seamless, easy to stand up, and manage, with the benefit of moving your storage to your cloud provider.  

However, if you want interoperability or to be vendor neutral, you’ll want to leverage an external catalog. In my opinion, if you are driving towards a lake house architecture, this is the way to go since interoperability is probably the biggest benefit in the ever change data landscape of today. With that said, Snowflake’s Open Catalog is an extremely easy option to get started, but it is worth noting that it is still very different than using Snowflake managed Iceberg tables.  

Integrating Snowflake Open Catalog with Snowflake

One of the key differences when using internal and external Iceberg tables is state management. Snowflake managed Iceberg will always be the latest state and since you will be executing DDL within Snowflake to create and update the tables, your Snowflake catalog is always in sync as well. When leveraging the Snowflake Open Catalog, there are additional steps to make sure your Snowflake database is in sync with the catalog. This includes the latest metadata state as well as the catalog. You will still need to execute DDL for Snowflake to know about an Iceberg table in your catalog.  

In other words, Snowflake Open Catalog does not keep Snowflake up to date on what exists. To account for that you can follow the guide below to sync an Iceberg Namespace with your Snowflake database.

If you want to skip ahead and go straight into the code, you can review the notebook here.

Pre-requisites

Snowflake Open Catalog

Setup an Open Catalog account and create some iceberg tables. To follow along create a catalog named arctic_tundra.

Create Open Catalog Integration

Placeholders

  • <org> - Your Snowflake Organization
  • <account> - Your Snowflake Open Catalog Account
  • OAuth Credentials (Doc)
  • <client_id> - OAuth Client Id
  • <client_secret> - OAuth Client Secret
  • Optional:
  • OAUTH_ALLOWED_SCOPES: (doc)

SQL

USE ROLE ACCOUNTADMIN;

CREATE OR REPLACE CATALOG INTEGRATION  

 CATALOG_INT_MY_CATALOG

 CATALOG_SOURCE = POLARIS

 TABLE_FORMAT = ICEBERG

 REST_CONFIG = (

   CATALOG_URI = 'https://<org>-<account>.snowflakecomputing.com/polaris/api/catalog'

   CATALOG_API_TYPE = PUBLIC

   CATALOG_NAME = 'arctic_tundra'

 )

 REST_AUTHENTICATION = (

   TYPE = OAUTH

   OAUTH_CLIENT_ID = '<client_id>'

   OAUTH_CLIENT_SECRET = '<client_secret>'

   OAUTH_ALLOWED_SCOPES = ('PRINCIPAL_ROLE:ALL') -- Or more restrictive scopes

 )

 ENABLED = TRUE;

GRANT USAGE ON INTEGRATION CATALOG_INT_MY_CATALOG TO ROLE SYSADMIN;

External Volume

In this example we are going to use S3. You will need the details from your Snowflake Open Catalog setup and AWS IAM configuration.

Placeholders:

  • <catalog_default_base_location> - Can be found under Catalog Details > Storage Details
  • <account_id> - AWS Account Id
  • <role_name> - AWS Role Name
  • <external_id> - External Id set in the AWS IAM trust policy

SQL

USE ROLE ACCOUNTADMIN;

CREATE OR REPLACE EXTERNAL VOLUME EXT_VOL_ARCTIC_TUNDRA

  STORAGE_LOCATIONS =

     (

        (

           NAME = 'snowflake-open-catalog-volume'

           STORAGE_PROVIDER = 'S3'

           STORAGE_BASE_URL = '<catalog_default_base_location>'

           STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::<account_id>:role/<role_name>'

           STORAGE_AWS_EXTERNAL_ID = '<external_id>'

        )

     )

     ALLOW_WRITES = TRUE;

USE ROLE SECURITYADMIN;

GRANT USAGE ON EXTERNAL VOLUME EXT_VOL_ARCTIC_TUNDRA TO ROLE SYSADMIN;

Database

This is the database where your iceberg tables will live, configured to use your Catalog integration.

Placeholders
  • <namespace> - For simplicity sake, we are going to name the schema the same as the namespace in the catalog, but can be whatever you choose.

SQL

USE ROLE SYSADMIN;

CREATE DATABASE IF NOT EXISTS ARCTIC_TUNDRA;

ALTER DATABASE ARCTIC_TUNDRA

 SET CATALOG = 'CATALOG_INT_MY_CATALOG';

 

CREATE SCHEMA ARCTIC_TUNDRA.EXAMPLE;

Create Stage

Directory settings will be used for the input of a stored procedure at a later step.

Placeholders:

  • <catalog_default_base_location> - Can be found under Catalog Details > Storage Details

SQL

USE ROLE SYSADMIN;

CREATE OR REPLACE STAGE ARCTIC_TUNDRA.EXAMPLE.STG_ARCTIC_TUNDRA

URL = '<catalog_default_base_location>'

STORAGE_INTEGRATION = INT_S3_ARCTIC_TUNDRA

DIRECTORY = (

   ENABLE = true

   AUTO_REFRESH = true

 );

Stored Procedure

Next we'll create a stored procedure to iterate through the Iceberg tables within the catalog and create the Iceberg tables within Snowflake.

Parameters
  • stage_name - Stage we created to see iceberg table files
  • schema_name - Iceberg namespace
  • table_exclusion_pattern - Wildcard pattern of tables to not create

The catalog is hardcoded (CATALOG_INT_MY_CATALOG) but this could be parameterized as well

SQL

USE ROLE SYSADMIN;

CREATE OR REPLACE PROCEDURE ARCTIC_TUNDRA.EXAMPLE.SP_SYNC_CATALOG_ICEBERG_TABLES(

   stage_name VARCHAR,

   schema_name VARCHAR,

   table_exclusion_pattern VARCHAR

)

RETURNS STRING

LANGUAGE PYTHON

RUNTIME_VERSION = '3.12'

PACKAGES = ('snowflake-snowpark-python')

HANDLER = 'main'

AS

$$

def main(session, stage_name, schema_name, table_exclusion_pattern):

   

   result_logs = []

   # Dynamic Query

   query = f"""

   WITH iceberg_tables AS (

       SELECT DISTINCT

           CURRENT_DATABASE() AS database_name,

           SPLIT_PART(RELATIVE_PATH, '/', 2) AS schema_name,

           SPLIT_PART(RELATIVE_PATH, '/', 3) AS table_name,

           'CATALOG = ''CATALOG_INT_MY_CATALOG''\\n' ||

           'EXTERNAL_VOLUME = ''EXT_VOL_ARCTIC_TUNDRA''\\n' ||

           'CATALOG_NAMESPACE = ''' || SPLIT_PART(RELATIVE_PATH, '/', 2) || '''\\n' ||

           'CATALOG_TABLE_NAME = ''' || SPLIT_PART(RELATIVE_PATH, '/', 3) || '''' AS statement_params

       FROM DIRECTORY(@{stage_name})

   )

   SELECT

       'CREATE ICEBERG TABLE IF NOT EXISTS ' ||  

       UPPER(database_name || '.' || schema_name || '.' || table_name) || '\\n' ||  

       statement_params AS stmt

   FROM iceberg_tables

   WHERE schema_name ILIKE '{schema_name}' AND table_name NOT ILIKE '{table_exclusion_pattern}';

   """

   # Execute and iterate over results

   rows = session.sql(query).collect()

   for row in rows:

       stmt = row['STMT']

       try:

           session.sql(stmt).collect()

           result_logs.append(f"Executed: {stmt}")

       except Exception as e:

           result_logs.append(f"Failed: {stmt} -- {str(e)}")

   return "\\n".join(result_logs)

$$;

Sync

Execute the stored procedure.

SQL

USE ROLE SYSADMIN;

CALL ARCTIC_TUNDRA.EXAMPLE.SP_SYNC_CATALOG_ICEBERG_TABLES('ARCTIC_TUNDRA.EXAMPLE.STG_ARCTIC_TUNDRA', 'EXAMPLE','sys%');

Wrap Up

This is a quick an easy way to sync your Snowflake Open Catalog to Snowflake environment, which could be extended further to replicate the functionality of CATALOG_SYNC in the reverse direction.

If you want to quickly apply this to your own environment, you can access the notebook here.

Want to explore how Apache Iceberg could fit into your data architecture?

Reach out to DI Squared for a free 1:1 to lay the right groundwork. Get started today.