How Federated Queries and Starburst Galaxy Save the Day: Review and Tutorial

The Challenge of Vendor Lock-In

Imagine you're trying to host a game night with your friends. You get the board games ready, whip up some snacks, and just as you’re about to kick off, you realize the cards are in one room, the dice are in another, and the game board got misplaced on a completely different floor. Oh, and to make it all worse, there’s no elevator or stairs -- you have to build them both. Suddenly, what should’ve been a fun evening is a nightmare beyond belief.

‍

Well, this is what it feels like when you’re trying to query data scattered across multiple silos. Your transaction history is in Postgres, customer feedback exists in parquet files on S3, and crucial product details are in Snowflake. Just like the pieces of your board game, disparate data makes finding answers absurdly and uncomfortably complex.

‍

This is where federated queries shine. Instead of painstakingly moving all your data into one place, federated querying allows you to join data from multiple sources as if they were living in a single database. It’s the perfect trifecta of outcomes: Time, seamlessness, and money in the bag.

‍

And here’s the best part. Tools like Starburst make this process even smoother. By uniting your scattered storage solutions under one cohesive querying framework, Starburst reduces the time-and-resource drain of building bridges between platforms.

‍

This article will explore how Starburst makes federated querying as simple as uploading your missing game cards, dice, and board through a metaphorical "magic portal."

‍

The Scenario: How Federated Queries Work

Now, instead of game night, let’s imagine you work in a bank.

‍

Over decades, it’s accumulated terabytes of data distributed across various systems for loans, accounts, and regulatory reporting. Its data warehouses house intricate procedures and transformation pipelines honed over years to meet strict compliance standards. Migrating all of that into a single modern platform like BigQuery (BQ) isn’t just daunting; it’s likely a multi-year endeavor fraught with risk. You can’t just pause those critical pipelines or move everything in one go; that would disrupt the operations and regulatory requirements that these systems uphold.

‍

What do you do?

‍

The Query that Reached Across the Data Galaxy

Federated queries handle complex data architectures, and this is especially true when we’re talking about working with vast, fragmented data sources spanning multiple systems. These are not simple, one-database queries; they reach across entire ecosystems of data warehouses to unify insights without actually physically migrating or duplicating data.

‍

At the heart of federated queries is their ability to query “in place.” Rather than duplicating or moving your data, which often involves cumbersome ETL workloads. Federated queries allow you to analyze your datasets where they already live. This approach tangibly reduces infrastructure complexity.

‍

This is where federated queries put on their game face

They allow organizations to query spread-out datasets in real-time, pulling precisely what is needed without shifting entire systems. If an analyst requests a report blending data from historical compliance records and a new customer database, federated queries make it possible for BI tools to interface with multiple data stores under a unified pane of glass. Not only does this avoid the costly and time-consuming bulk transfer of data, but it also ensures seamless operations during migrations or while handling hybrid architectures.

‍

But federated querying isn’t about solving every data access problem ever

Using federated queries is ideal for large-scale, highly structured data within enterprise warehouses, not random business unit data or loosely organized data silos. Federated queries thrive where data is already curated, but in different places. Where it does not shine is connecting raw, unrefined data across your data landscape. A very tangible reason being performance bottlenecks. It can also lead teams into development anti-patterns, storing business logic and data cleansing into the analytics/consumption layer.

‍

How Starburst Galaxy amplifies federated query capabilities

And that’s where Starburst Galaxy proves its mettle. Galaxy takes the promise of federated queries and magnifies it. It eliminates the need to endlessly preprocess or reshuffle data by weaving a logical mesh across all sources. Galaxy is the High Performance Teammate you’ve always dreamed of to manage the messy, colossal realities of enterprise environments.

‍

In addition to the benefits of query federation, Starburst Galaxy wraps numerous features that enable developers to easily manage and scale, while giving citizen developers and business users a great user experience of finding and consuming data. Some of its key features include:

‍

Enterprise Security and Governance

Performance Enhancements

Cloud (SaaS) Offering

Data Products & Cataloging

Operational Features

Support

‍

A Practical Example of Federated Queries with Starburst Galaxy

Federated queries feel magical, but there’s nothing abstract about them. So let’s look at how you’d execute a concrete bank example.

‍

Imagine a compliance team wants to consolidate loan data stored in a legacy SQL Server and customer demographic data stored in Salesforce for a regulatory report. Instead of enduring the painful and error-prone process of migrating this data, Starburst Galaxy allows them to query both systems directly. Here’s how you'd do it:

‍

Query

```

WITH

combined_data AS (

SELECT

loans.customer_id,

customers.region,

SUM(loans.principal_balance) AS total_balance,

AVG(loans.interestrate) AS avg_interest_rate,

COUNT(loans.loan_id) AS avg_loan_count

FROM

sqlserver.loan_data.loans AS loans

JOIN salesforce.customer_data.contacts AS customers

ON loans.customer_id = customers.customer_id

GROUP BY

loans.customer_id,

customers.region

)

SELECT

region,

COUNT(customer_id) AS customer_count,

SUM(total_balance) AS total_balance,

AVG(avg_interest_rate) AS avg_rate,

AVG(avg_loan_count) AS avg_loan_count

FROM

combined_data

GROUP BY

region;

‍

```

‍

And voilà! Starburst Galaxy addresses the struggle with production-grade precision to meet enterprise-grade needs.

‍

Behind the scenes:

‍

Putting It All Together with Starburst Galaxy

Here’s the vital takeaway.

‍

With a single query, you join data from Postgres, Snowflake, S3 parquet files, or even that infamous Excel file someone conveniently emailed five minutes ago. You don’t just fetch data; you make it seamlessly act like it’s all in one database. It takes elbow grease to get started, and you have to kick your infrastructure into shape first, too but, in the end, it pays off.

‍

In our example above, Galaxy isn’t just a tool. It’s a paradigm shift. It can eliminate additional data pipelines, latency, and further development by enabling users to surface data through data products.

‍

If you are interested in taking it for a spin, you can try it out for free here. And if you need help getting things setup or deciding if Starburst Galaxy is the right solution for your data stack we are here to help!

‍

DI Squared is here to help

If you’re deciding between products, and want a helping hand figuring out if Starburst is the best option for your use cases, go ahead and put some time on our calendar. Our job (and favorite thing to do) is to advise you and help you take the plunge.

‍