Access Amazon Redshift Managed Storage tables through Apache Spark on AWS Glue and Amazon EMR using Amazon SageMaker Lakehouse

Data environments in data-driven organizations are changing to meet the growing demands for analytics, including business intelligence (BI) dashboarding, one-time querying, data science, machine learning (ML), and generative AI. These organizations have a huge demand for lakehouse solutions that combine the best of data warehouses and data lakes to simplify data management with easy access to all data from their preferred engines.

Amazon SageMaker Lakehouse unifies all your data across Amazon Simple Storage Service (Amazon S3) data lakes and Amazon Redshift data warehouses, helping you build powerful analytics and artificial intelligence and machine learning (AI/ML) applications on a single copy of data. SageMaker Lakehouse gives you the flexibility to access and query your data in place with all Apache Iceberg compatible tools and engines. It secures your data in the lakehouse by defining fine-grained permissions, which are consistently applied across all analytics and ML tools and engines. You can bring data from operational databases and applications into your lakehouse in near real time through zero-ETL integrations. It accesses and queries data in-place with federated query capabilities across third-party data sources through Amazon Athena.

With SageMaker Lakehouse, you can access tables stored in Amazon Redshift managed storage (RMS) through Iceberg APIs, using the Iceberg REST catalog backed by AWS Glue Data Catalog. This expands your data integration workload across data lakes and data warehouses, enabling seamless access to diverse data sources.

Amazon SageMaker Unified Studio, Amazon EMR 7.5.0 and higher, and AWS Glue 5.0 natively support SageMaker Lakehouse. This post describes how to integrate data on RMS tables through Apache Spark using SageMaker Unified Studio, Amazon EMR 7.5.0 and higher, and AWS Glue 5.0.

How to access RMS tables through Apache Spark on AWS Glue and Amazon EMR

With SageMaker Lakehouse, RMS tables are accessible through the Apache Iceberg REST catalog. Open source engines such as Apache Spark are compatible with Apache Iceberg, and they can interact with RMS tables by conﬁguring this Iceberg REST catalog. You can learn more in Connecting to the Data Catalog using AWS Glue Iceberg REST extension endpoint.

Note that the Iceberg REST extensions endpoint is used when you access RMS tables. This endpoint is accessible through the Apache Iceberg AWS Glue Data Catalog extensions, which comes preinstalled on AWS Glue 5.0 and Amazon EMR 7.5.0 or higher. The extension library enables access to RMS tables using the Amazon Redshift connector for Apache Spark.

To access RMS backed catalog databases from Spark, each RMS database requires its own Spark session catalog configuration. Here are the required Spark configurations:

Spark config key	Value
`spark.sql.catalog.{catalog_name}`	`org.apache.iceberg.spark.SparkCatalog`
`spark.sql.catalog.{catalog_name}.type`	`glue`
`spark.sql.catalog.{catalog_name}.glue.id`	`{account_id}:{rms_catalog_name}/{database_name}`
`spark.sql.catalog.{catalog_name}.client.region`	`{aws_region}`
`spark.sql.extensions`	`org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions`

Configuration parameters:

{catalog_name}: Your chosen name for referencing the RMS catalog database in your application code
{rms_catalog_name}: The RMS catalog name as shown in the AWS Lake Formation catalogs section
{database_name}: The RMS database name
{aws_region}: The AWS Region where the RMS catalog is located

For a deeper understanding of how the Amazon Redshift hierarchy (databases, schemas, and tables) is mapped to the AWS Glue multilevel catalogs, you can refer to the Bringing Amazon Redshift data into the AWS Glue Data Catalog documentation.

In the following section, we demonstrate how to access RMS tables through Apache Spark using SageMaker Unified Studio JupyterLab notebooks with the AWS Glue 5.0 runtime and Amazon EMR Serverless.

Although we can bring existing Amazon Redshift tables into the AWS Glue Data catalog by creating a Lakehouse Redshift catalog from an existing Redshift namespace and provide access to a SageMaker Unified Studio project, in the following example, you’ll create a managed Amazon Redshift Lakehouse catalog directly from SageMaker Unified Studio and work with that.

Prerequisites

To follow these instructions, you must have the following prerequisites:

Create a SageMaker Unified Studio project

Complete the following steps to create a SageMaker Unified Studio project:

Sign in to SageMaker Unified Studio.
Choose Select a project on the top menu and choose Create project.
For Project name, enter demo.
For Project profile, choose All capabilities.
Choose Continue.

Leave the default values and choose Continue.
Review the configurations and choose Create project.

You need to wait for the project to be created. Project creation can take about 5 minutes. When the project status changes to Active, select the project name to access the project’s home page.

Make note of the Project role ARN because you’ll need it for next steps.

You’ve successfully created the project and noted the project role ARN. The next step is to configure a Lakehouse catalog for your RMS.

Configure a Lakehouse catalog for your RMS

Complete the following steps to configure a Lakehouse catalog for your RMS:

In the navigation pane, choose Data.
Choose the + (plus) sign.
Select Create Lakehouse catalog to create a new catalog and choose Next.

For Lakehouse catalog name, enter rms-catalog-demo.
Choose Add catalog.

Wait for the catalog to be created.

In SageMaker Unified Studio, choose Data in the left navigation pane, then select the three vertical dots next to Redshift (Lakehouse) and choose Refresh to make sure the Amazon Redshift compute is active.

Create a new table in the RMS Lakehouse catalog:

In SageMaker Unified Studio, on the top menu, under Build, choose Query Editor.
On the top right, choose Select data source.
For CONNECTIONS, choose Redshift (Lakehouse).
For DATABASES, choose dev@rms-catalog-demo.
For SCHEMAS, choose public.
Choose Choose.

In the query cell, enter and execute the following query to create a new schema:

create schema "dev@rms-catalog-demo".salesdb

In a new cell, enter and execute the following query to create a new table:

create table salesdb.store_sales (ss_sold_timestamp timestamp, ss_item text, ss_sales_price float);

In a new cell, enter and execute the following query to populate the table with sample data:

insert into salesdb.store_sales values ('2024-12-01T09:00:00Z', 'Product 1', 100.0),
('2024-12-01T11:00:00Z', 'Product 2', 500.0),
('2024-12-01T15:00:00Z', 'Product 3', 20.0),
('2024-12-01T17:00:00Z', 'Product 4', 1000.0),
('2024-12-01T18:00:00Z', 'Product 5', 30.0),
('2024-12-02T10:00:00Z', 'Product 6', 5000.0),
('2024-12-02T16:00:00Z', 'Product 7', 5.0);

In a new cell, enter and run the following query to verify the table contents:

select * from salesdb.store_sales;

(Optional) Create an Amazon EMR Serverless application

IMPORTANT: This section is only required if you plan to test also using Amazon EMR Serverless. If you intend to use AWS Glue exclusively, you can skip this section entirely.

Navigate to the project page. In the left navigation pane, select Compute, then select the Data processing Choose Add compute.

Choose Create new compute resources, then choose Next.

Select EMR Serverless.

Specify emr_serverless_application as Compute name, select Compatibility as Permission mode, and choose Add compute.

Monitor the deployment progress. Wait for the Amazon EMR Serverless application to complete its deployment. This process can take a minute.

Access Amazon Redshift Managed Storage tables through Apache Spark

In this section, we demonstrate how to query tables stored in RMS using a SageMaker Unified Studio notebook.

In the navigation pane, choose Data
Under Lakehouse, select the down arrow next to rms-catalog-demo
Under dev, select the down arrow next salesdb, choose store_sales, and choose the three dots

SageMaker Lakehouse oﬀers multiple analysis options: Query with Athena, Query with Redshift, and Open in Jupyter Lab notebook.

Choose Open in Jupyter Lab notebook
On the Launcher tab, choose Python 3 (ipykernel)

In SageMaker Unified Studio JupyterLab, you can specify different compute types for each notebook cell. Although this example demonstrates using AWS Glue compute (project.spark.compatibility), the same code can be executed using Amazon EMR Serverless by selecting the appropriate compute in the cell settings. The following table shows the connection type and compute values to specify when running PySpark code or Spark SQL code with different engines:

Compute option	Pyspark code		Spark SQL
Compute option	Connection type	Compute	Connection type	Compute
AWS Glue	Pyspark	`project.spark.compatibility`	SQL	`project.spark.compatibility`
Amazon EMR Serverless	Pyspark	`emr-s.emr_serverless_application`	SQL	`emr-s.emr_serverless_application`

In the notebook cell’s top left corner, set Connection Type to PySpark and select spark.compatibility (AWS Glue 5.0) as Compute
Execute the following code to initialize the SparkSession and configure rmscatalog as the session catalog for accessing the dev database under the rms-catalog-demo RMS catalog:

from pyspark.sql import SparkSession

catalog_name = "rmscatalog"
#Change  with your AWS account ID
rms_catalog_id = ":rms-catalog-demo/dev"

#Change with your AWS region
aws_region="us-east-2"

spark = SparkSession.builder.appName('rms_demo') \
    .config(f'spark.sql.catalog.{catalog_name}', 'org.apache.iceberg.spark.SparkCatalog') \
    .config(f'spark.sql.catalog.{catalog_name}.type', 'glue') \
    .config(f'spark.sql.catalog.{catalog_name}.glue.id', rms_catalog_id) \
    .config(f'spark.sql.catalog.{catalog_name}.client.region', aws_region) \
    .config('spark.sql.extensions','org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions') \
    .getOrCreate()

Create a new cell and switch the connection type from PySpark to SQL to execute Spark SQL commands directly
Enter the following SQL statement to view all tables under salesdb (RMS schema) within rmscatalog:

SHOW TABLES IN rmscatalog.salesdb

In a new SQL cell, enter the following DESCRIBE EXTENDED statement to view detailed information about the store_sales table in the salesdb schema:

DESCRIBE EXTENDED rmscatalog.salesdb.store_sales

In the output, you’ll observe that the Provider is set to iceberg. This indicates that the table is recognized as an Iceberg table, despite being stored in Amazon Redshift managed storage.

In a new SQL cell, enter the following SELECT statement to view the content of the table

SELECT * FROM rmscatalog.salesdb.store_sales

Throughout this example, we demonstrated how to create a table in Amazon Redshift Serverless and seamlessly query it as an Iceberg table using Apache Spark within a SageMaker Unified Studio notebook.

Clean up

To avoid incurring future charges, clean up all created resources:

Delete the created SageMaker Unified Studio project. This step will automatically delete Amazon EMR compute (for example, the Amazon EMR Serverless application) that was provisioned from the project:
1. Inside SageMaker Studio, navigate to the demo project’s Project overview section.
2. Choose Actions, then select Delete project.
3. Type confirm and choose Delete project.

Delete the created Lakehouse catalog:
1. Navigate to the AWS Lake Formation page in the Catalogs section.
2. Select the rms-catalog-demo catalog, choose Actions, then select Delete.
3. In the confirmation window type rms-catalog-demo and then choose Drop.

Conclusion

In this post, we demonstrated how to use Apache Spark to interact with Amazon Redshift Managed Storage tables through Amazon SageMaker Lakehouse using the Iceberg REST catalog. This integration provides a unified view of your data across Amazon S3 data lakes and Amazon Redshift data warehouses, so you can build powerful analytics and AI/ML applications while maintaining a single copy of your data.

For additional workloads and implementations, visit Simplify data access for your enterprise using Amazon SageMaker Lakehouse.

About the Authors

Noritaka Sekiyama is a Principal Big Data Architect with Amazon Web Services (AWS) Analytics services. He’s responsible for building software artifacts to help customers. In his spare time, he enjoys cycling on his road bike.

Stefano Sandonà is a Senior Big Data Specialist Solution Architect at Amazon Web Services (AWS). Passionate about data, distributed systems, and security, he helps customers worldwide architect high-performance, efficient, and secure data solutions.

Derek Liu is a Senior Solutions Architect based out of Vancouver, BC. He enjoys helping customers solve big data challenges through Amazon Web Services (AWS) analytic services.

Raj Ramasubbu is a Senior Analytics Specialist Solutions Architect focused on big data and analytics and AI/ML with Amazon Web Services (AWS). He helps customers architect and build highly scalable, performant, and secure cloud-based solutions on AWS. Raj provided technical expertise and leadership in building data engineering, big data analytics, business intelligence, and data science solutions for over 18 years prior to joining AWS. He helped customers in various industry verticals like healthcare, medical devices, life science, retail, asset management, car insurance, residential REIT, agriculture, title insurance, supply chain, document management, and real estate.

Angel Conde Manjon is a Sr. EMEA Data & AI PSA, based in Madrid. He has previously worked on research related to data analytics and AI in diverse European research projects. In his current role, Angel helps partners develop businesses centered on data and AI.

Appendix: Sample script for Lake Formation FGAC enabled Spark cluster

If you want to access RMS tables from Lake Formation FGAC enabled Spark cluster on AWS Glue or Amazon EMR, refer to the following code example:

from pyspark.sql import SparkSession

catalog_name = "rmscatalog"
rms_catalog_name = "123456789012:rms-catalog-demo/dev"
account_id = "123456789012"
region = "us-east-2"

spark = SparkSession.builder.appName('rms_demo') \
.config('spark.sql.defaultCatalog', catalog_name) \
.config(f'spark.sql.catalog.{catalog_name}', 'org.apache.iceberg.spark.SparkCatalog') \
.config(f'spark.sql.catalog.{catalog_name}.type', 'glue') \
.config(f'spark.sql.catalog.{catalog_name}.glue.id', rms_catalog_name) \
.config(f'spark.sql.catalog.{catalog_name}.client.region', region) \
.config(f'spark.sql.catalog.{catalog_name}.glue.account-id', account_id) \
.config(f'spark.sql.catalog.{catalog_name}.glue.catalog-arn',f'arn:aws:glue:{region}:{rms_catalog_name}') \
.config('spark.sql.extensions','org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions') \
.getOrCreate()

Access Amazon Redshift Managed Storage tables through Apache Spark on AWS Glue and Amazon EMR using Amazon SageMaker Lakehouse

How to access RMS tables through Apache Spark on AWS Glue and Amazon EMR

Prerequisites

Create a SageMaker Unified Studio project

Configure a Lakehouse catalog for your RMS

(Optional) Create an Amazon EMR Serverless application

Access Amazon Redshift Managed Storage tables through Apache Spark

Clean up

Conclusion

About the Authors

Appendix: Sample script for Lake Formation FGAC enabled Spark cluster

Rehan

Leave a Reply Cancel reply

How to access RMS tables through Apache Spark on AWS Glue and Amazon EMR

Prerequisites

Create a SageMaker Unified Studio project

Configure a Lakehouse catalog for your RMS

(Optional) Create an Amazon EMR Serverless application

Access Amazon Redshift Managed Storage tables through Apache Spark

Clean up

Conclusion

About the Authors

Appendix: Sample script for Lake Formation FGAC enabled Spark cluster

Rehan

Leave a Reply Cancel reply

You May Like

Big Data Analytics in the Energy Sector

The top 7 reasons that show why and how Java continues to have a bright future in 2023

Announcing the Winners of the 2025 Data Intelligence Industry Awards