In an era where data drives innovation and decision-making, organizations are increasingly focused on not only accumulating data but on maintaining its quality and reliability. High-quality data is essential for building trust in analytics, enhancing the performance of machine learning (ML) models, and supporting strategic business initiatives.
By using AWS Glue Data Quality, you can measure and monitor the quality of your data. It analyzes your data, recommends data quality rules, evaluates data quality, and provides you with a score that quantifies the quality of your data. With this, you can make confident business decisions. With this launch, AWS Glue Data Quality is now integrated with the lakehouse architecture of Amazon SageMaker, Apache Iceberg on general purpose Amazon Simple Storage Service (Amazon S3) buckets, and Amazon S3 Tables. This integration brings together serverless data integration, quality management, and advanced ML capabilities in a unified environment.
This post explores how you can use AWS Glue Data Quality to maintain data quality of S3 Tables and Apache Iceberg tables on general purpose S3 buckets. We’ll discuss strategies for verifying the quality of published data and how these integrated technologies can be used to implement effective data quality workflows.
Solution overview
In this launch, we’re supporting the lakehouse architecture of Amazon SageMaker, Apache Iceberg on general purpose S3 buckets, and Amazon S3 Tables. As example use cases, we demonstrate data quality on an Apache Iceberg table stored in a general purpose S3 bucket as well as on Amazon S3 Tables. The steps will cover the following:
- Create an Apache Iceberg table on a general purpose Amazon S3 bucket and an Amazon S3 table in a table bucket using two AWS Glue extract, transform, and load (ETL) jobs
- Grant appropriate AWS Lake Formation permissions on each table
- Run data quality recommendations at rest on the Apache Iceberg table on general purpose S3 bucket
- Run the data quality rules and visualize the results in Amazon SageMaker Unified Studio
- Run data quality recommendations at rest on the S3 table
- Run the data quality rules and visualize the results in SageMaker Unified Studio
The following diagram is the solution architecture.
Prerequisites
To implement the instructions, you must have the following prerequisites:
Create S3 tables and Apache Iceberg on general purpose S3 bucket
First, complete the following steps to upload data and scripts:
- Upload the attached AWS Glue job scripts to your designated script bucket in S3
- To download the New York City Taxi – Yellow Trip Data dataset for January 2025 (Parquet file), navigate to NYC TLC Trip Record Data, expand 2025, and choose Yellow Taxi Trip records under January section. A file called
yellow_tripdata_2025-01.parquet
will be downloaded to your computer. - On the Amazon S3 console, open an input bucket of your choice and create a folder called
nyc_yellow_trip_data
. The stack will create aGlueJobRole
with permissions to this bucket. - Upload the
yellow_tripdata_2025-01.parquet
file to the folder. - Download the CloudFormation stack file. Navigate to the CloudFormation console. Choose Create stack. Choose Upload a template file and select the CloudFormation template you downloaded. Choose Next.
- Enter a unique name for Stack name.
- Configure the stack parameters. Default values are provided in the following table:
Parameter | Default value | Description |
ScriptBucketName |
N/A – user-supplied | Name of the referenced Amazon S3 general purpose bucket containing the AWS Glue job scripts |
DatabaseName |
iceberg_dq_demo |
Name of the AWS Glue Database to be created for the Apache Iceberg table on general purpose Amazon S3 bucket |
GlueIcebergJobName |
create_iceberg_table_on_s3 |
The name of the created AWS Glue job that creates the Apache Iceberg table on general purpose Amazon S3 bucket |
GlueS3TableJobName |
create_s3_table_on_s3_bucket |
The name of the created AWS Glue job that creates the Amazon S3 table |
S3TableBucketName |
dataquality-demo-bucket |
Name of the Amazon S3 table bucket to be created. |
S3TableNamespaceName |
s3_table_dq_demo |
Name of the Amazon S3 table bucket namespace to be created |
S3TableTableName |
ny_taxi |
Name of the Amazon S3 table to be created by the AWS Glue job |
IcebergTableName |
ny_taxi |
Name of the Apache Iceberg table on general purpose Amazon S3 to be created by the AWS Glue job |
IcebergScriptPath |
scripts/create_iceberg_table_on_s3.py |
The referenced Amazon S3 path to the AWS Glue script file for the Apache Iceberg table creation job. Verify the file name matches the corresponding GlueIcebergJobName |
S3TableScriptPath |
scripts/create_s3_table_on_s3_bucket.py |
The referenced Amazon S3 path to the AWS Glue script file for the Amazon S3 table creation job. Verify the file name matches the corresponding GlueS3TableJobName |
InputS3Bucket |
N/A – user-supplied bucket | Name of the referenced Amazon S3 bucket with which the NY Taxi data was uploaded |
InputS3Path |
nyc_yellow_trip_data |
The referenced Amazon S3 path with which the NY Taxi data was uploaded |
OutputBucketName |
N/A – user-supplied | Name of the created Amazon S3 general purpose bucket for the AWS Glue job for Apache Iceberg table data |
Complete the following steps to configure AWS Identity and Access Management (IAM) and Lake Formation permissions:
- If you haven’t previously worked with S3 Tables and analytics services, navigate to Amazon S3.
- Choose Table buckets.
- Choose Enable integration to enable analytics service integrations with your S3 table buckets.
- Navigate to the Resources tab for your AWS CloudFormation stack. Note the IAM role with the logical ID
GlueJobRole
and the database name with the logical IDGlueDatabase
. Additionally, note the name of the S3 table bucket with the logical IDS3TableBucket
as well as the namespace name with the logical IDS3TableBucketNamespace
. The S3 table bucket name is the portion of the Amazon Resource Name (ARN) which follows:arn:aws:s3tables:
. The namespace name is the portion of the namespace ARN which follows:: :bucket/{S3 Table bucket Name} arn:aws:s3tables:
.: :bucket/{S3 Table bucket Name}|{namespace name} - Navigate to the Lake Formation console with a Lake Formation data lake administrator.
- Navigate to the Databases tab and select your
GlueDatabase
. Note the selected default catalog should match your AWS account ID. - Select the Actions dropdown menu and under Permissions, choose Grant.
- Grant your
GlueJobRole
from step 4 the necessary permissions. Under Database permissions, select Create table and Describe, as shown in the following screenshot.
Navigate back to the Databases tab in Lake Formation and select the catalog that matches with the value of S3TableBucket
you noted in step 4 in the format:
- Select your namespace name. From the Actions dropdown menu, under Permissions, choose Grant.
- Grant your GlueJobRole from step 4 the necessary permissions Under Database permissions, select Create table and Describe, as shown in the following screenshot.
To run the jobs created in the CloudFormation stack to create the sample tables and configure Lake Formation permissions for the DataQualityRole
, complete the following steps:
- In the Resources tab of your CloudFormation stack, note the AWS Glue job names for the logical resource IDs:
GlueS3TableJob
andGlueIcebergJob
. - Navigate to the AWS Glue console and select ETL jobs. Select your
GlueIcebergJob
from step 11 and choose Run job. Select yourGlueS3TableJob
and choose Run job. - To verify the successful creation of your Apache Iceberg table on general purpose S3 bucket in the database, navigate to Lake Formation with your Lake Formation data lake administrator permissions. Under Databases, select your
GlueDatabase
. The selected default catalog should match your AWS account ID. - On the dropdown menu, choose View and then Tables. You should see a new tab with the table name you specified for
IcebergTableName
. You have verified the table creation. - Select this table and grant your DataQualityRole (
) the necessary Lake Formation permissions by choosing the Grant link in the Actions tab. Choose Select, Describe from Table permissions for the new Apache Iceberg table.-DataQualityRole- - To verify the S3 table in the S3 table bucket, navigate to Databases in the Lake Formation console with your Lake Formation data lake administrator permissions. Make sure the selected catalog is your S3 table bucket catalog:
:s3tablescatalog/ - Select your S3 table namespace and choose the dropdown menu View.
- Choose Tables and you should see a new tab with the table name you specified for
S3TableTableName
. You have verified the table creation. - Choose the link for the table and under Actions, choose Grant. Grant your
DataQualityRole
the necessary Lake Formation permissions. Choose Select, Describe from Table permissions for the S3 table. - In the Lake Formation console with your Lake Formation data lake administrator permissions, on the Administration tab, choose Data lake locations .
- Choose Register location. Input your
OutputBucketName
as the Amazon S3 path. Input theLakeFormationRole
from the stack resources as the IAM role. Under Permission mode, choose Lake Formation. - On the Lake Formation console under Application integration settings, select Allow external engines to access data in Amazon S3 locations with full table access, as shown in the following screenshot.
Generate recommendations for Apache Iceberg table on general purpose S3 bucket managed by Lake Formation
In this section, we show how to generate data quality rules using the data quality rule recommendations feature of AWS Glue Data Quality for your Apache Iceberg table on a general purpose S3 bucket. Follow these steps:
- Navigate to the AWS Glue console. Under Data Catalog, choose Databases. Choose the
GlueDatabase
. - Under Tables, select your
IcebergTableName
. On the Data quality tab, choose Run history. - Under Recommendation runs, choose Recommend rules.
- Use the
DataQualityRole
(
) to generate data quality rule recommendations, leaving the other settings as default. The results are shown in the following screenshot.-DataQualityRole-
Run data quality rules for Apache Iceberg table on general purpose S3 bucket managed by Lake Formation
In this section, we show how to create a data quality ruleset with the recommended rules. After creating the ruleset, we run the data quality rules. Follow these steps:
- Copy the resulting rules from your recommendation run by selecting the dq-run ID and choosing Copy.
- Navigate back to the table under the Data quality tab and choose Create data quality rules. Paste the ruleset from step 1 here. Choose Save ruleset, as shown in the following screenshot.
- After saving your ruleset, navigate back to the Data Quality tab for your Apache Iceberg table on the general purpose S3 bucket. Select the ruleset you created. To run the data quality evaluation run on the ruleset using your data quality role, choose Run, as shown in the following screenshot.
Generate recommendations for the S3 table on the S3 table bucket
In this section, we show how to use the AWS Command Line Interface (AWS CLI) to generate recommendations for your S3 table on the S3 table bucket. This will also create a data quality ruleset for the S3 table. Follow these steps:
- Fill in your S3 table
namespace name
, S3 tabletable name
,Catalog ID
, andData Quality role ARN
in the following JSON file and save it locally: