This project demonstrates an end-to-end ETL (Extract, Transform, Load) process using AWS Glue, S3, Step Functions, SNS, and Athena. It automates the workflow to extract raw data from an S3 bucket, transform it using Glue PySpark, load the transformed data back into S3, catalog it with a Glue Crawler, and make it queryable with Athena. Notifications are sent via SNS upon completion of the workflow.
- ETL Workflow with AWS Glue: Perform data extraction, transformation, and loading with PySpark scripts.
- Separate Databases: Maintain separate Glue databases for raw and transformed data for better organization.
- Automated Cataloging: Use Glue Crawlers to update the Glue Data Catalog dynamically.
- Query with Athena: Query the transformed data easily using SQL-like syntax in AWS Athena.
- Step Functions Orchestration: Orchestrate the entire workflow seamlessly using AWS Step Functions.
- Notifications via SNS: Get email notifications upon the successful completion of the ETL process.
- Data Extraction: Fetch raw data from an S3 bucket.
- Data Transformation: Process and transform the data using a Glue PySpark script.
- Data Loading: Load the transformed data back into another S3 bucket.
- Cataloging: Use Glue Crawlers to catalog the raw and transformed data into separate Glue databases.
- Querying: Query the transformed data via Athena for analytics.
- Notifications: Send an SNS email notification when the process completes.
- AWS Glue: For ETL processes and cataloging.
- AWS S3: For data storage.
- AWS Step Functions: For orchestration of the ETL workflow.
- AWS SNS: For sending email notifications.
- AWS Athena: For querying the cataloged data.
- PySpark: For data transformation.
- An AWS account with access to Glue, S3, Step Functions, SNS, and Athena.
- Python knowledge for understanding the Glue PySpark scripts.
- S3 buckets:
- One for raw data.
- One for transformed data.
- Glue databases:
- One for raw data.
- One for transformed data.
- SNS topic created and subscribed with a valid email for notifications.
- Create two S3 buckets:
raw-data-bucket: To store raw data.transformed-data-bucket: To store transformed data.
- Create two Glue databases:
raw_data_db: For cataloging raw data.transformed_data_db: For cataloging transformed data.
- Create a Glue job with a PySpark script for data transformation.
- Add the necessary IAM roles and policies.
- Set up two Glue Crawlers:
- Crawler 1: To catalog raw data into
raw_data_db. - Crawler 2: To catalog transformed data into
transformed_data_db.
- Crawler 1: To catalog raw data into
- Design the Step Functions workflow to automate the process:
- Step 1: Trigger Glue ETL job.
- Step 2: Start the Glue Crawlers.
- Step 3: Notify completion with SNS.
- Create an SNS topic and subscribe with your email address to receive notifications.
- Configure Athena to query the data using the Glue Data Catalog.
- Set the default database to
transformed_data_db.
- Place the raw data in the designated S3 bucket (
raw-data-bucket). - Trigger the Step Functions workflow manually or via an event.
- Monitor the workflow in the AWS Management Console.
- Once completed, check your email for an SNS notification.
- Use Athena to run queries on the transformed data.