Monitoring ETL Pipelines
Quick Links
- Step Functions (dev): Sign in to the AWS Console Step Functions dashboard in the
devAWS account.- Use the search bar to find state machines named
data-etl-flow-{source-name}-stepFn. - Click on your pipeline to view recent executions, statuses, and details.
- Use the search bar to find state machines named
- CloudWatch Logs: From your Step Functions execution detail page, scroll down to find the "Log output" or "History" section. Click on the linked log group or log stream to open the related CloudWatch Logs for that specific pipeline execution. This will take you directly to the logs for troubleshooting and deeper inspection.
- Pulumi Stack:
https://app.pulumi.com/cartesianio/data-lake-infra/dev
Monitoring Locations
Step Functions (Pipeline Execution)
State Machine: data-etl-flow-{source-name}-stepFn
Check:
- Execution status (Running/Succeeded/Failed)
- Execution history with timestamps
- Input/output payloads
- Error details for failed executions
CloudWatch Logs
Log Groups:
/aws/ecs/data-collector-{source-name}- ECS task logs/aws/emr-serverless/...- EMR job logsdata-etl-flow-{source-name}-logs- Step Function execution logs
Common Log Streams:
-
ecs/{task-id}- ECS container logs -
EMR Serverless Spark Job Logs
- After launching a Spark job via Step Functions or directly in the EMR Studio console, navigate to the AWS EMR Serverless dashboard: EMR Serverless Console
- Select the relevant application and find your job run in the Job Runs list.
- Click into the Job Run. Under "Logs," you'll find direct links to CloudWatch log streams for Driver and Executors.
- Logs are accessible from the "View logs" link within the EMR Serverless job run details.
- The Driver log contains Spark driver output, including job orchestration details and errors.
- Executor logs are also available for deeper debugging.
-
Spark UI (for EMR Serverless Jobs)
- Each EMR Serverless Spark job exposes a Spark History Server UI for visual inspection of stages, jobs, SQL, and resource usage.
- In your EMR Serverless Job Run details page (as above), look for the "Monitoring" or "Spark UI" link/button. Click this to open the Spark UI in a new tab.
- The Spark UI link remains active for a limited time (typically several hours after job completion).
- If the link is unavailable, you may need to re-run or troubleshoot job permissions/networking.
- Within the Spark UI, inspect Executors, Stages, and SQL tabs to diagnose performance issues, stage failures, or application bottlenecks.
Check:
lastStatus:RUNNINGorSTOPPEDstoppedReason: Error message
Manually Triggering a Flow via Lambda (Simulating EventBridge Trigger)
To manually simulate the pipeline trigger (as EventBridge would), you can use the built-in test functionality of the trigger Lambda.
Steps to Trigger Manually
-
Locate the Lambda Function
- In the AWS Console, navigate to Lambda.
- Search for the function named:
data-etl-flow-{source-name}-trigger
-
Use the Predefined Test Event
- Select the Lambda function to open its details page.
- Go to the "Test" tab.
- There should already be a test event configured that mirrors the expected EventBridge payload.
- If not, create a new test event based on the input schema for EventBridge triggers. You can reference a recent EventBridge sample event from CloudWatch Logs if needed.
- Click on the "Test" button to trigger the pipeline. You can observe its execution in the Step Functions view, and monitor each step in the relevant AWS service (e.g., ECS, EMR Serverless) as it progresses.
Notes
- The recommended way is to re-use the test event that is (or should be) pre-configured for the Lambda, so that the simulation exactly matches the automated trigger.
- No need to manually craft a payload unless customizing for edge cases or debugging with special inputs.
This approach is ideal for quickly verifying that the end-to-end pipeline reacts correctly to event triggers in a controlled and reproducible way.
EMR Serverless Job Status
Check via Step Functions:
RunEMRBronze/RunEMRSilversteps- Job status in execution output
- CloudWatch Logs for detailed errors
Common Statuses:
SUBMITTED→RUNNING→SUCCESS/FAILED
Processing Stages
Bronze → Silver → Gold
-
Bronze: Raw data ingestion
- Check S3 bucket:
bronze-dl-{id} - Table:
bronze.{table_name}
- Check S3 bucket:
-
Silver: Processed data
- Check S3 bucket:
silver-dl-{id} - Table:
silver.{table_name}
- Check S3 bucket:
-
Gold: Aggregated data
- Check S3 bucket:
gold-dl-{id} - Table:
gold.{table_name}
- Check S3 bucket:
Common Issues & Solutions
Failed ECS Tasks
Symptoms: Step Function execution stuck at RunECS
Check:
- CloudWatch Logs for container errors
- Task definition:
aws ecs describe-task-definition --task-definition data-collector-{name} - Network/security group issues
EMR Job Failures
Symptoms: Step Function execution fails at RunEMRBronze or RunEMRSilver
Check:
- EMR Serverless application status
- S3 source data availability
Data Flow Issues
Symptoms: Bronze succeeds but Silver/Gold fails
Check:
- S3 bucket contents:
aws s3 ls s3://{bucket}/{path}/ - Athena table queries:
SELECT COUNT(*) FROM bronze.{table} - Date path state files:
s3://{bucket}/state/last-run-bronze.json
References
- Pipeline code:
packages/data-lake/data-lake-infra/src/running-flow/dataEtlFlow.ts - Bronze infra:
packages/data-lake/data-lake-infra/src/bronze/bronzeInfra.ts - Silver infra:
packages/data-lake/data-lake-infra/src/silver/silverInfra.ts - Pulumi outputs:
pulumi stack output --stack dev