Skip to main content

Monitoring ETL Pipelines

  • Step Functions (dev): Sign in to the AWS Console Step Functions dashboard in the dev AWS account.
    • Use the search bar to find state machines named data-etl-flow-{source-name}-stepFn.
    • Click on your pipeline to view recent executions, statuses, and details.
  • CloudWatch Logs: From your Step Functions execution detail page, scroll down to find the "Log output" or "History" section. Click on the linked log group or log stream to open the related CloudWatch Logs for that specific pipeline execution. This will take you directly to the logs for troubleshooting and deeper inspection.
  • Pulumi Stack: https://app.pulumi.com/cartesianio/data-lake-infra/dev

Monitoring Locations

Step Functions (Pipeline Execution)

State Machine: data-etl-flow-{source-name}-stepFn

Check:

  • Execution status (Running/Succeeded/Failed)
  • Execution history with timestamps
  • Input/output payloads
  • Error details for failed executions

CloudWatch Logs

Log Groups:

  • /aws/ecs/data-collector-{source-name} - ECS task logs
  • /aws/emr-serverless/... - EMR job logs
  • data-etl-flow-{source-name}-logs - Step Function execution logs

Common Log Streams:

  • ecs/{task-id} - ECS container logs

  • EMR Serverless Spark Job Logs

    • After launching a Spark job via Step Functions or directly in the EMR Studio console, navigate to the AWS EMR Serverless dashboard: EMR Serverless Console
    • Select the relevant application and find your job run in the Job Runs list.
    • Click into the Job Run. Under "Logs," you'll find direct links to CloudWatch log streams for Driver and Executors.
      • Logs are accessible from the "View logs" link within the EMR Serverless job run details.
      • The Driver log contains Spark driver output, including job orchestration details and errors.
      • Executor logs are also available for deeper debugging.
  • Spark UI (for EMR Serverless Jobs)

    • Each EMR Serverless Spark job exposes a Spark History Server UI for visual inspection of stages, jobs, SQL, and resource usage.
    • In your EMR Serverless Job Run details page (as above), look for the "Monitoring" or "Spark UI" link/button. Click this to open the Spark UI in a new tab.
      • The Spark UI link remains active for a limited time (typically several hours after job completion).
      • If the link is unavailable, you may need to re-run or troubleshoot job permissions/networking.
    • Within the Spark UI, inspect Executors, Stages, and SQL tabs to diagnose performance issues, stage failures, or application bottlenecks.

Check:

  • lastStatus: RUNNING or STOPPED
  • stoppedReason: Error message

Manually Triggering a Flow via Lambda (Simulating EventBridge Trigger)

To manually simulate the pipeline trigger (as EventBridge would), you can use the built-in test functionality of the trigger Lambda.

Steps to Trigger Manually

  1. Locate the Lambda Function

    • In the AWS Console, navigate to Lambda.
    • Search for the function named:
      data-etl-flow-{source-name}-trigger
  2. Use the Predefined Test Event

    • Select the Lambda function to open its details page.
    • Go to the "Test" tab.
    • There should already be a test event configured that mirrors the expected EventBridge payload.
    • If not, create a new test event based on the input schema for EventBridge triggers. You can reference a recent EventBridge sample event from CloudWatch Logs if needed.
    • Click on the "Test" button to trigger the pipeline. You can observe its execution in the Step Functions view, and monitor each step in the relevant AWS service (e.g., ECS, EMR Serverless) as it progresses.

Notes

  • The recommended way is to re-use the test event that is (or should be) pre-configured for the Lambda, so that the simulation exactly matches the automated trigger.
  • No need to manually craft a payload unless customizing for edge cases or debugging with special inputs.

This approach is ideal for quickly verifying that the end-to-end pipeline reacts correctly to event triggers in a controlled and reproducible way.

EMR Serverless Job Status

Check via Step Functions:

  • RunEMRBronze / RunEMRSilver steps
  • Job status in execution output
  • CloudWatch Logs for detailed errors

Common Statuses:

  • SUBMITTEDRUNNINGSUCCESS / FAILED

Processing Stages

Bronze → Silver → Gold

  1. Bronze: Raw data ingestion

    • Check S3 bucket: bronze-dl-{id}
    • Table: bronze.{table_name}
  2. Silver: Processed data

    • Check S3 bucket: silver-dl-{id}
    • Table: silver.{table_name}
  3. Gold: Aggregated data

    • Check S3 bucket: gold-dl-{id}
    • Table: gold.{table_name}

Common Issues & Solutions

Failed ECS Tasks

Symptoms: Step Function execution stuck at RunECS

Check:

  • CloudWatch Logs for container errors
  • Task definition: aws ecs describe-task-definition --task-definition data-collector-{name}
  • Network/security group issues

EMR Job Failures

Symptoms: Step Function execution fails at RunEMRBronze or RunEMRSilver

Check:

  • EMR Serverless application status
  • S3 source data availability

Data Flow Issues

Symptoms: Bronze succeeds but Silver/Gold fails

Check:

  • S3 bucket contents: aws s3 ls s3://{bucket}/{path}/
  • Athena table queries: SELECT COUNT(*) FROM bronze.{table}
  • Date path state files: s3://{bucket}/state/last-run-bronze.json

References

  • Pipeline code: packages/data-lake/data-lake-infra/src/running-flow/dataEtlFlow.ts
  • Bronze infra: packages/data-lake/data-lake-infra/src/bronze/bronzeInfra.ts
  • Silver infra: packages/data-lake/data-lake-infra/src/silver/silverInfra.ts
  • Pulumi outputs: pulumi stack output --stack dev