Measuring Scheduling Latency in Dagster on GKE

Last updated: November 4, 2025

Problem Description

  Users need to measure scheduling latency in Dagster deployments on GKE, specifically tracking the time between when a Kubernetes job/pod could start and when it actually starts. This includes understanding how to access and query Dagster event logs for performance analysis.

Symptoms

  • Need to track worker pod pending time due to insufficient Kubernetes resources

  • Requirement to perform statistical analysis across multiple jobs and deployments

  • Uncertainty about where Dagster event logs are stored and how to access them programmatically

Root Cause

  Dagster provides detailed event logging for job execution phases, but users need to understand which specific events to monitor and how to access the logs programmatically for analysis.

Solution

  Use Dagster's built-in event logging system to track specific execution phases and calculate scheduling latency.

Step-by-Step Resolution

  1. Identify the relevant Dagster event types for measuring scheduling latency:

    1. STEP_WORKER_STARTING: When the subprocess/pod is being launched

    2. STEP_WORKER_STARTED: When the worker process has started

    3. STEP_START: When the actual op execution begins

  2. Calculate scheduling latency by measuring the time difference between STEP_WORKER_STARTED and STEP_START events. This tracks the time it took a job to run after the worker started/was ready.

  3. Access logs programmatically using Dagster's GraphQL API endpoints:

    1. Use runsOrError endpoint to query job runs

    2. Use CapturedLogs endpoint to retrieve log data

Alternative Solutions (if applicable)

  Configure custom log storage using ComputeManager if you need logs stored in a specific location other than the default S3 bucket for Dagster+.

Prevention

  Set up automated monitoring of these event logs to proactively identify scheduling bottlenecks and resource constraints in your Kubernetes cluster.

Related Documentation