Measuring Scheduling Latency in Dagster on GKE
Last updated: November 4, 2025
Problem Description
Users need to measure scheduling latency in Dagster deployments on GKE, specifically tracking the time between when a Kubernetes job/pod could start and when it actually starts. This includes understanding how to access and query Dagster event logs for performance analysis.
Symptoms
Need to track worker pod pending time due to insufficient Kubernetes resources
Requirement to perform statistical analysis across multiple jobs and deployments
Uncertainty about where Dagster event logs are stored and how to access them programmatically
Root Cause
Dagster provides detailed event logging for job execution phases, but users need to understand which specific events to monitor and how to access the logs programmatically for analysis.
Solution
Use Dagster's built-in event logging system to track specific execution phases and calculate scheduling latency.
Step-by-Step Resolution
Identify the relevant Dagster event types for measuring scheduling latency:
STEP_WORKER_STARTING: When the subprocess/pod is being launchedSTEP_WORKER_STARTED: When the worker process has startedSTEP_START: When the actual op execution begins
Calculate scheduling latency by measuring the time difference between
STEP_WORKER_STARTEDandSTEP_STARTevents. This tracks the time it took a job to run after the worker started/was ready.Access logs programmatically using Dagster's GraphQL API endpoints:
Use
runsOrErrorendpoint to query job runsUse
CapturedLogsendpoint to retrieve log data
Alternative Solutions (if applicable)
Configure custom log storage using ComputeManager if you need logs stored in a specific location other than the default S3 bucket for Dagster+.
Prevention
Set up automated monitoring of these event logs to proactively identify scheduling bottlenecks and resource constraints in your Kubernetes cluster.
Related Documentation