Run Status Sensor Timeouts with get_job_snapshot Calls
Last updated: October 29, 2025
Run status sensors that make calls to context.instance.get_job_snapshot() and context.instance.get_run_stats() may experience timeouts when processing multiple runs, causing the sensor to fall behind and fail to process recent runs.
Problem Description
Run status sensors timeout when processing multiple runs that require fetching job snapshots and run statistics. The sensor falls behind processing recent runs and may show runs from several hours ago, indicating it cannot keep up with the volume of successful runs.
Symptoms
Sensor timeouts after 150 seconds (or configured timeout duration)
Thread dumps showing calls stuck on
get_job_snapshotorget_run_statsSensor processing runs from hours ago instead of recent runs
Stack traces showing threads idle in SSL/HTTP operations
Individual
get_job_snapshotcalls taking 20+ secondsIndividual
get_run_statscalls taking 7+ seconds
Root Cause
By default, run status sensors process 5 runs per tick, and the timeout applies to the total runtime across all runs in that tick. When each run requires expensive API calls to fetch job snapshots and run statistics, the cumulative time can exceed the configured timeout, causing the sensor to fail and fall behind.
Solution
Reset the sensor cursor to start processing from the most recent runs:
# Set the sensor cursor to empty string to start from the end
cursor = ""Step-by-Step Resolution
Reduce runs per tick: Set the environment variable to process fewer runs per sensor tick:
DAGSTER_RUN_STATUS_SENSOR_RUN_LIMIT=1Increase timeout: Adjust the Dagster Cloud timeout in your Helm chart:
dagsterCloud: timeout: 120 # Increase from default 60 secondsReset sensor cursor: If the sensor is significantly behind, reset it to start from recent runs by setting the cursor to an empty string in the Dagster UI
Add logging: Include timing logs in your sensor to monitor performance:
@run_status_sensor(...) def my_sensor(context: RunStatusSensorContext): context.log.info(f"Processing run {context.dagster_run.run_id}") start_time = time.time() job_snapshot = context.instance.get_job_snapshot(run.job_snapshot_id) context.log.info(f"get_job_snapshot took {time.time() - start_time:.2f}s") start_time = time.time() run_stats = context.instance.get_run_stats(run.run_id) context.log.info(f"get_run_stats took {time.time() - start_time:.2f}s")
Alternative Solutions
Consider making the API calls concurrently using asyncio or threading to reduce total processing time per run. You might also evaluate whether all the data from job snapshots is necessary, or if a lighter-weight alternative exists for your use case.
Prevention
Monitor sensor performance regularly and adjust the runs per tick limit based on your API call latency. Consider the trade-off between processing speed and falling behind on run processing when configuring these values.
Related Documentation