Run Status Sensor Timeouts with get_job_snapshot Calls

Last updated: October 29, 2025

Run status sensors that make calls to context.instance.get_job_snapshot() and context.instance.get_run_stats() may experience timeouts when processing multiple runs, causing the sensor to fall behind and fail to process recent runs.

Problem Description

Run status sensors timeout when processing multiple runs that require fetching job snapshots and run statistics. The sensor falls behind processing recent runs and may show runs from several hours ago, indicating it cannot keep up with the volume of successful runs.

Symptoms

  • Sensor timeouts after 150 seconds (or configured timeout duration)

  • Thread dumps showing calls stuck on get_job_snapshot or get_run_stats

  • Sensor processing runs from hours ago instead of recent runs

  • Stack traces showing threads idle in SSL/HTTP operations

  • Individual get_job_snapshot calls taking 20+ seconds

  • Individual get_run_stats calls taking 7+ seconds

Root Cause

By default, run status sensors process 5 runs per tick, and the timeout applies to the total runtime across all runs in that tick. When each run requires expensive API calls to fetch job snapshots and run statistics, the cumulative time can exceed the configured timeout, causing the sensor to fail and fall behind.

Solution

Reset the sensor cursor to start processing from the most recent runs:

# Set the sensor cursor to empty string to start from the end
cursor = ""

Step-by-Step Resolution

  1. Reduce runs per tick: Set the environment variable to process fewer runs per sensor tick:

    DAGSTER_RUN_STATUS_SENSOR_RUN_LIMIT=1
  2. Increase timeout: Adjust the Dagster Cloud timeout in your Helm chart:

    dagsterCloud:
      timeout: 120  # Increase from default 60 seconds
  3. Reset sensor cursor: If the sensor is significantly behind, reset it to start from recent runs by setting the cursor to an empty string in the Dagster UI

  4. Add logging: Include timing logs in your sensor to monitor performance:

    @run_status_sensor(...)
    def my_sensor(context: RunStatusSensorContext):
        context.log.info(f"Processing run {context.dagster_run.run_id}")
        
        start_time = time.time()
        job_snapshot = context.instance.get_job_snapshot(run.job_snapshot_id)
        context.log.info(f"get_job_snapshot took {time.time() - start_time:.2f}s")
        
        start_time = time.time()
        run_stats = context.instance.get_run_stats(run.run_id)
        context.log.info(f"get_run_stats took {time.time() - start_time:.2f}s")

Alternative Solutions

Consider making the API calls concurrently using asyncio or threading to reduce total processing time per run. You might also evaluate whether all the data from job snapshots is necessary, or if a lighter-weight alternative exists for your use case.

Prevention

Monitor sensor performance regularly and adjust the runs per tick limit based on your API call latency. Consider the trade-off between processing speed and falling behind on run processing when configuring these values.

Related Documentation