Skip to content

Runners Platform Best-Effort Tier-1 On-Call Rotation Setup

Runners Platform Best-Effort On-Call Rotation Setup

Summary

  • Group Name: Runners Platform
  • Group Manager / DRI: Kam Kyrala (@kkyrala)
  • Slack Channel: #g_runner_saas_infrastructure
  • Rotation Type: Best-Effort (24x5)

Background

The Runners Platform team is establishing a best-effort on-call rotation to provide support for runner platform issues. This is a best-effort policy, which means:

  • No expectation to always be at your desk
  • Flexible work hours are maintained - your start time does not need to shift to align with on-call
  • Pages are sent when you're working; if you miss a page, it escalates to others on the team
  • Ultimate escalation to EOC who is already on-call
  • 24x5 coverage only (Monday-Friday, no weekends)
  • Team members can take breaks/walks without arranging coverage

Runners Platform On-call Members

Runners Platform On-call Schedule

All times are in UTC Coverage is Monday through Friday only (24x5)

Schedule Structure:

  • Option C: We will page everyone who is available in round-robin fashion until someone responds. If no one responds, we will escalate to EOC.

Proposed Regional Coverage (to be finalized based on schedule structure)

This is a best effort rotation, so there will be no formalized "schedule" for regions.

Escalation Path

flowchart TD
    incident[Incident Escalated to Runners Platform]
    
    level1[Level 1: Page Runners Platform Team<br/>Best-effort rotation]
    
    ack_level1{Acknowledged<br/>within 15min?}
    
    eoc[Escalate to EOC<br/>Standard EOC on-call handles]
    
    resolved[Incident Handled]
    
    incident --> level1
    level1 --> ack_level1
    ack_level1 -->|Yes| resolved
    ack_level1 -->|No| eoc
    eoc --> resolved
    
    classDef level1 fill:#fff3e0
    classDef decision fill:#f3e5f5
    classDef success fill:#e8f5e8
    classDef escalation fill:#ffebee
    
    class level1 level1
    class ack_level1 decision
    class resolved success
    class eoc escalation

The default escalation path can be changed. Time intervals can be adjusted, and notification options are not fixed. If unsure, the defaults should be a reasonable starting point.

DRI Checklist

  • Go through the Rotation Leader LevelUp channel for detailed instructions on how to onboard your team (optional)
  • Finalize On-call team members
    • Finalize on-call schedule structure (Option A, B, or C above)
    • Fill the schedule section above with finalized team member assignments based on timezones and chosen schedule structure
    • If any of the members are part of the Incident Manager on-call rotation, please create an issue like this example here to have them removed (where possible) from the IM rotation.
      • Zoe Braddock (@zbraddock) - Remove from database on-call rotation - pending timeline
      • Ermia Qasemi (@ermiaqasemi) - Remove from dedicated EOC rotation - last shift Nov 10th 2025
      • Anton Starovoytov (@astarovoytov) - Remove from .com EOC rotation - will remove in new year
      • Rehab Hassanein (@rehab) - Remove from .com EOC rotation - will remove in new year
      • Addie Yeung (@ayeung) - Remove from .com EOC rotation - will remove in new year
    • If any of the team members are part of the Dev on-call rotation, please add their emails to the Excluded Team Member Emails tab with the name of the rotation under reason in the eligibility spreadsheet to exclude them from the rotation.
    • Ensure backend engineers who have never been on-call receive appropriate training and understand expectations for best-effort rotation

Note: For this best-effort rotation, @kkyrala (rotation leader/DRI) will be available for escalation as needed, though not in a formal Level 2 escalation chain since we escalate directly to EOC

  • Oncall license Setup and access

    • Use Slack command /request to raise a request in Lumos for yourself to get On Call Scheduler access to be able to set the rotation on incident.io
    • Ensure each team member has Full access in the "on-call seat" column on the incident.io users page, verify here. If not request the Networking & Incident Management team to provide it for any team members who need it by pinging a member of the Networking & Incident Management team on the issue. DO NOT USE THE ACCESS REQUEST TEMPLATE process for this. This is not granting permission, this is granting a full access license (for billing purposes) for that user to use the on-call features.
  • Setup Schedules and Escalation path

  • Once the schedule section above in this issue is filled create a Schedule for your team using incident.io. To do so you can duplicate SAMPLE tier2 - TEAMNAME schedule and edit it as per your requirements, add the members accordingly. For the schedule name, use the format runners platform - best effort or similar. This is your on-call schedule.

  • Setup escalation path to EOC

    • Navigate to Escalation paths in incident.io UI and create escalation path for Runners Platform
    • For the Escalation Path name use the format runners platform
    • Configure: Level 1 pages Runners Platform schedule, if no acknowledgment within appropriate timeout (e.g., 15 minutes), escalate to EOC on-call
    • Work with incident.io to explore best configuration for best-effort/casual on-call system to minimize notification noise and accommodate flexible work schedules

Note: In the Escalation Path on incident.io, Notify represents Paging the folks on the schedule, Notify on Slack Channel will simply notify them on Slack

  • Prepare team for On-call

    • Inform rotation members to ignore notifications about upcoming on-call shifts, with a message like below
    Hi, you'll be getting a notification about upcoming on-call shifts. Do not worry, you will not be paged yet. We will only activate the rotation on date X. Any shifts scheduled before that are just for us to test the setup and prepare for the go-live.
    
    IMPORTANT: This is a BEST-EFFORT rotation. You are NOT required to be at your desk at specific times. 
    Maintain your normal flexible work hours. If you miss a page while on a break or away from your desk, 
    it will escalate to others on the team and ultimately to EOC. No coverage needed for breaks/walks.
    • Instruct your team members to set their notification preferences in the incident.io ui, this represents how they wish to be informed when they are paged
    • Provide guidance on notification settings that work well with flexible schedules and best-effort expectations
    • While it's not mandatory it is recommended to have the incident.io app installed on the member's mobile device (though less critical for best-effort rotation)
    • Share related handbook links with the rotation members
    • Review the On-Call Readiness dashboard
    • Instruct the team members to finish appropriate on-call training (e.g., Tier-2 levelup course or adapted version)
    • Create Runners Platform-specific on-call documentation covering:
      • Scope of runner platform on-call responsibilities
      • Common issues and escalation points
      • Best-effort expectations and guidelines
  • Go live!

    • Set go-live date (target: after holiday season, when .com EOCs transition off that rotation)
    • On the due date update the On-call teams catalog with the name (runners platform) and escalation path of your team. Each row in the catalog helps populate the drop-down menu that EOCs will use to select to page the required team.
    • Announce in the #eoc-general that your team is ready to be paged, give a high-level description for this group's covered areas (runner platform infrastructure, runner managers, job execution issues, etc.), clarify this is a best-effort 24x5 rotation, and reference this handbook link (or create Runners Platform-specific documentation)
    • Monitor rotation for first 2 weeks and gather feedback on notification configuration and escalation timings
    • Iterate on settings based on team feedback to optimize for best-effort rotation model

Congratulations you are now ready to be on-call!


Notes:

  • This is establishing a best-effort, 24x5 rotation - different from traditional on-call expectations
  • Team members maintain flexible work hours and are not required to be available at specific times
  • Goal is to provide runner platform expertise without creating an undue burden on the team
  • EOC remains the ultimate escalation point for all incidents

workflow-infraTriage

Edited by Kam Kyrala