Runners Platform Best-Effort Tier-1 On-Call Rotation Setup
Runners Platform Best-Effort On-Call Rotation Setup
Summary
- Group Name: Runners Platform
- Group Manager / DRI: Kam Kyrala (@kkyrala)
- Slack Channel: #g_runner_saas_infrastructure
- Rotation Type: Best-Effort (24x5)
Background
The Runners Platform team is establishing a best-effort on-call rotation to provide support for runner platform issues. This is a best-effort policy, which means:
- No expectation to always be at your desk
- Flexible work hours are maintained - your start time does not need to shift to align with on-call
- Pages are sent when you're working; if you miss a page, it escalates to others on the team
- Ultimate escalation to EOC who is already on-call
- 24x5 coverage only (Monday-Friday, no weekends)
- Team members can take breaks/walks without arranging coverage
Runners Platform On-call Members
- Anton Starovoytov (@astarovoytov)
- Rehab Hassanein (@rehab)
- Addie Yeung (@ayeung)
- Zoe Braddock (@zbraddock)
- Ermia Qasemi (@ermiaqasemi)
- Davis Bickford (@dbickford)
- Joe Shaw (@joe-shaw)
- Igor Wiedler (@igorwwwwwwwwwwwwwwwwwwww)
- Tomasz Maczukin (@tmaczukin)
Runners Platform On-call Schedule
All times are in UTC Coverage is Monday through Friday only (24x5)
Schedule Structure:
- Option C: We will page everyone who is available in round-robin fashion until someone responds. If no one responds, we will escalate to EOC.
Proposed Regional Coverage (to be finalized based on schedule structure)
This is a best effort rotation, so there will be no formalized "schedule" for regions.
Escalation Path
flowchart TD
incident[Incident Escalated to Runners Platform]
level1[Level 1: Page Runners Platform Team<br/>Best-effort rotation]
ack_level1{Acknowledged<br/>within 15min?}
eoc[Escalate to EOC<br/>Standard EOC on-call handles]
resolved[Incident Handled]
incident --> level1
level1 --> ack_level1
ack_level1 -->|Yes| resolved
ack_level1 -->|No| eoc
eoc --> resolved
classDef level1 fill:#fff3e0
classDef decision fill:#f3e5f5
classDef success fill:#e8f5e8
classDef escalation fill:#ffebee
class level1 level1
class ack_level1 decision
class resolved success
class eoc escalation
The default escalation path can be changed. Time intervals can be adjusted, and notification options are not fixed. If unsure, the defaults should be a reasonable starting point.
DRI Checklist
-
Go through the Rotation Leader LevelUp channel for detailed instructions on how to onboard your team (optional) -
Finalize On-call team members -
Finalize on-call schedule structure (Option A, B, or C above) -
Fill the schedule section above with finalized team member assignments based on timezones and chosen schedule structure -
If any of the members are part of the Incident Manager on-call rotation, please create an issue like this example here to have them removed (where possible) from the IM rotation. -
Zoe Braddock (@zbraddock) - Remove from database on-call rotation - pending timeline -
Ermia Qasemi (@ermiaqasemi) - Remove from dedicated EOC rotation - last shift Nov 10th 2025 -
Anton Starovoytov (@astarovoytov) - Remove from .com EOC rotation - will remove in new year -
Rehab Hassanein (@rehab) - Remove from .com EOC rotation - will remove in new year -
Addie Yeung (@ayeung) - Remove from .com EOC rotation - will remove in new year
-
-
If any of the team members are part of the Dev on-call rotation, please add their emails to the Excluded Team Member Emails
tab with the name of the rotation underreason
in the eligibility spreadsheet to exclude them from the rotation. -
Ensure backend engineers who have never been on-call receive appropriate training and understand expectations for best-effort rotation
-
Note: For this best-effort rotation, @kkyrala (rotation leader/DRI) will be available for escalation as needed, though not in a formal Level 2 escalation chain since we escalate directly to EOC
-
Oncall license Setup and access -
Use Slack command /request
to raise a request in Lumos for yourself to getOn Call Scheduler
access to be able to set the rotation on incident.io -
Ensure each team member has Full access
in the "on-call seat" column on the incident.io users page, verify here. If not request the Networking & Incident Management team to provide it for any team members who need it by pinging a member of the Networking & Incident Management team on the issue. DO NOT USE THE ACCESS REQUEST TEMPLATE process for this. This is not granting permission, this is granting a full access license (for billing purposes) for that user to use the on-call features.
-
-
Setup Schedules and Escalation path -
Once the schedule section above in this issue is filled create a Schedule
for your team using incident.io. To do so you can duplicateSAMPLE tier2 - TEAMNAME
schedule and edit it as per your requirements, add the members accordingly. For the schedule name, use the formatrunners platform - best effort
or similar. This is your on-call schedule. -
Setup escalation path to EOC -
Navigate to Escalation paths
in incident.io UI and create escalation path for Runners Platform -
For the Escalation Path name use the format runners platform
-
Configure: Level 1 pages Runners Platform schedule, if no acknowledgment within appropriate timeout (e.g., 15 minutes), escalate to EOC on-call -
Work with incident.io to explore best configuration for best-effort/casual on-call system to minimize notification noise and accommodate flexible work schedules
-
Note: In the Escalation Path on incident.io, Notify
represents Paging the folks on the schedule, Notify on Slack Channel
will simply notify them on Slack
-
Prepare team for On-call -
Inform rotation members to ignore notifications about upcoming on-call shifts, with a message like below
Hi, you'll be getting a notification about upcoming on-call shifts. Do not worry, you will not be paged yet. We will only activate the rotation on date X. Any shifts scheduled before that are just for us to test the setup and prepare for the go-live. IMPORTANT: This is a BEST-EFFORT rotation. You are NOT required to be at your desk at specific times. Maintain your normal flexible work hours. If you miss a page while on a break or away from your desk, it will escalate to others on the team and ultimately to EOC. No coverage needed for breaks/walks.
-
Instruct your team members to set their notification preferences in the incident.io ui, this represents how they wish to be informed when they are paged -
Provide guidance on notification settings that work well with flexible schedules and best-effort expectations -
While it's not mandatory it is recommended to have the incident.io app installed on the member's mobile device (though less critical for best-effort rotation) -
Share related handbook links with the rotation members - Incident.io onboarding runbook
- Incident Management Handbook
- Document best-effort rotation expectations specific to Runners Platform team
-
Review the On-Call Readiness dashboard -
Instruct the team members to finish appropriate on-call training (e.g., Tier-2 levelup course or adapted version) -
Create Runners Platform-specific on-call documentation covering: -
Scope of runner platform on-call responsibilities -
Common issues and escalation points -
Best-effort expectations and guidelines
-
-
-
Go live! -
Set go-live date (target: after holiday season, when .com EOCs transition off that rotation) -
On the due date update the On-call teams catalog with the name ( runners platform
) and escalation path of your team. Each row in the catalog helps populate the drop-down menu that EOCs will use to select to page the required team. -
Announce in the #eoc-general
that your team is ready to be paged, give a high-level description for this group's covered areas (runner platform infrastructure, runner managers, job execution issues, etc.), clarify this is a best-effort 24x5 rotation, and reference this handbook link (or create Runners Platform-specific documentation) -
Monitor rotation for first 2 weeks and gather feedback on notification configuration and escalation timings -
Iterate on settings based on team feedback to optimize for best-effort rotation model
-
Congratulations you are now ready to be on-call!
Notes:
- This is establishing a best-effort, 24x5 rotation - different from traditional on-call expectations
- Team members maintain flexible work hours and are not required to be available at specific times
- Goal is to provide runner platform expertise without creating an undue burden on the team
- EOC remains the ultimate escalation point for all incidents