Skip to content

Scheduling control flow - retry to schedule application

**Implement retry with delay in scheduling control flow in case when the suitable cluster for the given application is not available yet. **

Proposed workflow

  1. Create app
  2. Scheduler doesn't match any cluster for given app (because there is not any cluster)
  3. Scheduler raises RESOURCE_NOT_FOUND error (with error code under 100) and keeps PENDING status of the given app.
class ReasonCode(Enum):
    INTERNAL_ERROR = 1  # Default error

    INVALID_RESOURCE = 10  # Invalid values in the Manifest
    CLUSTER_NOT_REACHABLE = 11  # Connectivity issue with the Kubernetes deployment
    RESOURCE_NOT_FOUND = 12 # Scheduler not found the suitable resource 
    NO_SUITABLE_RESOURCE = 50  # Scheduler issue
    # Codes over 100 will cause the controller to delete the resource directly

# scheduler error handler (code snippet)
async def error_handler(self, app, error=None):
        if reason.code.value >= 100:
            app.status.state = ApplicationState.DELETING
        elif reason.code.value == 12:
            app.status.state = ApplicationState.PENDING
        else:
            app.status.state = ApplicationState.FAILED
  1. Scheduler loop tries to process app again (with the PENDING state and RESOURCE_NOT_FOUND reason) but still there is not suitable cluster for given app
  • Only limited number or retries are permitted
app.status.scheduler_retries -= 1
  • ! Scheduler should store the number of retries in status.scheduler_retries to prevent overloading of queue
    - If the retries number is exceeded scheduler raises UnsuitableDeploymentError("No cluster available") and sets FAILED status in given app
kind: Application
api: kubernetes
metadata:
  changed:       # timestamp when spec was changed

status:
  
  scheduled:     # timestamp when app was bound to cluster
  scheduled_to:  # cluster where the app was scheduled to
  running_on:    # cluster where the app is currently running
  scheduler_retries: int = 5  # the number of retries to prevent overloading of scheduler queue
  1. Create suitable cluster
  2. Scheduler loop tries to process app again, there is suitable cluster for it
  3. Scheduler sets app.status.scheduled_to for given app, remove RESOURCE_NOT_FOUND reason from app and also sets the proper state for app

! Consider scenario with application cluster constraints and updating the app

! Apply delayed queuing for scheduler work queue (it is already supported in Krake)

Edited by Matej Feder