Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for preloading models #822

Merged
merged 1 commit into from
Nov 22, 2024
Merged

Conversation

alexnorell
Copy link
Contributor

@alexnorell alexnorell commented Nov 20, 2024

Description

This PR introduces support for preloading models at startup and includes Kubernetes health check and readiness endpoints.

Key changes:

  • Added PRELOAD_MODELS environment variable to enable asynchronous preloading of specified models at server startup.

  • Implemented a /readiness endpoint for Kubernetes readiness probes to indicate when the server is ready to handle requests.

  • Added a /healthz endpoint for Kubernetes liveness probes to ensure the server is alive.

  • Updated http_api.py to handle model initialization with asynchronous tasks and readiness state tracking.

  • Update cpu and gpu builds:

    • Allow for building and pushing to docker hub with custom tags
    • move gpu build over to depot
    • Create internal action to determine the list of tags to build for cpu and gpu builds
      • This can be rolled out to all other docker builds in the future.

Dependencies:

  • No new external dependencies added.

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

How has this change been tested, please provide a testcase or example of how you tested the change?

Locally by setting the environment variables

For example:

  • Tested model preloading functionality with a mock PRELOAD_MODELS to simulate loading multiple models.
  • Verified Kubernetes readiness and liveness endpoints using curl and simulated Kubernetes probes.
  • Also set invalid model ids and verified that this doesn't block startup.

Any specific deployment considerations

  • Ensure the PRELOAD_MODELS is properly configured with a comma-separated list of model IDs if preloading is required.
  • A valid API key must be stored in API_KEY
  • Update Kubernetes deployment manifests to include the new /readiness and /healthz probes.

Docs

  • Docs updated? What were the changes:
    • Added information about the new environment variable PRELOAD_MODELS.
    • Documented the readiness and health check endpoints for Kubernetes.

Copy link
Collaborator

@PawelPeczek-Roboflow PawelPeczek-Roboflow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be a great extension, I do have one comment - the error in health seems to be non-recoverable one - once any of the model cannot be loaded (either temporally or by the virtue of actual problem) - service will never get better:

  • I am not sure if this is something that k8s by default would terminate after some time or maybe will fall in loops of reboots
  • also I am not 100% sure about the desired end in such scenario - what is the context of PR?

@alexnorell
Copy link
Contributor Author

Seems to be a great extension, I do have one comment - the error in health seems to be non-recoverable one - once any of the model cannot be loaded (either temporally or by the virtue of actual problem) - service will never get better:

  • I am not sure if this is something that k8s by default would terminate after some time or maybe will fall in loops of reboots
  • also I am not 100% sure about the desired end in such scenario - what is the context of PR?

I've address some of the edge cases with this. The idea for these changes comes from the desire to preload models at startup rather than lazy loading in commonly used models.

For the K8s side:

  • /healthz should return positive status as soon as FastAPI is able to serve requests. We can add to this endpoint in the future to add additional states for health, but it is helpful to have at least something that lets us know the service is responding.
  • /readiness should return an error state until all the models at least have been attempted to be loaded. Once that has happened, K8s will start to move traffic to it. The intention with this change is to do a best effort initialization, but not prevent the service from running if it can't initialize.

@PawelPeczek-Roboflow
Copy link
Collaborator

ok, looks good - lmk if that is ready to be shipped

- Update GitHub Actions for deploying CPU and GPU containers
@alexnorell alexnorell marked this pull request as ready for review November 22, 2024 12:00
@alexnorell
Copy link
Contributor Author

ok, looks good - lmk if that is ready to be shipped

Should be ready for review now

@alexnorell alexnorell merged commit a9bd1bf into main Nov 22, 2024
71 checks passed
@alexnorell alexnorell deleted the feature/default_model_load branch November 22, 2024 12:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants