AI for Earth Grantees have the option of hosting their completed APIs on the official AI for Earth API platform. These APIs are subject to the AI for Earth acceptance criteria. Each of the following sections identify a requirement that must be met before the API is migrated to the AI for Earth hosting platform.
- API Design
- Testing
- Documentation
- Kubernetes
- Reliability
- Devops
- Telemetry
- Publishing to the AI for Earth API Platform
Validation of API input must be performed prior to any processing. This ensures fail-fast, decreases unnecessary resource utilization, and provides immediate information to the caller.
For long-running/async APIs, a task id must be immediately returned to the caller. Ensure that threading or parallel processing is utilized.
For synchronous APIs, a result must be returned in fewer than 5 seconds. If not, the API shall be converted to an async API and must utilize the task manager.
All APIs must maintain a stateless design among requests. If state is required between requests, a distributed state system must be implemented. This state system must be atomic and lock-free in nature and must be able to handle loads associated with the API.
APIs are to be tested in the following categories and, when applicable, results submitted with the request for acceptance.
- Test functionality against requirements and API design specification
- Test typical use cases
- Test edge cases, including any possible out-of-bounds input
- Test very large inputs and very small inputs
- Test empty input
- Test unsupported REST methods
- Test bad input
- Test permission issues for SAS Blobs, etc.
- Test for timeouts when handling large inputs
- Gather metrics relating to running on GPU vs CPU
- Ensure that all failure points occur early in execution (fail-fast)
- Tune based on performance testing
- Gather metrics for incremental loads
- Identify maximum load for a single instance
Several areas of the API require documentation. The documentation must be versioned with the API.
- Document the purpose, use cases, and end-to-end scenarios
- Document relationship between input and output
- Document typical usage and edge case usage
- The API must be Swagger documented, which includes input, output, schemas, etc.
- Swagger documentation must include sample inputs.
- If the API requires input other than JSON, example input (files, etc.) must be provided.
- Example input must cover all computation possibilities.
Step-by-step usage instructions must be provided, which shall include:
- How to generate input data
- How to utilize output
- How to interpret statuses
- How to interpret error conditions
- How to fix errors
Document all integration points with external sources:
- How to generate input with external tools, software, etc.
- How to utilize output with external tools, software, etc.
- How to utilize other APIs or software to create end-to-end capabilities
We host APIs in a custom Kubernetes cluster. To ensure API availability and dynamic scaling, information needs to be provided for the following criteria.
- Typical CPU usage for a single request
- Maximum CPU usage for a single request
- GPU execution time
- CPU execution time
- Typical memory usage for a single request
- Maximum memory usage for a single request
- The API must contain an endpoint that returns a health check. This health check must be added to the Dockerfile, such as:
HEALTHCHECK --interval=1m --timeout=3s --start-period=20s \
CMD curl -f http://localhost/ || exit 1
Based on the load of a single request (typical and largest), estimate the resource thresholds that indicate that the service must be scaled up and down.
A best effort shall be made to prevent an instance from continuous recycling.
Stagger the loading of large objects into memory such that an instance can quickly respond to requests upon startup.
Since AI for Earth will be assuming initial DevOps, detailed instructions shall be provided for the following categories.
Document all known possible failure cases and how to resolve them. Document in the form of a playbook, where the case is identified and step-by-step directions, with code is provided.
Provide several datasets that can be used for acceptance testing, load testing, stress testing, and functional testing.
Any custom setup instructions, along with required architectural components, must be clearly documented.
Provide, at least, three contacts for additional DevOps support. This shall include a phone number, Email address, Slack/Teams channel, etc.
Clearly identify any restrictions for API usage. This includes denoting any sensitive issues.
The following questions shall be answered:
- What restrictions exist for API usage?
- Who makes decisions on key approvals? Provide contact information.
AI for Earth collects telemetry from our back-end system in order to provide a reliable service. No PII (personally identifiable information) is collected by AI for Earth.
The API shall include an endpoint that reports the state of the service.
Performance metrics, along with trace logging, must be included. The trace logging shall include execution time. To aid in performance requirement identification, input size, etc. should be included with the trace log.
Telemetry is collected by AI for Earth for our back-end system. Additional logging can be sent to the API owner's Application Insights instance. Provide this information so that logs can be distributed.
Absolutely no PII (personally identifiable information) shall be collected by the API.
Identify alert conditions based on the collected telemetry. Include the response to such alerts in the DevOps playbook.
An AI for Earth engineer must perform the publishing to the platform, but before they can do so, the your image must be made available to the AI for Earth team. Please follow these steps to complete this process.
- Create an Azure Container Registry in your Azure subscription.
- Tag and push your image to the repository. The images must be versioned. Internally, we use the following naming pattern:
<ACR_name>.azurecr.io/<grantee_moniker>/<image_version>-<api_name>:<build_number>
- Contact the AI for Earth Engineering Team to obtain the AI for Earth publisher Azure Active Directory application name. This application will be used to deploy your API image to the AI for Earth API Platform.
- Grant 'AcrPull' role access to the Azure AD application from step 3.
- Ensure that your API's documentation is up-to-date and the API has been fully tested and verified.
- Notify the AI for Earth Engineering Team of your intention to deploy to production.