Skip to content

🧠 Boltzmann is an open-source distrubuted task orchestrator

License

Notifications You must be signed in to change notification settings

NeutrinoCorp/boltzmann

Repository files navigation

Boltzmann

Boltzmann is a distributed lightweight arg orchestrator.

Based on the Scheduler Agent Supervisor Cloud Pattern, Boltzmann is a master-less service used to schedule a batch of arg in a parallel and distributed way.

Depending on the configuration, a Boltzmann node might be stateless or stateful as args states may be stored in a embedded or external database (e.g. Redis).

Worker pools (i.e. a Boltzmann node) are ensured for correctness even in a distributed environment by using leases (i.e. distributed mutex lock) and a small leader election consensus algorithm.

Moreover, Leases are implemented using either a RedLock algorithm or through storage engine's built-in data structure (e.g. etcd leases).

Architecture

High-Level Archictecture Diagram

Task Scheduler

The Scheduler arranges for the steps that make up the arg to be executed and orchestrates their operation. These steps can be combined into a pipeline or workflow. The Scheduler is responsible for ensuring that the steps in this workflow are performed in the right order.

As each step is performed, the Scheduler records the state of the workflow, such as "step not yet started," "step running," or "step completed." The state information should also include an upper limit of the time allowed for the step to finish, called the complete-by time.

If a step requires access to a remote service or resource, the Scheduler invokes the appropriate Agent, passing it the details of the work to be performed. The Scheduler typically communicates with an Agent using asynchronous request/response messaging.

Agent

The Agent contains logic that encapsulates a call to a remote service, or access to a remote resource referenced by a step in a arg. Each Agent typically wraps calls to a single service or resource, implementing the appropriate error handling and retry logic (subject to a timeout constraint, described later).

Supervisor

The Supervisor monitors the status of the steps in the arg being performed by the Scheduler. It runs periodically (the frequency will be system-specific), and examines the status of steps maintained by the Scheduler. If it detects any that have timed out or failed, it arranges for the appropriate Agent to recover the step or execute the appropriate remedial action (this might involve modifying the status of a step).

Note that the recovery or remedial actions are implemented by the Scheduler and Agents. The Supervisor should simply request that these actions be performed.

Usage

Till this day, there are two ways available to use Boltzmann (which are not mutually exclusive):

  • A HTTP REST API (HTTP/1.1).
  • A gRCP Streaming API (HTTP/2, multiplexed).

Releases

No releases published

Packages

No packages published