Skip to content

ElasticDL Introduction

QI JUN edited this page Nov 11, 2019 · 4 revisions

ElasticDL Introduction

ElasticDL is a Kubernetes-native deep learning framework that supports fault-tolerance and elastic scheduling.

It has four main features:

  • Fault-tolerance and elastic scheduling
  • Providing a unified data pipeline for both training and serving
  • Integrating with TensorFlow 2.0 and PyTorch easily
  • Integrating with SQLFlow in native

Fault-tolerance and elastic scheduling

In large-scale training, it is probable that a training job fails due to hardware failure or getting preempted by the job scheduling mechanism. ElasticDL support fault-tolerance to make sure the job could go on other than crashing.

The feature of fault-tolerance makes ElasticDL works with the priority-based preemption of Kubernetes to achieve elastic scheduling. When Kubernetes kills some processes of a job to free resources for new-coming jobs with higher priority, the current job doesn't fail but continues with fewer resources.

Elastic scheduling could significantly improve the overall utilization of a cluster. Suppose that a cluster has N GPUs, and a job is using one of them. Without elastic scheduling, a new job claiming N GPUs would have to wait for the first job to complete before starting. This pending time could be hours, days, or even weeks. During this very long time, the utilization of the cluster is 1/N. With elastic scheduling, the new job could start running immediately with N-1 GPUs, and Kubernetes might increase its GPU consumption by 1 after the first job completes. In this case, the overall utilization is 100%.

We provide a distributed communication package, which implements two mainstream communication strategies: parameter server and Allreduce. Both of these two strategies support fault-tolerance and elastic scheduling. For more details, please refer to the design doc of parameter server and Allreduce

Unified data pipeline

TensorFlow Transform is a library for preprocessing data with TensorFlow.

Integrating with deep learning frameworks

We are trying to provide a pluggable distributed communication package, which could be integrated with deep learning frameworks easily. We do not touch the runtime of a deep learning framework. The following are examples, which show how to integrate ElasticDL with TensorFlow 2.0 and PyTorch with different communication strategies.

In TensorFlow

  • Parameter Server
def train(model, dataset, optimizer):
    for x, y in dataset:
        with tf.GradientTape() as tape:
            prediction = model(x)
            loss = loss_fn(prediction, y)
        gradients = tape.gradient(loss, model.trainable_variables)
        for grad in reversed(gradients):
            push_to_ps(grad)
  • Allreduce
def train(model, dataset, optimizer):
    for x, y in dataset:
        with tf.GradientTape() as tape:
            prediction = model(x)
            loss = loss_fn(prediction, y)
        gradients = tape.gradient(loss, model.trainable_variables)
        for grad in reversed(gradients):
            allreduce_average(grad)
        optimizer.apply_gradients(zip(gradients, model.trainable_variables))

In PyTorch

  • Parameter Server
def train(model, device, train_loader, optimizer):
    model.train()
    for data, target in train_loader:
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()
        for param in model.parameters():
            push_to_ps(param.grad.data)
  • Allreduce
def train(model, device, train_loader, optimizer):
    model.train()
    for data, target in train_loader:
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()
        for param in model.parameters():
            allreduce_average(param.grad.data)
        optimizer.step()

Integrating with SQLFLow

SQLFlow is a bridge that connects a SQL engine, e.g. MySQL, Hive or MaxCompute, with TensorFlow, XGBoost and other machine learning toolkits. SQLFlow extends the SQL syntax to enable model training, prediction and model explanation.

SQLFlow translates a SQL program, perhaps with extended SQL syntax for AI, into a workflow. The workflow describes how to launch an ElasticDL job.