Skip to content

Latest commit

 

History

History
223 lines (169 loc) · 7.41 KB

elasticdl_estimator.md

File metadata and controls

223 lines (169 loc) · 7.41 KB

Train TensorFlow Estimator Models using ElasticDL on Personal Computer

This document shows how to run an ElasticDL job to train a tf.estimator model using iris dataset on Minikube.

Prerequisites

  1. Install Minikube, preferably >= v1.11.0, following the installation guide. Minikube runs a single-node Kubernetes cluster in a virtual machine on your personal computer.

  2. Install Docker CE, preferably >= 18.x, following the guide for building Docker images containing user-defined models and the ElasticDL framework.

  3. Install Python, preferably >= 3.6, because the ElasticDL command-line tool is in Python.

Models

Among all machine learning toolkits that ElasticDL can work with, TensorFlow is the most tested and used. In this tutorial, we use a model from the model zoo directory. This model is defined using TensorFlow estimator API.

Datasets

We use the iris dataset in this tutorial.

mkdir ./data
wget https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data -O ./data/iris.data

The Kubernetes Cluster

The following command starts a Kubernetes cluster locally using Minikube. It uses VirtualBox, a hypervisor coming with macOS, to create the virtual machine cluster.

minikube start --vm-driver=virtualbox \
  --cpus 2 --memory 6144 --disk-size=50gb 
eval $(minikube docker-env)

The command minikube docker-env returns a set of Bash environment variable to configure your local environment to re-use the Docker daemon inside the Minikube instance.

The following command is necessary to enable RBAC of Kubernetes.

kubectl apply -f \
  https://raw.githubusercontent.com/sql-machine-learning/elasticdl/develop/elasticdl/manifests/elasticdl-rbac.yaml

If you happen to live in a region where raw.githubusercontent.com is banned, you might want to Git clone the above repository to get the YAML file.

Install ElasticDL Client Tool

The following command installs the command line tool elasticdl, which talks to the Kubernetes cluster and operates ElasticDL jobs.

pip install elasticdl_client

Build the Docker Image with Model Definition

Kubernetes runs Docker containers, so we need to put user-defined models, the ElasticDL api package and all dependencies into a Docker image.

In this tutorial, we use a predefined model in the ElasticDL repository. To retrieve the source code, please run the following command.

git clone https://github.com/sql-machine-learning/elasticdl

The estimator model definition is in directory elasticdl/model_zoo/iris.

We build the image based on tensorflow:1.13.2 and the dockerfile is

FROM tensorflow/tensorflow:1.13.2-py3 as base

RUN pip install elasticdl_api

COPY ./model_zoo model_zoo

Then, we use docker to build the image

docker build -t elasticdl:iris_estimator -f ${iris_dockerfile} .

Submit the Training Job

The following command submits a training job:

elasticdl train \
  --image_name=elasticdl:1.0.0 \
  --worker_image=elasticdl:iris_estimator \
  --ps_image=elasticdl:iris_estimator \
  --job_command="python -m model_zoo.iris.dnn_estimator" \
  --master_resource_request="cpu=0.2,memory=1024Mi" \
  --master_resource_limit="cpu=1,memory=2048Mi" \
  --num_ps=1 \
  --ps_resource_request="cpu=0.2,memory=1024Mi" \
  --ps_resource_limit="cpu=1,memory=2048Mi" \
  --num_workers=1 \
  --worker_resource_request="cpu=0.3,memory=1024Mi" \
  --worker_resource_limit="cpu=1,memory=2048Mi" \
  --chief_resource_request="cpu=0.3,memory=1024Mi" \
  --chief_resource_limit="cpu=1,memory=2048Mi" \
  --num_evaluator=1 \
  --evaluator_resource_request="cpu=0.3,memory=1024Mi" \
  --evaluator_resource_limit="cpu=1,memory=2048Mi" \
  --job_name=test-iris-estimator \
  --image_pull_policy=Never \
  --distribution_strategy=ParameterServerStrategy \
  --need_tf_config=true \
  --volume="host_path={iris_data_dir},mount_path=/data" \

--image_name is the image to launch the ElasticDL master which has nothing to do with the estimator model. The ElasticDL master is responsible for launching pod and assigning data shards to workers with elasticity and fault-tolerance.

{iris_data_dir} is the absolute path of the ./data with iris.data. Here, the option --volume="host_path={iris_data_dir},mount_path=/data" bind mount it into the containers/pods.

The option --num_workers=1 tells the master to start a worker pod. The option --num_ps=1 tells the master to start a ps pod. The option --num_evaluator tells the master to start an evaluator pod.

And the master will start a chief worker for a TensorFlow estiamtor model by default.

Check Job Status

After the job submission, we can run the command kubectl get pods to list related containers.

NAME                                     READY   STATUS    RESTARTS   AGE
elasticdl-test-iris-estimator-master     1/1     Running   0          9s
test-iris-estimator-edljob-chief-0       1/1     Running   0          6s
test-iris-estimator-edljob-evaluator-0   0/1     Pending   0          6s
test-iris-estimator-edljob-ps-0          1/1     Running   0          7s
test-iris-estimator-edljob-worker-0      1/1     Running   0          6s

Train an Estimator Model Using ElasticDL with Your Dataset

You only need to modify your input_fn with ElasticDL DataShardService. The DataShardService will split the sample indices into ranges and assign those ranges to workers. The worker only need to read samples by indices in those ranges.

  1. Create a DataShardService.
from elasticai_api.common.data_shard_service import build_data_shard_service

training_data_shard_svc = build_data_shard_service(
        batch_size=batch_size,
        num_epochs=100,
        dataset_size=len(rows),
        num_minibatches_per_shard=1,
        dataset_name="iris_training_data",
    )
  • batch_size: Batch size of each step.
  • num_epochs: The number of epochs.
  • dataset_size: The total number of samples in the dataset.
  • num_minibatches_per_shard: The number of batches in each shard. The number of samples in each shard is batch_size * num_minibatches_per_shard
  • dataset_name: The name of dataset.
  1. Create a generator by reading samples with shards.

The shard.start and shard.end is the start index and end index of samples in those shard. You can read samples by the two indices like:

def train_generator(shard_service):
    while True:
        # Read samples by the range of the shard from
        # the data shard serice.
        shard = shard_service.fetch_shard()
        if not shard:
            break
        for i in range(shard.start, shard.end):
            label = CATEGORY_CODE[rows[i][-1]]
            yield rows[i][0:-1], [label]
  1. Create a session hook to report shard
from elasticai_api.tensorflow.hooks import ElasticDataShardReportHook

hooks = [
    ElasticDataShardReportHook(training_data_shard_svc),
]
train_spec = tf.estimator.TrainSpec(input_fn=train_input_fn, hooks=hooks)

After 3 steps, you can train your estimator models using ElasticDL in data parallel mode.