diff --git a/.gitmodules b/.gitmodules new file mode 100644 index 0000000..8642c67 --- /dev/null +++ b/.gitmodules @@ -0,0 +1,6 @@ +[submodule "examples/tensorflow/models"] + path = examples/tensorflow/models + url = https://github.com/tensorflow/models/ +[submodule "examples/mxnet"] + path = examples/mxnet + url = https://github.com/dmlc/mxnet diff --git a/LICENSE.txt b/LICENSE.txt new file mode 100644 index 0000000..2ca938e --- /dev/null +++ b/LICENSE.txt @@ -0,0 +1,29 @@ +Amazon Software License +1. Definitions +"Licensor" means any person or entity that distributes its Work. + +"Software" means the original work of authorship made available under this License. + +"Work" means the Software and any additions to or derivative works of the Software that are made available under this License. + +The terms "reproduce," "reproduction," "derivative works," and "distribution" have the meaning as provided under U.S. copyright law; provided, however, that for the purposes of this License, derivative works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work. + +Works, including the Software, are "made available" under this License by including in or with the Work either (a) a copyright notice referencing the applicability of this License to the Work, or (b) a copy of this License. +2. License Grants +2.1 Copyright Grant. Subject to the terms and conditions of this License, each Licensor grants to you a perpetual, worldwide, non-exclusive, royalty-free, copyright license to reproduce, prepare derivative works of, publicly display, publicly perform, sublicense and distribute its Work and any resulting derivative works in any form. +2.2 Patent Grant. Subject to the terms and conditions of this License, each Licensor grants to you a perpetual, worldwide, non-exclusive, royalty-free patent license to make, have made, use, sell, offer for sale, import, and otherwise transfer its Work, in whole or in part. The foregoing license applies only to the patent claims licensable by Licensor that would be infringed by Licensor’s Work (or portion thereof) individually and excluding any combinations with any other materials or technology. +3. Limitations +3.1 Redistribution. You may reproduce or distribute the Work only if (a) you do so under this License, (b) you include a complete copy of this License with your distribution, and (c) you retain without modification any copyright, patent, trademark, or attribution notices that are present in the Work. +3.2 Derivative Works. You may specify that additional or different terms apply to the use, reproduction, and distribution of your derivative works of the Work ("Your Terms") only if (a) Your Terms provide that the use limitation in Section 3.3 applies to your derivative works, and (b) you identify the specific derivative works that are subject to Your Terms. Notwithstanding Your Terms, this License (including the redistribution requirements in Section 3.1) will continue to apply to the Work itself. +3.3 Use Limitation. The Work and any derivative works thereof only may be used or intended for use with the web services, computing platforms or applications provided by Amazon.com, Inc. or its affiliates, including Amazon Web Services, Inc. +3.4 Patent Claims. If you bring or threaten to bring a patent claim against any Licensor (including any claim, cross-claim or counterclaim in a lawsuit) to enforce any patents that you allege are infringed by any Work, then your rights under this License from such Licensor (including the grants in Sections 2.1 and 2.2) will terminate immediately. +3.5 Trademarks. This License does not grant any rights to use any Licensor’s or its affiliates’ names, logos, or trademarks, except as necessary to reproduce the notices described in this License. +3.6 Termination. If you violate any term of this License, then your rights under this License (including the grants in Sections 2.1 and 2.2) will terminate immediately. +4. Disclaimer of Warranty. +THE WORK IS PROVIDED "AS IS" WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WARRANTIES OR CONDITIONS OF M ERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE OR NON-INFRINGEMENT. YOU BEAR THE RISK OF UNDERTAKING ANY ACTIVITIES UNDER THIS LICENSE. SOME STATES’ CONSUMER LAWS DO NOT ALLOW EXCLUSION OF AN IMPLIED WARRANTY, SO THIS DISCLAIMER MAY NOT APPLY TO YOU. +5. Limitation of Liability. +EXCEPT AS PROHIBITED BY APPLICABLE LAW, IN NO EVENT AND UNDER NO LEGAL THEORY, WHETHER IN TORT (INCLUDING NEGLIGENCE), CONTRACT, OR OTHERWISE SHALL ANY LICENSOR BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING OUT OF OR RELATED TO THIS LICENSE, THE USE OR INABILITY TO USE THE WORK (INCLUDING BUT NOT LIMITED TO LOSS OF GOODWILL, BUSINESS INTERRUPTION, LOST PROFITS OR DATA, COMPUTER FAILURE OR MALFUNCTION, OR ANY OTHER COMM ERCIAL DAMAGES OR LOSSES), EVEN IF THE LICENSOR HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. +Effective Date – April 18, 2008 © 2008 Amazon.com, Inc. or its affiliates. All rights reserved. + +Note: Other license terms may apply to certain, identified software files contained within or distributed with the accompanying software if such terms are included in the directory containing the accompanying software. Such other license terms will then apply in lieu of the terms of the software license above. + diff --git a/NOTICE.txt b/NOTICE.txt new file mode 100644 index 0000000..fe2f4f5 --- /dev/null +++ b/NOTICE.txt @@ -0,0 +1,2 @@ +Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved. + diff --git a/README.md b/README.md index 11e1bd2..08e7408 100644 --- a/README.md +++ b/README.md @@ -1,2 +1,162 @@ -# aws-deplearning-cfn -CFN cluster for DeepLearning AMIs. +# **Distributed Deep Learning Using MXNet and TensorFlow** + +[AWS CloudFormation](https://aws.amazon.com/cloudformation), which creates and configures Amazon Web Services resources with a template, simplifies the process of setting up a distributed deep learning cluster. The AWS CloudFormation Deep Learning template uses the [Amazon Deep Learning AMI](https://aws.amazon.com/marketplace/pp/B01M0AXXQB) (which provides MXNet, TensorFlow, Caffe, Theano, Torch, and CNTK frameworks) to launch a cluster of [EC2](https://aws.amazon.com/ec2) instances and other AWS resources needed to perform distributed deep learning. With this template, we continue with our mission to make [distributed deep learning easy] (https://aws.amazon.com/blogs/compute/distributed-deep-learning-made-easy/). AWS CloudFormation creates all resources in the customer account. + +## What's New? +We've updated the AWS CloudFormation Deep Learning template to add some exciting new features and capabilities. + +* We've enhanced the AWS CloudFormation Deep Learning template with automation that continues stack creation even if the provisioned number of worker instances falls short of the desired count. In the previous version of the template, if one of the worker instances failed to be provisioned, for example, if it a hit account limit, AWS CloudFormation rolled back the stack and required you to adjust your desired count and restart the stack creation process. The new template includes a function that automatically adjusts the count down and proceeds with setting up the rest of the cluster (stack). + +* We now support creating a cluster of CPU Amazon EC2 instance types. + +* We've also added [Amazon Elastic File System (Amazon EFS)](https://aws.amazon.com/efs/) support for the cluster created with the template. + * Amazon EFS is automatically mounted on all worker instances during startup. + * Amazon EFS allows sharing of code, data, and results across worker instances. + * Using Amazon EFS doesn't degrade performance for densely packed files (for example, .rec files containing image data). + +* We now support creating a cluster of instances running the Ubuntu operating system. See the [Ubuntu Deep Learning AMI](https://aws.amazon.com/marketplace/pp/B06VSPXKDX). + +## EC2 Cluster Architecture +The following architecture diagram shows the EC2 cluster infrastructure. +![](images/Slide0.png) + +## Resources Created by the Deep Learning Template +The Amazon Deep Learning template creates a stack that contains the following resources: + +* A VPC in the customer account. +* The requested number or available number of worker instances in an [Auto Scaling](https://aws.amazon.com/autoscaling) group within the VPC. These worker instances are launched in a private subnet. +* A master instance in a separate Auto Scaling group that acts as a proxy to enable connectivity to the cluster with SSH. AWS CloudFormation places this instance within the VPC and connects it to both the public and private subnets. This instance has both public IP addresses and DNS. +* An Amazon EFS file storage system configured in General Purpose performance mode. +* A mount target to mount Amazon EFS on the instances. +* A security group that allows external SSH access to the master instance. +* A security group that allows the master and worker instances to mount and access Amazon EFS through NFS port 2049. +* Two security groups that open ports on the private subnet for communication between the master and workers. +* An [AWS Identity and Access Management (IAM)](https://aws.amazon.com/iam) role that allows instances to poll [Amazon Simple Queue Service (Amazon SQS)](https://aws.amazon.com/sqs/) and access and query Auto Scaling groups and the private IP addresses of the EC2 instances. +* A NAT gateway used by the instances within the VPC to talk to the outside world. +* Two Amazon SQS queues to configure the metadata at startup on the master and the workers. +* An [AWS Lambda](https://aws.amazon.com/lambda/) function that monitors the Auto Scaling group's launch activities and modifies the desired capacity of the Auto Scaling group based on availability. +* An [Amazon Simple Notification Service (Amazon SNS)](https://aws.amazon.com/sns/) topic to trigger the Lambda function on Auto Scaling events. +* AWS CloudFormation WaitCondition and WaitHandler, with a stack creation timeout of 55 minutes to complete metadata setup. + +## How the Deep Learning Template Works +The startup script enables SSH forwarding on all hosts. Enabling SSH agent forwarding is essential because frameworks such as MXNet use SSH for communication between master and worker instances during distributed training. + +The startup script on the master polls the master SQS queue for messages confirming that Auto Scaling setup is complete. The Lambda function sends two messages, one when the master Auto Scaling group is successfully set up, and a second when either the requested capacity is satisfied or when instances fail to launch on the worker Auto Scaling group. When instance launch fails on the worker Auto Scaling group, the Lambda function modifies the desired capacity to the number of instances that have been successfully launched. + +Upon receiving messages on the Amazon SQS master queue, the setup script on the master configures all of the necessary worker metadata (IP addresses of the workers, GPU count, etc.,) and broadcasts the metadata on the worker SQS queue. Upon receiving this message, the startup script on the worker instances that are polling the SQS worker queue configure this metadata on the workers. + +The following environment variables are set up on all the instances: + +* **$DEEPLEARNING_WORKERS_PATH**: The file path that contains the list of workers + +* **$DEEPLEARNING_WORKERS_COUNT**: The total number of workers + +* **$DEEPLEARNING_WORKER_GPU_COUNT**: The number of GPUs on the instance + +* **$EFS_MOUNT**: The directory where Amazon EFS is mounted + +## Setting Up a Deep Learning Stack +To set up a deep learning AWS CloudFormation stack, follow [Using the AWS CloudFormation Deep Learning Template](cfn-template/StackSetup.md). + +## Running Distributed Training +To demonstrate how to run distributed training using [MXNet](http://mxnet.io/) and [Tensorflow](https://www.tensorflow.org/) frameworks, we use the standard [CIFAR-10 model](https://www.cs.toronto.edu/~kriz/cifar.html). CIFAR-10 is a sufficiently complex network that benefits from a distributed setup and that can be quickly trained on such a setup. + +[Log in to the master instance](cfn-template/StackSetup.md#logintomaster). Follow **Step 3** in [Using the AWS CloudFormation Deep Learning Template](cfn-template/StackSetup.md). + +Clone the [awslabs/deeplearning-cfn](https://github.com/awslabs/deeplearning-cfn) repo that contains the examples onto the EFS mount: + +**Note:** This could take a few minutes. + + git clone https://github.com/nswamy/deeplearning-cfn $EFS_MOUNT/deeplearning-cfn && \ + cd $EFS_MOUNT/deeplearning-cfn && \ + # + #fetches dmlc/mxnet and tensorflow/models repos as submodules + git submodule update --init $EFS_MOUNT/deeplearning-cfn/examples/tensorflow/models && \ + git submodule update --init $EFS_MOUNT/deeplearning-cfn/examples/mxnet && \ + cd $EFS_MOUNT/deeplearning-cfn/examples/mxnet/ && \ + git submodule update --init $EFS_MOUNT/deeplearning-cfn/examples/mxnet/dmlc-core + +### Running Distributed Training on MXNet + +The following example shows how to run CIFAR-10 with data parallelism on MXNet. Note the use of the DEEPLEARNING_* environment variables. + + #terminate all running Python processes across workers + while read -u 10 host; do ssh -o "StrictHostKeyChecking no" $host "pkill -f python" ; \ + done 10<$DEEPLEARNING_WORKERS_PATH + + #navigate to the MXNet image-classification example directory \ + cd $EFS_MOUNT/deeplearning-cfn/examples/mxnet/example/image-classification/ + + #run the CIFAR10 distributed training example \ + ../../tools/launch.py -n $DEEPLEARNING_WORKERS_COUNT -H $DEEPLEARNING_WORKERS_PATH \ + python train_cifar10.py --gpus $(seq -s , 0 1 $(($DEEPLEARNING_WORKER_GPU_COUNT - 1))) \ + --network resnet --num-layers 50 --kv-store dist_device_sync + +We were able to run the training for 100 epochs in 25 minutes on 2 P2.8x EC2 instances and achieve a training accuracy of 92%. + +These steps summarize how to get started. For more information about running distributed training on MXNet, see [Run MXNet on Multiple Devices](http://mxnet.readthedocs.io/en/latest/how_to/multi_devices.html). + +### Running Distributed Training on TensorFlow +The new template introduces [Amazon Elastic File System](https://aws.amazon.com/efs/), which facilitates sharing data among workers, store the checkpoints and the logs of all the tensorlfow processes in one place. You can now monitor all the logs on the master instance. + +For the TensorFlow distributed training example, we use the CIFAR-10 model provided by [TensorFlow](https://www.tensorflow.org/tutorials/deep_cnn#cifar-10_model) and the distributed training sample code discussed in [Distributed Tensorflow](https://www.tensorflow.org/versions/master/how_tos/distributed/). + +**Note** This distributed training example is not tuned to achieve the greatest accuracy. It merely +shows how the deep learning AWS CloudFormation stack simplifies running a distributed TensorFlow training. + +Download the CIFAR-10 dataset from [Alex Krizhevsky's page](http://www.cs.toronto.edu/~kriz/cifar-10-binary.tar.gz) +and unzip the tar.gz file onto the EFS mount so you don't have to copy or download the dataset on all of the workers. + + mkdir $EFS_MOUNT/cifar10_data && \ + wget http://www.cs.toronto.edu/~kriz/cifar-10-binary.tar.gz --directory-prefix=$EFS_MOUNT/cifar10_data \ + && tar -xzvf $EFS_MOUNT/cifar10_data/cifar-10-binary.tar.gz -C $EFS_MOUNT/cifar10_data + +We have included a script that generates the commands to run the workers and parameter servers on the worker instances. The script takes training_script as an argument, you can also pass the arguments that are needed by your distributed training script. + + cd $EFS_MOUNT/deeplearning-cfn/examples/tensorflow && \ + # generates commands to run workers and parameter-servers on all the workers \ + python generate_trainer.py --workers_file_path $DEEPLEARNING_WORKERS_PATH \ + --worker_count $DEEPLEARNING_WORKERS_COUNT \ + --worker_gpu_count $DEEPLEARNING_WORKER_GPU_COUNT \ + --trainer_script_dir $EFS_MOUNT/deeplearning-cfn/examples/tensorflow \ + --training_script $EFS_MOUNT/deeplearning-cfn/examples/tensorflow/cifar10_multi_machine_train.py \ + --batch_size 128 --data_dir=$EFS_MOUNT/cifar10_data \ + --train_dir=$EFS_MOUNT/deeplearning-cfn/examples/tensorflow/train \ + --log_dir $EFS_MOUNT/deeplearning-cfn/examples/tensorflow/logs \ + --max_steps 200000 + +Stop all of the Python processes that might be running on the workers: + + #terminate all running Python processes across workers \ + while read -u 10 host; do ssh -o "StrictHostKeyChecking no" $host "pkill -f python" ; \ + done 10<$DEEPLEARNING_WORKERS_PATH + +Run the distributed training across all of the workers: + + trainer_script_dir=$EFS_MOUNT/deeplearning-cfn/examples/tensorflow && while read -u 10 host; \ + do ssh -o "StrictHostKeyChecking no" $host "bash $trainer_script_dir/$host.sh" ; \ + done 10<$DEEPLEARNING_WORKERS_PATH + +Because the logs of all of the workers and the process status processes are stored on Amazon EFS, you can now monitor them on the master: + + tail -f /myEFSvolume/deeplearning-cfn/examples/tensorflow/logs/* + +We were able train this model in an hour on 2 P2.8x EC2 instances running 2 process status processes and 16 worker processes for 200000 steps and reduce the loss to 0.82 averaged across the 16 workers. + +Running the evaluation script on the trained model achieves an accuracy of 77%: + + python $EFS_MOUNT/deeplearning-cfn/examples/tensorflow/models/tutorials/image/cifar10/cifar10_eval.py \ + --data_dir=$EFS_MOUNT/cifar10_data/ \ + --eval_dir=$EFS_MOUNT/deeplearning-cfn/examples/tensorflow/eval \ + --checkpoint_dir=$EFS_MOUNT/deeplearning-cfn/examples/tensorflow/train + +You can visualize the training on [TensorBoard](https://www.tensorflow.org/get_started/summaries_and_tensorboard) by running the following command on the master node. TensorBoard starts running on the private IP address of the master instance and port 6006. Make a note of this IP address because you will use it in the next command. + + tensorboard --logdir /myEFSvolume/deeplearning-cfn/examples/tensorflow/train + +Now use SSH port forwarding to see TensorBoard on your local computer. Run a command similar to the following on the local computer. (This uses SSH agent forwarding for credentials.) + + #In this example, 192.0.2.0 is the private IP of the master and 203.0.113.0 is the public ip of the master instance, ec2-user is the userid of the master if Instance Type is Amazon Linux + ssh -l ec2-user -L 6006:192.0.2.0:6006 203.0.113.0 + +To see TensorBoard, open [http://localhost:6006](http://localhost:6006) on a browser. For more information, see [TensorBoard: Visualizing Learning](https://www.tensorflow.org/get_started/summaries_and_tensorboard). diff --git a/cfn-bootstrap/LICENSE.txt b/cfn-bootstrap/LICENSE.txt new file mode 100644 index 0000000..2ca938e --- /dev/null +++ b/cfn-bootstrap/LICENSE.txt @@ -0,0 +1,29 @@ +Amazon Software License +1. Definitions +"Licensor" means any person or entity that distributes its Work. + +"Software" means the original work of authorship made available under this License. + +"Work" means the Software and any additions to or derivative works of the Software that are made available under this License. + +The terms "reproduce," "reproduction," "derivative works," and "distribution" have the meaning as provided under U.S. copyright law; provided, however, that for the purposes of this License, derivative works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work. + +Works, including the Software, are "made available" under this License by including in or with the Work either (a) a copyright notice referencing the applicability of this License to the Work, or (b) a copy of this License. +2. License Grants +2.1 Copyright Grant. Subject to the terms and conditions of this License, each Licensor grants to you a perpetual, worldwide, non-exclusive, royalty-free, copyright license to reproduce, prepare derivative works of, publicly display, publicly perform, sublicense and distribute its Work and any resulting derivative works in any form. +2.2 Patent Grant. Subject to the terms and conditions of this License, each Licensor grants to you a perpetual, worldwide, non-exclusive, royalty-free patent license to make, have made, use, sell, offer for sale, import, and otherwise transfer its Work, in whole or in part. The foregoing license applies only to the patent claims licensable by Licensor that would be infringed by Licensor’s Work (or portion thereof) individually and excluding any combinations with any other materials or technology. +3. Limitations +3.1 Redistribution. You may reproduce or distribute the Work only if (a) you do so under this License, (b) you include a complete copy of this License with your distribution, and (c) you retain without modification any copyright, patent, trademark, or attribution notices that are present in the Work. +3.2 Derivative Works. You may specify that additional or different terms apply to the use, reproduction, and distribution of your derivative works of the Work ("Your Terms") only if (a) Your Terms provide that the use limitation in Section 3.3 applies to your derivative works, and (b) you identify the specific derivative works that are subject to Your Terms. Notwithstanding Your Terms, this License (including the redistribution requirements in Section 3.1) will continue to apply to the Work itself. +3.3 Use Limitation. The Work and any derivative works thereof only may be used or intended for use with the web services, computing platforms or applications provided by Amazon.com, Inc. or its affiliates, including Amazon Web Services, Inc. +3.4 Patent Claims. If you bring or threaten to bring a patent claim against any Licensor (including any claim, cross-claim or counterclaim in a lawsuit) to enforce any patents that you allege are infringed by any Work, then your rights under this License from such Licensor (including the grants in Sections 2.1 and 2.2) will terminate immediately. +3.5 Trademarks. This License does not grant any rights to use any Licensor’s or its affiliates’ names, logos, or trademarks, except as necessary to reproduce the notices described in this License. +3.6 Termination. If you violate any term of this License, then your rights under this License (including the grants in Sections 2.1 and 2.2) will terminate immediately. +4. Disclaimer of Warranty. +THE WORK IS PROVIDED "AS IS" WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WARRANTIES OR CONDITIONS OF M ERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE OR NON-INFRINGEMENT. YOU BEAR THE RISK OF UNDERTAKING ANY ACTIVITIES UNDER THIS LICENSE. SOME STATES’ CONSUMER LAWS DO NOT ALLOW EXCLUSION OF AN IMPLIED WARRANTY, SO THIS DISCLAIMER MAY NOT APPLY TO YOU. +5. Limitation of Liability. +EXCEPT AS PROHIBITED BY APPLICABLE LAW, IN NO EVENT AND UNDER NO LEGAL THEORY, WHETHER IN TORT (INCLUDING NEGLIGENCE), CONTRACT, OR OTHERWISE SHALL ANY LICENSOR BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING OUT OF OR RELATED TO THIS LICENSE, THE USE OR INABILITY TO USE THE WORK (INCLUDING BUT NOT LIMITED TO LOSS OF GOODWILL, BUSINESS INTERRUPTION, LOST PROFITS OR DATA, COMPUTER FAILURE OR MALFUNCTION, OR ANY OTHER COMM ERCIAL DAMAGES OR LOSSES), EVEN IF THE LICENSOR HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. +Effective Date – April 18, 2008 © 2008 Amazon.com, Inc. or its affiliates. All rights reserved. + +Note: Other license terms may apply to certain, identified software files contained within or distributed with the accompanying software if such terms are included in the directory containing the accompanying software. Such other license terms will then apply in lieu of the terms of the software license above. + diff --git a/cfn-bootstrap/NOTICE.txt b/cfn-bootstrap/NOTICE.txt new file mode 100644 index 0000000..fe2f4f5 --- /dev/null +++ b/cfn-bootstrap/NOTICE.txt @@ -0,0 +1,2 @@ +Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved. + diff --git a/cfn-bootstrap/dl_cfn_setup.py b/cfn-bootstrap/dl_cfn_setup.py new file mode 100644 index 0000000..cf93064 --- /dev/null +++ b/cfn-bootstrap/dl_cfn_setup.py @@ -0,0 +1,436 @@ +#!/usr/bin/python + +# Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved. +# +# Licensed under the Amazon Software License (the "License"). +# You may not use this file except in compliance with the License. +# A copy of the License is located at +# +# http://aws.amazon.com/asl/ +# +# or in the "license" file accompanying this file. This file is distributed +# on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either +# express or implied. See the License for the specific language governing +# permissions and limitations under the License. + +import os +import boto +import boto.utils +from sets import Set +import logging +import json +import subprocess +import time +import sys +import datetime +import pwd +import grp +import os +import boto.ec2 +import boto.ec2.autoscale +import boto.sqs +import boto.cloudformation + +HOST_FILE = '/etc/hosts' +WORKER_FILE = '/opt/deeplearning/workers' +SLEEP_INTERVAL_IN_SECS = 30 +SQS_RECEIVE_INTERVAL_IN_SECS = 20 +AWS_DL_NODE_TYPE = None +AWS_DL_MASTER_QUEUE = None +AWS_DL_WORKER_QUEUE = None +AWS_DL_SETUP_TIMEOUT = None +AWS_DL_MASTERLAUNCH_TIMEOUT = None +AWS_DL_STACK_ID = None +AWS_DL_WAIT_HANDLE = None +AWS_REGION = None +AWS_DL_ROLE_NAME = None +AWS_DL_DEFAULT_USER = None +EFS_MOUNT = None + +AWS_GPU_INSTANCE_TYPES = [ "g2.2xlarge", "g2.8xlarge", "p2.xlarge", "p2.8xlarge", "p2.16xlarge" ] + +''' +Setup Logger and LogLevel +''' +def setup_logging(log_loc='/var/log'): + + log_file = '{}/dl_cfn_setup.log'.format(log_loc) + LOGGER = logging.getLogger('dl-cfn-setup') + LOGGER.setLevel(logging.INFO) + formatter = logging.Formatter('%(asctime)s %(levelname)s: %(filename)s:%(lineno)d %(message)s') + file_handler = logging.FileHandler(log_file) + file_handler.setFormatter(formatter) + console_handler = logging.StreamHandler() + console_handler.setFormatter(formatter) + + LOGGER.addHandler(file_handler) + LOGGER.addHandler(console_handler) + + return LOGGER + +def ping_host(hostname): + res = os.system("ping -c 1 -w 10 " + hostname) + return res == 0 + +def get_gpu_count(): + LOGGER.info('setup_gpu_count') + + instance_type = boto.utils.get_instance_metadata()['instance-type'] + if instance_type not in AWS_GPU_INSTANCE_TYPES: + LOGGER.info('Not a GPU Instance, number of GPUs: {}'.format(0)) + return 0 + try: + output = subprocess.check_output(['nvidia-smi', '-L']) + gpu_count = output.count('\n') + LOGGER.info("number of GPUs:{}".format(gpu_count)) + return gpu_count + except subprocess.CalledProcessError as e: + LOGGER.exception("Error executing nvidia-smi: {}".format(e)) + return 0 + +def setup_env_variables(master_instance_ip, worker_instance_ips, default_user, efs_mount): + LOGGER.info("setup_env_variables") + + with open(HOST_FILE, 'a') as hosts, open(WORKER_FILE, 'w+') as w: + hosts.write("{} deeplearning-master\n".format(master_instance_ip)) + worker_index=1 + for worker_ip in worker_instance_ips: + hosts.write("{} deeplearning-worker{}\n".format(worker_ip, worker_index) ) + w.write("deeplearning-worker{}\n".format(worker_index)) + worker_index += 1 + + gpu_count = get_gpu_count() + with open("/etc/profile.d/deeplearning.sh", "a") as f: + num_workers = sum(1 for line in open(WORKER_FILE, "r")) + f.write("export DEEPLEARNING_WORKERS_COUNT={}\n".format(num_workers)) + f.write("export DEEPLEARNING_WORKERS_PATH={}\n".format(WORKER_FILE)) + f.write("export DEEPLEARNING_WORKER_GPU_COUNT={}\n".format(gpu_count)) + f.write("export EFS_MOUNT={}\n".format(efs_mount)) + + #change ownership to ec2-user + uid = pwd.getpwnam(default_user).pw_uid + gid = grp.getgrnam(default_user).gr_gid + os.chown(WORKER_FILE, uid, gid) + + return + +''' +wait for asg setup success message from the lambda function +message will be of the format +{"min": 1, "desired": 1, "max": 1, "launched": 1, "status": "success", "asg": "cfn-test-WorkerAutoScalingGroup-1HPKVL6PJEVQS", "event": "asg-setup"} +''' +def wait_until_asg_success(master_queue_name, region, timeout): + LOGGER.info('wait_until_asg_success on queue_name:{}, timeout:{}'.format(master_queue_name, timeout)) + sqs_con = boto.sqs.connect_to_region(region_name=region) + sqs_queue = sqs_con.get_queue(queue_name = master_queue_name) + asg_success_message = {} + + start_time = time.time() + next_execution_ts = start_time + + while True: + LOGGER.info('checking autoscaling group success message at {}'.format(datetime.datetime.now())) + + recvd_messages = sqs_con.receive_message(queue=sqs_queue,number_messages=10, visibility_timeout=60) + LOGGER.info('number of messages received: {}'.format(len(recvd_messages))) + for msg in recvd_messages: + msg_body = msg.get_body() + LOGGER.info('received message with body:{}'.format(msg_body)) + try: + content = json.loads(msg_body) + if content is not None and content['event'] == 'asg-setup' and content['status'] == 'success': + # http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/standard-queues.html#standard-queues-at-least-once-delivery + # ignore duplicate message + if content['asg'] not in asg_success_message: + LOGGER.info('autosclaing_group: {} succeeded at {}'.format(content['asg'], datetime.datetime.now())) + asg_success_message[content['asg']] = content + else: + LOGGER.info('received duplicate sqs message for {} at {}'.format(content['asg'], datetime.datetime.now())) + sqs_con.delete_message(queue=sqs_queue, message=msg) + except (TypeError, KeyError) as e: + LOGGER.exception(e) + LOGGER.error(msg) + continue + + if len(asg_success_message) is 2: + LOGGER.info('status of all autoscaling_groups received') + break + + next_execution_ts = next_execution_ts + SLEEP_INTERVAL_IN_SECS + if (next_execution_ts > (start_time + timeout)): + LOGGER.info('timeout while checking asg status after {} seconds'.format(timeout)) + break + + LOGGER.info('not received all autoscaling group success at {}, WAITING :{}'.format(datetime.datetime.now(), SLEEP_INTERVAL_IN_SECS)) + time.sleep(next_execution_ts - time.time()) + + return asg_success_message + +def wait_for_worker_setup_message(worker_queue_name, timeout, region): + LOGGER.info('wait_for_worker_setup_message, worker_queue_name:{}, timeout:{}'.format(worker_queue_name, timeout)) + sqs_con = boto.sqs.connect_to_region(region_name=region) + sqs_queue = sqs_con.get_queue(queue_name = worker_queue_name) + + start_time = time.time() + next_execution_ts = start_time + + while True: + LOGGER.info('checking for worker_setup message at {}'.format(datetime.datetime.now())) + #visibility_timeout is set to 0, so that other workers can simultaneously act on this message + recvd_messages = sqs_con.receive_message(queue=sqs_queue,number_messages=10, visibility_timeout=0) + LOGGER.info('number of messages received: {}'.format(len(recvd_messages))) + for msg in recvd_messages: + msg_body = msg.get_body() + LOGGER.info('received message with body:{}'.format(msg_body)) + try: + content = json.loads(msg_body) + if content is not None and content['event'] == 'worker-setup': + LOGGER.info('received worker-setup success message: {}'.format(content)) + # do not delete the message, other workers need to consume this. + return content['master-ip'], content['worker-ips'] + else: + #don't act on other messages + continue + except (TypeError, KeyError) as e: + LOGGER.error(e) + LOGGER.error(msg) + continue + + next_execution_ts = next_execution_ts + SLEEP_INTERVAL_IN_SECS + if (next_execution_ts > (start_time + timeout)): + LOGGER.info('did not receive worker-setup success even after {} seconds'.format(timeout)) + return None + + LOGGER.info('worker setup not complete is not complete at {}'.format(datetime.datetime.now())) + time.sleep(next_execution_ts - time.time()) + + return None + +def wait_until_instances_active(autoscaling_groups, timeout, region): + LOGGER.info('wait_until_instances_active, asgs:{}, timeout:{}'.format(autoscaling_groups, timeout)) + + autoscale_con = boto.ec2.autoscale.connect_to_region(region_name=region) + ec2_con = boto.ec2.connect_to_region(region_name=region) + start_time = time.time() + next_execution_ts = start_time + master_instance_ids = [] + worker_instance_ids = [] + master_instances = {} + worker_instances = {} + try: + # http://boto.cloudhackers.com/en/latest/ref/autoscale.html#boto.ec2.autoscale.group.AutoScalingGroup + # does not specify how to get the next token for pagination, + # since there are only 2 groups in our case, we will assume they will be returned in one call + groups = autoscale_con.get_all_groups(names=autoscaling_groups) + + for asg in groups: + instance_ids=[] + for instance in asg.instances: + if instance.health_status == 'Healthy': + instance_ids.append(instance.instance_id) + + if 'master' in asg.name.lower(): + master_instance_ids.extend(instance_ids) + else: + worker_instance_ids.extend(instance_ids) + LOGGER.info('from autoscale, found instances:{} for asg:{}'.format(instance_ids, asg.name)) + + LOGGER.info('worker_asg_instane_ids:{}, master_ids:{}'.format(worker_instance_ids, master_instance_ids)) + next_token = None + pending_instance_ids = master_instance_ids + worker_instance_ids + + while(True): + LOGGER.info('getting ec2 instance info:{}'.format(pending_instance_ids)) + reservations = ec2_con.get_all_reservations(instance_ids = pending_instance_ids, next_token = next_token) + next_token = reservations.next_token + + for r in reservations: + for i in r.instances: + if i.state.lower() == 'running': + if i.id in master_instance_ids: + LOGGER.info('master instance in running state, id:{}, ip:{}'.format(i.id, i.private_ip_address)) + master_instances[i.id] = i.private_ip_address + elif i.id in worker_instance_ids: + LOGGER.info('worker instance in running state, id:{}, ip:{}'.format(i.id, i.private_ip_address)) + worker_instances[i.id] = i.private_ip_address + LOGGER.info('worker:{}'.format(worker_instances)) + pending_instance_ids.remove(i.id) + elif i.state.lower() == 'pending': + LOGGER.info('instance is still in pending state, instance id:{}'.format(i.id)) + continue + + next_execution_ts = next_execution_ts + SLEEP_INTERVAL_IN_SECS + if (len(pending_instance_ids) == 0): + LOGGER.info('received info of all instances, master: {}, worker: {}'.format(master_instances, worker_instances)) + break + elif (next_token is not None): + LOGGER.info('next_token is not None, will continue fetching more instances') + continue + elif (next_execution_ts < start_time + timeout): + LOGGER.error('Reached timeout, pending_instance_ids:{}, next_token:{}'.format(pending_instance_ids, next_token)) + break + else: + LOGGER.info('not all instance info is available, pending: {}, waiting for {} seconds'.format(pending_instance_ids, SLEEP_INTERVAL_IN_SECS)) + time.sleep(next_execution_ts - time.time()) + + LOGGER.info('master: {}, worker: {}'.format(master_instances, worker_instances)) + return master_instances, worker_instances + except Exception as e: + LOGGER.exception(e) + return ({},{}) +''' +This method will send success signal to the wait handle url +its assumed cfn-signal aws cli tool is available on the instance +''' +def send_cfn_success_signal(stack_id, wait_handle_url, aws_region): + try: + instance_id = boto.utils.get_instance_metadata()['instance-id'] + command_args = ['/opt/aws/bin/cfn-signal', '--region', aws_region, '--stack', \ + stack_id, '--success', 'true', '--id', instance_id, wait_handle_url] + LOGGER.info('cfn-signal command: {}'.format(' '.join(map(str, command_args)))) + output = subprocess.check_output(command_args) + LOGGER.info(output) + except subprocess.CalledProcessError as e: + LOGGER.exception('FAILED to send cfn-signal') + sys.exit(1) + return + +''' +waits for a message on SQS for asg setup complete and instances are active. +fetches private ip addresses of the instances and sets up metadata +''' +def setup_worker_metadata(setup_timeout, master_queue_name, stack_id, region): + LOGGER.info('setup_worker_metadata') + + start_time = time.time() + asg_setup_messages = wait_until_asg_success(master_queue_name, region, setup_timeout) + if len(asg_setup_messages) is not 2: + LOGGER.error('did not receive asg success message for all autoscaling_groups, received only: {}'.format(asg_setup_messages)) + sys.exit(1) + + master_asg_message = None + worker_asg_message = None + for key, value in asg_setup_messages.iteritems(): + LOGGER.info('asg success message:{}'.format(value)) + if 'master' in key.lower(): + master_asg_message = value + else: + worker_asg_message = value + + timeout = setup_timeout - (time.time() - start_time) + start_time = time.time() + + (master_instances, worker_instances) = wait_until_instances_active([master_asg_message['asg'], worker_asg_message['asg']], timeout, region) + LOGGER.info('from wait_until_instances_active, master: {}, worker:{}'.format(master_instances, worker_instances)) + if (len(master_instances) != 1): + LOGGER.error('expected single master, instead got instance ips:{}', master_instances) + sys.exit(1) + master_instance_ip = master_instances.values()[0] + worker_instance_ips = [master_instance_ip] + + if len(worker_instances) is 0: + LOGGER.info('no worker is launched, using only master instance as worker') + else: + worker_instance_ips.extend(worker_instances.values()) + + if (len(worker_instances) != worker_asg_message['launched']): + LOGGER.error('expected {} number of instances to be running, instead got instance_ids: {}, ips: {}' \ + .format(worker_asg_message['launched'], worker_instances.keys(), worker_instances.values()) ) + + worker_instance_ips = sorted(worker_instance_ips) + + return master_instance_ip, worker_instance_ips + +def send_worker_setup_msg(worker_queue_name, master_instance_ip, worker_instance_ips, region): + LOGGER.info('send_worker_setup_msg:{}'.format(send_worker_setup_msg)) + + sqs_con = boto.sqs.connect_to_region(region_name=region) + sqs_queue = sqs_con.get_queue(queue_name = worker_queue_name) + + worker_setup_message={'event' : 'worker-setup'} + worker_setup_message['master-ip'] = master_instance_ip + worker_setup_message['worker-ips'] = worker_instance_ips + + LOGGER.info('sending worker-setup message:{}'.format(json.dumps(worker_setup_message))) + sqs_con.send_message(queue=sqs_queue, message_content=json.dumps(worker_setup_message)) + +def check_instance_role_availability(role_name, timeout): + LOGGER.info('check_instance_role_availability, role_name:{}, timeout: {}'.format(role_name, timeout)) + + start_time = time.time() + next_execution_ts = start_time + while True: + LOGGER.info('checking presence of instance role: {}, @ :{}'.format(role_name, datetime.datetime.now())) + + try: + metadata = boto.utils.get_instance_metadata(version='latest',timeout=30, num_retries=5) + instance_role = metadata['iam']['security-credentials'][role_name] + # we don't want to log the credentials + del instance_role['AccessKeyId'] + del instance_role['SecretAccessKey'] + del instance_role['Token'] + LOGGER.info('SUCCESS getting instance role {}'.format(instance_role)) + return True + except KeyError as e: + LOGGER.info('FAILED to get instance role: {} @ {}'.format(role_name, datetime.datetime.now())) + pass + next_execution_ts = next_execution_ts + SLEEP_INTERVAL_IN_SECS + if (next_execution_ts > (start_time + timeout)): + LOGGER.info('TIMEOUT while checking instance role after {} seconds'.format(timeout)) + break + + LOGGER.info('WAITING :{} to get instance_role:{} @ {}'.format(SLEEP_INTERVAL_IN_SECS, role_name, datetime.datetime.now())) + time.sleep(next_execution_ts - time.time()) + return False + +LOGGER = setup_logging() +def main(): + LOGGER.info("main") + + try: + AWS_DL_NODE_TYPE = os.environ["AWS_DL_NODE_TYPE"] + AWS_DL_MASTER_QUEUE = os.environ['AWS_DL_MASTER_QUEUE'] + AWS_DL_WORKER_QUEUE = os.environ['AWS_DL_WORKER_QUEUE'] + AWS_DL_WAITCONDITION_TIMEOUT = float(os.environ['AWS_DL_WAITCONDITION_TIMEOUT']) + AWS_DL_MASTERLAUNCH_TIMEOUT = float(os.environ['AWS_DL_MASTERLAUNCH_TIMEOUT']) + AWS_DL_STACK_ID = os.environ['AWS_DL_STACK_ID'] + AWS_DL_WAIT_HANDLE = os.environ['AWS_DL_WAIT_HANDLE'] + AWS_DL_ROLE_NAME = os.environ['AWS_DL_ROLE_NAME'] + AWS_DL_DEFAULT_USER = os.environ['AWS_DL_DEFAULT_USER'] + AWS_REGION = os.environ['AWS_REGION'] + EFS_MOUNT = os.environ['EFS_MOUNT'] + + LOGGER.info('AWS_DL_NODE_TYPE:{}\n AWS_DL_MASTER_QUEUE:{}\n AWS_DL_WORKER_QUEUE:{}\n AWS_DL_WAITCONDITION_TIMEOUT:{}\n, AWS_DL_MASTERLAUNCH_TIMEOUT:{}\n AWS_DL_STACK_ID:{}\n \ + AWS_DL_WAIT_HANDLE:{}\n AWS_DL_ROLE_NAME:{}\n AWS_REGION:{}, AWS_DL_DEFAULT_USER:{}, EFS_MOUNT:{}\n'.format(AWS_DL_NODE_TYPE, AWS_DL_MASTER_QUEUE, AWS_DL_WORKER_QUEUE, \ + AWS_DL_WAITCONDITION_TIMEOUT, AWS_DL_MASTERLAUNCH_TIMEOUT, AWS_DL_STACK_ID, AWS_DL_WAIT_HANDLE, AWS_DL_ROLE_NAME, AWS_REGION, AWS_DL_DEFAULT_USER, EFS_MOUNT) + ) + + # we want to make sure we finish before the timeout expires + setup_timeout = AWS_DL_WAITCONDITION_TIMEOUT - AWS_DL_MASTERLAUNCH_TIMEOUT + start_time = time.time() + check_instance_role_availability(AWS_DL_ROLE_NAME, setup_timeout) + setup_timeout = setup_timeout - (time.time() - start_time) + + # get master ips + if (AWS_DL_NODE_TYPE.lower() == 'master'): + master_instance_ip, worker_instance_ips = setup_worker_metadata(setup_timeout, AWS_DL_MASTER_QUEUE, AWS_DL_STACK_ID, AWS_REGION) + setup_env_variables(master_instance_ip, worker_instance_ips, AWS_DL_DEFAULT_USER, EFS_MOUNT) + send_worker_setup_msg(AWS_DL_WORKER_QUEUE, master_instance_ip, worker_instance_ips, AWS_REGION) + send_cfn_success_signal(AWS_DL_STACK_ID, AWS_DL_WAIT_HANDLE, AWS_REGION) + + elif (AWS_DL_NODE_TYPE.lower() == 'worker'): + master_instance_ip, worker_instance_ips = wait_for_worker_setup_message(AWS_DL_WORKER_QUEUE, setup_timeout, AWS_REGION) + if master_instance_ip is None or worker_instance_ips is None: + LOGGER.error('FAILED worker metadata setup : master_ip:{}, worker_ips:{}'.format(master_instance_ip, worker_instance_ips)) + sys.exit(1) + setup_env_variables(master_instance_ip, worker_instance_ips, AWS_DL_DEFAULT_USER, EFS_MOUNT) + else: + LOGGER.error('unknown node type: {}'.format(AWS_DL_NODE_TYPE)) + sys.exit(1) + + except Exception as e: + LOGGER.exception(e) + sys.exit(1) + +if __name__ =='__main__': + main() \ No newline at end of file diff --git a/cfn-lambda_function/LICENSE.txt b/cfn-lambda_function/LICENSE.txt new file mode 100644 index 0000000..2ca938e --- /dev/null +++ b/cfn-lambda_function/LICENSE.txt @@ -0,0 +1,29 @@ +Amazon Software License +1. Definitions +"Licensor" means any person or entity that distributes its Work. + +"Software" means the original work of authorship made available under this License. + +"Work" means the Software and any additions to or derivative works of the Software that are made available under this License. + +The terms "reproduce," "reproduction," "derivative works," and "distribution" have the meaning as provided under U.S. copyright law; provided, however, that for the purposes of this License, derivative works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work. + +Works, including the Software, are "made available" under this License by including in or with the Work either (a) a copyright notice referencing the applicability of this License to the Work, or (b) a copy of this License. +2. License Grants +2.1 Copyright Grant. Subject to the terms and conditions of this License, each Licensor grants to you a perpetual, worldwide, non-exclusive, royalty-free, copyright license to reproduce, prepare derivative works of, publicly display, publicly perform, sublicense and distribute its Work and any resulting derivative works in any form. +2.2 Patent Grant. Subject to the terms and conditions of this License, each Licensor grants to you a perpetual, worldwide, non-exclusive, royalty-free patent license to make, have made, use, sell, offer for sale, import, and otherwise transfer its Work, in whole or in part. The foregoing license applies only to the patent claims licensable by Licensor that would be infringed by Licensor’s Work (or portion thereof) individually and excluding any combinations with any other materials or technology. +3. Limitations +3.1 Redistribution. You may reproduce or distribute the Work only if (a) you do so under this License, (b) you include a complete copy of this License with your distribution, and (c) you retain without modification any copyright, patent, trademark, or attribution notices that are present in the Work. +3.2 Derivative Works. You may specify that additional or different terms apply to the use, reproduction, and distribution of your derivative works of the Work ("Your Terms") only if (a) Your Terms provide that the use limitation in Section 3.3 applies to your derivative works, and (b) you identify the specific derivative works that are subject to Your Terms. Notwithstanding Your Terms, this License (including the redistribution requirements in Section 3.1) will continue to apply to the Work itself. +3.3 Use Limitation. The Work and any derivative works thereof only may be used or intended for use with the web services, computing platforms or applications provided by Amazon.com, Inc. or its affiliates, including Amazon Web Services, Inc. +3.4 Patent Claims. If you bring or threaten to bring a patent claim against any Licensor (including any claim, cross-claim or counterclaim in a lawsuit) to enforce any patents that you allege are infringed by any Work, then your rights under this License from such Licensor (including the grants in Sections 2.1 and 2.2) will terminate immediately. +3.5 Trademarks. This License does not grant any rights to use any Licensor’s or its affiliates’ names, logos, or trademarks, except as necessary to reproduce the notices described in this License. +3.6 Termination. If you violate any term of this License, then your rights under this License (including the grants in Sections 2.1 and 2.2) will terminate immediately. +4. Disclaimer of Warranty. +THE WORK IS PROVIDED "AS IS" WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WARRANTIES OR CONDITIONS OF M ERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE OR NON-INFRINGEMENT. YOU BEAR THE RISK OF UNDERTAKING ANY ACTIVITIES UNDER THIS LICENSE. SOME STATES’ CONSUMER LAWS DO NOT ALLOW EXCLUSION OF AN IMPLIED WARRANTY, SO THIS DISCLAIMER MAY NOT APPLY TO YOU. +5. Limitation of Liability. +EXCEPT AS PROHIBITED BY APPLICABLE LAW, IN NO EVENT AND UNDER NO LEGAL THEORY, WHETHER IN TORT (INCLUDING NEGLIGENCE), CONTRACT, OR OTHERWISE SHALL ANY LICENSOR BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING OUT OF OR RELATED TO THIS LICENSE, THE USE OR INABILITY TO USE THE WORK (INCLUDING BUT NOT LIMITED TO LOSS OF GOODWILL, BUSINESS INTERRUPTION, LOST PROFITS OR DATA, COMPUTER FAILURE OR MALFUNCTION, OR ANY OTHER COMM ERCIAL DAMAGES OR LOSSES), EVEN IF THE LICENSOR HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. +Effective Date – April 18, 2008 © 2008 Amazon.com, Inc. or its affiliates. All rights reserved. + +Note: Other license terms may apply to certain, identified software files contained within or distributed with the accompanying software if such terms are included in the directory containing the accompanying software. Such other license terms will then apply in lieu of the terms of the software license above. + diff --git a/cfn-lambda_function/NOTICE.txt b/cfn-lambda_function/NOTICE.txt new file mode 100644 index 0000000..fe2f4f5 --- /dev/null +++ b/cfn-lambda_function/NOTICE.txt @@ -0,0 +1,2 @@ +Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved. + diff --git a/cfn-lambda_function/dl_cfn_setup_lambda.zip b/cfn-lambda_function/dl_cfn_setup_lambda.zip new file mode 100644 index 0000000..14b22da Binary files /dev/null and b/cfn-lambda_function/dl_cfn_setup_lambda.zip differ diff --git a/cfn-lambda_function/lambda_function.py b/cfn-lambda_function/lambda_function.py new file mode 100755 index 0000000..82b4533 --- /dev/null +++ b/cfn-lambda_function/lambda_function.py @@ -0,0 +1,199 @@ +# Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved. +# +# Licensed under the Amazon Software License (the "License"). +# You may not use this file except in compliance with the License. +# A copy of the License is located at +# +# http://aws.amazon.com/asl/ +# +# or in the "license" file accompanying this file. This file is distributed +# on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either +# express or implied. See the License for the specific language governing +# permissions and limitations under the License. + +from __future__ import print_function + +import json +import os +import boto3 +import collections + +print('Loading function') +ASGInstanceCount = collections.namedtuple('ASGInstanceCount', ['min', 'desired', 'max', 'launched']) + +def lambda_handler(event, context): + # print("Received event: " + json.dumps(event, indent=2)) + message = json.loads(event['Records'][0]['Sns']['Message']) + # print("From SNS: " + event['Records'][0]['Sns']['Message']) + # print('AWS_STACK_ID: ' + os.environ['AWS_STACK_ID']) + if message['Event']: + print('EVENT: ', message['Event']) + return eval(get_handler(message['Event']))(message) + else: + return do_nothing(message) + + return message + +def get_handler(Event): + return { + 'autoscaling:EC2_INSTANCE_LAUNCH': 'on_instance_launch', + 'autoscaling:EC2_INSTANCE_LAUNCH_ERROR': 'on_instance_launch_error', + 'autoscaling:EC2_INSTANCE_TERMINATE': 'on_instance_terminate', + 'autoscaling:EC2_INSTANCE_TERMINATE_ERROR': 'on_instance_terminate_error', + 'autoscaling:TEST_NOTIFICATION' : 'do_nothing' + }[Event] + +def do_nothing(message): + print('do_nothing') + print("Unknown Event. Received message: " + json.dumps(message, indent=2)) + return + +def send_asg_success(status, asg, asg_instance_counts): + sqs_url = os.environ['AWS_DL_MASTER_SQS_URL'] + print("sqs_url: ", sqs_url) + sqs_con = boto3.client('sqs') + msg_dict = asg_instance_counts._asdict() + msg_dict['status'] = status.lower() + msg_dict['asg'] = asg + msg_dict['event'] = 'asg-setup' + + print('sending message to sqs:', json.dumps(msg_dict)) + sqs_con.send_message(QueueUrl=sqs_url, MessageBody=json.dumps(msg_dict)) + return + +''' + get various instance counts associated with the asg +''' +def get_instance_count(autoscaling_group_name): + print('get_instance_count') + + autoscale_con = boto3.client('autoscaling') + + asg = autoscale_con.describe_auto_scaling_groups(AutoScalingGroupNames=[autoscaling_group_name])['AutoScalingGroups'][0] + num_instances_healthy = 0 + +# TODO: check if pagination needs to be handled for asg.instances + for each_instance in asg['Instances']: + ''' + we will only consider instances that are inService or are in Pending state + http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-lifecycle.html + since this lambda function is expected to run during stack creation, we'll ignore the + case where the User could go and Stop the Instance and can move the instance state to 'Pending' + ''' + if each_instance['LifecycleState'] == 'InService' and each_instance['HealthStatus'] == 'Healthy': + num_instances_healthy += 1 + elif each_instance['LifecycleState'] == 'Pending' and each_instance['HealthStatus'] == 'Healthy': + num_instances_healthy += 1 + else: + continue + + asg_instance_counts = ASGInstanceCount(min=asg['MinSize'], max=asg['MaxSize'], desired=asg['DesiredCapacity'], launched=num_instances_healthy) + print(asg_instance_counts._asdict()) + return asg_instance_counts + +def on_instance_launch(message): + print('on_instance_launch') + + autoscaling_group_name = message['AutoScalingGroupName'] + availability_zone = message['Details']['Availability Zone'] + start_time = message['StartTime'] + instance_id = message['EC2InstanceId'] + request_id = message['RequestId'] + + if autoscaling_group_name and 'WorkerAutoScalingGroup' in autoscaling_group_name: + autoscaling_group = 'WorkerAutoScalingGroup' + elif autoscaling_group_name and 'MasterAutoScalingGroup' in autoscaling_group_name: + autoscaling_group = 'MasterAutoScalingGroup' + else: + print('Unknown AutoScaling group,message :',message) + return + + print('AutoScalingGroupName: ', autoscaling_group_name, ', EC2InstanceId: ', instance_id, \ + ', Availability Zone: ', availability_zone, ', Instance StartTime: ', start_time, ', RequestId: ',request_id) + + logical_resource_id = None + asg_instance_counts = get_instance_count(autoscaling_group_name) + + if asg_instance_counts.launched == asg_instance_counts.desired: + print('Launched desired number of instances:', asg_instance_counts.launched) + send_asg_success('SUCCESS', autoscaling_group_name, asg_instance_counts) + + if autoscaling_group is 'MasterAutoScalingGroup': + cfn_con = boto3.client('cloudformation') + print('Sending cfn-signal SUCCESS to:', autoscaling_group_name, 'with instance Id: ', instance_id) + try: + cfn_con.signal_resource(StackName=os.environ['AWS_DL_STACK_ID'], LogicalResourceId=autoscaling_group, \ + UniqueId=instance_id,Status='SUCCESS') + except Exception as e: + print('exception sending cfn-signal: ', e.message) + else: + autoscale_con = boto3.client('autoscaling') + print('Suspending ReplaceUnhealthy processes for the asg: ', autoscaling_group_name) + autoscale_con.suspend_processes(AutoScalingGroupName=autoscaling_group_name, ScalingProcesses=['ReplaceUnhealthy']) + + return + +''' +suspend autoscaling policy +change desired capacity +send success message to sqs + +''' +def on_instance_launch_error(message): + print('on_instance_launch_error') + + autoscaling_group_name = message['AutoScalingGroupName'] + availability_zone = message['Details']['Availability Zone'] + start_time = message['StartTime'] + instance_id = message['EC2InstanceId'] + request_id = message['RequestId'] + + print('AutoScalingGroupName: ', autoscaling_group_name, ', EC2InstanceId: ', instance_id, \ + ', Availability Zone: ', availability_zone, ', Instance StartTime: ', start_time, ', RequestId: ',request_id) + print('StatusCode: ', message['StatusCode'], 'StatusMessage: ', message['StatusMessage']) + + autoscale_con = boto3.client('autoscaling') + asg_instance_counts = get_instance_count(autoscaling_group_name) + + ''' + change desired capacity and suspend processes only if we have atleast the min_size requested + ''' + if asg_instance_counts.launched >= asg_instance_counts.min: + print('setting desired capacity of asg: ', autoscaling_group_name, ' to number of Healthy instances: ', asg_instance_counts.launched) + autoscale_con.set_desired_capacity(AutoScalingGroupName=autoscaling_group_name, DesiredCapacity=asg_instance_counts.launched) + print('Suspending ReplaceUnhealthy processes for the asg: ', autoscaling_group_name) + autoscale_con.suspend_processes(AutoScalingGroupName=autoscaling_group_name, ScalingProcesses=['ReplaceUnhealthy']) + print('sending worker asg setup message complete to sqs') + send_asg_success('SUCCESS', autoscaling_group_name, asg_instance_counts) + + return + +''' +''' +def on_instance_terminate(message): + print('on_instance_terminate') + + autoscaling_group_name = message['AutoScalingGroupName'] + availability_zone = message['Details']['Availability Zone'] + start_time = message['StartTime'] + instance_id = message['EC2InstanceId'] + request_id = message['RequestId'] + + print('AutoScalingGroupName: ', autoscaling_group_name, ', EC2InstanceId: ', instance_id, \ + ', Availability Zone: ', availability_zone, ', Instance StartTime: ', start_time, ', RequestId: ',request_id) + + return + +def on_instance_terminate_error(): + print('on_instance_terminate_error') + + autoscaling_group_name = message['AutoScalingGroupName'] + availability_zone = message['Details']['Availability Zone'] + start_time = message['StartTime'] + instance_id = message['EC2InstanceId'] + request_id = message['RequestId'] + + print('AutoScalingGroupName: ', autoscaling_group_name, ', EC2InstanceId: ', instance_id, \ + ', Availability Zone: ', availability_zone, ', Instance StartTime: ', start_time, ', RequestId: ',request_id) + + return diff --git a/cfn-lambda_function/template.yaml b/cfn-lambda_function/template.yaml new file mode 100755 index 0000000..5dc2401 --- /dev/null +++ b/cfn-lambda_function/template.yaml @@ -0,0 +1,21 @@ +AWSTemplateFormatVersion: '2010-09-09' +Transform: 'AWS::Serverless-2016-10-31' +Description: ASG Launch/Terminate/Errors trigger this function to send cfn-signal to cloudformation and etup complete message on sqs when asg is setup. +Resources: + snsmessagepython: + Type: 'AWS::Serverless::Function' + Properties: + Handler: lambda_function.lambda_handler + Runtime: python2.7 + CodeUri: . + Description: ASG Launch/Terminate/Errors trigger this function to send cfn-signal to cloudformation and etup complete message on sqs when asg is setup. + MemorySize: 128 + Timeout: 3 + Events: + SNS1: + Type: SNS + Properties: + Topic: + Ref: SNSTopic1 + SNSTopic1: + Type: 'AWS::SNS::Topic' diff --git a/cfn-template/LICENSE.txt b/cfn-template/LICENSE.txt new file mode 100644 index 0000000..2ca938e --- /dev/null +++ b/cfn-template/LICENSE.txt @@ -0,0 +1,29 @@ +Amazon Software License +1. Definitions +"Licensor" means any person or entity that distributes its Work. + +"Software" means the original work of authorship made available under this License. + +"Work" means the Software and any additions to or derivative works of the Software that are made available under this License. + +The terms "reproduce," "reproduction," "derivative works," and "distribution" have the meaning as provided under U.S. copyright law; provided, however, that for the purposes of this License, derivative works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work. + +Works, including the Software, are "made available" under this License by including in or with the Work either (a) a copyright notice referencing the applicability of this License to the Work, or (b) a copy of this License. +2. License Grants +2.1 Copyright Grant. Subject to the terms and conditions of this License, each Licensor grants to you a perpetual, worldwide, non-exclusive, royalty-free, copyright license to reproduce, prepare derivative works of, publicly display, publicly perform, sublicense and distribute its Work and any resulting derivative works in any form. +2.2 Patent Grant. Subject to the terms and conditions of this License, each Licensor grants to you a perpetual, worldwide, non-exclusive, royalty-free patent license to make, have made, use, sell, offer for sale, import, and otherwise transfer its Work, in whole or in part. The foregoing license applies only to the patent claims licensable by Licensor that would be infringed by Licensor’s Work (or portion thereof) individually and excluding any combinations with any other materials or technology. +3. Limitations +3.1 Redistribution. You may reproduce or distribute the Work only if (a) you do so under this License, (b) you include a complete copy of this License with your distribution, and (c) you retain without modification any copyright, patent, trademark, or attribution notices that are present in the Work. +3.2 Derivative Works. You may specify that additional or different terms apply to the use, reproduction, and distribution of your derivative works of the Work ("Your Terms") only if (a) Your Terms provide that the use limitation in Section 3.3 applies to your derivative works, and (b) you identify the specific derivative works that are subject to Your Terms. Notwithstanding Your Terms, this License (including the redistribution requirements in Section 3.1) will continue to apply to the Work itself. +3.3 Use Limitation. The Work and any derivative works thereof only may be used or intended for use with the web services, computing platforms or applications provided by Amazon.com, Inc. or its affiliates, including Amazon Web Services, Inc. +3.4 Patent Claims. If you bring or threaten to bring a patent claim against any Licensor (including any claim, cross-claim or counterclaim in a lawsuit) to enforce any patents that you allege are infringed by any Work, then your rights under this License from such Licensor (including the grants in Sections 2.1 and 2.2) will terminate immediately. +3.5 Trademarks. This License does not grant any rights to use any Licensor’s or its affiliates’ names, logos, or trademarks, except as necessary to reproduce the notices described in this License. +3.6 Termination. If you violate any term of this License, then your rights under this License (including the grants in Sections 2.1 and 2.2) will terminate immediately. +4. Disclaimer of Warranty. +THE WORK IS PROVIDED "AS IS" WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WARRANTIES OR CONDITIONS OF M ERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE OR NON-INFRINGEMENT. YOU BEAR THE RISK OF UNDERTAKING ANY ACTIVITIES UNDER THIS LICENSE. SOME STATES’ CONSUMER LAWS DO NOT ALLOW EXCLUSION OF AN IMPLIED WARRANTY, SO THIS DISCLAIMER MAY NOT APPLY TO YOU. +5. Limitation of Liability. +EXCEPT AS PROHIBITED BY APPLICABLE LAW, IN NO EVENT AND UNDER NO LEGAL THEORY, WHETHER IN TORT (INCLUDING NEGLIGENCE), CONTRACT, OR OTHERWISE SHALL ANY LICENSOR BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING OUT OF OR RELATED TO THIS LICENSE, THE USE OR INABILITY TO USE THE WORK (INCLUDING BUT NOT LIMITED TO LOSS OF GOODWILL, BUSINESS INTERRUPTION, LOST PROFITS OR DATA, COMPUTER FAILURE OR MALFUNCTION, OR ANY OTHER COMM ERCIAL DAMAGES OR LOSSES), EVEN IF THE LICENSOR HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. +Effective Date – April 18, 2008 © 2008 Amazon.com, Inc. or its affiliates. All rights reserved. + +Note: Other license terms may apply to certain, identified software files contained within or distributed with the accompanying software if such terms are included in the directory containing the accompanying software. Such other license terms will then apply in lieu of the terms of the software license above. + diff --git a/cfn-template/NOTICE.txt b/cfn-template/NOTICE.txt new file mode 100644 index 0000000..fe2f4f5 --- /dev/null +++ b/cfn-template/NOTICE.txt @@ -0,0 +1,2 @@ +Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved. + diff --git a/cfn-template/StackSetup.md b/cfn-template/StackSetup.md new file mode 100644 index 0000000..c51b5a1 --- /dev/null +++ b/cfn-template/StackSetup.md @@ -0,0 +1,120 @@ +#Using the AWS CloudFormation Deep Learning Template + +Setting up and using the template requires three steps: + +1. Launching an AWS CloudFormation stack +2. Finding the number of worker instances that successfully launched +3. Logging in to the master instance + +## Step 1: Launch an AWS CloudFormation Stack + +**Note** + +If you need to scale the number of instances beyond the [default limit](https://aws.amazon.com/ec2/faqs/#How_many_instances_can_I_run_in_Amazon_EC2), file a [support request](https://aws.amazon.com/contact-us/ec2-request). + +**To launch the stack**: + +1. Download the Deep Learning template from the [awslabs/deeplearning-cfn GitHub repo](https://github.com/awslabs/deeplearning-cfn/blob/master/cfn-template/deeplearning.template) + +2. Open the [AWS CloudFormation console](https://console.aws.amazon.com/cloudformation), and then choose **Create New Stack**. +![](../images/Slide1.png) + +3. To upload the template, choose **Choose File**, and then choose **Next**. +![](../images/Slide2.png) + +4. For **Stack name**, type a descriptive name. + +5. For **EFSFileSystemId**, you can either enter an existing Amazon EFS File System Id or leave it blank to create a new Amazon EFS file system. + +6. For **EFSMountPoint**, type the path where you want to mount the Amazon EFS file system on the instances. + +7. Choose an **ImageType**, Amazon Linux or Ubuntu. + +8. Choose an **InstanceType**, such as [P2.16xlarge](https://aws.amazon.com/ec2/instance-types/p2/). + +9. For **KeyName**, choose an EC2 key pair. + +10. For **SSHLocation**, choose a valid CIDR IP address range to allow SSH access to the master instance and stack. + +11. For **Worker Count**, type a value. The stack provisions the worker count that you specify plus one more, with the additional instance acting as the master. The master also participates in training, evaluation, or both. Choose **Next**. +![](../images/Slide3.png) + +12. (Optional) In the **Tags** section, type values for **Key** and **Value**. This allows you to assign metadata to your resources. +13. (Optional) In the **Permissions** section, choose the AWS Identity and Access Management (IAM) role that AWS CloudFormation uses to create the stack. Choose **Next**. + +14. In the **Capabilities** section, select the check box to agree to allow AWS CloudFormation to create an IAM role. The IAM role is required for setting up a stack. + + +15. Review your settings, and choose **Create.** +![](../images/Slide5.png) + +16. To see the status of your stack, choose **Events**. If stack creation fails, because of an access issue or an unsupported number of workers, for example, troubleshoot the issue. +For information about troubleshooting stack creation, see [Troubleshooting AWS CloudFormation](http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/troubleshooting.html). Check the event log to see the reason for the failure. +![](../images/Slide6.png) + +## Step 2: Find the number of worker instances that successfully launched + +As noted earlier, you might not always be able to launch the requested number of worker instances because of high demand on the type of instance that you chose or because of your account limits. If the stack is unable to launch even a single instance within the timeout period, AWS CloudFormation rolls the stack back. + +**To find the number of workers that the template was able to launch:** + +1. In the AWS CloudFormation console, choose the stack, and then choose the **Resources** tab to see the details for the worker Auto Scaling group. Choose the Auto Scaling group ID for **WorkerAutoScalingGroup**. +![](../images/Slide7.png) + +2. On the Amazon EC2 console, in the **Auto Scaling Groups section**, you can see the **Desired** capacity of the WorkerAutoScaling group. This is the number of worker instances that you can launch. +![](../images/Slide8.png) + +## Step 3: Log in to the master instance + +SSH agent forwarding securely connects the instances within the VPC, which are connected to the private subnet. To set up and use SSH agent forwarding, see [Securely Connect to Linux Instances Running in a Private Amazon VPC](https://aws.amazon.com/blogs/security/securely-connect-to-linux-instances-running-in-a-private-amazon-vpc/). + +**To log in to the master instance:** + +1. Find the public DNS/IP address of the master. In the AWS CloudFormation console, navigate to the AWS CloudFormation stack. To see the Auto Scaling group in which the master instance is launched, choose the **Resources** tab. Choose the Auto Scaling group **Physical ID** for the **MasterAutoScalingGroup**. +![](../images/Slide9.png) + + + a. On the **Auto Scaling** page on the [Amazon EC2 console](https://console.aws.amazon.com/ec2), find the Instance ID of the master instance by choosing the **Instances** tab, and then choosing the Instance ID. +![](../images/Slide10.png) + + b. When you choose the **Instance ID**, EC2 displays details about the master instance, including the public DNS/IP address that you need to log in. Make a note of the address because you will need it in the next step. +![](../images/Slide11.png) + +2. Enable SSH agent forwarding. This enables communication with all of the instances in the private subnet. Using the DNS/IP address that you recorded in the first step, modify the SSH configuration to include these lines: + + Host IP/DNS-from-above + ForwardAgent yes + +3. Log in to the master instance. If you have not already done so, follow the steps in [Securely Connect to Linux Instances Running in a Private Amazon VPC](https://aws.amazon.com/blogs/security/securely-connect-to-linux-instances-running-in-a-private-amazon-vpc/) to enable forwarding your credentials when you use SSH to connect to the master instance. If you neglect this step, distributed training that requires SSH communication will fail. + + On macOS, type: + + ssh -A @ + #USER-ID is ec2-user for Amazon Linux + + On the Windows platform, type: + + ssh @ + #USER-ID is ec2-user for Amazon Linux + +Follow the [Deep Learning Using MXNet and TensorFlow](../README.md) for Distributed Training examples. + +# FAQ + +###Q. How do I change the IP addresses that are allowed to connect to the master instance with SSH? +The AWS CloudFormation stack output contains the security group that controls the inbound IP addresses for SSH access to the master instance. Use this security group to change your inbound IP addresses. + +###Q. When an instance is replaced, are the IP addresses of the instances updated? +No. You must update IP addresses manually. + +###Q. Does the master instance participate in training and validation? +Yes, the master instance acts both as a proxy and as a distributed training and validation instance. + +###Q. Why are the instances in an Auto Scaling group? +An [Auto Scaling](https://aws.amazon.com/autoscaling/) group maintains the number of desired instances by launching a new instance if an instance fails. There are two Auto Scaling groups: one for the master and one for the workers in the private subnet. Because only the master instance has a public endpoint to access the hosts in the stack, if the master instance becomes unavailable, you can terminate it. The associated Auto Scaling group automatically launches a new master instance with a new public endpoint. + +###Q. When a new worker instance is added or an existing instance is replaced, does AWS CloudFormation update the IP addresses on the master instance? +No, this template does not have the capability to automatically update the IP address of the replacement instance. + +###Q. Why does stack creation fail when I use an existing Amazon EFS file system that is attached to a mount target? +You can use an Amazon EFS file system in only one VPC at a time. If your Amazon EFS system is attached to a different VPC, delete the association by following the instructions in [Creating or Deleting Mount Targets in a VPC](http://docs.aws.amazon.com/efs/latest/ug/manage-fs-access-create-delete-mount-targets.html). diff --git a/cfn-template/deeplearning.template b/cfn-template/deeplearning.template new file mode 100644 index 0000000..0449147 --- /dev/null +++ b/cfn-template/deeplearning.template @@ -0,0 +1,885 @@ +{ + "AWSTemplateFormatVersion" : "2010-09-09", + "Description" : "Launches a Deep Learning Cluster with one Master and variable number of Workers.", + "Parameters" : { + "KeyName" : { + "Description" : "Name of an existing Amazon EC2 KeyPair to enable SSH access to the instances", + "Type" : "AWS::EC2::KeyPair::KeyName" + }, + "WorkerCount" : { + "Description" : "The number of worker instances (launches +1 instance for the Master).", + "Type" : "Number", + "MinValue" : "1", + "Default" : "1" + }, + "InstanceType" : { + "Description" : "The EC2 instance type for workers.For GPUs choose g2.xx or p2.xx", + "Type" : "String", + "Default" : "p2.xlarge", + "AllowedValues" : [ + "p2.16xlarge", + "p2.8xlarge", + "p2.xlarge", + "g2.8xlarge", + "g2.2xlarge", + "t2.small", + "t2.medium", + "t2.large", + "t2.xlarge", + "t2.2xlarge", + "m4.large", + "m4.xlarge", + "m4.2xlarge", + "m4.4xlarge", + "m4.10xlarge", + "m4.16xlarge", + "m3.medium", + "m3.large", + "m3.xlarge", + "m3.2xlarge", + "c4.large", + "c4.xlarge", + "c4.2xlarge", + "c4.4xlarge", + "c4.8xlarge", + "c3.large", + "c3.xlarge", + "c3.2xlarge", + "c3.4xlarge", + "c3.8xlarge", + "x1.16large", + "x1.32xlarge", + "r4.large", + "r4.xlarge", + "r4.2xlarge", + "r4.4xlarge", + "r4.8xlarge", + "r4.16xlarge", + "r3.large", + "r3.xlarge", + "r3.2xlarge", + "r3.4xlarge", + "r3.8xlarge", + "i2.xlarge", + "i2.2xlarge", + "i2.4xlarge", + "i2.8xlarge", + "d2.xlarge", + "d2.2xlarge", + "d2.4xlarge", + "d2.8xlarge", + "f1.2xlarge", + "f1.16xlarge" + ], + "ConstraintDescription" : "Must be a valid CPU optimized or GPU EC2 instance type." + }, + "ImageType" : { + "Description" : "Linux Flavor(Amazon Linux or Ubuntu)", + "Type" : "String", + "Default" : "AmazonLinux", + "AllowedValues" : [ "AmazonLinux", "Ubuntu" ], + "ConstraintDescription" : "Amazon Supported Image Type" + }, + "SSHLocation": { + "Description": "Restrict SSH access to a valid CIDR range, this should be a valid CIDR IP address range that you want to allow access to your Master and Stack.", + "Type": "String", + "MinLength": "9", + "MaxLength": "18", + "AllowedPattern": "(\\d{1,3})\\.(\\d{1,3})\\.(\\d{1,3})\\.(\\d{1,3})/(\\d{1,2})", + "ConstraintDescription": "Must be a valid CIDR range of the form x.x.x.x/x" + }, + "EFSFileSystemId" :{ + "Description": "Existing Amazon EFS File System Id or leave it blank to create a new EFS File System.", + "Type": "String", + "AllowedPattern": "(^fs-[0-9a-f]{8,8})$|()$", + "Default": "", + "ConstraintDescription" : "Should be a Valid EFS File System Id" + }, + "EFSMountPoint" : { + "Description" : "The Linux mount point for the EFS volume", + "Type": "String", + "MinLength": "1", + "Default": "myEFSvolume" + } + }, + "Conditions" : { + "CreateNewFileSystem" : { "Fn::Equals" : [{"Ref": "EFSFileSystemId"}, ""] } + }, + "Mappings" : { + "AmazonLinux" : { + "us-east-1" : { "AMI" : "ami-e7c96af1" }, + "us-west-2" : { "AMI" : "ami-dfb13ebf" }, + "eu-west-1" : { "AMI" : "ami-6e5d6808" } + }, + "Ubuntu" : { + "us-east-1" : { "AMI" : "ami-9548e783" }, + "us-west-2" : { "AMI" : "ami-e9038d89" }, + "eu-west-1" : { "AMI" : "ami-627d4a04" } + }, + "SubnetConfig" : { + "VPC" : { "CIDR" : "10.0.0.0/16" }, + "Public" : { "CIDR" : "10.0.0.0/24" }, + "Private" : { "CIDR" : "10.0.1.0/24" } + }, + "S3" : { + "us-east-1" : { "URL" : "https://s3.amazonaws.com/" }, + "us-west-2" : { "URL" : "https://s3-us-west-2.amazonaws.com/" }, + "eu-west-1" : { "URL" : "https://s3-eu-west-1.amazonaws.com/" } + }, + "Other" : { + "S3SourceBucket" : { "BucketNameSuffix" : "-aws-dl-cfn" }, + "Setup" : { "Filename" : "dl_cfn_setup.py" }, + "LambdaFunction" : { "FileName": "dl_cfn_setup_lambda.zip" }, + "TimeoutValues" : { "WaitConditionTimeout" : "3300", "MasterLaunchTimeout" : "600"}, + "DefaultUser" : {"AmazonLinux": "ec2-user", "Ubuntu": "ubuntu"} + } + }, + "Resources" : { + "ResourceMetadataLambdaFunction": { + "Type": "AWS::Lambda::Function", + "DependsOn" : ["MasterQueue"], + "Properties": { + "Handler": "lambda_function.lambda_handler", + "Role": { "Fn::GetAtt" : ["LambdaExecutionRole", "Arn"] }, + "Code": { + "S3Bucket": {"Fn::Join" : ["", [{ "Ref" : "AWS::Region" }, { "Fn::FindInMap" : [ "Other", "S3SourceBucket", "BucketNameSuffix" ]} ] ]}, + "S3Key": { "Fn::FindInMap" : [ "Other", "LambdaFunction", "FileName" ]}, + }, + "MemorySize" : "256", + "Timeout": "60", + "Runtime": "python2.7", + "Environment" : { + "Variables": { "AWS_DL_STACK_ID" : { "Ref" : "AWS::StackName" }, + "AWS_DL_MASTER_SQS_URL" : {"Ref" : "MasterQueue"} + } + } + } + }, + "LambdaExecutionRole": { + "Type": "AWS::IAM::Role", + "DependsOn" : ["MasterQueue"], + "Properties": { + "ManagedPolicyArns": ["arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole"], + "AssumeRolePolicyDocument": { + "Version": "2012-10-17", + "Statement": [{ + "Effect": "Allow", + "Principal": {"Service": ["lambda.amazonaws.com"]}, + "Action": ["sts:AssumeRole"] + }] + }, + "Path": "/", + "Policies": [ + { "PolicyName": "AWSDeepLearningLambdaExecutionRole", + "PolicyDocument": { + "Version": "2012-10-17", + "Statement": [{ + "Effect": "Allow", + "Action" : [ "autoscaling:DescribeAutoScalingGroups", "autoscaling:SetDesiredCapacity", + "autoscaling:SuspendProcesses", + "cloudformation:DescribeStackResource", "cloudformation:SignalResource" + ], + "Resource": "*" + }] + } + }, + { + "PolicyName": "AllowLambdaSQSSend", + "PolicyDocument": { + "Version": "2012-10-17", + "Statement": [{ + "Effect": "Allow", + "Action" : [ + "sqs:sendmessage" + ], + "Resource" : { "Fn::GetAtt" : [ "MasterQueue", "Arn" ] } + }] + } + } + ] + } + }, + "PermissionForSNSToInvokeLambda": { + "Type": "AWS::Lambda::Permission", + "Properties": { + "FunctionName": { + "Fn::GetAtt": ["ResourceMetadataLambdaFunction", "Arn"] + }, + "Action": "lambda:InvokeFunction", + "Principal": "sns.amazonaws.com", + "SourceArn": { + "Ref" : "ResourceMetadataSNSTopic" + } + } + }, + "InstanceRole" : { + "Type" : "AWS::IAM::Role", + "Properties" : { + "RoleName" : {"Fn::Join" : ["", [{ "Ref" : "AWS::StackName" }, "-InstanceRole" ] ] }, + "AssumeRolePolicyDocument" : { + "Statement" : [ { + "Effect" : "Allow", + "Principal" : { + "Service" : [ "ec2.amazonaws.com" ] + }, + "Action" : [ "sts:AssumeRole" ] + } ] + }, + "Path" : "/", + "Policies" : [ { + "PolicyName" : "instance", + "PolicyDocument" : { + "Statement" : [ { + "Effect" : "Allow", + "Action" : [ "autoscaling:DescribeAutoScalingGroups", "autoscaling:DescribeAutoScalingInstances", "ec2:DescribeInstances", "cloudformation:DescribeStackResource"], + "Resource" : "*" + } ] + } + }, + { + "PolicyName" : "allow-sqs-receive-send-delete-master", + "PolicyDocument" : { + "Statement" : [ { + "Effect" : "Allow", + "Action" : [ "sqs:DeleteMessage", "sqs:ReceiveMessage", "sqs:SendMessage", "sqs:GetQueueUrl"], + "Resource" : { "Fn::GetAtt" : [ "MasterQueue", "Arn" ] } + } ] + } + }, + { + "PolicyName" : "allow-sqs-receive-send-delete-worker", + "PolicyDocument" : { + "Statement" : [ { + "Effect" : "Allow", + "Action" : [ "sqs:DeleteMessage", "sqs:ReceiveMessage", "sqs:SendMessage", "sqs:GetQueueUrl"], + "Resource" : { "Fn::GetAtt" : [ "WorkerQueue", "Arn" ] } + } ] + } + }, + { + "PolicyName" : "allow-to-send-signal-to-WaitConditionHandle", + "PolicyDocument" : { + "Statement" : [ { + "Effect" : "Allow", + "Action" : [ "s3:*"], + "Resource" : {"Fn::Join" : ["", ["arn:aws:s3:::", "cloudformation-waitcondition-", { "Ref" : "AWS::Region" }, "/*" ] ] } + } ] + } + } + ] + } + }, + "InstanceProfile" : { + "Type" : "AWS::IAM::InstanceProfile", + "DependsOn" : "InstanceRole", + "Properties" : { + "Path" : "/", + "Roles" : [ { + "Ref" : "InstanceRole" + } ] + } + }, + "AdminSSHSecurityGroup" : { + "Type" : "AWS::EC2::SecurityGroup", + "Properties" : { + "GroupDescription" : "Security group that controls SSH access to the Master instance.", + "VpcId" : { "Ref" : "Vpc" }, + "Tags" : [ + { "Key" : "Name", "Value" : {"Fn::Join" : ["", [{ "Ref" : "AWS::StackName" }, "_SSH" ] ] } } + ], + "SecurityGroupIngress" : [ + { "IpProtocol" : "tcp", "FromPort" : "22", "ToPort" : "22", "CidrIp" : { "Ref" : "SSHLocation" } } + ], + "SecurityGroupEgress" : [ + ] + } + }, + "MasterSecurityGroup" : { + "Type" : "AWS::EC2::SecurityGroup", + "Properties" : { + "GroupDescription" : "Enable Port access to and from the Master on the Private Interface.", + "VpcId" : { "Ref" : "Vpc" }, + "Tags" : [ + { "Key" : "Name", "Value" : {"Fn::Join" : ["", [{ "Ref" : "AWS::StackName" }, "_Master" ] ] } } + ], + "SecurityGroupIngress" : [ + ], + "SecurityGroupEgress" : [ + ] + } + }, + "MasterSecurityIngress1" : { + "Type" : "AWS::EC2::SecurityGroupIngress", + "DependsOn" : ["MasterSecurityGroup"], + "Properties" : { + "GroupId" : { "Fn::GetAtt": [ "MasterSecurityGroup", "GroupId" ] }, + "IpProtocol" : "tcp", + "FromPort" : "0", + "ToPort" : "65535", + "SourceSecurityGroupId" : { "Fn::GetAtt": [ "MasterSecurityGroup", "GroupId" ] } + } + }, + "MasterSecurityIngress2" : { + "Type" : "AWS::EC2::SecurityGroupIngress", + "DependsOn" : ["MasterSecurityGroup", "WorkerSecurityGroup"], + "Properties" : { + "GroupId" : { "Fn::GetAtt": [ "MasterSecurityGroup", "GroupId" ] }, + "IpProtocol" : "icmp", + "FromPort" : "-1", + "ToPort" : "-1", + "SourceSecurityGroupId" : { "Fn::GetAtt": [ "MasterSecurityGroup", "GroupId" ] } + } + }, + "MasterSecurityIngress3" : { + "Type" : "AWS::EC2::SecurityGroupIngress", + "DependsOn" : ["MasterSecurityGroup", "WorkerSecurityGroup"], + "Properties" : { + "GroupId" : { "Fn::GetAtt": [ "MasterSecurityGroup", "GroupId" ] }, + "IpProtocol" : "tcp", + "FromPort" : "0", + "ToPort" : "65535", + "SourceSecurityGroupId" : { "Fn::GetAtt": [ "WorkerSecurityGroup", "GroupId" ] } + } + }, + "MasterSecurityIngress4" : { + "Type" : "AWS::EC2::SecurityGroupIngress", + "DependsOn" : ["MasterSecurityGroup", "WorkerSecurityGroup"], + "Properties" : { + "GroupId" : { "Fn::GetAtt": [ "MasterSecurityGroup", "GroupId" ] }, + "IpProtocol" : "icmp", + "FromPort" : "-1", + "ToPort" : "-1", + "SourceSecurityGroupId" : { "Fn::GetAtt": [ "WorkerSecurityGroup", "GroupId" ] } + } + }, + "WorkerSecurityGroup" : { + "Type" : "AWS::EC2::SecurityGroup", + "DependsOn" : ["MasterSecurityGroup"], + "Properties" : { + "GroupDescription" : "Enable Port access to and from the Worker on the Private Interface", + "VpcId" : { "Ref" : "Vpc" }, + "Tags" : [ + { "Key" : "Name", "Value" : {"Fn::Join" : ["", [{ "Ref" : "AWS::StackName" }, "_Worker"] ]} } + ], + "SecurityGroupIngress" : [ + { "IpProtocol" : "tcp", "FromPort" : "0", "ToPort" : "65535", "SourceSecurityGroupId" : { "Ref" : "MasterSecurityGroup" } }, + { "IpProtocol" : "icmp", "FromPort" : "-1", "ToPort" : "-1", "SourceSecurityGroupId" : { "Ref" : "MasterSecurityGroup" } } + ], + "SecurityGroupEgress" : [ + ] + } + }, + "WorkerSecurityIngress3" : { + "Type" : "AWS::EC2::SecurityGroupIngress", + "DependsOn" : ["WorkerSecurityGroup"], + "Properties" : { + "GroupId" : { "Fn::GetAtt": [ "WorkerSecurityGroup", "GroupId" ] }, + "IpProtocol" : "tcp", + "FromPort" : "0", + "ToPort" : "65535", + "SourceSecurityGroupId" : { "Fn::GetAtt": [ "WorkerSecurityGroup", "GroupId" ] } + } + }, + "WorkerSecurityIngress4" : { + "Type" : "AWS::EC2::SecurityGroupIngress", + "DependsOn" : ["WorkerSecurityGroup"], + "Properties" : { + "GroupId" : { "Fn::GetAtt": [ "WorkerSecurityGroup", "GroupId" ] }, + "IpProtocol" : "icmp", + "FromPort" : "-1", + "ToPort" : "-1", + "SourceSecurityGroupId" : { "Fn::GetAtt": [ "WorkerSecurityGroup", "GroupId" ] } + } + }, + "MountTargetSecurityGroup" : { + "Type" : "AWS::EC2::SecurityGroup", + "DependsOn" : ["MasterSecurityGroup", "WorkerSecurityGroup"], + "Properties" : { + "GroupDescription": "Security group for mount target", + "VpcId" : { "Ref" : "Vpc" }, + "Tags" : [ + { "Key" : "Name", "Value" : {"Fn::Join" : ["", [{ "Ref" : "AWS::StackName" }, "_Master" ] ] } } + ], + "SecurityGroupIngress" : [ + { "IpProtocol" : "tcp", "FromPort" : "2049", "ToPort" : "2049", "SourceSecurityGroupId" : { "Ref" : "MasterSecurityGroup" } }, + { "IpProtocol" : "tcp", "FromPort" : "2049", "ToPort" : "2049", "SourceSecurityGroupId" : { "Ref" : "WorkerSecurityGroup" } } + ], + "SecurityGroupEgress" : [ + ] + } + }, + "FileSystem": { + "Type": "AWS::EFS::FileSystem", + "Condition": "CreateNewFileSystem", + "DeletionPolicy": "Retain", + "Properties": { + "PerformanceMode": "generalPurpose", + "FileSystemTags": [ + { + "Key": "Name", + "Value": { "Ref" : "AWS::StackName" } + } + ] + } + }, + "MountTarget": { + "Type": "AWS::EFS::MountTarget", + "Properties": { + "FileSystemId": { "Fn::If" : [ "CreateNewFileSystem", {"Ref" : "FileSystem"}, {"Ref" : "EFSFileSystemId"} ] }, + "SubnetId" : { "Ref" : "PrivateSubnet" }, + "SecurityGroups": [ { "Ref": "MountTargetSecurityGroup" } ] + } + }, + "WorkerLaunchConfig" : { + "Type" : "AWS::AutoScaling::LaunchConfiguration", + "Properties" : { + "ImageId" : { + "Fn::FindInMap" : [ {"Ref" : "ImageType" }, { "Ref" : "AWS::Region" }, "AMI" ] + }, + "InstanceType" : { + "Ref" : "InstanceType" + }, + "IamInstanceProfile" : { + "Ref" : "InstanceProfile" + }, + "SecurityGroups" : [ + {"Ref" : "WorkerSecurityGroup"} + ], + "UserData" : { + "Fn::Base64" : { + "Fn::Join" : [ "", + [ + "#!/bin/bash -xe", + "\n", + + "# setup ssh-forwarding. ", + "sed -i \"s/^#\\(\\s\\+\\)ForwardAgent\\(\\s\\+\\)no/\\ \\1ForwardAgent\\2yes/g\" /etc/ssh/ssh_config", + "\n", + + "mkdir -p /opt/deeplearning", + "\n", + + "# run cfn-init. \n", + "export CFN_PATH=\\/opt\\/aws\\/bin", + "\n", + "$CFN_PATH\\/cfn-init -v --region ", { "Ref" : "AWS::Region" }, + " --configsets Setup ", + " -s ", + { "Ref" : "AWS::StackId" }, + " -r WorkerLaunchConfig ", + "\n", + "" + ] + ] + } + }, + "KeyName" : { + "Ref" : "KeyName" + } + }, + "Metadata" : { + "AWS::CloudFormation::Init" : { + "configSets" : {"Setup" : ["efs-config", "download-setup", "deeplearning-config" ] }, + "efs-config" : { + "commands" : { + "00_install_nfs" : { + "command" : {"Fn::Join" : [ "", [ "if [ \"AmazonLinux\" = \"", { "Ref" : "ImageType" }, "\" ];", "then yum -y -q install nfs-utils; else apt-get -qq -y install nfs-common ; fi" ]]} + }, + "01_createdir" : { + "command" : {"Fn::Join" : [ "", [ "mkdir -p /", { "Ref" : "EFSMountPoint" }]]} + }, + "02_mount" : { + "command" : {"Fn::Join" : [ "", [ "sudo mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2 ", {"Fn::If" : [ "CreateNewFileSystem", {"Ref" : "FileSystem"}, {"Ref" : "EFSFileSystemId"} ]}, ".efs.", { "Ref" : "AWS::Region" }, ".amazonaws.com:/ /", {"Ref": "EFSMountPoint"} ]]} + }, + "03_permissions" : { + "command" : {"Fn::Join" : [ "", [ "chown ",{ "Fn::FindInMap" : [ "Other", "DefaultUser", {"Ref" : "ImageType" } ]}, ":", { "Fn::FindInMap" : [ "Other", "DefaultUser", {"Ref" : "ImageType" } ]}, " /", { "Ref" : "EFSMountPoint" }]]} + } + } + }, + "download-setup" :{ + "files" : { + "/opt/deeplearning/dl_cfn_setup.py": + { "source" : { "Fn::Join" : [ "", [ {"Fn::FindInMap" : [ "S3", { "Ref" : "AWS::Region" }, "URL" ]}, {"Fn::Join" : ["", [{ "Ref" : "AWS::Region" }, { "Fn::FindInMap" : [ "Other", "S3SourceBucket", "BucketNameSuffix" ]} ] ]}, "/", { "Fn::FindInMap" : [ "Other", "Setup", "Filename" ]} ] ] } } + } + }, + "deeplearning-config" : { + "commands" : { + "01_setup" : { + "command" : "python /opt/deeplearning/dl_cfn_setup.py | tee -a /var/log/cloud-init-output.log", + "cwd" : "/opt/deeplearning", + "env" : { "AWS_DL_NODE_TYPE" : "Worker", + "AWS_DL_MASTER_QUEUE": { "Fn::GetAtt" : [ "MasterQueue", "QueueName" ] }, + "AWS_DL_WORKER_QUEUE": { "Fn::GetAtt" : [ "WorkerQueue", "QueueName" ] }, + "AWS_DL_WAITCONDITION_TIMEOUT" : { "Fn::FindInMap" : [ "Other", "TimeoutValues", "WaitConditionTimeout" ]}, + "AWS_DL_MASTERLAUNCH_TIMEOUT" : { "Fn::FindInMap" : [ "Other", "TimeoutValues", "MasterLaunchTimeout" ]}, + "AWS_DL_STACK_ID" : { "Ref" : "AWS::StackId" }, + "AWS_DL_WAIT_HANDLE" : { "Ref" : "myWaitHandle" }, + "AWS_DL_ROLE_NAME" : {"Fn::Join" : ["", [{ "Ref" : "AWS::StackName" }, "-InstanceRole" ] ] }, + "AWS_DL_DEFAULT_USER" : { "Fn::FindInMap" : [ "Other", "DefaultUser", {"Ref" : "ImageType" } ]}, + "AWS_REGION" : { "Ref" : "AWS::Region" }, + "EFS_MOUNT" : {"Fn::Join" : ["", ["/", { "Ref" : "EFSMountPoint" } ] ] } + } + } + } + } + } + } + }, + "MasterLaunchConfig" : { + "Type" : "AWS::AutoScaling::LaunchConfiguration", + "Properties" : { + "AssociatePublicIpAddress" : "true", + "ImageId" : { + "Fn::FindInMap" : [ {"Ref" : "ImageType" }, { "Ref" : "AWS::Region" }, "AMI" ] + }, + "InstanceType" : { + "Ref" : "InstanceType" + }, + "IamInstanceProfile" : { + "Ref" : "InstanceProfile" + }, + "SecurityGroups" : [ + { "Ref" : "MasterSecurityGroup" }, + { "Ref" : "AdminSSHSecurityGroup" } + ], + "UserData" : { + "Fn::Base64" : { + "Fn::Join" : [ "", + [ + "#!/bin/bash -xe", + "\n", + "# setup ssh-forwarding. \n", + "sed -i \"s/^#\\(\\s\\+\\)ForwardAgent\\(\\s\\+\\)no/\\ \\1ForwardAgent\\2yes/g\" /etc/ssh/ssh_config", + "\n", + + "mkdir -p /opt/deeplearning", + "\n", + + "# run cfn-init. \n", + "export CFN_PATH=\\/opt\\/aws\\/bin", + "\n", + "$CFN_PATH\\/cfn-init -v --region ", { "Ref" : "AWS::Region" }, + " --configsets Setup ", + " -s ", + { "Ref" : "AWS::StackId" }, + " -r MasterLaunchConfig ", + "\n", + "" + ] + ] + } + }, + "KeyName" : { + "Ref" : "KeyName" + } + }, + "Metadata" : { + "AWS::CloudFormation::Init" : { + "configSets" : {"Setup" : ["efs-config", "download-setup", "deeplearning-config" ] }, + "efs-config" : { + "commands" : { + "00_install_nfs" : { + "command" : {"Fn::Join" : [ "", [ "if [ \"AmazonLinux\" = \"", { "Ref" : "ImageType" }, "\" ];", "then yum -y -q install nfs-utils; else apt-get -qq -y install nfs-common ; fi" ]]} + }, + "01_createdir" : { + "command" : {"Fn::Join" : [ "", [ "mkdir -p /", { "Ref" : "EFSMountPoint" }]]} + }, + "02_mount" : { + "command" : {"Fn::Join" : [ "", [ "sudo mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2 ", { "Fn::If" : [ "CreateNewFileSystem", {"Ref" : "FileSystem"}, {"Ref" : "EFSFileSystemId"} ] }, ".efs.", { "Ref" : "AWS::Region" }, ".amazonaws.com:/ /", {"Ref": "EFSMountPoint"} ]]} + }, + "03_permissions" : { + "command" : {"Fn::Join" : [ "", [ "chown ",{ "Fn::FindInMap" : [ "Other", "DefaultUser", {"Ref" : "ImageType" } ]}, ":", { "Fn::FindInMap" : [ "Other", "DefaultUser", {"Ref" : "ImageType" } ]}, " /", { "Ref" : "EFSMountPoint" }]]} + } + } + }, + "download-setup" :{ + "files" : { + "/opt/deeplearning/dl_cfn_setup.py": + { "source" : { "Fn::Join" : [ "", [ {"Fn::FindInMap" : [ "S3", { "Ref" : "AWS::Region" }, "URL" ]}, {"Fn::Join" : ["", [{ "Ref" : "AWS::Region" }, { "Fn::FindInMap" : [ "Other", "S3SourceBucket", "BucketNameSuffix" ]} ] ]}, "/", { "Fn::FindInMap" : [ "Other", "Setup", "Filename" ]} ] ] } } + } + }, + "deeplearning-config" : { + "commands" : { + "01_setup" : { + "command" : "python /opt/deeplearning/dl_cfn_setup.py | tee -a /var/log/cloud-init-output.log", + "cwd" : "/opt/deeplearning", + "env" : { "AWS_DL_NODE_TYPE" : "Master", + "AWS_DL_MASTER_QUEUE": { "Fn::GetAtt" : [ "MasterQueue", "QueueName" ] }, + "AWS_DL_WORKER_QUEUE": { "Fn::GetAtt" : [ "WorkerQueue", "QueueName" ] }, + "AWS_DL_WAITCONDITION_TIMEOUT" : { "Fn::FindInMap" : [ "Other", "TimeoutValues", "WaitConditionTimeout" ]}, + "AWS_DL_MASTERLAUNCH_TIMEOUT" : { "Fn::FindInMap" : [ "Other", "TimeoutValues", "MasterLaunchTimeout" ]}, + "AWS_DL_STACK_ID" : { "Ref" : "AWS::StackId" }, + "AWS_DL_WAIT_HANDLE" : { "Ref" : "myWaitHandle" }, + "AWS_DL_ROLE_NAME" : {"Fn::Join" : ["", [{ "Ref" : "AWS::StackName" }, "-InstanceRole" ] ] }, + "AWS_DL_DEFAULT_USER" : { "Fn::FindInMap" : [ "Other", "DefaultUser", {"Ref" : "ImageType" } ]}, + "AWS_REGION" : { "Ref" : "AWS::Region" }, + "EFS_MOUNT" : {"Fn::Join" : ["", ["/", { "Ref" : "EFSMountPoint" } ] ] } + } + } + } + } + } + } + }, + "MasterAutoScalingGroup" : { + "Type" : "AWS::AutoScaling::AutoScalingGroup", + "DependsOn" : ["MasterLaunchConfig", "MountTarget", "MasterQueue", "WorkerQueue"], + "CreationPolicy" : { + "ResourceSignal" : { + "Timeout": {"Fn::Join" : ["", ["PT", { "Fn::FindInMap" : [ "Other", "TimeoutValues", "MasterLaunchTimeout" ]}, "S" ] ] }, + "Count" : "1" + } + }, + "Properties" : { + "DesiredCapacity" : "1", + "MinSize" : "1", + "MaxSize" : "1", + "LaunchConfigurationName" : { "Ref" : "MasterLaunchConfig"}, + "VPCZoneIdentifier" : [{ "Ref" : "PublicSubnet"}], + "NotificationConfiguration" : { + "TopicARN" : { + "Ref" : "ResourceMetadataSNSTopic" + }, + "NotificationTypes" : [ "autoscaling:EC2_INSTANCE_LAUNCH", + "autoscaling:EC2_INSTANCE_LAUNCH_ERROR", + "autoscaling:EC2_INSTANCE_TERMINATE_ERROR", + "autoscaling:EC2_INSTANCE_TERMINATE" + ] + }, + "Tags" : [ { + "Key" : "Name", + "Value" : { + "Fn::Join" : ["", [{ "Ref" : "AWS::StackName" }, "-Master" ] ] + }, + "PropagateAtLaunch" : true + }, + { + "Key" : "NodeType", + "Value" : "Master", + "PropagateAtLaunch" : true + } + ] + } + }, + "WorkerAutoScalingGroup" : { + "Type" : "AWS::AutoScaling::AutoScalingGroup", + "DependsOn" : ["WorkerLaunchConfig", "MountTarget", "MasterQueue", "WorkerQueue", "MasterAutoScalingGroup"], + "Properties" : { + "MinSize" : "0", + "MaxSize" : { "Ref" : "WorkerCount" }, + "DesiredCapacity" : { "Ref" : "WorkerCount" }, + "LaunchConfigurationName" : { + "Ref" : "WorkerLaunchConfig" + }, + "VPCZoneIdentifier" : [ { "Ref" : "PrivateSubnet" } ], + "NotificationConfiguration" : { + "TopicARN" : { + "Ref" : "ResourceMetadataSNSTopic" + }, + "NotificationTypes" : [ + "autoscaling:EC2_INSTANCE_LAUNCH", + "autoscaling:EC2_INSTANCE_LAUNCH_ERROR", + "autoscaling:EC2_INSTANCE_TERMINATE_ERROR", + "autoscaling:EC2_INSTANCE_TERMINATE" + ] + }, + "Tags" : [ { + "Key" : "Name", + "Value" : { + "Fn::Join" : ["", [{ "Ref" : "AWS::StackName" }, "-Worker" ] ] + }, + "PropagateAtLaunch" : true + }, + { + "Key" : "NodeType", + "Value" : "Worker", + "PropagateAtLaunch" : true + } + ] + } + }, + "MasterQueue" : { + "Type" : "AWS::SQS::Queue", + "Properties" : { + "QueueName" : {"Fn::Join" : ["", [{ "Ref" : "AWS::StackName" }, "-aws-dl-cfn-master" ] ] } + } + }, + "WorkerQueue" : { + "Type" : "AWS::SQS::Queue", + "Properties" : { + "QueueName" : {"Fn::Join" : ["", [{ "Ref" : "AWS::StackName" }, "-aws-dl-cfn-worker" ] ] } + } + }, + "ResourceMetadataSNSTopic" : { + "Type" : "AWS::SNS::Topic", + "Properties" : { + "DisplayName" : {"Fn::Join" : ["", [{ "Ref" : "AWS::StackName" }, "-aws-dl-cfn" ] ] }, + "Subscription" : [ + { + "Endpoint" : { + "Fn::GetAtt" : [ "ResourceMetadataLambdaFunction", "Arn" ] + }, "Protocol" : "lambda" + } + ], + "TopicName" : {"Fn::Join" : ["", [{ "Ref" : "AWS::StackName" }, "-aws-dl-cfn" ] ] } + } + }, + "myWaitHandle" : { + "Type" : "AWS::CloudFormation::WaitConditionHandle", + "Properties" : { + } + }, + "myWaitCondition" : { + "Type" : "AWS::CloudFormation::WaitCondition", + "Properties" : { + "Handle" : { "Ref" : "myWaitHandle" }, + "Timeout" : { "Fn::FindInMap" : [ "Other", "TimeoutValues", "WaitConditionTimeout" ]} + } + }, + "NATGatewayEIP" : { + "Type" : "AWS::EC2::EIP", + "Properties" : {"Domain" : "vpc"} + }, + "Vpc" : { + "Type" : "AWS::EC2::VPC", + "Properties" : { + "CidrBlock" : { "Fn::FindInMap" : [ "SubnetConfig", "VPC", "CIDR" ]}, + "EnableDnsSupport" : "true", + "EnableDnsHostnames" : "true", + "Tags" : [ + { "Key" : "Name", "Value" : { "Ref" : "AWS::StackName" } } + ] + } + }, + "InternetGateway" : { + "Type" : "AWS::EC2::InternetGateway", + "Properties" : { + "Tags" : [ + { "Key" : "Network", "Value" : "Public" }, + { "Key" : "Name", "Value" : { "Ref" : "AWS::StackName" } } + ] + } + }, + "GatewayToInternet" : { + "Type" : "AWS::EC2::VPCGatewayAttachment", + "Properties" : { + "VpcId" : { "Ref" : "Vpc" }, + "InternetGatewayId" : { "Ref" : "InternetGateway" } + } + }, + "PublicSubnet" : { + "Type" : "AWS::EC2::Subnet", + "DependsOn" : ["PrivateSubnet"], + "Properties" : { + "VpcId" : {"Ref" : "Vpc"}, + "AvailabilityZone" : { "Fn::GetAtt" : [ "PrivateSubnet", "AvailabilityZone" ] } , + "CidrBlock": { "Fn::FindInMap" : [ "SubnetConfig", "Public", "CIDR" ]}, + "Tags" : [ + { "Key" : "Network", "Value" : "Public" }, + { "Key" : "Name", "Value" : { "Ref" : "AWS::StackName" } } + ] + } + }, + "PrivateSubnet" : { + "Type" : "AWS::EC2::Subnet", + "Properties" : { + "VpcId" : { "Ref" : "Vpc" }, + "CidrBlock" : { "Fn::FindInMap" : [ "SubnetConfig", "Private", "CIDR" ]}, + "Tags" : [ + { "Key" : "Network", "Value" : "Private" }, + { "Key" : "Name", "Value" : { "Ref" : "AWS::StackName" }} + ] + } + }, + "NATGateway" : { + "Type" : "AWS::EC2::NatGateway", + "DependsOn" : "GatewayToInternet", + "Properties" : { + "AllocationId" : { + "Fn::GetAtt" : [ + "NATGatewayEIP", + "AllocationId" + ] + }, + "SubnetId" : { + "Ref" : "PublicSubnet" + } + } + }, + "PublicRouteTable" : { + "Type" : "AWS::EC2::RouteTable", + "DependsOn": "GatewayToInternet", + "Properties" : { + "VpcId" : { "Ref" : "Vpc" }, + "Tags" : [ + { "Key" : "Network", "Value" : "Public" }, + { "Key" : "Name", "Value" : { "Ref" : "AWS::StackName" } } + ] + } + }, + "PublicRoute" : { + "Type" : "AWS::EC2::Route", + "Properties" : { + "RouteTableId" : { "Ref" : "PublicRouteTable" }, + "DestinationCidrBlock" : "0.0.0.0/0", + "GatewayId" : { "Ref" : "InternetGateway" } + } + }, + "PublicSubnetRouteAssociation" : { + "Type" : "AWS::EC2::SubnetRouteTableAssociation", + "Properties" : { + "SubnetId" : { "Ref" : "PublicSubnet" }, + "RouteTableId" : { "Ref" : "PublicRouteTable" } + } + }, + "PrivateRouteTable" : { + "Type" : "AWS::EC2::RouteTable", + "Properties" : { + "VpcId" : { "Ref" : "Vpc" }, + "Tags" : [ + { "Key" : "Network", "Value" : "Private" }, + { "Key" : "Name", "Value" : { "Ref" : "AWS::StackName" }} + ] + } + }, + "PrivateRoute" : { + "Type" : "AWS::EC2::Route", + "Properties" : { + "RouteTableId" : { "Ref" : "PrivateRouteTable" }, + "DestinationCidrBlock" : "0.0.0.0/0", + "NatGatewayId" : { "Ref" : "NATGateway" } + } + }, + "PrivateSubnetRouteAssociation" : { + "Type" : "AWS::EC2::SubnetRouteTableAssociation", + "Properties" : { + "SubnetId" : { "Ref" : "PrivateSubnet" }, + "RouteTableId" : { "Ref" : "PrivateRouteTable" } + } + } + }, + "Outputs" : { + "AdminSSHSecurityGroup" : { + "Description" : "Security Group that restricts Inbound IPs to SSH into the Master", + "Value" : { + "Ref" : "AdminSSHSecurityGroup" + } + }, + "MasterAutoScalingGroup" : { + "Description" : "Autoscaling Group that contains the Master Instance", + "Value" : { + "Ref" : "MasterAutoScalingGroup" + } + }, + "WorkerAutoScalingGroup" : { + "Description" : "Autoscaling Group that contains the Workers", + "Value" : { + "Ref" : "WorkerAutoScalingGroup" + } + }, + "MountTargetID" : { + "Description" : "EFS Mount target ID", + "Value" : { "Ref" : "MountTarget" } + } + } +} \ No newline at end of file diff --git a/examples/mxnet b/examples/mxnet new file mode 160000 index 0000000..3ba709d --- /dev/null +++ b/examples/mxnet @@ -0,0 +1 @@ +Subproject commit 3ba709d2c6059c1a070729bd70153c8554f9921e diff --git a/examples/tensorflow/LICENSE b/examples/tensorflow/LICENSE new file mode 100644 index 0000000..43fcf7b --- /dev/null +++ b/examples/tensorflow/LICENSE @@ -0,0 +1,203 @@ +Copyright 2016 The TensorFlow Authors. All rights reserved. + + Apache License + Version 2.0, January 2004 + http://www.apache.org/licenses/ + + TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION + + 1. Definitions. + + "License" shall mean the terms and conditions for use, reproduction, + and distribution as defined by Sections 1 through 9 of this document. + + "Licensor" shall mean the copyright owner or entity authorized by + the copyright owner that is granting the License. + + "Legal Entity" shall mean the union of the acting entity and all + other entities that control, are controlled by, or are under common + control with that entity. For the purposes of this definition, + "control" means (i) the power, direct or indirect, to cause the + direction or management of such entity, whether by contract or + otherwise, or (ii) ownership of fifty percent (50%) or more of the + outstanding shares, or (iii) beneficial ownership of such entity. + + "You" (or "Your") shall mean an individual or Legal Entity + exercising permissions granted by this License. + + "Source" form shall mean the preferred form for making modifications, + including but not limited to software source code, documentation + source, and configuration files. + + "Object" form shall mean any form resulting from mechanical + transformation or translation of a Source form, including but + not limited to compiled object code, generated documentation, + and conversions to other media types. + + "Work" shall mean the work of authorship, whether in Source or + Object form, made available under the License, as indicated by a + copyright notice that is included in or attached to the work + (an example is provided in the Appendix below). + + "Derivative Works" shall mean any work, whether in Source or Object + form, that is based on (or derived from) the Work and for which the + editorial revisions, annotations, elaborations, or other modifications + represent, as a whole, an original work of authorship. For the purposes + of this License, Derivative Works shall not include works that remain + separable from, or merely link (or bind by name) to the interfaces of, + the Work and Derivative Works thereof. + + "Contribution" shall mean any work of authorship, including + the original version of the Work and any modifications or additions + to that Work or Derivative Works thereof, that is intentionally + submitted to Licensor for inclusion in the Work by the copyright owner + or by an individual or Legal Entity authorized to submit on behalf of + the copyright owner. For the purposes of this definition, "submitted" + means any form of electronic, verbal, or written communication sent + to the Licensor or its representatives, including but not limited to + communication on electronic mailing lists, source code control systems, + and issue tracking systems that are managed by, or on behalf of, the + Licensor for the purpose of discussing and improving the Work, but + excluding communication that is conspicuously marked or otherwise + designated in writing by the copyright owner as "Not a Contribution." + + "Contributor" shall mean Licensor and any individual or Legal Entity + on behalf of whom a Contribution has been received by Licensor and + subsequently incorporated within the Work. + + 2. Grant of Copyright License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + copyright license to reproduce, prepare Derivative Works of, + publicly display, publicly perform, sublicense, and distribute the + Work and such Derivative Works in Source or Object form. + + 3. Grant of Patent License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + (except as stated in this section) patent license to make, have made, + use, offer to sell, sell, import, and otherwise transfer the Work, + where such license applies only to those patent claims licensable + by such Contributor that are necessarily infringed by their + Contribution(s) alone or by combination of their Contribution(s) + with the Work to which such Contribution(s) was submitted. If You + institute patent litigation against any entity (including a + cross-claim or counterclaim in a lawsuit) alleging that the Work + or a Contribution incorporated within the Work constitutes direct + or contributory patent infringement, then any patent licenses + granted to You under this License for that Work shall terminate + as of the date such litigation is filed. + + 4. Redistribution. You may reproduce and distribute copies of the + Work or Derivative Works thereof in any medium, with or without + modifications, and in Source or Object form, provided that You + meet the following conditions: + + (a) You must give any other recipients of the Work or + Derivative Works a copy of this License; and + + (b) You must cause any modified files to carry prominent notices + stating that You changed the files; and + + (c) You must retain, in the Source form of any Derivative Works + that You distribute, all copyright, patent, trademark, and + attribution notices from the Source form of the Work, + excluding those notices that do not pertain to any part of + the Derivative Works; and + + (d) If the Work includes a "NOTICE" text file as part of its + distribution, then any Derivative Works that You distribute must + include a readable copy of the attribution notices contained + within such NOTICE file, excluding those notices that do not + pertain to any part of the Derivative Works, in at least one + of the following places: within a NOTICE text file distributed + as part of the Derivative Works; within the Source form or + documentation, if provided along with the Derivative Works; or, + within a display generated by the Derivative Works, if and + wherever such third-party notices normally appear. The contents + of the NOTICE file are for informational purposes only and + do not modify the License. You may add Your own attribution + notices within Derivative Works that You distribute, alongside + or as an addendum to the NOTICE text from the Work, provided + that such additional attribution notices cannot be construed + as modifying the License. + + You may add Your own copyright statement to Your modifications and + may provide additional or different license terms and conditions + for use, reproduction, or distribution of Your modifications, or + for any such Derivative Works as a whole, provided Your use, + reproduction, and distribution of the Work otherwise complies with + the conditions stated in this License. + + 5. Submission of Contributions. Unless You explicitly state otherwise, + any Contribution intentionally submitted for inclusion in the Work + by You to the Licensor shall be under the terms and conditions of + this License, without any additional terms or conditions. + Notwithstanding the above, nothing herein shall supersede or modify + the terms of any separate license agreement you may have executed + with Licensor regarding such Contributions. + + 6. Trademarks. This License does not grant permission to use the trade + names, trademarks, service marks, or product names of the Licensor, + except as required for reasonable and customary use in describing the + origin of the Work and reproducing the content of the NOTICE file. + + 7. Disclaimer of Warranty. Unless required by applicable law or + agreed to in writing, Licensor provides the Work (and each + Contributor provides its Contributions) on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or + implied, including, without limitation, any warranties or conditions + of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A + PARTICULAR PURPOSE. You are solely responsible for determining the + appropriateness of using or redistributing the Work and assume any + risks associated with Your exercise of permissions under this License. + + 8. Limitation of Liability. In no event and under no legal theory, + whether in tort (including negligence), contract, or otherwise, + unless required by applicable law (such as deliberate and grossly + negligent acts) or agreed to in writing, shall any Contributor be + liable to You for damages, including any direct, indirect, special, + incidental, or consequential damages of any character arising as a + result of this License or out of the use or inability to use the + Work (including but not limited to damages for loss of goodwill, + work stoppage, computer failure or malfunction, or any and all + other commercial damages or losses), even if such Contributor + has been advised of the possibility of such damages. + + 9. Accepting Warranty or Additional Liability. While redistributing + the Work or Derivative Works thereof, You may choose to offer, + and charge a fee for, acceptance of support, warranty, indemnity, + or other liability obligations and/or rights consistent with this + License. However, in accepting such obligations, You may act only + on Your own behalf and on Your sole responsibility, not on behalf + of any other Contributor, and only if You agree to indemnify, + defend, and hold each Contributor harmless for any liability + incurred by, or claims asserted against, such Contributor by reason + of your accepting any such warranty or additional liability. + + END OF TERMS AND CONDITIONS + + APPENDIX: How to apply the Apache License to your work. + + To apply the Apache License to your work, attach the following + boilerplate notice, with the fields enclosed by brackets "[]" + replaced with your own identifying information. (Don't include + the brackets!) The text should be enclosed in the appropriate + comment syntax for the file format. We also recommend that a + file or class name and description of purpose be included on the + same "printed page" as the copyright notice for easier + identification within third-party archives. + + Copyright 2016, The Authors. + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. diff --git a/examples/tensorflow/cifar10_multi_machine_train.py b/examples/tensorflow/cifar10_multi_machine_train.py new file mode 100644 index 0000000..7b7d044 --- /dev/null +++ b/examples/tensorflow/cifar10_multi_machine_train.py @@ -0,0 +1,118 @@ +from __future__ import absolute_import +from __future__ import division +from __future__ import print_function + +from datetime import datetime +import sys +import os.path +sys.path.insert(0, os.path.dirname(__file__) + '/models/tutorials/image/cifar10/') + +import cifar10 +import re +import time +import argparse +import numpy as np +from six.moves import xrange # pylint: disable=redefined-builtin +import tensorflow as tf + + +FLAGS = tf.app.flags.FLAGS + +tf.app.flags.DEFINE_integer('max_steps', 1000000, + """Number of batches to run.""") +tf.app.flags.DEFINE_string("ps_hosts", "", + "Comma-separated list of hostname:port pairs") +tf.app.flags.DEFINE_string("worker_hosts", "", + "Comma-separated list of hostname:port pairs") + +tf.app.flags.DEFINE_string("job_name", "", "One of 'ps', 'worker'") + +tf.app.flags.DEFINE_integer("task_index", 0, "Index of task within the job") + +tf.app.flags.DEFINE_string('train_dir', '/tmp/cifar10_train', + """Directory where to write event logs """ + """and checkpoint.""") + +def main(_): + + class _LoggerHook(tf.train.SessionRunHook): + """Logs loss and runtime.""" + + def begin(self): + self._step = -1 + + def before_run(self, run_context): + self._step += 1 + self._start_time = time.time() + return tf.train.SessionRunArgs(loss) # Asks for loss value. + + def after_run(self, run_context, run_values): + duration = time.time() - self._start_time + loss_value = run_values.results + if self._step % 10 == 0: + num_examples_per_step = FLAGS.batch_size + examples_per_sec = num_examples_per_step / duration + sec_per_batch = float(duration) + + format_str = ('%s: step %d, loss = %.2f (%.1f examples/sec; %.3f ' + 'sec/batch)') + print (format_str % (datetime.now(), self._step, loss_value, + examples_per_sec, sec_per_batch)) + ps_hosts = FLAGS.ps_hosts.split(",") + worker_hosts = FLAGS.worker_hosts.split(",") + + # Create a cluster from the parameter server and worker hosts. + cluster = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts}) + + # Create and start a server for the local task. + server = tf.train.Server(cluster, + job_name=FLAGS.job_name, + task_index=FLAGS.task_index) + + if FLAGS.job_name == "ps": + server.join() + elif FLAGS.job_name == "worker": + + # Assigns ops to the local worker by default. + with tf.device(tf.train.replica_device_setter( + worker_device="/job:worker/task:%d" % FLAGS.task_index, + cluster=cluster)): + + global_step = tf.contrib.framework.get_or_create_global_step() + + # Get images and labels for CIFAR-10. + images, labels = cifar10.distorted_inputs() + + # Build inference Graph. + logits = cifar10.inference(images) + + # Build the portion of the Graph calculating the losses. Note that we will + # assemble the total_loss using a custom function below. + loss = cifar10.loss(logits, labels) + + # Build a Graph that trains the model with one batch of examples and + # updates the model parameters. + train_op = cifar10.train(loss,global_step) + + # The StopAtStepHook handles stopping after running given steps. + hooks=[tf.train.StopAtStepHook(last_step=FLAGS.max_steps), _LoggerHook()] + + # The MonitoredTrainingSession takes care of session initialization, + # restoring from a checkpoint, saving to a checkpoint, and closing when done + # or an error occurs. + with tf.train.MonitoredTrainingSession(master=server.target, + is_chief=(FLAGS.task_index == 0), + checkpoint_dir=FLAGS.train_dir, + save_checkpoint_secs=60, + hooks=hooks) as mon_sess: + while not mon_sess.should_stop(): + # Run a training step asynchronously. + # See `tf.train.SyncReplicasOptimizer` for additional details on how to + # perform *synchronous* training. + # mon_sess.run handles AbortedError in case of preempted PS. + mon_sess.run(train_op) + +if __name__ == "__main__": + if tf.gfile.Exists(FLAGS.train_dir) is False: + tf.gfile.MakeDirs(FLAGS.train_dir) + tf.app.run() diff --git a/examples/tensorflow/generate_trainer.py b/examples/tensorflow/generate_trainer.py new file mode 100644 index 0000000..33b3c92 --- /dev/null +++ b/examples/tensorflow/generate_trainer.py @@ -0,0 +1,86 @@ +import sys, getopt, os, argparse + +#parse arguments +def parse_args(): + parser = argparse.ArgumentParser(description='Run Benchmark on various imagenet networks using train_imagenent.py') + parser.add_argument('--trainer_script_dir', type=str, help='location where distributed trainer scripts should be stored, use a shared file system like efs',required=True) + parser.add_argument('--log_dir', type=str, default="/tmp/", help='location where the logs should be stored',required=False) + parser.add_argument('--workers_file_path', type=str, help='worker file path', required=True) + parser.add_argument('--worker_count', type=int, help='number of workers', required=True) + parser.add_argument('--worker_gpu_count', type=int, help='number of gpus on each worker to use', required=True) + parser.add_argument('--training_script', nargs='+', help = 'training script and its arguments, e.g: --script cifar10_train.py --batch_size 8 --data_dir /myEFSVolume/data') + args, unknown = parser.parse_known_args() + args.training_script += unknown + args.training_script = ' '.join(args.training_script) + return args + +# generates a list of workers where the training will be run. +# one worker per GPU +def get_worker_list(nodes, gpu_per_node): + lst = [] + for node in nodes: + for index in range(gpu_per_node): + port = str(2230 + (index%gpu_per_node)) + lst.append( node + ":" + port ) + return ','.join(lst) + +# generates a list of parameter servers +# one parameter server per node +def get_ps_list(nodes): + return ','.join( [n + ":2222" for n in nodes] ) + +#creates list of commands that has to be run on each node +def get_script(training_script, workers_list, ps_list, index, gpu_per_node, log_dir): + + script = 'source /etc/profile' + script += "\n\n" + + script += "CUDA_VISIBLE_DEVICES='' python " + training_script + " " \ + + "--ps_hosts=" + ps_list + " " \ + + "--worker_hosts=" + workers_list + " " \ + + "--job_name=ps " \ + + "--task_index=" + str(index) \ + + " > " + log_dir + "/ps" + str(index) \ + + " 2>&1" \ + + " &" + + script += "\n\n" + + for i in range(gpu_per_node): + script += "CUDA_VISIBLE_DEVICES='" + str(i) + "' " \ + + "python " + training_script + " " \ + + "--ps_hosts=" + ps_list + " " \ + + "--worker_hosts=" + workers_list + " " \ + + "--job_name=worker " \ + + "--task_index=" + str(index*gpu_per_node + i) \ + + " > "+ log_dir + "/worker" + str(index*gpu_per_node + i) \ + + " 2>&1" \ + + " &" + + script += "\n\n" + + return script + +def gen_scripts(training_script, nodes_file, trainer_script_dir, num_nodes, gpu_per_node, log_dir): + + with open(nodes_file, 'r') as f: + nodes = f.read().splitlines() + + workers_list = get_worker_list(nodes, gpu_per_node) + ps_list = get_ps_list(nodes) + + for index, host in enumerate(nodes): + script = get_script(training_script, workers_list, ps_list, index, gpu_per_node, log_dir) + file_name = trainer_script_dir + "/" + host + ".sh" + with open(file_name, "w") as sh_file: + sh_file.write(script) + +def main(): + args = parse_args() + if not os.path.exists(args.log_dir): + os.makedirs(args.log_dir) + gen_scripts(args.training_script, args.workers_file_path, args.trainer_script_dir, + args.worker_count, args.worker_gpu_count, args.log_dir) + +if __name__ == "__main__": + main() diff --git a/examples/tensorflow/models b/examples/tensorflow/models new file mode 160000 index 0000000..3be9ece --- /dev/null +++ b/examples/tensorflow/models @@ -0,0 +1 @@ +Subproject commit 3be9ece9574d7bac07704e43705741d9af1de1e6 diff --git a/images/Slide0.png b/images/Slide0.png new file mode 100644 index 0000000..7561771 Binary files /dev/null and b/images/Slide0.png differ diff --git a/images/Slide1.png b/images/Slide1.png new file mode 100644 index 0000000..798a7b1 Binary files /dev/null and b/images/Slide1.png differ diff --git a/images/Slide10.png b/images/Slide10.png new file mode 100644 index 0000000..3c6c3b1 Binary files /dev/null and b/images/Slide10.png differ diff --git a/images/Slide11.png b/images/Slide11.png new file mode 100644 index 0000000..aa141c5 Binary files /dev/null and b/images/Slide11.png differ diff --git a/images/Slide2.png b/images/Slide2.png new file mode 100644 index 0000000..46f582d Binary files /dev/null and b/images/Slide2.png differ diff --git a/images/Slide3.png b/images/Slide3.png new file mode 100644 index 0000000..e20e54b Binary files /dev/null and b/images/Slide3.png differ diff --git a/images/Slide4.png b/images/Slide4.png new file mode 100644 index 0000000..634ab53 Binary files /dev/null and b/images/Slide4.png differ diff --git a/images/Slide5.png b/images/Slide5.png new file mode 100644 index 0000000..c6b666b Binary files /dev/null and b/images/Slide5.png differ diff --git a/images/Slide6.png b/images/Slide6.png new file mode 100644 index 0000000..8c02f20 Binary files /dev/null and b/images/Slide6.png differ diff --git a/images/Slide7.png b/images/Slide7.png new file mode 100644 index 0000000..866aaf5 Binary files /dev/null and b/images/Slide7.png differ diff --git a/images/Slide8.png b/images/Slide8.png new file mode 100644 index 0000000..312ec1e Binary files /dev/null and b/images/Slide8.png differ diff --git a/images/Slide9.png b/images/Slide9.png new file mode 100644 index 0000000..a0b0c81 Binary files /dev/null and b/images/Slide9.png differ