[RFC00101] Energy Draw Tracker #1

abhijithneilabraham · 2022-07-07T10:52:45Z

Energy Draw Tracker

Track and monitor energy draw for experiments related to model training, model inference in GPUs and CPUS.

Summary

The carbon footprint caused by energy consumption of GPUs and CPUs while doing model training and model inference could be reduced, if properly tracked and taken measures to reduce. By this tool, GPU/CPU usage for model training and model inference will be monitored, and logged.

Technical Overview

The Energy Draw Tool provides the following features:

Easy to use plugin for all your experiments with very few lines of code.
Can be used as a callback in your Tensorflow/Pytorch/Keras experiments
Usage monitoring which provides support for multiple devices.
- GPU
- CPU:
  - Intel/Mac Chips
Track the combined Energy Draw for experiments with distributed machines.
Do tracking for
- Model training
- Model inference.
- Hyperparameter tuning
Save emission details to a database or csv files.
Use visualisation to view the emission statistics

Alternatives

Rationale

The proposed design is chosen over other designs because of a number of reasons:

Some existing designs does not support the new Apple M series Mac Processor.
Existing designs does not monitor an experiment which can be run together in different machines. The proposed design will monitor a combined output from multiple machines.
Easier to use as a callback in your experiments.
Can unify multiple solutions together, so that more categories of devices can be supported.

One of the best alternative approach to this design is CodeCarbon, but the following issues arise for running with codecarbon.

CodeCarbon does not support running with Apple M series Mac Processor.
Running experiments in distributed machines require CodeCarbon API being called in every single one of them.

Drawbacks

Implementation would require testing multiple machines, cost of testing would be higher.

Useful References

What similar work have we already successfully completed?
- Energy Draw Callback while running Talos Experiments
Is this something that have already been built by others?: No
Are there useful academic literature or other articles related with this topic? (provide links)
- Deep Learning and climate change
Have we built a relevant prototype previously? : No
Do we have a rough mock for the UI/UX? : No
Do we have a schematic for the system? : No

Unresolved Questions

What is there that is unresolved (and will be resolved as part of fulfilling this request)?
- The unresolved is the impact created by Machine Learning experiments in Climate Change. This will be resolved as part of fulfilling this request.
Are there other requests with same or similar problems to solve? : No

Parts of the System Affected

Which parts of the current system are affected by this request? : None
What other open requests are closely related with this request? : None
Does this request depend on fulfillment of any other request? : None
Does any other request depend on the fulfillment of this request? : None

Future possibilities

The API could be extended to adopt reduction strategies for energy consumption.
The system could be globally used with many other ML tools, helping track energy.
Provides energy reduction strategies which could be aligned with other ML tools and their implementation methodology.

Infrastructure

Detect machine details:
- Check if CPU or GPU.
- Check the processor(Intel/Mac).
- Check if distributed machines are being used.
Run energy Tracking
- Make different APIs which supports coverage of all machines and processors.
- If distributed machines are being used, automatically add tracker callback for the scripts run in distributed machines.
Logging the output
- The output could be logged and viewed as
  - CSV files
  - Databases

Testing

The testing procedure can be done in the following steps:

A tensorflow example for model training:
- Run the API for tracking with GPU, log the monitored output into a csv file.
- CPU tracking API:
  - Run the API for tracking with Intel based processors, log the monitored output into a csv file.
  - Run the API for tracking with M series Mac based processors, log the monitored output into a csv file.
A tensorflow example for model inference.
- Run the API for tracking with GPU, log the monitored output into a csv file.
- CPU Tracking API:
  - Run the API for tracking with Intel based processors, log the monitored output into a csv file.
  - Run the API for tracking with M series Mac based processors, log the monitored output into a csv file.
A talos example for hyperparameter tuning.
- Run the API for tracking with GPU, log the monitored output into a csv file.
- CPU tracking API:
  - Run the API for tracking with Intel based processors, log the monitored output into a csv file.
  - Run the API for tracking with M series Mac based processors, log the monitored output into a csv file.

Documentation

Describe the level of documentation fulfilling this request involves. Consider both end-user documentation and developer documentation.

End User Documentation:
- Introduction
  - Mission
  - Summary
  - Frequently Asked Questions
- Getting Started
  - Installation,
  - Quickstart
  - Examples.
- Logging
  - Output logging
  - Visualisation
Developer Documentation
- Tracker
  - API reference for tracking with GPU
  - API reference for tracking with CPU:
    - API reference for tracking with Intel based processors
    - API reference for tracking with M series MAC processors
- Logging
  - API reference for logging to csv files.
  - API reference for logging to databases
  - API reference for visualisation tools

Version History.

Version 0.0.1

Recordings.

Work Phases.

Non-Coding.

Planning
Documentation
Prototype Release
Testing

Implementation.

API

Build an API for tracking with GPU devices with Nvidia.Use the nvidia-smi command's features.
- Build Callbacks
- Build API plugin to use with python.

References :
* power draw callback
* GpuStat

Build an API for tracking with CPU (Intel/Mac).
- Build Callbacks
- Build API plugin to use with python.

References :
* PyRAPL
* EnergyUsage

Docker

Write Dockerfile and upload the image to docker hub

Distributed Run

Track scripts running on distributed machines.
- Add support for energy tracking for hyperparameter tuning using Jako

Logging

Write API to log Energy output to CSV file. Columns include Timestamp, start time, end time, Energy in W/H, Device Type, Processor Type.
Write API to log output to a postgres database. Columns include Timestamp, start time, end time, Energy in W/H, Device Type, Processor Type.
Add Hasura API to manage the postgres database.

Visualisation

Write APIs for visualising using Plotly/Dash and/or Metabase. Use the logging outputs from csvs/database for visualisations.

Documentation.

Write End User documentation, as well as Developer documentation.

Testing

All the testing can use the Bitcoin price prediction example

The text was updated successfully, but these errors were encountered:

abhijithneilabraham closed this as completed Jul 8, 2022

abhijithneilabraham reopened this Jul 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC00101] Energy Draw Tracker #1

[RFC00101] Energy Draw Tracker #1

abhijithneilabraham commented Jul 7, 2022 •

edited

Loading

[RFC00101] Energy Draw Tracker #1

[RFC00101] Energy Draw Tracker #1

Comments

abhijithneilabraham commented Jul 7, 2022 • edited Loading

Energy Draw Tracker

Summary

Technical Overview

Alternatives

Rationale

Drawbacks

Useful References

Unresolved Questions

Parts of the System Affected

Future possibilities

Infrastructure

Testing

Documentation

Version History.

Recordings.

Work Phases.

Non-Coding.

Implementation.

API

Docker

Distributed Run

Logging

Visualisation

Documentation.

Testing

abhijithneilabraham commented Jul 7, 2022 •

edited

Loading