Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC00101] Energy Draw Tracker #1

Open
54 tasks
abhijithneilabraham opened this issue Jul 7, 2022 · 0 comments
Open
54 tasks

[RFC00101] Energy Draw Tracker #1

abhijithneilabraham opened this issue Jul 7, 2022 · 0 comments

Comments

@abhijithneilabraham
Copy link

abhijithneilabraham commented Jul 7, 2022

Energy Draw Tracker

Track and monitor energy draw for experiments related to model training, model inference in GPUs and CPUS.

Summary

The carbon footprint caused by energy consumption of GPUs and CPUs while doing model training and model inference could be reduced, if properly tracked and taken measures to reduce. By this tool, GPU/CPU usage for model training and model inference will be monitored, and logged.

Technical Overview

The Energy Draw Tool provides the following features:

  • Easy to use plugin for all your experiments with very few lines of code.
  • Can be used as a callback in your Tensorflow/Pytorch/Keras experiments
  • Usage monitoring which provides support for multiple devices.
    • GPU
    • CPU:
      • Intel/Mac Chips
  • Track the combined Energy Draw for experiments with distributed machines.
  • Do tracking for
    • Model training
    • Model inference.
    • Hyperparameter tuning
  • Save emission details to a database or csv files.
  • Use visualisation to view the emission statistics

Alternatives

Rationale

The proposed design is chosen over other designs because of a number of reasons:

  • Some existing designs does not support the new Apple M series Mac Processor.
  • Existing designs does not monitor an experiment which can be run together in different machines. The proposed design will monitor a combined output from multiple machines.
  • Easier to use as a callback in your experiments.
  • Can unify multiple solutions together, so that more categories of devices can be supported.

One of the best alternative approach to this design is CodeCarbon, but the following issues arise for running with codecarbon.

  • CodeCarbon does not support running with Apple M series Mac Processor.
  • Running experiments in distributed machines require CodeCarbon API being called in every single one of them.

Drawbacks

  • Implementation would require testing multiple machines, cost of testing would be higher.

Useful References

  • What similar work have we already successfully completed?

  • Is this something that have already been built by others?: No

  • Are there useful academic literature or other articles related with this topic? (provide links)

  • Have we built a relevant prototype previously? : No

  • Do we have a rough mock for the UI/UX? : No

  • Do we have a schematic for the system? : No

Unresolved Questions

  • What is there that is unresolved (and will be resolved as part of fulfilling this request)?
    • The unresolved is the impact created by Machine Learning experiments in Climate Change. This will be resolved as part of fulfilling this request.
  • Are there other requests with same or similar problems to solve? : No

Parts of the System Affected

  • Which parts of the current system are affected by this request? : None
  • What other open requests are closely related with this request? : None
  • Does this request depend on fulfillment of any other request? : None
  • Does any other request depend on the fulfillment of this request? : None

Future possibilities

  • The API could be extended to adopt reduction strategies for energy consumption.
  • The system could be globally used with many other ML tools, helping track energy.
  • Provides energy reduction strategies which could be aligned with other ML tools and their implementation methodology.

Infrastructure

  • Detect machine details:

    • Check if CPU or GPU.
    • Check the processor(Intel/Mac).
    • Check if distributed machines are being used.
  • Run energy Tracking

    • Make different APIs which supports coverage of all machines and processors.
    • If distributed machines are being used, automatically add tracker callback for the scripts run in distributed machines.
  • Logging the output

    • The output could be logged and viewed as
      • CSV files
      • Databases

Testing

The testing procedure can be done in the following steps:

  • A tensorflow example for model training:

    • Run the API for tracking with GPU, log the monitored output into a csv file.
    • CPU tracking API:
      • Run the API for tracking with Intel based processors, log the monitored output into a csv file.
      • Run the API for tracking with M series Mac based processors, log the monitored output into a csv file.
  • A tensorflow example for model inference.

    • Run the API for tracking with GPU, log the monitored output into a csv file.
    • CPU Tracking API:
      • Run the API for tracking with Intel based processors, log the monitored output into a csv file.
      • Run the API for tracking with M series Mac based processors, log the monitored output into a csv file.
  • A talos example for hyperparameter tuning.

    • Run the API for tracking with GPU, log the monitored output into a csv file.
    • CPU tracking API:
      • Run the API for tracking with Intel based processors, log the monitored output into a csv file.
      • Run the API for tracking with M series Mac based processors, log the monitored output into a csv file.

Documentation

Describe the level of documentation fulfilling this request involves. Consider both end-user documentation and developer documentation.

  • End User Documentation:

    • Introduction
      • Mission
      • Summary
      • Frequently Asked Questions
    • Getting Started
      • Installation,
      • Quickstart
      • Examples.
    • Logging
      • Output logging
      • Visualisation
  • Developer Documentation

    • Tracker
      • API reference for tracking with GPU
      • API reference for tracking with CPU:
        • API reference for tracking with Intel based processors
        • API reference for tracking with M series MAC processors
    • Logging
      • API reference for logging to csv files.
      • API reference for logging to databases
      • API reference for visualisation tools

Version History.

Version 0.0.1

Recordings.

Work Phases.

Non-Coding.

  • Planning
  • Documentation
  • Prototype Release
  • Testing

Implementation.

API

  • Build an API for tracking with GPU devices with Nvidia.Use the nvidia-smi command's features.
    • Build Callbacks
    • Build API plugin to use with python.

References :
* power draw callback
* GpuStat

  • Build an API for tracking with CPU (Intel/Mac).
    • Build Callbacks
    • Build API plugin to use with python.

References :
* PyRAPL
* EnergyUsage

Docker

  • Write Dockerfile and upload the image to docker hub

Distributed Run

  • Track scripts running on distributed machines.
    • Add support for energy tracking for hyperparameter tuning using Jako

Logging

  • Write API to log Energy output to CSV file. Columns include Timestamp, start time, end time, Energy in W/H, Device Type, Processor Type.
  • Write API to log output to a postgres database. Columns include Timestamp, start time, end time, Energy in W/H, Device Type, Processor Type.
  • Add Hasura API to manage the postgres database.

Visualisation

  • Write APIs for visualising using Plotly/Dash and/or Metabase. Use the logging outputs from csvs/database for visualisations.

Documentation.

Write End User documentation, as well as Developer documentation.

  • End User Documentation:

    • Introduction
      • Mission
      • Summary
      • Frequently Asked Questions
    • Getting Started
      • Installation,
      • Quickstart
      • Examples.
    • Logging
      • Output logging
      • Visualisation
  • Developer Documentation

    • Tracker
      • API reference for tracking with GPU
      • API reference for tracking with CPU:
        • API reference for tracking with Intel based processors
        • API reference for tracking with M series MAC processors
    • Logging
      • API reference for logging to csv files
      • API reference for logging to databases
      • API reference for visualisation tools

Testing

All the testing can use the Bitcoin price prediction example

  • For model training:

    • Run the API for tracking with GPU, log the monitored output into a csv file.
    • CPU Tracking:
      • Run the API for tracking with Intel based processors, log the monitored output into a csv file.
      • Run the API for tracking with M series Mac based processors, log the monitored output into a csv file.
  • For model inference :

    • Run the API for tracking with GPU, log the monitored output into a csv file.
    • CPU Tracking:
      • Run the API for tracking with Intel based processors, log the monitored output into a csv file.
      • Run the API for tracking with M series Mac based processors, log the monitored output into a csv file.
  • For hyperparameter tuning (Using Talos for hyperparameter tuning) :

    • Run the API for tracking with GPU, log the monitored output into a csv file.
    • CPU Tracking:
      • Run the API for tracking with Intel based processors, log the monitored output into a csv file.
      • Run the API for tracking with M series Mac based processors, log the monitored output into a csv file.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant