From d7a5500a91ca1c453cfdf406d24e1403c31686fa Mon Sep 17 00:00:00 2001
From: Ke Sang <kesang@meta.com>
Date: Tue, 21 May 2024 21:33:41 -0700
Subject: [PATCH] add torchbench for Distributed Shampoo Optimizer v2 (#2616)

Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/2616

- There is no optimizer that has been integrated into TorchBench. Distributed Shampoo is quite complicate, and has a direct dependency on Pytorch. This creates a need to add it to torchbench to guardrail it from Pytorch 2.0 changes.
- This diff is to realize this feature, and particularly to enable Distributed Shampoo on Torchbench in Eager mode.  I will create a follow up diff to add py2 compile feature.
- For the current design of integration:
-- Pick Ads DHEN CMF 5x model, since CMF is a major MC model
-- choose optimizer stage alone benchmarking, rather than a full e2e benchmarking. This is because the computation of optimizer step itself is relatively ligher than fwd and bwd; and picking the e2e would make the optimizer step stage benchmarking results being shadowed by other stages(fwd, bwd) and make the benchmarking result not sensitive
-- build on top of originall ads_dhen_5x pipeline, and skip the fwd and bwd stage, and also set up the Shampoo config inside the Model __init__ stage
-- For Distributed Shampoo, there is a matrix root inverse computation, and in production, this is decided by precondition_frequency and its presence is trivial in the overall computation. And here for torchbench, we also skip it: by add the iteration count to bypass first root inverse compute. I.e.: Inside _prepare_before_optimizer func.
-- Eventually the torchbench would do the following: 1. initialize the ads_dhen_cmf 5x model on a local gpu, preload the data, and do fwd and bwd; 2. change some state variable of Shampoo(iteration step for preconditioning etc), and get the optimizer ready; 3. benchmarking the optimizer with torchbench pipeline, and return the results back

05/16:
- update the diff given the Shampoo v2 impl

Reviewed By: xuzhao9

Differential Revision: D51192560

fbshipit-source-id: 247dceec1587a837aa9ca128252c47e9e0cf42b7
---
 fbgemm_gpu/fbgemm_gpu/split_embedding_configs.py | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fbgemm_gpu/fbgemm_gpu/split_embedding_configs.py b/fbgemm_gpu/fbgemm_gpu/split_embedding_configs.py
index 5ceeef279..491d5e90d 100644
--- a/fbgemm_gpu/fbgemm_gpu/split_embedding_configs.py
+++ b/fbgemm_gpu/fbgemm_gpu/split_embedding_configs.py
@@ -30,6 +30,7 @@ class EmbOptimType(enum.Enum):
     PARTIAL_ROWWISE_LAMB = "partial_row_wise_lamb"
     ROWWISE_ADAGRAD = "row_wise_adagrad"
     SHAMPOO = "shampoo"  # not currently supported for sparse embedding tables
+    SHAMPOO_V2 = "shampoo_v2"  # not currently supported for sparse embedding tables
     MADGRAD = "madgrad"
     EXACT_ROWWISE_WEIGHTED_ADAGRAD = "exact_row_wise_weighted_adagrad"  # deprecated
     NONE = "none"