FBGEMM OSS Build Improvements #1726

q10 · 2023-04-26T19:51:13Z

q10
Apr 26, 2023
Collaborator

Summary

The FBGEMM and FBGEMM_GPU projects have been in open source (OSS) for a while, but the CI infrastructure around the OSS project has been under-maintained and unreliable, thus making it difficult for external users to use and contribute back to the project. Over the past 3 months, we have made significant changes and fixes to the infrastructure around build, test, and release processes (also known as workflows) for the OSS side of the FBGEMM and FBGEMM_GPU. The OSS workflow improvement efforts we have made focused on six key areas: build environment reproducibility, build consistency and completeness, build speed, new build targets support, build tooling, and documentation + community engagement. This has provided FBGEMM and FBGEMM developers, both internal and external, with massive productivity gains and significantly reduced turnaround time (from 3.5 hours down to 30 minutes wall clock time) for feedback when pushing PRs out on GitHub.

Introduction

The FBGEMM OSS repository houses two projects, the FBGEMM and FBGEMM_GPU projects. FBGEMM is a standalone high-performance low-precision math kernel library that was designed from the ground up to perform high-performance quantized inference on current generation CPUs. FBGEMM_GPU is a library that extends PyTorch to provide specialized and performant GPU primitives for deep learning recommendation systems (DLRMs). FBGEMM_GPU contains some dependencies on FBGEMM and supports close to 100 operators that generally outperform their PyTorch counterparts for DLRMs.

The FBGEMM project has a rather straightforward workflow, since there is only one build target for each platform, and there is no workflow to publish the builds to a repository. The FBGEMM CI runs builds for the Linux, Mac OS X, and Windows platforms, though tests are run only on Linux and Windows, because Mac OS X does not support AVX2.

The FBGEMM_GPU project has a much more complicated set of workflows stemming from 3 aspects of the project:

There are three sets of workflows - the CI jobs that build and run tests on PRs, the nightly release jobs that build the top of tree against PyTorch nightly and publish the artifacts to PyPi, and the version release jobs that do the same but for PyTorch version releases. Both release jobs have three parts - building the artifact, installing / testing the artifact, and publishing the artifact.
The FBGEMM_GPU package comes in 3 variants: the CUDA variant, ROCm variant, and CPU-only variant. For the CI workflows, all 3 variants are built and tested, while for the nightly and version release workflows, only the CUDA and CPU-only variants are supported at the moment.
FBGEMM_GPU has to support multiple versions of Python, CUDA, and ROCm. The Python organization actively supports the latest 4 versions of Python. The latest version of PyTorch supports 2 minor versions of CUDA and 2 major versions of ROCm. In addition, support for older Linux distributions is required, as we have users known to be running on CentOS 7 and Ubuntu 20.04.

Combining these aspects, the CI maintenance and support table for FBGEMM_GPU looks as follows:

Variant	Variant Version	PyTorch 1.13	PyTorch 2.0	PyTorch Nightly
CPU	*	Yes[4]	Yes[1]	Yes[1]
CUDA	11.7.1	Yes[4]	Yes[1]	Yes[1]
	11.8	Yes[4]	Yes[1]	Yes[1]
	12	No[2]	No[2]	No[2]
ROCm	5.3	Yes[4]	Yes[4]	Yes[3]
	5.4.2	No	Yes[4]	Yes[3]

[1] Support for Python 3.8, 3.9, 3.10, 3.11. Includes nightly and version release workflows.
[2] CUDA 12 support is unstable at the moment.
[3] Support for Python 3.8, 3.9, 3.10, only. No nightly nor version release workflows.
[4] Not actively maintained; previously actively tested prior to PyTorch 2.0 release.

OSS Build Improvements

The aforementioned challenges with maintaining OSS FBGEMM_GPU required a major revamp to the workflows that were in place, which we will discuss in detail here.

Build Environment Reproducibility

Due to the diversity of workflows for OSS FBGEMM_GPU, different workflows run on different types of compute environments, each of which requires its own unique setup. For example, the FBGEMM_GPU CUDA test jobs require GPU instances provided by the PyTorch test infrastructure to run successfully, while builds of the CUDA variant could be run on vanilla GitHub Actions instances. Because the workflows implicitly depend on the tools available in the environment they run in and these dependencies are not fully cataloged, this has led to many instances where the same workflow step (e.g. an FBGEMM_GPU build invocation) runs successfully in one environment but fails in another. Furthermore, because the compute environments are not directly accessible, which makes debugging these issues very difficult

To systematically address this issue, we have implemented a “two-level containerization” scheme for all of the FBGEMM_GPU workflows. In the first level, the workflows are containerized into a Docker environment. Specifically, all the build workflows and most of the test workflows are containerized under the amazon:2023 image, while artifact installation tests are carried out in both amazonlinux:2023 and ubuntu:20.04 images for wider test coverage (see here and here for workflows). For the ROCm builds, we use the rocm/dev-ubuntu-20.04 images provided by AMD.

In the second level, all workflows are run within a Conda environment inside the container. The purpose of this is to standardize and pin the Python and C++ tools setup required for FBGEMM_GPU builds, which can be difficult to do using the system package manager since they generally support only one version. The Conda environment also allows us to install specific versions of CUDA and PyTorch, which is exactly what we need to control the reproducibility of the build process.

The primary advantage of the two-level containerization scheme is that it allows us to reproduce the exact workflow environment that is present on the GitHub workers on a local machine, thus giving us a fast feedback loop for investigating and debugging build issues. However, not all FBGEMM_GPU build failures can be solved by containerization. In particular, the CUDA test workflows require that the instances have proper NVIDIA drivers installed, and GPU instances that we use are not prepared with drivers installed by default. For these use cases, we modified the jobs to install the NVIDIA drivers properly prior to executing the workflows, using scripts maintained by the PyTorch test infrastructure team (#1684).

Build Consistency and Completeness

With a reproducible build environment in place, we proceeded to address the various FBGEMM_GPU artifact build and installation issues that developers and users have consistently run into. Some of the issues we have encountered include:

Test Stability: A number of Python tests specify their hardware requirements incorrectly (CPU-only vs GPU), causing tests to fail. To address this, we have fixed the resource requirement annotations on these tests ([T145005253] Make Tests More Stable #1606).
Linux Distributions Compatibility: Older Linux distributions such as CentOS 7 and Ubuntu 20.04 are packaged with older versions of GBLIBC and GLIBCXX, which are not forward compatible. To make FBGEMM_GPU work with these older systems, we have fixed the builds to use an older version of gcc (version 10) ([T148256315] Add support for building FBGEMM_GPU against Python 3.11 in OSS #1646).
PyTorch Variants: The FBGEMM_GPU CUDA variant is built against PyTorch nightly, which is installed through Conda. Because PyTorch-CUDA nightly releases can be inconsistent, this causes Conda to fall back to PyTorch-CPU installation, which breaks the setup for an FBGEMM_GPU CUDA build. To work around this, we have updated our workflows to install PyTorch nightly through PIP ([T145005253] Containerize the remaining FBGEMM_GPU CI jobs #1658).
Broken Dependencies: Build dependencies such as the PyTorch CUDA nightly releases are known to be occasionally broken, even if they can be fetched and installed. This results in very cryptic build error messages down the pipeline. To work around this, we have added basic checks when installing dependencies to ensure that certain header files are in the PATH, certain Python imports work, etc ([T145005253] Fix the build scripts infrastructure #1589).
Artifact Loading: Python import errors appear when either the FBGEMM_GPU artifact contains undefined symbols, or when the dependencies are not properly installed. The merged_pool_embeddings operators were missing from the artifact for the longest time as an example. Setting up the build environment properly fixes this problem, along with adequate documentation on required dependencies for installation ([T146787123] Re-enable compilation of merge_pooled_embeddings operator in OSS #1621).

Since addressing these issues, the FBGEMM_GPU nightly and version release workflows have been operating very reliably:

Build Speed

At the beginning of the project, the average wall clock time for an FBGEMM + FBGEMM_GPU CI run was over 3 hours, and more often than not, they would exceed the 6-hour limit allotted to GitHub workflows. Consequently, lowering the wall clock times for a CI run became a primary goal of the OSS build improvements effort.

We observed the following issues that altogether slowed down the builds:

The FBGEMM workflows built and tested the static and shared-linking versions of the libraries in sequence, even though the build artifacts were independent.
Both FBGEMM and FBGEMM_GPU workflows involve compilation of many static and generated source files, which can be performed in parallel, but the parallelization was limited by the hardware limitations in the compute instances (2-core machines).
A certain test in Jagged Tensor Ops runs without issues under buck and OSS, but somehow hangs when run in the GItHub actions environment.
For the FBGEMM_GPU ROCm jobs, one of the two instances provided by AMD contained an expired GitHub access token, effectively rendering the instance unusable.
The FBGEMM_GPU workflows would fail very late in the test process if the build failed.
Running CI workflows were not being canceled when new commits are pushed to a PR, causing unnecessary queueing for compute resources.

These issues eventually resolved by re-writing the entire GitHub workflow definitions to be more parallelized, use higher-end compute instances (12-core and higher machines), be concurrently exclusive, and fail early (#1589, #1646). As of today, barring the wait times for acquiring the GPU compute instance resources, the overall CI run for every PR is now under 30 minutes.

New Build Targets Support

As part of the OSS build improvements, we have upgraded and increased our build support for more software stack targets:

C++: As of now, both FBGEMM and FBGEMM_GPU fully support building against C++17 ([T148369035] Add C++17 Support to FBGEMM and FBGEMM_GPU OSS builds #1652). As time moves on and PyTorch officializes C++20 support, we will do the same.
Python: For FBGEMM_GPU, we have removed support for building against Python 3.7, as that has reached EOL. In addition to 3.8, 3.9, 3.10, we have now added Python 3.11 support ([T148256315] Add support for building FBGEMM_GPU against Python 3.11 in OSS #1646).
CUDA: For FBGEMM_GPU, building against CUDA 11.8 is now supported in addition to 11.7, which is in line with what PyTorch supports ([T145005253] Improvements to OSS builds and the Release Process #1627).
ROCm: The OSS support for building the ROCm variant of FBGEMM_GPU had previously been flaky. Now, the CI infrastructure supports building FBGEMM_GPU against both 5.3 and 5.4, which is in line with what PyTorch currently supports ([T149188477] Fix the ROCm Build and Test Jobs #1668).

Build Tooling

The build steps for FBGEMM_GPU OSS are very complicated, and we have come to identify and understand many of the issues and gotchas that have appeared over time as we worked on the build improvements. To help us run FBGEMM_GPU OSS workflows at scale, we have built an infrastructure of build scripts dedicated to setting up and running FBGEMM_GPU builds and tests. These scripts are very extensive in providing us the reliability and correctness of the build and test process with the following features:

Retries (for long-running commands such as CUDA installation)
Pre-step checks (e.g. check that the compiler is in the PATH)
Post-step checks (e.g. check that the installed PyTorch corresponds to the intended build variant)
Filepath, binary and Python library import checks
Setting the correct flags and environment variable for builds
Symbols checking
Logging

We have also updated the build scripts to augment the version release process by automatically versioning FBGEMM_GPU releases based on git tags. Under this scheme, maintainers can quickly make FBGEMM_GPU releases simply by tagging a commit in the repository tree, and select that tag when kicking off the version release workflow.

Documentation + Community Engagement

In combination with the work on the build scripts infrastructure, we have also encoded our exhaustive knowledge of the FBGEMM_GPU OSS build process into extensive documentation, which can be found linked in the overhauled project README (#1639, #1695, #1697).

Finally, we have also set up both a Discussions Page on GitHub, as well as an #fbgemm channel on PyTorch Slack to better facilitate OSS community engagement and external user feedback solicitation around the two projects.

Future Steps

The OSS workflow improvements have increased the development velocity of the FBGEMM and FBGEMM_GPU projects by providing to the developers fast and reliable workflows that emit true negative signals. For example, we were recently able to quickly identify code changes that broke the FBGEMM_GPU ROCm variant build against ROCm 5.3. More importantly, these improvements bring value to the overall OSS health of the projects by providing the support infrastructure needed for incentivizing external developers to contribute.

The workflow infrastructure improvements are just the beginning, as we are currently investigating multiple areas of improvements including:

Reduction of build times by splitting the jinja template materializations. Preliminary work has shown that FBGEMM_GPU CUDA variant build times can be cut down by another 2-3 minutes.
Reduction of build artifact sizes. This is an issue for OSS package distribution because our artifact uploads surpass the file size limits imposed by PyPI.
Customizability of builds. Build and distribute FBGEMM_GPU OSS with a default set of configurations, but offer users the ability to build the different flavors of FBGEMM_GPU with a custom desired set of optimizers.
Adoption of a more scalable build system such as Bazel to reduce build times and offer a nicer interface for building FBGEMM_GPU variants and flavors.
CUDA 12 support for FBGEMM_GPU, to align with PyTorch CUDA support
Support for the NVIDIA H100
Support for building FBGEMM_GPU on ARM CPU
Adding documentation

We plan to discuss progress in these areas in future Notes.

Acknowledgments

This long list of improvements would not have been made possible without the help and contributions from our collaborators, whom we would like to deeply thank.

We also appreciate support from AMD for our improved ROCm testing story.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FBGEMM OSS Build Improvements #1726

{{title}}

Replies: 0 comments

Select a reply

FBGEMM OSS Build Improvements #1726

q10 Apr 26, 2023 Collaborator

Summary

Introduction

OSS Build Improvements

Build Environment Reproducibility

Build Consistency and Completeness

Build Speed

New Build Targets Support

Build Tooling

Documentation + Community Engagement

Future Steps

Acknowledgments

Replies: 0 comments

q10
Apr 26, 2023
Collaborator