ModernBERT bug fixes #35404

warner-benjamin · 2024-12-23T20:20:44Z

This PR fixes a few issues with the ModernBERT implementation and some typos in the docs.

First, on the flash attention 2 path, ModernBERT was incorrectly always returning repadded logits without gradients (see #35386). This PR will return padded logits with gradients if no labels are passed to ModernBertForMaskedLM or if repad_logits_with_grad is set to true. We don't keep the gradient when repadding after using an internal loss function by default to save memory.

Second, this PR fixes torch.compile from being automatically set on when on CPU (#35388).

Third, I added some details to the model doc strings.

Fourth, documentation is updated to capitalize BERT in ModernBERT following the other BERT models.

cc @ArthurZucker @tomaarsen

tomaarsen · 2024-12-23T20:49:34Z

docs/source/en/_toctree.yml

@@ -503,7 +503,7 @@
      - local: model_doc/mobilebert
        title: MobileBERT
      - local: model_doc/modernbert
-        title: ModernBert
+        title: ModernBERT


Good call - I got carried away with the Python class naming

tomaarsen · 2024-12-23T20:53:12Z

src/transformers/models/modernbert/modular_modernbert.py

@@ -892,6 +900,13 @@ def _maybe_set_compile(self):
                )
            self.config.reference_compile = False

+        if self.device.type == "cpu":
+            logger.warning_once(


I think we should wrap this in if self.config.reference_compile: just like above so users don't get warned when they never manually specified anything related to reference_compile.

Unless you'd rather always give this warning so people aren't confused why there's a small precision-esque difference between GPU setups?

That was my intention, I forgot the if statement. Fixed in ab11657.

tomaarsen · 2024-12-23T20:53:57Z

src/transformers/models/modernbert/modular_modernbert.py

        if config._attn_implementation_internal is None:
            config._attn_implementation_internal = "flash_attention_2"
            try:
                return cls._check_and_enable_flash_attn_2(
                    config,
-                    torch_dtype=torch_dtype,
+                    torch_dtype=torch.float16,


I think this is a cleaner solution to avoid the unnecessary FP32 warning than I figured was possible, nice.

HuggingFaceDocBuilderDev · 2024-12-23T21:29:37Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

warner-benjamin added 2 commits December 23, 2024 14:01

bug fixes

747b3e5

organize imports

46065d4

warner-benjamin mentioned this pull request Dec 23, 2024

modernbert logits do not have gradient #35386

Open

4 tasks

tomaarsen reviewed Dec 23, 2024

View reviewed changes

wrap cpu warning in reference_compile

ab11657

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ModernBERT bug fixes #35404

ModernBERT bug fixes #35404

warner-benjamin commented Dec 23, 2024 •

edited

Loading

tomaarsen Dec 23, 2024

tomaarsen Dec 23, 2024

warner-benjamin Dec 23, 2024

tomaarsen Dec 23, 2024

HuggingFaceDocBuilderDev commented Dec 23, 2024

ModernBERT bug fixes #35404

Are you sure you want to change the base?

ModernBERT bug fixes #35404

Conversation

warner-benjamin commented Dec 23, 2024 • edited Loading

tomaarsen Dec 23, 2024

Choose a reason for hiding this comment

tomaarsen Dec 23, 2024

Choose a reason for hiding this comment

warner-benjamin Dec 23, 2024

Choose a reason for hiding this comment

tomaarsen Dec 23, 2024

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Dec 23, 2024

warner-benjamin commented Dec 23, 2024 •

edited

Loading