Add Prompt Depth Anything Model #35401

haotongl · 2024-12-23T17:15:09Z

What does this PR do?

This PR adds the Prompt Depth Anything Model. Prompt Depth Anything builds upon Depth Anything V2 and incorporates metric prompt depth to enable accurate and high-resolution metric depth estimation.

The implementation leverages Modular Transformers. The main file can be found here.

Before submitting

[ N/A] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[ ✅] Did you read the [contributor guideline] (https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#create-a-pull-request),
Pull Request section?
[ N/A] Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
[ ✅] Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
[ ✅] Did you write any new necessary tests?

…nything

haotongl · 2024-12-24T04:21:29Z

@NielsRogge @qubvel @pcuenca Could you help review this PR when you have some time? Thanks so much in advance! Let me know if you have any questions or suggestions. 😊

docs/source/en/_toctree.yml

src/transformers/models/prompt_depth_anything/__init__.py

tests/models/prompt_depth_anything/test_modeling_prompt_depth_anything.py

qubvel · 2024-12-24T11:37:17Z

Hi @haotongl! Thanks for working on the model integration to transformers 🤗 I'm on holidays until Jan 3rd, and I'll do a review after that if it's still necessary.

docs/source/en/model_doc/prompt_depth_anything.md

NielsRogge · 2024-12-24T13:52:51Z

docs/source/en/model_doc/prompt_depth_anything.md

+
+*Prompts play a critical role in unleashing the power of language and vision foundation models for specific tasks. For the first time, we introduce prompting into depth foundation models, creating a new paradigm for metric depth estimation termed Prompt Depth Anything. Specifically, we use a low-cost LiDAR as the prompt to guide the Depth Anything model for accurate metric depth output, achieving up to 4K resolution. Our approach centers on a concise prompt fusion design that integrates the LiDAR at multiple scales within the depth decoder. To address training challenges posed by limited datasets containing both LiDAR depth and precise GT depth, we propose a scalable data pipeline that includes synthetic data LiDAR simulation and real data pseudo GT depth generation. Our approach sets new state-of-the-arts on the ARKitScenes and ScanNet++ datasets and benefits downstream applications, including 3D reconstruction and generalized robotic grasping.*
+
+<img src="https://promptda.github.io/assets/teaser.jpg"


Feel free to open a PR on this repo, specifically this folder: https://huggingface.co/datasets/huggingface/documentation-images/tree/main/transformers/model_doc to add a prompt_depth_anything_architecture.jpg picture

Hi, thanks for your kind help! I have uploaded the image and opened a PR.
https://huggingface.co/datasets/huggingface/documentation-images/discussions/408

src/transformers/models/prompt_depth_anything/modeling_prompt_depth_anything.py

xenova · 2024-12-25T04:05:42Z

docs/source/en/model_doc/prompt_depth_anything.md

+>>> prompt_depth_url = "https://github.com/DepthAnything/PromptDA/blob/main/assets/example_images/arkit_depth.png?raw=true"
+>>> prompt_depth = Image.open(requests.get(prompt_depth_url, stream=True).raw)
+>>> prompt_depth = torch.tensor((np.asarray(prompt_depth) / 1000.0).astype(np.float32))
+>>> prompt_depth = prompt_depth.unsqueeze(0).unsqueeze(0)


Usage-wise, it might be a good idea to create a PromptDepthAnythingProcessor to help handle processing the input image (via the image processor) and then optional prompt_depth input.

I have added a PromptDepthAnythingImageProcessor! Thank you for your suggestions!

Co-authored-by: Joshua Lochner <[email protected]>

haotongl added 5 commits December 23, 2024 20:57

add prompt depth anything model by modular transformer

24151d8

add prompt depth anything docs and imports

7e6dcaa

update code style according transformers doc

dfa7d67

update code style: import order issue is fixed by custom_init_isort

8509440

fix depth shape from B,1,H,W to B,H,W which is as the same as Depth A…

2fa72ef

…nything