Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Prompt Depth Anything Model #35401

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

haotongl
Copy link

What does this PR do?

This PR adds the Prompt Depth Anything Model. Prompt Depth Anything builds upon Depth Anything V2 and incorporates metric prompt depth to enable accurate and high-resolution metric depth estimation.

The implementation leverages Modular Transformers. The main file can be found here.

Before submitting

@haotongl
Copy link
Author

haotongl commented Dec 24, 2024

@NielsRogge @qubvel @pcuenca Could you help review this PR when you have some time? Thanks so much in advance! Let me know if you have any questions or suggestions. 😊

@qubvel
Copy link
Member

qubvel commented Dec 24, 2024

Hi @haotongl! Thanks for working on the model integration to transformers 🤗 I'm on holidays until Jan 3rd, and I'll do a review after that if it's still necessary.


*Prompts play a critical role in unleashing the power of language and vision foundation models for specific tasks. For the first time, we introduce prompting into depth foundation models, creating a new paradigm for metric depth estimation termed Prompt Depth Anything. Specifically, we use a low-cost LiDAR as the prompt to guide the Depth Anything model for accurate metric depth output, achieving up to 4K resolution. Our approach centers on a concise prompt fusion design that integrates the LiDAR at multiple scales within the depth decoder. To address training challenges posed by limited datasets containing both LiDAR depth and precise GT depth, we propose a scalable data pipeline that includes synthetic data LiDAR simulation and real data pseudo GT depth generation. Our approach sets new state-of-the-arts on the ARKitScenes and ScanNet++ datasets and benefits downstream applications, including 3D reconstruction and generalized robotic grasping.*

<img src="https://promptda.github.io/assets/teaser.jpg"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to open a PR on this repo, specifically this folder: https://huggingface.co/datasets/huggingface/documentation-images/tree/main/transformers/model_doc to add a prompt_depth_anything_architecture.jpg picture

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks for your kind help! I have uploaded the image and opened a PR.
https://huggingface.co/datasets/huggingface/documentation-images/discussions/408

Comment on lines 59 to 62
>>> prompt_depth_url = "https://github.com/DepthAnything/PromptDA/blob/main/assets/example_images/arkit_depth.png?raw=true"
>>> prompt_depth = Image.open(requests.get(prompt_depth_url, stream=True).raw)
>>> prompt_depth = torch.tensor((np.asarray(prompt_depth) / 1000.0).astype(np.float32))
>>> prompt_depth = prompt_depth.unsqueeze(0).unsqueeze(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usage-wise, it might be a good idea to create a PromptDepthAnythingProcessor to help handle processing the input image (via the image processor) and then optional prompt_depth input.

Copy link
Author

@haotongl haotongl Dec 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a PromptDepthAnythingImageProcessor! Thank you for your suggestions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants