Skip to content

petermg/Depth-Anything-V2_16bitPNG_Touchly1

 
 

Repository files navigation

This fork outputs 16bit png depth map files, which is better suited for use in mesh deformation than the original 8bit. This has been updated to fully support the Touchly1 format. Originally I had done this with a different fork until I found out that it had lower quality for speed improvements. This fork is focused on quality above speed. However you can always lower the input size "--input-size" to speed things up.

This fork has been modified to output 16bit precision depth map png files instead of 8bit. This allows for better 3D model creation without a "stair-stepping" look to the model.

This fork also outputs video files in the "Touchly1" format, which has the depth map on the bottom and the original video on the top, vertically stacking the output. Just append "_Touchly1" to the name of the video file to play it back in the Touchly player in 6DoF 3D!

Please use the "--help" flag with "run_video.py" to see all the options, many of which I have not yet taken the time to document here yet.

You can process video files by running "run_video.py"

You can download the Touchly Volumetric media player for the standalone Quest (Free) here: https://www.meta.com/experiences/5564815066942737/

You can download the Touchly Volumetric media player for PCVR on Steam ($6.99) here: https://store.steampowered.com/app/2480680/Touchly_Volumetric_VR_Video_Player/

The main Touchly website is here: https://touchly.app/

You can also download the official Touchly renderer / encoder app here: https://touchly.app/renderer/ There is a free version and a Pro version ($10). The Touchly Renderer both free and paid version can create volumetric videos from 2D videos and also from 3D SBS videos and 3D VR180 videos.

To process images run "python run_video.py --images" as image processing has been added to the "run_video.py" script. I should rename this script...

Depth Anything V2

Lihe Yang1 · Bingyi Kang2† · Zilong Huang2
Zhen Zhao · Xiaogang Xu · Jiashi Feng2 · Hengshuang Zhao1*

1HKU   2TikTok
†project lead *corresponding author

Paper PDF Project Page Benchmark

This work presents Depth Anything V2. It significantly outperforms V1 in fine-grained details and robustness. Compared with SD-based models, it enjoys faster inference speed, fewer parameters, and higher depth accuracy.

teaser

News

  • 2024-07-06: Depth Anything V2 is supported in Transformers. See the instructions for convenient usage.
  • 2024-06-25: Depth Anything is integrated into Apple Core ML Models. See the instructions (V1, V2) for usage.
  • 2024-06-22: We release smaller metric depth models based on Depth-Anything-V2-Small and Base.
  • 2024-06-20: Our repository and project page are flagged by GitHub and removed from the public for 6 days. Sorry for the inconvenience.
  • 2024-06-14: Paper, project page, code, models, demo, and benchmark are all released.

Pre-trained Models

We provide four models of varying scales for robust relative depth estimation:

Model Params Checkpoint
Depth-Anything-V2-Small 24.8M Download
Depth-Anything-V2-Base 97.5M Download
Depth-Anything-V2-Large 335.3M Download
Depth-Anything-V2-Giant 1.3B Coming soon

Usage

Prepraration

git clone https://github.com/DepthAnything/Depth-Anything-V2
cd Depth-Anything-V2
pip install -r requirements.txt

Download the checkpoints listed here and put them under the checkpoints directory.

Use our models

import cv2
import torch

from depth_anything_v2.dpt import DepthAnythingV2

DEVICE = 'cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu'

model_configs = {
    'vits': {'encoder': 'vits', 'features': 64, 'out_channels': [48, 96, 192, 384]},
    'vitb': {'encoder': 'vitb', 'features': 128, 'out_channels': [96, 192, 384, 768]},
    'vitl': {'encoder': 'vitl', 'features': 256, 'out_channels': [256, 512, 1024, 1024]},
    'vitg': {'encoder': 'vitg', 'features': 384, 'out_channels': [1536, 1536, 1536, 1536]}
}

encoder = 'vitl' # or 'vits', 'vitb', 'vitg'

model = DepthAnythingV2(**model_configs[encoder])
model.load_state_dict(torch.load(f'checkpoints/depth_anything_v2_{encoder}.pth', map_location='cpu'))
model = model.to(DEVICE).eval()

raw_img = cv2.imread('your/image/path')
depth = model.infer_image(raw_img) # HxW raw depth map in numpy

If you do not want to clone this repository, you can also load our models through Transformers. Below is a simple code snippet. Please refer to the official page for more details.

  • Note 1: Make sure you can connect to Hugging Face and have installed the latest Transformers.
  • Note 2: Due to the upsampling difference between OpenCV (we used) and Pillow (HF used), predictions may differ slightly. So you are more recommended to use our models through the way introduced above.
from transformers import pipeline
from PIL import Image

pipe = pipeline(task="depth-estimation", model="depth-anything/Depth-Anything-V2-Small-hf")
image = Image.open('your/image/path')
depth = pipe(image)["depth"]

Running script on images

python run.py \
  --encoder <vits | vitb | vitl | vitg> \
  --img-path <path> --outdir <outdir> \
  [--input-size <size>] [--pred-only] [--color]

Options:

  • --img-path: You can either 1) point it to an image directory storing all interested images, 2) point it to a single image, or 3) point it to a text file storing all image paths.
  • --input-size (optional): By default, we use input size 518 for model inference. You can increase the size for even more fine-grained results.
  • --pred-only (optional): Only save the predicted depth map, without raw image.
  • --color (optional): Save the color depth map, instead of gray.

For example:

python run.py --encoder vitl --img-path assets/examples --outdir depth_vis

Running script on videos

python run_video.py \
  --encoder <vits | vitb | vitl | vitg> \
  [--video-path <path to input video file(s)> (default is inputvideo)]  [--outdir <path to output video files> (default is outputvideo)]
  [--custom-height --height <size> (default is 518)] [--pred-only (optional as this only produces a depthmap video NOT a Touchly1 formatted video.)] [--color (DO NOT USE THIS OPTION if you want to create a Touchly1 formatted video)] [--codec <fourcc codec> (default is HFYU)] [--extension <video file container extension> (default is mkv)]

By default the encoder uses vitl. By default using "--custom-height" sets the new height to 518. If you add "--height" you can specify whatever height you want your input video to be resized to. The width will automatically be resized to maintain the same aspect ratio. This is generally used for the purpose of LOWERING the input size of the video in cases of OUT OF MEMORY errors when processing. Some command line examples for processing videos to the Touchly1 format:

python run_video.py --custom-height --height 256

The above command creates a video with the original video on top, the depthmap video on the bottom and a reduced input height set to 256. Since no directories are specified it looks for the input files in the default location of 'assets/inputvideo' and saves the output to the default location of 'outputvideo'. Another example:

python run_video.py

The above example processes the input video(s) from the default folder of 'inputvideo', does not resize the video for processing but uses the same resolution to create the depthmap and saves the video output to the default folder of 'outputvideo'. This creates a vertically stacked output of the original videon on top and the depthmap on the bottom, which is the Touchly1 volumetric video format.

python run_video.py --custom-height --height 256 --extension avi --codec mjpg

The above command first resizes the input video to 256 heigh with a width maintaining the original aspect ratio, it saves it as an AVI file, and encodes it using the MJPG codec.

You can also determine what codecs are supported for what containers / extensions by using the following command:

python run_video.py --showcodecs --extension avi

The above command will show you which codecs are available to be used with the avi format/extension/container. By default it will show you the codecs avialable for the mkv format/extension/container if a format/extension/container is not specified by use of the "--extension" option.

Our larger model has better temporal consistency on videos.

Gradio demo

To use our gradio demo locally:

python app.py

You can also try our online demo.

Note: Compared to V1, we have made a minor modification to the DINOv2-DPT architecture (originating from this issue). In V1, we unintentionally used features from the last four layers of DINOv2 for decoding. In V2, we use intermediate features instead. Although this modification did not improve details or accuracy, we decided to follow this common practice.

Fine-tuned to Metric Depth Estimation

Please refer to metric depth estimation.

DA-2K Evaluation Benchmark

Please refer to DA-2K benchmark.

Community Support

We sincerely appreciate all the community support for our Depth Anything series. Thank you a lot!

Acknowledgement

We are sincerely grateful to the awesome Hugging Face team (@Pedro Cuenca, @Niels Rogge, @Merve Noyan, @Amy Roberts, et al.) for their huge efforts in supporting our models in Transformers and Apple Core ML.

We also thank the DINOv2 team for contributing such impressive models to our community.

LICENSE

Depth-Anything-V2-Small model is under the Apache-2.0 license. Depth-Anything-V2-Base/Large/Giant models are under the CC-BY-NC-4.0 license.

Citation

If you find this project useful, please consider citing:

@article{depth_anything_v2,
  title={Depth Anything V2},
  author={Yang, Lihe and Kang, Bingyi and Huang, Zilong and Zhao, Zhen and Xu, Xiaogang and Feng, Jiashi and Zhao, Hengshuang},
  journal={arXiv:2406.09414},
  year={2024}
}

@inproceedings{depth_anything_v1,
  title={Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data}, 
  author={Yang, Lihe and Kang, Bingyi and Huang, Zilong and Xu, Xiaogang and Feng, Jiashi and Zhao, Hengshuang},
  booktitle={CVPR},
  year={2024}
}

Depth Anything V2

Lihe Yang1 · Bingyi Kang2† · Zilong Huang2
Zhen Zhao · Xiaogang Xu · Jiashi Feng2 · Hengshuang Zhao1*

1HKU   2TikTok
†project lead *corresponding author

Paper PDF Project Page Benchmark

This work presents Depth Anything V2. It significantly outperforms V1 in fine-grained details and robustness. Compared with SD-based models, it enjoys faster inference speed, fewer parameters, and higher depth accuracy.

teaser

News

  • 2024-07-06: Depth Anything V2 is supported in Transformers. See the instructions for convenient usage.
  • 2024-06-25: Depth Anything is integrated into Apple Core ML Models. See the instructions (V1, V2) for usage.
  • 2024-06-22: We release smaller metric depth models based on Depth-Anything-V2-Small and Base.
  • 2024-06-20: Our repository and project page are flagged by GitHub and removed from the public for 6 days. Sorry for the inconvenience.
  • 2024-06-14: Paper, project page, code, models, demo, and benchmark are all released.

Pre-trained Models

We provide four models of varying scales for robust relative depth estimation:

Model Params Checkpoint
Depth-Anything-V2-Small 24.8M Download
Depth-Anything-V2-Base 97.5M Download
Depth-Anything-V2-Large 335.3M Download
Depth-Anything-V2-Giant 1.3B Coming soon

Usage

Prepraration

git clone https://github.com/DepthAnything/Depth-Anything-V2
cd Depth-Anything-V2
pip install -r requirements.txt

Download the checkpoints listed here and put them under the checkpoints directory.

Use our models

import cv2
import torch

from depth_anything_v2.dpt import DepthAnythingV2

DEVICE = 'cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu'

model_configs = {
    'vits': {'encoder': 'vits', 'features': 64, 'out_channels': [48, 96, 192, 384]},
    'vitb': {'encoder': 'vitb', 'features': 128, 'out_channels': [96, 192, 384, 768]},
    'vitl': {'encoder': 'vitl', 'features': 256, 'out_channels': [256, 512, 1024, 1024]},
    'vitg': {'encoder': 'vitg', 'features': 384, 'out_channels': [1536, 1536, 1536, 1536]}
}

encoder = 'vitl' # or 'vits', 'vitb', 'vitg'

model = DepthAnythingV2(**model_configs[encoder])
model.load_state_dict(torch.load(f'checkpoints/depth_anything_v2_{encoder}.pth', map_location='cpu'))
model = model.to(DEVICE).eval()

raw_img = cv2.imread('your/image/path')
depth = model.infer_image(raw_img) # HxW raw depth map in numpy

If you do not want to clone this repository, you can also load our models through Transformers. Below is a simple code snippet. Please refer to the official page for more details.

  • Note 1: Make sure you can connect to Hugging Face and have installed the latest Transformers.
  • Note 2: Due to the upsampling difference between OpenCV (we used) and Pillow (HF used), predictions may differ slightly. So you are more recommended to use our models through the way introduced above.
from transformers import pipeline
from PIL import Image

pipe = pipeline(task="depth-estimation", model="depth-anything/Depth-Anything-V2-Small-hf")
image = Image.open('your/image/path')
depth = pipe(image)["depth"]

Running script on images

python run.py \
  --encoder <vits | vitb | vitl | vitg> \
  --img-path <path> --outdir <outdir> \
  [--input-size <size>] [--pred-only] [--grayscale]

Options:

  • --img-path: You can either 1) point it to an image directory storing all interested images, 2) point it to a single image, or 3) point it to a text file storing all image paths.
  • --input-size (optional): By default, we use input size 518 for model inference. You can increase the size for even more fine-grained results.
  • --pred-only (optional): Only save the predicted depth map, without raw image.
  • --grayscale (optional): Save the grayscale depth map, without applying color palette.

For example:

python run.py --encoder vitl --img-path assets/examples --outdir depth_vis

Running script on videos

python run_video.py \
  --encoder <vits | vitb | vitl | vitg> \
  --video-path assets/examples_video --outdir video_depth_vis \
  [--input-size <size>] [--pred-only] [--grayscale]

Our larger model has better temporal consistency on videos.

Gradio demo

To use our gradio demo locally:

python app.py

You can also try our online demo.

Note: Compared to V1, we have made a minor modification to the DINOv2-DPT architecture (originating from this issue). In V1, we unintentionally used features from the last four layers of DINOv2 for decoding. In V2, we use intermediate features instead. Although this modification did not improve details or accuracy, we decided to follow this common practice.

Fine-tuned to Metric Depth Estimation

Please refer to metric depth estimation.

DA-2K Evaluation Benchmark

Please refer to DA-2K benchmark.

Community Support

We sincerely appreciate all the community support for our Depth Anything series. Thank you a lot!

Acknowledgement

We are sincerely grateful to the awesome Hugging Face team (@Pedro Cuenca, @Niels Rogge, @Merve Noyan, @Amy Roberts, et al.) for their huge efforts in supporting our models in Transformers and Apple Core ML.

We also thank the DINOv2 team for contributing such impressive models to our community.

LICENSE

Depth-Anything-V2-Small model is under the Apache-2.0 license. Depth-Anything-V2-Base/Large/Giant models are under the CC-BY-NC-4.0 license.

Citation

If you find this project useful, please consider citing:

@article{depth_anything_v2,
  title={Depth Anything V2},
  author={Yang, Lihe and Kang, Bingyi and Huang, Zilong and Zhao, Zhen and Xu, Xiaogang and Feng, Jiashi and Zhao, Hengshuang},
  journal={arXiv:2406.09414},
  year={2024}
}

@inproceedings{depth_anything_v1,
  title={Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data}, 
  author={Yang, Lihe and Kang, Bingyi and Huang, Zilong and Xu, Xiaogang and Feng, Jiashi and Zhao, Hengshuang},
  booktitle={CVPR},
  year={2024}
}

About

Depth Anything V2. A More Capable Foundation Model for Monocular Depth Estimation

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 99.6%
  • Shell 0.4%