Skip to content

Latest commit

 

History

History
127 lines (95 loc) · 7.03 KB

README.md

File metadata and controls

127 lines (95 loc) · 7.03 KB

Theia: Distilling Diverse Vision Foundation Models for Robot Learning

Jinghuan Shang1,2, Karl Schmeckpeper1, Brandon B. May1, Maria Vittoria Minniti1, Tarik Kelestemur1, David Watkins1, Laura Herlant1

1The AI Institute 2Stony Brook University

CoRL 2024

Project Page, Paper, Models, Demo

Quick Start: Use Pre-trained Theia Models

Through huggingface:

import transformers
from transformers import AutoModel
import torch
model = AutoModel.from_pretrained("theaiinstitute/theia-base-patch16-224-cdiv", trust_remote_code=True)
fake_input = torch.zeros((1, 224 ,224, 3), dtype=torch.uint8)

theia_feature = model.forward_feature(fake_input)
# Theia / intermediate feature, mainly used for robot learning.
# To change different feature reduction methods, pass `feature_reduction_method` argument in AutoModel.from_pretrained() method

predicted_features = model(fake_input)
# predicted_features is dict[str, torch.Tensor] where each kv pair is target model name and predicted feature
# they are predicted features that tries to match teacher model features.

theia-<size>-patch16-224-cdiv are used for main evaluations in the paper.

Installation

Make sure you have Python >= 3.10. Create any virtual Python environment you like or use the Dockerfile. Then

pip install -e .

Data Preparation

Datasets

The datasets should be organized in webdataset format.

  1. Prepare images from ImageNet

First download and prepare ImageNet.

cd src/theia/scripts/preprocessing/image_datasets
python organize_imagenet_webdataset.py --dataset <dataset_name> --imagenet-raw-path <path_to_raw_images> --output-path <root_dir_to_hold_datasets>

For any other image dataset you want to use, you can simply dump all of them in a folder (any subfolder also works), and modify how you can get their paths in organize_imagenet_webdataset.py (variable image_paths).

  1. (Optional) Prepare frames from video datasets
cd src/theia/scripts/preprocessing/video_datasets
python subsampling_videos.py --dataset <dataset_name> --dataset-path <path_to_raw_videos> --output-path <root_dir_to_hold_datasets> [--subsampling-rate] [--samples-per-shard]

Feature Extraction

cd src/theia/scripts/preprocessing
python feature_extraction.py --dataset <dataset_name> --output-path <root_dir_to_hold_datasets> --model <model_name> --split <train or val (or test)> [--num-gpus]

You can also refer to the integrated script src/theia/scripts/preprocessing/iv_feature_extraction.py that launches feature extraction for multiple models at the same time.

During training we will need mean and variance for each teacher model to normalize teacher features. You can extract them using src/theia/scripts/preprocessing/calc_feature_mean.py or use the stats we provide in feature_stats.

Expected Dataset Format

More details about dataset format are available at dataset_format. Please use this to verify or troubleshoot your data.

Training

cd src/theia/scripts

# train theia tiny using training configuration trian_rvfm_imagenet
# with teacher models CLIP, DINOv2, and ViT
torchrun --nproc_per_node=8 --nnodes 1 --rdzv_backend c10d --rdzv_endpoint localhost:11111 train_rvfm.py --config-name=train_rvfm_imagenet logging.notes=imagenet_cdiv training/target_models=cdiv dataset.dataset_ratio=1.0 model.backbone.backbone=facebook/deit-tiny-patch16-224 logging.save_ckpt_interval=50000 dataset.dataset_root=<root_dir_to_hold_datasets>

To change output paths and wandb logging configs, override or modify src/theia/configs/logging/default.yaml.

To use different teacher models, override training/target_models=<teacher model config>. Available configs are under src/theia/configs/training/target_models

To change different datasets, override dataset=<dataset config>. Available configs are under src/theia/configs/dataset.

Decode Theia-representation to VFM outputs

You can decode Theia-predicted VFM representations to their outputs. For DINOv2 we apply the PCA vsiualization, for SAM we use decoder to generate segmentation masks (but with SAM's pipeline of prompting), and for Depth-Anything we use the deocder head to do depth prediction. Below are example outputs. Theia model should be trained on those teachers during distillation. To use any models available online, you can find models with cddsv in its name, indicating that it is trained on all teachers.

Try out our online demo or notebook example, or you can get outputs from local checkpoints by

cd src/theia/scripts/decoding
python decoding_example.py --backbone <backbone_name> --checkpoint-path <path to theia model checkpoint> --feature-stat-dir <where feature mean and std are placed> --media-to-vis-path <path to the video or image to decode>

References

Webdataset, transformers, safetensors, DINOv2, CLIP, ViT, SAM, RADIO, DepthAnything

Citation

If you use Theia in your research, please use the following BibTeX entry:

@inproceedings{
    shang2024theia,
    title={Theia: Distilling Diverse Vision Foundation Models for Robot Learning},
    author={Jinghuan Shang and Karl Schmeckpeper and Brandon B. May and Maria Vittoria Minniti and Tarik Kelestemur and David Watkins and Laura Herlant},
    booktitle={8th Annual Conference on Robot Learning},
    year={2024},
    url={https://openreview.net/forum?id=ylZHvlwUcI}
}