```
├── .gitignore
├── CODE_OF_CONDUCT.md (700 tokens)
├── CONTRIBUTING.md (300 tokens)
├── LICENSE (omitted)
├── README.md (700 tokens)
├── eval.py (2.5k tokens)
├── eval.sh (100 tokens)
├── examples/
├── ibims1/
├── ibims1_val.jsonl (64.7k tokens)
├── rgb/
├── Thumbs.db
├── corridor_01.png
├── corridor_02.png
├── corridor_03.png
├── corridor_04.png
├── corridor_05.png
├── corridor_06.png
├── corridor_07.png
├── corridor_08.png
├── corridor_09.png
├── corridor_10.png
├── factory_01.png
├── factory_02.png
├── factory_03.png
├── factory_04.png
├── factory_05.png
├── factory_06.png
├── factory_07.png
├── factory_08.png
├── kitchen_01.png
├── kitchen_02.png
├── kitchen_03.png
├── kitchen_04.png
├── kitchen_05.png
├── kitchen_06.png
├── kitchen_07.png
├── kitchen_08.png
├── lab_01.png
├── lab_02.png
├── lab_03.png
├── lab_04.png
├── lab_05.png
├── lab_06.png
├── lab_07.png
├── lab_08.png
├── lab_09.png
├── lab_10.png
├── lab_11.png
├── lectureroom_01.png
├── lectureroom_02.png
├── lectureroom_03.png
├── lectureroom_04.png
├── lectureroom_05.png
├── lectureroom_06.png
├── lectureroom_07.png
├── lectureroom_08.png
├── lectureroom_09.png
├── lectureroom_10.png
├── livingroom_01.png
├── livingroom_02.png
├── livingroom_03.png
├── livingroom_04.png
├── livingroom_05.png
├── livingroom_06.png
├── livingroom_07.png
├── livingroom_08.png
├── livingroom_09.png
├── livingroom_10.png
├── livingroom_11.png
├── livingroom_12.png
├── livingroom_13.png
├── livingroom_14.png
├── livingroom_15.png
├── meetingroom_01.png
├── meetingroom_02.png
├── meetingroom_03.png
├── meetingroom_04.png
├── meetingroom_05.png
├── meetingroom_06.png
├── meetingroom_07.png
├── meetingroom_08.png
├── office_01.png
├── office_02.png
├── office_03.png
├── office_04.png
├── office_05.png
├── office_06.png
├── office_07.png
├── office_08.png
├── restaurant_01.png
├── restaurant_02.png
├── restaurant_03.png
├── restaurant_04.png
├── restaurant_05.png
├── restaurant_06.png
├── restaurant_07.png
├── restaurant_08.png
├── restaurant_09.png
├── restaurant_10.png
├── restaurant_11.png
├── restaurant_12.png
├── restroom_01.png
├── restroom_02.png
├── storageroom_01.png
├── storageroom_02.png
├── storageroom_03.png
├── storageroom_04.png
├── storageroom_05.png
├── storageroom_06.png
├── storageroom_07.png
├── storageroom_08.png
├── media/
├── cv_model.png
├── main_result.png
├── multiTask.jpg
├── point_cloud.png
├── teaser.png
├── prepare_data.sh (1000 tokens)
├── requirements.txt
├── train.py (5.2k tokens)
├── train.sh (400 tokens)
├── utils/
├── callbacks.py (500 tokens)
├── curate_NYU.py (800 tokens)
├── curate_argoverse.py (2.2k tokens)
├── curate_ddad.py (1000 tokens)
├── curate_eth3d.py (1500 tokens)
├── curate_matterport3d.py (1400 tokens)
├── curate_nuscenes_eval.py (1700 tokens)
├── curate_nuscenes_train.py (1800 tokens)
├── curate_scannet.py (2.1k tokens)
├── curate_sunRGBD.py (800 tokens)
├── curate_taskonomy (1000 tokens)
├── curate_waymo.py (2.2k tokens)
├── datasets.py (5.9k tokens)
├── evaluation.py (900 tokens)
├── hub.py (1000 tokens)
├── metrics.py (400 tokens)
```
## /.gitignore
```gitignore path="/.gitignore"
*/__pycache__
```
## /CODE_OF_CONDUCT.md
# Code of Conduct
## Our Pledge
In the interest of fostering an open and welcoming environment, we as
contributors and maintainers pledge to make participation in our project and
our community a harassment-free experience for everyone, regardless of age, body
size, disability, ethnicity, sex characteristics, gender identity and expression,
level of experience, education, socio-economic status, nationality, personal
appearance, race, religion, or sexual identity and orientation.
## Our Standards
Examples of behavior that contributes to creating a positive environment
include:
* Using welcoming and inclusive language
* Being respectful of differing viewpoints and experiences
* Gracefully accepting constructive criticism
* Focusing on what is best for the community
* Showing empathy towards other community members
Examples of unacceptable behavior by participants include:
* The use of sexualized language or imagery and unwelcome sexual attention or
advances
* Trolling, insulting/derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or electronic
address, without explicit permission
* Other conduct which could reasonably be considered inappropriate in a
professional setting
## Our Responsibilities
Project maintainers are responsible for clarifying the standards of acceptable
behavior and are expected to take appropriate and fair corrective action in
response to any instances of unacceptable behavior.
Project maintainers have the right and responsibility to remove, edit, or
reject comments, commits, code, wiki edits, issues, and other contributions
that are not aligned to this Code of Conduct, or to ban temporarily or
permanently any contributor for other behaviors that they deem inappropriate,
threatening, offensive, or harmful.
## Scope
This Code of Conduct applies within all project spaces, and it also applies when
an individual is representing the project or its community in public spaces.
Examples of representing a project or community include using an official
project e-mail address, posting via an official social media account, or acting
as an appointed representative at an online or offline event. Representation of
a project may be further defined and clarified by project maintainers.
This Code of Conduct also applies outside the project spaces when there is a
reasonable belief that an individual's behavior may have a negative impact on
the project or its community.
## Enforcement
Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported by contacting the project team at <opensource-conduct@meta.com>. All
complaints will be reviewed and investigated and will result in a response that
is deemed necessary and appropriate to the circumstances. The project team is
obligated to maintain confidentiality with regard to the reporter of an incident.
Further details of specific enforcement policies may be posted separately.
Project maintainers who do not follow or enforce the Code of Conduct in good
faith may face temporary or permanent repercussions as determined by other
members of the project's leadership.
## Attribution
This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
[homepage]: https://www.contributor-covenant.org
For answers to common questions about this code of conduct, see
https://www.contributor-covenant.org/faq
## /CONTRIBUTING.md
# Contributing to DepthLM
We want to make contributing to this project as easy and transparent as
possible.
## Our Development Process
... (in particular how this is synced with internal changes to the project)
## Pull Requests
We actively welcome your pull requests.
1. Fork the repo and create your branch from `main`.
2. If you've added code that should be tested, add tests.
3. If you've changed APIs, update the documentation.
4. Ensure the test suite passes.
5. Make sure your code lints.
6. If you haven't already, complete the Contributor License Agreement ("CLA").
## Contributor License Agreement ("CLA")
In order to accept your pull request, we need you to submit a CLA. You only need
to do this once to work on any of Meta's open source projects.
Complete your CLA here: <https://code.facebook.com/cla>
## Issues
We use GitHub issues to track public bugs. Please ensure your description is
clear and has sufficient instructions to be able to reproduce the issue.
Meta has a [bounty program](https://bugbounty.meta.com/) for the safe
disclosure of security bugs. In those cases, please go through the process
outlined on that page and do not file a public issue.
## Coding Style
* 2 spaces for indentation rather than tabs
* 80 character line length
* ...
## License
By contributing to DepthLM, you agree that your contributions will be licensed
under the LICENSE file in the root directory of this source tree.
## /README.md
# [ICLR2026 Oral (top 1.2%)] DepthLM
Official implementation of "[DepthLM: Metric Depth from Vision Language Models](https://arxiv.org/abs/2509.25413)".
We show for the first time that **VLMs can achieve comparable accuracy with pure vision models on metric depth estimation**, with standard text-based SFT and no architecture chagne, i.e., no dense prediction head or regression/regularization loss is needed. Such simplicity allows DepthLM to train a unified VLM to handle various complex 3D understanding tasks such as speed or time estimation, and metric scale camera pose estimation, which require different architecture or hand-crafted pipelines in pure vision models.
<div align=center>
<img width=100% src="./media/teaser.png"/>
</div>
<div align=center>
<img width=100% src="./media/multiTask.jpg"/>
</div>
## Citation
If you find our code useful for your research, please consider citing:
@article{cai2025depthlm,
title={DepthLM: Metric Depth from Vision Language Models},
author={Cai, Zhipeng and Yeh, Ching-Feng and Hu, Xu and Liu, Zhuang and Meyer, Gregory and Lei, Xinjie and Zhao, Changsheng and Li, Shang-Wen and Chandra, Vikas and Shi, Yangyang},
journal={arXiv preprint arXiv:2509.25413},
year={2025},
}
## Contact
Zhipeng Cai, Meta Inc, homepage: https://zhipengcai.github.io/, email: czptc2h at gmail dot com.
## Prerequisites
1. run ```conda create -n DepthLM python=3.12```
2. run ```pip install -r requirements.txt``` (the code is tested with transformers 4.51.1 version)
| Model | Link |
|:----:|:-------------------------------------------------------------------------------------------------:|
| DepthLM (Pixtral 12B) | [Download 🤗](https://huggingface.co/facebook/DepthLM) |
| DepthLM (3B) | (Coming soon!) |
| DepthLM (7B) | (Coming soon!) |
## Data Preparation
- For each training/eval dataset, we curate them into
- A folder containing the images
- A jsonl file containing the corresponding camera intrinsics and 3D labels
- We provide example data from the iBims1 dataset at examples/ibims1 for quick code run without the need of data preparation. Other images/datasets can use the same code after finishing the data preparation steps.
- Due to legal reasons, we cannot directly release the curated data. However, we provide the data curation code to enable reproduction.
- Checkout each block in [prepare_data.sh](https://github.com/facebookresearch/DepthLM_Official/blob/main/prepare_data.sh) for the detailed data preparation steps on each dataset.
## Eval
- run ```bash eval.sh <path_to_your_model>```
## Training
- Download the base model you want to train from [here](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct/tree/main). Our code currently supports Qwen2.5-VL and Pixtral, please see our paper for the corresponding hyper-parameters.
- run ```bash train.sh <path_to_your_model> <output_path>```
## Results
### Comparison with VLMs
<div align=center>
<img width=100% src="./media/main_result.png"/>
</div>
### Comparison with pure vision models
<div align=center>
<img width=80% src="./media/cv_model.png"/>
</div>
### Point cloud visualization
<div align=center>
<img width=100% src="./media/point_cloud.png"/>
</div>
## Related project
Our follow up project [VLM³](https://github.com/facebookresearch/VLM3) has been released! It extends the findings of DepthLM to diverse 3D vision tasks!
## License
DepthLM is FAIR CC-BY-NC licensed, as found in the LICENSE file.
## /eval.py
```py path="/eval.py"
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
import argparse
import logging
import torch
from tqdm import tqdm
from transformers import (
AutoProcessor,
LlavaForConditionalGeneration,
Qwen2_5_VLForConditionalGeneration,
)
from utils.datasets import dataset_eval, dataset_inference
from utils.metrics import *
def convert_example_pixtral(example, image_before_text=None):
messages = []
problem = example.get("problem")
if "images" in example:
images = example.get("images")
if image_before_text is not None and image_before_text:
messages.append(
{
"role": "user",
"content": [{"type": "image", "image": img} for img in images]
+ [{"type": "text", "content": problem}],
}
)
else:
messages.append(
{
"role": "user",
"content": [{"type": "text", "content": problem}]
+ [{"type": "image", "image": img} for img in images],
}
)
else:
image = example.get("image")
if image_before_text is not None and image_before_text:
messages.append(
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "content": problem},
],
}
)
else:
messages.append(
{
"role": "user",
"content": [
{"type": "text", "content": problem},
{"type": "image", "image": image},
],
}
)
example["messages"] = messages
return example
def convert_example(example, image_before_text=None):
messages = []
if "system" in example:
messages.append(
{
"role": "system",
"content": [{"type": "text", "text": example["system"]}],
}
)
else:
SYSTEM_PROMPT = (
"A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant "
"first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning "
"process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., "
"<think> reasoning process here </think><answer> answer here </answer>"
)
messages.append(
{
"role": "system",
"content": [{"type": "text", "text": SYSTEM_PROMPT}],
}
)
problem = example.get("problem")
if "images" in example:
images = example.get("images")
if image_before_text is not None and image_before_text:
messages.append(
{
"role": "user",
"content": [{"type": "image", "image": img} for img in images]
+ [{"type": "text", "text": problem}],
}
)
else:
messages.append(
{
"role": "user",
"content": [{"type": "text", "text": problem}]
+ [{"type": "image", "image": img} for img in images],
}
)
else:
image = example.get("image")
if image_before_text is not None and image_before_text:
messages.append(
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": problem},
],
}
)
else:
messages.append(
{
"role": "user",
"content": [
{"type": "text", "text": problem},
{"type": "image", "image": image},
],
}
)
example["messages"] = messages
return example
def main(args):
model_path = args.model_path
img_fodler = args.image_folder
json_path = args.json_path
processor = AutoProcessor.from_pretrained(model_path)
if "pixtral" in model_path.lower():
print("loading DepthLM with pixtral (12B) architecture")
model = LlavaForConditionalGeneration.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
attn_implementation={
"text_config": "flash_attention_2",
"vision_config": "eager",
},
device_map="auto",
)
model.eval()
else:
print("loading DepthLM with qwen2.5-vl architecture")
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="auto",
)
if args.run_deterministic_inference:
dataset = dataset_inference(
json_path,
img_fodler,
normalized_focal_length=750.0, # change to the corresponding value for other models
)
else:
dataset = dataset_eval(
json_path,
img_fodler,
normalized_focal_length=750.0, # change to the corresponding value for other models
)
print(f"{dataset.__class__.__name__} size = {len(dataset)}")
metric_funcs = [delta1_metric]
metrics = []
all_outputs = [] # List to store all answers
all_solutions = [] # List to store all solutions
samples_to_eval = min(args.samples_to_eval, len(dataset))
step = 1
sampled_indices: list[int] = list(range(0, samples_to_eval, step))
print(f"Evaluating {len(sampled_indices)} samples")
with torch.no_grad():
for i in tqdm(range(0, len(sampled_indices), args.bsz)):
batch_indices: list[int] = sampled_indices[i : i + args.bsz]
batch_messages: list[dict[str, Any]] = []
for j in batch_indices:
message = dataset[j]
if message is not None:
batch_messages.append(message)
if len(batch_messages) == 0:
continue
if "pixtral" in model_path.lower():
chat = [
convert_example_pixtral(msg, True)["messages"]
for msg in batch_messages
]
inputs = processor.apply_chat_template(
chat,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
padding=True,
padding_side="left",
return_tensors="pt",
).to("cuda", dtype=torch.bfloat16)
generated_ids = model.generate(
**inputs,
max_new_tokens=args.max_new_tokens,
do_sample=False,
top_p=None,
top_k=None,
)
generated_ids_trimmed = [
out_ids[len(in_ids) :]
for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
batch_output_text = processor.batch_decode(
generated_ids_trimmed,
skip_special_tokens=True,
clean_up_tokenization_spaces=False,
)
else:
# code for qwen based models
if args.apply_system_prompt:
text = [
processor.apply_chat_template(
convert_example(msg, True)["messages"],
tokenize=False,
add_generation_prompt=True,
)
for msg in batch_messages
]
else:
batch_messages_text: list[str] = [
msg["prompt"] for msg in batch_messages
]
text: list[str] = [
processor.apply_chat_template(
msg, tokenize=False, add_generation_prompt=True
)
for msg in batch_messages_text
]
image_inputs = [
x["images"] if "images" in x else x["image"] for x in batch_messages
]
if i == 0:
print(
"text = ",
text[0],
"apply_system_prompt = ",
args.apply_system_prompt,
)
inputs = processor(
text=text,
images=image_inputs,
padding=True,
padding_side="left",
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Inference: Generation of the output
# TODO maybe enable sampling here later
generated_ids = model.generate(
**inputs,
use_cache=True,
max_new_tokens=args.max_new_tokens,
do_sample=False,
top_p=None, # Unset top_p to avoid the warning
top_k=None,
)
generated_ids_trimmed = [
out_ids[len(in_ids) :]
for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
batch_output_text = processor.batch_decode(
generated_ids_trimmed,
skip_special_tokens=True,
clean_up_tokenization_spaces=False,
)
print(f"model input = {batch_messages}")
print(f"model output = {batch_output_text}")
solution_list = [example["solution"] for example in batch_messages]
for k, metric_func in enumerate(metric_funcs):
if i == 0:
metrics.append(
metric_func(
batch_output_text,
solution_list.copy(),
)
)
else:
metrics[k] += metric_func(
batch_output_text,
solution_list.copy(),
)
all_outputs.extend(batch_output_text)
all_solutions.extend(solution_list.copy())
for i in range(len(metric_funcs)):
print("final delta_1 = ", sum(metrics[i]) / len(metrics[i]))
if __name__ == "__main__":
logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")
parser = argparse.ArgumentParser(description="DepthLM parameters.")
parser.add_argument(
"--model_path", type=str, required=True, help="Path to the model."
)
parser.add_argument(
"--image_folder",
type=str,
default="./examples/ibims1/",
help="folder that contains the image",
)
parser.add_argument(
"--json_path",
type=str,
default="./examples/ibims1/ibims1_val.jsonl",
help="path to the meta data",
)
parser.add_argument(
"--max_new_tokens",
type=int,
default=4096,
help="maximum number of tokens to generate",
)
parser.add_argument("--bsz", type=int, default=1, help="Batch size for processing.")
parser.add_argument(
"--apply_system_prompt",
action="store_true",
help="For Qwen only, whether to apply system prompt or not.",
)
parser.add_argument(
"--run_deterministic_inference",
action="store_true",
help="When True, will call the dataset_inference class to run deterministic inference.",
)
parser.add_argument(
"--samples_to_eval",
type=int,
default=128,
help="maximum number of samples to evaluate",
)
args = parser.parse_args()
main(args)
```
## /eval.sh
```sh path="/eval.sh"
#!/bin/bash
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
model_path=$1
python eval.py --model_path $model_path --image_folder "./examples/ibims1/" --json_path "./examples/ibims1/ibims1_val.jsonl" --bsz 3 --samples_to_eval 128
```
## /examples/ibims1/rgb/Thumbs.db
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/Thumbs.db
## /examples/ibims1/rgb/corridor_01.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/corridor_01.png
## /examples/ibims1/rgb/corridor_02.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/corridor_02.png
## /examples/ibims1/rgb/corridor_03.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/corridor_03.png
## /examples/ibims1/rgb/corridor_04.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/corridor_04.png
## /examples/ibims1/rgb/corridor_05.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/corridor_05.png
## /examples/ibims1/rgb/corridor_06.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/corridor_06.png
## /examples/ibims1/rgb/corridor_07.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/corridor_07.png
## /examples/ibims1/rgb/corridor_08.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/corridor_08.png
## /examples/ibims1/rgb/corridor_09.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/corridor_09.png
## /examples/ibims1/rgb/corridor_10.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/corridor_10.png
## /examples/ibims1/rgb/factory_01.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/factory_01.png
## /examples/ibims1/rgb/factory_02.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/factory_02.png
## /examples/ibims1/rgb/factory_03.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/factory_03.png
## /examples/ibims1/rgb/factory_04.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/factory_04.png
## /examples/ibims1/rgb/factory_05.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/factory_05.png
## /examples/ibims1/rgb/factory_06.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/factory_06.png
## /examples/ibims1/rgb/factory_07.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/factory_07.png
## /examples/ibims1/rgb/factory_08.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/factory_08.png
## /examples/ibims1/rgb/kitchen_01.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/kitchen_01.png
## /examples/ibims1/rgb/kitchen_02.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/kitchen_02.png
## /examples/ibims1/rgb/kitchen_03.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/kitchen_03.png
## /examples/ibims1/rgb/kitchen_04.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/kitchen_04.png
## /examples/ibims1/rgb/kitchen_05.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/kitchen_05.png
## /examples/ibims1/rgb/kitchen_06.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/kitchen_06.png
## /examples/ibims1/rgb/kitchen_07.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/kitchen_07.png
## /examples/ibims1/rgb/kitchen_08.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/kitchen_08.png
## /examples/ibims1/rgb/lab_01.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/lab_01.png
## /examples/ibims1/rgb/lab_02.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/lab_02.png
## /examples/ibims1/rgb/lab_03.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/lab_03.png
## /examples/ibims1/rgb/lab_04.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/lab_04.png
## /examples/ibims1/rgb/lab_05.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/lab_05.png
## /examples/ibims1/rgb/lab_06.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/lab_06.png
## /examples/ibims1/rgb/lab_07.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/lab_07.png
## /examples/ibims1/rgb/lab_08.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/lab_08.png
## /examples/ibims1/rgb/lab_09.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/lab_09.png
## /examples/ibims1/rgb/lab_10.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/lab_10.png
## /examples/ibims1/rgb/lab_11.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/lab_11.png
## /examples/ibims1/rgb/lectureroom_01.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/lectureroom_01.png
## /examples/ibims1/rgb/lectureroom_02.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/lectureroom_02.png
## /examples/ibims1/rgb/lectureroom_03.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/lectureroom_03.png
## /examples/ibims1/rgb/lectureroom_04.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/lectureroom_04.png
## /examples/ibims1/rgb/lectureroom_05.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/lectureroom_05.png
## /examples/ibims1/rgb/lectureroom_06.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/lectureroom_06.png
## /examples/ibims1/rgb/lectureroom_07.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/lectureroom_07.png
## /examples/ibims1/rgb/lectureroom_08.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/lectureroom_08.png
## /examples/ibims1/rgb/lectureroom_09.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/lectureroom_09.png
## /examples/ibims1/rgb/lectureroom_10.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/lectureroom_10.png
## /examples/ibims1/rgb/livingroom_01.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/livingroom_01.png
## /examples/ibims1/rgb/livingroom_02.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/livingroom_02.png
## /examples/ibims1/rgb/livingroom_03.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/livingroom_03.png
## /examples/ibims1/rgb/livingroom_04.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/livingroom_04.png
## /examples/ibims1/rgb/livingroom_05.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/livingroom_05.png
## /examples/ibims1/rgb/livingroom_06.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/livingroom_06.png
## /examples/ibims1/rgb/livingroom_07.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/livingroom_07.png
## /examples/ibims1/rgb/livingroom_08.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/livingroom_08.png
## /examples/ibims1/rgb/livingroom_09.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/livingroom_09.png
## /examples/ibims1/rgb/livingroom_10.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/livingroom_10.png
## /examples/ibims1/rgb/livingroom_11.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/livingroom_11.png
## /examples/ibims1/rgb/livingroom_12.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/livingroom_12.png
## /examples/ibims1/rgb/livingroom_13.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/livingroom_13.png
## /examples/ibims1/rgb/livingroom_14.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/livingroom_14.png
## /examples/ibims1/rgb/livingroom_15.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/livingroom_15.png
## /examples/ibims1/rgb/meetingroom_01.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/meetingroom_01.png
## /examples/ibims1/rgb/meetingroom_02.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/meetingroom_02.png
## /examples/ibims1/rgb/meetingroom_03.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/meetingroom_03.png
## /examples/ibims1/rgb/meetingroom_04.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/meetingroom_04.png
## /examples/ibims1/rgb/meetingroom_05.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/meetingroom_05.png
## /examples/ibims1/rgb/meetingroom_06.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/meetingroom_06.png
## /examples/ibims1/rgb/meetingroom_07.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/meetingroom_07.png
## /examples/ibims1/rgb/meetingroom_08.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/meetingroom_08.png
## /examples/ibims1/rgb/office_01.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/office_01.png
## /examples/ibims1/rgb/office_02.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/office_02.png
## /examples/ibims1/rgb/office_03.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/office_03.png
## /examples/ibims1/rgb/office_04.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/office_04.png
## /examples/ibims1/rgb/office_05.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/office_05.png
## /examples/ibims1/rgb/office_06.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/office_06.png
## /examples/ibims1/rgb/office_07.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/office_07.png
## /examples/ibims1/rgb/office_08.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/office_08.png
## /examples/ibims1/rgb/restaurant_01.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/restaurant_01.png
## /examples/ibims1/rgb/restaurant_02.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/restaurant_02.png
## /examples/ibims1/rgb/restaurant_03.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/restaurant_03.png
## /examples/ibims1/rgb/restaurant_04.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/restaurant_04.png
## /examples/ibims1/rgb/restaurant_05.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/restaurant_05.png
## /examples/ibims1/rgb/restaurant_06.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/restaurant_06.png
## /examples/ibims1/rgb/restaurant_07.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/restaurant_07.png
## /examples/ibims1/rgb/restaurant_08.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/restaurant_08.png
## /examples/ibims1/rgb/restaurant_09.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/restaurant_09.png
## /examples/ibims1/rgb/restaurant_10.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/restaurant_10.png
## /examples/ibims1/rgb/restaurant_11.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/restaurant_11.png
## /examples/ibims1/rgb/restaurant_12.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/restaurant_12.png
## /examples/ibims1/rgb/restroom_01.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/restroom_01.png
## /examples/ibims1/rgb/restroom_02.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/restroom_02.png
## /examples/ibims1/rgb/storageroom_01.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/storageroom_01.png
## /examples/ibims1/rgb/storageroom_02.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/storageroom_02.png
## /examples/ibims1/rgb/storageroom_03.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/storageroom_03.png
## /examples/ibims1/rgb/storageroom_04.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/storageroom_04.png
## /examples/ibims1/rgb/storageroom_05.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/storageroom_05.png
## /examples/ibims1/rgb/storageroom_06.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/storageroom_06.png
## /examples/ibims1/rgb/storageroom_07.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/storageroom_07.png
## /examples/ibims1/rgb/storageroom_08.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/storageroom_08.png
## /media/cv_model.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/media/cv_model.png
## /media/main_result.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/media/main_result.png
## /media/multiTask.jpg
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/media/multiTask.jpg
## /media/point_cloud.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/media/point_cloud.png
## /media/teaser.png
Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/media/teaser.png
## /prepare_data.sh
```sh path="/prepare_data.sh"
#!/bin/bash
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
# Though we provide separate code for each dataset, the main operation remains the same
# 1. download the dataset following the official instructions
# 2. convert images + camera intrinsics + depth maps into QA pairs, this step would generate a jsonl file containing all the meta data and a folder containing the corresponding images, similar to https://github.com/facebookresearch/DepthLM/tree/main/examples/ibims1.
# Argoverse
## 1. install av2 library and download + unzip the dataset following the official instructions in https://argoverse.github.io/user-guide/getting_started.html#downloading-the-data
## 2. (optional) move 10-20 scenes from val to train folder to enlarge the training dataset size
## 3. curate the data
python utils/curate_argoverse.py \
"/path/to/argoverse/train_or_val/" \
"/path/to/output_image_folder" \
"/path/to/output_jsonl/argoverse2_train_or_val.jsonl"
# Waymo
## 1. download and unzip waymo open dataset from https://console.cloud.google.com/storage/browser/waymo_open_dataset_v_2_0_1
## 2. curate the data
python utils/curate_waymo.py \
--dataset_dir /path/to/waymo/training/ \
--out_json_path /path/to/output_jsonl/waymo_train.jsonl \
--out_image_dir /path/to/output_image_folder
# NuScenes
## 1. download and unzip the dataset following https://www.nuscenes.org/nuscenes, we use the "Mini" subset for evaluation and other scenes in "All" for training
## 2. install nuscenes devkit at https://github.com/nutonomy/nuscenes-devkit
pip install nuscenes-devkit
## 3. curate training data
python utils/curate_nuscenes_train.py \
--dataroot /path/to/nuscenes_all \
--dataroot_mini /path/to/nuscenes_mini \
--out_json_path /path/to/output_jsonl/nuscenes_train.jsonl \
--out_image_dir /path/to/output_image_folder
## 4. curate eval data
python utils/curate_nuscenes_eval.py \
--dataroot /path/to/nuscenes_mini \
--out_json_path /path/to/output_jsonl/nuscenes_eval.jsonl \
--out_image_dir /path/to/output_image_folder
# ScanNet++
# our dataloader will automatically separate train and eval samples, so no need to separate them
## 1. download scannet++ dataset from https://kaldir.vc.in.tum.de/scannetpp/
## 2. clone and install the scannet++ github repo at https://github.com/scannetpp/scannetpp
## 3 change in /scannet_github_code_root/iphone/configs/prepare_iphone_data.yml the "data_doot" to the corresponding folder of your downloaded data
## 4. move data curation code to the scannet github local repo (we need modules in the scannet code to read the data)
mv utils/curate_scannet.py /scannet_github_code_root/iphone/prepare_depth_json.py
## 5. go to the scannet github local repo and run the data curation code
cd /scannet_github_code_root
python -m iphone.prepare_depth_json iphone/configs/prepare_iphone_data.yml
# Taskonomy
## 1. download the fullplus version of the dataset following https://github.com/StanfordVL/taskonomy/tree/master/data
python utils/curate_taskonomy.py \
--dataroot /path/to/taskonomy \
--out_json_path /path/to/output_jsonl/taskonomy.jsonl \
--out_image_dir /path/to/output_image_folder
# HM3d
## 1. download the hm3d dataset using https://docs.omnidata.vision/starter_dataset_download.html (set the components to hm3d)
## 2. curate data (coming soon)
# Matterport3D
## 1. download the dataset at https://niessner.github.io/Matterport/
## 2. curate data
python utils/curate_matterport3d.py \
--dataroot /path/to/matterport \
--out_json_path /path/to/output_jsonl/matterport.jsonl \
--out_image_dir /path/to/output_image_folder
# DDAD
## 1. download the dataset and install the dgp library following the "How to Use" section in https://github.com/TRI-ML/DDAD
## 2. curate data
python utils/curate_ddad.py \
--ddad_trainval_json_path /path/to/ddad/ddad_train_val/ddad.json \
--out_json_path /path/to/output_jsonl/ddad.jsonl \
--out_image_dir /path/to/output_image_folder \
--path_to_dgp_lib /path/to/dgp/lib/folder
# ETH3D
## 1. download images and depth maps from https://www.eth3d.net/datasets
## 2. curate data
python utils/curate_eth3d.py \
--image_dir /path/to/eth3d/multi_view_training_dslr_jpg \
--depth_map_dir /path/to/eth3d/depth_map \
--out_json_path /path/to/output_jsonl/eth3d.jsonl \
--out_image_dir /path/to/output_image_folder
# sunRGBD & NYUv2
## 1. download data and unzip
dataroot=/path/to/sunRGBD
mkdir -p $dataroot
cd $dataroot
wget http://cvgl.stanford.edu/data2/sun_rgbd.tgz
tar -xvzf sun_rgbd.tgz
## 2. curate data for sunRGBD (without NYUv2)
python utils/curate_sunRGBD.py \
--dataroot /path/to/SUNRGBD/root \
--out_json_path /path/to/output_jsonl/sunRGBD.jsonl \
--out_image_dir /path/to/output_image_folder
## 3. curate data for NYUv2
python utils/curate_NYU.py \
--dataroot /path/to/SUNRGBD/root \
--out_json_path /path/to/output_jsonl/NYUv2.jsonl \
--out_image_dir /path/to/output_image_folder
```
## /requirements.txt
torch
torchvision
datasets
numpy
pandas
peft
pillow
qwen-vl-utils
huggingface-hub
einops
flash-attn
math_verify
opencv-python
tensorboard
transformers
trl==0.15.2
accelerate==1.6.0
## /train.py
```py path="/train.py"
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
import datetime
import logging
import os
import sys
import uuid
from dataclasses import dataclass, field
from typing import Optional
import datasets
import torch
import transformers
import trl
from qwen_vl_utils import process_vision_info
# from torch import distributed as dist
from torch.utils.tensorboard import SummaryWriter
from transformers import (
AutoModelForCausalLM,
AutoProcessor,
BatchFeature,
LlavaForConditionalGeneration,
MllamaForConditionalGeneration,
Qwen2_5_VLForConditionalGeneration,
Qwen2VLForConditionalGeneration,
set_seed,
TrainerCallback,
)
from transformers.integrations import TensorBoardCallback
from transformers.trainer_utils import get_last_checkpoint
from trl import (
get_kbit_device_map,
get_peft_config,
get_quantization_config,
ScriptArguments,
SFTTrainer,
TrlParser,
)
from utils.datasets import dataset_train
from utils.callbacks import get_callbacks
logger = logging.getLogger(__name__)
@dataclass
class ModelConfig(trl.ModelConfig):
output_model_local_path: str = field(
default="test-output",
metadata={"help": "Output model local path, do not set manually"},
)
output_model_filename: Optional[str] = field(
default="test-output", metadata={"help": "Output model relative manifold path"}
)
@dataclass
class SFTConfig(trl.SFTConfig):
"""
args for callbacks, benchmarks etc
"""
benchmarks: list[str] = field(
default_factory=lambda: [],
metadata={"help": "The benchmarks to run after training."},
)
callbacks: list[str] = field(
default_factory=lambda: [],
metadata={"help": "The callbacks to run during training."},
)
system_prompt: Optional[str] = field(
default=None,
metadata={"help": "The optional system prompt to use for benchmarking."},
)
hub_model_revision: Optional[str] = field(
default="main",
metadata={"help": "The Hub model branch to push the model to."},
)
overwrite_hub_revision: bool = field(
default=False, metadata={"help": "Whether to overwrite the Hub revision."}
)
push_to_hub_revision: bool = field(
default=False, metadata={"help": "Whether to push to a Hub revision/branch."}
)
@dataclass
# pyre-fixme[11]: Annotation `ScriptArguments` is not defined as a type.
class SFTScriptArguments(ScriptArguments):
"""
Script arguments for the GRPO training script.
Args:
reward_funcs (`list[str]`):
List of reward functions. Possible values: 'accuracy', 'format'.
"""
dataset_class: str = field(
default="LazySupervisedDataset_ArgoverseDepth_GRPO",
metadata={"help": "dataset class name in callm.reason.openr1.utils.datasets"},
)
max_pixels: Optional[int] = field(
default=12845056,
metadata={"help": "Maximum number of pixels for the image"},
)
min_pixels: Optional[int] = field(
default=3136,
metadata={"help": "Minimum number of pixels for the image"},
)
image_folder: Optional[str] = field(
default=None,
metadata={"help": "image folder on manifold"},
)
augment: Optional[float] = field(
default=None,
metadata={"help": "augmentation ratio"},
)
normalized_focal_length: Optional[float] = field(
default=None,
metadata={"help": "normalized focal length"},
)
sample_weights: Optional[str] = field(
default=None,
metadata={"help": "weights for sampling"},
)
pad: Optional[bool] = field(
default=None,
metadata={
"help": "whether to pad image to have same width and height in 2 image strategy"
},
)
height_max: Optional[float] = field(
default=None,
metadata={"help": "max height"},
)
height_min: Optional[float] = field(
default=None,
metadata={"help": "min height"},
)
width_min: Optional[float] = field(
default=None,
metadata={"help": "min width"},
)
width_max: Optional[float] = field(
default=None,
metadata={"help": "max width"},
)
ratio_min: Optional[float] = field(
default=None,
metadata={"help": "min ratio"},
)
ratio_max: Optional[float] = field(
default=None,
metadata={"help": "max ratio"},
)
processor = None
def configure_pixtral_vision_tower(model, compute_dtype, device):
vision_tower = model.vision_tower
vision_tower.to(dtype=compute_dtype, device=device)
def convert_example(example):
"""
correct example into "messages"
eg:
{
"system": "You are a helpful assistant.",
"conversations": [
{"from": "user", "value": "How many objects are included in this image?",
"image_path": "/path/to/image.png"},
{"from": "assistant", "value": "<think>\nI can see 10 objects\n</think>\n<answer>\n10\n</answer>"}
]
}
"""
messages = []
if "system" in example:
messages.append(
{
"role": "system",
"content": [{"type": "text", "text": example["system"]}],
}
)
else:
SYSTEM_PROMPT = (
"A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant "
"first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning "
"process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., "
"<think> reasoning process here </think><answer> answer here </answer>"
)
messages.append(
{
"role": "system",
"content": [{"type": "text", "text": SYSTEM_PROMPT}],
}
)
thinking = example.get("thinking", "") # no thinking case included
problem = example.get("problem")
solution = example.get("solution")
if "images" in example:
images = example.get("images")
messages.append(
{
"role": "user",
"content": [{"type": "image", "image": img} for img in images]
+ [{"type": "text", "text": problem}],
}
)
else:
image = example.get("image")
messages.append(
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": problem},
],
}
)
messages.append(
{
"role": "assistant",
"content": f"{thinking}\n\n{solution}",
}
)
example["messages"] = messages
return example
def convert_example_phi4(example):
"""
correct example into "messages"
eg:
{
"system": "You are a helpful assistant.",
"conversations": [
{"from": "user", "value": "How many objects are included in this image?",
"image_path": "/path/to/image.png"},
{"from": "assistant", "value": "<think>\nI can see 10 objects\n</think>\n<answer>\n10\n</answer>"}
]
}
"""
messages = []
if "system" in example:
messages.append(
{
"role": "system",
"content": [{"type": "text", "text": example["system"]}],
}
)
else:
SYSTEM_PROMPT = (
"A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant "
"first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning "
"process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., "
"<think> reasoning process here </think><answer> answer here </answer>"
)
messages.append(
{
"role": "system",
"content": [{"type": "text", "text": SYSTEM_PROMPT}],
}
)
thinking = example.get("thinking", "") # no thinking case included
problem = example.get("problem")
solution = example.get("solution")
if "images" in example:
images = example.get("images")
messages.append(
{
"role": "user",
"content": [{"type": "image", "image": img} for img in images]
+ [{"type": "text", "text": problem}],
}
)
else:
image = example.get("image")
messages.append(
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": problem},
],
}
)
messages.append(
{
"role": "assistant",
"content": f"{thinking}\n\n{solution}",
}
)
example["messages"] = messages
return example
def pad_sequence(sequences, padding_side="right", padding_value=0):
"""
Pad a list of sequences to the same length.
sequences: list of tensors in [seq_len, *] shape
"""
assert padding_side in ["right", "left"]
max_size = sequences[0].size()
trailing_dims = max_size[1:]
max_len = max(len(seq) for seq in sequences)
batch_size = len(sequences)
output = sequences[0].new_full((batch_size, max_len) + trailing_dims, padding_value)
for i, seq in enumerate(sequences):
length = seq.size(0)
if padding_side == "right":
output.data[i, :length] = seq
else:
output.data[i, -length:] = seq
return output
def cat_with_pad(tensors, dim, padding_value=0):
"""
cat along dim, while pad to max for all other dims
"""
ndim = tensors[0].dim()
assert all(
t.dim() == ndim for t in tensors[1:]
), "All tensors must have the same number of dimensions"
out_size = [max(t.shape[i] for t in tensors) for i in range(ndim)]
out_size[dim] = sum(t.shape[dim] for t in tensors)
output = tensors[0].new_full(out_size, padding_value)
index = 0
for t in tensors:
# Create a slice list where every dimension except dim is full slice
slices = [slice(0, t.shape[d]) for d in range(ndim)]
# Update only the concat dimension slice
slices[dim] = slice(index, index + t.shape[dim])
output[slices] = t
index += t.shape[dim]
return output
def pmc_vqa_collate_fn(batch):
input_ids_list = []
labels_list = []
input_image_embeds_list = []
image_attention_mask_list = []
image_sizes_list = []
for inputs in batch:
input_ids_list.append(inputs["input_ids"][0])
labels_list.append(inputs["labels"][0])
input_image_embeds_list.append(inputs["input_image_embeds"])
image_attention_mask_list.append(inputs["image_attention_mask"])
image_sizes_list.append(inputs["image_sizes"])
input_ids = pad_sequence(input_ids_list, padding_side="right", padding_value=0)
labels = pad_sequence(labels_list, padding_side="right", padding_value=0)
attention_mask = (input_ids != 0).long()
input_image_embeds = cat_with_pad(input_image_embeds_list, dim=0)
image_attention_mask = cat_with_pad(image_attention_mask_list, dim=0)
image_sizes = torch.cat(image_sizes_list)
# breakpoint()
return BatchFeature(
{
"input_ids": input_ids,
"labels": labels,
"attention_mask": attention_mask,
"input_image_embeds": input_image_embeds,
"image_attention_mask": image_attention_mask,
"image_sizes": image_sizes,
"input_mode": 1, # vision mode
}
)
def collate_fn_phi4(examples):
_IGNORE_INDEX = -100
_MAX_TRAINING_LENGTH = 8192
batch = []
for example in examples:
image = example["image"]
question = example.get("problem")
user_message = {
"role": "user",
"content": "<|image_1|>" + question,
}
prompt = processor.tokenizer.apply_chat_template(
[user_message], tokenize=False, add_generation_prompt=True
)
answer = f'{example.get("thinking", "")}\n\n{example.get("solution")}<|end|><|endoftext|>'
inputs = processor(prompt, images=[image], return_tensors="pt")
answer_ids = processor.tokenizer(answer, return_tensors="pt").input_ids
input_ids = torch.cat([inputs.input_ids, answer_ids], dim=1)
labels = torch.full_like(input_ids, _IGNORE_INDEX)
labels[:, -answer_ids.shape[1] :] = answer_ids
# breakpoint()
if input_ids.size(1) > _MAX_TRAINING_LENGTH:
input_ids = input_ids[:, :_MAX_TRAINING_LENGTH]
labels = labels[:, :_MAX_TRAINING_LENGTH]
if torch.all(labels == _IGNORE_INDEX).item():
# workaround to make sure loss compute won't fail
labels[:, -1] = processor.tokenizer.eos_token_id
batch.append(
{
"input_ids": input_ids,
"labels": labels,
"input_image_embeds": inputs.input_image_embeds,
"image_attention_mask": inputs.image_attention_mask,
"image_sizes": inputs.image_sizes,
}
)
return pmc_vqa_collate_fn(batch)
def find_subsequence(sequence, subsequence):
"""
Helper function to find the starting index of a subsequence within a sequence.
"""
seq_len = len(sequence)
sub_len = len(subsequence)
for i in range(seq_len - sub_len + 1):
if torch.equal(sequence[i : i + sub_len], subsequence):
return i
return None
def get_image_token_count(image, dummy_text="describe this image"):
"""
Compute the number of tokens generated for an image using the model's vision tower.
Returns 0 if token computation fails.
"""
try:
inputs = processor(images=image, text=dummy_text, return_tensors="pt").to(
"cuda"
)
with torch.no_grad():
output = model.vision_tower(pixel_values=inputs["pixel_values"])
token_count = output.last_hidden_state.shape[1]
if token_count == 0:
raise ValueError("Image token count is zero.")
return token_count
except Exception as e:
print(f"[ERROR] Failed to compute image tokens: {e}")
return 0 # Return zero to flag as invalid
def collate_fn_pixtral(examples):
texts = [
processor.apply_chat_template(
convert_example(example)["messages"],
tokenize=False,
add_generation_prompt=True,
)
for example in examples
]
image_inputs = []
for example in examples:
imgs, vids = process_vision_info(example["messages"])
image_inputs.append(imgs)
batch = processor(
text=texts,
images=image_inputs,
return_tensors="pt",
padding=True,
)
# print("texts = ", texts[0])
# breakpoint()
labels = batch["input_ids"].clone()
labels[labels == processor.tokenizer.pad_token_id] = -100
image_token_id = processor.tokenizer.convert_tokens_to_ids(processor.image_token)
labels[labels == image_token_id] = -100
batch["labels"] = labels
return batch
def collate_fn(examples):
# breakpoint()
texts = [
processor.apply_chat_template(
convert_example(example)["messages"],
tokenize=False,
add_generation_prompt=True,
)
for example in examples
]
image_inputs = []
for example in examples:
imgs, vids = process_vision_info(example["messages"])
image_inputs.append(imgs)
batch = processor(
text=texts,
images=image_inputs,
return_tensors="pt",
padding=True,
)
# print("texts = ", texts[0])
# breakpoint()
labels = batch["input_ids"].clone()
labels[labels == processor.tokenizer.pad_token_id] = -100
image_token_id = processor.tokenizer.convert_tokens_to_ids(processor.image_token)
labels[labels == image_token_id] = -100
batch["labels"] = labels
# breakpoint()
return batch
def main(script_args, training_args, model_args):
set_seed(training_args.seed)
###############
# Setup logging
###############
logging.basicConfig(
format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
handlers=[logging.StreamHandler(sys.stdout)],
)
log_level = training_args.get_process_log_level()
logger.setLevel(log_level)
datasets.utils.logging.set_verbosity(log_level)
transformers.utils.logging.set_verbosity(log_level)
transformers.utils.logging.enable_default_handler()
transformers.utils.logging.enable_explicit_format()
training_args.gradient_checkpointing_kwargs = {"use_reentrant": False}
# Log on each process a small summary
logger.warning(
f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}"
+ f" distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
)
logger.info(f"Model parameters {model_args}")
logger.info(f"Script parameters {script_args}")
logger.info(f"Data parameters {training_args}")
print("script_args.image_folder = ", script_args.image_folder)
training_args.output_dir = model_args.output_model_local_path
# Check for last checkpoint
last_checkpoint = None
if os.path.isdir(training_args.output_dir):
last_checkpoint = get_last_checkpoint(training_args.output_dir)
if last_checkpoint is not None and training_args.resume_from_checkpoint is None:
logger.info(f"Checkpoint detected, resuming training at {last_checkpoint=}.")
################
# Load datasets
################
dataset_kwargs = {}
if script_args.normalized_focal_length is not None:
dataset_kwargs["normalized_focal_length"] = script_args.normalized_focal_length
if script_args.sample_weights is not None:
dataset_kwargs["sample_weights"] = ";".join(
weight
for i, weight in enumerate(script_args.sample_weights.split(";"))
)
if script_args.height_max is not None:
dataset_kwargs["height_max"] = script_args.height_max
if script_args.height_min is not None:
dataset_kwargs["height_min"] = script_args.height_min
if script_args.width_min is not None:
dataset_kwargs["width_min"] = script_args.width_min
if script_args.width_max is not None:
dataset_kwargs["width_max"] = script_args.width_max
if script_args.ratio_min is not None:
dataset_kwargs["ratio_min"] = script_args.ratio_min
if script_args.ratio_max is not None:
dataset_kwargs["ratio_max"] = script_args.ratio_max
dataset = dataset_train(script_args.dataset_name, script_args.image_folder, **dataset_kwargs)
print("[dataset] dataset_size = ", len(dataset))
################
# Load tokenizer
################
global processor
if "vl" in model_args.model_name_or_path.lower():
processor = AutoProcessor.from_pretrained(
model_args.model_name_or_path,
trust_remote_code=model_args.trust_remote_code,
)
logger.info("Using AutoProcessor for vision-language model.")
if hasattr(processor, "pad_token") and processor.pad_token is None:
processor.pad_token = processor.eos_token
elif (
hasattr(processor.tokenizer, "pad_token")
and processor.tokenizer.pad_token is None
):
processor.tokenizer.pad_token = processor.tokenizer.eos_token
elif "pixtral-12b" in model_args.model_name_or_path.lower():
processor = AutoProcessor.from_pretrained(
model_args.model_name_or_path,
)
if hasattr(processor, "pad_token") and processor.pad_token is None:
processor.pad_token = processor.eos_token
elif (
hasattr(processor.tokenizer, "pad_token")
and processor.tokenizer.pad_token is None
):
processor.tokenizer.pad_token = processor.tokenizer.eos_token
processor.image_processor.do_resize = False
processor.image_processor.do_rescale = False
# breakpoint()
else:
processor = AutoProcessor.from_pretrained(
model_args.model_name_or_path,
trust_remote_code=True,
use_fast=True,
)
logger.info("Using AutoProcessor.")
# ###################
# # Model init kwargs
# ###################
logger.info("*** Initializing model kwargs ***")
torch_dtype = (
model_args.torch_dtype
if model_args.torch_dtype in ["auto", None]
else getattr(torch, model_args.torch_dtype)
)
quantization_config = get_quantization_config(model_args)
if "pixtral-12b" in model_args.model_name_or_path.lower():
# seems like use_cache is not supported in the model class
model_kwargs = dict(
revision=model_args.model_revision,
trust_remote_code=model_args.trust_remote_code,
attn_implementation={
"text_config": "flash_attention_2",
"vision_config": "eager",
},
torch_dtype=torch_dtype,
device_map=(
get_kbit_device_map() if quantization_config is not None else None
),
quantization_config=quantization_config,
)
else:
# training_args.model_init_kwargs = model_kwargs
model_kwargs = dict(
revision=model_args.model_revision,
trust_remote_code=model_args.trust_remote_code,
attn_implementation=model_args.attn_implementation,
torch_dtype=torch_dtype,
use_cache=False if training_args.gradient_checkpointing else True,
device_map=(
get_kbit_device_map() if quantization_config is not None else None
),
quantization_config=quantization_config,
)
if "Qwen2.5-VL" in model_args.model_name_or_path:
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_args.model_name_or_path, **model_kwargs
)
elif "pixtral-12b" in model_args.model_name_or_path.lower():
model = LlavaForConditionalGeneration.from_pretrained(
model_args.model_name_or_path, **model_kwargs
)
if training_args.gradient_checkpointing:
model.enable_input_require_grads()
# This is a workaround for a bug in the current implementation of gradient checkpointing
training_args.gradient_checkpointing_kwargs = {"use_reentrant": True}
else:
model = AutoModelForCausalLM.from_pretrained(
model_args.model_name_or_path, **model_kwargs
)
############################
# Initialize the SFT Trainer
############################
callbacks = get_callbacks(training_args, model_args)
# # configure TensorboardCallback to upload to manifold
callbacks.append(
TensorBoardCallback(
SummaryWriter(
log_dir=os.path.join(
training_args.output_dir,
"tensorboard_logs",
),
comment="",
purge_step=None,
max_queue=10,
flush_secs=120,
filename_suffix=str(uuid.uuid4()),
)
)
)
training_args.dataset_kwargs = {
"skip_prepare_dataset": True,
}
training_args.remove_unused_columns = False
if "pixtral" in model_args.model_name_or_path.lower():
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset,
processing_class=processor.tokenizer,
data_collator=collate_fn_pixtral,
peft_config=get_peft_config(model_args),
callbacks=callbacks,
)
else:
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset,
processing_class=processor.tokenizer,
data_collator=collate_fn,
peft_config=get_peft_config(model_args),
callbacks=callbacks,
)
# ###############
# # Training loop
# ###############
logger.info("*** Train ***")
checkpoint = None
if training_args.resume_from_checkpoint is not None:
checkpoint = training_args.resume_from_checkpoint
elif last_checkpoint is not None:
checkpoint = last_checkpoint
train_result = trainer.train(resume_from_checkpoint=checkpoint)
metrics = train_result.metrics
trainer.log_metrics("train", metrics)
trainer.save_metrics("train", metrics)
trainer.save_state()
# ##################################
# # Save model and create model card
# ##################################
logger.info("*** Save model ***")
trainer.save_model(training_args.output_dir)
processor.save_pretrained(training_args.output_dir)
logger.info(f"Model saved to {training_args.output_dir}")
if __name__ == "__main__":
parser = TrlParser((SFTScriptArguments, SFTConfig, ModelConfig))
script_args, training_args, model_args = parser.parse_args_and_config()
output_model_basename = os.path.basename(model_args.output_model_filename)
model_args.output_model_local_path = os.path.join(
training_args.output_dir,
"models",
"DepthLM",
)
os.makedirs(model_args.output_model_local_path, exist_ok=True)
main(script_args, training_args, model_args)
```
## /train.sh
```sh path="/train.sh"
#!/bin/bash
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
model_path=$1
output_path=$2
# 1. we use ';' to separate the image_folder, dataset_name and sample_weights, as shown in this example, where we concat 2 identical datasets.
# 2. please follow the optimal hyper-paramters from the paper, this is just a basic example to make things run, for device that cannot use the same batch size as in the paper, you can scale the learning rate and batch size together following the square root rule so that you train with smaller batch sizes.
# 3. please adjust the max_steps to control the number of training samples
# 4. set per_device_train_batch_size to 10+ for H100 cards on Qwen2.5-VL 3B and 7B
# 5. to train the pixtral model, set per_device_train_batch_size to 3 and change the corresponding fsdp layer to --fsdp_transformer_layer_cls_to_wrap "MistralDecoderLayer,PixtralAttentionLayer"
torchrun --nproc_per_node=2 --master_port=12433 train.py \
--model_name_or_path $model_path \
--image_folder "./examples/ibims1/;./examples/ibims1/" \
--dataset_name "./examples/ibims1/ibims1_val.jsonl;./examples/ibims1/ibims1_val.jsonl" \
--sample_weights "1;1" \
--max_seq_length 4096 \
--learning_rate 1e-5 \
--lr_scheduler_type cosine \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 1 \
--warmup_ratio 0.1 \
--max_grad_norm 0.1 \
--logging_steps 1 \
--report_to tensorboard \
--gradient_checkpointing true \
--attn_implementation "flash_attention_2" \
--max_steps 10 \
--log_level info \
--logging_strategy steps \
--output_dir $output_path \
--save_steps 3000 \
--save_strategy steps \
--eval_strategy no \
--torch_dtype bfloat16 \
--seed 42 \
--normalized_focal_length 1000.0 \
--height_min 700 \
--height_max 1200 \
--width_min 1000 \
--width_max 1400 \
--dataset_class dataset_train \
--fsdp "full_shard auto_wrap" \
--fsdp_transformer_layer_cls_to_wrap "Qwen2_5_VLDecoderLayer"
```
## /utils/callbacks.py
```py path="/utils/callbacks.py"
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
import subprocess
from typing import List
from transformers import TrainerCallback
from transformers.trainer_callback import TrainerControl, TrainerState
from transformers.training_args import TrainingArguments
from .evaluation import run_benchmark_jobs
from .hub import push_to_hub_revision
def is_slurm_available() -> bool:
# returns true if a slurm queueing system is available
try:
subprocess.run(
["sinfo"], check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE
)
return True
except FileNotFoundError:
return False
class DummyConfig:
def __init__(self, **kwargs):
for k, v in kwargs.items():
setattr(self, k, v)
class PushToHubRevisionCallback(TrainerCallback):
def __init__(self, model_config) -> None:
self.model_config = model_config
def on_save(
self,
args: TrainingArguments,
state: TrainerState,
control: TrainerControl,
**kwargs,
):
if state.is_world_process_zero:
global_step = state.global_step
# WARNING: if you use dataclasses.replace(args, ...) the accelerator dist state will be broken, so I do this workaround
# Also if you instantiate a new SFTConfig, the accelerator dist state will be broken
dummy_config = DummyConfig(
hub_model_id=args.hub_model_id,
hub_model_revision=f"{args.hub_model_revision}-step-{global_step:09d}",
output_dir=f"{args.output_dir}/checkpoint-{global_step}",
system_prompt=args.system_prompt,
)
# TODO: I think this could be made async
push_to_hub_revision(
dummy_config, extra_ignore_patterns=["*.pt"]
) # don't push the optimizer states
if is_slurm_available():
dummy_config.benchmarks = args.benchmarks
run_benchmark_jobs(dummy_config, self.model_config)
CALLBACKS = {
"push_to_hub_revision": PushToHubRevisionCallback,
}
def get_callbacks(train_config, model_config) -> List[TrainerCallback]:
callbacks = []
for callback_name in train_config.callbacks:
if callback_name not in CALLBACKS:
raise ValueError(f"Callback {callback_name} not found in CALLBACKS.")
callbacks.append(CALLBACKS[callback_name](model_config))
return callbacks
```
## /utils/curate_NYU.py
```py path="/utils/curate_NYU.py"
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
import os, shutil, torch
from glob import glob
import numpy as np
# from fiftyone import ViewField as F
from PIL import Image
import argparse
# Set up argument parser
parser = argparse.ArgumentParser(description="Process some files.")
parser.add_argument("--dataroot", type=str, default="/home/czptc2h/datasets/SUNRGBD", help="image dir")
parser.add_argument(
"--out_json_path", type=str, help="output jsonl path"
)
parser.add_argument(
"--out_image_dir", type=str, help="output image folder"
)
args = parser.parse_args()
scene_dirs = glob(os.path.join(dataroot, "SUNRGBD/k*/*/*"))
print("Scene Dirs:", scene_dirs)
out_json_path = args.out_json_path
out_image_path = args.out_image_dir
import shutil
if os.path.exists(out_image_path):
shutil.rmtree(out_image_path)
os.makedirs(out_image_path)
points_per_image = 100
import json, os
count = 0
with open(out_json_path, "w") as jsonl_file:
for scene_dir in scene_dirs:
data_dict = {}
## Get image file path from scene directory
image_path = glob(f"{scene_dir}/image/*")[0]
if "NYU" not in image_path:
continue
sub_dir = image_path.replace(f"{dataroot}/SUNRGBD/", "")
## Copy the image to the out_image_path directory
os.makedirs(os.path.dirname(out_image_path + "/" + sub_dir), exist_ok=True)
shutil.copy(image_path, out_image_path + "/" + sub_dir)
## Get depth map file path from scene directory
depth_path = glob(f"{scene_dir}/depth/*")[0]
print("Image Path:", image_path, "; Depth Path:", depth_path)
intrinsic_path = f"{scene_dir}/intrinsics.txt"
with open(intrinsic_path, "r") as file:
intrinsic_data = file.read().strip().split()
intrinsic_matrix = np.array(intrinsic_data, dtype=np.float32).reshape(
(3, 3)
)
print("Intrinsic Matrix:\n", intrinsic_matrix)
# Read the image from image_path into a PIL image
pil_image = Image.open(image_path)
data_dict["image"] = sub_dir
data_dict["intrinsics"] = [
float(intrinsic_matrix[0, 0]),
float(intrinsic_matrix[1, 1]),
float(intrinsic_matrix[0, 2]),
float(intrinsic_matrix[1, 2]),
] + [pil_image.size[0], pil_image.size[1]]
depth_gt = Image.open(depth_path)
depth_gt = np.asarray(depth_gt, dtype=np.float32)
depth_gt = depth_gt / 10000.0
# Randomly sample 100 pixels in depth_gt with value > 0.005 and < 25
valid_pixels = np.argwhere((depth_gt > 0.005) & (depth_gt < 25))
sampled_indices = np.random.choice(
len(valid_pixels), size=points_per_image, replace=False
)
sampled_pixels = valid_pixels[sampled_indices]
data_dict["pixel_coords"] = sampled_pixels[:, [1, 0]].tolist()
fx, fy, cx, cy = (
intrinsic_matrix[0, 0],
intrinsic_matrix[1, 1],
intrinsic_matrix[0, 2],
intrinsic_matrix[1, 2],
)
z = depth_gt[sampled_pixels[:, 0], sampled_pixels[:, 1]]
x = (sampled_pixels[:, 1] - cx) * z / fx
y = (sampled_pixels[:, 0] - cy) * z / fy
euclidean_distances = np.sqrt(x**2 + y**2 + z**2)
data_dict["depth"] = euclidean_distances.tolist()
print("PIL Image Size:", pil_image.size)
print("Depth GT Size:", depth_gt.shape)
print("Data Dictionary:", data_dict)
json.dump(data_dict, jsonl_file)
jsonl_file.write("\n")
count += 1
print(f"processed {count} images")
```
## /utils/curate_argoverse.py
```py path="/utils/curate_argoverse.py"
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
import json
import logging
import os
import sys
from pathlib import Path
from typing import Final
import av2.rendering.color as color_utils
import av2.rendering.rasterize as raster_rendering_utils
import av2.rendering.video as video_utils
import av2.utils.io as io_utils
import av2.utils.raster as raster_utils
import click
import cv2
import numpy as np
from av2.datasets.sensor.av2_sensor_dataloader import AV2SensorDataLoader
from av2.datasets.sensor.constants import RingCameras
from av2.map.map_api import ArgoverseStaticMap
from av2.rendering.color import GREEN_HEX, RED_HEX
from av2.utils.typing import NDArrayByte, NDArrayFloat, NDArrayInt
from numpy import random
from PIL import Image
logger = logging.getLogger(__name__)
NUM_RANGE_BINS: Final[int] = 50
RING_CAMERA_FPS: Final[int] = 20
def get_immediate_subfolders(folder_path: str) -> list:
"""Return a list of immediate subfolders in the given folder path."""
return [f.name for f in Path(folder_path).iterdir() if f.is_dir()]
if __name__ == "__main__":
if len(sys.argv) != 4:
print(
"Usage: python script.py <root_folder> <out_image_folder> <jsonl_output_path>"
)
sys.exit(1)
root_folder = sys.argv[1]
out_image_folder = sys.argv[2]
jsonl_output_path = sys.argv[3]
frame_sample_interval = 1
points_per_frame = 100 # by default we curate 100 labeled pixels per image which is more than enough for depth estimation, you can change this number to have more curated pixels
cameras_used = [
"ring_front_left",
"ring_front_right",
"ring_rear_left",
"ring_rear_right",
"ring_side_left",
"ring_side_right",
"ring_front_center",
"stereo_front_left",
"stereo_front_right",
]
folders = get_immediate_subfolders(root_folder)
print(f"there are {len(folders)} folders in total, the first one is {folders[0]}")
loader = AV2SensorDataLoader(
data_dir=Path(root_folder), labels_dir=Path(root_folder)
)
count = 0
count_rows = 0
with open(jsonl_output_path, "w") as f:
skip_log_id = "d37be0e2-8223-3eeb-a0e2-c4b75d5ff87b" # errors during my downloading for this log, comment if you dont have issues
skip = False
for log_id in folders[:2]:
if skip and log_id != skip_log_id:
continue
skip = False
print("log_id", log_id)
# get the image file path
for _, cam_name in enumerate(list(RingCameras)):
if cam_name not in cameras_used:
print("skip ", cam_name, " camera")
continue
cam_im_fpaths = loader.get_ordered_log_cam_fpaths(log_id, cam_name)
# Sample every frame_sample_interval elements into a subset path list
sampled_cam_im_fpaths = cam_im_fpaths[::frame_sample_interval]
print("cam_im_fpaths = ", cam_im_fpaths)
for i, im_fpath in enumerate(sampled_cam_im_fpaths):
try:
data_dict = {}
data_dict["image"] = str(im_fpath).replace(root_folder, "")
# get the object labels
cam_timestamp_ns = int(im_fpath.stem)
city_SE3_ego = loader.get_city_SE3_ego(log_id, cam_timestamp_ns)
if city_SE3_ego is None:
logger.exception("missing LiDAR pose")
continue
# load feather file path, e.g. '315978406032859416.feather"
lidar_fpath = loader.get_closest_lidar_fpath(
log_id, cam_timestamp_ns
)
if lidar_fpath is None:
logger.info(
"No LiDAR sweep found within the synchronization interval for %s, so skipping...",
cam_name,
)
continue
lidar_timestamp_ns = int(lidar_fpath.stem)
lidar_points_ego = io_utils.read_lidar_sweep(
lidar_fpath, attrib_spec="xyz"
)
(
uv,
points_cam,
is_valid_points,
) = loader.project_ego_to_img_motion_compensated(
points_lidar_time=lidar_points_ego,
cam_name=cam_name,
cam_timestamp_ns=cam_timestamp_ns,
lidar_timestamp_ns=lidar_timestamp_ns,
log_id=log_id,
)
if is_valid_points is None or uv is None or points_cam is None:
continue
if is_valid_points.sum() == 0:
continue
uv_int: NDArrayInt = np.round(uv[is_valid_points]).astype(
np.int32
) # image coordinates in pixels
points_cam = points_cam[
is_valid_points
] # 3d points in camera coordinates
# read the object bounding boxes and labels
cuboids = loader.get_labels_at_lidar_timestamp(
log_id, lidar_timestamp_ns
)
# convert to camera reference frame
# project cuboids to camera reference frame
pinhole_camera = loader.get_log_pinhole_camera(
log_id=log_id, cam_name=cam_name
)
city_SE3_ego_cam_t = loader.get_city_SE3_ego(
log_id=log_id, timestamp_ns=cam_timestamp_ns
)
# get transformation to bring point in egovehicle frame to city frame,
# at the time when the LiDAR sweep was recorded.
city_SE3_ego_lidar_t = loader.get_city_SE3_ego(
log_id=log_id, timestamp_ns=lidar_timestamp_ns
)
intrinsics = [
pinhole_camera.intrinsics.fx_px,
pinhole_camera.intrinsics.fy_px,
pinhole_camera.intrinsics.cx_px,
pinhole_camera.intrinsics.cy_px,
pinhole_camera.intrinsics.width_px,
pinhole_camera.intrinsics.height_px,
]
# point clouds
# Ensure the number of points to sample does not exceed available points
num_points_to_sample = min(points_per_frame, len(uv_int))
# Calculate the interval for uniform sampling
sampled_indices = np.random.choice(
len(uv_int), num_points_to_sample, replace=False
)
# Subset the uv_int and points_cam arrays
uv_int = uv_int[sampled_indices].tolist()
points_cam = points_cam[sampled_indices].tolist()
data_dict["intrinsics"] = intrinsics
data_dict["pixel_coords"] = uv_int
# Read the image file path as a PIL image
undistorted_pil_image = Image.open(im_fpath)
# Check if the fx value in new_K is greater than 1000
if intrinsics[0] > 1000:
# Calculate the scaling factor to make fx equal to 1000
scale_factor = 1000 / intrinsics[0]
# Rescale the undistorted_pil_image
new_width = int(undistorted_pil_image.width * scale_factor)
new_height = int(
undistorted_pil_image.height * scale_factor
)
undistorted_pil_image = undistorted_pil_image.resize(
(new_width, new_height), Image.LANCZOS
)
# Rescale the pixel coordinates
data_dict["pixel_coords"] = [
(int(x * scale_factor), int(y * scale_factor))
for x, y in data_dict["pixel_coords"]
]
data_dict["intrinsics"] = [
1000.0,
1000.0,
data_dict["intrinsics"][2] * scale_factor,
data_dict["intrinsics"][3] * scale_factor,
undistorted_pil_image.width,
undistorted_pil_image.height,
]
# Construct the full path for the output image
output_image_path = os.path.join(
out_image_folder, data_dict["image"].lstrip("/")
)
# Create the directory if it doesn't exist
os.makedirs(os.path.dirname(output_image_path), exist_ok=True)
# Save the undistorted image as a JPEG
undistorted_pil_image.save(output_image_path)
data_dict["depth"] = []
for point_id in range(len(uv_int)):
data_dict["depth"].append(
(
points_cam[point_id][0] ** 2
+ points_cam[point_id][1] ** 2
+ points_cam[point_id][2] ** 2
)
** 0.5
)
f.write(f"{json.dumps(data_dict)}\n")
count_rows += 1
# exit()
count += 1
if count % 1000 == 0:
print("data_dict", data_dict)
print(
"processed ", count, " frames and ", count_rows, "rows"
)
except Exception as e:
print("error ", e)
break
```
## /utils/curate_ddad.py
```py path="/utils/curate_ddad.py"
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
import sys
import argparse
# Set up argument parser
parser = argparse.ArgumentParser(description="Process some files.")
parser.add_argument(
"--out_json_path", type=str, help="output jsonl path"
)
parser.add_argument(
"--out_image_dir", type=str, help="output image folder"
)
parser.add_argument(
"--ddad_trainval_json_path", type=str, help="path to the ddad train val json path, i.e., ddad/ddad_train_val/ddad.json"
)
parser.add_argument(
"--path_to_dgp_lib", type=str, help="dgp path"
)
args = parser.parse_args()
sys.path.insert(
0,
args.path_to_dgp_lib,
) # add nuscenes package path to enable module finding
import cv2
import numpy as np
import PIL
from dgp.datasets.synchronized_dataset import SynchronizedSceneDataset
from dgp.proto.ontology_pb2 import Ontology
from dgp.utils.protobuf import open_pbobject
from dgp.utils.visualization_utils import visualize_semantic_segmentation_2d
# from IPython import display
from matplotlib.cm import get_cmap
plasma_color_map = get_cmap("plasma")
out_json_path = args.out_json_path
output_image_path = args.out_image_dir
points_per_image = 100
import os
# Remove the folder of output_image_path if it exists
if os.path.exists(output_image_path):
import shutil
shutil.rmtree(output_image_path)
# Ensure the output directory exists
os.makedirs(output_image_path, exist_ok=True)
# Define high level variables
DDAD_TRAIN_VAL_JSON_PATH = args.ddad_trainval_json_path
DATUMS = ["lidar"] + ["CAMERA_%02d" % idx for idx in [1, 5, 6, 7, 8, 9]]
# Load the val set
ddad_val = SynchronizedSceneDataset(
DDAD_TRAIN_VAL_JSON_PATH,
split="val",
datum_names=DATUMS,
generate_depth_from_datum="lidar",
)
print("Loaded DDAD val split containing {} samples".format(len(ddad_val)))
import json
# Open the out_json_path as a jsonl file for writing
with open(out_json_path, "w") as jsonl_file:
count = 0
# Iterate through the dataset.
for sample in ddad_val:
# Each sample contains a list of the requested datums.
print("sample = {}", sample, "/", len(sample))
for i in range(len(sample[0])):
datum = sample[0][i]
if "CAMERA" in datum["datum_name"]:
data_dict = {}
image_fname = f"{count}.jpg"
data_dict["image"] = f"val_images/" + image_fname
print(datum["datum_name"], i)
# point_cloud = lidar["point_cloud"] # Nx3 numpy.ndarray
image_01 = datum["rgb"] # PIL.Image
depth_01 = datum["depth"] # (H,W) numpy.ndarray, generated from 'lidar'
data_dict["intrinsics"] = [
float(datum["intrinsics"][0, 0]),
float(datum["intrinsics"][1, 1]),
float(datum["intrinsics"][0, 2]),
float(datum["intrinsics"][1, 2]),
image_01.size[0], # Image width
image_01.size[1], # Image height
]
# print("image_01 = ", image_01, "; depth_01 = ", depth_01)
# Find non-zero elements in depth_01
non_zero_indices = np.nonzero(depth_01)
random_indices = np.random.choice(
len(non_zero_indices[0]), size=100, replace=False
)
non_zero_indices = (
non_zero_indices[0][random_indices],
non_zero_indices[1][random_indices],
)
non_zero_values = depth_01[non_zero_indices]
data_dict["pixel_coords"] = []
data_dict["depth"] = []
# Print pixel coordinates and their corresponding depth values
for coord, value in zip(zip(*non_zero_indices), non_zero_values):
data_dict["pixel_coords"].append([int(coord[1]), int(coord[0])])
data_dict["depth"].append(float(value))
# print(f"Pixel coordinates: {coord}, Depth value: {value}")
# # Calculate and print the minimum and maximum values in the non-zero depth values
# min_depth = np.min(non_zero_values)
# max_depth = np.max(non_zero_values)
# print(
# f"Minimum depth value: {min_depth}, Maximum depth value: {max_depth}"
# )
print("data_dict = ", data_dict)
json.dump(data_dict, jsonl_file)
jsonl_file.write("\n")
# Save image_01 to the specified path
image_save_path = os.path.join(output_image_path, image_fname)
image_01.save(image_save_path)
count += 1
print(f"processed {count} images")
# breakpoint()
# break
```
## /utils/curate_eth3d.py
```py path="/utils/curate_eth3d.py"
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
import os
import numpy as np
import argparse
# Set up argument parser
parser = argparse.ArgumentParser(description="Process some files.")
parser.add_argument("--image_dir", type=str, default="/home/czptc2h/datasets/ETH3D/multi_view_training_dslr_jpg", help="image dir")
parser.add_argument("--depth_map_dir", type=str, default="/home/czptc2h/datasets/ETH3D/depth", help="depth map dir")
parser.add_argument(
"--out_json_path", type=str, help="output jsonl path"
)
parser.add_argument(
"--out_image_dir", type=str, help="output image folder"
)
args = parser.parse_args()
def get_image_paths(directory):
image_paths = []
for root, dirs, files in os.walk(directory):
for file in files:
if file.lower().endswith(
(".png", ".jpg", ".jpeg", ".bmp", ".gif", ".tiff")
):
image_paths.append(os.path.join(root, file))
return image_paths
image_directory = args.image_dir
all_image_paths = get_image_paths(image_directory)
all_image_paths.sort(key=lambda x: os.path.basename(x))
print(all_image_paths[:10])
depth_directory = args.depth_map_dir
depth_image_paths = get_image_paths(depth_directory)
depth_image_paths.sort(key=lambda x: os.path.basename(x))
print(depth_image_paths[:10])
from PIL import Image
points_per_image = 100
out_json_path = args.out_json_path
out_image_path = args.out_image_dir
import shutil
if os.path.exists(out_image_path):
shutil.rmtree(out_image_path)
os.makedirs(out_image_path)
import cv2
def undistort_fisheye(image, depth_image, camera_params):
fx, fy, cx, cy = map(float, camera_params[4:8])
k1, k2, p1, p2, k3, k4, sz1, sy1 = map(float, camera_params[8:])
width, height = image.size
k = np.array([k1, k2, k3, k4])
p = np.array([p1, p2])
sz = np.array([sz1, sy1])
# Convert PIL image to numpy array
image_np = np.array(image)
# Camera matrix
K = np.array([[fx, 0, cx], [0, fy, cy], [0, 0, 1]])
# Distortion coefficients
D = np.array([k1, k2, k3, k4])
# Undistort image using OpenCV
h, w = image_np.shape[:2]
map1, map2 = cv2.fisheye.initUndistortRectifyMap(
K, D, np.eye(3), K, (w, h), cv2.CV_16SC2
)
undistorted_image_np = cv2.remap(
image_np,
map1,
map2,
interpolation=cv2.INTER_LINEAR,
borderMode=cv2.BORDER_CONSTANT,
)
# Convert back to PIL image
undistorted_image = Image.fromarray(undistorted_image_np)
# dont undistort depth image, get the original coordinate and use the mapping to get the undistorted coordinate
# # Undistort depth image using OpenCV
depth_image_np = np.array(depth_image)
undistorted_depth_image_np = cv2.remap(
depth_image_np,
map1,
map2,
interpolation=cv2.INTER_NEAREST,
borderMode=cv2.BORDER_CONSTANT,
)
undistorted_depth_image = Image.fromarray(undistorted_depth_image_np)
new_intrinsics = [fx, fy, cx, cy, width, height]
return undistorted_image, undistorted_depth_image, new_intrinsics
import json
count = 0
with open(out_json_path, "w") as jsonl_file:
for image_path, depth_path in zip(all_image_paths, depth_image_paths):
image = Image.open(image_path)
# image.save(os.path.join(out_image_path, "after_first_read.jpg"))
with open(depth_path, "rb") as f:
width, height = image.size
depth_data = np.fromfile(f, dtype=np.float32, count=width * height)
depth_image = depth_data.reshape((height, width))
print(f"Loaded Image: {image_path}, Loaded Depth Map: {depth_path}")
image_folder = os.path.dirname(os.path.dirname(os.path.dirname(image_path)))
dslr_calibration_folder = os.path.join(image_folder, "dslr_calibration_jpg")
corresponding_camera_file = os.path.join(dslr_calibration_folder, "cameras.txt")
if os.path.exists(corresponding_camera_file):
print(
f"Found corresponding camera file: {corresponding_camera_file} for image: {image_path}"
)
with open(corresponding_camera_file, "r") as camera_file:
for line in camera_file:
if not line.startswith("#"):
camera_params = line.strip().split(" ")
break
print(f"Camera Parameters: {camera_params}")
# Undistort image and depth_image
fx, fy, cx, cy = map(float, camera_params[4:8])
if camera_params[1] == "THIN_PRISM_FISHEYE":
k1, k2, p1, p2, k3, k4, sz1, sy1 = map(float, camera_params[8:])
# Call the function
image, depth_image, new_intrinsics = undistort_fisheye(
image, depth_image, camera_params
)
else:
print("Camera model not supported")
continue
# Resize image and depth_image to have width < 2048
scale_factor = min(1280.0 / height, 1)
new_width = int(width * scale_factor)
new_height = int(height * scale_factor)
image = image.resize((new_width, new_height))
depth_image = np.array(
depth_image
) # dont rescale depth image, rescale the pixel_coordinates
# Calculate new intrinsics
fx *= scale_factor
fy *= scale_factor
cx *= scale_factor
cy *= scale_factor
new_intrinsics = [fx, fy, cx, cy, new_width, new_height]
data_dict = {}
data_dict["image"] = image_path.replace(
image_directory+"/", ""
).replace(".png", ".jpg")
data_dict["intrinsics"] = [
float(new_intrinsics[0]),
float(new_intrinsics[1]),
float(new_intrinsics[2]),
float(new_intrinsics[3]),
int(new_intrinsics[4]),
int(new_intrinsics[5]),
]
data_dict["pixel_coords"] = []
data_dict["depth"] = []
valid_indices = np.argwhere(
(depth_image > 1e-4)
& (depth_image < 1e6)
& (np.arange(depth_image.shape[0])[:, None] > 10)
& (np.arange(depth_image.shape[0])[:, None] < depth_image.shape[0] - 10)
& (np.arange(depth_image.shape[1])[None, :] > 10)
& (np.arange(depth_image.shape[1])[None, :] < depth_image.shape[1] - 10)
)
sampled_indices = valid_indices[
np.random.choice(valid_indices.shape[0], points_per_image, replace=False)
]
for y, x in sampled_indices:
x_ori = x
y_ori = y
x = int(x * scale_factor)
y = int(y * scale_factor)
data_dict["pixel_coords"].append([int(x), int(y)])
fx, fy, cx, cy, width, height = data_dict["intrinsics"]
x_normalized = (x - cx) / fx
y_normalized = (y - cy) / fy
z = float(depth_image[y_ori, x_ori])
euclidean_distance = np.sqrt(x_normalized**2 + y_normalized**2 + 1) * z
data_dict["depth"].append(float(euclidean_distance))
# Save the resized image into out_image_path
resized_image_path = os.path.join(out_image_path, data_dict["image"])
print("resized_image_path", resized_image_path)
os.makedirs(os.path.dirname(resized_image_path), exist_ok=True)
image.save(resized_image_path)
print("Data Dictionary:", data_dict)
json.dump(data_dict, jsonl_file)
jsonl_file.write("\n")
count += 1
print(f"processed {count} images")
```
## /utils/curate_matterport3d.py
```py path="/utils/curate_matterport3d.py"
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
import json, random
import os
import sys
import numpy as np
import torch
from PIL import Image
import argparse
# Set up argument parser
parser = argparse.ArgumentParser(description="Process some files.")
parser.add_argument("--dataroot", type=str, default="/home/czptc2h/datasets/matterport", help="data root")
parser.add_argument(
"--out_json_path", type=str, help="output jsonl path"
)
parser.add_argument(
"--out_image_dir", type=str, help="output image folder"
)
args = parser.parse_args()
root = args.dataroot
def get_all_image_paths(root):
image_paths = []
for subdir, _, files in os.walk(root):
if "undistorted_color_images" in subdir:
for file in files:
if file.endswith((".png", ".jpg", ".jpeg", ".bmp", ".tiff", ".gif")):
image_paths.append(os.path.join(subdir, file))
return image_paths
def get_all_file_paths(root, folder_name, file_extensions=(".png")):
image_paths = []
for subdir, _, files in os.walk(root):
if folder_name in subdir:
for file in files:
if file.endswith(file_extensions):
image_paths.append(os.path.join(subdir, file))
return image_paths
all_image_paths = sorted(get_all_image_paths(root))
all_depth_paths = sorted(get_all_file_paths(root, "undistorted_depth_images", (".png")))
all_calib_paths = sorted(
get_all_file_paths(root, "undistorted_camera_parameters", (".conf"))
)
calib_map_dict = {}
for calib_path in all_calib_paths:
folder_name = os.path.relpath(calib_path, root).split(os.sep)[0]
calib_map_dict[folder_name] = calib_path
points_per_image = 100
out_json_path = args.out_json_path
# Create the directory for out_json_path if it doesn't exist
os.makedirs(os.path.dirname(out_json_path), exist_ok=True)
out_image_path = args.out_image_dir
import shutil
if os.path.exists(out_image_path):
shutil.rmtree(out_image_path)
os.makedirs(out_image_path)
count = 0
with open(out_json_path, "w") as jsonl_file:
for image_path, depth_path in zip(all_image_paths, all_depth_paths):
folder_name = os.path.relpath(image_path, root).split(os.sep)[0]
calib_path = calib_map_dict[folder_name]
# Extract the base filename from the image_path
base_filename = os.path.basename(image_path)
# Initialize variables to store the intrinsics matrix
intrinsics_matrix = None
# Read the calibration file
with open(calib_path, "r") as calib_file:
for line in calib_file:
# Check if the line contains the base filename
if base_filename in line:
# Read the previous line for intrinsics_matrix
calib_file.seek(0) # Reset file pointer to the beginning
lines = calib_file.readlines()
for i, l in enumerate(lines):
if base_filename in l:
# The intrinsics_matrix is expected to be in the lines before the scan line
for j in range(i - 1, -1, -1):
intrinsics_matrix_line = lines[j]
if "intrinsics_matrix" in intrinsics_matrix_line:
# Extract the values after 'intrinsics_matrix'
intrinsics_matrix = list(
map(float, intrinsics_matrix_line.split()[1:])
)
break
break
if not intrinsics_matrix:
print(f"Intrinsics Matrix not found for {base_filename}")
continue
data_dict = {}
data_dict["image_path"] = folder_name + "/" + base_filename
# Read the image at image_path as a PIL image
pil_image = Image.open(image_path)
fx = intrinsics_matrix[0]
fy = intrinsics_matrix[4]
cx = intrinsics_matrix[2]
cy = intrinsics_matrix[5]
data_dict["intrinsics"] = [
fx,
fy,
cx,
cy,
pil_image.width,
pil_image.height,
]
# Read the depth image at depth_path as a PIL image
depth_pil_image = Image.open(depth_path)
# Convert depth image to numpy array for easier manipulation
depth_array = np.array(depth_pil_image)
# Get the coordinates where depth is not 0
non_zero_coords = np.argwhere(depth_array > 0)
# Randomly sample 2 * points_per_image coordinates
sample_size = min(len(non_zero_coords), 2 * points_per_image)
if sample_size < 50:
continue
if len(non_zero_coords) < 2 * points_per_image:
print(
f"Population size: {len(non_zero_coords)} is smaller than required sample size: {2 * points_per_image}"
)
sampled_coords = random.sample(list(non_zero_coords), sample_size)
# Extract intrinsics from data_dict
fx, fy, cx, cy, width, height = data_dict["intrinsics"]
# Initialize a list to store the 3D points
euclidean_distances = []
# Iterate over the sampled coordinates
for coord in sampled_coords:
y, x = coord
# Get the depth value at the sampled coordinate
depth = depth_array[y, x] / 4000.0
x_real = (x - cx) * depth / fx
y_real = (y - cy) * depth / fy
z_real = depth
# Calculate the Euclidean distance
euclidean_distances.append(
float(np.sqrt(x_real**2 + y_real**2 + z_real**2))
)
# Filter and collect the first points_per_image elements that satisfy the conditions
filtered_coords = []
filtered_distances = []
for coord, distance in zip(sampled_coords, euclidean_distances):
if 0.05 <= distance <= 50:
filtered_coords.append([int(coord[1]), int(coord[0])]) # [x, y] format
filtered_distances.append(distance)
if len(filtered_coords) == points_per_image:
break
# Set the data_dict values
data_dict["pixel_coords"] = filtered_coords
data_dict["depth"] = filtered_distances
# Save the pil_image to the relative path of data_dict["image_path"] under out_image_path
relative_image_path = os.path.join(out_image_path, data_dict["image_path"])
os.makedirs(os.path.dirname(relative_image_path), exist_ok=True)
pil_image.save(relative_image_path)
# Write data_dict into jsonl_file
jsonl_file.write(json.dumps(data_dict) + "\n")
count += 1
if count % 1000 == 0:
print(f"Iteration: {count}")
print(f"Data Dictionary: {data_dict}")
```
## /utils/curate_nuscenes_eval.py
```py path="/utils/curate_nuscenes_eval.py"
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
import json, os
import cv2
import numpy as np
from nuscenes.nuscenes import NuScenes
from nuscenes.utils.data_classes import LidarPointCloud
from PIL import Image, ImageDraw
from pyquaternion import Quaternion
import argparse
# Set up argument parser
parser = argparse.ArgumentParser(description="Process some files.")
parser.add_argument("--dataroot_mini", type=str, default="/home/czptc2h/datasets/nuscenes", help="data root mini")
parser.add_argument(
"--out_json_path", type=str, help="output jsonl path"
)
parser.add_argument(
"--out_image_dir", type=str, help="output image folder"
)
args = parser.parse_args()
# Initialize the NuScenes dataset
dataroot = args.dataroot_mini
version = "v1.0-trainval" # Using mini version
nusc = NuScenes(version=version, dataroot=dataroot, verbose=True)
def map_pointcloud_to_image(pointcloud, camera_token):
"""
Map pointcloud to the image plane.
Args:
pointcloud: LidarPointCloud object
camera_token: Token of the camera sample data
Returns:
points_img: Points in image coordinates
depths: Depth values
"""
cam = nusc.get("sample_data", camera_token)
cam_path = os.path.join(nusc.dataroot, cam["filename"])
im = cv2.imread(cam_path)
# Get sensor calibration data
lidar_to_world = nusc.get(
"calibrated_sensor", pointcloud["calibrated_sensor_token"]
)
lidar_rotation = Quaternion(lidar_to_world["rotation"])
lidar_translation = np.array(lidar_to_world["translation"])
cam_to_world = nusc.get("calibrated_sensor", cam["calibrated_sensor_token"])
cam_intrinsic = np.array(cam_to_world["camera_intrinsic"])
cam_rotation = Quaternion(cam_to_world["rotation"])
cam_translation = np.array(cam_to_world["translation"])
# Transform points from lidar to world coordinate
pc = LidarPointCloud.from_file(os.path.join(nusc.dataroot, pointcloud["filename"]))
points = pc.points[:3, :]
points = np.vstack((points, np.ones(points.shape[1])))
# Transformation matrix from lidar to world coordinate
lidar_to_world_matrix = np.eye(4)
lidar_to_world_matrix[:3, :3] = lidar_rotation.rotation_matrix
lidar_to_world_matrix[:3, 3] = lidar_translation
# Transformation matrix from world to camera coordinate
world_to_cam_matrix = np.eye(4)
world_to_cam_matrix[:3, :3] = cam_rotation.rotation_matrix.T
world_to_cam_matrix[:3, 3] = -np.dot(
cam_rotation.rotation_matrix.T, cam_translation
)
# Transform points to camera coordinate
points_cam = np.dot(world_to_cam_matrix, np.dot(lidar_to_world_matrix, points))
# Only keep points in front of the camera
mask = points_cam[2, :] > 0
points_cam = points_cam[:, mask]
# Project to image plane
points_img = np.dot(cam_intrinsic, points_cam[:3, :])
points_img = points_img / points_img[2, :]
points_img = points_img[:2, :]
# Get depths
depths = points_cam[2, :].copy()
return points_img.T, depths, im
def create_depth_map(points_img, depths, image_shape):
"""
Create a depth map from projected points.
Args:
points_img: Points in image coordinates
depths: Depth values
image_shape: Shape of the image (height, width)
Returns:
depth_map: Depth map as a 2D numpy array
"""
depth_map = np.zeros((image_shape[0], image_shape[1]))
# Keep only points that
# Keep only points that fall within the image
mask = np.logical_and.reduce(
[
points_img[:, 0] >= 0,
points_img[:, 0] < image_shape[1],
points_img[:, 1] >= 0,
points_img[:, 1] < image_shape[0],
]
)
points_img = points_img[mask]
depths = depths[mask]
# Convert to integers for indexing
points_int = np.floor(points_img).astype(np.int32)
# Populate depth map
for i in range(points_int.shape[0]):
x, y = points_int[i, 0], points_int[i, 1]
if depth_map[y, x] == 0 or depths[i] < depth_map[y, x]:
depth_map[y, x] = depths[i]
return depth_map
CAMERA_NAMES = [
"CAM_FRONT",
"CAM_FRONT_RIGHT",
"CAM_BACK_RIGHT",
"CAM_BACK",
"CAM_BACK_LEFT",
"CAM_FRONT_LEFT",
]
def process_sample(sample_idx, output_folder, camera_name):
"""
Process a single sample from the nuScenes dataset.
Args:
sample_idx: Index of the sample
output_folder: Folder to save the image and depth map
"""
data_dict = {}
# Get sample
sample = nusc.sample[sample_idx]
# Get camera sample data
camera_token = sample["data"][camera_name]
# camera_data = nusc.get("sample_data", camera_token)
# Get LiDAR sample data
lidar_token = sample["data"]["LIDAR_TOP"]
lidar_data = nusc.get("sample_data", lidar_token)
# Map pointcloud to image and create depth map
points_img, depths, image = map_pointcloud_to_image(lidar_data, camera_token)
depth_map = create_depth_map(points_img, depths, (image.shape[0], image.shape[1]))
# Read out camera intrinsic information
cam_intrinsic = nusc.get(
"calibrated_sensor",
nusc.get("sample_data", camera_token)["calibrated_sensor_token"],
)["camera_intrinsic"]
print("Camera intrinsic matrix for", camera_name, ":", cam_intrinsic)
# Print indices of the depth map that are not 0
non_zero_indices = np.argwhere(depth_map != 0)
print("Non-zero depth map indices:", non_zero_indices)
# Print the size of the depth map and image
print("Depth map size:", depth_map.shape)
print("Image size:", image.shape)
# Save image and depth map to output folder
img_filename = f"{sample_idx:06d}_{camera_name}_image.jpg"
cv2.imwrite(os.path.join(output_folder, img_filename), image)
data_dict["image"] = img_filename
data_dict["intrinsics"] = [
cam_intrinsic[0][0],
cam_intrinsic[1][1],
cam_intrinsic[0][2],
cam_intrinsic[1][2],
image.shape[1], # Image width
image.shape[0], # Image height
]
# Randomly sample 100 pixels with non-zero depth map values
non_zero_indices = np.argwhere(depth_map != 0)
sampled_indices = non_zero_indices[
np.random.choice(non_zero_indices.shape[0], 100, replace=False)
]
# Store their pixel coordinates and depth values into lists
data_dict["pixel_coords"] = [[int(x), int(y)] for y, x in sampled_indices]
data_dict["depth"] = [depth_map[y, x] for y, x in sampled_indices]
return data_dict
def process_multiple_samples(
num_samples=5, output_folder="output", json_path="test.json", is_val=False
):
"""
Process multiple samples from the dataset.
Args:
num_samples: Number of samples to process
output_folder: Folder to save the images and depth maps
"""
# Check if the output folder exists and delete it if it does
if os.path.exists(output_folder):
import shutil
shutil.rmtree(output_folder)
if not os.path.exists(output_folder):
os.makedirs(output_folder)
with open(json_path, "w") as f:
if num_samples == -1:
line_count = 0
sample_range = (
range(int(len(nusc.sample) * 0.95))
if not is_val
else range(int(len(nusc.sample) * 0.95), len(nusc.sample))
)
print("sample_range = ", sample_range)
for i in sample_range:
print(f"Processing sample {i}")
for camera_name in CAMERA_NAMES:
entry = process_sample(i, output_folder, camera_name)
# Save meta_data_json to a JSON Lines file
json.dump(entry, f)
f.write("\n")
line_count += 1
print(f"Total lines processed: {line_count}")
else:
for i in np.random.choice(
len(nusc.sample), min(num_samples, len(nusc.sample)), replace=False
):
print(f"Processing sample {i}")
camera_name = np.random.choice(CAMERA_NAMES)
entry = process_sample(i, output_folder, camera_name)
# Save meta_data_json to a JSON Lines file
json.dump(entry, f)
f.write("\n")
# Example: Process all samples and save to "output" folder
process_multiple_samples(
num_samples=-1,
output_folder=args.out_image_dir,
json_path=args.out_json_path,
is_val=True,
)
```
## /utils/curate_nuscenes_train.py
```py path="/utils/curate_nuscenes_train.py"
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
import json, os
import cv2
import numpy as np
from nuscenes.nuscenes import NuScenes
from nuscenes.utils.data_classes import LidarPointCloud
from PIL import Image, ImageDraw
from pyquaternion import Quaternion
import argparse
# Set up argument parser
parser = argparse.ArgumentParser(description="Process some files.")
parser.add_argument("--dataroot", type=str, default="/home/czptc2h/datasets/nuscenes_full", help="data root")
parser.add_argument("--dataroot_mini", type=str, default="/home/czptc2h/datasets/nuscenes", help="data root mini")
parser.add_argument(
"--out_json_path", type=str, help="output jsonl path"
)
parser.add_argument(
"--out_image_dir", type=str, help="output image folder"
)
args = parser.parse_args()
# Initialize the NuScenes dataset
dataroot = args.dataroot
version = "v1.0-trainval" # Using mini version
nusc = NuScenes(version=version, dataroot=dataroot, verbose=True)
dataroot_mini = args.dataroot_mini
version_mini = "v1.0-mini" # Using mini version
nusc_mini = NuScenes(version=version_mini, dataroot=dataroot_mini, verbose=True)
# Extract the list of scene tokens for both nusc and nusc_mini
nusc_mini_scene_tokens = sorted([scene["token"] for scene in nusc_mini.scene])
def map_pointcloud_to_image(pointcloud, camera_token):
"""
Map pointcloud to the image plane.
Args:
pointcloud: LidarPointCloud object
camera_token: Token of the camera sample data
Returns:
points_img: Points in image coordinates
depths: Depth values
"""
cam = nusc.get("sample_data", camera_token)
cam_path = os.path.join(nusc.dataroot, cam["filename"])
im = cv2.imread(cam_path)
# Get sensor calibration data
lidar_to_world = nusc.get(
"calibrated_sensor", pointcloud["calibrated_sensor_token"]
)
lidar_rotation = Quaternion(lidar_to_world["rotation"])
lidar_translation = np.array(lidar_to_world["translation"])
cam_to_world = nusc.get("calibrated_sensor", cam["calibrated_sensor_token"])
cam_intrinsic = np.array(cam_to_world["camera_intrinsic"])
cam_rotation = Quaternion(cam_to_world["rotation"])
cam_translation = np.array(cam_to_world["translation"])
# Transform points from lidar to world coordinate
pc = LidarPointCloud.from_file(os.path.join(nusc.dataroot, pointcloud["filename"]))
points = pc.points[:3, :]
points = np.vstack((points, np.ones(points.shape[1])))
# Transformation matrix from lidar to world coordinate
lidar_to_world_matrix = np.eye(4)
lidar_to_world_matrix[:3, :3] = lidar_rotation.rotation_matrix
lidar_to_world_matrix[:3, 3] = lidar_translation
# Transformation matrix from world to camera coordinate
world_to_cam_matrix = np.eye(4)
world_to_cam_matrix[:3, :3] = cam_rotation.rotation_matrix.T
world_to_cam_matrix[:3, 3] = -np.dot(
cam_rotation.rotation_matrix.T, cam_translation
)
# Transform points to camera coordinate
points_cam = np.dot(world_to_cam_matrix, np.dot(lidar_to_world_matrix, points))
# Only keep points in front of the camera
mask = points_cam[2, :] > 0
points_cam = points_cam[:, mask]
# Project to image plane
points_img = np.dot(cam_intrinsic, points_cam[:3, :])
points_img = points_img / points_img[2, :]
points_img = points_img[:2, :]
# Get depths
depths = points_cam[2, :].copy()
return points_img.T, depths, im
def create_depth_map(points_img, depths, image_shape):
"""
Create a depth map from projected points.
Args:
points_img: Points in image coordinates
depths: Depth values
image_shape: Shape of the image (height, width)
Returns:
depth_map: Depth map as a 2D numpy array
"""
depth_map = np.zeros((image_shape[0], image_shape[1]))
# Keep only points that
# Keep only points that fall within the image
mask = np.logical_and.reduce(
[
points_img[:, 0] >= 0,
points_img[:, 0] < image_shape[1],
points_img[:, 1] >= 0,
points_img[:, 1] < image_shape[0],
]
)
points_img = points_img[mask]
depths = depths[mask]
# Convert to integers for indexing
points_int = np.floor(points_img).astype(np.int32)
# Populate depth map
for i in range(points_int.shape[0]):
x, y = points_int[i, 0], points_int[i, 1]
if depth_map[y, x] == 0 or depths[i] < depth_map[y, x]:
depth_map[y, x] = depths[i]
return depth_map
CAMERA_NAMES = [
"CAM_FRONT",
"CAM_FRONT_RIGHT",
"CAM_BACK_RIGHT",
"CAM_BACK",
"CAM_BACK_LEFT",
"CAM_FRONT_LEFT",
]
def process_sample(sample_idx, output_folder, camera_name):
"""
Process a single sample from the nuScenes dataset.
Args:
sample_idx: Index of the sample
output_folder: Folder to save the image and depth map
"""
data_dict = {}
# Get sample
sample = nusc.sample[sample_idx]
# Get camera sample data
camera_token = sample["data"][camera_name]
# camera_data = nusc.get("sample_data", camera_token)
# Get LiDAR sample data
lidar_token = sample["data"]["LIDAR_TOP"]
lidar_data = nusc.get("sample_data", lidar_token)
# Map pointcloud to image and create depth map
points_img, depths, image = map_pointcloud_to_image(lidar_data, camera_token)
depth_map = create_depth_map(points_img, depths, (image.shape[0], image.shape[1]))
# Read out camera intrinsic information
cam_intrinsic = nusc.get(
"calibrated_sensor",
nusc.get("sample_data", camera_token)["calibrated_sensor_token"],
)["camera_intrinsic"]
non_zero_indices = np.argwhere(depth_map != 0)
# Save image and depth map to output folder
img_filename = f"{sample_idx:06d}_{camera_name}_image.jpg"
# depth_filename = f"{sample_idx:06d}_depth.png"
cv2.imwrite(os.path.join(output_folder, img_filename), image)
data_dict["image"] = img_filename
data_dict["intrinsics"] = [
cam_intrinsic[0][0],
cam_intrinsic[1][1],
cam_intrinsic[0][2],
cam_intrinsic[1][2],
image.shape[1], # Image width
image.shape[0], # Image height
]
# Randomly sample 100 pixels with non-zero depth map values
non_zero_indices = np.argwhere(depth_map != 0)
sampled_indices = non_zero_indices[
np.random.choice(non_zero_indices.shape[0], 100, replace=False)
]
# Store their pixel coordinates and depth values into lists
data_dict["pixel_coords"] = [[int(x), int(y)] for y, x in sampled_indices]
data_dict["depth"] = [depth_map[y, x] for y, x in sampled_indices]
return data_dict
def process_multiple_samples(
num_samples=5, output_folder="output", json_path="test.json"
):
"""
Process multiple samples from the dataset.
Args:
num_samples: Number of samples to process
output_folder: Folder to save the images and depth maps
"""
# Check if the output folder exists and delete it if it does
if os.path.exists(output_folder):
import shutil
shutil.rmtree(output_folder)
if not os.path.exists(output_folder):
os.makedirs(output_folder)
with open(json_path, "w") as f:
if num_samples == -1:
line_count = 0
sample_range = range(len(nusc.sample))
print("sample_range = ", sample_range)
for i in sample_range:
# breakpoint()
if nusc.sample[i]["scene_token"] in nusc_mini_scene_tokens:
print("Skipping sample ", i, " as it is in nusc_mini")
continue
if i % 1000 == 0:
print(f"Processing sample {i}")
for camera_name in CAMERA_NAMES:
entry = process_sample(i, output_folder, camera_name)
# Save meta_data_json to a JSON Lines file
json.dump(entry, f)
f.write("\n")
line_count += 1
print(f"Total lines processed: {line_count}")
else:
for i in np.random.choice(
len(nusc.sample), min(num_samples, len(nusc.sample)), replace=False
):
print(f"Processing sample {i}")
camera_name = np.random.choice(CAMERA_NAMES)
entry = process_sample(i, output_folder, camera_name)
# Save meta_data_json to a JSON Lines file
json.dump(entry, f)
f.write("\n")
# Example: Process all samples and save to "output" folder
process_multiple_samples(
num_samples=-1,
output_folder=args.out_image_dir,
json_path=args.out_json_path,
)
```
## /utils/curate_scannet.py
```py path="/utils/curate_scannet.py"
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
"""
Download ScanNet++ data
Default: download splits with scene IDs and default files
that can be used for novel view synthesis on DSLR and iPhone images
and semantic tasks on the mesh
"""
import argparse
import json
import os
import shutil
import subprocess
import sys
import zlib
from pathlib import Path
import imageio as iio
import lz4.block
import numpy as np
import yaml
from common.scene_release import ScannetppScene_Release
from common.utils.utils import load_json, load_yaml_munch, read_txt_list, run_command
from munch import Munch
from tqdm import tqdm
def extract_rgb(scene, w=512, h=384):
scene.iphone_rgb_dir.mkdir(parents=True, exist_ok=True)
cmd = f"ffmpeg -i {scene.iphone_video_path} -vf scale={w}:{h} -start_number 0 -q:v 1 {scene.iphone_rgb_dir}/frame_%06d.jpg"
return run_command(cmd, verbose=True, exit_on_error=False)
def extract_masks(scene, w=512, h=384):
scene.iphone_video_mask_dir.mkdir(parents=True, exist_ok=True)
cmd = f"ffmpeg -i {str(scene.iphone_video_mask_path)} -pix_fmt gray -vf scale={w}:{h} -start_number 0 {scene.iphone_video_mask_dir}/frame_%06d.png"
return run_command(cmd, verbose=True, exit_on_error=False)
def extract_depth(scene):
# global compression with zlib
height, width = 192, 256
sample_rate = 1
scene.iphone_depth_dir.mkdir(parents=True, exist_ok=True)
try:
with open(scene.iphone_depth_path, "rb") as infile:
data = infile.read()
data = zlib.decompress(data, wbits=-zlib.MAX_WBITS)
depth = np.frombuffer(data, dtype=np.float32).reshape(-1, height, width)
for frame_id in tqdm(
range(0, depth.shape[0], sample_rate), desc="decode_depth"
):
iio.imwrite(
f"{scene.iphone_depth_dir}/frame_{frame_id:06}.png",
(depth * 1000).astype(np.uint16),
)
# per frame compression with lz4/zlib
except:
frame_id = 0
with open(scene.iphone_depth_path, "rb") as infile:
while True:
size = infile.read(4) # 32-bit integer
if len(size) == 0:
break
size = int.from_bytes(size, byteorder="little")
if frame_id % sample_rate != 0:
infile.seek(size, 1)
frame_id += 1
continue
# read the whole file
data = infile.read(size)
try:
# try using lz4
data = lz4.block.decompress(
data, uncompressed_size=height * width * 2
) # UInt16 = 2bytes
depth = np.frombuffer(data, dtype=np.uint16).reshape(height, width)
except:
# try using zlib
data = zlib.decompress(data, wbits=-zlib.MAX_WBITS)
depth = np.frombuffer(data, dtype=np.float32).reshape(height, width)
depth = (depth * 1000).astype(np.uint16)
# 6 digit frame id = 277 minute video at 60 fps
iio.imwrite(f"{scene.iphone_depth_dir}/frame_{frame_id:06}.png", depth)
frame_id += 1
def main(args):
cfg = load_yaml_munch(args.config_file)
# get the scenes to process, specify any one
if cfg.get("scene_list_file"):
scene_ids = read_txt_list(cfg.scene_list_file)
elif cfg.get("scene_ids"):
scene_ids = cfg.scene_ids
elif cfg.get("splits"):
scene_ids = []
# Read only the immediate level subfolders of cfg.data_root as scene_ids
scene_ids = [
f
for f in os.listdir(cfg.data_root + "data/")
if os.path.isdir(os.path.join(cfg.data_root + "data/", f))
]
print("Scene IDs:", scene_ids)
print("Number of scenes:", len(scene_ids))
output_dir = "/home/czptc2h/datasets/scannet_pp/out_images"
output_dir_json = (
"/home/czptc2h/datasets/scannet_pp/scannet_depth_instructions.jsonl"
)
# Create the output directory if it doesn't exist
os.makedirs(output_dir, exist_ok=True)
sample_interval = 10
points_per_frame = 100
image_width = 1280 # resize first to save memory
image_height = 960
# get the options to process
# go through each scene
# Open a new jsonl file at output_dir_json
with open(output_dir_json, "w") as jsonl_file:
for scene_id in tqdm(scene_ids, desc="scene"):
try:
scene = ScannetppScene_Release(
scene_id, data_root=Path(cfg.data_root) / "data"
)
print(
"cfg.data_root = ",
cfg.data_root,
"scene_id = ",
scene_id,
"scene = ",
scene,
)
# # extract data for the current scene
out_rgb = extract_rgb(scene, image_width, image_height)
if out_rgb.returncode != 0:
print("error during rgb extraction, go to the next scene")
continue
out_mask = extract_masks(scene, image_width, image_height)
if out_mask.returncode != 0:
print("error during mask extraction, go to the next scene")
continue
extract_depth(scene)
# convert data into json
# remove all files in the folders
# Iteratively read all png files under scene.iphone_video_mask_dir
for i, (image_file, mask_file, depth_file) in enumerate(
zip(
os.listdir(scene.iphone_rgb_dir),
os.listdir(scene.iphone_video_mask_dir),
os.listdir(scene.iphone_depth_dir),
)
):
if i % sample_interval != 0:
continue
data_dict = {}
data_dict["image"] = (
str(scene.iphone_rgb_dir / image_file)
.replace(cfg.data_root, "")
.replace("data/", "")
)
# Move the image file to the specified output directory
destination_path = Path(output_dir) / data_dict["image"]
destination_path.parent.mkdir(
parents=True, exist_ok=True
) # Ensure the directory exists
shutil.move(
str(scene.iphone_rgb_dir / image_file), destination_path
)
data_dict["intrinsics"] = [
1427.4375 * (image_width / 1920),
1427.4375 * (image_height / 1440),
959.5 * (image_width / 1920),
719.5 * (image_height / 1440),
image_width,
image_height,
] # rescaled intrinsics
if mask_file.endswith(".png"):
mask_path = scene.iphone_video_mask_dir / mask_file
mask_image = iio.imread(mask_path)
# Randomly sample points_per_frame pixels where mask_image is not 0
non_zero_indices = np.argwhere(mask_image != 0)
random_indices = np.random.choice(
non_zero_indices.shape[0],
points_per_frame,
replace=False,
)
sampled_indices = (
non_zero_indices
* np.array([192 / image_height, 256 / image_width])
).astype(int)[random_indices]
sampled_indices_ori = (non_zero_indices).astype(int)[
random_indices
]
data_dict["pixel_coords"] = [
[int(x), int(y)] for y, x in sampled_indices_ori
]
# convert to euclidean distance
if depth_file.endswith(".png"):
depth_path = scene.iphone_depth_dir / depth_file
depth_image = iio.imread(depth_path)
fx, fy, cx, cy, _, _ = data_dict["intrinsics"]
fx, fy, cx, cy, _, _ = data_dict["intrinsics"]
pixel_coords = np.array(data_dict["pixel_coords"])
x = (pixel_coords[:, 0] - cx) / fx
y = (pixel_coords[:, 1] - cy) / fy
z = (
depth_image[sampled_indices[:, 0], sampled_indices[:, 1]]
/ 1000.0
)
data_dict["depth"] = np.sqrt(x**2 + y**2 + z**2).tolist()
# Print samples to verify computation
sample_indices = np.random.choice(
len(z), min(5, len(z)), replace=False
)
for idx in sample_indices:
print(
f"Sample {idx}: z = {z[idx]}, pixel_coords = {data_dict['pixel_coords'][idx]}, depth = {data_dict['depth'][idx]}"
)
json.dump(data_dict, jsonl_file)
jsonl_file.write("\n")
# Remove the folder scene.iphone_rgb_dir
shutil.rmtree(scene.iphone_rgb_dir)
shutil.rmtree(scene.iphone_video_mask_dir)
shutil.rmtree(scene.iphone_depth_dir)
except Exception as e:
print(e)
continue
if __name__ == "__main__":
p = argparse.ArgumentParser()
p.add_argument("config_file", help="Path to config file")
args = p.parse_args()
main(args)
```
## /utils/curate_sunRGBD.py
```py path="/utils/curate_sunRGBD.py"
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
import math
import os, shutil, torch
from glob import glob
import numpy as np
from PIL import Image
import argparse
# Set up argument parser
parser = argparse.ArgumentParser(description="Process some files.")
parser.add_argument("--dataroot", type=str, default="/home/czptc2h/datasets/SUNRGBD", help="image dir")
parser.add_argument(
"--out_json_path", type=str, help="output jsonl path"
)
parser.add_argument(
"--out_image_dir", type=str, help="output image folder"
)
args = parser.parse_args()
## restrict to 20 scenes
scene_dirs = glob(os.path.join(dataroot, "SUNRGBD/*/*/*"))
print("Scene Dirs:", scene_dirs)
out_json_path = args.out_json_path
out_image_path = args.out_image_dir
if os.path.exists(out_image_path):
shutil.rmtree(out_image_path)
os.makedirs(out_image_path)
import shutil
points_per_image = 100
import json, os
count = 0
with open(out_json_path, "w") as jsonl_file:
for scene_dir in scene_dirs:
data_dict = {}
## Get image file path from scene directory
print("Scene Dir:", scene_dir)
try:
image_path = glob(f"{scene_dir}/image/*")[0]
except:
image_path = glob(f"{scene_dir}/*/*/image/*")[0]
if "NYU" in image_path:
continue
img = Image.open(image_path)
sub_dir = image_path.replace(f"{dataroot}/SUNRGBD/", "")
## Copy the image to the out_image_path directory
os.makedirs(os.path.dirname(out_image_path + "/" + sub_dir), exist_ok=True)
shutil.copy(image_path, out_image_path + "/" + sub_dir)
## Get depth map file path from scene directory
try:
depth_path = glob(f"{scene_dir}/depth_bfx/*")[0]
except:
depth_path = glob(f"{scene_dir}/*/*/depth_bfx/*")[0]
print("Image Path:", image_path, "; Depth Path:", depth_path)
# Replace the last 2 file/folder names in the path of depth_path with "intrinsics.txt"
intrinsic_path = os.path.join(os.path.dirname(os.path.dirname(depth_path)), "intrinsics.txt")
with open(intrinsic_path, "r") as file:
intrinsic_data = file.read().strip().split()
intrinsic_matrix = np.array(intrinsic_data, dtype=np.float32).reshape(
(3, 3)
)
print("Intrinsic Matrix:\n", intrinsic_matrix)
# Read the image from image_path into a PIL image
pil_image = Image.open(image_path)
data_dict["image"] = sub_dir
data_dict["intrinsics"] = [
float(intrinsic_matrix[0, 0]),
float(intrinsic_matrix[1, 1]),
float(intrinsic_matrix[0, 2]),
float(intrinsic_matrix[1, 2]),
] + [pil_image.size[0], pil_image.size[1]]
depth_gt = Image.open(depth_path)
depth_gt = np.asarray(depth_gt, dtype=np.float32)
depth_gt = depth_gt / 10000.0
# Randomly sample 100 pixels in depth_gt with value > 0.005 and < 25
valid_pixels = np.argwhere((depth_gt > 0.005) & (depth_gt < 25))
sampled_indices = np.random.choice(
len(valid_pixels), size=points_per_image, replace=False
)
sampled_pixels = valid_pixels[sampled_indices]
data_dict["pixel_coords"] = sampled_pixels[:, [1, 0]].tolist()
fx, fy, cx, cy = (
intrinsic_matrix[0, 0],
intrinsic_matrix[1, 1],
intrinsic_matrix[0, 2],
intrinsic_matrix[1, 2],
)
z = depth_gt[sampled_pixels[:, 0], sampled_pixels[:, 1]]
x = (sampled_pixels[:, 1] - cx) * z / fx
y = (sampled_pixels[:, 0] - cy) * z / fy
euclidean_distances = np.sqrt(x**2 + y**2 + z**2)
data_dict["depth"] = euclidean_distances.tolist()
print("PIL Image Size:", pil_image.size)
print("Depth GT Size:", depth_gt.shape)
print("Data Dictionary:", data_dict)
json.dump(data_dict, jsonl_file)
jsonl_file.write("\n")
count += 1
print(f"processed {count} images")
```
## /utils/curate_taskonomy
``` path="/utils/curate_taskonomy"
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
import os
import shutil
import json
import argparse
import numpy as np
from glob import glob
from PIL import Image
def main():
parser = argparse.ArgumentParser(description="Process Taskonomy dataset.")
parser.add_argument("--dataroot", type=str, required=True, help="Path to Taskonomy fullplus root directory")
parser.add_argument("--out_json_path", type=str, required=True, help="Output JSONL path")
parser.add_argument("--out_image_dir", type=str, required=True, help="Output RGB image folder")
parser.add_argument("--points_per_image", type=int, default=3, help="Number of depth points to sample")
args = parser.parse_args()
dataroot = args.dataroot
out_json_path = args.out_json_path
out_image_path = args.out_image_dir
points_per_image = args.points_per_image
print(f"Scanning Taskonomy RGB files in {dataroot}...")
# Taskonomy standard naming is often something like <building_name>/rgb/<building_name>_..._rgb.png
rgb_files = glob(os.path.join(dataroot, "**", "*_rgb.png"), recursive=True)
if not rgb_files:
# Fallback case compressed
rgb_files = glob(os.path.join(dataroot, "**", "*_rgb.webp"), recursive=True)
print(f"Found {len(rgb_files)} RGB files.")
if os.path.exists(out_image_path):
shutil.rmtree(out_image_path)
os.makedirs(out_image_path)
count = 0
with open(out_json_path, "w") as jsonl_file:
for image_path in rgb_files:
# Deduce depth path by replacing 'rgb' string identifier with 'depth_zbuffer'
depth_path = image_path.replace("rgb", "depth_zbuffer")
if not os.path.exists(depth_path):
# Try finding depth_euclidean if zbuffer is missing
depth_path = image_path.replace("rgb", "depth_euclidean")
is_euclidean = True
if not os.path.exists(depth_path):
continue
else:
is_euclidean = False
# Create output subdirectories
sub_dir = os.path.relpath(image_path, dataroot)
dest_image_path = os.path.join(out_image_path, sub_dir)
os.makedirs(os.path.dirname(dest_image_path), exist_ok=True)
shutil.copy(image_path, dest_image_path)
data_dict = {}
data_dict["image"] = sub_dir
pil_image = Image.open(image_path)
W, H = pil_image.size
# Taskonomy uses a 90 degree FOV camera
# fx = fy = W / (2 * tan(FOV / 2)) -> W / 2
fx = W / 2.0
fy = H / 2.0
cx = W / 2.0
cy = H / 2.0
data_dict["intrinsics"] = [fx, fy, cx, cy, W, H]
# Load 16-bit Depth
depth_img = Image.open(depth_path)
depth_arr = np.asarray(depth_img, dtype=np.float32)
# Taskonomy 16-bit depth is scaled. standard is pixel_value / 512.0 for meters
depth_arr = depth_arr / 512.0
# Filter valid depth pixels (0.01m to 120m)
valid_pixels = np.argwhere((depth_arr > 0.01) & (depth_arr < 120.0))
if len(valid_pixels) < points_per_image:
# Skip images that don't have enough valid depth pixels
continue
sampled_indices = np.random.choice(
len(valid_pixels), size=points_per_image, replace=False
)
sampled_pixels = valid_pixels[sampled_indices]
# [y, x] -> [x, y] to align with [u, v]
data_dict["pixel_coords"] = sampled_pixels[:, [1, 0]].tolist()
# Retrieve Z depth
z = depth_arr[sampled_pixels[:, 0], sampled_pixels[:, 1]]
if is_euclidean:
# If the dataset specifically provides depth_euclidean, no trigonometry needed
euclidean_distances = z
else:
# Calculate Euclidean distances from Z-buffer
x = (sampled_pixels[:, 1] - cx) * z / fx
y = (sampled_pixels[:, 0] - cy) * z / fy
euclidean_distances = np.sqrt(x**2 + y**2 + z**2)
data_dict["depth"] = euclidean_distances.tolist()
json.dump(data_dict, jsonl_file)
jsonl_file.write("\n")
count += 1
if count % 1000 == 0:
print(f"Processed {count} valid images...")
print(f"Taskonomy curation complete! {count} total files written to {out_json_path}")
if __name__ == "__main__":
main()
```
## /utils/curate_waymo.py
```py path="/utils/curate_waymo.py"
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
import io, warnings
from typing import Optional
# Disable annoying warnings from PyArrow using under the hood.
warnings.simplefilter(action="ignore", category=FutureWarning)
import argparse
# Print the pixel coordinates and their depth values
import random
import dask.dataframe as dd
import numpy as np
import tensorflow as tf
from PIL import Image
from waymo_open_dataset import v2
from waymo_open_dataset.utils import range_image_utils
from waymo_open_dataset.v2.perception.utils import lidar_utils
import argparse
# Set up argument parser
parser = argparse.ArgumentParser(description="Process some files.")
parser.add_argument("--dataset_dir", type=str, default="/home/czptc2h/datasets/waymo/training/", help="waymo")
parser.add_argument(
"--out_json_path", type=str, help="output jsonl path"
)
parser.add_argument(
"--out_image_dir", type=str, help="output image folder"
)
args = parser.parse_args()
# Path to the directory with all components
dataset_dir = args.dataset_dir
# List all parquet files in the "camera_image" directory and extract their names without extensions
camera_image_dir = f"{dataset_dir}/camera_image"
import os
parquet_files = [
os.path.join(camera_image_dir, file)
for file in os.listdir(camera_image_dir)
if file.endswith(".parquet")
]
filenames = [
os.path.splitext(file)[0]
for file in os.listdir(camera_image_dir)
if file.endswith(".parquet")
]
# change these to process a subset of the data
start_file = 0
end_file = len(filenames)
print(f"Found {len(filenames)} files in {camera_image_dir}")
# Print some samples of filenames
sample_size = min(
5, len(filenames)
) # Print up to 5 samples or less if fewer files exist
print("Sample filenames:", filenames[:sample_size])
def read(tag: str, context_name: str) -> dd.DataFrame:
"""Creates a Dask DataFrame for the component specified by its tag."""
paths = tf.io.gfile.glob(f"{dataset_dir}/{tag}/{context_name}.parquet")
return dd.read_parquet(paths)
out_json_path = args.out_json_path
# Create the directory for out_json_path if it doesn't exist
os.makedirs(os.path.dirname(out_json_path), exist_ok=True)
out_image_path = args.out_image_dir
import shutil
if os.path.exists(out_image_path):
shutil.rmtree(out_image_path)
os.makedirs(out_image_path)
points_per_image = 100
import json, os
import cv2
import numpy as np
from PIL import Image
def undistort_image(pil_image, intrinsic, pixel_coordinates):
# Convert PIL image to OpenCV image
cv_image = np.array(pil_image)
# Define the camera intrinsic parameters
fx = intrinsic.f_u
fy = intrinsic.f_v
cx = intrinsic.c_u
cy = intrinsic.c_v
k1 = intrinsic.k1
k2 = intrinsic.k2
p1 = intrinsic.p1
p2 = intrinsic.p2
k3 = intrinsic.k3
# Create a camera intrinsic matrix
K = np.array([[fx, 0, cx], [0, fy, cy], [0, 0, 1]])
# Create a distortion coefficients vector
dist_coeffs = np.array([k1, k2, p1, p2, k3])
# Get the image dimensions
h, w = cv_image.shape[:2]
# Create a new camera intrinsic matrix with the distortion removed
# Create a new camera intrinsic matrix with the distortion removed
new_K, _ = cv2.getOptimalNewCameraMatrix(K, dist_coeffs, (w, h), 1, (w, h))
new_K[0, 0] = fx # Set fx
new_K[1, 1] = fy # Set fy
new_K[0, 2] = w / 2 # Set cx to be at the center
new_K[1, 2] = h / 2 # Set cy to be at the center
# Undistort the image
map_x, map_y = cv2.initUndistortRectifyMap(K, dist_coeffs, None, new_K, (w, h), 5)
undistorted_image = cv2.remap(cv_image, map_x, map_y, cv2.INTER_LINEAR)
# Convert the undistorted image back to a PIL image
undistorted_pil_image = Image.fromarray(undistorted_image)
# Convert pixel coordinates to undistorted coordinates
undistorted_pixel_coordinates = []
for x, y in pixel_coordinates:
undistorted_x = int(map_x[y, x])
undistorted_y = int(map_y[y, x])
undistorted_pixel_coordinates.append((undistorted_x, undistorted_y))
return undistorted_pil_image, new_K, undistorted_pixel_coordinates
points_per_image = 100
count = 0
with open(out_json_path, "w") as jsonl_file:
for filename in filenames[start_file:end_file]:
# Process each filename as needed
# Example: Write filename to the JSONL file
# print("Processing filename:", filename)
lidar = read("lidar", filename)
lidar_calib = read("lidar_calibration", filename)
camera_calib = read("camera_calibration", filename)
lidar_pose = read("lidar_pose", filename)
vehicle_pose = read("vehicle_pose", filename)
cam_img = read("camera_image", filename)
lidar_camera_projection = read("lidar_camera_projection", filename)
df = v2.merge(lidar_calib, lidar)
df = v2.merge(df, lidar_camera_projection)
df = v2.merge(df, lidar_pose)
df = v2.merge(df, vehicle_pose)
df = v2.merge(df, camera_calib)
df = v2.merge(df, cam_img)
for _, row in df.iterrows():
# print(row)
# Create all component objects
lidar = v2.LiDARComponent.from_dict(row)
lidar_calib = v2.LiDARCalibrationComponent.from_dict(row)
camera_calib = v2.CameraCalibrationComponent.from_dict(row)
lidar_pose = v2.LiDARPoseComponent.from_dict(row)
vehicle_pose = v2.VehiclePoseComponent.from_dict(row)
camera_image = v2.CameraImageComponent.from_dict(row)
lidar_cam_proj = v2.LiDARCameraProjectionComponent.from_dict(row)
range_image_cartesian = lidar_utils.convert_range_image_to_cartesian(
range_image=lidar.range_image_return1,
calibration=lidar_calib,
pixel_pose=lidar_pose.range_image_return1,
frame_pose=vehicle_pose,
)
extrinsic = np.reshape(camera_calib.extrinsic.transform, [1, 4, 4]).astype(
np.float32
)
camera_image_size = (camera_calib.height, camera_calib.width)
ric_shape = range_image_cartesian.shape
ric = np.reshape(
range_image_cartesian, [1, ric_shape[0], ric_shape[1], ric_shape[2]]
)
cp = lidar_cam_proj.range_image_return1
cp_tensor = tf.reshape(tf.convert_to_tensor(value=cp.values), cp.shape)
cp_shape = cp_tensor.shape
cp_tensor = np.reshape(
cp_tensor, [1, cp_shape[0], cp_shape[1], cp_shape[2]]
)
depth_image = range_image_utils.build_camera_depth_image(
ric,
extrinsic,
cp_tensor,
list(camera_image_size),
camera_image.key.camera_name,
)
# Convert depth_image to a numpy array
depth_image_np = depth_image.numpy().squeeze(axis=0)
# Find non-zero elements in the depth_images
non_zero_indices = np.nonzero(depth_image_np)
# Extract the pixel coordinates and their corresponding depth values
pixel_coordinates = list(zip(non_zero_indices[0], non_zero_indices[1]))
# breakpoint()
depth_values = depth_image_np[non_zero_indices]
data_dict = {}
data_dict["image"] = (
f"{camera_image.key.segment_context_name}/{camera_image.key.frame_timestamp_micros}_{camera_image.key.camera_name}.jpg"
)
sample_size = min(2 * points_per_image, len(pixel_coordinates))
sample_indices = random.sample(range(len(pixel_coordinates)), sample_size)
data_dict["pixel_coords"] = [
list(reversed(pixel_coordinates[i])) for i in sample_indices
]
data_dict["depth"] = [float(depth_values[i]) for i in sample_indices]
image_filename = os.path.join(
out_image_path,
data_dict["image"],
)
pil_image = Image.open(io.BytesIO(camera_image.image))
undistorted_pil_image, new_K, data_dict["pixel_coords"] = undistort_image(
pil_image, camera_calib.intrinsic, data_dict["pixel_coords"]
)
# Check if the fx value in new_K is greater than 1000
if new_K[0, 0] > 1000:
# Calculate the scaling factor to make fx equal to 1000
scale_factor = 1000 / new_K[0, 0]
# Rescale the undistorted_pil_image
new_width = int(undistorted_pil_image.width * scale_factor)
new_height = int(undistorted_pil_image.height * scale_factor)
undistorted_pil_image = undistorted_pil_image.resize(
(new_width, new_height), Image.ANTIALIAS
)
# Rescale the new_K matrix
new_K[0, 0] *= scale_factor
new_K[1, 1] *= scale_factor
new_K[0, 2] *= scale_factor
new_K[1, 2] *= scale_factor
# Rescale the pixel coordinates
data_dict["pixel_coords"] = [
(int(x * scale_factor), int(y * scale_factor))
for x, y in data_dict["pixel_coords"]
]
data_dict["intrinsics"] = [
new_K[0, 0],
new_K[1, 1],
new_K[0, 2],
new_K[1, 2],
undistorted_pil_image.width,
undistorted_pil_image.height,
]
# Filter pixel coordinates and corresponding depth values
valid_pixel_coords = []
valid_depths = []
for coord, depth in zip(data_dict["pixel_coords"], data_dict["depth"]):
x, y = coord
if (
0 <= x < undistorted_pil_image.width
and 0 <= y < undistorted_pil_image.height
):
valid_pixel_coords.append(coord)
valid_depths.append(depth)
if len(valid_pixel_coords) == points_per_image:
break
data_dict["pixel_coords"] = valid_pixel_coords
data_dict["depth"] = valid_depths
os.makedirs(os.path.dirname(image_filename), exist_ok=True)
undistorted_pil_image.save(image_filename)
json.dump(data_dict, jsonl_file)
jsonl_file.write("\n")
if count % 100 == 0:
print(f"data_dict[{count}] = ", data_dict)
count += 1
count += 1
```
## /utils/datasets.py
```py path="/utils/datasets.py"
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
import bisect
import json
import logging
import os
import random
from io import StringIO
from typing import Any
import cv2
import numpy as np
import pandas as pd
from PIL import Image
from torch.utils.data import Dataset
logger: logging.Logger = logging.getLogger()
logger.setLevel(logging.INFO)
# unified prompt that can be used for both SFT and GRPO, our method is not sensitive to the prompt, so you can adjust it flexibly
def generate_prompt_depth_sft(
depth,
is_eval=False,
):
problem = "Given this image, how far is the point pointed by the red arrow from the camera? Output the thinking process in <think> </think> and final answer (the meter number only, without the unit) in <answer> </answer> tags."
thinking = (
f"<think> The point is around {depth:.2f} meters away from the camera. </think>"
)
if is_eval:
solution = f"<answer> {depth} </answer>"
else:
solution = f"<answer> {depth:.2f} </answer>"
return problem, thinking, solution
# ####################### handle camera ambiguities ################################
def undistort_image(intrinsics: list, image: Image):
# Check if fx and fy are not the same
if abs(intrinsics[0] - intrinsics[1]) > 1e-3:
# Convert PIL image to numpy array
image_np = np.array(image)
# Create camera matrix from intrinsics
camera_matrix = np.array(
[
[intrinsics[0], 0, intrinsics[2]],
[0, intrinsics[1], intrinsics[3]],
[0, 0, 1],
]
)
# Assume no distortion coefficients
dist_coeffs = np.zeros((4, 1))
# Get optimal new camera matrix
new_camera_matrix, _ = cv2.getOptimalNewCameraMatrix(
camera_matrix,
dist_coeffs,
(image_np.shape[1], image_np.shape[0]),
1,
(image_np.shape[1], image_np.shape[0]),
)
# Undistort the image
undistorted_image_np = cv2.undistort(
image_np, camera_matrix, dist_coeffs, None, new_camera_matrix
)
# Extract [fx, fy, cx, cy] from the new camera matrix
new_intrinsics = [
float(new_camera_matrix[0, 0]),
float(new_camera_matrix[1, 1]),
float(new_camera_matrix[0, 2]),
float(new_camera_matrix[1, 2]),
]
# Convert back to PIL image
return Image.fromarray(undistorted_image_np), new_intrinsics
else:
return image, intrinsics
def normalizing_focal_length(
normalized_focal_length: float, intrinsics: list, image: Image
):
# Calculate the scaling factor for the focal length normalization
scale_factor = normalized_focal_length / intrinsics[0]
# Resize the image according to the scaling factor
new_width = int(image.width * scale_factor)
new_height = int(image.height * scale_factor)
# Update the intrinsics with the normalized focal length
intrinsics = [
intrinsics[0] * scale_factor,
intrinsics[1] * scale_factor,
intrinsics[2] * scale_factor,
intrinsics[3] * scale_factor,
new_width,
new_height,
]
return image.resize((new_width, new_height)), intrinsics
def is_within_range(coord, crop_range):
x, y = coord
left, top, right, bottom = crop_range
return left <= x < right and top <= y < bottom
def adjust_index(
index,
pixel_coords,
):
# Check if the current index is valid
if pixel_coords[index] != [-1, -1]:
return index
# Search for the closest valid index
left = index - 1
right = index + 1
n = len(pixel_coords)
while left >= 0 or right < n:
if left >= 0 and pixel_coords[left] != [-1, -1]:
return left
if right < n and pixel_coords[right] != [-1, -1]:
return right
left -= 1
right += 1
# If no valid index is found, return -1
return -1
class dataset_eval(Dataset):
def __init__(
self,
data_path: str,
image_folder: str,
points_per_image=None,
normalized_focal_length=1000.0, # set to the intrinsics after original resize if needed
) -> None:
super(dataset_eval, self).__init__()
self.normalized_focal_length = normalized_focal_length
print("reading data from ", data_path, "image_folder = ", image_folder)
if ".jsonl" in data_path:
with open(data_path, "r") as f:
json_content = f.read()
self.list_data_dict = pd.read_json(
StringIO(json_content), lines=True
).to_dict(orient="records")
else:
self.list_data_dict = json.load(open(data_path, "r"))
self.data_path = data_path
self.image_folder = image_folder
self.length = self._get_length()
if "scannet" in data_path:
self.list_data_dict = self.list_data_dict[
int(len(self.list_data_dict) * 0.98) :
] # keep the last 2% for evaluation
random.seed(42)
random.shuffle(self.list_data_dict)
self.random_indices = []
random.seed(42) # Set a fixed seed for replicability
while len(self.random_indices) < self.__len__():
i = random.sample(range(len(self.list_data_dict[0]["pixel_coords"])), 1)
self.random_indices.append((i))
def _get_length(self) -> int:
return len(self.list_data_dict)
def __len__(self) -> int:
return len(self.list_data_dict) * len(self.list_data_dict[0]["pixel_coords"])
def extract_image_and_meta(self, index):
index_ori = index
index %= len(self.list_data_dict)
random_index = self.random_indices[index_ori][0]
# read image
data_dict = {}
# breakpoint()
data_dict["image"] = Image.open(
os.path.join(
self.image_folder, self.list_data_dict[index]["image"].lstrip("/")
)
)
intrinsics = self.list_data_dict[index]["intrinsics"][:4]
if intrinsics[0] == 0.0: # handle intrinsic errors
intrinsics[0] = intrinsics[1]
if intrinsics[1] == 0.0:
intrinsics[1] = intrinsics[0]
data_dict["image"], intrinsics_new = undistort_image(
intrinsics, data_dict["image"]
)
data_dict["image"], intrinsics_new = normalizing_focal_length(
self.normalized_focal_length, intrinsics_new, data_dict["image"]
)
pixel_coords = [
[
int(
(coord[0] - intrinsics[2]) * (intrinsics_new[0] / intrinsics[0])
+ intrinsics_new[2]
),
int(
(coord[1] - intrinsics[3]) * (intrinsics_new[1] / intrinsics[1])
+ intrinsics_new[3]
),
]
for coord in self.list_data_dict[index]["pixel_coords"]
]
pixel_coord = pixel_coords[
random_index
].copy() # pixel coords starts from top-left corner
depth = self.list_data_dict[index]["depth"][random_index]
# randomly decide the task and run the prompt generation functions
# Adjustable cross size
cross_size = 5 # You can modify this value to change the cross size
cross_thickness = 1 # You can modify this value to change the cross thickness
# Calculate the scaling factor
scale_x = 1
scale_y = 1
# Scale the pixel coordinates
scaled_pixel_x = int(pixel_coord[0] * scale_x)
scaled_pixel_y = int(pixel_coord[1] * scale_y)
center_x = round(intrinsics_new[2])
center_y = round(intrinsics_new[3])
# Compute the number of pixels from the center to scaled_pixel_x and scaled_pixel_y
pixels_from_center_x = abs(scaled_pixel_x - center_x)
pixels_from_center_y = abs(scaled_pixel_y - center_y)
# Check if the adjustable cross can be drawn
if (
cross_size <= scaled_pixel_x < data_dict["image"].width - cross_size
and cross_size <= scaled_pixel_y < data_dict["image"].height - cross_size
):
# Draw a --> like arrow
for dx in range(1, cross_size + 1):
data_dict["image"].putpixel(
(scaled_pixel_x - dx, scaled_pixel_y), (255, 0, 0)
) # Horizontal line
# Draw the arrowhead
for dy in range(1, cross_size // 2 + 1):
data_dict["image"].putpixel(
(
scaled_pixel_x - dy - 1,
scaled_pixel_y + dy,
),
(255, 0, 0),
)
data_dict["image"].putpixel(
(
scaled_pixel_x - dy - 1,
scaled_pixel_y - dy,
),
(255, 0, 0),
)
else:
# Skip this sample and get the next one
return self.extract_image_and_meta((index_ori + 1) % self.__len__())
return (
data_dict["image"],
depth,
pixel_coord,
intrinsics_new,
)
def __getitem__(self, index):
data_dict = {}
(
data_dict["image"],
depth,
pixel_coord,
intrinsics,
) = self.extract_image_and_meta(index)
# generate prompt
data_dict["problem"], data_dict["thinking"], data_dict["solution"] = (
generate_prompt_depth_sft(
depth,
is_eval=True,
)
)
data_dict["pixel_coord"] = pixel_coord
data_dict["intrinsics"] = intrinsics
data_dict["system"] = "You are a helpful assistant."
data_dict["prompt"] = [
{
"content": [
{"image": data_dict["image"], "type": "image"},
{"text": data_dict["problem"], "type": "text"},
],
"role": "user",
}
]
return data_dict
class dataset_train(Dataset):
def __init__(
self,
data_path: str,
image_folder: str,
height_max=1200,
height_min=700,
width_max=1400,
width_min=1000,
normalized_focal_length=1000,
sample_weights=None, # support weighted sampling
ratio_min=1.0, # taskonomy dataset has intrinsic noise, we randomly rescale the aspect ratio of the images to handle that
ratio_max=1.3,
) -> None:
super().__init__()
print("reading data from ", data_path, "image_folder = ", image_folder)
data_paths = data_path.split(";")
image_folders = image_folder.split(";")
self.list_data_dict = []
for dp in data_paths:
if ".jsonl" in dp:
print("reading jsonl from ", dp)
try:
with open(dp, "r") as f:
json_content = f.read()
self.list_data_dict.append(
pd.read_json(StringIO(json_content), lines=True).to_dict(
orient="records"
)
)
except Exception as e:
print(e)
self.list_data_dict.append(json.load(open(dp, "r")))
else:
self.list_data_dict.append(json.load(open(dp, "r")))
if "scannet" in dp:
self.list_data_dict[-1] = self.list_data_dict[-1][
: int(len(self.list_data_dict[-1]) * 0.98)
]
self.data_path = data_paths
self.image_folder = image_folders
self.length = self._get_length()
print(
"reading finished, dataset size is ",
self.__len__(),
", data_path = ",
self.data_path,
", image_folder = ",
self.image_folder,
)
self.random_indices = []
self.normalized_focal_length = normalized_focal_length
self.width_range = [width_min, width_max]
self.height_range = [height_min, height_max]
self.sample_weights = (
[int(x) for x in sample_weights.split(";")]
if sample_weights
else [1] * len(self.list_data_dict)
)
self.ratio_min = ratio_min
self.ratio_max = ratio_max
def _get_length(self) -> int:
length = 0
for data_dict in self.list_data_dict:
length += len(data_dict)
return length
def __len__(self, ori_length=False) -> int:
if ori_length:
length = 0
for data_dict in self.list_data_dict:
length += len(data_dict)
return length
else:
length = 0
for data_dict in self.list_data_dict:
length += (
len(data_dict) * 100
) # 100 labeled points per image in our data curation pipeline, cna change this number accordingly
return length
def getitem_Taskonomy(
self, index, id_dataset
): # taskonomy dataset has intrinsic noise, we randomly rescale the aspect ratio of the images to handle that
index = index % len(self.list_data_dict[id_dataset])
# read image
data_dict = {}
data_dict["image"] = Image.open(
os.path.join(
self.image_folder[id_dataset],
self.list_data_dict[id_dataset][index]["image"].lstrip("/"),
)
)
intrinsics = self.list_data_dict[id_dataset][index]["intrinsics"][:4]
data_dict["image"], intrinsics_new = undistort_image(
intrinsics, data_dict["image"]
)
if self.normalized_focal_length > 0:
data_dict["image"], intrinsics_new = normalizing_focal_length(
self.normalized_focal_length, intrinsics_new, data_dict["image"]
)
# Calculate the new height to maintain the aspect ratio of 1.3
new_height = int(
data_dict["image"].width / random.uniform(self.ratio_min, self.ratio_max)
)
# Resize the image
data_dict["image"] = data_dict["image"].resize(
(data_dict["image"].width, new_height)
)
# Adjust the intrinsics to account for the new image height
intrinsics_new[1] *= new_height / intrinsics_new[5] # Scale fy
intrinsics_new[3] *= new_height / intrinsics_new[5] # Scale cy
intrinsics_new[5] = new_height # Update height
pixel_coords = [
[
int(
(coord[0] - intrinsics[2]) * (intrinsics_new[0] / intrinsics[0])
+ intrinsics_new[2]
),
int(
(coord[1] - intrinsics[3]) * (intrinsics_new[1] / intrinsics[1])
+ intrinsics_new[3]
),
]
for coord in self.list_data_dict[id_dataset][index]["pixel_coords"]
]
if len(self.list_data_dict[id_dataset][index]["pixel_coords"]) - 1 > 0:
random_index = random.randint(
0, len(self.list_data_dict[id_dataset][index]["pixel_coords"]) - 1
)
else:
print("no pixel in ", index, ": ", self.list_data_dict[id_dataset][index])
return self.__getitem__((index + 1) % self.__len__())
pixel_coord = pixel_coords[random_index]
depth = self.list_data_dict[id_dataset][index]["depth"][random_index]
# Adjustable cross size
cross_size = 5 # You can modify this value to change the cross size
# Calculate the scaling factor
scale_x = 1
scale_y = 1
# Scale the pixel coordinates
scaled_pixel_x = int(pixel_coord[0] * scale_x)
scaled_pixel_y = int(pixel_coord[1] * scale_y)
# Check if the adjustable cross can be drawn
if (
cross_size <= scaled_pixel_x < data_dict["image"].width - cross_size
and cross_size <= scaled_pixel_y < data_dict["image"].height - cross_size
):
# Draw a --> like arrow
for dx in range(1, cross_size + 1):
data_dict["image"].putpixel(
(scaled_pixel_x - dx, scaled_pixel_y), (255, 0, 0)
) # Horizontal line
# Draw the arrowhead
for dy in range(1, cross_size // 2 + 1):
data_dict["image"].putpixel(
(
scaled_pixel_x - dy - 1,
scaled_pixel_y + dy,
),
(255, 0, 0),
)
data_dict["image"].putpixel(
(
scaled_pixel_x - dy - 1,
scaled_pixel_y - dy,
),
(255, 0, 0),
)
else:
# Skip this sample and get the next one
return self.__getitem__((index + 1) % self.__len__())
# generate prompt
data_dict["problem"], data_dict["thinking"], data_dict["solution"] = (
generate_prompt_depth_sft(depth)
)
data_dict["system"] = "You are a helpful assistant."
return data_dict
def getitem_noTaskonomy(self, index, id_dataset):
index_ori = index
index = index % len(self.list_data_dict[id_dataset])
intrinsics = self.list_data_dict[id_dataset][index]["intrinsics"][:4]
# read image
data_dict = {}
img = Image.open(
os.path.join(
self.image_folder[id_dataset],
self.list_data_dict[id_dataset][index]["image"].lstrip("/"),
)
)
img, intrinsics_new = undistort_image(intrinsics, img)
if self.normalized_focal_length > 0:
img, intrinsics_new = normalizing_focal_length(
self.normalized_focal_length, intrinsics_new, img
)
data_dict["image"] = img
pixel_coords = [
[
int(
(coord[0] - intrinsics[2]) * (intrinsics_new[0] / intrinsics[0])
+ intrinsics_new[2]
),
int(
(coord[1] - intrinsics[3]) * (intrinsics_new[1] / intrinsics[1])
+ intrinsics_new[3]
),
]
for coord in self.list_data_dict[id_dataset][index]["pixel_coords"]
]
# Random center crop
width, height = data_dict["image"].size
crop_height = int(
min(height, random.uniform(self.height_range[0], self.height_range[1]))
)
crop_width = int(
min(width, random.uniform(self.width_range[0], self.width_range[1]))
)
center_x = round(intrinsics_new[2])
center_y = round(intrinsics_new[3])
# Ensure the crop is within the specified bounds
left = max(0, (width - crop_width) // 2)
top = max(
0,
(height - crop_height) // 2,
)
right = min(width, left + crop_width)
bottom = min(height, top + crop_height)
data_dict["image"] = data_dict["image"].crop((left, top, right, bottom))
# Adjust intrinsics_new to account for cropping
intrinsics_new[2] -= left # Adjust cx
intrinsics_new[3] -= top # Adjust cy
intrinsics_new[4] = data_dict["image"].width # Update width
intrinsics_new[5] = data_dict["image"].height # Update height
pixel_coords = [
(
[coord[0] - left, coord[1] - top]
if is_within_range(coord, (left, top, right, bottom))
else [-1, -1]
)
for coord in pixel_coords
]
if len(self.list_data_dict[id_dataset][index]["pixel_coords"]) - 1 > 0:
random_index = random.randint(
0, len(self.list_data_dict[id_dataset][index]["pixel_coords"]) - 1
)
else:
print("no pixel in ", index, ": ", self.list_data_dict[id_dataset][index])
return self.__getitem__((index + 1) % self.__len__(True))
random_index = adjust_index(random_index, pixel_coords)
if random_index == -1:
# Skip this sample and get the next one
return self.__getitem__((index_ori + 1) % self.__len__(True))
pixel_coord = pixel_coords[random_index].copy()
depth = self.list_data_dict[id_dataset][index]["depth"][random_index]
# Adjustable cross size
cross_size = 5 # You can modify this value to change the cross size
# Scale the pixel coordinates
scaled_pixel_x = int(pixel_coord[0])
scaled_pixel_y = int(pixel_coord[1])
# Check if the adjustable cross can be drawn
if (
cross_size <= scaled_pixel_x < data_dict["image"].width - cross_size
and cross_size <= scaled_pixel_y < data_dict["image"].height - cross_size
):
# Draw a --> like arrow
for dx in range(1, cross_size + 1):
data_dict["image"].putpixel(
(scaled_pixel_x - dx, scaled_pixel_y), (255, 0, 0)
) # Horizontal line
# Draw the arrowhead
for dy in range(1, cross_size // 2 + 1):
data_dict["image"].putpixel(
(
scaled_pixel_x - dy - 1,
scaled_pixel_y + dy,
),
(255, 0, 0),
)
data_dict["image"].putpixel(
(
scaled_pixel_x - dy - 1,
scaled_pixel_y - dy,
),
(255, 0, 0),
)
else:
return self.__getitem__((index + 1) % self.__len__(True))
data_dict["problem"], data_dict["thinking"], data_dict["solution"] = (
generate_prompt_depth_sft(depth)
)
data_dict["system"] = "You are a helpful assistant."
return data_dict
def __getitem__(self, index):
id_dataset = random.choices(
range(len(self.list_data_dict)), weights=self.sample_weights, k=1
)[0]
if "taskonomy" in self.image_folder[id_dataset]:
return self.getitem_Taskonomy(index, id_dataset)
else:
return self.getitem_noTaskonomy(index, id_dataset)
class dataset_inference(Dataset):
"""Dataset for deterministic inference. Each image and pixel is processed for exactly once."""
def __init__(
self,
data_path: str,
image_folder: str,
normalized_focal_length=750.0, # set to the intrinsics after original resize if needed
) -> None:
super(dataset_inference, self).__init__()
self.normalized_focal_length = normalized_focal_length
logger.info(f"reading data from {data_path=}, {image_folder=}")
if ".jsonl" in data_path:
with open(data_path, "r") as f:
json_content = f.read()
self.list_data_dict = pd.read_json(
StringIO(json_content), lines=True
).to_dict(orient="records")
else:
self.list_data_dict = json.load(open(data_path, "r"))
# Number of points per image, e.g., [1, 2, 3, 4, 5]
self.num_pixels: list[int] = [
len(data_dict["pixel_coords"]) for data_dict in self.list_data_dict
]
# Cumulative sum of number of points, e.g., [1, 3, 6, 10, 15]
self.num_pixels_cumsum = np.cumsum(self.num_pixels)
logger.info(
f"{dataset_inference.__name__} has {len(self.num_pixels_cumsum)}"
f" images, {self.num_pixels_cumsum[-1]} pixels"
)
self.data_path = data_path
self.image_folder = image_folder
if "scannet" in data_path:
self.list_data_dict = self.list_data_dict[
int(len(self.list_data_dict) * 0.98) :
] # keep the last 2% for evaluation
random.seed(42)
random.shuffle(self.list_data_dict)
def __len__(self) -> int:
"""
Each image-pixel pair is a sample, so the dataset length equals
total number of pixels.
"""
return self.num_pixels_cumsum[-1]
def extract_image_and_meta(self, index: int) -> dict[str, Any]:
"""
Index mapping example:
self.num_pixels = [1, 2, 3, 4, 5]
self.num_pixels_cumsum = [1, 3, 6, 10, 15]
index = 0 -> index + 1 = 1 -> image_index = 0, pixel_index = 1
index = 1 -> index + 1 = 2 -> image_index = 1, pixel_index = 0
index = 2 -> index + 1 = 3 -> image_index = 1, pixel_index = 1
index = 3 -> index + 1 = 4 -> image_index = 2, pixel_index = 0
"""
image_index: int = bisect.bisect_left(self.num_pixels_cumsum, index + 1)
pixel_index: int = (
index - int(self.num_pixels_cumsum[image_index - 1])
if image_index > 0
else index
)
logger.debug(f"Loading sample {index=}: {image_index=}, {pixel_index=}")
assert pixel_index >= 0 and pixel_index < self.num_pixels[image_index]
data_dict: dict[str, Any] = {}
# Step 1: Load image
data_dict["image"] = Image.open(
os.path.join(
self.image_folder, self.list_data_dict[image_index]["image"].lstrip("/")
)
)
# Step 2: Load intrinsics and rescale image and intrinsics to target focal length
intrinsics = self.list_data_dict[image_index]["intrinsics"][:4]
if intrinsics[0] == 0.0: # handle intrinsic errors
intrinsics[0] = intrinsics[1]
if intrinsics[1] == 0.0:
intrinsics[1] = intrinsics[0]
data_dict["image"], intrinsics_new = undistort_image(
intrinsics, data_dict["image"]
)
data_dict["image"], intrinsics_new = normalizing_focal_length(
self.normalized_focal_length, intrinsics_new, data_dict["image"]
)
# Step 3: Load pixel coordinates and rescale it
pixel_coord = self.list_data_dict[image_index]["pixel_coords"][pixel_index]
scaled_pixel_x = int(
(pixel_coord[0] - intrinsics[2]) * (intrinsics_new[0] / intrinsics[0])
+ intrinsics_new[2]
)
scaled_pixel_y = int(
(pixel_coord[1] - intrinsics[3]) * (intrinsics_new[1] / intrinsics[1])
+ intrinsics_new[3]
)
pixel_coord: tuple[int, int] = (scaled_pixel_x, scaled_pixel_y)
# Step 4: Load depth
depth: float = self.list_data_dict[image_index]["depth"][pixel_index]
# Step 5: Draw marker on the image
cross_size = 5 # Adjustable cross size
# Check if the adjustable cross can be drawn
if (
cross_size <= scaled_pixel_x < data_dict["image"].width - cross_size
and cross_size <= scaled_pixel_y < data_dict["image"].height - cross_size
):
# Draw a --> like arrow
for dx in range(1, cross_size + 1):
data_dict["image"].putpixel(
(scaled_pixel_x - dx, scaled_pixel_y), (255, 0, 0)
) # Horizontal line
# Draw the arrowhead
for dy in range(1, cross_size // 2 + 1):
data_dict["image"].putpixel(
(
scaled_pixel_x - dy - 1,
scaled_pixel_y + dy,
),
(255, 0, 0),
)
data_dict["image"].putpixel(
(
scaled_pixel_x - dy - 1,
scaled_pixel_y - dy,
),
(255, 0, 0),
)
else:
logger.error(
f"Marker cannot be drawn because pixel is too close to the boarder. Skipped."
)
return None
data_dict["pixel_coord"] = pixel_coord
data_dict["intrinsics"] = intrinsics_new
data_dict["depth"] = depth
return data_dict
def __getitem__(self, index) -> dict[str, Any]:
if index < 0 or index >= self.__len__():
raise ValueError(
f"Index out of range: {index}. Dataset size = {self.__len__()}"
)
data_dict: dict[str, Any] = self.extract_image_and_meta(index)
# generate prompt
data_dict["problem"], data_dict["thinking"], data_dict["solution"] = (
generate_prompt_depth_sft(
data_dict["depth"],
is_eval=True,
)
)
data_dict["system"] = "You are a helpful assistant."
data_dict["prompt"] = [
{
"content": [
{"image": data_dict["image"], "type": "image"},
{"text": data_dict["problem"], "type": "text"},
],
"role": "user",
}
]
return data_dict
```
## /utils/evaluation.py
```py path="/utils/evaluation.py"
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
import subprocess
from typing import Dict, TYPE_CHECKING, Union
from .hub import get_gpu_count_for_vllm, get_param_count_from_repo_id
if TYPE_CHECKING:
from trl import GRPOConfig, ModelConfig, SFTConfig
import os
# We need a special environment setup to launch vLLM from within Slurm training jobs.
# - Reference code: https://github.com/huggingface/brrr/blob/c55ba3505686d690de24c7ace6487a5c1426c0fd/brrr/lighteval/one_job_runner.py#L105
# - Slack thread: https://huggingface.slack.com/archives/C043JTYE1MJ/p1726566494958269
user_home_directory = os.path.expanduser("~")
VLLM_SLURM_PREFIX = [
"env",
"-i",
"bash",
"-c",
f"for f in /etc/profile.d/*.sh; do source $f; done; export HOME={user_home_directory}; sbatch ",
]
def register_lighteval_task(
configs: Dict[str, str],
eval_suite: str,
task_name: str,
task_list: str,
num_fewshot: int = 0,
):
"""Registers a LightEval task configuration.
- Core tasks can be added from this table: https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/tasks_table.jsonl
- Custom tasks that require their own metrics / scripts, should be stored in scripts/evaluation/extended_lighteval_tasks
Args:
configs (Dict[str, str]): The dictionary to store the task configuration.
eval_suite (str, optional): The evaluation suite.
task_name (str): The name of the task.
task_list (str): The comma-separated list of tasks in the format "extended|{task_name}|{num_fewshot}|0" or "lighteval|{task_name}|{num_fewshot}|0".
num_fewshot (int, optional): The number of few-shot examples. Defaults to 0.
is_custom_task (bool, optional): Whether the task is a custom task. Defaults to False.
"""
# Format task list in lighteval format
task_list = ",".join(
f"{eval_suite}|{task}|{num_fewshot}|0" for task in task_list.split(",")
)
configs[task_name] = task_list
LIGHTEVAL_TASKS = {}
register_lighteval_task(LIGHTEVAL_TASKS, "custom", "math_500", "math_500", 0)
register_lighteval_task(LIGHTEVAL_TASKS, "custom", "aime24", "aime24", 0)
def get_lighteval_tasks():
return list(LIGHTEVAL_TASKS.keys())
SUPPORTED_BENCHMARKS = get_lighteval_tasks()
def run_lighteval_job(
benchmark: str,
training_args: Union["SFTConfig", "GRPOConfig"],
model_args: "ModelConfig",
) -> None:
task_list = LIGHTEVAL_TASKS[benchmark]
model_name = training_args.hub_model_id
model_revision = training_args.hub_model_revision
# For large models >= 30b params or those running the MATH benchmark, we need to shard them across the GPUs to avoid OOM
num_gpus = get_gpu_count_for_vllm(model_name, model_revision)
if get_param_count_from_repo_id(model_name) >= 30_000_000_000:
tensor_parallel = True
else:
tensor_parallel = False
cmd = VLLM_SLURM_PREFIX.copy()
cmd_args = [
f"--gres=gpu:{num_gpus}",
f"--job-name=or1_{benchmark}_{model_name.split('/')[-1]}_{model_revision}",
"slurm/eval_callback.slurm",
benchmark,
f'"{task_list}"',
model_name,
model_revision,
f"{tensor_parallel}",
f"{model_args.trust_remote_code}",
]
if training_args.system_prompt is not None:
cmd_args.append(f"--system_prompt={training_args.system_prompt}")
cmd[-1] += " " + " ".join(cmd_args)
subprocess.run(cmd, check=True)
def run_benchmark_jobs(
training_args: Union["SFTConfig", "GRPOConfig"], model_args: "ModelConfig"
) -> None:
benchmarks = training_args.benchmarks
if len(benchmarks) == 1 and benchmarks[0] == "all":
benchmarks = get_lighteval_tasks()
# Evaluate on all supported benchmarks. Later we may want to include a `chat` option
# that just evaluates on `ifeval` and `mt_bench` etc.
for benchmark in benchmarks:
print(f"Launching benchmark `{benchmark}`")
if benchmark in get_lighteval_tasks():
run_lighteval_job(benchmark, training_args, model_args)
else:
raise ValueError(f"Unknown benchmark {benchmark}")
```
## /utils/hub.py
```py path="/utils/hub.py"
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
import logging
import re
from huggingface_hub import (
create_branch,
create_repo,
get_safetensors_metadata,
list_repo_commits,
list_repo_files,
list_repo_refs,
repo_exists,
upload_folder,
)
from transformers import AutoConfig
from trl import GRPOConfig, SFTConfig
logger = logging.getLogger(__name__)
def push_to_hub_revision(
training_args: SFTConfig | GRPOConfig, extra_ignore_patterns=[]
) -> bool:
"""Pushes the model to branch on a Hub repo."""
# Create a repo if it doesn't exist yet
repo_url = create_repo(
repo_id=training_args.hub_model_id, private=True, exist_ok=True
)
# Get initial commit to branch from
initial_commit = list_repo_commits(training_args.hub_model_id)[-1]
# Now create the branch we'll be pushing to
create_branch(
repo_id=training_args.hub_model_id,
branch=training_args.hub_model_revision,
revision=initial_commit.commit_id,
exist_ok=True,
)
logger.info(f"Created target repo at {repo_url}")
logger.info(f"Pushing to the Hub revision {training_args.hub_model_revision}...")
ignore_patterns = ["checkpoint-*", "*.pth"]
ignore_patterns.extend(extra_ignore_patterns)
upload_folder(
repo_id=training_args.hub_model_id,
folder_path=training_args.output_dir,
revision=training_args.hub_model_revision,
commit_message=f"Add {training_args.hub_model_revision} checkpoint",
ignore_patterns=ignore_patterns,
)
logger.info(
f"Pushed to {repo_url} revision {training_args.hub_model_revision} successfully!"
)
return True
def check_hub_revision_exists(training_args: SFTConfig | GRPOConfig):
"""Checks if a given Hub revision exists."""
if repo_exists(training_args.hub_model_id):
if training_args.push_to_hub_revision is True:
# First check if the revision exists
revisions = [
rev.name for rev in list_repo_refs(training_args.hub_model_id).branches
]
# If the revision exists, we next check it has a README file
if training_args.hub_model_revision in revisions:
repo_files = list_repo_files(
repo_id=training_args.hub_model_id,
revision=training_args.hub_model_revision,
)
if (
"README.md" in repo_files
and training_args.overwrite_hub_revision is False
):
raise ValueError(
f"Revision {training_args.hub_model_revision} already exists. "
"Use --overwrite_hub_revision to overwrite it."
)
def get_param_count_from_repo_id(repo_id: str) -> int:
"""Function to get model param counts from safetensors metadata or find patterns like 42m, 1.5b, 0.5m or products like 8x7b in a repo ID."""
try:
metadata = get_safetensors_metadata(repo_id)
return list(metadata.parameter_count.values())[0]
except Exception:
# Pattern to match products (like 8x7b) and single values (like 42m)
pattern = r"((\d+(\.\d+)?)(x(\d+(\.\d+)?))?)([bm])"
matches = re.findall(pattern, repo_id.lower())
param_counts = []
for full_match, number1, _, _, number2, _, unit in matches:
if number2: # If there's a second number, it's a product
number = float(number1) * float(number2)
else: # Otherwise, it's a single value
number = float(number1)
if unit == "b":
number *= 1_000_000_000 # Convert to billion
elif unit == "m":
number *= 1_000_000 # Convert to million
param_counts.append(number)
if len(param_counts) > 0:
# Return the largest number
return int(max(param_counts))
else:
# Return -1 if no match found
return -1
def get_gpu_count_for_vllm(
model_name: str, revision: str = "main", num_gpus: int = 8
) -> int:
"""vLLM enforces a constraint that the number of attention heads must be divisible by the number of GPUs and 64 must be divisible by the number of GPUs.
This function calculates the number of GPUs to use for decoding based on the number of attention heads in the model.
"""
config = AutoConfig.from_pretrained(
model_name, revision=revision, trust_remote_code=True
)
# Get number of attention heads
num_heads = config.num_attention_heads
# Reduce num_gpus so that num_heads is divisible by num_gpus and 64 is divisible by num_gpus
while num_heads % num_gpus != 0 or 64 % num_gpus != 0:
logger.info(
f"Reducing num_gpus from {num_gpus} to {num_gpus - 1} to make num_heads divisible by num_gpus"
)
num_gpus -= 1
return num_gpus
```
## /utils/metrics.py
```py path="/utils/metrics.py"
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
import re
from math_verify import ( # @manual=fbsource//third-party/pypi/math-verify:math-verify
parse,
)
def delta1_metric(contents, solution, **kwargs):
"""Reward function that checks if the completion is correct using either symbolic verification or exact string matching."""
rewards = []
for content, sol in zip(contents, solution):
reward = -1.0
# Try symbolic verification first
try:
answer = float(parse(content))
reward = float(max(answer / float(sol), float(sol) / answer) < 1.25)
except Exception:
pass # Continue to next verification method if this fails
# If symbolic verification failed, try string matching
if reward == -1.0:
# Extract answer from solution if it has think/answer tags
sol_match = re.search(r"<answer>(.*?)</answer>", sol)
ground_truth = float(
sol_match.group(1).strip() if sol_match else sol.strip()
)
try:
student_answer = float(parse(content)[0])
reward = (
1.0
if max(student_answer / ground_truth, ground_truth / student_answer)
< 1.25
else 0.0
)
except Exception as e:
print("error: ", e, "during solution parsing, content = ", content)
reward = 0.0
rewards.append(reward)
return rewards
METRIC_CLASSES = {
"delta1_metric": delta1_metric,
}
```
The content has been capped at 50000 tokens. The user could consider applying other filters to refine the result. The better and more specific the context, the better the LLM can follow instructions. If the context seems verbose, the user can refine the filter using uithub. Thank you for using https://uithub.com - Perfect LLM context for any GitHub repo.