facebookresearch/DepthLM_Official/main 101k tokens More Tools
```
├── .gitignore
├── CODE_OF_CONDUCT.md (700 tokens)
├── CONTRIBUTING.md (300 tokens)
├── LICENSE (omitted)
├── README.md (700 tokens)
├── eval.py (2.5k tokens)
├── eval.sh (100 tokens)
├── examples/
   ├── ibims1/
      ├── ibims1_val.jsonl (64.7k tokens)
      ├── rgb/
         ├── Thumbs.db
         ├── corridor_01.png
         ├── corridor_02.png
         ├── corridor_03.png
         ├── corridor_04.png
         ├── corridor_05.png
         ├── corridor_06.png
         ├── corridor_07.png
         ├── corridor_08.png
         ├── corridor_09.png
         ├── corridor_10.png
         ├── factory_01.png
         ├── factory_02.png
         ├── factory_03.png
         ├── factory_04.png
         ├── factory_05.png
         ├── factory_06.png
         ├── factory_07.png
         ├── factory_08.png
         ├── kitchen_01.png
         ├── kitchen_02.png
         ├── kitchen_03.png
         ├── kitchen_04.png
         ├── kitchen_05.png
         ├── kitchen_06.png
         ├── kitchen_07.png
         ├── kitchen_08.png
         ├── lab_01.png
         ├── lab_02.png
         ├── lab_03.png
         ├── lab_04.png
         ├── lab_05.png
         ├── lab_06.png
         ├── lab_07.png
         ├── lab_08.png
         ├── lab_09.png
         ├── lab_10.png
         ├── lab_11.png
         ├── lectureroom_01.png
         ├── lectureroom_02.png
         ├── lectureroom_03.png
         ├── lectureroom_04.png
         ├── lectureroom_05.png
         ├── lectureroom_06.png
         ├── lectureroom_07.png
         ├── lectureroom_08.png
         ├── lectureroom_09.png
         ├── lectureroom_10.png
         ├── livingroom_01.png
         ├── livingroom_02.png
         ├── livingroom_03.png
         ├── livingroom_04.png
         ├── livingroom_05.png
         ├── livingroom_06.png
         ├── livingroom_07.png
         ├── livingroom_08.png
         ├── livingroom_09.png
         ├── livingroom_10.png
         ├── livingroom_11.png
         ├── livingroom_12.png
         ├── livingroom_13.png
         ├── livingroom_14.png
         ├── livingroom_15.png
         ├── meetingroom_01.png
         ├── meetingroom_02.png
         ├── meetingroom_03.png
         ├── meetingroom_04.png
         ├── meetingroom_05.png
         ├── meetingroom_06.png
         ├── meetingroom_07.png
         ├── meetingroom_08.png
         ├── office_01.png
         ├── office_02.png
         ├── office_03.png
         ├── office_04.png
         ├── office_05.png
         ├── office_06.png
         ├── office_07.png
         ├── office_08.png
         ├── restaurant_01.png
         ├── restaurant_02.png
         ├── restaurant_03.png
         ├── restaurant_04.png
         ├── restaurant_05.png
         ├── restaurant_06.png
         ├── restaurant_07.png
         ├── restaurant_08.png
         ├── restaurant_09.png
         ├── restaurant_10.png
         ├── restaurant_11.png
         ├── restaurant_12.png
         ├── restroom_01.png
         ├── restroom_02.png
         ├── storageroom_01.png
         ├── storageroom_02.png
         ├── storageroom_03.png
         ├── storageroom_04.png
         ├── storageroom_05.png
         ├── storageroom_06.png
         ├── storageroom_07.png
         ├── storageroom_08.png
├── media/
   ├── cv_model.png
   ├── main_result.png
   ├── multiTask.jpg
   ├── point_cloud.png
   ├── teaser.png
├── prepare_data.sh (1000 tokens)
├── requirements.txt
├── train.py (5.2k tokens)
├── train.sh (400 tokens)
├── utils/
   ├── callbacks.py (500 tokens)
   ├── curate_NYU.py (800 tokens)
   ├── curate_argoverse.py (2.2k tokens)
   ├── curate_ddad.py (1000 tokens)
   ├── curate_eth3d.py (1500 tokens)
   ├── curate_matterport3d.py (1400 tokens)
   ├── curate_nuscenes_eval.py (1700 tokens)
   ├── curate_nuscenes_train.py (1800 tokens)
   ├── curate_scannet.py (2.1k tokens)
   ├── curate_sunRGBD.py (800 tokens)
   ├── curate_taskonomy (1000 tokens)
   ├── curate_waymo.py (2.2k tokens)
   ├── datasets.py (5.9k tokens)
   ├── evaluation.py (900 tokens)
   ├── hub.py (1000 tokens)
   ├── metrics.py (400 tokens)
```


## /.gitignore

```gitignore path="/.gitignore" 
*/__pycache__

```

## /CODE_OF_CONDUCT.md

# Code of Conduct

## Our Pledge

In the interest of fostering an open and welcoming environment, we as
contributors and maintainers pledge to make participation in our project and
our community a harassment-free experience for everyone, regardless of age, body
size, disability, ethnicity, sex characteristics, gender identity and expression,
level of experience, education, socio-economic status, nationality, personal
appearance, race, religion, or sexual identity and orientation.

## Our Standards

Examples of behavior that contributes to creating a positive environment
include:

* Using welcoming and inclusive language
* Being respectful of differing viewpoints and experiences
* Gracefully accepting constructive criticism
* Focusing on what is best for the community
* Showing empathy towards other community members

Examples of unacceptable behavior by participants include:

* The use of sexualized language or imagery and unwelcome sexual attention or
advances
* Trolling, insulting/derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or electronic
address, without explicit permission
* Other conduct which could reasonably be considered inappropriate in a
professional setting

## Our Responsibilities

Project maintainers are responsible for clarifying the standards of acceptable
behavior and are expected to take appropriate and fair corrective action in
response to any instances of unacceptable behavior.

Project maintainers have the right and responsibility to remove, edit, or
reject comments, commits, code, wiki edits, issues, and other contributions
that are not aligned to this Code of Conduct, or to ban temporarily or
permanently any contributor for other behaviors that they deem inappropriate,
threatening, offensive, or harmful.

## Scope

This Code of Conduct applies within all project spaces, and it also applies when
an individual is representing the project or its community in public spaces.
Examples of representing a project or community include using an official
project e-mail address, posting via an official social media account, or acting
as an appointed representative at an online or offline event. Representation of
a project may be further defined and clarified by project maintainers.

This Code of Conduct also applies outside the project spaces when there is a
reasonable belief that an individual's behavior may have a negative impact on
the project or its community.

## Enforcement

Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported by contacting the project team at <opensource-conduct@meta.com>. All
complaints will be reviewed and investigated and will result in a response that
is deemed necessary and appropriate to the circumstances. The project team is
obligated to maintain confidentiality with regard to the reporter of an incident.
Further details of specific enforcement policies may be posted separately.

Project maintainers who do not follow or enforce the Code of Conduct in good
faith may face temporary or permanent repercussions as determined by other
members of the project's leadership.

## Attribution

This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html

[homepage]: https://www.contributor-covenant.org

For answers to common questions about this code of conduct, see
https://www.contributor-covenant.org/faq


## /CONTRIBUTING.md

# Contributing to DepthLM
We want to make contributing to this project as easy and transparent as
possible.

## Our Development Process
... (in particular how this is synced with internal changes to the project)

## Pull Requests
We actively welcome your pull requests.

1. Fork the repo and create your branch from `main`.
2. If you've added code that should be tested, add tests.
3. If you've changed APIs, update the documentation.
4. Ensure the test suite passes.
5. Make sure your code lints.
6. If you haven't already, complete the Contributor License Agreement ("CLA").

## Contributor License Agreement ("CLA")
In order to accept your pull request, we need you to submit a CLA. You only need
to do this once to work on any of Meta's open source projects.

Complete your CLA here: <https://code.facebook.com/cla>

## Issues
We use GitHub issues to track public bugs. Please ensure your description is
clear and has sufficient instructions to be able to reproduce the issue.

Meta has a [bounty program](https://bugbounty.meta.com/) for the safe
disclosure of security bugs. In those cases, please go through the process
outlined on that page and do not file a public issue.

## Coding Style
* 2 spaces for indentation rather than tabs
* 80 character line length
* ...

## License
By contributing to DepthLM, you agree that your contributions will be licensed
under the LICENSE file in the root directory of this source tree.


## /README.md

# [ICLR2026 Oral (top 1.2%)] DepthLM
Official implementation of "[DepthLM: Metric Depth from Vision Language Models](https://arxiv.org/abs/2509.25413)".


We show for the first time that **VLMs can achieve comparable accuracy with pure vision models on metric depth estimation**, with standard text-based SFT and no architecture chagne, i.e., no dense prediction head or regression/regularization loss is needed. Such simplicity allows DepthLM to train a unified VLM to handle various complex 3D understanding tasks such as speed or time estimation, and metric scale camera pose estimation, which require different architecture or hand-crafted pipelines in pure vision models.

<div align=center>
<img width=100% src="./media/teaser.png"/>
</div>

<div align=center>
<img width=100% src="./media/multiTask.jpg"/>
</div>

## Citation

    If you find our code useful for your research, please consider citing:

    @article{cai2025depthlm,
        title={DepthLM: Metric Depth from Vision Language Models},
        author={Cai, Zhipeng and Yeh, Ching-Feng and Hu, Xu and Liu, Zhuang and Meyer, Gregory and Lei, Xinjie and Zhao, Changsheng and Li, Shang-Wen and Chandra, Vikas and Shi, Yangyang},
        journal={arXiv preprint arXiv:2509.25413},
        year={2025},
    }

## Contact
Zhipeng Cai, Meta Inc, homepage: https://zhipengcai.github.io/, email: czptc2h at gmail dot com.

## Prerequisites
1. run ```conda create -n DepthLM python=3.12```
2. run ```pip install -r requirements.txt``` (the code is tested with transformers 4.51.1 version)

| Model      |                                               Link                                                |
|:----:|:-------------------------------------------------------------------------------------------------:|
| DepthLM (Pixtral 12B)  |   [Download 🤗](https://huggingface.co/facebook/DepthLM) |
| DepthLM (3B)  |   (Coming soon!) |
| DepthLM (7B)  |   (Coming soon!) |

## Data Preparation
- For each training/eval dataset, we curate them into
    - A folder containing the images
    - A jsonl file containing the corresponding camera intrinsics and 3D labels
- We provide example data from the iBims1 dataset at examples/ibims1 for quick code run without the need of data preparation. Other images/datasets can use the same code after finishing the data preparation steps.
- Due to legal reasons, we cannot directly release the curated data. However, we provide the data curation code to enable reproduction.
- Checkout each block in [prepare_data.sh](https://github.com/facebookresearch/DepthLM_Official/blob/main/prepare_data.sh) for the detailed data preparation steps on each dataset.

## Eval
- run ```bash eval.sh <path_to_your_model>```


## Training
- Download the base model you want to train from [here](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct/tree/main). Our code currently supports Qwen2.5-VL and Pixtral, please see our paper for the corresponding hyper-parameters.
- run ```bash train.sh <path_to_your_model> <output_path>```

## Results

### Comparison with VLMs

<div align=center>
<img width=100% src="./media/main_result.png"/>
</div>

### Comparison with pure vision models

<div align=center>
<img width=80% src="./media/cv_model.png"/>
</div>

### Point cloud visualization

<div align=center>
<img width=100% src="./media/point_cloud.png"/>
</div>

## Related project

Our follow up project [VLM³](https://github.com/facebookresearch/VLM3) has been released! It extends the findings of DepthLM to diverse 3D vision tasks! 

## License
DepthLM is FAIR CC-BY-NC licensed, as found in the LICENSE file.


## /eval.py

```py path="/eval.py" 
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.

# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.

import argparse
import logging

import torch
from tqdm import tqdm
from transformers import (
    AutoProcessor,
    LlavaForConditionalGeneration,
    Qwen2_5_VLForConditionalGeneration,
)
from utils.datasets import dataset_eval, dataset_inference
from utils.metrics import *


def convert_example_pixtral(example, image_before_text=None):
    messages = []
    problem = example.get("problem")
    if "images" in example:
        images = example.get("images")

        if image_before_text is not None and image_before_text:
            messages.append(
                {
                    "role": "user",
                    "content": [{"type": "image", "image": img} for img in images]
                    + [{"type": "text", "content": problem}],
                }
            )
        else:
            messages.append(
                {
                    "role": "user",
                    "content": [{"type": "text", "content": problem}]
                    + [{"type": "image", "image": img} for img in images],
                }
            )
    else:
        image = example.get("image")
        if image_before_text is not None and image_before_text:
            messages.append(
                {
                    "role": "user",
                    "content": [
                        {"type": "image", "image": image},
                        {"type": "text", "content": problem},
                    ],
                }
            )
        else:
            messages.append(
                {
                    "role": "user",
                    "content": [
                        {"type": "text", "content": problem},
                        {"type": "image", "image": image},
                    ],
                }
            )
    example["messages"] = messages
    return example


def convert_example(example, image_before_text=None):
    messages = []
    if "system" in example:
        messages.append(
            {
                "role": "system",
                "content": [{"type": "text", "text": example["system"]}],
            }
        )
    else:
        SYSTEM_PROMPT = (
            "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant "
            "first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning "
            "process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., "
            "<think> reasoning process here </think><answer> answer here </answer>"
        )
        messages.append(
            {
                "role": "system",
                "content": [{"type": "text", "text": SYSTEM_PROMPT}],
            }
        )
    problem = example.get("problem")
    if "images" in example:
        images = example.get("images")

        if image_before_text is not None and image_before_text:
            messages.append(
                {
                    "role": "user",
                    "content": [{"type": "image", "image": img} for img in images]
                    + [{"type": "text", "text": problem}],
                }
            )
        else:
            messages.append(
                {
                    "role": "user",
                    "content": [{"type": "text", "text": problem}]
                    + [{"type": "image", "image": img} for img in images],
                }
            )
    else:
        image = example.get("image")
        if image_before_text is not None and image_before_text:
            messages.append(
                {
                    "role": "user",
                    "content": [
                        {"type": "image", "image": image},
                        {"type": "text", "text": problem},
                    ],
                }
            )
        else:
            messages.append(
                {
                    "role": "user",
                    "content": [
                        {"type": "text", "text": problem},
                        {"type": "image", "image": image},
                    ],
                }
            )
    example["messages"] = messages
    return example


def main(args):
    model_path = args.model_path
    img_fodler = args.image_folder
    json_path = args.json_path
    processor = AutoProcessor.from_pretrained(model_path)

    if "pixtral" in model_path.lower():
        print("loading DepthLM with pixtral (12B) architecture")
        model = LlavaForConditionalGeneration.from_pretrained(
            model_path,
            torch_dtype=torch.bfloat16,
            attn_implementation={
                "text_config": "flash_attention_2",
                "vision_config": "eager",
            },
            device_map="auto",
        )
        model.eval()
    else:
        print("loading DepthLM with qwen2.5-vl architecture")
        model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
            model_path,
            torch_dtype=torch.bfloat16,
            attn_implementation="flash_attention_2",
            device_map="auto",
        )

    if args.run_deterministic_inference:
        dataset = dataset_inference(
            json_path,
            img_fodler,
            normalized_focal_length=750.0,  # change to the corresponding value for other models
        )
    else:
        dataset = dataset_eval(
            json_path,
            img_fodler,
            normalized_focal_length=750.0,  # change to the corresponding value for other models
        )

    print(f"{dataset.__class__.__name__} size = {len(dataset)}")

    metric_funcs = [delta1_metric]
    metrics = []
    all_outputs = []  # List to store all answers
    all_solutions = []  # List to store all solutions

    samples_to_eval = min(args.samples_to_eval, len(dataset))
    step = 1
    sampled_indices: list[int] = list(range(0, samples_to_eval, step))
    print(f"Evaluating {len(sampled_indices)} samples")

    with torch.no_grad():

        for i in tqdm(range(0, len(sampled_indices), args.bsz)):
            batch_indices: list[int] = sampled_indices[i : i + args.bsz]
            batch_messages: list[dict[str, Any]] = []
            for j in batch_indices:
                message = dataset[j]
                if message is not None:
                    batch_messages.append(message)
            if len(batch_messages) == 0:
                continue

            if "pixtral" in model_path.lower():
                chat = [
                    convert_example_pixtral(msg, True)["messages"]
                    for msg in batch_messages
                ]

                inputs = processor.apply_chat_template(
                    chat,
                    add_generation_prompt=True,
                    tokenize=True,
                    return_dict=True,
                    padding=True,
                    padding_side="left",
                    return_tensors="pt",
                ).to("cuda", dtype=torch.bfloat16)

                generated_ids = model.generate(
                    **inputs,
                    max_new_tokens=args.max_new_tokens,
                    do_sample=False,
                    top_p=None,
                    top_k=None,
                )

                generated_ids_trimmed = [
                    out_ids[len(in_ids) :]
                    for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
                ]
                batch_output_text = processor.batch_decode(
                    generated_ids_trimmed,
                    skip_special_tokens=True,
                    clean_up_tokenization_spaces=False,
                )
            else:
                # code for qwen based models
                if args.apply_system_prompt:
                    text = [
                        processor.apply_chat_template(
                            convert_example(msg, True)["messages"],
                            tokenize=False,
                            add_generation_prompt=True,
                        )
                        for msg in batch_messages
                    ]
                else:
                    batch_messages_text: list[str] = [
                        msg["prompt"] for msg in batch_messages
                    ]
                    text: list[str] = [
                        processor.apply_chat_template(
                            msg, tokenize=False, add_generation_prompt=True
                        )
                        for msg in batch_messages_text
                    ]

                image_inputs = [
                    x["images"] if "images" in x else x["image"] for x in batch_messages
                ]

                if i == 0:
                    print(
                        "text = ",
                        text[0],
                        "apply_system_prompt = ",
                        args.apply_system_prompt,
                    )

                inputs = processor(
                    text=text,
                    images=image_inputs,
                    padding=True,
                    padding_side="left",
                    return_tensors="pt",
                )
                inputs = inputs.to("cuda")

                # Inference: Generation of the output
                # TODO maybe enable sampling here later
                generated_ids = model.generate(
                    **inputs,
                    use_cache=True,
                    max_new_tokens=args.max_new_tokens,
                    do_sample=False,
                    top_p=None,  # Unset top_p to avoid the warning
                    top_k=None,
                )

                generated_ids_trimmed = [
                    out_ids[len(in_ids) :]
                    for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
                ]
                batch_output_text = processor.batch_decode(
                    generated_ids_trimmed,
                    skip_special_tokens=True,
                    clean_up_tokenization_spaces=False,
                )

            print(f"model input = {batch_messages}")
            print(f"model output = {batch_output_text}")

            solution_list = [example["solution"] for example in batch_messages]
            for k, metric_func in enumerate(metric_funcs):
                if i == 0:
                    metrics.append(
                        metric_func(
                            batch_output_text,
                            solution_list.copy(),
                        )
                    )
                else:
                    metrics[k] += metric_func(
                        batch_output_text,
                        solution_list.copy(),
                    )

            all_outputs.extend(batch_output_text)
            all_solutions.extend(solution_list.copy())

    for i in range(len(metric_funcs)):
        print("final delta_1 = ", sum(metrics[i]) / len(metrics[i]))


if __name__ == "__main__":
    logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")

    parser = argparse.ArgumentParser(description="DepthLM parameters.")
    parser.add_argument(
        "--model_path", type=str, required=True, help="Path to the model."
    )
    parser.add_argument(
        "--image_folder",
        type=str,
        default="./examples/ibims1/",
        help="folder that contains the image",
    )
    parser.add_argument(
        "--json_path",
        type=str,
        default="./examples/ibims1/ibims1_val.jsonl",
        help="path to the meta data",
    )
    parser.add_argument(
        "--max_new_tokens",
        type=int,
        default=4096,
        help="maximum number of tokens to generate",
    )
    parser.add_argument("--bsz", type=int, default=1, help="Batch size for processing.")
    parser.add_argument(
        "--apply_system_prompt",
        action="store_true",
        help="For Qwen only, whether to apply system prompt or not.",
    )
    parser.add_argument(
        "--run_deterministic_inference",
        action="store_true",
        help="When True, will call the dataset_inference class to run deterministic inference.",
    )
    parser.add_argument(
        "--samples_to_eval",
        type=int,
        default=128,
        help="maximum number of samples to evaluate",
    )
    args = parser.parse_args()

    main(args)

```

## /eval.sh

```sh path="/eval.sh" 
#!/bin/bash

# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.

# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.

model_path=$1
python eval.py --model_path $model_path --image_folder "./examples/ibims1/" --json_path "./examples/ibims1/ibims1_val.jsonl" --bsz 3 --samples_to_eval 128

```

## /examples/ibims1/rgb/Thumbs.db

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/Thumbs.db

## /examples/ibims1/rgb/corridor_01.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/corridor_01.png

## /examples/ibims1/rgb/corridor_02.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/corridor_02.png

## /examples/ibims1/rgb/corridor_03.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/corridor_03.png

## /examples/ibims1/rgb/corridor_04.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/corridor_04.png

## /examples/ibims1/rgb/corridor_05.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/corridor_05.png

## /examples/ibims1/rgb/corridor_06.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/corridor_06.png

## /examples/ibims1/rgb/corridor_07.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/corridor_07.png

## /examples/ibims1/rgb/corridor_08.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/corridor_08.png

## /examples/ibims1/rgb/corridor_09.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/corridor_09.png

## /examples/ibims1/rgb/corridor_10.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/corridor_10.png

## /examples/ibims1/rgb/factory_01.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/factory_01.png

## /examples/ibims1/rgb/factory_02.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/factory_02.png

## /examples/ibims1/rgb/factory_03.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/factory_03.png

## /examples/ibims1/rgb/factory_04.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/factory_04.png

## /examples/ibims1/rgb/factory_05.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/factory_05.png

## /examples/ibims1/rgb/factory_06.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/factory_06.png

## /examples/ibims1/rgb/factory_07.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/factory_07.png

## /examples/ibims1/rgb/factory_08.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/factory_08.png

## /examples/ibims1/rgb/kitchen_01.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/kitchen_01.png

## /examples/ibims1/rgb/kitchen_02.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/kitchen_02.png

## /examples/ibims1/rgb/kitchen_03.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/kitchen_03.png

## /examples/ibims1/rgb/kitchen_04.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/kitchen_04.png

## /examples/ibims1/rgb/kitchen_05.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/kitchen_05.png

## /examples/ibims1/rgb/kitchen_06.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/kitchen_06.png

## /examples/ibims1/rgb/kitchen_07.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/kitchen_07.png

## /examples/ibims1/rgb/kitchen_08.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/kitchen_08.png

## /examples/ibims1/rgb/lab_01.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/lab_01.png

## /examples/ibims1/rgb/lab_02.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/lab_02.png

## /examples/ibims1/rgb/lab_03.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/lab_03.png

## /examples/ibims1/rgb/lab_04.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/lab_04.png

## /examples/ibims1/rgb/lab_05.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/lab_05.png

## /examples/ibims1/rgb/lab_06.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/lab_06.png

## /examples/ibims1/rgb/lab_07.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/lab_07.png

## /examples/ibims1/rgb/lab_08.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/lab_08.png

## /examples/ibims1/rgb/lab_09.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/lab_09.png

## /examples/ibims1/rgb/lab_10.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/lab_10.png

## /examples/ibims1/rgb/lab_11.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/lab_11.png

## /examples/ibims1/rgb/lectureroom_01.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/lectureroom_01.png

## /examples/ibims1/rgb/lectureroom_02.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/lectureroom_02.png

## /examples/ibims1/rgb/lectureroom_03.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/lectureroom_03.png

## /examples/ibims1/rgb/lectureroom_04.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/lectureroom_04.png

## /examples/ibims1/rgb/lectureroom_05.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/lectureroom_05.png

## /examples/ibims1/rgb/lectureroom_06.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/lectureroom_06.png

## /examples/ibims1/rgb/lectureroom_07.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/lectureroom_07.png

## /examples/ibims1/rgb/lectureroom_08.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/lectureroom_08.png

## /examples/ibims1/rgb/lectureroom_09.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/lectureroom_09.png

## /examples/ibims1/rgb/lectureroom_10.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/lectureroom_10.png

## /examples/ibims1/rgb/livingroom_01.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/livingroom_01.png

## /examples/ibims1/rgb/livingroom_02.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/livingroom_02.png

## /examples/ibims1/rgb/livingroom_03.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/livingroom_03.png

## /examples/ibims1/rgb/livingroom_04.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/livingroom_04.png

## /examples/ibims1/rgb/livingroom_05.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/livingroom_05.png

## /examples/ibims1/rgb/livingroom_06.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/livingroom_06.png

## /examples/ibims1/rgb/livingroom_07.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/livingroom_07.png

## /examples/ibims1/rgb/livingroom_08.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/livingroom_08.png

## /examples/ibims1/rgb/livingroom_09.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/livingroom_09.png

## /examples/ibims1/rgb/livingroom_10.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/livingroom_10.png

## /examples/ibims1/rgb/livingroom_11.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/livingroom_11.png

## /examples/ibims1/rgb/livingroom_12.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/livingroom_12.png

## /examples/ibims1/rgb/livingroom_13.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/livingroom_13.png

## /examples/ibims1/rgb/livingroom_14.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/livingroom_14.png

## /examples/ibims1/rgb/livingroom_15.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/livingroom_15.png

## /examples/ibims1/rgb/meetingroom_01.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/meetingroom_01.png

## /examples/ibims1/rgb/meetingroom_02.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/meetingroom_02.png

## /examples/ibims1/rgb/meetingroom_03.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/meetingroom_03.png

## /examples/ibims1/rgb/meetingroom_04.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/meetingroom_04.png

## /examples/ibims1/rgb/meetingroom_05.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/meetingroom_05.png

## /examples/ibims1/rgb/meetingroom_06.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/meetingroom_06.png

## /examples/ibims1/rgb/meetingroom_07.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/meetingroom_07.png

## /examples/ibims1/rgb/meetingroom_08.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/meetingroom_08.png

## /examples/ibims1/rgb/office_01.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/office_01.png

## /examples/ibims1/rgb/office_02.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/office_02.png

## /examples/ibims1/rgb/office_03.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/office_03.png

## /examples/ibims1/rgb/office_04.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/office_04.png

## /examples/ibims1/rgb/office_05.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/office_05.png

## /examples/ibims1/rgb/office_06.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/office_06.png

## /examples/ibims1/rgb/office_07.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/office_07.png

## /examples/ibims1/rgb/office_08.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/office_08.png

## /examples/ibims1/rgb/restaurant_01.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/restaurant_01.png

## /examples/ibims1/rgb/restaurant_02.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/restaurant_02.png

## /examples/ibims1/rgb/restaurant_03.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/restaurant_03.png

## /examples/ibims1/rgb/restaurant_04.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/restaurant_04.png

## /examples/ibims1/rgb/restaurant_05.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/restaurant_05.png

## /examples/ibims1/rgb/restaurant_06.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/restaurant_06.png

## /examples/ibims1/rgb/restaurant_07.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/restaurant_07.png

## /examples/ibims1/rgb/restaurant_08.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/restaurant_08.png

## /examples/ibims1/rgb/restaurant_09.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/restaurant_09.png

## /examples/ibims1/rgb/restaurant_10.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/restaurant_10.png

## /examples/ibims1/rgb/restaurant_11.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/restaurant_11.png

## /examples/ibims1/rgb/restaurant_12.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/restaurant_12.png

## /examples/ibims1/rgb/restroom_01.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/restroom_01.png

## /examples/ibims1/rgb/restroom_02.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/restroom_02.png

## /examples/ibims1/rgb/storageroom_01.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/storageroom_01.png

## /examples/ibims1/rgb/storageroom_02.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/storageroom_02.png

## /examples/ibims1/rgb/storageroom_03.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/storageroom_03.png

## /examples/ibims1/rgb/storageroom_04.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/storageroom_04.png

## /examples/ibims1/rgb/storageroom_05.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/storageroom_05.png

## /examples/ibims1/rgb/storageroom_06.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/storageroom_06.png

## /examples/ibims1/rgb/storageroom_07.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/storageroom_07.png

## /examples/ibims1/rgb/storageroom_08.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/examples/ibims1/rgb/storageroom_08.png

## /media/cv_model.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/media/cv_model.png

## /media/main_result.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/media/main_result.png

## /media/multiTask.jpg

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/media/multiTask.jpg

## /media/point_cloud.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/media/point_cloud.png

## /media/teaser.png

Binary file available at https://raw.githubusercontent.com/facebookresearch/DepthLM_Official/refs/heads/main/media/teaser.png

## /prepare_data.sh

```sh path="/prepare_data.sh" 
#!/bin/bash

# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.

# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.

# Though we provide separate code for each dataset, the main operation remains the same
# 1. download the dataset following the official instructions
# 2. convert images + camera intrinsics + depth maps into QA pairs, this step would generate a jsonl file containing all the meta data and a folder containing the corresponding images, similar to https://github.com/facebookresearch/DepthLM/tree/main/examples/ibims1.

# Argoverse
## 1. install av2 library and download + unzip the dataset following the official instructions in https://argoverse.github.io/user-guide/getting_started.html#downloading-the-data
## 2. (optional) move 10-20 scenes from val to train folder to enlarge the training dataset size
## 3. curate the data
python utils/curate_argoverse.py \
"/path/to/argoverse/train_or_val/" \
"/path/to/output_image_folder" \
"/path/to/output_jsonl/argoverse2_train_or_val.jsonl"



# Waymo
## 1. download and unzip waymo open dataset from https://console.cloud.google.com/storage/browser/waymo_open_dataset_v_2_0_1
## 2. curate the data
python utils/curate_waymo.py \
--dataset_dir /path/to/waymo/training/ \
--out_json_path /path/to/output_jsonl/waymo_train.jsonl \
--out_image_dir /path/to/output_image_folder

# NuScenes
## 1. download and unzip the dataset following https://www.nuscenes.org/nuscenes, we use the "Mini" subset for evaluation and other scenes in "All" for training
## 2. install nuscenes devkit at https://github.com/nutonomy/nuscenes-devkit
pip install nuscenes-devkit
## 3. curate training data
python utils/curate_nuscenes_train.py \
--dataroot /path/to/nuscenes_all \
--dataroot_mini /path/to/nuscenes_mini \
--out_json_path /path/to/output_jsonl/nuscenes_train.jsonl \
--out_image_dir /path/to/output_image_folder
## 4. curate eval data
python utils/curate_nuscenes_eval.py \
--dataroot /path/to/nuscenes_mini \
--out_json_path /path/to/output_jsonl/nuscenes_eval.jsonl \
--out_image_dir /path/to/output_image_folder

# ScanNet++
# our dataloader will automatically separate train and eval samples, so no need to separate them
## 1. download scannet++ dataset from https://kaldir.vc.in.tum.de/scannetpp/
## 2. clone and install the scannet++ github repo at https://github.com/scannetpp/scannetpp
## 3 change in /scannet_github_code_root/iphone/configs/prepare_iphone_data.yml the "data_doot" to the corresponding folder of your downloaded data
## 4. move data curation code to the scannet github local repo (we need modules in the scannet code to read the data)
mv utils/curate_scannet.py /scannet_github_code_root/iphone/prepare_depth_json.py
## 5. go to the scannet github local repo and run the data curation code
cd /scannet_github_code_root
python -m iphone.prepare_depth_json iphone/configs/prepare_iphone_data.yml

# Taskonomy
## 1. download the fullplus version of the dataset following https://github.com/StanfordVL/taskonomy/tree/master/data
python utils/curate_taskonomy.py \
--dataroot /path/to/taskonomy \
--out_json_path /path/to/output_jsonl/taskonomy.jsonl \
--out_image_dir /path/to/output_image_folder

# HM3d
## 1. download the hm3d dataset using https://docs.omnidata.vision/starter_dataset_download.html (set the components to hm3d)
## 2. curate data (coming soon)

# Matterport3D
## 1. download the dataset at https://niessner.github.io/Matterport/
## 2. curate data
python utils/curate_matterport3d.py \
--dataroot /path/to/matterport \
--out_json_path /path/to/output_jsonl/matterport.jsonl \
--out_image_dir /path/to/output_image_folder

# DDAD
## 1. download the dataset and install the dgp library following the "How to Use" section in https://github.com/TRI-ML/DDAD
## 2. curate data
python utils/curate_ddad.py \
--ddad_trainval_json_path /path/to/ddad/ddad_train_val/ddad.json \
--out_json_path /path/to/output_jsonl/ddad.jsonl \
--out_image_dir /path/to/output_image_folder  \
--path_to_dgp_lib /path/to/dgp/lib/folder

# ETH3D
## 1. download images and depth maps from https://www.eth3d.net/datasets
## 2. curate data
python utils/curate_eth3d.py \
--image_dir /path/to/eth3d/multi_view_training_dslr_jpg \
--depth_map_dir /path/to/eth3d/depth_map \
--out_json_path /path/to/output_jsonl/eth3d.jsonl \
--out_image_dir /path/to/output_image_folder

# sunRGBD & NYUv2
## 1. download data and unzip
dataroot=/path/to/sunRGBD
mkdir -p $dataroot
cd $dataroot
wget http://cvgl.stanford.edu/data2/sun_rgbd.tgz
tar -xvzf sun_rgbd.tgz
## 2. curate data for sunRGBD (without NYUv2)
python utils/curate_sunRGBD.py \
--dataroot /path/to/SUNRGBD/root \
--out_json_path /path/to/output_jsonl/sunRGBD.jsonl \
--out_image_dir /path/to/output_image_folder
## 3. curate data for NYUv2
python utils/curate_NYU.py \
--dataroot /path/to/SUNRGBD/root \
--out_json_path /path/to/output_jsonl/NYUv2.jsonl \
--out_image_dir /path/to/output_image_folder

```

## /requirements.txt

torch
torchvision
datasets
numpy
pandas
peft
pillow
qwen-vl-utils
huggingface-hub
einops
flash-attn
math_verify
opencv-python
tensorboard
transformers
trl==0.15.2
accelerate==1.6.0


## /train.py

```py path="/train.py" 
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.

# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.

import datetime
import logging
import os
import sys
import uuid
from dataclasses import dataclass, field
from typing import Optional

import datasets
import torch
import transformers
import trl

from qwen_vl_utils import process_vision_info


# from torch import distributed as dist

from torch.utils.tensorboard import SummaryWriter
from transformers import (
    AutoModelForCausalLM,
    AutoProcessor,
    BatchFeature,
    LlavaForConditionalGeneration,
    MllamaForConditionalGeneration,
    Qwen2_5_VLForConditionalGeneration,
    Qwen2VLForConditionalGeneration,
    set_seed,
    TrainerCallback,
)
from transformers.integrations import TensorBoardCallback
from transformers.trainer_utils import get_last_checkpoint

from trl import (
    get_kbit_device_map,
    get_peft_config,
    get_quantization_config,
    ScriptArguments,
    SFTTrainer,
    TrlParser,
)

from utils.datasets import dataset_train
from utils.callbacks import get_callbacks
logger = logging.getLogger(__name__)


@dataclass
class ModelConfig(trl.ModelConfig):
    output_model_local_path: str = field(
        default="test-output",
        metadata={"help": "Output model local path, do not set manually"},
    )
    output_model_filename: Optional[str] = field(
        default="test-output", metadata={"help": "Output model relative manifold path"}
    )


@dataclass
class SFTConfig(trl.SFTConfig):
    """
    args for callbacks, benchmarks etc
    """

    benchmarks: list[str] = field(
        default_factory=lambda: [],
        metadata={"help": "The benchmarks to run after training."},
    )
    callbacks: list[str] = field(
        default_factory=lambda: [],
        metadata={"help": "The callbacks to run during training."},
    )
    system_prompt: Optional[str] = field(
        default=None,
        metadata={"help": "The optional system prompt to use for benchmarking."},
    )
    hub_model_revision: Optional[str] = field(
        default="main",
        metadata={"help": "The Hub model branch to push the model to."},
    )
    overwrite_hub_revision: bool = field(
        default=False, metadata={"help": "Whether to overwrite the Hub revision."}
    )
    push_to_hub_revision: bool = field(
        default=False, metadata={"help": "Whether to push to a Hub revision/branch."}
    )


@dataclass
# pyre-fixme[11]: Annotation `ScriptArguments` is not defined as a type.
class SFTScriptArguments(ScriptArguments):
    """
    Script arguments for the GRPO training script.

    Args:
        reward_funcs (`list[str]`):
            List of reward functions. Possible values: 'accuracy', 'format'.
    """

    dataset_class: str = field(
        default="LazySupervisedDataset_ArgoverseDepth_GRPO",
        metadata={"help": "dataset class name in callm.reason.openr1.utils.datasets"},
    )
    max_pixels: Optional[int] = field(
        default=12845056,
        metadata={"help": "Maximum number of pixels for the image"},
    )
    min_pixels: Optional[int] = field(
        default=3136,
        metadata={"help": "Minimum number of pixels for the image"},
    )

    image_folder: Optional[str] = field(
        default=None,
        metadata={"help": "image folder on manifold"},
    )
    augment: Optional[float] = field(
        default=None,
        metadata={"help": "augmentation ratio"},
    )
    normalized_focal_length: Optional[float] = field(
        default=None,
        metadata={"help": "normalized focal length"},
    )
    sample_weights: Optional[str] = field(
        default=None,
        metadata={"help": "weights for sampling"},
    )
    pad: Optional[bool] = field(
        default=None,
        metadata={
            "help": "whether to pad image to have same width and height in 2 image strategy"
        },
    )
    height_max: Optional[float] = field(
        default=None,
        metadata={"help": "max height"},
    )
    height_min: Optional[float] = field(
        default=None,
        metadata={"help": "min height"},
    )
    width_min: Optional[float] = field(
        default=None,
        metadata={"help": "min width"},
    )
    width_max: Optional[float] = field(
        default=None,
        metadata={"help": "max width"},
    )
    ratio_min: Optional[float] = field(
        default=None,
        metadata={"help": "min ratio"},
    )
    ratio_max: Optional[float] = field(
        default=None,
        metadata={"help": "max ratio"},
    )


processor = None


def configure_pixtral_vision_tower(model, compute_dtype, device):
    vision_tower = model.vision_tower
    vision_tower.to(dtype=compute_dtype, device=device)


def convert_example(example):
    """
    correct example into "messages"
    eg:
    {
      "system": "You are a helpful assistant.",
      "conversations": [
          {"from": "user", "value": "How many objects are included in this image?",
           "image_path": "/path/to/image.png"},
          {"from": "assistant", "value": "<think>\nI can see 10 objects\n</think>\n<answer>\n10\n</answer>"}
      ]
    }
    """
    messages = []
    if "system" in example:
        messages.append(
            {
                "role": "system",
                "content": [{"type": "text", "text": example["system"]}],
            }
        )
    else:
        SYSTEM_PROMPT = (
            "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant "
            "first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning "
            "process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., "
            "<think> reasoning process here </think><answer> answer here </answer>"
        )
        messages.append(
            {
                "role": "system",
                "content": [{"type": "text", "text": SYSTEM_PROMPT}],
            }
        )

    thinking = example.get("thinking", "")  # no thinking case included
    problem = example.get("problem")
    solution = example.get("solution")
    if "images" in example:
        images = example.get("images")
        messages.append(
            {
                "role": "user",
                "content": [{"type": "image", "image": img} for img in images]
                + [{"type": "text", "text": problem}],
            }
        )
    else:
        image = example.get("image")
        messages.append(
            {
                "role": "user",
                "content": [
                    {"type": "image", "image": image},
                    {"type": "text", "text": problem},
                ],
            }
        )
    messages.append(
        {
            "role": "assistant",
            "content": f"{thinking}\n\n{solution}",
        }
    )

    example["messages"] = messages
    return example


def convert_example_phi4(example):
    """
    correct example into "messages"
    eg:
    {
      "system": "You are a helpful assistant.",
      "conversations": [
          {"from": "user", "value": "How many objects are included in this image?",
           "image_path": "/path/to/image.png"},
          {"from": "assistant", "value": "<think>\nI can see 10 objects\n</think>\n<answer>\n10\n</answer>"}
      ]
    }
    """
    messages = []
    if "system" in example:
        messages.append(
            {
                "role": "system",
                "content": [{"type": "text", "text": example["system"]}],
            }
        )
    else:
        SYSTEM_PROMPT = (
            "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant "
            "first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning "
            "process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., "
            "<think> reasoning process here </think><answer> answer here </answer>"
        )
        messages.append(
            {
                "role": "system",
                "content": [{"type": "text", "text": SYSTEM_PROMPT}],
            }
        )

    thinking = example.get("thinking", "")  # no thinking case included
    problem = example.get("problem")
    solution = example.get("solution")
    if "images" in example:
        images = example.get("images")
        messages.append(
            {
                "role": "user",
                "content": [{"type": "image", "image": img} for img in images]
                + [{"type": "text", "text": problem}],
            }
        )
    else:
        image = example.get("image")
        messages.append(
            {
                "role": "user",
                "content": [
                    {"type": "image", "image": image},
                    {"type": "text", "text": problem},
                ],
            }
        )
    messages.append(
        {
            "role": "assistant",
            "content": f"{thinking}\n\n{solution}",
        }
    )

    example["messages"] = messages
    return example


def pad_sequence(sequences, padding_side="right", padding_value=0):
    """
    Pad a list of sequences to the same length.
    sequences: list of tensors in [seq_len, *] shape
    """
    assert padding_side in ["right", "left"]
    max_size = sequences[0].size()
    trailing_dims = max_size[1:]
    max_len = max(len(seq) for seq in sequences)
    batch_size = len(sequences)
    output = sequences[0].new_full((batch_size, max_len) + trailing_dims, padding_value)
    for i, seq in enumerate(sequences):
        length = seq.size(0)
        if padding_side == "right":
            output.data[i, :length] = seq
        else:
            output.data[i, -length:] = seq
    return output


def cat_with_pad(tensors, dim, padding_value=0):
    """
    cat along dim, while pad to max for all other dims
    """
    ndim = tensors[0].dim()
    assert all(
        t.dim() == ndim for t in tensors[1:]
    ), "All tensors must have the same number of dimensions"

    out_size = [max(t.shape[i] for t in tensors) for i in range(ndim)]
    out_size[dim] = sum(t.shape[dim] for t in tensors)
    output = tensors[0].new_full(out_size, padding_value)

    index = 0
    for t in tensors:
        # Create a slice list where every dimension except dim is full slice
        slices = [slice(0, t.shape[d]) for d in range(ndim)]
        # Update only the concat dimension slice
        slices[dim] = slice(index, index + t.shape[dim])

        output[slices] = t
        index += t.shape[dim]

    return output


def pmc_vqa_collate_fn(batch):
    input_ids_list = []
    labels_list = []
    input_image_embeds_list = []
    image_attention_mask_list = []
    image_sizes_list = []
    for inputs in batch:
        input_ids_list.append(inputs["input_ids"][0])
        labels_list.append(inputs["labels"][0])
        input_image_embeds_list.append(inputs["input_image_embeds"])
        image_attention_mask_list.append(inputs["image_attention_mask"])
        image_sizes_list.append(inputs["image_sizes"])

    input_ids = pad_sequence(input_ids_list, padding_side="right", padding_value=0)
    labels = pad_sequence(labels_list, padding_side="right", padding_value=0)
    attention_mask = (input_ids != 0).long()
    input_image_embeds = cat_with_pad(input_image_embeds_list, dim=0)
    image_attention_mask = cat_with_pad(image_attention_mask_list, dim=0)
    image_sizes = torch.cat(image_sizes_list)

    # breakpoint()
    return BatchFeature(
        {
            "input_ids": input_ids,
            "labels": labels,
            "attention_mask": attention_mask,
            "input_image_embeds": input_image_embeds,
            "image_attention_mask": image_attention_mask,
            "image_sizes": image_sizes,
            "input_mode": 1,  # vision mode
        }
    )


def collate_fn_phi4(examples):
    _IGNORE_INDEX = -100
    _MAX_TRAINING_LENGTH = 8192
    batch = []
    for example in examples:
        image = example["image"]
        question = example.get("problem")
        user_message = {
            "role": "user",
            "content": "<|image_1|>" + question,
        }
        prompt = processor.tokenizer.apply_chat_template(
            [user_message], tokenize=False, add_generation_prompt=True
        )
        answer = f'{example.get("thinking", "")}\n\n{example.get("solution")}<|end|><|endoftext|>'
        inputs = processor(prompt, images=[image], return_tensors="pt")

        answer_ids = processor.tokenizer(answer, return_tensors="pt").input_ids

        input_ids = torch.cat([inputs.input_ids, answer_ids], dim=1)
        labels = torch.full_like(input_ids, _IGNORE_INDEX)
        labels[:, -answer_ids.shape[1] :] = answer_ids

        # breakpoint()
        if input_ids.size(1) > _MAX_TRAINING_LENGTH:
            input_ids = input_ids[:, :_MAX_TRAINING_LENGTH]
            labels = labels[:, :_MAX_TRAINING_LENGTH]
            if torch.all(labels == _IGNORE_INDEX).item():
                # workaround to make sure loss compute won't fail
                labels[:, -1] = processor.tokenizer.eos_token_id
        batch.append(
            {
                "input_ids": input_ids,
                "labels": labels,
                "input_image_embeds": inputs.input_image_embeds,
                "image_attention_mask": inputs.image_attention_mask,
                "image_sizes": inputs.image_sizes,
            }
        )

    return pmc_vqa_collate_fn(batch)


def find_subsequence(sequence, subsequence):
    """
    Helper function to find the starting index of a subsequence within a sequence.
    """
    seq_len = len(sequence)
    sub_len = len(subsequence)
    for i in range(seq_len - sub_len + 1):
        if torch.equal(sequence[i : i + sub_len], subsequence):
            return i
    return None


def get_image_token_count(image, dummy_text="describe this image"):
    """
    Compute the number of tokens generated for an image using the model's vision tower.
    Returns 0 if token computation fails.
    """
    try:
        inputs = processor(images=image, text=dummy_text, return_tensors="pt").to(
            "cuda"
        )
        with torch.no_grad():
            output = model.vision_tower(pixel_values=inputs["pixel_values"])
        token_count = output.last_hidden_state.shape[1]
        if token_count == 0:
            raise ValueError("Image token count is zero.")
        return token_count
    except Exception as e:
        print(f"[ERROR] Failed to compute image tokens: {e}")
        return 0  # Return zero to flag as invalid


def collate_fn_pixtral(examples):
    texts = [
        processor.apply_chat_template(
            convert_example(example)["messages"],
            tokenize=False,
            add_generation_prompt=True,
        )
        for example in examples
    ]
    image_inputs = []
    for example in examples:
        imgs, vids = process_vision_info(example["messages"])
        image_inputs.append(imgs)
    batch = processor(
        text=texts,
        images=image_inputs,
        return_tensors="pt",
        padding=True,
    )

    # print("texts = ", texts[0])
    # breakpoint()
    labels = batch["input_ids"].clone()
    labels[labels == processor.tokenizer.pad_token_id] = -100
    image_token_id = processor.tokenizer.convert_tokens_to_ids(processor.image_token)
    labels[labels == image_token_id] = -100
    batch["labels"] = labels
    return batch


def collate_fn(examples):
    # breakpoint()
    texts = [
        processor.apply_chat_template(
            convert_example(example)["messages"],
            tokenize=False,
            add_generation_prompt=True,
        )
        for example in examples
    ]
    image_inputs = []
    for example in examples:
        imgs, vids = process_vision_info(example["messages"])
        image_inputs.append(imgs)
    batch = processor(
        text=texts,
        images=image_inputs,
        return_tensors="pt",
        padding=True,
    )

    # print("texts = ", texts[0])
    # breakpoint()
    labels = batch["input_ids"].clone()
    labels[labels == processor.tokenizer.pad_token_id] = -100
    image_token_id = processor.tokenizer.convert_tokens_to_ids(processor.image_token)
    labels[labels == image_token_id] = -100
    batch["labels"] = labels
    # breakpoint()
    return batch


def main(script_args, training_args, model_args):
    set_seed(training_args.seed)

    ###############
    # Setup logging
    ###############
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
        datefmt="%Y-%m-%d %H:%M:%S",
        handlers=[logging.StreamHandler(sys.stdout)],
    )
    log_level = training_args.get_process_log_level()
    logger.setLevel(log_level)
    datasets.utils.logging.set_verbosity(log_level)
    transformers.utils.logging.set_verbosity(log_level)
    transformers.utils.logging.enable_default_handler()
    transformers.utils.logging.enable_explicit_format()
    training_args.gradient_checkpointing_kwargs = {"use_reentrant": False}

    # Log on each process a small summary
    logger.warning(
        f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}"
        + f" distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
    )
    logger.info(f"Model parameters {model_args}")
    logger.info(f"Script parameters {script_args}")
    logger.info(f"Data parameters {training_args}")


    print("script_args.image_folder = ", script_args.image_folder)
    training_args.output_dir = model_args.output_model_local_path

    # Check for last checkpoint
    last_checkpoint = None
    if os.path.isdir(training_args.output_dir):
        last_checkpoint = get_last_checkpoint(training_args.output_dir)
    if last_checkpoint is not None and training_args.resume_from_checkpoint is None:
        logger.info(f"Checkpoint detected, resuming training at {last_checkpoint=}.")

    ################
    # Load datasets
    ################

    dataset_kwargs = {}

    if script_args.normalized_focal_length is not None:
        dataset_kwargs["normalized_focal_length"] = script_args.normalized_focal_length
    if script_args.sample_weights is not None:
        dataset_kwargs["sample_weights"] = ";".join(
            weight
            for i, weight in enumerate(script_args.sample_weights.split(";"))
        )
    if script_args.height_max is not None:
        dataset_kwargs["height_max"] = script_args.height_max
    if script_args.height_min is not None:
        dataset_kwargs["height_min"] = script_args.height_min
    if script_args.width_min is not None:
        dataset_kwargs["width_min"] = script_args.width_min
    if script_args.width_max is not None:
        dataset_kwargs["width_max"] = script_args.width_max
    if script_args.ratio_min is not None:
        dataset_kwargs["ratio_min"] = script_args.ratio_min
    if script_args.ratio_max is not None:
        dataset_kwargs["ratio_max"] = script_args.ratio_max

    dataset = dataset_train(script_args.dataset_name, script_args.image_folder, **dataset_kwargs)


    print("[dataset] dataset_size = ", len(dataset))

    ################
    # Load tokenizer
    ################
    global processor
    if "vl" in model_args.model_name_or_path.lower():
        processor = AutoProcessor.from_pretrained(
            model_args.model_name_or_path,
            trust_remote_code=model_args.trust_remote_code,
        )
        logger.info("Using AutoProcessor for vision-language model.")
        if hasattr(processor, "pad_token") and processor.pad_token is None:
            processor.pad_token = processor.eos_token
        elif (
            hasattr(processor.tokenizer, "pad_token")
            and processor.tokenizer.pad_token is None
        ):
            processor.tokenizer.pad_token = processor.tokenizer.eos_token
    elif "pixtral-12b" in model_args.model_name_or_path.lower():
        processor = AutoProcessor.from_pretrained(
            model_args.model_name_or_path,
        )

        if hasattr(processor, "pad_token") and processor.pad_token is None:
            processor.pad_token = processor.eos_token
        elif (
            hasattr(processor.tokenizer, "pad_token")
            and processor.tokenizer.pad_token is None
        ):
            processor.tokenizer.pad_token = processor.tokenizer.eos_token

        processor.image_processor.do_resize = False
        processor.image_processor.do_rescale = False
        # breakpoint()

    else:
        processor = AutoProcessor.from_pretrained(
            model_args.model_name_or_path,
            trust_remote_code=True,
            use_fast=True,
        )
        logger.info("Using AutoProcessor.")

    # ###################
    # # Model init kwargs
    # ###################
    logger.info("*** Initializing model kwargs ***")
    torch_dtype = (
        model_args.torch_dtype
        if model_args.torch_dtype in ["auto", None]
        else getattr(torch, model_args.torch_dtype)
    )
    quantization_config = get_quantization_config(model_args)

    if "pixtral-12b" in model_args.model_name_or_path.lower():
        # seems like use_cache is not supported in the model class
        model_kwargs = dict(
            revision=model_args.model_revision,
            trust_remote_code=model_args.trust_remote_code,
            attn_implementation={
                "text_config": "flash_attention_2",
                "vision_config": "eager",
            },
            torch_dtype=torch_dtype,
            device_map=(
                get_kbit_device_map() if quantization_config is not None else None
            ),
            quantization_config=quantization_config,
        )
    else:
        # training_args.model_init_kwargs = model_kwargs
        model_kwargs = dict(
            revision=model_args.model_revision,
            trust_remote_code=model_args.trust_remote_code,
            attn_implementation=model_args.attn_implementation,
            torch_dtype=torch_dtype,
            use_cache=False if training_args.gradient_checkpointing else True,
            device_map=(
                get_kbit_device_map() if quantization_config is not None else None
            ),
            quantization_config=quantization_config,
        )

    if "Qwen2.5-VL" in model_args.model_name_or_path:
        model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
            model_args.model_name_or_path, **model_kwargs
        )
    elif "pixtral-12b" in model_args.model_name_or_path.lower():
        model = LlavaForConditionalGeneration.from_pretrained(
            model_args.model_name_or_path, **model_kwargs
        )
        if training_args.gradient_checkpointing:
            model.enable_input_require_grads()
            # This is a workaround for a bug in the current implementation of gradient checkpointing
            training_args.gradient_checkpointing_kwargs = {"use_reentrant": True}
    else:
        model = AutoModelForCausalLM.from_pretrained(
            model_args.model_name_or_path, **model_kwargs
        )

    ############################
    # Initialize the SFT Trainer
    ############################

    callbacks = get_callbacks(training_args, model_args)
    # # configure TensorboardCallback to upload to manifold
    callbacks.append(
        TensorBoardCallback(
            SummaryWriter(
                log_dir=os.path.join(
                    training_args.output_dir,
                    "tensorboard_logs",
                ),
                comment="",
                purge_step=None,
                max_queue=10,
                flush_secs=120,
                filename_suffix=str(uuid.uuid4()),
            )
        )
    )

    training_args.dataset_kwargs = {
        "skip_prepare_dataset": True,
    }
    training_args.remove_unused_columns = False

    if "pixtral" in model_args.model_name_or_path.lower():
        trainer = SFTTrainer(
            model=model,
            args=training_args,
            train_dataset=dataset,
            processing_class=processor.tokenizer,
            data_collator=collate_fn_pixtral,
            peft_config=get_peft_config(model_args),
            callbacks=callbacks,
        )
    else:
        trainer = SFTTrainer(
            model=model,
            args=training_args,
            train_dataset=dataset,
            processing_class=processor.tokenizer,
            data_collator=collate_fn,
            peft_config=get_peft_config(model_args),
            callbacks=callbacks,
        )

    # ###############
    # # Training loop
    # ###############
    logger.info("*** Train ***")
    checkpoint = None
    if training_args.resume_from_checkpoint is not None:
        checkpoint = training_args.resume_from_checkpoint
    elif last_checkpoint is not None:
        checkpoint = last_checkpoint
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
    metrics = train_result.metrics
    trainer.log_metrics("train", metrics)
    trainer.save_metrics("train", metrics)
    trainer.save_state()

    # ##################################
    # # Save model and create model card
    # ##################################
    logger.info("*** Save model ***")
    trainer.save_model(training_args.output_dir)
    processor.save_pretrained(training_args.output_dir)
    logger.info(f"Model saved to {training_args.output_dir}")


if __name__ == "__main__":
    parser = TrlParser((SFTScriptArguments, SFTConfig, ModelConfig))
    script_args, training_args, model_args = parser.parse_args_and_config()
    output_model_basename = os.path.basename(model_args.output_model_filename)
    model_args.output_model_local_path = os.path.join(
        training_args.output_dir,
        "models",
        "DepthLM",
    )
    os.makedirs(model_args.output_model_local_path, exist_ok=True)

    main(script_args, training_args, model_args)

```

## /train.sh

```sh path="/train.sh" 
#!/bin/bash

# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.

# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.

model_path=$1
output_path=$2

# 1. we use ';' to separate the image_folder, dataset_name and sample_weights, as shown in this example, where we concat 2 identical datasets.
# 2. please follow the optimal hyper-paramters from the paper, this is just a basic example to make things run, for device that cannot use the same batch size as in the paper, you can scale the learning rate and batch size together following the square root rule so that you train with smaller batch sizes.
# 3. please adjust the max_steps to control the number of training samples
# 4. set per_device_train_batch_size to 10+ for H100 cards on Qwen2.5-VL 3B and 7B
# 5. to train the pixtral model, set per_device_train_batch_size to 3 and change the corresponding fsdp layer to --fsdp_transformer_layer_cls_to_wrap "MistralDecoderLayer,PixtralAttentionLayer"

torchrun --nproc_per_node=2 --master_port=12433 train.py \
--model_name_or_path $model_path \
--image_folder "./examples/ibims1/;./examples/ibims1/" \
--dataset_name "./examples/ibims1/ibims1_val.jsonl;./examples/ibims1/ibims1_val.jsonl" \
--sample_weights "1;1" \
--max_seq_length 4096 \
--learning_rate 1e-5 \
--lr_scheduler_type cosine \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 1 \
--warmup_ratio 0.1 \
--max_grad_norm 0.1 \
--logging_steps 1 \
--report_to tensorboard \
--gradient_checkpointing true \
--attn_implementation "flash_attention_2" \
--max_steps 10 \
--log_level info \
--logging_strategy steps \
--output_dir $output_path \
--save_steps 3000 \
--save_strategy steps \
--eval_strategy no \
--torch_dtype bfloat16 \
--seed 42 \
--normalized_focal_length 1000.0 \
--height_min 700 \
--height_max 1200 \
--width_min 1000 \
--width_max 1400 \
--dataset_class dataset_train \
--fsdp "full_shard auto_wrap" \
--fsdp_transformer_layer_cls_to_wrap "Qwen2_5_VLDecoderLayer"

```

## /utils/callbacks.py

```py path="/utils/callbacks.py" 
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.

# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.

import subprocess
from typing import List

from transformers import TrainerCallback
from transformers.trainer_callback import TrainerControl, TrainerState
from transformers.training_args import TrainingArguments

from .evaluation import run_benchmark_jobs
from .hub import push_to_hub_revision


def is_slurm_available() -> bool:
    # returns true if a slurm queueing system is available
    try:
        subprocess.run(
            ["sinfo"], check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE
        )
        return True
    except FileNotFoundError:
        return False


class DummyConfig:
    def __init__(self, **kwargs):
        for k, v in kwargs.items():
            setattr(self, k, v)


class PushToHubRevisionCallback(TrainerCallback):
    def __init__(self, model_config) -> None:
        self.model_config = model_config

    def on_save(
        self,
        args: TrainingArguments,
        state: TrainerState,
        control: TrainerControl,
        **kwargs,
    ):
        if state.is_world_process_zero:
            global_step = state.global_step

            # WARNING: if you use dataclasses.replace(args, ...) the accelerator dist state will be broken, so I do this workaround
            # Also if you instantiate a new SFTConfig, the accelerator dist state will be broken
            dummy_config = DummyConfig(
                hub_model_id=args.hub_model_id,
                hub_model_revision=f"{args.hub_model_revision}-step-{global_step:09d}",
                output_dir=f"{args.output_dir}/checkpoint-{global_step}",
                system_prompt=args.system_prompt,
            )

            # TODO: I think this could be made async
            push_to_hub_revision(
                dummy_config, extra_ignore_patterns=["*.pt"]
            )  # don't push the optimizer states

            if is_slurm_available():
                dummy_config.benchmarks = args.benchmarks
                run_benchmark_jobs(dummy_config, self.model_config)


CALLBACKS = {
    "push_to_hub_revision": PushToHubRevisionCallback,
}


def get_callbacks(train_config, model_config) -> List[TrainerCallback]:
    callbacks = []
    for callback_name in train_config.callbacks:
        if callback_name not in CALLBACKS:
            raise ValueError(f"Callback {callback_name} not found in CALLBACKS.")
        callbacks.append(CALLBACKS[callback_name](model_config))

    return callbacks

```

## /utils/curate_NYU.py

```py path="/utils/curate_NYU.py" 
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.

# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.


import os, shutil, torch
from glob import glob

import numpy as np

# from fiftyone import ViewField as F
from PIL import Image

import argparse
# Set up argument parser
parser = argparse.ArgumentParser(description="Process some files.")
parser.add_argument("--dataroot", type=str, default="/home/czptc2h/datasets/SUNRGBD", help="image dir")
parser.add_argument(
    "--out_json_path", type=str, help="output jsonl path"
)
parser.add_argument(
    "--out_image_dir", type=str, help="output image folder"
)
args = parser.parse_args()

scene_dirs = glob(os.path.join(dataroot, "SUNRGBD/k*/*/*"))
print("Scene Dirs:", scene_dirs)

out_json_path = args.out_json_path
out_image_path = args.out_image_dir

import shutil

if os.path.exists(out_image_path):
    shutil.rmtree(out_image_path)
os.makedirs(out_image_path)


points_per_image = 100
import json, os

count = 0
with open(out_json_path, "w") as jsonl_file:
    for scene_dir in scene_dirs:
        data_dict = {}
        ## Get image file path from scene directory
        image_path = glob(f"{scene_dir}/image/*")[0]
        if "NYU" not in image_path:
            continue

        sub_dir = image_path.replace(f"{dataroot}/SUNRGBD/", "")
        ## Copy the image to the out_image_path directory
        os.makedirs(os.path.dirname(out_image_path + "/" + sub_dir), exist_ok=True)
        shutil.copy(image_path, out_image_path + "/" + sub_dir)

        ## Get depth map file path from scene directory
        depth_path = glob(f"{scene_dir}/depth/*")[0]

        print("Image Path:", image_path, "; Depth Path:", depth_path)

        intrinsic_path = f"{scene_dir}/intrinsics.txt"
        with open(intrinsic_path, "r") as file:
            intrinsic_data = file.read().strip().split()
            intrinsic_matrix = np.array(intrinsic_data, dtype=np.float32).reshape(
                (3, 3)
            )
        print("Intrinsic Matrix:\n", intrinsic_matrix)

        # Read the image from image_path into a PIL image
        pil_image = Image.open(image_path)

        data_dict["image"] = sub_dir
        data_dict["intrinsics"] = [
            float(intrinsic_matrix[0, 0]),
            float(intrinsic_matrix[1, 1]),
            float(intrinsic_matrix[0, 2]),
            float(intrinsic_matrix[1, 2]),
        ] + [pil_image.size[0], pil_image.size[1]]

        depth_gt = Image.open(depth_path)
        depth_gt = np.asarray(depth_gt, dtype=np.float32)
        depth_gt = depth_gt / 10000.0

        # Randomly sample 100 pixels in depth_gt with value > 0.005 and < 25
        valid_pixels = np.argwhere((depth_gt > 0.005) & (depth_gt < 25))
        sampled_indices = np.random.choice(
            len(valid_pixels), size=points_per_image, replace=False
        )
        sampled_pixels = valid_pixels[sampled_indices]

        data_dict["pixel_coords"] = sampled_pixels[:, [1, 0]].tolist()
        fx, fy, cx, cy = (
            intrinsic_matrix[0, 0],
            intrinsic_matrix[1, 1],
            intrinsic_matrix[0, 2],
            intrinsic_matrix[1, 2],
        )
        z = depth_gt[sampled_pixels[:, 0], sampled_pixels[:, 1]]
        x = (sampled_pixels[:, 1] - cx) * z / fx
        y = (sampled_pixels[:, 0] - cy) * z / fy
        euclidean_distances = np.sqrt(x**2 + y**2 + z**2)
        data_dict["depth"] = euclidean_distances.tolist()

        print("PIL Image Size:", pil_image.size)
        print("Depth GT Size:", depth_gt.shape)

        print("Data Dictionary:", data_dict)

        json.dump(data_dict, jsonl_file)
        jsonl_file.write("\n")
        count += 1
        print(f"processed {count} images")

```

## /utils/curate_argoverse.py

```py path="/utils/curate_argoverse.py" 
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.

# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.


import json
import logging
import os
import sys
from pathlib import Path
from typing import Final

import av2.rendering.color as color_utils
import av2.rendering.rasterize as raster_rendering_utils
import av2.rendering.video as video_utils
import av2.utils.io as io_utils
import av2.utils.raster as raster_utils

import click
import cv2
import numpy as np
from av2.datasets.sensor.av2_sensor_dataloader import AV2SensorDataLoader
from av2.datasets.sensor.constants import RingCameras
from av2.map.map_api import ArgoverseStaticMap
from av2.rendering.color import GREEN_HEX, RED_HEX
from av2.utils.typing import NDArrayByte, NDArrayFloat, NDArrayInt
from numpy import random
from PIL import Image

logger = logging.getLogger(__name__)


NUM_RANGE_BINS: Final[int] = 50
RING_CAMERA_FPS: Final[int] = 20


def get_immediate_subfolders(folder_path: str) -> list:
    """Return a list of immediate subfolders in the given folder path."""
    return [f.name for f in Path(folder_path).iterdir() if f.is_dir()]


if __name__ == "__main__":
    if len(sys.argv) != 4:
        print(
            "Usage: python script.py <root_folder> <out_image_folder> <jsonl_output_path>"
        )
        sys.exit(1)

    root_folder = sys.argv[1]
    out_image_folder = sys.argv[2]
    jsonl_output_path = sys.argv[3]

    frame_sample_interval = 1
    points_per_frame = 100 # by default we curate 100 labeled pixels per image which is more than enough for depth estimation, you can change this number to have more curated pixels
    cameras_used = [
        "ring_front_left",
        "ring_front_right",
        "ring_rear_left",
        "ring_rear_right",
        "ring_side_left",
        "ring_side_right",
        "ring_front_center",
        "stereo_front_left",
        "stereo_front_right",
    ]
    folders = get_immediate_subfolders(root_folder)
    print(f"there are {len(folders)} folders in total, the first one is {folders[0]}")

    loader = AV2SensorDataLoader(
        data_dir=Path(root_folder), labels_dir=Path(root_folder)
    )

    count = 0
    count_rows = 0
    with open(jsonl_output_path, "w") as f:
        skip_log_id = "d37be0e2-8223-3eeb-a0e2-c4b75d5ff87b"  # errors during my downloading for this log, comment if you dont have issues
        skip = False
        for log_id in folders[:2]:
            if skip and log_id != skip_log_id:
                continue
            skip = False
            print("log_id", log_id)
            # get the image file path
            for _, cam_name in enumerate(list(RingCameras)):
                if cam_name not in cameras_used:
                    print("skip ", cam_name, " camera")
                    continue
                cam_im_fpaths = loader.get_ordered_log_cam_fpaths(log_id, cam_name)

                # Sample every frame_sample_interval elements into a subset path list
                sampled_cam_im_fpaths = cam_im_fpaths[::frame_sample_interval]
                print("cam_im_fpaths = ", cam_im_fpaths)
                for i, im_fpath in enumerate(sampled_cam_im_fpaths):
                    try:
                        data_dict = {}
                        data_dict["image"] = str(im_fpath).replace(root_folder, "")
                        # get the object labels

                        cam_timestamp_ns = int(im_fpath.stem)
                        city_SE3_ego = loader.get_city_SE3_ego(log_id, cam_timestamp_ns)
                        if city_SE3_ego is None:
                            logger.exception("missing LiDAR pose")
                            continue

                        # load feather file path, e.g. '315978406032859416.feather"
                        lidar_fpath = loader.get_closest_lidar_fpath(
                            log_id, cam_timestamp_ns
                        )
                        if lidar_fpath is None:
                            logger.info(
                                "No LiDAR sweep found within the synchronization interval for %s, so skipping...",
                                cam_name,
                            )
                            continue

                        lidar_timestamp_ns = int(lidar_fpath.stem)

                        lidar_points_ego = io_utils.read_lidar_sweep(
                            lidar_fpath, attrib_spec="xyz"
                        )

                        (
                            uv,
                            points_cam,
                            is_valid_points,
                        ) = loader.project_ego_to_img_motion_compensated(
                            points_lidar_time=lidar_points_ego,
                            cam_name=cam_name,
                            cam_timestamp_ns=cam_timestamp_ns,
                            lidar_timestamp_ns=lidar_timestamp_ns,
                            log_id=log_id,
                        )

                        if is_valid_points is None or uv is None or points_cam is None:
                            continue

                        if is_valid_points.sum() == 0:
                            continue

                        uv_int: NDArrayInt = np.round(uv[is_valid_points]).astype(
                            np.int32
                        )  # image coordinates in pixels
                        points_cam = points_cam[
                            is_valid_points
                        ]  # 3d points in camera coordinates

                        # read the object bounding boxes and labels
                        cuboids = loader.get_labels_at_lidar_timestamp(
                            log_id, lidar_timestamp_ns
                        )

                        # convert to camera reference frame
                        # project cuboids to camera reference frame
                        pinhole_camera = loader.get_log_pinhole_camera(
                            log_id=log_id, cam_name=cam_name
                        )

                        city_SE3_ego_cam_t = loader.get_city_SE3_ego(
                            log_id=log_id, timestamp_ns=cam_timestamp_ns
                        )

                        # get transformation to bring point in egovehicle frame to city frame,
                        # at the time when the LiDAR sweep was recorded.
                        city_SE3_ego_lidar_t = loader.get_city_SE3_ego(
                            log_id=log_id, timestamp_ns=lidar_timestamp_ns
                        )

                        intrinsics = [
                            pinhole_camera.intrinsics.fx_px,
                            pinhole_camera.intrinsics.fy_px,
                            pinhole_camera.intrinsics.cx_px,
                            pinhole_camera.intrinsics.cy_px,
                            pinhole_camera.intrinsics.width_px,
                            pinhole_camera.intrinsics.height_px,
                        ]

                        # point clouds

                        # Ensure the number of points to sample does not exceed available points
                        num_points_to_sample = min(points_per_frame, len(uv_int))

                        # Calculate the interval for uniform sampling
                        sampled_indices = np.random.choice(
                            len(uv_int), num_points_to_sample, replace=False
                        )

                        # Subset the uv_int and points_cam arrays
                        uv_int = uv_int[sampled_indices].tolist()
                        points_cam = points_cam[sampled_indices].tolist()

                        data_dict["intrinsics"] = intrinsics
                        data_dict["pixel_coords"] = uv_int
                        # Read the image file path as a PIL image
                        undistorted_pil_image = Image.open(im_fpath)

                        # Check if the fx value in new_K is greater than 1000
                        if intrinsics[0] > 1000:
                            # Calculate the scaling factor to make fx equal to 1000
                            scale_factor = 1000 / intrinsics[0]

                            # Rescale the undistorted_pil_image
                            new_width = int(undistorted_pil_image.width * scale_factor)
                            new_height = int(
                                undistorted_pil_image.height * scale_factor
                            )
                            undistorted_pil_image = undistorted_pil_image.resize(
                                (new_width, new_height), Image.LANCZOS
                            )

                            # Rescale the pixel coordinates
                            data_dict["pixel_coords"] = [
                                (int(x * scale_factor), int(y * scale_factor))
                                for x, y in data_dict["pixel_coords"]
                            ]

                            data_dict["intrinsics"] = [
                                1000.0,
                                1000.0,
                                data_dict["intrinsics"][2] * scale_factor,
                                data_dict["intrinsics"][3] * scale_factor,
                                undistorted_pil_image.width,
                                undistorted_pil_image.height,
                            ]

                        # Construct the full path for the output image
                        output_image_path = os.path.join(
                            out_image_folder, data_dict["image"].lstrip("/")
                        )

                        # Create the directory if it doesn't exist
                        os.makedirs(os.path.dirname(output_image_path), exist_ok=True)

                        # Save the undistorted image as a JPEG
                        undistorted_pil_image.save(output_image_path)

                        data_dict["depth"] = []

                        for point_id in range(len(uv_int)):
                            data_dict["depth"].append(
                                (
                                    points_cam[point_id][0] ** 2
                                    + points_cam[point_id][1] ** 2
                                    + points_cam[point_id][2] ** 2
                                )
                                ** 0.5
                            )

                        f.write(f"{json.dumps(data_dict)}\n")
                        count_rows += 1
                        # exit()
                        count += 1
                        if count % 1000 == 0:
                            print("data_dict", data_dict)
                            print(
                                "processed ", count, " frames and ", count_rows, "rows"
                            )
                    except Exception as e:
                        print("error ", e)
                        break

```

## /utils/curate_ddad.py

```py path="/utils/curate_ddad.py" 
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.

# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.

import sys

import argparse
# Set up argument parser
parser = argparse.ArgumentParser(description="Process some files.")
parser.add_argument(
    "--out_json_path", type=str, help="output jsonl path"
)
parser.add_argument(
    "--out_image_dir", type=str, help="output image folder"
)
parser.add_argument(
    "--ddad_trainval_json_path", type=str, help="path to the ddad train val json path, i.e., ddad/ddad_train_val/ddad.json"
)
parser.add_argument(
    "--path_to_dgp_lib", type=str, help="dgp path"
)
args = parser.parse_args()

sys.path.insert(
    0,
    args.path_to_dgp_lib,
)  # add nuscenes package path to enable module finding

import cv2
import numpy as np
import PIL

from dgp.datasets.synchronized_dataset import SynchronizedSceneDataset
from dgp.proto.ontology_pb2 import Ontology
from dgp.utils.protobuf import open_pbobject
from dgp.utils.visualization_utils import visualize_semantic_segmentation_2d

# from IPython import display
from matplotlib.cm import get_cmap


plasma_color_map = get_cmap("plasma")


out_json_path = args.out_json_path
output_image_path = args.out_image_dir
points_per_image = 100

import os

# Remove the folder of output_image_path if it exists
if os.path.exists(output_image_path):
    import shutil

    shutil.rmtree(output_image_path)

# Ensure the output directory exists
os.makedirs(output_image_path, exist_ok=True)

# Define high level variables
DDAD_TRAIN_VAL_JSON_PATH = args.ddad_trainval_json_path
DATUMS = ["lidar"] + ["CAMERA_%02d" % idx for idx in [1, 5, 6, 7, 8, 9]]

# Load the val set
ddad_val = SynchronizedSceneDataset(
    DDAD_TRAIN_VAL_JSON_PATH,
    split="val",
    datum_names=DATUMS,
    generate_depth_from_datum="lidar",
)
print("Loaded DDAD val split containing {} samples".format(len(ddad_val)))

import json

# Open the out_json_path as a jsonl file for writing
with open(out_json_path, "w") as jsonl_file:
    count = 0
    # Iterate through the dataset.
    for sample in ddad_val:
        # Each sample contains a list of the requested datums.
        print("sample = {}", sample, "/", len(sample))

        for i in range(len(sample[0])):
            datum = sample[0][i]
            if "CAMERA" in datum["datum_name"]:
                data_dict = {}
                image_fname = f"{count}.jpg"
                data_dict["image"] = f"val_images/" + image_fname

                print(datum["datum_name"], i)
                # point_cloud = lidar["point_cloud"]  # Nx3 numpy.ndarray
                image_01 = datum["rgb"]  # PIL.Image
                depth_01 = datum["depth"]  # (H,W) numpy.ndarray, generated from 'lidar'

                data_dict["intrinsics"] = [
                    float(datum["intrinsics"][0, 0]),
                    float(datum["intrinsics"][1, 1]),
                    float(datum["intrinsics"][0, 2]),
                    float(datum["intrinsics"][1, 2]),
                    image_01.size[0],  # Image width
                    image_01.size[1],  # Image height
                ]

                # print("image_01 = ", image_01, "; depth_01 = ", depth_01)
                # Find non-zero elements in depth_01
                non_zero_indices = np.nonzero(depth_01)
                random_indices = np.random.choice(
                    len(non_zero_indices[0]), size=100, replace=False
                )
                non_zero_indices = (
                    non_zero_indices[0][random_indices],
                    non_zero_indices[1][random_indices],
                )
                non_zero_values = depth_01[non_zero_indices]

                data_dict["pixel_coords"] = []
                data_dict["depth"] = []
                # Print pixel coordinates and their corresponding depth values
                for coord, value in zip(zip(*non_zero_indices), non_zero_values):
                    data_dict["pixel_coords"].append([int(coord[1]), int(coord[0])])
                    data_dict["depth"].append(float(value))
                    # print(f"Pixel coordinates: {coord}, Depth value: {value}")

                # # Calculate and print the minimum and maximum values in the non-zero depth values
                # min_depth = np.min(non_zero_values)
                # max_depth = np.max(non_zero_values)
                # print(
                #     f"Minimum depth value: {min_depth}, Maximum depth value: {max_depth}"
                # )
                print("data_dict = ", data_dict)
                json.dump(data_dict, jsonl_file)
                jsonl_file.write("\n")

                # Save image_01 to the specified path
                image_save_path = os.path.join(output_image_path, image_fname)
                image_01.save(image_save_path)
                count += 1
                print(f"processed {count} images")

                # breakpoint()
                # break

```

## /utils/curate_eth3d.py

```py path="/utils/curate_eth3d.py" 
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.

# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.

import os

import numpy as np

import argparse
# Set up argument parser
parser = argparse.ArgumentParser(description="Process some files.")
parser.add_argument("--image_dir", type=str, default="/home/czptc2h/datasets/ETH3D/multi_view_training_dslr_jpg", help="image dir")
parser.add_argument("--depth_map_dir", type=str, default="/home/czptc2h/datasets/ETH3D/depth", help="depth map dir")
parser.add_argument(
    "--out_json_path", type=str, help="output jsonl path"
)
parser.add_argument(
    "--out_image_dir", type=str, help="output image folder"
)
args = parser.parse_args()


def get_image_paths(directory):
    image_paths = []
    for root, dirs, files in os.walk(directory):
        for file in files:
            if file.lower().endswith(
                (".png", ".jpg", ".jpeg", ".bmp", ".gif", ".tiff")
            ):
                image_paths.append(os.path.join(root, file))
    return image_paths


image_directory = args.image_dir
all_image_paths = get_image_paths(image_directory)
all_image_paths.sort(key=lambda x: os.path.basename(x))
print(all_image_paths[:10])

depth_directory = args.depth_map_dir
depth_image_paths = get_image_paths(depth_directory)
depth_image_paths.sort(key=lambda x: os.path.basename(x))
print(depth_image_paths[:10])


from PIL import Image

points_per_image = 100
out_json_path = args.out_json_path
out_image_path = args.out_image_dir
import shutil

if os.path.exists(out_image_path):
    shutil.rmtree(out_image_path)

os.makedirs(out_image_path)

import cv2


def undistort_fisheye(image, depth_image, camera_params):
    fx, fy, cx, cy = map(float, camera_params[4:8])
    k1, k2, p1, p2, k3, k4, sz1, sy1 = map(float, camera_params[8:])

    width, height = image.size
    k = np.array([k1, k2, k3, k4])
    p = np.array([p1, p2])
    sz = np.array([sz1, sy1])

    # Convert PIL image to numpy array
    image_np = np.array(image)

    # Camera matrix
    K = np.array([[fx, 0, cx], [0, fy, cy], [0, 0, 1]])

    # Distortion coefficients
    D = np.array([k1, k2, k3, k4])

    # Undistort image using OpenCV
    h, w = image_np.shape[:2]
    map1, map2 = cv2.fisheye.initUndistortRectifyMap(
        K, D, np.eye(3), K, (w, h), cv2.CV_16SC2
    )
    undistorted_image_np = cv2.remap(
        image_np,
        map1,
        map2,
        interpolation=cv2.INTER_LINEAR,
        borderMode=cv2.BORDER_CONSTANT,
    )

    # Convert back to PIL image
    undistorted_image = Image.fromarray(undistorted_image_np)

    # dont undistort depth image, get the original coordinate and use the mapping to get the undistorted coordinate
    # # Undistort depth image using OpenCV
    depth_image_np = np.array(depth_image)
    undistorted_depth_image_np = cv2.remap(
        depth_image_np,
        map1,
        map2,
        interpolation=cv2.INTER_NEAREST,
        borderMode=cv2.BORDER_CONSTANT,
    )
    undistorted_depth_image = Image.fromarray(undistorted_depth_image_np)

    new_intrinsics = [fx, fy, cx, cy, width, height]
    return undistorted_image, undistorted_depth_image, new_intrinsics


import json

count = 0
with open(out_json_path, "w") as jsonl_file:
    for image_path, depth_path in zip(all_image_paths, depth_image_paths):
        image = Image.open(image_path)

        # image.save(os.path.join(out_image_path, "after_first_read.jpg"))
        with open(depth_path, "rb") as f:
            width, height = image.size
            depth_data = np.fromfile(f, dtype=np.float32, count=width * height)
            depth_image = depth_data.reshape((height, width))
        print(f"Loaded Image: {image_path}, Loaded Depth Map: {depth_path}")

        image_folder = os.path.dirname(os.path.dirname(os.path.dirname(image_path)))
        dslr_calibration_folder = os.path.join(image_folder, "dslr_calibration_jpg")
        corresponding_camera_file = os.path.join(dslr_calibration_folder, "cameras.txt")
        if os.path.exists(corresponding_camera_file):
            print(
                f"Found corresponding camera file: {corresponding_camera_file} for image: {image_path}"
            )

        with open(corresponding_camera_file, "r") as camera_file:
            for line in camera_file:
                if not line.startswith("#"):
                    camera_params = line.strip().split(" ")
                    break
        print(f"Camera Parameters: {camera_params}")

        # Undistort image and depth_image
        fx, fy, cx, cy = map(float, camera_params[4:8])
        if camera_params[1] == "THIN_PRISM_FISHEYE":
            k1, k2, p1, p2, k3, k4, sz1, sy1 = map(float, camera_params[8:])

            # Call the function
            image, depth_image, new_intrinsics = undistort_fisheye(
                image, depth_image, camera_params
            )

        else:
            print("Camera model not supported")
            continue

        # Resize image and depth_image to have width < 2048
        scale_factor = min(1280.0 / height, 1)
        new_width = int(width * scale_factor)
        new_height = int(height * scale_factor)
        image = image.resize((new_width, new_height))
        depth_image = np.array(
            depth_image
        )  # dont rescale depth image, rescale the pixel_coordinates

        # Calculate new intrinsics
        fx *= scale_factor
        fy *= scale_factor
        cx *= scale_factor
        cy *= scale_factor
        new_intrinsics = [fx, fy, cx, cy, new_width, new_height]

        data_dict = {}
        data_dict["image"] = image_path.replace(
            image_directory+"/", ""
        ).replace(".png", ".jpg")
        data_dict["intrinsics"] = [
            float(new_intrinsics[0]),
            float(new_intrinsics[1]),
            float(new_intrinsics[2]),
            float(new_intrinsics[3]),
            int(new_intrinsics[4]),
            int(new_intrinsics[5]),
        ]

        data_dict["pixel_coords"] = []
        data_dict["depth"] = []

        valid_indices = np.argwhere(
            (depth_image > 1e-4)
            & (depth_image < 1e6)
            & (np.arange(depth_image.shape[0])[:, None] > 10)
            & (np.arange(depth_image.shape[0])[:, None] < depth_image.shape[0] - 10)
            & (np.arange(depth_image.shape[1])[None, :] > 10)
            & (np.arange(depth_image.shape[1])[None, :] < depth_image.shape[1] - 10)
        )

        sampled_indices = valid_indices[
            np.random.choice(valid_indices.shape[0], points_per_image, replace=False)
        ]

        for y, x in sampled_indices:
            x_ori = x
            y_ori = y
            x = int(x * scale_factor)
            y = int(y * scale_factor)
            data_dict["pixel_coords"].append([int(x), int(y)])
            fx, fy, cx, cy, width, height = data_dict["intrinsics"]
            x_normalized = (x - cx) / fx
            y_normalized = (y - cy) / fy
            z = float(depth_image[y_ori, x_ori])
            euclidean_distance = np.sqrt(x_normalized**2 + y_normalized**2 + 1) * z
            data_dict["depth"].append(float(euclidean_distance))

        # Save the resized image into out_image_path

        resized_image_path = os.path.join(out_image_path, data_dict["image"])
        print("resized_image_path", resized_image_path)

        os.makedirs(os.path.dirname(resized_image_path), exist_ok=True)
        image.save(resized_image_path)

        print("Data Dictionary:", data_dict)

        json.dump(data_dict, jsonl_file)
        jsonl_file.write("\n")
        count += 1
        print(f"processed {count} images")

```

## /utils/curate_matterport3d.py

```py path="/utils/curate_matterport3d.py" 
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.

# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.


import json, random
import os
import sys

import numpy as np
import torch
from PIL import Image

import argparse
# Set up argument parser
parser = argparse.ArgumentParser(description="Process some files.")
parser.add_argument("--dataroot", type=str, default="/home/czptc2h/datasets/matterport", help="data root")
parser.add_argument(
    "--out_json_path", type=str, help="output jsonl path"
)
parser.add_argument(
    "--out_image_dir", type=str, help="output image folder"
)
args = parser.parse_args()

root = args.dataroot

def get_all_image_paths(root):
    image_paths = []
    for subdir, _, files in os.walk(root):
        if "undistorted_color_images" in subdir:
            for file in files:
                if file.endswith((".png", ".jpg", ".jpeg", ".bmp", ".tiff", ".gif")):
                    image_paths.append(os.path.join(subdir, file))
    return image_paths


def get_all_file_paths(root, folder_name, file_extensions=(".png")):
    image_paths = []
    for subdir, _, files in os.walk(root):
        if folder_name in subdir:
            for file in files:
                if file.endswith(file_extensions):
                    image_paths.append(os.path.join(subdir, file))
    return image_paths


all_image_paths = sorted(get_all_image_paths(root))
all_depth_paths = sorted(get_all_file_paths(root, "undistorted_depth_images", (".png")))
all_calib_paths = sorted(
    get_all_file_paths(root, "undistorted_camera_parameters", (".conf"))
)

calib_map_dict = {}
for calib_path in all_calib_paths:
    folder_name = os.path.relpath(calib_path, root).split(os.sep)[0]
    calib_map_dict[folder_name] = calib_path

points_per_image = 100
out_json_path = args.out_json_path
# Create the directory for out_json_path if it doesn't exist
os.makedirs(os.path.dirname(out_json_path), exist_ok=True)

out_image_path = args.out_image_dir
import shutil

if os.path.exists(out_image_path):
    shutil.rmtree(out_image_path)
os.makedirs(out_image_path)

count = 0
with open(out_json_path, "w") as jsonl_file:
    for image_path, depth_path in zip(all_image_paths, all_depth_paths):
        folder_name = os.path.relpath(image_path, root).split(os.sep)[0]
        calib_path = calib_map_dict[folder_name]
        # Extract the base filename from the image_path
        base_filename = os.path.basename(image_path)

        # Initialize variables to store the intrinsics matrix
        intrinsics_matrix = None

        # Read the calibration file
        with open(calib_path, "r") as calib_file:
            for line in calib_file:
                # Check if the line contains the base filename
                if base_filename in line:
                    # Read the previous line for intrinsics_matrix
                    calib_file.seek(0)  # Reset file pointer to the beginning
                    lines = calib_file.readlines()
                    for i, l in enumerate(lines):
                        if base_filename in l:
                            # The intrinsics_matrix is expected to be in the lines before the scan line
                            for j in range(i - 1, -1, -1):
                                intrinsics_matrix_line = lines[j]
                                if "intrinsics_matrix" in intrinsics_matrix_line:
                                    # Extract the values after 'intrinsics_matrix'
                                    intrinsics_matrix = list(
                                        map(float, intrinsics_matrix_line.split()[1:])
                                    )
                                    break
                    break

        if not intrinsics_matrix:
            print(f"Intrinsics Matrix not found for {base_filename}")
            continue

        data_dict = {}
        data_dict["image_path"] = folder_name + "/" + base_filename

        # Read the image at image_path as a PIL image
        pil_image = Image.open(image_path)
        fx = intrinsics_matrix[0]
        fy = intrinsics_matrix[4]
        cx = intrinsics_matrix[2]
        cy = intrinsics_matrix[5]

        data_dict["intrinsics"] = [
            fx,
            fy,
            cx,
            cy,
            pil_image.width,
            pil_image.height,
        ]
        # Read the depth image at depth_path as a PIL image
        depth_pil_image = Image.open(depth_path)

        # Convert depth image to numpy array for easier manipulation
        depth_array = np.array(depth_pil_image)

        # Get the coordinates where depth is not 0
        non_zero_coords = np.argwhere(depth_array > 0)

        # Randomly sample 2 * points_per_image coordinates
        sample_size = min(len(non_zero_coords), 2 * points_per_image)
        if sample_size < 50:
            continue
        if len(non_zero_coords) < 2 * points_per_image:
            print(
                f"Population size: {len(non_zero_coords)} is smaller than required sample size: {2 * points_per_image}"
            )

        sampled_coords = random.sample(list(non_zero_coords), sample_size)

        # Extract intrinsics from data_dict
        fx, fy, cx, cy, width, height = data_dict["intrinsics"]

        # Initialize a list to store the 3D points
        euclidean_distances = []

        # Iterate over the sampled coordinates
        for coord in sampled_coords:
            y, x = coord
            # Get the depth value at the sampled coordinate
            depth = depth_array[y, x] / 4000.0

            x_real = (x - cx) * depth / fx
            y_real = (y - cy) * depth / fy
            z_real = depth
            # Calculate the Euclidean distance
            euclidean_distances.append(
                float(np.sqrt(x_real**2 + y_real**2 + z_real**2))
            )

        # Filter and collect the first points_per_image elements that satisfy the conditions
        filtered_coords = []
        filtered_distances = []

        for coord, distance in zip(sampled_coords, euclidean_distances):
            if 0.05 <= distance <= 50:
                filtered_coords.append([int(coord[1]), int(coord[0])])  # [x, y] format
                filtered_distances.append(distance)
                if len(filtered_coords) == points_per_image:
                    break

        # Set the data_dict values
        data_dict["pixel_coords"] = filtered_coords
        data_dict["depth"] = filtered_distances

        # Save the pil_image to the relative path of data_dict["image_path"] under out_image_path
        relative_image_path = os.path.join(out_image_path, data_dict["image_path"])
        os.makedirs(os.path.dirname(relative_image_path), exist_ok=True)
        pil_image.save(relative_image_path)

        # Write data_dict into jsonl_file
        jsonl_file.write(json.dumps(data_dict) + "\n")

        count += 1
        if count % 1000 == 0:
            print(f"Iteration: {count}")
            print(f"Data Dictionary: {data_dict}")

```

## /utils/curate_nuscenes_eval.py

```py path="/utils/curate_nuscenes_eval.py" 
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.

# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.

import json, os

import cv2
import numpy as np

from nuscenes.nuscenes import NuScenes
from nuscenes.utils.data_classes import LidarPointCloud
from PIL import Image, ImageDraw
from pyquaternion import Quaternion

import argparse
# Set up argument parser
parser = argparse.ArgumentParser(description="Process some files.")
parser.add_argument("--dataroot_mini", type=str, default="/home/czptc2h/datasets/nuscenes", help="data root mini")
parser.add_argument(
    "--out_json_path", type=str, help="output jsonl path"
)
parser.add_argument(
    "--out_image_dir", type=str, help="output image folder"
)
args = parser.parse_args()

# Initialize the NuScenes dataset
dataroot = args.dataroot_mini
version = "v1.0-trainval"  # Using mini version
nusc = NuScenes(version=version, dataroot=dataroot, verbose=True)


def map_pointcloud_to_image(pointcloud, camera_token):
    """
    Map pointcloud to the image plane.

    Args:
        pointcloud: LidarPointCloud object
        camera_token: Token of the camera sample data

    Returns:
        points_img: Points in image coordinates
        depths: Depth values
    """
    cam = nusc.get("sample_data", camera_token)
    cam_path = os.path.join(nusc.dataroot, cam["filename"])
    im = cv2.imread(cam_path)

    # Get sensor calibration data
    lidar_to_world = nusc.get(
        "calibrated_sensor", pointcloud["calibrated_sensor_token"]
    )
    lidar_rotation = Quaternion(lidar_to_world["rotation"])
    lidar_translation = np.array(lidar_to_world["translation"])

    cam_to_world = nusc.get("calibrated_sensor", cam["calibrated_sensor_token"])
    cam_intrinsic = np.array(cam_to_world["camera_intrinsic"])
    cam_rotation = Quaternion(cam_to_world["rotation"])
    cam_translation = np.array(cam_to_world["translation"])

    # Transform points from lidar to world coordinate
    pc = LidarPointCloud.from_file(os.path.join(nusc.dataroot, pointcloud["filename"]))
    points = pc.points[:3, :]
    points = np.vstack((points, np.ones(points.shape[1])))

    # Transformation matrix from lidar to world coordinate
    lidar_to_world_matrix = np.eye(4)
    lidar_to_world_matrix[:3, :3] = lidar_rotation.rotation_matrix
    lidar_to_world_matrix[:3, 3] = lidar_translation

    # Transformation matrix from world to camera coordinate
    world_to_cam_matrix = np.eye(4)
    world_to_cam_matrix[:3, :3] = cam_rotation.rotation_matrix.T
    world_to_cam_matrix[:3, 3] = -np.dot(
        cam_rotation.rotation_matrix.T, cam_translation
    )

    # Transform points to camera coordinate
    points_cam = np.dot(world_to_cam_matrix, np.dot(lidar_to_world_matrix, points))

    # Only keep points in front of the camera
    mask = points_cam[2, :] > 0
    points_cam = points_cam[:, mask]

    # Project to image plane
    points_img = np.dot(cam_intrinsic, points_cam[:3, :])
    points_img = points_img / points_img[2, :]
    points_img = points_img[:2, :]

    # Get depths
    depths = points_cam[2, :].copy()

    return points_img.T, depths, im


def create_depth_map(points_img, depths, image_shape):
    """
    Create a depth map from projected points.

    Args:
        points_img: Points in image coordinates
        depths: Depth values
        image_shape: Shape of the image (height, width)

    Returns:
        depth_map: Depth map as a 2D numpy array
    """
    depth_map = np.zeros((image_shape[0], image_shape[1]))

    # Keep only points that
    # Keep only points that fall within the image
    mask = np.logical_and.reduce(
        [
            points_img[:, 0] >= 0,
            points_img[:, 0] < image_shape[1],
            points_img[:, 1] >= 0,
            points_img[:, 1] < image_shape[0],
        ]
    )

    points_img = points_img[mask]
    depths = depths[mask]

    # Convert to integers for indexing
    points_int = np.floor(points_img).astype(np.int32)

    # Populate depth map
    for i in range(points_int.shape[0]):
        x, y = points_int[i, 0], points_int[i, 1]
        if depth_map[y, x] == 0 or depths[i] < depth_map[y, x]:
            depth_map[y, x] = depths[i]

    return depth_map


CAMERA_NAMES = [
    "CAM_FRONT",
    "CAM_FRONT_RIGHT",
    "CAM_BACK_RIGHT",
    "CAM_BACK",
    "CAM_BACK_LEFT",
    "CAM_FRONT_LEFT",
]


def process_sample(sample_idx, output_folder, camera_name):
    """
    Process a single sample from the nuScenes dataset.

    Args:
        sample_idx: Index of the sample
        output_folder: Folder to save the image and depth map
    """
    data_dict = {}
    # Get sample
    sample = nusc.sample[sample_idx]

    # Get camera sample data
    camera_token = sample["data"][camera_name]
    # camera_data = nusc.get("sample_data", camera_token)

    # Get LiDAR sample data
    lidar_token = sample["data"]["LIDAR_TOP"]
    lidar_data = nusc.get("sample_data", lidar_token)

    # Map pointcloud to image and create depth map
    points_img, depths, image = map_pointcloud_to_image(lidar_data, camera_token)
    depth_map = create_depth_map(points_img, depths, (image.shape[0], image.shape[1]))
    # Read out camera intrinsic information
    cam_intrinsic = nusc.get(
        "calibrated_sensor",
        nusc.get("sample_data", camera_token)["calibrated_sensor_token"],
    )["camera_intrinsic"]
    print("Camera intrinsic matrix for", camera_name, ":", cam_intrinsic)
    # Print indices of the depth map that are not 0
    non_zero_indices = np.argwhere(depth_map != 0)
    print("Non-zero depth map indices:", non_zero_indices)
    # Print the size of the depth map and image
    print("Depth map size:", depth_map.shape)
    print("Image size:", image.shape)

    # Save image and depth map to output folder
    img_filename = f"{sample_idx:06d}_{camera_name}_image.jpg"
    cv2.imwrite(os.path.join(output_folder, img_filename), image)
    data_dict["image"] = img_filename
    data_dict["intrinsics"] = [
        cam_intrinsic[0][0],
        cam_intrinsic[1][1],
        cam_intrinsic[0][2],
        cam_intrinsic[1][2],
        image.shape[1],  # Image width
        image.shape[0],  # Image height
    ]
    # Randomly sample 100 pixels with non-zero depth map values
    non_zero_indices = np.argwhere(depth_map != 0)
    sampled_indices = non_zero_indices[
        np.random.choice(non_zero_indices.shape[0], 100, replace=False)
    ]

    # Store their pixel coordinates and depth values into lists
    data_dict["pixel_coords"] = [[int(x), int(y)] for y, x in sampled_indices]
    data_dict["depth"] = [depth_map[y, x] for y, x in sampled_indices]

    return data_dict


def process_multiple_samples(
    num_samples=5, output_folder="output", json_path="test.json", is_val=False
):
    """
    Process multiple samples from the dataset.

    Args:
        num_samples: Number of samples to process
        output_folder: Folder to save the images and depth maps
    """

    # Check if the output folder exists and delete it if it does
    if os.path.exists(output_folder):
        import shutil

        shutil.rmtree(output_folder)

    if not os.path.exists(output_folder):
        os.makedirs(output_folder)

    with open(json_path, "w") as f:
        if num_samples == -1:
            line_count = 0
            sample_range = (
                range(int(len(nusc.sample) * 0.95))
                if not is_val
                else range(int(len(nusc.sample) * 0.95), len(nusc.sample))
            )
            print("sample_range = ", sample_range)
            for i in sample_range:
                print(f"Processing sample {i}")
                for camera_name in CAMERA_NAMES:
                    entry = process_sample(i, output_folder, camera_name)
                    # Save meta_data_json to a JSON Lines file
                    json.dump(entry, f)
                    f.write("\n")
                    line_count += 1
            print(f"Total lines processed: {line_count}")
        else:
            for i in np.random.choice(
                len(nusc.sample), min(num_samples, len(nusc.sample)), replace=False
            ):
                print(f"Processing sample {i}")
                camera_name = np.random.choice(CAMERA_NAMES)
                entry = process_sample(i, output_folder, camera_name)
                # Save meta_data_json to a JSON Lines file
                json.dump(entry, f)
                f.write("\n")


# Example: Process all samples and save to "output" folder
process_multiple_samples(
    num_samples=-1,
    output_folder=args.out_image_dir,
    json_path=args.out_json_path,
    is_val=True,
)

```

## /utils/curate_nuscenes_train.py

```py path="/utils/curate_nuscenes_train.py" 
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.

# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.

import json, os

import cv2
import numpy as np

from nuscenes.nuscenes import NuScenes
from nuscenes.utils.data_classes import LidarPointCloud
from PIL import Image, ImageDraw
from pyquaternion import Quaternion

import argparse
# Set up argument parser
parser = argparse.ArgumentParser(description="Process some files.")
parser.add_argument("--dataroot", type=str, default="/home/czptc2h/datasets/nuscenes_full", help="data root")
parser.add_argument("--dataroot_mini", type=str, default="/home/czptc2h/datasets/nuscenes", help="data root mini")
parser.add_argument(
    "--out_json_path", type=str, help="output jsonl path"
)
parser.add_argument(
    "--out_image_dir", type=str, help="output image folder"
)
args = parser.parse_args()

# Initialize the NuScenes dataset
dataroot = args.dataroot
version = "v1.0-trainval"  # Using mini version
nusc = NuScenes(version=version, dataroot=dataroot, verbose=True)

dataroot_mini = args.dataroot_mini
version_mini = "v1.0-mini"  # Using mini version
nusc_mini = NuScenes(version=version_mini, dataroot=dataroot_mini, verbose=True)
# Extract the list of scene tokens for both nusc and nusc_mini
nusc_mini_scene_tokens = sorted([scene["token"] for scene in nusc_mini.scene])

def map_pointcloud_to_image(pointcloud, camera_token):
    """
    Map pointcloud to the image plane.

    Args:
        pointcloud: LidarPointCloud object
        camera_token: Token of the camera sample data

    Returns:
        points_img: Points in image coordinates
        depths: Depth values
    """
    cam = nusc.get("sample_data", camera_token)
    cam_path = os.path.join(nusc.dataroot, cam["filename"])
    im = cv2.imread(cam_path)

    # Get sensor calibration data
    lidar_to_world = nusc.get(
        "calibrated_sensor", pointcloud["calibrated_sensor_token"]
    )
    lidar_rotation = Quaternion(lidar_to_world["rotation"])
    lidar_translation = np.array(lidar_to_world["translation"])

    cam_to_world = nusc.get("calibrated_sensor", cam["calibrated_sensor_token"])
    cam_intrinsic = np.array(cam_to_world["camera_intrinsic"])
    cam_rotation = Quaternion(cam_to_world["rotation"])
    cam_translation = np.array(cam_to_world["translation"])

    # Transform points from lidar to world coordinate
    pc = LidarPointCloud.from_file(os.path.join(nusc.dataroot, pointcloud["filename"]))
    points = pc.points[:3, :]
    points = np.vstack((points, np.ones(points.shape[1])))

    # Transformation matrix from lidar to world coordinate
    lidar_to_world_matrix = np.eye(4)
    lidar_to_world_matrix[:3, :3] = lidar_rotation.rotation_matrix
    lidar_to_world_matrix[:3, 3] = lidar_translation

    # Transformation matrix from world to camera coordinate
    world_to_cam_matrix = np.eye(4)
    world_to_cam_matrix[:3, :3] = cam_rotation.rotation_matrix.T
    world_to_cam_matrix[:3, 3] = -np.dot(
        cam_rotation.rotation_matrix.T, cam_translation
    )

    # Transform points to camera coordinate
    points_cam = np.dot(world_to_cam_matrix, np.dot(lidar_to_world_matrix, points))

    # Only keep points in front of the camera
    mask = points_cam[2, :] > 0
    points_cam = points_cam[:, mask]

    # Project to image plane
    points_img = np.dot(cam_intrinsic, points_cam[:3, :])
    points_img = points_img / points_img[2, :]
    points_img = points_img[:2, :]

    # Get depths
    depths = points_cam[2, :].copy()

    return points_img.T, depths, im


def create_depth_map(points_img, depths, image_shape):
    """
    Create a depth map from projected points.

    Args:
        points_img: Points in image coordinates
        depths: Depth values
        image_shape: Shape of the image (height, width)

    Returns:
        depth_map: Depth map as a 2D numpy array
    """
    depth_map = np.zeros((image_shape[0], image_shape[1]))

    # Keep only points that
    # Keep only points that fall within the image
    mask = np.logical_and.reduce(
        [
            points_img[:, 0] >= 0,
            points_img[:, 0] < image_shape[1],
            points_img[:, 1] >= 0,
            points_img[:, 1] < image_shape[0],
        ]
    )

    points_img = points_img[mask]
    depths = depths[mask]

    # Convert to integers for indexing
    points_int = np.floor(points_img).astype(np.int32)

    # Populate depth map
    for i in range(points_int.shape[0]):
        x, y = points_int[i, 0], points_int[i, 1]
        if depth_map[y, x] == 0 or depths[i] < depth_map[y, x]:
            depth_map[y, x] = depths[i]

    return depth_map


CAMERA_NAMES = [
    "CAM_FRONT",
    "CAM_FRONT_RIGHT",
    "CAM_BACK_RIGHT",
    "CAM_BACK",
    "CAM_BACK_LEFT",
    "CAM_FRONT_LEFT",
]


def process_sample(sample_idx, output_folder, camera_name):
    """
    Process a single sample from the nuScenes dataset.

    Args:
        sample_idx: Index of the sample
        output_folder: Folder to save the image and depth map
    """
    data_dict = {}
    # Get sample
    sample = nusc.sample[sample_idx]

    # Get camera sample data
    camera_token = sample["data"][camera_name]
    # camera_data = nusc.get("sample_data", camera_token)

    # Get LiDAR sample data
    lidar_token = sample["data"]["LIDAR_TOP"]
    lidar_data = nusc.get("sample_data", lidar_token)

    # Map pointcloud to image and create depth map
    points_img, depths, image = map_pointcloud_to_image(lidar_data, camera_token)
    depth_map = create_depth_map(points_img, depths, (image.shape[0], image.shape[1]))
    # Read out camera intrinsic information
    cam_intrinsic = nusc.get(
        "calibrated_sensor",
        nusc.get("sample_data", camera_token)["calibrated_sensor_token"],
    )["camera_intrinsic"]
    non_zero_indices = np.argwhere(depth_map != 0)

    # Save image and depth map to output folder
    img_filename = f"{sample_idx:06d}_{camera_name}_image.jpg"
    # depth_filename = f"{sample_idx:06d}_depth.png"
    cv2.imwrite(os.path.join(output_folder, img_filename), image)
    data_dict["image"] = img_filename
    data_dict["intrinsics"] = [
        cam_intrinsic[0][0],
        cam_intrinsic[1][1],
        cam_intrinsic[0][2],
        cam_intrinsic[1][2],
        image.shape[1],  # Image width
        image.shape[0],  # Image height
    ]
    # Randomly sample 100 pixels with non-zero depth map values
    non_zero_indices = np.argwhere(depth_map != 0)
    sampled_indices = non_zero_indices[
        np.random.choice(non_zero_indices.shape[0], 100, replace=False)
    ]

    # Store their pixel coordinates and depth values into lists
    data_dict["pixel_coords"] = [[int(x), int(y)] for y, x in sampled_indices]
    data_dict["depth"] = [depth_map[y, x] for y, x in sampled_indices]

    return data_dict


def process_multiple_samples(
    num_samples=5, output_folder="output", json_path="test.json"
):
    """
    Process multiple samples from the dataset.

    Args:
        num_samples: Number of samples to process
        output_folder: Folder to save the images and depth maps
    """

    # Check if the output folder exists and delete it if it does
    if os.path.exists(output_folder):
        import shutil

        shutil.rmtree(output_folder)

    if not os.path.exists(output_folder):
        os.makedirs(output_folder)

    with open(json_path, "w") as f:
        if num_samples == -1:
            line_count = 0
            sample_range = range(len(nusc.sample))
            print("sample_range = ", sample_range)
            for i in sample_range:
                # breakpoint()
                if nusc.sample[i]["scene_token"] in nusc_mini_scene_tokens:
                    print("Skipping sample ", i, " as it is in nusc_mini")
                    continue
                if i % 1000 == 0:
                    print(f"Processing sample {i}")
                for camera_name in CAMERA_NAMES:
                    entry = process_sample(i, output_folder, camera_name)
                    # Save meta_data_json to a JSON Lines file
                    json.dump(entry, f)
                    f.write("\n")
                    line_count += 1
            print(f"Total lines processed: {line_count}")
        else:
            for i in np.random.choice(
                len(nusc.sample), min(num_samples, len(nusc.sample)), replace=False
            ):
                print(f"Processing sample {i}")
                camera_name = np.random.choice(CAMERA_NAMES)
                entry = process_sample(i, output_folder, camera_name)
                # Save meta_data_json to a JSON Lines file
                json.dump(entry, f)
                f.write("\n")


# Example: Process all samples and save to "output" folder
process_multiple_samples(
    num_samples=-1,
    output_folder=args.out_image_dir,
    json_path=args.out_json_path,
)

```

## /utils/curate_scannet.py

```py path="/utils/curate_scannet.py" 
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.

# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.

"""
Download ScanNet++ data

Default: download splits with scene IDs and default files
that can be used for novel view synthesis on DSLR and iPhone images
and semantic tasks on the mesh
"""

import argparse
import json
import os
import shutil
import subprocess
import sys
import zlib
from pathlib import Path

import imageio as iio
import lz4.block
import numpy as np
import yaml
from common.scene_release import ScannetppScene_Release
from common.utils.utils import load_json, load_yaml_munch, read_txt_list, run_command
from munch import Munch
from tqdm import tqdm


def extract_rgb(scene, w=512, h=384):
    scene.iphone_rgb_dir.mkdir(parents=True, exist_ok=True)
    cmd = f"ffmpeg -i {scene.iphone_video_path} -vf scale={w}:{h} -start_number 0 -q:v 1 {scene.iphone_rgb_dir}/frame_%06d.jpg"
    return run_command(cmd, verbose=True, exit_on_error=False)


def extract_masks(scene, w=512, h=384):
    scene.iphone_video_mask_dir.mkdir(parents=True, exist_ok=True)
    cmd = f"ffmpeg -i {str(scene.iphone_video_mask_path)} -pix_fmt gray -vf scale={w}:{h} -start_number 0 {scene.iphone_video_mask_dir}/frame_%06d.png"
    return run_command(cmd, verbose=True, exit_on_error=False)


def extract_depth(scene):
    # global compression with zlib
    height, width = 192, 256
    sample_rate = 1
    scene.iphone_depth_dir.mkdir(parents=True, exist_ok=True)

    try:
        with open(scene.iphone_depth_path, "rb") as infile:
            data = infile.read()
            data = zlib.decompress(data, wbits=-zlib.MAX_WBITS)
            depth = np.frombuffer(data, dtype=np.float32).reshape(-1, height, width)

        for frame_id in tqdm(
            range(0, depth.shape[0], sample_rate), desc="decode_depth"
        ):
            iio.imwrite(
                f"{scene.iphone_depth_dir}/frame_{frame_id:06}.png",
                (depth * 1000).astype(np.uint16),
            )
    # per frame compression with lz4/zlib
    except:
        frame_id = 0
        with open(scene.iphone_depth_path, "rb") as infile:
            while True:
                size = infile.read(4)  # 32-bit integer
                if len(size) == 0:
                    break
                size = int.from_bytes(size, byteorder="little")
                if frame_id % sample_rate != 0:
                    infile.seek(size, 1)
                    frame_id += 1
                    continue

                # read the whole file
                data = infile.read(size)
                try:
                    # try using lz4
                    data = lz4.block.decompress(
                        data, uncompressed_size=height * width * 2
                    )  # UInt16 = 2bytes
                    depth = np.frombuffer(data, dtype=np.uint16).reshape(height, width)
                except:
                    # try using zlib
                    data = zlib.decompress(data, wbits=-zlib.MAX_WBITS)
                    depth = np.frombuffer(data, dtype=np.float32).reshape(height, width)
                    depth = (depth * 1000).astype(np.uint16)

                # 6 digit frame id = 277 minute video at 60 fps
                iio.imwrite(f"{scene.iphone_depth_dir}/frame_{frame_id:06}.png", depth)
                frame_id += 1


def main(args):
    cfg = load_yaml_munch(args.config_file)

    # get the scenes to process, specify any one
    if cfg.get("scene_list_file"):
        scene_ids = read_txt_list(cfg.scene_list_file)
    elif cfg.get("scene_ids"):
        scene_ids = cfg.scene_ids
    elif cfg.get("splits"):
        scene_ids = []
    # Read only the immediate level subfolders of cfg.data_root as scene_ids
    scene_ids = [
        f
        for f in os.listdir(cfg.data_root + "data/")
        if os.path.isdir(os.path.join(cfg.data_root + "data/", f))
    ]

    print("Scene IDs:", scene_ids)
    print("Number of scenes:", len(scene_ids))

    output_dir = "/home/czptc2h/datasets/scannet_pp/out_images"
    output_dir_json = (
        "/home/czptc2h/datasets/scannet_pp/scannet_depth_instructions.jsonl"
    )

    # Create the output directory if it doesn't exist
    os.makedirs(output_dir, exist_ok=True)
    sample_interval = 10
    points_per_frame = 100
    image_width = 1280  # resize first to save memory
    image_height = 960

    # get the options to process
    # go through each scene
    # Open a new jsonl file at output_dir_json
    with open(output_dir_json, "w") as jsonl_file:
        for scene_id in tqdm(scene_ids, desc="scene"):
            try:
                scene = ScannetppScene_Release(
                    scene_id, data_root=Path(cfg.data_root) / "data"
                )

                print(
                    "cfg.data_root = ",
                    cfg.data_root,
                    "scene_id = ",
                    scene_id,
                    "scene = ",
                    scene,
                )

                # # extract data for the current scene
                out_rgb = extract_rgb(scene, image_width, image_height)
                if out_rgb.returncode != 0:
                    print("error during rgb extraction, go to the next scene")
                    continue
                out_mask = extract_masks(scene, image_width, image_height)
                if out_mask.returncode != 0:
                    print("error during mask extraction, go to the next scene")
                    continue
                extract_depth(scene)

                # convert data into json
                # remove all files in the folders
                # Iteratively read all png files under scene.iphone_video_mask_dir
                for i, (image_file, mask_file, depth_file) in enumerate(
                    zip(
                        os.listdir(scene.iphone_rgb_dir),
                        os.listdir(scene.iphone_video_mask_dir),
                        os.listdir(scene.iphone_depth_dir),
                    )
                ):
                    if i % sample_interval != 0:
                        continue
                    data_dict = {}

                    data_dict["image"] = (
                        str(scene.iphone_rgb_dir / image_file)
                        .replace(cfg.data_root, "")
                        .replace("data/", "")
                    )
                    # Move the image file to the specified output directory
                    destination_path = Path(output_dir) / data_dict["image"]
                    destination_path.parent.mkdir(
                        parents=True, exist_ok=True
                    )  # Ensure the directory exists
                    shutil.move(
                        str(scene.iphone_rgb_dir / image_file), destination_path
                    )

                    data_dict["intrinsics"] = [
                        1427.4375 * (image_width / 1920),
                        1427.4375 * (image_height / 1440),
                        959.5 * (image_width / 1920),
                        719.5 * (image_height / 1440),
                        image_width,
                        image_height,
                    ]  # rescaled intrinsics

                    if mask_file.endswith(".png"):
                        mask_path = scene.iphone_video_mask_dir / mask_file
                        mask_image = iio.imread(mask_path)
                        # Randomly sample points_per_frame pixels where mask_image is not 0
                        non_zero_indices = np.argwhere(mask_image != 0)
                        random_indices = np.random.choice(
                            non_zero_indices.shape[0],
                            points_per_frame,
                            replace=False,
                        )
                        sampled_indices = (
                            non_zero_indices
                            * np.array([192 / image_height, 256 / image_width])
                        ).astype(int)[random_indices]

                        sampled_indices_ori = (non_zero_indices).astype(int)[
                            random_indices
                        ]

                        data_dict["pixel_coords"] = [
                            [int(x), int(y)] for y, x in sampled_indices_ori
                        ]

                    # convert to euclidean distance
                    if depth_file.endswith(".png"):
                        depth_path = scene.iphone_depth_dir / depth_file
                        depth_image = iio.imread(depth_path)
                        fx, fy, cx, cy, _, _ = data_dict["intrinsics"]

                        fx, fy, cx, cy, _, _ = data_dict["intrinsics"]
                        pixel_coords = np.array(data_dict["pixel_coords"])
                        x = (pixel_coords[:, 0] - cx) / fx
                        y = (pixel_coords[:, 1] - cy) / fy
                        z = (
                            depth_image[sampled_indices[:, 0], sampled_indices[:, 1]]
                            / 1000.0
                        )

                        data_dict["depth"] = np.sqrt(x**2 + y**2 + z**2).tolist()
                        # Print samples to verify computation
                        sample_indices = np.random.choice(
                            len(z), min(5, len(z)), replace=False
                        )
                        for idx in sample_indices:
                            print(
                                f"Sample {idx}: z = {z[idx]}, pixel_coords = {data_dict['pixel_coords'][idx]}, depth = {data_dict['depth'][idx]}"
                            )


                    json.dump(data_dict, jsonl_file)
                    jsonl_file.write("\n")

                # Remove the folder scene.iphone_rgb_dir
                shutil.rmtree(scene.iphone_rgb_dir)
                shutil.rmtree(scene.iphone_video_mask_dir)
                shutil.rmtree(scene.iphone_depth_dir)

            except Exception as e:
                print(e)
                continue


if __name__ == "__main__":
    p = argparse.ArgumentParser()
    p.add_argument("config_file", help="Path to config file")
    args = p.parse_args()

    main(args)

```

## /utils/curate_sunRGBD.py

```py path="/utils/curate_sunRGBD.py" 
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.

# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.

import math
import os, shutil, torch
from glob import glob

import numpy as np

from PIL import Image

import argparse
# Set up argument parser
parser = argparse.ArgumentParser(description="Process some files.")
parser.add_argument("--dataroot", type=str, default="/home/czptc2h/datasets/SUNRGBD", help="image dir")
parser.add_argument(
    "--out_json_path", type=str, help="output jsonl path"
)
parser.add_argument(
    "--out_image_dir", type=str, help="output image folder"
)
args = parser.parse_args()


## restrict to 20 scenes
scene_dirs = glob(os.path.join(dataroot, "SUNRGBD/*/*/*"))
print("Scene Dirs:", scene_dirs)

out_json_path = args.out_json_path
out_image_path = args.out_image_dir

if os.path.exists(out_image_path):
    shutil.rmtree(out_image_path)
os.makedirs(out_image_path)

import shutil

points_per_image = 100
import json, os

count = 0
with open(out_json_path, "w") as jsonl_file:
    for scene_dir in scene_dirs:
        data_dict = {}
        ## Get image file path from scene directory
        print("Scene Dir:", scene_dir)
        try:
            image_path = glob(f"{scene_dir}/image/*")[0]
        except:
            image_path = glob(f"{scene_dir}/*/*/image/*")[0]
        if "NYU" in image_path:
            continue

        img = Image.open(image_path)
        sub_dir = image_path.replace(f"{dataroot}/SUNRGBD/", "")
        ## Copy the image to the out_image_path directory
        os.makedirs(os.path.dirname(out_image_path + "/" + sub_dir), exist_ok=True)
        shutil.copy(image_path, out_image_path + "/" + sub_dir)

        ## Get depth map file path from scene directory
        try:
            depth_path = glob(f"{scene_dir}/depth_bfx/*")[0]
        except:
            depth_path = glob(f"{scene_dir}/*/*/depth_bfx/*")[0]

        print("Image Path:", image_path, "; Depth Path:", depth_path)

        # Replace the last 2 file/folder names in the path of depth_path with "intrinsics.txt"
        intrinsic_path = os.path.join(os.path.dirname(os.path.dirname(depth_path)), "intrinsics.txt")

        with open(intrinsic_path, "r") as file:
            intrinsic_data = file.read().strip().split()
            intrinsic_matrix = np.array(intrinsic_data, dtype=np.float32).reshape(
                (3, 3)
            )
        print("Intrinsic Matrix:\n", intrinsic_matrix)

        # Read the image from image_path into a PIL image
        pil_image = Image.open(image_path)

        data_dict["image"] = sub_dir
        data_dict["intrinsics"] = [
            float(intrinsic_matrix[0, 0]),
            float(intrinsic_matrix[1, 1]),
            float(intrinsic_matrix[0, 2]),
            float(intrinsic_matrix[1, 2]),
        ] + [pil_image.size[0], pil_image.size[1]]

        depth_gt = Image.open(depth_path)
        depth_gt = np.asarray(depth_gt, dtype=np.float32)
        depth_gt = depth_gt / 10000.0

        # Randomly sample 100 pixels in depth_gt with value > 0.005 and < 25
        valid_pixels = np.argwhere((depth_gt > 0.005) & (depth_gt < 25))
        sampled_indices = np.random.choice(
            len(valid_pixels), size=points_per_image, replace=False
        )
        sampled_pixels = valid_pixels[sampled_indices]

        data_dict["pixel_coords"] = sampled_pixels[:, [1, 0]].tolist()
        fx, fy, cx, cy = (
            intrinsic_matrix[0, 0],
            intrinsic_matrix[1, 1],
            intrinsic_matrix[0, 2],
            intrinsic_matrix[1, 2],
        )
        z = depth_gt[sampled_pixels[:, 0], sampled_pixels[:, 1]]
        x = (sampled_pixels[:, 1] - cx) * z / fx
        y = (sampled_pixels[:, 0] - cy) * z / fy
        euclidean_distances = np.sqrt(x**2 + y**2 + z**2)
        data_dict["depth"] = euclidean_distances.tolist()

        print("PIL Image Size:", pil_image.size)
        print("Depth GT Size:", depth_gt.shape)

        print("Data Dictionary:", data_dict)

        json.dump(data_dict, jsonl_file)
        jsonl_file.write("\n")
        count += 1
        print(f"processed {count} images")

```

## /utils/curate_taskonomy

``` path="/utils/curate_taskonomy" 
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.

# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.

import os
import shutil
import json
import argparse
import numpy as np
from glob import glob
from PIL import Image

def main():
    parser = argparse.ArgumentParser(description="Process Taskonomy dataset.")
    parser.add_argument("--dataroot", type=str, required=True, help="Path to Taskonomy fullplus root directory")
    parser.add_argument("--out_json_path", type=str, required=True, help="Output JSONL path")
    parser.add_argument("--out_image_dir", type=str, required=True, help="Output RGB image folder")
    parser.add_argument("--points_per_image", type=int, default=3, help="Number of depth points to sample")
    args = parser.parse_args()

    dataroot = args.dataroot
    out_json_path = args.out_json_path
    out_image_path = args.out_image_dir
    points_per_image = args.points_per_image

    print(f"Scanning Taskonomy RGB files in {dataroot}...")
    # Taskonomy standard naming is often something like <building_name>/rgb/<building_name>_..._rgb.png
    rgb_files = glob(os.path.join(dataroot, "**", "*_rgb.png"), recursive=True)
    if not rgb_files:
        # Fallback case compressed
        rgb_files = glob(os.path.join(dataroot, "**", "*_rgb.webp"), recursive=True)
    
    print(f"Found {len(rgb_files)} RGB files.")

    if os.path.exists(out_image_path):
        shutil.rmtree(out_image_path)
    os.makedirs(out_image_path)

    count = 0
    with open(out_json_path, "w") as jsonl_file:
        for image_path in rgb_files:
            # Deduce depth path by replacing 'rgb' string identifier with 'depth_zbuffer'
            depth_path = image_path.replace("rgb", "depth_zbuffer")
            
            if not os.path.exists(depth_path):
                # Try finding depth_euclidean if zbuffer is missing
                depth_path = image_path.replace("rgb", "depth_euclidean")
                is_euclidean = True
                if not os.path.exists(depth_path):
                    continue
            else:
                is_euclidean = False

            # Create output subdirectories
            sub_dir = os.path.relpath(image_path, dataroot)
            dest_image_path = os.path.join(out_image_path, sub_dir)
            os.makedirs(os.path.dirname(dest_image_path), exist_ok=True)
            shutil.copy(image_path, dest_image_path)

            data_dict = {}
            data_dict["image"] = sub_dir

            pil_image = Image.open(image_path)
            W, H = pil_image.size

            # Taskonomy uses a 90 degree FOV camera
            # fx = fy = W / (2 * tan(FOV / 2)) -> W / 2
            fx = W / 2.0
            fy = H / 2.0
            cx = W / 2.0
            cy = H / 2.0

            data_dict["intrinsics"] = [fx, fy, cx, cy, W, H]

            # Load 16-bit Depth
            depth_img = Image.open(depth_path)
            depth_arr = np.asarray(depth_img, dtype=np.float32)

            # Taskonomy 16-bit depth is scaled. standard is pixel_value / 512.0 for meters
            depth_arr = depth_arr / 512.0

            # Filter valid depth pixels (0.01m to 120m)
            valid_pixels = np.argwhere((depth_arr > 0.01) & (depth_arr < 120.0))
            
            if len(valid_pixels) < points_per_image:
                # Skip images that don't have enough valid depth pixels
                continue
                
            sampled_indices = np.random.choice(
                len(valid_pixels), size=points_per_image, replace=False
            )
            sampled_pixels = valid_pixels[sampled_indices]

            # [y, x] -> [x, y] to align with [u, v]
            data_dict["pixel_coords"] = sampled_pixels[:, [1, 0]].tolist()

            # Retrieve Z depth
            z = depth_arr[sampled_pixels[:, 0], sampled_pixels[:, 1]]

            if is_euclidean:
                # If the dataset specifically provides depth_euclidean, no trigonometry needed
                euclidean_distances = z
            else:
                # Calculate Euclidean distances from Z-buffer
                x = (sampled_pixels[:, 1] - cx) * z / fx
                y = (sampled_pixels[:, 0] - cy) * z / fy
                euclidean_distances = np.sqrt(x**2 + y**2 + z**2)

            data_dict["depth"] = euclidean_distances.tolist()

            json.dump(data_dict, jsonl_file)
            jsonl_file.write("\n")
            count += 1
            
            if count % 1000 == 0:
                print(f"Processed {count} valid images...")

    print(f"Taskonomy curation complete! {count} total files written to {out_json_path}")

if __name__ == "__main__":
    main()

```

## /utils/curate_waymo.py

```py path="/utils/curate_waymo.py" 
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.

# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.

import io, warnings
from typing import Optional

# Disable annoying warnings from PyArrow using under the hood.
warnings.simplefilter(action="ignore", category=FutureWarning)


import argparse

# Print the pixel coordinates and their depth values
import random

import dask.dataframe as dd
import numpy as np
import tensorflow as tf
from PIL import Image
from waymo_open_dataset import v2
from waymo_open_dataset.utils import range_image_utils
from waymo_open_dataset.v2.perception.utils import lidar_utils

import argparse
# Set up argument parser
parser = argparse.ArgumentParser(description="Process some files.")
parser.add_argument("--dataset_dir", type=str, default="/home/czptc2h/datasets/waymo/training/", help="waymo")
parser.add_argument(
    "--out_json_path", type=str, help="output jsonl path"
)
parser.add_argument(
    "--out_image_dir", type=str, help="output image folder"
)
args = parser.parse_args()

# Path to the directory with all components
dataset_dir = args.dataset_dir
# List all parquet files in the "camera_image" directory and extract their names without extensions
camera_image_dir = f"{dataset_dir}/camera_image"
import os

parquet_files = [
    os.path.join(camera_image_dir, file)
    for file in os.listdir(camera_image_dir)
    if file.endswith(".parquet")
]
filenames = [
    os.path.splitext(file)[0]
    for file in os.listdir(camera_image_dir)
    if file.endswith(".parquet")
]

# change these to process a subset of the data
start_file = 0
end_file = len(filenames)

print(f"Found {len(filenames)} files in {camera_image_dir}")
# Print some samples of filenames
sample_size = min(
    5, len(filenames)
)  # Print up to 5 samples or less if fewer files exist
print("Sample filenames:", filenames[:sample_size])


def read(tag: str, context_name: str) -> dd.DataFrame:
    """Creates a Dask DataFrame for the component specified by its tag."""
    paths = tf.io.gfile.glob(f"{dataset_dir}/{tag}/{context_name}.parquet")
    return dd.read_parquet(paths)


out_json_path = args.out_json_path
# Create the directory for out_json_path if it doesn't exist
os.makedirs(os.path.dirname(out_json_path), exist_ok=True)

out_image_path = args.out_image_dir

import shutil

if os.path.exists(out_image_path):
    shutil.rmtree(out_image_path)
os.makedirs(out_image_path)

points_per_image = 100
import json, os

import cv2

import numpy as np
from PIL import Image


def undistort_image(pil_image, intrinsic, pixel_coordinates):
    # Convert PIL image to OpenCV image
    cv_image = np.array(pil_image)
    # Define the camera intrinsic parameters
    fx = intrinsic.f_u
    fy = intrinsic.f_v
    cx = intrinsic.c_u
    cy = intrinsic.c_v
    k1 = intrinsic.k1
    k2 = intrinsic.k2
    p1 = intrinsic.p1
    p2 = intrinsic.p2
    k3 = intrinsic.k3
    # Create a camera intrinsic matrix
    K = np.array([[fx, 0, cx], [0, fy, cy], [0, 0, 1]])
    # Create a distortion coefficients vector
    dist_coeffs = np.array([k1, k2, p1, p2, k3])
    # Get the image dimensions
    h, w = cv_image.shape[:2]
    # Create a new camera intrinsic matrix with the distortion removed
    # Create a new camera intrinsic matrix with the distortion removed
    new_K, _ = cv2.getOptimalNewCameraMatrix(K, dist_coeffs, (w, h), 1, (w, h))
    new_K[0, 0] = fx  # Set fx
    new_K[1, 1] = fy  # Set fy
    new_K[0, 2] = w / 2  # Set cx to be at the center
    new_K[1, 2] = h / 2  # Set cy to be at the center
    # Undistort the image
    map_x, map_y = cv2.initUndistortRectifyMap(K, dist_coeffs, None, new_K, (w, h), 5)
    undistorted_image = cv2.remap(cv_image, map_x, map_y, cv2.INTER_LINEAR)
    # Convert the undistorted image back to a PIL image
    undistorted_pil_image = Image.fromarray(undistorted_image)

    # Convert pixel coordinates to undistorted coordinates
    undistorted_pixel_coordinates = []
    for x, y in pixel_coordinates:
        undistorted_x = int(map_x[y, x])
        undistorted_y = int(map_y[y, x])
        undistorted_pixel_coordinates.append((undistorted_x, undistorted_y))

    return undistorted_pil_image, new_K, undistorted_pixel_coordinates


points_per_image = 100
count = 0
with open(out_json_path, "w") as jsonl_file:
    for filename in filenames[start_file:end_file]:
        # Process each filename as needed
        # Example: Write filename to the JSONL file
        # print("Processing filename:", filename)
        lidar = read("lidar", filename)
        lidar_calib = read("lidar_calibration", filename)
        camera_calib = read("camera_calibration", filename)
        lidar_pose = read("lidar_pose", filename)
        vehicle_pose = read("vehicle_pose", filename)
        cam_img = read("camera_image", filename)
        lidar_camera_projection = read("lidar_camera_projection", filename)
        df = v2.merge(lidar_calib, lidar)
        df = v2.merge(df, lidar_camera_projection)
        df = v2.merge(df, lidar_pose)
        df = v2.merge(df, vehicle_pose)
        df = v2.merge(df, camera_calib)
        df = v2.merge(df, cam_img)

        for _, row in df.iterrows():
            # print(row)

            # Create all component objects
            lidar = v2.LiDARComponent.from_dict(row)
            lidar_calib = v2.LiDARCalibrationComponent.from_dict(row)
            camera_calib = v2.CameraCalibrationComponent.from_dict(row)
            lidar_pose = v2.LiDARPoseComponent.from_dict(row)
            vehicle_pose = v2.VehiclePoseComponent.from_dict(row)
            camera_image = v2.CameraImageComponent.from_dict(row)
            lidar_cam_proj = v2.LiDARCameraProjectionComponent.from_dict(row)

            range_image_cartesian = lidar_utils.convert_range_image_to_cartesian(
                range_image=lidar.range_image_return1,
                calibration=lidar_calib,
                pixel_pose=lidar_pose.range_image_return1,
                frame_pose=vehicle_pose,
            )
            extrinsic = np.reshape(camera_calib.extrinsic.transform, [1, 4, 4]).astype(
                np.float32
            )
            camera_image_size = (camera_calib.height, camera_calib.width)
            ric_shape = range_image_cartesian.shape
            ric = np.reshape(
                range_image_cartesian, [1, ric_shape[0], ric_shape[1], ric_shape[2]]
            )

            cp = lidar_cam_proj.range_image_return1
            cp_tensor = tf.reshape(tf.convert_to_tensor(value=cp.values), cp.shape)
            cp_shape = cp_tensor.shape
            cp_tensor = np.reshape(
                cp_tensor, [1, cp_shape[0], cp_shape[1], cp_shape[2]]
            )

            depth_image = range_image_utils.build_camera_depth_image(
                ric,
                extrinsic,
                cp_tensor,
                list(camera_image_size),
                camera_image.key.camera_name,
            )

            # Convert depth_image to a numpy array
            depth_image_np = depth_image.numpy().squeeze(axis=0)

            # Find non-zero elements in the depth_images
            non_zero_indices = np.nonzero(depth_image_np)

            # Extract the pixel coordinates and their corresponding depth values
            pixel_coordinates = list(zip(non_zero_indices[0], non_zero_indices[1]))
            # breakpoint()
            depth_values = depth_image_np[non_zero_indices]

            data_dict = {}
            data_dict["image"] = (
                f"{camera_image.key.segment_context_name}/{camera_image.key.frame_timestamp_micros}_{camera_image.key.camera_name}.jpg"
            )

            sample_size = min(2 * points_per_image, len(pixel_coordinates))
            sample_indices = random.sample(range(len(pixel_coordinates)), sample_size)

            data_dict["pixel_coords"] = [
                list(reversed(pixel_coordinates[i])) for i in sample_indices
            ]

            data_dict["depth"] = [float(depth_values[i]) for i in sample_indices]

            image_filename = os.path.join(
                out_image_path,
                data_dict["image"],
            )
            pil_image = Image.open(io.BytesIO(camera_image.image))

            undistorted_pil_image, new_K, data_dict["pixel_coords"] = undistort_image(
                pil_image, camera_calib.intrinsic, data_dict["pixel_coords"]
            )

            # Check if the fx value in new_K is greater than 1000
            if new_K[0, 0] > 1000:
                # Calculate the scaling factor to make fx equal to 1000
                scale_factor = 1000 / new_K[0, 0]

                # Rescale the undistorted_pil_image
                new_width = int(undistorted_pil_image.width * scale_factor)
                new_height = int(undistorted_pil_image.height * scale_factor)
                undistorted_pil_image = undistorted_pil_image.resize(
                    (new_width, new_height), Image.ANTIALIAS
                )

                # Rescale the new_K matrix
                new_K[0, 0] *= scale_factor
                new_K[1, 1] *= scale_factor
                new_K[0, 2] *= scale_factor
                new_K[1, 2] *= scale_factor

                # Rescale the pixel coordinates
                data_dict["pixel_coords"] = [
                    (int(x * scale_factor), int(y * scale_factor))
                    for x, y in data_dict["pixel_coords"]
                ]

            data_dict["intrinsics"] = [
                new_K[0, 0],
                new_K[1, 1],
                new_K[0, 2],
                new_K[1, 2],
                undistorted_pil_image.width,
                undistorted_pil_image.height,
            ]

            # Filter pixel coordinates and corresponding depth values
            valid_pixel_coords = []
            valid_depths = []
            for coord, depth in zip(data_dict["pixel_coords"], data_dict["depth"]):
                x, y = coord
                if (
                    0 <= x < undistorted_pil_image.width
                    and 0 <= y < undistorted_pil_image.height
                ):
                    valid_pixel_coords.append(coord)
                    valid_depths.append(depth)
                if len(valid_pixel_coords) == points_per_image:
                    break

            data_dict["pixel_coords"] = valid_pixel_coords
            data_dict["depth"] = valid_depths

            os.makedirs(os.path.dirname(image_filename), exist_ok=True)
            undistorted_pil_image.save(image_filename)

            json.dump(data_dict, jsonl_file)
            jsonl_file.write("\n")

            if count % 100 == 0:
                print(f"data_dict[{count}] = ", data_dict)

            count += 1
        count += 1

```

## /utils/datasets.py

```py path="/utils/datasets.py" 
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.

# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.

import bisect
import json
import logging
import os
import random
from io import StringIO
from typing import Any

import cv2
import numpy as np

import pandas as pd

from PIL import Image
from torch.utils.data import Dataset

logger: logging.Logger = logging.getLogger()
logger.setLevel(logging.INFO)


# unified prompt that can be used for both SFT and GRPO, our method is not sensitive to the prompt, so you can adjust it flexibly
def generate_prompt_depth_sft(
    depth,
    is_eval=False,
):
    problem = "Given this image, how far is the point pointed by the red arrow from the camera? Output the thinking process in <think> </think> and final answer (the meter number only, without the unit) in <answer> </answer> tags."

    thinking = (
        f"<think> The point is around {depth:.2f} meters away from the camera. </think>"
    )
    if is_eval:
        solution = f"<answer> {depth} </answer>"
    else:
        solution = f"<answer> {depth:.2f} </answer>"
    return problem, thinking, solution


# ####################### handle camera ambiguities ################################
def undistort_image(intrinsics: list, image: Image):
    # Check if fx and fy are not the same
    if abs(intrinsics[0] - intrinsics[1]) > 1e-3:
        # Convert PIL image to numpy array
        image_np = np.array(image)

        # Create camera matrix from intrinsics
        camera_matrix = np.array(
            [
                [intrinsics[0], 0, intrinsics[2]],
                [0, intrinsics[1], intrinsics[3]],
                [0, 0, 1],
            ]
        )

        # Assume no distortion coefficients
        dist_coeffs = np.zeros((4, 1))

        # Get optimal new camera matrix
        new_camera_matrix, _ = cv2.getOptimalNewCameraMatrix(
            camera_matrix,
            dist_coeffs,
            (image_np.shape[1], image_np.shape[0]),
            1,
            (image_np.shape[1], image_np.shape[0]),
        )

        # Undistort the image
        undistorted_image_np = cv2.undistort(
            image_np, camera_matrix, dist_coeffs, None, new_camera_matrix
        )

        # Extract [fx, fy, cx, cy] from the new camera matrix
        new_intrinsics = [
            float(new_camera_matrix[0, 0]),
            float(new_camera_matrix[1, 1]),
            float(new_camera_matrix[0, 2]),
            float(new_camera_matrix[1, 2]),
        ]

        # Convert back to PIL image
        return Image.fromarray(undistorted_image_np), new_intrinsics
    else:
        return image, intrinsics


def normalizing_focal_length(
    normalized_focal_length: float, intrinsics: list, image: Image
):
    # Calculate the scaling factor for the focal length normalization
    scale_factor = normalized_focal_length / intrinsics[0]
    # Resize the image according to the scaling factor
    new_width = int(image.width * scale_factor)
    new_height = int(image.height * scale_factor)

    # Update the intrinsics with the normalized focal length
    intrinsics = [
        intrinsics[0] * scale_factor,
        intrinsics[1] * scale_factor,
        intrinsics[2] * scale_factor,
        intrinsics[3] * scale_factor,
        new_width,
        new_height,
    ]

    return image.resize((new_width, new_height)), intrinsics


def is_within_range(coord, crop_range):
    x, y = coord
    left, top, right, bottom = crop_range
    return left <= x < right and top <= y < bottom


def adjust_index(
    index,
    pixel_coords,
):
    # Check if the current index is valid
    if pixel_coords[index] != [-1, -1]:
        return index

    # Search for the closest valid index
    left = index - 1
    right = index + 1
    n = len(pixel_coords)

    while left >= 0 or right < n:
        if left >= 0 and pixel_coords[left] != [-1, -1]:
            return left
        if right < n and pixel_coords[right] != [-1, -1]:
            return right
        left -= 1
        right += 1

    # If no valid index is found, return -1
    return -1


class dataset_eval(Dataset):
    def __init__(
        self,
        data_path: str,
        image_folder: str,
        points_per_image=None,
        normalized_focal_length=1000.0,  # set to the intrinsics after original resize if needed
    ) -> None:
        super(dataset_eval, self).__init__()
        self.normalized_focal_length = normalized_focal_length

        print("reading data from ", data_path, "image_folder = ", image_folder)
        if ".jsonl" in data_path:
            with open(data_path, "r") as f:
                json_content = f.read()
            self.list_data_dict = pd.read_json(
                StringIO(json_content), lines=True
            ).to_dict(orient="records")
        else:
            self.list_data_dict = json.load(open(data_path, "r"))

        self.data_path = data_path
        self.image_folder = image_folder
        self.length = self._get_length()

        if "scannet" in data_path:
            self.list_data_dict = self.list_data_dict[
                int(len(self.list_data_dict) * 0.98) :
            ]  # keep the last 2% for evaluation

            random.seed(42)
            random.shuffle(self.list_data_dict)

        self.random_indices = []

        random.seed(42)  # Set a fixed seed for replicability
        while len(self.random_indices) < self.__len__():
            i = random.sample(range(len(self.list_data_dict[0]["pixel_coords"])), 1)
            self.random_indices.append((i))

    def _get_length(self) -> int:
        return len(self.list_data_dict)

    def __len__(self) -> int:
        return len(self.list_data_dict) * len(self.list_data_dict[0]["pixel_coords"])

    def extract_image_and_meta(self, index):
        index_ori = index

        index %= len(self.list_data_dict)
        random_index = self.random_indices[index_ori][0]

        # read image
        data_dict = {}

        # breakpoint()

        data_dict["image"] = Image.open(
            os.path.join(
                self.image_folder, self.list_data_dict[index]["image"].lstrip("/")
            )
        )

        intrinsics = self.list_data_dict[index]["intrinsics"][:4]
        if intrinsics[0] == 0.0:  # handle intrinsic errors
            intrinsics[0] = intrinsics[1]
        if intrinsics[1] == 0.0:
            intrinsics[1] = intrinsics[0]

        data_dict["image"], intrinsics_new = undistort_image(
            intrinsics, data_dict["image"]
        )

        data_dict["image"], intrinsics_new = normalizing_focal_length(
            self.normalized_focal_length, intrinsics_new, data_dict["image"]
        )

        pixel_coords = [
            [
                int(
                    (coord[0] - intrinsics[2]) * (intrinsics_new[0] / intrinsics[0])
                    + intrinsics_new[2]
                ),
                int(
                    (coord[1] - intrinsics[3]) * (intrinsics_new[1] / intrinsics[1])
                    + intrinsics_new[3]
                ),
            ]
            for coord in self.list_data_dict[index]["pixel_coords"]
        ]

        pixel_coord = pixel_coords[
            random_index
        ].copy()  # pixel coords starts from top-left corner
        depth = self.list_data_dict[index]["depth"][random_index]

        # randomly decide the task and run the prompt generation functions
        # Adjustable cross size
        cross_size = 5  # You can modify this value to change the cross size
        cross_thickness = 1  # You can modify this value to change the cross thickness

        # Calculate the scaling factor
        scale_x = 1
        scale_y = 1

        # Scale the pixel coordinates
        scaled_pixel_x = int(pixel_coord[0] * scale_x)
        scaled_pixel_y = int(pixel_coord[1] * scale_y)
        center_x = round(intrinsics_new[2])
        center_y = round(intrinsics_new[3])

        # Compute the number of pixels from the center to scaled_pixel_x and scaled_pixel_y
        pixels_from_center_x = abs(scaled_pixel_x - center_x)
        pixels_from_center_y = abs(scaled_pixel_y - center_y)

        # Check if the adjustable cross can be drawn
        if (
            cross_size <= scaled_pixel_x < data_dict["image"].width - cross_size
            and cross_size <= scaled_pixel_y < data_dict["image"].height - cross_size
        ):
            # Draw a --> like arrow
            for dx in range(1, cross_size + 1):
                data_dict["image"].putpixel(
                    (scaled_pixel_x - dx, scaled_pixel_y), (255, 0, 0)
                )  # Horizontal line
            # Draw the arrowhead
            for dy in range(1, cross_size // 2 + 1):
                data_dict["image"].putpixel(
                    (
                        scaled_pixel_x - dy - 1,
                        scaled_pixel_y + dy,
                    ),
                    (255, 0, 0),
                )
                data_dict["image"].putpixel(
                    (
                        scaled_pixel_x - dy - 1,
                        scaled_pixel_y - dy,
                    ),
                    (255, 0, 0),
                )
        else:
            # Skip this sample and get the next one
            return self.extract_image_and_meta((index_ori + 1) % self.__len__())

        return (
            data_dict["image"],
            depth,
            pixel_coord,
            intrinsics_new,
        )

    def __getitem__(self, index):
        data_dict = {}
        (
            data_dict["image"],
            depth,
            pixel_coord,
            intrinsics,
        ) = self.extract_image_and_meta(index)

        # generate prompt
        data_dict["problem"], data_dict["thinking"], data_dict["solution"] = (
            generate_prompt_depth_sft(
                depth,
                is_eval=True,
            )
        )

        data_dict["pixel_coord"] = pixel_coord
        data_dict["intrinsics"] = intrinsics

        data_dict["system"] = "You are a helpful assistant."

        data_dict["prompt"] = [
            {
                "content": [
                    {"image": data_dict["image"], "type": "image"},
                    {"text": data_dict["problem"], "type": "text"},
                ],
                "role": "user",
            }
        ]
        return data_dict


class dataset_train(Dataset):
    def __init__(
        self,
        data_path: str,
        image_folder: str,
        height_max=1200,
        height_min=700,
        width_max=1400,
        width_min=1000,
        normalized_focal_length=1000,
        sample_weights=None,  # support weighted sampling
        ratio_min=1.0,  # taskonomy dataset has intrinsic noise, we randomly rescale the aspect ratio of the images to handle that
        ratio_max=1.3,
    ) -> None:
        super().__init__()
        print("reading data from ", data_path, "image_folder = ", image_folder)
        data_paths = data_path.split(";")
        image_folders = image_folder.split(";")

        self.list_data_dict = []

        for dp in data_paths:
            if ".jsonl" in dp:
                print("reading jsonl from ", dp)
                try:
                    with open(dp, "r") as f:
                        json_content = f.read()
                    self.list_data_dict.append(
                        pd.read_json(StringIO(json_content), lines=True).to_dict(
                            orient="records"
                        )
                    )
                except Exception as e:
                    print(e)
                    self.list_data_dict.append(json.load(open(dp, "r")))
            else:
                self.list_data_dict.append(json.load(open(dp, "r")))

            if "scannet" in dp:
                self.list_data_dict[-1] = self.list_data_dict[-1][
                    : int(len(self.list_data_dict[-1]) * 0.98)
                ]

        self.data_path = data_paths
        self.image_folder = image_folders
        self.length = self._get_length()
        print(
            "reading finished, dataset size is ",
            self.__len__(),
            ", data_path = ",
            self.data_path,
            ", image_folder = ",
            self.image_folder,
        )
        self.random_indices = []
        self.normalized_focal_length = normalized_focal_length
        self.width_range = [width_min, width_max]
        self.height_range = [height_min, height_max]
        self.sample_weights = (
            [int(x) for x in sample_weights.split(";")]
            if sample_weights
            else [1] * len(self.list_data_dict)
        )
        self.ratio_min = ratio_min
        self.ratio_max = ratio_max

    def _get_length(self) -> int:
        length = 0
        for data_dict in self.list_data_dict:
            length += len(data_dict)
        return length

    def __len__(self, ori_length=False) -> int:
        if ori_length:
            length = 0
            for data_dict in self.list_data_dict:
                length += len(data_dict)
            return length
        else:
            length = 0
            for data_dict in self.list_data_dict:
                length += (
                    len(data_dict) * 100
                )  # 100 labeled points per image in our data curation pipeline, cna change this number accordingly
            return length

    def getitem_Taskonomy(
        self, index, id_dataset
    ):  # taskonomy dataset has intrinsic noise, we randomly rescale the aspect ratio of the images to handle that
        index = index % len(self.list_data_dict[id_dataset])

        # read image
        data_dict = {}

        data_dict["image"] = Image.open(
            os.path.join(
                self.image_folder[id_dataset],
                self.list_data_dict[id_dataset][index]["image"].lstrip("/"),
            )
        )
        intrinsics = self.list_data_dict[id_dataset][index]["intrinsics"][:4]

        data_dict["image"], intrinsics_new = undistort_image(
            intrinsics, data_dict["image"]
        )

        if self.normalized_focal_length > 0:
            data_dict["image"], intrinsics_new = normalizing_focal_length(
                self.normalized_focal_length, intrinsics_new, data_dict["image"]
            )
        # Calculate the new height to maintain the aspect ratio of 1.3
        new_height = int(
            data_dict["image"].width / random.uniform(self.ratio_min, self.ratio_max)
        )

        # Resize the image
        data_dict["image"] = data_dict["image"].resize(
            (data_dict["image"].width, new_height)
        )

        # Adjust the intrinsics to account for the new image height
        intrinsics_new[1] *= new_height / intrinsics_new[5]  # Scale fy
        intrinsics_new[3] *= new_height / intrinsics_new[5]  # Scale cy
        intrinsics_new[5] = new_height  # Update height

        pixel_coords = [
            [
                int(
                    (coord[0] - intrinsics[2]) * (intrinsics_new[0] / intrinsics[0])
                    + intrinsics_new[2]
                ),
                int(
                    (coord[1] - intrinsics[3]) * (intrinsics_new[1] / intrinsics[1])
                    + intrinsics_new[3]
                ),
            ]
            for coord in self.list_data_dict[id_dataset][index]["pixel_coords"]
        ]

        if len(self.list_data_dict[id_dataset][index]["pixel_coords"]) - 1 > 0:
            random_index = random.randint(
                0, len(self.list_data_dict[id_dataset][index]["pixel_coords"]) - 1
            )
        else:
            print("no pixel in ", index, ": ", self.list_data_dict[id_dataset][index])
            return self.__getitem__((index + 1) % self.__len__())

        pixel_coord = pixel_coords[random_index]
        depth = self.list_data_dict[id_dataset][index]["depth"][random_index]

        # Adjustable cross size
        cross_size = 5  # You can modify this value to change the cross size
        # Calculate the scaling factor
        scale_x = 1
        scale_y = 1

        # Scale the pixel coordinates
        scaled_pixel_x = int(pixel_coord[0] * scale_x)
        scaled_pixel_y = int(pixel_coord[1] * scale_y)

        # Check if the adjustable cross can be drawn
        if (
            cross_size <= scaled_pixel_x < data_dict["image"].width - cross_size
            and cross_size <= scaled_pixel_y < data_dict["image"].height - cross_size
        ):
            # Draw a --> like arrow
            for dx in range(1, cross_size + 1):
                data_dict["image"].putpixel(
                    (scaled_pixel_x - dx, scaled_pixel_y), (255, 0, 0)
                )  # Horizontal line
            # Draw the arrowhead
            for dy in range(1, cross_size // 2 + 1):
                data_dict["image"].putpixel(
                    (
                        scaled_pixel_x - dy - 1,
                        scaled_pixel_y + dy,
                    ),
                    (255, 0, 0),
                )
                data_dict["image"].putpixel(
                    (
                        scaled_pixel_x - dy - 1,
                        scaled_pixel_y - dy,
                    ),
                    (255, 0, 0),
                )
        else:
            # Skip this sample and get the next one
            return self.__getitem__((index + 1) % self.__len__())

        # generate prompt
        data_dict["problem"], data_dict["thinking"], data_dict["solution"] = (
            generate_prompt_depth_sft(depth)
        )

        data_dict["system"] = "You are a helpful assistant."
        return data_dict

    def getitem_noTaskonomy(self, index, id_dataset):
        index_ori = index
        index = index % len(self.list_data_dict[id_dataset])
        intrinsics = self.list_data_dict[id_dataset][index]["intrinsics"][:4]

        # read image
        data_dict = {}

        img = Image.open(
            os.path.join(
                self.image_folder[id_dataset],
                self.list_data_dict[id_dataset][index]["image"].lstrip("/"),
            )
        )

        img, intrinsics_new = undistort_image(intrinsics, img)

        if self.normalized_focal_length > 0:
            img, intrinsics_new = normalizing_focal_length(
                self.normalized_focal_length, intrinsics_new, img
            )

        data_dict["image"] = img

        pixel_coords = [
            [
                int(
                    (coord[0] - intrinsics[2]) * (intrinsics_new[0] / intrinsics[0])
                    + intrinsics_new[2]
                ),
                int(
                    (coord[1] - intrinsics[3]) * (intrinsics_new[1] / intrinsics[1])
                    + intrinsics_new[3]
                ),
            ]
            for coord in self.list_data_dict[id_dataset][index]["pixel_coords"]
        ]

        # Random center crop
        width, height = data_dict["image"].size

        crop_height = int(
            min(height, random.uniform(self.height_range[0], self.height_range[1]))
        )
        crop_width = int(
            min(width, random.uniform(self.width_range[0], self.width_range[1]))
        )

        center_x = round(intrinsics_new[2])
        center_y = round(intrinsics_new[3])

        # Ensure the crop is within the specified bounds
        left = max(0, (width - crop_width) // 2)
        top = max(
            0,
            (height - crop_height) // 2,
        )
        right = min(width, left + crop_width)
        bottom = min(height, top + crop_height)

        data_dict["image"] = data_dict["image"].crop((left, top, right, bottom))

        # Adjust intrinsics_new to account for cropping
        intrinsics_new[2] -= left  # Adjust cx
        intrinsics_new[3] -= top  # Adjust cy
        intrinsics_new[4] = data_dict["image"].width  # Update width
        intrinsics_new[5] = data_dict["image"].height  # Update height

        pixel_coords = [
            (
                [coord[0] - left, coord[1] - top]
                if is_within_range(coord, (left, top, right, bottom))
                else [-1, -1]
            )
            for coord in pixel_coords
        ]

        if len(self.list_data_dict[id_dataset][index]["pixel_coords"]) - 1 > 0:
            random_index = random.randint(
                0, len(self.list_data_dict[id_dataset][index]["pixel_coords"]) - 1
            )
        else:
            print("no pixel in ", index, ": ", self.list_data_dict[id_dataset][index])
            return self.__getitem__((index + 1) % self.__len__(True))

        random_index = adjust_index(random_index, pixel_coords)
        if random_index == -1:
            # Skip this sample and get the next one
            return self.__getitem__((index_ori + 1) % self.__len__(True))

        pixel_coord = pixel_coords[random_index].copy()
        depth = self.list_data_dict[id_dataset][index]["depth"][random_index]

        # Adjustable cross size
        cross_size = 5  # You can modify this value to change the cross size

        # Scale the pixel coordinates
        scaled_pixel_x = int(pixel_coord[0])
        scaled_pixel_y = int(pixel_coord[1])

        # Check if the adjustable cross can be drawn
        if (
            cross_size <= scaled_pixel_x < data_dict["image"].width - cross_size
            and cross_size <= scaled_pixel_y < data_dict["image"].height - cross_size
        ):
            # Draw a --> like arrow
            for dx in range(1, cross_size + 1):
                data_dict["image"].putpixel(
                    (scaled_pixel_x - dx, scaled_pixel_y), (255, 0, 0)
                )  # Horizontal line
            # Draw the arrowhead
            for dy in range(1, cross_size // 2 + 1):
                data_dict["image"].putpixel(
                    (
                        scaled_pixel_x - dy - 1,
                        scaled_pixel_y + dy,
                    ),
                    (255, 0, 0),
                )
                data_dict["image"].putpixel(
                    (
                        scaled_pixel_x - dy - 1,
                        scaled_pixel_y - dy,
                    ),
                    (255, 0, 0),
                )
        else:
            return self.__getitem__((index + 1) % self.__len__(True))

        data_dict["problem"], data_dict["thinking"], data_dict["solution"] = (
            generate_prompt_depth_sft(depth)
        )

        data_dict["system"] = "You are a helpful assistant."
        return data_dict

    def __getitem__(self, index):
        id_dataset = random.choices(
            range(len(self.list_data_dict)), weights=self.sample_weights, k=1
        )[0]
        if "taskonomy" in self.image_folder[id_dataset]:
            return self.getitem_Taskonomy(index, id_dataset)
        else:
            return self.getitem_noTaskonomy(index, id_dataset)


class dataset_inference(Dataset):
    """Dataset for deterministic inference. Each image and pixel is processed for exactly once."""

    def __init__(
        self,
        data_path: str,
        image_folder: str,
        normalized_focal_length=750.0,  # set to the intrinsics after original resize if needed
    ) -> None:
        super(dataset_inference, self).__init__()
        self.normalized_focal_length = normalized_focal_length

        logger.info(f"reading data from {data_path=}, {image_folder=}")
        if ".jsonl" in data_path:
            with open(data_path, "r") as f:
                json_content = f.read()
            self.list_data_dict = pd.read_json(
                StringIO(json_content), lines=True
            ).to_dict(orient="records")
        else:
            self.list_data_dict = json.load(open(data_path, "r"))

        # Number of points per image, e.g., [1, 2, 3, 4, 5]
        self.num_pixels: list[int] = [
            len(data_dict["pixel_coords"]) for data_dict in self.list_data_dict
        ]
        # Cumulative sum of number of points, e.g., [1, 3, 6, 10, 15]
        self.num_pixels_cumsum = np.cumsum(self.num_pixels)
        logger.info(
            f"{dataset_inference.__name__} has {len(self.num_pixels_cumsum)}"
            f" images, {self.num_pixels_cumsum[-1]} pixels"
        )

        self.data_path = data_path
        self.image_folder = image_folder

        if "scannet" in data_path:
            self.list_data_dict = self.list_data_dict[
                int(len(self.list_data_dict) * 0.98) :
            ]  # keep the last 2% for evaluation

            random.seed(42)
            random.shuffle(self.list_data_dict)

    def __len__(self) -> int:
        """
        Each image-pixel pair is a sample, so the dataset length equals
        total number of pixels.
        """
        return self.num_pixels_cumsum[-1]

    def extract_image_and_meta(self, index: int) -> dict[str, Any]:
        """
        Index mapping example:
        self.num_pixels =        [1, 2, 3, 4, 5]
        self.num_pixels_cumsum = [1, 3, 6, 10, 15]
        index = 0 -> index + 1 = 1 -> image_index = 0, pixel_index = 1
        index = 1 -> index + 1 = 2 -> image_index = 1, pixel_index = 0
        index = 2 -> index + 1 = 3 -> image_index = 1, pixel_index = 1
        index = 3 -> index + 1 = 4 -> image_index = 2, pixel_index = 0
        """
        image_index: int = bisect.bisect_left(self.num_pixels_cumsum, index + 1)
        pixel_index: int = (
            index - int(self.num_pixels_cumsum[image_index - 1])
            if image_index > 0
            else index
        )
        logger.debug(f"Loading sample {index=}: {image_index=}, {pixel_index=}")
        assert pixel_index >= 0 and pixel_index < self.num_pixels[image_index]

        data_dict: dict[str, Any] = {}

        # Step 1: Load image
        data_dict["image"] = Image.open(
            os.path.join(
                self.image_folder, self.list_data_dict[image_index]["image"].lstrip("/")
            )
        )

        # Step 2: Load intrinsics and rescale image and intrinsics to target focal length
        intrinsics = self.list_data_dict[image_index]["intrinsics"][:4]
        if intrinsics[0] == 0.0:  # handle intrinsic errors
            intrinsics[0] = intrinsics[1]
        if intrinsics[1] == 0.0:
            intrinsics[1] = intrinsics[0]

        data_dict["image"], intrinsics_new = undistort_image(
            intrinsics, data_dict["image"]
        )

        data_dict["image"], intrinsics_new = normalizing_focal_length(
            self.normalized_focal_length, intrinsics_new, data_dict["image"]
        )

        # Step 3: Load pixel coordinates and rescale it
        pixel_coord = self.list_data_dict[image_index]["pixel_coords"][pixel_index]
        scaled_pixel_x = int(
            (pixel_coord[0] - intrinsics[2]) * (intrinsics_new[0] / intrinsics[0])
            + intrinsics_new[2]
        )
        scaled_pixel_y = int(
            (pixel_coord[1] - intrinsics[3]) * (intrinsics_new[1] / intrinsics[1])
            + intrinsics_new[3]
        )
        pixel_coord: tuple[int, int] = (scaled_pixel_x, scaled_pixel_y)

        # Step 4: Load depth
        depth: float = self.list_data_dict[image_index]["depth"][pixel_index]

        # Step 5: Draw marker on the image
        cross_size = 5  # Adjustable cross size

        # Check if the adjustable cross can be drawn
        if (
            cross_size <= scaled_pixel_x < data_dict["image"].width - cross_size
            and cross_size <= scaled_pixel_y < data_dict["image"].height - cross_size
        ):
            # Draw a --> like arrow
            for dx in range(1, cross_size + 1):
                data_dict["image"].putpixel(
                    (scaled_pixel_x - dx, scaled_pixel_y), (255, 0, 0)
                )  # Horizontal line
            # Draw the arrowhead
            for dy in range(1, cross_size // 2 + 1):
                data_dict["image"].putpixel(
                    (
                        scaled_pixel_x - dy - 1,
                        scaled_pixel_y + dy,
                    ),
                    (255, 0, 0),
                )
                data_dict["image"].putpixel(
                    (
                        scaled_pixel_x - dy - 1,
                        scaled_pixel_y - dy,
                    ),
                    (255, 0, 0),
                )
        else:
            logger.error(
                f"Marker cannot be drawn because pixel is too close to the boarder. Skipped."
            )
            return None

        data_dict["pixel_coord"] = pixel_coord
        data_dict["intrinsics"] = intrinsics_new
        data_dict["depth"] = depth
        return data_dict

    def __getitem__(self, index) -> dict[str, Any]:
        if index < 0 or index >= self.__len__():
            raise ValueError(
                f"Index out of range: {index}. Dataset size = {self.__len__()}"
            )

        data_dict: dict[str, Any] = self.extract_image_and_meta(index)

        # generate prompt
        data_dict["problem"], data_dict["thinking"], data_dict["solution"] = (
            generate_prompt_depth_sft(
                data_dict["depth"],
                is_eval=True,
            )
        )

        data_dict["system"] = "You are a helpful assistant."

        data_dict["prompt"] = [
            {
                "content": [
                    {"image": data_dict["image"], "type": "image"},
                    {"text": data_dict["problem"], "type": "text"},
                ],
                "role": "user",
            }
        ]
        return data_dict

```

## /utils/evaluation.py

```py path="/utils/evaluation.py" 
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.

# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.

import subprocess
from typing import Dict, TYPE_CHECKING, Union

from .hub import get_gpu_count_for_vllm, get_param_count_from_repo_id


if TYPE_CHECKING:
    from trl import GRPOConfig, ModelConfig, SFTConfig

import os


# We need a special environment setup to launch vLLM from within Slurm training jobs.
# - Reference code: https://github.com/huggingface/brrr/blob/c55ba3505686d690de24c7ace6487a5c1426c0fd/brrr/lighteval/one_job_runner.py#L105
# - Slack thread: https://huggingface.slack.com/archives/C043JTYE1MJ/p1726566494958269
user_home_directory = os.path.expanduser("~")
VLLM_SLURM_PREFIX = [
    "env",
    "-i",
    "bash",
    "-c",
    f"for f in /etc/profile.d/*.sh; do source $f; done; export HOME={user_home_directory}; sbatch ",
]


def register_lighteval_task(
    configs: Dict[str, str],
    eval_suite: str,
    task_name: str,
    task_list: str,
    num_fewshot: int = 0,
):
    """Registers a LightEval task configuration.

    - Core tasks can be added from this table: https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/tasks_table.jsonl
    - Custom tasks that require their own metrics / scripts, should be stored in scripts/evaluation/extended_lighteval_tasks

    Args:
        configs (Dict[str, str]): The dictionary to store the task configuration.
        eval_suite (str, optional): The evaluation suite.
        task_name (str): The name of the task.
        task_list (str): The comma-separated list of tasks in the format "extended|{task_name}|{num_fewshot}|0" or "lighteval|{task_name}|{num_fewshot}|0".
        num_fewshot (int, optional): The number of few-shot examples. Defaults to 0.
        is_custom_task (bool, optional): Whether the task is a custom task. Defaults to False.
    """
    # Format task list in lighteval format
    task_list = ",".join(
        f"{eval_suite}|{task}|{num_fewshot}|0" for task in task_list.split(",")
    )
    configs[task_name] = task_list


LIGHTEVAL_TASKS = {}

register_lighteval_task(LIGHTEVAL_TASKS, "custom", "math_500", "math_500", 0)
register_lighteval_task(LIGHTEVAL_TASKS, "custom", "aime24", "aime24", 0)


def get_lighteval_tasks():
    return list(LIGHTEVAL_TASKS.keys())


SUPPORTED_BENCHMARKS = get_lighteval_tasks()


def run_lighteval_job(
    benchmark: str,
    training_args: Union["SFTConfig", "GRPOConfig"],
    model_args: "ModelConfig",
) -> None:
    task_list = LIGHTEVAL_TASKS[benchmark]
    model_name = training_args.hub_model_id
    model_revision = training_args.hub_model_revision
    # For large models >= 30b params or those running the MATH benchmark, we need to shard them across the GPUs to avoid OOM
    num_gpus = get_gpu_count_for_vllm(model_name, model_revision)
    if get_param_count_from_repo_id(model_name) >= 30_000_000_000:
        tensor_parallel = True
    else:
        tensor_parallel = False

    cmd = VLLM_SLURM_PREFIX.copy()
    cmd_args = [
        f"--gres=gpu:{num_gpus}",
        f"--job-name=or1_{benchmark}_{model_name.split('/')[-1]}_{model_revision}",
        "slurm/eval_callback.slurm",
        benchmark,
        f'"{task_list}"',
        model_name,
        model_revision,
        f"{tensor_parallel}",
        f"{model_args.trust_remote_code}",
    ]
    if training_args.system_prompt is not None:
        cmd_args.append(f"--system_prompt={training_args.system_prompt}")
    cmd[-1] += " " + " ".join(cmd_args)
    subprocess.run(cmd, check=True)


def run_benchmark_jobs(
    training_args: Union["SFTConfig", "GRPOConfig"], model_args: "ModelConfig"
) -> None:
    benchmarks = training_args.benchmarks
    if len(benchmarks) == 1 and benchmarks[0] == "all":
        benchmarks = get_lighteval_tasks()
        # Evaluate on all supported benchmarks. Later we may want to include a `chat` option
        # that just evaluates on `ifeval` and `mt_bench` etc.

    for benchmark in benchmarks:
        print(f"Launching benchmark `{benchmark}`")
        if benchmark in get_lighteval_tasks():
            run_lighteval_job(benchmark, training_args, model_args)
        else:
            raise ValueError(f"Unknown benchmark {benchmark}")

```

## /utils/hub.py

```py path="/utils/hub.py" 
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.

# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.

import logging
import re

from huggingface_hub import (
    create_branch,
    create_repo,
    get_safetensors_metadata,
    list_repo_commits,
    list_repo_files,
    list_repo_refs,
    repo_exists,
    upload_folder,
)

from transformers import AutoConfig
from trl import GRPOConfig, SFTConfig


logger = logging.getLogger(__name__)


def push_to_hub_revision(
    training_args: SFTConfig | GRPOConfig, extra_ignore_patterns=[]
) -> bool:
    """Pushes the model to branch on a Hub repo."""

    # Create a repo if it doesn't exist yet
    repo_url = create_repo(
        repo_id=training_args.hub_model_id, private=True, exist_ok=True
    )
    # Get initial commit to branch from
    initial_commit = list_repo_commits(training_args.hub_model_id)[-1]
    # Now create the branch we'll be pushing to
    create_branch(
        repo_id=training_args.hub_model_id,
        branch=training_args.hub_model_revision,
        revision=initial_commit.commit_id,
        exist_ok=True,
    )
    logger.info(f"Created target repo at {repo_url}")
    logger.info(f"Pushing to the Hub revision {training_args.hub_model_revision}...")
    ignore_patterns = ["checkpoint-*", "*.pth"]
    ignore_patterns.extend(extra_ignore_patterns)
    upload_folder(
        repo_id=training_args.hub_model_id,
        folder_path=training_args.output_dir,
        revision=training_args.hub_model_revision,
        commit_message=f"Add {training_args.hub_model_revision} checkpoint",
        ignore_patterns=ignore_patterns,
    )
    logger.info(
        f"Pushed to {repo_url} revision {training_args.hub_model_revision} successfully!"
    )

    return True


def check_hub_revision_exists(training_args: SFTConfig | GRPOConfig):
    """Checks if a given Hub revision exists."""
    if repo_exists(training_args.hub_model_id):
        if training_args.push_to_hub_revision is True:
            # First check if the revision exists
            revisions = [
                rev.name for rev in list_repo_refs(training_args.hub_model_id).branches
            ]
            # If the revision exists, we next check it has a README file
            if training_args.hub_model_revision in revisions:
                repo_files = list_repo_files(
                    repo_id=training_args.hub_model_id,
                    revision=training_args.hub_model_revision,
                )
                if (
                    "README.md" in repo_files
                    and training_args.overwrite_hub_revision is False
                ):
                    raise ValueError(
                        f"Revision {training_args.hub_model_revision} already exists. "
                        "Use --overwrite_hub_revision to overwrite it."
                    )


def get_param_count_from_repo_id(repo_id: str) -> int:
    """Function to get model param counts from safetensors metadata or find patterns like 42m, 1.5b, 0.5m or products like 8x7b in a repo ID."""
    try:
        metadata = get_safetensors_metadata(repo_id)
        return list(metadata.parameter_count.values())[0]
    except Exception:
        # Pattern to match products (like 8x7b) and single values (like 42m)
        pattern = r"((\d+(\.\d+)?)(x(\d+(\.\d+)?))?)([bm])"
        matches = re.findall(pattern, repo_id.lower())

        param_counts = []
        for full_match, number1, _, _, number2, _, unit in matches:
            if number2:  # If there's a second number, it's a product
                number = float(number1) * float(number2)
            else:  # Otherwise, it's a single value
                number = float(number1)

            if unit == "b":
                number *= 1_000_000_000  # Convert to billion
            elif unit == "m":
                number *= 1_000_000  # Convert to million

            param_counts.append(number)

        if len(param_counts) > 0:
            # Return the largest number
            return int(max(param_counts))
        else:
            # Return -1 if no match found
            return -1


def get_gpu_count_for_vllm(
    model_name: str, revision: str = "main", num_gpus: int = 8
) -> int:
    """vLLM enforces a constraint that the number of attention heads must be divisible by the number of GPUs and 64 must be divisible by the number of GPUs.
    This function calculates the number of GPUs to use for decoding based on the number of attention heads in the model.
    """
    config = AutoConfig.from_pretrained(
        model_name, revision=revision, trust_remote_code=True
    )
    # Get number of attention heads
    num_heads = config.num_attention_heads
    # Reduce num_gpus so that num_heads is divisible by num_gpus and 64 is divisible by num_gpus
    while num_heads % num_gpus != 0 or 64 % num_gpus != 0:
        logger.info(
            f"Reducing num_gpus from {num_gpus} to {num_gpus - 1} to make num_heads divisible by num_gpus"
        )
        num_gpus -= 1
    return num_gpus

```

## /utils/metrics.py

```py path="/utils/metrics.py" 
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.

# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.

import re

from math_verify import (  # @manual=fbsource//third-party/pypi/math-verify:math-verify
    parse,
)


def delta1_metric(contents, solution, **kwargs):
    """Reward function that checks if the completion is correct using either symbolic verification or exact string matching."""
    rewards = []
    for content, sol in zip(contents, solution):
        reward = -1.0
        # Try symbolic verification first
        try:
            answer = float(parse(content))
            reward = float(max(answer / float(sol), float(sol) / answer) < 1.25)
        except Exception:
            pass  # Continue to next verification method if this fails

        # If symbolic verification failed, try string matching
        if reward == -1.0:
            # Extract answer from solution if it has think/answer tags
            sol_match = re.search(r"<answer>(.*?)</answer>", sol)
            ground_truth = float(
                sol_match.group(1).strip() if sol_match else sol.strip()
            )
            try:
                student_answer = float(parse(content)[0])
                reward = (
                    1.0
                    if max(student_answer / ground_truth, ground_truth / student_answer)
                    < 1.25
                    else 0.0
                )
            except Exception as e:
                print("error: ", e, "during solution parsing, content = ", content)
                reward = 0.0

        rewards.append(reward)

    return rewards


METRIC_CLASSES = {
    "delta1_metric": delta1_metric,
}

```


The content has been capped at 50000 tokens. The user could consider applying other filters to refine the result. The better and more specific the context, the better the LLM can follow instructions. If the context seems verbose, the user can refine the filter using uithub. Thank you for using https://uithub.com - Perfect LLM context for any GitHub repo.
Copied!