``` ├── .github/ ├── workflows/ ├── lint.yaml ├── .gitignore ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── MODEL_CARD.md ├── README.md ├── conda-extras.yaml ├── conda.yaml ├── dinov2/ ├── __init__.py ├── configs/ ├── __init__.py ├── eval/ ├── vitb14_pretrain.yaml ├── vitb14_reg4_pretrain.yaml ├── vitg14_pretrain.yaml ├── vitg14_reg4_pretrain.yaml ├── vitl14_pretrain.yaml ├── vitl14_reg4_pretrain.yaml ├── vits14_pretrain.yaml ├── vits14_reg4_pretrain.yaml ├── ssl_default_config.yaml ├── train/ ├── vitg14.yaml ├── vitl14.yaml ├── vitl16_short.yaml ├── data/ ├── __init__.py ├── adapters.py ├── augmentations.py ├── collate.py ├── datasets/ ├── __init__.py ├── decoders.py ├── extended.py ├── image_net.py ├── image_net_22k.py ├── loaders.py ├── masking.py ├── samplers.py ├── transforms.py ├── distributed/ ├── __init__.py ├── eval/ ├── __init__.py ├── depth/ ├── __init__.py ├── models/ ├── __init__.py ├── backbones/ ├── __init__.py ├── vision_transformer.py ├── builder.py ├── decode_heads/ ├── __init__.py ├── decode_head.py ├── dpt_head.py ├── linear_head.py ├── depther/ ├── __init__.py ├── base.py ├── encoder_decoder.py ├── losses/ ├── __init__.py ├── gradientloss.py ├── sigloss.py ├── ops/ ├── __init__.py ├── wrappers.py ├── knn.py ├── linear.py ├── log_regression.py ├── metrics.py ├── segmentation/ ├── __init__.py ├── hooks/ ├── __init__.py ├── optimizer.py ├── models/ ├── __init__.py ├── backbones/ ├── __init__.py ├── vision_transformer.py ├── decode_heads/ ├── __init__.py ├── linear_head.py ├── utils/ ├── __init__.py ├── colormaps.py ├── segmentation_m2f/ ├── __init__.py ├── core/ ├── __init__.py ├── anchor/ ├── __init__.py ├── builder.py ├── point_generator.py ``` ## /.github/workflows/lint.yaml ```yaml path="/.github/workflows/lint.yaml" name: Lint on: push: branches: - main pull_request: branches: - main jobs: run-linters: name: Run linters runs-on: ubuntu-20.04 steps: - name: Checkout repository uses: actions/checkout@v3 - name: Set up Python uses: actions/setup-python@v4 with: python-version: 3.9 cache: 'pip' cache-dependency-path: '**/requirements*.txt' - name: Install Python (development) dependencies run: | pip install -r requirements-dev.txt - name: Run flake8 run: | flake8 - name: Run black if: always() run: | black --check dinov2 - name: Run pylint if: always() run: | pylint --exit-zero dinov2 ``` ## /.gitignore ```gitignore path="/.gitignore" build/ dist/ *.egg-info/ **/__pycache__/ **/.ipynb_checkpoints **/.ipynb_checkpoints/** *.swp .vscode/ ``` ## /CODE_OF_CONDUCT.md # Code of Conduct ## Our Pledge In the interest of fostering an open and welcoming environment, we as contributors and maintainers pledge to make participation in our project and our community a harassment-free experience for everyone, regardless of age, body size, disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation. ## Our Standards Examples of behavior that contributes to creating a positive environment include: * Using welcoming and inclusive language * Being respectful of differing viewpoints and experiences * Gracefully accepting constructive criticism * Focusing on what is best for the community * Showing empathy towards other community members Examples of unacceptable behavior by participants include: * The use of sexualized language or imagery and unwelcome sexual attention or advances * Trolling, insulting/derogatory comments, and personal or political attacks * Public or private harassment * Publishing others' private information, such as a physical or electronic address, without explicit permission * Other conduct which could reasonably be considered inappropriate in a professional setting ## Our Responsibilities Project maintainers are responsible for clarifying the standards of acceptable behavior and are expected to take appropriate and fair corrective action in response to any instances of unacceptable behavior. Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, or to ban temporarily or permanently any contributor for other behaviors that they deem inappropriate, threatening, offensive, or harmful. ## Scope This Code of Conduct applies within all project spaces, and it also applies when an individual is representing the project or its community in public spaces. Examples of representing a project or community include using an official project e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event. Representation of a project may be further defined and clarified by project maintainers. This Code of Conduct also applies outside the project spaces when there is a reasonable belief that an individual's behavior may have a negative impact on the project or its community. ## Enforcement Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by contacting the project team at . All complaints will be reviewed and investigated and will result in a response that is deemed necessary and appropriate to the circumstances. The project team is obligated to maintain confidentiality with regard to the reporter of an incident. Further details of specific enforcement policies may be posted separately. Project maintainers who do not follow or enforce the Code of Conduct in good faith may face temporary or permanent repercussions as determined by other members of the project's leadership. ## Attribution This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4, available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html [homepage]: https://www.contributor-covenant.org For answers to common questions about this code of conduct, see https://www.contributor-covenant.org/faq ## /CONTRIBUTING.md # Contributing to DINOv2 We want to make contributing to this project as easy and transparent as possible. ## Pull Requests We actively welcome your pull requests. 1. Fork the repo and create your branch from `main`. 2. If you've added code that should be tested, add tests. 3. If you've changed APIs, update the documentation. 4. Ensure the test suite passes. 5. Make sure your code lints. 6. If you haven't already, complete the Contributor License Agreement ("CLA"). ## Contributor License Agreement ("CLA") In order to accept your pull request, we need you to submit a CLA. You only need to do this once to work on any of Meta's open source projects. Complete your CLA here: ## Issues We use GitHub issues to track public bugs. Please ensure your description is clear and has sufficient instructions to be able to reproduce the issue. Meta has a [bounty program](https://www.facebook.com/whitehat/) for the safe disclosure of security bugs. In those cases, please go through the process outlined on that page and do not file a public issue. ## License By contributing to DINOv2, you agree that your contributions will be licensed under the LICENSE file in the root directory of this source tree. ## /LICENSE ``` path="/LICENSE" Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License. "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution." "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: (a) You must give any other recipients of the Work or Derivative Works a copy of this License; and (b) You must cause any modified files to carry prominent notices stating that You changed the files; and (c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and (d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability. END OF TERMS AND CONDITIONS APPENDIX: How to apply the Apache License to your work. To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "[]" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives. Copyright [yyyy] [name of copyright owner] Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. ``` ## /MODEL_CARD.md # Model Card for DINOv2-S/B/L/g These are Vision Transformer models trained following the method described in the papers: "DINOv2: Learning Robust Visual Features without Supervision" and "Vision Transformers Need Registers". We provide 8 models: - 1 ViT-g trained from scratch with 3 ViT-S/B/L models distilled from the ViT-g, without registers. - 1 ViT-g trained from scratch with 3 ViT-S/B/L models distilled from the ViT-g, with registers. ## Model Details The model takes an image as input and returns a class token and patch tokens, and optionally 4 register tokens. The embedding dimension is: - 384 for ViT-S. - 768 for ViT-B. - 1024 for ViT-L. - 1536 for ViT-g. The models follow a Transformer architecture, with a patch size of 14. In the case of registers, we add 4 register tokens, learned during training, to the input sequence after the patch embedding. For a 224x224 image, this results in 1 class token + 256 patch tokens, and optionally 4 register tokens. The models can accept larger images provided the image shapes are multiples of the patch size (14). If this condition is not verified, the model will crop to the closest smaller multiple of the patch size. ### Model Description - **Developed by:** Meta AI - **Model type:** Vision Transformer - **License:** Apache License 2.0 - **Repository:** https://github.com/facebookresearch/dinov2 - **Paper:** https://arxiv.org/abs/2304.07193 - **Demo:** https://dinov2.metademolab.com/ ## Uses The models are vision backbones providing multi-purpose features for downstream tasks. ### Direct Use The models can be used without fine-tuning, with downstream classifiers as simple as linear layers, to obtain competitive results: - on depth estimation, semantic segmentation, using linear layers. - on image classification, using k-NN classifiers on the class token. - on image classification, with logistic regression classifiers applied on the class token. - on image classification, with a linear layer applied on the class token and the average of the patch tokens. - on image retrieval using nearest neighbors. ### Downstream Use It is technically possible to perform fine-tuning on the models, for small gains (we measured +2% on ImageNet-1k classification). We recommend keeping this as a very last step and only when necessary, as the features already provide good performance out-of-the-box. ## Bias, Risks, and Limitations Despite improvements thanks to the training method not using annotations, we still observe significant biases in our models toward rich households from Western countries. ### Recommendations We expect fine-tuning will increase the biases in the features produced by the model as they will be tuned to the fine-tuning labels. ## How to Get Started with the Model Use the code below to get started with the model. ```python import torch # DINOv2 dinov2_vits14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14') dinov2_vitb14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14') dinov2_vitl14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitl14') dinov2_vitg14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitg14') # DINOv2 with registers dinov2_vits14_reg = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14_reg') dinov2_vitb14_reg = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14_reg') dinov2_vitl14_reg = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitl14_reg') dinov2_vitg14_reg = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitg14_reg') ``` ## Training Details ### Training Data - **Training data:** LVD-142M (see paper) - **Training regime:** fp16 using PyTorch-FSDP mixed-precision. ### Training Procedure - **Training objective:** - DINO self-distillation loss with multi-crop - iBOT masked-image modeling loss - KoLeo regularization on [CLS] tokens - **Architectures:** - ViT-S (21M params): Patch size 14, embedding dimension 384, 6 heads, MLP FFN - ViT-B (86M params): Patch size 14, embedding dimension 768, 12 heads, MLP FFN - ViT-L (0.3B params): Patch size 14, embedding dimension 1024, 16 heads, MLP FFN - ViT-g (1.1B params): Patch size 14, embedding dimension 1536, 24 heads, SwiGLU FFN - **Distillation:** - Distillation follows the standard DINOv2 pretraining procedure, except the teacher is a pretrained ViT-g, frozen. ## Evaluation We refer users to the associated papers for the evaluation protocols.
ImageNet-1k NYU-Depth v2 SUN-RGBD ADE20k iNaturalist 2018 Oxford-H
model with
registers
classif. (acc) classif. (acc) classif. V2 (acc) depth (RMSE) depth (RMSE) segm. (mAP) classif. (acc) retrieval (mAP)
k-NN linear linear linear
4 layers
NYU-D transfer multiscale linear nearest neighbor
ViT-S/14 :x: 79.0% 81.1% 70.8% 0.417 0.431 47.2 69.5% 43.2
ViT-S/14 :white_check_mark: 79.1% 80.9% 71.0% N/A N/A N/A 67.6% 39.5
ViT-B/14 :x: 82.1% 84.5% 74.9% 0.362 0.400 51.3 76.3% 49.5
ViT-B/14 :white_check_mark: 82.0% 84.6% 75.6% N/A N/A N/A 73.8% 51.0
ViT-L/14 :x: 83.5% 86.3% 77.6% 0.333 0.396 53.1 79.8% 54.0
ViT-L/14 :white_check_mark: 83.8% 86.7% 78.5% N/A N/A N/A 80.9% 55.7
ViT-g/14 :x: 83.5% 86.5% 78.4% 0.298 0.362 53.0 81.6% 52.3
ViT-g/14 :white_check_mark: 83.7% 87.1% 78.8% N/A N/A N/A 81.5% 58.2
## Environmental Impact - **Hardware Type:** Nvidia A100 - **Hours used:** 22,000 for ViT-g, 4,500 for ViT-S distillation, 5,300 for ViT-B distillation, 8,000 for ViT-L distillation - **Cloud Provider:** Private infra - **Compute Region:** USA - **Carbon Emitted:** 7t CO2eq #### Hardware Nvidia A100 GPUs #### Software PyTorch 2.0, xFormers 0.0.18 **BibTeX** ``` @misc{oquab2023dinov2, title={DINOv2: Learning Robust Visual Features without Supervision}, author={Oquab, Maxime and Darcet, Timothée and Moutakanni, Theo and Vo, Huy and Szafraniec, Marc and Khalidov, Vasil and Fernandez, Pierre and Haziza, Daniel and Massa, Francisco and El-Nouby, Alaaeldin and Howes, Russell and Huang, Po-Yao and Xu, Hu and Sharma, Vasu and Li, Shang-Wen and Galuba, Wojciech and Rabbat, Mike and Assran, Mido and Ballas, Nicolas and Synnaeve, Gabriel and Misra, Ishan and Jegou, Herve and Mairal, Julien and Labatut, Patrick and Joulin, Armand and Bojanowski, Piotr}, journal={arXiv:2304.07193}, year={2023} } @misc{darcet2023vitneedreg, title={Vision Transformers Need Registers}, author={Darcet, Timothée and Oquab, Maxime and Mairal, Julien and Bojanowski, Piotr}, journal={arXiv:2309.16588}, year={2023} } ``` ## /README.md :new: [2023-10-26] *Added DINOv2 backbones with registers, following [Vision Transformers Need Registers](https://arxiv.org/abs/2309.16588).* # DINOv2: Learning Robust Visual Features without Supervision **[Meta AI Research, FAIR](https://ai.facebook.com/research/)** Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V. Vo, Marc Szafraniec, Vasil Khalidov, Patrick Labatut, Armand Joulin, Piotr Bojanowski [[`Paper #1`](https://arxiv.org/abs/2304.07193)] [`Paper #2`](https://arxiv.org/abs/2309.16588)] [[`Blog`](https://ai.facebook.com/blog/dino-v2-computer-vision-self-supervised-learning/)] [[`Demo`](https://dinov2.metademolab.com)] [[`BibTeX`](#citing-dinov2)] PyTorch implementation and pretrained models for DINOv2. For details, see the papers: **[DINOv2: Learning Robust Visual Features without Supervision](https://arxiv.org/abs/2304.07193)** and **[Vision Transformers Need Registers](https://arxiv.org/abs/2309.16588)**. DINOv2 models produce high-performance visual features that can be directly employed with classifiers as simple as linear layers on a variety of computer vision tasks; these visual features are robust and perform well across domains without any requirement for fine-tuning. The models were pretrained on a dataset of 142 M images without using any labels or annotations. https://github.com/facebookresearch/dinov2/assets/60359573/f168823e-7922-415a-b429-578badf5c356
Visualization of the three first principal components of the patch features of all frames, mapped to RGB values.
## Pretrained models
model # of
params
with
registers
ImageNet
k-NN
ImageNet
linear
download
ViT-S/14 distilled 21 M :x: 79.0% 81.1% backbone only
ViT-S/14 distilled 21 M :white_check_mark: 79.1% 80.9% backbone only
ViT-B/14 distilled 86 M :x: 82.1% 84.5% backbone only
ViT-B/14 distilled 86 M :white_check_mark: 82.0% 84.6% backbone only
ViT-L/14 distilled 300 M :x: 83.5% 86.3% backbone only
ViT-L/14 distilled 300 M :white_check_mark: 83.8% 86.7% backbone only
ViT-g/14 1,100 M :x: 83.5% 86.5% backbone only
ViT-g/14 1,100 M :white_check_mark: 83.7% 87.1% backbone only
### Pretrained backbones (via PyTorch Hub) Please follow the instructions [here](https://pytorch.org/get-started/locally/) to install PyTorch (the only required dependency for loading the model). Installing PyTorch with CUDA support is strongly recommended. A corresponding [model card](MODEL_CARD.md) is included in the repository. ```python import torch # DINOv2 dinov2_vits14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14') dinov2_vitb14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14') dinov2_vitl14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitl14') dinov2_vitg14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitg14') # DINOv2 with registers dinov2_vits14_reg = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14_reg') dinov2_vitb14_reg = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14_reg') dinov2_vitl14_reg = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitl14_reg') dinov2_vitg14_reg = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitg14_reg') ``` ### Pretrained heads - Image classification
backbone with
registers
download
ImageNet
ViT-S/14 distilled :x: linear head (1 layer, 4 layers)
ViT-S/14 distilled :white_check_mark: linear head (1 layer, 4 layers)
ViT-B/14 distilled :x: linear head (1 layer, 4 layers)
ViT-B/14 distilled :white_check_mark: linear head (1 layer, 4 layers)
ViT-L/14 distilled :x: linear head (1 layer, 4 layers)
ViT-L/14 distilled :white_check_mark: linear head (1 layer, 4 layers)
ViT-g/14 :x: linear head (1 layer, 4 layers)
ViT-g/14 :white_check_mark: linear head (1 layer, 4 layers)
The (full) classifier models can be loaded via PyTorch Hub: ```python import torch # DINOv2 dinov2_vits14_lc = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14_lc') dinov2_vitb14_lc = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14_lc') dinov2_vitl14_lc = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitl14_lc') dinov2_vitg14_lc = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitg14_lc') # DINOv2 with registers dinov2_vits14_reg_lc = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14_reg_lc') dinov2_vitb14_reg_lc = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14_reg_lc') dinov2_vitl14_reg_lc = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitl14_reg_lc') dinov2_vitg14_reg_lc = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitg14_reg_lc') ``` ### Pretrained heads - Depth estimation
backbone download head
NYUd KITTI
ViT-S/14 distilled linear (1 layer, 4 layers), DPT linear (1 layer, 4 layers), DPT
ViT-B/14 distilled linear (1 layer, 4 layers), DPT linear (1 layer, 4 layers), DPT
ViT-L/14 distilled linear (1 layer, 4 layers), DPT linear (1 layer, 4 layers), DPT
ViT-g/14 linear (1 layer, 4 layers), DPT linear (1 layer, 4 layers), DPT
### Pretrained heads - Semantic segmentation
backbone download model download head
ADE20K ADE20K VOC2012
ViT-S/14 distilled linear, multi-scale linear, multi-scale
ViT-B/14 distilled linear, multi-scale linear, multi-scale
ViT-L/14 distilled linear, multi-scale linear, multi-scale
ViT-g/14 Mask2Former linear, multi-scale linear, multi-scale
## Installation The training and evaluation code requires PyTorch 2.0 and [xFormers](https://github.com/facebookresearch/xformers) 0.0.18 as well as a number of other 3rd party packages. Note that the code has only been tested with the specified versions and also expects a Linux environment. To setup all the required dependencies for training and evaluation, please follow the instructions below: *[conda](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html)* **(Recommended)** - Clone the repository and then create and activate a `dinov2` conda environment using the provided environment definition: ```shell conda env create -f conda.yaml conda activate dinov2 ``` *[pip](https://pip.pypa.io/en/stable/getting-started/)* - Clone the repository and then use the provided `requirements.txt` to install the dependencies: ```shell pip install -r requirements.txt ``` For dense tasks (depth estimation and semantic segmentation), there are additional dependencies (specific versions of `mmcv` and `mmsegmentation`) which are captured in the `extras` dependency specifications: *[conda](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html)* **(Recommended)**: ```shell conda env create -f conda-extras.yaml conda activate dinov2-extras ``` *[pip](https://pip.pypa.io/en/stable/getting-started/)*: ```shell pip install -r requirements.txt -r requirements-extras.txt ``` ## Data preparation ### ImageNet-1k The root directory of the dataset should hold the following contents: - `/test/ILSVRC2012_test_00000001.JPEG` - `/test/[..]` - `/test/ILSVRC2012_test_00100000.JPEG` - `/train/n01440764/n01440764_10026.JPEG` - `/train/[...]` - `/train/n15075141/n15075141_9993.JPEG` - `/val/n01440764/ILSVRC2012_val_00000293.JPEG` - `/val/[...]` - `/val/n15075141/ILSVRC2012_val_00049174.JPEG` - `/labels.txt` The provided dataset implementation expects a few additional metadata files to be present under the extra directory: - `/class-ids-TRAIN.npy` - `/class-ids-VAL.npy` - `/class-names-TRAIN.npy` - `/class-names-VAL.npy` - `/entries-TEST.npy` - `/entries-TRAIN.npy` - `/entries-VAL.npy` These metadata files can be generated (once) with the following lines of Python code: ```python from dinov2.data.datasets import ImageNet for split in ImageNet.Split: dataset = ImageNet(split=split, root="", extra="") dataset.dump_extra() ``` Note that the root and extra directories do not have to be distinct directories. ### ImageNet-22k Please adapt the [dataset class](dinov2/data/datasets/image_net_22k.py) to match your local setup.
:warning: To execute the commands provided in the next sections for training and evaluation, the `dinov2` package should be included in the Python module search path, i.e. simply prefix the command to run with `PYTHONPATH=.`. ## Training ### Fast setup: training DINOv2 ViT-L/16 on ImageNet-1k Run DINOv2 training on 4 A100-80GB nodes (32 GPUs) in a SLURM cluster environment with submitit: ```shell python dinov2/run/train/train.py \ --nodes 4 \ --config-file dinov2/configs/train/vitl16_short.yaml \ --output-dir \ train.dataset_path=ImageNet:split=TRAIN:root=:extra= ``` Training time is approximately 1 day and the resulting checkpoint should reach 81.6% on k-NN eval and 82.9% on linear eval. The training code saves the weights of the teacher in the `eval` folder every 12500 iterations for evaluation. ### Long setup: training DINOv2 ViT-L/14 on ImageNet-22k Run DINOv2 training on 12 A100-80GB nodes (96 GPUs) in a SLURM cluster environment with submitit: ```shell python dinov2/run/train/train.py \ --nodes 12 \ --config-file dinov2/configs/train/vitl14.yaml \ --output-dir \ train.dataset_path=ImageNet22k:root=:extra= ``` Training time is approximately 3.3 days and the resulting checkpoint should reach 82.0% on k-NN eval and 84.5% on linear eval. The training code saves the weights of the teacher in the `eval` folder every 12500 iterations for evaluation. ## Evaluation The training code regularly saves the teacher weights. In order to evaluate the model, run the following evaluation on a single node: ### k-NN classification on ImageNet-1k ```shell python dinov2/run/eval/knn.py \ --config-file /config.yaml \ --pretrained-weights /eval/training_24999/teacher_checkpoint.pth \ --output-dir /eval/training_24999/knn \ --train-dataset ImageNet:split=TRAIN:root=:extra= \ --val-dataset ImageNet:split=VAL:root=:extra= ``` ### Logistic regression classification on ImageNet-1k ```shell python dinov2/run/eval/log_regression.py \ --config-file /config.yaml \ --pretrained-weights /eval/training_24999/teacher_checkpoint.pth \ --output-dir /eval/training_24999/logreg \ --train-dataset ImageNet:split=TRAIN:root=:extra= \ --val-dataset ImageNet:split=VAL:root=:extra= ``` ### Linear classification with data augmentation on ImageNet-1k ```shell python dinov2/run/eval/linear.py \ --config-file /config.yaml \ --pretrained-weights /eval/training_24999/teacher_checkpoint.pth \ --output-dir /eval/training_24999/linear \ --train-dataset ImageNet:split=TRAIN:root=:extra= \ --val-dataset ImageNet:split=VAL:root=:extra= ``` We release the weights from evaluating the different models:
model with
registers
ImageNet
top-1
linear evaluation
ViT-S/14 distilled :x: 81.1% linear head weights
ViT-S/14 distilled :white_check_mark: 80.8% linear head weights
ViT-B/14 distilled :x: 84.5% linear head weights
ViT-B/14 distilled :white_check_mark: 84.4% linear head weights
ViT-L/14 distilled :x: 86.3% linear head weights
ViT-L/14 distilled :white_check_mark: 86.5% linear head weights
ViT-g/14 :x: 86.5% linear head weights
ViT-g/14 :white_check_mark: 87.0% linear head weights
The performance of the provided pretrained model weights can be evaluated as follows on ImageNet-1k: ```shell python dinov2/run/eval/linear.py \ --config-file dinov2/configs/eval/vitg14_pretrain.yaml \ --pretrained-weights https://dl.fbaipublicfiles.com/dinov2/dinov2_vitg14/dinov2_vitg14_pretrain.pth \ --train-dataset ImageNet:split=TRAIN:root=:extra= \ --val-dataset ImageNet:split=VAL:root=:extra= ``` ## Notebooks A few notebooks are provided to help the community leverage the models and code:
  • Depth estimation - How to load and use the depth heads in combination with a matching backbone via mmcv
  • Semantic segmentation - How to load and use the segmentation heads in combination with a matching backbone via mmcv, and also how to load and use the Mask2Former-based segmentation model trained on ADE20K
## License DINOv2 code and model weights are released under the Apache License 2.0. See [LICENSE](LICENSE) for additional details. ## Contributing See [contributing](CONTRIBUTING.md) and the [code of conduct](CODE_OF_CONDUCT.md). ## Citing DINOv2 If you find this repository useful, please consider giving a star :star: and citation :t-rex:: ``` @misc{oquab2023dinov2, title={DINOv2: Learning Robust Visual Features without Supervision}, author={Oquab, Maxime and Darcet, Timothée and Moutakanni, Theo and Vo, Huy V. and Szafraniec, Marc and Khalidov, Vasil and Fernandez, Pierre and Haziza, Daniel and Massa, Francisco and El-Nouby, Alaaeldin and Howes, Russell and Huang, Po-Yao and Xu, Hu and Sharma, Vasu and Li, Shang-Wen and Galuba, Wojciech and Rabbat, Mike and Assran, Mido and Ballas, Nicolas and Synnaeve, Gabriel and Misra, Ishan and Jegou, Herve and Mairal, Julien and Labatut, Patrick and Joulin, Armand and Bojanowski, Piotr}, journal={arXiv:2304.07193}, year={2023} } ``` ``` @misc{darcet2023vitneedreg, title={Vision Transformers Need Registers}, author={Darcet, Timothée and Oquab, Maxime and Mairal, Julien and Bojanowski, Piotr}, journal={arXiv:2309.16588}, year={2023} } ``` ## /conda-extras.yaml ```yaml path="/conda-extras.yaml" name: dinov2-extras channels: - defaults - pytorch - nvidia - xformers - conda-forge dependencies: - python=3.9 - pytorch::pytorch=2.0.0 - pytorch::pytorch-cuda=11.7.0 - pytorch::torchvision=0.15.0 - omegaconf - torchmetrics=0.10.3 - fvcore - iopath - xformers::xformers=0.0.18 - pip - pip: - git+https://github.com/facebookincubator/submitit - --extra-index-url https://pypi.nvidia.com - cuml-cu11 - mmcv-full==1.5.0 - mmsegmentation==0.27.0 ``` ## /conda.yaml ```yaml path="/conda.yaml" name: dinov2 channels: - defaults - pytorch - nvidia - xformers - conda-forge dependencies: - python=3.9 - pytorch::pytorch=2.0.0 - pytorch::pytorch-cuda=11.7.0 - pytorch::torchvision=0.15.0 - omegaconf - torchmetrics=0.10.3 - fvcore - iopath - xformers::xformers=0.0.18 - pip - pip: - git+https://github.com/facebookincubator/submitit - --extra-index-url https://pypi.nvidia.com - cuml-cu11 ``` ## /dinov2/__init__.py ```py path="/dinov2/__init__.py" # Copyright (c) Meta Platforms, Inc. and affiliates. # # This source code is licensed under the Apache License, Version 2.0 # found in the LICENSE file in the root directory of this source tree. __version__ = "0.0.1" ``` ## /dinov2/configs/__init__.py ```py path="/dinov2/configs/__init__.py" # Copyright (c) Meta Platforms, Inc. and affiliates. # # This source code is licensed under the Apache License, Version 2.0 # found in the LICENSE file in the root directory of this source tree. import pathlib from omegaconf import OmegaConf def load_config(config_name: str): config_filename = config_name + ".yaml" return OmegaConf.load(pathlib.Path(__file__).parent.resolve() / config_filename) dinov2_default_config = load_config("ssl_default_config") def load_and_merge_config(config_name: str): default_config = OmegaConf.create(dinov2_default_config) loaded_config = load_config(config_name) return OmegaConf.merge(default_config, loaded_config) ``` ## /dinov2/configs/eval/vitb14_pretrain.yaml ```yaml path="/dinov2/configs/eval/vitb14_pretrain.yaml" student: arch: vit_base patch_size: 14 crops: global_crops_size: 518 # this is to set up the position embeddings properly local_crops_size: 98 ``` ## /dinov2/configs/eval/vitb14_reg4_pretrain.yaml ```yaml path="/dinov2/configs/eval/vitb14_reg4_pretrain.yaml" student: arch: vit_base patch_size: 14 num_register_tokens: 4 interpolate_antialias: true interpolate_offset: 0.0 crops: global_crops_size: 518 # this is to set up the position embeddings properly local_crops_size: 98 ``` ## /dinov2/configs/eval/vitg14_pretrain.yaml ```yaml path="/dinov2/configs/eval/vitg14_pretrain.yaml" student: arch: vit_giant2 patch_size: 14 ffn_layer: swiglufused crops: global_crops_size: 518 # this is to set up the position embeddings properly local_crops_size: 98 ``` ## /dinov2/configs/eval/vitg14_reg4_pretrain.yaml ```yaml path="/dinov2/configs/eval/vitg14_reg4_pretrain.yaml" student: arch: vit_giant2 patch_size: 14 ffn_layer: swiglufused num_register_tokens: 4 interpolate_antialias: true interpolate_offset: 0.0 crops: global_crops_size: 518 # this is to set up the position embeddings properly local_crops_size: 98 ``` ## /dinov2/configs/eval/vitl14_pretrain.yaml ```yaml path="/dinov2/configs/eval/vitl14_pretrain.yaml" student: arch: vit_large patch_size: 14 crops: global_crops_size: 518 # this is to set up the position embeddings properly local_crops_size: 98 ``` ## /dinov2/configs/eval/vitl14_reg4_pretrain.yaml ```yaml path="/dinov2/configs/eval/vitl14_reg4_pretrain.yaml" student: arch: vit_large patch_size: 14 num_register_tokens: 4 interpolate_antialias: true interpolate_offset: 0.0 crops: global_crops_size: 518 # this is to set up the position embeddings properly local_crops_size: 98 ``` ## /dinov2/configs/eval/vits14_pretrain.yaml ```yaml path="/dinov2/configs/eval/vits14_pretrain.yaml" student: arch: vit_small patch_size: 14 crops: global_crops_size: 518 # this is to set up the position embeddings properly local_crops_size: 98 ``` ## /dinov2/configs/eval/vits14_reg4_pretrain.yaml ```yaml path="/dinov2/configs/eval/vits14_reg4_pretrain.yaml" student: arch: vit_small patch_size: 14 num_register_tokens: 4 interpolate_antialias: true interpolate_offset: 0.0 crops: global_crops_size: 518 # this is to set up the position embeddings properly local_crops_size: 98 ``` ## /dinov2/configs/ssl_default_config.yaml ```yaml path="/dinov2/configs/ssl_default_config.yaml" MODEL: WEIGHTS: '' compute_precision: grad_scaler: true teacher: backbone: sharding_strategy: SHARD_GRAD_OP mixed_precision: param_dtype: fp16 reduce_dtype: fp16 buffer_dtype: fp32 dino_head: sharding_strategy: SHARD_GRAD_OP mixed_precision: param_dtype: fp16 reduce_dtype: fp16 buffer_dtype: fp32 ibot_head: sharding_strategy: SHARD_GRAD_OP mixed_precision: param_dtype: fp16 reduce_dtype: fp16 buffer_dtype: fp32 student: backbone: sharding_strategy: SHARD_GRAD_OP mixed_precision: param_dtype: fp16 reduce_dtype: fp16 buffer_dtype: fp32 dino_head: sharding_strategy: SHARD_GRAD_OP mixed_precision: param_dtype: fp16 reduce_dtype: fp32 buffer_dtype: fp32 ibot_head: sharding_strategy: SHARD_GRAD_OP mixed_precision: param_dtype: fp16 reduce_dtype: fp32 buffer_dtype: fp32 dino: loss_weight: 1.0 head_n_prototypes: 65536 head_bottleneck_dim: 256 head_nlayers: 3 head_hidden_dim: 2048 koleo_loss_weight: 0.1 ibot: loss_weight: 1.0 mask_sample_probability: 0.5 mask_ratio_min_max: - 0.1 - 0.5 separate_head: false head_n_prototypes: 65536 head_bottleneck_dim: 256 head_nlayers: 3 head_hidden_dim: 2048 train: batch_size_per_gpu: 64 dataset_path: ImageNet:split=TRAIN output_dir: . saveckp_freq: 20 seed: 0 num_workers: 10 OFFICIAL_EPOCH_LENGTH: 1250 cache_dataset: true centering: "centering" # or "sinkhorn_knopp" student: arch: vit_large patch_size: 16 drop_path_rate: 0.3 layerscale: 1.0e-05 drop_path_uniform: true pretrained_weights: '' ffn_layer: "mlp" block_chunks: 0 qkv_bias: true proj_bias: true ffn_bias: true num_register_tokens: 0 interpolate_antialias: false interpolate_offset: 0.1 teacher: momentum_teacher: 0.992 final_momentum_teacher: 1 warmup_teacher_temp: 0.04 teacher_temp: 0.07 warmup_teacher_temp_epochs: 30 optim: epochs: 100 weight_decay: 0.04 weight_decay_end: 0.4 base_lr: 0.004 # learning rate for a batch size of 1024 lr: 0. # will be set after applying scaling rule warmup_epochs: 10 min_lr: 1.0e-06 clip_grad: 3.0 freeze_last_layer_epochs: 1 scaling_rule: sqrt_wrt_1024 patch_embed_lr_mult: 0.2 layerwise_decay: 0.9 adamw_beta1: 0.9 adamw_beta2: 0.999 crops: global_crops_scale: - 0.32 - 1.0 local_crops_number: 8 local_crops_scale: - 0.05 - 0.32 global_crops_size: 224 local_crops_size: 96 evaluation: eval_period_iterations: 12500 ``` ## /dinov2/configs/train/vitg14.yaml ```yaml path="/dinov2/configs/train/vitg14.yaml" dino: head_n_prototypes: 131072 head_bottleneck_dim: 384 ibot: separate_head: true head_n_prototypes: 131072 train: batch_size_per_gpu: 12 dataset_path: ImageNet22k centering: sinkhorn_knopp student: arch: vit_giant2 patch_size: 14 drop_path_rate: 0.4 ffn_layer: swiglufused block_chunks: 4 teacher: momentum_teacher: 0.994 optim: epochs: 500 weight_decay_end: 0.2 base_lr: 2.0e-04 # learning rate for a batch size of 1024 warmup_epochs: 80 layerwise_decay: 1.0 crops: local_crops_size: 98 ``` ## /dinov2/configs/train/vitl14.yaml ```yaml path="/dinov2/configs/train/vitl14.yaml" dino: head_n_prototypes: 131072 head_bottleneck_dim: 384 ibot: separate_head: true head_n_prototypes: 131072 train: batch_size_per_gpu: 32 dataset_path: ImageNet22k centering: sinkhorn_knopp student: arch: vit_large patch_size: 14 drop_path_rate: 0.4 ffn_layer: swiglufused block_chunks: 4 teacher: momentum_teacher: 0.994 optim: epochs: 500 weight_decay_end: 0.2 base_lr: 2.0e-04 # learning rate for a batch size of 1024 warmup_epochs: 80 layerwise_decay: 1.0 crops: local_crops_size: 98 ``` ## /dinov2/configs/train/vitl16_short.yaml ```yaml path="/dinov2/configs/train/vitl16_short.yaml" # this corresponds to the default config train: dataset_path: ImageNet:split=TRAIN batch_size_per_gpu: 64 student: block_chunks: 4 ``` ## /dinov2/data/__init__.py ```py path="/dinov2/data/__init__.py" # Copyright (c) Meta Platforms, Inc. and affiliates. # # This source code is licensed under the Apache License, Version 2.0 # found in the LICENSE file in the root directory of this source tree. from .adapters import DatasetWithEnumeratedTargets from .loaders import make_data_loader, make_dataset, SamplerType from .collate import collate_data_and_cast from .masking import MaskingGenerator from .augmentations import DataAugmentationDINO ``` ## /dinov2/data/adapters.py ```py path="/dinov2/data/adapters.py" # Copyright (c) Meta Platforms, Inc. and affiliates. # # This source code is licensed under the Apache License, Version 2.0 # found in the LICENSE file in the root directory of this source tree. from typing import Any, Tuple from torch.utils.data import Dataset class DatasetWithEnumeratedTargets(Dataset): def __init__(self, dataset): self._dataset = dataset def get_image_data(self, index: int) -> bytes: return self._dataset.get_image_data(index) def get_target(self, index: int) -> Tuple[Any, int]: target = self._dataset.get_target(index) return (index, target) def __getitem__(self, index: int) -> Tuple[Any, Tuple[Any, int]]: image, target = self._dataset[index] target = index if target is None else target return image, (index, target) def __len__(self) -> int: return len(self._dataset) ``` ## /dinov2/data/augmentations.py ```py path="/dinov2/data/augmentations.py" # Copyright (c) Meta Platforms, Inc. and affiliates. # # This source code is licensed under the Apache License, Version 2.0 # found in the LICENSE file in the root directory of this source tree. import logging from torchvision import transforms from .transforms import ( GaussianBlur, make_normalize_transform, ) logger = logging.getLogger("dinov2") class DataAugmentationDINO(object): def __init__( self, global_crops_scale, local_crops_scale, local_crops_number, global_crops_size=224, local_crops_size=96, ): self.global_crops_scale = global_crops_scale self.local_crops_scale = local_crops_scale self.local_crops_number = local_crops_number self.global_crops_size = global_crops_size self.local_crops_size = local_crops_size logger.info("###################################") logger.info("Using data augmentation parameters:") logger.info(f"global_crops_scale: {global_crops_scale}") logger.info(f"local_crops_scale: {local_crops_scale}") logger.info(f"local_crops_number: {local_crops_number}") logger.info(f"global_crops_size: {global_crops_size}") logger.info(f"local_crops_size: {local_crops_size}") logger.info("###################################") # random resized crop and flip self.geometric_augmentation_global = transforms.Compose( [ transforms.RandomResizedCrop( global_crops_size, scale=global_crops_scale, interpolation=transforms.InterpolationMode.BICUBIC ), transforms.RandomHorizontalFlip(p=0.5), ] ) self.geometric_augmentation_local = transforms.Compose( [ transforms.RandomResizedCrop( local_crops_size, scale=local_crops_scale, interpolation=transforms.InterpolationMode.BICUBIC ), transforms.RandomHorizontalFlip(p=0.5), ] ) # color distorsions / blurring color_jittering = transforms.Compose( [ transforms.RandomApply( [transforms.ColorJitter(brightness=0.4, contrast=0.4, saturation=0.2, hue=0.1)], p=0.8, ), transforms.RandomGrayscale(p=0.2), ] ) global_transfo1_extra = GaussianBlur(p=1.0) global_transfo2_extra = transforms.Compose( [ GaussianBlur(p=0.1), transforms.RandomSolarize(threshold=128, p=0.2), ] ) local_transfo_extra = GaussianBlur(p=0.5) # normalization self.normalize = transforms.Compose( [ transforms.ToTensor(), make_normalize_transform(), ] ) self.global_transfo1 = transforms.Compose([color_jittering, global_transfo1_extra, self.normalize]) self.global_transfo2 = transforms.Compose([color_jittering, global_transfo2_extra, self.normalize]) self.local_transfo = transforms.Compose([color_jittering, local_transfo_extra, self.normalize]) def __call__(self, image): output = {} # global crops: im1_base = self.geometric_augmentation_global(image) global_crop_1 = self.global_transfo1(im1_base) im2_base = self.geometric_augmentation_global(image) global_crop_2 = self.global_transfo2(im2_base) output["global_crops"] = [global_crop_1, global_crop_2] # global crops for teacher: output["global_crops_teacher"] = [global_crop_1, global_crop_2] # local crops: local_crops = [ self.local_transfo(self.geometric_augmentation_local(image)) for _ in range(self.local_crops_number) ] output["local_crops"] = local_crops output["offsets"] = () return output ``` ## /dinov2/data/collate.py ```py path="/dinov2/data/collate.py" # Copyright (c) Meta Platforms, Inc. and affiliates. # # This source code is licensed under the Apache License, Version 2.0 # found in the LICENSE file in the root directory of this source tree. import torch import random def collate_data_and_cast(samples_list, mask_ratio_tuple, mask_probability, dtype, n_tokens=None, mask_generator=None): # dtype = torch.half # TODO: Remove n_global_crops = len(samples_list[0][0]["global_crops"]) n_local_crops = len(samples_list[0][0]["local_crops"]) collated_global_crops = torch.stack([s[0]["global_crops"][i] for i in range(n_global_crops) for s in samples_list]) collated_local_crops = torch.stack([s[0]["local_crops"][i] for i in range(n_local_crops) for s in samples_list]) B = len(collated_global_crops) N = n_tokens n_samples_masked = int(B * mask_probability) probs = torch.linspace(*mask_ratio_tuple, n_samples_masked + 1) upperbound = 0 masks_list = [] for i in range(0, n_samples_masked): prob_min = probs[i] prob_max = probs[i + 1] masks_list.append(torch.BoolTensor(mask_generator(int(N * random.uniform(prob_min, prob_max))))) upperbound += int(N * prob_max) for i in range(n_samples_masked, B): masks_list.append(torch.BoolTensor(mask_generator(0))) random.shuffle(masks_list) collated_masks = torch.stack(masks_list).flatten(1) mask_indices_list = collated_masks.flatten().nonzero().flatten() masks_weight = (1 / collated_masks.sum(-1).clamp(min=1.0)).unsqueeze(-1).expand_as(collated_masks)[collated_masks] return { "collated_global_crops": collated_global_crops.to(dtype), "collated_local_crops": collated_local_crops.to(dtype), "collated_masks": collated_masks, "mask_indices_list": mask_indices_list, "masks_weight": masks_weight, "upperbound": upperbound, "n_masked_patches": torch.full((1,), fill_value=mask_indices_list.shape[0], dtype=torch.long), } ``` ## /dinov2/data/datasets/__init__.py ```py path="/dinov2/data/datasets/__init__.py" # Copyright (c) Meta Platforms, Inc. and affiliates. # # This source code is licensed under the Apache License, Version 2.0 # found in the LICENSE file in the root directory of this source tree. from .image_net import ImageNet from .image_net_22k import ImageNet22k ``` ## /dinov2/data/datasets/decoders.py ```py path="/dinov2/data/datasets/decoders.py" # Copyright (c) Meta Platforms, Inc. and affiliates. # # This source code is licensed under the Apache License, Version 2.0 # found in the LICENSE file in the root directory of this source tree. from io import BytesIO from typing import Any from PIL import Image class Decoder: def decode(self) -> Any: raise NotImplementedError class ImageDataDecoder(Decoder): def __init__(self, image_data: bytes) -> None: self._image_data = image_data def decode(self) -> Image: f = BytesIO(self._image_data) return Image.open(f).convert(mode="RGB") class TargetDecoder(Decoder): def __init__(self, target: Any): self._target = target def decode(self) -> Any: return self._target ``` ## /dinov2/data/datasets/extended.py ```py path="/dinov2/data/datasets/extended.py" # Copyright (c) Meta Platforms, Inc. and affiliates. # # This source code is licensed under the Apache License, Version 2.0 # found in the LICENSE file in the root directory of this source tree. from typing import Any, Tuple from torchvision.datasets import VisionDataset from .decoders import TargetDecoder, ImageDataDecoder class ExtendedVisionDataset(VisionDataset): def __init__(self, *args, **kwargs) -> None: super().__init__(*args, **kwargs) # type: ignore def get_image_data(self, index: int) -> bytes: raise NotImplementedError def get_target(self, index: int) -> Any: raise NotImplementedError def __getitem__(self, index: int) -> Tuple[Any, Any]: try: image_data = self.get_image_data(index) image = ImageDataDecoder(image_data).decode() except Exception as e: raise RuntimeError(f"can not read image for sample {index}") from e target = self.get_target(index) target = TargetDecoder(target).decode() if self.transforms is not None: image, target = self.transforms(image, target) return image, target def __len__(self) -> int: raise NotImplementedError ``` ## /dinov2/data/datasets/image_net.py ```py path="/dinov2/data/datasets/image_net.py" # Copyright (c) Meta Platforms, Inc. and affiliates. # # This source code is licensed under the Apache License, Version 2.0 # found in the LICENSE file in the root directory of this source tree. import csv from enum import Enum import logging import os from typing import Callable, List, Optional, Tuple, Union import numpy as np from .extended import ExtendedVisionDataset logger = logging.getLogger("dinov2") _Target = int class _Split(Enum): TRAIN = "train" VAL = "val" TEST = "test" # NOTE: torchvision does not support the test split @property def length(self) -> int: split_lengths = { _Split.TRAIN: 1_281_167, _Split.VAL: 50_000, _Split.TEST: 100_000, } return split_lengths[self] def get_dirname(self, class_id: Optional[str] = None) -> str: return self.value if class_id is None else os.path.join(self.value, class_id) def get_image_relpath(self, actual_index: int, class_id: Optional[str] = None) -> str: dirname = self.get_dirname(class_id) if self == _Split.TRAIN: basename = f"{class_id}_{actual_index}" else: # self in (_Split.VAL, _Split.TEST): basename = f"ILSVRC2012_{self.value}_{actual_index:08d}" return os.path.join(dirname, basename + ".JPEG") def parse_image_relpath(self, image_relpath: str) -> Tuple[str, int]: assert self != _Split.TEST dirname, filename = os.path.split(image_relpath) class_id = os.path.split(dirname)[-1] basename, _ = os.path.splitext(filename) actual_index = int(basename.split("_")[-1]) return class_id, actual_index class ImageNet(ExtendedVisionDataset): Target = Union[_Target] Split = Union[_Split] def __init__( self, *, split: "ImageNet.Split", root: str, extra: str, transforms: Optional[Callable] = None, transform: Optional[Callable] = None, target_transform: Optional[Callable] = None, ) -> None: super().__init__(root, transforms, transform, target_transform) self._extra_root = extra self._split = split self._entries = None self._class_ids = None self._class_names = None @property def split(self) -> "ImageNet.Split": return self._split def _get_extra_full_path(self, extra_path: str) -> str: return os.path.join(self._extra_root, extra_path) def _load_extra(self, extra_path: str) -> np.ndarray: extra_full_path = self._get_extra_full_path(extra_path) return np.load(extra_full_path, mmap_mode="r") def _save_extra(self, extra_array: np.ndarray, extra_path: str) -> None: extra_full_path = self._get_extra_full_path(extra_path) os.makedirs(self._extra_root, exist_ok=True) np.save(extra_full_path, extra_array) @property def _entries_path(self) -> str: return f"entries-{self._split.value.upper()}.npy" @property def _class_ids_path(self) -> str: return f"class-ids-{self._split.value.upper()}.npy" @property def _class_names_path(self) -> str: return f"class-names-{self._split.value.upper()}.npy" def _get_entries(self) -> np.ndarray: if self._entries is None: self._entries = self._load_extra(self._entries_path) assert self._entries is not None return self._entries def _get_class_ids(self) -> np.ndarray: if self._split == _Split.TEST: assert False, "Class IDs are not available in TEST split" if self._class_ids is None: self._class_ids = self._load_extra(self._class_ids_path) assert self._class_ids is not None return self._class_ids def _get_class_names(self) -> np.ndarray: if self._split == _Split.TEST: assert False, "Class names are not available in TEST split" if self._class_names is None: self._class_names = self._load_extra(self._class_names_path) assert self._class_names is not None return self._class_names def find_class_id(self, class_index: int) -> str: class_ids = self._get_class_ids() return str(class_ids[class_index]) def find_class_name(self, class_index: int) -> str: class_names = self._get_class_names() return str(class_names[class_index]) def get_image_data(self, index: int) -> bytes: entries = self._get_entries() actual_index = entries[index]["actual_index"] class_id = self.get_class_id(index) image_relpath = self.split.get_image_relpath(actual_index, class_id) image_full_path = os.path.join(self.root, image_relpath) with open(image_full_path, mode="rb") as f: image_data = f.read() return image_data def get_target(self, index: int) -> Optional[Target]: entries = self._get_entries() class_index = entries[index]["class_index"] return None if self.split == _Split.TEST else int(class_index) def get_targets(self) -> Optional[np.ndarray]: entries = self._get_entries() return None if self.split == _Split.TEST else entries["class_index"] def get_class_id(self, index: int) -> Optional[str]: entries = self._get_entries() class_id = entries[index]["class_id"] return None if self.split == _Split.TEST else str(class_id) def get_class_name(self, index: int) -> Optional[str]: entries = self._get_entries() class_name = entries[index]["class_name"] return None if self.split == _Split.TEST else str(class_name) def __len__(self) -> int: entries = self._get_entries() assert len(entries) == self.split.length return len(entries) def _load_labels(self, labels_path: str) -> List[Tuple[str, str]]: labels_full_path = os.path.join(self.root, labels_path) labels = [] try: with open(labels_full_path, "r") as f: reader = csv.reader(f) for row in reader: class_id, class_name = row labels.append((class_id, class_name)) except OSError as e: raise RuntimeError(f'can not read labels file "{labels_full_path}"') from e return labels def _dump_entries(self) -> None: split = self.split if split == ImageNet.Split.TEST: dataset = None sample_count = split.length max_class_id_length, max_class_name_length = 0, 0 else: labels_path = "labels.txt" logger.info(f'loading labels from "{labels_path}"') labels = self._load_labels(labels_path) # NOTE: Using torchvision ImageFolder for consistency from torchvision.datasets import ImageFolder dataset_root = os.path.join(self.root, split.get_dirname()) dataset = ImageFolder(dataset_root) sample_count = len(dataset) max_class_id_length, max_class_name_length = -1, -1 for sample in dataset.samples: _, class_index = sample class_id, class_name = labels[class_index] max_class_id_length = max(len(class_id), max_class_id_length) max_class_name_length = max(len(class_name), max_class_name_length) dtype = np.dtype( [ ("actual_index", " old_percent: logger.info(f"creating entries: {percent}%") old_percent = percent actual_index = index + 1 class_index = np.uint32(-1) class_id, class_name = "", "" entries_array[index] = (actual_index, class_index, class_id, class_name) else: class_names = {class_id: class_name for class_id, class_name in labels} assert dataset old_percent = -1 for index in range(sample_count): percent = 100 * (index + 1) // sample_count if percent > old_percent: logger.info(f"creating entries: {percent}%") old_percent = percent image_full_path, class_index = dataset.samples[index] image_relpath = os.path.relpath(image_full_path, self.root) class_id, actual_index = split.parse_image_relpath(image_relpath) class_name = class_names[class_id] entries_array[index] = (actual_index, class_index, class_id, class_name) logger.info(f'saving entries to "{self._entries_path}"') self._save_extra(entries_array, self._entries_path) def _dump_class_ids_and_names(self) -> None: split = self.split if split == ImageNet.Split.TEST: return entries_array = self._load_extra(self._entries_path) max_class_id_length, max_class_name_length, max_class_index = -1, -1, -1 for entry in entries_array: class_index, class_id, class_name = ( entry["class_index"], entry["class_id"], entry["class_name"], ) max_class_index = max(int(class_index), max_class_index) max_class_id_length = max(len(str(class_id)), max_class_id_length) max_class_name_length = max(len(str(class_name)), max_class_name_length) class_count = max_class_index + 1 class_ids_array = np.empty(class_count, dtype=f"U{max_class_id_length}") class_names_array = np.empty(class_count, dtype=f"U{max_class_name_length}") for entry in entries_array: class_index, class_id, class_name = ( entry["class_index"], entry["class_id"], entry["class_name"], ) class_ids_array[class_index] = class_id class_names_array[class_index] = class_name logger.info(f'saving class IDs to "{self._class_ids_path}"') self._save_extra(class_ids_array, self._class_ids_path) logger.info(f'saving class names to "{self._class_names_path}"') self._save_extra(class_names_array, self._class_names_path) def dump_extra(self) -> None: self._dump_entries() self._dump_class_ids_and_names() ``` ## /dinov2/data/datasets/image_net_22k.py ```py path="/dinov2/data/datasets/image_net_22k.py" # Copyright (c) Meta Platforms, Inc. and affiliates. # # This source code is licensed under the Apache License, Version 2.0 # found in the LICENSE file in the root directory of this source tree. from dataclasses import dataclass from enum import Enum from functools import lru_cache from gzip import GzipFile from io import BytesIO from mmap import ACCESS_READ, mmap import os from typing import Any, Callable, List, Optional, Set, Tuple import warnings import numpy as np from .extended import ExtendedVisionDataset _Labels = int _DEFAULT_MMAP_CACHE_SIZE = 16 # Warning: This can exhaust file descriptors @dataclass class _ClassEntry: block_offset: int maybe_filename: Optional[str] = None @dataclass class _Entry: class_index: int # noqa: E701 start_offset: int end_offset: int filename: str class _Split(Enum): TRAIN = "train" VAL = "val" @property def length(self) -> int: return { _Split.TRAIN: 11_797_647, _Split.VAL: 561_050, }[self] def entries_path(self): return f"imagenet21kp_{self.value}.txt" def _get_tarball_path(class_id: str) -> str: return f"{class_id}.tar" def _make_mmap_tarball(tarballs_root: str, mmap_cache_size: int): @lru_cache(maxsize=mmap_cache_size) def _mmap_tarball(class_id: str) -> mmap: tarball_path = _get_tarball_path(class_id) tarball_full_path = os.path.join(tarballs_root, tarball_path) with open(tarball_full_path) as f: return mmap(fileno=f.fileno(), length=0, access=ACCESS_READ) return _mmap_tarball class ImageNet22k(ExtendedVisionDataset): _GZIPPED_INDICES: Set[int] = { 841_545, 1_304_131, 2_437_921, 2_672_079, 2_795_676, 2_969_786, 6_902_965, 6_903_550, 6_903_628, 7_432_557, 7_432_589, 7_813_809, 8_329_633, 10_296_990, 10_417_652, 10_492_265, 10_598_078, 10_782_398, 10_902_612, 11_203_736, 11_342_890, 11_397_596, 11_589_762, 11_705_103, 12_936_875, 13_289_782, } Labels = _Labels def __init__( self, *, root: str, extra: str, transforms: Optional[Callable] = None, transform: Optional[Callable] = None, target_transform: Optional[Callable] = None, mmap_cache_size: int = _DEFAULT_MMAP_CACHE_SIZE, ) -> None: super().__init__(root, transforms, transform, target_transform) self._extra_root = extra entries_path = self._get_entries_path(root) self._entries = self._load_extra(entries_path) class_ids_path = self._get_class_ids_path(root) self._class_ids = self._load_extra(class_ids_path) self._gzipped_indices = ImageNet22k._GZIPPED_INDICES self._mmap_tarball = _make_mmap_tarball(self._tarballs_root, mmap_cache_size) def _get_entries_path(self, root: Optional[str] = None) -> str: return "entries.npy" def _get_class_ids_path(self, root: Optional[str] = None) -> str: return "class-ids.npy" def _find_class_ids(self, path: str) -> List[str]: class_ids = [] with os.scandir(path) as entries: for entry in entries: root, ext = os.path.splitext(entry.name) if ext != ".tar": continue class_ids.append(root) return sorted(class_ids) def _load_entries_class_ids(self, root: Optional[str] = None) -> Tuple[List[_Entry], List[str]]: root = self.get_root(root) entries: List[_Entry] = [] class_ids = self._find_class_ids(root) for class_index, class_id in enumerate(class_ids): path = os.path.join(root, "blocks", f"{class_id}.log") class_entries = [] try: with open(path) as f: for line in f: line = line.rstrip() block, filename = line.split(":") block_offset = int(block[6:]) filename = filename[1:] maybe_filename = None if filename != "** Block of NULs **": maybe_filename = filename _, ext = os.path.splitext(filename) # assert ext == ".JPEG" class_entry = _ClassEntry(block_offset, maybe_filename) class_entries.append(class_entry) except OSError as e: raise RuntimeError(f'can not read blocks file "{path}"') from e assert class_entries[-1].maybe_filename is None for class_entry1, class_entry2 in zip(class_entries, class_entries[1:]): assert class_entry1.block_offset <= class_entry2.block_offset start_offset = 512 * class_entry1.block_offset end_offset = 512 * class_entry2.block_offset assert class_entry1.maybe_filename is not None filename = class_entry1.maybe_filename entry = _Entry(class_index, start_offset, end_offset, filename) # Skip invalid image files (PIL throws UnidentifiedImageError) if filename == "n06470073_47249.JPEG": continue entries.append(entry) return entries, class_ids def _load_extra(self, extra_path: str) -> np.ndarray: extra_root = self._extra_root extra_full_path = os.path.join(extra_root, extra_path) return np.load(extra_full_path, mmap_mode="r") def _save_extra(self, extra_array: np.ndarray, extra_path: str) -> None: extra_root = self._extra_root extra_full_path = os.path.join(extra_root, extra_path) os.makedirs(extra_root, exist_ok=True) np.save(extra_full_path, extra_array) @property def _tarballs_root(self) -> str: return self.root def find_class_id(self, class_index: int) -> str: return str(self._class_ids[class_index]) def get_image_data(self, index: int) -> bytes: entry = self._entries[index] class_id = entry["class_id"] class_mmap = self._mmap_tarball(class_id) start_offset, end_offset = entry["start_offset"], entry["end_offset"] try: mapped_data = class_mmap[start_offset:end_offset] data = mapped_data[512:] # Skip entry header block if len(data) >= 2 and tuple(data[:2]) == (0x1F, 0x8B): assert index in self._gzipped_indices, f"unexpected gzip header for sample {index}" with GzipFile(fileobj=BytesIO(data)) as g: data = g.read() except Exception as e: raise RuntimeError(f"can not retrieve image data for sample {index} " f'from "{class_id}" tarball') from e return data def get_target(self, index: int) -> Any: return int(self._entries[index]["class_index"]) def get_targets(self) -> np.ndarray: return self._entries["class_index"] def get_class_id(self, index: int) -> str: return str(self._entries[index]["class_id"]) def get_class_ids(self) -> np.ndarray: return self._entries["class_id"] def __getitem__(self, index: int) -> Tuple[Any, Any]: with warnings.catch_warnings(): warnings.simplefilter("ignore") return super().__getitem__(index) def __len__(self) -> int: return len(self._entries) def _dump_entries(self, *args, **kwargs) -> None: entries, class_ids = self._load_entries_class_ids(*args, **kwargs) max_class_id_length, max_filename_length, max_class_index = -1, -1, -1 for entry in entries: class_id = class_ids[entry.class_index] max_class_index = max(entry.class_index, max_class_index) max_class_id_length = max(len(class_id), max_class_id_length) max_filename_length = max(len(entry.filename), max_filename_length) dtype = np.dtype( [ ("class_index", " None: entries_path = self._get_entries_path(*args, **kwargs) entries_array = self._load_extra(entries_path) max_class_id_length, max_class_index = -1, -1 for entry in entries_array: class_index, class_id = entry["class_index"], entry["class_id"] max_class_index = max(int(class_index), max_class_index) max_class_id_length = max(len(str(class_id)), max_class_id_length) class_ids_array = np.empty(max_class_index + 1, dtype=f"U{max_class_id_length}") for entry in entries_array: class_index, class_id = entry["class_index"], entry["class_id"] class_ids_array[class_index] = class_id class_ids_path = self._get_class_ids_path(*args, **kwargs) self._save_extra(class_ids_array, class_ids_path) def _dump_extra(self, *args, **kwargs) -> None: self._dump_entries(*args, *kwargs) self._dump_class_ids(*args, *kwargs) def dump_extra(self, root: Optional[str] = None) -> None: return self._dump_extra(root) ``` ## /dinov2/data/loaders.py ```py path="/dinov2/data/loaders.py" # Copyright (c) Meta Platforms, Inc. and affiliates. # # This source code is licensed under the Apache License, Version 2.0 # found in the LICENSE file in the root directory of this source tree. import logging from enum import Enum from typing import Any, Callable, List, Optional, TypeVar import torch from torch.utils.data import Sampler from .datasets import ImageNet, ImageNet22k from .samplers import EpochSampler, InfiniteSampler, ShardedInfiniteSampler logger = logging.getLogger("dinov2") class SamplerType(Enum): DISTRIBUTED = 0 EPOCH = 1 INFINITE = 2 SHARDED_INFINITE = 3 SHARDED_INFINITE_NEW = 4 def _make_bool_str(b: bool) -> str: return "yes" if b else "no" def _make_sample_transform(image_transform: Optional[Callable] = None, target_transform: Optional[Callable] = None): def transform(sample): image, target = sample if image_transform is not None: image = image_transform(image) if target_transform is not None: target = target_transform(target) return image, target return transform def _parse_dataset_str(dataset_str: str): tokens = dataset_str.split(":") name = tokens[0] kwargs = {} for token in tokens[1:]: key, value = token.split("=") assert key in ("root", "extra", "split") kwargs[key] = value if name == "ImageNet": class_ = ImageNet if "split" in kwargs: kwargs["split"] = ImageNet.Split[kwargs["split"]] elif name == "ImageNet22k": class_ = ImageNet22k else: raise ValueError(f'Unsupported dataset "{name}"') return class_, kwargs def make_dataset( *, dataset_str: str, transform: Optional[Callable] = None, target_transform: Optional[Callable] = None, ): """ Creates a dataset with the specified parameters. Args: dataset_str: A dataset string description (e.g. ImageNet:split=TRAIN). transform: A transform to apply to images. target_transform: A transform to apply to targets. Returns: The created dataset. """ logger.info(f'using dataset: "{dataset_str}"') class_, kwargs = _parse_dataset_str(dataset_str) dataset = class_(transform=transform, target_transform=target_transform, **kwargs) logger.info(f"# of dataset samples: {len(dataset):,d}") # Aggregated datasets do not expose (yet) these attributes, so add them. if not hasattr(dataset, "transform"): setattr(dataset, "transform", transform) if not hasattr(dataset, "target_transform"): setattr(dataset, "target_transform", target_transform) return dataset def _make_sampler( *, dataset, type: Optional[SamplerType] = None, shuffle: bool = False, seed: int = 0, size: int = -1, advance: int = 0, ) -> Optional[Sampler]: sample_count = len(dataset) if type == SamplerType.INFINITE: logger.info("sampler: infinite") if size > 0: raise ValueError("sampler size > 0 is invalid") return InfiniteSampler( sample_count=sample_count, shuffle=shuffle, seed=seed, advance=advance, ) elif type in (SamplerType.SHARDED_INFINITE, SamplerType.SHARDED_INFINITE_NEW): logger.info("sampler: sharded infinite") if size > 0: raise ValueError("sampler size > 0 is invalid") # TODO: Remove support for old shuffling use_new_shuffle_tensor_slice = type == SamplerType.SHARDED_INFINITE_NEW return ShardedInfiniteSampler( sample_count=sample_count, shuffle=shuffle, seed=seed, advance=advance, use_new_shuffle_tensor_slice=use_new_shuffle_tensor_slice, ) elif type == SamplerType.EPOCH: logger.info("sampler: epoch") if advance > 0: raise NotImplementedError("sampler advance > 0 is not supported") size = size if size > 0 else sample_count logger.info(f"# of samples / epoch: {size:,d}") return EpochSampler( size=size, sample_count=sample_count, shuffle=shuffle, seed=seed, ) elif type == SamplerType.DISTRIBUTED: logger.info("sampler: distributed") if size > 0: raise ValueError("sampler size > 0 is invalid") if advance > 0: raise ValueError("sampler advance > 0 is invalid") return torch.utils.data.DistributedSampler( dataset=dataset, shuffle=shuffle, seed=seed, drop_last=False, ) logger.info("sampler: none") return None T = TypeVar("T") def make_data_loader( *, dataset, batch_size: int, num_workers: int, shuffle: bool = True, seed: int = 0, sampler_type: Optional[SamplerType] = SamplerType.INFINITE, sampler_size: int = -1, sampler_advance: int = 0, drop_last: bool = True, persistent_workers: bool = False, collate_fn: Optional[Callable[[List[T]], Any]] = None, ): """ Creates a data loader with the specified parameters. Args: dataset: A dataset (third party, LaViDa or WebDataset). batch_size: The size of batches to generate. num_workers: The number of workers to use. shuffle: Whether to shuffle samples. seed: The random seed to use. sampler_type: Which sampler to use: EPOCH, INFINITE, SHARDED_INFINITE, SHARDED_INFINITE_NEW, DISTRIBUTED or None. sampler_size: The number of images per epoch (when applicable) or -1 for the entire dataset. sampler_advance: How many samples to skip (when applicable). drop_last: Whether the last non-full batch of data should be dropped. persistent_workers: maintain the workers Dataset instances alive after a dataset has been consumed once. collate_fn: Function that performs batch collation """ sampler = _make_sampler( dataset=dataset, type=sampler_type, shuffle=shuffle, seed=seed, size=sampler_size, advance=sampler_advance, ) logger.info("using PyTorch data loader") data_loader = torch.utils.data.DataLoader( dataset, sampler=sampler, batch_size=batch_size, num_workers=num_workers, pin_memory=True, drop_last=drop_last, persistent_workers=persistent_workers, collate_fn=collate_fn, ) try: logger.info(f"# of batches: {len(data_loader):,d}") except TypeError: # data loader has no length logger.info("infinite data loader") return data_loader ``` ## /dinov2/data/masking.py ```py path="/dinov2/data/masking.py" # Copyright (c) Meta Platforms, Inc. and affiliates. # # This source code is licensed under the Apache License, Version 2.0 # found in the LICENSE file in the root directory of this source tree. import random import math import numpy as np class MaskingGenerator: def __init__( self, input_size, num_masking_patches=None, min_num_patches=4, max_num_patches=None, min_aspect=0.3, max_aspect=None, ): if not isinstance(input_size, tuple): input_size = (input_size,) * 2 self.height, self.width = input_size self.num_patches = self.height * self.width self.num_masking_patches = num_masking_patches self.min_num_patches = min_num_patches self.max_num_patches = num_masking_patches if max_num_patches is None else max_num_patches max_aspect = max_aspect or 1 / min_aspect self.log_aspect_ratio = (math.log(min_aspect), math.log(max_aspect)) def __repr__(self): repr_str = "Generator(%d, %d -> [%d ~ %d], max = %d, %.3f ~ %.3f)" % ( self.height, self.width, self.min_num_patches, self.max_num_patches, self.num_masking_patches, self.log_aspect_ratio[0], self.log_aspect_ratio[1], ) return repr_str def get_shape(self): return self.height, self.width def _mask(self, mask, max_mask_patches): delta = 0 for _ in range(10): target_area = random.uniform(self.min_num_patches, max_mask_patches) aspect_ratio = math.exp(random.uniform(*self.log_aspect_ratio)) h = int(round(math.sqrt(target_area * aspect_ratio))) w = int(round(math.sqrt(target_area / aspect_ratio))) if w < self.width and h < self.height: top = random.randint(0, self.height - h) left = random.randint(0, self.width - w) num_masked = mask[top : top + h, left : left + w].sum() # Overlap if 0 < h * w - num_masked <= max_mask_patches: for i in range(top, top + h): for j in range(left, left + w): if mask[i, j] == 0: mask[i, j] = 1 delta += 1 if delta > 0: break return delta def __call__(self, num_masking_patches=0): mask = np.zeros(shape=self.get_shape(), dtype=bool) mask_count = 0 while mask_count < num_masking_patches: max_mask_patches = num_masking_patches - mask_count max_mask_patches = min(max_mask_patches, self.max_num_patches) delta = self._mask(mask, max_mask_patches) if delta == 0: break else: mask_count += delta return mask ``` ## /dinov2/data/samplers.py ```py path="/dinov2/data/samplers.py" # Copyright (c) Meta Platforms, Inc. and affiliates. # # This source code is licensed under the Apache License, Version 2.0 # found in the LICENSE file in the root directory of this source tree. import itertools from typing import Any, Optional import warnings import numpy as np import torch from torch.utils.data.sampler import Sampler import dinov2.distributed as distributed class EpochSampler(Sampler): def __init__( self, *, size: int, sample_count: int, shuffle: bool = False, seed: int = 0, start: Optional[int] = None, step: Optional[int] = None, ): self._size = size self._sample_count = sample_count self._shuffle = shuffle self._seed = seed self._start = distributed.get_global_rank() if start is None else start self._step = distributed.get_global_size() if step is None else step self._epoch = 0 def __iter__(self): count = (self._size + self._sample_count - 1) // self._sample_count tiled_indices = np.tile(np.arange(self._sample_count), count) if self._shuffle: seed = self._seed * self._epoch if self._seed != 0 else self._epoch rng = np.random.default_rng(seed) iterable = rng.choice(tiled_indices, self._size, replace=False) else: iterable = tiled_indices[: self._size] yield from itertools.islice(iterable, self._start, None, self._step) def __len__(self): return (self._size - self._start + self._step - 1) // self._step def set_epoch(self, epoch): self._epoch = epoch def _get_numpy_dtype(size: int) -> Any: return np.int32 if size <= 2**31 else np.int64 def _get_torch_dtype(size: int) -> Any: return torch.int32 if size <= 2**31 else torch.int64 def _generate_randperm_indices(*, size: int, generator: torch.Generator): """Generate the indices of a random permutation.""" dtype = _get_torch_dtype(size) # This is actually matching PyTorch's CPU implementation, see: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/TensorFactories.cpp#L900-L921 perm = torch.arange(size, dtype=dtype) for i in range(size): j = torch.randint(i, size, size=(1,), generator=generator).item() # Always swap even if no-op value = perm[j].item() perm[j] = perm[i].item() perm[i] = value yield value class InfiniteSampler(Sampler): def __init__( self, *, sample_count: int, shuffle: bool = False, seed: int = 0, start: Optional[int] = None, step: Optional[int] = None, advance: int = 0, ): self._sample_count = sample_count self._seed = seed self._shuffle = shuffle self._start = distributed.get_global_rank() if start is None else start self._step = distributed.get_global_size() if step is None else step self._advance = advance def __iter__(self): if self._shuffle: iterator = self._shuffled_iterator() else: iterator = self._iterator() yield from itertools.islice(iterator, self._advance, None) def _iterator(self): assert not self._shuffle while True: iterable = range(self._sample_count) yield from itertools.islice(iterable, self._start, None, self._step) def _shuffled_iterator(self): assert self._shuffle # Instantiate a generator here (rather than in the ctor) to keep the class # picklable (requirement of mp.spawn) generator = torch.Generator().manual_seed(self._seed) while True: iterable = _generate_randperm_indices(size=self._sample_count, generator=generator) yield from itertools.islice(iterable, self._start, None, self._step) # The following function is somewhat equivalent to _new_shuffle_tensor_slice below, # but avoids a full in-place random permutation generation. def _shuffle_tensor_slice( *, tensor: torch.Tensor, start: int = 0, step: int = 1, generator: torch.Generator ) -> np.ndarray: stop = len(tensor) count = stop // step drop_count = stop - step * count if drop_count: warnings.warn(f"# of dropped samples: {drop_count}") dtype = _get_numpy_dtype(stop) result = np.empty(count, dtype=dtype) for i in range(count): j = torch.randint(0, i + 1, size=(1,), generator=generator).item() if i > 0 else 0 result[i] = result[j] result[j] = tensor[start + i * step].item() return result def _new_shuffle_tensor_slice( *, tensor: torch.Tensor, start: int = 0, step: int = 1, generator: torch.Generator ) -> np.ndarray: stop = len(tensor) count = stop // step dtype = torch.int64 # Needed for using randperm result as indices count = stop // step drop_count = stop - step * count if drop_count: warnings.warn(f"# of dropped samples: {drop_count}") indices = torch.randperm(count, dtype=dtype, generator=generator) return tensor[start::step][indices].numpy() def _make_seed(seed: int, start: int, iter_count: int) -> int: # NOTE: Tried a few variants (including iter_count << 32), this one worked best. return seed + start + (iter_count << 24) class ShardedInfiniteSampler(Sampler): def __init__( self, *, sample_count: int, shuffle: bool = False, seed: int = 0, start: Optional[int] = None, step: Optional[int] = None, advance: int = 0, use_new_shuffle_tensor_slice: bool = False, ): self._sample_count = sample_count self._seed = seed self._shuffle = shuffle self._start = distributed.get_global_rank() if start is None else start self._step = distributed.get_global_size() if step is None else step self._advance = advance self._iter_count = 0 self._shuffle_tensor_slice_fn = ( _new_shuffle_tensor_slice if use_new_shuffle_tensor_slice else _shuffle_tensor_slice ) def __iter__(self): iter_count = self._advance // self._sample_count if iter_count > 0: self._advance -= iter_count * self._sample_count self._iter_count += iter_count if self._shuffle: iterator = self._shuffled_iterator() else: iterator = self._iterator() yield from itertools.islice(iterator, self._advance, None) def _iterator(self): assert not self._shuffle while True: iterable = range(self._sample_count) yield from itertools.islice(iterable, self._start, None, self._step) def _shuffled_iterator(self): assert self._shuffle # Instantiate a generator here (rather than in the ctor) to be keep the class # picklable (requirement of mp.spawn) generator = torch.Generator() # Always shuffle everything first generator.manual_seed(self._seed) dtype = _get_torch_dtype(self._sample_count) perm = torch.randperm(self._sample_count, dtype=dtype, generator=generator) while True: # Re-seed on each iteration to allow skipping whole permutations seed = _make_seed(self._seed, self._start, self._iter_count) generator.manual_seed(seed) iterable = self._shuffle_tensor_slice_fn( tensor=perm, start=self._start, step=self._step, generator=generator ) yield from iterable self._iter_count += 1 ``` ## /dinov2/data/transforms.py ```py path="/dinov2/data/transforms.py" # Copyright (c) Meta Platforms, Inc. and affiliates. # # This source code is licensed under the Apache License, Version 2.0 # found in the LICENSE file in the root directory of this source tree. from typing import Sequence import torch from torchvision import transforms class GaussianBlur(transforms.RandomApply): """ Apply Gaussian Blur to the PIL image. """ def __init__(self, *, p: float = 0.5, radius_min: float = 0.1, radius_max: float = 2.0): # NOTE: torchvision is applying 1 - probability to return the original image keep_p = 1 - p transform = transforms.GaussianBlur(kernel_size=9, sigma=(radius_min, radius_max)) super().__init__(transforms=[transform], p=keep_p) class MaybeToTensor(transforms.ToTensor): """ Convert a ``PIL Image`` or ``numpy.ndarray`` to tensor, or keep as is if already a tensor. """ def __call__(self, pic): """ Args: pic (PIL Image, numpy.ndarray or torch.tensor): Image to be converted to tensor. Returns: Tensor: Converted image. """ if isinstance(pic, torch.Tensor): return pic return super().__call__(pic) # Use timm's names IMAGENET_DEFAULT_MEAN = (0.485, 0.456, 0.406) IMAGENET_DEFAULT_STD = (0.229, 0.224, 0.225) def make_normalize_transform( mean: Sequence[float] = IMAGENET_DEFAULT_MEAN, std: Sequence[float] = IMAGENET_DEFAULT_STD, ) -> transforms.Normalize: return transforms.Normalize(mean=mean, std=std) # This roughly matches torchvision's preset for classification training: # https://github.com/pytorch/vision/blob/main/references/classification/presets.py#L6-L44 def make_classification_train_transform( *, crop_size: int = 224, interpolation=transforms.InterpolationMode.BICUBIC, hflip_prob: float = 0.5, mean: Sequence[float] = IMAGENET_DEFAULT_MEAN, std: Sequence[float] = IMAGENET_DEFAULT_STD, ): transforms_list = [transforms.RandomResizedCrop(crop_size, interpolation=interpolation)] if hflip_prob > 0.0: transforms_list.append(transforms.RandomHorizontalFlip(hflip_prob)) transforms_list.extend( [ MaybeToTensor(), make_normalize_transform(mean=mean, std=std), ] ) return transforms.Compose(transforms_list) # This matches (roughly) torchvision's preset for classification evaluation: # https://github.com/pytorch/vision/blob/main/references/classification/presets.py#L47-L69 def make_classification_eval_transform( *, resize_size: int = 256, interpolation=transforms.InterpolationMode.BICUBIC, crop_size: int = 224, mean: Sequence[float] = IMAGENET_DEFAULT_MEAN, std: Sequence[float] = IMAGENET_DEFAULT_STD, ) -> transforms.Compose: transforms_list = [ transforms.Resize(resize_size, interpolation=interpolation), transforms.CenterCrop(crop_size), MaybeToTensor(), make_normalize_transform(mean=mean, std=std), ] return transforms.Compose(transforms_list) ``` ## /dinov2/distributed/__init__.py ```py path="/dinov2/distributed/__init__.py" # Copyright (c) Meta Platforms, Inc. and affiliates. # # This source code is licensed under the Apache License, Version 2.0 # found in the LICENSE file in the root directory of this source tree. import os import random import re import socket from typing import Dict, List import torch import torch.distributed as dist _LOCAL_RANK = -1 _LOCAL_WORLD_SIZE = -1 def is_enabled() -> bool: """ Returns: True if distributed training is enabled """ return dist.is_available() and dist.is_initialized() def get_global_size() -> int: """ Returns: The number of processes in the process group """ return dist.get_world_size() if is_enabled() else 1 def get_global_rank() -> int: """ Returns: The rank of the current process within the global process group. """ return dist.get_rank() if is_enabled() else 0 def get_local_rank() -> int: """ Returns: The rank of the current process within the local (per-machine) process group. """ if not is_enabled(): return 0 assert 0 <= _LOCAL_RANK < _LOCAL_WORLD_SIZE return _LOCAL_RANK def get_local_size() -> int: """ Returns: The size of the per-machine process group, i.e. the number of processes per machine. """ if not is_enabled(): return 1 assert 0 <= _LOCAL_RANK < _LOCAL_WORLD_SIZE return _LOCAL_WORLD_SIZE def is_main_process() -> bool: """ Returns: True if the current process is the main one. """ return get_global_rank() == 0 def _restrict_print_to_main_process() -> None: """ This function disables printing when not in the main process """ import builtins as __builtin__ builtin_print = __builtin__.print def print(*args, **kwargs): force = kwargs.pop("force", False) if is_main_process() or force: builtin_print(*args, **kwargs) __builtin__.print = print def _get_master_port(seed: int = 0) -> int: MIN_MASTER_PORT, MAX_MASTER_PORT = (20_000, 60_000) master_port_str = os.environ.get("MASTER_PORT") if master_port_str is None: rng = random.Random(seed) return rng.randint(MIN_MASTER_PORT, MAX_MASTER_PORT) return int(master_port_str) def _get_available_port() -> int: with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: # A "" host address means INADDR_ANY i.e. binding to all interfaces. # Note this is not compatible with IPv6. s.bind(("", 0)) port = s.getsockname()[1] return port _TORCH_DISTRIBUTED_ENV_VARS = ( "MASTER_ADDR", "MASTER_PORT", "RANK", "WORLD_SIZE", "LOCAL_RANK", "LOCAL_WORLD_SIZE", ) def _collect_env_vars() -> Dict[str, str]: return {env_var: os.environ[env_var] for env_var in _TORCH_DISTRIBUTED_ENV_VARS if env_var in os.environ} def _is_slurm_job_process() -> bool: return "SLURM_JOB_ID" in os.environ def _parse_slurm_node_list(s: str) -> List[str]: nodes = [] # Extract "hostname", "hostname[1-2,3,4-5]," substrings p = re.compile(r"(([^\[]+)(?:\[([^\]]+)\])?),?") for m in p.finditer(s): prefix, suffixes = s[m.start(2) : m.end(2)], s[m.start(3) : m.end(3)] for suffix in suffixes.split(","): span = suffix.split("-") if len(span) == 1: nodes.append(prefix + suffix) else: width = len(span[0]) start, end = int(span[0]), int(span[1]) + 1 nodes.extend([prefix + f"{i:0{width}}" for i in range(start, end)]) return nodes def _check_env_variable(key: str, new_value: str): # Only check for difference with preset environment variables if key in os.environ and os.environ[key] != new_value: raise RuntimeError(f"Cannot export environment variables as {key} is already set") class _TorchDistributedEnvironment: def __init__(self): self.master_addr = "127.0.0.1" self.master_port = 0 self.rank = -1 self.world_size = -1 self.local_rank = -1 self.local_world_size = -1 if _is_slurm_job_process(): return self._set_from_slurm_env() env_vars = _collect_env_vars() if not env_vars: # Environment is not set pass elif len(env_vars) == len(_TORCH_DISTRIBUTED_ENV_VARS): # Environment is fully set return self._set_from_preset_env() else: # Environment is partially set collected_env_vars = ", ".join(env_vars.keys()) raise RuntimeError(f"Partially set environment: {collected_env_vars}") if torch.cuda.device_count() > 0: return self._set_from_local() raise RuntimeError("Can't initialize PyTorch distributed environment") # Slurm job created with sbatch, submitit, etc... def _set_from_slurm_env(self): # logger.info("Initialization from Slurm environment") job_id = int(os.environ["SLURM_JOB_ID"]) node_count = int(os.environ["SLURM_JOB_NUM_NODES"]) nodes = _parse_slurm_node_list(os.environ["SLURM_JOB_NODELIST"]) assert len(nodes) == node_count self.master_addr = nodes[0] self.master_port = _get_master_port(seed=job_id) self.rank = int(os.environ["SLURM_PROCID"]) self.world_size = int(os.environ["SLURM_NTASKS"]) assert self.rank < self.world_size self.local_rank = int(os.environ["SLURM_LOCALID"]) self.local_world_size = self.world_size // node_count assert self.local_rank < self.local_world_size # Single node job with preset environment (i.e. torchrun) def _set_from_preset_env(self): # logger.info("Initialization from preset environment") self.master_addr = os.environ["MASTER_ADDR"] self.master_port = os.environ["MASTER_PORT"] self.rank = int(os.environ["RANK"]) self.world_size = int(os.environ["WORLD_SIZE"]) assert self.rank < self.world_size self.local_rank = int(os.environ["LOCAL_RANK"]) self.local_world_size = int(os.environ["LOCAL_WORLD_SIZE"]) assert self.local_rank < self.local_world_size # Single node and GPU job (i.e. local script run) def _set_from_local(self): # logger.info("Initialization from local") self.master_addr = "127.0.0.1" self.master_port = _get_available_port() self.rank = 0 self.world_size = 1 self.local_rank = 0 self.local_world_size = 1 def export(self, *, overwrite: bool) -> "_TorchDistributedEnvironment": # See the "Environment variable initialization" section from # https://pytorch.org/docs/stable/distributed.html for the complete list of # environment variables required for the env:// initialization method. env_vars = { "MASTER_ADDR": self.master_addr, "MASTER_PORT": str(self.master_port), "RANK": str(self.rank), "WORLD_SIZE": str(self.world_size), "LOCAL_RANK": str(self.local_rank), "LOCAL_WORLD_SIZE": str(self.local_world_size), } if not overwrite: for k, v in env_vars.items(): _check_env_variable(k, v) os.environ.update(env_vars) return self def enable(*, set_cuda_current_device: bool = True, overwrite: bool = False, allow_nccl_timeout: bool = False): """Enable distributed mode Args: set_cuda_current_device: If True, call torch.cuda.set_device() to set the current PyTorch CUDA device to the one matching the local rank. overwrite: If True, overwrites already set variables. Else fails. """ global _LOCAL_RANK, _LOCAL_WORLD_SIZE if _LOCAL_RANK >= 0 or _LOCAL_WORLD_SIZE >= 0: raise RuntimeError("Distributed mode has already been enabled") torch_env = _TorchDistributedEnvironment() torch_env.export(overwrite=overwrite) if set_cuda_current_device: torch.cuda.set_device(torch_env.local_rank) if allow_nccl_timeout: # This allows to use torch distributed timeout in a NCCL backend key, value = "NCCL_ASYNC_ERROR_HANDLING", "1" if not overwrite: _check_env_variable(key, value) os.environ[key] = value dist.init_process_group(backend="nccl") dist.barrier() # Finalize setup _LOCAL_RANK = torch_env.local_rank _LOCAL_WORLD_SIZE = torch_env.local_world_size _restrict_print_to_main_process() ``` ## /dinov2/eval/__init__.py ```py path="/dinov2/eval/__init__.py" # Copyright (c) Meta Platforms, Inc. and affiliates. # # This source code is licensed under the Apache License, Version 2.0 # found in the LICENSE file in the root directory of this source tree. ``` ## /dinov2/eval/depth/__init__.py ```py path="/dinov2/eval/depth/__init__.py" # Copyright (c) Meta Platforms, Inc. and affiliates. # # This source code is licensed under the Apache License, Version 2.0 # found in the LICENSE file in the root directory of this source tree. ``` ## /dinov2/eval/depth/models/__init__.py ```py path="/dinov2/eval/depth/models/__init__.py" # Copyright (c) Meta Platforms, Inc. and affiliates. # # This source code is licensed under the Apache License, Version 2.0 # found in the LICENSE file in the root directory of this source tree. from .backbones import * # noqa: F403 from .builder import BACKBONES, DEPTHER, HEADS, LOSSES, build_backbone, build_depther, build_head, build_loss from .decode_heads import * # noqa: F403 from .depther import * # noqa: F403 from .losses import * # noqa: F403 ``` ## /dinov2/eval/depth/models/backbones/__init__.py ```py path="/dinov2/eval/depth/models/backbones/__init__.py" # Copyright (c) Meta Platforms, Inc. and affiliates. # # This source code is licensed under the Apache License, Version 2.0 # found in the LICENSE file in the root directory of this source tree. from .vision_transformer import DinoVisionTransformer ``` ## /dinov2/eval/depth/models/backbones/vision_transformer.py ```py path="/dinov2/eval/depth/models/backbones/vision_transformer.py" # Copyright (c) Meta Platforms, Inc. and affiliates. # # This source code is licensed under the Apache License, Version 2.0 # found in the LICENSE file in the root directory of this source tree. from mmcv.runner import BaseModule from ..builder import BACKBONES @BACKBONES.register_module() class DinoVisionTransformer(BaseModule): """Vision Transformer.""" def __init__(self, *args, **kwargs): super().__init__() ``` ## /dinov2/eval/depth/models/builder.py ```py path="/dinov2/eval/depth/models/builder.py" # Copyright (c) Meta Platforms, Inc. and affiliates. # # This source code is licensed under the Apache License, Version 2.0 # found in the LICENSE file in the root directory of this source tree. import warnings from mmcv.cnn import MODELS as MMCV_MODELS from mmcv.cnn.bricks.registry import ATTENTION as MMCV_ATTENTION from mmcv.utils import Registry MODELS = Registry("models", parent=MMCV_MODELS) ATTENTION = Registry("attention", parent=MMCV_ATTENTION) BACKBONES = MODELS NECKS = MODELS HEADS = MODELS LOSSES = MODELS DEPTHER = MODELS def build_backbone(cfg): """Build backbone.""" return BACKBONES.build(cfg) def build_neck(cfg): """Build neck.""" return NECKS.build(cfg) def build_head(cfg): """Build head.""" return HEADS.build(cfg) def build_loss(cfg): """Build loss.""" return LOSSES.build(cfg) def build_depther(cfg, train_cfg=None, test_cfg=None): """Build depther.""" if train_cfg is not None or test_cfg is not None: warnings.warn("train_cfg and test_cfg is deprecated, " "please specify them in model", UserWarning) assert cfg.get("train_cfg") is None or train_cfg is None, "train_cfg specified in both outer field and model field " assert cfg.get("test_cfg") is None or test_cfg is None, "test_cfg specified in both outer field and model field " return DEPTHER.build(cfg, default_args=dict(train_cfg=train_cfg, test_cfg=test_cfg)) ``` ## /dinov2/eval/depth/models/decode_heads/__init__.py ```py path="/dinov2/eval/depth/models/decode_heads/__init__.py" # Copyright (c) Meta Platforms, Inc. and affiliates. # # This source code is licensed under the Apache License, Version 2.0 # found in the LICENSE file in the root directory of this source tree. from .dpt_head import DPTHead from .linear_head import BNHead ``` ## /dinov2/eval/depth/models/decode_heads/decode_head.py ```py path="/dinov2/eval/depth/models/decode_heads/decode_head.py" # Copyright (c) Meta Platforms, Inc. and affiliates. # # This source code is licensed under the Apache License, Version 2.0 # found in the LICENSE file in the root directory of this source tree. import copy from abc import ABCMeta, abstractmethod import mmcv import numpy as np import torch import torch.nn as nn from mmcv.runner import BaseModule, auto_fp16, force_fp32 from ...ops import resize from ..builder import build_loss class DepthBaseDecodeHead(BaseModule, metaclass=ABCMeta): """Base class for BaseDecodeHead. Args: in_channels (List): Input channels. channels (int): Channels after modules, before conv_depth. conv_cfg (dict|None): Config of conv layers. Default: None. act_cfg (dict): Config of activation layers. Default: dict(type='ReLU') loss_decode (dict): Config of decode loss. Default: dict(type='SigLoss'). sampler (dict|None): The config of depth map sampler. Default: None. align_corners (bool): align_corners argument of F.interpolate. Default: False. min_depth (int): Min depth in dataset setting. Default: 1e-3. max_depth (int): Max depth in dataset setting. Default: None. norm_cfg (dict|None): Config of norm layers. Default: None. classify (bool): Whether predict depth in a cls.-reg. manner. Default: False. n_bins (int): The number of bins used in cls. step. Default: 256. bins_strategy (str): The discrete strategy used in cls. step. Default: 'UD'. norm_strategy (str): The norm strategy on cls. probability distribution. Default: 'linear' scale_up (str): Whether predict depth in a scale-up manner. Default: False. """ def __init__( self, in_channels, channels=96, conv_cfg=None, act_cfg=dict(type="ReLU"), loss_decode=dict(type="SigLoss", valid_mask=True, loss_weight=10), sampler=None, align_corners=False, min_depth=1e-3, max_depth=None, norm_cfg=None, classify=False, n_bins=256, bins_strategy="UD", norm_strategy="linear", scale_up=False, ): super(DepthBaseDecodeHead, self).__init__() self.in_channels = in_channels self.channels = channels self.conv_cfg = conv_cfg self.act_cfg = act_cfg if isinstance(loss_decode, dict): self.loss_decode = build_loss(loss_decode) elif isinstance(loss_decode, (list, tuple)): self.loss_decode = nn.ModuleList() for loss in loss_decode: self.loss_decode.append(build_loss(loss)) self.align_corners = align_corners self.min_depth = min_depth self.max_depth = max_depth self.norm_cfg = norm_cfg self.classify = classify self.n_bins = n_bins self.scale_up = scale_up if self.classify: assert bins_strategy in ["UD", "SID"], "Support bins_strategy: UD, SID" assert norm_strategy in ["linear", "softmax", "sigmoid"], "Support norm_strategy: linear, softmax, sigmoid" self.bins_strategy = bins_strategy self.norm_strategy = norm_strategy self.softmax = nn.Softmax(dim=1) self.conv_depth = nn.Conv2d(channels, n_bins, kernel_size=3, padding=1, stride=1) else: self.conv_depth = nn.Conv2d(channels, 1, kernel_size=3, padding=1, stride=1) self.fp16_enabled = False self.relu = nn.ReLU() self.sigmoid = nn.Sigmoid() def extra_repr(self): """Extra repr.""" s = f"align_corners={self.align_corners}" return s @auto_fp16() @abstractmethod def forward(self, inputs, img_metas): """Placeholder of forward function.""" pass def forward_train(self, img, inputs, img_metas, depth_gt, train_cfg): """Forward function for training. Args: inputs (list[Tensor]): List of multi-level img features. img_metas (list[dict]): List of image info dict where each dict has: 'img_shape', 'scale_factor', 'flip', and may also contain 'filename', 'ori_shape', 'pad_shape', and 'img_norm_cfg'. For details on the values of these keys see `depth/datasets/pipelines/formatting.py:Collect`. depth_gt (Tensor): GT depth train_cfg (dict): The training config. Returns: dict[str, Tensor]: a dictionary of loss components """ depth_pred = self.forward(inputs, img_metas) losses = self.losses(depth_pred, depth_gt) log_imgs = self.log_images(img[0], depth_pred[0], depth_gt[0], img_metas[0]) losses.update(**log_imgs) return losses def forward_test(self, inputs, img_metas, test_cfg): """Forward function for testing. Args: inputs (list[Tensor]): List of multi-level img features. img_metas (list[dict]): List of image info dict where each dict has: 'img_shape', 'scale_factor', 'flip', and may also contain 'filename', 'ori_shape', 'pad_shape', and 'img_norm_cfg'. For details on the values of these keys see `depth/datasets/pipelines/formatting.py:Collect`. test_cfg (dict): The testing config. Returns: Tensor: Output depth map. """ return self.forward(inputs, img_metas) def depth_pred(self, feat): """Prediction each pixel.""" if self.classify: logit = self.conv_depth(feat) if self.bins_strategy == "UD": bins = torch.linspace(self.min_depth, self.max_depth, self.n_bins, device=feat.device) elif self.bins_strategy == "SID": bins = torch.logspace(self.min_depth, self.max_depth, self.n_bins, device=feat.device) # following Adabins, default linear if self.norm_strategy == "linear": logit = torch.relu(logit) eps = 0.1 logit = logit + eps logit = logit / logit.sum(dim=1, keepdim=True) elif self.norm_strategy == "softmax": logit = torch.softmax(logit, dim=1) elif self.norm_strategy == "sigmoid": logit = torch.sigmoid(logit) logit = logit / logit.sum(dim=1, keepdim=True) output = torch.einsum("ikmn,k->imn", [logit, bins]).unsqueeze(dim=1) else: if self.scale_up: output = self.sigmoid(self.conv_depth(feat)) * self.max_depth else: output = self.relu(self.conv_depth(feat)) + self.min_depth return output @force_fp32(apply_to=("depth_pred",)) def losses(self, depth_pred, depth_gt): """Compute depth loss.""" loss = dict() depth_pred = resize( input=depth_pred, size=depth_gt.shape[2:], mode="bilinear", align_corners=self.align_corners, warning=False ) if not isinstance(self.loss_decode, nn.ModuleList): losses_decode = [self.loss_decode] else: losses_decode = self.loss_decode for loss_decode in losses_decode: if loss_decode.loss_name not in loss: loss[loss_decode.loss_name] = loss_decode(depth_pred, depth_gt) else: loss[loss_decode.loss_name] += loss_decode(depth_pred, depth_gt) return loss def log_images(self, img_path, depth_pred, depth_gt, img_meta): show_img = copy.deepcopy(img_path.detach().cpu().permute(1, 2, 0)) show_img = show_img.numpy().astype(np.float32) show_img = mmcv.imdenormalize( show_img, img_meta["img_norm_cfg"]["mean"], img_meta["img_norm_cfg"]["std"], img_meta["img_norm_cfg"]["to_rgb"], ) show_img = np.clip(show_img, 0, 255) show_img = show_img.astype(np.uint8) show_img = show_img[:, :, ::-1] show_img = show_img.transpose(0, 2, 1) show_img = show_img.transpose(1, 0, 2) depth_pred = depth_pred / torch.max(depth_pred) depth_gt = depth_gt / torch.max(depth_gt) depth_pred_color = copy.deepcopy(depth_pred.detach().cpu()) depth_gt_color = copy.deepcopy(depth_gt.detach().cpu()) return {"img_rgb": show_img, "img_depth_pred": depth_pred_color, "img_depth_gt": depth_gt_color} ``` ## /dinov2/eval/depth/models/decode_heads/dpt_head.py ```py path="/dinov2/eval/depth/models/decode_heads/dpt_head.py" # Copyright (c) Meta Platforms, Inc. and affiliates. # # This source code is licensed under the Apache License, Version 2.0 # found in the LICENSE file in the root directory of this source tree. import math import torch import torch.nn as nn from mmcv.cnn import ConvModule, Linear, build_activation_layer from mmcv.runner import BaseModule from ...ops import resize from ..builder import HEADS from .decode_head import DepthBaseDecodeHead class Interpolate(nn.Module): def __init__(self, scale_factor, mode, align_corners=False): super(Interpolate, self).__init__() self.interp = nn.functional.interpolate self.scale_factor = scale_factor self.mode = mode self.align_corners = align_corners def forward(self, x): x = self.interp(x, scale_factor=self.scale_factor, mode=self.mode, align_corners=self.align_corners) return x class HeadDepth(nn.Module): def __init__(self, features): super(HeadDepth, self).__init__() self.head = nn.Sequential( nn.Conv2d(features, features // 2, kernel_size=3, stride=1, padding=1), Interpolate(scale_factor=2, mode="bilinear", align_corners=True), nn.Conv2d(features // 2, 32, kernel_size=3, stride=1, padding=1), nn.ReLU(), nn.Conv2d(32, 1, kernel_size=1, stride=1, padding=0), ) def forward(self, x): x = self.head(x) return x class ReassembleBlocks(BaseModule): """ViTPostProcessBlock, process cls_token in ViT backbone output and rearrange the feature vector to feature map. Args: in_channels (int): ViT feature channels. Default: 768. out_channels (List): output channels of each stage. Default: [96, 192, 384, 768]. readout_type (str): Type of readout operation. Default: 'ignore'. patch_size (int): The patch size. Default: 16. init_cfg (dict, optional): Initialization config dict. Default: None. """ def __init__( self, in_channels=768, out_channels=[96, 192, 384, 768], readout_type="ignore", patch_size=16, init_cfg=None ): super(ReassembleBlocks, self).__init__(init_cfg) assert readout_type in ["ignore", "add", "project"] self.readout_type = readout_type self.patch_size = patch_size self.projects = nn.ModuleList( [ ConvModule( in_channels=in_channels, out_channels=out_channel, kernel_size=1, act_cfg=None, ) for out_channel in out_channels ] ) self.resize_layers = nn.ModuleList( [ nn.ConvTranspose2d( in_channels=out_channels[0], out_channels=out_channels[0], kernel_size=4, stride=4, padding=0 ), nn.ConvTranspose2d( in_channels=out_channels[1], out_channels=out_channels[1], kernel_size=2, stride=2, padding=0 ), nn.Identity(), nn.Conv2d( in_channels=out_channels[3], out_channels=out_channels[3], kernel_size=3, stride=2, padding=1 ), ] ) if self.readout_type == "project": self.readout_projects = nn.ModuleList() for _ in range(len(self.projects)): self.readout_projects.append( nn.Sequential(Linear(2 * in_channels, in_channels), build_activation_layer(dict(type="GELU"))) ) def forward(self, inputs): assert isinstance(inputs, list) out = [] for i, x in enumerate(inputs): assert len(x) == 2 x, cls_token = x[0], x[1] feature_shape = x.shape if self.readout_type == "project": x = x.flatten(2).permute((0, 2, 1)) readout = cls_token.unsqueeze(1).expand_as(x) x = self.readout_projects[i](torch.cat((x, readout), -1)) x = x.permute(0, 2, 1).reshape(feature_shape) elif self.readout_type == "add": x = x.flatten(2) + cls_token.unsqueeze(-1) x = x.reshape(feature_shape) else: pass x = self.projects[i](x) x = self.resize_layers[i](x) out.append(x) return out class PreActResidualConvUnit(BaseModule): """ResidualConvUnit, pre-activate residual unit. Args: in_channels (int): number of channels in the input feature map. act_cfg (dict): dictionary to construct and config activation layer. norm_cfg (dict): dictionary to construct and config norm layer. stride (int): stride of the first block. Default: 1 dilation (int): dilation rate for convs layers. Default: 1. init_cfg (dict, optional): Initialization config dict. Default: None. """ def __init__(self, in_channels, act_cfg, norm_cfg, stride=1, dilation=1, init_cfg=None): super(PreActResidualConvUnit, self).__init__(init_cfg) self.conv1 = ConvModule( in_channels, in_channels, 3, stride=stride, padding=dilation, dilation=dilation, norm_cfg=norm_cfg, act_cfg=act_cfg, bias=False, order=("act", "conv", "norm"), ) self.conv2 = ConvModule( in_channels, in_channels, 3, padding=1, norm_cfg=norm_cfg, act_cfg=act_cfg, bias=False, order=("act", "conv", "norm"), ) def forward(self, inputs): inputs_ = inputs.clone() x = self.conv1(inputs) x = self.conv2(x) return x + inputs_ class FeatureFusionBlock(BaseModule): """FeatureFusionBlock, merge feature map from different stages. Args: in_channels (int): Input channels. act_cfg (dict): The activation config for ResidualConvUnit. norm_cfg (dict): Config dict for normalization layer. expand (bool): Whether expand the channels in post process block. Default: False. align_corners (bool): align_corner setting for bilinear upsample. Default: True. init_cfg (dict, optional): Initialization config dict. Default: None. """ def __init__(self, in_channels, act_cfg, norm_cfg, expand=False, align_corners=True, init_cfg=None): super(FeatureFusionBlock, self).__init__(init_cfg) self.in_channels = in_channels self.expand = expand self.align_corners = align_corners self.out_channels = in_channels if self.expand: self.out_channels = in_channels // 2 self.project = ConvModule(self.in_channels, self.out_channels, kernel_size=1, act_cfg=None, bias=True) self.res_conv_unit1 = PreActResidualConvUnit(in_channels=self.in_channels, act_cfg=act_cfg, norm_cfg=norm_cfg) self.res_conv_unit2 = PreActResidualConvUnit(in_channels=self.in_channels, act_cfg=act_cfg, norm_cfg=norm_cfg) def forward(self, *inputs): x = inputs[0] if len(inputs) == 2: if x.shape != inputs[1].shape: res = resize(inputs[1], size=(x.shape[2], x.shape[3]), mode="bilinear", align_corners=False) else: res = inputs[1] x = x + self.res_conv_unit1(res) x = self.res_conv_unit2(x) x = resize(x, scale_factor=2, mode="bilinear", align_corners=self.align_corners) x = self.project(x) return x @HEADS.register_module() class DPTHead(DepthBaseDecodeHead): """Vision Transformers for Dense Prediction. This head is implemented of `DPT `_. Args: embed_dims (int): The embed dimension of the ViT backbone. Default: 768. post_process_channels (List): Out channels of post process conv layers. Default: [96, 192, 384, 768]. readout_type (str): Type of readout operation. Default: 'ignore'. patch_size (int): The patch size. Default: 16. expand_channels (bool): Whether expand the channels in post process block. Default: False. """ def __init__( self, embed_dims=768, post_process_channels=[96, 192, 384, 768], readout_type="ignore", patch_size=16, expand_channels=False, **kwargs ): super(DPTHead, self).__init__(**kwargs) self.in_channels = self.in_channels self.expand_channels = expand_channels self.reassemble_blocks = ReassembleBlocks(embed_dims, post_process_channels, readout_type, patch_size) self.post_process_channels = [ channel * math.pow(2, i) if expand_channels else channel for i, channel in enumerate(post_process_channels) ] self.convs = nn.ModuleList() for channel in self.post_process_channels: self.convs.append(ConvModule(channel, self.channels, kernel_size=3, padding=1, act_cfg=None, bias=False)) self.fusion_blocks = nn.ModuleList() for _ in range(len(self.convs)): self.fusion_blocks.append(FeatureFusionBlock(self.channels, self.act_cfg, self.norm_cfg)) self.fusion_blocks[0].res_conv_unit1 = None self.project = ConvModule(self.channels, self.channels, kernel_size=3, padding=1, norm_cfg=self.norm_cfg) self.num_fusion_blocks = len(self.fusion_blocks) self.num_reassemble_blocks = len(self.reassemble_blocks.resize_layers) self.num_post_process_channels = len(self.post_process_channels) assert self.num_fusion_blocks == self.num_reassemble_blocks assert self.num_reassemble_blocks == self.num_post_process_channels self.conv_depth = HeadDepth(self.channels) def forward(self, inputs, img_metas): assert len(inputs) == self.num_reassemble_blocks x = [inp for inp in inputs] x = self.reassemble_blocks(x) x = [self.convs[i](feature) for i, feature in enumerate(x)] out = self.fusion_blocks[0](x[-1]) for i in range(1, len(self.fusion_blocks)): out = self.fusion_blocks[i](out, x[-(i + 1)]) out = self.project(out) out = self.depth_pred(out) return out ``` ## /dinov2/eval/depth/models/decode_heads/linear_head.py ```py path="/dinov2/eval/depth/models/decode_heads/linear_head.py" # Copyright (c) Meta Platforms, Inc. and affiliates. # # This source code is licensed under the Apache License, Version 2.0 # found in the LICENSE file in the root directory of this source tree. import torch import torch.nn as nn from ...ops import resize from ..builder import HEADS from .decode_head import DepthBaseDecodeHead @HEADS.register_module() class BNHead(DepthBaseDecodeHead): """Just a batchnorm.""" def __init__(self, input_transform="resize_concat", in_index=(0, 1, 2, 3), upsample=1, **kwargs): super().__init__(**kwargs) self.input_transform = input_transform self.in_index = in_index self.upsample = upsample # self.bn = nn.SyncBatchNorm(self.in_channels) if self.classify: self.conv_depth = nn.Conv2d(self.channels, self.n_bins, kernel_size=1, padding=0, stride=1) else: self.conv_depth = nn.Conv2d(self.channels, 1, kernel_size=1, padding=0, stride=1) def _transform_inputs(self, inputs): """Transform inputs for decoder. Args: inputs (list[Tensor]): List of multi-level img features. Returns: Tensor: The transformed inputs """ if "concat" in self.input_transform: inputs = [inputs[i] for i in self.in_index] if "resize" in self.input_transform: inputs = [ resize( input=x, size=[s * self.upsample for s in inputs[0].shape[2:]], mode="bilinear", align_corners=self.align_corners, ) for x in inputs ] inputs = torch.cat(inputs, dim=1) elif self.input_transform == "multiple_select": inputs = [inputs[i] for i in self.in_index] else: inputs = inputs[self.in_index] return inputs def _forward_feature(self, inputs, img_metas=None, **kwargs): """Forward function for feature maps before classifying each pixel with ``self.cls_seg`` fc. Args: inputs (list[Tensor]): List of multi-level img features. Returns: feats (Tensor): A tensor of shape (batch_size, self.channels, H, W) which is feature map for last layer of decoder head. """ # accept lists (for cls token) inputs = list(inputs) for i, x in enumerate(inputs): if len(x) == 2: x, cls_token = x[0], x[1] if len(x.shape) == 2: x = x[:, :, None, None] cls_token = cls_token[:, :, None, None].expand_as(x) inputs[i] = torch.cat((x, cls_token), 1) else: x = x[0] if len(x.shape) == 2: x = x[:, :, None, None] inputs[i] = x x = self._transform_inputs(inputs) # feats = self.bn(x) return x def forward(self, inputs, img_metas=None, **kwargs): """Forward function.""" output = self._forward_feature(inputs, img_metas=img_metas, **kwargs) output = self.depth_pred(output) return output ``` ## /dinov2/eval/depth/models/depther/__init__.py ```py path="/dinov2/eval/depth/models/depther/__init__.py" # Copyright (c) Meta Platforms, Inc. and affiliates. # # This source code is licensed under the Apache License, Version 2.0 # found in the LICENSE file in the root directory of this source tree. from .base import BaseDepther from .encoder_decoder import DepthEncoderDecoder ``` ## /dinov2/eval/depth/models/depther/base.py ```py path="/dinov2/eval/depth/models/depther/base.py" # Copyright (c) Meta Platforms, Inc. and affiliates. # # This source code is licensed under the Apache License, Version 2.0 # found in the LICENSE file in the root directory of this source tree. from abc import ABCMeta, abstractmethod from collections import OrderedDict import torch import torch.distributed as dist from mmcv.runner import BaseModule, auto_fp16 class BaseDepther(BaseModule, metaclass=ABCMeta): """Base class for depther.""" def __init__(self, init_cfg=None): super(BaseDepther, self).__init__(init_cfg) self.fp16_enabled = False @property def with_neck(self): """bool: whether the depther has neck""" return hasattr(self, "neck") and self.neck is not None @property def with_auxiliary_head(self): """bool: whether the depther has auxiliary head""" return hasattr(self, "auxiliary_head") and self.auxiliary_head is not None @property def with_decode_head(self): """bool: whether the depther has decode head""" return hasattr(self, "decode_head") and self.decode_head is not None @abstractmethod def extract_feat(self, imgs): """Placeholder for extract features from images.""" pass @abstractmethod def encode_decode(self, img, img_metas): """Placeholder for encode images with backbone and decode into a semantic depth map of the same size as input.""" pass @abstractmethod def forward_train(self, imgs, img_metas, **kwargs): """Placeholder for Forward function for training.""" pass @abstractmethod def simple_test(self, img, img_meta, **kwargs): """Placeholder for single image test.""" pass @abstractmethod def aug_test(self, imgs, img_metas, **kwargs): """Placeholder for augmentation test.""" pass def forward_test(self, imgs, img_metas, **kwargs): """ Args: imgs (List[Tensor]): the outer list indicates test-time augmentations and inner Tensor should have a shape NxCxHxW, which contains all images in the batch. img_metas (List[List[dict]]): the outer list indicates test-time augs (multiscale, flip, etc.) and the inner list indicates images in a batch. """ for var, name in [(imgs, "imgs"), (img_metas, "img_metas")]: if not isinstance(var, list): raise TypeError(f"{name} must be a list, but got " f"{type(var)}") num_augs = len(imgs) if num_augs != len(img_metas): raise ValueError(f"num of augmentations ({len(imgs)}) != " f"num of image meta ({len(img_metas)})") # all images in the same aug batch all of the same ori_shape and pad # shape for img_meta in img_metas: ori_shapes = [_["ori_shape"] for _ in img_meta] assert all(shape == ori_shapes[0] for shape in ori_shapes) img_shapes = [_["img_shape"] for _ in img_meta] assert all(shape == img_shapes[0] for shape in img_shapes) pad_shapes = [_["pad_shape"] for _ in img_meta] assert all(shape == pad_shapes[0] for shape in pad_shapes) if num_augs == 1: return self.simple_test(imgs[0], img_metas[0], **kwargs) else: return self.aug_test(imgs, img_metas, **kwargs) @auto_fp16(apply_to=("img",)) def forward(self, img, img_metas, return_loss=True, **kwargs): """Calls either :func:`forward_train` or :func:`forward_test` depending on whether ``return_loss`` is ``True``. Note this setting will change the expected inputs. When ``return_loss=True``, img and img_meta are single-nested (i.e. Tensor and List[dict]), and when ``resturn_loss=False``, img and img_meta should be double nested (i.e. List[Tensor], List[List[dict]]), with the outer list indicating test time augmentations. """ if return_loss: return self.forward_train(img, img_metas, **kwargs) else: return self.forward_test(img, img_metas, **kwargs) def train_step(self, data_batch, optimizer, **kwargs): """The iteration step during training. This method defines an iteration step during training, except for the back propagation and optimizer updating, which are done in an optimizer hook. Note that in some complicated cases or models, the whole process including back propagation and optimizer updating is also defined in this method, such as GAN. Args: data (dict): The output of dataloader. optimizer (:obj:`torch.optim.Optimizer` | dict): The optimizer of runner is passed to ``train_step()``. This argument is unused and reserved. Returns: dict: It should contain at least 3 keys: ``loss``, ``log_vars``, ``num_samples``. ``loss`` is a tensor for back propagation, which can be a weighted sum of multiple losses. ``log_vars`` contains all the variables to be sent to the logger. ``num_samples`` indicates the batch size (when the model is DDP, it means the batch size on each GPU), which is used for averaging the logs. """ losses = self(**data_batch) # split losses and images real_losses = {} log_imgs = {} for k, v in losses.items(): if "img" in k: log_imgs[k] = v else: real_losses[k] = v loss, log_vars = self._parse_losses(real_losses) outputs = dict(loss=loss, log_vars=log_vars, num_samples=len(data_batch["img_metas"]), log_imgs=log_imgs) return outputs def val_step(self, data_batch, **kwargs): """The iteration step during validation. This method shares the same signature as :func:`train_step`, but used during val epochs. Note that the evaluation after training epochs is not implemented with this method, but an evaluation hook. """ output = self(**data_batch, **kwargs) return output @staticmethod def _parse_losses(losses): """Parse the raw outputs (losses) of the network. Args: losses (dict): Raw output of the network, which usually contain losses and other necessary information. Returns: tuple[Tensor, dict]: (loss, log_vars), loss is the loss tensor which may be a weighted sum of all losses, log_vars contains all the variables to be sent to the logger. """ log_vars = OrderedDict() for loss_name, loss_value in losses.items(): if isinstance(loss_value, torch.Tensor): log_vars[loss_name] = loss_value.mean() elif isinstance(loss_value, list): log_vars[loss_name] = sum(_loss.mean() for _loss in loss_value) else: raise TypeError(f"{loss_name} is not a tensor or list of tensors") loss = sum(_value for _key, _value in log_vars.items() if "loss" in _key) log_vars["loss"] = loss for loss_name, loss_value in log_vars.items(): # reduce loss when distributed training if dist.is_available() and dist.is_initialized(): loss_value = loss_value.data.clone() dist.all_reduce(loss_value.div_(dist.get_world_size())) log_vars[loss_name] = loss_value.item() return loss, log_vars ``` ## /dinov2/eval/depth/models/depther/encoder_decoder.py ```py path="/dinov2/eval/depth/models/depther/encoder_decoder.py" # Copyright (c) Meta Platforms, Inc. and affiliates. # # This source code is licensed under the Apache License, Version 2.0 # found in the LICENSE file in the root directory of this source tree. import torch import torch.nn.functional as F from ...models import builder from ...models.builder import DEPTHER from ...ops import resize from .base import BaseDepther def add_prefix(inputs, prefix): """Add prefix for dict. Args: inputs (dict): The input dict with str keys. prefix (str): The prefix to add. Returns: dict: The dict with keys updated with ``prefix``. """ outputs = dict() for name, value in inputs.items(): outputs[f"{prefix}.{name}"] = value return outputs @DEPTHER.register_module() class DepthEncoderDecoder(BaseDepther): """Encoder Decoder depther. EncoderDecoder typically consists of backbone, (neck) and decode_head. """ def __init__(self, backbone, decode_head, neck=None, train_cfg=None, test_cfg=None, pretrained=None, init_cfg=None): super(DepthEncoderDecoder, self).__init__(init_cfg) if pretrained is not None: assert backbone.get("pretrained") is None, "both backbone and depther set pretrained weight" backbone.pretrained = pretrained self.backbone = builder.build_backbone(backbone) self._init_decode_head(decode_head) if neck is not None: self.neck = builder.build_neck(neck) self.train_cfg = train_cfg self.test_cfg = test_cfg assert self.with_decode_head def _init_decode_head(self, decode_head): """Initialize ``decode_head``""" self.decode_head = builder.build_head(decode_head) self.align_corners = self.decode_head.align_corners def extract_feat(self, img): """Extract features from images.""" x = self.backbone(img) if self.with_neck: x = self.neck(x) return x def encode_decode(self, img, img_metas, rescale=True, size=None): """Encode images with backbone and decode into a depth estimation map of the same size as input.""" x = self.extract_feat(img) out = self._decode_head_forward_test(x, img_metas) # crop the pred depth to the certain range. out = torch.clamp(out, min=self.decode_head.min_depth, max=self.decode_head.max_depth) if rescale: if size is None: if img_metas is not None: size = img_metas[0]["ori_shape"][:2] else: size = img.shape[2:] out = resize(input=out, size=size, mode="bilinear", align_corners=self.align_corners) return out def _decode_head_forward_train(self, img, x, img_metas, depth_gt, **kwargs): """Run forward function and calculate loss for decode head in training.""" losses = dict() loss_decode = self.decode_head.forward_train(img, x, img_metas, depth_gt, self.train_cfg, **kwargs) losses.update(add_prefix(loss_decode, "decode")) return losses def _decode_head_forward_test(self, x, img_metas): """Run forward function and calculate loss for decode head in inference.""" depth_pred = self.decode_head.forward_test(x, img_metas, self.test_cfg) return depth_pred def forward_dummy(self, img): """Dummy forward function.""" depth = self.encode_decode(img, None) return depth def forward_train(self, img, img_metas, depth_gt, **kwargs): """Forward function for training. Args: img (Tensor): Input images. img_metas (list[dict]): List of image info dict where each dict has: 'img_shape', 'scale_factor', 'flip', and may also contain 'filename', 'ori_shape', 'pad_shape', and 'img_norm_cfg'. For details on the values of these keys see `depth/datasets/pipelines/formatting.py:Collect`. depth_gt (Tensor): Depth gt used if the architecture supports depth estimation task. Returns: dict[str, Tensor]: a dictionary of loss components """ x = self.extract_feat(img) losses = dict() # the last of x saves the info from neck loss_decode = self._decode_head_forward_train(img, x, img_metas, depth_gt, **kwargs) losses.update(loss_decode) return losses def whole_inference(self, img, img_meta, rescale, size=None): """Inference with full image.""" depth_pred = self.encode_decode(img, img_meta, rescale, size=size) return depth_pred def slide_inference(self, img, img_meta, rescale): """Inference by sliding-window with overlap. If h_crop > h_img or w_crop > w_img, the small patch will be used to decode without padding. """ h_stride, w_stride = self.test_cfg.stride h_crop, w_crop = self.test_cfg.crop_size batch_size, _, h_img, w_img = img.size() h_grids = max(h_img - h_crop + h_stride - 1, 0) // h_stride + 1 w_grids = max(w_img - w_crop + w_stride - 1, 0) // w_stride + 1 preds = img.new_zeros((batch_size, 1, h_img, w_img)) count_mat = img.new_zeros((batch_size, 1, h_img, w_img)) for h_idx in range(h_grids): for w_idx in range(w_grids): y1 = h_idx * h_stride x1 = w_idx * w_stride y2 = min(y1 + h_crop, h_img) x2 = min(x1 + w_crop, w_img) y1 = max(y2 - h_crop, 0) x1 = max(x2 - w_crop, 0) crop_img = img[:, :, y1:y2, x1:x2] depth_pred = self.encode_decode(crop_img, img_meta, rescale) preds += F.pad(depth_pred, (int(x1), int(preds.shape[3] - x2), int(y1), int(preds.shape[2] - y2))) count_mat[:, :, y1:y2, x1:x2] += 1 assert (count_mat == 0).sum() == 0 if torch.onnx.is_in_onnx_export(): # cast count_mat to constant while exporting to ONNX count_mat = torch.from_numpy(count_mat.cpu().detach().numpy()).to(device=img.device) preds = preds / count_mat return preds def inference(self, img, img_meta, rescale, size=None): """Inference with slide/whole style. Args: img (Tensor): The input image of shape (N, 3, H, W). img_meta (dict): Image info dict where each dict has: 'img_shape', 'scale_factor', 'flip', and may also contain 'filename', 'ori_shape', 'pad_shape', and 'img_norm_cfg'. For details on the values of these keys see `depth/datasets/pipelines/formatting.py:Collect`. rescale (bool): Whether rescale back to original shape. Returns: Tensor: The output depth map. """ assert self.test_cfg.mode in ["slide", "whole"] ori_shape = img_meta[0]["ori_shape"] assert all(_["ori_shape"] == ori_shape for _ in img_meta) if self.test_cfg.mode == "slide": depth_pred = self.slide_inference(img, img_meta, rescale) else: depth_pred = self.whole_inference(img, img_meta, rescale, size=size) output = depth_pred flip = img_meta[0]["flip"] if flip: flip_direction = img_meta[0]["flip_direction"] assert flip_direction in ["horizontal", "vertical"] if flip_direction == "horizontal": output = output.flip(dims=(3,)) elif flip_direction == "vertical": output = output.flip(dims=(2,)) return output def simple_test(self, img, img_meta, rescale=True): """Simple test with single image.""" depth_pred = self.inference(img, img_meta, rescale) if torch.onnx.is_in_onnx_export(): # our inference backend only support 4D output depth_pred = depth_pred.unsqueeze(0) return depth_pred depth_pred = depth_pred.cpu().numpy() # unravel batch dim depth_pred = list(depth_pred) return depth_pred def aug_test(self, imgs, img_metas, rescale=True): """Test with augmentations. Only rescale=True is supported. """ # aug_test rescale all imgs back to ori_shape for now assert rescale # to save memory, we get augmented depth logit inplace depth_pred = self.inference(imgs[0], img_metas[0], rescale) for i in range(1, len(imgs)): cur_depth_pred = self.inference(imgs[i], img_metas[i], rescale, size=depth_pred.shape[-2:]) depth_pred += cur_depth_pred depth_pred /= len(imgs) depth_pred = depth_pred.cpu().numpy() # unravel batch dim depth_pred = list(depth_pred) return depth_pred ``` ## /dinov2/eval/depth/models/losses/__init__.py ```py path="/dinov2/eval/depth/models/losses/__init__.py" # Copyright (c) Meta Platforms, Inc. and affiliates. # # This source code is licensed under the Apache License, Version 2.0 # found in the LICENSE file in the root directory of this source tree. from .gradientloss import GradientLoss from .sigloss import SigLoss ``` ## /dinov2/eval/depth/models/losses/gradientloss.py ```py path="/dinov2/eval/depth/models/losses/gradientloss.py" # Copyright (c) Meta Platforms, Inc. and affiliates. # # This source code is licensed under the Apache License, Version 2.0 # found in the LICENSE file in the root directory of this source tree. import torch import torch.nn as nn from ...models.builder import LOSSES @LOSSES.register_module() class GradientLoss(nn.Module): """GradientLoss. Adapted from https://www.cs.cornell.edu/projects/megadepth/ Args: valid_mask (bool): Whether filter invalid gt (gt > 0). Default: True. loss_weight (float): Weight of the loss. Default: 1.0. max_depth (int): When filtering invalid gt, set a max threshold. Default: None. """ def __init__(self, valid_mask=True, loss_weight=1.0, max_depth=None, loss_name="loss_grad"): super(GradientLoss, self).__init__() self.valid_mask = valid_mask self.loss_weight = loss_weight self.max_depth = max_depth self.loss_name = loss_name self.eps = 0.001 # avoid grad explode def gradientloss(self, input, target): input_downscaled = [input] + [input[:: 2 * i, :: 2 * i] for i in range(1, 4)] target_downscaled = [target] + [target[:: 2 * i, :: 2 * i] for i in range(1, 4)] gradient_loss = 0 for input, target in zip(input_downscaled, target_downscaled): if self.valid_mask: mask = target > 0 if self.max_depth is not None: mask = torch.logical_and(target > 0, target <= self.max_depth) N = torch.sum(mask) else: mask = torch.ones_like(target) N = input.numel() input_log = torch.log(input + self.eps) target_log = torch.log(target + self.eps) log_d_diff = input_log - target_log log_d_diff = torch.mul(log_d_diff, mask) v_gradient = torch.abs(log_d_diff[0:-2, :] - log_d_diff[2:, :]) v_mask = torch.mul(mask[0:-2, :], mask[2:, :]) v_gradient = torch.mul(v_gradient, v_mask) h_gradient = torch.abs(log_d_diff[:, 0:-2] - log_d_diff[:, 2:]) h_mask = torch.mul(mask[:, 0:-2], mask[:, 2:]) h_gradient = torch.mul(h_gradient, h_mask) gradient_loss += (torch.sum(h_gradient) + torch.sum(v_gradient)) / N return gradient_loss def forward(self, depth_pred, depth_gt): """Forward function.""" gradient_loss = self.loss_weight * self.gradientloss(depth_pred, depth_gt) return gradient_loss ``` ## /dinov2/eval/depth/models/losses/sigloss.py ```py path="/dinov2/eval/depth/models/losses/sigloss.py" # Copyright (c) Meta Platforms, Inc. and affiliates. # # This source code is licensed under the Apache License, Version 2.0 # found in the LICENSE file in the root directory of this source tree. import torch import torch.nn as nn from ...models.builder import LOSSES @LOSSES.register_module() class SigLoss(nn.Module): """SigLoss. This follows `AdaBins `_. Args: valid_mask (bool): Whether filter invalid gt (gt > 0). Default: True. loss_weight (float): Weight of the loss. Default: 1.0. max_depth (int): When filtering invalid gt, set a max threshold. Default: None. warm_up (bool): A simple warm up stage to help convergence. Default: False. warm_iter (int): The number of warm up stage. Default: 100. """ def __init__( self, valid_mask=True, loss_weight=1.0, max_depth=None, warm_up=False, warm_iter=100, loss_name="sigloss" ): super(SigLoss, self).__init__() self.valid_mask = valid_mask self.loss_weight = loss_weight self.max_depth = max_depth self.loss_name = loss_name self.eps = 0.001 # avoid grad explode # HACK: a hack implementation for warmup sigloss self.warm_up = warm_up self.warm_iter = warm_iter self.warm_up_counter = 0 def sigloss(self, input, target): if self.valid_mask: valid_mask = target > 0 if self.max_depth is not None: valid_mask = torch.logical_and(target > 0, target <= self.max_depth) input = input[valid_mask] target = target[valid_mask] if self.warm_up: if self.warm_up_counter < self.warm_iter: g = torch.log(input + self.eps) - torch.log(target + self.eps) g = 0.15 * torch.pow(torch.mean(g), 2) self.warm_up_counter += 1 return torch.sqrt(g) g = torch.log(input + self.eps) - torch.log(target + self.eps) Dg = torch.var(g) + 0.15 * torch.pow(torch.mean(g), 2) return torch.sqrt(Dg) def forward(self, depth_pred, depth_gt): """Forward function.""" loss_depth = self.loss_weight * self.sigloss(depth_pred, depth_gt) return loss_depth ``` ## /dinov2/eval/depth/ops/__init__.py ```py path="/dinov2/eval/depth/ops/__init__.py" # Copyright (c) Meta Platforms, Inc. and affiliates. # # This source code is licensed under the Apache License, Version 2.0 # found in the LICENSE file in the root directory of this source tree. from .wrappers import resize ``` ## /dinov2/eval/depth/ops/wrappers.py ```py path="/dinov2/eval/depth/ops/wrappers.py" # Copyright (c) Meta Platforms, Inc. and affiliates. # # This source code is licensed under the Apache License, Version 2.0 # found in the LICENSE file in the root directory of this source tree. import warnings import torch.nn.functional as F def resize(input, size=None, scale_factor=None, mode="nearest", align_corners=None, warning=False): if warning: if size is not None and align_corners: input_h, input_w = tuple(int(x) for x in input.shape[2:]) output_h, output_w = tuple(int(x) for x in size) if output_h > input_h or output_w > output_h: if ( (output_h > 1 and output_w > 1 and input_h > 1 and input_w > 1) and (output_h - 1) % (input_h - 1) and (output_w - 1) % (input_w - 1) ): warnings.warn( f"When align_corners={align_corners}, " "the output would more aligned if " f"input size {(input_h, input_w)} is `x+1` and " f"out size {(output_h, output_w)} is `nx+1`" ) return F.interpolate(input, size, scale_factor, mode, align_corners) ``` ## /dinov2/eval/knn.py ```py path="/dinov2/eval/knn.py" # Copyright (c) Meta Platforms, Inc. and affiliates. # # This source code is licensed under the Apache License, Version 2.0 # found in the LICENSE file in the root directory of this source tree. import argparse from functools import partial import json import logging import os import sys from typing import List, Optional import torch from torch.nn.functional import one_hot, softmax import dinov2.distributed as distributed from dinov2.data import SamplerType, make_data_loader, make_dataset from dinov2.data.transforms import make_classification_eval_transform from dinov2.eval.metrics import AccuracyAveraging, build_topk_accuracy_metric from dinov2.eval.setup import get_args_parser as get_setup_args_parser from dinov2.eval.setup import setup_and_build_model from dinov2.eval.utils import ModelWithNormalize, evaluate, extract_features logger = logging.getLogger("dinov2") def get_args_parser( description: Optional[str] = None, parents: Optional[List[argparse.ArgumentParser]] = None, add_help: bool = True, ): parents = parents or [] setup_args_parser = get_setup_args_parser(parents=parents, add_help=False) parents = [setup_args_parser] parser = argparse.ArgumentParser( description=description, parents=parents, add_help=add_help, ) parser.add_argument( "--train-dataset", dest="train_dataset_str", type=str, help="Training dataset", ) parser.add_argument( "--val-dataset", dest="val_dataset_str", type=str, help="Validation dataset", ) parser.add_argument( "--nb_knn", nargs="+", type=int, help="Number of NN to use. 20 is usually working the best.", ) parser.add_argument( "--temperature", type=float, help="Temperature used in the voting coefficient", ) parser.add_argument( "--gather-on-cpu", action="store_true", help="Whether to gather the train features on cpu, slower" "but useful to avoid OOM for large datasets (e.g. ImageNet22k).", ) parser.add_argument( "--batch-size", type=int, help="Batch size.", ) parser.add_argument( "--n-per-class-list", nargs="+", type=int, help="Number to take per class", ) parser.add_argument( "--n-tries", type=int, help="Number of tries", ) parser.set_defaults( train_dataset_str="ImageNet:split=TRAIN", val_dataset_str="ImageNet:split=VAL", nb_knn=[10, 20, 100, 200], temperature=0.07, batch_size=256, n_per_class_list=[-1], n_tries=1, ) return parser class KnnModule(torch.nn.Module): """ Gets knn of test features from all processes on a chunk of the train features Each rank gets a chunk of the train features as well as a chunk of the test features. In `compute_neighbors`, for each rank one after the other, its chunk of test features is sent to all devices, partial knns are computed with each chunk of train features then collated back on the original device. """ def __init__(self, train_features, train_labels, nb_knn, T, device, num_classes=1000): super().__init__() self.global_rank = distributed.get_global_rank() self.global_size = distributed.get_global_size() self.device = device self.train_features_rank_T = train_features.chunk(self.global_size)[self.global_rank].T.to(self.device) self.candidates = train_labels.chunk(self.global_size)[self.global_rank].view(1, -1).to(self.device) self.nb_knn = nb_knn self.max_k = max(self.nb_knn) self.T = T self.num_classes = num_classes def _get_knn_sims_and_labels(self, similarity, train_labels): topk_sims, indices = similarity.topk(self.max_k, largest=True, sorted=True) neighbors_labels = torch.gather(train_labels, 1, indices) return topk_sims, neighbors_labels def _similarity_for_rank(self, features_rank, source_rank): # Send the features from `source_rank` to all ranks broadcast_shape = torch.tensor(features_rank.shape).to(self.device) torch.distributed.broadcast(broadcast_shape, source_rank) broadcasted = features_rank if self.global_rank != source_rank: broadcasted = torch.zeros(*broadcast_shape, dtype=features_rank.dtype, device=self.device) torch.distributed.broadcast(broadcasted, source_rank) # Compute the neighbors for `source_rank` among `train_features_rank_T` similarity_rank = torch.mm(broadcasted, self.train_features_rank_T) candidate_labels = self.candidates.expand(len(similarity_rank), -1) return self._get_knn_sims_and_labels(similarity_rank, candidate_labels) def _gather_all_knn_for_rank(self, topk_sims, neighbors_labels, target_rank): # Gather all neighbors for `target_rank` topk_sims_rank = retrieved_rank = None if self.global_rank == target_rank: topk_sims_rank = [torch.zeros_like(topk_sims) for _ in range(self.global_size)] retrieved_rank = [torch.zeros_like(neighbors_labels) for _ in range(self.global_size)] torch.distributed.gather(topk_sims, topk_sims_rank, dst=target_rank) torch.distributed.gather(neighbors_labels, retrieved_rank, dst=target_rank) if self.global_rank == target_rank: # Perform a second top-k on the k * global_size retrieved neighbors topk_sims_rank = torch.cat(topk_sims_rank, dim=1) retrieved_rank = torch.cat(retrieved_rank, dim=1) results = self._get_knn_sims_and_labels(topk_sims_rank, retrieved_rank) return results return None def compute_neighbors(self, features_rank): for rank in range(self.global_size): topk_sims, neighbors_labels = self._similarity_for_rank(features_rank, rank) results = self._gather_all_knn_for_rank(topk_sims, neighbors_labels, rank) if results is not None: topk_sims_rank, neighbors_labels_rank = results return topk_sims_rank, neighbors_labels_rank def forward(self, features_rank): """ Compute the results on all values of `self.nb_knn` neighbors from the full `self.max_k` """ assert all(k <= self.max_k for k in self.nb_knn) topk_sims, neighbors_labels = self.compute_neighbors(features_rank) batch_size = neighbors_labels.shape[0] topk_sims_transform = softmax(topk_sims / self.T, 1) matmul = torch.mul( one_hot(neighbors_labels, num_classes=self.num_classes), topk_sims_transform.view(batch_size, -1, 1), ) probas_for_k = {k: torch.sum(matmul[:, :k, :], 1) for k in self.nb_knn} return probas_for_k class DictKeysModule(torch.nn.Module): def __init__(self, keys): super().__init__() self.keys = keys def forward(self, features_dict, targets): for k in self.keys: features_dict = features_dict[k] return {"preds": features_dict, "target": targets} def create_module_dict(*, module, n_per_class_list, n_tries, nb_knn, train_features, train_labels): modules = {} mapping = create_class_indices_mapping(train_labels) for npc in n_per_class_list: if npc < 0: # Only one try needed when using the full data full_module = module( train_features=train_features, train_labels=train_labels, nb_knn=nb_knn, ) modules["full"] = ModuleDictWithForward({"1": full_module}) continue all_tries = {} for t in range(n_tries): final_indices = filter_train(mapping, npc, seed=t) k_list = list(set(nb_knn + [npc])) k_list = sorted([el for el in k_list if el <= npc]) all_tries[str(t)] = module( train_features=train_features[final_indices], train_labels=train_labels[final_indices], nb_knn=k_list, ) modules[f"{npc} per class"] = ModuleDictWithForward(all_tries) return ModuleDictWithForward(modules) def filter_train(mapping, n_per_class, seed): torch.manual_seed(seed) final_indices = [] for k in mapping.keys(): index = torch.randperm(len(mapping[k]))[:n_per_class] final_indices.append(mapping[k][index]) return torch.cat(final_indices).squeeze() def create_class_indices_mapping(labels): unique_labels, inverse = torch.unique(labels, return_inverse=True) mapping = {unique_labels[i]: (inverse == i).nonzero() for i in range(len(unique_labels))} return mapping class ModuleDictWithForward(torch.nn.ModuleDict): def forward(self, *args, **kwargs): return {k: module(*args, **kwargs) for k, module in self._modules.items()} def eval_knn( model, train_dataset, val_dataset, accuracy_averaging, nb_knn, temperature, batch_size, num_workers, gather_on_cpu, n_per_class_list=[-1], n_tries=1, ): model = ModelWithNormalize(model) logger.info("Extracting features for train set...") train_features, train_labels = extract_features( model, train_dataset, batch_size, num_workers, gather_on_cpu=gather_on_cpu ) logger.info(f"Train features created, shape {train_features.shape}.") val_dataloader = make_data_loader( dataset=val_dataset, batch_size=batch_size, num_workers=num_workers, sampler_type=SamplerType.DISTRIBUTED, drop_last=False, shuffle=False, persistent_workers=True, ) num_classes = train_labels.max() + 1 metric_collection = build_topk_accuracy_metric(accuracy_averaging, num_classes=num_classes) device = torch.cuda.current_device() partial_module = partial(KnnModule, T=temperature, device=device, num_classes=num_classes) knn_module_dict = create_module_dict( module=partial_module, n_per_class_list=n_per_class_list, n_tries=n_tries, nb_knn=nb_knn, train_features=train_features, train_labels=train_labels, ) postprocessors, metrics = {}, {} for n_per_class, knn_module in knn_module_dict.items(): for t, knn_try in knn_module.items(): postprocessors = { **postprocessors, **{(n_per_class, t, k): DictKeysModule([n_per_class, t, k]) for k in knn_try.nb_knn}, } metrics = {**metrics, **{(n_per_class, t, k): metric_collection.clone() for k in knn_try.nb_knn}} model_with_knn = torch.nn.Sequential(model, knn_module_dict) # ============ evaluation ... ============ logger.info("Start the k-NN classification.") _, results_dict = evaluate(model_with_knn, val_dataloader, postprocessors, metrics, device) # Averaging the results over the n tries for each value of n_per_class for n_per_class, knn_module in knn_module_dict.items(): first_try = list(knn_module.keys())[0] k_list = knn_module[first_try].nb_knn for k in k_list: keys = results_dict[(n_per_class, first_try, k)].keys() # keys are e.g. `top-1` and `top-5` results_dict[(n_per_class, k)] = { key: torch.mean(torch.stack([results_dict[(n_per_class, t, k)][key] for t in knn_module.keys()])) for key in keys } for t in knn_module.keys(): del results_dict[(n_per_class, t, k)] return results_dict def eval_knn_with_model( model, output_dir, train_dataset_str="ImageNet:split=TRAIN", val_dataset_str="ImageNet:split=VAL", nb_knn=(10, 20, 100, 200), temperature=0.07, autocast_dtype=torch.float, accuracy_averaging=AccuracyAveraging.MEAN_ACCURACY, transform=None, gather_on_cpu=False, batch_size=256, num_workers=5, n_per_class_list=[-1], n_tries=1, ): transform = transform or make_classification_eval_transform() train_dataset = make_dataset( dataset_str=train_dataset_str, transform=transform, ) val_dataset = make_dataset( dataset_str=val_dataset_str, transform=transform, ) with torch.cuda.amp.autocast(dtype=autocast_dtype): results_dict_knn = eval_knn( model=model, train_dataset=train_dataset, val_dataset=val_dataset, accuracy_averaging=accuracy_averaging, nb_knn=nb_knn, temperature=temperature, batch_size=batch_size, num_workers=num_workers, gather_on_cpu=gather_on_cpu, n_per_class_list=n_per_class_list, n_tries=n_tries, ) results_dict = {} if distributed.is_main_process(): for knn_ in results_dict_knn.keys(): top1 = results_dict_knn[knn_]["top-1"].item() * 100.0 top5 = results_dict_knn[knn_]["top-5"].item() * 100.0 results_dict[f"{knn_} Top 1"] = top1 results_dict[f"{knn_} Top 5"] = top5 logger.info(f"{knn_} classifier result: Top1: {top1:.2f} Top5: {top5:.2f}") metrics_file_path = os.path.join(output_dir, "results_eval_knn.json") with open(metrics_file_path, "a") as f: for k, v in results_dict.items(): f.write(json.dumps({k: v}) + "\n") if distributed.is_enabled(): torch.distributed.barrier() return results_dict def main(args): model, autocast_dtype = setup_and_build_model(args) eval_knn_with_model( model=model, output_dir=args.output_dir, train_dataset_str=args.train_dataset_str, val_dataset_str=args.val_dataset_str, nb_knn=args.nb_knn, temperature=args.temperature, autocast_dtype=autocast_dtype, accuracy_averaging=AccuracyAveraging.MEAN_ACCURACY, transform=None, gather_on_cpu=args.gather_on_cpu, batch_size=args.batch_size, num_workers=5, n_per_class_list=args.n_per_class_list, n_tries=args.n_tries, ) return 0 if __name__ == "__main__": description = "DINOv2 k-NN evaluation" args_parser = get_args_parser(description=description) args = args_parser.parse_args() sys.exit(main(args)) ``` ## /dinov2/eval/linear.py ```py path="/dinov2/eval/linear.py" # Copyright (c) Meta Platforms, Inc. and affiliates. # # This source code is licensed under the Apache License, Version 2.0 # found in the LICENSE file in the root directory of this source tree. import argparse from functools import partial import json import logging import os import sys from typing import List, Optional import numpy as np import torch import torch.nn as nn from torch.nn.parallel import DistributedDataParallel from fvcore.common.checkpoint import Checkpointer, PeriodicCheckpointer from dinov2.data import SamplerType, make_data_loader, make_dataset from dinov2.data.transforms import make_classification_eval_transform, make_classification_train_transform import dinov2.distributed as distributed from dinov2.eval.metrics import MetricType, build_metric from dinov2.eval.setup import get_args_parser as get_setup_args_parser from dinov2.eval.setup import setup_and_build_model from dinov2.eval.utils import ModelWithIntermediateLayers, evaluate from dinov2.logging import MetricLogger logger = logging.getLogger("dinov2") def get_args_parser( description: Optional[str] = None, parents: Optional[List[argparse.ArgumentParser]] = None, add_help: bool = True, ): parents = parents or [] setup_args_parser = get_setup_args_parser(parents=parents, add_help=False) parents = [setup_args_parser] parser = argparse.ArgumentParser( description=description, parents=parents, add_help=add_help, ) parser.add_argument( "--train-dataset", dest="train_dataset_str", type=str, help="Training dataset", ) parser.add_argument( "--val-dataset", dest="val_dataset_str", type=str, help="Validation dataset", ) parser.add_argument( "--test-datasets", dest="test_dataset_strs", type=str, nargs="+", help="Test datasets, none to reuse the validation dataset", ) parser.add_argument( "--epochs", type=int, help="Number of training epochs", ) parser.add_argument( "--batch-size", type=int, help="Batch Size (per GPU)", ) parser.add_argument( "--num-workers", type=int, help="Number de Workers", ) parser.add_argument( "--epoch-length", type=int, help="Length of an epoch in number of iterations", ) parser.add_argument( "--save-checkpoint-frequency", type=int, help="Number of epochs between two named checkpoint saves.", ) parser.add_argument( "--eval-period-iterations", type=int, help="Number of iterations between two evaluations.", ) parser.add_argument( "--learning-rates", nargs="+", type=float, help="Learning rates to grid search.", ) parser.add_argument( "--no-resume", action="store_true", help="Whether to not resume from existing checkpoints", ) parser.add_argument( "--val-metric-type", type=MetricType, choices=list(MetricType), help="Validation metric", ) parser.add_argument( "--test-metric-types", type=MetricType, choices=list(MetricType), nargs="+", help="Evaluation metric", ) parser.add_argument( "--classifier-fpath", type=str, help="Path to a file containing pretrained linear classifiers", ) parser.add_argument( "--val-class-mapping-fpath", type=str, help="Path to a file containing a mapping to adjust classifier outputs", ) parser.add_argument( "--test-class-mapping-fpaths", nargs="+", type=str, help="Path to a file containing a mapping to adjust classifier outputs", ) parser.set_defaults( train_dataset_str="ImageNet:split=TRAIN", val_dataset_str="ImageNet:split=VAL", test_dataset_strs=None, epochs=10, batch_size=128, num_workers=8, epoch_length=1250, save_checkpoint_frequency=20, eval_period_iterations=1250, learning_rates=[1e-5, 2e-5, 5e-5, 1e-4, 2e-4, 5e-4, 1e-3, 2e-3, 5e-3, 1e-2, 2e-2, 5e-2, 0.1], val_metric_type=MetricType.MEAN_ACCURACY, test_metric_types=None, classifier_fpath=None, val_class_mapping_fpath=None, test_class_mapping_fpaths=[None], ) return parser def has_ddp_wrapper(m: nn.Module) -> bool: return isinstance(m, DistributedDataParallel) def remove_ddp_wrapper(m: nn.Module) -> nn.Module: return m.module if has_ddp_wrapper(m) else m def _pad_and_collate(batch): maxlen = max(len(targets) for image, targets in batch) padded_batch = [ (image, np.pad(targets, (0, maxlen - len(targets)), constant_values=-1)) for image, targets in batch ] return torch.utils.data.default_collate(padded_batch) def create_linear_input(x_tokens_list, use_n_blocks, use_avgpool): intermediate_output = x_tokens_list[-use_n_blocks:] output = torch.cat([class_token for _, class_token in intermediate_output], dim=-1) if use_avgpool: output = torch.cat( ( output, torch.mean(intermediate_output[-1][0], dim=1), # patch tokens ), dim=-1, ) output = output.reshape(output.shape[0], -1) return output.float() class LinearClassifier(nn.Module): """Linear layer to train on top of frozen features""" def __init__(self, out_dim, use_n_blocks, use_avgpool, num_classes=1000): super().__init__() self.out_dim = out_dim self.use_n_blocks = use_n_blocks self.use_avgpool = use_avgpool self.num_classes = num_classes self.linear = nn.Linear(out_dim, num_classes) self.linear.weight.data.normal_(mean=0.0, std=0.01) self.linear.bias.data.zero_() def forward(self, x_tokens_list): output = create_linear_input(x_tokens_list, self.use_n_blocks, self.use_avgpool) return self.linear(output) class AllClassifiers(nn.Module): def __init__(self, classifiers_dict): super().__init__() self.classifiers_dict = nn.ModuleDict() self.classifiers_dict.update(classifiers_dict) def forward(self, inputs): return {k: v.forward(inputs) for k, v in self.classifiers_dict.items()} def __len__(self): return len(self.classifiers_dict) class LinearPostprocessor(nn.Module): def __init__(self, linear_classifier, class_mapping=None): super().__init__() self.linear_classifier = linear_classifier self.register_buffer("class_mapping", None if class_mapping is None else torch.LongTensor(class_mapping)) def forward(self, samples, targets): preds = self.linear_classifier(samples) return { "preds": preds[:, self.class_mapping] if self.class_mapping is not None else preds, "target": targets, } def scale_lr(learning_rates, batch_size): return learning_rates * (batch_size * distributed.get_global_size()) / 256.0 def setup_linear_classifiers(sample_output, n_last_blocks_list, learning_rates, batch_size, num_classes=1000): linear_classifiers_dict = nn.ModuleDict() optim_param_groups = [] for n in n_last_blocks_list: for avgpool in [False, True]: for _lr in learning_rates: lr = scale_lr(_lr, batch_size) out_dim = create_linear_input(sample_output, use_n_blocks=n, use_avgpool=avgpool).shape[1] linear_classifier = LinearClassifier( out_dim, use_n_blocks=n, use_avgpool=avgpool, num_classes=num_classes ) linear_classifier = linear_classifier.cuda() linear_classifiers_dict[ f"classifier_{n}_blocks_avgpool_{avgpool}_lr_{lr:.5f}".replace(".", "_") ] = linear_classifier optim_param_groups.append({"params": linear_classifier.parameters(), "lr": lr}) linear_classifiers = AllClassifiers(linear_classifiers_dict) if distributed.is_enabled(): linear_classifiers = nn.parallel.DistributedDataParallel(linear_classifiers) return linear_classifiers, optim_param_groups @torch.no_grad() def evaluate_linear_classifiers( feature_model, linear_classifiers, data_loader, metric_type, metrics_file_path, training_num_classes, iteration, prefixstring="", class_mapping=None, best_classifier_on_val=None, ): logger.info("running validation !") num_classes = len(class_mapping) if class_mapping is not None else training_num_classes metric = build_metric(metric_type, num_classes=num_classes) postprocessors = {k: LinearPostprocessor(v, class_mapping) for k, v in linear_classifiers.classifiers_dict.items()} metrics = {k: metric.clone() for k in linear_classifiers.classifiers_dict} _, results_dict_temp = evaluate( feature_model, data_loader, postprocessors, metrics, torch.cuda.current_device(), ) logger.info("") results_dict = {} max_accuracy = 0 best_classifier = "" for i, (classifier_string, metric) in enumerate(results_dict_temp.items()): logger.info(f"{prefixstring} -- Classifier: {classifier_string} * {metric}") if ( best_classifier_on_val is None and metric["top-1"].item() > max_accuracy ) or classifier_string == best_classifier_on_val: max_accuracy = metric["top-1"].item() best_classifier = classifier_string results_dict["best_classifier"] = {"name": best_classifier, "accuracy": max_accuracy} logger.info(f"best classifier: {results_dict['best_classifier']}") if distributed.is_main_process(): with open(metrics_file_path, "a") as f: f.write(f"iter: {iteration}\n") for k, v in results_dict.items(): f.write(json.dumps({k: v}) + "\n") f.write("\n") return results_dict def eval_linear( *, feature_model, linear_classifiers, train_data_loader, val_data_loader, metrics_file_path, optimizer, scheduler, output_dir, max_iter, checkpoint_period, # In number of iter, creates a new file every period running_checkpoint_period, # Period to update main checkpoint file eval_period, metric_type, training_num_classes, resume=True, classifier_fpath=None, val_class_mapping=None, ): checkpointer = Checkpointer(linear_classifiers, output_dir, optimizer=optimizer, scheduler=scheduler) start_iter = checkpointer.resume_or_load(classifier_fpath or "", resume=resume).get("iteration", -1) + 1 periodic_checkpointer = PeriodicCheckpointer(checkpointer, checkpoint_period, max_iter=max_iter) iteration = start_iter logger.info("Starting training from iteration {}".format(start_iter)) metric_logger = MetricLogger(delimiter=" ") header = "Training" for data, labels in metric_logger.log_every( train_data_loader, 10, header, max_iter, start_iter, ): data = data.cuda(non_blocking=True) labels = labels.cuda(non_blocking=True) features = feature_model(data) outputs = linear_classifiers(features) losses = {f"loss_{k}": nn.CrossEntropyLoss()(v, labels) for k, v in outputs.items()} loss = sum(losses.values()) # compute the gradients optimizer.zero_grad() loss.backward() # step optimizer.step() scheduler.step() # log if iteration % 10 == 0: torch.cuda.synchronize() metric_logger.update(loss=loss.item()) metric_logger.update(lr=optimizer.param_groups[0]["lr"]) print("lr", optimizer.param_groups[0]["lr"]) if iteration - start_iter > 5: if iteration % running_checkpoint_period == 0: torch.cuda.synchronize() if distributed.is_main_process(): logger.info("Checkpointing running_checkpoint") periodic_checkpointer.save("running_checkpoint_linear_eval", iteration=iteration) torch.cuda.synchronize() periodic_checkpointer.step(iteration) if eval_period > 0 and (iteration + 1) % eval_period == 0 and iteration != max_iter - 1: _ = evaluate_linear_classifiers( feature_model=feature_model, linear_classifiers=remove_ddp_wrapper(linear_classifiers), data_loader=val_data_loader, metrics_file_path=metrics_file_path, prefixstring=f"ITER: {iteration}", metric_type=metric_type, training_num_classes=training_num_classes, iteration=iteration, class_mapping=val_class_mapping, ) torch.cuda.synchronize() iteration = iteration + 1 val_results_dict = evaluate_linear_classifiers( feature_model=feature_model, linear_classifiers=remove_ddp_wrapper(linear_classifiers), data_loader=val_data_loader, metrics_file_path=metrics_file_path, metric_type=metric_type, training_num_classes=training_num_classes, iteration=iteration, class_mapping=val_class_mapping, ) return val_results_dict, feature_model, linear_classifiers, iteration def make_eval_data_loader(test_dataset_str, batch_size, num_workers, metric_type): test_dataset = make_dataset( dataset_str=test_dataset_str, transform=make_classification_eval_transform(), ) test_data_loader = make_data_loader( dataset=test_dataset, batch_size=batch_size, num_workers=num_workers, sampler_type=SamplerType.DISTRIBUTED, drop_last=False, shuffle=False, persistent_workers=False, collate_fn=_pad_and_collate if metric_type == MetricType.IMAGENET_REAL_ACCURACY else None, ) return test_data_loader def test_on_datasets( feature_model, linear_classifiers, test_dataset_strs, batch_size, num_workers, test_metric_types, metrics_file_path, training_num_classes, iteration, best_classifier_on_val, prefixstring="", test_class_mappings=[None], ): results_dict = {} for test_dataset_str, class_mapping, metric_type in zip(test_dataset_strs, test_class_mappings, test_metric_types): logger.info(f"Testing on {test_dataset_str}") test_data_loader = make_eval_data_loader(test_dataset_str, batch_size, num_workers, metric_type) dataset_results_dict = evaluate_linear_classifiers( feature_model, remove_ddp_wrapper(linear_classifiers), test_data_loader, metric_type, metrics_file_path, training_num_classes, iteration, prefixstring="", class_mapping=class_mapping, best_classifier_on_val=best_classifier_on_val, ) results_dict[f"{test_dataset_str}_accuracy"] = 100.0 * dataset_results_dict["best_classifier"]["accuracy"] return results_dict def run_eval_linear( model, output_dir, train_dataset_str, val_dataset_str, batch_size, epochs, epoch_length, num_workers, save_checkpoint_frequency, eval_period_iterations, learning_rates, autocast_dtype, test_dataset_strs=None, resume=True, classifier_fpath=None, val_class_mapping_fpath=None, test_class_mapping_fpaths=[None], val_metric_type=MetricType.MEAN_ACCURACY, test_metric_types=None, ): seed = 0 if test_dataset_strs is None: test_dataset_strs = [val_dataset_str] if test_metric_types is None: test_metric_types = [val_metric_type] * len(test_dataset_strs) else: assert len(test_metric_types) == len(test_dataset_strs) assert len(test_dataset_strs) == len(test_class_mapping_fpaths) train_transform = make_classification_train_transform() train_dataset = make_dataset( dataset_str=train_dataset_str, transform=train_transform, ) training_num_classes = len(torch.unique(torch.Tensor(train_dataset.get_targets().astype(int)))) sampler_type = SamplerType.SHARDED_INFINITE # sampler_type = SamplerType.INFINITE n_last_blocks_list = [1, 4] n_last_blocks = max(n_last_blocks_list) autocast_ctx = partial(torch.cuda.amp.autocast, enabled=True, dtype=autocast_dtype) feature_model = ModelWithIntermediateLayers(model, n_last_blocks, autocast_ctx) sample_output = feature_model(train_dataset[0][0].unsqueeze(0).cuda()) linear_classifiers, optim_param_groups = setup_linear_classifiers( sample_output, n_last_blocks_list, learning_rates, batch_size, training_num_classes, ) optimizer = torch.optim.SGD(optim_param_groups, momentum=0.9, weight_decay=0) max_iter = epochs * epoch_length scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, max_iter, eta_min=0) checkpointer = Checkpointer(linear_classifiers, output_dir, optimizer=optimizer, scheduler=scheduler) start_iter = checkpointer.resume_or_load(classifier_fpath or "", resume=resume).get("iteration", -1) + 1 train_data_loader = make_data_loader( dataset=train_dataset, batch_size=batch_size, num_workers=num_workers, shuffle=True, seed=seed, sampler_type=sampler_type, sampler_advance=start_iter, drop_last=True, persistent_workers=True, ) val_data_loader = make_eval_data_loader(val_dataset_str, batch_size, num_workers, val_metric_type) checkpoint_period = save_checkpoint_frequency * epoch_length if val_class_mapping_fpath is not None: logger.info(f"Using class mapping from {val_class_mapping_fpath}") val_class_mapping = np.load(val_class_mapping_fpath) else: val_class_mapping = None test_class_mappings = [] for class_mapping_fpath in test_class_mapping_fpaths: if class_mapping_fpath is not None and class_mapping_fpath != "None": logger.info(f"Using class mapping from {class_mapping_fpath}") class_mapping = np.load(class_mapping_fpath) else: class_mapping = None test_class_mappings.append(class_mapping) metrics_file_path = os.path.join(output_dir, "results_eval_linear.json") val_results_dict, feature_model, linear_classifiers, iteration = eval_linear( feature_model=feature_model, linear_classifiers=linear_classifiers, train_data_loader=train_data_loader, val_data_loader=val_data_loader, metrics_file_path=metrics_file_path, optimizer=optimizer, scheduler=scheduler, output_dir=output_dir, max_iter=max_iter, checkpoint_period=checkpoint_period, running_checkpoint_period=epoch_length, eval_period=eval_period_iterations, metric_type=val_metric_type, training_num_classes=training_num_classes, resume=resume, val_class_mapping=val_class_mapping, classifier_fpath=classifier_fpath, ) results_dict = {} if len(test_dataset_strs) > 1 or test_dataset_strs[0] != val_dataset_str: results_dict = test_on_datasets( feature_model, linear_classifiers, test_dataset_strs, batch_size, 0, # num_workers, test_metric_types, metrics_file_path, training_num_classes, iteration, val_results_dict["best_classifier"]["name"], prefixstring="", test_class_mappings=test_class_mappings, ) results_dict["best_classifier"] = val_results_dict["best_classifier"]["name"] results_dict[f"{val_dataset_str}_accuracy"] = 100.0 * val_results_dict["best_classifier"]["accuracy"] logger.info("Test Results Dict " + str(results_dict)) return results_dict def main(args): model, autocast_dtype = setup_and_build_model(args) run_eval_linear( model=model, output_dir=args.output_dir, train_dataset_str=args.train_dataset_str, val_dataset_str=args.val_dataset_str, test_dataset_strs=args.test_dataset_strs, batch_size=args.batch_size, epochs=args.epochs, epoch_length=args.epoch_length, num_workers=args.num_workers, save_checkpoint_frequency=args.save_checkpoint_frequency, eval_period_iterations=args.eval_period_iterations, learning_rates=args.learning_rates, autocast_dtype=autocast_dtype, resume=not args.no_resume, classifier_fpath=args.classifier_fpath, val_metric_type=args.val_metric_type, test_metric_types=args.test_metric_types, val_class_mapping_fpath=args.val_class_mapping_fpath, test_class_mapping_fpaths=args.test_class_mapping_fpaths, ) return 0 if __name__ == "__main__": description = "DINOv2 linear evaluation" args_parser = get_args_parser(description=description) args = args_parser.parse_args() sys.exit(main(args)) ``` ## /dinov2/eval/log_regression.py ```py path="/dinov2/eval/log_regression.py" # Copyright (c) Meta Platforms, Inc. and affiliates. # # This source code is licensed under the Apache License, Version 2.0 # found in the LICENSE file in the root directory of this source tree. import argparse import gc import logging import sys import time from typing import List, Optional from cuml.linear_model import LogisticRegression import torch import torch.backends.cudnn as cudnn import torch.distributed from torch import nn from torch.utils.data import TensorDataset from torchmetrics import MetricTracker from dinov2.data import make_dataset from dinov2.data.transforms import make_classification_eval_transform from dinov2.distributed import get_global_rank, get_global_size from dinov2.eval.metrics import MetricType, build_metric from dinov2.eval.setup import get_args_parser as get_setup_args_parser from dinov2.eval.setup import setup_and_build_model from dinov2.eval.utils import evaluate, extract_features from dinov2.utils.dtype import as_torch_dtype logger = logging.getLogger("dinov2") DEFAULT_MAX_ITER = 1_000 C_POWER_RANGE = torch.linspace(-6, 5, 45) _CPU_DEVICE = torch.device("cpu") def get_args_parser( description: Optional[str] = None, parents: Optional[List[argparse.ArgumentParser]] = None, add_help: bool = True, ): parents = parents or [] setup_args_parser = get_setup_args_parser(parents=parents, add_help=False) parents = [setup_args_parser] parser = argparse.ArgumentParser( description=description, parents=parents, add_help=add_help, ) parser.add_argument( "--train-dataset", dest="train_dataset_str", type=str, help="Training dataset", ) parser.add_argument( "--val-dataset", dest="val_dataset_str", type=str, help="Validation dataset", ) parser.add_argument( "--finetune-dataset-str", dest="finetune_dataset_str", type=str, help="Fine-tuning dataset", ) parser.add_argument( "--finetune-on-val", action="store_true", help="If there is no finetune dataset, whether to choose the " "hyperparameters on the val set instead of 10%% of the train dataset", ) parser.add_argument( "--metric-type", type=MetricType, choices=list(MetricType), help="Metric type", ) parser.add_argument( "--train-features-device", type=str, help="Device to gather train features (cpu, cuda, cuda:0, etc.), default: %(default)s", ) parser.add_argument( "--train-dtype", type=str, help="Data type to convert the train features to (default: %(default)s)", ) parser.add_argument( "--max-train-iters", type=int, help="Maximum number of train iterations (default: %(default)s)", ) parser.set_defaults( train_dataset_str="ImageNet:split=TRAIN", val_dataset_str="ImageNet:split=VAL", finetune_dataset_str=None, metric_type=MetricType.MEAN_ACCURACY, train_features_device="cpu", train_dtype="float64", max_train_iters=DEFAULT_MAX_ITER, finetune_on_val=False, ) return parser class LogRegModule(nn.Module): def __init__( self, C, max_iter=DEFAULT_MAX_ITER, dtype=torch.float64, device=_CPU_DEVICE, ): super().__init__() self.dtype = dtype self.device = device self.estimator = LogisticRegression( penalty="l2", C=C, max_iter=max_iter, output_type="numpy", tol=1e-12, linesearch_max_iter=50, ) def forward(self, samples, targets): samples_device = samples.device samples = samples.to(dtype=self.dtype, device=self.device) if self.device == _CPU_DEVICE: samples = samples.numpy() probas = self.estimator.predict_proba(samples) return {"preds": torch.from_numpy(probas).to(samples_device), "target": targets} def fit(self, train_features, train_labels): train_features = train_features.to(dtype=self.dtype, device=self.device) train_labels = train_labels.to(dtype=self.dtype, device=self.device) if self.device == _CPU_DEVICE: # both cuML and sklearn only work with numpy arrays on CPU train_features = train_features.numpy() train_labels = train_labels.numpy() self.estimator.fit(train_features, train_labels) def evaluate_model(*, logreg_model, logreg_metric, test_data_loader, device): postprocessors = {"metrics": logreg_model} metrics = {"metrics": logreg_metric} return evaluate(nn.Identity(), test_data_loader, postprocessors, metrics, device) def train_for_C(*, C, max_iter, train_features, train_labels, dtype=torch.float64, device=_CPU_DEVICE): logreg_model = LogRegModule(C, max_iter=max_iter, dtype=dtype, device=device) logreg_model.fit(train_features, train_labels) return logreg_model def train_and_evaluate( *, C, max_iter, train_features, train_labels, logreg_metric, test_data_loader, train_dtype=torch.float64, train_features_device, eval_device, ): logreg_model = train_for_C( C=C, max_iter=max_iter, train_features=train_features, train_labels=train_labels, dtype=train_dtype, device=train_features_device, ) return evaluate_model( logreg_model=logreg_model, logreg_metric=logreg_metric, test_data_loader=test_data_loader, device=eval_device, ) def sweep_C_values( *, train_features, train_labels, test_data_loader, metric_type, num_classes, train_dtype=torch.float64, train_features_device=_CPU_DEVICE, max_train_iters=DEFAULT_MAX_ITER, ): if metric_type == MetricType.PER_CLASS_ACCURACY: # If we want to output per-class accuracy, we select the hyperparameters with mean per class metric_type = MetricType.MEAN_PER_CLASS_ACCURACY logreg_metric = build_metric(metric_type, num_classes=num_classes) metric_tracker = MetricTracker(logreg_metric, maximize=True) ALL_C = 10**C_POWER_RANGE logreg_models = {} train_features = train_features.to(dtype=train_dtype, device=train_features_device) train_labels = train_labels.to(device=train_features_device) for i in range(get_global_rank(), len(ALL_C), get_global_size()): C = ALL_C[i].item() logger.info( f"Training for C = {C:.5f}, dtype={train_dtype}, " f"features: {train_features.shape}, {train_features.dtype}, " f"labels: {train_labels.shape}, {train_labels.dtype}" ) logreg_models[C] = train_for_C( C=C, max_iter=max_train_iters, train_features=train_features, train_labels=train_labels, dtype=train_dtype, device=train_features_device, ) gather_list = [None for _ in range(get_global_size())] torch.distributed.all_gather_object(gather_list, logreg_models) logreg_models_gathered = {} for logreg_dict in gather_list: logreg_models_gathered.update(logreg_dict) for i in range(len(ALL_C)): metric_tracker.increment() C = ALL_C[i].item() evals = evaluate_model( logreg_model=logreg_models_gathered[C], logreg_metric=metric_tracker, test_data_loader=test_data_loader, device=torch.cuda.current_device(), ) logger.info(f"Trained for C = {C:.5f}, accuracies = {evals}") best_stats, which_epoch = metric_tracker.best_metric(return_step=True) best_stats_100 = {k: 100.0 * v for k, v in best_stats.items()} if which_epoch["top-1"] == i: best_C = C logger.info(f"Sweep best {best_stats_100}, best C = {best_C:.6f}") return best_stats, best_C def eval_log_regression( *, model, train_dataset, val_dataset, finetune_dataset, metric_type, batch_size, num_workers, finetune_on_val=False, train_dtype=torch.float64, train_features_device=_CPU_DEVICE, max_train_iters=DEFAULT_MAX_ITER, ): """ Implements the "standard" process for log regression evaluation: The value of C is chosen by training on train_dataset and evaluating on finetune_dataset. Then, the final model is trained on a concatenation of train_dataset and finetune_dataset, and is evaluated on val_dataset. If there is no finetune_dataset, the value of C is the one that yields the best results on a random 10% subset of the train dataset """ start = time.time() train_features, train_labels = extract_features( model, train_dataset, batch_size, num_workers, gather_on_cpu=(train_features_device == _CPU_DEVICE) ) val_features, val_labels = extract_features( model, val_dataset, batch_size, num_workers, gather_on_cpu=(train_features_device == _CPU_DEVICE) ) val_data_loader = torch.utils.data.DataLoader( TensorDataset(val_features, val_labels), batch_size=batch_size, drop_last=False, num_workers=0, persistent_workers=False, ) if finetune_dataset is None and finetune_on_val: logger.info("Choosing hyperparameters on the val dataset") finetune_features, finetune_labels = val_features, val_labels elif finetune_dataset is None and not finetune_on_val: logger.info("Choosing hyperparameters on 10% of the train dataset") torch.manual_seed(0) indices = torch.randperm(len(train_features), device=train_features.device) finetune_index = indices[: len(train_features) // 10] train_index = indices[len(train_features) // 10 :] finetune_features, finetune_labels = train_features[finetune_index], train_labels[finetune_index] train_features, train_labels = train_features[train_index], train_labels[train_index] else: logger.info("Choosing hyperparameters on the finetune dataset") finetune_features, finetune_labels = extract_features( model, finetune_dataset, batch_size, num_workers, gather_on_cpu=(train_features_device == _CPU_DEVICE) ) # release the model - free GPU memory del model gc.collect() torch.cuda.empty_cache() finetune_data_loader = torch.utils.data.DataLoader( TensorDataset(finetune_features, finetune_labels), batch_size=batch_size, drop_last=False, ) if len(train_labels.shape) > 1: num_classes = train_labels.shape[1] else: num_classes = train_labels.max() + 1 logger.info("Using cuML for logistic regression") best_stats, best_C = sweep_C_values( train_features=train_features, train_labels=train_labels, test_data_loader=finetune_data_loader, metric_type=metric_type, num_classes=num_classes, train_dtype=train_dtype, train_features_device=train_features_device, max_train_iters=max_train_iters, ) if not finetune_on_val: logger.info("Best parameter found, concatenating features") train_features = torch.cat((train_features, finetune_features)) train_labels = torch.cat((train_labels, finetune_labels)) logger.info("Training final model") logreg_metric = build_metric(metric_type, num_classes=num_classes) evals = train_and_evaluate( C=best_C, max_iter=max_train_iters, train_features=train_features, train_labels=train_labels, logreg_metric=logreg_metric.clone(), test_data_loader=val_data_loader, eval_device=torch.cuda.current_device(), train_dtype=train_dtype, train_features_device=train_features_device, ) best_stats = evals[1]["metrics"] best_stats["best_C"] = best_C logger.info(f"Log regression evaluation done in {int(time.time() - start)}s") return best_stats def eval_log_regression_with_model( model, train_dataset_str="ImageNet:split=TRAIN", val_dataset_str="ImageNet:split=VAL", finetune_dataset_str=None, autocast_dtype=torch.float, finetune_on_val=False, metric_type=MetricType.MEAN_ACCURACY, train_dtype=torch.float64, train_features_device=_CPU_DEVICE, max_train_iters=DEFAULT_MAX_ITER, ): cudnn.benchmark = True transform = make_classification_eval_transform(resize_size=224) target_transform = None train_dataset = make_dataset(dataset_str=train_dataset_str, transform=transform, target_transform=target_transform) val_dataset = make_dataset(dataset_str=val_dataset_str, transform=transform, target_transform=target_transform) if finetune_dataset_str is not None: finetune_dataset = make_dataset( dataset_str=finetune_dataset_str, transform=transform, target_transform=target_transform ) else: finetune_dataset = None with torch.cuda.amp.autocast(dtype=autocast_dtype): results_dict_logreg = eval_log_regression( model=model, train_dataset=train_dataset, val_dataset=val_dataset, finetune_dataset=finetune_dataset, metric_type=metric_type, batch_size=256, num_workers=0, # 5, finetune_on_val=finetune_on_val, train_dtype=train_dtype, train_features_device=train_features_device, max_train_iters=max_train_iters, ) results_dict = { "top-1": results_dict_logreg["top-1"].cpu().numpy() * 100.0, "top-5": results_dict_logreg.get("top-5", torch.tensor(0.0)).cpu().numpy() * 100.0, "best_C": results_dict_logreg["best_C"], } logger.info( "\n".join( [ "Training of the supervised logistic regression on frozen features completed.\n" "Top-1 test accuracy: {acc:.1f}".format(acc=results_dict["top-1"]), "Top-5 test accuracy: {acc:.1f}".format(acc=results_dict["top-5"]), "obtained for C = {c:.6f}".format(c=results_dict["best_C"]), ] ) ) torch.distributed.barrier() return results_dict def main(args): model, autocast_dtype = setup_and_build_model(args) eval_log_regression_with_model( model=model, train_dataset_str=args.train_dataset_str, val_dataset_str=args.val_dataset_str, finetune_dataset_str=args.finetune_dataset_str, autocast_dtype=autocast_dtype, finetune_on_val=args.finetune_on_val, metric_type=args.metric_type, train_dtype=as_torch_dtype(args.train_dtype), train_features_device=torch.device(args.train_features_device), max_train_iters=args.max_train_iters, ) return 0 if __name__ == "__main__": description = "DINOv2 logistic regression evaluation" args_parser = get_args_parser(description=description) args = args_parser.parse_args() sys.exit(main(args)) ``` ## /dinov2/eval/metrics.py ```py path="/dinov2/eval/metrics.py" # Copyright (c) Meta Platforms, Inc. and affiliates. # # This source code is licensed under the Apache License, Version 2.0 # found in the LICENSE file in the root directory of this source tree. from enum import Enum import logging from typing import Any, Dict, Optional import torch from torch import Tensor from torchmetrics import Metric, MetricCollection from torchmetrics.classification import MulticlassAccuracy from torchmetrics.utilities.data import dim_zero_cat, select_topk logger = logging.getLogger("dinov2") class MetricType(Enum): MEAN_ACCURACY = "mean_accuracy" MEAN_PER_CLASS_ACCURACY = "mean_per_class_accuracy" PER_CLASS_ACCURACY = "per_class_accuracy" IMAGENET_REAL_ACCURACY = "imagenet_real_accuracy" @property def accuracy_averaging(self): return getattr(AccuracyAveraging, self.name, None) def __str__(self): return self.value class AccuracyAveraging(Enum): MEAN_ACCURACY = "micro" MEAN_PER_CLASS_ACCURACY = "macro" PER_CLASS_ACCURACY = "none" def __str__(self): return self.value def build_metric(metric_type: MetricType, *, num_classes: int, ks: Optional[tuple] = None): if metric_type.accuracy_averaging is not None: return build_topk_accuracy_metric( average_type=metric_type.accuracy_averaging, num_classes=num_classes, ks=(1, 5) if ks is None else ks, ) elif metric_type == MetricType.IMAGENET_REAL_ACCURACY: return build_topk_imagenet_real_accuracy_metric( num_classes=num_classes, ks=(1, 5) if ks is None else ks, ) raise ValueError(f"Unknown metric type {metric_type}") def build_topk_accuracy_metric(average_type: AccuracyAveraging, num_classes: int, ks: tuple = (1, 5)): metrics: Dict[str, Metric] = { f"top-{k}": MulticlassAccuracy(top_k=k, num_classes=int(num_classes), average=average_type.value) for k in ks } return MetricCollection(metrics) def build_topk_imagenet_real_accuracy_metric(num_classes: int, ks: tuple = (1, 5)): metrics: Dict[str, Metric] = {f"top-{k}": ImageNetReaLAccuracy(top_k=k, num_classes=int(num_classes)) for k in ks} return MetricCollection(metrics) class ImageNetReaLAccuracy(Metric): is_differentiable: bool = False higher_is_better: Optional[bool] = None full_state_update: bool = False def __init__( self, num_classes: int, top_k: int = 1, **kwargs: Any, ) -> None: super().__init__(**kwargs) self.num_classes = num_classes self.top_k = top_k self.add_state("tp", [], dist_reduce_fx="cat") def update(self, preds: Tensor, target: Tensor) -> None: # type: ignore # preds [B, D] # target [B, A] # preds_oh [B, D] with 0 and 1 # select top K highest probabilities, use one hot representation preds_oh = select_topk(preds, self.top_k) # target_oh [B, D + 1] with 0 and 1 target_oh = torch.zeros((preds_oh.shape[0], preds_oh.shape[1] + 1), device=target.device, dtype=torch.int32) target = target.long() # for undefined targets (-1) use a fake value `num_classes` target[target == -1] = self.num_classes # fill targets, use one hot representation target_oh.scatter_(1, target, 1) # target_oh [B, D] (remove the fake target at index `num_classes`) target_oh = target_oh[:, :-1] # tp [B] with 0 and 1 tp = (preds_oh * target_oh == 1).sum(dim=1) # at least one match between prediction and target tp.clip_(max=1) # ignore instances where no targets are defined mask = target_oh.sum(dim=1) > 0 tp = tp[mask] self.tp.append(tp) # type: ignore def compute(self) -> Tensor: tp = dim_zero_cat(self.tp) # type: ignore return tp.float().mean() ``` ## /dinov2/eval/segmentation/__init__.py ```py path="/dinov2/eval/segmentation/__init__.py" # Copyright (c) Meta Platforms, Inc. and affiliates. # # This source code is licensed under the Apache License, Version 2.0 # found in the LICENSE file in the root directory of this source tree. ``` ## /dinov2/eval/segmentation/hooks/__init__.py ```py path="/dinov2/eval/segmentation/hooks/__init__.py" # Copyright (c) Meta Platforms, Inc. and affiliates. # # This source code is licensed under the Apache License, Version 2.0 # found in the LICENSE file in the root directory of this source tree. from .optimizer import DistOptimizerHook ``` ## /dinov2/eval/segmentation/hooks/optimizer.py ```py path="/dinov2/eval/segmentation/hooks/optimizer.py" # Copyright (c) Meta Platforms, Inc. and affiliates. # # This source code is licensed under the Apache License, Version 2.0 # found in the LICENSE file in the root directory of this source tree. try: import apex except ImportError: print("apex is not installed") from mmcv.runner import OptimizerHook, HOOKS @HOOKS.register_module() class DistOptimizerHook(OptimizerHook): """Optimizer hook for distributed training.""" def __init__(self, update_interval=1, grad_clip=None, coalesce=True, bucket_size_mb=-1, use_fp16=False): self.grad_clip = grad_clip self.coalesce = coalesce self.bucket_size_mb = bucket_size_mb self.update_interval = update_interval self.use_fp16 = use_fp16 def before_run(self, runner): runner.optimizer.zero_grad() def after_train_iter(self, runner): runner.outputs["loss"] /= self.update_interval if self.use_fp16: # runner.outputs['loss'].backward() with apex.amp.scale_loss(runner.outputs["loss"], runner.optimizer) as scaled_loss: scaled_loss.backward() else: runner.outputs["loss"].backward() if self.every_n_iters(runner, self.update_interval): if self.grad_clip is not None: self.clip_grads(runner.model.parameters()) runner.optimizer.step() runner.optimizer.zero_grad() ``` ## /dinov2/eval/segmentation/models/__init__.py ```py path="/dinov2/eval/segmentation/models/__init__.py" # Copyright (c) Meta Platforms, Inc. and affiliates. # # This source code is licensed under the Apache License, Version 2.0 # found in the LICENSE file in the root directory of this source tree. from .backbones import * # noqa: F403 from .decode_heads import * # noqa: F403 ``` ## /dinov2/eval/segmentation/models/backbones/__init__.py ```py path="/dinov2/eval/segmentation/models/backbones/__init__.py" # Copyright (c) Meta Platforms, Inc. and affiliates. # # This source code is licensed under the Apache License, Version 2.0 # found in the LICENSE file in the root directory of this source tree. from .vision_transformer import DinoVisionTransformer ``` ## /dinov2/eval/segmentation/models/backbones/vision_transformer.py ```py path="/dinov2/eval/segmentation/models/backbones/vision_transformer.py" # Copyright (c) Meta Platforms, Inc. and affiliates. # # This source code is licensed under the Apache License, Version 2.0 # found in the LICENSE file in the root directory of this source tree. from mmcv.runner import BaseModule from mmseg.models.builder import BACKBONES @BACKBONES.register_module() class DinoVisionTransformer(BaseModule): """Vision Transformer.""" def __init__( self, *args, **kwargs, ): super().__init__() ``` ## /dinov2/eval/segmentation/models/decode_heads/__init__.py ```py path="/dinov2/eval/segmentation/models/decode_heads/__init__.py" # Copyright (c) Meta Platforms, Inc. and affiliates. # # This source code is licensed under the Apache License, Version 2.0 # found in the LICENSE file in the root directory of this source tree. from .linear_head import BNHead ``` ## /dinov2/eval/segmentation/models/decode_heads/linear_head.py ```py path="/dinov2/eval/segmentation/models/decode_heads/linear_head.py" # Copyright (c) Meta Platforms, Inc. and affiliates. # # This source code is licensed under the Apache License, Version 2.0 # found in the LICENSE file in the root directory of this source tree. import torch import torch.nn as nn from mmseg.models.builder import HEADS from mmseg.models.decode_heads.decode_head import BaseDecodeHead from mmseg.ops import resize @HEADS.register_module() class BNHead(BaseDecodeHead): """Just a batchnorm.""" def __init__(self, resize_factors=None, **kwargs): super().__init__(**kwargs) assert self.in_channels == self.channels self.bn = nn.SyncBatchNorm(self.in_channels) self.resize_factors = resize_factors def _forward_feature(self, inputs): """Forward function for feature maps before classifying each pixel with ``self.cls_seg`` fc. Args: inputs (list[Tensor]): List of multi-level img features. Returns: feats (Tensor): A tensor of shape (batch_size, self.channels, H, W) which is feature map for last layer of decoder head. """ # print("inputs", [i.shape for i in inputs]) x = self._transform_inputs(inputs) # print("x", x.shape) feats = self.bn(x) # print("feats", feats.shape) return feats def _transform_inputs(self, inputs): """Transform inputs for decoder. Args: inputs (list[Tensor]): List of multi-level img features. Returns: Tensor: The transformed inputs """ if self.input_transform == "resize_concat": # accept lists (for cls token) input_list = [] for x in inputs: if isinstance(x, list): input_list.extend(x) else: input_list.append(x) inputs = input_list # an image descriptor can be a local descriptor with resolution 1x1 for i, x in enumerate(inputs): if len(x.shape) == 2: inputs[i] = x[:, :, None, None] # select indices inputs = [inputs[i] for i in self.in_index] # Resizing shenanigans # print("before", *(x.shape for x in inputs)) if self.resize_factors is not None: assert len(self.resize_factors) == len(inputs), (len(self.resize_factors), len(inputs)) inputs = [ resize(input=x, scale_factor=f, mode="bilinear" if f >= 1 else "area") for x, f in zip(inputs, self.resize_factors) ] # print("after", *(x.shape for x in inputs)) upsampled_inputs = [ resize(input=x, size=inputs[0].shape[2:], mode="bilinear", align_corners=self.align_corners) for x in inputs ] inputs = torch.cat(upsampled_inputs, dim=1) elif self.input_transform == "multiple_select": inputs = [inputs[i] for i in self.in_index] else: inputs = inputs[self.in_index] return inputs def forward(self, inputs): """Forward function.""" output = self._forward_feature(inputs) output = self.cls_seg(output) return output ``` ## /dinov2/eval/segmentation/utils/__init__.py ```py path="/dinov2/eval/segmentation/utils/__init__.py" # Copyright (c) Meta Platforms, Inc. and affiliates. # # This source code is licensed under the Apache License, Version 2.0 # found in the LICENSE file in the root directory of this source tree. ``` ## /dinov2/eval/segmentation/utils/colormaps.py ```py path="/dinov2/eval/segmentation/utils/colormaps.py" # Copyright (c) Meta Platforms, Inc. and affiliates. # # This source code is licensed under the Apache License, Version 2.0 # found in the LICENSE file in the root directory of this source tree. ADE20K_COLORMAP = [ (0, 0, 0), (120, 120, 120), (180, 120, 120), (6, 230, 230), (80, 50, 50), (4, 200, 3), (120, 120, 80), (140, 140, 140), (204, 5, 255), (230, 230, 230), (4, 250, 7), (224, 5, 255), (235, 255, 7), (150, 5, 61), (120, 120, 70), (8, 255, 51), (255, 6, 82), (143, 255, 140), (204, 255, 4), (255, 51, 7), (204, 70, 3), (0, 102, 200), (61, 230, 250), (255, 6, 51), (11, 102, 255), (255, 7, 71), (255, 9, 224), (9, 7, 230), (220, 220, 220), (255, 9, 92), (112, 9, 255), (8, 255, 214), (7, 255, 224), (255, 184, 6), (10, 255, 71), (255, 41, 10), (7, 255, 255), (224, 255, 8), (102, 8, 255), (255, 61, 6), (255, 194, 7), (255, 122, 8), (0, 255, 20), (255, 8, 41), (255, 5, 153), (6, 51, 255), (235, 12, 255), (160, 150, 20), (0, 163, 255), (140, 140, 140), (250, 10, 15), (20, 255, 0), (31, 255, 0), (255, 31, 0), (255, 224, 0), (153, 255, 0), (0, 0, 255), (255, 71, 0), (0, 235, 255), (0, 173, 255), (31, 0, 255), (11, 200, 200), (255, 82, 0), (0, 255, 245), (0, 61, 255), (0, 255, 112), (0, 255, 133), (255, 0, 0), (255, 163, 0), (255, 102, 0), (194, 255, 0), (0, 143, 255), (51, 255, 0), (0, 82, 255), (0, 255, 41), (0, 255, 173), (10, 0, 255), (173, 255, 0), (0, 255, 153), (255, 92, 0), (255, 0, 255), (255, 0, 245), (255, 0, 102), (255, 173, 0), (255, 0, 20), (255, 184, 184), (0, 31, 255), (0, 255, 61), (0, 71, 255), (255, 0, 204), (0, 255, 194), (0, 255, 82), (0, 10, 255), (0, 112, 255), (51, 0, 255), (0, 194, 255), (0, 122, 255), (0, 255, 163), (255, 153, 0), (0, 255, 10), (255, 112, 0), (143, 255, 0), (82, 0, 255), (163, 255, 0), (255, 235, 0), (8, 184, 170), (133, 0, 255), (0, 255, 92), (184, 0, 255), (255, 0, 31), (0, 184, 255), (0, 214, 255), (255, 0, 112), (92, 255, 0), (0, 224, 255), (112, 224, 255), (70, 184, 160), (163, 0, 255), (153, 0, 255), (71, 255, 0), (255, 0, 163), (255, 204, 0), (255, 0, 143), (0, 255, 235), (133, 255, 0), (255, 0, 235), (245, 0, 255), (255, 0, 122), (255, 245, 0), (10, 190, 212), (214, 255, 0), (0, 204, 255), (20, 0, 255), (255, 255, 0), (0, 153, 255), (0, 41, 255), (0, 255, 204), (41, 0, 255), (41, 255, 0), (173, 0, 255), (0, 245, 255), (71, 0, 255), (122, 0, 255), (0, 255, 184), (0, 92, 255), (184, 255, 0), (0, 133, 255), (255, 214, 0), (25, 194, 194), (102, 255, 0), (92, 0, 255), ] ADE20K_CLASS_NAMES = [ "", "wall", "building;edifice", "sky", "floor;flooring", "tree", "ceiling", "road;route", "bed", "windowpane;window", "grass", "cabinet", "sidewalk;pavement", "person;individual;someone;somebody;mortal;soul", "earth;ground", "door;double;door", "table", "mountain;mount", "plant;flora;plant;life", "curtain;drape;drapery;mantle;pall", "chair", "car;auto;automobile;machine;motorcar", "water", "painting;picture", "sofa;couch;lounge", "shelf", "house", "sea", "mirror", "rug;carpet;carpeting", "field", "armchair", "seat", "fence;fencing", "desk", "rock;stone", "wardrobe;closet;press", "lamp", "bathtub;bathing;tub;bath;tub", "railing;rail", "cushion", "base;pedestal;stand", "box", "column;pillar", "signboard;sign", "chest;of;drawers;chest;bureau;dresser", "counter", "sand", "sink", "skyscraper", "fireplace;hearth;open;fireplace", "refrigerator;icebox", "grandstand;covered;stand", "path", "stairs;steps", "runway", "case;display;case;showcase;vitrine", "pool;table;billiard;table;snooker;table", "pillow", "screen;door;screen", "stairway;staircase", "river", "bridge;span", "bookcase", "blind;screen", "coffee;table;cocktail;table", "toilet;can;commode;crapper;pot;potty;stool;throne", "flower", "book", "hill", "bench", "countertop", "stove;kitchen;stove;range;kitchen;range;cooking;stove", "palm;palm;tree", "kitchen;island", "computer;computing;machine;computing;device;data;processor;electronic;computer;information;processing;system", "swivel;chair", "boat", "bar", "arcade;machine", "hovel;hut;hutch;shack;shanty", "bus;autobus;coach;charabanc;double-decker;jitney;motorbus;motorcoach;omnibus;passenger;vehicle", "towel", "light;light;source", "truck;motortruck", "tower", "chandelier;pendant;pendent", "awning;sunshade;sunblind", "streetlight;street;lamp", "booth;cubicle;stall;kiosk", "television;television;receiver;television;set;tv;tv;set;idiot;box;boob;tube;telly;goggle;box", "airplane;aeroplane;plane", "dirt;track", "apparel;wearing;apparel;dress;clothes", "pole", "land;ground;soil", "bannister;banister;balustrade;balusters;handrail", "escalator;moving;staircase;moving;stairway", "ottoman;pouf;pouffe;puff;hassock", "bottle", "buffet;counter;sideboard", "poster;posting;placard;notice;bill;card", "stage", "van", "ship", "fountain", "conveyer;belt;conveyor;belt;conveyer;conveyor;transporter", "canopy", "washer;automatic;washer;washing;machine", "plaything;toy", "swimming;pool;swimming;bath;natatorium", "stool", "barrel;cask", "basket;handbasket", "waterfall;falls", "tent;collapsible;shelter", "bag", "minibike;motorbike", "cradle", "oven", "ball", "food;solid;food", "step;stair", "tank;storage;tank", "trade;name;brand;name;brand;marque", "microwave;microwave;oven", "pot;flowerpot", "animal;animate;being;beast;brute;creature;fauna", "bicycle;bike;wheel;cycle", "lake", "dishwasher;dish;washer;dishwashing;machine", "screen;silver;screen;projection;screen", "blanket;cover", "sculpture", "hood;exhaust;hood", "sconce", "vase", "traffic;light;traffic;signal;stoplight", "tray", "ashcan;trash;can;garbage;can;wastebin;ash;bin;ash-bin;ashbin;dustbin;trash;barrel;trash;bin", "fan", "pier;wharf;wharfage;dock", "crt;screen", "plate", "monitor;monitoring;device", "bulletin;board;notice;board", "shower", "radiator", "glass;drinking;glass", "clock", "flag", ] VOC2012_COLORMAP = [ (0, 0, 0), (128, 0, 0), (0, 128, 0), (128, 128, 0), (0, 0, 128), (128, 0, 128), (0, 128, 128), (128, 128, 128), (64, 0, 0), (192, 0, 0), (64, 128, 0), (192, 128, 0), (64, 0, 128), (192, 0, 128), (64, 128, 128), (192, 128, 128), (0, 64, 0), (128, 64, 0), (0, 192, 0), (128, 192, 0), (0, 64, 128), ] VOC2012_CLASS_NAMES = [ "", "aeroplane", "bicycle", "bird", "boat", "bottle", "bus", "car", "cat", "chair", "cow", "diningtable", "dog", "horse", "motorbike", "person", "pottedplant", "sheep", "sofa", "train", "tvmonitor", ] ``` ## /dinov2/eval/segmentation_m2f/__init__.py ```py path="/dinov2/eval/segmentation_m2f/__init__.py" # Copyright (c) Meta Platforms, Inc. and affiliates. # # This source code is licensed under the Apache License, Version 2.0 # found in the LICENSE file in the root directory of this source tree. from .core import * # noqa: F403 from .models import * # noqa: F403 from .ops import * # noqa: F403 ``` ## /dinov2/eval/segmentation_m2f/core/__init__.py ```py path="/dinov2/eval/segmentation_m2f/core/__init__.py" # Copyright (c) Meta Platforms, Inc. and affiliates. # # This source code is licensed under the Apache License, Version 2.0 # found in the LICENSE file in the root directory of this source tree. from mmseg.core.evaluation import * # noqa: F403 from mmseg.core.seg import * # noqa: F403 from .anchor import * # noqa: F403 from .box import * # noqa: F403 from .utils import * # noqa: F403 ``` ## /dinov2/eval/segmentation_m2f/core/anchor/__init__.py ```py path="/dinov2/eval/segmentation_m2f/core/anchor/__init__.py" # Copyright (c) Meta Platforms, Inc. and affiliates. # # This source code is licensed under the Apache License, Version 2.0 # found in the LICENSE file in the root directory of this source tree. from .point_generator import MlvlPointGenerator # noqa: F403 ``` ## /dinov2/eval/segmentation_m2f/core/anchor/builder.py ```py path="/dinov2/eval/segmentation_m2f/core/anchor/builder.py" # Copyright (c) Meta Platforms, Inc. and affiliates. # # This source code is licensed under the Apache License, Version 2.0 # found in the LICENSE file in the root directory of this source tree. import warnings from mmcv.utils import Registry, build_from_cfg PRIOR_GENERATORS = Registry("Generator for anchors and points") ANCHOR_GENERATORS = PRIOR_GENERATORS def build_prior_generator(cfg, default_args=None): return build_from_cfg(cfg, PRIOR_GENERATORS, default_args) def build_anchor_generator(cfg, default_args=None): warnings.warn("``build_anchor_generator`` would be deprecated soon, please use " "``build_prior_generator`` ") return build_prior_generator(cfg, default_args=default_args) ``` The content has been capped at 50000 tokens, and files over NaN bytes have been omitted. The user could consider applying other filters to refine the result. The better and more specific the context, the better the LLM can follow instructions. If the context seems verbose, the user can refine the filter using uithub. Thank you for using https://uithub.com - Perfect LLM context for any GitHub repo.