```
├── .dockerignore
├── .github/
├── CONTRIBUTING.md
├── ISSUE_TEMPLATE/
├── bug_report.yml
├── documentation.yml
├── feature_request.yml
├── actions/
├── setup-venv/
├── action.yml
├── dependabot.yml
├── pull_request_template.md
├── workflows/
├── main.yml
├── pr_checks.yml
├── .gitignore
├── .readthedocs.yaml
├── CHANGELOG.md
├── LICENSE
├── Makefile
├── README.md
├── RELEASE_PROCESS.md
├── docs/
├── .gitignore
├── Makefile
├── make.bat
├── source/
├── CHANGELOG.md
├── CONTRIBUTING.md
├── _static/
├── css/
├── custom.css
├── favicon.ico
├── conf.py
├── index.md
├── installation.md
├── overview.md
├── gantry-requirements.txt
├── olmocr/
├── __init__.py
├── bench/
├── README.md
├── __init__.py
├── benchmark.py
├── checker/
├── check_old_scans_math.py
├── convert.py
├── katex/
├── __init__.py
├── auto-render.min.js
├── katex.min.css
├── katex.min.js
```
## /.dockerignore
```dockerignore path="/.dockerignore"
.git
.github
.mypy_cache
.pytest_cache
.venv
__pycache__
*.egg-info
```
## /.github/CONTRIBUTING.md
# Contributing
Thanks for considering contributing! Please read this document to learn the various ways you can contribute to this project and how to go about doing it.
## Bug reports and feature requests
### Did you find a bug?
First, do [a quick search](https://github.com/allenai/olmocrissues) to see whether your issue has already been reported.
If your issue has already been reported, please comment on the existing issue.
Otherwise, open [a new GitHub issue](https://github.com/allenai/olmocrissues). Be sure to include a clear title
and description. The description should include as much relevant information as possible. The description should
explain how to reproduce the erroneous behavior as well as the behavior you expect to see. Ideally you would include a
code sample or an executable test case demonstrating the expected behavior.
### Do you have a suggestion for an enhancement or new feature?
We use GitHub issues to track feature requests. Before you create a feature request:
* Make sure you have a clear idea of the enhancement you would like. If you have a vague idea, consider discussing
it first on a GitHub issue.
* Check the documentation to make sure your feature does not already exist.
* Do [a quick search](https://github.com/allenai/olmocrissues) to see whether your feature has already been suggested.
When creating your request, please:
* Provide a clear title and description.
* Explain why the enhancement would be useful. It may be helpful to highlight the feature in other libraries.
* Include code examples to demonstrate how the enhancement would be used.
## Making a pull request
When you're ready to contribute code to address an open issue, please follow these guidelines to help us be able to review your pull request (PR) quickly.
1. **Initial setup** (only do this once)
Expand details 👇
If you haven't already done so, please [fork](https://help.github.com/en/enterprise/2.13/user/articles/fork-a-repo) this repository on GitHub.
Then clone your fork locally with
git clone https://github.com/USERNAME/olmocrgit
or
git clone git@github.com:USERNAME/olmocrgit
At this point the local clone of your fork only knows that it came from *your* repo, github.com/USERNAME/olmocrgit, but doesn't know anything the *main* repo, [https://github.com/allenai/oolmocrit](https://github.com/allenai/ololmocrYou can see this by running
git remote -v
which will output something like this:
origin https://github.com/USERNAME/olmocrgit (fetch)
origin https://github.com/USERNAME/olmocrgit (push)
This means that your local clone can only track changes from your fork, but not from the main repo, and so you won't be able to keep your fork up-to-date with the main repo over time. Therefore you'll need to add another "remote" to your clone that points to [https://github.com/allenai/olmocrgit](https://github.com/allenai/oolmocr To do this, run the following:
git remote add upstream https://github.com/allenai/olmocrgit
Now if you do `git remote -v` again, you'll see
origin https://github.com/USERNAME/olmocrgit (fetch)
origin https://github.com/USERNAME/olmocrgit (push)
upstream https://github.com/allenai/olmocrgit (fetch)
upstream https://github.com/allenai/olmocrgit (push)
Finally, you'll need to create a Python 3 virtual environment suitable for working on this project. There a number of tools out there that making working with virtual environments easier.
The most direct way is with the [`venv` module](https://docs.python.org/3.7/library/venv.html) in the standard library, but if you're new to Python or you don't already have a recent Python 3 version installed on your machine,
we recommend [Miniconda](https://docs.conda.io/en/latest/miniconda.html).
On Mac, for example, you can install Miniconda with [Homebrew](https://brew.sh/):
brew install miniconda
Then you can create and activate a new Python environment by running:
conda create -n olmocrpython=3.9
conda activate olmocr
Once your virtual environment is activated, you can install your local clone in "editable mode" with
pip install -U pip setuptools wheel
pip install -e .[dev]
The "editable mode" comes from the `-e` argument to `pip`, and essential just creates a symbolic link from the site-packages directory of your virtual environment to the source code in your local clone. That way any changes you make will be immediately reflected in your virtual environment.
2. **Ensure your fork is up-to-date**
Expand details 👇
Once you've added an "upstream" remote pointing to [https://github.com/allenai/python-package-temlate.git](https://github.com/allenai/olmocr, keeping your fork up-to-date is easy:
git checkout main # if not already on main
git pull --rebase upstream main
git push
3. **Create a new branch to work on your fix or enhancement**
Expand details 👇
Committing directly to the main branch of your fork is not recommended. It will be easier to keep your fork clean if you work on a separate branch for each contribution you intend to make.
You can create a new branch with
# replace BRANCH with whatever name you want to give it
git checkout -b BRANCH
git push -u origin BRANCH
4. **Test your changes**
Expand details 👇
Our continuous integration (CI) testing runs [a number of checks](https://github.com/allenai/olmocractions) for each pull request on [GitHub Actions](https://github.com/features/actions). You can run most of these tests locally, which is something you should do *before* opening a PR to help speed up the review process and make it easier for us.
First, you should run [`isort`](https://github.com/PyCQA/isort) and [`black`](https://github.com/psf/black) to make sure you code is formatted consistently.
Many IDEs support code formatters as plugins, so you may be able to setup isort and black to run automatically everytime you save.
For example, [`black.vim`](https://github.com/psf/black/tree/master/plugin) will give you this functionality in Vim. But both `isort` and `black` are also easy to run directly from the command line.
Just run this from the root of your clone:
isort .
black .
Our CI also uses [`ruff`](https://github.com/astral-sh/ruff) to lint the code base and [`mypy`](http://mypy-lang.org/) for type-checking. You should run both of these next with
ruff check .
and
mypy .
We also strive to maintain high test coverage, so most contributions should include additions to [the unit tests](https://github.com/allenai/olmocrtree/main/tests). These tests are run with [`pytest`](https://docs.pytest.org/en/latest/), which you can use to locally run any test modules that you've added or changed.
For example, if you've fixed a bug in `olmocra/b.py`, you can run the tests specific to that module with
pytest -v tests/a/b_test.py
If your contribution involves additions to any public part of the API, we require that you write docstrings
for each function, method, class, or module that you add.
See the [Writing docstrings](#writing-docstrings) section below for details on the syntax.
You should test to make sure the API documentation can build without errors by running
make docs
If the build fails, it's most likely due to small formatting issues. If the error message isn't clear, feel free to comment on this in your pull request.
And finally, please update the [CHANGELOG](https://github.com/allenai/olmocrblob/main/CHANGELOG.md) with notes on your contribution in the "Unreleased" section at the top.
After all of the above checks have passed, you can now open [a new GitHub pull request](https://github.com/allenai/olmocrpulls).
Make sure you have a clear description of the problem and the solution, and include a link to relevant issues.
We look forward to reviewing your PR!
### Writing docstrings
We use [Sphinx](https://www.sphinx-doc.org/en/master/index.html) to build our API docs, which automatically parses all docstrings
of public classes and methods using the [autodoc](https://www.sphinx-doc.org/en/master/usage/extensions/autodoc.html) extension.
Please refer to autoc's documentation to learn about the docstring syntax.
## /.github/ISSUE_TEMPLATE/bug_report.yml
```yml path="/.github/ISSUE_TEMPLATE/bug_report.yml"
name: 🐛 Bug Report
description: Create a report to help us reproduce and fix the bug
labels: 'bug'
body:
- type: markdown
attributes:
value: >
#### Before submitting a bug, please make sure the issue hasn't been already addressed by searching through [the existing and past issues](https://github.com/allenai/olmocr/issues?q=is%3Aissue+sort%3Acreated-desc+).
- type: textarea
attributes:
label: 🐛 Describe the bug
description: |
Please provide a clear and concise description of what the bug is.
If relevant, add a minimal example so that we can reproduce the error by running the code. It is very important for the snippet to be as succinct (minimal) as possible, so please take time to trim down any irrelevant code to help us debug efficiently. We are going to copy-paste your code and we expect to get the same result as you did: avoid any external data, and include the relevant imports, etc. For example:
\`\`\`python
# All necessary imports at the beginning
import olmocr
# A succinct reproducing example trimmed down to the essential parts:
assert False is True, "Oh no!"
\`\`\`
If the code is too long (hopefully, it isn't), feel free to put it in a public gist and link it in the issue: https://gist.github.com.
Please also paste or describe the results you observe instead of the expected results. If you observe an error, please paste the error message including the **full** traceback of the exception. It may be relevant to wrap error messages in \`\`\`` \`\`\`triple quotes blocks\`\`\` \`\`\``.
placeholder: |
A clear and concise description of what the bug is.
validations:
required: true
- type: textarea
attributes:
label: Versions
description: |
Please run the following and paste the output below.
\`\`\`sh
python --version && pip freeze
\`\`\`
validations:
required: true
- type: markdown
attributes:
value: >
Thanks for contributing 🎉!
```
## /.github/ISSUE_TEMPLATE/documentation.yml
```yml path="/.github/ISSUE_TEMPLATE/documentation.yml"
name: 📚 Documentation
description: Report an issue related to https://olmocr.readthedocs.io/latest
labels: 'documentation'
body:
- type: textarea
attributes:
label: 📚 The doc issue
description: >
A clear and concise description of what content in https://olmocr.readthedocs.io/latest is an issue.
validations:
required: true
- type: textarea
attributes:
label: Suggest a potential alternative/fix
description: >
Tell us how we could improve the documentation in this regard.
- type: markdown
attributes:
value: >
Thanks for contributing 🎉!
```
## /.github/ISSUE_TEMPLATE/feature_request.yml
```yml path="/.github/ISSUE_TEMPLATE/feature_request.yml"
name: 🚀 Feature request
description: Submit a proposal/request for a new feature
labels: 'feature request'
body:
- type: textarea
attributes:
label: 🚀 The feature, motivation and pitch
description: >
A clear and concise description of the feature proposal. Please outline the motivation for the proposal. Is your feature request related to a specific problem? e.g., *"I'm working on X and would like Y to be possible"*. If this is related to another GitHub issue, please link here too.
validations:
required: true
- type: textarea
attributes:
label: Alternatives
description: >
A description of any alternative solutions or features you've considered, if any.
- type: textarea
attributes:
label: Additional context
description: >
Add any other context or screenshots about the feature request.
- type: markdown
attributes:
value: >
Thanks for contributing 🎉!
```
## /.github/actions/setup-venv/action.yml
```yml path="/.github/actions/setup-venv/action.yml"
name: Python virtualenv
description: Set up a Python virtual environment with caching
inputs:
python-version:
description: The Python version to use
required: true
cache-prefix:
description: Update this to invalidate the cache
required: true
default: v0
runs:
using: composite
steps:
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: ${{ inputs.python-version }}
- shell: bash
run: |
# Install prerequisites.
pip install --upgrade pip setuptools wheel virtualenv
- shell: bash
run: |
# Get the exact Python version to use in the cache key.
echo "PYTHON_VERSION=$(python --version)" >> $GITHUB_ENV
- uses: actions/cache@v3
id: virtualenv-cache
with:
path: .venv
key: ${{ inputs.cache-prefix }}-${{ runner.os }}-${{ env.PYTHON_VERSION }}-${{ hashFiles('pyproject.toml') }}
- if: steps.virtualenv-cache.outputs.cache-hit != 'true'
shell: bash
run: |
# Set up virtual environment without cache hit.
test -d .venv || virtualenv -p $(which python) --copies --reset-app-data .venv
. .venv/bin/activate
pip install -e .[dev]
pip install -e .[bench]
- if: steps.virtualenv-cache.outputs.cache-hit == 'true'
shell: bash
run: |
# Set up virtual environment from cache hit.
. .venv/bin/activate
pip install --no-deps -e .[dev]
pip install --no-deps -e .[bench]
- shell: bash
run: |
# Show environment info.
. .venv/bin/activate
echo "✓ Installed $(python --version) virtual environment to $(which python)"
echo "Packages:"
pip freeze
```
## /.github/dependabot.yml
```yml path="/.github/dependabot.yml"
version: 2
updates:
- package-ecosystem: "pip"
directory: "/"
schedule:
interval: "daily"
open-pull-requests-limit: 10
- package-ecosystem: "github-actions"
directory: "/"
schedule:
interval: "daily"
```
## /.github/pull_request_template.md
Fixes #
Changes proposed in this pull request:
-
## Before submitting
- [ ] I've read and followed all steps in the [Making a pull request](https://github.com/allenai/olmocr/blob/main/.github/CONTRIBUTING.md#making-a-pull-request)
section of the `CONTRIBUTING` docs.
- [ ] I've updated or added any relevant docstrings following the syntax described in the
[Writing docstrings](https://github.com/allenai/olmocr/blob/main/.github/CONTRIBUTING.md#writing-docstrings) section of the `CONTRIBUTING` docs.
- [ ] If this PR fixes a bug, I've added a test that will fail without my fix.
- [ ] If this PR adds a new feature, I've added tests that sufficiently cover my new functionality.
## /.github/workflows/main.yml
```yml path="/.github/workflows/main.yml"
name: Main
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
on:
pull_request:
branches:
- main
push:
branches:
- main
tags:
- "v*.*.*"
env:
# Change this to invalidate existing cache.
CACHE_PREFIX: v0
PYTHONPATH: ./
jobs:
checks:
name: Python ${{ matrix.python }} - ${{ matrix.task.name }}
runs-on: [ubuntu-latest]
timeout-minutes: 15
strategy:
fail-fast: false
matrix:
python: ["3.11"]
task:
- name: Test
run: |
playwright install chromium
pytest -v --color=yes -m "not nonci" tests/
pytest -v --color=yes -m "not nonci" olmocr/bench/katex/render.py
include:
- python: "3.11"
task:
name: Lint
run: ruff check .
# Removing mypy for now, as it isn't handling async things correctly and crashing
# - python: "3.11"
# task:
# name: Type check
# run: mypy .
- python: "3.11"
task:
name: Build
run: |
python -m build
- python: "3.11"
task:
name: Style
run: |
isort --check .
black --check .
- python: "3.11"
task:
name: Docs
run: cd docs && make html
steps:
- uses: actions/checkout@v3
- name: Install system dependencies
run: |
sudo apt-get update
sudo apt-get install -y --no-install-recommends poppler-utils
- name: Setup Python environment
uses: ./.github/actions/setup-venv
with:
python-version: ${{ matrix.python }}
cache-prefix: ${{ env.CACHE_PREFIX }}
- name: Restore mypy cache
if: matrix.task.name == 'Type check'
uses: actions/cache@v3
with:
path: .mypy_cache
key: mypy-${{ env.CACHE_PREFIX }}-${{ runner.os }}-${{ matrix.python }}-${{ hashFiles('*requirements.txt') }}-${{ github.ref }}-${{ github.sha }}
restore-keys: |
mypy-${{ env.CACHE_PREFIX }}-${{ runner.os }}-${{ matrix.python }}-${{ hashFiles('*requirements.txt') }}-${{ github.ref }}
mypy-${{ env.CACHE_PREFIX }}-${{ runner.os }}-${{ matrix.python }}-${{ hashFiles('*requirements.txt') }}
- name: ${{ matrix.task.name }}
run: |
. .venv/bin/activate
${{ matrix.task.run }}
- name: Upload package distribution files
if: matrix.task.name == 'Build'
uses: actions/upload-artifact@v4
with:
name: package
path: dist
- name: Clean up
if: always()
run: |
. .venv/bin/activate
pip uninstall -y olmocr
gpu_checks:
name: GPU CI
runs-on: ubuntu-latest
timeout-minutes: 15
needs: [checks]
env:
BEAKER_TOKEN: ${{ secrets.BEAKER_TOKEN }}
BEAKER_IMAGE: jakep/olmocr-gpu-ci
BEAKER_BUDGET: ai2/oe-data
BEAKER_WORKSPACE: ai2/olmocr
steps:
- name: Determine current commit SHA (pull request)
if: github.event_name == 'pull_request'
run: |
echo "COMMIT_SHA=${{ github.event.pull_request.head.sha }}" >> $GITHUB_ENV
- name: Determine current commit SHA (push)
if: github.event_name != 'pull_request'
run: |
echo "COMMIT_SHA=$GITHUB_SHA" >> $GITHUB_ENV
- name: GPU Tests
uses: allenai/beaker-run-action@v1.2
if: env.BEAKER_TOKEN != ''
with:
spec: |
version: v2
description: GPU Tests
budget: ${{ env.BEAKER_BUDGET }}
tasks:
- name: tests
image:
beaker: ${{ env.BEAKER_IMAGE }}
context:
priority: normal
preemptible: true
resources:
gpuCount: 1
constraints:
cluster:
- ai2/jupiter-cirrascale-2
- ai2/neptune-cirrascale
- ai2/saturn-cirrascale
- ai2/ceres-cirrascale
envVars:
- name: GIT_REVISION
value: ${{ env.COMMIT_SHA }}
entrypoint: ["/bin/bash"]
command: ["./gpu-ci-script.sh"]
result:
path: /unused
token: ${{ env.BEAKER_TOKEN }}
workspace: ${{ env.BEAKER_WORKSPACE }}
release:
name: Release
runs-on: ubuntu-latest
needs: [checks, gpu_checks]
if: startsWith(github.ref, 'refs/tags/')
steps:
- uses: actions/checkout@v3
with:
fetch-depth: 0
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: "3.11"
- name: Install requirements
run: |
pip install --upgrade pip setuptools wheel build
pip install -e .[dev]
pip install -e .[bench]
- name: Prepare environment
run: |
echo "RELEASE_VERSION=${GITHUB_REF#refs/tags/v}" >> $GITHUB_ENV
echo "TAG=${GITHUB_REF#refs/tags/}" >> $GITHUB_ENV
- name: Download package distribution files
uses: actions/download-artifact@v4
with:
name: package
path: dist
- name: Generate release notes
run: |
python scripts/release_notes.py > ${{ github.workspace }}-RELEASE_NOTES.md
- name: Publish package to PyPI
run: |
twine upload -u '${{ secrets.PYPI_USERNAME }}' -p '${{ secrets.PYPI_PASSWORD }}' dist/*
- name: Publish GitHub release
uses: softprops/action-gh-release@v1
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
with:
body_path: ${{ github.workspace }}-RELEASE_NOTES.md
prerelease: ${{ contains(env.TAG, 'rc') }}
files: |
dist/*
```
## /.github/workflows/pr_checks.yml
```yml path="/.github/workflows/pr_checks.yml"
name: PR Checks
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
on:
pull_request:
branches:
- main
paths:
- 'olmocr/**'
jobs:
changelog:
name: CHANGELOG
runs-on: ubuntu-latest
if: github.event_name == 'pull_request'
steps:
- uses: actions/checkout@v3
with:
fetch-depth: 0
- name: Check that CHANGELOG has been updated
run: |
# If this step fails, this means you haven't updated the CHANGELOG.md
# file with notes on your contribution.
git diff --name-only $(git merge-base origin/main HEAD) | grep '^CHANGELOG.md$' && echo "Thanks for helping keep our CHANGELOG up-to-date!"
```
## /.gitignore
```gitignore path="/.gitignore"
# ml stuff
wandb/
*histogram.png
*.json
dolma_previews/*
s2_previews/*
gnarly_previews/*
s2orc_previews/*
s2orc_previews_3200/*
sample200_vllm/*
sample200_sglang/*
pdelfin_testset/*
localworkspace/*
math_data/*
math_data_big/*
gpt4otestset/*
gpt4otestset_output/*
pdfs/*
olmOCR-bench/*
table_data*/
/synth*/
dolma_samples/*
/*.html
scoreelo.csv
debug.log
birrpipeline-debug.log
beakerpipeline-debug.log
olmocr-pipeline-debug.log
# build artifacts
.eggs/
.mypy_cache
*.egg-info/
build/
dist/
pip-wheel-metadata/
# dev tools
.envrc
.python-version
.idea
.venv/
.vscode/
/*.iml
pyrightconfig.json
# jupyter notebooks
.ipynb_checkpoints
# miscellaneous
.cache/
doc/_build/
*.swp
.DS_Store
# python
*.pyc
*.pyo
__pycache__
# testing and continuous integration
.coverage
.pytest_cache/
.benchmarks
# documentation build artifacts
docs/build
site/
```
## /.readthedocs.yaml
```yaml path="/.readthedocs.yaml"
version: 2
sphinx:
configuration: docs/source/conf.py
fail_on_warning: true
python:
version: "3.8"
install:
- method: pip
path: .
extra_requirements:
- dev
```
## /CHANGELOG.md
# Changelog
All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## Unreleased
## [v0.1.60](https://github.com/allenai/olmocr/releases/tag/v0.1.60) - 2025-03-17
## [v0.1.58](https://github.com/allenai/olmocr/releases/tag/v0.1.58) - 2025-02-15
## [v0.1.53](https://github.com/allenai/olmocr/releases/tag/v0.1.53) - 2025-02-14
- Fixed git checks
- Added gemini and claude runners and a viewer.
## /LICENSE
``` path="/LICENSE"
Apache License
Version 2.0, January 2004
https://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "{}"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright {yyyy} {name of copyright owner}
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
https://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
```
## /Makefile
``` path="/Makefile"
.PHONY : docs
docs :
rm -rf docs/build/
sphinx-autobuild -b html --watch olmocr/ docs/source/ docs/build/
.PHONY : run-checks
run-checks :
isort --check .
black --check .
ruff check .
mypy .
CUDA_VISIBLE_DEVICES='' pytest -v --color=yes --doctest-modules tests/ olmocr/
.PHONY : build
build :
rm -rf *.egg-info/
python -m build
```
## /README.md
olmOCR
A toolkit for training language models to work with PDF documents in the wild.
Try the online demo: [https://olmocr.allenai.org/](https://olmocr.allenai.org/)
What is included:
- A prompting strategy to get really good natural text parsing using ChatGPT 4o - [buildsilver.py](https://github.com/allenai/olmocr/blob/main/olmocr/data/buildsilver.py)
- An side-by-side eval toolkit for comparing different pipeline versions - [runeval.py](https://github.com/allenai/olmocr/blob/main/olmocr/eval/runeval.py)
- Basic filtering by language and SEO spam removal - [filter.py](https://github.com/allenai/olmocr/blob/main/olmocr/filter/filter.py)
- Finetuning code for Qwen2-VL and Molmo-O - [train.py](https://github.com/allenai/olmocr/blob/main/olmocr/train/train.py)
- Processing millions of PDFs through a finetuned model using Sglang - [pipeline.py](https://github.com/allenai/olmocr/blob/main/olmocr/pipeline.py)
- Viewing [Dolma docs](https://github.com/allenai/dolma) created from PDFs - [dolmaviewer.py](https://github.com/allenai/olmocr/blob/main/olmocr/viewer/dolmaviewer.py)
### Installation
Requirements:
- Recent NVIDIA GPU (tested on RTX 4090, L40S, A100, H100) with at least 20 GB of GPU RAM
- 30GB of free disk space
You will need to install poppler-utils and additional fonts for rendering PDF images.
Install dependencies (Ubuntu/Debian)
```bash
sudo apt-get update
sudo apt-get install poppler-utils ttf-mscorefonts-installer msttcorefonts fonts-crosextra-caladea fonts-crosextra-carlito gsfonts lcdf-typetools
```
Set up a conda environment and install olmocr
```bash
conda create -n olmocr python=3.11
conda activate olmocr
git clone https://github.com/allenai/olmocr.git
cd olmocr
# For CPU-only operations, ex. running benchmarks
pip install -e .
# For actually converting the files with your own GPU
pip install -e .[gpu] --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/
```
### Local Usage Example
For quick testing, try the [web demo](https://olmocr.allen.ai/). To run locally, a GPU is required, as inference is powered by [sglang](https://github.com/sgl-project/sglang) under the hood.
Convert a Single PDF:
```bash
python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/horribleocr.pdf
```
Convert an Image file:
```bash
python -m olmocr.pipeline ./localworkspace --pdfs random_page.png
```
Convert Multiple PDFs:
```bash
python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/*.pdf
```
Results will be stored as JSON in `./localworkspace`.
#### Viewing Results
Extracted text is stored as [Dolma](https://github.com/allenai/dolma)-style JSONL inside of the `./localworkspace/results` directory.
```bash
cat localworkspace/results/output_*.jsonl
```
View results side-by-side with the original PDFs (uses `dolmaviewer` command):
```bash
python -m olmocr.viewer.dolmaviewer localworkspace/results/output_*.jsonl
```
Now open `./dolma_previews/tests_gnarly_pdfs_horribleocr_pdf.html` in your favorite browser.

### Multi-node / Cluster Usage
If you want to convert millions of PDFs, using multiple nodes running in parallel, then olmOCR supports
reading your PDFs from AWS S3, and coordinating work using an AWS S3 output bucket.
For example, you can start this command on your first worker node, and it will set up
a simple work queue in your AWS bucket and start converting PDFs.
```bash
python -m olmocr.pipeline s3://my_s3_bucket/pdfworkspaces/exampleworkspace --pdfs s3://my_s3_bucket/jakep/gnarly_pdfs/*.pdf
```
Now on any subsequent nodes, just run this and they will start grabbing items from the same workspace queue.
```bash
python -m olmocr.pipeline s3://my_s3_bucket/pdfworkspaces/exampleworkspace
```
If you are at Ai2 and want to linearize millions of PDFs efficiently using [beaker](https://www.beaker.org), just add the `--beaker`
flag. This will prepare the workspace on your local machine, and then launch N GPU workers in the cluster to start
converting PDFs.
For example:
```bash
python -m olmocr.pipeline s3://my_s3_bucket/pdfworkspaces/exampleworkspace --pdfs s3://my_s3_bucket/jakep/gnarly_pdfs/*.pdf --beaker --beaker_gpus 4
```
### Full documentation for the pipeline
```bash
python -m olmocr.pipeline --help
usage: pipeline.py [-h] [--pdfs PDFS] [--workspace_profile WORKSPACE_PROFILE] [--pdf_profile PDF_PROFILE] [--pages_per_group PAGES_PER_GROUP]
[--max_page_retries MAX_PAGE_RETRIES] [--max_page_error_rate MAX_PAGE_ERROR_RATE] [--workers WORKERS] [--apply_filter] [--stats] [--model MODEL]
[--model_max_context MODEL_MAX_CONTEXT] [--model_chat_template MODEL_CHAT_TEMPLATE] [--target_longest_image_dim TARGET_LONGEST_IMAGE_DIM]
[--target_anchor_text_len TARGET_ANCHOR_TEXT_LEN] [--beaker] [--beaker_workspace BEAKER_WORKSPACE] [--beaker_cluster BEAKER_CLUSTER]
[--beaker_gpus BEAKER_GPUS] [--beaker_priority BEAKER_PRIORITY]
workspace
Manager for running millions of PDFs through a batch inference pipeline
positional arguments:
workspace The filesystem path where work will be stored, can be a local folder, or an s3 path if coordinating work with many workers, s3://bucket/prefix/
options:
-h, --help show this help message and exit
--pdfs PDFS Path to add pdfs stored in s3 to the workspace, can be a glob path s3://bucket/prefix/*.pdf or path to file containing list of pdf paths
--workspace_profile WORKSPACE_PROFILE
S3 configuration profile for accessing the workspace
--pdf_profile PDF_PROFILE
S3 configuration profile for accessing the raw pdf documents
--pages_per_group PAGES_PER_GROUP
Aiming for this many pdf pages per work item group
--max_page_retries MAX_PAGE_RETRIES
Max number of times we will retry rendering a page
--max_page_error_rate MAX_PAGE_ERROR_RATE
Rate of allowable failed pages in a document, 1/250 by default
--workers WORKERS Number of workers to run at a time
--apply_filter Apply basic filtering to English pdfs which are not forms, and not likely seo spam
--stats Instead of running any job, reports some statistics about the current workspace
--model MODEL List of paths where you can find the model to convert this pdf. You can specify several different paths here, and the script will try to use the
one which is fastest to access
--model_max_context MODEL_MAX_CONTEXT
Maximum context length that the model was fine tuned under
--model_chat_template MODEL_CHAT_TEMPLATE
Chat template to pass to sglang server
--target_longest_image_dim TARGET_LONGEST_IMAGE_DIM
Dimension on longest side to use for rendering the pdf pages
--target_anchor_text_len TARGET_ANCHOR_TEXT_LEN
Maximum amount of anchor text to use (characters)
--beaker Submit this job to beaker instead of running locally
--beaker_workspace BEAKER_WORKSPACE
Beaker workspace to submit to
--beaker_cluster BEAKER_CLUSTER
Beaker clusters you want to run on
--beaker_gpus BEAKER_GPUS
Number of gpu replicas to run
--beaker_priority BEAKER_PRIORITY
Beaker priority level for the job
```
## Team
**olmOCR** is developed and maintained by the AllenNLP team, backed by [the Allen Institute for Artificial Intelligence (AI2)](https://allenai.org/).
AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering.
To learn more about who specifically contributed to this codebase, see [our contributors](https://github.com/allenai/olmocr/graphs/contributors) page.
## License
**olmOCR** is licensed under [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0).
A full copy of the license can be found [on GitHub](https://github.com/allenai/olmocr/blob/main/LICENSE).
## Citing
```bibtex
@misc{olmocr,
title={{olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models}},
author={Jake Poznanski and Jon Borchardt and Jason Dunkelberger and Regan Huff and Daniel Lin and Aman Rangapur and Christopher Wilhelm and Kyle Lo and Luca Soldaini},
year={2025},
eprint={2502.18443},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.18443},
}
```
## /RELEASE_PROCESS.md
# GitHub Release Process
## Steps
1. Update the version in `olmocr/version.py`.
3. Run the release script:
```bash
./scripts/release.sh
```
This will commit the changes to the CHANGELOG and `version.py` files and then create a new tag in git
which will trigger a workflow on GitHub Actions that handles the rest.
## Fixing a failed release
If for some reason the GitHub Actions release workflow failed with an error that needs to be fixed, you'll have to delete both the tag and corresponding release from GitHub. After you've pushed a fix, delete the tag from your local clone with
```bash
git tag -l | xargs git tag -d && git fetch -t
```
Then repeat the steps above.
## /docs/.gitignore
```gitignore path="/docs/.gitignore"
build
```
## /docs/Makefile
``` path="/docs/Makefile"
# Minimal makefile for Sphinx documentation
#
# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?= -W
SPHINXBUILD ?= sphinx-build
SOURCEDIR = source
BUILDDIR = build
# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
.PHONY: help Makefile
# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
```
## /docs/make.bat
```bat path="/docs/make.bat"
@ECHO OFF
pushd %~dp0
REM Command file for Sphinx documentation
if "%SPHINXBUILD%" == "" (
set SPHINXBUILD=sphinx-build
)
set SOURCEDIR=source
set BUILDDIR=build
if "%1" == "" goto help
%SPHINXBUILD% >NUL 2>NUL
if errorlevel 9009 (
echo.
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
echo.installed, then set the SPHINXBUILD environment variable to point
echo.to the full path of the 'sphinx-build' executable. Alternatively you
echo.may add the Sphinx directory to PATH.
echo.
echo.If you don't have Sphinx installed, grab it from
echo.https://www.sphinx-doc.org/
exit /b 1
)
%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
goto end
:help
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
:end
popd
```
## /docs/source/CHANGELOG.md
../../CHANGELOG.md
## /docs/source/CONTRIBUTING.md
../../.github/CONTRIBUTING.md
## /docs/source/_static/css/custom.css
```css path="/docs/source/_static/css/custom.css"
```
## /docs/source/_static/favicon.ico
Binary file available at https://raw.githubusercontent.com/allenai/olmocr/refs/heads/main/docs/source/_static/favicon.ico
## /docs/source/conf.py
```py path="/docs/source/conf.py"
# Configuration file for the Sphinx documentation builder.
#
# This file only contains a selection of the most common options. For a full
# list see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html
import logging
import os
import sys
from datetime import datetime
# -- Path setup --------------------------------------------------------------
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#
sys.path.insert(0, os.path.abspath("../../"))
from olmocr import VERSION, VERSION_SHORT # noqa: E402
# -- Project information -----------------------------------------------------
project = "olmocr"
copyright = f"{datetime.today().year}, Allen Institute for Artificial Intelligence"
author = "Allen Institute for Artificial Intelligence"
version = VERSION_SHORT
release = VERSION
# -- General configuration ---------------------------------------------------
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
"sphinx.ext.autodoc",
"sphinx.ext.napoleon",
"myst_parser",
"sphinx.ext.intersphinx",
"sphinx.ext.viewcode",
"sphinx.ext.doctest",
"sphinx_copybutton",
"sphinx_autodoc_typehints",
]
# Tell myst-parser to assign header anchors for h1-h3.
myst_heading_anchors = 3
suppress_warnings = ["myst.header"]
# Add any paths that contain templates here, relative to this directory.
templates_path = ["_templates"]
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This pattern also affects html_static_path and html_extra_path.
exclude_patterns = ["_build"]
source_suffix = [".rst", ".md"]
intersphinx_mapping = {
"python": ("https://docs.python.org/3", None),
# Uncomment these if you use them in your codebase:
# "torch": ("https://pytorch.org/docs/stable", None),
# "datasets": ("https://huggingface.co/docs/datasets/master/en", None),
# "transformers": ("https://huggingface.co/docs/transformers/master/en", None),
}
# By default, sort documented members by type within classes and modules.
autodoc_member_order = "groupwise"
# Include default values when documenting parameter types.
typehints_defaults = "comma"
# -- Options for HTML output -------------------------------------------------
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
#
html_theme = "furo"
html_title = f"olmocr v{VERSION}"
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ["_static"]
html_css_files = ["css/custom.css"]
html_favicon = "_static/favicon.ico"
html_theme_options = {
"footer_icons": [
{
"name": "GitHub",
"url": "https://github.com/allenai/olmocr",
"html": """
""", # noqa: E501
"class": "",
},
],
}
# -- Hack to get rid of stupid warnings from sphinx_autodoc_typehints --------
class ShutupSphinxAutodocTypehintsFilter(logging.Filter):
def filter(self, record: logging.LogRecord) -> bool:
if "Cannot resolve forward reference" in record.msg:
return False
return True
logging.getLogger("sphinx.sphinx_autodoc_typehints").addFilter(ShutupSphinxAutodocTypehintsFilter())
```
## /docs/source/index.md
# **olmocr**
```{toctree}
:maxdepth: 2
:hidden:
:caption: Getting started
installation
overview
```
```{toctree}
:hidden:
:caption: Development
CHANGELOG
CONTRIBUTING
License
GitHub Repository
```
## Indices and tables
```{eval-rst}
* :ref:`genindex`
* :ref:`modindex`
```
## /docs/source/installation.md
Installation
============
**olmocr** supports Python >= 3.8.
## Installing with `pip`
**olmocr** is available [on PyPI](https://pypi.org/project/olmocr/). Just run
```bash
pip install olmocr
```
## Installing from source
To install **olmocr** from source, first clone [the repository](https://github.com/allenai/olmocr):
```bash
git clone https://github.com/allenai/olmocr.git
cd olmocr
```
Then run
```bash
pip install -e .
```
## /docs/source/overview.md
Overview
========
## /gantry-requirements.txt
torchvision
cached-path
smart_open
pypdf
pypdfium2
lingua-language-detector
Pillow
ruff
mypy>=1.0,<1.5
black>=23.0,<24.0
isort>=5.12,<5.13
pytest
pytest-sphinx
pytest-cov
twine>=1.11.0
build
setuptools
wheel
Sphinx>=4.3.0,<7.1.0
furo==2023.7.26
myst-parser>=1.0,<2.1
sphinx-copybutton==0.5.2
sphinx-autobuild==2021.3.14
sphinx-autodoc-typehints==1.23.3
packaging
necessary
accelerate>=0.34.2
datasets==3.0.0
peft
wandb
omegaconf
s3fs
transformers>=4.45.1
bitsandbytes
ftfy
## /olmocr/__init__.py
```py path="/olmocr/__init__.py"
from .version import VERSION, VERSION_SHORT
```
## /olmocr/bench/README.md
# olmOCR-Bench
We develop olmOCR-Bench in order to automatically and effectively evaluate document-level OCR of various tools.
olmOCR-Bench works by testing various "facts" about document pages at the PDF-level.
Our intention is that each "fact" is very simple, unambiguous, and machine-checkable. For example, once your document has been OCRed, we may check that a particular sentence appears somewhere on the page.
We stay away from soft metrics like edit distance comparisons, because they may assign lower scores for parses of the document that differ from the reference, but may in fact still be correct. For example, on a document containing multiple distinct articles: you want the text of each article to be grouped together, but the relative order of the two articles may not be critical. Also, some documents may have critical details, like switching x and y in an equation that can make all the difference in understanding, but would appear as just a single character edit in an edit-distance metric.
olmOCR-bench operates on single page PDFs directly. We make this choice because PDFs do preserve some digital metadata and information which may be helpful to some OCR systems. Almost any other format can be converted to a PDF, but not the reverse, so we try to preserve these original documents where possible.
## Benchmark Principles
As we created olmOCR-bench, we also kept a few general rules in mind:
- We expect your OCR system to output a plain-text Unicode document in a reading order that would be considered natural.
- Documents from the benchmark should fit on a standard A4 piece of paper and still be readable to a human.
- Markdown syntax is allowed, but ignored. Ex. if we are looking for the word "enlightenment" to appear on a page, and your system outputs "**\*\*enlightenment\*\***" in Markdown bold, that still counts.
- olmOCR-bench is not position sensitive, ex. we check that a sentence or math equation appears anywhere on a page. The exception to this is header/footer tests where we want to find simple page numbers appearing in the first or last few characters of a page.
- Tables can be in either Markdown syntax, or as an html `
`.
- Math equations must render with [Katex](https://katex.org/) and be delimeted with $, $$, \\(, or \\[.
- Math equations are not position sensitive either, so if we are checking for
$ 3x^2 $ to appear on a page, then outputting $ \int_a^b{ 3x ^ 2dx} $ counts.
- We normalize all Unicode to NFC before running the benchmark, so if your OCR model outputs é vs e + ◌́ then either way should not affect your benchmark score.
- We normalize all the different variants of hyphens to the ascii -, all the variants of double quoets to ascii " and all variants of single quotes/apostrophes to ascii '. You should score the same on the benchmark if you output - vs —
- All facts checked about documents are either pass/fail. We want it to be very clear if your OCR system fails a test, and if so, what output would make it pass.
## olmOCR-Bench Fact classes
- Text presence
- This task makes sure that a given small piece of text (ex. 1-3 sentence level) is present within
a parsed document. Soft/fuzzy matching is allowed, as well as specifying if the text must be in the first N or last N characters of the document. Case sensitive by default.
- Text absense
- This task makes sure that a given piece of next does NOT appear in the OCR'ed version of a document. We generally want our OCR systems to filter out content like headers/footers/page numbers from documents. The same fuzzy matching as in Text Presence tests is allowed.
- Natural Reading Order
- This task ensures that blocks of text which are present have a defined order relative to one another. For example,
on a document that contains multiple news articles on one page, you'd want to see that the first sentence of the
first article appears after the heading of that article. But, you may be okay with swapping the order of those
two articles.
- Table Accuracy
- Both Markdown and HTML based tables are supported. These tests check that a cell with a given text exists somewhere in the table, and that its neighbors have certain properties. Ex. A cell exists on this page with text "4.5%" and above that is a cell with the text "2.4%"
- Math Formula Accuracy
- We render a given Latex style equation using Katex in a headless browser. And then see if it exists anywhere in the final OCRed document. Matching is performed on a relative symbol level, ex. in "\f\relax{x} = \int_{-\infty}^\infty
x^2dx" we check that a ∫ appears to the left of a x, x appears to the left of dx, etc...
## Downloading and running the benchmark
Currently the full benchmark data is located here, but it's private until we are done reviewing and checking all of the tests:
https://huggingface.co/datasets/allenai/olmOCR-bench
To run a benchmark, first install the bench requirements
```bash
conda create -n olmocr python=3.11
conda activate olmocr
git clone https://github.com/allenai/olmocr.git
cd olmocr
pip install -e .[bench]
# Now clone the benchmark data
git clone https://huggingface.co/datasets/allenai/olmOCR-bench
```
Convert your documents
```bash
# convert using a single OCR-engine, see the olmocr/bench/runners directory for options
python -m olmocr.bench.convert olmocr_pipeline --dir ./olmOCR-bench/bench_data
# or use convert_all.sh to run OCR with many common frameworks all at once, API keys will be required
./olmocr/bench/scripts/convert_all.sh
```
Now run the benchmark
```bash
python -m olmocr.bench.benchmark --dir ./olmOCR-bench/bench_data
```
## Previewing the benchmark questions
We have an internal data annotation tool that can be used to review the questions in the benchmark, and make edits.
```bash
python -m olmocr.bench.review_app --port 5000 --debug ./olmOCR-bench/bench_data/multi_column.jsonl --force
```
## How were the tests made
Several categories of tests have been made so far:
1. arxiv_math -> We downloaded recent math papers from arxiv, filtered to those which had a single tex source file, and a rendered pdf, using https://github.com/allenai/olmocr/blob/main/olmocr/bench/miners/download_math.py. Then we matched up the text on a pdf page to the location in the tex source mostly likely to match to it using a dynamic programming matching algorithm in https://github.com/allenai/olmocr/blob/main/olmocr/bench/miners/mine_math.py. From there, Latex equations from the matching page were then parsed out, and we checked they rendered in Katex before adding them as test cases. We did a final quick scan over the data manually to remove any cases where the Latex parsing may have failed egregiously.
2. headers_footers -> We sampled documents from our internal crawled PDF repository. (The same from which olmOCR-mix was derived, though the likelyhood of duplicates is low, as there are 200M+ pdfs in this set). Then we used [DocLayout-YOLO](https://github.com/opendatalab/DocLayout-YOLO) to identify regions of the pages which were marked as headers/footers using the abandon category. We then got the text of those headers/footers regions by extracting them out and prompting Gemini, and we added them as test cases which should be absent. Manual review was then performed to remove mistakenly filtered text, and to set conditions such as limiting the search area to the first N or last N characters. Ex. if a page number "5" appears on the bottom a page, you want to test that your OCR system does not output a "5" in the last 20 characters of the page, but "5" could apepar earlier if in the actual body text.
3. table_tests -> We sampled documents from our internal crawled PDF repository, and found those which had tables using gemini-flash-2.0. https://github.com/allenai/olmocr/blob/main/olmocr/bench/miners/mine_tables_gemini.py On pages that had tables, we then further asked gemini-flash-2.0 to tell us the relationships between randomly chosen cells. Those tests were then manually checked.
4. multi_column -> We sampled documents from our internal crawled PDF repository manually, to find documents which had multi-column layouts and multiple articles on one page. Then, we used claude-sonnet-3.7 to render those pages to html, and from that html, we extracted text segments which were before/after one another. Then we manually reviewed each entry.
5. old_scans -> We sampled documents from the Library of Congress which contained handwriting or typewritten text. Then we priortized creating rules that check for reading order. (TODO)
6. old_scans_math -> We found old math textbooks in the public domain from the Internet Archive. We then extracted random pages from them, OCRed them, filtered down to pages which contained equations, and picked several random equations from each page to use as test cases. We then manually checked each test case to see that it was accurate capturing what was on the page.
7. long_tiny_text -> We found documents from the Internet Archive which contained a large amount of dense small print on a single page. Ex. pages from a dictionary, or pages of references from an academic paper. We then generated test cases using an LLM, and verified them manually.
## TODO List for release
- [ ] Check all tests for duplicates
- [ ] Write a script to verify that all baseline tests that actually have weird unicodes have exemptions
- [X] Review math equations in old_scans_math.jsonl using chat gpt script
- [X] Add test category of long_texts which are still ~1 standard printed page, but with dense/small text
- [ ] Review multicolumn_tests, make sure they are correct, clean, and don't have order tests between regions
- [X] Remove [] and other special symbols from old_scans
- [ ] Full review of old_scans, somehow, chatgpt or prolific
- [ ] Adjust scoring to weight each test category equally in final score distribution
- [ ] Double check marker inline math outputs
- [ ] Run against final set of comparison tools, and check list of all-pass and all-fail tests
## /olmocr/bench/__init__.py
```py path="/olmocr/bench/__init__.py"
```
## /olmocr/bench/benchmark.py
```py path="/olmocr/bench/benchmark.py"
#!/usr/bin/env python3
"""
This script runs olmocr bench.
It will take as an argument a folder, and scan it for .jsonl files which contain the various rules and properties that we will check.
It will then validate the JSON files to make sure they are all valid.
Then, each other folder in there (besides /pdfs) represents a pipeline tool that we will evaluate.
We will validate that each one of those contains at least one .md file (or repeated generations, e.g. _pg{page}_repeat{repeat}.md)
corresponding to its parse for every .pdf in the /pdfs folder.
Then, we will read each one, and check if they pass against all the rules.
If a rule fails on some of the repeats, a short explanation is printed.
The final score is averaged over the repeated generations.
Statistical analysis including bootstrap confidence intervals are provided for the results.
Pairwise permutation tests are conducted between specific candidate pairs.
"""
import argparse
import glob
import os
import random
import re
import sys
from concurrent.futures import ThreadPoolExecutor, as_completed
from itertools import combinations
from typing import Dict, List, Tuple
from pypdf import PdfReader
from tqdm import tqdm
from .report import generate_html_report
from .tests import BaselineTest, BasePDFTest, load_tests
from .utils import calculate_bootstrap_ci, perform_permutation_test
def evaluate_candidate(
candidate_folder: str, all_tests: List[BasePDFTest], pdf_basenames: List[str], force: bool = False
) -> Tuple[float, int, List[str], List[str], Dict[str, List[float]], List[float], Dict[str, Dict[int, List[Tuple[BasePDFTest, bool, str]]]]]:
"""
For the candidate folder (pipeline tool output), validate that it contains at least one .md file
(i.e. repeated generations like _pg{page}_repeat{repeat}.md) for every PDF in the pdf folder.
Then, run each rule against all corresponding .md files concurrently and average the results.
Returns a tuple:
(overall_score, total_tests, candidate_errors, test_failures, test_type_breakdown, all_test_scores, test_results)
- overall_score: Average fraction of tests passed (averaged over repeats and tests).
- total_tests: Total number of tests evaluated.
- candidate_errors: List of candidate errors (e.g. missing files).
- test_failures: List of failure messages for tests not passing on all repeats.
- test_type_breakdown: Dictionary mapping test type to list of average pass ratios for tests of that type.
- all_test_scores: List of all individual test scores (used for bootstrapping).
- test_results: Dictionary mapping PDF name to dictionary mapping page number to list of (test, passed, explanation) tuples.
"""
candidate_errors = []
test_failures = []
test_type_breakdown = {} # key: test type, value: list of average pass ratios
all_test_scores = [] # Store all individual test scores for bootstrapping
test_results = {} # Store detailed test results for reporting
candidate_name = os.path.basename(candidate_folder)
# Map each PDF to its corresponding MD repeats (e.g., doc1_pg1_repeat1.md, doc1_pg2_repeat2.md, etc.)
pdf_to_md_files = {}
for pdf_name in pdf_basenames:
md_base = os.path.splitext(pdf_name)[0]
md_regex = re.compile(rf"^{re.escape(md_base)}_pg\d+_repeat\d+\.md$")
all_files = list(glob.glob(os.path.join(candidate_folder, "**/*.md"), recursive=True))
md_files = [f for f in all_files if md_regex.match(os.path.relpath(f, candidate_folder))]
if not md_files and not force:
candidate_errors.append(
f"Candidate '{candidate_name}' is missing MD repeats for {pdf_name} " f"(expected files matching {md_base}_pg{{page}}_repeat*.md)."
)
else:
pdf_to_md_files[pdf_name] = md_files
if candidate_errors:
return (0.0, len(all_tests), candidate_errors, test_failures, test_type_breakdown, all_test_scores, test_results)
# Define an inner function to evaluate a single test
def process_test(test: BasePDFTest) -> Tuple[float, str, str, List[str], Tuple[bool, str]]:
local_errors = []
test_failure = None
pdf_name = test.pdf
# Initialize the test_results structure if needed
if pdf_name not in test_results:
test_results[pdf_name] = {}
if test.page not in test_results[pdf_name]:
test_results[pdf_name][test.page] = []
md_base = os.path.splitext(pdf_name)[0]
md_files = pdf_to_md_files.get(pdf_name, [])
# Filter MD files for the specific page corresponding to the test
page_md_files = [f for f in md_files if re.search(rf"_pg{test.page}_", os.path.basename(f))]
if not page_md_files:
local_errors.append(
f"Candidate '{candidate_name}' is missing MD repeats for {pdf_name} page {test.page} "
f"(expected files matching {md_base}_pg{test.page}_repeat*.md)."
)
test_results[pdf_name][test.page].append((test, False, "Missing MD files"))
return (0.0, None, test.type, local_errors, (False, "Missing MD files"))
repeat_passes = 0
num_repeats = 0
explanations = []
for md_path in page_md_files:
num_repeats += 1
try:
with open(md_path, "r", encoding="utf-8") as f:
md_content = f.read()
except Exception as e:
local_errors.append(f"Error reading {md_path}: {e}")
continue
try:
passed, explanation = test.run(md_content)
if passed:
repeat_passes += 1
else:
explanations.append(explanation)
except Exception as e:
local_errors.append(f"Error running test {test.id} on {md_path}: {e}")
explanations.append(str(e))
test_avg = repeat_passes / num_repeats if num_repeats > 0 else 0.0
final_passed = test_avg > 0.5 # Consider test passed if majority of repeats pass
final_explanation = explanations[0] if explanations else "All repeats passed"
# Store the test result for reporting
test_results[pdf_name][test.page].append((test, final_passed, final_explanation))
if test_avg < 1.0:
test_failure = (
f"Test {test.id} on {md_base} page {test.page} average pass ratio: {test_avg:.3f} "
f"({repeat_passes}/{num_repeats} repeats passed). Ex: {explanations[0] if explanations else 'No explanation'}"
)
return (test_avg, test_failure, test.type, local_errors, (final_passed, final_explanation))
total_test_score = 0.0
futures = []
# Use a thread pool to evaluate each test concurrently.
with ThreadPoolExecutor(max_workers=min(os.cpu_count() or 1, 64)) as executor:
futures = [executor.submit(process_test, test) for test in all_tests]
# tqdm progress bar for this candidate's tests
for future in tqdm(as_completed(futures), total=len(futures), desc=f"Evaluating tests for {candidate_name}", unit="test"):
test_avg, test_failure, test_type, errors, _ = future.result()
all_test_scores.append(test_avg)
total_test_score += test_avg
if test_failure:
test_failures.append(test_failure)
if test_type not in test_type_breakdown:
test_type_breakdown[test_type] = []
test_type_breakdown[test_type].append(test_avg)
local_errors = errors
if local_errors:
candidate_errors.extend(local_errors)
overall_score = total_test_score / len(all_tests) if all_tests else 0.0
return (overall_score, len(all_tests), candidate_errors, test_failures, test_type_breakdown, all_test_scores, test_results)
def main():
parser = argparse.ArgumentParser(description="Run OLMOCR Bench.")
parser.add_argument(
"--dir",
default=os.path.join(os.path.dirname(__file__), "sample_data"),
help="Path to the folder containing .jsonl files, /pdfs folder, and pipeline tool subfolders.",
)
parser.add_argument(
"--force",
action="store_true",
help="Run benchmark even if some files are missing",
)
parser.add_argument("--candidate", type=str, default=None, help="Run test only for a single candidate")
parser.add_argument("--skip_baseline", action="store_true", help="Skip running baseline tests (ex. that check that basic content is present on each page)")
parser.add_argument(
"--bootstrap_samples",
type=int,
default=1000,
help="Number of bootstrap samples for confidence interval calculation (default: 1000).",
)
parser.add_argument(
"--confidence_level",
type=float,
default=0.95,
help="Confidence level for interval calculation (default: 0.95 for 95% CI).",
)
parser.add_argument(
"--permutation_tests",
nargs="?",
const="default",
help=(
"Run permutation testing. If provided without candidate names, run default tests. "
"If provided with a comma-separated list of candidate names (e.g. --permutation_tests asdf,qwe,ert), "
"run permutation tests on all pairs of the specified candidates."
),
)
# New arguments
parser.add_argument("--sample", type=int, default=None, help="Randomly sample N tests to run instead of all tests.")
parser.add_argument("--test_report", type=str, default=None, help="Generate an HTML report of test results. Provide a filename (e.g., results.html).")
args = parser.parse_args()
input_folder = args.dir if os.path.isdir(args.dir) else os.path.dirname(args.dir)
n_bootstrap = args.bootstrap_samples
ci_level = args.confidence_level
pdf_folder = os.path.join(input_folder, "pdfs")
if not os.path.exists(pdf_folder):
print("Error: /pdfs folder must exist in your data directory.", file=sys.stderr)
sys.exit(1)
all_pdf_files = list(glob.glob(os.path.join(pdf_folder, "**/*.pdf"), recursive=True))
if not all_pdf_files:
print(f"Error: No PDF files found in {pdf_folder}", file=sys.stderr)
sys.exit(1)
pdf_basenames = [os.path.relpath(p, pdf_folder) for p in all_pdf_files]
if os.path.isfile(args.dir):
jsonl_files = [args.dir]
else:
jsonl_files = glob.glob(os.path.join(input_folder, "*.jsonl"))
if not jsonl_files:
print(f"Error: No .jsonl files found in {input_folder}.", file=sys.stderr)
sys.exit(1)
all_tests = []
test_to_jsonl = {} # Map test IDs to their source jsonl files
for jsonl_path in jsonl_files:
jsonl_basename = os.path.basename(jsonl_path)
tests = load_tests(jsonl_path)
for test in tests:
test_to_jsonl[test.id] = jsonl_basename
all_tests.extend(tests)
if not all_tests:
print("No valid tests found. Exiting.", file=sys.stderr)
sys.exit(1)
for pdf in pdf_basenames:
if not any(t.type == "baseline" for t in all_tests if t.pdf == pdf):
all_tests.append(BaselineTest(id=f"{pdf}_baseline", pdf=pdf, page=1, type="baseline"))
test_to_jsonl[all_tests[-1].id] = "baseline"
for pdf in pdf_basenames:
pdf_doc = PdfReader(os.path.join(pdf_folder, pdf))
for page in range(1, len(pdf_doc.pages) + 1):
if not any(test for test in all_tests if test.pdf == pdf and test.page == page) and not args.force:
print(f"No dataset entry found for pdf {pdf} page {page}")
sys.exit(1)
if args.skip_baseline:
all_tests = [test for test in all_tests if test.type != "baseline"]
# Sample tests if requested
if args.sample is not None and args.sample > 0:
if args.sample >= len(all_tests):
print(f"Sample size {args.sample} is greater than or equal to the total number of tests ({len(all_tests)}). Using all tests.")
else:
print(f"Randomly sampling {args.sample} tests out of {len(all_tests)} total tests.")
all_tests = random.sample(all_tests, args.sample)
candidate_folders = []
for entry in os.listdir(input_folder):
full_path = os.path.join(input_folder, entry)
if args.candidate is not None:
if entry == args.candidate:
candidate_folders.append(full_path)
else:
if os.path.isdir(full_path) and entry != "pdfs":
candidate_folders.append(full_path)
if not candidate_folders:
print("Error: No candidate pipeline folders found (subdirectories besides 'pdfs').", file=sys.stderr)
sys.exit(1)
candidate_folders.sort()
summary = []
test_results_by_candidate = {}
print("\nRunning tests for each candidate:")
# Process candidates sequentially so that each candidate's progress bar is distinct.
for candidate in candidate_folders:
candidate_name = os.path.basename(candidate)
print(f"\nEvaluating candidate: {candidate_name}")
overall_score, total_tests, candidate_errors, test_failures, test_type_breakdown, all_test_scores, test_results = evaluate_candidate(
candidate, all_tests, pdf_basenames, args.force
)
# Always store test results for displaying jsonl file groupings
test_results_by_candidate[candidate_name] = test_results
if all_test_scores:
ci = calculate_bootstrap_ci(all_test_scores, n_bootstrap=n_bootstrap, ci_level=ci_level)
else:
ci = (0.0, 0.0)
summary.append((candidate_name, overall_score, total_tests, candidate_errors, test_failures, test_type_breakdown, ci, all_test_scores))
print(f"\nCandidate: {candidate_name}")
if candidate_errors:
for err in candidate_errors:
print(f" [ERROR] {err}")
else:
if test_failures:
for fail in test_failures:
print(f" [FAIL] {fail}")
print(f" Average Score: {overall_score * 100:.1f}% (95% CI: [{ci[0] * 100:.1f}%, {ci[1] * 100:.1f}%]) over {total_tests} tests.")
print("\n" + "=" * 60)
print("Final Summary with 95% Confidence Intervals:")
for candidate_name, overall_score, total_tests, candidate_errors, _, test_type_breakdown, ci, _ in summary:
if candidate_errors:
status = "FAILED (errors)"
ciw_str = ""
else:
status = f"{overall_score * 100:0.1f}%"
half_width = ((ci[1] - ci[0]) / 2) * 100
ciw_str = f"± {half_width:0.1f}%"
print(f"{candidate_name:20s} : Average Score: {status} {ciw_str}")
# Sort the test types alphabetically
for ttype in sorted(test_type_breakdown.keys()):
scores = test_type_breakdown[ttype]
avg = sum(scores) / len(scores) * 100 if scores else 0.0
print(f" {ttype:8s}: {avg:0.1f}% average pass rate over {len(scores)} tests")
# Group results by jsonl file
jsonl_results = {}
for test in all_tests:
# Get the jsonl file this test came from
jsonl_file = test_to_jsonl.get(test.id, "unknown")
if jsonl_file not in jsonl_results:
jsonl_results[jsonl_file] = {"total": 0, "passed": 0}
jsonl_results[jsonl_file]["total"] += 1
# Get the test result for this candidate if it exists
test_result = None
if not candidate_errors and hasattr(test, "pdf") and hasattr(test, "page"):
pdf_name = test.pdf
page = test.page
if pdf_name in test_results_by_candidate.get(candidate_name, {}) and page in test_results_by_candidate[candidate_name].get(pdf_name, {}):
for t, passed, _ in test_results_by_candidate[candidate_name][pdf_name][page]:
if t.id == test.id:
test_result = passed
break
if test_result:
jsonl_results[jsonl_file]["passed"] += 1
print("\n Results by JSONL file:")
for jsonl_file, results in sorted(jsonl_results.items()):
if results["total"] > 0:
pass_rate = (results["passed"] / results["total"]) * 100
print(f" {jsonl_file:30s}: {pass_rate:0.1f}% ({results['passed']}/{results['total']} tests)")
print("")
if args.permutation_tests is not None:
print("\n" + "=" * 60)
print("Pairwise Permutation Tests:")
valid_candidates = [c for c in summary if not c[3]]
if args.permutation_tests == "default":
olmocr_candidates = sorted([c for c in valid_candidates if "olmocr" in c[0].lower()], key=lambda x: x[1], reverse=True)
non_olmocr_candidates = sorted([c for c in valid_candidates if "olmocr" not in c[0].lower()], key=lambda x: x[1], reverse=True)
top_olmocr = olmocr_candidates[0] if olmocr_candidates else None
top_non_olmocr = non_olmocr_candidates[0] if non_olmocr_candidates else None
top_two_olmocr = olmocr_candidates[:2]
if top_olmocr and top_non_olmocr:
olmocr_name, olmocr_score = top_olmocr[0], top_olmocr[1]
non_olmocr_name, non_olmocr_score = top_non_olmocr[0], top_non_olmocr[1]
diff, p_value = perform_permutation_test(top_olmocr[7], top_non_olmocr[7])
print("\nComparison 1: Top olmocr vs Top non-olmocr candidate")
print(f" {olmocr_name} ({olmocr_score*100:.1f}%) vs {non_olmocr_name} ({non_olmocr_score*100:.1f}%)")
print(f" Difference: {diff*100:.2f}% (positive means {olmocr_name} is better)")
print(f" p-value: {p_value:.4f}")
if p_value < 0.05:
print(" Result: Statistically significant difference (p < 0.05)")
else:
print(" Result: No statistically significant difference (p ≥ 0.05)")
else:
print("\nCannot perform olmocr vs non-olmocr comparison: Missing candidates")
if len(top_two_olmocr) >= 2:
diff, p_value = perform_permutation_test(top_two_olmocr[0][7], top_two_olmocr[1][7])
print("\nComparison 2: Top two olmocr candidates")
print(f" {top_two_olmocr[0][0]} ({top_two_olmocr[0][1]*100:.1f}%) vs {top_two_olmocr[1][0]} ({top_two_olmocr[1][1]*100:.1f}%)")
print(f" Difference: {diff*100:.2f}% (positive means {top_two_olmocr[0][0]} is better)")
print(f" p-value: {p_value:.4f}")
if p_value < 0.05:
print(" Result: Statistically significant difference (p < 0.05)")
else:
print(" Result: No statistically significant difference (p ≥ 0.05)")
else:
print("\nCannot perform top two olmocr comparison: Not enough olmocr candidates")
else:
candidate_names = [name.strip() for name in args.permutation_tests.split(",")]
selected_candidates = [c for c in valid_candidates if c[0] in candidate_names]
if len(selected_candidates) < 2:
print("\nNot enough valid candidates among the selected for permutation tests.")
else:
for cand1, cand2 in combinations(selected_candidates, 2):
diff, p_value = perform_permutation_test(cand1[7], cand2[7])
print(f"\nComparison: {cand1[0]} vs {cand2[0]}")
print(f" {cand1[0]} ({cand1[1]*100:.1f}%) vs {cand2[0]} ({cand2[1]*100:.1f}%)")
print(f" Difference: {diff*100:.2f}% (positive means {cand1[0]} is better)")
print(f" p-value: {p_value:.4f}")
if p_value < 0.05:
print(" Result: Statistically significant difference (p < 0.05)")
else:
print(" Result: No statistically significant difference (p ≥ 0.05)")
print("=" * 60)
# Generate HTML report if requested
if args.test_report:
generate_html_report(test_results_by_candidate, pdf_folder, args.test_report)
if __name__ == "__main__":
main()
```
## /olmocr/bench/checker/check_old_scans_math.py
```py path="/olmocr/bench/checker/check_old_scans_math.py"
import argparse
import json
import os
from typing import Any, Dict
from openai import OpenAI
from olmocr.data.renderpdf import render_pdf_to_base64png
def verify_latex_match(
pdf_path: str,
page_num: int,
latex_expression: str,
model: str = "gpt-4o-2024-08-06",
temperature: float = 0.1,
target_longest_image_dim: int = 2048,
) -> Dict[str, Any]:
"""
Verify if a LaTeX math expression matches what appears in a PDF page.
Args:
pdf_path (str): Path to the PDF file
page_num (int): Page number to check (1-indexed)
latex_expression (str): LaTeX expression to verify
model (str): OpenAI model to use
temperature (float): Temperature for API call
target_longest_image_dim (int): Target dimension for the image
Returns:
Dict with verification result
"""
image_base64 = render_pdf_to_base64png(pdf_path, page_num=page_num, target_longest_image_dim=target_longest_image_dim)
if not os.getenv("OPENAI_API_KEY"):
raise SystemExit("You must specify an OPENAI_API_KEY environment variable")
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
prompt = f"""
This is a mathematical expression verification task.
I'm showing you a page from a PDF document containing mathematical expressions.
Please verify if the following LaTeX expression:
{latex_expression}
appears correctly in the document.
Respond with a JSON object containing:
1. "status": "correct" or "incorrect"
2. "confidence": a value between 0 and 1 representing your confidence in the answer
3. "explanation": a brief explanation of why you believe the expression is correct or incorrect
Focus specifically on checking if this exact mathematical expression appears in the document.
"""
response = client.chat.completions.create(
model=model,
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_base64}"}},
],
}
],
# temperature=temperature,
response_format={"type": "json_object"},
# max_tokens=1000,
)
raw_response = response.choices[0].message.content
result = json.loads(raw_response)
return {
"pdf": pdf_path,
"math": latex_expression,
"status": result.get("status", "unknown"),
"confidence": result.get("confidence", 0),
"explanation": result.get("explanation", "No explanation provided"),
}
def process_jsonl_file(input_jsonl_path: str, output_jsonl_path: str, model: str = "o4-mini-2025-04-16", temperature: float = 0.1) -> None:
"""
Process a JSONL file containing math expressions to verify.
Args:
input_jsonl_path (str): Path to input JSONL file
output_jsonl_path (str): Path to output JSONL file
model (str): OpenAI model to use
temperature (float): Temperature for API call
"""
processed_count = 0
with open(output_jsonl_path, "w") as out_file:
with open(input_jsonl_path, "r") as in_file:
for line_num, line in enumerate(in_file, 1):
try:
entry = json.loads(line.strip())
pdf_path = entry.get("pdf")
page_num = entry.get("page", 1)
math_expr = entry.get("math")
if not all([pdf_path, math_expr]):
print(f"Line {line_num}: Skipping entry due to missing required fields")
continue
print(f"Line {line_num}: Processing: {pdf_path}, page {page_num}")
try:
result = verify_latex_match(pdf_path=pdf_path, page_num=page_num, latex_expression=math_expr, model=model, temperature=temperature)
out_file.write(json.dumps(result) + "\n")
processed_count += 1
except Exception as e:
print(f"Line {line_num}: Error processing {pdf_path}: {str(e)}")
error_result = {"pdf": pdf_path, "math": math_expr, "status": "error", "explanation": str(e)}
out_file.write(json.dumps(error_result) + "\n")
processed_count += 1
except json.JSONDecodeError:
print(f"Line {line_num}: Invalid JSON, skipping")
print(f"Processed {processed_count} entries. Results saved to {output_jsonl_path}")
def main():
parser = argparse.ArgumentParser(description="Verify LaTeX math expressions in PDFs")
parser.add_argument("input_jsonl", help="Path to input JSONL file")
parser.add_argument("output_jsonl", help="Path to output JSONL file")
parser.add_argument("--model", default="o4-mini-2025-04-16", help="OpenAI model to use")
parser.add_argument("--temperature", type=float, default=0.1, help="Temperature for API call")
args = parser.parse_args()
process_jsonl_file(input_jsonl_path=args.input_jsonl, output_jsonl_path=args.output_jsonl, model=args.model, temperature=args.temperature)
if __name__ == "__main__":
main()
```
## /olmocr/bench/convert.py
```py path="/olmocr/bench/convert.py"
import argparse
import asyncio
import base64
import glob
import importlib
import os
import tempfile
from functools import partial
from pypdf import PdfReader
from tqdm import tqdm
from olmocr.data.renderpdf import render_pdf_to_base64png
from olmocr.image_utils import convert_image_to_pdf_bytes
def parse_method_arg(method_arg):
"""
Parse a method configuration string of the form:
method_name[:key=value[:key2=value2...]]
Returns:
(method_name, kwargs_dict, folder_name)
"""
parts = method_arg.split(":")
name = parts[0]
kwargs = {}
folder_name = name # Default folder name is the method name
for extra in parts[1:]:
if "=" in extra:
key, value = extra.split("=", 1)
if key == "name":
folder_name = value
continue
try:
converted = int(value)
except ValueError:
try:
converted = float(value)
except ValueError:
converted = value
kwargs[key] = converted
else:
raise ValueError(f"Extra argument '{extra}' is not in key=value format")
return name, kwargs, folder_name
# Wrapper to run synchronous functions in the event loop
async def run_sync_in_executor(func, *args, **kwargs):
"""Run a synchronous function in the default executor"""
loop = asyncio.get_running_loop()
return await loop.run_in_executor(None, partial(func, *args, **kwargs))
async def process_pdf(pdf_path, page_num, method, kwargs, output_path, is_async):
"""Process a single PDF and save the result to output_path"""
try:
if is_async:
# Run async function directly
markdown = await method(pdf_path, page_num=page_num, **kwargs)
else:
# Run synchronous function in the executor
markdown = await run_sync_in_executor(method, pdf_path, page_num=page_num, **kwargs)
if markdown is None:
print(f"Warning, did not get output for {os.path.basename(output_path)}")
# Write blank to this file, so that it's marked as an error and not just skipped in evals
with open(output_path, "w") as out_f:
out_f.write("")
return False
# Write the markdown to the output file
with open(output_path, "w") as out_f:
out_f.write(markdown)
return True
except Exception as ex:
print(f"Exception {str(ex)} occurred while processing {os.path.basename(output_path)}")
# Write blank to this file, so that it's marked as an error and not just skipped in evals
with open(output_path, "w") as out_f:
out_f.write("")
return False
async def process_pdfs(config, pdf_directory, data_directory, repeats, remove_text, force, max_parallel=None):
"""
Process PDFs using asyncio for both sync and async methods,
limiting the number of concurrent tasks to max_parallel.
"""
for candidate in config.keys():
print(f"Starting conversion using {candidate} with kwargs: {config[candidate]['kwargs']}")
folder_name = config[candidate]["folder_name"]
candidate_output_dir = os.path.join(data_directory, folder_name)
os.makedirs(candidate_output_dir, exist_ok=True)
method = config[candidate]["method"]
kwargs = config[candidate]["kwargs"]
is_async = asyncio.iscoroutinefunction(method)
# Use recursive glob to support nested PDFs
all_pdfs = glob.glob(os.path.join(pdf_directory, "**/*.pdf"), recursive=True)
all_pdfs.sort()
# Prepare all tasks
tasks = []
task_descriptions = {}
for pdf_path in all_pdfs:
pdf = PdfReader(pdf_path)
num_pages = len(pdf.pages)
base_name = os.path.basename(pdf_path).replace(".pdf", "")
# Determine the PDF's relative folder path (e.g. "arxiv_data") relative to pdf_directory
relative_pdf_path = os.path.relpath(pdf_path, pdf_directory)
pdf_relative_dir = os.path.dirname(relative_pdf_path)
if remove_text:
print(f"Converting {pdf_path} into images to remove text-content...")
# Generate image files from each page
temp_image_files = []
try:
for page_num in range(1, num_pages + 1):
# Get base64 PNG data for the current page
base64_png = render_pdf_to_base64png(pdf_path, page_num, target_longest_image_dim=2048)
# Decode base64 and save to temporary file
temp_img = tempfile.NamedTemporaryFile("wb", suffix=".png", delete=False)
temp_img.write(base64.b64decode(base64_png))
temp_img.close()
temp_image_files.append(temp_img.name)
# Convert all images to a single PDF using our enhanced function
pdf_bytes = convert_image_to_pdf_bytes(temp_image_files)
# Write the PDF bytes to a temporary file
temp_pdf = tempfile.NamedTemporaryFile("wb", suffix=".pdf", delete=False)
temp_pdf.write(pdf_bytes)
temp_pdf.close()
# Update pdf_path to the new file
pdf_path = temp_pdf.name
finally:
# Clean up temporary image files
for temp_file in temp_image_files:
try:
os.remove(temp_file)
except Exception as e:
print(f"Warning: Failed to remove temporary file {temp_file}: {e}")
for repeat in range(1, repeats + 1):
for page_num in range(1, num_pages + 1):
output_filename = f"{base_name}_pg{page_num}_repeat{repeat}.md"
# Preserve the relative folder structure in the output directory
candidate_pdf_dir = os.path.join(candidate_output_dir, pdf_relative_dir)
os.makedirs(candidate_pdf_dir, exist_ok=True)
output_path = os.path.join(candidate_pdf_dir, output_filename)
if os.path.exists(output_path) and not force:
print(f"Skipping {base_name}_pg{page_num}_repeat{repeat} for {candidate}, file already exists")
print("Rerun with --force flag to force regeneration")
continue
task = process_pdf(pdf_path, page_num, method, kwargs, output_path, is_async)
tasks.append(task)
task_descriptions[id(task)] = f"{base_name}_pg{page_num}_repeat{repeat} ({candidate})"
# Process tasks with semaphore to limit concurrency
semaphore = asyncio.Semaphore(max_parallel or 1) # Default to 1 if not specified
async def process_with_semaphore(task):
async with semaphore:
return await task
# Wrap each task with the semaphore
limited_tasks = [process_with_semaphore(task) for task in tasks]
# Process tasks with progress bar
if limited_tasks:
completed = 0
with tqdm(total=len(limited_tasks), desc=f"Processing {candidate}") as pbar:
for task in asyncio.as_completed(limited_tasks):
try:
result = await task
if result:
completed += 1
except Exception as e:
print(f"Task failed: {e}")
finally:
pbar.update(1)
print(f"Completed {completed} out of {len(limited_tasks)} tasks for {candidate}")
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Run PDF conversion using specified OCR methods and extra parameters.")
parser.add_argument(
"methods",
nargs="+",
help="Methods to run in the format method[:key=value ...]. "
"Example: gotocr mineru:temperature=2 marker:u=3. "
"Use 'name=folder_name' to specify a custom output folder name.",
)
parser.add_argument("--repeats", type=int, default=1, help="Number of times to repeat the conversion for each PDF.")
parser.add_argument(
"--dir",
type=str,
default=os.path.join(os.path.dirname(__file__), "sample_data"),
help="Path to the data folder in which to save outputs, pdfs should be in /pdfs folder within it.",
)
parser.add_argument("--force", action="store_true", default=False, help="Force regenerating of output files, even if they already exist")
parser.add_argument("--parallel", type=int, default=1, help="Maximum number of concurrent tasks")
parser.add_argument(
"--remove_text",
action="store_true",
help="When your PDF gets processed, we will take a screenshot of it first, to erase any text content in it. This would disable document-anchoring for olmocr.",
)
args = parser.parse_args()
# Mapping of method names to a tuple: (module path, function name)
available_methods = {
"olmocr_pipeline": ("olmocr.bench.runners.run_olmocr_pipeline", "run_olmocr_pipeline"),
"gotocr": ("olmocr.bench.runners.run_gotocr", "run_gotocr"),
"marker": ("olmocr.bench.runners.run_marker", "run_marker"),
"mineru": ("olmocr.bench.runners.run_mineru", "run_mineru"),
"chatgpt": ("olmocr.bench.runners.run_chatgpt", "run_chatgpt"),
"gemini": ("olmocr.bench.runners.run_gemini", "run_gemini"),
"mistral": ("olmocr.bench.runners.run_mistral", "run_mistral"),
"docling": ("olmocr.bench.runners.run_docling", "run_docling"),
"rolmocr": ("olmocr.bench.runners.run_rolmocr", "run_rolmocr"),
"transformers": ("olmocr.bench.runners.run_transformers", "run_transformers"),
"server": ("olmocr.bench.runners.run_server", "run_server"),
}
# Build config by importing only requested methods.
config = {}
for method_arg in args.methods:
method_name, extra_kwargs, folder_name = parse_method_arg(method_arg)
if method_name not in available_methods:
parser.error(f"Unknown method: {method_name}. " f"Available methods: {', '.join(available_methods.keys())}")
module_path, function_name = available_methods[method_name]
# Dynamically import the module and get the function.
module = importlib.import_module(module_path)
function = getattr(module, function_name)
config[method_name] = {"method": function, "kwargs": extra_kwargs, "folder_name": folder_name}
data_directory = args.dir
pdf_directory = os.path.join(data_directory, "pdfs")
# Run the async process function with the parallel argument
asyncio.run(process_pdfs(config, pdf_directory, data_directory, args.repeats, args.remove_text, args.force, args.parallel))
```
## /olmocr/bench/katex/__init__.py
```py path="/olmocr/bench/katex/__init__.py"
from .render import compare_rendered_equations, render_equation
```
## /olmocr/bench/katex/auto-render.min.js
```js path="/olmocr/bench/katex/auto-render.min.js"
!function(e,t){"object"==typeof exports&&"object"==typeof module?module.exports=t(require("katex")):"function"==typeof define&&define.amd?define(["katex"],t):"object"==typeof exports?exports.renderMathInElement=t(require("katex")):e.renderMathInElement=t(e.katex)}("undefined"!=typeof self?self:this,(function(e){return function(){"use strict";var t={757:function(t){t.exports=e}},n={};function r(e){var o=n[e];if(void 0!==o)return o.exports;var i=n[e]={exports:{}};return t[e](i,i.exports,r),i.exports}r.n=function(e){var t=e&&e.__esModule?function(){return e.default}:function(){return e};return r.d(t,{a:t}),t},r.d=function(e,t){for(var n in t)r.o(t,n)&&!r.o(e,n)&&Object.defineProperty(e,n,{enumerable:!0,get:t[n]})},r.o=function(e,t){return Object.prototype.hasOwnProperty.call(e,t)};var o={};r.d(o,{default:function(){return p}});var i=r(757),a=r.n(i);const l=function(e,t,n){let r=n,o=0;const i=e.length;for(;re.left.replace(/[-/\\^$*+?.()|[\]{}]/g,"\\$&"))).join("|")+")");for(;n=e.search(o),-1!==n;){n>0&&(r.push({type:"text",data:e.slice(0,n)}),e=e.slice(n));const o=t.findIndex((t=>e.startsWith(t.left)));if(n=l(t[o].right,e,t[o].left.length),-1===n)break;const i=e.slice(0,n+t[o].right.length),a=s.test(i)?i:e.slice(t[o].left.length,n);r.push({type:"math",data:a,rawData:i,display:t[o].display}),e=e.slice(n+t[o].right.length)}return""!==e&&r.push({type:"text",data:e}),r};const c=function(e,t){const n=d(e,t.delimiters);if(1===n.length&&"text"===n[0].type)return null;const r=document.createDocumentFragment();for(let e=0;e-1===e.indexOf(" "+t+" ")))&&f(r,t)}}};var p=function(e,t){if(!e)throw new Error("No element provided to render");const n={};for(const e in t)t.hasOwnProperty(e)&&(n[e]=t[e]);n.delimiters=n.delimiters||[{left:"$$",right:"$$",display:!0},{left:"\\(",right:"\\)",display:!1},{left:"\\begin{equation}",right:"\\end{equation}",display:!0},{left:"\\begin{align}",right:"\\end{align}",display:!0},{left:"\\begin{alignat}",right:"\\end{alignat}",display:!0},{left:"\\begin{gather}",right:"\\end{gather}",display:!0},{left:"\\begin{CD}",right:"\\end{CD}",display:!0},{left:"\\[",right:"\\]",display:!0}],n.ignoredTags=n.ignoredTags||["script","noscript","style","textarea","pre","code","option"],n.ignoredClasses=n.ignoredClasses||[],n.errorCallback=n.errorCallback||console.error,n.macros=n.macros||{},f(e,n)};return o=o.default}()}));
```
## /olmocr/bench/katex/katex.min.css
```css path="/olmocr/bench/katex/katex.min.css"
@font-face{font-family:KaTeX_AMS;font-style:normal;font-weight:400;src:url(fonts/KaTeX_AMS-Regular.woff2) format("woff2"),url(fonts/KaTeX_AMS-Regular.woff) format("woff"),url(fonts/KaTeX_AMS-Regular.ttf) format("truetype")}@font-face{font-family:KaTeX_Caligraphic;font-style:normal;font-weight:700;src:url(fonts/KaTeX_Caligraphic-Bold.woff2) format("woff2"),url(fonts/KaTeX_Caligraphic-Bold.woff) format("woff"),url(fonts/KaTeX_Caligraphic-Bold.ttf) format("truetype")}@font-face{font-family:KaTeX_Caligraphic;font-style:normal;font-weight:400;src:url(fonts/KaTeX_Caligraphic-Regular.woff2) format("woff2"),url(fonts/KaTeX_Caligraphic-Regular.woff) format("woff"),url(fonts/KaTeX_Caligraphic-Regular.ttf) format("truetype")}@font-face{font-family:KaTeX_Fraktur;font-style:normal;font-weight:700;src:url(fonts/KaTeX_Fraktur-Bold.woff2) format("woff2"),url(fonts/KaTeX_Fraktur-Bold.woff) format("woff"),url(fonts/KaTeX_Fraktur-Bold.ttf) format("truetype")}@font-face{font-family:KaTeX_Fraktur;font-style:normal;font-weight:400;src:url(fonts/KaTeX_Fraktur-Regular.woff2) format("woff2"),url(fonts/KaTeX_Fraktur-Regular.woff) format("woff"),url(fonts/KaTeX_Fraktur-Regular.ttf) format("truetype")}@font-face{font-family:KaTeX_Main;font-style:normal;font-weight:700;src:url(fonts/KaTeX_Main-Bold.woff2) format("woff2"),url(fonts/KaTeX_Main-Bold.woff) format("woff"),url(fonts/KaTeX_Main-Bold.ttf) format("truetype")}@font-face{font-family:KaTeX_Main;font-style:italic;font-weight:700;src:url(fonts/KaTeX_Main-BoldItalic.woff2) format("woff2"),url(fonts/KaTeX_Main-BoldItalic.woff) format("woff"),url(fonts/KaTeX_Main-BoldItalic.ttf) format("truetype")}@font-face{font-family:KaTeX_Main;font-style:italic;font-weight:400;src:url(fonts/KaTeX_Main-Italic.woff2) format("woff2"),url(fonts/KaTeX_Main-Italic.woff) format("woff"),url(fonts/KaTeX_Main-Italic.ttf) format("truetype")}@font-face{font-family:KaTeX_Main;font-style:normal;font-weight:400;src:url(fonts/KaTeX_Main-Regular.woff2) format("woff2"),url(fonts/KaTeX_Main-Regular.woff) format("woff"),url(fonts/KaTeX_Main-Regular.ttf) format("truetype")}@font-face{font-family:KaTeX_Math;font-style:italic;font-weight:700;src:url(fonts/KaTeX_Math-BoldItalic.woff2) format("woff2"),url(fonts/KaTeX_Math-BoldItalic.woff) format("woff"),url(fonts/KaTeX_Math-BoldItalic.ttf) format("truetype")}@font-face{font-family:KaTeX_Math;font-style:italic;font-weight:400;src:url(fonts/KaTeX_Math-Italic.woff2) format("woff2"),url(fonts/KaTeX_Math-Italic.woff) format("woff"),url(fonts/KaTeX_Math-Italic.ttf) format("truetype")}@font-face{font-family:"KaTeX_SansSerif";font-style:normal;font-weight:700;src:url(fonts/KaTeX_SansSerif-Bold.woff2) format("woff2"),url(fonts/KaTeX_SansSerif-Bold.woff) format("woff"),url(fonts/KaTeX_SansSerif-Bold.ttf) format("truetype")}@font-face{font-family:"KaTeX_SansSerif";font-style:italic;font-weight:400;src:url(fonts/KaTeX_SansSerif-Italic.woff2) format("woff2"),url(fonts/KaTeX_SansSerif-Italic.woff) format("woff"),url(fonts/KaTeX_SansSerif-Italic.ttf) format("truetype")}@font-face{font-family:"KaTeX_SansSerif";font-style:normal;font-weight:400;src:url(fonts/KaTeX_SansSerif-Regular.woff2) format("woff2"),url(fonts/KaTeX_SansSerif-Regular.woff) format("woff"),url(fonts/KaTeX_SansSerif-Regular.ttf) format("truetype")}@font-face{font-family:KaTeX_Script;font-style:normal;font-weight:400;src:url(fonts/KaTeX_Script-Regular.woff2) format("woff2"),url(fonts/KaTeX_Script-Regular.woff) format("woff"),url(fonts/KaTeX_Script-Regular.ttf) format("truetype")}@font-face{font-family:KaTeX_Size1;font-style:normal;font-weight:400;src:url(fonts/KaTeX_Size1-Regular.woff2) format("woff2"),url(fonts/KaTeX_Size1-Regular.woff) format("woff"),url(fonts/KaTeX_Size1-Regular.ttf) format("truetype")}@font-face{font-family:KaTeX_Size2;font-style:normal;font-weight:400;src:url(fonts/KaTeX_Size2-Regular.woff2) format("woff2"),url(fonts/KaTeX_Size2-Regular.woff) format("woff"),url(fonts/KaTeX_Size2-Regular.ttf) format("truetype")}@font-face{font-family:KaTeX_Size3;font-style:normal;font-weight:400;src:url(fonts/KaTeX_Size3-Regular.woff2) format("woff2"),url(fonts/KaTeX_Size3-Regular.woff) format("woff"),url(fonts/KaTeX_Size3-Regular.ttf) format("truetype")}@font-face{font-family:KaTeX_Size4;font-style:normal;font-weight:400;src:url(fonts/KaTeX_Size4-Regular.woff2) format("woff2"),url(fonts/KaTeX_Size4-Regular.woff) format("woff"),url(fonts/KaTeX_Size4-Regular.ttf) format("truetype")}@font-face{font-family:KaTeX_Typewriter;font-style:normal;font-weight:400;src:url(fonts/KaTeX_Typewriter-Regular.woff2) format("woff2"),url(fonts/KaTeX_Typewriter-Regular.woff) format("woff"),url(fonts/KaTeX_Typewriter-Regular.ttf) format("truetype")}.katex{font:normal 1.21em KaTeX_Main,Times New Roman,serif;line-height:1.2;text-indent:0;text-rendering:auto}.katex *{-ms-high-contrast-adjust:none!important;border-color:currentColor}.katex .katex-version:after{content:"0.16.21"}.katex .katex-mathml{clip:rect(1px,1px,1px,1px);border:0;height:1px;overflow:hidden;padding:0;position:absolute;width:1px}.katex .katex-html>.newline{display:block}.katex .base{position:relative;white-space:nowrap;width:-webkit-min-content;width:-moz-min-content;width:min-content}.katex .base,.katex .strut{display:inline-block}.katex .textbf{font-weight:700}.katex .textit{font-style:italic}.katex .textrm{font-family:KaTeX_Main}.katex .textsf{font-family:KaTeX_SansSerif}.katex .texttt{font-family:KaTeX_Typewriter}.katex .mathnormal{font-family:KaTeX_Math;font-style:italic}.katex .mathit{font-family:KaTeX_Main;font-style:italic}.katex .mathrm{font-style:normal}.katex .mathbf{font-family:KaTeX_Main;font-weight:700}.katex .boldsymbol{font-family:KaTeX_Math;font-style:italic;font-weight:700}.katex .amsrm,.katex .mathbb,.katex .textbb{font-family:KaTeX_AMS}.katex .mathcal{font-family:KaTeX_Caligraphic}.katex .mathfrak,.katex .textfrak{font-family:KaTeX_Fraktur}.katex .mathboldfrak,.katex .textboldfrak{font-family:KaTeX_Fraktur;font-weight:700}.katex .mathtt{font-family:KaTeX_Typewriter}.katex .mathscr,.katex .textscr{font-family:KaTeX_Script}.katex .mathsf,.katex .textsf{font-family:KaTeX_SansSerif}.katex .mathboldsf,.katex .textboldsf{font-family:KaTeX_SansSerif;font-weight:700}.katex .mathitsf,.katex .mathsfit,.katex .textitsf{font-family:KaTeX_SansSerif;font-style:italic}.katex .mainrm{font-family:KaTeX_Main;font-style:normal}.katex .vlist-t{border-collapse:collapse;display:inline-table;table-layout:fixed}.katex .vlist-r{display:table-row}.katex .vlist{display:table-cell;position:relative;vertical-align:bottom}.katex .vlist>span{display:block;height:0;position:relative}.katex .vlist>span>span{display:inline-block}.katex .vlist>span>.pstrut{overflow:hidden;width:0}.katex .vlist-t2{margin-right:-2px}.katex .vlist-s{display:table-cell;font-size:1px;min-width:2px;vertical-align:bottom;width:2px}.katex .vbox{align-items:baseline;display:inline-flex;flex-direction:column}.katex .hbox{width:100%}.katex .hbox,.katex .thinbox{display:inline-flex;flex-direction:row}.katex .thinbox{max-width:0;width:0}.katex .msupsub{text-align:left}.katex .mfrac>span>span{text-align:center}.katex .mfrac .frac-line{border-bottom-style:solid;display:inline-block;width:100%}.katex .hdashline,.katex .hline,.katex .mfrac .frac-line,.katex .overline .overline-line,.katex .rule,.katex .underline .underline-line{min-height:1px}.katex .mspace{display:inline-block}.katex .clap,.katex .llap,.katex .rlap{position:relative;width:0}.katex .clap>.inner,.katex .llap>.inner,.katex .rlap>.inner{position:absolute}.katex .clap>.fix,.katex .llap>.fix,.katex .rlap>.fix{display:inline-block}.katex .llap>.inner{right:0}.katex .clap>.inner,.katex .rlap>.inner{left:0}.katex .clap>.inner>span{margin-left:-50%;margin-right:50%}.katex .rule{border:0 solid;display:inline-block;position:relative}.katex .hline,.katex .overline .overline-line,.katex .underline .underline-line{border-bottom-style:solid;display:inline-block;width:100%}.katex .hdashline{border-bottom-style:dashed;display:inline-block;width:100%}.katex .sqrt>.root{margin-left:.2777777778em;margin-right:-.5555555556em}.katex .fontsize-ensurer.reset-size1.size1,.katex .sizing.reset-size1.size1{font-size:1em}.katex .fontsize-ensurer.reset-size1.size2,.katex .sizing.reset-size1.size2{font-size:1.2em}.katex .fontsize-ensurer.reset-size1.size3,.katex .sizing.reset-size1.size3{font-size:1.4em}.katex .fontsize-ensurer.reset-size1.size4,.katex .sizing.reset-size1.size4{font-size:1.6em}.katex .fontsize-ensurer.reset-size1.size5,.katex .sizing.reset-size1.size5{font-size:1.8em}.katex .fontsize-ensurer.reset-size1.size6,.katex .sizing.reset-size1.size6{font-size:2em}.katex .fontsize-ensurer.reset-size1.size7,.katex .sizing.reset-size1.size7{font-size:2.4em}.katex .fontsize-ensurer.reset-size1.size8,.katex .sizing.reset-size1.size8{font-size:2.88em}.katex .fontsize-ensurer.reset-size1.size9,.katex .sizing.reset-size1.size9{font-size:3.456em}.katex .fontsize-ensurer.reset-size1.size10,.katex .sizing.reset-size1.size10{font-size:4.148em}.katex .fontsize-ensurer.reset-size1.size11,.katex .sizing.reset-size1.size11{font-size:4.976em}.katex .fontsize-ensurer.reset-size2.size1,.katex .sizing.reset-size2.size1{font-size:.8333333333em}.katex .fontsize-ensurer.reset-size2.size2,.katex .sizing.reset-size2.size2{font-size:1em}.katex .fontsize-ensurer.reset-size2.size3,.katex .sizing.reset-size2.size3{font-size:1.1666666667em}.katex .fontsize-ensurer.reset-size2.size4,.katex .sizing.reset-size2.size4{font-size:1.3333333333em}.katex .fontsize-ensurer.reset-size2.size5,.katex .sizing.reset-size2.size5{font-size:1.5em}.katex .fontsize-ensurer.reset-size2.size6,.katex .sizing.reset-size2.size6{font-size:1.6666666667em}.katex .fontsize-ensurer.reset-size2.size7,.katex .sizing.reset-size2.size7{font-size:2em}.katex .fontsize-ensurer.reset-size2.size8,.katex .sizing.reset-size2.size8{font-size:2.4em}.katex .fontsize-ensurer.reset-size2.size9,.katex .sizing.reset-size2.size9{font-size:2.88em}.katex .fontsize-ensurer.reset-size2.size10,.katex .sizing.reset-size2.size10{font-size:3.4566666667em}.katex .fontsize-ensurer.reset-size2.size11,.katex .sizing.reset-size2.size11{font-size:4.1466666667em}.katex .fontsize-ensurer.reset-size3.size1,.katex .sizing.reset-size3.size1{font-size:.7142857143em}.katex .fontsize-ensurer.reset-size3.size2,.katex .sizing.reset-size3.size2{font-size:.8571428571em}.katex .fontsize-ensurer.reset-size3.size3,.katex .sizing.reset-size3.size3{font-size:1em}.katex .fontsize-ensurer.reset-size3.size4,.katex .sizing.reset-size3.size4{font-size:1.1428571429em}.katex .fontsize-ensurer.reset-size3.size5,.katex .sizing.reset-size3.size5{font-size:1.2857142857em}.katex .fontsize-ensurer.reset-size3.size6,.katex .sizing.reset-size3.size6{font-size:1.4285714286em}.katex .fontsize-ensurer.reset-size3.size7,.katex .sizing.reset-size3.size7{font-size:1.7142857143em}.katex .fontsize-ensurer.reset-size3.size8,.katex .sizing.reset-size3.size8{font-size:2.0571428571em}.katex .fontsize-ensurer.reset-size3.size9,.katex .sizing.reset-size3.size9{font-size:2.4685714286em}.katex .fontsize-ensurer.reset-size3.size10,.katex .sizing.reset-size3.size10{font-size:2.9628571429em}.katex .fontsize-ensurer.reset-size3.size11,.katex .sizing.reset-size3.size11{font-size:3.5542857143em}.katex .fontsize-ensurer.reset-size4.size1,.katex .sizing.reset-size4.size1{font-size:.625em}.katex .fontsize-ensurer.reset-size4.size2,.katex .sizing.reset-size4.size2{font-size:.75em}.katex .fontsize-ensurer.reset-size4.size3,.katex .sizing.reset-size4.size3{font-size:.875em}.katex .fontsize-ensurer.reset-size4.size4,.katex .sizing.reset-size4.size4{font-size:1em}.katex .fontsize-ensurer.reset-size4.size5,.katex .sizing.reset-size4.size5{font-size:1.125em}.katex .fontsize-ensurer.reset-size4.size6,.katex .sizing.reset-size4.size6{font-size:1.25em}.katex .fontsize-ensurer.reset-size4.size7,.katex .sizing.reset-size4.size7{font-size:1.5em}.katex .fontsize-ensurer.reset-size4.size8,.katex .sizing.reset-size4.size8{font-size:1.8em}.katex .fontsize-ensurer.reset-size4.size9,.katex .sizing.reset-size4.size9{font-size:2.16em}.katex .fontsize-ensurer.reset-size4.size10,.katex .sizing.reset-size4.size10{font-size:2.5925em}.katex .fontsize-ensurer.reset-size4.size11,.katex .sizing.reset-size4.size11{font-size:3.11em}.katex .fontsize-ensurer.reset-size5.size1,.katex .sizing.reset-size5.size1{font-size:.5555555556em}.katex .fontsize-ensurer.reset-size5.size2,.katex .sizing.reset-size5.size2{font-size:.6666666667em}.katex .fontsize-ensurer.reset-size5.size3,.katex .sizing.reset-size5.size3{font-size:.7777777778em}.katex .fontsize-ensurer.reset-size5.size4,.katex .sizing.reset-size5.size4{font-size:.8888888889em}.katex .fontsize-ensurer.reset-size5.size5,.katex .sizing.reset-size5.size5{font-size:1em}.katex .fontsize-ensurer.reset-size5.size6,.katex .sizing.reset-size5.size6{font-size:1.1111111111em}.katex .fontsize-ensurer.reset-size5.size7,.katex .sizing.reset-size5.size7{font-size:1.3333333333em}.katex .fontsize-ensurer.reset-size5.size8,.katex .sizing.reset-size5.size8{font-size:1.6em}.katex .fontsize-ensurer.reset-size5.size9,.katex .sizing.reset-size5.size9{font-size:1.92em}.katex .fontsize-ensurer.reset-size5.size10,.katex .sizing.reset-size5.size10{font-size:2.3044444444em}.katex .fontsize-ensurer.reset-size5.size11,.katex .sizing.reset-size5.size11{font-size:2.7644444444em}.katex .fontsize-ensurer.reset-size6.size1,.katex .sizing.reset-size6.size1{font-size:.5em}.katex .fontsize-ensurer.reset-size6.size2,.katex .sizing.reset-size6.size2{font-size:.6em}.katex .fontsize-ensurer.reset-size6.size3,.katex .sizing.reset-size6.size3{font-size:.7em}.katex .fontsize-ensurer.reset-size6.size4,.katex .sizing.reset-size6.size4{font-size:.8em}.katex .fontsize-ensurer.reset-size6.size5,.katex .sizing.reset-size6.size5{font-size:.9em}.katex .fontsize-ensurer.reset-size6.size6,.katex .sizing.reset-size6.size6{font-size:1em}.katex .fontsize-ensurer.reset-size6.size7,.katex .sizing.reset-size6.size7{font-size:1.2em}.katex .fontsize-ensurer.reset-size6.size8,.katex .sizing.reset-size6.size8{font-size:1.44em}.katex .fontsize-ensurer.reset-size6.size9,.katex .sizing.reset-size6.size9{font-size:1.728em}.katex .fontsize-ensurer.reset-size6.size10,.katex .sizing.reset-size6.size10{font-size:2.074em}.katex .fontsize-ensurer.reset-size6.size11,.katex .sizing.reset-size6.size11{font-size:2.488em}.katex .fontsize-ensurer.reset-size7.size1,.katex .sizing.reset-size7.size1{font-size:.4166666667em}.katex .fontsize-ensurer.reset-size7.size2,.katex .sizing.reset-size7.size2{font-size:.5em}.katex .fontsize-ensurer.reset-size7.size3,.katex .sizing.reset-size7.size3{font-size:.5833333333em}.katex .fontsize-ensurer.reset-size7.size4,.katex .sizing.reset-size7.size4{font-size:.6666666667em}.katex .fontsize-ensurer.reset-size7.size5,.katex .sizing.reset-size7.size5{font-size:.75em}.katex .fontsize-ensurer.reset-size7.size6,.katex .sizing.reset-size7.size6{font-size:.8333333333em}.katex .fontsize-ensurer.reset-size7.size7,.katex .sizing.reset-size7.size7{font-size:1em}.katex .fontsize-ensurer.reset-size7.size8,.katex .sizing.reset-size7.size8{font-size:1.2em}.katex .fontsize-ensurer.reset-size7.size9,.katex .sizing.reset-size7.size9{font-size:1.44em}.katex .fontsize-ensurer.reset-size7.size10,.katex .sizing.reset-size7.size10{font-size:1.7283333333em}.katex .fontsize-ensurer.reset-size7.size11,.katex .sizing.reset-size7.size11{font-size:2.0733333333em}.katex .fontsize-ensurer.reset-size8.size1,.katex .sizing.reset-size8.size1{font-size:.3472222222em}.katex .fontsize-ensurer.reset-size8.size2,.katex .sizing.reset-size8.size2{font-size:.4166666667em}.katex .fontsize-ensurer.reset-size8.size3,.katex .sizing.reset-size8.size3{font-size:.4861111111em}.katex .fontsize-ensurer.reset-size8.size4,.katex .sizing.reset-size8.size4{font-size:.5555555556em}.katex .fontsize-ensurer.reset-size8.size5,.katex .sizing.reset-size8.size5{font-size:.625em}.katex .fontsize-ensurer.reset-size8.size6,.katex .sizing.reset-size8.size6{font-size:.6944444444em}.katex .fontsize-ensurer.reset-size8.size7,.katex .sizing.reset-size8.size7{font-size:.8333333333em}.katex .fontsize-ensurer.reset-size8.size8,.katex .sizing.reset-size8.size8{font-size:1em}.katex .fontsize-ensurer.reset-size8.size9,.katex .sizing.reset-size8.size9{font-size:1.2em}.katex .fontsize-ensurer.reset-size8.size10,.katex .sizing.reset-size8.size10{font-size:1.4402777778em}.katex .fontsize-ensurer.reset-size8.size11,.katex .sizing.reset-size8.size11{font-size:1.7277777778em}.katex .fontsize-ensurer.reset-size9.size1,.katex .sizing.reset-size9.size1{font-size:.2893518519em}.katex .fontsize-ensurer.reset-size9.size2,.katex .sizing.reset-size9.size2{font-size:.3472222222em}.katex .fontsize-ensurer.reset-size9.size3,.katex .sizing.reset-size9.size3{font-size:.4050925926em}.katex .fontsize-ensurer.reset-size9.size4,.katex .sizing.reset-size9.size4{font-size:.462962963em}.katex .fontsize-ensurer.reset-size9.size5,.katex .sizing.reset-size9.size5{font-size:.5208333333em}.katex .fontsize-ensurer.reset-size9.size6,.katex .sizing.reset-size9.size6{font-size:.5787037037em}.katex .fontsize-ensurer.reset-size9.size7,.katex .sizing.reset-size9.size7{font-size:.6944444444em}.katex .fontsize-ensurer.reset-size9.size8,.katex .sizing.reset-size9.size8{font-size:.8333333333em}.katex .fontsize-ensurer.reset-size9.size9,.katex .sizing.reset-size9.size9{font-size:1em}.katex .fontsize-ensurer.reset-size9.size10,.katex .sizing.reset-size9.size10{font-size:1.2002314815em}.katex .fontsize-ensurer.reset-size9.size11,.katex .sizing.reset-size9.size11{font-size:1.4398148148em}.katex .fontsize-ensurer.reset-size10.size1,.katex .sizing.reset-size10.size1{font-size:.2410800386em}.katex .fontsize-ensurer.reset-size10.size2,.katex .sizing.reset-size10.size2{font-size:.2892960463em}.katex .fontsize-ensurer.reset-size10.size3,.katex .sizing.reset-size10.size3{font-size:.337512054em}.katex .fontsize-ensurer.reset-size10.size4,.katex .sizing.reset-size10.size4{font-size:.3857280617em}.katex .fontsize-ensurer.reset-size10.size5,.katex .sizing.reset-size10.size5{font-size:.4339440694em}.katex .fontsize-ensurer.reset-size10.size6,.katex .sizing.reset-size10.size6{font-size:.4821600771em}.katex .fontsize-ensurer.reset-size10.size7,.katex .sizing.reset-size10.size7{font-size:.5785920926em}.katex .fontsize-ensurer.reset-size10.size8,.katex .sizing.reset-size10.size8{font-size:.6943105111em}.katex .fontsize-ensurer.reset-size10.size9,.katex .sizing.reset-size10.size9{font-size:.8331726133em}.katex .fontsize-ensurer.reset-size10.size10,.katex .sizing.reset-size10.size10{font-size:1em}.katex .fontsize-ensurer.reset-size10.size11,.katex .sizing.reset-size10.size11{font-size:1.1996142719em}.katex .fontsize-ensurer.reset-size11.size1,.katex .sizing.reset-size11.size1{font-size:.2009646302em}.katex .fontsize-ensurer.reset-size11.size2,.katex .sizing.reset-size11.size2{font-size:.2411575563em}.katex .fontsize-ensurer.reset-size11.size3,.katex .sizing.reset-size11.size3{font-size:.2813504823em}.katex .fontsize-ensurer.reset-size11.size4,.katex .sizing.reset-size11.size4{font-size:.3215434084em}.katex .fontsize-ensurer.reset-size11.size5,.katex .sizing.reset-size11.size5{font-size:.3617363344em}.katex .fontsize-ensurer.reset-size11.size6,.katex .sizing.reset-size11.size6{font-size:.4019292605em}.katex .fontsize-ensurer.reset-size11.size7,.katex .sizing.reset-size11.size7{font-size:.4823151125em}.katex .fontsize-ensurer.reset-size11.size8,.katex .sizing.reset-size11.size8{font-size:.578778135em}.katex .fontsize-ensurer.reset-size11.size9,.katex .sizing.reset-size11.size9{font-size:.6945337621em}.katex .fontsize-ensurer.reset-size11.size10,.katex .sizing.reset-size11.size10{font-size:.8336012862em}.katex .fontsize-ensurer.reset-size11.size11,.katex .sizing.reset-size11.size11{font-size:1em}.katex .delimsizing.size1{font-family:KaTeX_Size1}.katex .delimsizing.size2{font-family:KaTeX_Size2}.katex .delimsizing.size3{font-family:KaTeX_Size3}.katex .delimsizing.size4{font-family:KaTeX_Size4}.katex .delimsizing.mult .delim-size1>span{font-family:KaTeX_Size1}.katex .delimsizing.mult .delim-size4>span{font-family:KaTeX_Size4}.katex .nulldelimiter{display:inline-block;width:.12em}.katex .delimcenter,.katex .op-symbol{position:relative}.katex .op-symbol.small-op{font-family:KaTeX_Size1}.katex .op-symbol.large-op{font-family:KaTeX_Size2}.katex .accent>.vlist-t,.katex .op-limits>.vlist-t{text-align:center}.katex .accent .accent-body{position:relative}.katex .accent .accent-body:not(.accent-full){width:0}.katex .overlay{display:block}.katex .mtable .vertical-separator{display:inline-block;min-width:1px}.katex .mtable .arraycolsep{display:inline-block}.katex .mtable .col-align-c>.vlist-t{text-align:center}.katex .mtable .col-align-l>.vlist-t{text-align:left}.katex .mtable .col-align-r>.vlist-t{text-align:right}.katex .svg-align{text-align:left}.katex svg{fill:currentColor;stroke:currentColor;fill-rule:nonzero;fill-opacity:1;stroke-width:1;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1;display:block;height:inherit;position:absolute;width:100%}.katex svg path{stroke:none}.katex img{border-style:none;max-height:none;max-width:none;min-height:0;min-width:0}.katex .stretchy{display:block;overflow:hidden;position:relative;width:100%}.katex .stretchy:after,.katex .stretchy:before{content:""}.katex .hide-tail{overflow:hidden;position:relative;width:100%}.katex .halfarrow-left{left:0;overflow:hidden;position:absolute;width:50.2%}.katex .halfarrow-right{overflow:hidden;position:absolute;right:0;width:50.2%}.katex .brace-left{left:0;overflow:hidden;position:absolute;width:25.1%}.katex .brace-center{left:25%;overflow:hidden;position:absolute;width:50%}.katex .brace-right{overflow:hidden;position:absolute;right:0;width:25.1%}.katex .x-arrow-pad{padding:0 .5em}.katex .cd-arrow-pad{padding:0 .55556em 0 .27778em}.katex .mover,.katex .munder,.katex .x-arrow{text-align:center}.katex .boxpad{padding:0 .3em}.katex .fbox,.katex .fcolorbox{border:.04em solid;box-sizing:border-box}.katex .cancel-pad{padding:0 .2em}.katex .cancel-lap{margin-left:-.2em;margin-right:-.2em}.katex .sout{border-bottom-style:solid;border-bottom-width:.08em}.katex .angl{border-right:.049em solid;border-top:.049em solid;box-sizing:border-box;margin-right:.03889em}.katex .anglpad{padding:0 .03889em}.katex .eqn-num:before{content:"(" counter(katexEqnNo) ")";counter-increment:katexEqnNo}.katex .mml-eqn-num:before{content:"(" counter(mmlEqnNo) ")";counter-increment:mmlEqnNo}.katex .mtr-glue{width:50%}.katex .cd-vert-arrow{display:inline-block;position:relative}.katex .cd-label-left{display:inline-block;position:absolute;right:calc(50% + .3em);text-align:left}.katex .cd-label-right{display:inline-block;left:calc(50% + .3em);position:absolute;text-align:right}.katex-display{display:block;margin:1em 0;text-align:center}.katex-display>.katex{display:block;text-align:center;white-space:nowrap}.katex-display>.katex>.katex-html{display:block;position:relative}.katex-display>.katex>.katex-html>.tag{position:absolute;right:0}.katex-display.leqno>.katex>.katex-html>.tag{left:0;right:auto}.katex-display.fleqn>.katex{padding-left:2em;text-align:left}body{counter-reset:katexEqnNo mmlEqnNo}
```
The content has been capped at 50000 tokens, and files over NaN bytes have been omitted. The user could consider applying other filters to refine the result. The better and more specific the context, the better the LLM can follow instructions. If the context seems verbose, the user can refine the filter using uithub. Thank you for using https://uithub.com - Perfect LLM context for any GitHub repo.