``` ├── .cursorignore ├── .github/ ├── ISSUE_TEMPLATE/ ├── bug_report.yaml ├── feature_request.yaml ├── PULL_REQUEST_TEMPLATE/ ├── pr_form.yml ├── dependabot.yml ├── labels.yml ├── release-drafter.yml ├── workflows/ ├── codeql.yml ├── docs.yml ├── labeler.yml ├── lint.yml ├── pr-lint.yml ├── publish-to-pypi.yml ├── test.yml ├── .gitignore ├── .pre-commit-config.yaml ├── LICENSE ├── README.md ├── babeldoc/ ├── __init__.py ├── assets/ ├── assets.py ├── embedding_assets_metadata.py ├── asynchronize/ ├── __init__.py ├── const.py ├── converter.py ├── document_il/ ├── __init__.py ├── babeldoc_exception/ ├── BabelDOCException.py ├── backend/ ├── __init__.py ├── pdf_creater.py ├── frontend/ ├── __init__.py ├── il_creater.py ├── il_version_1.py ├── il_version_1.rnc ├── il_version_1.rng ├── il_version_1.xsd ``` ## /.cursorignore ```cursorignore path="/.cursorignore" # Project notes and templates xnotes/ ``` ## /.github/ISSUE_TEMPLATE/bug_report.yaml ```yaml path="/.github/ISSUE_TEMPLATE/bug_report.yaml" name: "🐞 Bug Report" description: Create a report to help us improve labels: ['bug'] body: - type: checkboxes id: checks attributes: label: Before you submit options: - label: I have searched existing issues required: true - label: I spent at least 5 minutes investigating and preparing this report required: true - label: I confirmed this is not caused by a network issue required: true - type: markdown attributes: value: | Thank you for using **BabelDOC** and helping us improve it! 🙏 - type: textarea id: environment attributes: label: Environment description: Provide your system details (required) value: | - OS: - Python: - BabelDOC: render: markdown validations: required: true - type: textarea id: describe attributes: label: Describe the bug description: A clear and concise description of what the bug is. validations: required: true - type: textarea id: reproduce attributes: label: Steps to Reproduce description: Help us reproduce the issue value: | 1. Go to '...' 2. Click on '...' 3. See error validations: required: false - type: textarea id: expected attributes: label: Expected Behavior description: What did you expect to happen? validations: required: false - type: textarea id: logs attributes: label: Relevant Log Output or Screenshots description: Copy and paste any logs or attach screenshots. This will be formatted automatically. render: text validations: required: false - type: textarea id: pdf attributes: label: Original PDF File description: Upload the input PDF if applicable. validations: required: false - type: textarea id: others attributes: label: Additional Context description: Anything else we should know? validations: required: false ``` ## /.github/ISSUE_TEMPLATE/feature_request.yaml ```yaml path="/.github/ISSUE_TEMPLATE/feature_request.yaml" name: "✨ Feature Request" description: Suggest a new idea or improvement for BabelDOC labels: ['enhancement'] body: - type: markdown attributes: value: | Thank you for helping improve **BabelDOC**! Please fill out the form below to suggest a feature. - type: textarea id: describe attributes: label: Is your feature request related to a problem? description: If applicable, describe what problem this feature would solve. placeholder: Ex. I'm always frustrated when ... validations: required: false - type: textarea id: solution attributes: label: Describe the solution you'd like description: What would you like to see happen? validations: required: true - type: textarea id: alternatives attributes: label: Describe alternatives you've considered description: Have you thought of other ways to solve this? validations: required: false - type: textarea id: additional attributes: label: Additional context description: Any other context, examples, or screenshots? validations: required: false ``` ## /.github/PULL_REQUEST_TEMPLATE/pr_form.yml ```yml path="/.github/PULL_REQUEST_TEMPLATE/pr_form.yml" name: Pull Request description: Submit a pull request to contribute to BabelDOC title: "[PR] " labels: - needs triage body: - type: markdown attributes: value: | ## 👋 Thanks for contributing to **BabelDOC**! Please fill out this form to help us review your pull request effectively. - type: input id: issue attributes: label: Related Issue(s) description: If this pull request closes or is related to one or more issues, list them here (e.g., #37) placeholder: "#37" validations: required: false - type: textarea id: summary attributes: label: Description description: Describe the purpose of this pull request and what was changed. placeholder: | - What does this PR introduce or fix? - What is the motivation behind it? validations: required: true - type: dropdown id: pr_type attributes: label: PR Type description: What kind of change is this? multiple: true options: - enhancement - bug - documentation - refactor - test - chore validations: required: true - type: checkboxes id: checklist attributes: label: Contributor Checklist options: - label: I’ve read the **CONTRIBUTING.md** guide required: true - label: My changes follow the project’s code style and guidelines required: true - label: I’ve linked the related issue(s) in the description above - label: I’ve updated relevant documentation (if applicable) - label: I’ve added necessary tests (if applicable) - label: All new and existing tests passed locally - label: I understand that due to limited maintainer resources, only small pull requests are accepted. Suggestions with proof-of-concept patches are appreciated, and my patch may be rewritten if necessary. - type: textarea id: testing attributes: label: Testing Instructions description: Provide step-by-step instructions on how to test your changes placeholder: | 1. Run `...` 2. Visit `...` 3. Click `...` 4. Verify `...` validations: required: false - type: textarea id: screenshots attributes: label: Screenshots (if applicable) description: If UI changes were made, please attach before/after screenshots. validations: required: false - type: textarea id: notes attributes: label: Additional Notes description: Anything else the reviewer should know? validations: required: false ``` ## /.github/dependabot.yml ```yml path="/.github/dependabot.yml" version: 2 updates: - package-ecosystem: github-actions directory: "/" schedule: interval: weekly # - package-ecosystem: pip # directory: "/.github/workflows" # schedule: # interval: weekly # - package-ecosystem: pip # directory: "/docs" # schedule: # interval: weekly - package-ecosystem: pip directory: "/" schedule: interval: weekly versioning-strategy: lockfile-only allow: - dependency-type: "all" ``` ## /.github/labels.yml ```yml path="/.github/labels.yml" --- # Labels names are important as they are used by Release Drafter to decide # regarding where to record them in changelog or if to skip them. # # The repository labels will be automatically configured using this file and # the GitHub Action https://github.com/marketplace/actions/github-labeler. - name: breaking description: Breaking Changes color: "bfd4f2" - name: bug description: Something isn't working color: "d73a4a" - name: build description: Build System and Dependencies color: "bfdadc" - name: ci description: Continuous Integration color: "4a97d6" - name: dependencies description: Pull requests that update a dependency file color: "0366d6" - name: documentation description: Improvements or additions to documentation color: "0075ca" - name: duplicate description: This issue or pull request already exists color: "cfd3d7" - name: enhancement description: New feature or request color: "a2eeef" - name: github_actions description: Pull requests that update Github_actions code color: "000000" - name: good first issue description: Good for newcomers color: "7057ff" - name: help wanted description: Extra attention is needed color: "008672" - name: invalid description: This doesn't seem right color: "e4e669" - name: performance description: Performance color: "016175" - name: python description: Pull requests that update Python code color: "2b67c6" - name: question description: Further information is requested color: "d876e3" - name: refactoring description: Refactoring color: "ef67c4" - name: removal description: Removals and Deprecations color: "9ae7ea" - name: style description: Style color: "c120e5" - name: testing description: Testing color: "b1fc6f" - name: wontfix description: This will not be worked on color: "ffffff" ``` ## /.github/release-drafter.yml ```yml path="/.github/release-drafter.yml" name-template: 'v$RESOLVED_VERSION' tag-template: 'v$RESOLVED_VERSION' categories: - title: '🚀 Features' labels: - 'feature' - 'enhancement' - title: '🐛 Bug Fixes' labels: - 'fix' - 'bugfix' - 'bug' - title: '🧰 Maintenance' labels: - 'chore' - 'maintenance' - 'refactor' - title: '📝 Documentation' labels: - 'docs' - 'documentation' change-template: '- $TITLE @$AUTHOR (#$NUMBER)' change-title-escapes: '\<*_&' # You can add # and @ to disable mentions version-resolver: major: labels: - 'major' minor: labels: - 'minor' patch: labels: - 'patch' default: patch template: | ## Changes $CHANGES ## Contributors $CONTRIBUTORS ``` ## /.github/workflows/codeql.yml ```yml path="/.github/workflows/codeql.yml" # For most projects, this workflow file will not need changing; you simply need # to commit it to your repository. # # You may wish to alter this file to override the set of languages analyzed, # or to provide custom queries or build logic. # # ******** NOTE ******** # We have attempted to detect the languages in your repository. Please check # the `language` matrix defined below to confirm you have the correct set of # supported CodeQL languages. # name: "CodeQL Advanced" on: push: pull_request: branches: [ "main" ] schedule: - cron: '36 14 * * 1' jobs: analyze: name: Analyze (${{ matrix.language }}) # Runner size impacts CodeQL analysis time. To learn more, please see: # - https://gh.io/recommended-hardware-resources-for-running-codeql # - https://gh.io/supported-runners-and-hardware-resources # - https://gh.io/using-larger-runners (GitHub.com only) # Consider using larger runners or machines with greater resources for possible analysis time improvements. runs-on: ${{ (matrix.language == 'swift' && 'macos-latest') || 'ubuntu-latest' }} permissions: # required for all workflows security-events: write # required to fetch internal or private CodeQL packs packages: read # only required for workflows in private repositories actions: read contents: read strategy: fail-fast: false matrix: include: - language: python build-mode: none - language: actions # CodeQL supports the following values keywords for 'language': 'c-cpp', 'csharp', 'go', 'java-kotlin', 'javascript-typescript', 'python', 'ruby', 'swift' # Use `c-cpp` to analyze code written in C, C++ or both # Use 'java-kotlin' to analyze code written in Java, Kotlin or both # Use 'javascript-typescript' to analyze code written in JavaScript, TypeScript or both # To learn more about changing the languages that are analyzed or customizing the build mode for your analysis, # see https://docs.github.com/en/code-security/code-scanning/creating-an-advanced-setup-for-code-scanning/customizing-your-advanced-setup-for-code-scanning. # If you are analyzing a compiled language, you can modify the 'build-mode' for that language to customize how # your codebase is analyzed, see https://docs.github.com/en/code-security/code-scanning/creating-an-advanced-setup-for-code-scanning/codeql-code-scanning-for-compiled-languages steps: - name: Checkout repository uses: actions/checkout@v4 # Initializes the CodeQL tools for scanning. - name: Initialize CodeQL uses: github/codeql-action/init@v3 with: languages: ${{ matrix.language }} build-mode: ${{ matrix.build-mode }} # If you wish to specify custom queries, you can do so here or in a config file. # By default, queries listed here will override any specified in a config file. # Prefix the list here with "+" to use these queries and those in the config file. # For more details on CodeQL's query packs, refer to: https://docs.github.com/en/code-security/code-scanning/automatically-scanning-your-code-for-vulnerabilities-and-errors/configuring-code-scanning#using-queries-in-ql-packs # queries: security-extended,security-and-quality # If the analyze step fails for one of the languages you are analyzing with # "We were unable to automatically build your code", modify the matrix above # to set the build mode to "manual" for that language. Then modify this step # to build your code. # ℹ️ Command-line programs to run using the OS shell. # 📚 See https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idstepsrun - if: matrix.build-mode == 'manual' shell: bash run: | echo 'If you are using a "manual" build mode for one or more of the' \ 'languages you are analyzing, replace this with the commands to build' \ 'your code, for example:' echo ' make bootstrap' echo ' make release' exit 1 - name: Perform CodeQL Analysis uses: github/codeql-action/analyze@v3 with: category: "/language:${{matrix.language}}" ``` ## /.github/workflows/docs.yml ```yml path="/.github/workflows/docs.yml" name: docs on: push: branches: - main permissions: contents: write jobs: deploy: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 with: fetch-depth: 0 - name: Configure Git Credentials run: | git config user.name github-actions[bot] git config user.email 41898282+github-actions[bot]@users.noreply.github.com - name: Setup uv with Python 3.12 uses: astral-sh/setup-uv@d4b2f3b6ecc6e67c4457f6d3e41ec42d3d0fcb86 # v5.4.2 with: python-version: "3.12" enable-cache: true cache-dependency-glob: "uv.lock" - run: echo "cache_id=$(date --utc '+%V')" >> $GITHUB_ENV - uses: actions/cache@v4 with: key: mkdocs-material-${{ env.cache_id }} path: .cache restore-keys: | mkdocs-material- - run: uv sync - run: uv run mkdocs gh-deploy --force ``` ## /.github/workflows/labeler.yml ```yml path="/.github/workflows/labeler.yml" name: Labeler on: push: branches: - 'main' paths: - '.github/labels.yml' - '.github/workflows/labels.yml' pull_request: paths: - '.github/labels.yml' - '.github/workflows/labels.yml' permissions: contents: read issues: write pull-requests: write jobs: labeler: runs-on: ubuntu-latest steps: - name: Check out the repository uses: actions/checkout@v4 - name: Run Labeler uses: crazy-max/ghaction-github-labeler@24d110aa46a59976b8a7f35518cb7f14f434c916 # v5.3.0 with: skip-delete: true dry-run: ${{ github.event_name == 'pull_request' }} github-token: ${{ secrets.GITHUB_TOKEN }} yaml-file: .github/labels.yml exclude: | help* *issue ``` ## /.github/workflows/lint.yml ```yml path="/.github/workflows/lint.yml" name: Lint Code permissions: contents: read pull-requests: write on: [push] jobs: lint: strategy: fail-fast: false runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Ruff uses: astral-sh/ruff-action@v3 - name: AutoCorrect uses: huacnlee/autocorrect-action@main ``` ## /.github/workflows/pr-lint.yml ```yml path="/.github/workflows/pr-lint.yml" name: Lint Code and Review Dog Report on: [pull_request] permissions: contents: read pull-requests: write jobs: ruff: name: runner / ruff runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Install Python uses: actions/setup-python@v5 with: python-version: '3.11' - name: Install ruff run: pip install ruff - name: Install reviewdog uses: reviewdog/action-setup@e04ffabe3898a0af8d0fb1af00c188831c4b5893 # v1.3.2 with: reviewdog_version: latest - name: Run ruff with reviewdog env: REVIEWDOG_GITHUB_API_TOKEN: ${{ secrets.GITHUB_TOKEN }} run: | ruff check . --output-format=rdjson | reviewdog -f=rdjson -reporter=github-pr-review -fail-on-error autocorrect: name: runner / autocorrect runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: AutoCorrect uses: huacnlee/autocorrect-action@bf91ab3904c2908dd8e71312a8a83ed1eb632997 # v2.13.3 - name: Report ReviewDog if: failure() uses: huacnlee/autocorrect-action@bf91ab3904c2908dd8e71312a8a83ed1eb632997 # v2.13.3 env: REVIEWDOG_GITHUB_API_TOKEN: ${{ secrets.GITHUB_TOKEN }} with: reviewdog: true ``` ## /.github/workflows/publish-to-pypi.yml ```yml path="/.github/workflows/publish-to-pypi.yml" name: Release on: push: branches: - main - master permissions: id-token: write contents: write pull-requests: write jobs: check-repository: name: Check if running in main repository runs-on: ubuntu-latest outputs: is_main_repo: ${{ github.repository == 'funstory-ai/BabelDOC' }} steps: - run: echo "Running repository check" build: name: Build distribution 📦 needs: check-repository if: needs.check-repository.outputs.is_main_repo == 'true' runs-on: ubuntu-latest outputs: is_release: ${{ steps.check-version.outputs.tag }} steps: - uses: actions/checkout@v4 with: persist-credentials: true fetch-depth: 2 token: ${{ secrets.GITHUB_TOKEN }} - name: Setup uv with Python 3.12 uses: astral-sh/setup-uv@d4b2f3b6ecc6e67c4457f6d3e41ec42d3d0fcb86 # v5.4.2 with: python-version: "3.12" enable-cache: true cache-dependency-glob: "uv.lock" - name: Check if there is a parent commit id: check-parent-commit run: | echo "sha=$(git rev-parse --verify --quiet HEAD^)" >> $GITHUB_OUTPUT - name: Detect and tag new version id: check-version if: steps.check-parent-commit.outputs.sha uses: salsify/action-detect-and-tag-new-version@b1778166f13188a9d478e2d1198f993011ba9864 # v2.0.3 with: version-command: | cat pyproject.toml | grep "version = " | head -n 1 | awk -F'"' '{print $2}' - name: Install Dependencies run: | uv sync - name: Bump version for developmental release if: "! steps.check-version.outputs.tag" run: | version=$(bumpver update --patch --tag=final --dry 2>&1 | grep "New Version" | awk '{print $NF}') && bumpver update --set-version $version.dev$(date +%s) - name: Build package run: "uv build" - name: Store the distribution packages uses: actions/upload-artifact@v4.6.2 with: name: python-package-distributions path: dist/ publish-to-pypi: name: Publish Python 🐍 distribution 📦 to PyPI if: needs.build.outputs.is_release != '' needs: - check-repository - build runs-on: ubuntu-latest environment: name: pypi url: https://pypi.org/p/BabelDOC permissions: id-token: write steps: - name: Download all the dists uses: actions/download-artifact@95815c38cf2ff2164869cbab79da8d1f422bc89e # v4.2.1 with: name: python-package-distributions path: dist/ - name: Publish distribution 📦 to PyPI uses: pypa/gh-action-pypi-publish@76f52bc884231f62b9a034ebfe128415bbaabdfc # v1.12.4 publish-to-testpypi: name: Publish Python 🐍 distribution 📦 to TestPyPI if: needs.build.outputs.is_release == '' needs: - check-repository - build runs-on: ubuntu-latest environment: name: testpypi url: https://test.pypi.org/p/BabelDOC permissions: id-token: write steps: - name: Download all the dists uses: actions/download-artifact@95815c38cf2ff2164869cbab79da8d1f422bc89e # v4.2.1 with: name: python-package-distributions path: dist/ - name: Publish distribution 📦 to TestPyPI uses: pypa/gh-action-pypi-publish@76f52bc884231f62b9a034ebfe128415bbaabdfc # v1.12.4 with: repository-url: https://test.pypi.org/legacy/ post-release: name: Post Release Tasks needs: - check-repository - build - publish-to-pypi - publish-to-testpypi if: | always() && needs.check-repository.outputs.is_main_repo == 'true' && (needs.publish-to-pypi.result == 'success' || needs.publish-to-testpypi.result == 'success') runs-on: ubuntu-latest permissions: contents: write pull-requests: write steps: - uses: actions/checkout@v4 with: persist-credentials: true fetch-depth: 2 token: ${{ secrets.GITHUB_TOKEN }} - name: Publish the release notes uses: release-drafter/release-drafter@b1476f6e6eb133afa41ed8589daba6dc69b4d3f5 # v6.1.0 with: publish: ${{ needs.build.outputs.is_release != '' }} tag: ${{ needs.build.outputs.is_release }} env: GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} ``` ## /.github/workflows/test.yml ```yml path="/.github/workflows/test.yml" name: Run Tests 🧪 on: push: pull_request: branches: ["main"] permissions: contents: read pull-requests: read jobs: test: name: Run Python Tests runs-on: ubuntu-latest strategy: matrix: python-version: ["3.10", "3.11", "3.12"] steps: - uses: actions/checkout@v4 with: persist-credentials: false - name: Cached Assets id: cache-assets uses: actions/cache@v4.2.0 with: path: ~/.cache/babeldoc key: babeldoc-assets-${{ hashFiles('babeldoc/assets/embedding_assets_metadata.py') }} - name: Setup uv with Python ${{ matrix.python-version }} uses: astral-sh/setup-uv@d4b2f3b6ecc6e67c4457f6d3e41ec42d3d0fcb86 # v5.4.2 with: python-version: ${{ matrix.python-version }} enable-cache: true cache-dependency-glob: "uv.lock" - name: Warm up cache run: | uv run babeldoc --warmup - name: Run tests env: OPENAI_API_KEY: ${{ secrets.OPENAIAPIKEY }} OPENAI_BASE_URL: ${{ secrets.OPENAIAPIURL }} OPENAI_MODEL: ${{ secrets.OPENAIMODEL }} run: | uv run babeldoc --help uv run babeldoc --openai --files examples/ci/test.pdf --openai-api-key ${{ env.OPENAI_API_KEY }} --openai-base-url ${{ env.OPENAI_BASE_URL }} --openai-model ${{ env.OPENAI_MODEL }} - name: Generate offline assets package run: | uv run babeldoc --generate-offline-assets /tmp/offline_assets - name: Restore offline assets package run: | rm -rf ~/.cache/babeldoc uv run babeldoc --restore-offline-assets /tmp/offline_assets - name: Clean up run: | rm -rf /tmp/offline_assets rm -rf ~/.cache/babeldoc/cache.v1.db rm -rf ~/.cache/babeldoc/working ``` ## /.gitignore ```gitignore path="/.gitignore" # Logs web/logs web/*.log web/npm-debug.log* web/yarn-debug.log* web/yarn-error.log* web/pnpm-debug.log* web/lerna-debug.log* web/node_modules web/dist web/dist-ssr web/*.local memray* **/*.so *.pdf *.docx *.json **/*.pyc .venv .idea *.egg-info .DS_Store .vscode __pycache__ .ruff_cache yadt.toml examples/ /make_gif.py /dist .cache .cursor/rules/_*.mdc /.cursor /xnotes /docs/workflow-rules.md ``` ## /.pre-commit-config.yaml ```yaml path="/.pre-commit-config.yaml" files: '^.*\.py$' repos: - repo: https://github.com/astral-sh/ruff-pre-commit # Ruff version. rev: v0.9.5 hooks: # Run the linter. - id: ruff args: [ "--fix", "--ignore=E203,E261,E501,E741,F841" ] # Run the formatter. - id: ruff-format ``` ## /LICENSE ``` path="/LICENSE" GNU AFFERO GENERAL PUBLIC LICENSE Version 3, 19 November 2007 Copyright (C) 2007 Free Software Foundation, Inc. Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. Preamble The GNU Affero General Public License is a free, copyleft license for software and other kinds of works, specifically designed to ensure cooperation with the community in the case of network server software. The licenses for most software and other practical works are designed to take away your freedom to share and change the works. By contrast, our General Public Licenses are intended to guarantee your freedom to share and change all versions of a program--to make sure it remains free software for all its users. When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for them if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs, and that you know you can do these things. Developers that use our General Public Licenses protect your rights with two steps: (1) assert copyright on the software, and (2) offer you this License which gives you legal permission to copy, distribute and/or modify the software. A secondary benefit of defending all users' freedom is that improvements made in alternate versions of the program, if they receive widespread use, become available for other developers to incorporate. Many developers of free software are heartened and encouraged by the resulting cooperation. However, in the case of software used on network servers, this result may fail to come about. The GNU General Public License permits making a modified version and letting the public access it on a server without ever releasing its source code to the public. The GNU Affero General Public License is designed specifically to ensure that, in such cases, the modified source code becomes available to the community. It requires the operator of a network server to provide the source code of the modified version running there to the users of that server. Therefore, public use of a modified version, on a publicly accessible server, gives the public access to the source code of the modified version. An older license, called the Affero General Public License and published by Affero, was designed to accomplish similar goals. This is a different license, not a version of the Affero GPL, but Affero has released a new version of the Affero GPL which permits relicensing under this license. The precise terms and conditions for copying, distribution and modification follow. TERMS AND CONDITIONS 0. Definitions. "This License" refers to version 3 of the GNU Affero General Public License. "Copyright" also means copyright-like laws that apply to other kinds of works, such as semiconductor masks. "The Program" refers to any copyrightable work licensed under this License. Each licensee is addressed as "you". "Licensees" and "recipients" may be individuals or organizations. To "modify" a work means to copy from or adapt all or part of the work in a fashion requiring copyright permission, other than the making of an exact copy. The resulting work is called a "modified version" of the earlier work or a work "based on" the earlier work. A "covered work" means either the unmodified Program or a work based on the Program. To "propagate" a work means to do anything with it that, without permission, would make you directly or secondarily liable for infringement under applicable copyright law, except executing it on a computer or modifying a private copy. Propagation includes copying, distribution (with or without modification), making available to the public, and in some countries other activities as well. To "convey" a work means any kind of propagation that enables other parties to make or receive copies. Mere interaction with a user through a computer network, with no transfer of a copy, is not conveying. An interactive user interface displays "Appropriate Legal Notices" to the extent that it includes a convenient and prominently visible feature that (1) displays an appropriate copyright notice, and (2) tells the user that there is no warranty for the work (except to the extent that warranties are provided), that licensees may convey the work under this License, and how to view a copy of this License. If the interface presents a list of user commands or options, such as a menu, a prominent item in the list meets this criterion. 1. Source Code. The "source code" for a work means the preferred form of the work for making modifications to it. "Object code" means any non-source form of a work. A "Standard Interface" means an interface that either is an official standard defined by a recognized standards body, or, in the case of interfaces specified for a particular programming language, one that is widely used among developers working in that language. The "System Libraries" of an executable work include anything, other than the work as a whole, that (a) is included in the normal form of packaging a Major Component, but which is not part of that Major Component, and (b) serves only to enable use of the work with that Major Component, or to implement a Standard Interface for which an implementation is available to the public in source code form. A "Major Component", in this context, means a major essential component (kernel, window system, and so on) of the specific operating system (if any) on which the executable work runs, or a compiler used to produce the work, or an object code interpreter used to run it. The "Corresponding Source" for a work in object code form means all the source code needed to generate, install, and (for an executable work) run the object code and to modify the work, including scripts to control those activities. However, it does not include the work's System Libraries, or general-purpose tools or generally available free programs which are used unmodified in performing those activities but which are not part of the work. For example, Corresponding Source includes interface definition files associated with source files for the work, and the source code for shared libraries and dynamically linked subprograms that the work is specifically designed to require, such as by intimate data communication or control flow between those subprograms and other parts of the work. The Corresponding Source need not include anything that users can regenerate automatically from other parts of the Corresponding Source. The Corresponding Source for a work in source code form is that same work. 2. Basic Permissions. All rights granted under this License are granted for the term of copyright on the Program, and are irrevocable provided the stated conditions are met. This License explicitly affirms your unlimited permission to run the unmodified Program. The output from running a covered work is covered by this License only if the output, given its content, constitutes a covered work. This License acknowledges your rights of fair use or other equivalent, as provided by copyright law. You may make, run and propagate covered works that you do not convey, without conditions so long as your license otherwise remains in force. You may convey covered works to others for the sole purpose of having them make modifications exclusively for you, or provide you with facilities for running those works, provided that you comply with the terms of this License in conveying all material for which you do not control copyright. Those thus making or running the covered works for you must do so exclusively on your behalf, under your direction and control, on terms that prohibit them from making any copies of your copyrighted material outside their relationship with you. Conveying under any other circumstances is permitted solely under the conditions stated below. Sublicensing is not allowed; section 10 makes it unnecessary. 3. Protecting Users' Legal Rights From Anti-Circumvention Law. No covered work shall be deemed part of an effective technological measure under any applicable law fulfilling obligations under article 11 of the WIPO copyright treaty adopted on 20 December 1996, or similar laws prohibiting or restricting circumvention of such measures. When you convey a covered work, you waive any legal power to forbid circumvention of technological measures to the extent such circumvention is effected by exercising rights under this License with respect to the covered work, and you disclaim any intention to limit operation or modification of the work as a means of enforcing, against the work's users, your or third parties' legal rights to forbid circumvention of technological measures. 4. Conveying Verbatim Copies. You may convey verbatim copies of the Program's source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice; keep intact all notices stating that this License and any non-permissive terms added in accord with section 7 apply to the code; keep intact all notices of the absence of any warranty; and give all recipients a copy of this License along with the Program. You may charge any price or no price for each copy that you convey, and you may offer support or warranty protection for a fee. 5. Conveying Modified Source Versions. You may convey a work based on the Program, or the modifications to produce it from the Program, in the form of source code under the terms of section 4, provided that you also meet all of these conditions: a) The work must carry prominent notices stating that you modified it, and giving a relevant date. b) The work must carry prominent notices stating that it is released under this License and any conditions added under section 7. This requirement modifies the requirement in section 4 to "keep intact all notices". c) You must license the entire work, as a whole, under this License to anyone who comes into possession of a copy. This License will therefore apply, along with any applicable section 7 additional terms, to the whole of the work, and all its parts, regardless of how they are packaged. This License gives no permission to license the work in any other way, but it does not invalidate such permission if you have separately received it. d) If the work has interactive user interfaces, each must display Appropriate Legal Notices; however, if the Program has interactive interfaces that do not display Appropriate Legal Notices, your work need not make them do so. A compilation of a covered work with other separate and independent works, which are not by their nature extensions of the covered work, and which are not combined with it such as to form a larger program, in or on a volume of a storage or distribution medium, is called an "aggregate" if the compilation and its resulting copyright are not used to limit the access or legal rights of the compilation's users beyond what the individual works permit. Inclusion of a covered work in an aggregate does not cause this License to apply to the other parts of the aggregate. 6. Conveying Non-Source Forms. You may convey a covered work in object code form under the terms of sections 4 and 5, provided that you also convey the machine-readable Corresponding Source under the terms of this License, in one of these ways: a) Convey the object code in, or embodied in, a physical product (including a physical distribution medium), accompanied by the Corresponding Source fixed on a durable physical medium customarily used for software interchange. b) Convey the object code in, or embodied in, a physical product (including a physical distribution medium), accompanied by a written offer, valid for at least three years and valid for as long as you offer spare parts or customer support for that product model, to give anyone who possesses the object code either (1) a copy of the Corresponding Source for all the software in the product that is covered by this License, on a durable physical medium customarily used for software interchange, for a price no more than your reasonable cost of physically performing this conveying of source, or (2) access to copy the Corresponding Source from a network server at no charge. c) Convey individual copies of the object code with a copy of the written offer to provide the Corresponding Source. This alternative is allowed only occasionally and noncommercially, and only if you received the object code with such an offer, in accord with subsection 6b. d) Convey the object code by offering access from a designated place (gratis or for a charge), and offer equivalent access to the Corresponding Source in the same way through the same place at no further charge. You need not require recipients to copy the Corresponding Source along with the object code. If the place to copy the object code is a network server, the Corresponding Source may be on a different server (operated by you or a third party) that supports equivalent copying facilities, provided you maintain clear directions next to the object code saying where to find the Corresponding Source. Regardless of what server hosts the Corresponding Source, you remain obligated to ensure that it is available for as long as needed to satisfy these requirements. e) Convey the object code using peer-to-peer transmission, provided you inform other peers where the object code and Corresponding Source of the work are being offered to the general public at no charge under subsection 6d. A separable portion of the object code, whose source code is excluded from the Corresponding Source as a System Library, need not be included in conveying the object code work. A "User Product" is either (1) a "consumer product", which means any tangible personal property which is normally used for personal, family, or household purposes, or (2) anything designed or sold for incorporation into a dwelling. In determining whether a product is a consumer product, doubtful cases shall be resolved in favor of coverage. For a particular product received by a particular user, "normally used" refers to a typical or common use of that class of product, regardless of the status of the particular user or of the way in which the particular user actually uses, or expects or is expected to use, the product. A product is a consumer product regardless of whether the product has substantial commercial, industrial or non-consumer uses, unless such uses represent the only significant mode of use of the product. "Installation Information" for a User Product means any methods, procedures, authorization keys, or other information required to install and execute modified versions of a covered work in that User Product from a modified version of its Corresponding Source. The information must suffice to ensure that the continued functioning of the modified object code is in no case prevented or interfered with solely because modification has been made. If you convey an object code work under this section in, or with, or specifically for use in, a User Product, and the conveying occurs as part of a transaction in which the right of possession and use of the User Product is transferred to the recipient in perpetuity or for a fixed term (regardless of how the transaction is characterized), the Corresponding Source conveyed under this section must be accompanied by the Installation Information. But this requirement does not apply if neither you nor any third party retains the ability to install modified object code on the User Product (for example, the work has been installed in ROM). The requirement to provide Installation Information does not include a requirement to continue to provide support service, warranty, or updates for a work that has been modified or installed by the recipient, or for the User Product in which it has been modified or installed. Access to a network may be denied when the modification itself materially and adversely affects the operation of the network or violates the rules and protocols for communication across the network. Corresponding Source conveyed, and Installation Information provided, in accord with this section must be in a format that is publicly documented (and with an implementation available to the public in source code form), and must require no special password or key for unpacking, reading or copying. 7. Additional Terms. "Additional permissions" are terms that supplement the terms of this License by making exceptions from one or more of its conditions. Additional permissions that are applicable to the entire Program shall be treated as though they were included in this License, to the extent that they are valid under applicable law. If additional permissions apply only to part of the Program, that part may be used separately under those permissions, but the entire Program remains governed by this License without regard to the additional permissions. When you convey a copy of a covered work, you may at your option remove any additional permissions from that copy, or from any part of it. (Additional permissions may be written to require their own removal in certain cases when you modify the work.) You may place additional permissions on material, added by you to a covered work, for which you have or can give appropriate copyright permission. Notwithstanding any other provision of this License, for material you add to a covered work, you may (if authorized by the copyright holders of that material) supplement the terms of this License with terms: a) Disclaiming warranty or limiting liability differently from the terms of sections 15 and 16 of this License; or b) Requiring preservation of specified reasonable legal notices or author attributions in that material or in the Appropriate Legal Notices displayed by works containing it; or c) Prohibiting misrepresentation of the origin of that material, or requiring that modified versions of such material be marked in reasonable ways as different from the original version; or d) Limiting the use for publicity purposes of names of licensors or authors of the material; or e) Declining to grant rights under trademark law for use of some trade names, trademarks, or service marks; or f) Requiring indemnification of licensors and authors of that material by anyone who conveys the material (or modified versions of it) with contractual assumptions of liability to the recipient, for any liability that these contractual assumptions directly impose on those licensors and authors. All other non-permissive additional terms are considered "further restrictions" within the meaning of section 10. If the Program as you received it, or any part of it, contains a notice stating that it is governed by this License along with a term that is a further restriction, you may remove that term. If a license document contains a further restriction but permits relicensing or conveying under this License, you may add to a covered work material governed by the terms of that license document, provided that the further restriction does not survive such relicensing or conveying. If you add terms to a covered work in accord with this section, you must place, in the relevant source files, a statement of the additional terms that apply to those files, or a notice indicating where to find the applicable terms. Additional terms, permissive or non-permissive, may be stated in the form of a separately written license, or stated as exceptions; the above requirements apply either way. 8. Termination. You may not propagate or modify a covered work except as expressly provided under this License. Any attempt otherwise to propagate or modify it is void, and will automatically terminate your rights under this License (including any patent licenses granted under the third paragraph of section 11). However, if you cease all violation of this License, then your license from a particular copyright holder is reinstated (a) provisionally, unless and until the copyright holder explicitly and finally terminates your license, and (b) permanently, if the copyright holder fails to notify you of the violation by some reasonable means prior to 60 days after the cessation. Moreover, your license from a particular copyright holder is reinstated permanently if the copyright holder notifies you of the violation by some reasonable means, this is the first time you have received notice of violation of this License (for any work) from that copyright holder, and you cure the violation prior to 30 days after your receipt of the notice. Termination of your rights under this section does not terminate the licenses of parties who have received copies or rights from you under this License. If your rights have been terminated and not permanently reinstated, you do not qualify to receive new licenses for the same material under section 10. 9. Acceptance Not Required for Having Copies. You are not required to accept this License in order to receive or run a copy of the Program. Ancillary propagation of a covered work occurring solely as a consequence of using peer-to-peer transmission to receive a copy likewise does not require acceptance. However, nothing other than this License grants you permission to propagate or modify any covered work. These actions infringe copyright if you do not accept this License. Therefore, by modifying or propagating a covered work, you indicate your acceptance of this License to do so. 10. Automatic Licensing of Downstream Recipients. Each time you convey a covered work, the recipient automatically receives a license from the original licensors, to run, modify and propagate that work, subject to this License. You are not responsible for enforcing compliance by third parties with this License. An "entity transaction" is a transaction transferring control of an organization, or substantially all assets of one, or subdividing an organization, or merging organizations. If propagation of a covered work results from an entity transaction, each party to that transaction who receives a copy of the work also receives whatever licenses to the work the party's predecessor in interest had or could give under the previous paragraph, plus a right to possession of the Corresponding Source of the work from the predecessor in interest, if the predecessor has it or can get it with reasonable efforts. You may not impose any further restrictions on the exercise of the rights granted or affirmed under this License. For example, you may not impose a license fee, royalty, or other charge for exercise of rights granted under this License, and you may not initiate litigation (including a cross-claim or counterclaim in a lawsuit) alleging that any patent claim is infringed by making, using, selling, offering for sale, or importing the Program or any portion of it. 11. Patents. A "contributor" is a copyright holder who authorizes use under this License of the Program or a work on which the Program is based. The work thus licensed is called the contributor's "contributor version". A contributor's "essential patent claims" are all patent claims owned or controlled by the contributor, whether already acquired or hereafter acquired, that would be infringed by some manner, permitted by this License, of making, using, or selling its contributor version, but do not include claims that would be infringed only as a consequence of further modification of the contributor version. For purposes of this definition, "control" includes the right to grant patent sublicenses in a manner consistent with the requirements of this License. Each contributor grants you a non-exclusive, worldwide, royalty-free patent license under the contributor's essential patent claims, to make, use, sell, offer for sale, import and otherwise run, modify and propagate the contents of its contributor version. In the following three paragraphs, a "patent license" is any express agreement or commitment, however denominated, not to enforce a patent (such as an express permission to practice a patent or covenant not to sue for patent infringement). To "grant" such a patent license to a party means to make such an agreement or commitment not to enforce a patent against the party. If you convey a covered work, knowingly relying on a patent license, and the Corresponding Source of the work is not available for anyone to copy, free of charge and under the terms of this License, through a publicly available network server or other readily accessible means, then you must either (1) cause the Corresponding Source to be so available, or (2) arrange to deprive yourself of the benefit of the patent license for this particular work, or (3) arrange, in a manner consistent with the requirements of this License, to extend the patent license to downstream recipients. "Knowingly relying" means you have actual knowledge that, but for the patent license, your conveying the covered work in a country, or your recipient's use of the covered work in a country, would infringe one or more identifiable patents in that country that you have reason to believe are valid. If, pursuant to or in connection with a single transaction or arrangement, you convey, or propagate by procuring conveyance of, a covered work, and grant a patent license to some of the parties receiving the covered work authorizing them to use, propagate, modify or convey a specific copy of the covered work, then the patent license you grant is automatically extended to all recipients of the covered work and works based on it. A patent license is "discriminatory" if it does not include within the scope of its coverage, prohibits the exercise of, or is conditioned on the non-exercise of one or more of the rights that are specifically granted under this License. You may not convey a covered work if you are a party to an arrangement with a third party that is in the business of distributing software, under which you make payment to the third party based on the extent of your activity of conveying the work, and under which the third party grants, to any of the parties who would receive the covered work from you, a discriminatory patent license (a) in connection with copies of the covered work conveyed by you (or copies made from those copies), or (b) primarily for and in connection with specific products or compilations that contain the covered work, unless you entered into that arrangement, or that patent license was granted, prior to 28 March 2007. Nothing in this License shall be construed as excluding or limiting any implied license or other defenses to infringement that may otherwise be available to you under applicable patent law. 12. No Surrender of Others' Freedom. If conditions are imposed on you (whether by court order, agreement or otherwise) that contradict the conditions of this License, they do not excuse you from the conditions of this License. If you cannot convey a covered work so as to satisfy simultaneously your obligations under this License and any other pertinent obligations, then as a consequence you may not convey it at all. For example, if you agree to terms that obligate you to collect a royalty for further conveying from those to whom you convey the Program, the only way you could satisfy both those terms and this License would be to refrain entirely from conveying the Program. 13. Remote Network Interaction; Use with the GNU General Public License. Notwithstanding any other provision of this License, if you modify the Program, your modified version must prominently offer all users interacting with it remotely through a computer network (if your version supports such interaction) an opportunity to receive the Corresponding Source of your version by providing access to the Corresponding Source from a network server at no charge, through some standard or customary means of facilitating copying of software. This Corresponding Source shall include the Corresponding Source for any work covered by version 3 of the GNU General Public License that is incorporated pursuant to the following paragraph. Notwithstanding any other provision of this License, you have permission to link or combine any covered work with a work licensed under version 3 of the GNU General Public License into a single combined work, and to convey the resulting work. The terms of this License will continue to apply to the part which is the covered work, but the work with which it is combined will remain governed by version 3 of the GNU General Public License. 14. Revised Versions of this License. The Free Software Foundation may publish revised and/or new versions of the GNU Affero General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. Each version is given a distinguishing version number. If the Program specifies that a certain numbered version of the GNU Affero General Public License "or any later version" applies to it, you have the option of following the terms and conditions either of that numbered version or of any later version published by the Free Software Foundation. If the Program does not specify a version number of the GNU Affero General Public License, you may choose any version ever published by the Free Software Foundation. If the Program specifies that a proxy can decide which future versions of the GNU Affero General Public License can be used, that proxy's public statement of acceptance of a version permanently authorizes you to choose that version for the Program. Later license versions may give you additional or different permissions. However, no additional obligations are imposed on any author or copyright holder as a result of your choosing to follow a later version. 15. Disclaimer of Warranty. THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 16. Limitation of Liability. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. 17. Interpretation of Sections 15 and 16. If the disclaimer of warranty and limitation of liability provided above cannot be given local legal effect according to their terms, reviewing courts shall apply local law that most closely approximates an absolute waiver of all civil liability in connection with the Program, unless a warranty or assumption of liability accompanies a copy of the Program in return for a fee. END OF TERMS AND CONDITIONS How to Apply These Terms to Your New Programs If you develop a new program, and you want it to be of the greatest possible use to the public, the best way to achieve this is to make it free software which everyone can redistribute and change under these terms. To do so, attach the following notices to the program. It is safest to attach them to the start of each source file to most effectively state the exclusion of warranty; and each file should have at least the "copyright" line and a pointer to where the full notice is found. BabelDOC is library for ultimated document translation solution. Copyright (C) 2024 This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details. You should have received a copy of the GNU Affero General Public License along with this program. If not, see . Also add information on how to contact you by electronic and paper mail. If your software can interact with users remotely through a computer network, you should also make sure that it provides a way for users to get its source. For example, if your program is a web application, its interface could display a "Source" link that leads users to an archive of the code. There are many ways you could offer source, and different solutions will be better for different programs; see section 13 for the specific requirements. You should also get your employer (if you work as a programmer) or school, if any, to sign a "copyright disclaimer" for the program, if necessary. For more information on this, and how to apply and follow the GNU AGPL, see . ``` ## /README.md PDF scientific paper translation and bilingual comparison library. - **Online Service**: Beta version launched [Immersive Translate - BabelDOC](https://app.immersivetranslate.com/babel-doc/) 1000 free pages per month. - **Self-deployment**: [PDFMathTranslate](https://github.com/Byaidu/PDFMathTranslate) 1.9.3+ Experimental support for BabelDOC, available for self-deployment + WebUI with more translation services. - Provides a simple [command line interface](#getting-started). - Provides a [Python API](#python-api). - Mainly designed to be embedded into other programs, but can also be used directly for simple translation tasks. ## Preview
## We are hiring See details: [EN](https://github.com/funstory-ai/jobs) | [ZH](https://github.com/funstory-ai/jobs/blob/main/README_ZH.md) ## Getting Started ### Install from PyPI We recommend using the Tool feature of [uv](https://github.com/astral-sh/uv) to install yadt. 1. First, you need to refer to [uv installation](https://github.com/astral-sh/uv#installation) to install uv and set up the `PATH` environment variable as prompted. 2. Use the following command to install yadt: ```bash uv tool install --python 3.12 BabelDOC babeldoc --help ``` 3. Use the `babeldoc` command. For example: ```bash babeldoc --bing --files example.pdf # multiple files babeldoc --bing --files example1.pdf --files example2.pdf ``` ### Install from Source We still recommend using [uv](https://github.com/astral-sh/uv) to manage virtual environments. 1. First, you need to refer to [uv installation](https://github.com/astral-sh/uv#installation) to install uv and set up the `PATH` environment variable as prompted. 2. Use the following command to install yadt: ```bash # clone the project git clone https://github.com/funstory-ai/BabelDOC # enter the project directory cd BabelDOC # install dependencies and run babeldoc uv run babeldoc --help ``` 3. Use the `uv run babeldoc` command. For example: ```bash uv run babeldoc --files example.pdf --openai --openai-model "gpt-4o-mini" --openai-base-url "https://api.openai.com/v1" --openai-api-key "your-api-key-here" # multiple files uv run babeldoc --files example.pdf --files example2.pdf --openai --openai-model "gpt-4o-mini" --openai-base-url "https://api.openai.com/v1" --openai-api-key "your-api-key-here" ``` > [!TIP] > The absolute path is recommended. ## Advanced Options > [!NOTE] > This CLI is mainly for debugging purposes. Although end users can use this CLI to translate files, we do not provide any technical support for this purpose. > > End users should directly use **Online Service**: Beta version launched [Immersive Translate - BabelDOC](https://app.immersivetranslate.com/babel-doc/) 1000 free pages per month. > > End users who need self-deployment should use [PDFMathTranslate](https://github.com/Byaidu/PDFMathTranslate) > > If you find that an option is not listed below, it means that this option is a debugging option for maintainers. Please do not use these options. ### Language Options - `--lang-in`, `-li`: Source language code (default: en) - `--lang-out`, `-lo`: Target language code (default: zh) > [!TIP] > Currently, this project mainly focuses on English-to-Chinese translation, and other scenarios have not been tested yet. > > (2025.3.1 update): Basic English target language support has been added, primarily to minimize line breaks within words([0-9A-Za-z]+). > > [HELP WANTED: Collecting word regular expressions for more languages](https://github.com/funstory-ai/BabelDOC/issues/129) ### PDF Processing Options - `--files`: One or more file paths to input PDF documents. - `--pages`, `-p`: Specify pages to translate (e.g., "1,2,1-,-3,3-5"). If not set, translate all pages - `--split-short-lines`: Force split short lines into different paragraphs (may cause poor typesetting & bugs) - `--short-line-split-factor`: Split threshold factor (default: 0.8). The actual threshold is the median length of all lines on the current page \* this factor - `--skip-clean`: Skip PDF cleaning step - `--dual-translate-first`: Put translated pages first in dual PDF mode (default: original pages first) - `--disable-rich-text-translate`: Disable rich text translation (may help improve compatibility with some PDFs) - `--enhance-compatibility`: Enable all compatibility enhancement options (equivalent to --skip-clean --dual-translate-first --disable-rich-text-translate) - `--use-alternating-pages-dual`: Use alternating pages mode for dual PDF. When enabled, original and translated pages are arranged in alternate order. When disabled (default), original and translated pages are shown side by side on the same page. - `--watermark-output-mode`: Control watermark output mode: 'watermarked' (default) adds watermark to translated PDF, 'no_watermark' doesn't add watermark, 'both' outputs both versions. - `--max-pages-per-part`: Maximum number of pages per part for split translation. If not set, no splitting will be performed. - `--no-watermark`: [DEPRECATED] Use --watermark-output-mode=no_watermark instead. - `--translate-table-text`: Translate table text (experimental, default: False) - `--skip-scanned-detection`: Skip scanned document detection (default: False). When using split translation, only the first part performs detection if not skipped. - `--ocr-workaround`: Use OCR workaround (default: False). When enabled, the tool will use OCR to detect text and fill background for scanned PDF. > [!TIP] > - Both `--skip-clean` and `--dual-translate-first` may help improve compatibility with some PDF readers > - `--disable-rich-text-translate` can also help with compatibility by simplifying translation input > - However, using `--skip-clean` will result in larger file sizes > - If you encounter any compatibility issues, try using `--enhance-compatibility` first > - Use `--max-pages-per-part` for large documents to split them into smaller parts for translation and automatically merge them back. > - Use `--skip-scanned-detection` to speed up processing when you know your document is not a scanned PDF. > - Use `--ocr-workaround` to fill background for scanned PDF. (Current assumption: background is pure white, text is pure black, this option will also auto enable `--skip-scanned-detection`) ### Translation Service Options - `--qps`: QPS (Queries Per Second) limit for translation service (default: 4) - `--ignore-cache`: Ignore translation cache and force retranslation - `--no-dual`: Do not output bilingual PDF files - `--no-mono`: Do not output monolingual PDF files - `--min-text-length`: Minimum text length to translate (default: 5) - `--openai`: Use OpenAI for translation (default: False) > [!TIP] > > 1. Currently, only OpenAI-compatible LLM is supported. For more translator support, please use [PDFMathTranslate](https://github.com/Byaidu/PDFMathTranslate). > 2. It is recommended to use models with strong compatibility with OpenAI, such as: `glm-4-flash`, `deepseek-chat`, etc. > 3. Currently, it has not been optimized for traditional translation engines like Bing/Google, it is recommended to use LLMs. > 4. You can use [litellm](https://github.com/BerriAI/litellm) to access multiple models. ### OpenAI Specific Options - `--openai-model`: OpenAI model to use (default: gpt-4o-mini) - `--openai-base-url`: Base URL for OpenAI API - `--openai-api-key`: API key for OpenAI service > [!TIP] > > 1. This tool supports any OpenAI-compatible API endpoints. Just set the correct base URL and API key. (e.g. `https://xxx.custom.xxx/v1`) > 2. For local models like Ollama, you can use any value as the API key (e.g. `--openai-api-key a`). ### Output Control - `--output`, `-o`: Output directory for translated files. If not set, use current working directory. - `--debug`, `-d`: Enable debug logging level and export detailed intermediate results in `~/.cache/yadt/working`. - `--report-interval`: Progress report interval in seconds (default: 0.1). ### Offline Assets Management - `--generate-offline-assets`: Generate an offline assets package in the specified directory. This creates a zip file containing all required models and fonts. - `--restore-offline-assets`: Restore an offline assets package from the specified file. This extracts models and fonts from a previously generated package. > [!TIP] > > 1. Offline assets packages are useful for environments without internet access or to speed up installation on multiple machines. > 2. Generate a package once with `babeldoc --generate-offline-assets /path/to/output/dir` and then distribute it. > 3. Restore the package on target machines with `babeldoc --restore-offline-assets /path/to/offline_assets_*.zip`. > 4. The offline assets package name cannot be modified because the file list hash is encoded in the name. > 5. If you provide a directory path to `--restore-offline-assets`, the tool will automatically look for the correct offline assets package file in that directory. > 6. The package contains all necessary fonts and models required for document processing, ensuring consistent results across different environments. > 7. The integrity of all assets is verified using SHA3-256 hashes during both packaging and restoration. > 8. If you're deploying in an air-gapped environment, make sure to generate the package on a machine with internet access first. ### Configuration File - `--config`, `-c`: Configuration file path. Use the TOML format. Example Configuration: ```toml [babeldoc] # Basic settings debug = true lang-in = "en-US" lang-out = "zh-CN" qps = 10 output = "/path/to/output/dir" # PDF processing options split-short-lines = false short-line-split-factor = 0.8 skip-clean = false dual-translate-first = false disable-rich-text-translate = false use-alternating-pages-dual = false watermark-output-mode = "watermarked" # Choices: "watermarked", "no_watermark", "both" max-pages-per-part = 50 # Automatically split the document for translation and merge it back. # no-watermark = false # DEPRECATED: Use watermark-output-mode instead skip-scanned-detection = false # Skip scanned document detection for faster processing # Translation service openai = true openai-model = "gpt-4o-mini" openai-base-url = "https://api.openai.com/v1" openai-api-key = "your-api-key-here" # Output control no-dual = false no-mono = false min-text-length = 5 report-interval = 0.5 # Offline assets management # Uncomment one of these options as needed: # generate-offline-assets = "/path/to/output/dir" # restore-offline-assets = "/path/to/offline_assets_package.zip" ``` ## Python API > [!TIP] > > 1. Before pdf2zh 2.0 is released, you can temporarily use BabelDOC's Python API. However, after pdf2zh 2.0 is released, please directly use pdf2zh's Python API. > > 2. This project's Python API does not guarantee any compatibility. However, the Python API from pdf2zh will guarantee a certain level of compatibility. You can refer to the example in [main.py](https://github.com/funstory-ai/yadt/blob/main/babeldoc/main.py) to use BabelDOC's Python API. Please note: 1. Make sure call `babeldoc.high_level.init()` before using the API 2. The current `TranslationConfig` does not fully validate input parameters, so you need to ensure the validity of input parameters 3. For offline assets management, you can use the following functions: ```python # Generate an offline assets package from pathlib import Path import babeldoc.assets.assets # Generate package to a specific directory # path is optional, default is ~/.cache/babeldoc/assets/offline_assets_{hash}.zip babeldoc.assets.assets.generate_offline_assets_package(Path("/path/to/output/dir")) # Restore from a package file # path is optional, default is ~/.cache/babeldoc/assets/offline_assets_{hash}.zip babeldoc.assets.assets.restore_offline_assets_package(Path("/path/to/offline_assets_package.zip")) # You can also restore from a directory containing the offline assets package # The tool will automatically find the correct package file based on the hash babeldoc.assets.assets.restore_offline_assets_package(Path("/path/to/directory")) ``` > [!TIP] > > 1. The offline assets package name cannot be modified because the file list hash is encoded in the name. > 2. When using in production environments, it's recommended to pre-generate the assets package and include it with your application distribution. > 3. The package verification ensures that all required assets are intact and match their expected checksums. ## Background There are a lot projects and teams working on to make document editing and translating easier like: - [mathpix](https://mathpix.com/) - [Doc2X](https://doc2x.noedgeai.com/) - [minerU](https://github.com/opendatalab/MinerU) - [PDFMathTranslate](https://github.com/funstory-ai/yadt) There are also some solutions to solve specific parts of the problem like: - [layoutreader](https://github.com/microsoft/unilm/tree/master/layoutreader): the read order of the text block in a pdf - [Surya](https://github.com/surya-is/surya): the structure of the pdf This project hopes to promote a standard pipeline and interface to solve the problem. In fact, there are two main stages of a PDF parser or translator: - **Parsing**: A stage of parsing means to get the structure of the pdf such as text blocks, images, tables, etc. - **Rendering**: A stage of rendering means to render the structure into a new pdf or other format. For a service like mathpix, it will parse the pdf into a structure may be in a XML format, and then render them using a single column reader order as [layoutreader](https://github.com/microsoft/unilm/tree/master/layoutreader) does. The bad news is that the original structure lost. Some people will use Adobe PDF Parser because it will generate a Word document and it keeps the original structure. But it is somewhat expensive. And you know, a pdf or word document is not a good format for reading in mobile devices. We offer an intermediate representation of the results from parser and can be rendered into a new pdf or other format. The pipeline is also a plugin-based system which everybody can add their new model, ocr, renderer, etc. ## Roadmap - [ ] Add line support - [ ] Add table support - [ ] Add cross-page/cross-column paragraph support - [ ] More advanced typesetting features - [ ] Outline support - [ ] ... Our first 1.0 version goal is to finish a translation from [PDF Reference, Version 1.7](https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/pdfreference1.7old.pdf) to the following language version: - Simplified Chinese - Traditional Chinese - Japanese - Spanish And meet the following requirements: - layout error less than 1% - content loss less than 1% ## Known Issues 1. Parsing errors in the author and reference sections; they get merged into one paragraph after translation. 2. Lines are not supported. 3. Does not support drop caps. 4. Large pages will be skipped. ## How to Contribute We encourage you to contribute to YADT! Please check out the [CONTRIBUTING](https://github.com/funstory-ai/yadt/blob/main/docs/CONTRIBUTING.md) guide. Everyone interacting in YADT and its sub-projects' codebases, issue trackers, chat rooms, and mailing lists is expected to follow the YADT [Code of Conduct](https://github.com/funstory-ai/yadt/blob/main/docs/CODE_OF_CONDUCT.md). [Immersive Translation](https://immersivetranslate.com) sponsors monthly Pro membership redemption codes for active contributors to this project, see details at: [CONTRIBUTOR_REWARD.md](https://github.com/funstory-ai/BabelDOC/blob/main/docs/CONTRIBUTOR_REWARD.md) ## Acknowledgements - [PDFMathTranslate](https://github.com/Byaidu/PDFMathTranslate) - [DocLayout-YOLO](https://github.com/opendatalab/DocLayout-YOLO) - [pdfminer](https://github.com/pdfminer/pdfminer.six) - [PyMuPDF](https://github.com/pymupdf/PyMuPDF) - [Asynchronize](https://github.com/multimeric/Asynchronize/tree/master?tab=readme-ov-file) - [PriorityThreadPoolExecutor](https://github.com/oleglpts/PriorityThreadPoolExecutor)

Star History

Star History Chart ## /babeldoc/__init__.py ```py path="/babeldoc/__init__.py" __version__ = "0.3.27" ``` ## /babeldoc/assets/assets.py ```py path="/babeldoc/assets/assets.py" import asyncio import hashlib import logging import threading import zipfile from pathlib import Path import httpx from babeldoc.assets import embedding_assets_metadata from babeldoc.assets.embedding_assets_metadata import DOC_LAYOUT_ONNX_MODEL_URL from babeldoc.assets.embedding_assets_metadata import ( DOCLAYOUT_YOLO_DOCSTRUCTBENCH_IMGSZ1024ONNX_SHA3_256, ) from babeldoc.assets.embedding_assets_metadata import EMBEDDING_FONT_METADATA from babeldoc.assets.embedding_assets_metadata import FONT_METADATA_URL from babeldoc.assets.embedding_assets_metadata import FONT_URL_BY_UPSTREAM from babeldoc.assets.embedding_assets_metadata import ( TABLE_DETECTION_RAPIDOCR_MODEL_SHA3_256, ) from babeldoc.assets.embedding_assets_metadata import TABLE_DETECTION_RAPIDOCR_MODEL_URL from babeldoc.assets.embedding_assets_metadata import TIKTOKEN_CACHES from babeldoc.const import get_cache_file_path from tenacity import retry from tenacity import stop_after_attempt from tenacity import wait_exponential logger = logging.getLogger(__name__) class ResultContainer: def __init__(self): self.result = None def set_result(self, result): self.result = result def run_in_another_thread(coro): result_container = ResultContainer() def _wrapper(): result_container.set_result(asyncio.run(coro)) thread = threading.Thread(target=_wrapper) thread.start() thread.join() return result_container.result def run_coro(coro): return run_in_another_thread(coro) def _retry_if_not_cancelled_and_failed(retry_state): """Only retry if the exception is not CancelledError and the attempt failed.""" if retry_state.outcome.failed: exception = retry_state.outcome.exception() # Don't retry on CancelledError if isinstance(exception, asyncio.CancelledError): logger.debug("Operation was cancelled, not retrying") return False # Retry on network related errors if isinstance( exception, httpx.HTTPError | ConnectionError | ValueError | TimeoutError ): logger.warning(f"Network error occurred: {exception}, will retry") return True # Don't retry on success return False def verify_file(path: Path, sha3_256: str): if not path.exists(): return False hash_ = hashlib.sha3_256() with path.open("rb") as f: while True: chunk = f.read(1024 * 1024) if not chunk: break hash_.update(chunk) return hash_.hexdigest() == sha3_256 @retry( retry=_retry_if_not_cancelled_and_failed, stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=1, max=15), before_sleep=lambda retry_state: logger.warning( f"Download file failed, retrying in {retry_state.next_action.sleep} seconds... " f"(Attempt {retry_state.attempt_number}/3)" ), ) async def download_file( client: httpx.AsyncClient | None = None, url: str = None, path: Path = None, sha3_256: str = None, ): if client is None: async with httpx.AsyncClient() as client: response = await client.get(url, follow_redirects=True) else: response = await client.get(url, follow_redirects=True) response.raise_for_status() with path.open("wb") as f: f.write(response.content) if not verify_file(path, sha3_256): path.unlink(missing_ok=True) raise ValueError(f"File {path} is corrupted") @retry( retry=_retry_if_not_cancelled_and_failed, stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=1, max=15), before_sleep=lambda retry_state: logger.warning( f"Get font metadata failed, retrying in {retry_state.next_action.sleep} seconds... " f"(Attempt {retry_state.attempt_number}/3)" ), ) async def get_font_metadata( client: httpx.AsyncClient | None = None, upstream: str = None ): if upstream not in FONT_METADATA_URL: logger.critical(f"Invalid upstream: {upstream}") exit(1) if client is None: async with httpx.AsyncClient() as client: response = await client.get( FONT_METADATA_URL[upstream], follow_redirects=True ) else: response = await client.get(FONT_METADATA_URL[upstream], follow_redirects=True) response.raise_for_status() logger.debug(f"Get font metadata from {upstream} success") return upstream, response.json() async def get_fastest_upstream_for_font( client: httpx.AsyncClient | None = None, exclude_upstream: list[str] = None ): tasks: list[asyncio.Task[tuple[str, dict]]] = [] for upstream in FONT_METADATA_URL: if exclude_upstream and upstream in exclude_upstream: continue tasks.append(asyncio.create_task(get_font_metadata(client, upstream))) for future in asyncio.as_completed(tasks): try: result = await future for task in tasks: if not task.done(): task.cancel() return result except Exception as e: logger.exception(f"Error getting font metadata: {e}") logger.error("All upstreams failed") return None, None async def get_fastest_upstream_for_model(client: httpx.AsyncClient | None = None): return await get_fastest_upstream_for_font(client, exclude_upstream=["github"]) async def get_fastest_upstream(client: httpx.AsyncClient | None = None): ( fastest_upstream_for_font, online_font_metadata, ) = await get_fastest_upstream_for_font(client) if fastest_upstream_for_font is None: logger.error("Failed to get fastest upstream") exit(1) if fastest_upstream_for_font == "github": # since github is only store font, we need to get the fastest upstream for model fastest_upstream_for_model, _ = await get_fastest_upstream_for_model(client) if fastest_upstream_for_model is None: logger.error("Failed to get fastest upstream") exit(1) else: fastest_upstream_for_model = fastest_upstream_for_font return online_font_metadata, fastest_upstream_for_font, fastest_upstream_for_model async def get_doclayout_onnx_model_path_async(client: httpx.AsyncClient | None = None): onnx_path = get_cache_file_path( "doclayout_yolo_docstructbench_imgsz1024.onnx", "models" ) if verify_file(onnx_path, DOCLAYOUT_YOLO_DOCSTRUCTBENCH_IMGSZ1024ONNX_SHA3_256): return onnx_path logger.info("doclayout onnx model not found or corrupted, downloading...") fastest_upstream, _ = await get_fastest_upstream_for_model(client) if fastest_upstream is None: logger.error("Failed to get fastest upstream") exit(1) url = DOC_LAYOUT_ONNX_MODEL_URL[fastest_upstream] await download_file( client, url, onnx_path, DOCLAYOUT_YOLO_DOCSTRUCTBENCH_IMGSZ1024ONNX_SHA3_256 ) logger.info(f"Download doclayout onnx model from {fastest_upstream} success") return onnx_path async def get_table_detection_rapidocr_model_path_async( client: httpx.AsyncClient | None = None, ): onnx_path = get_cache_file_path("ch_PP-OCRv4_det_infer.onnx", "models") if verify_file(onnx_path, TABLE_DETECTION_RAPIDOCR_MODEL_SHA3_256): return onnx_path logger.info("table detection rapidocr model not found or corrupted, downloading...") fastest_upstream, _ = await get_fastest_upstream_for_model(client) if fastest_upstream is None: logger.error("Failed to get fastest upstream") exit(1) url = TABLE_DETECTION_RAPIDOCR_MODEL_URL[fastest_upstream] await download_file(client, url, onnx_path, TABLE_DETECTION_RAPIDOCR_MODEL_SHA3_256) logger.info( f"Download table detection rapidocr model from {fastest_upstream} success" ) return onnx_path def get_doclayout_onnx_model_path(): return run_coro(get_doclayout_onnx_model_path_async()) def get_table_detection_rapidocr_model_path(): return run_coro(get_table_detection_rapidocr_model_path_async()) def get_font_url_by_name_and_upstream(font_file_name: str, upstream: str): if upstream not in FONT_URL_BY_UPSTREAM: logger.critical(f"Invalid upstream: {upstream}") exit(1) return FONT_URL_BY_UPSTREAM[upstream](font_file_name) async def get_font_and_metadata_async( font_file_name: str, client: httpx.AsyncClient | None = None, fastest_upstream: str | None = None, font_metadata: dict | None = None, ): cache_file_path = get_cache_file_path(font_file_name, "fonts") if font_file_name in EMBEDDING_FONT_METADATA and verify_file( cache_file_path, EMBEDDING_FONT_METADATA[font_file_name]["sha3_256"] ): return cache_file_path, EMBEDDING_FONT_METADATA[font_file_name] logger.info(f"Font {cache_file_path} not found or corrupted, downloading...") if fastest_upstream is None: fastest_upstream, font_metadata = await get_fastest_upstream_for_font(client) if fastest_upstream is None: logger.critical("Failed to get fastest upstream") exit(1) if font_file_name not in font_metadata: logger.critical(f"Font {font_file_name} not found in {font_metadata}") exit(1) if verify_file(cache_file_path, font_metadata[font_file_name]["sha3_256"]): return cache_file_path, font_metadata[font_file_name] assert font_metadata is not None url = get_font_url_by_name_and_upstream(font_file_name, fastest_upstream) if "sha3_256" not in font_metadata[font_file_name]: logger.critical(f"Font {font_file_name} not found in {font_metadata}") exit(1) await download_file( client, url, cache_file_path, font_metadata[font_file_name]["sha3_256"] ) return cache_file_path, font_metadata[font_file_name] def get_font_and_metadata(font_file_name: str): return run_coro(get_font_and_metadata_async(font_file_name)) def get_font_family(lang_code: str): font_family = embedding_assets_metadata.get_font_family(lang_code) return font_family async def download_all_fonts_async(client: httpx.AsyncClient | None = None): for font_file_name in EMBEDDING_FONT_METADATA: if not verify_file( get_cache_file_path(font_file_name, "fonts"), EMBEDDING_FONT_METADATA[font_file_name]["sha3_256"], ): break else: logger.debug("All fonts are already downloaded") return fastest_upstream, font_metadata = await get_fastest_upstream_for_font(client) if fastest_upstream is None: logger.error("Failed to get fastest upstream") exit(1) logger.info(f"Downloading fonts from {fastest_upstream}") font_tasks = [ asyncio.create_task( get_font_and_metadata_async( font_file_name, client, fastest_upstream, font_metadata ) ) for font_file_name in EMBEDDING_FONT_METADATA ] await asyncio.gather(*font_tasks) async def async_warmup(): logger.info("Downloading all assets...") from tiktoken import encoding_for_model _ = encoding_for_model("gpt-4o") async with httpx.AsyncClient() as client: onnx_task = asyncio.create_task(get_doclayout_onnx_model_path_async(client)) onnx_task2 = asyncio.create_task( get_table_detection_rapidocr_model_path_async(client) ) font_tasks = asyncio.create_task(download_all_fonts_async(client)) await asyncio.gather(onnx_task, onnx_task2, font_tasks) def warmup(): run_coro(async_warmup()) def generate_all_assets_file_list(): result = {} result["fonts"] = [] result["models"] = [] result["tiktoken"] = [] for font_file_name in EMBEDDING_FONT_METADATA: result["fonts"].append( { "name": font_file_name, "sha3_256": EMBEDDING_FONT_METADATA[font_file_name]["sha3_256"], } ) for tiktoken_file, sha3_256 in TIKTOKEN_CACHES.items(): result["tiktoken"].append( { "name": tiktoken_file, "sha3_256": sha3_256, } ) result["models"].append( { "name": "doclayout_yolo_docstructbench_imgsz1024.onnx", "sha3_256": DOCLAYOUT_YOLO_DOCSTRUCTBENCH_IMGSZ1024ONNX_SHA3_256, }, ) result["models"].append( { "name": "ch_PP-OCRv4_det_infer.onnx", "sha3_256": TABLE_DETECTION_RAPIDOCR_MODEL_SHA3_256, }, ) return result async def generate_offline_assets_package_async(output_directory: Path | None = None): await async_warmup() logger.info("Generating offline assets package...") file_list = generate_all_assets_file_list() offline_assets_tag = get_offline_assets_tag(file_list) if output_directory is None: output_path = get_cache_file_path( f"offline_assets_{offline_assets_tag}.zip", "assets" ) else: output_directory.mkdir(parents=True, exist_ok=True) output_path = output_directory / f"offline_assets_{offline_assets_tag}.zip" with zipfile.ZipFile( output_path, "w", compression=zipfile.ZIP_DEFLATED, compresslevel=9 ) as zipf: for file_type, file_descs in file_list.items(): # zipf.mkdir(file_type) for file_desc in file_descs: file_name = file_desc["name"] sha3_256 = file_desc["sha3_256"] file_path = get_cache_file_path(file_name, file_type) if not verify_file(file_path, sha3_256): logger.error(f"File {file_path} is corrupted") exit(1) with file_path.open("rb") as f: zipf.writestr(f"{file_type}/{file_name}", f.read()) logger.info(f"Offline assets package generated at {output_path}") async def restore_offline_assets_package_async(input_path: Path | None = None): file_list = generate_all_assets_file_list() offline_assets_tag = get_offline_assets_tag(file_list) if input_path is None: input_path = get_cache_file_path( f"offline_assets_{offline_assets_tag}.zip", "assets" ) else: if input_path.exists() and input_path.is_dir(): input_path = input_path / f"offline_assets_{offline_assets_tag}.zip" if not input_path.exists(): logger.critical(f"Offline assets package not found: {input_path}") exit(1) import re offline_assets_tag_from_input_path = re.match( r"offline_assets_(.*)\.zip", input_path.name ).group(1) if offline_assets_tag != offline_assets_tag_from_input_path: logger.critical( f"Offline assets tag mismatch: {offline_assets_tag} != {offline_assets_tag_from_input_path}" ) exit(1) nothing_changed = True with zipfile.ZipFile(input_path, "r") as zipf: for file_type, file_descs in file_list.items(): for file_desc in file_descs: file_name = file_desc["name"] file_path = get_cache_file_path(file_name, file_type) if verify_file(file_path, file_desc["sha3_256"]): continue nothing_changed = False with zipf.open(f"{file_type}/{file_name}", "r") as f: with file_path.open("wb") as f2: f2.write(f.read()) if not verify_file(file_path, file_desc["sha3_256"]): logger.critical( "Offline assets package is corrupted, please delete it and try again" ) exit(1) if not nothing_changed: logger.info(f"Offline assets package restored from {input_path}") def get_offline_assets_tag(file_list: dict | None = None): if file_list is None: file_list = generate_all_assets_file_list() import orjson # noinspection PyTypeChecker offline_assets_tag = hashlib.sha3_256( orjson.dumps( file_list, option=orjson.OPT_APPEND_NEWLINE | orjson.OPT_INDENT_2 | orjson.OPT_SORT_KEYS, ) ).hexdigest() return offline_assets_tag def generate_offline_assets_package(output_directory: Path | None = None): return run_coro(generate_offline_assets_package_async(output_directory)) def restore_offline_assets_package(input_path: Path | None = None): return run_coro(restore_offline_assets_package_async(input_path)) if __name__ == "__main__": from rich.logging import RichHandler logging.basicConfig(level=logging.DEBUG, handlers=[RichHandler()]) logging.getLogger("httpx").setLevel(logging.WARNING) logging.getLogger("httpcore").setLevel(logging.WARNING) # warmup() # generate_offline_assets_package() # restore_offline_assets_package(Path( # '/Users/aw/.cache/babeldoc/assets/offline_assets_33971e4940e90ba0c35baacda44bbe83b214f4703a7bdb8b837de97d0383508c.zip')) # warmup() ``` ## /babeldoc/assets/embedding_assets_metadata.py ```py path="/babeldoc/assets/embedding_assets_metadata.py" import itertools DOCLAYOUT_YOLO_DOCSTRUCTBENCH_IMGSZ1024ONNX_SHA3_256 = ( "60be061226930524958b5465c8c04af3d7c03bcb0beb66454f5da9f792e3cf2a" ) TABLE_DETECTION_RAPIDOCR_MODEL_SHA3_256 = ( "062f4619afe91b33147c033acadecbb53f2a7b99ac703d157b96d5b10948da5e" ) TIKTOKEN_CACHES = { "fb374d419588a4632f3f557e76b4b70aebbca790": "cb04bcda5782cfbbe77f2f991d92c0ea785d9496ef1137c91dfc3c8c324528d6" } FONT_METADATA_URL = { "github": "https://raw.githubusercontent.com/funstory-ai/BabelDOC-Assets/refs/heads/main/font_metadata.json", "huggingface": "https://huggingface.co/datasets/awwaawwa/BabelDOC-Assets/resolve/main/font_metadata.json?download=true", "hf-mirror": "https://hf-mirror.com/datasets/awwaawwa/BabelDOC-Assets/resolve/main/font_metadata.json?download=true", "modelscope": "https://www.modelscope.cn/datasets/awwaawwa/BabelDOCAssets/resolve/master/font_metadata.json", } FONT_URL_BY_UPSTREAM = { "github": lambda name: f"https://raw.githubusercontent.com/funstory-ai/BabelDOC-Assets/refs/heads/main/fonts/{name}", "huggingface": lambda name: f"https://huggingface.co/datasets/awwaawwa/BabelDOC-Assets/resolve/main/fonts/{name}?download=true", "hf-mirror": lambda name: f"https://hf-mirror.com/datasets/awwaawwa/BabelDOC-Assets/resolve/main/fonts/{name}?download=true", "modelscope": lambda name: f"https://www.modelscope.cn/datasets/awwaawwa/BabelDOCAssets/resolve/master/fonts/{name}", } DOC_LAYOUT_ONNX_MODEL_URL = { "huggingface": "https://huggingface.co/wybxc/DocLayout-YOLO-DocStructBench-onnx/resolve/main/doclayout_yolo_docstructbench_imgsz1024.onnx?download=true", "hf-mirror": "https://hf-mirror.com/wybxc/DocLayout-YOLO-DocStructBench-onnx/resolve/main/doclayout_yolo_docstructbench_imgsz1024.onnx?download=true", "modelscope": "https://www.modelscope.cn/models/AI-ModelScope/DocLayout-YOLO-DocStructBench-onnx/resolve/master/doclayout_yolo_docstructbench_imgsz1024.onnx", } TABLE_DETECTION_RAPIDOCR_MODEL_URL = { "huggingface": "https://huggingface.co/spaces/RapidAI/RapidOCR/resolve/main/models/text_det/ch_PP-OCRv4_det_infer.onnx", "hf-mirror": "https://hf-mirror.com/spaces/RapidAI/RapidOCR/resolve/main/models/text_det/ch_PP-OCRv4_det_infer.onnx", "modelscope": "https://www.modelscope.cn/models/RapidAI/RapidOCR/resolve/master/onnx/PP-OCRv4/det/ch_PP-OCRv4_det_infer.onnx", } # from https://github.com/funstory-ai/BabelDOC-Assets/blob/main/font_metadata.json EMBEDDING_FONT_METADATA = { "GoNotoKurrent-Bold.ttf": { "ascent": 1069, "bold": 1, "descent": -293, "encoding_length": 2, "file_name": "GoNotoKurrent-Bold.ttf", "font_name": "Go Noto Kurrent-Bold Bold", "italic": 0, "monospace": 0, "serif": 1, "sha3_256": "000b37f592477945b27b7702dcad39f73e23e140e66ddff9847eb34f32389566", "size": 15303772, }, "GoNotoKurrent-Regular.ttf": { "ascent": 1069, "bold": 0, "descent": -293, "encoding_length": 2, "file_name": "GoNotoKurrent-Regular.ttf", "font_name": "Go Noto Kurrent-Regular Regular", "italic": 0, "monospace": 0, "serif": 1, "sha3_256": "4324a60d507c691e6efc97420647f4d2c2d86d9de35009d1c769861b76074ae6", "size": 15515760, }, "KleeOne-Regular.ttf": { "ascent": 1160, "bold": 0, "descent": -288, "encoding_length": 2, "file_name": "KleeOne-Regular.ttf", "font_name": "Klee One Regular", "italic": 0, "monospace": 0, "serif": 1, "sha3_256": "8585c29f89b322d937f83739f61ede5d84297873e1465cad9a120a208ac55ce0", "size": 8724704, }, "LXGWWenKaiGB-Regular.ttf": { "ascent": 928, "bold": 0, "descent": -256, "encoding_length": 2, "file_name": "LXGWWenKaiGB-Regular.ttf", "font_name": "LXGW WenKai GB Regular", "italic": 0, "monospace": 0, "serif": 1, "sha3_256": "b563a5e8d9db4cd15602a3a3700b01925e80a21f99fb88e1b763b1fb8685f8ee", "size": 19558756, }, "LXGWWenKaiMonoTC-Regular.ttf": { "ascent": 928, "bold": 0, "descent": -241, "encoding_length": 2, "file_name": "LXGWWenKaiMonoTC-Regular.ttf", "font_name": "LXGW WenKai Mono TC Regular", "italic": 0, "monospace": 1, "serif": 1, "sha3_256": "596b278d11418d374a1cfa3a50cbfb82b31db82d3650cfacae8f94311b27fdc5", "size": 13115416, }, "LXGWWenKaiTC-Regular.ttf": { "ascent": 928, "bold": 0, "descent": -256, "encoding_length": 2, "file_name": "LXGWWenKaiTC-Regular.ttf", "font_name": "LXGW WenKai TC Regular", "italic": 0, "monospace": 0, "serif": 1, "sha3_256": "66ccd0ffe8e56cd585dabde8d1292c3f551b390d8ed85f81d7a844825f9c2379", "size": 13100328, }, "MaruBuri-Regular.ttf": { "ascent": 800, "bold": 0, "descent": -200, "encoding_length": 2, "file_name": "MaruBuri-Regular.ttf", "font_name": "MaruBuri Regular", "italic": 0, "monospace": 0, "serif": 1, "sha3_256": "abb672dde7b89e06914ce27c59159b7a2933f26207bfcc47981c67c11c41e6d1", "size": 3268988, }, "NotoSans-Bold.ttf": { "ascent": 1069, "bold": 1, "descent": -293, "encoding_length": 2, "file_name": "NotoSans-Bold.ttf", "font_name": "Noto Sans Bold", "italic": 0, "monospace": 0, "serif": 1, "sha3_256": "ecd38d472c1cad07d8a5dffd2b5a0f72edcd40fff2b4e68d770da8f2ef343a82", "size": 630964, }, "NotoSans-BoldItalic.ttf": { "ascent": 1069, "bold": 1, "descent": -293, "encoding_length": 2, "file_name": "NotoSans-BoldItalic.ttf", "font_name": "Noto Sans Bold Italic", "italic": 1, "monospace": 0, "serif": 1, "sha3_256": "0b6c690a4a6b7d605b2ecbde00c7ac1a23e60feb17fa30d8b972d61ec3ff732b", "size": 644340, }, "NotoSans-Italic.ttf": { "ascent": 1069, "bold": 0, "descent": -293, "encoding_length": 2, "file_name": "NotoSans-Italic.ttf", "font_name": "Noto Sans Italic", "italic": 1, "monospace": 0, "serif": 1, "sha3_256": "830652f61724c017e5a29a96225b484a2ccbd25f69a1b3f47e5f466a2dbed1ad", "size": 642344, }, "NotoSans-Regular.ttf": { "ascent": 1069, "bold": 0, "descent": -293, "encoding_length": 2, "file_name": "NotoSans-Regular.ttf", "font_name": "Noto Sans Regular", "italic": 0, "monospace": 0, "serif": 1, "sha3_256": "7dfe2bbf97dc04c852d1223b220b63430e6ad03b0dbb28ebe6328a20a2d45eb8", "size": 629024, }, "NotoSerif-Bold.ttf": { "ascent": 1069, "bold": 1, "descent": -293, "encoding_length": 2, "file_name": "NotoSerif-Bold.ttf", "font_name": "Noto Serif Bold", "italic": 0, "monospace": 0, "serif": 1, "sha3_256": "28d88d924285eadb9f9ce49f2d2b95473f89a307b226c5f6ebed87a654898312", "size": 506864, }, "NotoSerif-BoldItalic.ttf": { "ascent": 1069, "bold": 1, "descent": -293, "encoding_length": 2, "file_name": "NotoSerif-BoldItalic.ttf", "font_name": "Noto Serif Bold Italic", "italic": 1, "monospace": 0, "serif": 1, "sha3_256": "b69ee56af6351b2fb4fbce623f8e1c1f9fb19170686a9e5db2cf260b8cf24ac7", "size": 535724, }, "NotoSerif-Italic.ttf": { "ascent": 1069, "bold": 0, "descent": -293, "encoding_length": 2, "file_name": "NotoSerif-Italic.ttf", "font_name": "Noto Serif Italic", "italic": 1, "monospace": 0, "serif": 1, "sha3_256": "9b7773c24ab8a29e3c1c03efa4ab652d051e4c209134431953463aa946d62868", "size": 535340, }, "NotoSerif-Regular.ttf": { "ascent": 1069, "bold": 0, "descent": -293, "encoding_length": 2, "file_name": "NotoSerif-Regular.ttf", "font_name": "Noto Serif Regular", "italic": 0, "monospace": 0, "serif": 1, "sha3_256": "c2bbe984e65bafd3bcd38b3cb1e1344f3b7b79d6beffc7a3d883b57f8358559d", "size": 504932, }, "SourceHanSansCN-Bold.ttf": { "ascent": 1160, "bold": 1, "descent": -288, "encoding_length": 2, "file_name": "SourceHanSansCN-Bold.ttf", "font_name": "Source Han Sans CN Bold", "italic": 0, "monospace": 0, "serif": 1, "sha3_256": "82314c11016a04ef03e7afd00abe0ccc8df54b922dee79abf6424f3002a31825", "size": 10174460, }, "SourceHanSansCN-Regular.ttf": { "ascent": 1160, "bold": 0, "descent": -288, "encoding_length": 2, "file_name": "SourceHanSansCN-Regular.ttf", "font_name": "Source Han Sans CN Regular", "italic": 0, "monospace": 0, "serif": 1, "sha3_256": "b45a80cf3650bfc62aa014e58243c6325e182c4b0c5819e41a583c699cce9a8f", "size": 10397552, }, "SourceHanSansHK-Bold.ttf": { "ascent": 1160, "bold": 1, "descent": -288, "encoding_length": 2, "file_name": "SourceHanSansHK-Bold.ttf", "font_name": "Source Han Sans HK Bold", "italic": 0, "monospace": 0, "serif": 1, "sha3_256": "3eecd57457ba9a0fbad6c794f40e7ae704c4f825091aef2ac18902ffdde50608", "size": 6856692, }, "SourceHanSansHK-Regular.ttf": { "ascent": 1160, "bold": 0, "descent": -288, "encoding_length": 2, "file_name": "SourceHanSansHK-Regular.ttf", "font_name": "Source Han Sans HK Regular", "italic": 0, "monospace": 0, "serif": 1, "sha3_256": "5fe4141f9164c03616323400b2936ee4c8265314492e2b822c3a6fbfb63ffe08", "size": 6999792, }, "SourceHanSansJP-Bold.ttf": { "ascent": 1160, "bold": 1, "descent": -288, "encoding_length": 2, "file_name": "SourceHanSansJP-Bold.ttf", "font_name": "Source Han Sans JP Bold", "italic": 0, "monospace": 0, "serif": 1, "sha3_256": "fb05bd84d62e8064117ee357ab6a4481e1cde931e8e984c0553c8c4b09dc3938", "size": 5603068, }, "SourceHanSansJP-Regular.ttf": { "ascent": 1160, "bold": 0, "descent": -288, "encoding_length": 2, "file_name": "SourceHanSansJP-Regular.ttf", "font_name": "Source Han Sans JP Regular", "italic": 0, "monospace": 0, "serif": 1, "sha3_256": "722cfbdcc0fd83fe07a3d1b10e9e64343c924a351d02cfe8dbb6ec4c6bc38230", "size": 5723960, }, "SourceHanSansKR-Bold.ttf": { "ascent": 1160, "bold": 1, "descent": -288, "encoding_length": 2, "file_name": "SourceHanSansKR-Bold.ttf", "font_name": "Source Han Sans KR Bold", "italic": 0, "monospace": 0, "serif": 1, "sha3_256": "02959eb2c1eea0786a736aeb50b6e61f2ab873cd69c659389b7511f80f734838", "size": 5858892, }, "SourceHanSansKR-Regular.ttf": { "ascent": 1160, "bold": 0, "descent": -288, "encoding_length": 2, "file_name": "SourceHanSansKR-Regular.ttf", "font_name": "Source Han Sans KR Regular", "italic": 0, "monospace": 0, "serif": 1, "sha3_256": "aba70109eff718e8f796f0185f8dca38026c1661b43c195883c84577e501adf2", "size": 5961704, }, "SourceHanSansTW-Bold.ttf": { "ascent": 1160, "bold": 1, "descent": -288, "encoding_length": 2, "file_name": "SourceHanSansTW-Bold.ttf", "font_name": "Source Han Sans TW Bold", "italic": 0, "monospace": 0, "serif": 1, "sha3_256": "4a92730e644a1348e87bba7c77e9b462f257f381bd6abbeac5860d8f8306aee6", "size": 6883224, }, "SourceHanSansTW-Regular.ttf": { "ascent": 1160, "bold": 0, "descent": -288, "encoding_length": 2, "file_name": "SourceHanSansTW-Regular.ttf", "font_name": "Source Han Sans TW Regular", "italic": 0, "monospace": 0, "serif": 1, "sha3_256": "6129b68ff4b0814624cac7edca61fbacf8f4d79db6f4c3cfc46b1c48ea2f81ac", "size": 7024812, }, "SourceHanSerifCN-Bold.ttf": { "ascent": 1150, "bold": 1, "descent": -286, "encoding_length": 2, "file_name": "SourceHanSerifCN-Bold.ttf", "font_name": "Source Han Serif CN Bold", "italic": 0, "monospace": 0, "serif": 1, "sha3_256": "77816a54957616e140e25a36a41fc061ddb505a1107de4e6a65f561e5dcf8310", "size": 14134156, }, "SourceHanSerifCN-Regular.ttf": { "ascent": 1150, "bold": 0, "descent": -286, "encoding_length": 2, "file_name": "SourceHanSerifCN-Regular.ttf", "font_name": "Source Han Serif CN Regular", "italic": 0, "monospace": 0, "serif": 1, "sha3_256": "c8bf74da2c3b7457c9d887465b42fb6f80d3d84f361cfe5b0673a317fb1f85ad", "size": 14047768, }, "SourceHanSerifHK-Bold.ttf": { "ascent": 1150, "bold": 1, "descent": -286, "encoding_length": 2, "file_name": "SourceHanSerifHK-Bold.ttf", "font_name": "Source Han Serif HK Bold", "italic": 0, "monospace": 0, "serif": 1, "sha3_256": "0f81296f22846b622a26f7342433d6c5038af708a32fc4b892420c150227f4bb", "size": 9532580, }, "SourceHanSerifHK-Regular.ttf": { "ascent": 1150, "bold": 0, "descent": -286, "encoding_length": 2, "file_name": "SourceHanSerifHK-Regular.ttf", "font_name": "Source Han Serif HK Regular", "italic": 0, "monospace": 0, "serif": 1, "sha3_256": "d5232ec3adf4fb8604bb4779091169ec9bd9d574b513e4a75752e614193afebe", "size": 9467292, }, "SourceHanSerifJP-Bold.ttf": { "ascent": 1150, "bold": 1, "descent": -286, "encoding_length": 2, "file_name": "SourceHanSerifJP-Bold.ttf", "font_name": "Source Han Serif JP Bold", "italic": 0, "monospace": 0, "serif": 1, "sha3_256": "a4a8c22e8ec7bb6e66b9caaff1e12c7a52b5a4201eec3d074b35957c0126faef", "size": 7811832, }, "SourceHanSerifJP-Regular.ttf": { "ascent": 1150, "bold": 0, "descent": -286, "encoding_length": 2, "file_name": "SourceHanSerifJP-Regular.ttf", "font_name": "Source Han Serif JP Regular", "italic": 0, "monospace": 0, "serif": 1, "sha3_256": "3d1f9933c7f3abc8c285e317119a533e6dcfe6027d1f5f066ba71b3eb9161e9c", "size": 7748816, }, "SourceHanSerifKR-Bold.ttf": { "ascent": 1150, "bold": 1, "descent": -286, "encoding_length": 2, "file_name": "SourceHanSerifKR-Bold.ttf", "font_name": "Source Han Serif KR Bold", "italic": 0, "monospace": 0, "serif": 1, "sha3_256": "b071b1aecb042aa779e1198767048438dc756d0da8f90660408abb421393f5cb", "size": 12387920, }, "SourceHanSerifKR-Regular.ttf": { "ascent": 1150, "bold": 0, "descent": -286, "encoding_length": 2, "file_name": "SourceHanSerifKR-Regular.ttf", "font_name": "Source Han Serif KR Regular", "italic": 0, "monospace": 0, "serif": 1, "sha3_256": "a85913439f0a49024ca77c02dfede4318e503ee6b2b7d8fef01eb42435f27b61", "size": 12459924, }, "SourceHanSerifTW-Bold.ttf": { "ascent": 1150, "bold": 1, "descent": -286, "encoding_length": 2, "file_name": "SourceHanSerifTW-Bold.ttf", "font_name": "Source Han Serif TW Bold", "italic": 0, "monospace": 0, "serif": 1, "sha3_256": "562eea88895ab79ffefab7eabb4d322352a7b1963764c524c6d5242ca456bb6e", "size": 9551724, }, "SourceHanSerifTW-Regular.ttf": { "ascent": 1150, "bold": 0, "descent": -286, "encoding_length": 2, "file_name": "SourceHanSerifTW-Regular.ttf", "font_name": "Source Han Serif TW Regular", "italic": 0, "monospace": 0, "serif": 1, "sha3_256": "85c1d6460b2e169b3d53ac60f6fb7a219fb99923027d78fb64b679475e2ddae4", "size": 9486772, }, } FONT_NAMES = {v["font_name"] for v in EMBEDDING_FONT_METADATA.values()} CN_FONT_FAMILY = { # 手写体 "script": [ "LXGWWenKaiGB-Regular.ttf", ], # 正文字体 "normal": [ "SourceHanSerifCN-Bold.ttf", "SourceHanSerifCN-Regular.ttf", "SourceHanSansCN-Bold.ttf", "SourceHanSansCN-Regular.ttf", ], # 备用字体 "fallback": [ "GoNotoKurrent-Regular.ttf", "GoNotoKurrent-Bold.ttf", ], "base": ["SourceHanSansCN-Regular.ttf"], } HK_FONT_FAMILY = { "script": ["LXGWWenKaiTC-Regular.ttf"], "normal": [ "SourceHanSerifHK-Bold.ttf", "SourceHanSerifHK-Regular.ttf", "SourceHanSansHK-Bold.ttf", "SourceHanSansHK-Regular.ttf", ], "fallback": [ "GoNotoKurrent-Regular.ttf", "GoNotoKurrent-Bold.ttf", ], "base": ["SourceHanSansCN-Regular.ttf"], } TW_FONT_FAMILY = { "script": ["LXGWWenKaiTC-Regular.ttf"], "normal": [ "SourceHanSerifTW-Bold.ttf", "SourceHanSerifTW-Regular.ttf", "SourceHanSansTW-Bold.ttf", "SourceHanSansTW-Regular.ttf", ], "fallback": [ "GoNotoKurrent-Regular.ttf", "GoNotoKurrent-Bold.ttf", ], "base": ["SourceHanSansCN-Regular.ttf"], } KR_FONT_FAMILY = { "script": ["MaruBuri-Regular.ttf"], "normal": [ "SourceHanSerifKR-Bold.ttf", "SourceHanSerifKR-Regular.ttf", "SourceHanSansKR-Bold.ttf", "SourceHanSansKR-Regular.ttf", ], "fallback": [ "GoNotoKurrent-Regular.ttf", "GoNotoKurrent-Bold.ttf", ], "base": ["SourceHanSansCN-Regular.ttf"], } JP_FONT_FAMILY = { "script": ["KleeOne-Regular.ttf"], "normal": [ "SourceHanSerifJP-Bold.ttf", "SourceHanSerifJP-Regular.ttf", "SourceHanSansJP-Bold.ttf", "SourceHanSansJP-Regular.ttf", ], "fallback": [ "GoNotoKurrent-Regular.ttf", "GoNotoKurrent-Bold.ttf", ], "base": ["SourceHanSansCN-Regular.ttf"], } EN_FONT_FAMILY = { "script": [ "NotoSans-Italic.ttf", "NotoSans-BoldItalic.ttf", "NotoSerif-Italic.ttf", "NotoSerif-BoldItalic.ttf", ], "normal": [ "NotoSerif-Regular.ttf", "NotoSerif-Bold.ttf", "NotoSans-Regular.ttf", "NotoSans-Bold.ttf", ], "fallback": [ "GoNotoKurrent-Regular.ttf", "GoNotoKurrent-Bold.ttf", ], "base": [ "NotoSans-Regular.ttf", ], } ALL_FONT_FAMILY = { "CN": CN_FONT_FAMILY, "TW": TW_FONT_FAMILY, "HK": HK_FONT_FAMILY, "KR": KR_FONT_FAMILY, "JP": JP_FONT_FAMILY, "EN": EN_FONT_FAMILY, } def __add_fallback_to_font_family(): for lang1, family1 in ALL_FONT_FAMILY.items(): added_font = set() for font in itertools.chain.from_iterable(family1.values()): added_font.add(font) for lang2, family2 in ALL_FONT_FAMILY.items(): if lang1 != lang2: for type_ in family1: for font in family2[type_]: if font not in added_font: family1[type_].append(font) added_font.add(font) __add_fallback_to_font_family() def get_font_family(lang_code: str): lang_code = lang_code.upper() if "KR" in lang_code: font_family = KR_FONT_FAMILY elif "JP" in lang_code: font_family = JP_FONT_FAMILY elif "HK" in lang_code: font_family = HK_FONT_FAMILY elif "TW" in lang_code: font_family = TW_FONT_FAMILY elif "EN" in lang_code: font_family = EN_FONT_FAMILY elif "CN" in lang_code: font_family = CN_FONT_FAMILY else: font_family = EN_FONT_FAMILY verify_font_family(font_family) return font_family def verify_font_family(font_family: str | dict): if isinstance(font_family, str): font_family = ALL_FONT_FAMILY[font_family] for k in font_family: if k not in ["script", "normal", "fallback", "base"]: raise ValueError(f"Invalid font family: {font_family}") for font_file_name in font_family[k]: if font_file_name not in EMBEDDING_FONT_METADATA: raise ValueError(f"Invalid font file: {font_file_name}") if __name__ == "__main__": for k in ALL_FONT_FAMILY: verify_font_family(k) ``` ## /babeldoc/asynchronize/__init__.py ```py path="/babeldoc/asynchronize/__init__.py" import asyncio import time class Args: def __init__(self, args, kwargs): self.args = args self.kwargs = kwargs class AsyncCallback: def __init__(self): self.queue = asyncio.Queue() self.finished = False self.loop = asyncio.get_event_loop() def step_callback(self, *args, **kwargs): # Whenever a step is called, add to the queue but don't set finished to True, so __anext__ will continue args = Args(args, kwargs) # We have to use the threadsafe call so that it wakes up the event loop, in case it's sleeping: # https://stackoverflow.com/a/49912853/2148718 self.loop.call_soon_threadsafe(self.queue.put_nowait, args) # Add a small delay to release the GIL, ensuring the event loop has time to process messages time.sleep(0.01) def finished_callback(self, *args, **kwargs): # Whenever a finished is called, add to the queue as with step, but also set finished to True, so __anext__ # will terminate after processing the remaining items if self.finished: return self.step_callback(*args, **kwargs) self.finished = True def __await__(self): # Since this implements __anext__, this can return itself return self.queue.get().__await__() def __aiter__(self): # Since this implements __anext__, this can return itself return self async def __anext__(self): # Keep waiting for the queue if a) we haven't finished, or b) if the queue is still full. This lets us finish # processing the remaining items even after we've finished if self.finished and self.queue.empty(): raise StopAsyncIteration result = await self.queue.get() return result ``` ## /babeldoc/const.py ```py path="/babeldoc/const.py" import os import shutil import subprocess from pathlib import Path __version__ = "0.3.27" CACHE_FOLDER = Path.home() / ".cache" / "babeldoc" def get_cache_file_path(filename: str, sub_folder: str | None = None) -> Path: if sub_folder is not None: sub_folder = sub_folder.strip("/") sub_folder_path = CACHE_FOLDER / sub_folder sub_folder_path.mkdir(parents=True, exist_ok=True) return sub_folder_path / filename return CACHE_FOLDER / filename try: git_path = shutil.which("git") if git_path is None: raise FileNotFoundError("git executable not found") two_parent = Path(__file__).resolve().parent.parent md_ = two_parent / "docs" / "README.md" if two_parent.name == "site-packages" or not md_.exists(): raise FileNotFoundError("not in git repo") WATERMARK_VERSION = ( subprocess.check_output( # noqa: S603 [git_path, "describe", "--always"], cwd=Path(__file__).resolve().parent, ) .strip() .decode() ) except (OSError, FileNotFoundError, subprocess.CalledProcessError): WATERMARK_VERSION = f"v{__version__}" TIKTOKEN_CACHE_FOLDER = CACHE_FOLDER / "tiktoken" TIKTOKEN_CACHE_FOLDER.mkdir(parents=True, exist_ok=True) os.environ["TIKTOKEN_CACHE_DIR"] = str(TIKTOKEN_CACHE_FOLDER) ``` ## /babeldoc/converter.py ```py path="/babeldoc/converter.py" import logging import re import unicodedata import numpy as np from pdfminer.converter import PDFConverter from pdfminer.layout import LTChar from pdfminer.layout import LTComponent from pdfminer.layout import LTFigure from pdfminer.layout import LTLine from pdfminer.layout import LTPage from pdfminer.layout import LTText from pdfminer.pdfcolor import PDFColorSpace from pdfminer.pdffont import PDFCIDFont from pdfminer.pdffont import PDFFont from pdfminer.pdffont import PDFUnicodeNotDefined from pdfminer.pdfinterp import PDFGraphicState from pdfminer.pdfinterp import PDFResourceManager from pdfminer.utils import Matrix from pdfminer.utils import apply_matrix_pt from pdfminer.utils import bbox2str from pdfminer.utils import matrix2str from pdfminer.utils import mult_matrix from pymupdf import Font from babeldoc.document_il.frontend.il_creater import ILCreater log = logging.getLogger(__name__) class PDFConverterEx(PDFConverter): def __init__( self, rsrcmgr: PDFResourceManager, il_creater: ILCreater | None = None, ) -> None: PDFConverter.__init__(self, rsrcmgr, None, "utf-8", 1, None) self.il_creater = il_creater def begin_page(self, page, ctm) -> None: # 重载替换 cropbox (x0, y0, x1, y1) = page.cropbox (x0, y0) = apply_matrix_pt(ctm, (x0, y0)) (x1, y1) = apply_matrix_pt(ctm, (x1, y1)) mediabox = (0, 0, abs(x0 - x1), abs(y0 - y1)) self.il_creater.on_page_media_box( mediabox[0], mediabox[1], mediabox[2], mediabox[3], ) self.il_creater.on_page_number(page.pageno) self.cur_item = LTPage(page.pageno, mediabox) def end_page(self, _page) -> None: # 重载返回指令流 return self.receive_layout(self.cur_item) def begin_figure(self, name, bbox, matrix) -> None: # 重载设置 pageid self._stack.append(self.cur_item) self.cur_item = LTFigure(name, bbox, mult_matrix(matrix, self.ctm)) self.cur_item.pageid = self._stack[-1].pageid def end_figure(self, _: str) -> None: # 重载返回指令流 fig = self.cur_item if not isinstance(self.cur_item, LTFigure): raise ValueError(f"Unexpected item type: {type(self.cur_item)}") self.cur_item = self._stack.pop() self.cur_item.add(fig) return self.receive_layout(fig) def render_char( self, matrix, font, fontsize: float, scaling: float, rise: float, cid: int, ncs, graphicstate: PDFGraphicState, ) -> float: # 重载设置 cid 和 font try: text = font.to_unichr(cid) if not isinstance(text, str): raise TypeError(f"Expected string, got {type(text)}") except PDFUnicodeNotDefined: text = self.handle_undefined_char(font, cid) textwidth = font.char_width(cid) textdisp = font.char_disp(cid) if not hasattr(font, "xobj_id"): log.debug( f"Font {font.fontname} does not have xobj_id attribute.", ) font_id = "UNKNOW" else: font_id = self.il_creater.current_page_font_name_id_map.get( font.xobj_id, None ) item = AWLTChar( matrix, font, fontsize, scaling, rise, text, textwidth, textdisp, ncs, graphicstate, self.il_creater.xobj_id, font_id, ) self.cur_item.add(item) item.cid = cid # hack 插入原字符编码 item.font = font # hack 插入原字符字体 return item.adv class AWLTChar(LTChar): """Actual letter in the text as a Unicode string.""" def __init__( self, matrix: Matrix, font: PDFFont, fontsize: float, scaling: float, rise: float, text: str, textwidth: float, textdisp: float | tuple[float | None, float], ncs: PDFColorSpace, graphicstate: PDFGraphicState, xobj_id: int, font_id: str, ) -> None: LTText.__init__(self) self._text = text self.matrix = matrix self.fontname = font.fontname self.ncs = ncs self.graphicstate = graphicstate self.xobj_id = xobj_id self.adv = textwidth * fontsize * scaling self.aw_font_id = font_id # compute the boundary rectangle. if font.is_vertical(): # vertical assert isinstance(textdisp, tuple) (vx, vy) = textdisp if vx is None: vx = fontsize * 0.5 else: vx = vx * fontsize * 0.001 vy = (1000 - vy) * fontsize * 0.001 bbox_lower_left = (-vx, vy + rise + self.adv) bbox_upper_right = (-vx + fontsize, vy + rise) else: # horizontal descent = font.get_descent() * fontsize bbox_lower_left = (0, descent + rise) bbox_upper_right = (self.adv, descent + rise + fontsize) (a, b, c, d, e, f) = self.matrix self.upright = a * d * scaling > 0 and b * c <= 0 (x0, y0) = apply_matrix_pt(self.matrix, bbox_lower_left) (x1, y1) = apply_matrix_pt(self.matrix, bbox_upper_right) if x1 < x0: (x0, x1) = (x1, x0) if y1 < y0: (y0, y1) = (y1, y0) LTComponent.__init__(self, (x0, y0, x1, y1)) if font.is_vertical() or matrix[0] == 0: self.size = self.width else: self.size = self.height return def __repr__(self) -> str: return f"<{self.__class__.__name__} {bbox2str(self.bbox)} matrix={matrix2str(self.matrix)} font={self.fontname!r} adv={self.adv} text={self.get_text()!r}>" def get_text(self) -> str: return self._text class Paragraph: def __init__(self, y, x, x0, x1, size, brk): self.y: float = y # 初始纵坐标 self.x: float = x # 初始横坐标 self.x0: float = x0 # 左边界 self.x1: float = x1 # 右边界 self.size: float = size # 字体大小 self.brk: bool = brk # 换行标记 # fmt: off class TranslateConverter(PDFConverterEx): def __init__( self, rsrcmgr, vfont: str | None = None, vchar: str | None = None, thread: int = 0, layout: dict | None = None, lang_in: str = "", # 保留参数但添加未使用标记 _lang_out: str = "", # 改为未使用参数 _service: str = "", # 改为未使用参数 resfont: str = "", noto: Font | None = None, envs: dict | None = None, _prompt: list | None = None, # 改为未使用参数 il_creater: ILCreater | None = None, ): layout = layout or {} super().__init__(rsrcmgr, il_creater) self.vfont = vfont self.vchar = vchar self.thread = thread self.layout = layout self.resfont = resfont self.noto = noto def receive_layout(self, ltpage: LTPage): # 段落 sstk: list[str] = [] # 段落文字栈 pstk: list[Paragraph] = [] # 段落属性栈 vbkt: int = 0 # 段落公式括号计数 # 公式组 vstk: list[LTChar] = [] # 公式符号组 vlstk: list[LTLine] = [] # 公式线条组 vfix: float = 0 # 公式纵向偏移 # 公式组栈 var: list[list[LTChar]] = [] # 公式符号组栈 varl: list[list[LTLine]] = [] # 公式线条组栈 varf: list[float] = [] # 公式纵向偏移栈 vlen: list[float] = [] # 公式宽度栈 # 全局 lstk: list[LTLine] = [] # 全局线条栈 xt: LTChar = None # 上一个字符 xt_cls: int = -1 # 上一个字符所属段落,保证无论第一个字符属于哪个类别都可以触发新段落 vmax: float = ltpage.width / 4 # 行内公式最大宽度 ops: str = "" # 渲染结果 def vflag(font: str, char: str): # 匹配公式(和角标)字体 if isinstance(font, bytes): # 不一定能 decode,直接转 str font = str(font) font = font.split("+")[-1] # 字体名截断 if re.match(r"\(cid:", char): return True # 基于字体名规则的判定 if self.vfont: if re.match(self.vfont, font): return True else: if re.match( # latex 字体 r"(CM[^R]|(MS|XY|MT|BL|RM|EU|LA|RS)[A-Z]|LINE|LCIRCLE|TeX-|rsfs|txsy|wasy|stmary|.*Mono|.*Code|.*Ital|.*Sym|.*Math)", font, ): return True # 基于字符集规则的判定 if self.vchar: if re.match(self.vchar, char): return True else: if ( char and char != " " # 非空格 and ( unicodedata.category(char[0]) in ["Lm", "Mn", "Sk", "Sm", "Zl", "Zp", "Zs"] # 文字修饰符、数学符号、分隔符号 or ord(char[0]) in range(0x370, 0x400) # 希腊字母 ) ): return True return False ############################################################ # A. 原文档解析 for child in ltpage: if isinstance(child, LTChar): try: self.il_creater.on_lt_char(child) except Exception: log.exception( 'Error processing LTChar', ) continue cur_v = False layout = self.layout[ltpage.pageid] # ltpage.height 可能是 fig 里面的高度,这里统一用 layout.shape h, w = layout.shape # 读取当前字符在 layout 中的类别 cx, cy = np.clip(int(child.x0), 0, w - 1), np.clip(int(child.y0), 0, h - 1) cls = layout[cy, cx] # 锚定文档中 bullet 的位置 if child.get_text() == "•": cls = 0 # 判定当前字符是否属于公式 if ( # 判定当前字符是否属于公式 cls == 0 # 1. 类别为保留区域 or (cls == xt_cls and len(sstk[-1].strip()) > 1 and child.size < pstk[-1].size * 0.79) # 2. 角标字体,有 0.76 的角标和 0.799 的大写,这里用 0.79 取中,同时考虑首字母放大的情况 or vflag(child.fontname, child.get_text()) # 3. 公式字体 or (child.matrix[0] == 0 and child.matrix[3] == 0) # 4. 垂直字体 ): cur_v = True # 判定括号组是否属于公式 if not cur_v: if vstk and child.get_text() == "(": cur_v = True vbkt += 1 if vbkt and child.get_text() == ")": cur_v = True vbkt -= 1 if ( # 判定当前公式是否结束 not cur_v # 1. 当前字符不属于公式 or cls != xt_cls # 2. 当前字符与前一个字符不属于同一段落 # or (abs(child.x0 - xt.x0) > vmax and cls != 0) # 3. 段落内换行,可能是一长串斜体的段落,也可能是段内分式换行,这里设个阈值进行区分 # 禁止纯公式(代码)段落换行,直到文字开始再重开文字段落,保证只存在两种情况 # A. 纯公式(代码)段落(锚定绝对位置)sstk[-1]=="" -> sstk[-1]=="{v*}" # B. 文字开头段落(排版相对位置)sstk[-1]!="" or (sstk[-1] != "" and abs(child.x0 - xt.x0) > vmax) # 因为 cls==xt_cls==0 一定有 sstk[-1]=="",所以这里不需要再判定 cls!=0 ): if vstk: if ( # 根据公式右侧的文字修正公式的纵向偏移 not cur_v # 1. 当前字符不属于公式 and cls == xt_cls # 2. 当前字符与前一个字符属于同一段落 and child.x0 > max([vch.x0 for vch in vstk]) # 3. 当前字符在公式右侧 ): vfix = vstk[0].y0 - child.y0 if sstk[-1] == "": xt_cls = -1 # 禁止纯公式段落(sstk[-1]=="{v*}")的后续连接,但是要考虑新字符和后续字符的连接,所以这里修改的是上个字符的类别 sstk[-1] += f"{{v{len(var)}}}" var.append(vstk) varl.append(vlstk) varf.append(vfix) vstk = [] vlstk = [] vfix = 0 # 当前字符不属于公式或当前字符是公式的第一个字符 if not vstk: if cls == xt_cls: # 当前字符与前一个字符属于同一段落 if child.x0 > xt.x1 + 1: # 添加行内空格 sstk[-1] += " " elif child.x1 < xt.x0: # 添加换行空格并标记原文段落存在换行 sstk[-1] += " " pstk[-1].brk = True else: # 根据当前字符构建一个新的段落 sstk.append("") pstk.append(Paragraph(child.y0, child.x0, child.x0, child.x0, child.size, False)) if not cur_v: # 文字入栈 if ( # 根据当前字符修正段落属性 child.size > pstk[-1].size / 0.79 # 1. 当前字符显著比段落字体大 or len(sstk[-1].strip()) == 1 # 2. 当前字符为段落第二个文字(考虑首字母放大的情况) ) and child.get_text() != " ": # 3. 当前字符不是空格 pstk[-1].y -= child.size - pstk[-1].size # 修正段落初始纵坐标,假设两个不同大小字符的上边界对齐 pstk[-1].size = child.size sstk[-1] += child.get_text() else: # 公式入栈 if ( # 根据公式左侧的文字修正公式的纵向偏移 not vstk # 1. 当前字符是公式的第一个字符 and cls == xt_cls # 2. 当前字符与前一个字符属于同一段落 and child.x0 > xt.x0 # 3. 前一个字符在公式左侧 ): vfix = child.y0 - xt.y0 vstk.append(child) # 更新段落边界,因为段落内换行之后可能是公式开头,所以要在外边处理 pstk[-1].x0 = min(pstk[-1].x0, child.x0) pstk[-1].x1 = max(pstk[-1].x1, child.x1) # 更新上一个字符 xt = child xt_cls = cls elif isinstance(child, LTFigure): # 图表 self.il_creater.on_pdf_figure(child) pass elif isinstance(child, LTLine): # 线条 continue layout = self.layout[ltpage.pageid] # ltpage.height 可能是 fig 里面的高度,这里统一用 layout.shape h, w = layout.shape # 读取当前线条在 layout 中的类别 cx, cy = np.clip(int(child.x0), 0, w - 1), np.clip(int(child.y0), 0, h - 1) cls = layout[cy, cx] if vstk and cls == xt_cls: # 公式线条 vlstk.append(child) else: # 全局线条 lstk.append(child) else: pass return # 处理结尾 if vstk: # 公式出栈 sstk[-1] += f"{{v{len(var)}}}" var.append(vstk) varl.append(vlstk) varf.append(vfix) log.debug("\n==========[VSTACK]==========\n") for var_id, v in enumerate(var): # 计算公式宽度 l = max([vch.x1 for vch in v]) - v[0].x0 log.debug(f'< {l:.1f} {v[0].x0:.1f} {v[0].y0:.1f} {v[0].cid} {v[0].fontname} {len(varl[var_id])} > v{var_id} = {"".join([ch.get_text() for ch in v])}') vlen.append(l) ############################################################ # B. 段落翻译 log.debug("\n==========[SSTACK]==========\n") news = sstk.copy() ############################################################ # C. 新文档排版 def raw_string(fcur: str, cstk: str): # 编码字符串 if fcur == 'noto': return "".join([f"{self.noto.has_glyph(ord(c)):04x}" for c in cstk]) elif isinstance(self.fontmap[fcur], PDFCIDFont): # 判断编码长度 return "".join([f"{ord(c):04x}" for c in cstk]) else: return "".join([f"{ord(c):02x}" for c in cstk]) _x, _y = 0, 0 for para_id, new in enumerate(news): x: float = pstk[para_id].x # 段落初始横坐标 y: float = pstk[para_id].y # 段落初始纵坐标 x0: float = pstk[para_id].x0 # 段落左边界 x1: float = pstk[para_id].x1 # 段落右边界 size: float = pstk[para_id].size # 段落字体大小 brk: bool = pstk[para_id].brk # 段落换行标记 cstk: str = "" # 当前文字栈 fcur: str = None # 当前字体 ID tx = x fcur_ = fcur ptr = 0 log.debug(f"< {y} {x} {x0} {x1} {size} {brk} > {sstk[para_id]} | {new}") while ptr < len(new): vy_regex = re.match( r"\{\s*v([\d\s]+)\}", new[ptr:], re.IGNORECASE, ) # 匹配 {vn} 公式标记 mod = 0 # 文字修饰符 if vy_regex: # 加载公式 ptr += len(vy_regex.group(0)) try: vid = int(vy_regex.group(1).replace(" ", "")) adv = vlen[vid] except Exception as e: log.debug("Skipping formula placeholder due to: %s", e) continue # 翻译器可能会自动补个越界的公式标记 if var[vid][-1].get_text() and unicodedata.category(var[vid][-1].get_text()[0]) in ["Lm", "Mn", "Sk"]: # 文字修饰符 mod = var[vid][-1].width else: # 加载文字 ch = new[ptr] fcur_ = None try: if fcur_ is None and self.fontmap["tiro"].to_unichr(ord(ch)) == ch: fcur_ = "tiro" # 默认拉丁字体 except Exception: pass if fcur_ is None: fcur_ = self.resfont # 默认非拉丁字体 if fcur_ == 'noto': adv = self.noto.char_lengths(ch, size)[0] else: adv = self.fontmap[fcur_].char_width(ord(ch)) * size ptr += 1 if ( # 输出文字缓冲区 fcur_ != fcur # 1. 字体更新 or vy_regex # 2. 插入公式 or x + adv > x1 + 0.1 * size # 3. 到达右边界(可能一整行都被符号化,这里需要考虑浮点误差) ): if cstk: ops += f"/{fcur} {size:f} Tf 1 0 0 1 {tx:f} {y:f} Tm [<{raw_string(fcur, cstk)}>] TJ " cstk = "" if brk and x + adv > x1 + 0.1 * size: # 到达右边界且原文段落存在换行 x = x0 lang_space = {"zh-cn": 1.4, "zh-tw": 1.4, "zh-hans": 1.4, "zh-hant": 1.4, "zh": 1.4, "ja": 1.1, "ko": 1.2, "en": 1.2, "ar": 1.0, "ru": 0.8, "uk": 0.8, "ta": 0.8} # y -= size * lang_space.get(self.translator.lang_out.lower(), 1.1) # 小语种大多适配 1.1 y -= size * 1.4 if vy_regex: # 插入公式 fix = 0 if fcur is not None: # 段落内公式修正纵向偏移 fix = varf[vid] for vch in var[vid]: # 排版公式字符 vc = chr(vch.cid) ops += f"/{self.fontid[vch.font]} {vch.size:f} Tf 1 0 0 1 {x + vch.x0 - var[vid][0].x0:f} {fix + y + vch.y0 - var[vid][0].y0:f} Tm <{raw_string(self.fontid[vch.font], vc)}> TJ " if log.isEnabledFor(logging.DEBUG): lstk.append(LTLine(0.1, (_x, _y), (x + vch.x0 - var[vid][0].x0, fix + y + vch.y0 - var[vid][0].y0))) _x, _y = x + vch.x0 - var[vid][0].x0, fix + y + vch.y0 - var[vid][0].y0 for l in varl[vid]: # 排版公式线条 if l.linewidth < 5: # hack 有的文档会用粗线条当图片背景 ops += f"ET q 1 0 0 1 {l.pts[0][0] + x - var[vid][0].x0:f} {l.pts[0][1] + fix + y - var[vid][0].y0:f} cm [] 0 d 0 J {l.linewidth:f} w 0 0 m {l.pts[1][0] - l.pts[0][0]:f} {l.pts[1][1] - l.pts[0][1]:f} l S Q BT " else: # 插入文字缓冲区 if not cstk: # 单行开头 tx = x if x == x0 and ch == " ": # 消除段落换行空格 adv = 0 else: cstk += ch else: cstk += ch adv -= mod # 文字修饰符 fcur = fcur_ x += adv if log.isEnabledFor(logging.DEBUG): lstk.append(LTLine(0.1, (_x, _y), (x, y))) _x, _y = x, y # 处理结尾 if cstk: ops += f"/{fcur} {size:f} Tf 1 0 0 1 {tx:f} {y:f} Tm <{raw_string(fcur, cstk)}> TJ " for l in lstk: # 排版全局线条 if l.linewidth < 5: # hack 有的文档会用粗线条当图片背景 ops += f"ET q 1 0 0 1 {l.pts[0][0]:f} {l.pts[0][1]:f} cm [] 0 d 0 J {l.linewidth:f} w 0 0 m {l.pts[1][0] - l.pts[0][0]:f} {l.pts[1][1] - l.pts[0][1]:f} l S Q BT " ops = f"BT {ops}ET " return ops ``` ## /babeldoc/document_il/__init__.py ```py path="/babeldoc/document_il/__init__.py" from babeldoc.document_il.il_version_1 import BaseOperations from babeldoc.document_il.il_version_1 import Box from babeldoc.document_il.il_version_1 import Cropbox from babeldoc.document_il.il_version_1 import Document from babeldoc.document_il.il_version_1 import GraphicState from babeldoc.document_il.il_version_1 import Mediabox from babeldoc.document_il.il_version_1 import Page from babeldoc.document_il.il_version_1 import PageLayout from babeldoc.document_il.il_version_1 import PdfCharacter from babeldoc.document_il.il_version_1 import PdfFigure from babeldoc.document_il.il_version_1 import PdfFont from babeldoc.document_il.il_version_1 import PdfFontCharBoundingBox from babeldoc.document_il.il_version_1 import PdfFormula from babeldoc.document_il.il_version_1 import PdfLine from babeldoc.document_il.il_version_1 import PdfParagraph from babeldoc.document_il.il_version_1 import PdfParagraphComposition from babeldoc.document_il.il_version_1 import PdfRectangle from babeldoc.document_il.il_version_1 import PdfSameStyleCharacters from babeldoc.document_il.il_version_1 import PdfSameStyleUnicodeCharacters from babeldoc.document_il.il_version_1 import PdfStyle from babeldoc.document_il.il_version_1 import PdfXobject from babeldoc.document_il.il_version_1 import VisualBbox __all__ = [ "BaseOperations", "Box", "Cropbox", "Document", "GraphicState", "Mediabox", "Page", "PageLayout", "PdfCharacter", "PdfFigure", "PdfFont", "PdfFontCharBoundingBox", "PdfFormula", "PdfLine", "PdfParagraph", "PdfParagraphComposition", "PdfRectangle", "PdfSameStyleCharacters", "PdfSameStyleUnicodeCharacters", "PdfStyle", "PdfXobject", "VisualBbox", ] ``` ## /babeldoc/document_il/babeldoc_exception/BabelDOCException.py ```py path="/babeldoc/document_il/babeldoc_exception/BabelDOCException.py" class ScannedPDFError(Exception): def __init__(self, message): super().__init__(message) ``` ## /babeldoc/document_il/backend/__init__.py ```py path="/babeldoc/document_il/backend/__init__.py" ``` ## /babeldoc/document_il/backend/pdf_creater.py ```py path="/babeldoc/document_il/backend/pdf_creater.py" import io import itertools import logging import os import re import time import unicodedata from multiprocessing import Process from pathlib import Path import freetype import pymupdf from bitstring import BitStream from babeldoc.assets.embedding_assets_metadata import FONT_NAMES from babeldoc.document_il import il_version_1 from babeldoc.document_il.utils.fontmap import FontMapper from babeldoc.document_il.utils.zstd_helper import zstd_decompress from babeldoc.translation_config import TranslateResult from babeldoc.translation_config import TranslationConfig from babeldoc.translation_config import WatermarkOutputMode logger = logging.getLogger(__name__) SUBSET_FONT_STAGE_NAME = "Subset font" SAVE_PDF_STAGE_NAME = "Save PDF" def to_int(src): return int(re.search(r"\d+", src).group(0)) def parse_mapping(text): mapping = [] for x in re.finditer(rb"<(?P[a-fA-F0-9]+)>", text): mapping.append(int(x.group("num"), 16)) return mapping def apply_normalization(cmap, gid, code): need = False if 0x2F00 <= code <= 0x2FD5: # Kangxi Radicals need = True if 0xF900 <= code <= 0xFAFF: # CJK Compatibility Ideographs need = True if need: norm = unicodedata.normalize("NFD", chr(code)) cmap[gid] = ord(norm) else: cmap[gid] = code def batched(iterable, n, *, strict=False): # batched('ABCDEFG', 3) → ABC DEF G if n < 1: raise ValueError("n must be at least one") iterator = iter(iterable) while batch := tuple(itertools.islice(iterator, n)): if strict and len(batch) != n: raise ValueError("batched(): incomplete batch") yield batch def update_tounicode_cmap_pair(cmap, data): for start, stop, value in batched(data, 3): for gid in range(start, stop + 1): code = value + gid - start apply_normalization(cmap, gid, code) def update_tounicode_cmap_code(cmap, data): for gid, code in batched(data, 2): apply_normalization(cmap, gid, code) def parse_tounicode_cmap(data): cmap = {} for x in re.finditer( rb"\s+beginbfrange\s*(?P(<[0-9a-fA-F]+>\s*)+)endbfrange\s+", data ): update_tounicode_cmap_pair(cmap, parse_mapping(x.group("r"))) for x in re.finditer( rb"\s+beginbfchar\s*(?P(<[0-9a-fA-F]+>\s*)+)endbfchar", data ): update_tounicode_cmap_code(cmap, parse_mapping(x.group("c"))) return cmap def parse_truetype_data(data): glyph_in_use = [] face = freetype.Face(io.BytesIO(data)) for i in range(face.num_glyphs): face.load_glyph(i) if face.glyph.outline.contours: glyph_in_use.append(i) return glyph_in_use TOUNICODE_HEAD = """\ /CIDInit /ProcSet findresource begin 12 dict begin /CIDSystemInfo <> def /CMapName /Adobe-Identity-UCS def /CMapType 2 def 1 begincodespacerange <0000> endcodespacerange""" TOUNICODE_TAIL = """\ endcmap CMapName currentdict /CMap defineresource pop end end""" def make_tounicode(cmap, used): short = [] for x in used: if x in cmap: short.append((x, cmap[x])) line = [TOUNICODE_HEAD] for block in batched(short, 100): line.append(f"{len(block)} beginbfchar") for glyph, code in block: if code < 0x10000: line.append(f"<{glyph:04x}><{code:04x}>") else: line.append(f"<{glyph:04x}><{code:08x}>") line.append("endbfchar") line.append(TOUNICODE_TAIL) return "\n".join(line) def reproduce_one_font(doc, index): m = doc.xref_get_key(index, "ToUnicode") f = doc.xref_get_key(index, "DescendantFonts") if m[0] == "xref" and f[0] == "array": mi = to_int(m[1]) fi = to_int(f[1]) ff = doc.xref_get_key(fi, "FontDescriptor/FontFile2") ms = doc.xref_stream(mi) fs = doc.xref_stream(to_int(ff[1])) cmap = parse_tounicode_cmap(ms) used = parse_truetype_data(fs) text = make_tounicode(cmap, used) doc.update_stream(mi, bytes(text, "U8")) def reproduce_cmap(doc): assert doc font_set = set() for page in doc: font_list = page.get_fonts() for font in font_list: if font[1] == "ttf" and font[3] in FONT_NAMES and ".ttf" in font[4]: font_set.add(font) for font in font_set: reproduce_one_font(doc, font[0]) return doc def _subset_fonts_process(pdf_path, output_path): """Function to run in subprocess for font subsetting. Args: pdf_path: Path to the PDF file to subset output_path: Path where to save the result """ try: pdf = pymupdf.open(pdf_path) pdf.subset_fonts(fallback=False) pdf.save(output_path) # 返回 0 表示成功 os._exit(0) except Exception as e: logger.error(f"Error in font subsetting subprocess: {e}") # 返回 1 表示失败 os._exit(1) def _save_pdf_clean_process( pdf_path, output_path, garbage=1, deflate=True, clean=True, deflate_fonts=True, linear=False, ): """Function to run in subprocess for saving PDF with clean=True which can be time-consuming. Args: pdf_path: Path to the PDF file to save output_path: Path where to save the result garbage: Garbage collection level (0, 1, 2, 3, 4) deflate: Whether to deflate the PDF clean: Whether to clean the PDF deflate_fonts: Whether to deflate fonts linear: Whether to linearize the PDF """ try: pdf = pymupdf.open(pdf_path) pdf.save( output_path, garbage=garbage, deflate=deflate, clean=clean, deflate_fonts=deflate_fonts, linear=linear, ) # 返回 0 表示成功 os._exit(0) except Exception as e: logger.error(f"Error in save PDF with clean=True subprocess: {e}") # 返回 1 表示失败 os._exit(1) class PDFCreater: stage_name = "Generate drawing instructions" def __init__( self, original_pdf_path: str, document: il_version_1.Document, translation_config: TranslationConfig, mediabox_data: dict, ): self.original_pdf_path = original_pdf_path self.docs = document self.font_path = translation_config.font self.font_mapper = FontMapper(translation_config) self.translation_config = translation_config self.mediabox_data = mediabox_data def render_graphic_state( self, draw_op: BitStream, graphic_state: il_version_1.GraphicState, ): if graphic_state is None: return # if graphic_state.stroking_color_space_name: # draw_op.append( # f"/{graphic_state.stroking_color_space_name} CS \n".encode() # ) # if graphic_state.non_stroking_color_space_name: # draw_op.append( # f"/{graphic_state.non_stroking_color_space_name}" # f" cs \n".encode() # ) # if graphic_state.ncolor is not None: # if len(graphic_state.ncolor) == 1: # draw_op.append(f"{graphic_state.ncolor[0]} g \n".encode()) # elif len(graphic_state.ncolor) == 3: # draw_op.append( # f"{' '.join((str(x) for x in graphic_state.ncolor))} sc \n".encode() # ) # if graphic_state.scolor is not None: # if len(graphic_state.scolor) == 1: # draw_op.append(f"{graphic_state.scolor[0]} G \n".encode()) # elif len(graphic_state.scolor) == 3: # draw_op.append( # f"{' '.join((str(x) for x in graphic_state.scolor))} SC \n".encode() # ) if graphic_state.passthrough_per_char_instruction: draw_op.append( f"{graphic_state.passthrough_per_char_instruction} \n".encode(), ) def render_paragraph_to_char( self, paragraph: il_version_1.PdfParagraph, ) -> list[il_version_1.PdfCharacter]: chars = [] for composition in paragraph.pdf_paragraph_composition: if not isinstance(composition.pdf_character, il_version_1.PdfCharacter): logger.error( f"Unknown composition type. " f"This type only appears in the IL " f"after the translation is completed." f"During pdf rendering, this type is not supported." f"Composition: {composition}. " f"Paragraph: {paragraph}. ", ) continue chars.append(composition.pdf_character) if not chars and paragraph.unicode: logger.error( f"Unable to export paragraphs that have " f"not yet been formatted: {paragraph}", ) return chars return chars def get_available_font_list(self, pdf, page): page_xref_id = pdf[page.page_number].xref return self.get_xobj_available_fonts(page_xref_id, pdf) def get_xobj_available_fonts(self, page_xref_id, pdf): try: resources_type, r_id = pdf.xref_get_key(page_xref_id, "Resources") if resources_type == "xref": resource_xref_id = re.search("(\\d+) 0 R", r_id).group(1) r_id = pdf.xref_object(int(resource_xref_id)) resources_type = "dict" if resources_type == "dict": xref_id = re.search("/Font (\\d+) 0 R", r_id) if xref_id is not None: xref_id = xref_id.group(1) font_dict = pdf.xref_object(int(xref_id)) else: search = re.search("/Font *<<(.+?)>>", r_id.replace("\n", " ")) if search is None: # Have resources but no fonts return set() font_dict = search.group(1) else: r_id = int(r_id.split(" ")[0]) _, font_dict = pdf.xref_get_key(r_id, "Font") fonts = re.findall("/([^ ]+?) ", font_dict) return set(fonts) except Exception: return set() def _render_rectangle( self, draw_op: BitStream, rectangle: il_version_1.PdfRectangle, line_width: float = 1, ): """Draw a rectangle in PDF for visualization purposes. Args: draw_op: BitStream to append PDF drawing operations rectangle: Rectangle object containing position information line_width: Line width """ x1 = rectangle.box.x y1 = rectangle.box.y x2 = rectangle.box.x2 y2 = rectangle.box.y2 width = x2 - x1 height = y2 - y1 # Save graphics state draw_op.append(b"q ") # Set green color for debug visibility draw_op.append( rectangle.graphic_state.passthrough_per_char_instruction.encode(), ) # Green stroke if line_width > 0: draw_op.append(f" {line_width} w ".encode()) # Line width draw_op.append(f"{x1} {y1} {width} {height} re ".encode()) if rectangle.fill_background: draw_op.append(b" f ") else: draw_op.append(b" S ") # Restore graphics state draw_op.append(b"Q\n") def create_side_by_side_dual_pdf( self, original_pdf: pymupdf.Document, translated_pdf: pymupdf.Document, dual_out_path: str, translation_config: TranslationConfig, ) -> pymupdf.Document: """Create a dual PDF with side-by-side pages (original and translation). Args: original_pdf: Original PDF document translated_pdf: Translated PDF document dual_out_path: Output path for the dual PDF translation_config: Translation configuration Returns: The created dual PDF document """ # Create a new PDF for side-by-side pages dual = pymupdf.open() page_count = min(original_pdf.page_count, translated_pdf.page_count) for page_id in range(page_count): # Get pages from both PDFs orig_page = original_pdf[page_id] trans_page = translated_pdf[page_id] # Calculate total width and use max height total_width = orig_page.rect.width + trans_page.rect.width max_height = max(orig_page.rect.height, trans_page.rect.height) # Create new page with combined width dual_page = dual.new_page(width=total_width, height=max_height) # Define rectangles for left and right sides left_width = ( orig_page.rect.width if not translation_config.dual_translate_first else trans_page.rect.width ) rect_left = pymupdf.Rect(0, 0, left_width, max_height) rect_right = pymupdf.Rect(left_width, 0, total_width, max_height) # Show pages according to dual_translate_first setting if translation_config.dual_translate_first: # Show translated page on left and original on right rect_left, rect_right = rect_right, rect_left try: # Show original page on left and translated on right (default) dual_page.show_pdf_page( rect_left, original_pdf, page_id, keep_proportion=True, ) except Exception as e: logger.warning( f"Failed to show original page on left and translated on right (default). " f"Page ID: {page_id}. " f"Original PDF: {self.original_pdf_path}. " f"Translated PDF: {translation_config.input_file}. ", exc_info=e, ) try: dual_page.show_pdf_page( rect_right, translated_pdf, page_id, keep_proportion=True, ) except Exception as e: logger.warning( f"Failed to show translated page on left and original on right. " f"Page ID: {page_id}. " f"Original PDF: {self.original_pdf_path}. " f"Translated PDF: {translation_config.input_file}. ", exc_info=e, ) return dual def create_alternating_pages_dual_pdf( self, original_pdf_path: str, translated_pdf: pymupdf.Document, translation_config: TranslationConfig, ) -> pymupdf.Document: """Create a dual PDF with alternating pages (original and translation). Args: original_pdf_path: Path to the original PDF translated_pdf: Translated PDF document translation_config: Translation configuration Returns: The created dual PDF document """ # Open the original PDF and insert translated PDF dual = pymupdf.open(original_pdf_path) dual.insert_file(translated_pdf) # Rearrange pages to alternate between original and translated page_count = translated_pdf.page_count for page_id in range(page_count): if translation_config.dual_translate_first: dual.move_page(page_count + page_id, page_id * 2) else: dual.move_page(page_count + page_id, page_id * 2 + 1) return dual def write_debug_info( self, pdf: pymupdf.Document, translation_config: TranslationConfig, ): self.font_mapper.add_font(pdf, self.docs) for page in self.docs.page: _, r_id = pdf.xref_get_key(pdf[page.page_number].xref, "Contents") resource_xref_id = re.search("(\\d+) 0 R", r_id).group(1) base_op = pdf.xref_stream(int(resource_xref_id)) translation_config.raise_if_cancelled() xobj_available_fonts = {} xobj_draw_ops = {} xobj_encoding_length_map = {} available_font_list = self.get_available_font_list(pdf, page) page_encoding_length_map = { f.font_id: f.encoding_length for f in page.pdf_font } page_op = BitStream() # q {ops_base}Q 1 0 0 1 {x0} {y0} cm {ops_new} page_op.append(b"q ") if base_op is not None: page_op.append(base_op) page_op.append(b" Q ") page_op.append( f"q Q 1 0 0 1 {page.cropbox.box.x} {page.cropbox.box.y} cm \n".encode(), ) # 收集所有字符 chars = [] # 首先添加页面级别的字符 if page.pdf_character: chars.extend(page.pdf_character) # 然后添加段落中的字符 for paragraph in page.pdf_paragraph: chars.extend(self.render_paragraph_to_char(paragraph)) # 渲染所有字符 for char in chars: if not getattr(char, "debug_info", False): continue if char.char_unicode == "\n": continue if char.pdf_character_id is None: # dummy char continue char_size = char.pdf_style.font_size font_id = char.pdf_style.font_id if font_id not in available_font_list: continue draw_op = page_op encoding_length_map = page_encoding_length_map draw_op.append(b"q ") self.render_graphic_state(draw_op, char.pdf_style.graphic_state) if char.vertical: draw_op.append( f"BT /{font_id} {char_size:f} Tf 0 1 -1 0 {char.box.x2:f} {char.box.y:f} Tm ".encode(), ) else: draw_op.append( f"BT /{font_id} {char_size:f} Tf 1 0 0 1 {char.box.x:f} {char.box.y:f} Tm ".encode(), ) encoding_length = encoding_length_map[font_id] # pdf32000-2008 page14: # As hexadecimal data enclosed in angle brackets < > # see 7.3.4.3, "Hexadecimal Strings." draw_op.append( f"<{char.pdf_character_id:0{encoding_length * 2}x}>".upper().encode(), ) draw_op.append(b" Tj ET Q \n") for rect in page.pdf_rectangle: if not rect.debug_info: continue self._render_rectangle(page_op, rect) draw_op = page_op # Since this is a draw instruction container, # no additional information is needed pdf.update_stream(int(resource_xref_id), draw_op.tobytes()) translation_config.raise_if_cancelled() # 使用子进程进行字体子集化 if not translation_config.skip_clean: pdf = self.subset_fonts_in_subprocess(pdf, translation_config, tag="debug") return pdf @staticmethod def subset_fonts_in_subprocess( pdf: pymupdf.Document, translation_config: TranslationConfig, tag: str ) -> pymupdf.Document: """Run font subsetting in a subprocess with timeout. Args: pdf: The PDF document object translation_config: Translation configuration Returns: Path to the PDF with subsetted fonts, or original path if subsetting failed or timed out """ original_pdf = pdf # Create temporary file paths temp_input = str( translation_config.get_working_file_path(f"temp_subset_input_{tag}.pdf") ) temp_output = str( translation_config.get_working_file_path(f"temp_subset_output_{tag}.pdf") ) # Save PDF to temporary file without subsetting pdf.save(temp_input) # Create and start subprocess process = Process(target=_subset_fonts_process, args=(temp_input, temp_output)) process.start() # Wait for subprocess with timeout (1 minute) timeout = 60 # 1 minutes in seconds start_time = time.time() while process.is_alive(): if time.time() - start_time > timeout: logger.warning( f"Font subsetting timeout after {timeout} seconds, terminating subprocess" ) process.terminate() try: process.join(5) # Give it 5 seconds to clean up if process.is_alive(): logger.warning("Subprocess did not terminate, killing it") process.kill() process.terminate() process.kill() process.terminate() process.kill() process.terminate() except Exception as e: logger.error(f"Error terminating font subsetting process: {e}") return original_pdf time.sleep(0.5) # Check every half second # Process completed, check exit code exit_code = process.exitcode success = exit_code == 0 # Check if subsetting was successful if ( success and Path(temp_output).exists() and Path(temp_output).stat().st_size > 0 ): logger.info("Font subsetting completed successfully") return pymupdf.open(temp_output) else: logger.warning( f"Font subsetting failed with exit code {exit_code} or produced empty file" ) return original_pdf @staticmethod def save_pdf_with_timeout( pdf: pymupdf.Document, output_path: str, translation_config: TranslationConfig, garbage: int = 1, deflate: bool = True, clean: bool = True, deflate_fonts: bool = True, linear: bool = False, timeout: int = 120, tag: str = "", ) -> bool: """Save a PDF document with a timeout for the clean=True operation. Args: pdf: The PDF document object output_path: Path where to save the PDF translation_config: Translation configuration garbage: Garbage collection level (0, 1, 2, 3, 4) deflate: Whether to deflate the PDF clean: Whether to clean the PDF deflate_fonts: Whether to deflate fonts linear: Whether to linearize the PDF timeout: Timeout in seconds (default: 2 minutes) Returns: True if saved with clean=True successfully, False if fallback to clean=False was used """ # Create temporary file paths temp_input = str( translation_config.get_working_file_path(f"temp_save_input_{tag}.pdf") ) temp_output = str( translation_config.get_working_file_path(f"temp_save_output_{tag}.pdf") ) # Save PDF to temporary file first pdf.save(temp_input) # Try to save with clean=True in a subprocess process = Process( target=_save_pdf_clean_process, args=( temp_input, temp_output, garbage, deflate, clean, deflate_fonts, linear, ), ) process.start() # Wait for subprocess with timeout start_time = time.time() while process.is_alive(): if time.time() - start_time > timeout: logger.warning( f"PDF save with clean={clean} timeout after {timeout} seconds, terminating subprocess" ) process.terminate() try: process.join(5) # Give it 5 seconds to clean up if process.is_alive(): logger.warning("Subprocess did not terminate, killing it") process.kill() process.terminate() process.kill() process.terminate() process.kill() process.terminate() except Exception as e: logger.error(f"Error terminating PDF save process: {e}") # Fallback to save without clean parameter logger.info("Falling back to save with clean=False") try: pdf.save( output_path, garbage=garbage, deflate=deflate, clean=False, deflate_fonts=deflate_fonts, linear=linear, ) return False except Exception as e: logger.error(f"Error in fallback save: {e}") # Last resort: basic save pdf.save(output_path) return False time.sleep(0.5) # Check every half second # Process completed, check exit code exit_code = process.exitcode success = exit_code == 0 # Check if save was successful if ( success and Path(temp_output).exists() and Path(temp_output).stat().st_size > 0 ): logger.info(f"PDF save with clean={clean} completed successfully") # Copy the successfully created file to the target path try: import shutil shutil.copy2(temp_output, output_path) return True except Exception as e: logger.error(f"Error copying saved PDF: {e}") pdf.save(output_path) # Fallback to direct save return False finally: Path(temp_input).unlink() Path(temp_output).unlink() else: logger.warning( f"PDF save with clean={clean} failed with exit code {exit_code} or produced empty file" ) # Fallback to save without clean parameter try: pdf.save( output_path, garbage=garbage, deflate=deflate, clean=False, deflate_fonts=deflate_fonts, linear=linear, ) except Exception as e: logger.error(f"Error in fallback save: {e}") # Last resort: basic save pdf.save(output_path) return False def restore_media_box(self, doc: pymupdf.Document, mediabox_data: dict) -> None: for pageno, page_box_data in mediabox_data.items(): for name, box in page_box_data.items(): try: doc.xref_set_key(doc[pageno].xref, name, box) except Exception: logger.error(f"Error restoring media box {name} from PDF") def write( self, translation_config: TranslationConfig, check_font_exists: bool = False ) -> TranslateResult: try: basename = Path(translation_config.input_file).stem debug_suffix = ".debug" if translation_config.debug else "" if ( translation_config.watermark_output_mode != WatermarkOutputMode.Watermarked ): debug_suffix += ".no_watermark" mono_out_path = translation_config.get_output_file_path( f"{basename}{debug_suffix}.{translation_config.lang_out}.mono.pdf", ) pdf = pymupdf.open(self.original_pdf_path) self.font_mapper.add_font(pdf, self.docs) with self.translation_config.progress_monitor.stage_start( self.stage_name, len(self.docs.page), ) as pbar: for page in self.docs.page: translation_config.raise_if_cancelled() xobj_available_fonts = {} xobj_draw_ops = {} xobj_encoding_length_map = {} available_font_list = self.get_available_font_list(pdf, page) page_encoding_length_map = { f.font_id: f.encoding_length for f in page.pdf_font } all_encoding_length_map = page_encoding_length_map.copy() for xobj in page.pdf_xobject: xobj_available_fonts[xobj.xobj_id] = available_font_list.copy() try: xobj_available_fonts[xobj.xobj_id].update( self.get_xobj_available_fonts(xobj.xref_id, pdf), ) except Exception: pass xobj_encoding_length_map[xobj.xobj_id] = { f.font_id: f.encoding_length for f in xobj.pdf_font } all_encoding_length_map.update( xobj_encoding_length_map[xobj.xobj_id] ) xobj_encoding_length_map[xobj.xobj_id].update( page_encoding_length_map ) xobj_op = BitStream() base_op = xobj.base_operations.value base_op = zstd_decompress(base_op) xobj_op.append(base_op.encode()) xobj_draw_ops[xobj.xobj_id] = xobj_op page_op = BitStream() # q {ops_base}Q 1 0 0 1 {x0} {y0} cm {ops_new} # page_op.append(b"q ") base_op = page.base_operations.value base_op = zstd_decompress(base_op) page_op.append(base_op.encode()) page_op.append(b" \n") # page_op.append(b" Q ") # page_op.append( # f"q Q 1 0 0 1 {page.cropbox.box.x} {page.cropbox.box.y} cm \n".encode(), # ) # 收集所有字符 chars = [] # 首先添加页面级别的字符 if page.pdf_character: chars.extend(page.pdf_character) # 然后添加段落中的字符 for paragraph in page.pdf_paragraph: chars.extend(self.render_paragraph_to_char(paragraph)) for rect in page.pdf_rectangle: if ( translation_config.ocr_workaround and not rect.debug_info and rect.fill_background ): if rect.xobj_id in xobj_available_fonts: draw_op = xobj_draw_ops[rect.xobj_id] else: draw_op = page_op self._render_rectangle(draw_op, rect, line_width=0.1) # 渲染所有字符 for char in chars: if char.char_unicode == "\n": continue if char.pdf_character_id is None: # dummy char continue char_size = char.pdf_style.font_size font_id = char.pdf_style.font_id if char.xobj_id in xobj_available_fonts: if ( check_font_exists and font_id not in xobj_available_fonts[char.xobj_id] ): continue draw_op = xobj_draw_ops[char.xobj_id] encoding_length_map = xobj_encoding_length_map[char.xobj_id] else: if check_font_exists and font_id not in available_font_list: continue draw_op = page_op encoding_length_map = page_encoding_length_map draw_op.append(b"q ") self.render_graphic_state(draw_op, char.pdf_style.graphic_state) if char.vertical: draw_op.append( f"BT /{font_id} {char_size:f} Tf 0 1 -1 0 {char.box.x2:f} {char.box.y:f} Tm ".encode(), ) else: draw_op.append( f"BT /{font_id} {char_size:f} Tf 1 0 0 1 {char.box.x:f} {char.box.y:f} Tm ".encode(), ) encoding_length = encoding_length_map.get(font_id, None) if encoding_length is None: if font_id in all_encoding_length_map: encoding_length = all_encoding_length_map[font_id] else: logger.debug( f"Font {font_id} not found in encoding length map for page {page.page_number}" ) continue # pdf32000-2008 page14: # As hexadecimal data enclosed in angle brackets < > # see 7.3.4.3, "Hexadecimal Strings." draw_op.append( f"<{char.pdf_character_id:0{encoding_length * 2}x}>".upper().encode(), ) draw_op.append(b" Tj ET Q \n") for xobj in page.pdf_xobject: draw_op = xobj_draw_ops[xobj.xobj_id] try: pdf.update_stream(xobj.xref_id, draw_op.tobytes()) except Exception: logger.warning( f"update xref {xobj.xref_id} stream fail, continue" ) # pdf.update_stream(xobj.xref_id, b'') for rect in page.pdf_rectangle: if translation_config.debug and rect.debug_info: self._render_rectangle(page_op, rect) draw_op = page_op op_container = pdf.get_new_xref() # Since this is a draw instruction container, # no additional information is needed pdf.update_object(op_container, "<<>>") pdf.update_stream(op_container, draw_op.tobytes()) pdf[page.page_number].set_contents(op_container) pbar.advance() translation_config.raise_if_cancelled() gc_level = 1 if self.translation_config.ocr_workaround: gc_level = 4 with self.translation_config.progress_monitor.stage_start( SUBSET_FONT_STAGE_NAME, 1, ) as pbar: if not translation_config.skip_clean: pdf = self.subset_fonts_in_subprocess( pdf, translation_config, tag="mono" ) pbar.advance() try: self.restore_media_box(pdf, self.mediabox_data) except Exception: logger.exception("restore media box failed") with self.translation_config.progress_monitor.stage_start( SAVE_PDF_STAGE_NAME, 2, ) as pbar: if not translation_config.no_mono: if translation_config.debug: translation_config.raise_if_cancelled() pdf.save( f"{mono_out_path}.decompressed.pdf", expand=True, pretty=True, ) translation_config.raise_if_cancelled() self.save_pdf_with_timeout( pdf, mono_out_path, translation_config, garbage=gc_level, deflate=True, clean=not translation_config.skip_clean, deflate_fonts=True, linear=False, tag="mono", ) pbar.advance() dual_out_path = None if not translation_config.no_dual: dual_out_path = translation_config.get_output_file_path( f"{basename}{debug_suffix}.{translation_config.lang_out}.dual.pdf", ) translation_config.raise_if_cancelled() original_pdf = pymupdf.open(self.original_pdf_path) translated_pdf = pdf # Choose between alternating pages and side-by-side format # Default to side-by-side if not specified use_alternating_pages = ( translation_config.use_alternating_pages_dual ) if use_alternating_pages: # Create a dual PDF with alternating pages (original and translation) dual = self.create_alternating_pages_dual_pdf( self.original_pdf_path, translated_pdf, translation_config, ) else: # Create a dual PDF with side-by-side pages (original and translation) dual = self.create_side_by_side_dual_pdf( original_pdf, translated_pdf, dual_out_path, translation_config, ) if translation_config.debug: translation_config.raise_if_cancelled() try: dual = self.write_debug_info(dual, translation_config) except Exception: logger.warning( "Failed to write debug info to dual PDF", exc_info=True, ) self.save_pdf_with_timeout( dual, dual_out_path, translation_config, garbage=gc_level, deflate=True, clean=not translation_config.skip_clean, deflate_fonts=True, linear=False, tag="dual", ) if translation_config.debug: translation_config.raise_if_cancelled() dual.save( f"{dual_out_path}.decompressed.pdf", expand=True, pretty=True, ) pbar.advance() return TranslateResult(mono_out_path, dual_out_path) except Exception: logger.exception( "Failed to create PDF: %s", translation_config.input_file, ) if not check_font_exists: return self.write(translation_config, True) raise ``` ## /babeldoc/document_il/frontend/__init__.py ```py path="/babeldoc/document_il/frontend/__init__.py" ``` ## /babeldoc/document_il/frontend/il_creater.py ```py path="/babeldoc/document_il/frontend/il_creater.py" import base64 import functools import logging import re from functools import wraps from io import BytesIO from itertools import islice import freetype import pdfminer.pdfinterp import pymupdf from pdfminer.layout import LTChar from pdfminer.layout import LTFigure from pdfminer.pdffont import PDFCIDFont from pdfminer.pdffont import PDFFont from pdfminer.pdfpage import PDFPage as PDFMinerPDFPage from pdfminer.pdftypes import PDFObjRef as PDFMinerPDFObjRef from pdfminer.pdftypes import resolve1 as pdftypes_resolve1 from pdfminer.psparser import PSLiteral from babeldoc.document_il import il_version_1 from babeldoc.document_il.utils import zstd_helper from babeldoc.document_il.utils.style_helper import BLACK from babeldoc.document_il.utils.style_helper import YELLOW from babeldoc.translation_config import TranslationConfig def batched(iterable, n, *, strict=False): # batched('ABCDEFG', 3) → ABC DEF G if n < 1: raise ValueError("n must be at least one") iterator = iter(iterable) while batch := tuple(islice(iterator, n)): if strict and len(batch) != n: raise ValueError("batched(): incomplete batch") yield batch logger = logging.getLogger(__name__) def create_hook(func, hook): @wraps(func) def wrapper(*args, **kwargs): hook(*args, **kwargs) return func(*args, **kwargs) return wrapper def hook_pdfminer_pdf_page_init(*args): attrs = args[3] try: while isinstance(attrs["MediaBox"], PDFMinerPDFObjRef): attrs["MediaBox"] = pdftypes_resolve1(attrs["MediaBox"]) except Exception: logger.exception(f"try to fix mediabox failed: {attrs}") PDFMinerPDFPage.__init__ = create_hook( PDFMinerPDFPage.__init__, hook_pdfminer_pdf_page_init ) def indirect(obj): if isinstance(obj, tuple) and obj[0] == "xref": return int(obj[1].split(" ")[0]) def get_glyph_cbox(face, g): face.load_glyph(g, freetype.FT_LOAD_NO_SCALE) cbox = face.glyph.outline.get_bbox() return cbox.xMin, cbox.yMin, cbox.xMax, cbox.yMax def get_char_cbox(face, idx): g = face.get_char_index(idx) return get_glyph_cbox(face, g) def get_name_cbox(face, name): g = face.get_name_index(name) return get_glyph_cbox(face, g) WinAnsiEncoding = [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 8364, 0, 8218, 402, 8222, 8230, 8224, 8225, 710, 8240, 352, 8249, 338, 0, 381, 0, 0, 8216, 8217, 8220, 8221, 8226, 8211, 8212, 732, 8482, 353, 8250, 339, 0, 382, 376, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, ] def parse_font_file(doc, idx, encoding, differences): bbox_list = [] data = doc.xref_stream(idx) face = freetype.Face(BytesIO(data)) scale = 1000 / face.units_per_EM for charmap in face.charmaps: if charmap.encoding_name == "FT_ENCODING_ADOBE_CUSTOM": face.select_charmap(freetype.FT_ENCODING_ADOBE_CUSTOM) break bbox_list = [get_char_cbox(face, x) for x in encoding] if differences: for code, name in differences: bbox_list[code] = get_name_cbox(face, name.encode("U8")) norm_bbox_list = [[v * scale for v in box] for box in bbox_list] return norm_bbox_list def parse_encoding(obj_str): delta = [] current = 0 for x in re.finditer( r"(?P

[\[\]])|(?P\d+)|(?P/[a-zA-Z0-9]+)|(?P.)", obj_str ): key = x.lastgroup val = x.group() if key == "c": current = int(val) if key == "n": delta.append((current, val[1:])) current += 1 return delta def parse_mapping(text): mapping = [] for x in re.finditer(r"<(?P[a-fA-F0-9]+)>", text): mapping.append(x.group("num")) return mapping def update_cmap_pair(cmap, data): for start_str, stop_str, value_str in batched(data, 3): start = int(start_str, 16) stop = int(stop_str, 16) value = base64.b16decode(value_str).decode("UTF-16-BE") for code in range(start, stop + 1): cmap[code] = value def update_cmap_code(cmap, data): for code_str, value_str in batched(data, 2): code = int(code_str, 16) value = base64.b16decode(value_str).decode("UTF-16-BE") cmap[code] = value def parse_cmap(cmap_str): cmap = {} for x in re.finditer( r"\s+beginbfrange\s*(?P(<[0-9a-fA-F]+>\s*)+)endbfrange\s+", cmap_str ): update_cmap_pair(cmap, parse_mapping(x.group("r"))) for x in re.finditer( r"\s+beginbfchar\s*(?P(<[0-9a-fA-F]+>\s*)+)endbfchar", cmap_str ): update_cmap_code(cmap, parse_mapping(x.group("c"))) return cmap def get_code(cmap, c): for k, v in cmap.items(): if v == c: return k return -1 def get_bbox(bbox, size, c, x, y): x_min, y_min, x_max, y_max = bbox[c] factor = 1 / 1000 * size x_min = x_min * factor y_min = -y_min * factor x_max = x_max * factor y_max = -y_max * factor ll = (x + x_min, y + y_min) lr = (x + x_max, y + y_min) ul = (x + x_min, y + y_max) ur = (x + x_max, y + y_max) return pymupdf.Quad(ll, lr, ul, ur) # 常见 Unicode 空格字符的代码点 unicode_spaces = [ "\u0020", # 半角空格 "\u00a0", # 不间断空格 "\u1680", # Ogham 空格标记 "\u2000", # En Quad "\u2001", # Em Quad "\u2002", # En Space "\u2003", # Em Space "\u2004", # 三分之一 Em 空格 "\u2005", # 四分之一 Em 空格 "\u2006", # 六分之一 Em 空格 "\u2007", # 数样间距 "\u2008", # 行首前导空格 "\u2009", # 瘦弱空格 "\u200a", # hair space "\u202f", # 窄不间断空格 "\u205f", # 数学中等空格 "\u3000", # 全角空格 "\u200b", # 零宽度空格 "\u2060", # 零宽度非断空格 "\t", # 水平制表符 ] # 构建正则表达式 pattern = "^[" + "".join(unicode_spaces) + "]+$" # 编译正则 space_regex = re.compile(pattern) class ILCreater: stage_name = "Parse PDF and Create Intermediate Representation" def __init__(self, translation_config: TranslationConfig): self.progress = None self.current_page: il_version_1.Page = None self.mupdf: pymupdf.Document = None self.model = translation_config.doc_layout_model self.docs = il_version_1.Document(page=[]) self.stroking_color_space_name = None self.non_stroking_color_space_name = None self.passthrough_per_char_instruction: list[tuple[str, str]] = [] self.translation_config = translation_config self.passthrough_per_char_instruction_stack: list[list[tuple[str, str]]] = [] self.xobj_id = 0 self.xobj_inc = 0 self.xobj_map: dict[int, il_version_1.PdfXobject] = {} self.xobj_stack = [] self.current_page_font_name_id_map = {} self.current_page_font_char_bounding_box_map = {} self.mupdf_font_map: dict[int, pymupdf.Font] = {} self.graphic_state_pool = {} def on_finish(self): self.progress.__exit__(None, None, None) def is_passthrough_per_char_operation(self, operator: str): return re.match("^(sc|scn|g|rg|k|cs|gs|ri)$", operator, re.IGNORECASE) def on_passthrough_per_char(self, operator: str, args: list[str]): if not self.is_passthrough_per_char_operation(operator): logger.error("Unknown passthrough_per_char operation: %s", operator) return # logger.debug("xobj_id: %d, on_passthrough_per_char: %s ( %s )", self.xobj_id, operator, args) args = [self.parse_arg(arg) for arg in args] for _i, value in enumerate(self.passthrough_per_char_instruction.copy()): op, arg = value if op == operator: self.passthrough_per_char_instruction.remove(value) break self.passthrough_per_char_instruction.append((operator, " ".join(args))) pass def remove_latest_passthrough_per_char_instruction(self): if self.passthrough_per_char_instruction: self.passthrough_per_char_instruction.pop() def parse_arg(self, arg: str): if isinstance(arg, PSLiteral): return f"/{arg.name}" if not isinstance(arg, str): return str(arg) return arg def pop_passthrough_per_char_instruction(self): if self.passthrough_per_char_instruction_stack: self.passthrough_per_char_instruction = ( self.passthrough_per_char_instruction_stack.pop() ) else: self.passthrough_per_char_instruction = [] logging.error( "pop_passthrough_per_char_instruction error on page: %s", self.current_page.page_number, ) def push_passthrough_per_char_instruction(self): self.passthrough_per_char_instruction_stack.append( self.passthrough_per_char_instruction.copy(), ) # pdf32000 page 171 def on_stroking_color_space(self, color_space_name): self.stroking_color_space_name = color_space_name def on_non_stroking_color_space(self, color_space_name): self.non_stroking_color_space_name = color_space_name def on_new_stream(self): self.stroking_color_space_name = None self.non_stroking_color_space_name = None self.passthrough_per_char_instruction = [] def push_xobj(self): self.xobj_stack.append( ( self.current_page_font_name_id_map.copy(), self.current_page_font_char_bounding_box_map.copy(), self.xobj_id, ), ) self.current_page_font_name_id_map = {} self.current_page_font_char_bounding_box_map = {} def pop_xobj(self): ( self.current_page_font_name_id_map, self.current_page_font_char_bounding_box_map, self.xobj_id, ) = self.xobj_stack.pop() def on_xobj_begin(self, bbox, xref_id): self.push_passthrough_per_char_instruction() self.push_xobj() self.xobj_inc += 1 self.xobj_id = self.xobj_inc xobject = il_version_1.PdfXobject( box=il_version_1.Box( x=float(bbox[0]), y=float(bbox[1]), x2=float(bbox[2]), y2=float(bbox[3]), ), xobj_id=self.xobj_id, xref_id=xref_id, ) self.current_page.pdf_xobject.append(xobject) self.xobj_map[self.xobj_id] = xobject return self.xobj_id def on_xobj_end(self, xobj_id, base_op): self.pop_passthrough_per_char_instruction() self.pop_xobj() xobj = self.xobj_map[xobj_id] base_op = zstd_helper.zstd_compress(base_op) xobj.base_operations = il_version_1.BaseOperations(value=base_op) self.xobj_inc += 1 def on_page_start(self): self.current_page = il_version_1.Page( pdf_font=[], pdf_character=[], page_layout=[], # currently don't support UserUnit page parameter # pdf32000 page 79 unit="point", ) self.current_page_font_name_id_map = {} self.current_page_font_char_bounding_box_map = {} self.passthrough_per_char_instruction_stack = [] self.xobj_stack = [] self.non_stroking_color_space_name = None self.stroking_color_space_name = None self.docs.page.append(self.current_page) def on_page_end(self): self.progress.advance(1) def on_page_crop_box( self, x0: float | int, y0: float | int, x1: float | int, y1: float | int, ): box = il_version_1.Box(x=float(x0), y=float(y0), x2=float(x1), y2=float(y1)) self.current_page.cropbox = il_version_1.Cropbox(box=box) def on_page_media_box( self, x0: float | int, y0: float | int, x1: float | int, y1: float | int, ): box = il_version_1.Box(x=float(x0), y=float(y0), x2=float(x1), y2=float(y1)) self.current_page.mediabox = il_version_1.Mediabox(box=box) def on_page_number(self, page_number: int): assert isinstance(page_number, int) assert page_number >= 0 self.current_page.page_number = page_number def on_page_base_operation(self, operation: str): operation = zstd_helper.zstd_compress(operation) self.current_page.base_operations = il_version_1.BaseOperations(value=operation) def on_page_resource_font(self, font: PDFFont, xref_id: int, font_id: str): font_name = font.fontname if isinstance(font_name, bytes): try: font_name = font_name.decode("utf-8") except UnicodeDecodeError: font_name = "BASE64:" + base64.b64encode(font_name).decode("utf-8") encoding_length = 1 if isinstance(font, PDFCIDFont): try: # pdf 32000:2008 page 273 # Table 118 - Predefined CJK CMap names _, encoding = self.mupdf.xref_get_key(xref_id, "Encoding") if encoding == "/Identity-H" or encoding == "/Identity-V": encoding_length = 2 if encoding == "/WinAnsiEncoding": encoding_length = 1 else: _, to_unicode_id = self.mupdf.xref_get_key(xref_id, "ToUnicode") if to_unicode_id is not None: to_unicode_bytes = self.mupdf.xref_stream( int(to_unicode_id.split(" ")[0]), ) code_range = re.search( b"begincodespacerange\n?.*<(\\d+?)>.*", to_unicode_bytes, ).group(1) encoding_length = len(code_range) // 2 except Exception: if ( font.unicode_map and font.unicode_map.cid2unichr and max(font.unicode_map.cid2unichr.keys()) > 255 ): encoding_length = 2 else: encoding_length = 1 try: if xref_id in self.mupdf_font_map: mupdf_font = self.mupdf_font_map[xref_id] else: mupdf_font = pymupdf.Font( fontbuffer=self.mupdf.extract_font(xref_id)[3] ) mupdf_font.has_glyph = functools.lru_cache(maxsize=10240, typed=True)( mupdf_font.has_glyph, ) bold = mupdf_font.is_bold italic = mupdf_font.is_italic monospaced = mupdf_font.is_monospaced serif = mupdf_font.is_serif self.mupdf_font_map[xref_id] = mupdf_font except Exception: bold = None italic = None monospaced = None serif = None il_font_metadata = il_version_1.PdfFont( name=font_name, xref_id=xref_id, font_id=font_id, encoding_length=encoding_length, bold=bold, italic=italic, monospace=monospaced, serif=serif, ascent=font.ascent, descent=font.descent, pdf_font_char_bounding_box=[], ) try: bbox_list, cmap = self.parse_font_xobj_id(xref_id) font_char_bounding_box_map = {} if not cmap: cmap = {x: x for x in range(257)} for char_id in cmap: if char_id < 0 or char_id >= len(bbox_list): continue bbox = bbox_list[char_id] x, y, x2, y2 = bbox if ( x == 0 and y == 0 and x2 == 500 and y2 == 698 or x == 0 and y == 0 and x2 == 0 and y2 == 0 ): # ignore default bounding box continue il_font_metadata.pdf_font_char_bounding_box.append( il_version_1.PdfFontCharBoundingBox( x=x, y=y, x2=x2, y2=y2, char_id=char_id, ) ) font_char_bounding_box_map[char_id] = bbox if self.xobj_id in self.xobj_map: if self.xobj_id not in self.current_page_font_char_bounding_box_map: self.current_page_font_char_bounding_box_map[self.xobj_id] = {} self.current_page_font_char_bounding_box_map[self.xobj_id][font_id] = ( font_char_bounding_box_map ) else: self.current_page_font_char_bounding_box_map[font_id] = ( font_char_bounding_box_map ) except Exception: pass self.current_page_font_name_id_map[xref_id] = font_id if self.xobj_id in self.xobj_map: self.xobj_map[self.xobj_id].pdf_font.append(il_font_metadata) else: self.current_page.pdf_font.append(il_font_metadata) def parse_font_xobj_id(self, xobj_id: int): bbox_list = [] encoding = list(range(256)) font_encoding = self.mupdf.xref_get_key(xobj_id, "Encoding") if font_encoding[1] == "/WinAnsiEncoding": encoding = WinAnsiEncoding differences = [] font_differences = self.mupdf.xref_get_key(xobj_id, "Encoding/Differences") if font_differences: differences = parse_encoding(font_differences[1]) for file_key in ["FontFile", "FontFile2", "FontFile3"]: font_file = self.mupdf.xref_get_key(xobj_id, f"FontDescriptor/{file_key}") if file_idx := indirect(font_file): bbox_list = parse_font_file(self.mupdf, file_idx, encoding, differences) cmap = {} to_unicode = self.mupdf.xref_get_key(xobj_id, "ToUnicode") if to_unicode_idx := indirect(to_unicode): cmap = parse_cmap(self.mupdf.xref_stream(to_unicode_idx).decode("U8")) return bbox_list, cmap def create_graphic_state(self, gs: pdfminer.pdfinterp.PDFGraphicState): graphic_state = il_version_1.GraphicState() for k, v in gs.__dict__.items(): if v is None: continue if k in ["scolor", "ncolor"]: if isinstance(v, tuple): v = list(v) else: v = [v] setattr(graphic_state, k, v) continue if k == "linewidth": graphic_state.linewidth = float(v) continue continue raise NotImplementedError graphic_state.stroking_color_space_name = self.stroking_color_space_name graphic_state.non_stroking_color_space_name = self.non_stroking_color_space_name graphic_state.passthrough_per_char_instruction = " ".join( f"{arg} {op}" for op, arg in gs.passthrough_instruction ) # 可能会影响部分 graphic state 准确度。不过 BabelDOC 仅使用 passthrough_per_char_instruction # 所以应该是没啥影响 # 但是池化 graphic state 后可以减少内存占用 if ( graphic_state.passthrough_per_char_instruction not in self.graphic_state_pool ): self.graphic_state_pool[graphic_state.passthrough_per_char_instruction] = ( graphic_state ) else: graphic_state = self.graphic_state_pool[ graphic_state.passthrough_per_char_instruction ] return graphic_state def on_lt_char(self, char: LTChar): if char.aw_font_id is None: return gs = self.create_graphic_state(char.graphicstate) # Get font from current page or xobject font = None for pdf_font in self.xobj_map.get(self.xobj_id, self.current_page).pdf_font: if pdf_font.font_id == char.aw_font_id: font = pdf_font break # Get descent from font descent = 0 if font and hasattr(font, "descent"): descent = font.descent * char.size / 1000 char_id = char.cid try: if ( font_bounding_box_map := self.current_page_font_char_bounding_box_map.get( self.xobj_id, self.current_page_font_char_bounding_box_map ).get(font.font_id) ): char_bounding_box = font_bounding_box_map.get(char_id, None) else: char_bounding_box = None except Exception: # logger.debug( # "Failed to get font bounding box for char %s", # char.get_text(), # ) char_bounding_box = None char_unicode = char.get_text() if "(cid:" not in char_unicode and len(char_unicode) > 1: return if space_regex.match(char_unicode): char_unicode = " " advance = char.adv bbox = il_version_1.Box( x=char.bbox[0], y=char.bbox[1], x2=char.bbox[2], y2=char.bbox[3], ) if char.matrix[0] == 0 and char.matrix[3] == 0: vertical = True visual_bbox = il_version_1.Box( x=char.bbox[0] - descent, y=char.bbox[1], x2=char.bbox[2] - descent, y2=char.bbox[3], ) else: vertical = False # Add descent to y coordinates visual_bbox = il_version_1.Box( x=char.bbox[0], y=char.bbox[1] + descent, x2=char.bbox[2], y2=char.bbox[3] + descent, ) visual_bbox = il_version_1.VisualBbox(box=visual_bbox) pdf_style = il_version_1.PdfStyle( font_id=char.aw_font_id, font_size=char.size, graphic_state=gs, ) if font: font_xref_id = font.xref_id if font_xref_id in self.mupdf_font_map: mupdf_font = self.mupdf_font_map[font_xref_id] # if "(cid:" not in char_unicode: # if mupdf_cid := mupdf_font.has_glyph(ord(char_unicode)): # char_id = mupdf_cid pdf_char = il_version_1.PdfCharacter( box=bbox, pdf_character_id=char_id, advance=advance, char_unicode=char_unicode, vertical=vertical, pdf_style=pdf_style, xobj_id=char.xobj_id, visual_bbox=visual_bbox, ) if self.translation_config.ocr_workaround: pdf_char.pdf_style.graphic_state = BLACK if pdf_style.font_size == 0.0: logger.warning( "Font size is 0.0 for character %s. Skip it.", char_unicode, ) return if char_bounding_box: x_min, y_min, x_max, y_max = char_bounding_box factor = 1 / 1000 * pdf_style.font_size x_min = x_min * factor y_min = y_min * factor x_max = x_max * factor y_max = y_max * factor ll = (char.bbox[0] + x_min, char.bbox[1] + y_min) ur = (char.bbox[0] + x_max, char.bbox[1] + y_max) pdf_char.visual_bbox = il_version_1.VisualBbox( il_version_1.Box(ll[0], ll[1], ur[0], ur[1]) ) self.current_page.pdf_character.append(pdf_char) if self.translation_config.show_char_box: self.current_page.pdf_rectangle.append( il_version_1.PdfRectangle( box=pdf_char.visual_bbox.box, graphic_state=YELLOW, debug_info=True, ) ) def create_il(self): pages = [ page for page in self.docs.page if self.translation_config.should_translate_page(page.page_number + 1) ] self.docs.page = pages return self.docs def on_total_pages(self, total_pages: int): assert isinstance(total_pages, int) assert total_pages > 0 self.docs.total_pages = total_pages total = 0 for page in range(total_pages): if self.translation_config.should_translate_page(page + 1) is False: continue total += 1 self.progress = self.translation_config.progress_monitor.stage_start( self.stage_name, total, ) def on_pdf_figure(self, figure: LTFigure): box = il_version_1.Box( figure.bbox[0], figure.bbox[1], figure.bbox[2], figure.bbox[3], ) self.current_page.pdf_figure.append(il_version_1.PdfFigure(box=box)) ``` ## /babeldoc/document_il/il_version_1.py ```py path="/babeldoc/document_il/il_version_1.py" from dataclasses import dataclass from dataclasses import field @dataclass class BaseOperations: class Meta: name = "baseOperations" value: str = field( default="", metadata={ "required": True, }, ) @dataclass class Box: class Meta: name = "box" x: float | None = field( default=None, metadata={ "type": "Attribute", "required": True, }, ) y: float | None = field( default=None, metadata={ "type": "Attribute", "required": True, }, ) x2: float | None = field( default=None, metadata={ "type": "Attribute", "required": True, }, ) y2: float | None = field( default=None, metadata={ "type": "Attribute", "required": True, }, ) @dataclass class GraphicState: class Meta: name = "graphicState" linewidth: float | None = field( default=None, metadata={ "type": "Attribute", }, ) dash: list[float] = field( default_factory=list, metadata={ "type": "Attribute", "min_length": 1, "tokens": True, }, ) flatness: float | None = field( default=None, metadata={ "type": "Attribute", }, ) intent: str | None = field( default=None, metadata={ "type": "Attribute", }, ) linecap: int | None = field( default=None, metadata={ "type": "Attribute", }, ) linejoin: int | None = field( default=None, metadata={ "type": "Attribute", }, ) miterlimit: float | None = field( default=None, metadata={ "type": "Attribute", }, ) ncolor: list[float] = field( default_factory=list, metadata={ "type": "Attribute", "min_length": 1, "tokens": True, }, ) scolor: list[float] = field( default_factory=list, metadata={ "type": "Attribute", "min_length": 1, "tokens": True, }, ) stroking_color_space_name: str | None = field( default=None, metadata={ "name": "strokingColorSpaceName", "type": "Attribute", }, ) non_stroking_color_space_name: str | None = field( default=None, metadata={ "name": "nonStrokingColorSpaceName", "type": "Attribute", }, ) passthrough_per_char_instruction: str | None = field( default=None, metadata={ "name": "passthroughPerCharInstruction", "type": "Attribute", }, ) @dataclass class PdfFontCharBoundingBox: class Meta: name = "pdfFontCharBoundingBox" x: float | None = field( default=None, metadata={ "type": "Attribute", "required": True, }, ) y: float | None = field( default=None, metadata={ "type": "Attribute", "required": True, }, ) x2: float | None = field( default=None, metadata={ "type": "Attribute", "required": True, }, ) y2: float | None = field( default=None, metadata={ "type": "Attribute", "required": True, }, ) char_id: int | None = field( default=None, metadata={ "type": "Attribute", "required": True, }, ) @dataclass class Cropbox: class Meta: name = "cropbox" box: Box | None = field( default=None, metadata={ "type": "Element", "required": True, }, ) @dataclass class Mediabox: class Meta: name = "mediabox" box: Box | None = field( default=None, metadata={ "type": "Element", "required": True, }, ) @dataclass class PageLayout: class Meta: name = "pageLayout" box: Box | None = field( default=None, metadata={ "type": "Element", "required": True, }, ) id: int | None = field( default=None, metadata={ "type": "Attribute", "required": True, }, ) conf: float | None = field( default=None, metadata={ "type": "Attribute", "required": True, }, ) class_name: str | None = field( default=None, metadata={ "type": "Attribute", "required": True, }, ) @dataclass class PdfFigure: class Meta: name = "pdfFigure" box: Box | None = field( default=None, metadata={ "type": "Element", "required": True, }, ) @dataclass class PdfFont: class Meta: name = "pdfFont" pdf_font_char_bounding_box: list[PdfFontCharBoundingBox] = field( default_factory=list, metadata={ "name": "pdfFontCharBoundingBox", "type": "Element", }, ) name: str | None = field( default=None, metadata={ "type": "Attribute", "required": True, }, ) font_id: str | None = field( default=None, metadata={ "name": "fontId", "type": "Attribute", "required": True, }, ) xref_id: int | None = field( default=None, metadata={ "name": "xrefId", "type": "Attribute", "required": True, }, ) encoding_length: int | None = field( default=None, metadata={ "name": "encodingLength", "type": "Attribute", "required": True, }, ) bold: bool | None = field( default=None, metadata={ "type": "Attribute", }, ) italic: bool | None = field( default=None, metadata={ "type": "Attribute", }, ) monospace: bool | None = field( default=None, metadata={ "type": "Attribute", }, ) serif: bool | None = field( default=None, metadata={ "type": "Attribute", }, ) ascent: float | None = field( default=None, metadata={ "type": "Attribute", }, ) descent: float | None = field( default=None, metadata={ "type": "Attribute", }, ) @dataclass class PdfRectangle: class Meta: name = "pdfRectangle" box: Box | None = field( default=None, metadata={ "type": "Element", "required": True, }, ) graphic_state: GraphicState | None = field( default=None, metadata={ "name": "graphicState", "type": "Element", "required": True, }, ) debug_info: bool | None = field( default=None, metadata={ "type": "Attribute", }, ) fill_background: bool | None = field( default=None, metadata={ "type": "Attribute", }, ) xobj_id: int | None = field( default=None, metadata={ "type": "Attribute", }, ) @dataclass class PdfStyle: class Meta: name = "pdfStyle" graphic_state: GraphicState | None = field( default=None, metadata={ "name": "graphicState", "type": "Element", "required": True, }, ) font_id: str | None = field( default=None, metadata={ "type": "Attribute", "required": True, }, ) font_size: float | None = field( default=None, metadata={ "type": "Attribute", "required": True, }, ) @dataclass class VisualBbox: class Meta: name = "visual_bbox" box: Box | None = field( default=None, metadata={ "type": "Element", "required": True, }, ) @dataclass class PdfCharacter: class Meta: name = "pdfCharacter" pdf_style: PdfStyle | None = field( default=None, metadata={ "name": "pdfStyle", "type": "Element", "required": True, }, ) box: Box | None = field( default=None, metadata={ "type": "Element", "required": True, }, ) visual_bbox: VisualBbox | None = field( default=None, metadata={ "type": "Element", }, ) vertical: bool | None = field( default=None, metadata={ "type": "Attribute", }, ) scale: float | None = field( default=None, metadata={ "type": "Attribute", }, ) pdf_character_id: int | None = field( default=None, metadata={ "name": "pdfCharacterId", "type": "Attribute", }, ) char_unicode: str | None = field( default=None, metadata={ "type": "Attribute", "required": True, }, ) advance: float | None = field( default=None, metadata={ "type": "Attribute", }, ) xobj_id: int | None = field( default=None, metadata={ "name": "xobjId", "type": "Attribute", }, ) debug_info: bool | None = field( default=None, metadata={ "type": "Attribute", }, ) @dataclass class PdfSameStyleUnicodeCharacters: class Meta: name = "pdfSameStyleUnicodeCharacters" pdf_style: PdfStyle | None = field( default=None, metadata={ "name": "pdfStyle", "type": "Element", }, ) unicode: str | None = field( default=None, metadata={ "type": "Attribute", "required": True, }, ) debug_info: bool | None = field( default=None, metadata={ "type": "Attribute", }, ) @dataclass class PdfXobject: class Meta: name = "pdfXobject" box: Box | None = field( default=None, metadata={ "type": "Element", "required": True, }, ) pdf_font: list[PdfFont] = field( default_factory=list, metadata={ "name": "pdfFont", "type": "Element", }, ) base_operations: BaseOperations | None = field( default=None, metadata={ "name": "baseOperations", "type": "Element", "required": True, }, ) xobj_id: int | None = field( default=None, metadata={ "name": "xobjId", "type": "Attribute", "required": True, }, ) xref_id: int | None = field( default=None, metadata={ "name": "xrefId", "type": "Attribute", "required": True, }, ) @dataclass class PdfFormula: class Meta: name = "pdfFormula" box: Box | None = field( default=None, metadata={ "type": "Element", "required": True, }, ) pdf_character: list[PdfCharacter] = field( default_factory=list, metadata={ "name": "pdfCharacter", "type": "Element", "min_occurs": 1, }, ) x_offset: float | None = field( default=None, metadata={ "type": "Attribute", "required": True, }, ) y_offset: float | None = field( default=None, metadata={ "type": "Attribute", "required": True, }, ) @dataclass class PdfLine: class Meta: name = "pdfLine" box: Box | None = field( default=None, metadata={ "type": "Element", "required": True, }, ) pdf_character: list[PdfCharacter] = field( default_factory=list, metadata={ "name": "pdfCharacter", "type": "Element", "min_occurs": 1, }, ) @dataclass class PdfSameStyleCharacters: class Meta: name = "pdfSameStyleCharacters" box: Box | None = field( default=None, metadata={ "type": "Element", "required": True, }, ) pdf_style: PdfStyle | None = field( default=None, metadata={ "name": "pdfStyle", "type": "Element", "required": True, }, ) pdf_character: list[PdfCharacter] = field( default_factory=list, metadata={ "name": "pdfCharacter", "type": "Element", "min_occurs": 1, }, ) @dataclass class PdfParagraphComposition: class Meta: name = "pdfParagraphComposition" pdf_line: PdfLine | None = field( default=None, metadata={ "name": "pdfLine", "type": "Element", }, ) pdf_formula: PdfFormula | None = field( default=None, metadata={ "name": "pdfFormula", "type": "Element", }, ) pdf_same_style_characters: PdfSameStyleCharacters | None = field( default=None, metadata={ "name": "pdfSameStyleCharacters", "type": "Element", }, ) pdf_character: PdfCharacter | None = field( default=None, metadata={ "name": "pdfCharacter", "type": "Element", }, ) pdf_same_style_unicode_characters: PdfSameStyleUnicodeCharacters | None = field( default=None, metadata={ "name": "pdfSameStyleUnicodeCharacters", "type": "Element", }, ) @dataclass class PdfParagraph: class Meta: name = "pdfParagraph" box: Box | None = field( default=None, metadata={ "type": "Element", "required": True, }, ) pdf_style: PdfStyle | None = field( default=None, metadata={ "name": "pdfStyle", "type": "Element", "required": True, }, ) pdf_paragraph_composition: list[PdfParagraphComposition] = field( default_factory=list, metadata={ "name": "pdfParagraphComposition", "type": "Element", }, ) xobj_id: int | None = field( default=None, metadata={ "name": "xobjId", "type": "Attribute", }, ) unicode: str | None = field( default=None, metadata={ "type": "Attribute", "required": True, }, ) scale: float | None = field( default=None, metadata={ "type": "Attribute", }, ) vertical: bool | None = field( default=None, metadata={ "type": "Attribute", }, ) first_line_indent: bool | None = field( default=None, metadata={ "name": "FirstLineIndent", "type": "Attribute", }, ) debug_id: str | None = field( default=None, metadata={ "type": "Attribute", }, ) layout_label: str | None = field( default=None, metadata={ "type": "Attribute", }, ) layout_id: int | None = field( default=None, metadata={ "type": "Attribute", }, ) @dataclass class Page: class Meta: name = "page" mediabox: Mediabox | None = field( default=None, metadata={ "type": "Element", "required": True, }, ) cropbox: Cropbox | None = field( default=None, metadata={ "type": "Element", "required": True, }, ) pdf_xobject: list[PdfXobject] = field( default_factory=list, metadata={ "name": "pdfXobject", "type": "Element", }, ) page_layout: list[PageLayout] = field( default_factory=list, metadata={ "name": "pageLayout", "type": "Element", }, ) pdf_rectangle: list[PdfRectangle] = field( default_factory=list, metadata={ "name": "pdfRectangle", "type": "Element", }, ) pdf_font: list[PdfFont] = field( default_factory=list, metadata={ "name": "pdfFont", "type": "Element", }, ) pdf_paragraph: list[PdfParagraph] = field( default_factory=list, metadata={ "name": "pdfParagraph", "type": "Element", }, ) pdf_figure: list[PdfFigure] = field( default_factory=list, metadata={ "name": "pdfFigure", "type": "Element", }, ) pdf_character: list[PdfCharacter] = field( default_factory=list, metadata={ "name": "pdfCharacter", "type": "Element", }, ) base_operations: BaseOperations | None = field( default=None, metadata={ "name": "baseOperations", "type": "Element", "required": True, }, ) page_number: int | None = field( default=None, metadata={ "name": "pageNumber", "type": "Attribute", "required": True, }, ) unit: str | None = field( default=None, metadata={ "name": "Unit", "type": "Attribute", "required": True, }, ) @dataclass class Document: class Meta: name = "document" page: list[Page] = field( default_factory=list, metadata={ "type": "Element", "min_occurs": 1, }, ) total_pages: int | None = field( default=None, metadata={ "name": "totalPages", "type": "Attribute", "required": True, }, ) ``` ## /babeldoc/document_il/il_version_1.rnc ```rnc path="/babeldoc/document_il/il_version_1.rnc" start = Document Document = element document { Page+, attribute totalPages { xsd:int } } Page = element page { element mediabox { Box }, element cropbox { Box }, PDFXobject*, PageLayout*, PDFRectangle*, PDFFont*, PDFParagraph*, PDFFigure*, PDFCharacter*, attribute pageNumber { xsd:int }, attribute Unit { xsd:string }, element baseOperations { xsd:string } } Box = element box { # from (x,y) to (x2,y2) attribute x { xsd:float }, attribute y { xsd:float }, attribute x2 { xsd:float }, attribute y2 { xsd:float } } PDFXrefId = xsd:int PDFFont = element pdfFont { attribute name { xsd:string }, attribute fontId { xsd:string }, attribute xrefId { PDFXrefId }, attribute encodingLength { xsd:int }, attribute bold { xsd:boolean }?, attribute italic { xsd:boolean }?, attribute monospace { xsd:boolean }?, attribute serif { xsd:boolean }?, attribute ascent { xsd:float }?, attribute descent { xsd:float }?, PDFFontCharBoundingBox* } PDFFontCharBoundingBox = element pdfFontCharBoundingBox { attribute x { xsd:float }, attribute y { xsd:float }, attribute x2 { xsd:float }, attribute y2 { xsd:float }, attribute char_id { xsd:int } } PDFXobject = element pdfXobject { attribute xobjId { xsd:int }, attribute xrefId { PDFXrefId }, Box, PDFFont*, element baseOperations { xsd:string } } PDFCharacter = element pdfCharacter { attribute vertical { xsd:boolean }?, attribute scale { xsd:float }?, attribute pdfCharacterId { xsd:int }?, attribute char_unicode { xsd:string }, attribute advance { xsd:float }?, # xobject nesting depth attribute xobjId { xsd:int }?, attribute debug_info { xsd:boolean }?, PDFStyle, Box, element visual_bbox { Box }? } PageLayout = element pageLayout { attribute id { xsd:int }, attribute conf { xsd:float }, attribute class_name { xsd:string }, Box } GraphicState = element graphicState { attribute linewidth { xsd:float }?, attribute dash { list { xsd:float+ } }?, attribute flatness { xsd:float }?, attribute intent { xsd:string }?, attribute linecap { xsd:int }?, attribute linejoin { xsd:int }?, attribute miterlimit { xsd:float }?, attribute ncolor { list { xsd:float+ } }?, attribute scolor { list { xsd:float+ } }?, attribute strokingColorSpaceName { xsd:string }?, attribute nonStrokingColorSpaceName { xsd:string }?, attribute passthroughPerCharInstruction { xsd:string }? } PDFStyle = element pdfStyle { attribute font_id { xsd:string }, attribute font_size { xsd:float }, GraphicState } PDFParagraph = element pdfParagraph { attribute xobjId { xsd:int }?, attribute unicode { xsd:string }, attribute scale { xsd:float }?, attribute vertical { xsd:boolean }?, attribute FirstLineIndent { xsd:boolean }?, attribute debug_id { xsd:string }?, attribute layout_label { xsd:string }?, attribute layout_id { xsd:int }?, Box, PDFStyle, PDFParagraphComposition* } PDFParagraphComposition = element pdfParagraphComposition { PDFLine | PDFFormula | PDFSameStyleCharacters | PDFCharacter | PDFSameStyleUnicodeCharacters } PDFLine = element pdfLine { Box, PDFCharacter+ } PDFSameStyleCharacters = element pdfSameStyleCharacters { Box, PDFStyle, PDFCharacter+ } PDFSameStyleUnicodeCharacters = element pdfSameStyleUnicodeCharacters { PDFStyle?, attribute unicode { xsd:string }, attribute debug_info { xsd:boolean }? } PDFFormula = element pdfFormula { Box, PDFCharacter+, attribute x_offset { xsd:float }, attribute y_offset { xsd:float } } PDFFigure = element pdfFigure { Box } PDFRectangle = element pdfRectangle { Box, GraphicState, attribute debug_info { xsd:boolean }?, attribute fill_background { xsd:boolean }?, attribute xobjId { xsd:int }? } ``` ## /babeldoc/document_il/il_version_1.rng ```rng path="/babeldoc/document_il/il_version_1.rng" ``` The content has been capped at 50000 tokens, and files over NaN bytes have been omitted. The user could consider applying other filters to refine the result. The better and more specific the context, the better the LLM can follow instructions. If the context seems verbose, the user can refine the filter using uithub. Thank you for using https://uithub.com - Perfect LLM context for any GitHub repo.