GitHub VectifyAI/PageIndex LLM Context

VectifyAI/PageIndex/main 70k tokens More Tools
```
├── .gitignore
├── CHANGELOG.md (100 tokens)
├── LICENSE (omitted)
├── README.md (1900 tokens)
├── docs/
   ├── 2023-annual-report-truncated.pdf
   ├── 2023-annual-report.pdf
   ├── PRML.pdf
   ├── Regulation Best Interest_Interpretive release.pdf
   ├── Regulation Best Interest_proposed rule.pdf
   ├── earthmover.pdf
   ├── four-lectures.pdf
   ├── q1-fy25-earnings.pdf
├── pageindex/
   ├── __init__.py
   ├── config.yaml
   ├── page_index.py (9.7k tokens)
   ├── utils.py (4.4k tokens)
├── requirements.txt
├── results/
   ├── 2023-annual-report-truncated_structure.json (400 tokens)
   ├── 2023-annual-report_structure.json (2.7k tokens)
   ├── PRML_structure.json (9.8k tokens)
   ├── Regulation Best Interest_Interpretive release_structure.json (2.5k tokens)
   ├── Regulation Best Interest_proposed rule_structure.json (26.4k tokens)
   ├── earthmover_structure.json (600 tokens)
   ├── four-lectures_structure.json (1500 tokens)
   ├── q1-fy25-earnings_structure.json (9.9k tokens)
├── run_pageindex.py (400 tokens)
```


## /.gitignore

```gitignore path="/.gitignore" 
.ipynb_checkpoints
__pycache__
files
index
temp/*
chroma-collections.parquet
chroma-embeddings.parquet
.DS_Store
.env*
notebook
SDK/*
log/*
logs/
parts/*
json_results/*

```

## /CHANGELOG.md

# Change Log
All notable changes to this project will be documented in this file.

## Beta - 2025-04-23

### Fixed
- [x] Fixed a bug introduced on April 18 where `start_index` was incorrectly passed.

## Beta - 2025-04-03

### Added
- [x] Add node_id, node summary
- [x] Add document discription

### Changed
- [x] Change "child_nodes" -> "nodes" to simplify the structure


## /README.md

<div align="center">
  <a href="https://vectify.ai/pageindex" target="_blank">
    <img src="https://github.com/user-attachments/assets/15c609f9-443d-4d81-a1f3-aa5a6051b676" alt="pg_logo_small" width="300px">
  </a>
</div>

# 📄 PageIndex

Are you frustrated with vector database retrieval accuracy for long professional documents? Traditional vector-based RAG relies on semantic *similarity* rather than true *relevance*. But **similarity ≠ relevance** — what we truly need in retrieval is **relevance**, and that requires **reasoning**. When working with professional documents that demand domain expertise and multi-step reasoning, similarity search often falls short.

🧠 **Reasoning-based RAG** offers a better alternative: enabling LLMs to *think* and *reason* their way to the most relevant document sections. Inspired by AlphaGo, we use *tree search* to perform structured document retrieval. 

**[PageIndex](https://vectify.ai/pageindex)** is a *document indexing system* that builds *search tree structures* from long documents, making them ready for reasoning-based RAG.  It has been used to develop a RAG system that achieved 98.7% accuracy on [FinanceBench](https://vectify.ai/blog/Mafin2.5), demonstrating state-of-the-art performance in document analysis.

<div align="center">
  <a href="https://vectify.ai/pageindex">
    <img src="https://github.com/user-attachments/assets/6604d932-bdf7-435e-8c28-2213e6ea6a5b" alt="PageIndex" width="700px"/>
  </a>
</div>

Self-host it with this open-source repo, or try our ☁️ [Cloud service](https://pageindex.vectify.ai/) — no setup required, with advanced features like OCR for complex and scanned PDFs.

Built by <a href="https://vectify.ai" target="_blank">Vectify AI</a>.

---

# **⭐ What is PageIndex**

PageIndex can transform lengthy PDF documents into a semantic **tree structure**, similar to a *"table of contents"* but optimized for use with Large Language Models (LLMs).
It's ideal for: financial reports, regulatory filings, academic textbooks, legal or technical manuals, and any document that exceeds LLM context limits.

### ✅ Key Features
    
- **Hierarchical Tree Structure**  
  Enables LLMs to traverse documents logically — like an intelligent, LLM-optimized table of contents.

- **Chunk-Free Segmentation**  
  No arbitrary chunking. Nodes follow the natural structure of the document.

- **Precise Page Referencing**  
  Every node contains its summary and start/end page physical index, allowing pinpoint retrieval.

- **Scales to Massive Documents**  
  Designed to handle hundreds or even thousands of pages with ease.

### 📦 PageIndex Format

Here is an example output. See more [example documents](https://github.com/VectifyAI/PageIndex/tree/main/docs) and [generated trees](https://github.com/VectifyAI/PageIndex/tree/main/results).

```
...
{
  "title": "Financial Stability",
  "node_id": "0006",
  "start_index": 21,
  "end_index": 22,
  "summary": "The Federal Reserve ...",
  "nodes": [
    {
      "title": "Monitoring Financial Vulnerabilities",
      "node_id": "0007",
      "start_index": 22,
      "end_index": 28,
      "summary": "The Federal Reserve's monitoring ..."
    },
    {
      "title": "Domestic and International Cooperation and Coordination",
      "node_id": "0008",
      "start_index": 28,
      "end_index": 31,
      "summary": "In 2023, the Federal Reserve collaborated ..."
    }
  ]
}
...
```

---

### ⚠️ Bug Fix Notice

A bug introduced on **April 18** has now been fixed.

If you pulled the repo between **April 18–23**, please update to the latest version:

```bash
git pull origin main
```

Thanks for your understanding 🙏


---

# 🚀 Package Usage

Follow these steps to generate a PageIndex tree from a PDF document.

### 1. Install dependencies

```bash
pip3 install -r requirements.txt
```

### 2. Set your OpenAI API key

Create a `.env` file in the root directory and add your API key:

```bash
CHATGPT_API_KEY=your_openai_key_here
```

### 3. Run PageIndex on your PDF

```bash
python3 run_pageindex.py --pdf_path /path/to/your/document.pdf
```
You can customize the processing with additional optional arguments:

```
--model                 OpenAI model to use (default: gpt-4o-2024-11-20)
--toc-check-pages       Pages to check for table of contents (default: 20)
--max-pages-per-node    Max pages per node (default: 10)
--max-tokens-per-node   Max tokens per node (default: 20000)
--if-add-node-id        Add node ID (yes/no, default: yes)
--if-add-node-summary   Add node summary (yes/no, default: no)
--if-add-doc-description Add doc description (yes/no, default: yes)
```

---

# ☁️ Cloud API & Platform (Beta)

Don't want to host it yourself? Try our [hosted API](https://pageindex.vectify.ai/) for PageIndex. The hosted service leverages our custom OCR model for more accurate PDF recognition, delivering better tree structures for complex documents. Ideal for rapid prototyping, production environments, and documents requiring advanced OCR.

You can also upload PDFs from your browser and explore results visually with our [web Dashboard](https://pageindex.ai/files) — no coding needed.

Leave your email in [this form](https://ii2abc2jejf.typeform.com/to/meB40zV0) to receive 1,000 pages for free.

---

# 📈 Case Study: Mafin 2.5 on FinanceBench

[Mafin 2.5](https://vectify.ai/) is a state-of-the-art reasoning-based RAG model designed specifically for financial document analysis. Powered by **PageIndex**, it achieved a market-leading [**98.7% accuracy**](https://vectify.ai/blog/Mafin2.5) on the [FinanceBench](https://arxiv.org/abs/2311.11944) benchmark — significantly outperforming traditional vector-based RAG systems.

PageIndex's hierarchical indexing enabled precise navigation and extraction of relevant content from complex financial reports, such as SEC filings and earnings disclosures.

👉 See the full [benchmark results](https://github.com/VectifyAI/Mafin2.5-FinanceBench) and our [blog post](https://vectify.ai/blog/Mafin2.5) for detailed comparisons and performance metrics.

<div align="center">
  <a href="https://github.com/VectifyAI/Mafin2.5-FinanceBench">
    <img src="https://github.com/user-attachments/assets/571aa074-d803-43c7-80c4-a04254b782a3" width="90%">
  </a>
</div>

---

# 🧠 Reasoning-Based RAG with PageIndex

Use PageIndex to build **reasoning-based retrieval systems** without relying on semantic similarity. Great for domain-specific tasks where nuance matters ([more examples](https://pageindex.vectify.ai/examples/rag)).

### 🔖 Preprocessing Workflow Example
1. Process documents using PageIndex to generate tree structures.
2. Store the tree structures and their corresponding document IDs in a database table.
3. Store the contents of each node in a separate table, indexed by node ID and tree ID.

### 🔖 Reasoning-Based RAG Framework Example
1. Query Preprocessing:
    - Analyze the query to identify the required knowledge
2. Document Selection: 
    - Search for relevant documents and their IDs
    - Fetch the corresponding tree structures from the database
3. Node Selection:
    - Search through tree structures to identify relevant nodes
4. LLM Generation:
    - Fetch the corresponding contents of the selected nodes from the database
    - Format and extract the relevant information
    - Send the assembled context along with the original query to the LLM
    - Generate contextually informed responses


### 🔖 Example Prompt for Node Selection

```python
prompt = f"""
You are given a question and a tree structure of a document.
You need to find all nodes that are likely to contain the answer.

Question: {question}

Document tree structure: {structure}

Reply in the following JSON format:
{{
    "thinking": <reasoning about where to look>,
    "node_list": [node_id1, node_id2, ...]
}}
"""
```
👉 For more examples, see the [PageIndex Dashboard](https://pageindex.vectify.ai/).

---

# 🛤 Roadmap

- [x]  [Detailed examples of document selection, node selection, and RAG pipelines](https://pageindex.vectify.ai/examples/rag)
- [x]  [Integration of reasoning-based retrieval and semantic-based retrieval](https://pageindex.vectify.ai/examples/hybrid-rag)
- [ ]  Efficient tree search methods introduction
- [ ]  Technical report on the design of PageIndex

---

# 🚧 Notice
This project is in its early beta development, and all progress will remain open and transparent. We welcome you to raise issues, reach out with questions, or contribute directly to the project.  

Due to the diverse structures of PDF documents, you may encounter instability during usage. For a more accurate and stable version with a leading OCR integration, please try our [hosted API for PageIndex](https://pageindex.vectify.ai/). Leave your email in [this form](https://ii2abc2jejf.typeform.com/to/meB40zV0) to receive 1,000 pages for free.

Together, let's push forward the revolution of reasoning-based RAG systems.

### 🙋 FAQ
- **Does PageIndex support other LLMs besides OpenAI?**  
  Currently optimized for GPT models, but future versions will support more.

- **Can PageIndex handle scanned PDFs?**  
  Yes! Our [Cloud API](https://pageindex.vectify.ai/) includes advanced OCR specifically for scanned and complex PDFs.

---

# 📬 Contact Us

Need customized support for your documents or reasoning-based RAG system?

:loudspeaker: [Join our Discord](https://discord.com/invite/nnyyEdT2RG)

:envelope: [Leave us a message](https://ii2abc2jejf.typeform.com/to/meB40zV0)

<div align="center">
  <a href="https://vectify.ai" target="_blank">
    <img src="https://github.com/user-attachments/assets/55abe487-9d21-44ad-b686-a008c2d2b7e7" alt="Vectify AI Logo" width="180">
  </a>
</div>


## /docs/2023-annual-report-truncated.pdf

Binary file available at https://raw.githubusercontent.com/VectifyAI/PageIndex/refs/heads/main/docs/2023-annual-report-truncated.pdf

## /docs/2023-annual-report.pdf

Binary file available at https://raw.githubusercontent.com/VectifyAI/PageIndex/refs/heads/main/docs/2023-annual-report.pdf

## /docs/PRML.pdf

Binary file available at https://raw.githubusercontent.com/VectifyAI/PageIndex/refs/heads/main/docs/PRML.pdf

## /docs/Regulation Best Interest_Interpretive release.pdf

Binary file available at https://raw.githubusercontent.com/VectifyAI/PageIndex/refs/heads/main/docs/Regulation Best Interest_Interpretive release.pdf

## /docs/Regulation Best Interest_proposed rule.pdf

Binary file available at https://raw.githubusercontent.com/VectifyAI/PageIndex/refs/heads/main/docs/Regulation Best Interest_proposed rule.pdf

## /docs/earthmover.pdf

Binary file available at https://raw.githubusercontent.com/VectifyAI/PageIndex/refs/heads/main/docs/earthmover.pdf

## /docs/four-lectures.pdf

Binary file available at https://raw.githubusercontent.com/VectifyAI/PageIndex/refs/heads/main/docs/four-lectures.pdf

## /docs/q1-fy25-earnings.pdf

Binary file available at https://raw.githubusercontent.com/VectifyAI/PageIndex/refs/heads/main/docs/q1-fy25-earnings.pdf

## /pageindex/__init__.py

```py path="/pageindex/__init__.py" 
from .page_index import *
```

## /pageindex/config.yaml

```yaml path="/pageindex/config.yaml" 
model: "gpt-4o-2024-11-20"
toc_check_page_num: 20
max_page_num_each_node: 10
max_token_num_each_node: 20000
if_add_node_id: "yes"
if_add_node_summary: "no"
if_add_doc_description: "yes"
if_add_node_text: "no"
```

## /pageindex/page_index.py

```py path="/pageindex/page_index.py" 
import os
import json
import copy
import math
import random
import re
from .utils import *
import os
from concurrent.futures import ThreadPoolExecutor, as_completed


################### check title in page #########################################################
async def check_title_appearance(item, page_list, start_index=1, model=None):    
    title=item['title']
    if 'physical_index' not in item or item['physical_index'] is None:
        return {'list_index': item.get('list_index'), 'answer': 'no', 'title':title, 'page_number': None}
    
    
    page_number = item['physical_index']
    page_text = page_list[page_number-start_index][0]

    
    prompt = f"""
    Your job is to check if the given section appears or starts in the given page_text.

    Note: do fuzzy matching, ignore any space inconsistency in the page_text.

    The given section title is {title}.
    The given page_text is {page_text}.
    
    Reply format:
    {{
        
        "thinking": <why do you think the section appears or starts in the page_text>
        "answer": "yes or no" (yes if the section appears or starts in the page_text, no otherwise)
    }}
    Directly return the final JSON structure. Do not output anything else."""

    response = await ChatGPT_API_async(model=model, prompt=prompt)
    response = extract_json(response)
    if 'answer' in response:
        answer = response['answer']
    else:
        answer = 'no'
    return {'list_index': item['list_index'], 'answer': answer, 'title': title, 'page_number': page_number}


async def check_title_appearance_in_start(title, page_text, model=None, logger=None):    
    prompt = f"""
    You will be given the current section title and the current page_text.
    Your job is to check if the current section starts in the beginning of the given page_text.
    If there are other contents before the current section title, then the current section does not start in the beginning of the given page_text.
    If the current section title is the first content in the given page_text, then the current section starts in the beginning of the given page_text.

    Note: do fuzzy matching, ignore any space inconsistency in the page_text.

    The given section title is {title}.
    The given page_text is {page_text}.
    
    reply format:
    {{
        "thinking": <why do you think the section appears or starts in the page_text>
        "start_begin": "yes or no" (yes if the section starts in the beginning of the page_text, no otherwise)
    }}
    Directly return the final JSON structure. Do not output anything else."""

    response = await ChatGPT_API_async(model=model, prompt=prompt)
    response = extract_json(response)
    if logger:
        logger.info(f"Response: {response}")
    return response.get("start_begin", "no")


async def check_title_appearance_in_start_concurrent(structure, page_list, model=None, logger=None):
    if logger:
        logger.info("Checking title appearance in start concurrently")
    
    # skip items without physical_index
    for item in structure:
        if item.get('physical_index') is None:
            item['appear_start'] = 'no'

    # only for items with valid physical_index
    tasks = []
    valid_items = []
    for item in structure:
        if item.get('physical_index') is not None:
            page_text = page_list[item['physical_index'] - 1][0]
            tasks.append(check_title_appearance_in_start(item['title'], page_text, model=model, logger=logger))
            valid_items.append(item)

    results = await asyncio.gather(*tasks, return_exceptions=True)
    for item, result in zip(valid_items, results):
        if isinstance(result, Exception):
            if logger:
                logger.error(f"Error checking start for {item['title']}: {result}")
            item['appear_start'] = 'no'
        else:
            item['appear_start'] = result

    return structure


def toc_detector_single_page(content, model=None):
    prompt = f"""
    Your job is to detect if there is a table of content provided in the given text.

    Given text: {content}

    return the following JSON format:
    {{
        "thinking": <why do you think there is a table of content in the given text>
        "toc_detected": "<yes or no>",
    }}

    Directly return the final JSON structure. Do not output anything else.
    Please note: abstract,summary, notation list, figure list, table list, etc. are not table of contents."""

    response = ChatGPT_API(model=model, prompt=prompt)
    # print('response', response)
    json_content = extract_json(response)    
    return json_content['toc_detected']


def check_if_toc_extraction_is_complete(content, toc, model=None):
    prompt = f"""
    You are given a partial document  and a  table of contents.
    Your job is to check if the  table of contents is complete, which it contains all the main sections in the partial document.

    Reply format:
    {{
        "thinking": <why do you think the table of contents is complete or not>
        "completed": "yes" or "no"
    }}
    Directly return the final JSON structure. Do not output anything else."""

    prompt = prompt + '\n Document:\n' + content + '\n Table of contents:\n' + toc
    response = ChatGPT_API(model=model, prompt=prompt)
    json_content = extract_json(response)
    return json_content['completed']


def check_if_toc_transformation_is_complete(content, toc, model=None):
    prompt = f"""
    You are given a raw table of contents and a  table of contents.
    Your job is to check if the  table of contents is complete.

    Reply format:
    {{
        "thinking": <why do you think the cleaned table of contents is complete or not>
        "completed": "yes" or "no"
    }}
    Directly return the final JSON structure. Do not output anything else."""

    prompt = prompt + '\n Raw Table of contents:\n' + content + '\n Cleaned Table of contents:\n' + toc
    response = ChatGPT_API(model=model, prompt=prompt)
    json_content = extract_json(response)
    return json_content['completed']

def extract_toc_content(content, model=None):
    prompt = f"""
    Your job is to extract the full table of contents from the given text, replace ... with :

    Given text: {content}

    Directly return the full table of contents content. Do not output anything else."""

    response, finish_reason = ChatGPT_API_with_finish_reason(model=model, prompt=prompt)
    
    if_complete = check_if_toc_transformation_is_complete(content, response, model)
    if if_complete == "yes" and finish_reason == "finished":
        return response
    
    chat_history = [
        {"role": "user", "content": prompt}, 
        {"role": "assistant", "content": response},    
    ]
    prompt = f"""please continue the generation of table of contents , directly output the remaining part of the structure"""
    new_response, finish_reason = ChatGPT_API_with_finish_reason(model=model, prompt=prompt, chat_history=chat_history)
    response = response + new_response
    if_complete = check_if_toc_transformation_is_complete(content, response, model)
    
    while not (if_complete == "yes" and finish_reason == "finished"):
        chat_history = [
            {"role": "user", "content": prompt}, 
            {"role": "assistant", "content": response},    
        ]
        prompt = f"""please continue the generation of table of contents , directly output the remaining part of the structure"""
        new_response, finish_reason = ChatGPT_API_with_finish_reason(model=model, prompt=prompt, chat_history=chat_history)
        response = response + new_response
        if_complete = check_if_toc_transformation_is_complete(content, response, model)
        
        # Optional: Add a maximum retry limit to prevent infinite loops
        if len(chat_history) > 5:  # Arbitrary limit of 10 attempts
            raise Exception('Failed to complete table of contents after maximum retries')
    
    return response

def detect_page_index(toc_content, model=None):
    print('start detect_page_index')
    prompt = f"""
    You will be given a table of contents.

    Your job is to detect if there are page numbers/indices given within the table of contents.

    Given text: {toc_content}

    Reply format:
    {{
        "thinking": <why do you think there are page numbers/indices given within the table of contents>
        "page_index_given_in_toc": "<yes or no>"
    }}
    Directly return the final JSON structure. Do not output anything else."""

    response = ChatGPT_API(model=model, prompt=prompt)
    json_content = extract_json(response)
    return json_content['page_index_given_in_toc']

def toc_extractor(page_list, toc_page_list, model):
    def transform_dots_to_colon(text):
        text = re.sub(r'\.{5,}', ': ', text)
        # Handle dots separated by spaces
        text = re.sub(r'(?:\. ){5,}\.?', ': ', text)
        return text
    
    toc_content = ""
    for page_index in toc_page_list:
        toc_content += page_list[page_index][0]
    toc_content = transform_dots_to_colon(toc_content)
    has_page_index = detect_page_index(toc_content, model=model)
    
    return {
        "toc_content": toc_content,
        "page_index_given_in_toc": has_page_index
    }




def toc_index_extractor(toc, content, model=None):
    print('start toc_index_extractor')
    tob_extractor_prompt = """
    You are given a table of contents in a json format and several pages of a document, your job is to add the physical_index to the table of contents in the json format.

    The provided pages contains tags like <physical_index_X> and <physical_index_X> to indicate the physical location of the page X.

    The structure variable is the numeric system which represents the index of the hierarchy section in the table of contents. For example, the first section has structure index 1, the first subsection has structure index 1.1, the second subsection has structure index 1.2, etc.

    The response should be in the following JSON format: 
    [
        {
            "structure": <structure index, "x.x.x" or None> (string),
            "title": <title of the section>,
            "physical_index": "<physical_index_X>" (keep the format)
        },
        ...
    ]

    Only add the physical_index to the sections that are in the provided pages.
    If the section is not in the provided pages, do not add the physical_index to it.
    Directly return the final JSON structure. Do not output anything else."""

    prompt = tob_extractor_prompt + '\nTable of contents:\n' + str(toc) + '\nDocument pages:\n' + content
    response = ChatGPT_API(model=model, prompt=prompt)
    json_content = extract_json(response)    
    return json_content



def toc_transformer(toc_content, model=None):
    print('start toc_transformer')
    init_prompt = """
    You are given a table of contents, You job is to transform the whole table of content into a JSON format included table_of_contents.

    structure is the numeric system which represents the index of the hierarchy section in the table of contents. For example, the first section has structure index 1, the first subsection has structure index 1.1, the second subsection has structure index 1.2, etc.

    The response should be in the following JSON format: 
    {
    table_of_contents: [
        {
            "structure": <structure index, "x.x.x" or None> (string),
            "title": <title of the section>,
            "page": <page number or None>,
        },
        ...
        ],
    }
    You should transform the full table of contents in one go.
    Directly return the final JSON structure, do not output anything else. """

    prompt = init_prompt + '\n Given table of contents\n:' + toc_content
    last_complete, finish_reason = ChatGPT_API_with_finish_reason(model=model, prompt=prompt)
    if_complete = check_if_toc_transformation_is_complete(toc_content, last_complete, model)
    if if_complete == "yes" and finish_reason == "finished":
        last_complete = extract_json(last_complete)
        cleaned_response=convert_page_to_int(last_complete['table_of_contents'])
        return cleaned_response
    
    last_complete = get_json_content(last_complete)
    while not (if_complete == "yes" and finish_reason == "finished"):
        position = last_complete.rfind('}')
        if position != -1:
            last_complete = last_complete[:position+2]
        prompt = f"""
        Your task is to continue the table of contents json structure, directly output the remaining part of the json structure.
        The response should be in the following JSON format: 

        The raw table of contents json structure is:
        {toc_content}

        The incomplete transformed table of contents json structure is:
        {last_complete}

        Please continue the json structure, directly output the remaining part of the json structure."""

        new_complete, finish_reason = ChatGPT_API_with_finish_reason(model=model, prompt=prompt)

        if new_complete.startswith('\`\`\`json'):
            new_complete =  get_json_content(new_complete)
            last_complete = last_complete+new_complete

        if_complete = check_if_toc_transformation_is_complete(toc_content, last_complete, model)
        

    last_complete = json.loads(last_complete)

    cleaned_response=convert_page_to_int(last_complete['table_of_contents'])
    return cleaned_response
    



def find_toc_pages(start_page_index, page_list, opt, logger=None):
    print('start find_toc_pages')
    last_page_is_yes = False
    toc_page_list = []
    i = start_page_index
    
    while i < len(page_list):
        # Only check beyond max_pages if we're still finding TOC pages
        if i >= opt.toc_check_page_num and not last_page_is_yes:
            break
        detected_result = toc_detector_single_page(page_list[i][0],model=opt.model)
        if detected_result == 'yes':
            if logger:
                logger.info(f'Page {i} has toc')
            toc_page_list.append(i)
            last_page_is_yes = True
        elif detected_result == 'no' and last_page_is_yes:
            if logger:
                logger.info(f'Found the last page with toc: {i-1}')
            break
        i += 1
    
    if not toc_page_list and logger:
        logger.info('No toc found')
        
    return toc_page_list

def remove_page_number(data):
    if isinstance(data, dict):
        data.pop('page_number', None)  
        for key in list(data.keys()):
            if 'nodes' in key:
                remove_page_number(data[key])
    elif isinstance(data, list):
        for item in data:
            remove_page_number(item)
    return data

def extract_matching_page_pairs(toc_page, toc_physical_index, start_page_index):
    pairs = []
    for phy_item in toc_physical_index:
        for page_item in toc_page:
            if phy_item.get('title') == page_item.get('title'):
                physical_index = phy_item.get('physical_index')
                if physical_index is not None and int(physical_index) >= start_page_index:
                    pairs.append({
                        'title': phy_item.get('title'),
                        'page': page_item.get('page'),
                        'physical_index': physical_index
                    })
    return pairs


def calculate_page_offset(pairs):
    differences = []
    for pair in pairs:
        try:
            physical_index = pair['physical_index']
            page_number = pair['page']
            difference = physical_index - page_number
            differences.append(difference)
        except (KeyError, TypeError):
            continue
    
    if not differences:
        return None
    
    difference_counts = {}
    for diff in differences:
        difference_counts[diff] = difference_counts.get(diff, 0) + 1
    
    most_common = max(difference_counts.items(), key=lambda x: x[1])[0]
    
    return most_common

def add_page_offset_to_toc_json(data, offset):
    for i in range(len(data)):
        if data[i].get('page') is not None and isinstance(data[i]['page'], int):
            data[i]['physical_index'] = data[i]['page'] + offset
            del data[i]['page']
    
    return data



def page_list_to_group_text(page_contents, token_lengths, max_tokens=20000, overlap_page=1):    
    num_tokens = sum(token_lengths)
    
    if num_tokens <= max_tokens:
        # merge all pages into one text
        page_text = "".join(page_contents)
        return [page_text]
    
    subsets = []
    current_subset = []
    current_token_count = 0

    expected_parts_num = math.ceil(num_tokens / max_tokens)
    average_tokens_per_part = math.ceil(((num_tokens / expected_parts_num) + max_tokens) / 2)
    
    for i, (page_content, page_tokens) in enumerate(zip(page_contents, token_lengths)):
        if current_token_count + page_tokens > average_tokens_per_part:

            subsets.append(''.join(current_subset))
            # Start new subset from overlap if specified
            overlap_start = max(i - overlap_page, 0)
            current_subset = page_contents[overlap_start:i]
            current_token_count = sum(token_lengths[overlap_start:i])
        
        # Add current page to the subset
        current_subset.append(page_content)
        current_token_count += page_tokens

    # Add the last subset if it contains any pages
    if current_subset:
        subsets.append(''.join(current_subset))
    
    print('divide page_list to groups', len(subsets))
    return subsets

def add_page_number_to_toc(part, structure, model=None):
    fill_prompt_seq = """
    You are given an JSON structure of a document and a partial part of the document. Your task is to check if the title that is described in the structure is started in the partial given document.

    The provided text contains tags like <physical_index_X> and <physical_index_X> to indicate the physical location of the page X. 

    If the full target section starts in the partial given document, insert the given JSON structure with the "start": "yes", and "start_index": "<physical_index_X>".

    If the full target section does not start in the partial given document, insert "start": "no",  "start_index": None.

    The response should be in the following format. 
        [
            {
                "structure": <structure index, "x.x.x" or None> (string),
                "title": <title of the section>,
                "start": "<yes or no>",
                "physical_index": "<physical_index_X> (keep the format)" or None
            },
            ...
        ]    
    The given structure contains the result of the previous part, you need to fill the result of the current part, do not change the previous result.
    Directly return the final JSON structure. Do not output anything else."""

    prompt = fill_prompt_seq + f"\n\nCurrent Partial Document:\n{part}\n\nGiven Structure\n{json.dumps(structure, indent=2)}\n"
    current_json_raw = ChatGPT_API(model=model, prompt=prompt)
    json_result = extract_json(current_json_raw)
    
    for item in json_result:
        if 'start' in item:
            del item['start']
    return json_result


def remove_first_physical_index_section(text):
    """
    Removes the first section between <physical_index_X> and <physical_index_X> tags,
    and returns the remaining text.
    """
    pattern = r'<physical_index_\d+>.*?<physical_index_\d+>'
    match = re.search(pattern, text, re.DOTALL)
    if match:
        # Remove the first matched section
        return text.replace(match.group(0), '', 1)
    return text

### add verify completeness
def generate_toc_continue(toc_content, part, model="gpt-4o-2024-11-20"):
    print('start generate_toc_continue')
    prompt = """
    You are an expert in extracting hierarchical tree structure.
    You are given a tree structure of the previous part and the text of the current part.
    Your task is to continue the tree structure from the previous part to include the current part.

    The structure variable is the numeric system which represents the index of the hierarchy section in the table of contents. For example, the first section has structure index 1, the first subsection has structure index 1.1, the second subsection has structure index 1.2, etc.

    For the title, you need to extract the original title from the text, only fix the space inconsistency.

    The provided text contains tags like <physical_index_X> and <physical_index_X> to indicate the start and end of page X. \
    
    For the physical_index, you need to extract the physical index of the start of the section from the text. Keep the <physical_index_X> format.

    The response should be in the following format. 
        [
            {
                "structure": <structure index, "x.x.x"> (string),
                "title": <title of the section, keep the original title>,
                "physical_index": "<physical_index_X> (keep the format)"
            },
            ...
        ]    

    Directly return the additional part of the final JSON structure. Do not output anything else."""

    prompt = prompt + '\nGiven text\n:' + part + '\nPrevious tree structure\n:' + json.dumps(toc_content, indent=2)
    response, finish_reason = ChatGPT_API_with_finish_reason(model=model, prompt=prompt)
    if finish_reason == 'finished':
        return extract_json(response)
    else:
        raise Exception(f'finish reason: {finish_reason}')
    
### add verify completeness
def generate_toc_init(part, model=None):
    print('start generate_toc_init')
    prompt = """
    You are an expert in extracting hierarchical tree structure, your task is to generate the tree structure of the document.

    The structure variable is the numeric system which represents the index of the hierarchy section in the table of contents. For example, the first section has structure index 1, the first subsection has structure index 1.1, the second subsection has structure index 1.2, etc.

    For the title, you need to extract the original title from the text, only fix the space inconsistency.

    The provided text contains tags like <physical_index_X> and <physical_index_X> to indicate the start and end of page X. 

    For the physical_index, you need to extract the physical index of the start of the section from the text. Keep the <physical_index_X> format.

    The response should be in the following format. 
        [
            {{
                "structure": <structure index, "x.x.x"> (string),
                "title": <title of the section, keep the original title>,
                "physical_index": "<physical_index_X> (keep the format)"
            }},
            
        ],


    Directly return the final JSON structure. Do not output anything else."""

    prompt = prompt + '\nGiven text\n:' + part
    response, finish_reason = ChatGPT_API_with_finish_reason(model=model, prompt=prompt)

    if finish_reason == 'finished':
         return extract_json(response)
    else:
        raise Exception(f'finish reason: {finish_reason}')

def process_no_toc(page_list, start_index=1, model=None, logger=None):
    page_contents=[]
    token_lengths=[]
    for page_index in range(start_index, start_index+len(page_list)):
        page_text = f"<physical_index_{page_index}>\n{page_list[page_index-start_index][0]}\n<physical_index_{page_index}>\n\n"
        page_contents.append(page_text)
        token_lengths.append(count_tokens(page_text, model))
    group_texts = page_list_to_group_text(page_contents, token_lengths)
    logger.info(f'len(group_texts): {len(group_texts)}')

    toc_with_page_number= generate_toc_init(group_texts[0], model)
    for group_text in group_texts[1:]:
        toc_with_page_number_additional = generate_toc_continue(toc_with_page_number, group_text, model)    
        toc_with_page_number.extend(toc_with_page_number_additional)
    logger.info(f'generate_toc: {toc_with_page_number}')

    toc_with_page_number = convert_physical_index_to_int(toc_with_page_number)
    logger.info(f'convert_physical_index_to_int: {toc_with_page_number}')

    return toc_with_page_number

def process_toc_no_page_numbers(toc_content, toc_page_list, page_list,  start_index=1, model=None, logger=None):
    page_contents=[]
    token_lengths=[]
    toc_content = toc_transformer(toc_content, model)
    logger.info(f'toc_transformer: {toc_content}')
    for page_index in range(start_index, start_index+len(page_list)):
        page_text = f"<physical_index_{page_index}>\n{page_list[page_index-start_index][0]}\n<physical_index_{page_index}>\n\n"
        page_contents.append(page_text)
        token_lengths.append(count_tokens(page_text, model))
    
    group_texts = page_list_to_group_text(page_contents, token_lengths)
    logger.info(f'len(group_texts): {len(group_texts)}')

    toc_with_page_number=copy.deepcopy(toc_content)
    for group_text in group_texts:
        toc_with_page_number = add_page_number_to_toc(group_text, toc_with_page_number, model)
    logger.info(f'add_page_number_to_toc: {toc_with_page_number}')

    toc_with_page_number = convert_physical_index_to_int(toc_with_page_number)
    logger.info(f'convert_physical_index_to_int: {toc_with_page_number}')

    return toc_with_page_number



def process_toc_with_page_numbers(toc_content, toc_page_list, page_list, toc_check_page_num=None, model=None, logger=None):
    toc_with_page_number = toc_transformer(toc_content, model)
    logger.info(f'toc_with_page_number: {toc_with_page_number}')

    toc_no_page_number = remove_page_number(copy.deepcopy(toc_with_page_number))
    
    start_page_index = toc_page_list[-1] + 1
    main_content = ""
    for page_index in range(start_page_index, min(start_page_index + toc_check_page_num, len(page_list))):
        main_content += f"<physical_index_{page_index+1}>\n{page_list[page_index][0]}\n<physical_index_{page_index+1}>\n\n"

    toc_with_physical_index = toc_index_extractor(toc_no_page_number, main_content, model)
    logger.info(f'toc_with_physical_index: {toc_with_physical_index}')

    toc_with_physical_index = convert_physical_index_to_int(toc_with_physical_index)
    logger.info(f'toc_with_physical_index: {toc_with_physical_index}')

    matching_pairs = extract_matching_page_pairs(toc_with_page_number, toc_with_physical_index, start_page_index)
    logger.info(f'matching_pairs: {matching_pairs}')

    offset = calculate_page_offset(matching_pairs)
    logger.info(f'offset: {offset}')

    toc_with_page_number = add_page_offset_to_toc_json(toc_with_page_number, offset)
    logger.info(f'toc_with_page_number: {toc_with_page_number}')

    toc_with_page_number = process_none_page_numbers(toc_with_page_number, page_list, model=model)
    logger.info(f'toc_with_page_number: {toc_with_page_number}')

    return toc_with_page_number



##check if needed to process none page numbers
def process_none_page_numbers(toc_items, page_list, start_index=1, model=None):
    for i, item in enumerate(toc_items):
        if "physical_index" not in item:
            # logger.info(f"fix item: {item}")
            # Find previous physical_index
            prev_physical_index = 0  # Default if no previous item exists
            for j in range(i - 1, -1, -1):
                if toc_items[j].get('physical_index') is not None:
                    prev_physical_index = toc_items[j]['physical_index']
                    break
            
            # Find next physical_index
            next_physical_index = -1  # Default if no next item exists
            for j in range(i + 1, len(toc_items)):
                if toc_items[j].get('physical_index') is not None:
                    next_physical_index = toc_items[j]['physical_index']
                    break

            page_contents = []
            for page_index in range(prev_physical_index, next_physical_index+1):
                # Add bounds checking to prevent IndexError
                list_index = page_index - start_index
                if list_index >= 0 and list_index < len(page_list):
                    page_text = f"<physical_index_{page_index}>\n{page_list[list_index][0]}\n<physical_index_{page_index}>\n\n"
                    page_contents.append(page_text)
                else:
                    continue

            item_copy = copy.deepcopy(item)
            del item_copy['page']
            result = add_page_number_to_toc(page_contents, item_copy, model)
            if isinstance(result[0]['physical_index'], str) and result[0]['physical_index'].startswith('<physical_index'):
                item['physical_index'] = int(result[0]['physical_index'].split('_')[-1].rstrip('>').strip())
                del item['page']
    
    return toc_items




def check_toc(page_list, opt=None):
    toc_page_list = find_toc_pages(start_page_index=0, page_list=page_list, opt=opt)
    if len(toc_page_list) == 0:
        print('no toc found')
        return {'toc_content': None, 'toc_page_list': [], 'page_index_given_in_toc': 'no'}
    else:
        print('toc found')
        toc_json = toc_extractor(page_list, toc_page_list, opt.model)

        if toc_json['page_index_given_in_toc'] == 'yes':
            print('index found')
            return {'toc_content': toc_json['toc_content'], 'toc_page_list': toc_page_list, 'page_index_given_in_toc': 'yes'}
        else:
            current_start_index = toc_page_list[-1] + 1
            
            while (toc_json['page_index_given_in_toc'] == 'no' and 
                   current_start_index < len(page_list) and 
                   current_start_index < opt.toc_check_page_num):
                
                additional_toc_pages = find_toc_pages(
                    start_page_index=current_start_index,
                    page_list=page_list,
                    opt=opt
                )
                
                if len(additional_toc_pages) == 0:
                    break

                additional_toc_json = toc_extractor(page_list, additional_toc_pages, opt.model)
                if additional_toc_json['page_index_given_in_toc'] == 'yes':
                    print('index found')
                    return {'toc_content': additional_toc_json['toc_content'], 'toc_page_list': additional_toc_pages, 'page_index_given_in_toc': 'yes'}

                else:
                    current_start_index = additional_toc_pages[-1] + 1
            print('index not found')
            return {'toc_content': toc_json['toc_content'], 'toc_page_list': toc_page_list, 'page_index_given_in_toc': 'no'}






################### fix incorrect toc #########################################################
def single_toc_item_index_fixer(section_title, content, model="gpt-4o-2024-11-20"):
    tob_extractor_prompt = """
    You are given a section title and several pages of a document, your job is to find the physical index of the start page of the section in the partial document.

    The provided pages contains tags like <physical_index_X> and <physical_index_X> to indicate the physical location of the page X.

    Reply in a JSON format:
    {
        "thinking": <explain which page, started and closed by <physical_index_X>, contains the start of this section>,
        "physical_index": "<physical_index_X>" (keep the format)
    }
    Directly return the final JSON structure. Do not output anything else."""

    prompt = tob_extractor_prompt + '\nSection Title:\n' + str(section_title) + '\nDocument pages:\n' + content
    response = ChatGPT_API(model=model, prompt=prompt)
    json_content = extract_json(response)    
    return convert_physical_index_to_int(json_content['physical_index'])



async def fix_incorrect_toc(toc_with_page_number, page_list, incorrect_results, start_index=1, model=None, logger=None):
    print(f'start fix_incorrect_toc with {len(incorrect_results)} incorrect results')
    incorrect_indices = {result['list_index'] for result in incorrect_results}
    
    end_index = len(page_list) + start_index - 1
    
    incorrect_results_and_range_logs = []
    # Helper function to process and check a single incorrect item
    async def process_and_check_item(incorrect_item):
        list_index = incorrect_item['list_index']
        
        # Check if list_index is valid
        if list_index < 0 or list_index >= len(toc_with_page_number):
            # Return an invalid result for out-of-bounds indices
            return {
                'list_index': list_index,
                'title': incorrect_item['title'],
                'physical_index': incorrect_item.get('physical_index'),
                'is_valid': False
            }
        
        # Find the previous correct item
        prev_correct = None
        for i in range(list_index-1, -1, -1):
            if i not in incorrect_indices and i >= 0 and i < len(toc_with_page_number):
                physical_index = toc_with_page_number[i].get('physical_index')
                if physical_index is not None:
                    prev_correct = physical_index
                    break
        # If no previous correct item found, use start_index
        if prev_correct is None:
            prev_correct = start_index - 1
        
        # Find the next correct item
        next_correct = None
        for i in range(list_index+1, len(toc_with_page_number)):
            if i not in incorrect_indices and i >= 0 and i < len(toc_with_page_number):
                physical_index = toc_with_page_number[i].get('physical_index')
                if physical_index is not None:
                    next_correct = physical_index
                    break
        # If no next correct item found, use end_index
        if next_correct is None:
            next_correct = end_index
        
        incorrect_results_and_range_logs.append({
            'list_index': list_index,
            'title': incorrect_item['title'],
            'prev_correct': prev_correct,
            'next_correct': next_correct
        })

        page_contents=[]
        for page_index in range(prev_correct, next_correct+1):
            # Add bounds checking to prevent IndexError
            list_index = page_index - start_index
            if list_index >= 0 and list_index < len(page_list):
                page_text = f"<physical_index_{page_index}>\n{page_list[list_index][0]}\n<physical_index_{page_index}>\n\n"
                page_contents.append(page_text)
            else:
                continue
        content_range = ''.join(page_contents)
        
        physical_index_int = single_toc_item_index_fixer(incorrect_item['title'], content_range, model)
        
        # Check if the result is correct
        check_item = incorrect_item.copy()
        check_item['physical_index'] = physical_index_int
        check_result = await check_title_appearance(check_item, page_list, start_index, model)

        return {
            'list_index': list_index,
            'title': incorrect_item['title'],
            'physical_index': physical_index_int,
            'is_valid': check_result['answer'] == 'yes'
        }

    # Process incorrect items concurrently
    tasks = [
        process_and_check_item(item)
        for item in incorrect_results
    ]
    results = await asyncio.gather(*tasks, return_exceptions=True)
    for item, result in zip(incorrect_results, results):
        if isinstance(result, Exception):
            print(f"Processing item {item} generated an exception: {result}")
            continue
    results = [result for result in results if not isinstance(result, Exception)]

    # Update the toc_with_page_number with the fixed indices and check for any invalid results
    invalid_results = []
    for result in results:
        if result['is_valid']:
            # Add bounds checking to prevent IndexError
            list_idx = result['list_index']
            if 0 <= list_idx < len(toc_with_page_number):
                toc_with_page_number[list_idx]['physical_index'] = result['physical_index']
            else:
                # Index is out of bounds, treat as invalid
                invalid_results.append({
                    'list_index': result['list_index'],
                    'title': result['title'],
                    'physical_index': result['physical_index'],
                })
        else:
            invalid_results.append({
                'list_index': result['list_index'],
                'title': result['title'],
                'physical_index': result['physical_index'],
            })

    logger.info(f'incorrect_results_and_range_logs: {incorrect_results_and_range_logs}')
    logger.info(f'invalid_results: {invalid_results}')

    return toc_with_page_number, invalid_results



async def fix_incorrect_toc_with_retries(toc_with_page_number, page_list, incorrect_results, start_index=1, max_attempts=3, model=None, logger=None):
    print('start fix_incorrect_toc')
    fix_attempt = 0
    current_toc = toc_with_page_number
    current_incorrect = incorrect_results

    while current_incorrect:
        print(f"Fixing {len(current_incorrect)} incorrect results")
        
        current_toc, current_incorrect = await fix_incorrect_toc(current_toc, page_list, current_incorrect, start_index, model, logger)
                
        fix_attempt += 1
        if fix_attempt >= max_attempts:
            logger.info("Maximum fix attempts reached")
            break
    
    return current_toc, current_incorrect




################### verify toc #########################################################
async def verify_toc(page_list, list_result, start_index=1, N=None, model=None):
    print('start verify_toc')
    # Find the last non-None physical_index
    last_physical_index = None
    for item in reversed(list_result):
        if item.get('physical_index') is not None:
            last_physical_index = item['physical_index']
            break
    
    # Early return if we don't have valid physical indices
    if last_physical_index is None or last_physical_index < len(page_list)/2:
        return 0, []
    
    # Determine which items to check
    if N is None:
        print('check all items')
        sample_indices = range(0, len(list_result))
    else:
        N = min(N, len(list_result))
        print(f'check {N} items')
        sample_indices = random.sample(range(0, len(list_result)), N)

    # Prepare items with their list indices
    indexed_sample_list = []
    for idx in sample_indices:
        item = list_result[idx]
        # Skip items with None physical_index (these were invalidated by validate_and_truncate_physical_indices)
        if item.get('physical_index') is not None:
            item_with_index = item.copy()
            item_with_index['list_index'] = idx  # Add the original index in list_result
            indexed_sample_list.append(item_with_index)

    # Run checks concurrently
    tasks = [
        check_title_appearance(item, page_list, start_index, model)
        for item in indexed_sample_list
    ]
    results = await asyncio.gather(*tasks)
    
    # Process results
    correct_count = 0
    incorrect_results = []
    for result in results:
        if result['answer'] == 'yes':
            correct_count += 1
        else:
            incorrect_results.append(result)
    
    # Calculate accuracy
    checked_count = len(results)
    accuracy = correct_count / checked_count if checked_count > 0 else 0
    print(f"accuracy: {accuracy*100:.2f}%")
    return accuracy, incorrect_results





################### main process #########################################################
async def meta_processor(page_list, mode=None, toc_content=None, toc_page_list=None, start_index=1, opt=None, logger=None):
    print(mode)
    print(f'start_index: {start_index}')
    
    if mode == 'process_toc_with_page_numbers':
        toc_with_page_number = process_toc_with_page_numbers(toc_content, toc_page_list, page_list, toc_check_page_num=opt.toc_check_page_num, model=opt.model, logger=logger)
    elif mode == 'process_toc_no_page_numbers':
        toc_with_page_number = process_toc_no_page_numbers(toc_content, toc_page_list, page_list, model=opt.model, logger=logger)
    else:
        toc_with_page_number = process_no_toc(page_list, start_index=start_index, model=opt.model, logger=logger)
            
    toc_with_page_number = [item for item in toc_with_page_number if item.get('physical_index') is not None] 
    
    toc_with_page_number = validate_and_truncate_physical_indices(
        toc_with_page_number, 
        len(page_list), 
        start_index=start_index, 
        logger=logger
    )
    
    accuracy, incorrect_results = await verify_toc(page_list, toc_with_page_number, start_index=start_index, model=opt.model)
        
    logger.info({
        'mode': 'process_toc_with_page_numbers',
        'accuracy': accuracy,
        'incorrect_results': incorrect_results
    })
    if accuracy == 1.0 and len(incorrect_results) == 0:
        return toc_with_page_number
    if accuracy > 0.6 and len(incorrect_results) > 0:
        toc_with_page_number, incorrect_results = await fix_incorrect_toc_with_retries(toc_with_page_number, page_list, incorrect_results,start_index=start_index, max_attempts=3, model=opt.model, logger=logger)
        return toc_with_page_number
    else:
        if mode == 'process_toc_with_page_numbers':
            return await meta_processor(page_list, mode='process_toc_no_page_numbers', toc_content=toc_content, toc_page_list=toc_page_list, start_index=start_index, opt=opt, logger=logger)
        elif mode == 'process_toc_no_page_numbers':
            return await meta_processor(page_list, mode='process_no_toc', start_index=start_index, opt=opt, logger=logger)
        else:
            raise Exception('Processing failed')
        
 
async def process_large_node_recursively(node, page_list, opt=None, logger=None):
    node_page_list = page_list[node['start_index']-1:node['end_index']]
    token_num = sum([page[1] for page in node_page_list])
    
    if node['end_index'] - node['start_index'] > opt.max_page_num_each_node and token_num >= opt.max_token_num_each_node:
        print('large node:', node['title'], 'start_index:', node['start_index'], 'end_index:', node['end_index'], 'token_num:', token_num)

        node_toc_tree = await meta_processor(node_page_list, mode='process_no_toc', start_index=node['start_index'], opt=opt, logger=logger)
        node_toc_tree = await check_title_appearance_in_start_concurrent(node_toc_tree, page_list, model=opt.model, logger=logger)
        
        # Filter out items with None physical_index before post_processing
        valid_node_toc_items = [item for item in node_toc_tree if item.get('physical_index') is not None]
        
        if valid_node_toc_items and node['title'].strip() == valid_node_toc_items[0]['title'].strip():
            node['nodes'] = post_processing(valid_node_toc_items[1:], node['end_index'])
            node['end_index'] = valid_node_toc_items[1]['start_index'] if len(valid_node_toc_items) > 1 else node['end_index']
        else:
            node['nodes'] = post_processing(valid_node_toc_items, node['end_index'])
            node['end_index'] = valid_node_toc_items[0]['start_index'] if valid_node_toc_items else node['end_index']
        
    if 'nodes' in node and node['nodes']:
        tasks = [
            process_large_node_recursively(child_node, page_list, opt, logger=logger)
            for child_node in node['nodes']
        ]
        await asyncio.gather(*tasks)
    
    return node

async def tree_parser(page_list, opt, doc=None, logger=None):
    check_toc_result = check_toc(page_list, opt)
    logger.info(check_toc_result)

    if check_toc_result.get("toc_content") and check_toc_result["toc_content"].strip() and check_toc_result["page_index_given_in_toc"] == "yes":
        toc_with_page_number = await meta_processor(
            page_list, 
            mode='process_toc_with_page_numbers', 
            start_index=1, 
            toc_content=check_toc_result['toc_content'], 
            toc_page_list=check_toc_result['toc_page_list'], 
            opt=opt,
            logger=logger)
    else:
        toc_with_page_number = await meta_processor(
            page_list, 
            mode='process_no_toc', 
            start_index=1, 
            opt=opt,
            logger=logger)

    toc_with_page_number = add_preface_if_needed(toc_with_page_number)
    toc_with_page_number = await check_title_appearance_in_start_concurrent(toc_with_page_number, page_list, model=opt.model, logger=logger)
    
    # Filter out items with None physical_index before post_processings
    valid_toc_items = [item for item in toc_with_page_number if item.get('physical_index') is not None]
    
    toc_tree = post_processing(valid_toc_items, len(page_list))
    tasks = [
        process_large_node_recursively(node, page_list, opt, logger=logger)
        for node in toc_tree
    ]
    await asyncio.gather(*tasks)
    
    return toc_tree


def page_index_main(doc, opt=None):
    logger = JsonLogger(doc)
    
    is_valid_pdf = (
        (isinstance(doc, str) and os.path.isfile(doc) and doc.lower().endswith(".pdf")) or 
        isinstance(doc, BytesIO)
    )
    if not is_valid_pdf:
        raise ValueError("Unsupported input type. Expected a PDF file path or BytesIO object.")

    print('Parsing PDF...')
    page_list = get_page_tokens(doc)

    logger.info({'total_page_number': len(page_list)})
    logger.info({'total_token': sum([page[1] for page in page_list])})
    
    structure = asyncio.run(tree_parser(page_list, opt, doc=doc, logger=logger))
    if opt.if_add_node_id == 'yes':
        write_node_id(structure)    
    if opt.if_add_node_text == 'yes':
        add_node_text(structure, page_list)
    if opt.if_add_node_summary == 'yes':
        if opt.if_add_node_text == 'no':
            add_node_text(structure, page_list)
        asyncio.run(generate_summaries_for_structure(structure, model=opt.model))
        if opt.if_add_node_text == 'no':
            remove_structure_text(structure)
        if opt.if_add_doc_description == 'yes':
            doc_description = generate_doc_description(structure, model=opt.model)
            return {
                'doc_name': get_pdf_name(doc),
                'doc_description': doc_description,
                'structure': structure,
            }
    return {
        'doc_name': get_pdf_name(doc),
        'structure': structure,
    }


def page_index(doc, model=None, toc_check_page_num=None, max_page_num_each_node=None, max_token_num_each_node=None,
               if_add_node_id=None, if_add_node_summary=None, if_add_doc_description=None, if_add_node_text=None):
    
    user_opt = {
        arg: value for arg, value in locals().items()
        if arg != "doc" and value is not None
    }
    opt = ConfigLoader().load(user_opt)
    return page_index_main(doc, opt)


def validate_and_truncate_physical_indices(toc_with_page_number, page_list_length, start_index=1, logger=None):
    """
    Validates and truncates physical indices that exceed the actual document length.
    This prevents errors when TOC references pages that don't exist in the document (e.g. the file is broken or incomplete).
    """
    if not toc_with_page_number:
        return toc_with_page_number
    
    max_allowed_page = page_list_length + start_index - 1
    truncated_items = []
    
    for i, item in enumerate(toc_with_page_number):
        if item.get('physical_index') is not None:
            original_index = item['physical_index']
            if original_index > max_allowed_page:
                item['physical_index'] = None
                truncated_items.append({
                    'title': item.get('title', 'Unknown'),
                    'original_index': original_index
                })
                if logger:
                    logger.info(f"Removed physical_index for '{item.get('title', 'Unknown')}' (was {original_index}, too far beyond document)")
    
    if truncated_items and logger:
        logger.info(f"Total removed items: {len(truncated_items)}")
        
    print(f"Document validation: {page_list_length} pages, max allowed index: {max_allowed_page}")
    if truncated_items:
        print(f"Truncated {len(truncated_items)} TOC items that exceeded document length")
     
    return toc_with_page_number
```

## /pageindex/utils.py

```py path="/pageindex/utils.py" 
import tiktoken
import openai
import logging
import os
from datetime import datetime
import time
import json
import PyPDF2
import copy
import asyncio
import pymupdf
from io import BytesIO
from dotenv import load_dotenv
load_dotenv()
import logging
import yaml
from pathlib import Path
from types import SimpleNamespace as config

CHATGPT_API_KEY = os.getenv("CHATGPT_API_KEY")


def count_tokens(text, model):
    enc = tiktoken.encoding_for_model(model)
    tokens = enc.encode(text)
    return len(tokens)

def ChatGPT_API_with_finish_reason(model, prompt, api_key=CHATGPT_API_KEY, chat_history=None):
    max_retries = 10
    client = openai.OpenAI(api_key=api_key)
    for i in range(max_retries):
        try:
            if chat_history:
                messages = chat_history
                messages.append({"role": "user", "content": prompt})
            else:
                messages = [{"role": "user", "content": prompt}]
            
            response = client.chat.completions.create(
                model=model,
                messages=messages,
                temperature=0,
            )
            if response.choices[0].finish_reason == "length":
                return response.choices[0].message.content, "max_output_reached"
            else:
                return response.choices[0].message.content, "finished"

        except Exception as e:
            print('************* Retrying *************')
            logging.error(f"Error: {e}")
            if i < max_retries - 1:
                time.sleep(1)  # Wait for 1秒 before retrying
            else:
                logging.error('Max retries reached for prompt: ' + prompt)
                return "Error"



def ChatGPT_API(model, prompt, api_key=CHATGPT_API_KEY, chat_history=None):
    max_retries = 10
    client = openai.OpenAI(api_key=api_key)
    for i in range(max_retries):
        try:
            if chat_history:
                messages = chat_history
                messages.append({"role": "user", "content": prompt})
            else:
                messages = [{"role": "user", "content": prompt}]
            
            response = client.chat.completions.create(
                model=model,
                messages=messages,
                temperature=0,
            )
   
            return response.choices[0].message.content
        except Exception as e:
            print('************* Retrying *************')
            logging.error(f"Error: {e}")
            if i < max_retries - 1:
                time.sleep(1)  # Wait for 1秒 before retrying
            else:
                logging.error('Max retries reached for prompt: ' + prompt)
                return "Error"
            

async def ChatGPT_API_async(model, prompt, api_key=CHATGPT_API_KEY):
    max_retries = 10
    client = openai.AsyncOpenAI(api_key=api_key)
    for i in range(max_retries):
        try:
            messages = [{"role": "user", "content": prompt}]
            response = await client.chat.completions.create(
                model=model,
                messages=messages,
                temperature=0,
            )
            return response.choices[0].message.content
        except Exception as e:
            print('************* Retrying *************')
            logging.error(f"Error: {e}")
            if i < max_retries - 1:
                await asyncio.sleep(1)  # Wait for 1秒 before retrying
            else:
                logging.error('Max retries reached for prompt: ' + prompt)
                return "Error"  
            
def get_json_content(response):
    start_idx = response.find("\`\`\`json")
    if start_idx != -1:
        start_idx += 7
        response = response[start_idx:]
        
    end_idx = response.rfind("\`\`\`")
    if end_idx != -1:
        response = response[:end_idx]
    
    json_content = response.strip()
    return json_content
         

def extract_json(content):
    try:
        # First, try to extract JSON enclosed within \`\`\`json and \`\`\`
        start_idx = content.find("\`\`\`json")
        if start_idx != -1:
            start_idx += 7  # Adjust index to start after the delimiter
            end_idx = content.rfind("\`\`\`")
            json_content = content[start_idx:end_idx].strip()
        else:
            # If no delimiters, assume entire content could be JSON
            json_content = content.strip()

        # Clean up common issues that might cause parsing errors
        json_content = json_content.replace('None', 'null')  # Replace Python None with JSON null
        json_content = json_content.replace('\n', ' ').replace('\r', ' ')  # Remove newlines
        json_content = ' '.join(json_content.split())  # Normalize whitespace

        # Attempt to parse and return the JSON object
        return json.loads(json_content)
    except json.JSONDecodeError as e:
        logging.error(f"Failed to extract JSON: {e}")
        # Try to clean up the content further if initial parsing fails
        try:
            # Remove any trailing commas before closing brackets/braces
            json_content = json_content.replace(',]', ']').replace(',}', '}')
            return json.loads(json_content)
        except:
            logging.error("Failed to parse JSON even after cleanup")
            return {}
    except Exception as e:
        logging.error(f"Unexpected error while extracting JSON: {e}")
        return {}

def write_node_id(data, node_id=0):
    if isinstance(data, dict):
        data['node_id'] = str(node_id).zfill(4)
        node_id += 1
        for key in list(data.keys()):
            if 'nodes' in key:
                node_id = write_node_id(data[key], node_id)
    elif isinstance(data, list):
        for index in range(len(data)):
            node_id = write_node_id(data[index], node_id)
    return node_id

def get_nodes(structure):
    if isinstance(structure, dict):
        structure_node = copy.deepcopy(structure)
        structure_node.pop('nodes', None)
        nodes = [structure_node]
        for key in list(structure.keys()):
            if 'nodes' in key:
                nodes.extend(get_nodes(structure[key]))
        return nodes
    elif isinstance(structure, list):
        nodes = []
        for item in structure:
            nodes.extend(get_nodes(item))
        return nodes
    
def structure_to_list(structure):
    if isinstance(structure, dict):
        nodes = []
        nodes.append(structure)
        if 'nodes' in structure:
            nodes.extend(structure_to_list(structure['nodes']))
        return nodes
    elif isinstance(structure, list):
        nodes = []
        for item in structure:
            nodes.extend(structure_to_list(item))
        return nodes

    
def get_leaf_nodes(structure):
    if isinstance(structure, dict):
        if not structure['nodes']:
            structure_node = copy.deepcopy(structure)
            structure_node.pop('nodes', None)
            return [structure_node]
        else:
            leaf_nodes = []
            for key in list(structure.keys()):
                if 'nodes' in key:
                    leaf_nodes.extend(get_leaf_nodes(structure[key]))
            return leaf_nodes
    elif isinstance(structure, list):
        leaf_nodes = []
        for item in structure:
            leaf_nodes.extend(get_leaf_nodes(item))
        return leaf_nodes

def is_leaf_node(data, node_id):
    # Helper function to find the node by its node_id
    def find_node(data, node_id):
        if isinstance(data, dict):
            if data.get('node_id') == node_id:
                return data
            for key in data.keys():
                if 'nodes' in key:
                    result = find_node(data[key], node_id)
                    if result:
                        return result
        elif isinstance(data, list):
            for item in data:
                result = find_node(item, node_id)
                if result:
                    return result
        return None

    # Find the node with the given node_id
    node = find_node(data, node_id)

    # Check if the node is a leaf node
    if node and not node.get('nodes'):
        return True
    return False

def get_last_node(structure):
    return structure[-1]


def extract_text_from_pdf(pdf_path):
    pdf_reader = PyPDF2.PdfReader(pdf_path)
    ###return text not list 
    text=""
    for page_num in range(len(pdf_reader.pages)):
        page = pdf_reader.pages[page_num]
        text+=page.extract_text()
    return text

def get_pdf_title(pdf_path):
    pdf_reader = PyPDF2.PdfReader(pdf_path)
    meta = pdf_reader.metadata
    title = meta.title if meta and meta.title else 'Untitled'
    return title

def get_text_of_pages(pdf_path, start_page, end_page, tag=True):
    pdf_reader = PyPDF2.PdfReader(pdf_path)
    text = ""
    for page_num in range(start_page-1, end_page):
        page = pdf_reader.pages[page_num]
        page_text = page.extract_text()
        if tag:
            text += f"<start_index_{page_num+1}>\n{page_text}\n<end_index_{page_num+1}>\n"
        else:
            text += page_text
    return text

def get_first_start_page_from_text(text):
    start_page = -1
    start_page_match = re.search(r'<start_index_(\d+)>', text)
    if start_page_match:
        start_page = int(start_page_match.group(1))
    return start_page

def get_last_start_page_from_text(text):
    start_page = -1
    # Find all matches of start_index tags
    start_page_matches = re.finditer(r'<start_index_(\d+)>', text)
    # Convert iterator to list and get the last match if any exist
    matches_list = list(start_page_matches)
    if matches_list:
        start_page = int(matches_list[-1].group(1))
    return start_page


def sanitize_filename(filename, replacement='-'):
    # In Linux, only '/' and '\0' (null) are invalid in filenames.
    # Null can't be represented in strings, so we only handle '/'.
    return filename.replace('/', replacement)

def get_pdf_name(pdf_path):
    # Extract PDF name
    if isinstance(pdf_path, str):
        pdf_name = os.path.basename(pdf_path)
    elif isinstance(pdf_path, BytesIO):
        pdf_reader = PyPDF2.PdfReader(pdf_path)
        meta = pdf_reader.metadata
        pdf_name = meta.title if meta and meta.title else 'Untitled'
        pdf_name = sanitize_filename(pdf_name)
    return pdf_name


class JsonLogger:
    def __init__(self, file_path):
        # Extract PDF name for logger name
        pdf_name = get_pdf_name(file_path)
            
        current_time = datetime.now().strftime("%Y%m%d_%H%M%S")
        self.filename = f"{pdf_name}_{current_time}.json"
        os.makedirs("./logs", exist_ok=True)
        # Initialize empty list to store all messages
        self.log_data = []

    def log(self, level, message, **kwargs):
        if isinstance(message, dict):
            self.log_data.append(message)
        else:
            self.log_data.append({'message': message})
        # Add new message to the log data
        
        # Write entire log data to file
        with open(self._filepath(), "w") as f:
            json.dump(self.log_data, f, indent=2)

    def info(self, message, **kwargs):
        self.log("INFO", message, **kwargs)

    def error(self, message, **kwargs):
        self.log("ERROR", message, **kwargs)

    def debug(self, message, **kwargs):
        self.log("DEBUG", message, **kwargs)

    def exception(self, message, **kwargs):
        kwargs["exception"] = True
        self.log("ERROR", message, **kwargs)

    def _filepath(self):
        return os.path.join("logs", self.filename)
    



def list_to_tree(data):
    def get_parent_structure(structure):
        """Helper function to get the parent structure code"""
        if not structure:
            return None
        parts = str(structure).split('.')
        return '.'.join(parts[:-1]) if len(parts) > 1 else None
    
    # First pass: Create nodes and track parent-child relationships
    nodes = {}
    root_nodes = []
    
    for item in data:
        structure = item.get('structure')
        node = {
            'title': item.get('title'),
            'start_index': item.get('start_index'),
            'end_index': item.get('end_index'),
            'nodes': []
        }
        
        nodes[structure] = node
        
        # Find parent
        parent_structure = get_parent_structure(structure)
        
        if parent_structure:
            # Add as child to parent if parent exists
            if parent_structure in nodes:
                nodes[parent_structure]['nodes'].append(node)
            else:
                root_nodes.append(node)
        else:
            # No parent, this is a root node
            root_nodes.append(node)
    
    # Helper function to clean empty children arrays
    def clean_node(node):
        if not node['nodes']:
            del node['nodes']
        else:
            for child in node['nodes']:
                clean_node(child)
        return node
    
    # Clean and return the tree
    return [clean_node(node) for node in root_nodes]

def add_preface_if_needed(data):
    if not isinstance(data, list) or not data:
        return data

    if data[0]['physical_index'] is not None and data[0]['physical_index'] > 1:
        preface_node = {
            "structure": "0",
            "title": "Preface",
            "physical_index": 1,
        }
        data.insert(0, preface_node)
    return data



def get_page_tokens(pdf_path, model="gpt-4o-2024-11-20", pdf_parser="PyPDF2"):
    enc = tiktoken.encoding_for_model(model)
    if pdf_parser == "PyPDF2":
        pdf_reader = PyPDF2.PdfReader(pdf_path)
        page_list = []
        for page_num in range(len(pdf_reader.pages)):
            page = pdf_reader.pages[page_num]
            page_text = page.extract_text()
            token_length = len(enc.encode(page_text))
            page_list.append((page_text, token_length))
        return page_list
    elif pdf_parser == "PyMuPDF":
        if isinstance(pdf_path, BytesIO):
            pdf_stream = pdf_path
            doc = pymupdf.open(stream=pdf_stream, filetype="pdf")
        elif isinstance(pdf_path, str) and os.path.isfile(pdf_path) and pdf_path.lower().endswith(".pdf"):
            doc = pymupdf.open(pdf_path)
        page_list = []
        for page in doc:
            page_text = page.get_text()
            token_length = len(enc.encode(page_text))
            page_list.append((page_text, token_length))
        return page_list
    else:
        raise ValueError(f"Unsupported PDF parser: {pdf_parser}")

        

def get_text_of_pdf_pages(pdf_pages, start_page, end_page):
    text = ""
    for page_num in range(start_page-1, end_page):
        text += pdf_pages[page_num][0]
    return text

def get_text_of_pdf_pages_with_labels(pdf_pages, start_page, end_page):
    text = ""
    for page_num in range(start_page-1, end_page):
        text += f"<physical_index_{page_num+1}>\n{pdf_pages[page_num][0]}\n<physical_index_{page_num+1}>\n"
    return text

def get_number_of_pages(pdf_path):
    pdf_reader = PyPDF2.PdfReader(pdf_path)
    num = len(pdf_reader.pages)
    return num



def post_processing(structure, end_physical_index):
    # First convert page_number to start_index in flat list
    for i, item in enumerate(structure):
        item['start_index'] = item.get('physical_index')
        if i < len(structure) - 1:
            if structure[i + 1].get('appear_start') == 'yes':
                item['end_index'] = structure[i + 1]['physical_index']-1
            else:
                item['end_index'] = structure[i + 1]['physical_index']
        else:
            item['end_index'] = end_physical_index
    tree = list_to_tree(structure)
    if len(tree)!=0:
        return tree
    else:
        ### remove appear_start 
        for node in structure:
            node.pop('appear_start', None)
            node.pop('physical_index', None)
        return structure

def clean_structure_post(data):
    if isinstance(data, dict):
        data.pop('page_number', None)
        data.pop('start_index', None)
        data.pop('end_index', None)
        if 'nodes' in data:
            clean_structure_post(data['nodes'])
    elif isinstance(data, list):
        for section in data:
            clean_structure_post(section)
    return data


def remove_structure_text(data):
    if isinstance(data, dict):
        data.pop('text', None)
        if 'nodes' in data:
            remove_structure_text(data['nodes'])
    elif isinstance(data, list):
        for item in data:
            remove_structure_text(item)
    return data


def check_token_limit(structure, limit=110000):
    list = structure_to_list(structure)
    for node in list:
        num_tokens = count_tokens(node['text'], model='gpt-4o')
        if num_tokens > limit:
            print(f"Node ID: {node['node_id']} has {num_tokens} tokens")
            print("Start Index:", node['start_index'])
            print("End Index:", node['end_index'])
            print("Title:", node['title'])
            print("\n")


def convert_physical_index_to_int(data):
    if isinstance(data, list):
        for i in range(len(data)):
            # Check if item is a dictionary and has 'physical_index' key
            if isinstance(data[i], dict) and 'physical_index' in data[i]:
                if isinstance(data[i]['physical_index'], str):
                    if data[i]['physical_index'].startswith('<physical_index_'):
                        data[i]['physical_index'] = int(data[i]['physical_index'].split('_')[-1].rstrip('>').strip())
                    elif data[i]['physical_index'].startswith('physical_index_'):
                        data[i]['physical_index'] = int(data[i]['physical_index'].split('_')[-1].strip())
    elif isinstance(data, str):
        if data.startswith('<physical_index_'):
            data = int(data.split('_')[-1].rstrip('>').strip())
        elif data.startswith('physical_index_'):
            data = int(data.split('_')[-1].strip())
        # Check data is int
        if isinstance(data, int):
            return data
        else:
            return None
    return data


def convert_page_to_int(data):
    for item in data:
        if 'page' in item and isinstance(item['page'], str):
            try:
                item['page'] = int(item['page'])
            except ValueError:
                # Keep original value if conversion fails
                pass
    return data


def add_node_text(node, pdf_pages):
    if isinstance(node, dict):
        start_page = node.get('start_index')
        end_page = node.get('end_index')
        node['text'] = get_text_of_pdf_pages(pdf_pages, start_page, end_page)
        if 'nodes' in node:
            add_node_text(node['nodes'], pdf_pages)
    elif isinstance(node, list):
        for index in range(len(node)):
            add_node_text(node[index], pdf_pages)
    return


def add_node_text_with_labels(node, pdf_pages):
    if isinstance(node, dict):
        start_page = node.get('start_index')
        end_page = node.get('end_index')
        node['text'] = get_text_of_pdf_pages_with_labels(pdf_pages, start_page, end_page)
        if 'nodes' in node:
            add_node_text_with_labels(node['nodes'], pdf_pages)
    elif isinstance(node, list):
        for index in range(len(node)):
            add_node_text_with_labels(node[index], pdf_pages)
    return


async def generate_node_summary(node, model=None):
    prompt = f"""You are given a part of a document, your task is to generate a description of the partial document about what are main points covered in the partial document.

    Partial Document Text: {node['text']}
    
    Directly return the description, do not include any other text.
    """
    response = await ChatGPT_API_async(model, prompt)
    return response


async def generate_summaries_for_structure(structure, model=None):
    nodes = structure_to_list(structure)
    tasks = [generate_node_summary(node, model=model) for node in nodes]
    summaries = await asyncio.gather(*tasks)
    
    for node, summary in zip(nodes, summaries):
        node['summary'] = summary
    return structure


def generate_doc_description(structure, model=None):
    prompt = f"""Your are an expert in generating descriptions for a document.
    You are given a structure of a document. Your task is to generate a one-sentence description for the document, which makes it easy to distinguish the document from other documents.
        
    Document Structure: {structure}
    
    Directly return the description, do not include any other text.
    """
    response = ChatGPT_API(model, prompt)
    return response


class ConfigLoader:
    def __init__(self, default_path: str = None):
        if default_path is None:
            default_path = Path(__file__).parent / "config.yaml"
        self._default_dict = self._load_yaml(default_path)

    @staticmethod
    def _load_yaml(path):
        with open(path, "r", encoding="utf-8") as f:
            return yaml.safe_load(f) or {}

    def _validate_keys(self, user_dict):
        unknown_keys = set(user_dict) - set(self._default_dict)
        if unknown_keys:
            raise ValueError(f"Unknown config keys: {unknown_keys}")

    def load(self, user_opt=None) -> config:
        """
        Load the configuration, merging user options with default values.
        """
        if user_opt is None:
            user_dict = {}
        elif isinstance(user_opt, config):
            user_dict = vars(user_opt)
        elif isinstance(user_opt, dict):
            user_dict = user_opt
        else:
            raise TypeError("user_opt must be dict, config(SimpleNamespace) or None")

        self._validate_keys(user_dict)
        merged = {**self._default_dict, **user_dict}
        return config(**merged)
```

## /requirements.txt

openai==1.70.0
pymupdf==1.25.5
PyPDF2==3.0.1
python-dotenv==1.1.0
tiktoken==0.7.0
pyyaml==6.0.2


## /results/2023-annual-report-truncated_structure.json

```json path="/results/2023-annual-report-truncated_structure.json" 
{
  "doc_name": "2023-annual-report-truncated.pdf",
  "structure": [
    {
      "title": "Preface",
      "start_index": 1,
      "end_index": 4,
      "node_id": "0000"
    },
    {
      "title": "About the Federal Reserve",
      "start_index": 5,
      "end_index": 7,
      "node_id": "0001"
    },
    {
      "title": "Overview",
      "start_index": 7,
      "end_index": 8,
      "node_id": "0002"
    },
    {
      "title": "Monetary Policy and Economic Developments",
      "start_index": 9,
      "end_index": 9,
      "nodes": [
        {
          "title": "March 2024 Summary",
          "start_index": 9,
          "end_index": 14,
          "node_id": "0004"
        },
        {
          "title": "June 2023 Summary",
          "start_index": 15,
          "end_index": 20,
          "node_id": "0005"
        }
      ],
      "node_id": "0003"
    },
    {
      "title": "Financial Stability",
      "start_index": 21,
      "end_index": 21,
      "nodes": [
        {
          "title": "Monitoring Financial Vulnerabilities",
          "start_index": 22,
          "end_index": 28,
          "node_id": "0007"
        },
        {
          "title": "Domestic and International Cooperation and Coordination",
          "start_index": 28,
          "end_index": 30,
          "node_id": "0008"
        }
      ],
      "node_id": "0006"
    },
    {
      "title": "Supervision and Regulation",
      "start_index": 31,
      "end_index": 32,
      "nodes": [
        {
          "title": "Supervised and Regulated Institutions",
          "start_index": 32,
          "end_index": 35,
          "node_id": "0010"
        },
        {
          "title": "Supervisory Developments",
          "start_index": 35,
          "end_index": 50,
          "node_id": "0011"
        }
      ],
      "node_id": "0009"
    }
  ]
}
```

## /results/2023-annual-report_structure.json

```json path="/results/2023-annual-report_structure.json" 
{
  "doc_name": "2023-annual-report.pdf",
  "structure": [
    {
      "title": "Preface",
      "start_index": 1,
      "end_index": 4,
      "node_id": "0000"
    },
    {
      "title": "About the Federal Reserve",
      "start_index": 5,
      "end_index": 6,
      "node_id": "0001"
    },
    {
      "title": "Overview",
      "start_index": 7,
      "end_index": 8,
      "node_id": "0002"
    },
    {
      "title": "Monetary Policy and Economic Developments",
      "start_index": 9,
      "end_index": 9,
      "nodes": [
        {
          "title": "March 2024 Summary",
          "start_index": 9,
          "end_index": 14,
          "node_id": "0004"
        },
        {
          "title": "June 2023 Summary",
          "start_index": 15,
          "end_index": 20,
          "node_id": "0005"
        }
      ],
      "node_id": "0003"
    },
    {
      "title": "Financial Stability",
      "start_index": 21,
      "end_index": 21,
      "nodes": [
        {
          "title": "Monitoring Financial Vulnerabilities",
          "start_index": 22,
          "end_index": 28,
          "node_id": "0007"
        },
        {
          "title": "Domestic and International Cooperation and Coordination",
          "start_index": 28,
          "end_index": 31,
          "node_id": "0008"
        }
      ],
      "node_id": "0006"
    },
    {
      "title": "Supervision and Regulation",
      "start_index": 31,
      "end_index": 31,
      "nodes": [
        {
          "title": "Supervised and Regulated Institutions",
          "start_index": 32,
          "end_index": 35,
          "node_id": "0010"
        },
        {
          "title": "Supervisory Developments",
          "start_index": 35,
          "end_index": 54,
          "node_id": "0011"
        },
        {
          "title": "Regulatory Developments",
          "start_index": 55,
          "end_index": 59,
          "node_id": "0012"
        }
      ],
      "node_id": "0009"
    },
    {
      "title": "Payment System and Reserve Bank Oversight",
      "start_index": 59,
      "end_index": 59,
      "nodes": [
        {
          "title": "Payment Services to Depository and Other Institutions",
          "start_index": 60,
          "end_index": 65,
          "node_id": "0014"
        },
        {
          "title": "Currency and Coin",
          "start_index": 66,
          "end_index": 68,
          "node_id": "0015"
        },
        {
          "title": "Fiscal Agency and Government Depository Services",
          "start_index": 69,
          "end_index": 72,
          "node_id": "0016"
        },
        {
          "title": "Evolutions and Improvements to the System",
          "start_index": 72,
          "end_index": 75,
          "node_id": "0017"
        },
        {
          "title": "Oversight of Federal Reserve Banks",
          "start_index": 75,
          "end_index": 81,
          "node_id": "0018"
        },
        {
          "title": "Pro Forma Financial Statements for Federal Reserve Priced Services",
          "start_index": 82,
          "end_index": 88,
          "node_id": "0019"
        }
      ],
      "node_id": "0013"
    },
    {
      "title": "Consumer and Community Affairs",
      "start_index": 89,
      "end_index": 89,
      "nodes": [
        {
          "title": "Consumer Compliance Supervision",
          "start_index": 89,
          "end_index": 101,
          "node_id": "0021"
        },
        {
          "title": "Consumer Laws and Regulations",
          "start_index": 101,
          "end_index": 102,
          "node_id": "0022"
        },
        {
          "title": "Consumer Research and Analysis of Emerging Issues and Policy",
          "start_index": 102,
          "end_index": 105,
          "node_id": "0023"
        },
        {
          "title": "Community Development",
          "start_index": 105,
          "end_index": 106,
          "node_id": "0024"
        }
      ],
      "node_id": "0020"
    },
    {
      "title": "Appendixes",
      "start_index": 107,
      "end_index": 109,
      "node_id": "0025"
    },
    {
      "title": "Federal Reserve System Organization",
      "start_index": 109,
      "end_index": 109,
      "nodes": [
        {
          "title": "Board of Governors",
          "start_index": 109,
          "end_index": 116,
          "node_id": "0027"
        },
        {
          "title": "Federal Open Market Committee",
          "start_index": 117,
          "end_index": 118,
          "node_id": "0028"
        },
        {
          "title": "Board of Governors Advisory Councils",
          "start_index": 119,
          "end_index": 122,
          "node_id": "0029"
        },
        {
          "title": "Federal Reserve Banks and Branches",
          "start_index": 123,
          "end_index": 146,
          "node_id": "0030"
        }
      ],
      "node_id": "0026"
    },
    {
      "title": "Minutes of Federal Open Market Committee Meetings",
      "start_index": 147,
      "end_index": 147,
      "nodes": [
        {
          "title": "Meeting Minutes",
          "start_index": 147,
          "end_index": 149,
          "node_id": "0032"
        }
      ],
      "node_id": "0031"
    },
    {
      "title": "Federal Reserve System Audits",
      "start_index": 149,
      "end_index": 149,
      "nodes": [
        {
          "title": "Office of Inspector General Activities",
          "start_index": 149,
          "end_index": 151,
          "node_id": "0034"
        },
        {
          "title": "Government Accountability Office Reviews",
          "start_index": 151,
          "end_index": 152,
          "node_id": "0035"
        }
      ],
      "node_id": "0033"
    },
    {
      "title": "Federal Reserve System Budgets",
      "start_index": 153,
      "end_index": 153,
      "nodes": [
        {
          "title": "System Budgets Overview",
          "start_index": 153,
          "end_index": 157,
          "node_id": "0037"
        },
        {
          "title": "Board of Governors Budgets",
          "start_index": 157,
          "end_index": 163,
          "node_id": "0038"
        },
        {
          "title": "Federal Reserve Banks Budgets",
          "start_index": 163,
          "end_index": 169,
          "node_id": "0039"
        },
        {
          "title": "Currency Budget",
          "start_index": 169,
          "end_index": 174,
          "node_id": "0040"
        }
      ],
      "node_id": "0036"
    },
    {
      "title": "Record of Policy Actions of the Board of Governors",
      "start_index": 175,
      "end_index": 175,
      "nodes": [
        {
          "title": "Rules and Regulations",
          "start_index": 175,
          "end_index": 176,
          "node_id": "0042"
        },
        {
          "title": "Policy Statements and Other Actions",
          "start_index": 177,
          "end_index": 181,
          "node_id": "0043"
        },
        {
          "title": "Discount Rates for Depository Institutions in 2023",
          "start_index": 181,
          "end_index": 183,
          "node_id": "0044"
        },
        {
          "title": "The Board of Governors and the Government Performance and Results Act",
          "start_index": 184,
          "end_index": 184,
          "node_id": "0045"
        }
      ],
      "node_id": "0041"
    },
    {
      "title": "Litigation",
      "start_index": 185,
      "end_index": 185,
      "nodes": [
        {
          "title": "Pending",
          "start_index": 185,
          "end_index": 186,
          "node_id": "0047"
        },
        {
          "title": "Resolved",
          "start_index": 186,
          "end_index": 187,
          "node_id": "0048"
        }
      ],
      "node_id": "0046"
    },
    {
      "title": "Statistical Tables",
      "start_index": 187,
      "end_index": 187,
      "nodes": [
        {
          "title": "Federal Reserve open market transactions, 2023",
          "start_index": 187,
          "end_index": 187,
          "nodes": [
            {
              "title": "Type of security and transaction",
              "start_index": 187,
              "end_index": 188,
              "node_id": "0051"
            },
            {
              "title": "Federal agency obligations",
              "start_index": 188,
              "end_index": 188,
              "node_id": "0052"
            },
            {
              "title": "Mortgage-backed securities",
              "start_index": 188,
              "end_index": 188,
              "node_id": "0053"
            },
            {
              "title": "Temporary transactions",
              "start_index": 188,
              "end_index": 188,
              "node_id": "0054"
            }
          ],
          "node_id": "0050"
        },
        {
          "title": "Federal Reserve Bank holdings of U.S. Treasury and federal agency securities, December 31, 2021\u201323",
          "start_index": 189,
          "end_index": 189,
          "nodes": [
            {
              "title": "By remaining maturity",
              "start_index": 189,
              "end_index": 189,
              "node_id": "0056"
            },
            {
              "title": "By type",
              "start_index": 189,
              "end_index": 190,
              "node_id": "0057"
            },
            {
              "title": "By issuer",
              "start_index": 190,
              "end_index": 190,
              "node_id": "0058"
            }
          ],
          "node_id": "0055"
        },
        {
          "title": "Reserve requirements of depository institutions, December 31, 2023",
          "start_index": 191,
          "end_index": 191,
          "node_id": "0059"
        },
        {
          "title": "Banking offices and banks affiliated with bank holding companies in the United States, December 31, 2022 and 2023",
          "start_index": 192,
          "end_index": 192,
          "node_id": "0060"
        },
        {
          "title": "Reserves of depository institutions, Federal Reserve Bank credit, and related items, year-end 1984\u20132023 and month-end 2023",
          "start_index": 193,
          "end_index": 196,
          "node_id": "0061"
        },
        {
          "title": "Reserves of depository institutions, Federal Reserve Bank credit, and related items, year-end 1918\u20131983",
          "start_index": 197,
          "end_index": 200,
          "node_id": "0062"
        },
        {
          "title": "Principal assets and liabilities of insured commercial banks, by class of bank, June 30, 2023 and 2022",
          "start_index": 201,
          "end_index": 201,
          "node_id": "0063"
        },
        {
          "title": "Initial margin requirements under Regulations T, U, and X",
          "start_index": 202,
          "end_index": 203,
          "node_id": "0064"
        },
        {
          "title": "Statement of condition of the Federal Reserve Banks, by Bank, December 31, 2023 and 2022",
          "start_index": 203,
          "end_index": 209,
          "node_id": "0065"
        },
        {
          "title": "Statement of condition of the Federal Reserve Banks, December 31, 2023 and 2022",
          "start_index": 209,
          "end_index": 210,
          "node_id": "0066"
        },
        {
          "title": "Income and expenses of the Federal Reserve Banks, by Bank, 2023",
          "start_index": 210,
          "end_index": 212,
          "nodes": [
            {
              "title": "Income and expenses of the Federal Reserve Banks, by Bank, 2023\u2014continued",
              "start_index": 212,
              "end_index": 214,
              "node_id": "0068"
            }
          ],
          "node_id": "0067"
        },
        {
          "title": "Income and expenses of the Federal Reserve Banks, 1914\u20132023",
          "start_index": 214,
          "end_index": 215,
          "nodes": [
            {
              "title": "Income and expenses of the Federal Reserve Banks, 1914\u20132023\u2014continued",
              "start_index": 215,
              "end_index": 216,
              "node_id": "0070"
            },
            {
              "title": "Income and expenses of the Federal Reserve Banks, 1914\u20132023\u2014continued",
              "start_index": 216,
              "end_index": 217,
              "node_id": "0071"
            },
            {
              "title": "Income and expenses of the Federal Reserve Banks, 1914\u20132023\u2014continued",
              "start_index": 217,
              "end_index": 217,
              "node_id": "0072"
            }
          ],
          "node_id": "0069"
        },
        {
          "title": "Operations in principal departments of the Federal Reserve Banks, 2020\u201323",
          "start_index": 218,
          "end_index": 218,
          "node_id": "0073"
        },
        {
          "title": "Number and annual salaries of officers and employees of the Federal Reserve Banks, December 31, 2023",
          "start_index": 219,
          "end_index": 220,
          "node_id": "0074"
        },
        {
          "title": "Acquisition costs and net book value of the premises of the Federal Reserve Banks and Branches, December 31, 2023",
          "start_index": 220,
          "end_index": 222,
          "node_id": "0075"
        }
      ],
      "node_id": "0049"
    }
  ]
}
```

## /results/PRML_structure.json

```json path="/results/PRML_structure.json" 
{
  "doc_name": "PRML.pdf",
  "structure": [
    {
      "title": "Preface",
      "start_index": 1,
      "end_index": 6,
      "node_id": "0000"
    },
    {
      "title": "Preface",
      "start_index": 7,
      "end_index": 10,
      "node_id": "0001"
    },
    {
      "title": "Mathematical notation",
      "start_index": 11,
      "end_index": 13,
      "node_id": "0002"
    },
    {
      "title": "Contents",
      "start_index": 13,
      "end_index": 20,
      "node_id": "0003"
    },
    {
      "title": "Introduction",
      "start_index": 21,
      "end_index": 24,
      "nodes": [
        {
          "title": "Example: Polynomial Curve Fitting",
          "start_index": 24,
          "end_index": 32,
          "node_id": "0005"
        },
        {
          "title": "Probability Theory",
          "start_index": 32,
          "end_index": 37,
          "nodes": [
            {
              "title": "Probability densities",
              "start_index": 37,
              "end_index": 39,
              "node_id": "0007"
            },
            {
              "title": "Expectations and covariances",
              "start_index": 39,
              "end_index": 41,
              "node_id": "0008"
            },
            {
              "title": "Bayesian probabilities",
              "start_index": 41,
              "end_index": 44,
              "node_id": "0009"
            },
            {
              "title": "The Gaussian distribution",
              "start_index": 44,
              "end_index": 48,
              "node_id": "0010"
            },
            {
              "title": "Curve fitting re-visited",
              "start_index": 48,
              "end_index": 50,
              "node_id": "0011"
            },
            {
              "title": "Bayesian curve fitting",
              "start_index": 50,
              "end_index": 52,
              "node_id": "0012"
            }
          ],
          "node_id": "0006"
        },
        {
          "title": "Model Selection",
          "start_index": 52,
          "end_index": 53,
          "node_id": "0013"
        },
        {
          "title": "The Curse of Dimensionality",
          "start_index": 53,
          "end_index": 58,
          "node_id": "0014"
        },
        {
          "title": "Decision Theory",
          "start_index": 58,
          "end_index": 59,
          "nodes": [
            {
              "title": "Minimizing the misclassification rate",
              "start_index": 59,
              "end_index": 61,
              "node_id": "0016"
            },
            {
              "title": "Minimizing the expected loss",
              "start_index": 61,
              "end_index": 62,
              "node_id": "0017"
            },
            {
              "title": "The reject option",
              "start_index": 62,
              "end_index": 62,
              "node_id": "0018"
            },
            {
              "title": "Inference and decision",
              "start_index": 62,
              "end_index": 66,
              "node_id": "0019"
            },
            {
              "title": "Loss functions for regression",
              "start_index": 66,
              "end_index": 68,
              "node_id": "0020"
            }
          ],
          "node_id": "0015"
        },
        {
          "title": "Information Theory",
          "start_index": 68,
          "end_index": 75,
          "nodes": [
            {
              "title": "Relative entropy and mutual information",
              "start_index": 75,
              "end_index": 78,
              "node_id": "0022"
            }
          ],
          "node_id": "0021"
        }
      ],
      "node_id": "0004"
    },
    {
      "title": "Exercises",
      "start_index": 78,
      "end_index": 87,
      "node_id": "0023"
    },
    {
      "title": "Probability Distributions",
      "start_index": 87,
      "end_index": 88,
      "nodes": [
        {
          "title": "Binary Variables",
          "start_index": 88,
          "end_index": 91,
          "nodes": [
            {
              "title": "The beta distribution",
              "start_index": 91,
              "end_index": 94,
              "node_id": "0026"
            }
          ],
          "node_id": "0025"
        },
        {
          "title": "Multinomial Variables",
          "start_index": 94,
          "end_index": 96,
          "nodes": [
            {
              "title": "The Dirichlet distribution",
              "start_index": 96,
              "end_index": 98,
              "node_id": "0028"
            }
          ],
          "node_id": "0027"
        },
        {
          "title": "The Gaussian Distribution",
          "start_index": 98,
          "end_index": 105,
          "nodes": [
            {
              "title": "Conditional Gaussian distributions",
              "start_index": 105,
              "end_index": 108,
              "node_id": "0030"
            },
            {
              "title": "Marginal Gaussian distributions",
              "start_index": 108,
              "end_index": 110,
              "node_id": "0031"
            },
            {
              "title": "Bayes\u2019 theorem for Gaussian variables",
              "start_index": 110,
              "end_index": 113,
              "node_id": "0032"
            },
            {
              "title": "Maximum likelihood for the Gaussian",
              "start_index": 113,
              "end_index": 114,
              "node_id": "0033"
            },
            {
              "title": "Sequential estimation",
              "start_index": 114,
              "end_index": 117,
              "node_id": "0034"
            },
            {
              "title": "Bayesian inference for the Gaussian",
              "start_index": 117,
              "end_index": 122,
              "node_id": "0035"
            },
            {
              "title": "Student\u2019s t-distribution",
              "start_index": 122,
              "end_index": 125,
              "node_id": "0036"
            },
            {
              "title": "Periodic variables",
              "start_index": 125,
              "end_index": 130,
              "node_id": "0037"
            },
            {
              "title": "Mixtures of Gaussians",
              "start_index": 130,
              "end_index": 133,
              "node_id": "0038"
            }
          ],
          "node_id": "0029"
        },
        {
          "title": "The Exponential Family",
          "start_index": 133,
          "end_index": 136,
          "nodes": [
            {
              "title": "Maximum likelihood and sufficient statistics",
              "start_index": 136,
              "end_index": 137,
              "node_id": "0040"
            },
            {
              "title": "Conjugate priors",
              "start_index": 137,
              "end_index": 137,
              "node_id": "0041"
            },
            {
              "title": "Noninformative priors",
              "start_index": 137,
              "end_index": 140,
              "node_id": "0042"
            }
          ],
          "node_id": "0039"
        },
        {
          "title": "Nonparametric Methods",
          "start_index": 140,
          "end_index": 142,
          "nodes": [
            {
              "title": "Kernel density estimators",
              "start_index": 142,
              "end_index": 144,
              "node_id": "0044"
            },
            {
              "title": "Nearest-neighbour methods",
              "start_index": 144,
              "end_index": 147,
              "node_id": "0045"
            }
          ],
          "node_id": "0043"
        }
      ],
      "node_id": "0024"
    },
    {
      "title": "Exercises",
      "start_index": 147,
      "end_index": 156,
      "node_id": "0046"
    },
    {
      "title": "Linear Models for Regression",
      "start_index": 157,
      "end_index": 158,
      "nodes": [
        {
          "title": "Linear Basis Function Models",
          "start_index": 158,
          "end_index": 160,
          "nodes": [
            {
              "title": "Maximum likelihood and least squares",
              "start_index": 160,
              "end_index": 163,
              "node_id": "0049"
            },
            {
              "title": "Geometry of least squares",
              "start_index": 163,
              "end_index": 163,
              "node_id": "0050"
            },
            {
              "title": "Sequential learning",
              "start_index": 163,
              "end_index": 164,
              "node_id": "0051"
            },
            {
              "title": "Regularized least squares",
              "start_index": 164,
              "end_index": 166,
              "node_id": "0052"
            },
            {
              "title": "Multiple outputs",
              "start_index": 166,
              "end_index": 167,
              "node_id": "0053"
            }
          ],
          "node_id": "0048"
        },
        {
          "title": "The Bias-Variance Decomposition",
          "start_index": 167,
          "end_index": 172,
          "node_id": "0054"
        },
        {
          "title": "Bayesian Linear Regression",
          "start_index": 172,
          "end_index": 172,
          "nodes": [
            {
              "title": "Parameter distribution",
              "start_index": 172,
              "end_index": 176,
              "node_id": "0056"
            },
            {
              "title": "Predictive distribution",
              "start_index": 176,
              "end_index": 179,
              "node_id": "0057"
            },
            {
              "title": "Equivalent kernel",
              "start_index": 179,
              "end_index": 181,
              "node_id": "0058"
            }
          ],
          "node_id": "0055"
        },
        {
          "title": "Bayesian Model Comparison",
          "start_index": 181,
          "end_index": 185,
          "node_id": "0059"
        },
        {
          "title": "The Evidence Approximation",
          "start_index": 185,
          "end_index": 186,
          "nodes": [
            {
              "title": "Evaluation of the evidence function",
              "start_index": 186,
              "end_index": 188,
              "node_id": "0061"
            },
            {
              "title": "Maximizing the evidence function",
              "start_index": 188,
              "end_index": 190,
              "node_id": "0062"
            },
            {
              "title": "Effective number of parameters",
              "start_index": 190,
              "end_index": 192,
              "node_id": "0063"
            }
          ],
          "node_id": "0060"
        },
        {
          "title": "Limitations of Fixed Basis Functions",
          "start_index": 192,
          "end_index": 193,
          "node_id": "0064"
        }
      ],
      "node_id": "0047"
    },
    {
      "title": "Exercises",
      "start_index": 193,
      "end_index": 199,
      "node_id": "0065"
    },
    {
      "title": "Linear Models for Classification",
      "start_index": 199,
      "end_index": 201,
      "nodes": [
        {
          "title": "Discriminant Functions",
          "start_index": 201,
          "end_index": 201,
          "nodes": [
            {
              "title": "Two classes",
              "start_index": 201,
              "end_index": 202,
              "node_id": "0068"
            },
            {
              "title": "Multiple classes",
              "start_index": 202,
              "end_index": 204,
              "node_id": "0069"
            },
            {
              "title": "Least squares for classification",
              "start_index": 204,
              "end_index": 206,
              "node_id": "0070"
            },
            {
              "title": "Fisher\u2019s linear discriminant",
              "start_index": 206,
              "end_index": 209,
              "node_id": "0071"
            },
            {
              "title": "Relation to least squares",
              "start_index": 209,
              "end_index": 211,
              "node_id": "0072"
            },
            {
              "title": "Fisher\u2019s discriminant for multiple classes",
              "start_index": 211,
              "end_index": 212,
              "node_id": "0073"
            },
            {
              "title": "The perceptron algorithm",
              "start_index": 212,
              "end_index": 216,
              "node_id": "0074"
            }
          ],
          "node_id": "0067"
        },
        {
          "title": "Probabilistic Generative Models",
          "start_index": 216,
          "end_index": 218,
          "nodes": [
            {
              "title": "Continuous inputs",
              "start_index": 218,
              "end_index": 220,
              "node_id": "0076"
            },
            {
              "title": "Maximum likelihood solution",
              "start_index": 220,
              "end_index": 222,
              "node_id": "0077"
            },
            {
              "title": "Discrete features",
              "start_index": 222,
              "end_index": 222,
              "node_id": "0078"
            },
            {
              "title": "Exponential family",
              "start_index": 222,
              "end_index": 223,
              "node_id": "0079"
            }
          ],
          "node_id": "0075"
        },
        {
          "title": "Probabilistic Discriminative Models",
          "start_index": 223,
          "end_index": 224,
          "nodes": [
            {
              "title": "Fixed basis functions",
              "start_index": 224,
              "end_index": 225,
              "node_id": "0081"
            },
            {
              "title": "Logistic regression",
              "start_index": 225,
              "end_index": 227,
              "node_id": "0082"
            },
            {
              "title": "Iterative reweighted least squares",
              "start_index": 227,
              "end_index": 229,
              "node_id": "0083"
            },
            {
              "title": "Multiclass logistic regression",
              "start_index": 229,
              "end_index": 230,
              "node_id": "0084"
            },
            {
              "title": "Probit regression",
              "start_index": 230,
              "end_index": 232,
              "node_id": "0085"
            },
            {
              "title": "Canonical link functions",
              "start_index": 232,
              "end_index": 232,
              "node_id": "0086"
            }
          ],
          "node_id": "0080"
        },
        {
          "title": "The Laplace Approximation",
          "start_index": 233,
          "end_index": 236,
          "nodes": [
            {
              "title": "Model comparison and BIC",
              "start_index": 236,
              "end_index": 237,
              "node_id": "0088"
            }
          ],
          "node_id": "0087"
        },
        {
          "title": "Bayesian Logistic Regression",
          "start_index": 237,
          "end_index": 237,
          "nodes": [
            {
              "title": "Laplace approximation",
              "start_index": 237,
              "end_index": 238,
              "node_id": "0090"
            },
            {
              "title": "Predictive distribution",
              "start_index": 238,
              "end_index": 240,
              "node_id": "0091"
            }
          ],
          "node_id": "0089"
        }
      ],
      "node_id": "0066"
    },
    {
      "title": "Exercises",
      "start_index": 240,
      "end_index": 245,
      "node_id": "0092"
    },
    {
      "title": "Neural Networks",
      "start_index": 245,
      "end_index": 247,
      "nodes": [
        {
          "title": "Feed-forward Network Functions",
          "start_index": 247,
          "end_index": 251,
          "nodes": [
            {
              "title": "Weight-space symmetries",
              "start_index": 251,
              "end_index": 252,
              "node_id": "0095"
            }
          ],
          "node_id": "0094"
        },
        {
          "title": "Network Training",
          "start_index": 252,
          "end_index": 256,
          "nodes": [
            {
              "title": "Parameter optimization",
              "start_index": 256,
              "end_index": 257,
              "node_id": "0097"
            },
            {
              "title": "Local quadratic approximation",
              "start_index": 257,
              "end_index": 259,
              "node_id": "0098"
            },
            {
              "title": "Use of gradient information",
              "start_index": 259,
              "end_index": 260,
              "node_id": "0099"
            },
            {
              "title": "Gradient descent optimization",
              "start_index": 260,
              "end_index": 261,
              "node_id": "0100"
            }
          ],
          "node_id": "0096"
        },
        {
          "title": "Error Backpropagation",
          "start_index": 261,
          "end_index": 262,
          "nodes": [
            {
              "title": "Evaluation of error-function derivatives",
              "start_index": 262,
              "end_index": 265,
              "node_id": "0102"
            },
            {
              "title": "A simple example",
              "start_index": 265,
              "end_index": 266,
              "node_id": "0103"
            },
            {
              "title": "Efficiency of backpropagation",
              "start_index": 266,
              "end_index": 267,
              "node_id": "0104"
            },
            {
              "title": "The Jacobian matrix",
              "start_index": 267,
              "end_index": 269,
              "node_id": "0105"
            }
          ],
          "node_id": "0101"
        },
        {
          "title": "The Hessian Matrix",
          "start_index": 269,
          "end_index": 270,
          "nodes": [
            {
              "title": "Diagonal approximation",
              "start_index": 270,
              "end_index": 271,
              "node_id": "0107"
            },
            {
              "title": "Outer product approximation",
              "start_index": 271,
              "end_index": 272,
              "node_id": "0108"
            },
            {
              "title": "Inverse Hessian",
              "start_index": 272,
              "end_index": 272,
              "node_id": "0109"
            },
            {
              "title": "Finite differences",
              "start_index": 272,
              "end_index": 273,
              "node_id": "0110"
            },
            {
              "title": "Exact evaluation of the Hessian",
              "start_index": 273,
              "end_index": 274,
              "node_id": "0111"
            },
            {
              "title": "Fast multiplication by the Hessian",
              "start_index": 274,
              "end_index": 276,
              "node_id": "0112"
            }
          ],
          "node_id": "0106"
        },
        {
          "title": "Regularization in Neural Networks",
          "start_index": 276,
          "end_index": 277,
          "nodes": [
            {
              "title": "Consistent Gaussian priors",
              "start_index": 277,
              "end_index": 279,
              "node_id": "0114"
            },
            {
              "title": "Early stopping",
              "start_index": 279,
              "end_index": 281,
              "node_id": "0115"
            },
            {
              "title": "Invariances",
              "start_index": 281,
              "end_index": 283,
              "node_id": "0116"
            },
            {
              "title": "Tangent propagation",
              "start_index": 283,
              "end_index": 285,
              "node_id": "0117"
            },
            {
              "title": "Training with transformed data",
              "start_index": 285,
              "end_index": 287,
              "node_id": "0118"
            },
            {
              "title": "Convolutional networks",
              "start_index": 287,
              "end_index": 289,
              "node_id": "0119"
            },
            {
              "title": "Soft weight sharing",
              "start_index": 289,
              "end_index": 292,
              "node_id": "0120"
            }
          ],
          "node_id": "0113"
        },
        {
          "title": "Mixture Density Networks",
          "start_index": 292,
          "end_index": 297,
          "node_id": "0121"
        },
        {
          "title": "Bayesian Neural Networks",
          "start_index": 297,
          "end_index": 298,
          "nodes": [
            {
              "title": "Posterior parameter distribution",
              "start_index": 298,
              "end_index": 300,
              "node_id": "0123"
            },
            {
              "title": "Hyperparameter optimization",
              "start_index": 300,
              "end_index": 301,
              "node_id": "0124"
            },
            {
              "title": "Bayesian neural networks for classification",
              "start_index": 301,
              "end_index": 304,
              "node_id": "0125"
            }
          ],
          "node_id": "0122"
        }
      ],
      "node_id": "0093"
    },
    {
      "title": "Exercises",
      "start_index": 304,
      "end_index": 311,
      "node_id": "0126"
    },
    {
      "title": "Kernel Methods",
      "start_index": 311,
      "end_index": 313,
      "nodes": [
        {
          "title": "Dual Representations",
          "start_index": 313,
          "end_index": 314,
          "node_id": "0128"
        },
        {
          "title": "Constructing Kernels",
          "start_index": 314,
          "end_index": 319,
          "node_id": "0129"
        },
        {
          "title": "Radial Basis Function Networks",
          "start_index": 319,
          "end_index": 321,
          "nodes": [
            {
              "title": "Nadaraya-Watson model",
              "start_index": 321,
              "end_index": 323,
              "node_id": "0131"
            }
          ],
          "node_id": "0130"
        },
        {
          "title": "Gaussian Processes",
          "start_index": 323,
          "end_index": 324,
          "nodes": [
            {
              "title": "Linear regression revisited",
              "start_index": 324,
              "end_index": 326,
              "node_id": "0133"
            },
            {
              "title": "Gaussian processes for regression",
              "start_index": 326,
              "end_index": 331,
              "node_id": "0134"
            },
            {
              "title": "Learning the hyperparameters",
              "start_index": 331,
              "end_index": 332,
              "node_id": "0135"
            },
            {
              "title": "Automatic relevance determination",
              "start_index": 332,
              "end_index": 333,
              "node_id": "0136"
            },
            {
              "title": "Gaussian processes for classification",
              "start_index": 333,
              "end_index": 335,
              "node_id": "0137"
            },
            {
              "title": "Laplace approximation",
              "start_index": 335,
              "end_index": 339,
              "node_id": "0138"
            },
            {
              "title": "Connection to neural networks",
              "start_index": 339,
              "end_index": 340,
              "node_id": "0139"
            }
          ],
          "node_id": "0132"
        }
      ],
      "node_id": "0127"
    },
    {
      "title": "Exercises",
      "start_index": 340,
      "end_index": 344,
      "node_id": "0140"
    },
    {
      "title": "Sparse Kernel Machines",
      "start_index": 345,
      "end_index": 346,
      "nodes": [
        {
          "title": "Maximum Margin Classifiers",
          "start_index": 346,
          "end_index": 351,
          "nodes": [
            {
              "title": "Overlapping class distributions",
              "start_index": 351,
              "end_index": 356,
              "node_id": "0143"
            },
            {
              "title": "Relation to logistic regression",
              "start_index": 356,
              "end_index": 358,
              "node_id": "0144"
            },
            {
              "title": "Multiclass SVMs",
              "start_index": 358,
              "end_index": 359,
              "node_id": "0145"
            },
            {
              "title": "SVMs for regression",
              "start_index": 359,
              "end_index": 364,
              "node_id": "0146"
            },
            {
              "title": "Computational learning theory",
              "start_index": 364,
              "end_index": 365,
              "node_id": "0147"
            }
          ],
          "node_id": "0142"
        },
        {
          "title": "Relevance Vector Machines",
          "start_index": 365,
          "end_index": 365,
          "nodes": [
            {
              "title": "RVM for regression",
              "start_index": 365,
              "end_index": 369,
              "node_id": "0149"
            },
            {
              "title": "Analysis of sparsity",
              "start_index": 369,
              "end_index": 373,
              "node_id": "0150"
            },
            {
              "title": "RVM for classification",
              "start_index": 373,
              "end_index": 377,
              "node_id": "0151"
            }
          ],
          "node_id": "0148"
        }
      ],
      "node_id": "0141"
    },
    {
      "title": "Exercises",
      "start_index": 377,
      "end_index": 379,
      "node_id": "0152"
    },
    {
      "title": "Graphical Models",
      "start_index": 379,
      "end_index": 380,
      "nodes": [
        {
          "title": "Bayesian Networks",
          "start_index": 380,
          "end_index": 382,
          "nodes": [
            {
              "title": "Example: Polynomial regression",
              "start_index": 382,
              "end_index": 385,
              "node_id": "0155"
            },
            {
              "title": "Generative models",
              "start_index": 385,
              "end_index": 386,
              "node_id": "0156"
            },
            {
              "title": "Discrete variables",
              "start_index": 386,
              "end_index": 390,
              "node_id": "0157"
            },
            {
              "title": "Linear-Gaussian models",
              "start_index": 390,
              "end_index": 392,
              "node_id": "0158"
            }
          ],
          "node_id": "0154"
        },
        {
          "title": "Conditional Independence",
          "start_index": 392,
          "end_index": 393,
          "nodes": [
            {
              "title": "Three example graphs",
              "start_index": 393,
              "end_index": 398,
              "node_id": "0160"
            },
            {
              "title": "D-separation",
              "start_index": 398,
              "end_index": 403,
              "node_id": "0161"
            }
          ],
          "node_id": "0159"
        },
        {
          "title": "Markov Random Fields",
          "start_index": 403,
          "end_index": 403,
          "nodes": [
            {
              "title": "Conditional independence properties",
              "start_index": 403,
              "end_index": 404,
              "node_id": "0163"
            },
            {
              "title": "Factorization properties",
              "start_index": 404,
              "end_index": 407,
              "node_id": "0164"
            },
            {
              "title": "Illustration: Image de-noising",
              "start_index": 407,
              "end_index": 410,
              "node_id": "0165"
            },
            {
              "title": "Relation to directed graphs",
              "start_index": 410,
              "end_index": 413,
              "node_id": "0166"
            }
          ],
          "node_id": "0162"
        },
        {
          "title": "Inference in Graphical Models",
          "start_index": 413,
          "end_index": 414,
          "nodes": [
            {
              "title": "Inference on a chain",
              "start_index": 414,
              "end_index": 418,
              "node_id": "0168"
            },
            {
              "title": "Trees",
              "start_index": 418,
              "end_index": 419,
              "node_id": "0169"
            },
            {
              "title": "Factor graphs",
              "start_index": 419,
              "end_index": 422,
              "node_id": "0170"
            },
            {
              "title": "The sum-product algorithm",
              "start_index": 422,
              "end_index": 431,
              "node_id": "0171"
            },
            {
              "title": "The max-sum algorithm",
              "start_index": 431,
              "end_index": 436,
              "node_id": "0172"
            },
            {
              "title": "Exact inference in general graphs",
              "start_index": 436,
              "end_index": 437,
              "node_id": "0173"
            },
            {
              "title": "Loopy belief propagation",
              "start_index": 437,
              "end_index": 438,
              "node_id": "0174"
            },
            {
              "title": "Learning the graph structure",
              "start_index": 438,
              "end_index": 438,
              "node_id": "0175"
            }
          ],
          "node_id": "0167"
        }
      ],
      "node_id": "0153"
    },
    {
      "title": "Exercises",
      "start_index": 438,
      "end_index": 443,
      "node_id": "0176"
    },
    {
      "title": "Mixture Models and EM",
      "start_index": 443,
      "end_index": 444,
      "nodes": [
        {
          "title": "K-means Clustering",
          "start_index": 444,
          "end_index": 448,
          "nodes": [
            {
              "title": "Image segmentation and compression",
              "start_index": 448,
              "end_index": 450,
              "node_id": "0179"
            }
          ],
          "node_id": "0178"
        },
        {
          "title": "Mixtures of Gaussians",
          "start_index": 450,
          "end_index": 452,
          "nodes": [
            {
              "title": "Maximum likelihood",
              "start_index": 452,
              "end_index": 455,
              "node_id": "0181"
            },
            {
              "title": "EM for Gaussian mixtures",
              "start_index": 455,
              "end_index": 459,
              "node_id": "0182"
            }
          ],
          "node_id": "0180"
        },
        {
          "title": "An Alternative View of EM",
          "start_index": 459,
          "end_index": 461,
          "nodes": [
            {
              "title": "Gaussian mixtures revisited",
              "start_index": 461,
              "end_index": 463,
              "node_id": "0184"
            },
            {
              "title": "Relation to K-means",
              "start_index": 463,
              "end_index": 464,
              "node_id": "0185"
            },
            {
              "title": "Mixtures of Bernoulli distributions",
              "start_index": 464,
              "end_index": 468,
              "node_id": "0186"
            },
            {
              "title": "EM for Bayesian linear regression",
              "start_index": 468,
              "end_index": 470,
              "node_id": "0187"
            }
          ],
          "node_id": "0183"
        },
        {
          "title": "The EM Algorithm in General",
          "start_index": 470,
          "end_index": 475,
          "node_id": "0188"
        }
      ],
      "node_id": "0177"
    },
    {
      "title": "Exercises",
      "start_index": 475,
      "end_index": 480,
      "node_id": "0189"
    },
    {
      "title": "Approximate Inference",
      "start_index": 481,
      "end_index": 482,
      "nodes": [
        {
          "title": "Variational Inference",
          "start_index": 482,
          "end_index": 484,
          "nodes": [
            {
              "title": "Factorized distributions",
              "start_index": 484,
              "end_index": 486,
              "node_id": "0192"
            },
            {
              "title": "Properties of factorized approximations",
              "start_index": 486,
              "end_index": 490,
              "node_id": "0193"
            },
            {
              "title": "Example: The univariate Gaussian",
              "start_index": 490,
              "end_index": 493,
              "node_id": "0194"
            },
            {
              "title": "Model comparison",
              "start_index": 493,
              "end_index": 494,
              "node_id": "0195"
            }
          ],
          "node_id": "0191"
        },
        {
          "title": "Illustration: Variational Mixture of Gaussians",
          "start_index": 494,
          "end_index": 495,
          "nodes": [
            {
              "title": "Variational distribution",
              "start_index": 495,
              "end_index": 501,
              "node_id": "0197"
            },
            {
              "title": "Variational lower bound",
              "start_index": 501,
              "end_index": 502,
              "node_id": "0198"
            },
            {
              "title": "Predictive density",
              "start_index": 502,
              "end_index": 503,
              "node_id": "0199"
            },
            {
              "title": "Determining the number of components",
              "start_index": 503,
              "end_index": 505,
              "node_id": "0200"
            },
            {
              "title": "Induced factorizations",
              "start_index": 505,
              "end_index": 506,
              "node_id": "0201"
            }
          ],
          "node_id": "0196"
        },
        {
          "title": "Variational Linear Regression",
          "start_index": 506,
          "end_index": 506,
          "nodes": [
            {
              "title": "Variational distribution",
              "start_index": 506,
              "end_index": 508,
              "node_id": "0203"
            },
            {
              "title": "Predictive distribution",
              "start_index": 508,
              "end_index": 509,
              "node_id": "0204"
            },
            {
              "title": "Lower bound",
              "start_index": 509,
              "end_index": 510,
              "node_id": "0205"
            }
          ],
          "node_id": "0202"
        },
        {
          "title": "Exponential Family Distributions",
          "start_index": 510,
          "end_index": 511,
          "nodes": [
            {
              "title": "Variational message passing",
              "start_index": 511,
              "end_index": 512,
              "node_id": "0207"
            }
          ],
          "node_id": "0206"
        },
        {
          "title": "Local Variational Methods",
          "start_index": 513,
          "end_index": 518,
          "node_id": "0208"
        },
        {
          "title": "Variational Logistic Regression",
          "start_index": 518,
          "end_index": 518,
          "nodes": [
            {
              "title": "Variational posterior distribution",
              "start_index": 518,
              "end_index": 520,
              "node_id": "0210"
            },
            {
              "title": "Optimizing the variational parameters",
              "start_index": 520,
              "end_index": 522,
              "node_id": "0211"
            },
            {
              "title": "Inference of hyperparameters",
              "start_index": 522,
              "end_index": 525,
              "node_id": "0212"
            }
          ],
          "node_id": "0209"
        },
        {
          "title": "Expectation Propagation",
          "start_index": 525,
          "end_index": 531,
          "nodes": [
            {
              "title": "Example: The clutter problem",
              "start_index": 531,
              "end_index": 533,
              "node_id": "0214"
            },
            {
              "title": "Expectation propagation on graphs",
              "start_index": 533,
              "end_index": 537,
              "node_id": "0215"
            }
          ],
          "node_id": "0213"
        }
      ],
      "node_id": "0190"
    },
    {
      "title": "Exercises",
      "start_index": 537,
      "end_index": 542,
      "node_id": "0216"
    },
    {
      "title": "Sampling Methods",
      "start_index": 543,
      "end_index": 546,
      "nodes": [
        {
          "title": "Basic Sampling Algorithms",
          "start_index": 546,
          "end_index": 546,
          "nodes": [
            {
              "title": "Standard distributions",
              "start_index": 546,
              "end_index": 548,
              "node_id": "0219"
            },
            {
              "title": "Rejection sampling",
              "start_index": 548,
              "end_index": 550,
              "node_id": "0220"
            },
            {
              "title": "Adaptive rejection sampling",
              "start_index": 550,
              "end_index": 552,
              "node_id": "0221"
            },
            {
              "title": "Importance sampling",
              "start_index": 552,
              "end_index": 554,
              "node_id": "0222"
            },
            {
              "title": "Sampling-importance-resampling",
              "start_index": 554,
              "end_index": 556,
              "node_id": "0223"
            },
            {
              "title": "Sampling and the EM algorithm",
              "start_index": 556,
              "end_index": 556,
              "node_id": "0224"
            }
          ],
          "node_id": "0218"
        },
        {
          "title": "Markov Chain Monte Carlo",
          "start_index": 557,
          "end_index": 559,
          "nodes": [
            {
              "title": "Markov chains",
              "start_index": 559,
              "end_index": 561,
              "node_id": "0226"
            },
            {
              "title": "The Metropolis-Hastings algorithm",
              "start_index": 561,
              "end_index": 562,
              "node_id": "0227"
            }
          ],
          "node_id": "0225"
        },
        {
          "title": "Gibbs Sampling",
          "start_index": 562,
          "end_index": 566,
          "node_id": "0228"
        },
        {
          "title": "Slice Sampling",
          "start_index": 566,
          "end_index": 568,
          "node_id": "0229"
        },
        {
          "title": "The Hybrid Monte Carlo Algorithm",
          "start_index": 568,
          "end_index": 568,
          "nodes": [
            {
              "title": "Dynamical systems",
              "start_index": 568,
              "end_index": 572,
              "node_id": "0231"
            },
            {
              "title": "Hybrid Monte Carlo",
              "start_index": 572,
              "end_index": 574,
              "node_id": "0232"
            }
          ],
          "node_id": "0230"
        },
        {
          "title": "Estimating the Partition Function",
          "start_index": 574,
          "end_index": 576,
          "node_id": "0233"
        }
      ],
      "node_id": "0217"
    },
    {
      "title": "Exercises",
      "start_index": 576,
      "end_index": 579,
      "node_id": "0234"
    },
    {
      "title": "Continuous Latent Variables",
      "start_index": 579,
      "end_index": 581,
      "nodes": [
        {
          "title": "Principal Component Analysis",
          "start_index": 581,
          "end_index": 581,
          "nodes": [
            {
              "title": "Maximum variance formulation",
              "start_index": 581,
              "end_index": 583,
              "node_id": "0237"
            },
            {
              "title": "Minimum-error formulation",
              "start_index": 583,
              "end_index": 585,
              "node_id": "0238"
            },
            {
              "title": "Applications of PCA",
              "start_index": 585,
              "end_index": 589,
              "node_id": "0239"
            },
            {
              "title": "PCA for high-dimensional data",
              "start_index": 589,
              "end_index": 590,
              "node_id": "0240"
            }
          ],
          "node_id": "0236"
        },
        {
          "title": "Probabilistic PCA",
          "start_index": 590,
          "end_index": 594,
          "nodes": [
            {
              "title": "Maximum likelihood PCA",
              "start_index": 594,
              "end_index": 597,
              "node_id": "0242"
            },
            {
              "title": "EM algorithm for PCA",
              "start_index": 597,
              "end_index": 600,
              "node_id": "0243"
            },
            {
              "title": "Bayesian PCA",
              "start_index": 600,
              "end_index": 603,
              "node_id": "0244"
            },
            {
              "title": "Factor analysis",
              "start_index": 603,
              "end_index": 606,
              "node_id": "0245"
            }
          ],
          "node_id": "0241"
        },
        {
          "title": "Kernel PCA",
          "start_index": 606,
          "end_index": 610,
          "node_id": "0246"
        },
        {
          "title": "Nonlinear Latent Variable Models",
          "start_index": 611,
          "end_index": 611,
          "nodes": [
            {
              "title": "Independent component analysis",
              "start_index": 611,
              "end_index": 612,
              "node_id": "0248"
            },
            {
              "title": "Autoassociative neural networks",
              "start_index": 612,
              "end_index": 615,
              "node_id": "0249"
            },
            {
              "title": "Modelling nonlinear manifolds",
              "start_index": 615,
              "end_index": 619,
              "node_id": "0250"
            }
          ],
          "node_id": "0247"
        }
      ],
      "node_id": "0235"
    },
    {
      "title": "Exercises",
      "start_index": 619,
      "end_index": 624,
      "node_id": "0251"
    },
    {
      "title": "Sequential Data",
      "start_index": 625,
      "end_index": 627,
      "nodes": [
        {
          "title": "Markov Models",
          "start_index": 627,
          "end_index": 630,
          "node_id": "0253"
        },
        {
          "title": "Hidden Markov Models",
          "start_index": 630,
          "end_index": 635,
          "nodes": [
            {
              "title": "Maximum likelihood for the HMM",
              "start_index": 635,
              "end_index": 638,
              "node_id": "0255"
            },
            {
              "title": "The forward-backward algorithm",
              "start_index": 638,
              "end_index": 645,
              "node_id": "0256"
            },
            {
              "title": "The sum-product algorithm for the HMM",
              "start_index": 645,
              "end_index": 647,
              "node_id": "0257"
            },
            {
              "title": "Scaling factors",
              "start_index": 647,
              "end_index": 649,
              "node_id": "0258"
            },
            {
              "title": "The Viterbi algorithm",
              "start_index": 649,
              "end_index": 651,
              "node_id": "0259"
            },
            {
              "title": "Extensions of the hidden Markov model",
              "start_index": 651,
              "end_index": 655,
              "node_id": "0260"
            }
          ],
          "node_id": "0254"
        },
        {
          "title": "Linear Dynamical Systems",
          "start_index": 655,
          "end_index": 658,
          "nodes": [
            {
              "title": "Inference in LDS",
              "start_index": 658,
              "end_index": 662,
              "node_id": "0262"
            },
            {
              "title": "Learning in LDS",
              "start_index": 662,
              "end_index": 664,
              "node_id": "0263"
            },
            {
              "title": "Extensions of LDS",
              "start_index": 664,
              "end_index": 665,
              "node_id": "0264"
            },
            {
              "title": "Particle filters",
              "start_index": 665,
              "end_index": 666,
              "node_id": "0265"
            }
          ],
          "node_id": "0261"
        }
      ],
      "node_id": "0252"
    },
    {
      "title": "Exercises",
      "start_index": 666,
      "end_index": 672,
      "node_id": "0266"
    },
    {
      "title": "Combining Models",
      "start_index": 673,
      "end_index": 674,
      "nodes": [
        {
          "title": "Bayesian Model Averaging",
          "start_index": 674,
          "end_index": 675,
          "node_id": "0268"
        },
        {
          "title": "Committees",
          "start_index": 675,
          "end_index": 677,
          "node_id": "0269"
        },
        {
          "title": "Boosting",
          "start_index": 677,
          "end_index": 679,
          "nodes": [
            {
              "title": "Minimizing exponential error",
              "start_index": 679,
              "end_index": 681,
              "node_id": "0271"
            },
            {
              "title": "Error functions for boosting",
              "start_index": 681,
              "end_index": 683,
              "node_id": "0272"
            }
          ],
          "node_id": "0270"
        },
        {
          "title": "Tree-based Models",
          "start_index": 683,
          "end_index": 686,
          "node_id": "0273"
        },
        {
          "title": "Conditional Mixture Models",
          "start_index": 686,
          "end_index": 687,
          "nodes": [
            {
              "title": "Mixtures of linear regression models",
              "start_index": 687,
              "end_index": 690,
              "node_id": "0275"
            },
            {
              "title": "Mixtures of logistic models",
              "start_index": 690,
              "end_index": 692,
              "node_id": "0276"
            },
            {
              "title": "Mixtures of experts",
              "start_index": 692,
              "end_index": 694,
              "node_id": "0277"
            }
          ],
          "node_id": "0274"
        }
      ],
      "node_id": "0267"
    },
    {
      "title": "Exercises",
      "start_index": 694,
      "end_index": 696,
      "node_id": "0278"
    },
    {
      "title": "Appendix A Data Sets",
      "start_index": 697,
      "end_index": 704,
      "node_id": "0279"
    },
    {
      "title": "Appendix B Probability Distributions",
      "start_index": 705,
      "end_index": 714,
      "node_id": "0280"
    },
    {
      "title": "Appendix C Properties of Matrices",
      "start_index": 715,
      "end_index": 722,
      "node_id": "0281"
    },
    {
      "title": "Appendix D Calculus of Variations",
      "start_index": 723,
      "end_index": 726,
      "node_id": "0282"
    },
    {
      "title": "Appendix E Lagrange Multipliers",
      "start_index": 727,
      "end_index": 730,
      "node_id": "0283"
    },
    {
      "title": "References",
      "start_index": 731,
      "end_index": 749,
      "node_id": "0284"
    },
    {
      "title": "Index",
      "start_index": 749,
      "end_index": 758,
      "node_id": "0285"
    }
  ]
}
```

## /results/Regulation Best Interest_Interpretive release_structure.json

```json path="/results/Regulation Best Interest_Interpretive release_structure.json" 
{
  "doc_name": "Regulation Best Interest_Interpretive release.pdf",
  "doc_description": "A detailed analysis of the SEC's interpretation of the \"solely incidental\" prong of the broker-dealer exclusion under the Investment Advisers Act of 1940, including its historical context, application guidance, economic implications, and regulatory considerations.",
  "structure": [
    {
      "title": "Preface",
      "start_index": 1,
      "end_index": 2,
      "node_id": "0000",
      "summary": "The partial document outlines an interpretation by the Securities and Exchange Commission (SEC) regarding the \"solely incidental\" prong of the broker-dealer exclusion under the Investment Advisers Act of 1940. It clarifies that brokers or dealers providing advisory services that are incidental to their primary business and for which they receive no special compensation are excluded from the definition of \"investment adviser\" under the Act. The document includes a historical and legislative context, the scope of the \"solely incidental\" prong, guidance on its application, and economic considerations related to the interpretation. It also provides contact information for further inquiries and specifies the effective date of the interpretation as July 12, 2019."
    },
    {
      "title": "Introduction",
      "start_index": 2,
      "end_index": 6,
      "node_id": "0001",
      "summary": "The partial document discusses the regulation of investment advisers under the Advisers Act, specifically focusing on the \"broker-dealer exclusion,\" which exempts brokers and dealers from being classified as investment advisers under certain conditions. Key points include:\n\n1. **Introduction to the Advisers Act**: Overview of the regulation of investment advisers and the broker-dealer exclusion, which applies when advisory services are \"solely incidental\" to brokerage business and no special compensation is received.\n\n2. **Historical Context and Legislative History**: Examination of the historical practices of broker-dealers providing investment advice, distinguishing between auxiliary advice as part of brokerage services and separate advisory services.\n\n3. **Interpretation of the Solely Incidental Prong**: Clarification of the \"solely incidental\" condition of the broker-dealer exclusion, including its application to activities like investment discretion and account monitoring.\n\n4. **Economic Considerations**: Discussion of the potential economic effects of the interpretation and application of the broker-dealer exclusion.\n\n5. **Regulatory Developments**: Reference to the Commission's 2018 proposals, including Regulation Best Interest (Reg. BI), the Proposed Fiduciary Interpretation, and the Relationship Summary Proposal, aimed at enhancing standards of conduct and investor understanding.\n\n6. **Public Comments and Feedback**: Summary of public comments on the scope and interpretation of the broker-dealer exclusion, highlighting disagreements and requests for clarification on the \"solely incidental\" prong.\n\n7. **Adoption of Interpretation**: The Commission's adoption of an interpretation to confirm and clarify its position on the \"solely incidental\" prong, complementing related rules and forms to improve investor understanding of broker-dealer and adviser relationships."
    },
    {
      "title": "Interpretation and Application",
      "start_index": 6,
      "end_index": 8,
      "nodes": [
        {
          "title": "Historical Context and Legislative History",
          "start_index": 8,
          "end_index": 10,
          "node_id": "0003",
          "summary": "The partial document discusses the historical context and legislative development of the Investment Advisers Act of 1940. It highlights the findings of a congressional study conducted by the SEC between 1935 and 1939, which identified issues with distinguishing legitimate investment counselors from unregulated \"tipster\" organizations and problems in the organization and operation of investment counsel institutions. The document explains how these findings led to the passage of the Advisers Act, which broadly defined \"investment adviser\" and established regulatory oversight for those providing investment advice for compensation. It also addresses the exclusion of certain professionals, such as broker-dealers, from the definition of \"investment adviser\" if their advice is incidental to their primary business and not specially compensated. Additionally, the document explores the scope of the \"solely incidental\" prong of the broker-dealer exclusion, referencing interpretations and rules by the SEC, including a 2005 rule regarding fee-based brokerage accounts."
        },
        {
          "title": "Scope of the Solely Incidental Prong of the Broker-Dealer Exclusion",
          "start_index": 10,
          "end_index": 14,
          "node_id": "0004",
          "summary": "The partial document discusses the \"broker-dealer exclusion\" under the Investment Advisers Act, specifically focusing on the \"solely incidental\" prong. It examines the scope of this exclusion, emphasizing that investment advice provided by broker-dealers is considered \"solely incidental\" if it is connected to and reasonably related to their primary business of effecting securities transactions. The document references historical interpretations, court rulings (e.g., Financial Planning Association v. SEC and Thomas v. Metropolitan Life Insurance Company), and legislative history to clarify this standard. It highlights that the frequency or importance of advice does not determine whether it meets the \"solely incidental\" standard, but rather its relationship to the broker-dealer's primary business. The document also provides guidance on applying this interpretation to specific practices, such as exercising investment discretion and account monitoring, noting that certain discretionary activities may fall outside the scope of the exclusion."
        },
        {
          "title": "Guidance on Applying the Interpretation of the Solely Incidental Prong",
          "start_index": 14,
          "end_index": 22,
          "node_id": "0005",
          "summary": "The partial document provides guidance on the application of the \"solely incidental\" prong of the broker-dealer exclusion under the Advisers Act. It focuses on two key areas: (1) the exercise of investment discretion by broker-dealers over customer accounts and (2) account monitoring. The document discusses the Commission's interpretation that unlimited investment discretion is not \"solely incidental\" to a broker-dealer's business, as it indicates a primarily advisory relationship. However, temporary or limited discretion in specific scenarios (e.g., cash management, tax-loss sales, or margin requirements) may be consistent with the \"solely incidental\" prong. It also addresses account monitoring, stating that agreed-upon periodic monitoring for buy, sell, or hold recommendations may align with the broker-dealer exclusion, while continuous monitoring or advisory-like services would not. The document includes examples, refinements to prior interpretations, and considerations for broker-dealers to adopt policies ensuring compliance. It concludes with economic considerations, highlighting the potential impact on broker-dealers, customers, and the financial advice market."
        }
      ],
      "node_id": "0002",
      "summary": "The partial document discusses the historical context and legislative history of the Advisers Act of 1940, focusing on the roles of broker-dealers in providing investment advice. It highlights two distinct ways broker-dealers offered advice: as part of traditional brokerage services with fixed commissions and as separate advisory services for a fee. The document examines the concept of \"brokerage house advice,\" detailing the types of information and services provided, such as market analyses, tax information, and investment recommendations. It also references a congressional study conducted between 1935 and 1939, which identified issues with distinguishing legitimate investment counselors from \"tipster\" organizations and problems in the organization and operation of investment counsel institutions. These findings led to the enactment of the Advisers Act, which broadly defined \"investment adviser\" to regulate those providing investment advice for compensation. The document also references various reports, hearings, and literature that informed the development of the Act."
    },
    {
      "title": "Economic Considerations",
      "start_index": 22,
      "end_index": 22,
      "nodes": [
        {
          "title": "Background",
          "start_index": 22,
          "end_index": 23,
          "node_id": "0007",
          "summary": "The partial document discusses the U.S. Securities and Exchange Commission's (SEC) interpretation of the \"solely incidental\" prong of the broker-dealer exclusion, clarifying its understanding without creating new legal obligations. It examines the potential economic effects of this interpretation on broker-dealers, their associated persons, customers, and the broader financial advice market. The document provides background data on broker-dealers, including their assets, customer accounts, and dual registration as investment advisers. It highlights compliance costs for broker-dealers to align with the interpretation and notes the limited circumstances under which broker-dealers exercise temporary or limited investment discretion. The document also references the lack of data received during the Reg. BI Proposal to analyze the economic impact further."
        },
        {
          "title": "Potential Economic Effects",
          "start_index": 23,
          "end_index": 28,
          "node_id": "0008",
          "summary": "The partial document discusses the economic effects and regulatory implications of the SEC's interpretation of the \"solely incidental\" prong of the broker-dealer exclusion from the definition of an investment adviser. Key points include:\n\n1. **Compliance Costs**: Broker-dealers currently incur costs to align their practices with the \"solely incidental\" prong, and the interpretation may lead to additional costs for evaluating and adjusting practices.\n\n2. **Impact on Broker-Dealer Practices**: Broker-dealers providing advisory services beyond the scope of the interpretation may need to adjust their practices, potentially resulting in reduced services, loss of customers, or a shift to advisory accounts.\n\n3. **Market Effects**: The interpretation could lead to decreased competition, increased fees, and a diminished number of broker-dealers offering commission-based services. It may also shift demand from broker-dealers to investment advisers.\n\n4. **Regulatory Adjustments**: Broker-dealers may choose to register as investment advisers, incurring new compliance costs, or migrate customers to advisory accounts of affiliates.\n\n5. **Potential Benefits**: Some broker-dealers may expand limited discretionary services or monitoring activities, benefiting investors with more efficient access to these services.\n\n6. **Regulatory Arbitrage Risks**: The interpretation raises concerns about regulatory arbitrage, though these risks may be mitigated by enhanced standards of conduct for broker-dealers.\n\n7. **Amendments to Regulations**: The document includes amendments to the Code of Federal Regulations, adding an interpretive release regarding the \"solely incidental\" prong, dated June 5, 2019."
        }
      ],
      "node_id": "0006",
      "summary": "The partial document discusses the SEC's interpretation of the \"solely incidental\" prong of the broker-dealer exclusion, clarifying that it does not impose new legal obligations but may have economic implications if broker-dealer practices deviate from this interpretation. It provides background on the potential effects on broker-dealers, their associated persons, customers, and the broader financial advice market. The document includes data on the number of registered broker-dealers, their customer accounts, total assets, and the prevalence of dual registrants (firms registered as both broker-dealers and investment advisers) as of December 2018."
    }
  ]
}
```

## /results/earthmover_structure.json

```json path="/results/earthmover_structure.json" 
{
  "doc_name": "earthmover.pdf",
  "structure": [
    {
      "title": "Earth Mover\u2019s Distance based Similarity Search at Scale",
      "start_index": 1,
      "end_index": 1,
      "node_id": "0000"
    },
    {
      "title": "ABSTRACT",
      "start_index": 1,
      "end_index": 1,
      "node_id": "0001"
    },
    {
      "title": "INTRODUCTION",
      "start_index": 1,
      "end_index": 2,
      "node_id": "0002"
    },
    {
      "title": "PRELIMINARIES",
      "start_index": 2,
      "end_index": 2,
      "nodes": [
        {
          "title": "Computing the EMD",
          "start_index": 3,
          "end_index": 3,
          "node_id": "0004"
        },
        {
          "title": "Filter-and-Refinement Framework",
          "start_index": 3,
          "end_index": 4,
          "node_id": "0005"
        }
      ],
      "node_id": "0003"
    },
    {
      "title": "SCALING UP SSP",
      "start_index": 4,
      "end_index": 5,
      "node_id": "0006"
    },
    {
      "title": "BOOSTING THE REFINEMENT PHASE",
      "start_index": 5,
      "end_index": 5,
      "nodes": [
        {
          "title": "Analysis of EMD Calculation",
          "start_index": 5,
          "end_index": 6,
          "node_id": "0008"
        },
        {
          "title": "Progressive Bounding",
          "start_index": 6,
          "end_index": 6,
          "node_id": "0009"
        },
        {
          "title": "Sensitivity to Refinement Order",
          "start_index": 6,
          "end_index": 7,
          "node_id": "0010"
        },
        {
          "title": "Dynamic Refinement Ordering",
          "start_index": 7,
          "end_index": 8,
          "node_id": "0011"
        },
        {
          "title": "Running Upper Bound",
          "start_index": 8,
          "end_index": 8,
          "node_id": "0012"
        }
      ],
      "node_id": "0007"
    },
    {
      "title": "EXPERIMENTAL EVALUATION",
      "start_index": 8,
      "end_index": 9,
      "nodes": [
        {
          "title": "Performance Improvement",
          "start_index": 9,
          "end_index": 10,
          "node_id": "0014"
        },
        {
          "title": "Scalability Experiments",
          "start_index": 10,
          "end_index": 11,
          "node_id": "0015"
        },
        {
          "title": "Parameter Tuning in DRO",
          "start_index": 11,
          "end_index": 12,
          "node_id": "0016"
        }
      ],
      "node_id": "0013"
    },
    {
      "title": "RELATED WORK",
      "start_index": 12,
      "end_index": 12,
      "node_id": "0017"
    },
    {
      "title": "CONCLUSION",
      "start_index": 12,
      "end_index": 12,
      "node_id": "0018"
    },
    {
      "title": "ACKNOWLEDGMENT",
      "start_index": 12,
      "end_index": 12,
      "node_id": "0019"
    },
    {
      "title": "REFERENCES",
      "start_index": 12,
      "end_index": 12,
      "node_id": "0020"
    }
  ]
}
```

## /results/four-lectures_structure.json

```json path="/results/four-lectures_structure.json" 
{
  "doc_name": "four-lectures.pdf",
  "structure": [
    {
      "title": "Preface",
      "start_index": 1,
      "end_index": 1,
      "node_id": "0000"
    },
    {
      "title": "ML at a Glance",
      "start_index": 2,
      "end_index": 2,
      "nodes": [
        {
          "title": "An ML session",
          "start_index": 2,
          "end_index": 3,
          "node_id": "0002"
        },
        {
          "title": "Types and Values",
          "start_index": 3,
          "end_index": 4,
          "node_id": "0003"
        },
        {
          "title": "Recursive Functions",
          "start_index": 4,
          "end_index": 4,
          "node_id": "0004"
        },
        {
          "title": "Raising Exceptions",
          "start_index": 4,
          "end_index": 5,
          "node_id": "0005"
        },
        {
          "title": "Structures",
          "start_index": 5,
          "end_index": 6,
          "node_id": "0006"
        },
        {
          "title": "Signatures",
          "start_index": 6,
          "end_index": 7,
          "node_id": "0007"
        },
        {
          "title": "Coercive Signature Matching",
          "start_index": 7,
          "end_index": 8,
          "node_id": "0008"
        },
        {
          "title": "Functor Declaration",
          "start_index": 8,
          "end_index": 9,
          "node_id": "0009"
        },
        {
          "title": "Functor Application",
          "start_index": 9,
          "end_index": 9,
          "node_id": "0010"
        },
        {
          "title": "Summary",
          "start_index": 9,
          "end_index": 9,
          "node_id": "0011"
        }
      ],
      "node_id": "0001"
    },
    {
      "title": "Programming with ML Modules",
      "start_index": 10,
      "end_index": 10,
      "nodes": [
        {
          "title": "Introduction",
          "start_index": 10,
          "end_index": 11,
          "node_id": "0013"
        },
        {
          "title": "Signatures",
          "start_index": 11,
          "end_index": 12,
          "node_id": "0014"
        },
        {
          "title": "Structures",
          "start_index": 12,
          "end_index": 13,
          "node_id": "0015"
        },
        {
          "title": "Functors",
          "start_index": 13,
          "end_index": 14,
          "node_id": "0016"
        },
        {
          "title": "Substructures",
          "start_index": 14,
          "end_index": 15,
          "node_id": "0017"
        },
        {
          "title": "Sharing",
          "start_index": 15,
          "end_index": 16,
          "node_id": "0018"
        },
        {
          "title": "Building the System",
          "start_index": 16,
          "end_index": 17,
          "node_id": "0019"
        },
        {
          "title": "Separate Compilation",
          "start_index": 17,
          "end_index": 18,
          "node_id": "0020"
        },
        {
          "title": "Good Style",
          "start_index": 18,
          "end_index": 18,
          "node_id": "0021"
        },
        {
          "title": "Bad Style",
          "start_index": 18,
          "end_index": 19,
          "node_id": "0022"
        }
      ],
      "node_id": "0012"
    },
    {
      "title": "The Static Semantics of Modules",
      "start_index": 20,
      "end_index": 20,
      "nodes": [
        {
          "title": "Elaboration",
          "start_index": 20,
          "end_index": 21,
          "node_id": "0024"
        },
        {
          "title": "Names",
          "start_index": 21,
          "end_index": 21,
          "node_id": "0025"
        },
        {
          "title": "Decorating Structures",
          "start_index": 21,
          "end_index": 21,
          "node_id": "0026"
        },
        {
          "title": "Decorating Signatures",
          "start_index": 22,
          "end_index": 23,
          "node_id": "0027"
        },
        {
          "title": "Signature Instantiation",
          "start_index": 23,
          "end_index": 24,
          "node_id": "0028"
        },
        {
          "title": "Signature Matching",
          "start_index": 24,
          "end_index": 25,
          "node_id": "0029"
        },
        {
          "title": "Signature Constraints",
          "start_index": 25,
          "end_index": 25,
          "node_id": "0030"
        },
        {
          "title": "Decorating Functors",
          "start_index": 26,
          "end_index": 26,
          "node_id": "0031"
        },
        {
          "title": "External Sharing",
          "start_index": 26,
          "end_index": 27,
          "node_id": "0032"
        },
        {
          "title": "Functors with Arguments",
          "start_index": 27,
          "end_index": 28,
          "node_id": "0033"
        },
        {
          "title": "Sharing Between Argument and Result",
          "start_index": 28,
          "end_index": 28,
          "node_id": "0034"
        },
        {
          "title": "Explicit Result Signatures",
          "start_index": 28,
          "end_index": 29,
          "node_id": "0035"
        }
      ],
      "node_id": "0023"
    },
    {
      "title": "Implementing an Interpreter in ML",
      "start_index": 30,
      "end_index": 32,
      "nodes": [
        {
          "title": "Version 1: The Bare Typechecker",
          "start_index": 32,
          "end_index": 33,
          "node_id": "0037"
        },
        {
          "title": "Version 2: Adding Lists and Polymorphism",
          "start_index": 33,
          "end_index": 37,
          "node_id": "0038"
        },
        {
          "title": "Version 3: A Different Implementation of Types",
          "start_index": 37,
          "end_index": 39,
          "node_id": "0039"
        },
        {
          "title": "Version 4: Introducing Variables and Let",
          "start_index": 39,
          "end_index": 43,
          "node_id": "0040"
        },
        {
          "title": "Acknowledgement",
          "start_index": 43,
          "end_index": 43,
          "node_id": "0041"
        }
      ],
      "node_id": "0036"
    },
    {
      "title": "Appendix A: The Bare Interpreter",
      "start_index": 44,
      "end_index": 44,
      "nodes": [
        {
          "title": "Syntax",
          "start_index": 44,
          "end_index": 44,
          "node_id": "0043"
        },
        {
          "title": "Parsing",
          "start_index": 44,
          "end_index": 45,
          "node_id": "0044"
        },
        {
          "title": "Environments",
          "start_index": 45,
          "end_index": 45,
          "node_id": "0045"
        },
        {
          "title": "Evaluation",
          "start_index": 45,
          "end_index": 46,
          "node_id": "0046"
        },
        {
          "title": "Type Checking",
          "start_index": 46,
          "end_index": 46,
          "node_id": "0047"
        },
        {
          "title": "The Interpreter",
          "start_index": 46,
          "end_index": 47,
          "node_id": "0048"
        },
        {
          "title": "The Evaluator",
          "start_index": 47,
          "end_index": 48,
          "node_id": "0049"
        },
        {
          "title": "The Typechecker",
          "start_index": 48,
          "end_index": 49,
          "node_id": "0050"
        },
        {
          "title": "The Basics",
          "start_index": 50,
          "end_index": 52,
          "node_id": "0051"
        }
      ],
      "node_id": "0042"
    },
    {
      "title": "Appendix B: Files",
      "start_index": 53,
      "end_index": 53,
      "node_id": "0052"
    }
  ]
}
```

## /results/q1-fy25-earnings_structure.json

```json path="/results/q1-fy25-earnings_structure.json" 
{
  "doc_name": "q1-fy25-earnings.pdf",
  "doc_description": "A comprehensive financial report detailing The Walt Disney Company's first-quarter fiscal 2025 performance, including revenue growth, segment highlights, guidance for fiscal 2025, and key financial metrics such as adjusted EPS, operating income, and cash flow.",
  "structure": [
    {
      "title": "THE WALT DISNEY COMPANY REPORTS FIRST QUARTER EARNINGS FOR FISCAL 2025",
      "start_index": 1,
      "end_index": 1,
      "nodes": [
        {
          "title": "Financial Results for the Quarter",
          "start_index": 1,
          "end_index": 1,
          "nodes": [
            {
              "title": "Key Points",
              "start_index": 1,
              "end_index": 1,
              "node_id": "0002",
              "summary": "The partial document outlines The Walt Disney Company's financial performance for the first fiscal quarter of 2025, ending December 28, 2024. Key points include:\n\n1. **Financial Results**: \n   - Revenue increased by 5% to $24.7 billion.\n   - Income before taxes rose by 27% to $3.7 billion.\n   - Diluted EPS grew by 35% to $1.40.\n   - Total segment operating income increased by 31% to $5.1 billion, with adjusted EPS up 44% to $1.76.\n\n2. **Entertainment Segment**:\n   - Operating income increased by $0.8 billion to $1.7 billion.\n   - Direct-to-Consumer operating income rose by $431 million to $293 million, with advertising revenue (excluding Disney+ Hotstar in India) up 16%.\n   - Disney+ and Hulu subscriptions increased by 0.9 million, while Disney+ subscribers decreased by 0.7 million.\n   - Content sales/licensing income grew by $536 million, driven by the success of *Moana 2*.\n\n3. **Sports Segment**:\n   - Operating income increased by $350 million to $247 million.\n   - Domestic ESPN advertising revenue grew by 15%.\n\n4. **Experiences Segment**:\n   - Operating income remained at $3.1 billion, with a 6 percentage-point adverse impact due to Hurricanes Milton and Helene and pre-opening expenses for the Disney Treasure.\n   - Domestic Parks & Experiences income declined by 5%, while International Parks & Experiences income increased by 28%."
            }
          ],
          "node_id": "0001",
          "summary": "The partial document is a report from The Walt Disney Company detailing its financial performance for the first fiscal quarter of 2025, ending December 28, 2024. Key points include:\n\n1. **Financial Performance**:\n   - Revenue increased by 5% to $24.7 billion.\n   - Income before taxes rose by 27% to $3.7 billion.\n   - Diluted EPS grew by 35% to $1.40.\n   - Total segment operating income increased by 31% to $5.1 billion, with adjusted EPS up 44% to $1.76.\n\n2. **Segment Highlights**:\n   - **Entertainment**: Operating income increased by $0.8 billion to $1.7 billion. Direct-to-Consumer income rose by $431 million, though advertising revenue declined 2% (up 16% excluding Disney+ Hotstar in India). Disney+ and Hulu subscriptions increased slightly, while Disney+ subscribers decreased by 0.7 million. Content sales/licensing income grew, driven by the success of *Moana 2*.\n   - **Sports**: Operating income increased by $350 million to $247 million, with ESPN domestic advertising revenue up 15%.\n   - **Experiences**: Operating income remained at $3.1 billion, with adverse impacts from hurricanes and pre-opening expenses for the Disney Treasure. Domestic Parks & Experiences income declined by 5%, while International Parks & Experiences income rose by 28%.\n\n3. **Additional Notes**:\n   - Non-GAAP financial measures are used for certain metrics.\n   - Disney+ Hotstar in India saw a significant decline in advertising revenue compared to the previous year."
        },
        {
          "title": "Guidance and Outlook",
          "start_index": 2,
          "end_index": 2,
          "nodes": [
            {
              "title": "Star India deconsolidated in Q1",
              "start_index": 2,
              "end_index": 2,
              "node_id": "0004",
              "summary": "The partial document outlines Disney's financial guidance and outlook for fiscal 2025, including the deconsolidation of Star India and its impact on operating income for the Entertainment and Sports segments. It highlights expectations for Q2 fiscal 2025, such as a modest decline in Disney+ subscribers, adverse impacts on Sports segment income, and pre-opening expenses for Disney Cruise Line. For fiscal 2025, the company projects high-single-digit adjusted EPS growth, $15 billion in cash from operations, and segment operating income growth across Entertainment, Sports, and Experiences. The CEO emphasizes strong Q1 results, including box office success, improved profitability in streaming, advancements in ESPN\u2019s digital strategy, and continued investments in the Experiences segment, expressing confidence in Disney's growth strategy."
            },
            {
              "title": "Q2 Fiscal 2025",
              "start_index": 2,
              "end_index": 2,
              "node_id": "0005",
              "summary": "The partial document outlines Disney's financial guidance and outlook for fiscal 2025, including the deconsolidation of Star India and its impact on operating income for the Entertainment and Sports segments. It highlights expectations for Q2 fiscal 2025, such as a modest decline in Disney+ subscribers, adverse impacts on Sports segment income, and pre-opening expenses for Disney Cruise Line. For the full fiscal year 2025, it projects high-single-digit adjusted EPS growth, $15 billion in cash from operations, and segment operating income growth across Entertainment, Sports, and Experiences. The CEO emphasizes Disney's strong start to the fiscal year, citing achievements in box office performance, improved streaming profitability, ESPN's digital strategy, and the enduring appeal of the Experiences segment."
            },
            {
              "title": "Fiscal Year 2025",
              "start_index": 2,
              "end_index": 2,
              "node_id": "0006",
              "summary": "The partial document outlines Disney's financial guidance and outlook for fiscal 2025, including the deconsolidation of Star India and its impact on operating income for the Entertainment and Sports segments. It highlights expectations for Q2 fiscal 2025, such as a modest decline in Disney+ subscribers, adverse impacts on Sports segment income, and pre-opening expenses for Disney Cruise Line. For the full fiscal year 2025, it projects high-single-digit adjusted EPS growth, $15 billion in cash from operations, and segment operating income growth across Entertainment, Sports, and Experiences. The CEO emphasizes Disney's creative and financial strength, strong box office performance, improved streaming profitability, advancements in ESPN's digital strategy, and continued global investments in the Experiences segment."
            }
          ],
          "node_id": "0003",
          "summary": "The partial document outlines Disney's financial guidance and outlook for fiscal 2025, including the deconsolidation of Star India and its impact on operating income for the Entertainment and Sports segments. It highlights expectations for Q2 fiscal 2025, such as a modest decline in Disney+ subscribers, adverse impacts on Sports segment income, and pre-opening expenses for Disney Cruise Line. For the full fiscal year 2025, it projects high-single-digit adjusted EPS growth, $15 billion in cash from operations, and segment operating income growth across Entertainment, Sports, and Experiences. The CEO emphasizes strong Q1 results, including box office success, improved profitability in streaming, advancements in ESPN\u2019s digital strategy, and continued investment in global experiences."
        },
        {
          "title": "Message From Our CEO",
          "start_index": 2,
          "end_index": 2,
          "node_id": "0007",
          "summary": "The partial document outlines Disney's financial guidance and outlook for fiscal 2025, including the deconsolidation of Star India and its impact on operating income for the Entertainment and Sports segments. It highlights expectations for Q2 fiscal 2025, such as a modest decline in Disney+ subscribers, adverse impacts on Sports segment income, and pre-opening expenses for Disney Cruise Line. For the full fiscal year 2025, it projects high-single-digit adjusted EPS growth, $15 billion in cash from operations, and segment operating income growth across Entertainment, Sports, and Experiences. The CEO emphasizes strong Q1 results, including box office success, improved profitability in streaming, advancements in ESPN\u2019s digital strategy, and continued investment in global experiences."
        }
      ],
      "node_id": "0000",
      "summary": "The partial document is a report from The Walt Disney Company detailing its financial performance for the first fiscal quarter of 2025, ending December 28, 2024. Key points include:\n\n1. **Financial Results**:  \n   - Revenue increased by 5% to $24.7 billion.  \n   - Income before taxes rose by 27% to $3.7 billion.  \n   - Diluted EPS grew by 35% to $1.40.  \n   - Total segment operating income increased by 31% to $5.1 billion, and adjusted EPS rose by 44% to $1.76.  \n\n2. **Entertainment Segment**:  \n   - Operating income increased by $0.8 billion to $1.7 billion.  \n   - Direct-to-Consumer operating income rose by $431 million to $293 million, with advertising revenue up 16% (excluding Disney+ Hotstar in India).  \n   - Disney+ and Hulu subscriptions increased by 0.9 million, while Disney+ subscribers decreased by 0.7 million.  \n   - Content sales/licensing income grew by $536 million, driven by the success of *Moana 2*.  \n\n3. **Sports Segment**:  \n   - Operating income increased by $350 million to $247 million.  \n   - Domestic ESPN advertising revenue grew by 15%.  \n\n4. **Experiences Segment**:  \n   - Operating income remained at $3.1 billion, with a 6 percentage-point adverse impact due to Hurricanes Milton and Helene and pre-opening expenses for the Disney Treasure.  \n   - Domestic Parks & Experiences income declined by 5%, while International Parks & Experiences income increased by 28%.  \n\nThe report also includes non-GAAP financial measures and notes the impact of Disney+ Hotstar's advertising revenue in India."
    },
    {
      "title": "SUMMARIZED FINANCIAL RESULTS",
      "start_index": 3,
      "end_index": 3,
      "nodes": [
        {
          "title": "SUMMARIZED SEGMENT FINANCIAL RESULTS",
          "start_index": 3,
          "end_index": 3,
          "node_id": "0009",
          "summary": "The partial document provides a summarized overview of financial results for the first quarter of fiscal years 2025 and 2024. Key points include:\n\n1. **Overall Financial Performance**:\n   - Revenues increased by 5% from $23,549 million in 2024 to $24,690 million in 2025.\n   - Income before income taxes rose by 27%.\n   - Total segment operating income grew by 31%.\n   - Diluted EPS increased by 35%, and diluted EPS excluding certain items rose by 44%.\n   - Cash provided by operations increased by 47%, while free cash flow decreased by 17%.\n\n2. **Segment Financial Results**:\n   - Revenue growth was observed in the Entertainment segment (9%) and Experiences segment (3%), while Sports revenue remained flat.\n   - Segment operating income for Entertainment increased significantly by 95%, while Sports shifted from a loss to a positive income. Experiences segment operating income remained stable.\n\n3. **Non-GAAP Measures**:\n   - The document highlights the use of non-GAAP financial measures such as total segment operating income, diluted EPS excluding certain items, and free cash flow, with references to further details and reconciliations provided elsewhere in the report."
        }
      ],
      "node_id": "0008",
      "summary": "The partial document provides a summarized overview of financial results for the first quarter of fiscal years 2025 and 2024. Key points include:\n\n1. **Overall Financial Performance**:\n   - Revenues increased by 5% from $23,549 million in 2024 to $24,690 million in 2025.\n   - Income before income taxes rose by 27%.\n   - Total segment operating income grew by 31%.\n   - Diluted EPS increased by 35%, and diluted EPS excluding certain items rose by 44%.\n   - Cash provided by operations increased by 47%, while free cash flow decreased by 17%.\n\n2. **Segment Financial Results**:\n   - Revenue growth was observed in the Entertainment segment (9%) and Experiences segment (3%), while Sports revenue remained flat.\n   - Segment operating income for Entertainment increased significantly by 95%, while Sports shifted from a loss to a positive income. Experiences segment operating income remained stable.\n\n3. **Non-GAAP Measures**:\n   - The document highlights the use of non-GAAP financial measures such as total segment operating income, diluted EPS excluding certain items, and free cash flow, with references to further details and reconciliations provided in later sections."
    },
    {
      "title": "DISCUSSION OF FIRST QUARTER SEGMENT RESULTS",
      "start_index": 4,
      "end_index": 4,
      "nodes": [
        {
          "title": "Star India",
          "start_index": 4,
          "end_index": 4,
          "node_id": "0011",
          "summary": "The partial document discusses the first-quarter segment results, focusing on the Star India joint venture formed between the Company and Reliance Industries Limited (RIL) on November 14, 2024. The joint venture combines Star-branded entertainment and sports television channels, Disney+ Hotstar, and certain RIL-controlled media businesses, with RIL holding a 56% controlling interest, the Company holding 37%, and a third-party investment company holding 7%. The Company now recognizes its 37% share of the joint venture\u2019s results under \"Equity in the income of investees.\" Additionally, the document provides financial results for the Entertainment segment, showing a 9% increase in total revenues and a 95% increase in operating income compared to the prior-year quarter. The growth in operating income is attributed to improved results in Content Sales/Licensing and Direct-to-Consumer, partially offset by a decline in Linear Networks."
        },
        {
          "title": "Entertainment",
          "start_index": 4,
          "end_index": 4,
          "nodes": [
            {
              "title": "Linear Networks",
              "start_index": 5,
              "end_index": 5,
              "node_id": "0013",
              "summary": "The partial document provides financial performance details for Linear Networks and Direct-to-Consumer segments for the quarters ending December 28, 2024, and December 30, 2023. Key points include:\n\n1. **Linear Networks**:\n   - Revenue decreased by 7%, with domestic revenue remaining flat and international revenue declining by 31%.\n   - Operating income decreased by 11%, with domestic income stable and international income dropping by 39%.\n   - Domestic operating income was impacted by higher programming costs (due to the 2023 guild strikes), lower affiliate revenue (fewer subscribers), lower technology costs, and higher advertising revenue (driven by political advertising but offset by lower viewership).\n   - International operating income decline was attributed to the Star India Transaction.\n   - Equity income from investees decreased due to lower income from A+E Television Networks, reduced advertising and affiliate revenue, and the absence of a prior-year gain from an investment sale.\n\n2. **Direct-to-Consumer**:\n   - Revenue increased by 9%, driven by higher subscription revenue due to increased pricing and more subscribers, partially offset by unfavorable foreign exchange impacts.\n   - Operating income improved significantly, moving from a loss in the prior year to a profit, reflecting subscription revenue growth."
            },
            {
              "title": "Direct-to-Consumer",
              "start_index": 5,
              "end_index": 7,
              "node_id": "0014",
              "summary": "The partial document provides a financial performance overview of various segments for the quarter ended December 28, 2024, compared to the prior-year quarter. Key points include:\n\n1. **Linear Networks**:\n   - Revenue decreased by 7%, with domestic revenue flat and international revenue down 31%.\n   - Operating income decreased by 11%, with domestic income flat and international income down 39%, primarily due to the Star India transaction.\n   - Equity income from investees declined by 29%, driven by lower income from A+E Television Networks and the absence of a prior-year gain on an investment sale.\n\n2. **Direct-to-Consumer (DTC)**:\n   - Revenue increased by 9%, and operating income improved significantly from a loss of $138 million to a profit of $293 million.\n   - Growth was driven by higher subscription revenue due to pricing increases and more subscribers, partially offset by higher costs and lower advertising revenue.\n   - Key metrics showed slight changes in Disney+ and Hulu subscriber numbers, with increases in average monthly revenue per paid subscriber due to pricing adjustments.\n\n3. **Content Sales/Licensing and Other**:\n   - Revenue increased by 34%, and operating income improved significantly, driven by strong theatrical performance, particularly from \"Moana 2,\" and contributions from \"Mufasa: The Lion King.\"\n\n4. **Sports**:\n   - ESPN revenue grew by 8%, with domestic and international segments showing increases, while Star India revenue dropped by 90%.\n   - Operating income for ESPN improved by 15%, while Star India shifted from a loss to a small profit.\n\nThe document highlights revenue trends, operating income changes, and key drivers for each segment, including programming costs, subscriber growth, pricing adjustments, and content performance."
            },
            {
              "title": "Content Sales/Licensing and Other",
              "start_index": 7,
              "end_index": 7,
              "node_id": "0015",
              "summary": "The partial document discusses the financial performance of Disney's streaming services, content sales, and sports segment. Key points include:\n\n1. **Disney+ Revenue**: Domestic and international Disney+ average monthly revenue per paid subscriber increased due to pricing hikes, partially offset by promotional offerings. International revenue also benefited from higher advertising revenue.\n\n2. **Hulu Revenue**: Hulu SVOD Only revenue remained stable, with pricing increases offsetting lower advertising revenue. Hulu Live TV + SVOD revenue increased due to pricing hikes.\n\n3. **Content Sales/Licensing**: Revenue and operating income improved significantly, driven by strong theatrical distribution results, particularly from \"Moana 2,\" and contributions from \"Mufasa: The Lion King.\"\n\n4. **Sports Revenue**: ESPN domestic and international revenues grew, while Star India revenue declined sharply. Operating income for ESPN improved, with domestic income slightly down and international losses reduced. Star India showed a notable recovery in operating income."
            }
          ],
          "node_id": "0012",
          "summary": "The partial document discusses the first-quarter segment results, focusing on the Star India joint venture formed between the Company and Reliance Industries Limited (RIL) on November 14, 2024. The joint venture combines Star-branded entertainment and sports television channels and the Disney+ Hotstar service in India, with RIL holding a 56% controlling interest, the Company holding 37%, and a third-party investment company holding 7%. The Company now recognizes its 37% share of the joint venture\u2019s results under \u201cEquity in the income of investees.\u201d Additionally, the document provides financial results for the Entertainment segment, showing a 9% increase in total revenues compared to the prior year, driven by growth in Direct-to-Consumer and Content Sales/Licensing and Other, despite a decline in Linear Networks. Operating income increased by 95%, primarily due to improved results in Content Sales/Licensing and Other and Direct-to-Consumer, partially offset by a decrease in Linear Networks."
        },
        {
          "title": "Sports",
          "start_index": 7,
          "end_index": 7,
          "nodes": [
            {
              "title": "Domestic ESPN",
              "start_index": 8,
              "end_index": 8,
              "node_id": "0017",
              "summary": "The partial document discusses the financial performance of ESPN, including domestic and international operations, as well as Star India, for the current quarter compared to the prior-year quarter. Key points include:\n\n1. **Domestic ESPN**: \n   - Decrease in operating results due to higher programming and production costs, primarily from expanded college football programming rights and changes in the College Football Playoff (CFP) format.\n   - Increase in advertising revenue due to higher rates.\n   - Revenue from sub-licensing CFP programming rights.\n   - Affiliate revenue remained comparable, with rate increases offset by fewer subscribers.\n\n2. **International ESPN**: \n   - Decrease in operating loss driven by higher fees from the Entertainment segment for Disney+ sports content.\n   - Increased programming and production costs due to higher soccer rights costs.\n   - Lower affiliate revenue due to fewer subscribers.\n\n3. **Star India**: \n   - Improved operating results due to the absence of significant cricket events in the current quarter compared to the prior-year quarter, which included the ICC Cricket World Cup.\n\n4. **Key Metrics for ESPN+**:\n   - Paid subscribers decreased from 25.6 million to 24.9 million.\n   - Average monthly revenue per paid subscriber increased from $5.94 to $6.36, driven by pricing increases and higher advertising revenue."
            },
            {
              "title": "International ESPN",
              "start_index": 8,
              "end_index": 8,
              "node_id": "0018",
              "summary": "The partial document discusses the financial performance of ESPN, including domestic and international operations, as well as Star India, for the current quarter compared to the prior-year quarter. Key points include:\n\n1. **Domestic ESPN**: \n   - Decrease in operating results due to higher programming and production costs, primarily from expanded college football programming rights and changes in the College Football Playoff (CFP) format.\n   - Increase in advertising revenue due to higher rates.\n   - Revenue from sub-licensing CFP programming rights.\n   - Affiliate revenue remained comparable, with rate increases offset by fewer subscribers.\n\n2. **International ESPN**: \n   - Decrease in operating loss driven by higher fees from the Entertainment segment for Disney+ sports content.\n   - Increased programming and production costs due to higher soccer rights costs.\n   - Lower affiliate revenue due to fewer subscribers.\n\n3. **Star India**: \n   - Improved operating results due to the absence of significant cricket events in the current quarter compared to the ICC Cricket World Cup in the prior-year quarter.\n\n4. **Key Metrics for ESPN+**:\n   - Paid subscribers decreased from 25.6 million to 24.9 million.\n   - Average monthly revenue per paid subscriber increased from $5.94 to $6.36, driven by pricing increases and higher advertising revenue."
            },
            {
              "title": "Star India",
              "start_index": 8,
              "end_index": 8,
              "node_id": "0019",
              "summary": "The partial document discusses the financial performance of ESPN, including domestic and international operations, as well as Star India, for a specific quarter. Key points include:\n\n1. **Domestic ESPN**: \n   - Decrease in operating results due to higher programming and production costs, primarily from expanded college football programming rights, including additional College Football Playoff (CFP) games under a revised format.\n   - Increase in advertising revenue due to higher rates.\n   - Revenue from sub-licensing CFP programming rights.\n   - Affiliate revenue remained comparable to the prior year due to effective rate increases offset by fewer subscribers.\n\n2. **International ESPN**: \n   - Decrease in operating loss driven by higher fees from the Entertainment segment for sports content on Disney+.\n   - Increased programming and production costs due to higher soccer rights costs.\n   - Lower affiliate revenue due to fewer subscribers.\n\n3. **Star India**: \n   - Improvement in operating results due to the absence of significant cricket events in the current quarter compared to the prior year, which included the ICC Cricket World Cup.\n\n4. **Key Metrics for ESPN+**:\n   - Paid subscribers decreased from 25.6 million to 24.9 million.\n   - Average monthly revenue per paid subscriber increased from $5.94 to $6.36, driven by pricing increases and higher advertising revenue."
            }
          ],
          "node_id": "0016",
          "summary": "The partial document discusses the financial performance of Disney's streaming services, content sales, and sports segment. Key points include:\n\n1. **Disney+ Revenue**: Domestic and international Disney+ average monthly revenue per paid subscriber increased due to pricing hikes, partially offset by promotional offerings. International revenue also benefited from higher advertising revenue.\n\n2. **Hulu Revenue**: Hulu SVOD Only revenue remained stable, with pricing increases offsetting lower advertising revenue. Hulu Live TV + SVOD revenue increased due to pricing hikes.\n\n3. **Content Sales/Licensing**: Revenue and operating income improved significantly, driven by strong theatrical performance, particularly from \"Moana 2,\" and contributions from \"Mufasa: The Lion King.\"\n\n4. **Sports Revenue**: ESPN domestic and international revenues grew, while Star India revenue declined sharply. Operating income for ESPN improved, with domestic income slightly down and international income showing significant recovery. Star India showed a notable turnaround in operating income."
        },
        {
          "title": "Experiences",
          "start_index": 9,
          "end_index": 9,
          "node_id": "0020",
          "summary": "The partial document provides financial performance details for the Parks & Experiences segment, including revenues and operating income for domestic and international operations, as well as consumer products. It highlights a 3% increase in total revenue and stable operating income compared to the prior year. Domestic parks and experiences were negatively impacted by hurricanes, leading to lower volumes and higher costs, despite increased guest spending. International parks and experiences saw growth in operating income due to higher guest spending, increased attendance, and new offerings. The document also notes increased corporate expenses due to a legal settlement and a $143 million loss related to the Star India Transaction."
        }
      ],
      "node_id": "0010",
      "summary": "The partial document discusses the first-quarter segment results, focusing on the Star India joint venture formed between the Company and Reliance Industries Limited (RIL) on November 14, 2024. The joint venture combines Star-branded entertainment and sports television channels, Disney+ Hotstar, and certain RIL-controlled media businesses, with RIL holding a 56% controlling interest, the Company holding 37%, and a third-party investment company holding 7%. The Company now recognizes its 37% share of the joint venture\u2019s results under \"Equity in the income of investees.\" Additionally, the document provides financial results for the Entertainment segment, showing a 9% increase in total revenues and a 95% increase in operating income compared to the prior-year quarter. The growth in operating income is attributed to improved results in Content Sales/Licensing and Direct-to-Consumer, partially offset by a decline in Linear Networks."
    },
    {
      "title": "OTHER FINANCIAL INFORMATION",
      "start_index": 9,
      "end_index": 9,
      "nodes": [
        {
          "title": "Corporate and Unallocated Shared Expenses",
          "start_index": 9,
          "end_index": 9,
          "node_id": "0022",
          "summary": "The partial document provides a financial overview of revenues and operating income for Parks & Experiences, including Domestic, International, and Consumer Products segments, comparing the quarters ending December 28, 2024, and December 30, 2023. It highlights a 3% increase in overall revenue and stable operating income. Domestic Parks and Experiences were negatively impacted by Hurricanes Milton and Helene, leading to closures, cancellations, higher costs, and lower attendance, despite increased guest spending. International Parks and Experiences saw growth in operating income due to higher guest spending, increased attendance, and new offerings, offset by higher costs. The document also notes a $152 million increase in corporate and unallocated shared expenses due to a legal settlement and a $143 million loss related to the Star India Transaction."
        },
        {
          "title": "Restructuring and Impairment Charges",
          "start_index": 9,
          "end_index": 9,
          "node_id": "0023",
          "summary": "The partial document provides financial performance details for the Parks & Experiences segment, including revenues and operating income for domestic and international operations, as well as consumer products. It highlights a 3% increase in overall revenue and stable operating income compared to the prior year. Domestic parks and experiences were negatively impacted by hurricanes, leading to lower volumes and higher costs, despite increased guest spending. International parks and experiences saw growth in operating income due to higher guest spending, increased attendance, and new offerings, though costs also rose. Additionally, corporate and unallocated shared expenses increased due to a legal settlement, and a $143 million loss was recorded related to the Star India Transaction."
        },
        {
          "title": "Interest Expense, net",
          "start_index": 10,
          "end_index": 10,
          "node_id": "0024",
          "summary": "The partial document provides a financial analysis of interest expense, net, equity in the income of investees, and income taxes for the quarters ending December 28, 2024, and December 30, 2023. Key points include:\n\n1. **Interest Expense, Net**: A decrease in interest expense due to lower average rates and debt balances, partially offset by reduced capitalized interest. Interest income and investment income declined due to lower cash balances, pension-related costs, and investment losses compared to prior-year gains.\n\n2. **Equity in the Income of Investees**: A $89 million decrease in income from investees, primarily due to lower income from A+E and losses from the India joint venture.\n\n3. **Income Taxes**: An increase in the effective income tax rate from 25.1% to 27.8%, driven by a non-cash tax charge related to the Star India Transaction, partially offset by favorable adjustments related to prior years, lower foreign tax rates, and a comparison to unfavorable prior-year effects of employee share-based awards."
        },
        {
          "title": "Equity in the Income of Investees",
          "start_index": 10,
          "end_index": 10,
          "node_id": "0025",
          "summary": "The partial document provides a financial analysis of interest expense, net, equity in the income of investees, and income taxes for the quarters ended December 28, 2024, and December 30, 2023. It highlights a decrease in net interest expense due to lower average rates and debt balances, offset by reduced capitalized interest. Interest income and investment income declined due to lower cash balances, pension-related costs, and investment losses. Equity income from investees decreased significantly, driven by lower income from A+E and losses from the India joint venture. The effective income tax rate increased due to a non-cash tax charge related to the Star India Transaction, partially offset by favorable adjustments related to prior years, lower foreign tax rates, and a comparison to unfavorable prior-year effects."
        },
        {
          "title": "Income Taxes",
          "start_index": 10,
          "end_index": 10,
          "node_id": "0026",
          "summary": "The partial document provides a financial analysis of interest expense, net, equity in the income of investees, and income taxes for the quarters ended December 28, 2024, and December 30, 2023. It highlights a decrease in net interest expense due to lower average rates and debt balances, offset by reduced capitalized interest. Interest income and investment income declined due to lower cash balances, pension-related costs, and investment losses. Equity income from investees dropped significantly, driven by lower income from A+E and losses from the India joint venture. The effective income tax rate increased due to a non-cash tax charge related to the Star India Transaction, partially offset by favorable adjustments related to prior years, lower foreign tax rates, and a comparison to unfavorable prior-year effects."
        },
        {
          "title": "Noncontrolling Interests",
          "start_index": 11,
          "end_index": 11,
          "node_id": "0027",
          "summary": "The partial document covers two main points:\n\n1. **Noncontrolling Interests**: It discusses the net income attributable to noncontrolling interests, which decreased by 63% compared to the prior-year quarter. The decrease is attributed to the prior-year accretion of NBC Universal\u2019s interest in Hulu. The calculation of net income attributable to noncontrolling interests is based on income after royalties, management fees, financing costs, and income taxes.\n\n2. **Cash from Operations**: It details cash provided by operations and free cash flow, showing an increase in cash provided by operations by $1.0 billion to $3.2 billion in the current quarter. The increase is driven by lower tax payments, higher operating income at Entertainment, and higher film and television production spending, along with the timing of payments for sports rights. Free cash flow decreased by $147 million compared to the prior-year quarter."
        },
        {
          "title": "Cash from Operations",
          "start_index": 11,
          "end_index": 11,
          "node_id": "0028",
          "summary": "The partial document covers two main points:\n\n1. **Noncontrolling Interests**: It discusses the net income attributable to noncontrolling interests, which decreased by 63% in the quarter ended December 28, 2024, compared to the prior-year quarter. The decrease is attributed to the prior-year accretion of NBC Universal\u2019s interest in Hulu. The calculation of net income attributable to noncontrolling interests includes royalties, management fees, financing costs, and income taxes.\n\n2. **Cash from Operations**: It details cash provided by operations and free cash flow for the quarter ended December 28, 2024, compared to the prior-year quarter. Cash provided by operations increased by $1.0 billion, driven by lower tax payments, higher operating income at Entertainment, and higher film and television production spending, along with the timing of payments for sports rights. Free cash flow decreased by $147 million due to increased investments in parks, resorts, and other property."
        },
        {
          "title": "Capital Expenditures",
          "start_index": 12,
          "end_index": 12,
          "node_id": "0029",
          "summary": "The partial document provides details on capital expenditures and depreciation expenses for parks, resorts, and other properties. It highlights an increase in capital expenditures from $1.3 billion to $2.5 billion, primarily due to higher spending on cruise ship fleet expansion in the Experiences segment. The document also breaks down investments and depreciation expenses by category (Entertainment, Sports, Domestic and International Experiences, and Corporate) for the quarters ending December 28, 2024, and December 30, 2023. Depreciation expenses increased from $823 million to $909 million, with detailed figures provided for each segment."
        },
        {
          "title": "Depreciation Expense",
          "start_index": 12,
          "end_index": 12,
          "node_id": "0030",
          "summary": "The partial document provides details on capital expenditures and depreciation expenses for parks, resorts, and other properties. It highlights an increase in capital expenditures from $1.3 billion to $2.5 billion, primarily due to higher spending on cruise ship fleet expansion in the Experiences segment. The breakdown of investments and depreciation expenses is provided for Entertainment, Sports, Domestic and International Experiences, and Corporate segments for the quarters ending December 28, 2024, and December 30, 2023. Depreciation expenses also increased from $823 million to $909 million, with detailed segment-wise allocations."
        }
      ],
      "node_id": "0021",
      "summary": "The partial document provides a financial overview of revenues and operating income for Parks & Experiences, including Domestic, International, and Consumer Products segments, comparing the quarters ending December 28, 2024, and December 30, 2023. It highlights a 3% increase in total revenue and stable operating income. Domestic Parks and Experiences were negatively impacted by Hurricanes Milton and Helene, leading to closures, cancellations, higher costs, and lower attendance, despite increased guest spending. International Parks and Experiences saw growth in operating income due to higher guest spending, increased attendance, and new offerings, offset by increased costs. The document also notes a rise in corporate and unallocated shared expenses due to a legal settlement and a $143 million loss related to the Star India Transaction."
    },
    {
      "title": "THE WALT DISNEY COMPANY CONDENSED CONSOLIDATED STATEMENTS OF INCOME",
      "start_index": 13,
      "end_index": 13,
      "node_id": "0031",
      "summary": "The partial document provides a condensed consolidated statement of income for The Walt Disney Company for the quarters ended December 28, 2024, and December 30, 2023. It includes details on revenues, costs and expenses, restructuring and impairment charges, net interest expense, equity in the income of investees, income before income taxes, income taxes, and net income. It also breaks down net income attributable to noncontrolling interests and The Walt Disney Company. Additionally, it provides earnings per share (diluted and basic) and the weighted average number of shares outstanding (diluted and basic) for both periods."
    },
    {
      "title": "THE WALT DISNEY COMPANY CONDENSED CONSOLIDATED BALANCE SHEETS",
      "start_index": 14,
      "end_index": 14,
      "node_id": "0032",
      "summary": "The partial document is a condensed consolidated balance sheet for The Walt Disney Company, comparing financial data as of December 28, 2024, and September 28, 2024. It details the company's assets, liabilities, and equity. Key points include:\n\n1. **Assets**: Breakdown of current assets (cash, receivables, inventories, content advances, and other assets), produced and licensed content costs, investments, property (attractions, buildings, equipment, projects in progress, and land), intangible assets, goodwill, and other assets. Total assets increased slightly from $196.2 billion to $197 billion.\n\n2. **Liabilities**: Includes current liabilities (accounts payable, borrowings, deferred revenue), long-term borrowings, deferred income taxes, and other long-term liabilities. Total liabilities remained relatively stable.\n\n3. **Equity**: Details Disney shareholders' equity, including common stock, retained earnings, accumulated other comprehensive loss, and treasury stock. Noncontrolling interests are also included. Total equity increased from $105.5 billion to $106.7 billion.\n\n4. **Overall Financial Position**: The balance sheet reflects a stable financial position with slight changes in assets, liabilities, and equity over the period."
    },
    {
      "title": "THE WALT DISNEY COMPANY CONDENSED CONSOLIDATED STATEMENTS OF CASH FLOWS",
      "start_index": 15,
      "end_index": 15,
      "node_id": "0033",
      "summary": "The partial document provides a condensed consolidated statement of cash flows for The Walt Disney Company for the quarters ended December 28, 2024, and December 30, 2023. It details cash flow activities categorized into operating, investing, and financing activities. Key points include:\n\n1. **Operating Activities**: Net income increased from $2,151 million in 2023 to $2,644 million in 2024. Other significant changes include variations in depreciation, deferred taxes, equity income, content costs, and changes in operating assets and liabilities, resulting in cash provided by operations of $3,205 million in 2024 compared to $2,185 million in 2023.\n\n2. **Investing Activities**: Investments in parks, resorts, and other properties increased significantly in 2024 ($2,466 million) compared to 2023 ($1,299 million), leading to higher cash used in investing activities.\n\n3. **Financing Activities**: The company saw a net cash outflow in financing activities, including commercial paper borrowings, stock repurchases, and debt reduction. In 2024, cash used in financing activities was $997 million, a significant improvement from $8,006 million in 2023.\n\n4. **Exchange Rate Impact**: Exchange rates negatively impacted cash in 2024 by $153 million, compared to a positive impact of $79 million in 2023.\n\n5. **Overall Cash Position**: The company\u2019s cash, cash equivalents, and restricted cash decreased from $14,235 million at the beginning of the 2023 period to $5,582 million at the end of the 2024 period."
    },
    {
      "title": "DTC PRODUCT DESCRIPTIONS AND KEY DEFINITIONS",
      "start_index": 16,
      "end_index": 16,
      "node_id": "0034",
      "summary": "The partial document provides an overview of Disney's Direct-to-Consumer (DTC) product offerings, key definitions, and metrics. It details the availability of Disney+, ESPN+, and Hulu as standalone services or bundled offerings in the U.S., including Hulu Live TV + SVOD, which incorporates Disney+ and ESPN+. It explains the global reach of Disney+ in over 150 countries and the various purchase channels, including websites, third-party platforms, and wholesale arrangements. The document defines \"paid subscribers\" as those generating subscription revenue, excluding extra member add-ons, and outlines how subscribers are counted for multi-product offerings. It also describes the calculation of average monthly revenue per paid subscriber for Hulu, ESPN+, and Disney+, including revenue components like subscription fees, advertising, and add-ons, while noting differences in revenue allocation and the impact of wholesale arrangements on average revenue."
    },
    {
      "title": "NON-GAAP FINANCIAL MEASURES",
      "start_index": 17,
      "end_index": 17,
      "nodes": [
        {
          "title": "Diluted EPS excluding certain items",
          "start_index": 17,
          "end_index": 18,
          "node_id": "0036",
          "summary": "The partial document discusses the use of non-GAAP financial measures, specifically diluted EPS excluding certain items (adjusted EPS), total segment operating income, and free cash flow. It explains that these measures are not defined by GAAP but are important for evaluating the company's performance. The document highlights that these measures should be reviewed alongside comparable GAAP measures and may not be directly comparable to similar measures from other companies. It provides details on the adjustments made to diluted EPS, including the exclusion of certain items affecting comparability and amortization of TFCF and Hulu intangible assets, to better reflect operational performance. The document also includes a reconciliation table comparing reported diluted EPS to adjusted EPS for specific quarters, showing the impact of excluded items such as restructuring charges and intangible asset amortization. Additionally, it notes the challenges in providing forward-looking GAAP measures due to unpredictable factors."
        },
        {
          "title": "Total segment operating income",
          "start_index": 19,
          "end_index": 20,
          "node_id": "0037",
          "summary": "The partial document focuses on the evaluation of the company's performance through two key financial metrics: total segment operating income and free cash flow. It explains that total segment operating income is used to assess the performance of operating segments separately from non-operational factors, providing insights into operational results. A reconciliation table is provided, showing the calculation of total segment operating income for two quarters, highlighting changes in various components such as corporate expenses, restructuring charges, and interest expenses. Additionally, the document discusses free cash flow as a measure of cash available for purposes beyond capital expenditures, such as debt servicing, acquisitions, and shareholder returns. A summary of consolidated cash flows and a reconciliation of cash provided by operations to free cash flow are presented, comparing figures for two quarters and highlighting changes in cash flow components."
        },
        {
          "title": "Free cash flow",
          "start_index": 20,
          "end_index": 20,
          "node_id": "0038",
          "summary": "The partial document provides a reconciliation of the company's consolidated cash provided by operations to free cash flow for the quarters ended December 28, 2024, and December 30, 2023. It highlights a $1,020 million increase in cash provided by operations, a $1,167 million increase in investments in parks, resorts, and other property, and a $147 million decrease in free cash flow."
        }
      ],
      "node_id": "0035",
      "summary": "The partial document discusses the use of non-GAAP financial measures by the company, including diluted EPS excluding certain items (adjusted EPS), total segment operating income, and free cash flow. It explains that these measures are not defined by GAAP but are important for evaluating the company's performance. The document emphasizes that these measures should be reviewed alongside comparable GAAP measures and may not be directly comparable to similar measures from other companies. It highlights the company's inability to provide forward-looking GAAP measures or reconciliations due to uncertainties in predicting significant items. Additionally, the document details the rationale for excluding certain items and amortization of TFCF and Hulu intangible assets from diluted EPS to enhance comparability and provide a clearer evaluation of operational performance, particularly given the significant impact of the 2019 TFCF and Hulu acquisition."
    },
    {
      "title": "FORWARD-LOOKING STATEMENTS",
      "start_index": 21,
      "end_index": 21,
      "node_id": "0039",
      "summary": "The partial document outlines the inclusion of forward-looking statements in an earnings release, emphasizing that these statements are based on management's views and assumptions about future events and business performance. It highlights that actual results may differ materially due to various factors, including company actions (e.g., restructuring, strategic initiatives, cost rationalization), external developments (e.g., economic conditions, competition, consumer behavior, regulatory changes, technological advancements, labor market activities, and natural disasters), and their potential impacts on operations, profitability, content performance, advertising markets, and taxation. The document also references additional risk factors and analyses detailed in the company's filings with the SEC, such as annual and quarterly reports."
    },
    {
      "title": "PREPARED EARNINGS REMARKS AND CONFERENCE CALL INFORMATION",
      "start_index": 22,
      "end_index": 22,
      "node_id": "0040",
      "summary": "The partial document provides information about The Walt Disney Company's prepared management remarks and a conference call scheduled for February 5, 2025, at 8:30 AM EST/5:30 AM PST, accessible via a live webcast on their investor website. It also mentions that a replay of the webcast will be available on the site. Additionally, contact details for Corporate Communications (David Jefferson) and Investor Relations (Carlos Gomez) are provided."
    }
  ]
}
```

## /run_pageindex.py

```py path="/run_pageindex.py" 
import argparse
from pageindex import *

if __name__ == "__main__":
    # Set up argument parser
    parser = argparse.ArgumentParser(description='Process PDF document and generate structure')
    parser.add_argument('--pdf_path', type=str, help='Path to the PDF file')
    parser.add_argument('--model', type=str, default='gpt-4o-2024-11-20', help='Model to use')
    parser.add_argument('--toc-check-pages', type=int, default=20, 
                      help='Number of pages to check for table of contents')
    parser.add_argument('--max-pages-per-node', type=int, default=10,
                      help='Maximum number of pages per node')
    parser.add_argument('--max-tokens-per-node', type=int, default=20000,
                      help='Maximum number of tokens per node')
    parser.add_argument('--if-add-node-id', type=str, default='yes',
                      help='Whether to add node id to the node')
    parser.add_argument('--if-add-node-summary', type=str, default='no',
                      help='Whether to add summary to the node')
    parser.add_argument('--if-add-doc-description', type=str, default='yes',
                      help='Whether to add doc description to the doc')
    parser.add_argument('--if-add-node-text', type=str, default='no',
                      help='Whether to add text to the node')
    args = parser.parse_args()
        
        # Configure options
    opt = config(
        model=args.model,
        toc_check_page_num=args.toc_check_pages,
        max_page_num_each_node=args.max_pages_per_node,
        max_token_num_each_node=args.max_tokens_per_node,
        if_add_node_id=args.if_add_node_id,
        if_add_node_summary=args.if_add_node_summary,
        if_add_doc_description=args.if_add_doc_description,
        if_add_node_text=args.if_add_node_text
    )

    # Process the PDF
    toc_with_page_number = page_index_main(args.pdf_path, opt)
    print('Parsing done, saving to file...')
    
    # Save results
    pdf_name = os.path.splitext(os.path.basename(args.pdf_path))[0]    
    os.makedirs('./results', exist_ok=True)
    
    with open(f'./results/{pdf_name}_structure.json', 'w', encoding='utf-8') as f:
        json.dump(toc_with_page_number, f, indent=2)
```


The content has been capped at 50000 tokens. The user could consider applying other filters to refine the result. The better and more specific the context, the better the LLM can follow instructions. If the context seems verbose, the user can refine the filter using uithub. Thank you for using https://uithub.com - Perfect LLM context for any GitHub repo.
Explorer

Navigate github.com

Search

Apply a magic LLM filter to retrieve a subset of the files at this location

Example filters:

Plugins

Pinned (0)

Directory (0)

Unlock Premium Features

Premium Features