```
├── .gitignore
├── .python-version
├── LICENSE (omitted)
├── README.md (400 tokens)
├── app.py (3.1k tokens)
├── cli.py (1000 tokens)
├── configs/
   ├── lang_vocab.json (100 tokens)
   ├── train_manifest.json (100 tokens)
├── dataset/
   ├── multilang_tts_dataset.py (400 tokens)
├── dia/
   ├── __init__.py
   ├── audio.py (2000 tokens)
   ├── config.py (1700 tokens)
   ├── layers.py (6.2k tokens)
   ├── model.py (600 tokens)
   ├── static/
      ├── images/
         ├── banner.png
├── docker/
   ├── Dockerfile (100 tokens)
   ├── launch.sh (100 tokens)
├── example/
   ├── simple.py (100 tokens)
   ├── voice_clone.py (200 tokens)
├── example_prompt.mp3
├── pyproject.toml (300 tokens)
├── scripts/
   ├── infer_dia.py (300 tokens)
   ├── train_dia.py (300 tokens)
   ├── validate.py (100 tokens)
├── tests/
   ├── test_spk_injection.py (100 tokens)
   ├── visualize_speakers.ipynb (200 tokens)
├── tools/
   ├── audio_cleaner.py (100 tokens)
   ├── hf_dataset_loader.py (100 tokens)
   ├── speaker_encoder.py (200 tokens)
   ├── test_manifest_gen.py (100 tokens)
├── uv.lock (omitted)
```


## /.gitignore

```gitignore path="/.gitignore" 
# Python-generated files
__pycache__/
*.py[oc]
build/
dist/
wheels/
*.egg-info

# Virtual environments
.venv

.gradio

**/*.pth
**/*.mp3
!example_prompt.mp3
**/*.txt

.ruff_cache
.ipynb_checkpoints
config.json
```

## /.python-version

```python-version path="/.python-version" 
3.10

```

## /README.md

# DIA-Multilingual (StyleTTS2-based)

This is a fork of the original DIA model extended for **multilingual TTS**, with support for 30+ languages (same as ElevenLabs). Built on top of StyleTTS2 with language token injection, espeak-ng phonemization, and support for reference audio-based style transfer.

---

## 🧠 Supported Languages

Supports over 30 languages via `<lang>` token injection, including:

`en`, `es`, `de`, `fr`, `it`, `pt`, `pl`, `ro`, `nl`, `tr`, `sv`, `cs`, `el`, `hu`, `fi`, `da`, `sk`, `bg`, `hi`, `ar`, `zh`, `ja`, `ko`, ...

---

## 🚀 Quickstart (RunPod)

**Build your container:**
```bash
docker build -t dia-multilang -f docker/Dockerfile .
```

**Launch training (inside container):**
```bash
bash docker/launch.sh
```

This will:
- Load `train_manifest.json` and `lang_vocab.json`
- Start training from scratch using espeak-based phoneme inputs
- Save checkpoints in `/workspace/checkpoints`

---

## 🧾 File Structure

```bash
├── dataset/
│   └── multilang_tts_dataset.py     # Dataset + phonemizer + collate_fn
├── scripts/
│   ├── train_dia.py                 # Main training loop
│   ├── validate.py                  # Eval script (loss only)
│   └── infer_dia.py                 # Generate audio from text
├── docker/
│   ├── Dockerfile                   # GPU-enabled training container
│   └── launch.sh                    # Entrypoint script
├── lang_vocab.json                  # Maps <lang> → token_id
├── train_manifest.json              # Manifest (audio, text, lang)
```

---

## 🎙️ Inference

Generate audio from text using:

```bash
python3 scripts/infer_dia.py \
  --model_path checkpoints/epoch49.pt \
  --lang_vocab lang_vocab.json \
  --text "Ciao, come stai?" \
  --lang it \
  --output_dir samples/
```

To use reference audio (zero-shot style cloning):

```bash
  --reference_wav samples/italian_female.wav
```

---

## 💡 Notes

- Uses espeak-ng to phonemize all input text (per-language IPA)
- Pretrained `xlm-roberta-base` recommended for phoneme encodings
- Output speech is high-fidelity and respects cross-language style transfer

---

## 🧠 Credits
Built on top of:
- [DIA](https://github.com/nari-labs/dia)
- [StyleTTS2](https://github.com/yl4579/StyleTTS2)

## /app.py

```py path="/app.py" 
import argparse
import tempfile
import time
from pathlib import Path
from typing import Optional, Tuple

import gradio as gr
import numpy as np
import soundfile as sf
import torch

from dia.model import Dia


# --- Global Setup ---
parser = argparse.ArgumentParser(description="Gradio interface for Nari TTS")
parser.add_argument(
    "--device", type=str, default=None, help="Force device (e.g., 'cuda', 'mps', 'cpu')"
)
parser.add_argument("--share", action="store_true", help="Enable Gradio sharing")

args = parser.parse_args()


# Determine device
if args.device:
    device = torch.device(args.device)
elif torch.cuda.is_available():
    device = torch.device("cuda")
# Simplified MPS check for broader compatibility
elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
    # Basic check is usually sufficient, detailed check can be problematic
    device = torch.device("mps")
else:
    device = torch.device("cpu")

print(f"Using device: {device}")

# Load Nari model and config
print("Loading Nari model...")
try:
    # Use the function from inference.py
    model = Dia.from_pretrained("nari-labs/Dia-1.6B", device=device)
except Exception as e:
    print(f"Error loading Nari model: {e}")
    raise


def run_inference(
    text_input: str,
    audio_prompt_input: Optional[Tuple[int, np.ndarray]],
    max_new_tokens: int,
    cfg_scale: float,
    temperature: float,
    top_p: float,
    cfg_filter_top_k: int,
    speed_factor: float,
):
    """
    Runs Nari inference using the globally loaded model and provided inputs.
    Uses temporary files for text and audio prompt compatibility with inference.generate.
    """
    global model, device  # Access global model, config, device

    if not text_input or text_input.isspace():
        raise gr.Error("Text input cannot be empty.")

    temp_txt_file_path = None
    temp_audio_prompt_path = None
    output_audio = (44100, np.zeros(1, dtype=np.float32))

    try:
        prompt_path_for_generate = None
        if audio_prompt_input is not None:
            sr, audio_data = audio_prompt_input
            # Check if audio_data is valid
            if (
                audio_data is None or audio_data.size == 0 or audio_data.max() == 0
            ):  # Check for silence/empty
                gr.Warning("Audio prompt seems empty or silent, ignoring prompt.")
            else:
                # Save prompt audio to a temporary WAV file
                with tempfile.NamedTemporaryFile(
                    mode="wb", suffix=".wav", delete=False
                ) as f_audio:
                    temp_audio_prompt_path = f_audio.name  # Store path for cleanup

                    # Basic audio preprocessing for consistency
                    # Convert to float32 in [-1, 1] range if integer type
                    if np.issubdtype(audio_data.dtype, np.integer):
                        max_val = np.iinfo(audio_data.dtype).max
                        audio_data = audio_data.astype(np.float32) / max_val
                    elif not np.issubdtype(audio_data.dtype, np.floating):
                        gr.Warning(
                            f"Unsupported audio prompt dtype {audio_data.dtype}, attempting conversion."
                        )
                        # Attempt conversion, might fail for complex types
                        try:
                            audio_data = audio_data.astype(np.float32)
                        except Exception as conv_e:
                            raise gr.Error(
                                f"Failed to convert audio prompt to float32: {conv_e}"
                            )

                    # Ensure mono (average channels if stereo)
                    if audio_data.ndim > 1:
                        if audio_data.shape[0] == 2:  # Assume (2, N)
                            audio_data = np.mean(audio_data, axis=0)
                        elif audio_data.shape[1] == 2:  # Assume (N, 2)
                            audio_data = np.mean(audio_data, axis=1)
                        else:
                            gr.Warning(
                                f"Audio prompt has unexpected shape {audio_data.shape}, taking first channel/axis."
                            )
                            audio_data = (
                                audio_data[0]
                                if audio_data.shape[0] < audio_data.shape[1]
                                else audio_data[:, 0]
                            )
                        audio_data = np.ascontiguousarray(
                            audio_data
                        )  # Ensure contiguous after slicing/mean

                    # Write using soundfile
                    try:
                        sf.write(
                            temp_audio_prompt_path, audio_data, sr, subtype="FLOAT"
                        )  # Explicitly use FLOAT subtype
                        prompt_path_for_generate = temp_audio_prompt_path
                        print(
                            f"Created temporary audio prompt file: {temp_audio_prompt_path} (orig sr: {sr})"
                        )
                    except Exception as write_e:
                        print(f"Error writing temporary audio file: {write_e}")
                        raise gr.Error(f"Failed to save audio prompt: {write_e}")

        # 3. Run Generation

        start_time = time.time()

        # Use torch.inference_mode() context manager for the generation call
        with torch.inference_mode():
            output_audio_np = model.generate(
                text_input,
                max_tokens=max_new_tokens,
                cfg_scale=cfg_scale,
                temperature=temperature,
                top_p=top_p,
                use_cfg_filter=True,
                cfg_filter_top_k=cfg_filter_top_k,  # Pass the value here
                use_torch_compile=False,  # Keep False for Gradio stability
                audio_prompt_path=prompt_path_for_generate,
            )

        end_time = time.time()
        print(f"Generation finished in {end_time - start_time:.2f} seconds.")

        # 4. Convert Codes to Audio
        if output_audio_np is not None:
            # Get sample rate from the loaded DAC model
            output_sr = 44100

            # --- Slow down audio ---
            original_len = len(output_audio_np)
            # Ensure speed_factor is positive and not excessively small/large to avoid issues
            speed_factor = max(0.1, min(speed_factor, 5.0))
            target_len = int(
                original_len / speed_factor
            )  # Target length based on speed_factor
            if (
                target_len != original_len and target_len > 0
            ):  # Only interpolate if length changes and is valid
                x_original = np.arange(original_len)
                x_resampled = np.linspace(0, original_len - 1, target_len)
                resampled_audio_np = np.interp(x_resampled, x_original, output_audio_np)
                output_audio = (
                    output_sr,
                    resampled_audio_np.astype(np.float32),
                )  # Use resampled audio
                print(
                    f"Resampled audio from {original_len} to {target_len} samples for {speed_factor:.2f}x speed."
                )
            else:
                output_audio = (
                    output_sr,
                    output_audio_np,
                )  # Keep original if calculation fails or no change
                print(f"Skipping audio speed adjustment (factor: {speed_factor:.2f}).")
            # --- End slowdown ---

            print(
                f"Audio conversion successful. Final shape: {output_audio[1].shape}, Sample Rate: {output_sr}"
            )

        else:
            print("\nGeneration finished, but no valid tokens were produced.")
            # Return default silence
            gr.Warning("Generation produced no output.")

    except Exception as e:
        print(f"Error during inference: {e}")
        import traceback

        traceback.print_exc()
        # Re-raise as Gradio error to display nicely in the UI
        raise gr.Error(f"Inference failed: {e}")

    finally:
        # 5. Cleanup Temporary Files defensively
        if temp_txt_file_path and Path(temp_txt_file_path).exists():
            try:
                Path(temp_txt_file_path).unlink()
                print(f"Deleted temporary text file: {temp_txt_file_path}")
            except OSError as e:
                print(
                    f"Warning: Error deleting temporary text file {temp_txt_file_path}: {e}"
                )
        if temp_audio_prompt_path and Path(temp_audio_prompt_path).exists():
            try:
                Path(temp_audio_prompt_path).unlink()
                print(f"Deleted temporary audio prompt file: {temp_audio_prompt_path}")
            except OSError as e:
                print(
                    f"Warning: Error deleting temporary audio prompt file {temp_audio_prompt_path}: {e}"
                )

    return output_audio


# --- Create Gradio Interface ---
css = """
#col-container {max-width: 90%; margin-left: auto; margin-right: auto;}
"""
# Attempt to load default text from example.txt
default_text = "[S1] Dia is an open weights text to dialogue model. \n[S2] You get full control over scripts and voices. \n[S1] Wow. Amazing. (laughs) \n[S2] Try it now on Git hub or Hugging Face."
example_txt_path = Path("./example.txt")
if example_txt_path.exists():
    try:
        default_text = example_txt_path.read_text(encoding="utf-8").strip()
        if not default_text:  # Handle empty example file
            default_text = "Example text file was empty."
    except Exception as e:
        print(f"Warning: Could not read example.txt: {e}")


# Build Gradio UI
with gr.Blocks(css=css) as demo:
    gr.Markdown("# Nari Text-to-Speech Synthesis")

    with gr.Row(equal_height=False):
        with gr.Column(scale=1):
            text_input = gr.Textbox(
                label="Input Text",
                placeholder="Enter text here...",
                value=default_text,
                lines=5,  # Increased lines
            )
            audio_prompt_input = gr.Audio(
                label="Audio Prompt (Optional)",
                show_label=True,
                sources=["upload", "microphone"],
                type="numpy",
            )
            with gr.Accordion("Generation Parameters", open=False):
                max_new_tokens = gr.Slider(
                    label="Max New Tokens (Audio Length)",
                    minimum=860,
                    maximum=3072,
                    value=model.config.data.audio_length,  # Use config default if available, else fallback
                    step=50,
                    info="Controls the maximum length of the generated audio (more tokens = longer audio).",
                )
                cfg_scale = gr.Slider(
                    label="CFG Scale (Guidance Strength)",
                    minimum=1.0,
                    maximum=5.0,
                    value=3.0,  # Default from inference.py
                    step=0.1,
                    info="Higher values increase adherence to the text prompt.",
                )
                temperature = gr.Slider(
                    label="Temperature (Randomness)",
                    minimum=1.0,
                    maximum=1.5,
                    value=1.3,  # Default from inference.py
                    step=0.05,
                    info="Lower values make the output more deterministic, higher values increase randomness.",
                )
                top_p = gr.Slider(
                    label="Top P (Nucleus Sampling)",
                    minimum=0.80,
                    maximum=1.0,
                    value=0.95,  # Default from inference.py
                    step=0.01,
                    info="Filters vocabulary to the most likely tokens cumulatively reaching probability P.",
                )
                cfg_filter_top_k = gr.Slider(
                    label="CFG Filter Top K",
                    minimum=15,
                    maximum=50,
                    value=30,
                    step=1,
                    info="Top k filter for CFG guidance.",
                )
                speed_factor_slider = gr.Slider(
                    label="Speed Factor",
                    minimum=0.8,
                    maximum=1.0,
                    value=0.94,
                    step=0.02,
                    info="Adjusts the speed of the generated audio (1.0 = original speed).",
                )

            run_button = gr.Button("Generate Audio", variant="primary")

        with gr.Column(scale=1):
            audio_output = gr.Audio(
                label="Generated Audio",
                type="numpy",
                autoplay=False,
            )

    # Link button click to function
    run_button.click(
        fn=run_inference,
        inputs=[
            text_input,
            audio_prompt_input,
            max_new_tokens,
            cfg_scale,
            temperature,
            top_p,
            cfg_filter_top_k,
            speed_factor_slider,
        ],
        outputs=[audio_output],  # Add status_output here if using it
        api_name="generate_audio",
    )

    # Add examples (ensure the prompt path is correct or remove it if example file doesn't exist)
    example_prompt_path = "./example_prompt.mp3"  # Adjust if needed
    examples_list = [
        [
            "[S1] Oh fire! Oh my goodness! What's the procedure? What to we do people? The smoke could be coming through an air duct! \n[S2] Oh my god! Okay.. it's happening. Everybody stay calm! \n[S1] What's the procedure... \n[S2] Everybody stay fucking calm!!!... Everybody fucking calm down!!!!! \n[S1] No! No! If you touch the handle, if its hot there might be a fire down the hallway! ",
            None,
            3072,
            3.0,
            1.3,
            0.95,
            35,
            0.94,
        ],
        [
            "[S1] Open weights text to dialogue model. \n[S2] You get full control over scripts and voices. \n[S1] I'm biased, but I think we clearly won. \n[S2] Hard to disagree. (laughs) \n[S1] Thanks for listening to this demo. \n[S2] Try it now on Git hub and Hugging Face. \n[S1] If you liked our model, please give us a star and share to your friends. \n[S2] This was Nari Labs.",
            example_prompt_path if Path(example_prompt_path).exists() else None,
            3072,
            3.0,
            1.3,
            0.95,
            35,
            0.94,
        ],
    ]

    if examples_list:
        gr.Examples(
            examples=examples_list,
            inputs=[
                text_input,
                audio_prompt_input,
                max_new_tokens,
                cfg_scale,
                temperature,
                top_p,
                cfg_filter_top_k,
                speed_factor_slider,
            ],
            outputs=[audio_output],
            fn=run_inference,
            cache_examples=False,
            label="Examples (Click to Run)",
        )
    else:
        gr.Markdown("_(No examples configured or example prompt file missing)_")


# --- Launch the App ---
if __name__ == "__main__":
    print("Launching Gradio interface...")
    demo.launch(share=args.share)

```

## /cli.py

```py path="/cli.py" 
import argparse
import os
import random

import numpy as np
import soundfile as sf
import torch

from dia.model import Dia


def set_seed(seed: int):
    """Sets the random seed for reproducibility."""
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
    # Ensure deterministic behavior for cuDNN (if used)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False


def main():
    parser = argparse.ArgumentParser(description="Generate audio using the Dia model.")

    parser.add_argument("text", type=str, help="Input text for speech generation.")
    parser.add_argument(
        "--output", type=str, required=True, help="Path to save the generated audio file (e.g., output.wav)."
    )

    parser.add_argument(
        "--repo-id",
        type=str,
        default="nari-labs/Dia-1.6B",
        help="Hugging Face repository ID (e.g., nari-labs/Dia-1.6B).",
    )
    parser.add_argument(
        "--local-paths", action="store_true", help="Load model from local config and checkpoint files."
    )

    parser.add_argument(
        "--config", type=str, help="Path to local config.json file (required if --local-paths is set)."
    )
    parser.add_argument(
        "--checkpoint", type=str, help="Path to local model checkpoint .pth file (required if --local-paths is set)."
    )
    parser.add_argument(
        "--audio-prompt", type=str, default=None, help="Path to an optional audio prompt WAV file for voice cloning."
    )

    gen_group = parser.add_argument_group("Generation Parameters")
    gen_group.add_argument(
        "--max-tokens",
        type=int,
        default=None,
        help="Maximum number of audio tokens to generate (defaults to config value).",
    )
    gen_group.add_argument(
        "--cfg-scale", type=float, default=3.0, help="Classifier-Free Guidance scale (default: 3.0)."
    )
    gen_group.add_argument(
        "--temperature", type=float, default=1.3, help="Sampling temperature (higher is more random, default: 0.7)."
    )
    gen_group.add_argument("--top-p", type=float, default=0.95, help="Nucleus sampling probability (default: 0.95).")

    infra_group = parser.add_argument_group("Infrastructure")
    infra_group.add_argument("--seed", type=int, default=None, help="Random seed for reproducibility.")
    infra_group.add_argument(
        "--device",
        type=str,
        default="cuda" if torch.cuda.is_available() else "cpu",
        help="Device to run inference on (e.g., 'cuda', 'cpu', default: auto).",
    )

    args = parser.parse_args()

    # Validation for local paths
    if args.local_paths:
        if not args.config:
            parser.error("--config is required when --local-paths is set.")
        if not args.checkpoint:
            parser.error("--checkpoint is required when --local-paths is set.")
        if not os.path.exists(args.config):
            parser.error(f"Config file not found: {args.config}")
        if not os.path.exists(args.checkpoint):
            parser.error(f"Checkpoint file not found: {args.checkpoint}")

    # Set seed if provided
    if args.seed is not None:
        set_seed(args.seed)
        print(f"Using random seed: {args.seed}")

    # Determine device
    device = torch.device(args.device)
    print(f"Using device: {device}")

    # Load model
    print("Loading model...")
    if args.local_paths:
        print(f"Loading from local paths: config='{args.config}', checkpoint='{args.checkpoint}'")
        try:
            model = Dia.from_local(args.config, args.checkpoint, device=device)
        except Exception as e:
            print(f"Error loading local model: {e}")
            exit(1)
    else:
        print(f"Loading from Hugging Face Hub: repo_id='{args.repo_id}'")
        try:
            model = Dia.from_pretrained(args.repo_id, device=device)
        except Exception as e:
            print(f"Error loading model from Hub: {e}")
            exit(1)
    print("Model loaded.")

    # Generate audio
    print("Generating audio...")
    try:
        sample_rate = 44100  # Default assumption

        output_audio = model.generate(
            text=args.text,
            audio_prompt_path=args.audio_prompt,
            max_tokens=args.max_tokens,
            cfg_scale=args.cfg_scale,
            temperature=args.temperature,
            top_p=args.top_p,
        )
        print("Audio generation complete.")

        print(f"Saving audio to {args.output}...")
        os.makedirs(os.path.dirname(args.output) or ".", exist_ok=True)

        sf.write(args.output, output_audio, sample_rate)
        print(f"Audio successfully saved to {args.output}")

    except Exception as e:
        print(f"Error during audio generation or saving: {e}")
        exit(1)


if __name__ == "__main__":
    main()

```

## /configs/lang_vocab.json

```json path="/configs/lang_vocab.json" 
{
  "<en>": 0,
  "<es>": 1,
  "<de>": 2,
  "<fr>": 3,
  "<it>": 4,
  "<pt>": 5,
  "<pl>": 6,
  "<ro>": 7,
  "<nl>": 8,
  "<tr>": 9,
  "<sv>": 10,
  "<cs>": 11,
  "<el>": 12,
  "<hu>": 13,
  "<fi>": 14,
  "<da>": 15,
  "<sk>": 16,
  "<bg>": 17,
  "<hr>": 18,
  "<lt>": 19,
  "<lv>": 20,
  "<et>": 21,
  "<sl>": 22,
  "<hi>": 23,
  "<ar>": 24,
  "<ru>": 25,
  "<he>": 26,
  "<zh>": 27,
  "<ja>": 28,
  "<ko>": 29
}
```

## /configs/train_manifest.json

```json path="/configs/train_manifest.json" 
[
  {
    "audio": "data/en/clips/sample1.wav",
    "text": "Hello world",
    "phonemes": "h\u0259\u02c8lo\u028a w\u025d\u02d0ld",
    "lang": "en"
  },
  {
    "audio": "data/es/clips/sample2.wav",
    "text": "Hola mundo",
    "phonemes": "ola mundo",
    "lang": "es"
  }
]
```

## /dataset/multilang_tts_dataset.py

```py path="/dataset/multilang_tts_dataset.py" 
import json
import torch
from torch.utils.data import Dataset
from pathlib import Path
import torchaudio

class MultilangTTSDataset(Dataset):
    def __init__(self, manifest_path, lang_vocab, sample_rate=22050):
        self.data = json.load(open(manifest_path))
        self.lang_vocab = lang_vocab
        self.sample_rate = sample_rate
        self.phoneme_tokenizer = self._build_tokenizer()

    def _build_tokenizer(self):
        from collections import Counter
        chars = Counter()
        for sample in self.data:
            chars.update(sample["phonemes"])
        vocab = {c: i+10 for i, c in enumerate(sorted(chars))}
        vocab["<pad>"] = 0
        vocab["<unk>"] = 1
        self.phoneme_vocab = vocab
        return lambda p: [vocab.get(c, 1) for c in p]

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        sample = self.data[idx]
        path = Path(sample["audio"])
        wav, sr = torchaudio.load(path)
        if sr != self.sample_rate:
            wav = torchaudio.functional.resample(wav, sr, self.sample_rate)
        wav = wav.squeeze(0)

        phoneme_ids = self.phoneme_tokenizer(sample["phonemes"])
        lang_token = f"<{sample['lang']}>"
        lang_token_id = self.lang_vocab.get(lang_token, self.lang_vocab.get("<unk>", 0))
        input_ids = [lang_token_id] + phoneme_ids

        return {
            "input_ids": torch.tensor(input_ids, dtype=torch.long),
            "waveform": wav,
            "lang": sample["lang"],
            "lang_token_id": lang_token_id,
            "path": str(path)
        }

def collate_fn(batch):
    from torch.nn.utils.rnn import pad_sequence
    input_seqs = [b["input_ids"] for b in batch]
    waveforms = [b["waveform"] for b in batch]
    padded_inputs = pad_sequence(input_seqs, batch_first=True, padding_value=0)
    wav_lens = [len(w) for w in waveforms]
    padded_audio = pad_sequence(waveforms, batch_first=True)

    return {
        "input_ids": padded_inputs,
        "audio": padded_audio,
        "audio_lens": wav_lens,
        "langs": [b["lang"] for b in batch],
        "lang_token_ids": torch.tensor([b["lang_token_id"] for b in batch])
    }
```

## /dia/__init__.py

```py path="/dia/__init__.py" 

```

## /dia/audio.py

```py path="/dia/audio.py" 
import typing as tp

import torch

from .config import DataConfig


def build_delay_indices(B: int, T: int, C: int, delay_pattern: tp.List[int]) -> tp.Tuple[torch.Tensor, torch.Tensor]:
    """
    Precompute (t_idx_BxTxC, indices_BTCx3) so that out[t, c] = in[t - delay[c], c].
    Negative t_idx => BOS; t_idx >= T => PAD.
    """
    delay_arr = torch.tensor(delay_pattern, dtype=torch.int32)

    t_idx_BxT = torch.broadcast_to(
        torch.arange(T, dtype=torch.int32)[None, :],
        [B, T],
    )
    t_idx_BxTx1 = t_idx_BxT[..., None]
    t_idx_BxTxC = t_idx_BxTx1 - delay_arr.view(1, 1, C)

    b_idx_BxTxC = torch.broadcast_to(
        torch.arange(B, dtype=torch.int32).view(B, 1, 1),
        [B, T, C],
    )
    c_idx_BxTxC = torch.broadcast_to(
        torch.arange(C, dtype=torch.int32).view(1, 1, C),
        [B, T, C],
    )

    # We must clamp time indices to [0..T-1] so gather_nd equivalent won't fail
    t_clamped_BxTxC = torch.clamp(t_idx_BxTxC, 0, T - 1)

    indices_BTCx3 = torch.stack(
        [
            b_idx_BxTxC.reshape(-1),
            t_clamped_BxTxC.reshape(-1),
            c_idx_BxTxC.reshape(-1),
        ],
        dim=1,
    ).long()  # Ensure indices are long type for indexing

    return t_idx_BxTxC, indices_BTCx3


def apply_audio_delay(
    audio_BxTxC: torch.Tensor,
    pad_value: int,
    bos_value: int,
    precomp: tp.Tuple[torch.Tensor, torch.Tensor],
) -> torch.Tensor:
    """
    Applies the delay pattern to batched audio tokens using precomputed indices,
    inserting BOS where t_idx < 0 and PAD where t_idx >= T.

    Args:
        audio_BxTxC: [B, T, C] int16 audio tokens (or int32/float)
        pad_value: the padding token
        bos_value: the BOS token
        precomp:  (t_idx_BxTxC, indices_BTCx3) from build_delay_indices

    Returns:
        result_BxTxC: [B, T, C] delayed audio tokens
    """
    device = audio_BxTxC.device  # Get device from input tensor
    t_idx_BxTxC, indices_BTCx3 = precomp
    t_idx_BxTxC = t_idx_BxTxC.to(device)  # Move precomputed indices to device
    indices_BTCx3 = indices_BTCx3.to(device)

    # Equivalent of tf.gather_nd using advanced indexing
    # Ensure indices are long type if not already (build_delay_indices should handle this)
    gathered_flat = audio_BxTxC[indices_BTCx3[:, 0], indices_BTCx3[:, 1], indices_BTCx3[:, 2]]
    gathered_BxTxC = gathered_flat.view(audio_BxTxC.shape)

    # Create masks on the correct device
    mask_bos = t_idx_BxTxC < 0  # => place bos_value
    mask_pad = t_idx_BxTxC >= audio_BxTxC.shape[1]  # => place pad_value

    # Create scalar tensors on the correct device
    bos_tensor = torch.tensor(bos_value, dtype=audio_BxTxC.dtype, device=device)
    pad_tensor = torch.tensor(pad_value, dtype=audio_BxTxC.dtype, device=device)

    # If mask_bos, BOS; else if mask_pad, PAD; else original gather
    # All tensors should now be on the same device
    result_BxTxC = torch.where(mask_bos, bos_tensor, torch.where(mask_pad, pad_tensor, gathered_BxTxC))

    return result_BxTxC


@torch.no_grad()
@torch.inference_mode()
def audio_to_codebook(
    model,
    input_values,
    data_config: DataConfig,
    padding_mask=None,
    sample_rate=44100,
):
    """
    Encodes the input audio waveform into discrete codes.

    Args:
        model: The model to use for encoding.
        input_values (`torch.Tensor` of shape `(batch_size, channels, sequence_length)`):
            Float values of the input audio waveform.
        padding_mask (`torch.Tensor` of shape `(batch_size, channels, sequence_length)`):
            Padding mask used to pad the `input_values`.
        sample_rate (`int`, *optional*) :
            Signal sampling_rate

    Returns:
        A list of frames containing the discrete encoded codes for the input audio waveform, along with rescaling
        factors for each chunk when `normalize` is True. Each frames is a tuple `(codebook, scale)`, with
        `codebook` of shape `[batch_size, num_codebooks, frames]`.
        Scale is not used here.

    """
    audio_data = model.preprocess(input_values, sample_rate)

    if padding_mask is None:
        padding_mask = torch.ones_like(input_values).bool()

    _, encoded_frame, _, _, _ = model.encode(audio_data, n_quantizers=None)  # 1, C, T
    seq_length = encoded_frame.shape[2]

    t_idx_BxTxC, indices_BTCx3 = build_delay_indices(
        B=1,
        T=seq_length,
        C=data_config.channels,
        delay_pattern=data_config.delay_pattern,
    )

    encoded_frame = apply_audio_delay(
        audio_BxTxC=encoded_frame.transpose(1, 2),  # 1, T, C
        pad_value=data_config.audio_pad_value,
        bos_value=data_config.audio_bos_value,
        precomp=(t_idx_BxTxC, indices_BTCx3),
    )

    return encoded_frame


def build_revert_indices(B: int, T: int, C: int, delay_pattern: tp.List[int]) -> tp.Tuple[torch.Tensor, torch.Tensor]:
    """
    Precompute indices for the revert operation using PyTorch.

    Returns:
        A tuple (t_idx_BxTxC, indices_BTCx3) where:
            - t_idx_BxTxC is a tensor of shape [B, T, C] computed as time indices plus the delay.
            - indices_BTCx3 is a tensor of shape [B*T*C, 3] used for gathering, computed from:
                batch indices, clamped time indices, and channel indices.
    """
    # Use default device unless specified otherwise; assumes inputs might define device later
    device = None  # Or determine dynamically if needed, e.g., from a model parameter

    delay_arr = torch.tensor(delay_pattern, dtype=torch.int32, device=device)

    t_idx_BT1 = torch.broadcast_to(torch.arange(T, device=device).unsqueeze(0), [B, T])
    t_idx_BT1 = t_idx_BT1.unsqueeze(-1)

    t_idx_BxTxC = torch.minimum(
        t_idx_BT1 + delay_arr.view(1, 1, C),
        torch.tensor(T - 1, device=device),
    )
    b_idx_BxTxC = torch.broadcast_to(torch.arange(B, device=device).view(B, 1, 1), [B, T, C])
    c_idx_BxTxC = torch.broadcast_to(torch.arange(C, device=device).view(1, 1, C), [B, T, C])

    indices_BTCx3 = torch.stack(
        [
            b_idx_BxTxC.reshape(-1),
            t_idx_BxTxC.reshape(-1),
            c_idx_BxTxC.reshape(-1),
        ],
        axis=1,
    ).long()  # Ensure indices are long type

    return t_idx_BxTxC, indices_BTCx3


def revert_audio_delay(
    audio_BxTxC: torch.Tensor,
    pad_value: int,
    precomp: tp.Tuple[torch.Tensor, torch.Tensor],
    T: int,
) -> torch.Tensor:
    """
    Reverts a delay pattern from batched audio tokens using precomputed indices (PyTorch version).

    Args:
        audio_BxTxC: Input delayed audio tensor
        pad_value: Padding value for out-of-bounds indices
        precomp: Precomputed revert indices tuple containing:
            - t_idx_BxTxC: Time offset indices tensor
            - indices_BTCx3: Gather indices tensor for original audio
        T: Original sequence length before padding

    Returns:
        Reverted audio tensor with same shape as input
    """
    t_idx_BxTxC, indices_BTCx3 = precomp
    device = audio_BxTxC.device  # Get device from input tensor

    # Move precomputed indices to the same device as audio_BxTxC if they aren't already
    t_idx_BxTxC = t_idx_BxTxC.to(device)
    indices_BTCx3 = indices_BTCx3.to(device)

    # Using PyTorch advanced indexing (equivalent to tf.gather_nd or np equivalent)
    gathered_flat = audio_BxTxC[indices_BTCx3[:, 0], indices_BTCx3[:, 1], indices_BTCx3[:, 2]]
    gathered_BxTxC = gathered_flat.view(audio_BxTxC.size())  # Use .size() for robust reshaping

    # Create pad_tensor on the correct device
    pad_tensor = torch.tensor(pad_value, dtype=audio_BxTxC.dtype, device=device)
    # Create T tensor on the correct device for comparison
    T_tensor = torch.tensor(T, device=device)

    result_BxTxC = torch.where(t_idx_BxTxC >= T_tensor, pad_tensor, gathered_BxTxC)  # Changed np.where to torch.where

    return result_BxTxC


@torch.no_grad()
@torch.inference_mode()
def decode(
    model,
    audio_codes,
):
    """
    Decodes the given frames into an output audio waveform
    """
    if len(audio_codes) != 1:
        raise ValueError(f"Expected one frame, got {len(audio_codes)}")

    try:
        audio_values = model.quantizer.from_codes(audio_codes)
        audio_values = model.decode(audio_values[0])

        return audio_values
    except Exception as e:
        print(f"Error in decode method: {str(e)}")
        raise


def codebook_to_audio(generated_codes: torch.Tensor, model, delay_pattern, B=1, T=2600, C=9):
    """Process a single codebook file to generate audio"""
    # Remove BOS token
    generated_codes = generated_codes[:, 1:]

    if generated_codes.shape[1] > T:
        generated_codes = generated_codes[:, :T]

    seq_length = generated_codes.shape[1]

    # Build revert indices
    t_idx_BxTxC, indices_BTCx3 = build_revert_indices(B=B, T=seq_length, C=C, delay_pattern=delay_pattern)

    # Transpose and add batch dimension
    audio_BxTxC = generated_codes.transpose(1, 0).unsqueeze(0)
    reverted_codebook = revert_audio_delay(
        audio_BxTxC=audio_BxTxC,
        pad_value=0,
        precomp=(t_idx_BxTxC, indices_BTCx3),
        T=seq_length,
    )
    reverted_codebook = reverted_codebook[:, :-30, :]

    codebook = reverted_codebook.transpose(1, 2)

    min_valid_index = 0
    max_valid_index = 1023
    invalid_mask = (codebook < min_valid_index) | (codebook > max_valid_index)

    num_invalid = torch.sum(invalid_mask).item()
    if num_invalid > 0:
        print(f"Warning: Clamping {num_invalid} indices outside range [{min_valid_index}, {max_valid_index}] to 0.")

    # Set invalid values to 0 (modify the tensor in-place)
    codebook[invalid_mask] = 0
    audio_array = decode(model, codebook)

    return audio_array

```

## /dia/config.py

```py path="/dia/config.py" 
"""Configuration management module for the Dia model.

This module provides comprehensive configuration management for the Dia model,
utilizing Pydantic for validation. It defines configurations for data processing,
model architecture (encoder and decoder), and training settings.

Key components:
- DataConfig: Parameters for data loading and preprocessing.
- EncoderConfig: Architecture details for the encoder module.
- DecoderConfig: Architecture details for the decoder module.
- ModelConfig: Combined model architecture settings.
- TrainingConfig: Training hyperparameters and settings.
- DiaConfig: Master configuration combining all components.
"""

import os
from typing import Annotated

from pydantic import BaseModel, BeforeValidator, Field


class DataConfig(BaseModel, frozen=True):
    """Configuration for data loading and preprocessing.

    Attributes:
        text_length: Maximum length of text sequences (must be multiple of 128).
        audio_length: Maximum length of audio sequences (must be multiple of 128).
        channels: Number of audio channels.
        text_pad_value: Value used for padding text sequences.
        audio_eos_value: Value representing the end of audio sequences.
        audio_bos_value: Value representing the beginning of audio sequences.
        audio_pad_value: Value used for padding audio sequences.
        delay_pattern: List of delay values for each audio channel.
    """

    text_length: Annotated[int, BeforeValidator(lambda x: (x + 127) // 128 * 128)] = Field(gt=0, multiple_of=128)
    audio_length: Annotated[int, BeforeValidator(lambda x: (x + 127) // 128 * 128)] = Field(gt=0, multiple_of=128)
    channels: int = Field(default=9, gt=0, multiple_of=1)
    text_pad_value: int = Field(default=0)
    audio_eos_value: int = Field(default=1024)
    audio_pad_value: int = Field(default=1025)
    audio_bos_value: int = Field(default=1026)
    delay_pattern: list[Annotated[int, Field(ge=0)]] = Field(default_factory=lambda: [0, 8, 9, 10, 11, 12, 13, 14, 15])

    def __hash__(self) -> int:
        """Generate a hash based on all fields of the config."""
        return hash(
            (
                self.text_length,
                self.audio_length,
                self.channels,
                self.text_pad_value,
                self.audio_pad_value,
                self.audio_bos_value,
                self.audio_eos_value,
                tuple(self.delay_pattern),
            )
        )


class EncoderConfig(BaseModel, frozen=True):
    """Configuration for the encoder component of the Dia model.

    Attributes:
        n_layer: Number of transformer layers.
        n_embd: Embedding dimension.
        n_hidden: Hidden dimension size in the MLP layers.
        n_head: Number of attention heads.
        head_dim: Dimension per attention head.
        mlp_activations: List of activation functions for the MLP layers.
        use_pre_norm: Whether to use pre-normalization (LayerNorm before attention/MLP).
    """

    n_layer: int = Field(gt=0)
    n_embd: int = Field(gt=0)
    n_hidden: int = Field(gt=0)
    n_head: int = Field(gt=0)
    head_dim: int = Field(gt=0)
    mlp_activations: list[str] = Field(default=["silu", "linear"])
    use_pre_norm: bool = Field(default=False)


class DecoderConfig(BaseModel, frozen=True):
    """Configuration for the decoder component of the Dia model.

    Attributes:
        n_layer: Number of transformer layers.
        n_embd: Embedding dimension.
        n_hidden: Hidden dimension size in the MLP layers.
        gqa_query_heads: Number of query heads for grouped-query self-attention.
        kv_heads: Number of key/value heads for grouped-query self-attention.
        gqa_head_dim: Dimension per query head for grouped-query self-attention.
        cross_query_heads: Number of query heads for cross-attention.
        cross_head_dim: Dimension per cross-attention head.
        mlp_activations: List of activation functions for the MLP layers.
        use_pre_norm: Whether to use pre-normalization.
    """

    n_layer: int = Field(gt=0)
    n_embd: int = Field(gt=0)
    n_hidden: int = Field(gt=0)
    gqa_query_heads: int = Field(gt=0)
    kv_heads: int = Field(gt=0)
    gqa_head_dim: int = Field(gt=0)
    cross_query_heads: int = Field(gt=0)
    cross_head_dim: int = Field(gt=0)
    mlp_activations: list[str] = Field(default=["silu", "linear"])
    use_pre_norm: bool = Field(default=False)


class ModelConfig(BaseModel, frozen=True):
    """Main configuration container for the Dia model architecture.

    Attributes:
        encoder: Configuration for the encoder component.
        decoder: Configuration for the decoder component.
        src_vocab_size: Size of the source (text) vocabulary.
        tgt_vocab_size: Size of the target (audio code) vocabulary.
        dropout: Dropout probability applied within the model.
        normalization_layer_epsilon: Epsilon value for normalization layers (e.g., LayerNorm).
        weight_dtype: Data type for model weights (e.g., "float32", "bfloat16").
        rope_min_timescale: Minimum timescale for Rotary Positional Embeddings (RoPE).
        rope_max_timescale: Maximum timescale for Rotary Positional Embeddings (RoPE).
    """

    encoder: EncoderConfig
    decoder: DecoderConfig
    src_vocab_size: int = Field(default=128, gt=0)
    tgt_vocab_size: int = Field(default=1028, gt=0)
    dropout: float = Field(default=0.0, ge=0.0, lt=1.0)
    normalization_layer_epsilon: float = Field(default=1.0e-5, ge=0.0)
    weight_dtype: str = Field(default="float32", description="Weight precision")
    rope_min_timescale: int = Field(default=1, description="Timescale For global Attention")
    rope_max_timescale: int = Field(default=10_000, description="Timescale For global Attention")


class TrainingConfig(BaseModel, frozen=True):
    """Training process configuration and hyperparameters.

    Note: This configuration currently only includes precision settings.
    Other training parameters (like batch size, learning rate, optimizer settings)
    are assumed to be handled externally.

    Attributes:
        dtype: Data type for activations during training (e.g., "bfloat16", "float32").
        logits_dot_in_fp32: Whether to compute the final logits dot product in fp32 for stability.
    """

    dtype: str = Field(default="bfloat16", description="Activation precision")
    logits_dot_in_fp32: bool = Field(default=False)


class DiaConfig(BaseModel, frozen=True):
    """Master configuration for the Dia model.

    Combines all sub-configurations into a single validated object.

    Attributes:
        version: Configuration version string.
        model: Model architecture configuration.
        training: Training process configuration (precision settings).
        data: Data loading and processing configuration.
    """

    version: str = Field(default="1.0")
    model: ModelConfig
    training: TrainingConfig
    data: DataConfig

    def save(self, path: str) -> None:
        """Save the current configuration instance to a JSON file.

        Ensures the parent directory exists and the file has a .json extension.

        Args:
            path: The target file path to save the configuration.

        Raises:
            ValueError: If the path is not a file with a .json extension.
        """
        os.makedirs(os.path.dirname(path), exist_ok=True)
        config_json = self.model_dump_json(indent=2)
        with open(path, "w") as f:
            f.write(config_json)

    @classmethod
    def load(cls, path: str) -> "DiaConfig | None":
        """Load and validate a Dia configuration from a JSON file.

        Args:
            path: The path to the configuration file.

        Returns:
            A validated DiaConfig instance if the file exists and is valid,
            otherwise None if the file is not found.

        Raises:
            ValueError: If the path does not point to an existing .json file.
            pydantic.ValidationError: If the JSON content fails validation against the DiaConfig schema.
        """
        try:
            with open(path, "r") as f:
                content = f.read()
            return cls.model_validate_json(content)
        except FileNotFoundError:
            return None

```

## /dia/layers.py

```py path="/dia/layers.py" 
from typing import Any

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch import Tensor
from torch.nn import RMSNorm

from .config import DiaConfig


def _normalize_axes(axes: tuple[int, ...], ndim: int) -> tuple[int, ...]:
    return tuple(ax if ax >= 0 else ndim + ax for ax in axes)


def _str_to_dtype(dtype_str: str) -> torch.dtype | None:
    # Allow None for default behavior
    if dtype_str is None or dtype_str.lower() == "none":
        return None
    if dtype_str == "float32":
        return torch.float32
    elif dtype_str == "float16":
        return torch.float16
    elif dtype_str == "bfloat16":
        return torch.bfloat16
    else:
        raise ValueError(f"Unsupported dtype string: {dtype_str}")


class DenseGeneral(nn.Module):
    """
    PyTorch equivalent of flax.linen.DenseGeneral with shapes defined at init.

    Stores weights (`kernel`) in the same layout as Jax and uses torch.tensordot
    for the generalized matrix multiplication. Weight/bias shapes are calculated
    and parameters created during initialization based on config.
    `load_weights` validates shapes and copies data.

    Attributes:
        axis (Tuple[int, ...]): Input axis or axes to contract.
        in_shapes (Tuple[int, ...]): Sizes of the input dimensions specified by `axis`.
        out_features (Tuple[int, ...]): Shape of the output features (non-contracted dims).
        use_bias (bool): Whether to add a bias term.
        weight (nn.Parameter): The kernel parameter.
        bias (Optional[nn.Parameter]): The bias parameter (if use_bias=True).
    """

    def __init__(
        self,
        in_shapes: tuple[int, ...],
        out_features: tuple[int, ...],
        axis: tuple[int, ...] = (-1,),
        dtype: torch.dtype | None = None,
        weight_dtype: torch.dtype | None = None,
        device: torch.device | None = None,
    ):
        super().__init__()
        self.in_shapes = in_shapes
        self.out_features = out_features
        self.axis = axis
        self.dtype = dtype
        self.kernel_shape = self.in_shapes + self.out_features

        factory_kwargs = {"device": device, "dtype": weight_dtype}
        self.weight = nn.Parameter(torch.empty(self.kernel_shape, **factory_kwargs))
        self.register_parameter("bias", None)

    def forward(self, inputs: Tensor) -> Tensor:
        norm_axis = _normalize_axes(self.axis, inputs.ndim)
        kernel_contract_axes = tuple(range(len(norm_axis)))

        output = torch.tensordot(
            inputs.float(),
            self.weight.float(),
            dims=(norm_axis, kernel_contract_axes),
        ).to(inputs.dtype)
        return output


def get_activation_fn(activation_string: str) -> nn.Module:  # Return Module instance
    """Maps activation string to PyTorch activation function module."""
    if activation_string == "gelu":
        return nn.GELU()
    elif activation_string == "relu":
        return nn.ReLU()
    elif activation_string == "silu" or activation_string == "swish":
        return nn.SiLU()
    elif activation_string == "linear":
        return nn.Identity()
    else:
        raise ValueError(f"Unsupported activation function: {activation_string}")


class MlpBlock(nn.Module):
    """MLP block using DenseGeneral."""

    def __init__(
        self,
        config: DiaConfig,
        embed_dim: int,
        intermediate_dim: int,
        dropout_rate: float,
        activations: list[str] = ["silu", "linear"],
        use_pre_norm: bool = False,
    ):
        super().__init__()
        self.use_pre_norm = use_pre_norm
        num_activations = len(activations)
        compute_dtype = _str_to_dtype(config.training.dtype)
        weight_dtype = _str_to_dtype(config.model.weight_dtype)
        self.dtype = compute_dtype
        # Assume default device for now, could be passed in config

        if use_pre_norm:
            self.pre_norm = RMSNorm(
                embed_dim,
                eps=config.model.normalization_layer_epsilon,
                dtype=torch.float32,
            )

        self.wi_fused = DenseGeneral(
            in_shapes=(embed_dim,),
            out_features=(
                num_activations,
                intermediate_dim,
            ),
            axis=(-1,),
            dtype=compute_dtype,
            weight_dtype=weight_dtype,
        )

        self.activation_fn_0 = get_activation_fn(activations[0])  # silu
        self.activation_fn_1 = get_activation_fn(activations[1])  # linear

        self.dropout = nn.Dropout(dropout_rate)

        # Output layer using DenseGeneral
        self.wo = DenseGeneral(
            in_shapes=(intermediate_dim,),
            out_features=(embed_dim,),
            axis=(-1,),
            dtype=compute_dtype,
            weight_dtype=weight_dtype,
        )

    def forward(self, x: torch.Tensor, deterministic: bool) -> torch.Tensor:
        """Forward pass."""
        if self.use_pre_norm and hasattr(self, "pre_norm"):
            x = self.pre_norm(x)

        fused_x = self.wi_fused(x)

        gate_input = fused_x[..., 0, :]
        up_input = fused_x[..., 1, :]

        gate = self.activation_fn_0(gate_input)
        up = self.activation_fn_1(up_input)
        hidden = torch.mul(gate, up).to(self.dtype)

        if not deterministic:
            hidden = self.dropout(hidden)

        output = self.wo(hidden)
        return output


class RotaryEmbedding(nn.Module):
    """Rotary Position Embedding (RoPE) implementation in PyTorch."""

    def __init__(
        self,
        embedding_dims: int,
        min_timescale: int = 1,
        max_timescale: int = 10000,
        dtype: torch.dtype = torch.float32,
    ):
        super().__init__()
        if embedding_dims % 2 != 0:
            raise ValueError("Embedding dim must be even for RoPE.")
        self.embedding_dims = embedding_dims
        self.min_timescale = min_timescale
        self.max_timescale = max_timescale
        self.dtype = dtype

        half_embedding_dim = embedding_dims // 2
        fraction = (2.0 * torch.arange(0, half_embedding_dim)) / embedding_dims
        self.register_buffer(
            "timescale",
            self.min_timescale * (self.max_timescale / self.min_timescale) ** fraction,
            persistent=False,
        )

    def extra_repr(self) -> str:
        s = f"{self.timescale.shape}"
        return s

    def forward(self, inputs: torch.Tensor, position: torch.Tensor):
        """Applies RoPE."""
        position = position.unsqueeze(-1).unsqueeze(-1)
        timescale = self.timescale.to(inputs.device)
        sinusoid_inp = position / timescale
        sin = torch.sin(sinusoid_inp).to(inputs.dtype)
        cos = torch.cos(sinusoid_inp).to(inputs.dtype)
        first_half, second_half = torch.chunk(inputs, 2, dim=-1)
        first_part = first_half * cos - second_half * sin
        second_part = second_half * cos + first_half * sin
        return torch.cat((first_part, second_part), dim=-1)


class KVCache:
    def __init__(self, num_heads, max_len, head_dim, device, k=None, v=None):
        self.k = torch.zeros((2, num_heads, max_len, head_dim), device=device) if k is None else k
        self.v = torch.zeros((2, num_heads, max_len, head_dim), device=device) if v is None else v
        self.current_idx = 0
        self.max_len = max_len

    def get_kv_for_attention(self, current_k, current_v):
        if self.current_idx == 0:
            return current_k, current_v
        else:
            past_k = self.k[:, :, : self.current_idx, :]
            past_v = self.v[:, :, : self.current_idx, :]
            attn_k = torch.cat((past_k, current_k), dim=2)
            attn_v = torch.cat((past_v, current_v), dim=2)
            return attn_k, attn_v

    def update_cache(self, k, v):
        assert self.current_idx < self.max_len
        self.k[:, :, self.current_idx : self.current_idx + 1, :] = k
        self.v[:, :, self.current_idx : self.current_idx + 1, :] = v
        self.current_idx += 1

    def prefill_kv(self, k, v):
        prefill_len = k.shape[2]
        assert prefill_len <= self.max_len
        self.k[:, :, :prefill_len, :] = k
        self.v[:, :, :prefill_len, :] = v
        self.current_idx = prefill_len


class Attention(nn.Module):
    """Attention using DenseGeneral."""

    def __init__(
        self,
        config: DiaConfig,
        q_embed_dim: int,
        kv_embed_dim: int,
        num_query_heads: int,
        num_kv_heads: int,
        head_dim: int,
        dropout_rate: float,
        is_cross_attn: bool = False,
        out_embed_dim: int | None = None,
    ):
        super().__init__()
        self.num_query_heads = num_query_heads
        self.num_kv_heads = num_kv_heads
        self.head_dim = head_dim
        self.is_cross_attn = is_cross_attn
        self.dropout_rate = dropout_rate
        compute_dtype = _str_to_dtype(config.training.dtype)
        weight_dtype = _str_to_dtype(config.model.weight_dtype)
        self.output_dim = out_embed_dim if out_embed_dim is not None else q_embed_dim
        self.projected_query_dim = num_query_heads * head_dim
        if num_query_heads % num_kv_heads != 0:
            raise ValueError(f"num_query_heads ({num_query_heads}) must be divisible by num_kv_heads ({num_kv_heads})")
        self.num_gqa_groups = num_query_heads // num_kv_heads

        # --- Projection Layers using DenseGeneral ---
        self.q_proj = DenseGeneral(
            in_shapes=(q_embed_dim,),
            out_features=(num_query_heads, head_dim),
            axis=(-1,),
            dtype=compute_dtype,
            weight_dtype=weight_dtype,
        )
        self.k_proj = DenseGeneral(
            in_shapes=(kv_embed_dim,),
            out_features=(num_kv_heads, head_dim),
            axis=(-1,),
            dtype=compute_dtype,
            weight_dtype=weight_dtype,
        )
        self.v_proj = DenseGeneral(
            in_shapes=(kv_embed_dim,),
            out_features=(num_kv_heads, head_dim),
            axis=(-1,),
            dtype=compute_dtype,
            weight_dtype=weight_dtype,
        )
        self.o_proj = DenseGeneral(
            in_shapes=(num_query_heads, head_dim),
            out_features=(self.output_dim,),
            axis=(-2, -1),
            dtype=compute_dtype,
            weight_dtype=weight_dtype,
        )

        # --- Rotary Embedding ---
        self.rotary_emb = RotaryEmbedding(
            embedding_dims=self.head_dim,
            min_timescale=config.model.rope_min_timescale,
            max_timescale=config.model.rope_max_timescale,
            dtype=compute_dtype,
        )

    def forward(
        self,
        Xq: torch.Tensor,  # (B, T, D) T = 1 in AR generation
        Xkv: torch.Tensor,  # (B, S, E) S = 1 in AR generation
        q_positions: torch.Tensor,  # (B, T)
        kv_positions: torch.Tensor | None = None,  # (B, S)
        deterministic: bool = True,
        attn_mask: torch.Tensor | None = None,  # None in Decoder Self Attention, Valid mask in Others
        cache: KVCache | None = None,  # None in Encoder, KVCache in Decoder
        prefill: bool = False,  # True only when prefilling KV Cache
    ) -> tuple[torch.Tensor, tuple[torch.Tensor, torch.Tensor] | None]:
        """
        Performs attention calculation with optional KV caching.

        Args:
            Xq: Query tensor (B, T, D). T=1 during single-step decoding.
            Xkv: Key/Value source tensor (B, S, E). S=1 during single-step decoding for self-attn.
            q_positions: Positions for queries (B, T).
            kv_positions: Positions for keys/values (B, S). If None, uses q_positions.
            deterministic: If True, disable dropout.
            attn_mask: Attention mask.
            cache: KVCache.
            prefill: If True, use prefill mode.

        Returns:
            A tuple containing:
            - output: The attention output tensor (B, T, output_dim).
            - present_kv: The K/V state to be cached for the next step ((B, N, S_new, H), (B, N, S_new, H)). For self-attn, S_new = S_past + S. For cross-attn, S_new = S_kv.
        """
        if kv_positions is None:
            kv_positions = q_positions
        original_dtype = Xq.dtype

        Xq_BxTxNxH = self.q_proj(Xq)
        Xq_BxTxNxH = self.rotary_emb(Xq_BxTxNxH, position=q_positions)
        Xq_BxNxTxH = Xq_BxTxNxH.transpose(1, 2)

        # Input values into attention calculation
        attn_k: torch.Tensor | None = None
        attn_v: torch.Tensor | None = None
        new_kv_cache: tuple[torch.Tensor, torch.Tensor] | None = None

        # Decoder Cross Attention
        if self.is_cross_attn:
            # Directly use cache (no need to check index)
            attn_k, attn_v = cache.k, cache.v
            if attn_k.shape[1] != self.num_query_heads or attn_v.shape[1] != self.num_query_heads:
                raise ValueError(
                    f"Cross-attention cache head dimension ({attn_k.shape[1]}) "
                    f"does not match num_query_heads ({self.num_query_heads}). "
                    "Cache should be pre-repeated for GQA."
                )
        # Self Attention
        else:
            Xk_BxSxKxH = self.k_proj(Xkv)  # (B, S, K, H)
            Xv_BxSxKxH = self.v_proj(Xkv)  # (B, S, K, H)
            Xk_BxSxKxH = self.rotary_emb(Xk_BxSxKxH, position=kv_positions)  # (B, S, K, H)

            Xk_BxKxSxH = Xk_BxSxKxH.transpose(1, 2)  # (B, K, S, H)
            Xv_BxKxSxH = Xv_BxSxKxH.transpose(1, 2)  # (B, K, S, H)
            # S=1 for Decode Step

            if self.num_gqa_groups > 1:
                Xk_BxNxSxH = Xk_BxKxSxH.repeat_interleave(self.num_gqa_groups, dim=1)
                Xv_BxNxSxH = Xv_BxKxSxH.repeat_interleave(self.num_gqa_groups, dim=1)
            else:
                Xk_BxNxSxH = Xk_BxKxSxH
                Xv_BxNxSxH = Xv_BxKxSxH

            # Encoder Self Attention
            if cache is None:
                attn_k = Xk_BxNxSxH
                attn_v = Xv_BxNxSxH
            # Decoder Self Attention
            else:
                # In prefill mode, we fill in cache until prefill length
                if prefill:
                    attn_k, attn_v = Xk_BxNxSxH, Xv_BxNxSxH
                    cache.prefill_kv(attn_k, attn_v)
                # In decode step, we add current K/V to cache step by step
                else:
                    new_kv_cache = Xk_BxNxSxH, Xv_BxNxSxH
                    attn_k, attn_v = cache.get_kv_for_attention(Xk_BxNxSxH, Xv_BxNxSxH)

        attn_output = F.scaled_dot_product_attention(
            Xq_BxNxTxH,
            attn_k,
            attn_v,
            attn_mask=attn_mask,
            dropout_p=self.dropout_rate if not deterministic else 0.0,
            scale=1.0,
        )

        attn_output = attn_output.transpose(1, 2).contiguous()  # (B, T, N, H)
        output = self.o_proj(attn_output)

        return output.to(original_dtype), new_kv_cache


class EncoderLayer(nn.Module):
    """Transformer Encoder Layer using DenseGeneral."""

    def __init__(self, config: DiaConfig):
        super().__init__()
        self.config = config
        model_config = config.model
        enc_config = config.model.encoder
        embed_dim = enc_config.n_embd

        self.pre_sa_norm = RMSNorm(
            embed_dim,
            eps=model_config.normalization_layer_epsilon,
            dtype=torch.float32,
        )
        self.self_attention = Attention(
            config=config,
            q_embed_dim=embed_dim,
            kv_embed_dim=embed_dim,
            num_query_heads=enc_config.n_head,
            num_kv_heads=enc_config.n_head,
            head_dim=enc_config.head_dim,
            dropout_rate=model_config.dropout,
            is_cross_attn=False,
            out_embed_dim=embed_dim,
        )
        self.post_sa_norm = RMSNorm(
            embed_dim,
            eps=model_config.normalization_layer_epsilon,
            dtype=torch.float32,
        )
        self.mlp = MlpBlock(
            config=config,
            embed_dim=embed_dim,
            intermediate_dim=enc_config.n_hidden,
            activations=enc_config.mlp_activations,
            dropout_rate=model_config.dropout,
            use_pre_norm=enc_config.use_pre_norm,
        )
        self.dropout = nn.Dropout(model_config.dropout)

    def forward(
        self,
        x: torch.Tensor,
        src_positions: torch.Tensor | None = None,
        deterministic: bool = True,
        attn_mask: torch.Tensor | None = None,
    ) -> torch.Tensor:
        residual = x
        x_norm = self.pre_sa_norm(x)

        sa_out, _ = self.self_attention(
            Xq=x_norm,
            Xkv=x_norm,
            q_positions=src_positions,
            kv_positions=src_positions,
            deterministic=deterministic,
            attn_mask=attn_mask,
        )
        x = residual + sa_out

        residual = x
        x_norm = self.post_sa_norm(x)
        mlp_out = self.mlp(x_norm, deterministic=deterministic)
        x = residual + mlp_out

        if not deterministic:
            x = self.dropout(x)
        return x


class Encoder(nn.Module):
    """Transformer Encoder Stack using DenseGeneral."""

    def __init__(self, config: DiaConfig):
        super().__init__()
        self.config = config
        model_config = config.model
        enc_config = config.model.encoder
        compute_dtype = _str_to_dtype(config.training.dtype)

        self.embedding = nn.Embedding(
            model_config.src_vocab_size,
            enc_config.n_embd,
            dtype=compute_dtype,
        )
        self.dropout = nn.Dropout(model_config.dropout)
        self.layers = nn.ModuleList([EncoderLayer(config=config) for _ in range(enc_config.n_layer)])
        self.norm = RMSNorm(
            enc_config.n_embd,
            eps=model_config.normalization_layer_epsilon,
            dtype=torch.float32,
        )

    def forward(
        self,
        x_ids: torch.Tensor,
        src_positions: torch.Tensor | None = None,
        deterministic: bool = True,
        attn_mask: torch.Tensor | None = None,
    ) -> torch.Tensor:
        x = self.embedding(x_ids)

        if not deterministic:
            x = self.dropout(x)

        for layer in self.layers:
            x = layer(
                x,
                src_positions=src_positions,
                deterministic=deterministic,
                attn_mask=attn_mask,
            )
        x = self.norm(x)
        if not deterministic:
            x = self.dropout(x)
        return x


class DecoderLayer(nn.Module):
    """Transformer Decoder Layer using DenseGeneral."""

    def __init__(self, config: DiaConfig):
        super().__init__()
        self.config = config
        model_config = config.model
        dec_config = config.model.decoder
        enc_config = config.model.encoder
        dec_embed_dim = dec_config.n_embd
        enc_embed_dim = enc_config.n_embd

        # Norms
        self.pre_sa_norm = RMSNorm(
            dec_embed_dim,
            eps=model_config.normalization_layer_epsilon,
            dtype=torch.float32,
        )
        self.pre_ca_norm = RMSNorm(
            dec_embed_dim,
            eps=model_config.normalization_layer_epsilon,
            dtype=torch.float32,
        )
        self.pre_mlp_norm = RMSNorm(
            dec_embed_dim,
            eps=model_config.normalization_layer_epsilon,
            dtype=torch.float32,
        )

        # Self-Attention (GQA) with Causal Masking
        self.self_attention = Attention(
            config=config,
            q_embed_dim=dec_embed_dim,
            kv_embed_dim=dec_embed_dim,
            num_query_heads=dec_config.gqa_query_heads,
            num_kv_heads=dec_config.kv_heads,
            head_dim=dec_config.gqa_head_dim,
            dropout_rate=model_config.dropout,
            is_cross_attn=False,
            out_embed_dim=dec_embed_dim,
        )
        # Cross-Attention (MHA)
        self.cross_attention = Attention(
            config=config,
            q_embed_dim=dec_embed_dim,
            kv_embed_dim=enc_embed_dim,  # Note kv_embed_dim
            num_query_heads=dec_config.cross_query_heads,
            num_kv_heads=dec_config.cross_query_heads,
            head_dim=dec_config.cross_head_dim,
            dropout_rate=model_config.dropout,
            is_cross_attn=True,
            out_embed_dim=dec_embed_dim,
        )
        # MLP
        self.mlp = MlpBlock(
            config=config,
            embed_dim=dec_embed_dim,
            intermediate_dim=dec_config.n_hidden,
            activations=dec_config.mlp_activations,
            dropout_rate=model_config.dropout,
            use_pre_norm=dec_config.use_pre_norm,
        )

    def forward(
        self,
        x: torch.Tensor,
        encoder_out: torch.Tensor,
        tgt_positions: torch.Tensor,
        src_positions: torch.Tensor | None,
        deterministic: bool,
        self_attn_mask: torch.Tensor,
        cross_attn_mask: torch.Tensor,
        self_attn_cache: KVCache,
        cross_attn_cache: KVCache,
        prefill: bool = False,
    ) -> torch.Tensor:
        residual = x
        x_norm = self.pre_sa_norm(x)

        sa_out, new_kv_cache = self.self_attention(
            Xq=x_norm,  # (2, 1, D)
            Xkv=x_norm,  # (2, 1, D)
            q_positions=tgt_positions,  # (2, 1)
            kv_positions=tgt_positions,  # (2, 1)
            deterministic=deterministic,
            attn_mask=self_attn_mask,  # (2, 1, 1, S_max)
            cache=self_attn_cache,
            prefill=prefill,
        )

        x = residual + sa_out

        # 2. Cross-Attention
        residual = x
        x_norm = self.pre_ca_norm(x)
        ca_out, _ = self.cross_attention(
            Xq=x_norm,
            Xkv=encoder_out,
            q_positions=tgt_positions,
            kv_positions=src_positions,
            deterministic=deterministic,
            attn_mask=cross_attn_mask,
            cache=cross_attn_cache,
        )
        x = residual + ca_out

        # 3. MLP
        residual = x
        x_norm = self.pre_mlp_norm(x)
        mlp_out = self.mlp(x_norm, deterministic=deterministic)
        x = residual + mlp_out

        return x, new_kv_cache


class Decoder(nn.Module):
    """Transformer Decoder Stack using DenseGeneral."""

    def __init__(self, config: DiaConfig):
        super().__init__()
        self.config = config
        model_config = config.model
        dec_config = config.model.decoder
        train_config = config.training
        data_config = config.data
        compute_dtype = _str_to_dtype(config.training.dtype)
        weight_dtype = _str_to_dtype(config.model.weight_dtype)
        self.num_channels = data_config.channels
        self.num_layers = dec_config.n_layer

        self.embeddings = nn.ModuleList(
            [
                nn.Embedding(model_config.tgt_vocab_size, dec_config.n_embd, dtype=compute_dtype)
                for _ in range(self.num_channels)
            ]
        )
        self.dropout = nn.Dropout(model_config.dropout)
        self.layers = nn.ModuleList([DecoderLayer(config=config) for _ in range(self.num_layers)])
        self.norm = RMSNorm(
            dec_config.n_embd,
            eps=model_config.normalization_layer_epsilon,
            dtype=torch.float32,
        )

        # Final Logits Projection using DenseGeneral
        self.logits_dense = DenseGeneral(
            in_shapes=(dec_config.n_embd,),
            out_features=(self.num_channels, model_config.tgt_vocab_size),
            axis=(-1,),
            dtype=(torch.float32 if train_config.logits_dot_in_fp32 else compute_dtype),
            weight_dtype=weight_dtype,
        )
        self.logits_in_fp32 = train_config.logits_dot_in_fp32

    def precompute_cross_attention_kv(
        self,
        max_len: int,
        encoder_out: torch.Tensor,  # (B, S, E)
        src_positions: torch.Tensor | None,  # (B, S)
    ) -> list[KVCache]:
        """
        Computes the Key and Value tensors for cross-attention for each layer from the encoder output.
        """
        per_layer_kv_cache: list[KVCache] = []

        for layer in self.layers:
            cross_attn_module = layer.cross_attention
            k_proj = cross_attn_module.k_proj(encoder_out)
            v_proj = cross_attn_module.v_proj(encoder_out)

            k_proj = cross_attn_module.rotary_emb(k_proj, position=src_positions)
            k = k_proj.transpose(1, 2)
            v = v_proj.transpose(1, 2)

            per_layer_kv_cache.append(
                KVCache(
                    cross_attn_module.num_kv_heads,
                    max_len,
                    cross_attn_module.head_dim,
                    k.device,
                    k=k,
                    v=v,
                )
            )

        return per_layer_kv_cache

    def decode_step(
        self,
        tgt_ids_Bx1xC: torch.Tensor,  # [B, 1, C]
        tgt_pos_Bx1: torch.Tensor,  # [B, 1]
        encoder_out: torch.Tensor,  # [B, S, E]
        self_attn_mask: Any,  # None
        cross_attn_mask: torch.Tensor,  # [B, 1, 1, S]
        self_attention_cache: list[KVCache],
        cross_attention_cache: list[KVCache],
    ) -> torch.Tensor:
        """
        Performs a single decoding step, managing KV caches layer by layer.

        Returns:
            A tuple containing:
            - logits_Bx1xCV: The final output logits for the current step (B, 1, C*V), cast to float32.
        """
        assert self_attn_mask is None, "Self-attention mask should be None, kept for pattern"

        x = None
        for i in range(self.num_channels):
            channel_tokens = tgt_ids_Bx1xC[..., i]
            channel_embed = self.embeddings[i](channel_tokens)
            x = channel_embed if x is None else x + channel_embed

        new_cache = []

        for i, layer in enumerate(self.layers):
            self_cache = self_attention_cache[i]
            cross_cache = cross_attention_cache[i]
            x, new_kv_cache = layer(
                x,  # (2, 1, D)
                encoder_out,  # (2, S, E)
                src_positions=None,  # CA KV is already computed
                tgt_positions=tgt_pos_Bx1,  # (2, 1)
                deterministic=True,
                self_attn_mask=None,
                cross_attn_mask=cross_attn_mask,
                self_attn_cache=self_cache,
                cross_attn_cache=cross_cache,
            )
            new_cache.append(new_kv_cache)

        x = self.norm(x)
        logits_Bx1xCxV = self.logits_dense(x)

        return logits_Bx1xCxV.to(torch.float32), new_cache

    def forward(
        self,
        tgt_ids_BxTxC: torch.Tensor,
        encoder_out: torch.Tensor,
        tgt_positions: torch.Tensor,
        src_positions: torch.Tensor,
        deterministic: bool,
        self_attn_mask: torch.Tensor,
        cross_attn_mask: torch.Tensor,
        self_attention_cache: list[KVCache],
        cross_attention_cache: list[KVCache],
    ) -> torch.Tensor:
        """
        Forward pass for the Decoder stack, managing KV caches.

        Args:
            tgt_ids_BxTxC: Target token IDs (B, T, C).
            encoder_out: Output from the encoder (B, S, E).
            tgt_positions: Positions for target sequence (B, T).
            src_positions: Positions for source sequence (B, S).
            deterministic: Disable dropout if True.
            self_attn_mask: Mask for self-attention.
            cross_attn_mask: Mask for cross-attention.
            past_key_values: List containing the self-attention KV cache for each layer
                             from the previous decoding step. `len(past_key_values)` should
                             equal `num_layers`.
            precomputed_cross_attn_kv: A single tuple containing the pre-computed K/V cache
                                      derived from `encoder_out`. This is passed identically
                                      to all layers.

        Returns:
            A tuple containing:
            - logits: The final output logits (B, T, C * V), cast to float32.
            - present_key_values: A list containing the updated self-attention KV cache
                                 for each layer for the *current* decoding step.
        """
        _, _, num_channels_in = tgt_ids_BxTxC.shape
        assert num_channels_in == self.num_channels, "Input channels mismatch"

        # Embeddings
        x = None
        for i in range(self.num_channels):
            channel_tokens = tgt_ids_BxTxC[..., i]
            channel_embed = self.embeddings[i](channel_tokens)
            x = channel_embed if x is None else x + channel_embed

        if not deterministic:
            x = self.dropout(x)

        for i, layer in enumerate(self.layers):
            x, _ = layer(
                x,
                encoder_out,
                tgt_positions=tgt_positions,
                src_positions=src_positions,
                deterministic=deterministic,
                self_attn_mask=self_attn_mask,
                cross_attn_mask=cross_attn_mask,
                self_attn_cache=self_attention_cache[i],
                cross_attn_cache=cross_attention_cache[i],
                prefill=True,
            )

        # Final Norm
        x = self.norm(x)
        logits_BxTxCxV = self.logits_dense(x)

        return logits_BxTxCxV.to(torch.float32)


class DiaModel(nn.Module):
    """PyTorch Dia Model using DenseGeneral."""

    def __init__(self, config: DiaConfig):
        super().__init__()
        self.config = config
        self.encoder = Encoder(config)
        self.decoder = Decoder(config)

    def forward(
        self,
        src_BxS: torch.Tensor,
        tgt_BxTxC: torch.Tensor,
        src_positions: torch.Tensor | None = None,
        tgt_positions: torch.Tensor | None = None,
        enc_self_attn_mask: torch.Tensor | None = None,
        dec_self_attn_mask: torch.Tensor | None = None,
        dec_cross_attn_mask: torch.Tensor | None = None,
        enable_dropout: bool = True,
    ):
        deterministic = not enable_dropout

        # --- Encoder Pass ---
        encoder_out = self.encoder(
            x_ids=src_BxS,
            src_positions=src_positions,
            deterministic=deterministic,
            attn_mask=enc_self_attn_mask,
        )

        # --- Decoder Pass ---
        logits, _ = self.decoder(
            tgt_ids_BxTxC=tgt_BxTxC,
            encoder_out=encoder_out,
            tgt_positions=tgt_positions,
            src_positions=src_positions,
            deterministic=deterministic,
            self_attn_mask=dec_self_attn_mask,
            cross_attn_mask=dec_cross_attn_mask,
            precomputed_cross_attn_kv=None,
        )

        return logits

```

## /dia/model.py

```py path="/dia/model.py" 
# DiaModel with diffusion-style decoder (StyleTTS2 inspired)

import torch
import torch.nn as nn
import torch.nn.functional as F

class DiaModel(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.config = config

        model_dim = config.model.decoder.d_model
        vocab_size = config.model.tgt_vocab_size
        self.spk_proj = nn.Linear(192, model_dim)

        # Encoder: text + lang embedding
        self.encoder_embed = nn.Embedding(config.model.encoder_vocab_size, model_dim)
        self.encoder_proj = nn.Linear(model_dim, model_dim)

        # Diffusion decoder setup
        self.token_proj = nn.Linear(config.model.input_dim, model_dim)
        self.diffusion_steps = config.model.diffusion_steps
        self.time_embed = nn.Embedding(self.diffusion_steps, model_dim)

        # Conditional U-Net style block
        self.diffusion_layers = nn.Sequential(
            nn.LayerNorm(model_dim),
            nn.Linear(model_dim, model_dim),
            nn.ReLU(),
            nn.Linear(model_dim, model_dim)
        )

        self.to_logits = nn.Linear(model_dim, vocab_size)

    def encoder(self, input_ids, lang_ids):
        x = self.encoder_embed(input_ids)
        return self.encoder_proj(x)

    def diffusion_step(self, x, t_embed, spk_embed=None, encoder_out=None):
        cond = x + t_embed
        if spk_embed is not None:
            spk = self.spk_proj(spk_embed).unsqueeze(1).expand_as(cond)
            cond = cond + spk
        if encoder_out is not None:
            cond = cond + encoder_out  # simple cross-attn via addition
        return self.diffusion_layers(cond)

    def decoder_forward(self, tgt_ids_BxTxC, spk_embed=None, encoder_out=None, **kwargs):
        B, T, _ = tgt_ids_BxTxC.shape
        x = self.token_proj(tgt_ids_BxTxC)

        for t in range(self.diffusion_steps):
            t_embed = self.time_embed(torch.tensor([t], device=x.device)).unsqueeze(0).expand(B, T, -1)
            x = self.diffusion_step(x, t_embed, spk_embed, encoder_out)

        return x

    def decode_step(self, tgt_ids_Bx1xC, tgt_pos_Bx1, encoder_out, spk_embed=None, **kwargs):
        B, T, _ = tgt_ids_Bx1xC.shape
        x = self.token_proj(tgt_ids_Bx1xC)
        t_embed = self.time_embed(torch.tensor([0], device=x.device)).unsqueeze(0).expand(B, T, -1)
        x = self.diffusion_step(x, t_embed, spk_embed, encoder_out)
        return self.to_logits(x), {}

    def compute_loss(self, output, targets):
        return F.mse_loss(output, targets)

    def extract_kv_cache(self):
        return {}

    def forward(self, batch):
        input_ids = batch["input_ids"]
        audio = batch["audio"]
        lang_ids = batch["lang_token_ids"]
        spk_embed = batch.get("spk_embed", None)

        enc_out = self.encoder(input_ids, lang_ids)
        output = self.decoder_forward(audio, spk_embed=spk_embed, encoder_out=enc_out)
        loss = self.compute_loss(output, audio)
        return loss

```

## /dia/static/images/banner.png

Binary file available at https://raw.githubusercontent.com/anan235/dia-multilingual/refs/heads/main/dia/static/images/banner.png

## /docker/Dockerfile

``` path="/docker/Dockerfile" 
FROM nvidia/cuda:12.1.1-cudnn8-runtime-ubuntu22.04

RUN apt update && apt install -y git wget espeak-ng ffmpeg libsndfile1-dev libespeak-ng-dev build-essential python3 python3-pip python-is-python3

RUN pip install --upgrade pip
RUN pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
RUN pip install transformers accelerate datasets huggingface_hub librosa phonemizer wandb einops matplotlib

RUN ln -s /usr/bin/espeak-ng /usr/bin/espeak

WORKDIR /workspace
```

## /docker/launch.sh

```sh path="/docker/launch.sh" 
#!/bin/bash
set -e

export HF_HOME=/workspace/cache/huggingface
export WANDB_PROJECT=multilang-tts

DATA_DIR=/workspace/data
MANIFEST_PATH=/workspace/train_manifest.json
LANG_VOCAB=/workspace/lang_vocab.json

python3 scripts/train_dia.py \
  --manifest $MANIFEST_PATH \
  --lang_vocab $LANG_VOCAB \
  --data_root $DATA_DIR \
  --output_dir /workspace/checkpoints \
  --epochs 50 \
  --batch_size 16 \
  --lr 1e-4 \
  --num_workers 4
```

## /example/simple.py

```py path="/example/simple.py" 
import soundfile as sf

from dia.model import Dia


model = Dia.from_pretrained("nari-labs/Dia-1.6B")

text = "[S1] Dia is an open weights text to dialogue model. [S2] You get full control over scripts and voices. [S1] Wow. Amazing. (laughs) [S2] Try it now on Git hub or Hugging Face."

output = model.generate(text)

sf.write("simple.mp3", output, 44100)

```

## /example/voice_clone.py

```py path="/example/voice_clone.py" 
import soundfile as sf

from dia.model import Dia


model = Dia.from_pretrained("nari-labs/Dia-1.6B")

# You should put the transcript of the voice you want to clone
# We will use the audio created by running simple.py as an example.
# Note that you will be REQUIRED TO RUN simple.py for the script to work as-is.
clone_from_text = "[S1] Dia is an open weights text to dialogue model. [S2] You get full control over scripts and voices. [S1] Wow. Amazing. (laughs) [S2] Try it now on Git hub or Hugging Face."
clone_from_audio = "simple.mp3"

# For your custom needs, replace above with below and add your audio file to this directory:
# clone_from_text = "[S1] ... [S2] ... [S1] ... corresponding to your_audio_name.mp3"
# clone_from_audio = "your_audio_name.mp3"

# Text to generate
text_to_generate = "[S1] Hello, how are you? [S2] I'm good, thank you. [S1] What's your name? [S2] My name is Dia. [S1] Nice to meet you. [S2] Nice to meet you too."

# It will only return the audio from the text_to_generate
output = model.generate(clone_from_text + text_to_generate, audio_prompt_path=clone_from_audio)

sf.write("voice_clone.mp3", output, 44100)

```

## /example_prompt.mp3

Binary file available at https://raw.githubusercontent.com/anan235/dia-multilingual/refs/heads/main/example_prompt.mp3

## /pyproject.toml

```toml path="/pyproject.toml" 
[project]
name = "nari-tts"
version = "0.1.0"
description = "Dia - A text-to-speech model for dialogue generation"
readme = "README.md"
requires-python = ">=3.10"
license = {file = "LICENSE"}
authors = [
    {name = "Nari Labs", email = "contact@narilabs.ai"}
]
dependencies = [
    "descript-audio-codec>=1.0.0",
    "gradio>=5.25.2",
    "huggingface-hub>=0.30.2",
    "numpy>=2.2.4",
    "pydantic>=2.11.3",
    "soundfile>=0.13.1",
    "torch>=2.6.0",
    "torchaudio>=2.6.0",
    "speechbrain>=0.5.14",
    # Multilingual Conformer speaker encoder
    "tflite-hub>=0.0.5",
    "tflite-runtime>=2.13.0 ; platform_machine == 'x86_64' and sys_platform != 'darwin'",
    "triton>=3.2.0 ; sys_platform == 'linux'",
    "triton-windows>=3.2.0.post18 ; sys_platform == 'win32'",
]

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

[project.urls]
"Homepage" = "https://github.com/nari-labs/dia"
"Bug Tracker" = "https://github.com/nari-labs/dia/issues"

[tool.hatch.build.targets.wheel]
packages = ["dia"]

[tool.ruff]
# Never enforce `E501` (line length violations).
lint.ignore = ["C901", "E501", "E741", "W605"]
lint.select = ["C", "E", "F", "I", "W"]
line-length = 119

# Ignore import violations in all `__init__.py` files.
[tool.ruff.lint.per-file-ignores]
"__init__.py" = ["E402", "F401", "F403", "F811"]

[tool.ruff.lint.isort]
lines-after-imports = 2

[tool.uv.sources]
torch = [
  { index = "pytorch-cu126", marker = "sys_platform == 'linux' or sys_platform == 'win32'" },
]
torchaudio = [
  { index = "pytorch-cu126", marker = "sys_platform == 'linux' or sys_platform == 'win32'" },
]

[[tool.uv.index]]
name = "pytorch-cu126"
url = "https://download.pytorch.org/whl/cu126"
explicit = true

```

## /scripts/infer_dia.py

```py path="/scripts/infer_dia.py" 
import argparse
import json
import torch
import torchaudio
from pathlib import Path
from dia.model import DiaModel
from subprocess import check_output

parser = argparse.ArgumentParser()
parser.add_argument("--model_path")
parser.add_argument("--lang_vocab")
parser.add_argument("--output_dir")
parser.add_argument("--text")
parser.add_argument("--lang")
parser.add_argument("--reference_wav", default=None)
parser.add_argument("--sample_rate", type=int, default=22050)
args = parser.parse_args()

lang_vocab = json.load(open(args.lang_vocab))
model = DiaModel.load_from_checkpoint(args.model_path, map_location="cpu")
model.eval()

lang_token = f"<{args.lang}>"
lang_token_id = lang_vocab.get(lang_token, 0)

ipa = check_output(f"echo '{args.text}' | espeak-ng -v {args.lang} --ipa -q", shell=True, text=True).strip()
print("Phonemes:", ipa)

char_vocab = {c: i+10 for i, c in enumerate(sorted(set(ipa)))}
char_vocab["<pad>"] = 0
char_vocab["<unk>"] = 1
phoneme_ids = [char_vocab.get(c, 1) for c in ipa]
input_ids = torch.tensor([lang_token_id] + phoneme_ids).unsqueeze(0)

ref = None
if args.reference_wav:
    wav, sr = torchaudio.load(args.reference_wav)
    if sr != args.sample_rate:
        wav = torchaudio.functional.resample(wav, sr, args.sample_rate)
    ref = wav.squeeze(0).unsqueeze(0)

with torch.no_grad():
    audio = model.infer(input_ids=input_ids, ref_audio=ref)

Path(args.output_dir).mkdir(parents=True, exist_ok=True)
outpath = Path(args.output_dir) / f"tts_{args.lang}.wav"
torchaudio.save(str(outpath), audio.cpu(), args.sample_rate)
print("Saved:", outpath)
```

## /scripts/train_dia.py

```py path="/scripts/train_dia.py" 
import argparse, json
from torch.utils.data import DataLoader
from dataset.multilang_tts_dataset import MultilangTTSDataset, collate_fn
from tools.speaker_encoder import SpeakerEncoder
from dia.model import DiaModel
import torch
import os

parser = argparse.ArgumentParser()
parser.add_argument("--manifest")
parser.add_argument("--lang_vocab")
parser.add_argument("--data_root")
parser.add_argument("--output_dir")
parser.add_argument("--epochs", type=int, default=20)
parser.add_argument("--batch_size", type=int, default=16)
parser.add_argument("--lr", type=float, default=1e-4)
parser.add_argument("--num_workers", type=int, default=4)
args = parser.parse_args()

lang_vocab = json.load(open(args.lang_vocab))
dataset = MultilangTTSDataset(args.manifest, lang_vocab)
dataloader = DataLoader(dataset, batch_size=args.batch_size, shuffle=True, collate_fn=collate_fn, num_workers=args.num_workers)

model = DiaModel(lang_vocab=lang_vocab)
optimizer = torch.optim.AdamW(model.parameters(), lr=args.lr)
spk_encoder = SpeakerEncoder()

os.makedirs(args.output_dir, exist_ok=True)

for epoch in range(args.epochs):
    for batch in dataloader:
        with torch.no_grad():
            batch["spk_embed"] = torch.stack([spk_encoder.encode(p) for p in batch["path"]])
        loss = model(batch)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
    ckpt_path = os.path.join(args.output_dir, f"epoch{epoch}.pt")
    torch.save(model.state_dict(), ckpt_path)
    print("Saved:", ckpt_path)
```

## /scripts/validate.py

```py path="/scripts/validate.py" 
import torch
from torch.utils.data import DataLoader
from dataset.multilang_tts_dataset import MultilangTTSDataset, collate_fn
from dia.model import DiaModel
import json

lang_vocab = json.load(open("lang_vocab.json"))
dataset = MultilangTTSDataset("valid_manifest.json", lang_vocab)
dataloader = DataLoader(dataset, batch_size=8, collate_fn=collate_fn)

model = DiaModel.load_from_checkpoint("checkpoints/best.pth")
model.eval()

losses = []
for batch in dataloader:
    with torch.no_grad():
        loss = model(batch)
        losses.append(loss.item())

print("Avg validation loss:", sum(losses)/len(losses))
```

## /tests/test_spk_injection.py

```py path="/tests/test_spk_injection.py" 
import torch
from dia.model import DiaModel
from tools.speaker_encoder import SpeakerEncoder

# Mock config
class DummyConfig:
    class decoder:
        d_model = 256
    model = decoder()

# Initialize model
model = DiaModel(config=DummyConfig())
model.eval()

# Simulate input
B, T, C = 2, 16, 256
dummy_audio_tokens = torch.randint(0, 100, (B, T, C)).float()
dummy_spk_embed = torch.randn(B, 192)

# Forward pass
with torch.no_grad():
    out = model.decoder_forward(dummy_audio_tokens, spk_embed=dummy_spk_embed)

print("✅ Decoder output shape:", out.shape)
```

## /tests/visualize_speakers.ipynb

```ipynb path="/tests/visualize_speakers.ipynb" 
import torch
import torchaudio
from tools.speaker_encoder import SpeakerEncoder
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA

# Load ECAPA encoder
encoder = SpeakerEncoder()

# Define your speaker sample paths
paths = {
    "EN Male": "samples/en_male.wav",
    "EN Female": "samples/en_female.wav",
    "ES Male": "samples/es_male.wav",
    "ES Female": "samples/es_female.wav",
}

# Encode
embeddings = {}
for name, path in paths.items():
    try:
        vec = encoder.encode(path).squeeze().numpy()
        embeddings[name] = vec
    except Exception as e:
        print(f"❌ Failed: {name} ({path}):", str(e))

# Reduce to 2D
if embeddings:
    pca = PCA(n_components=2)
    X = list(embeddings.values())
    names = list(embeddings.keys())
    coords = pca.fit_transform(X)

    # Plot
    plt.figure(figsize=(6, 6))
    sns.scatterplot(x=coords[:,0], y=coords[:,1], hue=names, s=150)
    plt.title("Speaker Embedding Space (PCA)")
    plt.xlabel("PC 1")
    plt.ylabel("PC 2")
    plt.grid(True)
    plt.show()
```

## /tools/audio_cleaner.py

```py path="/tools/audio_cleaner.py" 
import torchaudio
import torchaudio.transforms as T
from pathlib import Path

def normalize_and_trim(path, out_path, target_db=-20):
    wav, sr = torchaudio.load(path)
    vol = T.Vol(target_db, gain_type='db')
    wav = vol(wav)
    trimmed = T.Vad(sample_rate=sr)(wav)
    Path(out_path).parent.mkdir(parents=True, exist_ok=True)
    torchaudio.save(out_path, trimmed, sr)

# Example usage:
# normalize_and_trim("input.wav", "cleaned/input.wav")
```

## /tools/hf_dataset_loader.py

```py path="/tools/hf_dataset_loader.py" 
from datasets import load_dataset
import json
from pathlib import Path

def extract_commonvoice(lang, output_dir):
    ds = load_dataset("mozilla-foundation/common_voice_13_0", lang, split="train")
    manifest = []
    for sample in ds:
        if sample["audio"] and sample["sentence"]:
            manifest.append({
                "audio": sample["audio"]["path"],
                "text": sample["sentence"],
                "phonemes": "",  # fill via phonemizer
                "lang": lang
            })
    with open(Path(output_dir) / f"manifest_{lang}.json", "w") as f:
        json.dump(manifest, f, indent=2)
    print(f"✅ Saved: manifest_{lang}.json")

# extract_commonvoice("hi", "data/")
```

## /tools/speaker_encoder.py

```py path="/tools/speaker_encoder.py" 
import torch
import torchaudio
from speechbrain.pretrained import EncoderClassifier

class SpeakerEncoder:
    def __init__(self, device="cuda" if torch.cuda.is_available() else "cpu"):
        self.model = EncoderClassifier.from_hparams(
            # Conformer‑Based Multilingual Speaker Encoder
            # Model: tflite-hub/conformer-speaker-encoder
            # Architecture: Conformer | Loss: GE2E | Languages: 100+
            source="tflite-hub/conformer-speaker-encoder",
            run_opts={"device": device}
        )

    def encode(self, wav_path):
        signal, fs = torchaudio.load(wav_path)
        if fs != 16000:
            signal = torchaudio.functional.resample(signal, fs, 16000)
        embedding = self.model.encode_batch(signal).squeeze(0)
        return embedding.detach()

# Usage:
# enc = SpeakerEncoder()
# spk_vec = enc.encode("sample.wav")
```

## /tools/test_manifest_gen.py

```py path="/tools/test_manifest_gen.py" 
import json
from collections import defaultdict
from pathlib import Path
import random

def generate_test_manifest(train_manifest_path, output_path, samples_per_lang=5):
    with open(train_manifest_path) as f:
        data = json.load(f)

    lang_buckets = defaultdict(list)
    for item in data:
        lang_buckets[item["lang"]].append(item)

    test_manifest = []
    for lang, items in lang_buckets.items():
        test_manifest += random.sample(items, min(samples_per_lang, len(items)))

    with open(output_path, "w") as f:
        json.dump(test_manifest, f, indent=2)

    print(f"✅ test_manifest.json created with {len(test_manifest)} samples.")

# Usage:
# generate_test_manifest("train_manifest.json", "test_manifest.json")
```


The better and more specific the context, the better the LLM can follow instructions. If the context seems verbose, the user can refine the filter using uithub. Thank you for using https://uithub.com - Perfect LLM context for any GitHub repo.