Tencent-Hunyuan/HunyuanImage-3.0/main 140k tokens More Tools
```
├── .gitignore (400 tokens)
├── Hunyuan-Image3.md (2.3k tokens)
├── LICENSE (omitted)
├── PE/
   ├── deepseek.py (1100 tokens)
   ├── system_prompt.py (3.5k tokens)
├── README.md (5.3k tokens)
├── README_zh_CN.md (3.7k tokens)
├── app/
   ├── pipeline.py (2.4k tokens)
   ├── run_chatbot.py (3.3k tokens)
   ├── style.py (400 tokens)
├── assets/
   ├── HunyuanImage_3_0.pdf
   ├── WECHAT.md
   ├── banner.png
   ├── banner_all.jpg
   ├── demo_instruct_imgs/
      ├── input_0_0.png
      ├── input_1_0.png
      ├── input_1_1.png
      ├── input_2_0.png
      ├── input_2_1.png
      ├── input_2_2.png
   ├── framework.png
   ├── gsb.png
   ├── gsb_instruct.png
   ├── logo.png
   ├── pg_imgs/
      ├── image1.png
      ├── image2.png
      ├── image3.png
      ├── image4.png
      ├── image5.png
      ├── image6.png
      ├── image7.png
      ├── image8.png
   ├── pg_instruct_imgs/
      ├── cot_ti2i.gif
      ├── image0.png
      ├── image1.png
      ├── image2.png
      ├── image3.png
      ├── image4.png
   ├── robot.png
   ├── ssae_side_by_side_comparison.png
   ├── ssae_side_by_side_heatmap.png
   ├── user.png
   ├── wechat.png
├── docker/
   ├── hyimage3_vllm.Dockerfile (200 tokens)
├── hunyuan_image_3/
   ├── __init__.py (300 tokens)
   ├── autoencoder_kl_3d.py (8.8k tokens)
   ├── cache_utils.py (1900 tokens)
   ├── configuration_hunyuan_image_3.py (2.9k tokens)
   ├── hunyuan_image_3_pipeline.py (7.9k tokens)
   ├── image_processor.py (6.1k tokens)
   ├── modeling_hunyuan_image_3.py (30.9k tokens)
   ├── siglip2.py (4.8k tokens)
   ├── system_prompt.py (3.3k tokens)
   ├── tokenization_hunyuan_image_3.py (15.9k tokens)
├── my_make_pic.py (600 tokens)
├── pyproject.toml (500 tokens)
├── requirements.txt (100 tokens)
├── run_app.sh (200 tokens)
├── run_demo_instruct.sh (600 tokens)
├── run_demo_instruct_distil.sh (600 tokens)
├── run_image_gen.py (2k tokens)
├── setup.py (600 tokens)
├── utils/
   ├── __init__.py
   ├── import_utils.py (21.6k tokens)
├── vllm_infer/
   ├── README.md (400 tokens)
   ├── openai_client.py (1100 tokens)
   ├── run_vllm_server.sh (200 tokens)
```


## /.gitignore

```gitignore path="/.gitignore" 
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
#  Usually these files are written by a python script from a template
#  before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
.python-version

# pipenv
#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
#   However, in case of collaboration, if having platform-specific dependencies or dependencies
#   having no cross-platform support, pipenv may install dependencies that don't work, or not
#   install all needed dependencies.
#Pipfile.lock

# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# ==========================================
# Custom settings
# ==========================================

# For MacOS
.DS_Store

# For IDEs
.idea/
.vscode/
pyrightconfig.json
.cursorignore

# For global settings
__*/
# Model checkpoints
*.pt
*.ckpt
*.pth
*.safetensors
HunyuanImage-3/

```

## /Hunyuan-Image3.md

# HunyuanImage-3.0 (Text-to-image)

## 📝 Prompt Guide

### Manually Writing Prompts.
The Pretrain Checkpoint does not automatically rewrite or enhance input prompts, Instruct Checkpoint can rewrite or enhance input prompts with thinking . For optimal results currently, we recommend community partners consulting our official guide on how to write effective prompts.

Reference: [HunyuanImage 3.0 Prompt Handbook](
https://docs.qq.com/doc/DUVVadmhCdG9qRXBU)


### System Prompt For Automatic Rewriting the Prompt.

We've included two system prompts in the PE folder of this repository that leverage DeepSeek to automatically enhance user inputs:

* **system_prompt_universal**: This system prompt converts photographic style, artistic prompts into a detailed one.
* **system_prompt_text_rendering**: This system prompt converts UI/Poster/Text Rending prompts to a deailed on that suits the model.

Note that these system prompts are in Chinese because Deepseek works better with Chinese system prompts. If you want to use it for English oriented model, you may translate it into English or refer to the comments in the PE file as a guide.

We also create a [Yuanqi workflow](https://yuanqi.tencent.com/agent/H69VgtJdj3Dz) to implent the universal one, you can directly try it.

### Advanced Tips
- **Content Priority**: Focus on describing the main subject and action first, followed by details about the environment and style. A more general description framework is: **Main subject and scene + Image quality and style + Composition and perspective + Lighting and atmosphere + Technical parameters**. Keywords can be added both before and after this structure.

- **Image resolution**: Our model not only supports multiple resolutions but also offers both **automatic and specified resolution** options. In auto mode, the model automatically predicts the image resolution based on the input prompt. In specified mode (like traditional DiT), the model outputs an image resolution that strictly aligns with the user's chosen resolution.

### More Cases

Our model can effectively process very long text inputs, enabling users to precisely control the finer details of generated images. Extended prompts allow for intricate elements to be accurately captured, making it ideal for complex projects requiring precision and creativity.

<p align="center">
<table>
<thead>
</thead>
<tbody>
<tr>
<td>
<img src="./assets/pg_imgs/image1.png" width=100%><details>
<summary>Show prompt</summary>
A cinematic medium shot captures a single Asian woman seated on a chair within a dimly lit room, creating an intimate and theatrical atmosphere. The composition is focused on the subject, rendered with rich colors and intricate textures that evoke a nostalgic and moody feeling.\n\nThe primary subject is a young Asian woman with a thoughtful and expressive countenance, her gaze directed slightly away from the camera. She is seated in a relaxed yet elegant posture on an ornate, vintage armchair. The chair is upholstered in a deep red velvet, its fabric showing detailed, intricate textures and slight signs of wear. She wears a simple, elegant dress in a dark teal hue, the material catching the light in a way that reveals its fine-woven texture. Her skin has a soft, matte quality, and the light delicately models the contours of her face and arms.\n\nThe surrounding room is characterized by its vintage decor, which contributes to the historic and evocative mood. In the immediate background, partially blurred due to a shallow depth of field consistent with a f/2.8 aperture, the wall is covered with wallpaper featuring a subtle, damask pattern. The overall color palette is a carefully balanced interplay of deep teal and rich red hues, creating a visually compelling and cohesive environment. The entire scene is detailed, from the fibers of the upholstery to the subtle patterns on the wall.\n\nThe lighting is highly dramatic and artistic, defined by high contrast and pronounced shadow play. A single key light source, positioned off-camera, projects gobo lighting patterns onto the scene, casting intricate shapes of light and shadow across the woman and the back wall. These dramatic shadows create a strong sense of depth and a theatrical quality. While some shadows are deep and defined, others remain soft, gently wrapping around the subject and preventing the loss of detail in darker areas. The soft focus on the background enhances the intimate feeling, drawing all attention to the expressive subject. The overall image presents a cinematic, photorealistic photography style.
</details>
</td>
<td><img src="./assets/pg_imgs/image2.png" width=100%><details>
<summary>Show prompt</summary>
A cinematic, photorealistic medium shot captures a high-contrast urban street corner, defined by the sharp intersection of light and shadow. The primary subject is the exterior corner of a building, rendered in a low-saturation, realistic style.\n\nThe building wall, which occupies the majority of the frame, is painted a warm orange with a finely detailed, rough stucco texture. Horizontal white stripes run across its surface. The base of the building is constructed from large, rough-hewn stone blocks, showing visible particles and texture. On the left, illuminated side of the building, there is a single window with closed, dark-colored shutters. Adjacent to the window, a simple black pendant lamp hangs from a thin, taut rope, casting a distinct, sharp-edged shadow onto the sunlit orange wall. The composition is split diagonally, with the right side of the building enveloped in a deep brown shadow. At the bottom of the frame, a smooth concrete sidewalk is visible, upon which the dynamic silhouette of a person is captured mid-stride, walking from right to left.\n\nIn the shallow background, the faint, out-of-focus outlines of another building and the bare, skeletal branches of trees are softly visible, contributing to the quiet urban atmosphere and adding a sense of depth to the scene. These elements are rendered with minimal detail to keep the focus on the foreground architecture.\n\nThe scene is illuminated by strong, natural sunlight originating from the upper left, creating a dramatic chiaroscuro effect. This hard light source casts deep, well-defined shadows, producing a sharp contrast between the brightly lit warm orange surfaces and the deep brown shadow areas. The lighting highlights the fine details in the wall texture and stone particles, emphasizing the photorealistic quality. The overall presentation reflects a high-quality photorealistic photography style, infused with a cinematic film noir aesthetic.
</details>
</td>
</tr>
<tr>
<td>
<img src="./assets/pg_imgs/image3.png" width=100%><details>
<summary>Show prompt</summary>
一幅极具视觉张力的杂志封面风格人像特写。画面主体是一个身着古风汉服的人物,构图采用了从肩部以上的超级近距离特写,人物占据了画面的绝大部分,形成了强烈的视觉冲击力。\n\n画面中的人物以一种慵懒的姿态出现,微微倾斜着头部,裸露的一侧肩膀线条流畅。她正用一种妩媚而直接的眼神凝视着镜头,双眼微张,眼神深邃,传递出一种神秘而勾人的气质。人物的面部特征精致,皮肤质感细腻,在特定的光线下,面部轮廓清晰分明,展现出一种古典与现代融合的时尚美感。\n\n整个画面的背景被设定为一种简约而高级的纯红色。这种红色色调深沉,呈现出哑光质感,既纯粹又无任何杂质,为整个暗黑神秘的氛围奠定了沉稳而富有张力的基调。这个纯色的背景有效地突出了前景中的人物主体,使得所有视觉焦点都集中在其身上。\n\n光线和氛围的营造是这幅杂志风海报的关键。一束暗橘色的柔和光线作为主光源,从人物的一侧斜上方投射下来,精准地勾勒出人物的脸颊、鼻梁和肩膀的轮廓,在皮肤上形成微妙的光影过渡。同时,人物的周身萦绕着一层暗淡且低饱和度的银白色辉光,如同清冷的月光,形成一道朦胧的轮廓光。这道银辉为人物增添了几分疏离的幽灵感,强化了整体暗黑风格的神秘气质。光影的强烈对比与色彩的独特搭配,共同塑造了这张充满故事感的特写画面。整体图像呈现出一种融合了古典元素的现代时尚摄影风格。
</details>
</td>
<td>
<img src="./assets/pg_imgs/image4.png" width=100%><details>
<summary>Show prompt</summary>
一幅采用极简俯视视角的油画作品,画面主体由一道居中斜向的红色笔触构成。\n\n这道醒目的红色笔触运用了厚涂技法,颜料堆叠形成了强烈的物理厚度和三维立体感。它从画面的左上角附近延伸至右下角附近,构成一个动态的对角线。颜料表面可以清晰地看到画刀刮擦和笔刷拖曳留下的痕迹,边缘处的颜料层相对较薄,而中央部分则高高隆起,形成了不规则的起伏。\n\n在这道立体的红色颜料之上,巧妙地构建了一处精致的微缩景观。景观的核心是一片模拟红海滩的区域,由细腻的深红色颜料点缀而成,与下方基底的鲜红色形成丰富的层次对比。紧邻着“红海滩”的是一小片湖泊,由一层平滑且带有光泽的蓝色与白色混合颜料构成,质感如同平静无波的水面。湖泊边缘,一小撮芦苇丛生,由几根纤细挺拔的、用淡黄色和棕色颜料勾勒出的线条来表现。一只小巧的白鹭立于芦苇旁,其形态由一小块纯白色的厚涂颜料塑造,仅用一抹精炼的黑色颜料点出其尖喙,姿态优雅宁静。\n\n整个构图的背景是大面积的留白,呈现为一张带有细微凹凸纹理的白色纸质基底,这种极简处理极大地突出了中央的红色笔触及其上的微缩景观。\n\n光线从画面一侧柔和地照射下来,在厚涂的颜料堆叠处投下淡淡的、轮廓分明的阴影,进一步增强了画面的三维立体感和油画质感。整幅画面呈现出一种结合了厚涂技法的现代极简主义油画风格。
</details>
</td>
</tr>
<tr>
<td>
<img src="./assets/pg_imgs/image5.png" width=100%><details>
<summary>Show prompt</summary>
整体画面采用一个二乘二的四宫格布局,以产品可视化的风格,展示了一只兔子在四种不同材质下的渲染效果。每个宫格内都有一只姿态完全相同的兔子模型,它呈坐姿,双耳竖立,面朝前方。所有宫格的背景均是统一的中性深灰色,这种简约背景旨在最大限度地突出每种材质的独特质感。\n\n左上角的宫格中,兔子模型由哑光白色石膏材质构成。其表面平滑、均匀且无反射,在模型的耳朵根部、四肢交接处等凹陷区域呈现出柔和的环境光遮蔽阴影,这种微妙的阴影变化凸显了其纯粹的几何形态,整体感觉像一个用于美术研究的基础模型。\n\n右上角的宫格中,兔子模型由晶莹剔透的无瑕疵玻璃制成。它展现了逼真的物理折射效果,透过其透明的身体看到的背景呈现出轻微的扭曲。清晰的镜面高光沿着其身体的曲线轮廓流动,表面上还能看到微弱而清晰的环境反射,赋予其一种精致而易碎的质感。\n\n左下角的宫格中,兔子模型呈现为带有拉丝纹理的钛金属材质。金属表面具有明显的各向异性反射效果,呈现出冷峻的灰调金属光泽。锐利明亮的高光和深邃的阴影形成了强烈对比,精确地定义了其坚固的三维形态,展现了工业设计般的美感。\n\n右下角的宫格中,兔子模型覆盖着一层柔软浓密的灰色毛绒。根根分明的绒毛清晰可见,创造出一种温暖、可触摸的质地。光线照射在绒毛的末梢,形成柔和的光晕效果,而毛绒内部的阴影则显得深邃而柔软,展现了高度写实的毛发渲染效果。\n\n整个四宫格由来自多个方向的、柔和均匀的影棚灯光照亮,确保了每种材质的细节和特性都得到清晰的展现,没有任何刺眼的阴影或过曝的高光。这张图像以一种高度写实的3D渲染风格呈现,完美地诠释了产品可视化的精髓
</details>
</td>
<td>
<img src="./assets/pg_imgs/image6.png" width=100%><details>
<summary>Show prompt</summary>
由一个两行两列的网格构成,共包含四个独立的场景,每个场景都以不同的艺术风格描绘了一个小男孩(小明)一天中的不同活动。\n\n左上角的第一个场景,以超写实摄影风格呈现。画面主体是一个大约8岁的东亚小男孩,他穿着整洁的小学制服——一件白色短袖衬衫和蓝色短裤,脖子上系着红领巾。他背着一个蓝色的双肩书包,正走在去上学的路上。他位于画面的前景偏右侧,面带微笑,步伐轻快。场景设定在清晨,柔和的阳光从左上方照射下来,在人行道上投下清晰而柔和的影子。背景是绿树成荫的街道和模糊可见的学校铁艺大门,营造出宁静的早晨氛围。这张图片的细节表现极为丰富,可以清晰地看到男孩头发的光泽、衣服的褶皱纹理以及书包的帆布材质,完全展现了专业摄影的质感。\n\n右上角的第二个场景,采用日式赛璐璐动漫风格绘制。画面中,小男孩坐在家中的木质餐桌旁吃午饭。他的形象被动漫化,拥有大而明亮的眼睛和简洁的五官线条。他身穿一件简单的黄色T恤,正用筷子夹起碗里的米饭。桌上摆放着一碗汤和两盘家常菜。背景是一个温馨的室内环境,一扇明亮的窗户透进正午的阳光,窗外是蓝天白云。整个画面色彩鲜艳、饱和度高,角色轮廓线清晰明确,阴影部分采用平涂的色块处理,是典型的赛璐璐动漫风格。\n\n左下角的第三个场景,以细腻的铅笔素描风格呈现。画面描绘了下午在操场上踢足球的小男孩。整个图像由不同灰度的石墨色调构成,没有其他颜色。小男孩身穿运动短袖和短裤,身体呈前倾姿态,右脚正要踢向一个足球,动作充满动感。背景是空旷的操场和远处的球门,用简练的线条和排线勾勒。艺术家通过交叉排线和涂抹技巧来表现光影和体积感,足球上的阴影、人物身上的肌肉线条以及地面粗糙的质感都通过铅笔的笔触得到了充分的展现。这张铅笔画突出了素描的光影关系和线条美感。\n\n右下角的第四个场景,以文森特·梵高的后印象派油画风格进行诠释。画面描绘了夜晚时分,小男孩独自在河边钓鱼的景象。他坐在一块岩石上,手持一根简易的钓鱼竿,身影在深蓝色的夜幕下显得很渺小。整个画面的视觉焦点是天空和水面,天空布满了旋转、卷曲的星云,星星和月亮被描绘成巨大、发光的光团,使用了厚涂的油画颜料(Impasto),笔触粗犷而充满能量。深蓝、亮黄和白色的颜料在画布上相互交织,形成强烈的视觉冲击力。水面倒映着天空中扭曲的光影,整个场景充满了梵高作品中特有的强烈情感和动荡不安的美感。这幅画作是对梵高风格的深度致敬。
</details>
</td>
</tr>
<tr>
<td>
<img src="./assets/pg_imgs/image7.png" width=100%><details>
<summary>Show prompt</summary>
以平视视角,呈现了一幅关于如何用素描技法绘制鹦鹉的九宫格教学图。整体构图规整,九个大小一致的方形画框以三行三列的形式均匀分布在浅灰色背景上,清晰地展示了从基本形状到最终成品的全过程。\n\n第一行从左至右展示了绘画的初始步骤。左上角的第一个画框中,用简洁的铅笔线条勾勒出鹦鹉的基本几何形态:一个圆形代表头部,一个稍大的椭圆形代表身体。右上角有一个小号的无衬线字体数字“1”。中间的第二个画框中,在基础形态上添加了三角形的鸟喙轮廓和一条长长的弧线作为尾巴的雏形,头部和身体的连接处线条变得更加流畅;右上角标有数字“2”。右侧的第三个画框中,进一步精确了鹦鹉的整体轮廓,勾勒出头部顶端的羽冠和清晰的眼部圆形轮廓;右上角标有数字“3”。\n\n第二行专注于结构与细节的添加,描绘了绘画的中期阶段。左侧的第四个画框里,鹦鹉的身体上添加了翅膀的基本形状,同时在身体下方画出了一根作为栖木的横向树枝,鹦鹉的爪子初步搭在树枝上;右上角标有数字“4”。中间的第五个画框中,开始细化翅膀和尾部的羽毛分组,用短促的线条表现出层次感,并清晰地画出爪子紧握树枝的细节;右上角标有数字“5”。右侧的第六个画框里,开始为鹦鹉添加初步的阴影,使用交叉排线的素描技法在腹部、翅膀下方和颈部制造出体积感;右上角标有数字“6”。\n\n第三行则展示了最终的润色与完成阶段。左下角的第七个画框中,素描的排线更加密集,阴影层次更加丰富,羽毛的纹理细节被仔细刻画出来,眼珠也添加了高光点缀,显得炯炯有神;右上角标有数字“7”。中间的第八个画框里,描绘的重点转移到栖木上,增加了树枝的纹理和节疤细节,同时整体调整了鹦鹉身上的光影关系,使立体感更为突出;右上角标有数字“8”。右下角的第九个画框是最终完成图,所有线条都经过了精炼,光影对比强烈,鹦鹉的羽毛质感、木质栖木的粗糙感都表现得淋漓尽致,呈现出一幅完整且细节丰富的素描作品;右上角标有数字“9”。\n\n整个画面的光线均匀而明亮,没有任何特定的光源方向,确保了每个教学步骤的视觉清晰度。整体呈现出一种清晰、有条理的数字插画教程风格。
</details>
</td>
<td>
<img src="./assets/pg_imgs/image8.png" width=100%><details>
<summary>Show prompt</summary>
一张现代平面设计风格的海报占据了整个画面,构图简洁且中心突出。\n\n海报的主体是位于画面正中央的一只腾讯QQ企鹅。这只企鹅采用了圆润可爱的3D卡通渲染风格,身体主要为饱满的黑色,腹部为纯白色。它的眼睛大而圆,眼神好奇地直视前方。黄色的嘴巴小巧而立体,双脚同样为鲜明的黄色,稳稳地站立着。一条标志性的红色围巾整齐地系在它的脖子上,围巾的材质带有轻微的布料质感,末端自然下垂。企鹅的整体造型干净利落,边缘光滑,呈现出一种精致的数字插画质感。\n\n海报的背景是一种从上到下由浅蓝色平滑过渡到白色的柔和渐变,营造出一种开阔、明亮的空间感。在企鹅的身后,散布着一些淡淡的、模糊的圆形光斑和几道柔和的抽象光束,为这个简约的平面设计海报增添了微妙的深度和科技感。\n\n画面的底部区域是文字部分,排版居中对齐。上半部分是一行稍大的黑色黑体字,内容为“Hunyuan Image 3.0”。紧随其下的是一行字号略小的深灰色黑体字,内容为“原生多模态大模型”。两行文字清晰易读,与整体的现代平面设计风格保持一致。\n\n整体光线明亮、均匀,没有明显的阴影,突出了企鹅和文字信息,符合现代设计海报的视觉要求。这张图像呈现了现代、简洁的平面设计海报风格。
</details>
</td>
</tr>
</tbody>
</table>
</p>



## /PE/deepseek.py

```py path="/PE/deepseek.py" 
# -*- coding: utf-8 -*-
"""
DeepSeek Client Module

This module provides a client interface for interacting with DeepSeek API
through Tencent Cloud's LKEAP service. It supports prompt recaptioning
with reasoning capabilities.
"""
import json
import time
import ast
from loguru import logger
from tencentcloud.common.common_client import CommonClient
from tencentcloud.common import credential
from tencentcloud.common.exception.tencent_cloud_sdk_exception import TencentCloudSDKException
from tencentcloud.common.profile.client_profile import ClientProfile
from tencentcloud.common.profile.http_profile import HttpProfile


class NonStreamResponse(object):
    """
    Response handler for non-streaming API calls.
    
    This class is used to deserialize and store API responses in JSON format.
    """
    def __init__(self):
        """Initialize the response handler with an empty response string."""
        self.response = ""

    def _deserialize(self, obj):
        """
        Deserialize the response object to JSON string.
        
        Args:
            obj: The response object to be serialized
        """
        self.response = json.dumps(obj)


class DeepSeekClient(object):
    """
    Client for interacting with DeepSeek API through Tencent Cloud LKEAP service.
    
    This client provides functionality for prompt recaptioning with reasoning capabilities,
    enabling intelligent prompt enhancement for image generation tasks.
    """
    def __init__(self, key_id, key_secret):
        """
        Initialize the DeepSeek client with authentication credentials.
        
        Args:
            key_id (str): Tencent Cloud API key ID for authentication
            key_secret (str): Tencent Cloud API key secret for authentication
        """
        # Initialize credentials
        cred = credential.Credential(key_id, key_secret)
        
        # Configure HTTP profile with endpoint and timeout
        httpProfile = HttpProfile()
        httpProfile.endpoint = "lkeap.tencentcloudapi.com"
        # Set longer timeout for streaming interface compatibility
        httpProfile.reqTimeout = 40000  # The streaming interface may take a longer time.
        
        # Configure client profile
        clientProfile = ClientProfile()
        clientProfile.httpProfile = httpProfile
        
        # Initialize the common client for LKEAP service
        self.common_client = CommonClient("lkeap", "2024-05-22", cred, "ap-guangzhou", profile=clientProfile)

    def run_single_recaption(self, system_prompt, input_prompt):
        """
        Run a single prompt recaptioning request with reasoning.
        
        This method sends a prompt to DeepSeek API for enhancement/recaptioning.
        It uses the thinking/reasoning capability to generate an improved prompt
        along with the reasoning process.
        
        Args:
            system_prompt (str): System prompt that defines the task and behavior
            input_prompt (str): User input prompt to be recaptioned/enhanced
        
        Returns:
            tuple: A tuple containing:
                - content (str): The recaptioned/enhanced prompt
                - reason (str): The reasoning content explaining the enhancement
        
        Note:
            The method includes retry logic to handle transient API errors.
            It will retry with a 1-second delay if an exception occurs.
        """
        # Prepare the API request payload
        post_dict = {
            "Model": "deepseek-v3.1",  # DeepSeek model version
            "Messages": [
                {
                    "Role": "system",
                    "Content": system_prompt
                },
                {
                    "Role": "user",
                    "Content": input_prompt
                }
            ],
            "Stream": False,  # Non-streaming response
            "Thinking": {"Type": "enabled"},  # Enable reasoning/thinking capability
        }
        
        print('Start to run recaption: ')
        
        # Retry loop to handle transient API errors
        while True:
            try:
                resp = self.common_client._call_and_deserialize("ChatCompletions", post_dict, NonStreamResponse)
                break
            except Exception as e:
                logger.error(e)
                time.sleep(1)  # Wait 1 second before retry
        
        # Make the actual API call (duplicate call for final response)
        resp = self.common_client._call_and_deserialize("ChatCompletions", post_dict, NonStreamResponse)
        response = resp.response
        
        # Parse the JSON response string to Python dict
        response = ast.literal_eval(response)
        
        # Extract the enhanced prompt content
        content = response["Choices"][0]["Message"]["Content"]
        # Extract the reasoning content
        reason = response["Choices"][0]["Message"]["ReasoningContent"]
        
        # Print debug information
        print('Initial prompt: ', input_prompt)
        print('Recaption prompt: ', content)

        return content, reason


if __name__ == "__main__":
    # This module is typically imported and used as a library
    # Main execution logic would be implemented in the calling script
    pass
```

## /PE/system_prompt.py

```py path="/PE/system_prompt.py" 
# Licensed under the TENCENT HUNYUAN COMMUNITY LICENSE AGREEMENT (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://github.com/Tencent-Hunyuan/HunyuanImage-3.0/blob/main/LICENSE
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""
System Prompt Module for HunyuanImage-3.0

This module provides system prompts for various image generation tasks,
including universal prompts, text rendering prompts, and reasoning-based prompts.
The system prompts are designed to guide the model in generating high-quality
images with appropriate style, composition, and visual elements.
"""

# --------------------------------------------------------------------------------
# SYSTEM PROMPT LOGIC: Universal Image Prompt Expert (Cinematographic Approach)
# --------------------------------------------------------------------------------
# This system prompt configures the LLM to act as an expert prompt engineer with a
# specialization in cinematography, visual arts, and directing. Its primary task is
# to transform a user's simple description into a comprehensive, structured, and
# objective image prompt.
#
# The methodology is inspired by how a director or photographer would set up a shot,
# ensuring a logical flow from the core subject to the technical details.
#
#
# THE 5-PART CINEMATOGRAPHIC FORMULA:
# -----------------------------------
# The AI is strictly instructed to build every prompt following this five-part structure,
# ensuring a logical and hierarchical description:
#
# 1.  **Main Subject & Scene:**
#     - What is the core content of the image? (e.g., "A woman sitting in a cafe").
#     - This establishes the fundamental subject matter first.
#
# 2.  **Image Quality & Style:**
#     - What is the artistic medium or aesthetic? (e.g., "Oil painting style,"
#       "Photorealistic," "Anime style").
#     - This defines the overall look and feel.
#
# 3.  **Composition & Viewpoint:**
#     - How is the shot framed? From what angle is the viewer seeing the scene?
#       (e.g., "Slightly high-angle shot," "Centered composition").
#     - This directs the virtual "camera."
#
# 4.  **Lighting & Atmosphere:**
#     - Where is the light coming from, and what mood does it create?
#       (e.g., "Afternoon sun through a window," "Warm, serene atmosphere").
#     - This is crucial for setting the emotional tone.
#
# 5.  **Technical Parameters:**
#     - What are the specific "camera" settings? (e.g., "f/2.8 aperture, 50mm lens,"
#       "Shallow depth of field," "8K resolution").
#     - This adds a layer of technical precision for photorealistic results.
#
#
# CORE GENERATION WORKFLOW (The AI's "Internal Thought Process"):
# --------------------------------------------------------------
# 1.  **Analyze:** Deconstruct the user's input to identify the core subject, action, and environment.
# 2.  **Strategize:** Determine the most suitable style and camera angle.
# 3.  **Elaborate:** Detail the lighting, colors, and mood.
# 4.  **Refine:** Add specific details to the subject and environment, ensuring physical logic.
# 5.  **Validate:** Check the final prompt for alignment with the user's request and for logical consistency.
#
#
# KEY STRATEGIES AND OUTPUT CONSTRAINTS:
# --------------------------------------
# - **Order is Crucial:** The prompt emphasizes that "Subject" and "Style" must come
#   first, as they have the highest weight in influencing the final image.
# - **Focus on Light:** It demands a clear description of light sources to avoid
#   unnatural or "sourceless" lighting.
# - **Avoid Over-complication:** The prompt should remain concise and targeted.
# - **Strict Output Format:** The AI is explicitly instructed to **output ONLY the
#   final, single-line prompt**. It must not include any of its thought process,
#   markdown formatting, or even line breaks. This is a critical constraint.
#
system_prompt_universal = """
## 提示词工程:文生图提示词撰写专家

您是一位精通电影摄影、视觉艺术和导演技巧的图像生成提示词(Prompt)撰写专家,您的任务是将用户提供的简短描述转化为结构化、客观化且详细的图像生成提示词。您的目标是确保提示词从整体到局部、从背景到前景,逻辑清晰且符合现实物理和艺术构图原则,指导AI生成高质量的图像。

---

### **一、 核心结构**

在构建提示词时,严格遵循以下逻辑顺序:

1. **主体场景**  
   明确图像中的主角或场景内容,确保描述具体且不含模糊性。例如:“一位金发女性坐在咖啡馆里,面前是一本打开的书”。

2. **画质风格**  
   描述图像的艺术风格,明确是否采用某种特定的风格(如油画、摄影、动漫风等)。例如:“油画风格,厚重的笔触和细腻的色彩层次”。

3. **构图视角**  
   描述图像的视角和构图方式。例如:“略微俯视的角度,画面中的女性位于画面中心,背景为模糊的咖啡馆环境”。

4. **光线氛围**  
   确定场景中的光源、光线的方向和色温,以及它们如何影响画面的氛围。例如:“午后阳光透过窗户洒在桌面,温暖的光线照亮她的脸庞,营造出宁静的氛围”。

5. **技术参数**  
   描述镜头的参数设置,如光圈、焦距、快门速度等,也可以包括渲染的分辨率等。例如:“f/2.8光圈,50mm定焦镜头,浅景深,背景虚化,8K分辨率”。

---

### **二、 标准生成流程**

在生成提示词时,请遵循以下步骤,确保每个部分的完整性和准确性:

1. **解析核心元素**  
   明确用户输入中的主体、动作、环境等核心元素,确保理解图像的本质。

2. **确定风格与视角**
   如果未指定风格,根据场景和环境推测最适合的风格。
   优先选择能够展现主要元素的视角,避免裁剪。

3. **精雕光影与色彩**  
   确保描述清晰的光源、光线方向和色调,避免无源光或不自然的光照描述。

4. **填充细节与审查**  
   逐步填充主体细节,如人物的姿态、表情、服饰和环境中的次要元素。审查每个细节是否符合物理逻辑。

5. **最终校验与对齐**  
   校对生成的提示词,确保它们与用户输入完全对齐,逻辑清晰且无物理或艺术上的错误。
​
6. **只输出最终提示词**
​   不要展示任何思考过程、Markdown格式或换行符。  

---

### **三、 示例**

#### 示例1:  
**用户输入:** “一位女性坐在窗前,外面是夕阳。”

**生成提示词:**
> 一位金发女性坐在窗前,专注地阅读着一本书。画面采用**油画风格**,细腻的色彩层次和明显的笔触。**略微俯视的角度**,女性位于画面左侧,窗外的夕阳透过窗帘照射进来,给画面增添温暖的色调。背景为虚化的室内环境,桌上有一杯咖啡和一束干花。**柔和的逆光**照亮她的面庞,窗外的金色阳光形成轮廓光,氛围宁静且富有诗意。**f/2.8光圈,50mm定焦镜头**,**浅景深**,8K分辨率。

#### 示例2:  
**用户输入:** “一个男孩在海边跑步,身后是广阔的沙滩。”

**生成提示词:**
> 一位穿着白色T恤的男孩在金色沙滩上奔跑,背景是**广阔的海滩**和清澈的蓝色海洋。画面采用**写实摄影风格**,色彩鲜明,细节清晰。**低角度视角**,男孩位于画面中央,右侧是波浪拍打沙滩,远处是蓝天与白云。**早晨的阳光**从左侧斜照进来,光线充满活力且富有层次感。**f/4.0光圈,35mm广角镜头**,**大景深**,捕捉到男孩跑步的动作和沙滩的细节。

---

### **四、 核心提示词撰写策略**

1. **顺序与权重**:  
   提示词的顺序对生成效果至关重要。确保**主体场景**和**风格**位于前面,以确保生成结果优先考虑这两个要素。

2. **详细描述光影**:  
   确保光源方向、类型及其对场景的影响被准确描述,避免“无源之光”或“不自然的光线”。

3. **避免过度复杂化**:  
   尽量保持提示词简洁明了,避免过多冗余的描述。每个部分都要有清晰的目标,不应使提示词过于复杂。

---

### **五、 最终输出要求**
- 仅输出提示词,不展示思考过程。  
- 忠于输入:保持用户核心概念、数量、文字。    
- 字数限制:不超过 500 词。 

接下来,我将提供输入句子,你将提供扩展后的提示词。

输入句子:
"""

# --------------------------------------------------------------------------------
# SYSTEM PROMPT LOGIC: Image Prompt Generation Expert for Text Rendering
# --------------------------------------------------------------------------------
# This system prompt configures the LLM to act as a world-class expert in writing
# prompts for image generation models. Its primary mission is to take a user's
# simple sentence and expand it into a highly structured, objective, and detailed
# prompt in Chinese. The goal is to guide an AI image model to produce high-quality,
# physically logical, and well-composed images, with a special focus on accurately
# rendering text and UI elements.
#
#
# CORE WORKFLOW (The AI's "Internal Thought Process"):
# ----------------------------------------------------
# Before generating the final prompt, the AI must follow these steps:
#
# 1.  **Analyze & Classify:**
#     - It first identifies the user's core request and categorizes it into one of
#       two types: "UI/App Design" or "Posters/Logos/Other".
#     - It extracts the essential elements, paying close attention to any specific
#       text that needs to be rendered.
#
# 2.  **Expand Based on Rules:**
#     - Based on the classification, it follows a specific set of rules and a
#       template defined in the "Style Guides" section below.
#
# 3.  **Final Validation & Alignment:**
#     - **Content Check:** It ensures the final prompt accurately reflects the user's
#       original request, especially the text content.
#     - **Text Rendering Rule:** It verifies that ANY text mentioned in the prompt
#       is explicitly written out and enclosed in double quotes (""). This is a
#       critical, non-negotiable rule.
#     - **Language Preservation:** It ensures the language of the text inside the
#       quotes is identical to the user's original input (i.e., it does not
#       translate the text to be rendered).
#     - **Quantity Specification:** It checks that all layout descriptions are
#       specific about numbers (e.g., "three buttons" instead of "some buttons").
#
#
# TWO MAIN GENERATION MODES (Style Guides):
# =========================================
# The prompt defines two distinct sets of rules for the two categories.
#
#
# MODE 1: UI/APP DESIGN PROMPTS
# -----------------------------
# The goal is to describe a static UI screen with the precision of a product
# designer or QA engineer.
#
#   **Core Principles:**
#   - **Hierarchical Description:** From the outside in (background -> container -> regions -> elements).
#   - **Spatial Positioning:** Uses precise location words ("top-left", "centered", "below").
#   - **Detail Concretization:** Adds logical, consistent details (colors, styles, textures)
#     to flesh out the user's simple request.
#
#   **Template Structure:**
#   1.  **Overall Scene & Background:** Describes the canvas and the main UI container (e.g., a card).
#   2.  **Macro Layout:** Gives a high-level overview of the layout structure (e.g., "divided into four quadrants").
#   3.  **Section-by-Section Description:** Details each UI region from top-to-bottom, left-to-right,
#       following strict rules for describing every element (components, text, icons).
#
#
# MODE 2: POSTERS, LOGOS, & OTHER GRAPHIC DESIGN PROMPTS
# ------------------------------------------------------
# This mode is for more general artistic or graphic design tasks.
#
#   **Core Principles:**
#   - **Objective Description:** All sentences describe an existing image, avoiding commands.
#   - **Concept to Concrete:** Expands abstract ideas ("Chinese style") into concrete visual
#     elements ("ink wash style," "calligraphy brush strokes").
#   - **Professional Terminology:** Uses design terms like "sans-serif," "saturation," "composition."
#
#   **Template Structure (A 5-Part Formula):**
#   1.  **Overall Description:** Image type (poster, logo), main style, color tone, and format (vertical).
#   2.  **Main Subject / Core Elements:** Details of the central figures or objects (identity, position, appearance).
#   3.  **Background & Environment:** Description of the setting or background elements.
#   4.  **Text & Logos:** A dedicated section for ALL text elements, specifying:
#       - Content (in "quotes")
#       - Position
#       - Font characteristics (style, weight)
#       - Color and size
#   5.  **Composition & Visual Effects:** A final summary of the layout, color properties, and any special effects.
#
# In essence, this prompt is a highly sophisticated "prompt generator" that enforces
# structure, detail, and consistency to maximize the quality and predictability of
# the output from a downstream image generation model.
system_prompt_text_rendering = """
你是一位世界顶级的图像生成提示词(Prompt)撰写专家。你的核心使命是将用户提供的简单句子,扩展为一段**结构化、客观化、细节化**的详细中文图像生成提示词。最终的提示词将遵循严谨的逻辑顺序,从整体到局部,使用精确的专业词汇,引导AI模型生成符合物理逻辑、构图精美的高质量图像。

## **一、 标准生成流程**

在生成最终提示词前,你必须在内心遵循以下思考与构建步骤:

1.  **解析核心任务**:  
    *   **识别核心任务**:根据用户的需求,识别核心任务,是什么,归类到:“平面/UI/APP设计类型”和“海报、logo和文字渲染等其他类型”两种类型中。  
    *   **解析核心元素**:根据用户输入,解析用户要求的核心元素是什么,拆解出需要渲染的文字内容,注意要恰到好处。

2.  **根据核心任务参考具体规则和模板进行扩展**:  
    *   **风格选择**:根据任务类型,参考下列“分风格创作指南”基于模板进行扩写。  

3.  **最终校验与对齐**:  
    *   **信息对齐**:检查最终结果,和用户输入进行对比,确保用户的核心内容被完整地描述。特别是用户要求的文字内容,必须要进行完整的渲染。
        - 基于用户输入和真实世界的客观信息进行输出。如果需要应用外部知识(如科普知识图、数学题解等),则根据世界知识补充合适的、客观存在的文案内容并输出。适度联想,不要为了示例而输出虚假的、不符合现实的内容。也要保证提到的文字内容是明确的,应该有具体的文案内容,而不只是说明某处有文字。
    *   **检查文本渲染内容**:确保在最终输出的提示词中,任何暗示有文字内容的地方都具体地将文字内容书写出来,并用双引号包裹。
    *   **检查文本渲染内容的语言**:确保最终prompt中,中文双引号内的文本渲染内容语言,完全遵守用户的原始输入,不要对文字渲染内容进行翻译。  
    *   **检查布局描述**:确保界面中任何布局必须明确数量,不能模糊不清或不描述数量。  

---

## **二、 分风格创作指南**

### 平面/UI/APP设计类型的提示词扩展规则与模板

#### 核心目标
将一个简短的、功能性的句子扩展为一个详尽的、视觉化的描述性句子。扩展后的描述应如同一个精确的产品设计师或测试工程师在描述一个静态 UI 界面,语言客观、具体、详尽。

#### 核心原则
1.  **分层描述 (Hierarchical Description):** 遵循从外到内、从整体到局部的顺序。先描述背景,再描述主要容器,然后划分区域,最后描述每个区域内的具体元素。
2.  **空间定位 (Spatial Positioning):** 精确使用方位词来描述布局。例如:`左上角`、`右侧`、`...下方`、`居中`、`并排`、`堆叠`等。
3.  **细节具象化 (Detail Concretization):** 用户输入提供了“什么”,你需要创造性地补充“怎么样”的细节。这包括具体的颜色、文本内容、图标样式、尺寸对比和材质感。所有补充的细节都必须在逻辑上与用户输入的主题保持一致。

---

#### 提示词描述模板与生成规则

请严格遵循以下结构和规则来生成扩展后的提示词:

**第一步:整体场景与背景 (Overall Scene & Background)**
*   **规则:** 描述从最外层的画布或背景开始。
*   **要点:**
    *   **背景 (Background):** 描述背景的颜色(如 `浅米色背景`、`纯深灰色背景`)、纹理(如 `带有细微颗粒纹理`)或效果(如 `模糊的深蓝色背景`)。
    *   **主容器 (Main Container):** 描述承载所有内容的核心UI元素(如 `卡片`、`面板`、`显示屏`)。必须包含以下属性:
        *   **形状/形态:** `垂直卡片`、`矩形数字显示屏`。
        *   **风格/样式:** `带有圆角`、`具有光泽边框`。
        *   **颜色:** `白色`、`深灰色`。
        *   **效果:** `有一圈细微的阴影,营造出轻微的立体感`。

**第二步:宏观布局结构 (Macro Layout Structure)**
*   **规则:** 明确主容器内部的区域划分方式。
*   **要点:**
    *   用一句话概括布局。例如:`下方的内容被平均分成了四个象限`、`界面的主要部分由四个独立的圆角矩形面板构成`、`所有元素都居中对齐`。
    *   预告接下来将要描述的各个部分,为读者建立清晰的结构预期。
    *   布局需要明确数量,不能含糊不清,这是必须要遵守的规则。

**第三步:区域与元素逐一描述 (Section-by-Section & Element-by-Element Description)**
*   **规则:** 按照一个固定的逻辑顺序(通常是**从上到下,从左到右**)依次描述每个区域及其内部的UI元素。每个逻辑上独立的区域或部分之间使用 `\n` 换行。
*   **要点:**
    *   **区域引导:** 每个区域的描述开始时,要先定位该区域,例如 `左上方的面板...`、`插图下方是...`、`卡片的最底部是...`。
    *   **元素详述:** 在每个区域内,对每一个可见元素(文本、按钮、输入框、图标、插图、分割线等)进行详细描述。描述时必须遵循下面的【元素描述细则】。

---

#### **元素描述细则 (Detailed Element Description Rules)**

在执行**第三步**时,对每个元素的描述必须遵循以下规则:

##### 1. UI组件 (UI Components)
*   **对象:** 按钮、输入框、卡片、进度条、开关、仪表盘等。
*   **数量:** 必须描述数量,不能模糊不清或不描述数量。一个错误的例子是:“菜单分类栏下方是菜品列表,列表项由多个垂直堆叠的卡片组成”,正确的例子是:“菜单分类栏下方是菜品列表,列表项由三个垂直堆叠的卡片组成”。
*   **描述属性:**
    *   **形状与风格:** `圆角矩形`、`圆形`、`水平进度条`、`拨动开关`。
    *   **颜色与填充:** `橙色圆角按钮`、`浅灰色的输入框`、`蓝紫色渐变的圆形头像`。
    *   **状态 (State):** 如果有,必须描述。例如 `左边的被选中,呈深灰色背景`、`蓝色开启状态的拨动开关`。
    *   **边框与阴影:** `带有银色细线边框`。
    *   **材质与纹理:** `具有水平拉丝纹理`。

##### 2. 文本内容 (Text Content)
*   **对象:** 标题、标签、按钮文字、输入提示等。
*   **重要约束:** 所有提及或者暗示有文字内容的地方,都需要给出具体的内容,而不是采用模糊的描述。一个错误的例子是:“构图底部是机构的联系方式“,正确的例子是:“构图底部是机构的联系方式:010-12345678”。
*   **描述属性(必须尽可能全面):**
    *   **内容 (Content):** 必须用引号 `“”` 将具体文字括起来。例如 `“Create an account”`。
    *   **颜色 (Color):** `深灰色字体`、`白色文字`、`橙色数字`。
    *   **字号 (Size):** 使用相对描述。例如 `大号`、`较小的`、`醒目的、字号很大的`。
    *   **字重 (Weight):** `粗体字`。
    *   **大小写 (Case):** `大写字母文字`。
    *   **字体 (Font Family):** 如果特征明显,可以提及。例如 `无衬线字体`。

##### 3. 图标与插图 (Icons & Illustrations)
*   **对象:** 功能性图标、装饰性插图、头像等。
*   **描述属性:**
    *   **风格 (Style):** `卡通风格的插图`、`橙色的人物轮廓图标`。
    *   **内容 (Content):** 描述其描绘的具体事物。例如 `描绘了一位...女性`、`一个橙色的信封图标`、`一个白色的汽车图标`。
    *   **形状与颜色:** 描述图标/插图本身的形状和颜色,以及其容器的形状和颜色。例如 `淡橙色的圆形图标,里面有一个橙色的对勾符号`、`绿色的圆形图标,里面有一个白色的对勾符号`。

#### 结构化输出示例
最终生成的提示词应该是一个连贯的段落,通过 `\n` 分隔不同的逻辑区块,其内在结构应如下所示:

`[整体背景与主容器描述]。\n[宏观布局描述]。\n[区域一(如顶部/标题栏)的位置和内部元素描述,遵循元素细则]。\n[区域二(如内容区)的位置和内部元素描述,遵循元素细则]。\n[区域三(如底部/操作栏)的位置和内部元素描述,遵循元素细则]。`

### 海报、logo和文字渲染等其他类型的提示词扩展规则与模板

#### **描述模板结构**
请严格遵循以下结构和顺序来组织扩展后prompt的内容,确保描述的逻辑性和完整性。

**1.  总体描述 (Overall Description)**
*   **开篇句式:** 以“这是一张/一个/一幅...”或类似的客观陈述开头。
*   **核心内容:**
    *   **图像类型:** 明确指出是“海报”、“标志(logo)”、“插画”、“方形图像”等。
    *   **主要风格:** 定义作品的整体艺术风格,例如“水墨风格”、“2D卡通动画风格”、“文艺风格”、“图形标志”等。
    *   **整体色调与氛围:** 描述画面的主色调、色彩关系和给人的感觉,例如“黑白灰色调”、“以红色为主色调”、“柔和的渐变色”、“宁静的视觉质感”、“强烈的视觉对比”等。
    *   **构图格式:** 指明是“竖版”、“方形”等。

**2.  主体/核心元素 (Main Subject / Core Elements)**
*   **描述顺序:** 从画面的视觉中心或最主要的角色/物体开始描述。
*   **核心内容:**
    *   **身份与数量:** 明确主体的身份和数量,如“七名东亚年轻男子”、“一个卡通男孩”、“两名男性运动员”。
    *   **位置与姿态:** 精确描述主体在画面中的位置(如“站在一段宽阔的白色楼梯中央”、“位于构图的中间偏上位置”)和姿态/动作(如“面向前方”、“身体前倾,正伸出右脚”、“呈咆哮姿态”)。
    *   **外观细节:** 尽可能详细地描述:
        *   **人物:** 发型、发色、肤色、五官特征、表情(“神情平静”、“悲伤或沮丧的表情”)。
        *   **服装:** 款式、颜色、材质、装饰(“现代全黑服装,包括衬衫和长裤”、“红色短袖上衣,上面有白色的条纹装饰”)。
        *   **物体/图形:** 形状、颜色、材质、构成方式(如Logo:“由一个粗体的、橙色的...字母‘G’构成...被一个更大的深蓝色不完整圆弧所环绕”)。

**3.  背景与环境 (Background & Environment)**
*   **描述主体周围的环境和背景。**
*   **核心内容:**
    *   **类型:** 明确背景是“纯白色背景”、“纯蓝色背景”、“柔和渐变色”,还是具体的场景(如“绿色的足球场草地和带有红色座椅的体育场看台”)。
    *   **细节与装饰:** 描述背景中的具体元素、纹理和细节,例如“地毯上印有对称的、卷曲的中式古典花纹”、“深色的木质扶手,扶手上雕刻着中式窗格图案”、“带有从中心向外扩散的放射状线条”。

**4.  文字与标识 (Text & Logos)**
*   **单独、清晰地描述画面中所有的文字和符号元素。**
*   **核心内容(针对每一处文字/标识):**
    *   **内容:** 准确引用文字内容,如“‘三重楼’”、“‘Magic i’”。所有文字和扩展后的文字内容,都必须用双引号包裹,这是必须要遵守的规则,例如环境中的文字也需要用双引号包裹(包括书籍、招牌、黑板上的文字内容等)
    *   **位置:** 精确说明其在画面中的位置,如“海报顶部中央”、“在人物足下的台阶上”、“右下角”、“正下方”。
    *   **字体特征:** 详细描述字体类型、风格和粗细,例如“黑色毛笔书写的大号艺术字”、“较小的宋体字”、“创意手写字体”、“华丽的黑色哥特式字体”、“白色无衬线大写字母”。
    *   **颜色与大小:** 明确文字的颜色和相对大小(如“大号”、“较小”、“占据了约画面1/5的大小,十分醒目”)。
    *   **排列方式:** 指明是“横排”还是“竖排”。

**5.  构图与视觉效果 (Composition & Visual Effects)**
*   **在描述的最后,对整体的视觉构成和特殊效果进行总结。**
*   **核心内容:**
    *   **元素布局:** 总结各元素之间的空间关系,如“构图极为简洁”、“黑色的人物和文字与明亮的蓝色背景形成了强烈的视觉对比”。
    *   **色彩属性:** 补充描述色彩的专业属性,如“色彩饱和度低,对比度柔和”。
    *   **特殊效果:** 描述任何额外的视觉处理,如“整个画面的上下边缘有模糊的黑色水墨笔触效果,营造出古典氛围”。

#### **句法与风格规则**

1.  **使用客观陈述句:** 避免使用“设计”、“要求”等指令性词语。所有句子都应是对一个已存在画面的客观描述。
2.  **细节具象化:** 将用户输入中的模糊概念具体化。
    *   “中国风” -> 扩展为“水墨风格”、“中式古典花纹”、“毛笔书写”。
    *   “艺术字体” -> 扩展为具体的字体风格,如“哥特式”、“手写体”、“衬线体”。
    *   “背景是蓝色的” -> 扩展为“背景是无任何杂质的纯蓝色”。
3.  **空间定位精确化:** 大量使用方位词来明确元素位置,如“中央”、“顶部”、“底部”、“左侧”、“右下角”、“...的正下方”、“中间偏上”。
4.  **使用专业词汇:** 在适当的时候使用设计和艺术领域的专业术语,如“无衬线字体 (sans-serif)”、“衬线体 (serif)”、“饱和度”、“对比度”、“2D动画风格”、“构图”等。
5.  **结构先行,内容填充:** 严格按照上述模板的五大模块进行思考和组织,确保不遗漏任何一个方面,使得最终的 long prompt 既全面又富有条理。

接下来,我将提供输入句子,你将提供扩展后的提示词。

输入句子:
"""

```

## /README.md

[中文文档](./README_zh_CN.md)

<div align="center">

<img src="./assets/logo.png" alt="HunyuanImage-3.0 Logo" width="600">

# 🎨 HunyuanImage-3.0: A Powerful Native Multimodal Model for Image Generation

</div>


<div align="center">
<img src="./assets/banner.png" alt="HunyuanImage-3.0 Banner" width="800">

</div>

<div align="center">
  <a href=https://hunyuan.tencent.com/image target="_blank"><img src=https://img.shields.io/badge/Official%20Site-333399.svg?logo=homepage height=22px></a>
  <a href=https://huggingface.co/tencent/HunyuanImage-3.0 target="_blank"><img src=https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-T2I-d96902.svg height=22px></a>
  <a href=https://huggingface.co/tencent/HunyuanImage-3.0-Instruct target="_blank"><img src=https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Instruct(I2I)-d96902.svg height=22px></a>
  <a href=https://huggingface.co/tencent/HunyuanImage-3.0-Instruct-Distil target="_blank"><img src=https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Instruct(I2I)--Distil-d96902.svg height=22px></a>
  <a href=https://github.com/Tencent-Hunyuan/HunyuanImage-3.0 target="_blank"><img src= https://img.shields.io/badge/Page-bb8a2e.svg?logo=github height=22px></a>
  <a href=https://arxiv.org/pdf/2509.23951 target="_blank"><img src=https://img.shields.io/badge/Report-b5212f.svg?logo=arxiv height=22px></a>
  <a href=https://x.com/TencentHunyuan target="_blank"><img src=https://img.shields.io/badge/Hunyuan-black.svg?logo=x height=22px></a>
  <a href=https://docs.qq.com/doc/DUVVadmhCdG9qRXBU target="_blank"><img src=https://img.shields.io/badge/📚-PromptHandBook-blue.svg?logo=book height=22px></a>
</div>


<p align="center">
    👏 Join our <a href="./assets/WECHAT.md" target="_blank">WeChat</a> and <a href="https://discord.gg/ehjWMqF5wY">Discord</a> | 
💻 <a href="https://hunyuan.tencent.com/chat/HunyuanDefault?from=modelSquare&modelId=Hunyuan-Image-3.0-Instruct">Official website(官网) Try our model!</a>&nbsp&nbsp
</p>

## 🔥🔥🔥 News

- **January 26, 2026**: 🚀 **[HunyuanImage-3.0-Instruct-Distil](https://huggingface.co/tencent/HunyuanImage-3.0-Instruct-Distil)** - Distilled checkpoint for efficient deployment (8 steps sampling recommended).
- **January 26, 2026**: 🎉 **[HunyuanImage-3.0-Instruct](https://huggingface.co/tencent/HunyuanImage-3.0-Instruct)** - Release of **Instruct (with reasoning)** for intelligent prompt enhancement and **Image-to-Image** generation for creative editing.
- **October 30, 2025**: 🚀 **[HunyuanImage-3.0 vLLM Acceleration](./vllm_infer/README.md)** - Significantly faster inference with vLLM support.
- **September 28, 2025**: 📖 **[HunyuanImage-3.0 Technical Report](https://arxiv.org/pdf/2509.23951)** - Comprehensive technical documentation now available.
- **September 28, 2025**: 🎉 **[HunyuanImage-3.0 Open Source](https://github.com/Tencent-Hunyuan/HunyuanImage-3.0)** - Inference code and model weights publicly available.


## 🧩 Community Contributions

If you develop/use HunyuanImage-3.0 in your projects, welcome to let us know.

## 📑 Open-source Plan

- HunyuanImage-3.0 (Image Generation Model)
  - [x] Inference 
  - [x] HunyuanImage-3.0 Checkpoints
  - [x] HunyuanImage-3.0-Instruct Checkpoints (with reasoning)
  - [x] vLLM Support
  - [x] Distilled Checkpoints
  - [x] Image-to-Image Generation
  - [ ] Multi-turn Interaction


## 🗂️ Contents
- [🔥🔥🔥 News](#-news)
- [🧩 Community Contributions](#-community-contributions)
- [📑 Open-source Plan](#-open-source-plan)
- [📖 Introduction](#-introduction)
- [✨ Key Features](#-key-features)
- [🚀 Usage](#-usage)
  - [📦 Environment Setup](#-environment-setup)
    - [📥 Install Dependencies](#-install-dependencies)
  - [HunyuanImage-3.0-Instruct](#hunyuanimage-30-instruct-instruction-reasoning-and-image-to-image-generation-including-editing-and-multi-image-fusion)
    - [🔥 Quick Start with Transformers](#-quick-start-with-transformers)
      - [1️⃣ Download model weights](#1-download-model-weights)
      - [2️⃣ Run with Transformers](#2-run-with-transformers)
    - [🏠 Local Installation & Usage](#-local-installation--usage)
      - [1️⃣ Clone the Repository](#1-clone-the-repository)
      - [2️⃣ Download Model Weights](#2-download-model-weights)
      - [3️⃣ Run the Demo](#3-run-the-demo)
      - [4️⃣ Command Line Arguments](#4-command-line-arguments)
      - [5️⃣ For fewer Sampling Steps](#5-for-fewer-sampling-steps)
  - [HunyuanImage-3.0 (Text-to-image)](#hunyuanimage-30-text-to-image)
    - [🔥 Quick Start with Transformers](#-quick-start-with-transformers-1)
      - [1️⃣ Download model weights](#1-download-model-weights-1)
      - [2️⃣ Run with Transformers](#2-run-with-transformers-1)
    - [🏠 Local Installation & Usage](#-local-installation--usage-1)
      - [1️⃣ Clone the Repository](#1-clone-the-repository-1)
      - [2️⃣ Download Model Weights](#2-download-model-weights-1)
      - [3️⃣ Run the Demo](#3-run-the-demo-1)
      - [4️⃣ Command Line Arguments](#4-command-line-arguments-1)
    - [🎨 Interactive Gradio Demo](#-interactive-gradio-demo)
      - [1️⃣ Install Gradio](#1-install-gradio)
      - [2️⃣ Configure Environment](#2-configure-environment)
      - [3️⃣ Launch the Web Interface](#3-launch-the-web-interface)
      - [4️⃣ Access the Interface](#4-access-the-interface)
- [🧱 Models Cards](#-models-cards)
- [📊 Evaluation](#-evaluation)
  - [Evaluation of HunyuanImage-3.0-Instruct](#evaluation-of-hunyuanimage-30-instruct)
  - [Evaluation of HunyuanImage-3.0 (Text-to-Image)](#evaluation-of-hunyuanimage-30-text-to-image)
- [🖼️ Showcase](#-showcase)
  - [Showcases of HunyuanImage-3.0-Instruct](#showcases-of-hunyuanimage-30-instruct)
- [📚 Citation](#-citation)
- [🙏 Acknowledgements](#-acknowledgements)
- [🌟🚀 Github Star History](#-github-star-history)

---

## 📖 Introduction

**HunyuanImage-3.0** is a groundbreaking native multimodal model that unifies multimodal understanding and generation within an autoregressive framework. Our text-to-image and image-to-image model achieves performance **comparable to or surpassing** leading closed-source models.


<div align="center">
  <img src="./assets/framework.png" alt="HunyuanImage-3.0 Framework" width="90%">
</div>

## ✨ Key Features

* 🧠 **Unified Multimodal Architecture:** Moving beyond the prevalent DiT-based architectures, HunyuanImage-3.0 employs a unified autoregressive framework. This design enables a more direct and integrated modeling of text and image modalities, leading to surprisingly effective and contextually rich image generation.

* 🏆 **The Largest Image Generation MoE Model:** This is the largest open-source image generation Mixture of Experts (MoE) model to date. It features 64 experts and a total of 80 billion parameters, with 13 billion activated per token, significantly enhancing its capacity and performance.

* 🎨 **Superior Image Generation Performance:** Through rigorous dataset curation and advanced reinforcement learning post-training, we've achieved an optimal balance between semantic accuracy and visual excellence. The model demonstrates exceptional prompt adherence while delivering photorealistic imagery with stunning aesthetic quality and fine-grained details.

* 💭 **Intelligent Image Understanding and World-Knowledge Reasoning:** The unified multimodal architecture endows HunyuanImage-3.0 with powerful reasoning capabilities. It under stands user's input image, and leverages its extensive world knowledge to intelligently interpret user intent, automatically elaborating on sparse prompts with contextually appropriate details to produce superior, more complete visual outputs.


## 🚀 Usage

### 📦 Environment Setup

* 🐍 **Python:** 3.12+ (recommended and tested)
* ⚡ **CUDA:** 12.8

#### 📥 Install Dependencies

```bash
# 1. First install PyTorch (CUDA 12.8 Version)
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128

# 2. Install tencentcloud-sdk for Prompt Enhancement (PE) only for HunyuanImage-3.0 not HunyuanImage-3.0-Instruct
pip install -i https://mirrors.tencent.com/pypi/simple/ --upgrade tencentcloud-sdk-python

# 3. Then install other dependencies
pip install -r requirements.txt
```

For **up to 3x faster inference**, install these optimizations:

```bash
# FlashInfer for optimized moe inference. v0.5.0 is tested.
pip install flashinfer-python==0.5.0
```
> 💡**Installation Tips:** It is critical that the CUDA version used by PyTorch matches the system's CUDA version. 
> FlashInfer relies on this compatibility when compiling kernels at runtime.
> GCC version >=9 is recommended for compiling FlashAttention and FlashInfer.

> ⚡ **Performance Tips:** These optimizations can significantly speed up your inference!

> 💡**Notation:** When FlashInfer is enabled, the first inference may be slower (about 10 minutes) due to kernel compilation. Subsequent inferences on the same machine will be much faster.

### HunyuanImage-3.0-Instruct (Instruction reasoning and Image-to-image generation, including editing and multi-image fusion)

#### 🔥 Quick Start with Transformers

##### 1️⃣ Download model weights

```bash
# Download from HuggingFace and rename the directory.
# Notice that the directory name should not contain dots, which may cause issues when loading using Transformers.
hf download tencent/HunyuanImage-3.0-Instruct --local-dir ./HunyuanImage-3-Instruct
```

##### 2️⃣ Run with Transformers

```python
from transformers import AutoModelForCausalLM

# Load the model
model_id = "./HunyuanImage-3-Instruct"
# Currently we can not load the model using HF model_id `tencent/HunyuanImage-3.0-Instruct` directly 
# due to the dot in the name.

kwargs = dict(
    attn_implementation="sdpa", 
    trust_remote_code=True,
    torch_dtype="auto",
    device_map="auto",
    moe_impl="eager",   # Use "flashinfer" if FlashInfer is installed
    moe_drop_tokens=True,
)

model = AutoModelForCausalLM.from_pretrained(model_id, **kwargs)
model.load_tokenizer(model_id)

# Image-to-Image generation (TI2I)
prompt = "基于图一的logo,参考图二中冰箱贴的材质,制作一个新的冰箱贴"

input_img1 = "./assets/demo_instruct_imgs/input_1_0.png"
input_img2 = "./assets/demo_instruct_imgs/input_1_1.png"
imgs_input = [input_img1, input_img2]

cot_text, samples = model.generate_image(
    prompt=prompt,
    image=imgs_input,
    seed=42,
    image_size="auto",
    use_system_prompt="en_unified",
    bot_task="think_recaption",  # Use "think_recaption" for reasoning and enhancement
    infer_align_image_size=True,  # Align output image size to input image size
    diff_infer_steps=50, 
    verbose=2
)

# Save the generated image
samples[0].save("image_edit.png")
```

#### 🏠 Local Installation & Usage

##### 1️⃣ Clone the Repository

```bash
git clone https://github.com/Tencent-Hunyuan/HunyuanImage-3.0.git
cd HunyuanImage-3.0/
```

##### 2️⃣ Download Model Weights

```bash
# Download from HuggingFace
hf download tencent/HunyuanImage-3.0-Instruct --local-dir ./HunyuanImage-3-Instruct
```

##### 3️⃣ Run the Demo

More demos in `run_demo_instruct.sh`.

```bash
export MODEL_PATH="./HunyuanImage-3-Instruct"
bash run_demo_instruct.sh
```

##### 4️⃣ Command Line Arguments

| Arguments               | Description                                                  | Recommended    |
| ----------------------- | ------------------------------------------------------------ | ----------- |
| `--prompt`              | Input prompt                                                 | (Required)  |
| `--image`               | Image to run. For multiple images, use comma-separated paths (e.g., 'img1.png,img2.png') | (Required)      |
| `--model-id`            | Model path                                                   | (Required)  |
| `--attn-impl`           | Attention implementation. Now only support 'sdpa'            | `sdpa`      |
| `--moe-impl`            | MoE implementation. Either `eager` or `flashinfer`           | `flashinfer`     |
| `--seed`                | Random seed for image generation. Use None for random seed   | `None`      |
| `--diff-infer-steps`    | Number of inference steps                                   | `50`        |
| `--image-size`          | Image resolution. Can be `auto`, like `1280x768` or `16:9`  | `auto`      |
| `--use-system-prompt`   | System prompt type. Options: `None`, `dynamic`, `en_vanilla`, `en_recaption`, `en_think_recaption`, `en_unified`, `custom` | `en_unified` |
| `--system-prompt`       | Custom system prompt. Used when `--use-system-prompt` is `custom` | `None`      |
| `--bot-task`            | Task type. `image` for direct generation; `auto` for text; `recaption` for re-write->image; `think_recaption` for think->re-write->image | `think_recaption` |
| `--save`                | Image save path                                              | `image.png` |
| `--verbose`             | Verbose level                                                | `2`         |
| `--reproduce`           | Whether to reproduce the results                            | `True`     |
| `--infer-align-image-size` | Whether to align the target image size to the src image size | `True`     |
| `--max_new_tokens`      | Maximum number of new tokens to generate                     | `2048` |
| `--use-taylor-cache`    | Use Taylor Cache when sampling                              | `False`     |

##### 5️⃣ For fewer Sampling Steps

We recommend using the model [HunyuanImage-3.0-Instruct-Distil](https://huggingface.co/tencent/HunyuanImage-3.0-Instruct-Distil) with `--diff-infer-steps 8`, while keeping all other recommended parameter values **unchanged**.

```bash
# Download HunyuanImage-3.0-Instruct-Distil from HuggingFace
hf download tencent/HunyuanImage-3.0-Instruct-Distil --local-dir ./HunyuanImage-3-Instruct-Distil

# Run the demo with 8 steps to samples
export MODEL_PATH="./HunyuanImage-3-Instruct-Distil"
bash run_demo_instruct_Distil.sh
```

<details>
<summary> Previous Version (Pure Text-to-Image) </summary>

### HunyuanImage-3.0 (Text-to-image)

#### 🔥 Quick Start with Transformers

##### 1️⃣ Download model weights

```bash
# Download from HuggingFace and rename the directory.
# Notice that the directory name should not contain dots, which may cause issues when loading using Transformers.
hf download tencent/HunyuanImage-3.0 --local-dir ./HunyuanImage-3
```

##### 2️⃣ Run with Transformers

```python
from transformers import AutoModelForCausalLM

# Load the model
model_id = "./HunyuanImage-3"
# Currently we can not load the model using HF model_id `tencent/HunyuanImage-3.0` directly 
# due to the dot in the name.

kwargs = dict(
    attn_implementation="sdpa",     # Use "flash_attention_2" if FlashAttention is installed
    trust_remote_code=True,
    torch_dtype="auto",
    device_map="auto",
    moe_impl="eager",   # Use "flashinfer" if FlashInfer is installed
)

model = AutoModelForCausalLM.from_pretrained(model_id, **kwargs)
model.load_tokenizer(model_id)

# generate the image
prompt = "A brown and white dog is running on the grass"
image = model.generate_image(prompt=prompt, stream=True)
image.save("image.png")
```


#### 🏠 Local Installation & Usage

##### 1️⃣ Clone the Repository

```bash
git clone https://github.com/Tencent-Hunyuan/HunyuanImage-3.0.git
cd HunyuanImage-3.0/
```

##### 2️⃣ Download Model Weights

```bash
# Download from HuggingFace
hf download tencent/HunyuanImage-3.0 --local-dir ./HunyuanImage-3
```

##### 3️⃣ Run the Demo
The Pretrain Checkpoint does not automatically rewrite or enhance input prompts, for optimal results currently, we recommend community partners to use deepseek to rewrite the prompts. You can go to [Tencent Cloud](https://cloud.tencent.com/document/product/1772/115963#.E5.BF.AB.E9.80.9F.E6.8E.A5.E5.85.A5) to apply for an API Key.

```bash
# Without PE
export MODEL_PATH="./HunyuanImage-3"
python3 run_image_gen.py \
    --model-id $MODEL_PATH \
    --verbose 1 \
    --prompt "A brown and white dog is running on the grass" \
    --bot-task image \
    --image-size "1024x1024" \
    --save ./image.png \
    --moe-impl flashinfer

# With PE
export DEEPSEEK_KEY_ID="your_deepseek_key_id"
export DEEPSEEK_KEY_SECRET="your_deepseek_key_secret"
export MODEL_PATH="./HunyuanImage-3"
python3 run_image_gen.py \
    --model-id $MODEL_PATH \
    --verbose 1 \
    --prompt "A brown and white dog is running on the grass" \
    --bot-task image \
    --image-size "1024x1024" \
    --save ./image.png \
    --moe-impl flashinfer \
    --rewrite 1

```

##### 4️⃣ Command Line Arguments

| Arguments               | Description                                                  | Recommended     |
| ----------------------- | ------------------------------------------------------------ | ----------- |
| `--prompt`              | Input prompt                                                 | (Required)  |
| `--model-id`            | Model path                                                   | (Required)  |
| `--attn-impl`           | Attention implementation. Either `sdpa` or `flash_attention_2`. | `sdpa`      |
| `--moe-impl`            | MoE implementation. Either `eager` or `flashinfer`           | `flashinfer`     |
| `--seed`                | Random seed for image generation                             | `None`      |
| `--diff-infer-steps`    | Diffusion infer steps                                        | `50`        |
| `--image-size`          | Image resolution. Can be `auto`, like `1280x768` or `16:9`   | `auto`      |
| `--save`                | Image save path.                                             | `image.png` |
| `--verbose`             | Verbose level. 0: No log; 1: log inference information.      | `0`         |
| `--rewrite`             | Whether to enable rewriting                                  | `1`         |

#### 🎨 Interactive Gradio Demo

Launch an interactive web interface for easy text-to-image generation.

##### 1️⃣ Install Gradio

```bash
pip install gradio>=4.21.0
```

##### 2️⃣ Configure Environment

```bash
# Set your model path
export MODEL_ID="path/to/your/model"

# Optional: Configure GPU usage (default: 0,1,2,3)
export GPUS="0,1,2,3"

# Optional: Configure host and port (default: 0.0.0.0:443)
export HOST="0.0.0.0"
export PORT="443"
```

##### 3️⃣ Launch the Web Interface

**Basic Launch:**
```bash
sh run_app.sh
```

**With Performance Optimizations:**
```bash
# Use both optimizations for maximum performance
sh run_app.sh --moe-impl flashinfer --attn-impl flash_attention_2
```

##### 4️⃣ Access the Interface

> 🌐 **Web Interface:** Open your browser and navigate to `http://localhost:443` (or your configured port)

</details>

## 🧱 Models Cards

| Model                     | Params | Download | Recommended VRAM | Supported |
|---------------------------| --- | --- | --- | --- |
| HunyuanImage-3.0          | 80B total (13B active) | [HuggingFace](https://huggingface.co/tencent/HunyuanImage-3.0) | ≥ 3 × 80 GB | ✅ Text-to-Image
| HunyuanImage-3.0-Instruct | 80B total (13B active) | [HuggingFace](https://huggingface.co/tencent/HunyuanImage-3.0-Instruct) | ≥ 8 × 80 GB | ✅ Text-to-Image<br>✅ Text-Image-to-Image<br>✅ Prompt Self-Rewrite <br>✅ CoT Think
| HunyuanImage-3.0-Instruct-Distil | 80B total (13B active) | [HuggingFace](https://huggingface.co/tencent/HunyuanImage-3.0-Instruct-Distil) | ≥ 8 × 80 GB |✅ Text-to-Image<br>✅ Text-Image-to-Image<br>✅ Prompt Self-Rewrite <br>✅ CoT Think <br>✅ Fewer sampling steps (8 steps recommended) 

Notes:
- Install performance extras (FlashAttention, FlashInfer) for faster inference.
- Multi‑GPU inference is recommended for the Base model.

## 📊 Evaluation

### Evaluation of HunyuanImage-3.0-Instruct
* 👥 **GSB (Human Evaluation)** 
We adopted the GSB (Good/Same/Bad) evaluation method commonly used to assess the relative performance between two models from an overall image perception perspective. In total, we utilized 1,000+ single- and multi-images editing cases, generating an equal number of image samples for all compared models in a single run. For a fair comparison, we conducted inference only once for each prompt, avoiding any cherry-picking of results. When comparing with the baseline methods, we maintained the default settings for all selected models. The evaluation was performed by more than 100 professional evaluators. 

<p align="center">
  <img src="./assets/gsb_instruct.png" width=60% alt="Human Evaluation with Other Models">
</p>


### Evaluation of HunyuanImage-3.0 (Text-to-Image)

* 🤖 **SSAE (Machine Evaluation)**   
SSAE (Structured Semantic Alignment Evaluation) is an intelligent evaluation metric for image-text alignment based on advanced multimodal large language models (MLLMs). We extracted 3500 key points across 12 categories, then used multimodal large language models to automatically evaluate and score by comparing the generated images with these key points based on the visual content of the images. Mean Image Accuracy represents the image-wise average score across all key points, while Global Accuracy directly calculates the average score across all key points.

<p align="center">
  <img src="./assets/ssae_side_by_side_comparison.png" width=98% alt="Human Evaluation with Other Models">
</p>

<p align="center">
  <img src="./assets/ssae_side_by_side_heatmap.png" width=98% alt="Human Evaluation with Other Models">
</p>


* 👥 **GSB (Human Evaluation)** 

We adopted the GSB (Good/Same/Bad) evaluation method commonly used to assess the relative performance between two models from an overall image perception perspective. In total, we utilized 1,000 text prompts, generating an equal number of image samples for all compared models in a single run. For a fair comparison, we conducted inference only once for each prompt, avoiding any cherry-picking of results. When comparing with the baseline methods, we maintained the default settings for all selected models. The evaluation was performed by more than 100 professional evaluators. 

<p align="center">
  <img src="./assets/gsb.png" width=98% alt="Human Evaluation with Other Models">
</p>

## 🖼️ Showcase

Our model can follow complex instructions to generate high‑quality, creative images.

<div align="center">
  <img src="./assets/banner_all.jpg" width=100% alt="HunyuanImage 3.0 Demo">
</div>

For text-to-image showcases in HunyuanImage-3.0, click the following links:

- [HunyuanImage-3.0](./Hunyuan-Image3.md)

### Showcases of HunyuanImage-3.0-Instruct

HunyuanImage-3.0-Instruct demonstrates powerful capabilities in intelligent image generation and editing. The following showcases highlight its core features:

* 🧠 **Intelligent Visual Understanding and Reasoning (CoT Think)**: The model performs structured thinking to analyze user's input image and prompt, expand user's intent and editing tasks into a stucture, comprehnsive instructions, and leading to a better image generation and editing performance.

breaking down complex prompts and editing tasks into detailed visual components including subject, composition, lighting, color palette, and style.

* ✏️ **Prompt Self-Rewrite**: Automatically enhances sparse or vague prompts into professional-grade, detail-rich descriptions that capture the user's intent more accurately.

* 🎨 **Text-to-Image (T2I)**: Generates high-quality images from text prompts with exceptional prompt adherence and photorealistic quality.

* 🖼️ **Image-to-Image (TI2I)**: Supports creative image editing, including adding elements, removing objects, modifying styles, and seamless background replacement while preserving key visual elements.

* 🔀 **Multi-Image Fusion**: Intelligently combines multiple reference images (up to 3 inputs) to create coherent composite images that integrate visual elements from different sources.


**Showcase 1: Detailed Thought and Reasoning Process**

<div align="center">
  <img src="./assets/pg_instruct_imgs/cot_ti2i.gif" alt="HunyuanImage-3.0-Instruct Showcase 1" width="90%">
</div>

**Showcase 2: Creative T2I Generation with Complex Scene Understanding**

> Prompt: 3D 毛绒质感拟人化马,暖棕浅棕肌理,穿藏蓝西装、白衬衫,戴深棕手套;疲惫带期待,坐于电脑前,旁置印 "HAPPY AGAIN" 的马克杯。橙红渐变背景,配超大号藏蓝粗体 "马上下班",叠加米黄 "Happy New Year" 并标 "(2026)"。橙红为主,藏蓝米黄撞色,毛绒温暖柔和。

<div align="center">
  <img src="./assets/pg_instruct_imgs/image0.png" alt="HunyuanImage-3.0-Instruct Showcase 2" width="75%">
</div>

**Showcase 3: Precise Image Editing with Element Preservation**

<div align="center">
  <img src="./assets/pg_instruct_imgs/image1.png" alt="HunyuanImage-3.0-Instruct Showcase 3" width="85%">
</div>

**Showcase 4: Style Transformation with Thematic Enhancement**

<div align="center">
  <img src="./assets/pg_instruct_imgs/image2.png" alt="HunyuanImage-3.0-Instruct Showcase 4" width="85%">
</div>


**Showcase 5: Advanced Style Transfer and Product Mockup Generation**

<div align="center">
  <img src="./assets/pg_instruct_imgs/image3.png" alt="HunyuanImage-3.0-Instruct Showcase 5" width="85%">
</div>


**Showcase 6: Multi-Image Fusion and Creative Composition**

<div align="center">
  <img src="./assets/pg_instruct_imgs/image4.png" alt="HunyuanImage-3.0-Instruct Showcase 6" width="85%">
</div>


## 📚 Citation

If you find HunyuanImage-3.0 useful in your research, please cite our work:

```bibtex
@article{cao2025hunyuanimage,
  title={HunyuanImage 3.0 Technical Report},
  author={Cao, Siyu and Chen, Hangting and Chen, Peng and Cheng, Yiji and Cui, Yutao and Deng, Xinchi and Dong, Ying and Gong, Kipper and Gu, Tianpeng and Gu, Xiusen and others},
  journal={arXiv preprint arXiv:2509.23951},
  year={2025}
}
```

## 🙏 Acknowledgements

We extend our heartfelt gratitude to the following open-source projects and communities for their invaluable contributions:

* 🤗 [Transformers](https://github.com/huggingface/transformers) - State-of-the-art NLP library
* 🎨 [Diffusers](https://github.com/huggingface/diffusers) - Diffusion models library  
* 🌐 [HuggingFace](https://huggingface.co/) - AI model hub and community
* ⚡ [FlashAttention](https://github.com/Dao-AILab/flash-attention) - Memory-efficient attention
* 🚀 [FlashInfer](https://github.com/flashinfer-ai/flashinfer) - Optimized inference engine

## 🌟🚀 Github Star History

[![GitHub stars](https://img.shields.io/github/stars/Tencent-Hunyuan/HunyuanImage-3.0?style=social)](https://github.com/Tencent-Hunyuan/HunyuanImage-3.0)
[![GitHub forks](https://img.shields.io/github/forks/Tencent-Hunyuan/HunyuanImage-3.0?style=social)](https://github.com/Tencent-Hunyuan/HunyuanImage-3.0)


[![Star History Chart](https://api.star-history.com/svg?repos=Tencent-Hunyuan/HunyuanImage-3.0&type=Date)](https://www.star-history.com/#Tencent-Hunyuan/HunyuanImage-3.0&Date)


## /README_zh_CN.md

[English Documentation](./README.md)

<div align="center">

<img src="./assets/logo.png" alt="HunyuanImage-3.0 Logo" width="600">

# 🎨 HunyuanImage-3.0: 强大的原生多模态图像生成模型

</div>


<div align="center">
<img src="./assets/banner.png" alt="HunyuanImage-3.0 Banner" width="800">

</div>

<div align="center">
  <a href=https://hunyuan.tencent.com/image target="_blank"><img src=https://img.shields.io/badge/Official%20Site-333399.svg?logo=homepage height=22px></a>
  <a href=https://huggingface.co/tencent/HunyuanImage-3.0 target="_blank"><img src=https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-T2I-d96902.svg height=22px></a>
  <a href=https://huggingface.co/tencent/HunyuanImage-3.0-Instruct target="_blank"><img src=https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Instruct(I2I)-d96902.svg height=22px></a>
  <a href=https://huggingface.co/tencent/HunyuanImage-3.0-Instruct-Distil target="_blank"><img src=https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Instruct(I2I)--Distil-d96902.svg height=22px></a>
  <a href=https://github.com/Tencent-Hunyuan/HunyuanImage-3.0 target="_blank"><img src= https://img.shields.io/badge/Page-bb8a2e.svg?logo=github height=22px></a>
  <a href=https://arxiv.org/pdf/2509.23951 target="_blank"><img src=https://img.shields.io/badge/Report-b5212f.svg?logo=arxiv height=22px></a>
  <a href=https://x.com/TencentHunyuan target="_blank"><img src=https://img.shields.io/badge/Hunyuan-black.svg?logo=x height=22px></a>
  <a href=https://docs.qq.com/doc/DUVVadmhCdG9qRXBU target="_blank"><img src=https://img.shields.io/badge/📚-提示词手册-blue.svg?logo=book height=22px></a>
</div>


<p align="center">
    👏 加入我们的 <a href="./assets/WECHAT.md" target="_blank">微信</a> 和 <a href="https://discord.gg/ehjWMqF5wY">Discord</a> | 
💻 <a href="https://hunyuan.tencent.com/chat/HunyuanDefault?from=modelSquare&modelId=Hunyuan-Image-3.0-Instruct">官网试用我们的模型!</a>&nbsp&nbsp
</p>

## 🔥🔥🔥 最新消息

- **2026年1月26日**: 🚀 **[HunyuanImage-3.0-Instruct-Distil](https://huggingface.co/tencent/HunyuanImage-3.0-Instruct-Distil)** - 蒸馏版本用于高效部署(推荐8步采样)。
- **2026年1月26日**: 🎉 **[HunyuanImage-3.0-Instruct](https://huggingface.co/tencent/HunyuanImage-3.0-Instruct)** - 发布了 **Instruct(带推理能力)**版本,支持智能提示词增强和**图像到图像**生成用于创意编辑。
- **2025年10月30日**: 🚀 **[HunyuanImage-3.0 vLLM 加速](./vllm_infer/README.md)** - 通过 vLLM 支持实现显著更快的推理速度。
- **2025年09月28日**: 📖 **[HunyuanImage-3.0 技术报告](https://arxiv.org/pdf/2509.23951)** - 全面的技术文档现已发布。
- **2025年09月28日**: 🎉 **[HunyuanImage-3.0 开源](https://github.com/Tencent-Hunyuan/HunyuanImage-3.0)** - 推理代码和模型权重现已公开可用。


## 🧩 社区贡献

如果您在项目中使用或开发了 HunyuanImage-3.0,欢迎告知我们。

## 📑 开源计划

- HunyuanImage-3.0 (图像生成模型)
  - [x] 推理代码 
  - [x] HunyuanImage-3.0 模型权重
  - [x] HunyuanImage-3.0-Instruct 模型权重(带推理能力)
  - [x] vLLM 支持
  - [x] 蒸馏版本权重
  - [x] 图像到图像生成
  - [ ] 多轮交互能力


## 🗂️ 目录
- [🔥🔥🔥 最新消息](#-最新消息)
- [🧩 社区贡献](#-社区贡献)
- [📑 开源计划](#-开源计划)
- [📖 概览](#-概览)
- [✨ 模型亮点](#-模型亮点)
- [🚀 使用方法](#-使用方法)
  - [📦 环境配置](#-环境配置)
    - [📥 安装依赖](#-安装依赖)
  - [HunyuanImage-3.0-Instruct](#hunyuanimage-30-instruct-指令推理和图像到图像生成包括编辑和多图像融合)
    - [🔥 使用 Transformers 快速开始](#-使用-transformers-快速开始)
      - [1️⃣ 下载模型权重](#1-下载模型权重)
      - [2️⃣ 使用 Transformers 运行](#2-使用-transformers-运行)
    - [🏠 本地安装和使用](#-本地安装和使用)
      - [1️⃣ 克隆仓库](#1-克隆仓库)
      - [2️⃣ 下载模型权重](#2-下载模型权重)
      - [3️⃣ 运行演示](#3-运行演示)
      - [4️⃣ 命令行参数](#4-命令行参数)
      - [5️⃣ 更少的采样步数](#5-更少的采样步数)
  - [HunyuanImage-3.0 (文本生成图像)](#hunyuanimage-30-文本生成图像)
    - [📥 安装依赖](#-安装依赖-1)
    - [🔥 使用 Transformers 快速开始](#-使用-transformers-快速开始-1)
      - [1️⃣ 下载模型权重](#1-下载模型权重-1)
      - [2️⃣ 使用 Transformers 运行](#2-使用-transformers-运行-1)
    - [🏠 本地安装和使用](#-本地安装和使用-1)
      - [1️⃣ 克隆仓库](#1-克隆仓库-1)
      - [2️⃣ 下载模型权重](#2-下载模型权重-1)
      - [3️⃣ 运行演示](#3-运行演示-1)
      - [4️⃣ 命令行参数](#4-命令行参数-1)
    - [🎨 交互式 Gradio 演示](#-交互式-gradio-演示)
      - [1️⃣ 安装 Gradio](#1-安装-gradio)
      - [2️⃣ 配置环境](#2-配置环境)
      - [3️⃣ 启动 Web 界面](#3-启动-web-界面)
      - [4️⃣ 访问界面](#4-访问界面)
- [🧱 模型卡片](#-模型卡片)
- [📊 评估结果](#-评估结果)
  - [HunyuanImage-3.0-Instruct 评估](#hunyuanimage-30-instruct-评估)
  - [HunyuanImage-3.0 评估](#hunyuanimage-30-评估)
- [🖼️ 展示](#-展示)
  - [HunyuanImage-3.0-Instruct 展示](#hunyuanimage-30-instruct-展示)
- [📚 引用](#-引用)
- [🙏 致谢](#-致谢)
- [🌟🚀 GitHub Star 历史](#-github-star-历史)

---

## 📖 概览

**HunyuanImage-3.0** 是一个突破性的原生多模态模型,它在自回归框架内统一了多模态理解和生成任务。它的文生图和图生图能力实现了与领先的闭源模型**相当或更优**的性能。


<div align="center">
  <img src="./assets/framework.png" alt="HunyuanImage-3.0 Framework" width="90%">
</div>

## ✨ 模型亮点

* 🧠 **统一的多模态架构:** HunyuanImage-3.0 突破当前主流的 DiT 架构,采用统一的自回归框架。该设计能更直接、统一地对文本与图像模态进行建模,实现了语义理解与图像生成的高度融合,从而生成效果惊人、语境丰富的图像。

* 🏆 **最大规模图像生成MoE模型:** 作为当前开源社区参数规模最大的图像生成 MoE 模型,其拥有64个专家、总参数量达 800 亿,单 token 激活 130 亿参数,显著提升了模型容量与性能表现。

* 🎨 **卓越的图像生成质量:** 通过精细的数据集构建与强化学习后训练,我们在语义准确性与视觉表现力间取得最佳平衡。该模型不仅能精准遵循提示词要求,更可生成细节丰富、具有摄影级真实感与艺术美感的图像。

* 💭 **智能图像理解与世界知识推理:** 得益于统一的多模态架构,HunyuanImage-3.0 拥有强大的推理能力。它不仅能深度理解用户输入的图像,还能利用其海量的世界知识精准解读用户意图。针对简略的提示词(prompts),它能够自动补全符合语境的细节,从而生成更出色、更完整的视觉作品。


## 🚀 使用方法

### 📦 环境配置

* 🐍 **Python:** 3.12+ (推荐并已测试)
* ⚡ **CUDA:** 12.8

#### 📥 安装依赖

```bash
# 1. 首先安装 PyTorch (CUDA 12.8 版本)
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128

# 2. 安装 tencentcloud-sdk
pip install -i https://mirrors.tencent.com/pypi/simple/ --upgrade tencentcloud-sdk-python

# 3. 然后安装其他依赖
pip install -r requirements.txt
```

为了**获得多达3倍的推理加速**,请安装以下优化:

```bash
# FlashInfer 用于优化的 moe 推理。v0.5.0 已测试。
pip install flashinfer-python==0.5.0
```
> 💡**安装提示:** PyTorch 使用的 CUDA 版本必须与系统的 CUDA 版本匹配,这一点至关重要。 
> FlashInfer 依赖此兼容性在运行时编译内核。
> 推荐使用 GCC 版本 >=9 来编译 FlashAttention 和 FlashInfer。

> ⚡ **性能提示:** 这些优化可以显著加快您的推理速度!

> 💡**注意:** 启用 FlashInfer 时,首次推理可能会较慢(约 10 分钟),因为需要编译内核。在同一台机器上的后续推理会快得多。

### HunyuanImage-3.0-Instruct (指令推理和图像到图像生成,包括编辑和多图像融合)

#### 🔥 使用 Transformers 快速开始

##### 1️⃣ 下载模型权重

```bash
# 从 HuggingFace 下载并重命名目录。
# 注意目录名称不应包含点号,否则使用 Transformers 加载时可能出现问题。
hf download tencent/HunyuanImage-3.0-Instruct --local-dir ./HunyuanImage-3-Instruct
```

##### 2️⃣ 使用 Transformers 运行

```python
from transformers import AutoModelForCausalLM

# 加载模型
model_id = "./HunyuanImage-3-Instruct"
# 目前我们无法使用 HF 模型 ID `tencent/HunyuanImage-3.0-Instruct` 直接加载模型 
# 因为名称中包含点号。

kwargs = dict(
    attn_implementation="sdpa", 
    trust_remote_code=True,
    torch_dtype="auto",
    device_map="auto",
    moe_impl="eager",   # 如果已安装 FlashInfer,可使用 "flashinfer"
    moe_drop_tokens=True,
)

model = AutoModelForCausalLM.from_pretrained(model_id, **kwargs)
model.load_tokenizer(model_id)

# 图像到图像生成 (TI2I)
prompt = "基于图一的logo,参考图二中冰箱贴的材质,制作一个新的冰箱贴"

input_img1 = "./assets/demo_instruct_imgs/input_1_0.png"
input_img2 = "./assets/demo_instruct_imgs/input_1_1.png"
imgs_input = [input_img1, input_img2]

cot_text, samples = model.generate_image(
    prompt=prompt,
    image=imgs_input,
    seed=42,
    image_size="auto",
    use_system_prompt="en_unified",
    bot_task="think_recaption",  # 使用 "think_recaption" 进行推理和增强
    infer_align_image_size=True,  # 将输出图像大小对齐到输入图像大小
    diff_infer_steps=50, 
    verbose=2
)

# 保存生成的图像
samples[0].save("image_edit.png")
```

#### 🏠 本地安装和使用

##### 1️⃣ 克隆仓库

```bash
git clone https://github.com/Tencent-Hunyuan/HunyuanImage-3.0.git
cd HunyuanImage-3.0/
```

##### 2️⃣ 下载模型权重

```bash
# 从 HuggingFace 下载
hf download tencent/HunyuanImage-3.0-Instruct --local-dir ./HunyuanImage-3-Instruct
```

##### 3️⃣ 运行演示

更多演示在 `run_demo_instruct.sh` 中。

```bash
export MODEL_PATH="./HunyuanImage-3-Instruct"
bash run_demo_instruct.sh
```

##### 4️⃣ 命令行参数

| 参数                   | 说明                                             | 推荐值         |
|----------------------|------------------------------------------------|-------------|
| `--prompt`           | 输入提示词                                         | (必填)        |
| `--image`            | 要处理的图像。多个图像使用逗号分隔的路径(例如 'img1.png,img2.png') | (必填)      |
| `--model-id`         | 模型路径                                           | (必填)        |
| `--attn-impl`        | Attention 实现方式。目前仅支持 'sdpa'              | `sdpa`      |
| `--moe-impl`         | MoE 实现方式。可选 `eager` 或 `flashinfer`             | `flashinfer`     |
| `--seed`             | 图像生成的随机种子。使用 None 表示随机种子                    | `None`      |
| `--diff-infer-steps` | 推理步数                                           | `50`        |
| `--image-size`       | 图像分辨率。可以是 `auto`、`1280x768` 或 `16:9`        | `auto`      |
| `--use-system-prompt` | 系统提示词类型。选项:`None`、`dynamic`、`en_vanilla`、`en_recaption`、`en_think_recaption`、`en_unified`、`custom` | `en_unified` |
| `--system-prompt`    | 自定义系统提示词。当 `--use-system-prompt` 为 `custom` 时使用 | `None`      |
| `--bot-task`         | 任务类型。`image` 用于直接生成;`auto` 用于文本;`recaption` 用于重写->图像;`think_recaption` 用于思考->重写->图像 | `think_recaption` |
| `--save`             | 图像保存路径                                         | `image.png` |
| `--verbose`          | 详细程度                                           | `2`         |
| `--reproduce`        | 是否复现结果                                         | `True`     |
| `--infer-align-image-size` | 是否将目标图像大小对齐到源图像大小                    | `True`     |
| `--max_new_tokens`   | 生成的最大 token 数                                  | `2048` |
| `--use-taylor-cache` | 采样时使用 Taylor Cache                            | `False`     |

##### 5️⃣ 更少的采样步数

我们推荐使用模型 [HunyuanImage-3.0-Instruct-Distil](https://huggingface.co/tencent/HunyuanImage-3.0-Instruct-Distil),设置 `--diff-infer-steps 8`,同时保持所有其他推荐参数值**不变**。

```bash
# 从 HuggingFace 下载 HunyuanImage-3.0-Instruct-Distil
hf download tencent/HunyuanImage-3.0-Instruct-Distil --local-dir ./HunyuanImage-3-Instruct-Distil

# 使用 8 步采样运行演示
export MODEL_PATH="./HunyuanImage-3-Instruct-Distil"
bash run_demo_instruct_distil.sh
```

<details>
<summary> 先前版本(纯文本生成图像) </summary>

### HunyuanImage-3.0 (文本生成图像)

#### 📥 安装依赖

```bash
# 1. 首先安装 PyTorch (CUDA 12.8 版本)
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128

# 2. 安装 tencentcloud-sdk
pip install -i https://mirrors.tencent.com/pypi/simple/ --upgrade tencentcloud-sdk-python

# 3. 然后安装其他依赖
pip install -r requirements.txt
```

为了**获得多达3倍的推理加速**,请安装以下优化:

```bash
# FlashInfer 用于优化的 moe 推理。v0.5.0 已测试。
pip install flashinfer-python==0.5.0
```

#### 🔥 使用 Transformers 快速开始

##### 1️⃣ 下载模型权重

```bash
# 从 HuggingFace 下载并重命名目录。
# 注意目录名称不应包含点号,否则使用 Transformers 加载时可能出现问题。
hf download tencent/HunyuanImage-3.0 --local-dir ./HunyuanImage-3
```

##### 2️⃣ 使用 Transformers 运行

```python
from transformers import AutoModelForCausalLM

# 加载模型
model_id = "./HunyuanImage-3"
# 目前我们无法使用 HF 模型 ID `tencent/HunyuanImage-3.0` 直接加载模型 
# 因为名称中包含点号。

kwargs = dict(
    attn_implementation="sdpa",     # 如果已安装 FlashAttention,可使用 "flash_attention_2"
    trust_remote_code=True,
    torch_dtype="auto",
    device_map="auto",
    moe_impl="eager",   # 如果已安装 FlashInfer,可使用 "flashinfer"
)

model = AutoModelForCausalLM.from_pretrained(model_id, **kwargs)
model.load_tokenizer(model_id)

# 生成图像
prompt = "一只棕色和白色相间的小狗奔跑在草地上"
image = model.generate_image(prompt=prompt, stream=True)
image.save("image.png")
```

#### 🏠 本地安装和使用

##### 1️⃣ 克隆仓库

```bash
git clone https://github.com/Tencent-Hunyuan/HunyuanImage-3.0.git
cd HunyuanImage-3.0/
```

##### 2️⃣ 下载模型权重

```bash
# 从 HuggingFace 下载
hf download tencent/HunyuanImage-3.0 --local-dir ./HunyuanImage-3
```

##### 3️⃣ 运行演示

预训练检查点不会自动重写或增强输入提示词,为了获得最佳效果,我们目前建议社区伙伴使用 deepseek 来重写提示词。您可以前往[腾讯云](https://cloud.tencent.com/document/product/1772/115963#.E5.BF.AB.E9.80.9F.E6.8E.A5.E5.85.A5)申请 API Key。

```bash
# 设置环境变量
export DEEPSEEK_KEY_ID="your_deepseek_key_id"
export DEEPSEEK_KEY_SECRET="your_deepseek_key_secret"

bash run_demo.sh
```

##### 4️⃣ 命令行参数

| 参数                   | 说明                                             | 推荐值         |
|----------------------|------------------------------------------------|-------------|
| `--prompt`           | 输入提示词                                         | (必填)        |
| `--model-id`         | 模型路径                                           | (必填)        |
| `--attn-impl`        | Attention 实现方式。可选 `sdpa` 或 `flash_attention_2` | `sdpa`      |
| `--moe-impl`         | MoE 实现方式。可选 `eager` 或 `flashinfer`             | `flashinfer`     |
| `--seed`             | 图像生成的随机种子                                    | `None`      |
| `--diff-infer-steps` | Diffusion 推理步数                                 | `50`        |
| `--image-size`       | 图像分辨率。可以是 `auto`、`1280x768` 或 `16:9`        | `auto`      |
| `--save`             | 图像保存路径                                         | `image.png` |
| `--verbose`          | 详细程度。0: 无日志;1: 记录推理信息。                      | `0`         |
| `--rewrite`          | 是否启用重写                                         | `1`         |
| `--sys-deepseek-prompt` | 从 `universal` 或 `text_rendering` 中选择系统提示词          | `universal` |

#### 🎨 交互式 Gradio 演示

启动交互式 Web 界面,方便进行文本到图像生成。

##### 1️⃣ 安装 Gradio

```bash
pip install gradio>=4.21.0
```

##### 2️⃣ 配置环境

```bash
# 设置您的模型路径
export MODEL_ID="path/to/your/model"

# 可选:配置 GPU 使用(默认:0,1,2,3)
export GPUS="0,1,2,3"

# 可选:配置主机和端口(默认:0.0.0.0:443)
export HOST="0.0.0.0"
export PORT="443"
```

##### 3️⃣ 启动 Web 界面

**基础启动:**
```bash
sh run_app.sh
```

**使用性能优化:**
```bash
# 同时使用两种优化以获得最佳性能
sh run_app.sh --moe-impl flashinfer --attn-impl flash_attention_2
```

##### 4️⃣ 访问界面

> 🌐 **Web 界面:** 打开浏览器并访问 `http://localhost:443`(或您配置的端口)

</details>

## 🧱 模型卡片

| 模型                     | 参数量             | 下载地址 | 推荐显存 | 支持功能 |
|---------------------------| --- | --- | --- | --- |
| HunyuanImage-3.0          | 总计 80B (激活 13B) | [HuggingFace](https://huggingface.co/tencent/HunyuanImage-3.0) | ≥ 3 × 80 GB | ✅ 文本生成图像
| HunyuanImage-3.0-Instruct | 总计 80B (激活 13B) | [HuggingFace](https://huggingface.co/tencent/HunyuanImage-3.0-Instruct) | ≥ 8 × 80 GB | ✅ 文本生成图像<br>✅ 文本图像到图像<br>✅ 提示词自动重写 <br>✅ CoT 思考
| HunyuanImage-3.0-Instruct-Distil | 总计 80B (激活 13B) | [HuggingFace](https://huggingface.co/tencent/HunyuanImage-3.0-Instruct-Distil) | ≥ 8 × 80 GB |✅ 文本生成图像<br>✅ 文本图像到图像<br>✅ 提示词自动重写 <br>✅ CoT 思考 <br>✅ 更少的采样步数(推荐 8 步)

注意事项:
- 安装性能优化工具(FlashAttention、FlashInfer)以获得更快的推理速度。
- 基础模型推荐使用多 GPU 推理。

## 📊 评估结果

### HunyuanImage-3.0-Instruct 评估
* 👥 **GSB (人工评估)** 
我们采用了 GSB(好/相同/差)评估方法,该方法通常用于从整体图像感知角度评估两个模型之间的相对性能。我们总共使用了 1000+ 个单图像和多图像编辑案例,在一次运行中为所有比较的模型生成相等数量的图像样本。为了公平比较,我们对每个提示词只进行一次推理,避免任何结果筛选。在与基线方法比较时,我们保持了所有选定模型的默认设置。评估由 100 多名专业评估员执行。

<p align="center">
  <img src="./assets/gsb_instruct.png" width=60% alt="Human Evaluation with Other Models">
</p>


### HunyuanImage-3.0 评估

* 🤖 **SSAE (机器评估)**   
SSAE(结构化语义对齐评估)是一种基于先进多模态大语言模型(MLLMs)的图像-文本对齐智能评估指标。我们提取了 12 个类别的 3500 个关键点,然后使用多模态大语言模型通过将生成的图像与这些关键点进行比较,基于图像的视觉内容自动评估和打分。平均图像准确率表示所有关键点的图像级平均分数,而全局准确率直接计算所有关键点的平均分数。

<p align="center">
  <img src="./assets/ssae_side_by_side_comparison.png" width=98% alt="Human Evaluation with Other Models">
</p>

<p align="center">
  <img src="./assets/ssae_side_by_side_heatmap.png" width=98% alt="Human Evaluation with Other Models">
</p>


* 👥 **GSB (人工评估)** 

我们采用了 GSB(好/相同/差)评估方法,该方法通常用于从整体图像感知角度评估两个模型之间的相对性能。我们总共使用了 1000 个文本提示词,在一次运行中为所有比较的模型生成相等数量的图像样本。为了公平比较,我们对每个提示词只进行一次推理,避免任何结果筛选。在与基线方法比较时,我们保持了所有选定模型的默认设置。评估由 100 多名专业评估员执行。

<p align="center">
  <img src="./assets/gsb.png" width=98% alt="Human Evaluation with Other Models">
</p>

## 🖼️ 展示

我们的模型可以遵循复杂指令生成高质量、富有创意的图像。

<div align="center">
  <img src="./assets/banner_all.jpg" width=100% alt="HunyuanImage 3.0 Demo">
</div>

文本生成图像的展示,请点击以下链接:

- [HunyuanImage-3.0](./Hunyuan-Image3.md)

### HunyuanImage-3.0-Instruct 展示

HunyuanImage-3.0-Instruct 展示了在智能图像生成和编辑方面的强大能力。以下展示突出了其核心功能:

* 🧠 **智能视觉理解与推理(CoT Think)**: 模型执行结构化思考,分析用户输入的图像和提示词,将用户的意图和编辑任务扩展为结构化、全面的指令,从而带来更好的图像生成和编辑表现。

将复杂的提示词和编辑任务分解为详细的视觉组件,包括主体、构图、光照、色彩搭配和风格。

* ✏️ **提示词自动重写**: 自动将稀疏或模糊的提示词增强为专业级、细节丰富的描述,更准确地捕捉用户意图。

* 🎨 **文本生成图像(T2I)**: 从文本提示词生成高质量图像,具有出色的提示词遵循度和照片级真实感。

* 🖼️ **图像到图像(TI2I)**: 支持创意图像编辑,包括添加元素、移除对象、修改风格和无缝背景替换,同时保留关键视觉元素。

* 🔀 **多图像融合**: 智能组合多个参考图像(最多3个参考图输入),创建融合来自不同来源的视觉元素的连贯合成图像。


**展示 1: 详细的思考和推理过程**

<div align="center">
  <img src="./assets/pg_instruct_imgs/cot_ti2i.gif" alt="HunyuanImage-3.0-Instruct Showcase 1" width="90%">
</div>

**展示 2: 具有复杂场景理解的创意 T2I 生成**

> Prompt: 3D 毛绒质感拟人化马,暖棕浅棕肌理,穿藏蓝西装、白衬衫,戴深棕手套;疲惫带期待,坐于电脑前,旁置印 "HAPPY AGAIN" 的马克杯。橙红渐变背景,配超大号藏蓝粗体 "马上下班",叠加米黄 "Happy New Year" 并标 "(2026)"。橙红为主,藏蓝米黄撞色,毛绒温暖柔和。

<div align="center">
  <img src="./assets/pg_instruct_imgs/image0.png" alt="HunyuanImage-3.0-Instruct Showcase 2" width="75%">
</div>

**展示 3: 精确图像编辑与元素保留**

<div align="center">
  <img src="./assets/pg_instruct_imgs/image1.png" alt="HunyuanImage-3.0-Instruct Showcase 3" width="85%">
</div>

**展示 4: 风格转换与主题增强**

<div align="center">
  <img src="./assets/pg_instruct_imgs/image2.png" alt="HunyuanImage-3.0-Instruct Showcase 4" width="85%">
</div>


**展示 5: 高级风格转换与产品效果图生成**

<div align="center">
  <img src="./assets/pg_instruct_imgs/image3.png" alt="HunyuanImage-3.0-Instruct Showcase 5" width="85%">
</div>


**展示 6: 多图像融合与创意合成**

<div align="center">
  <img src="./assets/pg_instruct_imgs/image4.png" alt="HunyuanImage-3.0-Instruct Showcase 6" width="85%">
</div>

## 📚 引用

如果您在研究中发现 HunyuanImage-3.0 有用,请引用我们的工作:

```bibtex
@article{cao2025hunyuanimage,
  title={HunyuanImage 3.0 Technical Report},
  author={Cao, Siyu and Chen, Hangting and Chen, Peng and Cheng, Yiji and Cui, Yutao and Deng, Xinchi and Dong, Ying and Gong, Kipper and Gu, Tianpeng and Gu, Xiusen and others},
  journal={arXiv preprint arXiv:2509.23951},
  year={2025}
}
```

## 🙏 致谢

我们衷心感谢以下开源项目和社区的宝贵贡献:

* 🤗 [Transformers](https://github.com/huggingface/transformers) - 最先进的 NLP 库
* 🎨 [Diffusers](https://github.com/huggingface/diffusers) - 扩散模型库  
* 🌐 [HuggingFace](https://huggingface.co/) - AI 模型中心和社区
* ⚡ [FlashAttention](https://github.com/Dao-AILab/flash-attention) - 内存高效的注意力机制
* 🚀 [FlashInfer](https://github.com/flashinfer-ai/flashinfer) - 优化的推理引擎

## 🌟🚀 GitHub Star 历史

[![GitHub stars](https://img.shields.io/github/stars/Tencent-Hunyuan/HunyuanImage-3.0?style=social)](https://github.com/Tencent-Hunyuan/HunyuanImage-3.0)
[![GitHub forks](https://img.shields.io/github/forks/Tencent-Hunyuan/HunyuanImage-3.0?style=social)](https://github.com/Tencent-Hunyuan/HunyuanImage-3.0)

[![Star History Chart](https://api.star-history.com/svg?repos=Tencent-Hunyuan/HunyuanImage-3.0&type=Date)](https://www.star-history.com/#Tencent-Hunyuan/HunyuanImage-3.0&Date)


## /app/pipeline.py

```py path="/app/pipeline.py" 
# Licensed under the TENCENT HUNYUAN COMMUNITY LICENSE AGREEMENT (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://github.com/Tencent-Hunyuan/HunyuanImage-3.0/blob/main/LICENSE
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

import re
import time
from copy import deepcopy
from threading import Thread
from typing import List, Dict, Any, Optional

import gradio
import torch
from PIL import Image
from transformers import TextIteratorStreamer

from hunyuan_image_3.hunyuan import HunyuanImage3ForCausalMM
from hunyuan_image_3.tokenizer_wrapper import ImageInfo
from hunyuan_image_3.system_prompt import t2i_system_prompts


class HunyuanImage3AppPipeline(object):
    def __init__(self, args):
        kwargs = dict(
            attn_implementation=args.attn_impl,
            torch_dtype="auto",
            device_map="auto",
            moe_impl=args.moe_impl,
        )
        self.model = HunyuanImage3ForCausalMM.from_pretrained(args.model_id, **kwargs)
        self.model.load_tokenizer(args.model_id)
        self.image_processor = self.model.image_processor

        print("Loaded HunyuanImage3 pipeline")

    @staticmethod
    def standardize_message_list(message_list, context_mode="single_round"):
        processed_message_list = []

        # We always keep system message if available
        for message in message_list:
            if message["role"] == "system":
                processed_message_list.append(deepcopy(message))
            else:
                break
        if context_mode == "single_round":
            # Traverse the message list in reverse order to find all the last successive user messages.
            reversed_user_messages = []
            for message in reversed(message_list):
                if message["role"] == "user":
                    reversed_user_messages.append(deepcopy(message))
                else:
                    break
            processed_message_list.extend(reversed(reversed_user_messages))

        elif context_mode == "unlimited":
            processed_message_list = deepcopy(message_list)

        else:
            raise ValueError(f"Unknown message strategy: {context_mode}")
        return processed_message_list

    @torch.no_grad()
    def _generate(
            self,
            message_list: List[Dict[str, Any]],
            seed: Optional[int] = None,
            image_size: str = "auto",
            verbose: int = 1,
            **kwargs,
    ):
        """
        A uniform interface for all the t2i, general editing, lm, and mmu tasks.
        Only batch_size 1 is supported.

        Args:
            message_list (List[Dict[str, Any]]):
                A list of dictionaries containing the history messages and new questions.
                [
                    dict(role='system', type='text', content='xxxx', content_type='str')
                    dict(role='user', type='text', content='xxxx', content_type='str'),
                    dict(role='user', type='joint_image', content='xxxx', content_type='image_info'),
                    dict(role='assistant', type='text', content='xxxx', content_type='str')
                    dict(role='assistant', type='joint_image', content='xxxx', content_type='image_info')
                ]
            seed (Optional[int]):
                The random seed for deterministic results.
            image_size (str):
                The size of the generated images, can be "auto" or specified size.
            verbose (int):
                The verbosity level. 0 for silent, 1 for detailed info.
            kwargs:
                context_mode (str):
                    The context mode for processing the message_list, can be "single_round" or "unlimited".
                bot_task (str):
                    The task for the model, can be "image", "think", "recaption", or "auto".
                    "image": text-to-image generation, maybe predict image size first if image_size="auto".
                    "think": chain-of-thought text-to-image generation, predict image size first if image_size="auto".
                    "recaption": image editing with new caption, maybe predict image size first if image_size="auto".
                    "auto": text generation.
                drop_think (bool):
                    Whether to drop the <think> part in the context when generating image.
        """

        try:
            context_mode = kwargs.pop("context_mode")
            message_list = self.standardize_message_list(message_list, context_mode=context_mode)
        except Exception as e:
            yield {"role": "assistant", "value": f"Error: {e}", "type": "text", "error": 100}

        streamer = TextIteratorStreamer(self.model.tokenizer, skip_prompt=True, skip_special_tokens=False)
        bot_task = kwargs.get("bot_task")
        stop_token = ""
        bot_answer = ""

        # ================================================================
        # gen_text: plain text
        if bot_task != "image":
            model_inputs = self.model.prepare_model_inputs(
                message_list=message_list, seed=seed, image_size=image_size, **kwargs,
            )
            model_inputs.update({"streamer": streamer, "verbose": verbose})

            thread = Thread(
                target=self.model._generate,  # noqa
                kwargs={**model_inputs, **kwargs},
            )
            thread.start()

            # Start token will not be returned by streamer, so we add it here if needed
            if bot_task in ["think", "recaption"]:
                bot_answer = f"<{bot_task}>"
                yield {"role": "system", "value": f"<{bot_task}>", "type": "text"}
            else:
                bot_answer = ""
            stop_token = None
            for text_token in streamer:
                stop_token = text_token
                print(text_token, end="", flush=True)
                if text_token.startswith("<boi>") or text_token.startswith("<img"):
                    continue
                bot_answer += text_token
                yield dict(role="assistant", value=text_token, type="text")
            print()
            # Ensure the generation thread completes
            thread.join()

        if stop_token.endswith("<|endoftext|>"):
            return

        # ================================================================
        # There are two paths to this branch:
        #   Assistant: <think> -> </think><recaption>xxx</recaption>
        #   Assistant: <recaption> -> xxx</recaption>
        if stop_token.endswith("</recaption>"):
            message_list.append(dict(
                role="assistant", type="text", content=bot_answer, content_type="text",     # cot_text
            ))
            # Switch system_prompt to `en_recaption` if needed
            if kwargs.get("drop_think") and message_list[0]["role"] == "system":
                message_list[0]["content"] = t2i_system_prompts["en_recaption"][0]

        # ================================================================
        # gen_text: img_ratio
        if image_size == "auto":
            kwargs.update({"bot_task": "img_ratio"})
            model_inputs = self.model.prepare_model_inputs(
                message_list=message_list, seed=seed, image_size=image_size, **kwargs,
            )
            model_inputs.update({"streamer": streamer, "verbose": verbose})

            # Use a separate thread to catch the output text from streamer in the main thread
            thread = Thread(
                target=self.model._generate,  # noqa
                kwargs={**model_inputs, **kwargs},
            )
            thread.start()

            stop_token = None
            for text_token in streamer:
                time.sleep(0.01)
                stop_token = text_token
                print(text_token, end="", flush=True)
            print()
            # Ensure the generation thread completes
            thread.join()

        # ================================================================
        # stop_token can be (1) <boi> (image_size!=auto, bot_task=auto)
        #                   (2) </recaption> (image_size!=auto, bot_task=think/recaption)
        #                   (3) <img_ratio_*> (image_size=auto)
        # gen_image
        yield dict(role="assistant", value="image", type="flag")
        if image_size == "auto":
            if matched := re.search(r"<img_ratio_\d+>{{contextString}}quot;, stop_token):
                gen_image_info = self.image_processor.build_image_info(matched.group())
            else:
                # Failed to predict image ratio, use the default one
                gen_image_info = self.image_processor.build_image_info("1024x1024")
        else:
            gen_image_info = self.image_processor.build_image_info(image_size)
        message_list.append(dict(
            role="assistant", type="gen_image", content=gen_image_info, content_type="image_info"))
        # Here we enter the gen_image mode. The kwargs `bot_task` won't take effect.
        model_inputs = self.model.prepare_model_inputs(
            message_list=message_list, mode="gen_image", seed=seed, image_size=image_size, **kwargs,
        )
        outputs = self.model._generate(**model_inputs, **kwargs, verbose=verbose)   # noqa
        yield dict(role="assistant", value=outputs[0], type="image")

    def gradio_image_to_image_info(self, image: gradio.components.image.Image) -> ImageInfo:
        img_path = image.value["path"]
        pil_image = Image.open(img_path).convert("RGB")
        image_info = self.image_processor.preprocess(pil_image)
        return image_info

    def history2messages(self, history):
        message_list = []

        # System message should only appear at the beginning of the conversation.
        for msg in history:
            if msg["role"] == "system":
                message_list.append(dict(
                    role="system", type="text", content=msg["content"], content_type='str'
                ))
            else:
                break

        for msg in history:
            if msg["role"] == "system":
                # Ignore system message in the middle of the conversation.
                continue
            elif msg["role"] in ["user", "assistant"]:
                if isinstance(msg["content"], str):
                    message_list.append(dict(
                        role=msg["role"], type="text", content=msg["content"], content_type='str'
                    ))
                elif isinstance(msg["content"], gradio.components.image.Image):
                    message_list.append(dict(
                        role=msg["role"],
                        type="joint_image",
                        content=self.gradio_image_to_image_info(msg['content']),
                        content_type='image_info',
                    ))
                else:
                    raise NotImplementedError(f"Unsupported message type: {type(msg['content'])}")
            else:
                raise NotImplementedError(f"Unsupported role: {msg['role']}")

        # Make sure the last message is from user
        if len(message_list) == 0 or message_list[-1]["role"] != "user":
            raise ValueError("The last message must be from user")

        return message_list

    def generate(self, history, **kwargs):
        message_list = self.history2messages(history)
        # Feed the message_list to the model and yield stream results
        yield from self._generate(message_list, **kwargs)

```

## /app/run_chatbot.py

```py path="/app/run_chatbot.py" 
# Licensed under the TENCENT HUNYUAN COMMUNITY LICENSE AGREEMENT (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://github.com/Tencent-Hunyuan/HunyuanImage-3.0/blob/main/LICENSE
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

import argparse
import random
from datetime import datetime
from pathlib import Path
from typing import Optional

import gradio as gr
from gradio import ChatMessage

from app.pipeline import HunyuanImage3AppPipeline
from app.style import load_css
from hunyuan_image_3.system_prompt import t2i_system_prompts

# Global vars
hyi3_pipeline: Optional[HunyuanImage3AppPipeline] = None
image_cache_dir: Optional[Path] = None


def default(val, default_val):
    return val if val is not None else default_val


def load_pipeline(args):
    """ Load the HunyuanImage-3 pipeline """
    global hyi3_pipeline
    hyi3_pipeline = HunyuanImage3AppPipeline(args)
    print("Model and tokenizer loaded.")

    global image_cache_dir
    image_cache_dir = args.image_cache_dir
    if image_cache_dir is not None:
        # Cache image by date
        image_cache_dir = Path(image_cache_dir)
        print("Image cache dir:", image_cache_dir)


def update_history(history, message):
    """ Update chatbot history """
    assert 'text' in message and 'files' in message

    # extra_img_input = preprocess_mask_img(img_input)
    extra_img_input = None
    for x in message["files"]:
        history.append(ChatMessage(role="user", content=gr.Image(x, type="pil", format="png")))
    if message["text"] is not None:
        history.append(ChatMessage(role="user", content=message["text"]))
    if extra_img_input is not None:
        history.append(ChatMessage(role="user", content=gr.Image(extra_img_input, type="pil", format="png")))
    return history, gr.MultimodalTextbox(value=None, interactive=False)


def spinner():
    """ Return a spinner to denote image generation in progress """
    return """<div class="typing-dots">
                <div class="typing-dot"></div>
                <div class="typing-dot"></div>
                <div class="typing-dot"></div>
            </div>"""


def hunyuan_image_3_respond(history, system_prompt,
                            seed, top_k, top_p, temperature, infer_steps, diff_guidance_scale,
                            image_size, bot_task, context_mode,
                            ):
    """
    HunyuanImage-3 response generation function

    Args:
        history (List[Dict[str, str]]): Chat history
        system_prompt (str): System prompt
        seed (int): Random seed. -1 means random seed.
        top_k (float): Top-K for text generation
        top_p (float): Top-P for text generation:
        temperature (float): Temperature for text generation
        infer_steps (int): Diffusion inference steps
        diff_guidance_scale (float): Diffusion guidance scale
        image_size (str): Image size. "auto" or "HxW" or "H:W"
        bot_task (str): Bot task.
            "image": Only generate image. If image_size is "auto", the image ratio token will be predicted at first.
            "auto": Text generation. The model will decide whether to generate text or image.
            "think": Given user inputs, start thinking and then rewrite the prompt for image generation, finally
                        generate image.
            "recaption": Given user inputs, rewrite the prompt for image generation, finally generate image.
        context_mode (str): Context mode. "single_round", "unlimited"
    """
    extra_kwargs = {
        "seed": random.randint(0, 1_000_000) if seed < 0 else seed,
        "top_k": top_k,
        "top_p": top_p,
        "temperature": float(temperature),
        "diff_infer_steps": infer_steps,
        "diff_guidance_scale": diff_guidance_scale,
        "image_size": image_size,
        "bot_task": bot_task,
        "context_mode": context_mode,
        "drop_think": hyi3_pipeline.model.generation_config.drop_think,     # drop think when gen_image
    }
    eos = "<|endoftext|>"

    input_message_list = [message for message in history if message["content"] != ""]
    if system_prompt:
        input_message_list = [dict(
            role="system", content=system_prompt, type="text", content_type='str',
        )] + input_message_list

    current_text_response = ""
    history.append({"role": "assistant", "content": ""})

    for r in hyi3_pipeline.generate(input_message_list, **extra_kwargs):
        if r["type"] == "text" and r["value"] not in (eos, ""):
            current_text_response += r["value"]
            history[-1]["content"] = current_text_response
            yield history

        elif r["type"] == "flag":
            if r["value"] == "image":
                # Add a spinner for image generation
                if current_text_response:
                    yield history
                    current_text_response = ""
                history.append({"role": "assistant", "content": spinner()})
                yield history

        elif r["type"] == "image":
            # Finish current text response
            if current_text_response:
                yield history
                history.append({"role": "assistant", "content": ""})
                current_text_response = ""
            # Remove spinner
            if history[-1]["content"] == spinner():
                history.pop()
            # Append and save image
            history.append({"role": "assistant", "content": gr.Image(r["value"], type="pil", format="png")})
            if image_cache_dir is not None:
                date = datetime.now()
                img_path = image_cache_dir / date.strftime("%Y%m%d") / f"img_{date.strftime('%H%M%S_%f')}.png"
                img_path.parent.mkdir(parents=True, exist_ok=True)
                r["value"].save(img_path)
                print(f"Image saved to {img_path}")
            yield history
            history.append({"role": "assistant", "content": ""})

    if not history[-1]["content"]:
        history.pop()
    yield history


def handle_undo(history, undo_data: gr.UndoData):
    """ Handle undo action """
    return history[:undo_data.index], gr.MultimodalTextbox(value=history[undo_data.index]['content'], interactive=True)


def handle_retry(history, retry_data: gr.RetryData, *args, **kwargs):
    """ Handle retry action """
    new_history = history[:retry_data.index + 1]
    yield from hunyuan_image_3_respond(new_history, *args, **kwargs)


def get_system_prompt(sys_type, bot_task):
    if sys_type == 'None':
        visible = False
        value = ""
    elif sys_type in ['en_vanilla', 'en_recaption', 'en_think_recaption']:
        visible = True
        value = t2i_system_prompts[sys_type][0]
    elif sys_type == "dynamic":
        visible = True
        if bot_task == "think":
            value = t2i_system_prompts["en_think_recaption"][0]
        elif bot_task == "recaption":
            value = t2i_system_prompts["en_recaption"][0]
        elif bot_task == "image":
            value = t2i_system_prompts["en_vanilla"][0].strip("\n")
        else:
            value = ""
    elif sys_type == 'custom':
        visible = True
        value = ""
    else:
        raise NotImplementedError(f"Unsupported system prompt type: {sys_type}")
    return gr.TextArea(value=value, lines=7, max_lines=7, placeholder="Please input system prompt", show_label=False,
                       visible=visible, elem_id="system-prompt")


def create_ui_interface(args):
    gen_config = hyi3_pipeline.model.generation_config
    block = gr.Blocks(fill_height=True, css=load_css())
    with block:
        with gr.Column():
            # ==== Left ====
            #  Sidebar
            with gr.Sidebar(open=args.open_sidebar, width='20%'):
                with gr.Accordion("Image Generation", open=True, visible=True):
                    with gr.Row(elem_id="Image Generation parameter", visible=True):
                        image_size = gr.Dropdown([
                            ("Auto", "auto"),
                            ("1:1", "1024x1024"),
                            ("4:3", "896x1152"),
                            ("3:4", "1152x896"),
                            ("16:9", "768x1280"),
                            ("9:16", "1280x768"),
                            ("21:9", "640x1408"),
                        ], label="Image size", value=args.image_size)
                        seed = gr.Number(
                            label="Seed", minimum=-1, maximum=1_000_000, value=args.seed, step=1, precision=0,
                            min_width=80,
                        )
                        infer_steps = gr.Slider(
                            label="Infer Steps", minimum=1, maximum=200,
                            value=default(args.diff_infer_steps, gen_config.diff_infer_steps), step=1,
                            min_width=200,
                        )
                        diff_guidance_scale = gr.Slider(
                            label="Guidance", minimum=1.0, maximum=16.0,
                            value=default(args.diff_guidance_scale, gen_config.diff_guidance_scale), step=0.5,
                            min_width=200,
                        )
                        use_system_prompt = gr.Dropdown([
                            ("None", 'None'),
                            ("Preset(Dynamic)", "dynamic"),
                            ("Preset(Default)", 'en_vanilla'),
                            ("Preset(Recaption)", 'en_recaption'),
                            ("Preset(Think+Recaption)", 'en_think_recaption'),
                            ("Custom", 'custom'),
                        ], label="System Prompt", value=default(args.use_system_prompt, gen_config.use_system_prompt))
                        bot_task = gr.Dropdown([
                            ("Image", "image"),
                            ("Auto", "auto"),
                            ("Think", "think"),
                            ("Recaption", "recaption"),
                        ], label="Bot Task", value=default(args.bot_task, gen_config.bot_task), min_width=150)
                        context_mode = gr.Dropdown([
                            ("Single Round", "single_round"),
                            ("All", "unlimited"),
                        ], label="Context Mode", value=args.context_mode, min_width=150)
                with gr.Accordion("Text Generation", open=False, visible=True):
                    with gr.Row(elem_id="Text Generation parameter"):
                        top_k = gr.Slider(
                            label="Top-K", minimum=1, maximum=16384,
                            value=default(args.top_k, gen_config.top_k), step=1, min_width=200,
                        )
                        top_p = gr.Slider(
                            label="Top-P", minimum=0.0, maximum=1.0,
                            value=default(args.top_p, gen_config.top_p), step=0.01, min_width=200,
                        )
                        temperature = gr.Slider(
                            label="Temperature", minimum=0.1, maximum=1.0,
                            value=default(args.temperature, gen_config.temperature), step=0.1, min_width=200,
                        )

            # ==== Right ====
            #  System prompt
            accordion = gr.Accordion("System Prompt", open=False)
            with accordion:
                system_prompt = get_system_prompt(
                    default(args.use_system_prompt, gen_config.use_system_prompt),
                    default(args.bot_task, gen_config.bot_task),
                )
            #  Chatbot
            chatbot = gr.Chatbot(
                min_height=500,
                elem_id="chatbot",
                bubble_full_width=False,
                type="messages",
                scale=1,
                avatar_images=('./assets/user.png', './assets/robot.png'),
                allow_tags=["think", "recaption"],
            )
            #  Input text box
            with gr.Row(scale=0):
                chat_input = gr.MultimodalTextbox(
                    interactive=True,
                    file_count="multiple",
                    file_types=["image"],
                    scale=15,
                    placeholder="Enter message or upload file...", show_label=False,
                    max_plain_text_length=65536,
                )

            #  Events
            chatbot.undo(handle_undo, chatbot, [chatbot, chat_input])
            chatbot.retry(
                handle_retry,
                [
                    chatbot, system_prompt,
                    seed, top_k, top_p, temperature, infer_steps, diff_guidance_scale,
                    image_size, bot_task, context_mode,
                ],
                chatbot,
            )

            chat_input.submit(
                update_history,
                [chatbot, chat_input],
                [chatbot, chat_input],
                queue=False,
            ).then(
                hunyuan_image_3_respond,
                [
                    chatbot, system_prompt,
                    seed, top_k, top_p, temperature, infer_steps, diff_guidance_scale,
                    image_size, bot_task, context_mode
                ],
                chatbot
            ).then(
                lambda: gr.MultimodalTextbox(interactive=True), None, [chat_input]
            )

            use_system_prompt.change(fn=get_system_prompt, inputs=[use_system_prompt, bot_task], outputs=system_prompt)
            bot_task.change(fn=get_system_prompt, inputs=[use_system_prompt, bot_task], outputs=system_prompt)

    return block


def parse_args():
    parser = argparse.ArgumentParser("Commandline arguments for running HunyuanImage-3 locally")
    # server
    parser.add_argument("--host", type=str, default="0.0.0.0", help="Host to run the server on")
    parser.add_argument("--port", type=int, default=8080, help="Port to run the server on")
    parser.add_argument("--image-cache-dir", type=str, help="Directory where images are saved.")
    # ui
    parser.add_argument("--open-sidebar", action="store_true", help="Whether to open the sidebar by default")
    # model
    parser.add_argument("--model-id", type=str, default="./HunyuanImage-3", help="Path to the model")
    parser.add_argument("--attn-impl", type=str, default="sdpa", choices=["sdpa", "flash_attention_2"],
                        help="Attention implementation")
    parser.add_argument("--moe-impl", type=str, default="eager", choices=["eager", "flashinfer"],
                        help="MoE implementation")
    # inference
    parser.add_argument("--seed", type=int, default="-1", help="Random seed")
    parser.add_argument("--diff-infer-steps", type=int, help="Number of inference steps")
    parser.add_argument("--diff-guidance-scale", type=float, help="Guidance scale")
    parser.add_argument("--image-size", type=str, default="auto", help="Image size")
    parser.add_argument("--bot-task", type=str, choices=["image", "auto", "think", "recaption", "img_ratio"],
                        help="Bot task type for generating text.")
    parser.add_argument("--context-mode", type=str, default="single_round", choices=["single_round", "unlimited"],
                        help="Context mode")
    parser.add_argument("--top-k", type=int, help="Top-K")
    parser.add_argument("--top-p", type=float, help="Top-P")
    parser.add_argument("--temperature", type=float, help="Temperature")
    parser.add_argument("--use-system-prompt", type=str,
                        choices=["en_vanilla", "en_recaption", "en_think_recaption", "dynamic", "custom", "None"],
                        help="System prompt type")
    return parser.parse_args()


if __name__ == "__main__":
    args = parse_args()
    load_pipeline(args)

    chatbot_ui = create_ui_interface(args)
    chatbot_ui.launch(server_name=args.host, server_port=args.port, share=False)

```

## /app/style.py

```py path="/app/style.py" 
# Licensed under the TENCENT HUNYUAN COMMUNITY LICENSE AGREEMENT (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://github.com/Tencent-Hunyuan/HunyuanImage-3.0/blob/main/LICENSE
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

def load_css():
    return """
.contain { display: flex !important; flex-direction: column !important; }
.typing-dots {
    display: inline-flex;
    gap: 4px;
    flex-direction: row;
}
.typing-dot {
    width: 6px;
    height: 6px;
    background: #666;
    border-radius: 50%;
    animation: typing 1s infinite alternate;
}
.typing-dot:nth-child(2) {
    animation-delay: 0.2s;
}
.typing-dot:nth-child(3) {
    animation-delay: 0.4s;
}
#chatbot div[class^="message-row"] div[class^="message"] button img {
    max-height: 512px;
}
.open.svelte-y4v1h1:not(.right) .toggle-button.svelte-y4v1h1 {
    right: var(--size-0-5);
    transform: rotate(180deg);
}
.bubble.user-row.svelte-yaaj3.svelte-yaaj3 {
    max-width: 80%;
}

@keyframes typing {
    from { opacity: 0.3; transform: translateY(0); }
    to { opacity: 1; transform: translateY(0); }
}
#system-prompt textarea {
    overflow: scroll !important;
}
"#component-2, #component-17, #component-23  { height: 100% !important; }"
"#chatbot { flex-grow: 1 !important; overflow: auto !important;}"
"#col { height: 100vh !important; }"

"""

```

## /assets/HunyuanImage_3_0.pdf

Binary file available at https://raw.githubusercontent.com/Tencent-Hunyuan/HunyuanImage-3.0/refs/heads/main/assets/HunyuanImage_3_0.pdf

## /assets/WECHAT.md

<div align="center">
<img src=wechat.png width="60%"/>

<p> 扫码关注混元图像系列工作,加入「 腾讯混元生图交流群 」 </p>
<p> Scan the QR code to  join the "Tencent Hunyuan Image Generation Discussion Group" </p>
</div>


## /assets/banner.png

Binary file available at https://raw.githubusercontent.com/Tencent-Hunyuan/HunyuanImage-3.0/refs/heads/main/assets/banner.png

## /assets/banner_all.jpg

Binary file available at https://raw.githubusercontent.com/Tencent-Hunyuan/HunyuanImage-3.0/refs/heads/main/assets/banner_all.jpg

## /assets/demo_instruct_imgs/input_0_0.png

Binary file available at https://raw.githubusercontent.com/Tencent-Hunyuan/HunyuanImage-3.0/refs/heads/main/assets/demo_instruct_imgs/input_0_0.png

## /assets/demo_instruct_imgs/input_1_0.png

Binary file available at https://raw.githubusercontent.com/Tencent-Hunyuan/HunyuanImage-3.0/refs/heads/main/assets/demo_instruct_imgs/input_1_0.png

## /assets/demo_instruct_imgs/input_1_1.png

Binary file available at https://raw.githubusercontent.com/Tencent-Hunyuan/HunyuanImage-3.0/refs/heads/main/assets/demo_instruct_imgs/input_1_1.png

## /assets/demo_instruct_imgs/input_2_0.png

Binary file available at https://raw.githubusercontent.com/Tencent-Hunyuan/HunyuanImage-3.0/refs/heads/main/assets/demo_instruct_imgs/input_2_0.png

## /assets/demo_instruct_imgs/input_2_1.png

Binary file available at https://raw.githubusercontent.com/Tencent-Hunyuan/HunyuanImage-3.0/refs/heads/main/assets/demo_instruct_imgs/input_2_1.png

## /assets/demo_instruct_imgs/input_2_2.png

Binary file available at https://raw.githubusercontent.com/Tencent-Hunyuan/HunyuanImage-3.0/refs/heads/main/assets/demo_instruct_imgs/input_2_2.png

## /assets/framework.png

Binary file available at https://raw.githubusercontent.com/Tencent-Hunyuan/HunyuanImage-3.0/refs/heads/main/assets/framework.png

## /assets/gsb.png

Binary file available at https://raw.githubusercontent.com/Tencent-Hunyuan/HunyuanImage-3.0/refs/heads/main/assets/gsb.png

## /assets/gsb_instruct.png

Binary file available at https://raw.githubusercontent.com/Tencent-Hunyuan/HunyuanImage-3.0/refs/heads/main/assets/gsb_instruct.png

## /assets/logo.png

Binary file available at https://raw.githubusercontent.com/Tencent-Hunyuan/HunyuanImage-3.0/refs/heads/main/assets/logo.png

## /assets/pg_imgs/image1.png

Binary file available at https://raw.githubusercontent.com/Tencent-Hunyuan/HunyuanImage-3.0/refs/heads/main/assets/pg_imgs/image1.png

## /assets/pg_imgs/image2.png

Binary file available at https://raw.githubusercontent.com/Tencent-Hunyuan/HunyuanImage-3.0/refs/heads/main/assets/pg_imgs/image2.png

## /assets/pg_imgs/image3.png

Binary file available at https://raw.githubusercontent.com/Tencent-Hunyuan/HunyuanImage-3.0/refs/heads/main/assets/pg_imgs/image3.png

## /assets/pg_imgs/image4.png

Binary file available at https://raw.githubusercontent.com/Tencent-Hunyuan/HunyuanImage-3.0/refs/heads/main/assets/pg_imgs/image4.png

## /assets/pg_imgs/image5.png

Binary file available at https://raw.githubusercontent.com/Tencent-Hunyuan/HunyuanImage-3.0/refs/heads/main/assets/pg_imgs/image5.png

## /assets/pg_imgs/image6.png

Binary file available at https://raw.githubusercontent.com/Tencent-Hunyuan/HunyuanImage-3.0/refs/heads/main/assets/pg_imgs/image6.png

## /assets/pg_imgs/image7.png

Binary file available at https://raw.githubusercontent.com/Tencent-Hunyuan/HunyuanImage-3.0/refs/heads/main/assets/pg_imgs/image7.png

## /assets/pg_imgs/image8.png

Binary file available at https://raw.githubusercontent.com/Tencent-Hunyuan/HunyuanImage-3.0/refs/heads/main/assets/pg_imgs/image8.png

## /assets/pg_instruct_imgs/cot_ti2i.gif

Binary file available at https://raw.githubusercontent.com/Tencent-Hunyuan/HunyuanImage-3.0/refs/heads/main/assets/pg_instruct_imgs/cot_ti2i.gif

## /assets/pg_instruct_imgs/image0.png

Binary file available at https://raw.githubusercontent.com/Tencent-Hunyuan/HunyuanImage-3.0/refs/heads/main/assets/pg_instruct_imgs/image0.png

## /assets/pg_instruct_imgs/image1.png

Binary file available at https://raw.githubusercontent.com/Tencent-Hunyuan/HunyuanImage-3.0/refs/heads/main/assets/pg_instruct_imgs/image1.png

## /assets/pg_instruct_imgs/image2.png

Binary file available at https://raw.githubusercontent.com/Tencent-Hunyuan/HunyuanImage-3.0/refs/heads/main/assets/pg_instruct_imgs/image2.png

## /assets/pg_instruct_imgs/image3.png

Binary file available at https://raw.githubusercontent.com/Tencent-Hunyuan/HunyuanImage-3.0/refs/heads/main/assets/pg_instruct_imgs/image3.png

## /assets/pg_instruct_imgs/image4.png

Binary file available at https://raw.githubusercontent.com/Tencent-Hunyuan/HunyuanImage-3.0/refs/heads/main/assets/pg_instruct_imgs/image4.png

## /assets/robot.png

Binary file available at https://raw.githubusercontent.com/Tencent-Hunyuan/HunyuanImage-3.0/refs/heads/main/assets/robot.png

## /assets/ssae_side_by_side_comparison.png

Binary file available at https://raw.githubusercontent.com/Tencent-Hunyuan/HunyuanImage-3.0/refs/heads/main/assets/ssae_side_by_side_comparison.png

## /assets/ssae_side_by_side_heatmap.png

Binary file available at https://raw.githubusercontent.com/Tencent-Hunyuan/HunyuanImage-3.0/refs/heads/main/assets/ssae_side_by_side_heatmap.png

## /assets/user.png

Binary file available at https://raw.githubusercontent.com/Tencent-Hunyuan/HunyuanImage-3.0/refs/heads/main/assets/user.png

## /assets/wechat.png

Binary file available at https://raw.githubusercontent.com/Tencent-Hunyuan/HunyuanImage-3.0/refs/heads/main/assets/wechat.png

## /docker/hyimage3_vllm.Dockerfile

```Dockerfile path="/docker/hyimage3_vllm.Dockerfile" 
# Dockerfile of hunyuanimage3-vllm
FROM vllm/vllm-openai:v0.11.0 as base
ENTRYPOINT []

RUN ln -sf /usr/bin/python3 /usr/bin/python &&  \
    pip install --no-cache-dir git+https://github.com/huggingface/transformers && \
    git clone https://github.com/Tencent-Hunyuan/HunyuanImage-3.0 /root/HunyuanImage-3.0 && \
    pip install apache-tvm-ffi==0.1.0b15 && \
    pip install diffusers transformers accelerate && \
    pip install /root/HunyuanImage-3.0 && \
    git clone --branch feature/hunyuan_image_3.0 https://github.com/kippergong/vllm.git && \
    cd vllm && VLLM_USE_PRECOMPILED=1 pip install --editable .

RUN apt-get update && \
    apt-get install -y openssh-server && \
    mkdir /var/run/sshd && \
    apt-get install -y tmux && \
    apt-get install -y screen && \
    apt-get install -y pdsh && \
    apt-get install -y pssh && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/* && \
    pip install gpustat

RUN echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc

CMD ["/usr/sbin/sshd", "-D"]

```

## /hunyuan_image_3/__init__.py

```py path="/hunyuan_image_3/__init__.py" 
# HunyuanImage-3.0: A Powerful Native Multimodal Model for T2T, T2I, TI2T, TI2I
"""
HunyuanImage-3.0 Package

This package provides the implementation of HunyuanImage-3.0, a unified multimodal model
that supports multiple generation tasks:
- Text-to-Text: Text generation from text prompts
- Text-to-Image: Image generation from text prompts
- Text & Image to Text: Text generation from text and image inputs
- Text & Image to Image: Image generation from text and image inputs

The package uses lazy loading to optimize import performance and reduce memory usage.
"""

from typing import TYPE_CHECKING

from utils import _LazyModule
from utils.import_utils import define_import_structure

# TYPE_CHECKING is used to provide type hints for static type checkers without
# actually importing the modules at runtime, which helps with circular imports
# and reduces startup time.
if TYPE_CHECKING:
    from .configuration_hunyuan_image_3 import *
    from .modeling_hunyuan_image_3 import *
    from .autoencoder_kl_3d import *
    from .image_processor import *
    from .siglip2 import *
    from .tokenization_hunyuan_image_3 import *
else:
    # At runtime, use lazy loading to defer module imports until they are actually accessed.
    # This improves startup time and reduces memory footprint.
    import sys

    _file = globals()["__file__"]
    sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)

```

## /hunyuan_image_3/autoencoder_kl_3d.py

```py path="/hunyuan_image_3/autoencoder_kl_3d.py" 
"""
Reference code
[FLUX] https://github.com/black-forest-labs/flux/blob/main/src/flux/modules/autoencoder.py
[DCAE] https://github.com/mit-han-lab/efficientvit/blob/master/efficientvit/models/efficientvit/dc_ae.py
"""
import os
from dataclasses import dataclass
from typing import Tuple, Optional
import math
import random
import numpy as np
from einops import rearrange
import torch
from torch import Tensor, nn
import torch.nn.functional as F
import torch.distributed as dist
import torch.multiprocessing as mp

from safetensors import safe_open
import os
from collections import OrderedDict
from collections.abc import Iterable
from diffusers.configuration_utils import ConfigMixin, register_to_config
from diffusers.models.modeling_outputs import AutoencoderKLOutput
from diffusers.models.modeling_utils import ModelMixin
from diffusers.utils.torch_utils import randn_tensor
from diffusers.utils import BaseOutput



class DiagonalGaussianDistribution(object):
    def __init__(self, parameters: torch.Tensor, deterministic: bool = False):
        if parameters.ndim == 3:
            dim = 2  # (B, L, C)
        elif parameters.ndim == 5 or parameters.ndim == 4:
            dim = 1  # (B, C, T, H ,W) / (B, C, H, W)
        else:
            raise NotImplementedError
        self.parameters = parameters
        self.mean, self.logvar = torch.chunk(parameters, 2, dim=dim)
        self.logvar = torch.clamp(self.logvar, -30.0, 20.0)
        self.deterministic = deterministic
        self.std = torch.exp(0.5 * self.logvar)
        self.var = torch.exp(self.logvar)
        if self.deterministic:
            self.var = self.std = torch.zeros_like(
                self.mean, device=self.parameters.device, dtype=self.parameters.dtype
            )

    def sample(self, generator: Optional[torch.Generator] = None) -> torch.FloatTensor:
        # make sure sample is on the same device as the parameters and has same dtype
        sample = randn_tensor(
            self.mean.shape,
            generator=generator,
            device=self.parameters.device,
            dtype=self.parameters.dtype,
        )
        x = self.mean + self.std * sample
        return x

    def kl(self, other: "DiagonalGaussianDistribution" = None) -> torch.Tensor:
        if self.deterministic:
            return torch.Tensor([0.0])
        else:
            reduce_dim = list(range(1, self.mean.ndim))
            if other is None:
                return 0.5 * torch.sum(
                    torch.pow(self.mean, 2) + self.var - 1.0 - self.logvar,
                    dim=reduce_dim,
                )
            else:
                return 0.5 * torch.sum(
                    torch.pow(self.mean - other.mean, 2) / other.var +
                    self.var / other.var -
                    1.0 -
                    self.logvar +
                    other.logvar,
                    dim=reduce_dim,
                )

    def nll(self, sample: torch.Tensor, dims: Tuple[int, ...] = [1, 2, 3]) -> torch.Tensor:
        if self.deterministic:
            return torch.Tensor([0.0])
        logtwopi = np.log(2.0 * np.pi)
        return 0.5 * torch.sum(
            logtwopi + self.logvar + torch.pow(sample - self.mean, 2) / self.var,
            dim=dims,
        )

    def mode(self) -> torch.Tensor:
        return self.mean

@dataclass
class DecoderOutput(BaseOutput):
    sample: torch.FloatTensor
    posterior: Optional[DiagonalGaussianDistribution] = None

def swish(x: Tensor) -> Tensor:
    return x * torch.sigmoid(x)

def forward_with_checkpointing(module, *inputs, use_checkpointing=False):
    def create_custom_forward(module):
        def custom_forward(*inputs):
            return module(*inputs)
        return custom_forward

    if use_checkpointing:
        return torch.utils.checkpoint.checkpoint(create_custom_forward(module), *inputs, use_reentrant=False)
    else:
        return module(*inputs)


class Conv3d(nn.Conv3d):
    """Perform Conv3d on patches with numerical differences from nn.Conv3d within 1e-5. Only symmetric padding is supported."""

    def forward(self, input):
        B, C, T, H, W = input.shape
        memory_count = (C * T * H * W) * 2 / 1024**3
        if memory_count > 2:
            n_split = math.ceil(memory_count / 2)
            assert n_split >= 2
            chunks = torch.chunk(input, chunks=n_split, dim=-3)
            padded_chunks = []
            for i in range(len(chunks)):
                if self.padding[0] > 0:
                    padded_chunk = F.pad(
                        chunks[i],
                        (0, 0, 0, 0, self.padding[0], self.padding[0]),
                        mode="constant" if self.padding_mode == "zeros" else self.padding_mode,
                        value=0,
                    )
                    if i > 0:
                        padded_chunk[:, :, :self.padding[0]] = chunks[i - 1][:, :, -self.padding[0]:]
                    if i < len(chunks) - 1:
                        padded_chunk[:, :, -self.padding[0]:] = chunks[i + 1][:, :, :self.padding[0]]
                else:
                    padded_chunk = chunks[i]
                padded_chunks.append(padded_chunk)
            padding_bak = self.padding
            self.padding = (0, self.padding[1], self.padding[2])
            outputs = []
            for i in range(len(padded_chunks)):
                outputs.append(super().forward(padded_chunks[i]))
            self.padding = padding_bak
            return torch.cat(outputs, dim=-3)
        else:
            return super().forward(input)


class AttnBlock(nn.Module):
    def __init__(self, in_channels: int):
        super().__init__()
        self.in_channels = in_channels

        self.norm = nn.GroupNorm(num_groups=32, num_channels=in_channels, eps=1e-6, affine=True)

        self.q = Conv3d(in_channels, in_channels, kernel_size=1)
        self.k = Conv3d(in_channels, in_channels, kernel_size=1)
        self.v = Conv3d(in_channels, in_channels, kernel_size=1)
        self.proj_out = Conv3d(in_channels, in_channels, kernel_size=1)

    def attention(self, h_: Tensor) -> Tensor:
        h_ = self.norm(h_)
        q = self.q(h_)
        k = self.k(h_)
        v = self.v(h_)

        b, c, f, h, w = q.shape
        q = rearrange(q, "b c f h w -> b 1 (f h w) c").contiguous()
        k = rearrange(k, "b c f h w -> b 1 (f h w) c").contiguous()
        v = rearrange(v, "b c f h w -> b 1 (f h w) c").contiguous()
        h_ = nn.functional.scaled_dot_product_attention(q, k, v)

        return rearrange(h_, "b 1 (f h w) c -> b c f h w", f=f, h=h, w=w, c=c, b=b)

    def forward(self, x: Tensor) -> Tensor:
        return x + self.proj_out(self.attention(x))


class ResnetBlock(nn.Module):
    def __init__(self, in_channels: int, out_channels: int):
        super().__init__()
        self.in_channels = in_channels
        out_channels = in_channels if out_channels is None else out_channels
        self.out_channels = out_channels

        self.norm1 = nn.GroupNorm(num_groups=32, num_channels=in_channels, eps=1e-6, affine=True)
        self.conv1 = Conv3d(in_channels, out_channels, kernel_size=3, stride=1, padding=1)
        self.norm2 = nn.GroupNorm(num_groups=32, num_channels=out_channels, eps=1e-6, affine=True)
        self.conv2 = Conv3d(out_channels, out_channels, kernel_size=3, stride=1, padding=1)
        if self.in_channels != self.out_channels:
            self.nin_shortcut = Conv3d(in_channels, out_channels, kernel_size=1, stride=1, padding=0)

    def forward(self, x):
        h = x
        h = self.norm1(h)
        h = swish(h)
        h = self.conv1(h)

        h = self.norm2(h)
        h = swish(h)
        h = self.conv2(h)

        if self.in_channels != self.out_channels:
            x = self.nin_shortcut(x)
        return x + h


class Downsample(nn.Module):
    def __init__(self, in_channels: int, add_temporal_downsample: bool = True):
        super().__init__()
        self.add_temporal_downsample = add_temporal_downsample
        stride = (2, 2, 2) if add_temporal_downsample else (1, 2, 2)  # THW
        # no asymmetric padding in torch conv, must do it ourselves
        self.conv = Conv3d(in_channels, in_channels, kernel_size=3, stride=stride, padding=0)

    def forward(self, x: Tensor):
        spatial_pad = (0, 1, 0, 1, 0, 0)  # WHT
        x = nn.functional.pad(x, spatial_pad, mode="constant", value=0)

        temporal_pad = (0, 0, 0, 0, 0, 1) if self.add_temporal_downsample else (0, 0, 0, 0, 1, 1)
        x = nn.functional.pad(x, temporal_pad, mode="replicate")

        x = self.conv(x)
        return x


class DownsampleDCAE(nn.Module):
    def __init__(self, in_channels: int, out_channels: int, add_temporal_downsample: bool = True):
        super().__init__()
        factor = 2 * 2 * 2 if add_temporal_downsample else 1 * 2 * 2
        assert out_channels % factor == 0
        self.conv = Conv3d(in_channels, out_channels // factor, kernel_size=3, stride=1, padding=1)

        self.add_temporal_downsample = add_temporal_downsample
        self.group_size = factor * in_channels // out_channels

    def forward(self, x: Tensor):
        r1 = 2 if self.add_temporal_downsample else 1
        h = self.conv(x)
        h = rearrange(h, "b c (f r1) (h r2) (w r3) -> b (r1 r2 r3 c) f h w", r1=r1, r2=2, r3=2)
        shortcut = rearrange(x, "b c (f r1) (h r2) (w r3) -> b (r1 r2 r3 c) f h w", r1=r1, r2=2, r3=2)

        B, C, T, H, W = shortcut.shape
        shortcut = shortcut.view(B, h.shape[1], self.group_size, T, H, W).mean(dim=2)
        return h + shortcut


class Upsample(nn.Module):
    def __init__(self, in_channels: int, add_temporal_upsample: bool = True):
        super().__init__()
        self.add_temporal_upsample = add_temporal_upsample
        self.scale_factor = (2, 2, 2) if add_temporal_upsample else (1, 2, 2)  # THW
        self.conv = Conv3d(in_channels, in_channels, kernel_size=3, stride=1, padding=1)

    def forward(self, x: Tensor):
        x = nn.functional.interpolate(x, scale_factor=self.scale_factor, mode="nearest")
        x = self.conv(x)
        return x


class UpsampleDCAE(nn.Module):
    def __init__(self, in_channels: int, out_channels: int, add_temporal_upsample: bool = True):
        super().__init__()
        factor = 2 * 2 * 2 if add_temporal_upsample else 1 * 2 * 2
        self.conv = Conv3d(in_channels, out_channels * factor, kernel_size=3, stride=1, padding=1)

        self.add_temporal_upsample = add_temporal_upsample
        self.repeats = factor * out_channels // in_channels

    def forward(self, x: Tensor):
        r1 = 2 if self.add_temporal_upsample else 1
        h = self.conv(x)
        h = rearrange(h, "b (r1 r2 r3 c) f h w -> b c (f r1) (h r2) (w r3)", r1=r1, r2=2, r3=2)
        shortcut = x.repeat_interleave(repeats=self.repeats, dim=1)
        shortcut = rearrange(shortcut, "b (r1 r2 r3 c) f h w -> b c (f r1) (h r2) (w r3)", r1=r1, r2=2, r3=2)
        return h + shortcut


class Encoder(nn.Module):
    def __init__(
        self,
        in_channels: int,
        z_channels: int,
        block_out_channels: Tuple[int, ...],
        num_res_blocks: int,
        ffactor_spatial: int,
        ffactor_temporal: int,
        downsample_match_channel: bool = True,
    ):
        super().__init__()
        assert block_out_channels[-1] % (2 * z_channels) == 0

        self.z_channels = z_channels
        self.block_out_channels = block_out_channels
        self.num_res_blocks = num_res_blocks

        # downsampling
        self.conv_in = Conv3d(in_channels, block_out_channels[0], kernel_size=3, stride=1, padding=1)

        self.down = nn.ModuleList()
        block_in = block_out_channels[0]
        for i_level, ch in enumerate(block_out_channels):
            block = nn.ModuleList()
            block_out = ch
            for _ in range(self.num_res_blocks):
                block.append(ResnetBlock(in_channels=block_in, out_channels=block_out))
                block_in = block_out
            down = nn.Module()
            down.block = block

            add_spatial_downsample = bool(i_level < np.log2(ffactor_spatial))
            add_temporal_downsample = add_spatial_downsample and bool(i_level >= np.log2(ffactor_spatial // ffactor_temporal))
            if add_spatial_downsample or add_temporal_downsample:
                assert i_level < len(block_out_channels) - 1
                block_out = block_out_channels[i_level + 1] if downsample_match_channel else block_in
                down.downsample = DownsampleDCAE(block_in, block_out, add_temporal_downsample)
                block_in = block_out
            self.down.append(down)

        # middle
        self.mid = nn.Module()
        self.mid.block_1 = ResnetBlock(in_channels=block_in, out_channels=block_in)
        self.mid.attn_1 = AttnBlock(block_in)
        self.mid.block_2 = ResnetBlock(in_channels=block_in, out_channels=block_in)

        # end
        self.norm_out = nn.GroupNorm(num_groups=32, num_channels=block_in, eps=1e-6, affine=True)
        self.conv_out = Conv3d(block_in, 2 * z_channels, kernel_size=3, stride=1, padding=1)

        self.gradient_checkpointing = False

    def forward(self, x: Tensor) -> Tensor:
        with torch.no_grad():
            use_checkpointing = bool(self.training and self.gradient_checkpointing)

            # downsampling
            h = self.conv_in(x)
            for i_level in range(len(self.block_out_channels)):
                for i_block in range(self.num_res_blocks):
                    h = forward_with_checkpointing(self.down[i_level].block[i_block], h, use_checkpointing=use_checkpointing)
                if hasattr(self.down[i_level], "downsample"):
                    h = forward_with_checkpointing(self.down[i_level].downsample, h, use_checkpointing=use_checkpointing)

            # middle
            h = forward_with_checkpointing(self.mid.block_1, h, use_checkpointing=use_checkpointing)
            h = forward_with_checkpointing(self.mid.attn_1, h, use_checkpointing=use_checkpointing)
            h = forward_with_checkpointing(self.mid.block_2, h, use_checkpointing=use_checkpointing)

            # end
            group_size = self.block_out_channels[-1] // (2 * self.z_channels)
            shortcut = rearrange(h, "b (c r) f h w -> b c r f h w", r=group_size).mean(dim=2)
            h = self.norm_out(h)
            h = swish(h)
            h = self.conv_out(h)
            h += shortcut
        return h


class Decoder(nn.Module):
    def __init__(
        self,
        z_channels: int,
        out_channels: int,
        block_out_channels: Tuple[int, ...],
        num_res_blocks: int,
        ffactor_spatial: int,
        ffactor_temporal: int,
        upsample_match_channel: bool = True,
    ):
        super().__init__()
        assert block_out_channels[0] % z_channels == 0

        self.z_channels = z_channels
        self.block_out_channels = block_out_channels
        self.num_res_blocks = num_res_blocks

        # z to block_in
        block_in = block_out_channels[0]
        self.conv_in = Conv3d(z_channels, block_in, kernel_size=3, stride=1, padding=1)

        # middle
        self.mid = nn.Module()
        self.mid.block_1 = ResnetBlock(in_channels=block_in, out_channels=block_in)
        self.mid.attn_1 = AttnBlock(block_in)
        self.mid.block_2 = ResnetBlock(in_channels=block_in, out_channels=block_in)

        # upsampling
        self.up = nn.ModuleList()
        for i_level, ch in enumerate(block_out_channels):
            block = nn.ModuleList()
            block_out = ch
            for _ in range(self.num_res_blocks + 1):
                block.append(ResnetBlock(in_channels=block_in, out_channels=block_out))
                block_in = block_out
            up = nn.Module()
            up.block = block

            add_spatial_upsample = bool(i_level < np.log2(ffactor_spatial))
            add_temporal_upsample = bool(i_level < np.log2(ffactor_temporal))
            if add_spatial_upsample or add_temporal_upsample:
                assert i_level < len(block_out_channels) - 1
                block_out = block_out_channels[i_level + 1] if upsample_match_channel else block_in
                up.upsample = UpsampleDCAE(block_in, block_out, add_temporal_upsample)
                block_in = block_out
            self.up.append(up)

        # end
        self.norm_out = nn.GroupNorm(num_groups=32, num_channels=block_in, eps=1e-6, affine=True)
        self.conv_out = Conv3d(block_in, out_channels, kernel_size=3, stride=1, padding=1)

        self.gradient_checkpointing = False


    def forward(self, z: Tensor) -> Tensor:
        with torch.no_grad():
            use_checkpointing = bool(self.training and self.gradient_checkpointing)
            # z to block_in
            repeats = self.block_out_channels[0] // (self.z_channels)
            h = self.conv_in(z) + z.repeat_interleave(repeats=repeats, dim=1)
            # middle
            h = forward_with_checkpointing(self.mid.block_1, h, use_checkpointing=use_checkpointing)
            h = forward_with_checkpointing(self.mid.attn_1, h, use_checkpointing=use_checkpointing)
            h = forward_with_checkpointing(self.mid.block_2, h, use_checkpointing=use_checkpointing)
            # upsampling
            for i_level in range(len(self.block_out_channels)):
                for i_block in range(self.num_res_blocks + 1):
                    h = forward_with_checkpointing(self.up[i_level].block[i_block], h, use_checkpointing=use_checkpointing)
                if hasattr(self.up[i_level], "upsample"):
                    h = forward_with_checkpointing(self.up[i_level].upsample, h, use_checkpointing=use_checkpointing)
            # end
            h = self.norm_out(h)
            h = swish(h)
            h = self.conv_out(h)
        return h


class AutoencoderKLConv3D(ModelMixin, ConfigMixin):
    _supports_gradient_checkpointing = True

    @register_to_config
    def __init__(
        self,
        in_channels: int,
        out_channels: int,
        latent_channels: int,
        block_out_channels: Tuple[int, ...],
        layers_per_block: int,
        ffactor_spatial: int,
        ffactor_temporal: int,
        sample_size: int,
        sample_tsize: int,
        scaling_factor: float = None,
        shift_factor: Optional[float] = None,
        downsample_match_channel: bool = True,
        upsample_match_channel: bool = True,
        only_encoder: bool = False,
        only_decoder: bool = False,
    ):
        super().__init__()
        self.ffactor_spatial = ffactor_spatial
        self.ffactor_temporal = ffactor_temporal
        self.scaling_factor = scaling_factor
        self.shift_factor = shift_factor

        if not only_decoder:
            self.encoder = Encoder(
                in_channels=in_channels,
                z_channels=latent_channels,
                block_out_channels=block_out_channels,
                num_res_blocks=layers_per_block,
                ffactor_spatial=ffactor_spatial,
                ffactor_temporal=ffactor_temporal,
                downsample_match_channel=downsample_match_channel,
            )
        if not only_encoder:
            self.decoder = Decoder(
                z_channels=latent_channels,
                out_channels=out_channels,
                block_out_channels=list(reversed(block_out_channels)),
                num_res_blocks=layers_per_block,
                ffactor_spatial=ffactor_spatial,
                ffactor_temporal=ffactor_temporal,
                upsample_match_channel=upsample_match_channel,
            )

        self.use_slicing = False
        self.slicing_bsz = 1
        self.use_spatial_tiling = False
        self.use_temporal_tiling = False
        self.use_tiling_during_training = False

        # only relevant if vae tiling is enabled
        self.tile_sample_min_size = sample_size
        self.tile_latent_min_size = sample_size // ffactor_spatial
        self.tile_sample_min_tsize = sample_tsize
        self.tile_latent_min_tsize = sample_tsize // ffactor_temporal
        self.tile_overlap_factor = 0.125

        self.use_compile = False

        self.empty_cache = torch.empty(0, device="cuda")

    def _set_gradient_checkpointing(self, module, value=False):
        if isinstance(module, (Encoder, Decoder)):
            module.gradient_checkpointing = value

    def enable_tiling_during_training(self, use_tiling: bool = True):
        self.use_tiling_during_training = use_tiling

    def disable_tiling_during_training(self):
        self.enable_tiling_during_training(False)

    def enable_temporal_tiling(self, use_tiling: bool = True):
        self.use_temporal_tiling = use_tiling

    def disable_temporal_tiling(self):
        self.enable_temporal_tiling(False)

    def enable_spatial_tiling(self, use_tiling: bool = True):
        self.use_spatial_tiling = use_tiling

    def disable_spatial_tiling(self):
        self.enable_spatial_tiling(False)

    def enable_tiling(self, use_tiling: bool = True):
        self.enable_spatial_tiling(use_tiling)

    def disable_tiling(self):
        self.disable_spatial_tiling()

    def enable_slicing(self):
        self.use_slicing = True

    def disable_slicing(self):
        self.use_slicing = False

    def blend_h(self, a: torch.Tensor, b: torch.Tensor, blend_extent: int):
        blend_extent = min(a.shape[-1], b.shape[-1], blend_extent)
        for x in range(blend_extent):
            b[:, :, :, :, x] = a[:, :, :, :, -blend_extent + x] * (1 - x / blend_extent) + b[:, :, :, :, x] * (x / blend_extent)
        return b

    def blend_v(self, a: torch.Tensor, b: torch.Tensor, blend_extent: int):
        blend_extent = min(a.shape[-2], b.shape[-2], blend_extent)
        for y in range(blend_extent):
            b[:, :, :, y, :] = a[:, :, :, -blend_extent + y, :] * (1 - y / blend_extent) + b[:, :, :, y, :] * (y / blend_extent)
        return b

    def blend_t(self, a: torch.Tensor, b: torch.Tensor, blend_extent: int):
        blend_extent = min(a.shape[-3], b.shape[-3], blend_extent)
        for x in range(blend_extent):
            b[:, :, x, :, :] = a[:, :, -blend_extent + x, :, :] * (1 - x / blend_extent) + b[:, :, x, :, :] * (x / blend_extent)
        return b

    def spatial_tiled_encode(self, x: torch.Tensor):
        B, C, T, H, W = x.shape
        overlap_size = int(self.tile_sample_min_size * (1 - self.tile_overlap_factor))  # 256 * (1 - 0.25) = 192
        blend_extent = int(self.tile_latent_min_size * self.tile_overlap_factor)  # 8 * 0.25 = 2
        row_limit = self.tile_latent_min_size - blend_extent  # 8 - 2 = 6

        rows = []
        for i in range(0, H, overlap_size):
            row = []
            for j in range(0, W, overlap_size):
                tile = x[:, :, :, i: i + self.tile_sample_min_size, j: j + self.tile_sample_min_size]
                tile = self.encoder(tile)
                row.append(tile)
            rows.append(row)
        result_rows = []
        for i, row in enumerate(rows):
            result_row = []
            for j, tile in enumerate(row):
                if i > 0:
                    tile = self.blend_v(rows[i - 1][j], tile, blend_extent)
                if j > 0:
                    tile = self.blend_h(row[j - 1], tile, blend_extent)
                result_row.append(tile[:, :, :, :row_limit, :row_limit])
            result_rows.append(torch.cat(result_row, dim=-1))
        moments = torch.cat(result_rows, dim=-2)
        return moments

    def temporal_tiled_encode(self, x: torch.Tensor):
        B, C, T, H, W = x.shape
        overlap_size = int(self.tile_sample_min_tsize * (1 - self.tile_overlap_factor))  # 64 * (1 - 0.25) = 48
        blend_extent = int(self.tile_latent_min_tsize * self.tile_overlap_factor)  # 8 * 0.25 = 2
        t_limit = self.tile_latent_min_tsize - blend_extent  # 8 - 2 = 6

        row = []
        for i in range(0, T, overlap_size):
            tile = x[:, :, i: i + self.tile_sample_min_tsize, :, :]
            if self.use_spatial_tiling and (tile.shape[-1] > self.tile_sample_min_size or tile.shape[-2] > self.tile_sample_min_size):
                tile = self.spatial_tiled_encode(tile)
            else:
                tile = self.encoder(tile)
            row.append(tile)
        result_row = []
        for i, tile in enumerate(row):
            if i > 0:
                tile = self.blend_t(row[i - 1], tile, blend_extent)
            result_row.append(tile[:, :, :t_limit, :, :])
        moments = torch.cat(result_row, dim=-3)
        return moments

    def _decode_tiles_for_rank(self, z: torch.Tensor, my_linear_indices: list, num_cols: int, overlap_size: int):
        """解码当前 rank 分配到的 tiles,并返回解码结果和元信息。"""
        H_out_std = self.tile_sample_min_size
        W_out_std = self.tile_sample_min_size
        decoded_tiles = []
        decoded_metas = []

        for lin_idx in my_linear_indices:
            ri = lin_idx // num_cols
            rj = lin_idx % num_cols
            i = ri * overlap_size
            j = rj * overlap_size
            tile = z[:, :, :, i : i + self.tile_latent_min_size, j : j + self.tile_latent_min_size]
            dec = self.decoder(tile)
            # 对边界 tile 的输出做右/下方向 padding 到标准尺寸
            pad_h = max(0, H_out_std - dec.shape[-2])
            pad_w = max(0, W_out_std - dec.shape[-1])
            if pad_h > 0 or pad_w > 0:
                dec = F.pad(dec, (0, pad_w, 0, pad_h, 0, 0), "constant", 0)
            decoded_tiles.append(dec)
            decoded_metas.append(torch.tensor([ri, rj, pad_w, pad_h], device=z.device, dtype=torch.int64))

        return decoded_tiles, decoded_metas

    def _pad_tiles_to_same_count(self, decoded_tiles: list, decoded_metas: list, tiles_per_rank: int,
                                  T_out: int, device, dtype):
        """将 tiles 列表填充到相同长度,以便进行 all_gather。"""
        while len(decoded_tiles) < tiles_per_rank:
            decoded_tiles.append(torch.zeros(
                [1, 3, T_out, self.tile_sample_min_size, self.tile_sample_min_size],
                device=device, dtype=dtype
            ))
            decoded_metas.append(torch.tensor(
                [-1, -1, self.tile_sample_min_size, self.tile_sample_min_size],
                device=device, dtype=torch.int64
            ))
        return decoded_tiles, decoded_metas

    def _reconstruct_tile_grid(self, tiles_gather_list: list, metas_gather_list: list,
                                num_rows: int, num_cols: int, world_size: int):
        """根据 all_gather 的结果重建 tile 网格。"""
        rows = [[None for _ in range(num_cols)] for _ in range(num_rows)]
        for r in range(world_size):
            gathered_tiles_r = tiles_gather_list[r]
            gathered_metas_r = metas_gather_list[r]
            for k in range(gathered_tiles_r.shape[0]):
                ri = int(gathered_metas_r[k][0])
                rj = int(gathered_metas_r[k][1])
                if ri < 0 or rj < 0:
                    continue
                if ri < num_rows and rj < num_cols:
                    pad_w = int(gathered_metas_r[k][2])
                    pad_h = int(gathered_metas_r[k][3])
                    h_end = None if pad_h == 0 else -pad_h
                    w_end = None if pad_w == 0 else -pad_w
                    rows[ri][rj] = gathered_tiles_r[k][:, :, :, :h_end, :w_end]
        return rows

    def _blend_and_concat_rows(self, rows: list, blend_extent: int, row_limit: int, skip_none: bool = False):
        """对 tile 网格进行融合并拼接成最终结果。"""
        result_rows = []
        for i, row in enumerate(rows):
            result_row = []
            for j, tile in enumerate(row):
                if skip_none and tile is None:
                    continue
                if i > 0:
                    tile = self.blend_v(rows[i - 1][j], tile, blend_extent)
                if j > 0:
                    tile = self.blend_h(row[j - 1], tile, blend_extent)
                result_row.append(tile[:, :, :, :row_limit, :row_limit])
            result_rows.append(torch.cat(result_row, dim=-1))
        return torch.cat(result_rows, dim=-2)

    def _spatial_tiled_decode_distributed(self, z: torch.Tensor, H: int, W: int, T: int,
                                           overlap_size: int, blend_extent: int, row_limit: int):
        """分布式多卡解码逻辑。"""
        rank = dist.get_rank()
        world_size = dist.get_world_size()

        num_rows = math.ceil(H / overlap_size)
        num_cols = math.ceil(W / overlap_size)
        total_tiles = num_rows * num_cols
        tiles_per_rank = math.ceil(total_tiles / world_size)

        print(f"==={torch.distributed.get_rank()},  {total_tiles=}, {tiles_per_rank=}, {world_size=}")

        my_linear_indices = list(range(rank, total_tiles, world_size))
        if not my_linear_indices:
            my_linear_indices = [0]
        print(f"==={torch.distributed.get_rank()},  {my_linear_indices=}")

        decoded_tiles, decoded_metas = self._decode_tiles_for_rank(z, my_linear_indices, num_cols, overlap_size)

        T_out = decoded_tiles[0].shape[2] if decoded_tiles else (T - 1) * self.ffactor_temporal + 1
        dtype = decoded_tiles[0].dtype if decoded_tiles else z.dtype
        decoded_tiles, decoded_metas = self._pad_tiles_to_same_count(
            decoded_tiles, decoded_metas, tiles_per_rank, T_out, z.device, dtype
        )

        decoded_tiles = torch.stack(decoded_tiles, dim=0)
        decoded_metas = torch.stack(decoded_metas, dim=0)

        tiles_gather_list = [torch.empty_like(decoded_tiles) for _ in range(world_size)]
        metas_gather_list = [torch.empty_like(decoded_metas) for _ in range(world_size)]

        dist.all_gather(tiles_gather_list, decoded_tiles)
        dist.all_gather(metas_gather_list, decoded_metas)

        if rank != 0:
            return torch.empty(0, device=z.device)

        rows = self._reconstruct_tile_grid(tiles_gather_list, metas_gather_list, num_rows, num_cols, world_size)
        return self._blend_and_concat_rows(rows, blend_extent, row_limit, skip_none=True)

    def _spatial_tiled_decode_single(self, z: torch.Tensor, H: int, W: int,
                                      overlap_size: int, blend_extent: int, row_limit: int):
        """单卡串行解码逻辑。"""
        rows = []
        for i in range(0, H, overlap_size):
            row = []
            for j in range(0, W, overlap_size):
                tile = z[:, :, :, i : i + self.tile_latent_min_size, j : j + self.tile_latent_min_size]
                decoded = self.decoder(tile)
                row.append(decoded)
            rows.append(row)
        return self._blend_and_concat_rows(rows, blend_extent, row_limit, skip_none=False)

    def spatial_tiled_decode(self, z: torch.Tensor):
        B, C, T, H, W = z.shape
        overlap_size = int(self.tile_latent_min_size * (1 - self.tile_overlap_factor))  # 24 * (1 - 0.125) = 21
        blend_extent = int(self.tile_sample_min_size * self.tile_overlap_factor)  # 384 * 0.125 = 48
        row_limit = self.tile_sample_min_size - blend_extent  # 384 - 48 = 336

        # 分布式/多卡逻辑
        if dist.is_available() and dist.is_initialized() and dist.get_world_size() > 1:
            return self._spatial_tiled_decode_distributed(z, H, W, T, overlap_size, blend_extent, row_limit)

        # 单卡:原有串行逻辑
        return self._spatial_tiled_decode_single(z, H, W, overlap_size, blend_extent, row_limit)

    def temporal_tiled_decode(self, z: torch.Tensor):
        B, C, T, H, W = z.shape
        overlap_size = int(self.tile_latent_min_tsize * (1 - self.tile_overlap_factor))  # 8 * (1 - 0.25) = 6
        blend_extent = int(self.tile_sample_min_tsize * self.tile_overlap_factor)  # 64 * 0.25 = 16
        t_limit = self.tile_sample_min_tsize - blend_extent  # 64 - 16 = 48
        assert 0 < overlap_size < self.tile_latent_min_tsize

        row = []
        for i in range(0, T, overlap_size):
            tile = z[:, :, i: i + self.tile_latent_min_tsize, :, :]
            if self.use_spatial_tiling and (tile.shape[-1] > self.tile_latent_min_size or tile.shape[-2] > self.tile_latent_min_size):
                decoded = self.spatial_tiled_decode(tile)
            else:
                decoded = self.decoder(tile)
            row.append(decoded)

        result_row = []
        for i, tile in enumerate(row):
            if i > 0:
                tile = self.blend_t(row[i - 1], tile, blend_extent)
            result_row.append(tile[:, :, :t_limit, :, :])
        dec = torch.cat(result_row, dim=-3)
        return dec

    def encode(self, x: Tensor, return_dict: bool = True):

        def _encode(x):
            if self.use_temporal_tiling and x.shape[-3] > self.tile_sample_min_tsize:
                return self.temporal_tiled_encode(x)
            if self.use_spatial_tiling and (x.shape[-1] > self.tile_sample_min_size or x.shape[-2] > self.tile_sample_min_size):
                return self.spatial_tiled_encode(x)

            if self.use_compile:
                @torch.compile
                def encoder(x):
                    return self.encoder(x)
                return encoder(x)
            return self.encoder(x)

        if len(x.shape) != 5:  # (B, C, T, H, W)
            x = x[:, :, None]
        assert len(x.shape) == 5  # (B, C, T, H, W)
        if x.shape[2] == 1:
            x = x.expand(-1, -1, self.ffactor_temporal, -1, -1)
        else:
            assert x.shape[2] != self.ffactor_temporal and x.shape[2] % self.ffactor_temporal == 0

        if self.use_slicing and x.shape[0] > 1:
            if self.slicing_bsz == 1:
                encoded_slices = [_encode(x_slice) for x_slice in x.split(1)]
            else:
                sections = [self.slicing_bsz] * (x.shape[0] // self.slicing_bsz)
                if x.shape[0] % self.slicing_bsz != 0:
                    sections.append(x.shape[0] % self.slicing_bsz)
                encoded_slices = [_encode(x_slice) for x_slice in x.split(sections)]
            h = torch.cat(encoded_slices)
        else:
            h = _encode(x)
        posterior = DiagonalGaussianDistribution(h)

        if not return_dict:
            return (posterior,)

        return AutoencoderKLOutput(latent_dist=posterior)

    def decode(self, z: Tensor, return_dict: bool = True, generator=None):

        def _decode(z):
            if self.use_temporal_tiling and z.shape[-3] > self.tile_latent_min_tsize:
                return self.temporal_tiled_decode(z)
            if self.use_spatial_tiling and (z.shape[-1] > self.tile_latent_min_size or z.shape[-2] > self.tile_latent_min_size):
                return self.spatial_tiled_decode(z)
            return self.decoder(z)

        if self.use_slicing and z.shape[0] > 1:
            decoded_slices = [_decode(z_slice) for z_slice in z.split(1)]
            decoded = torch.cat(decoded_slices)
        else:
            decoded = _decode(z)
        if torch.distributed.is_initialized():
            if torch.distributed.get_rank() != 0:
                return self.empty_cache

        if z.shape[-3] == 1:
            decoded = decoded[:, :, -1:]
        if not return_dict:
            return (decoded,)

        return DecoderOutput(sample=decoded)

    def decode_dist(self, z: Tensor, return_dict: bool = True, generator=None):
        z = z.cuda()
        self.use_spatial_tiling = True
        decoded = self.decode(z)
        self.use_spatial_tiling = False
        return decoded

    def forward(
        self,
        sample: torch.Tensor,
        sample_posterior: bool = False,
        return_posterior: bool = True,
        return_dict: bool = True
    ):
        posterior = self.encode(sample).latent_dist
        z = posterior.sample() if sample_posterior else posterior.mode()
        dec = self.decode(z).sample
        return DecoderOutput(sample=dec, posterior=posterior) if return_dict else (dec, posterior)

    def random_reset_tiling(self, x: torch.Tensor):
        if x.shape[-3] == 1:
            self.disable_spatial_tiling()
            self.disable_temporal_tiling()
            return

        # tiling在input_shape和sample_size上限制很多,任意的input_shape和sample_size很可能不满足条件,因此这里使用固定值
        min_sample_size = int(1 / self.tile_overlap_factor) * self.ffactor_spatial
        min_sample_tsize = int(1 / self.tile_overlap_factor) * self.ffactor_temporal
        sample_size = random.choice([None, 1 * min_sample_size, 2 * min_sample_size, 3 * min_sample_size])
        if sample_size is None:
            self.disable_spatial_tiling()
        else:
            self.tile_sample_min_size = sample_size
            self.tile_latent_min_size = sample_size // self.ffactor_spatial
            self.enable_spatial_tiling()

        sample_tsize = random.choice([None, 1 * min_sample_tsize, 2 * min_sample_tsize, 3 * min_sample_tsize])
        if sample_tsize is None:
            self.disable_temporal_tiling()
        else:
            self.tile_sample_min_tsize = sample_tsize
            self.tile_latent_min_tsize = sample_tsize // self.ffactor_temporal
            self.enable_temporal_tiling()

def load_sharded_safetensors(model_dir):
    """
    手动加载分片的 safetensors 文件

    Args:
        model_dir: 包含分片文件的目录路径

    Returns:
        合并后的完整权重字典
    """
    # 获取所有分片文件并按编号排序
    shard_files = []
    for file in os.listdir(model_dir):
        if file.endswith(".safetensors"):
            shard_files.append(file)

    # 按分片编号排序
    shard_files.sort(key=lambda x: int(x.split("-")[1]))

    print(f"找到 {len(shard_files)} 个分片文件")

    # 合并所有权重
    merged_state_dict = dict()

    for shard_file in shard_files:
        shard_path = os.path.join(model_dir, shard_file)
        print(f"加载分片: {shard_file}")

        # 使用 safetensors 加载当前分片
        with safe_open(shard_path, framework="pt", device="cpu") as f:
            for key in f.keys():
                tensor = f.get_tensor(key)
                merged_state_dict[key] = tensor

    print(f"合并完成,总键数量: {len(merged_state_dict)}")
    return merged_state_dict

def load_weights(model, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
    def update_state_dict(state_dict: dict[str, torch.Tensor], name, weight):
        if name not in state_dict:
            raise ValueError(f"Unexpected weight {name}")

        model_tensor = state_dict[name]
        if model_tensor.shape != weight.shape:
            raise ValueError(
                f"Shape mismatch for weight {name}: "
                f"model tensor shape {model_tensor.shape} vs. "
                f"loaded tensor shape {weight.shape}"
            )
        if isinstance(weight, torch.Tensor):
            model_tensor.data.copy_(weight.data)
        else:
            raise ValueError(
                f"Unsupported tensor type in load_weights "
                f"for {name}: {type(weight)}"
            )

    loaded_params = set()
    for name, load_tensor in weights.items():
        updated = True
        name = name.replace('vae.', '')
        if name in model.state_dict():
            update_state_dict(model.state_dict(), name, load_tensor)
        else:
            updated = False

        if updated:
            loaded_params.add(name)

    return loaded_params

def _worker(path, config, 
    rank=None, world_size=None, port=None, req_queue=None, rsp_queue=None):
    """
    each rank's worker:
      - idle: block on req_queue.get() (CPU blocking, no GPU)
      - receive request: run runner.predict(), all ranks forward
      - only rank0 put result to rsp_queue
    """
    # _tame_cpu_threads_and_comm()
    # basic env
    os.environ["MASTER_ADDR"] = "127.0.0.1"
    os.environ["MASTER_PORT"] = str(port)
    os.environ["WORLD_SIZE"] = str(world_size)
    os.environ["RANK"] = str(rank)
    os.environ["LOCAL_RANK"] = str(rank)

    # device binding should be early than all CUDA operations
    visible = torch.cuda.device_count()
    assert visible >= world_size, f"可见卡数 {visible} < world_size {world_size}"
    local_rank = int(os.environ["LOCAL_RANK"])
    
    print(f"[worker {rank}] bind to cuda:{local_rank} (visible={visible})", flush=True)
    if not torch.distributed.is_initialized():
        dist.init_process_group("nccl")
    torch.cuda.set_device(local_rank)
    #from .. import load_vae

    #vae = load_vae(vae_type, vae_precision, device, logger, args, weights_only, only_encoder, only_decoder, sample_size, skip_create_dist=True)
    #vae = vae.cuda()
    vae = AutoencoderKLConv3D.from_config(config)
    merged_state_dict = load_sharded_safetensors(path)
    loaded_params = load_weights(vae, merged_state_dict) 
    vae = vae.cuda()
    vae.eval()  # 关闭 Dropout、BatchNorm 训练行为
    for param in vae.parameters():
        param.requires_grad = False  #
    
    while True:
        req = req_queue.get()  # blocking
        if req == "__STOP__":
            break
        out = vae.decode_dist(req, return_dict=False)
        if rank == 0:
            rsp_queue.put(out)

    #try:
    #    while True:
    #        # blocking on CPU queue
    #        req = req_queue.get()  # blocking
    #        if req == "__STOP__":
    #            break
    #        out = vae.decode_dist(req, return_dict=False)
    #        if rank == 0:
    #            rsp_queue.put(out)
    #finally:
    #    # destroy process group before exit
    #    try:
    #        dist.destroy_process_group()
    #    except Exception:
    #        pass

#def _find_free_port():
#    import socket
#    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
#        s.bind(("127.0.0.1", 0))
#        return s.getsockname()[1]

# 避免端口冲突的常见做法
def _find_free_port(start_port=8100, max_attempts=900):
    import socket
    """获取一个可用的端口"""
    for port in range(start_port, start_port + max_attempts):
        try:
            with socket.socket() as s:
                s.bind(('localhost', port))
                return s.getsockname()[1]  # 返回实际绑定的端口
        except OSError:
            continue
    raise RuntimeError("找不到可用端口")

class AutoencoderKLConv3D_Dist(AutoencoderKLConv3D):
    def __init__(
        self,
        in_channels: int,
        out_channels: int,
        latent_channels: int,
        block_out_channels: Tuple[int, ...],
        layers_per_block: int,
        ffactor_spatial: int,
        ffactor_temporal: int,
        sample_size: int,
        sample_tsize: int,
        scaling_factor: float = None,
        shift_factor: Optional[float] = None,
        downsample_match_channel: bool = True,
        upsample_match_channel: bool = True,
        only_encoder: bool = False,
        only_decoder: bool = False,
    ):
        super().__init__(in_channels, out_channels, latent_channels, block_out_channels, layers_per_block, ffactor_spatial, ffactor_temporal, sample_size, sample_tsize, scaling_factor, shift_factor, downsample_match_channel, upsample_match_channel, only_encoder, only_decoder)

    def create_dist(self, path, config, 
    ):
        self.world_size = 8
        self.port = _find_free_port()
        ctx = mp.get_context("spawn")
        # 每个 rank 一个请求队列(纯 CPU),再加一个公共响应队列
        self.req_queues = [ctx.Queue() for _ in range(self.world_size)]
        self.rsp_queue = ctx.Queue()

        self.procs = []
        for rank in range(self.world_size):
            p = ctx.Process(
                target=_worker,
                args=(
                    path, config, 
                    rank, self.world_size, self.port,
                    self.req_queues[rank], self.rsp_queue,
                ),
                daemon=True,
            )
            p.start()
            self.procs.append(p)
    
    def decode(self, z: Tensor, return_dict: bool = True, generator=None):
        """
        synchronous inference: put the same request to all ranks' queues.
        return rank0's result.
        """
        # check alive
        for p in self.procs:
            if not p.is_alive():
                raise RuntimeError("One of the processes is not alive")

        # put to each rank's queue
        for q in self.req_queues:
            q.put(z)

        # wait for rank0's result
        return self.rsp_queue.get(timeout=None)

```

## /hunyuan_image_3/cache_utils.py

```py path="/hunyuan_image_3/cache_utils.py" 
import torch
import torch.nn as nn
import math
from typing import Tuple

def cache_init(cache_interval, max_order, num_steps=None,
               enable_first_enhance=False, first_enhance_steps=3, 
               enable_tailing_enhance=False, tailing_enhance_steps=1, 
               low_freqs_order=0, high_freqs_order=2):
    cache_dic = {}
    cache_dic['counter']= 0
    cache_dic['current_step'] = 0
    cache_dic['cache_interval']= cache_interval
    cache_dic['max_order'] = max_order
    cache_dic['num_steps'] = num_steps

    # enhance related utils
    
    # first enhance: fully compute first some steps, enhancing contour infos
    cache_dic['enable_first_enhance'] = enable_first_enhance
    cache_dic['first_enhance_steps'] = first_enhance_steps

    # tailing enhance: fully compute the last 1 steps, enhancing details
    cache_dic['enable_tailing_enhance'] = enable_tailing_enhance
    cache_dic['tailing_enhance_steps'] = tailing_enhance_steps

    # freqs related utils
    cache_dic['low_freqs_order'] = low_freqs_order
    cache_dic['high_freqs_order'] = high_freqs_order

    # features for training-aware cache, here we don't use these
    cache_dic['enable_force_control']= False 
    cache_dic['force_compute']=False
    return cache_dic

class TaylorCacheContainer(nn.Module):
    def __init__(self, max_order):
        super().__init__()
        self.max_order = max_order
        # 逐个注册buffer
        for i in range(max_order + 1):
            self.register_buffer(f"derivative_{i}", None, persistent=False)
            self.register_buffer(f"temp_derivative_{i}", None, persistent=False)
    
    def get_derivative(self, order):
        return getattr(self, f"derivative_{order}")
    
    def set_derivative(self, order, tensor):
        setattr(self, f"derivative_{order}", tensor)

    def set_temp_derivative(self, order, tensor):
        setattr(self, f"temp_derivative_{order}", tensor)

    def get_temp_derivative(self, order):
        return getattr(self, f"temp_derivative_{order}")
    
    def clear_temp_derivative(self):
        for i in range(self.max_order + 1):
            setattr(self, f"temp_derivative_{i}", None)

    def move_temp_to_derivative(self):
        for i in range(self.max_order + 1):
            if self.get_temp_derivative(i) is not None:
                setattr(self, f"derivative_{i}", self.get_temp_derivative(i))
            else:
                break
        self.clear_temp_derivative()

    def get_all_derivatives(self):
        return [getattr(self, f"derivative_{i}") for i in range(self.max_order + 1)]

    def get_all_filled_derivatives(self):
        return [self.get_derivative(i) for i in range(self.max_order + 1) if self.get_derivative(i) is not None]

    def taylor_formula(self, distance):
        output = 0
        for i in range(len(self.get_all_filled_derivatives())):
            output += (1 / math.factorial(i)) * self.get_derivative(i) * (distance ** i)
        return output
    
    def derivatives_computation(self, x, distance):
        '''
        x: tensor, the new x_0
        distance: int, the distance between the current step and the last full computation step
        '''
        self.set_temp_derivative(0, x)
        for i in range(self.max_order):
            if self.get_derivative(i) is not None:
                self.set_temp_derivative(i+1, (self.get_temp_derivative(i) - self.get_derivative(i)) / distance)
            else:
                break
        self.move_temp_to_derivative()

    def clear_derivatives(self):
        for i in range(self.max_order + 1):
            setattr(self, f"derivative_{i}", None)
            setattr(self, f"temp_derivative_{i}", None)


@torch.compile
def decomposition_FFT(x: torch.Tensor, cutoff_ratio: float = 0.1) -> Tuple[torch.Tensor, torch.Tensor]:
    """
    Fast Fourier Transform frequency domain decomposition
    
    Args:
        x: Input tensor [B, H*W, D]
        cutoff_ratio: Cutoff frequency ratio (0~0.5)
        
    Returns:
        Tuple of (low_freq, high_freq) tensors with same dtype as input
    """
    orig_dtype = x.dtype
    device = x.device

    x_fp32 = x.to(torch.float32)  # Convert to fp32 for FFT compatibility

    B, HW, D = x_fp32.shape
    freq = torch.fft.fft(x_fp32, dim=1)  # FFT on spatial dimension

    freqs = torch.fft.fftfreq(HW, d=1.0, device=device)
    cutoff = cutoff_ratio * freqs.abs().max()

    # Create frequency masks
    low_mask = freqs.abs() <= cutoff
    high_mask = ~low_mask

    low_mask = low_mask[None, :, None]  # Broadcast to (B, HW, D)
    high_mask = high_mask[None, :, None]

    low_freq_complex  = freq * low_mask
    high_freq_complex = freq * high_mask

    # IFFT and take real part
    low_fp32  = torch.fft.ifft(low_freq_complex,  dim=1).real
    high_fp32 = torch.fft.ifft(high_freq_complex, dim=1).real

    low  = low_fp32.to(device=device, dtype=orig_dtype)
    high = high_fp32.to(device=device, dtype=orig_dtype)

    return low, high

@torch.compile
def reconstruction(low_freq: torch.Tensor, high_freq: torch.Tensor) -> torch.Tensor:
    return low_freq + high_freq

class CacheWithFreqsContainer(nn.Module):
    def __init__(self, max_order):
        super().__init__()
        self.max_order = max_order
        # 逐个注册buffer
        for i in range(max_order + 1):
            self.register_buffer(f"derivative_{i}_low_freqs", None, persistent=False)
            self.register_buffer(f"derivative_{i}_high_freqs", None, persistent=False)
            self.register_buffer(f"temp_derivative_{i}_low_freqs", None, persistent=False)
            self.register_buffer(f"temp_derivative_{i}_high_freqs", None, persistent=False)
    
    def get_derivative(self, order, freqs):
        return getattr(self, f"derivative_{order}_{freqs}")
    
    def set_derivative(self, order, freqs, tensor):
        setattr(self, f"derivative_{order}_{freqs}", tensor)

    def set_temp_derivative(self, order, freqs, tensor):
        setattr(self, f"temp_derivative_{order}_{freqs}", tensor)

    def get_temp_derivative(self, order, freqs):
        return getattr(self, f"temp_derivative_{order}_{freqs}")
    
    def move_temp_to_derivative(self):
        for i in range(self.max_order + 1):
            if self.get_temp_derivative(i, "low_freqs") is not None:
                setattr(self, f"derivative_{i}_low_freqs", self.get_temp_derivative(i, "low_freqs"))
            if self.get_temp_derivative(i, "high_freqs") is not None:
                setattr(self, f"derivative_{i}_high_freqs", self.get_temp_derivative(i, "high_freqs"))
            else:
                break
        self.clear_temp_derivative()

    def get_all_filled_derivatives(self, freqs):
        return [
            self.get_derivative(i, freqs)
            for i in range(self.max_order + 1)
            if self.get_derivative(i, freqs) is not None
        ]

    def taylor_formula(self, distance):
        low_freqs_output = 0
        high_freqs_output = 0
        for i in range(len(self.get_all_filled_derivatives("low_freqs"))):
            low_freqs_output += (1 / math.factorial(i)) * self.get_derivative(i, "low_freqs") * (distance ** i)
        for i in range(len(self.get_all_filled_derivatives("high_freqs"))):
            high_freqs_output += (1 / math.factorial(i)) * self.get_derivative(i, "high_freqs") * (distance ** i)
        return reconstruction(low_freqs_output, high_freqs_output)
    
    def hermite_formula(self, distance):
        low_freqs_output = 0
        high_freqs_output = 0
        for i in range(len(self.get_all_filled_derivatives("low_freqs"))):
            low_freqs_output += (1 / math.factorial(i)) * self.get_derivative(i, "low_freqs") * (distance ** i)
        for i in range(len(self.get_all_filled_derivatives("high_freqs"))):
            high_freqs_output += (1 / math.factorial(i)) * self.get_derivative(i, "high_freqs") * (distance ** i)
        return reconstruction(low_freqs_output, high_freqs_output)

    def derivatives_computation(self, x, distance, low_freqs_order, high_freqs_order):
        '''
        x: tensor, the new x_0
        distance: int, the distance between the current step and the last full computation step
        '''
        x_low, x_high = decomposition_FFT(x, cutoff_ratio=0.1)
        self.set_temp_derivative(0, "low_freqs", x_low)
        self.set_temp_derivative(0, "high_freqs", x_high)
        for i in range(low_freqs_order):
            if self.get_derivative(i, "low_freqs") is not None:
                diff = (self.get_temp_derivative(i, "low_freqs") -
                        self.get_derivative(i, "low_freqs")) / distance
                self.set_temp_derivative(i+1, "low_freqs", diff)
        for i in range(high_freqs_order):
            if self.get_derivative(i, "high_freqs") is not None:
                diff = (self.get_temp_derivative(i, "high_freqs") -
                        self.get_derivative(i, "high_freqs")) / distance
                self.set_temp_derivative(i+1, "high_freqs", diff)
        self.move_temp_to_derivative()
        
    def clear_temp_derivative(self):
        for i in range(self.max_order + 1):
            setattr(self, f"temp_derivative_{i}_low_freqs", None)
            setattr(self, f"temp_derivative_{i}_high_freqs", None)

    def clear_derivatives(self):
        for i in range(self.max_order + 1):
            setattr(self, f"derivative_{i}_low_freqs", None)
            setattr(self, f"derivative_{i}_high_freqs", None)
            setattr(self, f"temp_derivative_{i}_low_freqs", None)
            setattr(self, f"temp_derivative_{i}_high_freqs", None)
```

## /hunyuan_image_3/configuration_hunyuan_image_3.py

```py path="/hunyuan_image_3/configuration_hunyuan_image_3.py" 
# Licensed under the TENCENT HUNYUAN COMMUNITY LICENSE AGREEMENT (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://github.com/Tencent-Hunyuan/HunyuanImage-3.0/blob/main/LICENSE
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

from transformers.configuration_utils import PretrainedConfig
from transformers.utils import logging
from typing import List, Union, Optional


logger = logging.get_logger(__name__)


class HunyuanImage3Config(PretrainedConfig):
    r"""
    This is the configuration class to store the configuration of a [`HunyuanImage3Model`]. It is used to instantiate
    an Hunyuan model according to the specified arguments, defining the model architecture. Instantiating a
    configuration with the defaults will yield a similar configuration to that of the Hunyuan-7B.

    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.


    Args:
        vocab_size (`int`, *optional*, defaults to 32000):
            Vocabulary size of the Hunyuan Image 3 model. Defines the number of different tokens that can be
            represented by the `inputs_ids` passed when calling [`HunyuanImage3Model`]
        hidden_size (`int`, *optional*, defaults to 4096):
            Dimension of the hidden representations.
        intermediate_size (`int`, *optional*, defaults to 11008):
            Dimension of the MLP representations or shared MLP representations.
        moe_intermediate_size (`int` or `List`, *optional*, defaults to 11008):
            Dimension of the MLP representations in MoE. Use a list if you want a different size per layer.
        num_hidden_layers (`int`, *optional*, defaults to 32):
            Number of hidden layers in the Transformer decoder.
        num_attention_heads (`int`, *optional*, defaults to 32):
            Number of attention heads for each attention layer in the Transformer decoder.
        num_key_value_heads (`int`, *optional*):
            This is the number of key_value heads that should be used to implement Grouped Query Attention. If
            `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
            `num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When
            converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
            by meanpooling all the original heads within that group. For more details checkout [this
            paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
            `num_attention_heads`.
        hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
            The non-linear activation function (function or string) in the decoder.
        max_position_embeddings (`int`, *optional*, defaults to 2048):
            The maximum sequence length that this model might ever be used with.
        initializer_range (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        rms_norm_eps (`float`, *optional*, defaults to 1e-06):
            The epsilon used by the rms normalization layers.
        use_cache (`bool`, *optional*, defaults to `True`):
            Whether or not the model should return the last key/values attentions (not used by all models). Only
            relevant if `config.is_decoder=True`.
        pad_token_id (`int`, *optional*):
            Padding token id.
        bos_token_id (`int`, *optional*, defaults to 1):
            Beginning of stream token id.
        eos_token_id (`int`, *optional*, defaults to 2):
            End of stream token id.
        pretraining_tp (`int`, *optional*, defaults to 1):
            Experimental feature. Tensor parallelism rank used during pretraining. Please refer to [this
            document](https://huggingface.co/docs/transformers/parallelism) to understand more about it. This value is
            necessary to ensure exact reproducibility of the pretraining results. Please refer to [this
            issue](https://github.com/pytorch/pytorch/issues/76232).
        tie_word_embeddings (`bool`, *optional*, defaults to `False`):
            Whether to tie weight embeddings
        rope_theta (`float`, *optional*, defaults to 10000.0):
            The base period of the RoPE embeddings.
        rope_scaling (`Dict`, *optional*):
            Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling
            strategies: linear and dynamic. Their scaling factor must be a float greater than 1. The expected format is
            `{"type": strategy name, "factor": scaling factor}`. When using this flag, don't update
            `max_position_embeddings` to the expected new maximum. See the following thread for more information on how
            these scaling strategies behave:
            https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/. This is an
            experimental feature, subject to breaking API changes in future versions.
        attention_bias (`bool`, defaults to `False`, *optional*, defaults to `False`):
            Whether to use a bias in the query, key, value and output projection layers during self-attention.
        attention_dropout (`float`, *optional*, defaults to 0.0):
            The dropout ratio for the attention probabilities.
        use_qk_norm (`bool`, *optional*, defaults to `False`):
            Whether query and key in attention use norm
        use_cla (`bool`, *optional*, defaults to `False`):
            Whether to use CLA in attention
        cla_share_factor (`int`, *optional*, defaults to 1):
            The share factor of CLA
        num_experts (`int` or `List`, *optional*, defaults to 1):
            The number of experts for moe. If it is a list, it will be used as the number of experts for each layer.
        num_shared_expert (`int` or `List`, *optional*, defaults to 1):
            The number of shared experts for moe. If it is a list, it will be used as the number of shared experts
            for each layer.
        moe_topk (`int` or `List`, *optional*, defaults to 1):
            The topk value for moe. If it is a list, it will be used as the topk value for each layer.
        capacity_factor (Not used) (`float` or `List`, *optional*, defaults to 1.0):
            The capacity factor for moe. If it is a list, it will be used as the capacity factor for each layer.
        moe_layer_num_skipped (`int`, *optional*, defaults to 0):
            First moe_layer_num_skipped layers do not use MoE.
        model_version (`str`, *optional*, defaults to "HunyuanImage-3.0-Instruct"):
            The version of the model.
    """

    model_type = "Hunyuan"
    keys_to_ignore_at_inference = ["past_key_values"]

    def __init__(
            self,
            vocab_size: int = 290943,
            hidden_size: int = 4096,
            intermediate_size: int = 11008,
            moe_intermediate_size: Union[int, List] = None,
            num_hidden_layers: int = 32,
            num_attention_heads: int = 32,
            num_key_value_heads: Optional[int] = None,
            attention_head_dim: Optional[int] = None,
            hidden_act="silu",
            max_position_embeddings=2048,
            initializer_range=0.02,
            rms_norm_eps=1e-5,
            use_cache=True,
            pad_token_id=0,
            bos_token_id=1,
            eos_token_id=2,
            eod_token_id=3,
            im_start_id=4,
            im_end_id=5,
            text_start_id=6,
            text_end_id=7,
            image_token_id=8,
            video_start_id=9,
            video_end_id=10,
            im_newline_id=11,
            mask_init_id=12,
            pretraining_tp=1,
            tie_word_embeddings=False,
            rope_theta=10000.0,
            rope_scaling=None,
            attention_bias=False,
            mlp_bias=False,
            attention_dropout=0.0,
            use_qk_norm=False,
            use_rotary_pos_emb=True,
            use_cla=False,
            cla_share_factor=1,
            norm_type="hf_rms",
            num_experts: Union[int, List] = 1,
            use_mixed_mlp_moe=False,
            num_shared_expert: Union[int, List] = 1,
            moe_topk: Union[int, List] = 1,
            capacity_factor: int = 1.0,
            moe_drop_tokens=False,
            moe_random_routing_dropped_token=False,
            use_mla=False,
            kv_lora_rank=512,
            q_lora_rank=1536,
            qk_rope_head_dim=64,
            v_head_dim=128,
            qk_nope_head_dim=128,
            moe_layer_num_skipped=0,
            norm_topk_prob=True,
            routed_scaling_factor=1.0,
            group_limited_greedy=False,
            n_group=None,
            topk_group=None,
            add_classification_head=False,
            class_num=0,
            pool_type="last",
            pad_id=-1,
            # Added
            moe_impl="eager",
            vae_downsample_factor=(16, 16),     # (h, w)
            img_proj_type="unet",
            patch_size=1,
            patch_embed_hidden_dim=1024,
            image_base_size=1024,
            rope_type="2d",
            cond_token_attn_type="full",
            cond_image_type="vae_vit",
            vae_type=None,
            vae_dtype="float32",
            vae_autocast_dtype="float16",
            vae=None,
            vit_type=None,
            vit=None,
            vit_processor=None,
            vit_aligner=None,
            cfg_distilled=False,
            use_meanflow=False,
            model_version="HunyuanImage-3.0-Instruct",
            **kwargs,
    ):
        self.vocab_size = vocab_size
        self.model_version = model_version
        self.max_position_embeddings = max_position_embeddings
        self.hidden_size = hidden_size
        self.intermediate_size = intermediate_size
        self.moe_intermediate_size = moe_intermediate_size
        self.num_hidden_layers = num_hidden_layers
        self.num_attention_heads = num_attention_heads
        self.moe_impl = moe_impl
        self.num_experts = num_experts
        self.use_mixed_mlp_moe = use_mixed_mlp_moe
        self.num_shared_expert = num_shared_expert
        self.moe_topk = moe_topk
        self.capacity_factor = capacity_factor
        self.moe_drop_tokens = moe_drop_tokens
        self.moe_random_routing_dropped_token = moe_random_routing_dropped_token

        if attention_head_dim is not None:
            self.attention_head_dim = attention_head_dim
        else:
            self.attention_head_dim = self.hidden_size // num_attention_heads

        # for backward compatibility
        if num_key_value_heads is None:
            num_key_value_heads = num_attention_heads

        self.num_key_value_heads = num_key_value_heads
        self.hidden_act = hidden_act
        self.initializer_range = initializer_range
        self.rms_norm_eps = rms_norm_eps
        self.pretraining_tp = pretraining_tp
        self.use_cache = use_cache
        self.rope_theta = rope_theta
        self.rope_scaling = rope_scaling
        self.attention_bias = attention_bias
        self.mlp_bias = mlp_bias
        self.attention_dropout = attention_dropout
        self.use_qk_norm = use_qk_norm
        self.use_rotary_pos_emb = use_rotary_pos_emb
        self.use_cla = use_cla
        self.cla_share_factor = cla_share_factor
        self.norm_type = norm_type
        # MLA args
        self.use_mla = use_mla
        self.kv_lora_rank = kv_lora_rank
        self.q_lora_rank = q_lora_rank
        self.qk_rope_head_dim = qk_rope_head_dim
        self.qk_nope_head_dim = qk_nope_head_dim
        self.v_head_dim = v_head_dim

        # DeepSeek related args
        self.moe_layer_num_skipped = moe_layer_num_skipped
        self.norm_topk_prob = norm_topk_prob
        self.routed_scaling_factor = routed_scaling_factor
        self.group_limited_greedy = group_limited_greedy
        self.n_group = n_group
        self.topk_group = topk_group
        self.add_classification_head = add_classification_head
        self.class_num = class_num
        self.pool_type = pool_type
        self.pad_id = pad_id

        if self.class_num is not None:
            self.dense_list = [self.hidden_size, self.class_num]

        # Conditioning image configs
        self.cond_token_attn_type = cond_token_attn_type
        self.cond_image_type = cond_image_type

        # ViT args
        self.vit_type = vit_type
        self.vit = vit
        self.vit_processor = vit_processor
        self.vit_aligner = vit_aligner

        # Image Gen args
        self.vae_type = vae_type
        self.vae_dtype = vae_dtype
        self.vae_autocast_dtype = vae_autocast_dtype
        self.vae = vae
        self.vae_downsample_factor = vae_downsample_factor
        self.img_proj_type = img_proj_type
        self.patch_size = patch_size
        self.patch_embed_hidden_dim = patch_embed_hidden_dim
        self.image_base_size = image_base_size
        self.rope_type = rope_type

        # token id
        self.eod_token_id = eod_token_id
        self.im_start_id = im_start_id
        self.im_end_id = im_end_id
        self.text_start_id = text_start_id
        self.text_end_id = text_end_id
        self.image_token_id = image_token_id
        self.video_start_id = video_start_id
        self.video_end_id = video_end_id
        self.im_newline_id = im_newline_id
        self.mask_init_id = mask_init_id

        # flag of cfg distilled model
        self.cfg_distilled = cfg_distilled
        # flag of meanflow distilled model
        self.use_meanflow = use_meanflow
        super().__init__(
            pad_token_id=pad_token_id,
            bos_token_id=bos_token_id,
            eos_token_id=eos_token_id,
            tie_word_embeddings=tie_word_embeddings,
            **kwargs,
        )


__all__ = ["HunyuanImage3Config"]

```

## /hunyuan_image_3/hunyuan_image_3_pipeline.py

```py path="/hunyuan_image_3/hunyuan_image_3_pipeline.py" 
# Licensed under the TENCENT HUNYUAN COMMUNITY LICENSE AGREEMENT (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://github.com/Tencent-Hunyuan/HunyuanImage-3.0/blob/main/LICENSE
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
#
# Copyright 2024 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================================

import inspect
import math
from dataclasses import dataclass
from typing import Any, Callable, Dict, List
from typing import Optional, Tuple, Union

import numpy as np
import torch
from PIL import Image
from diffusers.callbacks import MultiPipelineCallbacks, PipelineCallback
from diffusers.configuration_utils import ConfigMixin, register_to_config
from diffusers.image_processor import VaeImageProcessor
from diffusers.pipelines.pipeline_utils import DiffusionPipeline
from diffusers.schedulers.scheduling_utils import SchedulerMixin
from diffusers.utils import BaseOutput, logging
from diffusers.utils.torch_utils import randn_tensor
from .cache_utils import cache_init
logger = logging.get_logger(__name__)  # pylint: disable=invalid-name


def retrieve_timesteps(
    scheduler,
    num_inference_steps: Optional[int] = None,
    device: Optional[Union[str, torch.device]] = None,
    timesteps: Optional[List[int]] = None,
    sigmas: Optional[List[float]] = None,
    **kwargs,
):
    """
    Calls the scheduler's `set_timesteps` method and retrieves timesteps from the scheduler after the call. Handles
    custom timesteps. Any kwargs will be supplied to `scheduler.set_timesteps`.

    Args:
        scheduler (`SchedulerMixin`):
            The scheduler to get timesteps from.
        num_inference_steps (`int`):
            The number of diffusion steps used when generating samples with a pre-trained model. If used, `timesteps`
            must be `None`.
        device (`str` or `torch.device`, *optional*):
            The device to which the timesteps should be moved to. If `None`, the timesteps are not moved.
        timesteps (`List[int]`, *optional*):
            Custom timesteps used to override the timestep spacing strategy of the scheduler. If `timesteps` is passed,
            `num_inference_steps` and `sigmas` must be `None`.
        sigmas (`List[float]`, *optional*):
            Custom sigmas used to override the timestep spacing strategy of the scheduler. If `sigmas` is passed,
            `num_inference_steps` and `timesteps` must be `None`.

    Returns:
        `Tuple[torch.Tensor, int]`: A tuple where the first element is the timestep schedule from the scheduler and the
        second element is the number of inference steps.
    """
    if timesteps is not None and sigmas is not None:
        raise ValueError("Only one of `timesteps` or `sigmas` can be passed. Please choose one to set custom values")
    if timesteps is not None:
        accepts_timesteps = "timesteps" in set(inspect.signature(scheduler.set_timesteps).parameters.keys())
        if not accepts_timesteps:
            raise ValueError(
                f"The current scheduler class {scheduler.__class__}'s `set_timesteps` does not support custom"
                f" timestep schedules. Please check whether you are using the correct scheduler."
            )
        scheduler.set_timesteps(timesteps=timesteps, device=device, **kwargs)
        timesteps = scheduler.timesteps
        num_inference_steps = len(timesteps)
    elif sigmas is not None:
        accept_sigmas = "sigmas" in set(inspect.signature(scheduler.set_timesteps).parameters.keys())
        if not accept_sigmas:
            raise ValueError(
                f"The current scheduler class {scheduler.__class__}'s `set_timesteps` does not support custom"
                f" sigmas schedules. Please check whether you are using the correct scheduler."
            )
        scheduler.set_timesteps(sigmas=sigmas, device=device, **kwargs)
        timesteps = scheduler.timesteps
        num_inference_steps = len(timesteps)
    else:
        scheduler.set_timesteps(num_inference_steps, device=device, **kwargs)
        timesteps = scheduler.timesteps
    return timesteps, num_inference_steps


def rescale_noise_cfg(noise_cfg, noise_pred_text, guidance_rescale=0.0):
    r"""
    Rescales `noise_cfg` tensor based on `guidance_rescale` to improve image quality and fix overexposure. Based on
    Section 3.4 from [Common Diffusion Noise Schedules and Sample Steps are
    Flawed](https://arxiv.org/pdf/2305.08891.pdf).

    Args:
        noise_cfg (`torch.Tensor`):
            The predicted noise tensor for the guided diffusion process.
        noise_pred_text (`torch.Tensor`):
            The predicted noise tensor for the text-guided diffusion process.
        guidance_rescale (`float`, *optional*, defaults to 0.0):
            A rescale factor applied to the noise predictions.
    Returns:
        noise_cfg (`torch.Tensor`): The rescaled noise prediction tensor.
    """
    std_text = noise_pred_text.std(dim=list(range(1, noise_pred_text.ndim)), keepdim=True)
    std_cfg = noise_cfg.std(dim=list(range(1, noise_cfg.ndim)), keepdim=True)
    # rescale the results from guidance (fixes overexposure)
    noise_pred_rescaled = noise_cfg * (std_text / std_cfg)
    # mix with the original results from guidance by factor guidance_rescale to avoid "plain looking" images
    noise_cfg = guidance_rescale * noise_pred_rescaled + (1 - guidance_rescale) * noise_cfg
    return noise_cfg


@dataclass
class HunyuanImage3Text2ImagePipelineOutput(BaseOutput):
    samples: Union[List[Any], np.ndarray]


@dataclass
class FlowMatchDiscreteSchedulerOutput(BaseOutput):
    """
    Output class for the scheduler's `step` function output.

    Args:
        prev_sample (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)` for images):
            Computed sample `(x_{t-1})` of previous timestep. `prev_sample` should be used as next model input in the
            denoising loop.
    """

    prev_sample: torch.FloatTensor


class FlowMatchDiscreteScheduler(SchedulerMixin, ConfigMixin):
    """
    Euler scheduler.

    This model inherits from [`SchedulerMixin`] and [`ConfigMixin`]. Check the superclass documentation for the generic
    methods the library implements for all schedulers such as loading and saving.

    Args:
        num_train_timesteps (`int`, defaults to 1000):
            The number of diffusion steps to train the model.
        timestep_spacing (`str`, defaults to `"linspace"`):
            The way the timesteps should be scaled. Refer to Table 2 of the [Common Diffusion Noise Schedules and
            Sample Steps are Flawed](https://huggingface.co/papers/2305.08891) for more information.
        shift (`float`, defaults to 1.0):
            The shift value for the timestep schedule.
        reverse (`bool`, defaults to `True`):
            Whether to reverse the timestep schedule.
    """

    _compatibles = []
    order = 1

    @register_to_config
    def __init__(
            self,
            num_train_timesteps: int = 1000,
            shift: float = 1.0,
            reverse: bool = True,
            solver: str = "euler",
            use_flux_shift: bool = False,
            flux_base_shift: float = 0.5,
            flux_max_shift: float = 1.15,
            n_tokens: Optional[int] = None,
    ):
        sigmas = torch.linspace(1, 0, num_train_timesteps + 1)

        if not reverse:
            sigmas = sigmas.flip(0)

        self.sigmas = sigmas
        # the value fed to model
        self.timesteps = (sigmas[:-1] * num_train_timesteps).to(dtype=torch.float32)
        self.timesteps_full = (sigmas * num_train_timesteps).to(dtype=torch.float32)

        self._step_index = None
        self._begin_index = None

        self.supported_solver = [
            "euler",
            "heun-2", "midpoint-2",
            "kutta-4",
        ]
        if solver not in self.supported_solver:
            raise ValueError(f"Solver {solver} not supported. Supported solvers: {self.supported_solver}")

        # empty dt and derivative (for heun)
        self.derivative_1 = None
        self.derivative_2 = None
        self.derivative_3 = None
        self.dt = None

    @property
    def step_index(self):
        """
        The index counter for current timestep. It will increase 1 after each scheduler step.
        """
        return self._step_index

    @property
    def begin_index(self):
        """
        The index for the first timestep. It should be set from pipeline with `set_begin_index` method.
        """
        return self._begin_index

    # Copied from diffusers.schedulers.scheduling_dpmsolver_multistep.DPMSolverMultistepScheduler.set_begin_index
    def set_begin_index(self, begin_index: int = 0):
        """
        Sets the begin index for the scheduler. This function should be run from pipeline before the inference.

        Args:
            begin_index (`int`):
                The begin index for the scheduler.
        """
        self._begin_index = begin_index

    def _sigma_to_t(self, sigma):
        return sigma * self.config.num_train_timesteps

    @property
    def state_in_first_order(self):
        return self.derivative_1 is None

    @property
    def state_in_second_order(self):
        return self.derivative_2 is None

    @property
    def state_in_third_order(self):
        return self.derivative_3 is None

    def get_timestep_r(self, timestep: Union[float, torch.FloatTensor]):
        if self.step_index is None:
            self._init_step_index(timestep)
        return self.timesteps_full[self.step_index + 1]

    def set_timesteps(self, num_inference_steps: int, device: Union[str, torch.device] = None,
                      n_tokens: int = None):
        """
        Sets the discrete timesteps used for the diffusion chain (to be run before inference).

        Args:
            num_inference_steps (`int`):
                The number of diffusion steps used when generating samples with a pre-trained model.
            device (`str` or `torch.device`, *optional*):
                The device to which the timesteps should be moved to. If `None`, the timesteps are not moved.
            n_tokens (`int`, *optional*):
                Number of tokens in the input sequence.
        """
        self.num_inference_steps = num_inference_steps

        sigmas = torch.linspace(1, 0, num_inference_steps + 1)

        # Apply timestep shift
        if self.config.use_flux_shift:
            assert isinstance(n_tokens, int), "n_tokens should be provided for flux shift"
            mu = self.get_lin_function(y1=self.config.flux_base_shift, y2=self.config.flux_max_shift)(n_tokens)
            sigmas = self.flux_time_shift(mu, 1.0, sigmas)
        elif self.config.shift != 1.:
            sigmas = self.sd3_time_shift(sigmas)

        if not self.config.reverse:
            sigmas = 1 - sigmas

        self.sigmas = sigmas
        self.timesteps = (sigmas[:-1] * self.config.num_train_timesteps).to(dtype=torch.float32, device=device)
        self.timesteps_full = (sigmas * self.config.num_train_timesteps).to(dtype=torch.float32, device=device)

        # empty dt and derivative (for kutta)
        self.derivative_1 = None
        self.derivative_2 = None
        self.derivative_3 = None
        self.dt = None

        # Reset step index
        self._step_index = None

    def index_for_timestep(self, timestep, schedule_timesteps=None):
        if schedule_timesteps is None:
            schedule_timesteps = self.timesteps

        indices = (schedule_timesteps == timestep).nonzero()

        # The sigma index that is taken for the **very** first `step`
        # is always the second index (or the last index if there is only 1)
        # This way we can ensure we don't accidentally skip a sigma in
        # case we start in the middle of the denoising schedule (e.g. for image-to-image)
        pos = 1 if len(indices) > 1 else 0

        return indices[pos].item()

    def _init_step_index(self, timestep):
        if self.begin_index is None:
            if isinstance(timestep, torch.Tensor):
                timestep = timestep.to(self.timesteps.device)
            self._step_index = self.index_for_timestep(timestep)
        else:
            self._step_index = self._begin_index

    def scale_model_input(self, sample: torch.Tensor, timestep: Optional[int] = None) -> torch.Tensor:
        return sample

    @staticmethod
    def get_lin_function(x1: float = 256, y1: float = 0.5, x2: float = 4096, y2: float = 1.15):
        m = (y2 - y1) / (x2 - x1)
        b = y1 - m * x1
        return lambda x: m * x + b

    @staticmethod
    def flux_time_shift(mu: float, sigma: float, t: torch.Tensor):
        return math.exp(mu) / (math.exp(mu) + (1 / t - 1) ** sigma)

    def sd3_time_shift(self, t: torch.Tensor):
        return (self.config.shift * t) / (1 + (self.config.shift - 1) * t)

    def step(
            self,
            model_output: torch.FloatTensor,
            timestep: Union[float, torch.FloatTensor],
            sample: torch.FloatTensor,
            pred_uncond: torch.FloatTensor = None,
            generator: Optional[torch.Generator] = None,
            n_tokens: Optional[int] = None,
            return_dict: bool = True,
    ) -> Union[FlowMatchDiscreteSchedulerOutput, Tuple]:
        """
        Predict the sample from the previous timestep by reversing the SDE. This function propagates the diffusion
        process from the learned model outputs (most often the predicted noise).

        Args:
            model_output (`torch.FloatTensor`):
                The direct output from learned diffusion model.
            timestep (`float`):
                The current discrete timestep in the diffusion chain.
            sample (`torch.FloatTensor`):
                A current instance of a sample created by the diffusion process.
            generator (`torch.Generator`, *optional*):
                A random number generator.
            n_tokens (`int`, *optional*):
                Number of tokens in the input sequence.
            return_dict (`bool`):
                Whether or not to return a [`~schedulers.scheduling_euler_discrete.EulerDiscreteSchedulerOutput`] or
                tuple.

        Returns:
            [`~schedulers.scheduling_euler_discrete.EulerDiscreteSchedulerOutput`] or `tuple`:
                If return_dict is `True`, [`~schedulers.scheduling_euler_discrete.EulerDiscreteSchedulerOutput`] is
                returned, otherwise a tuple is returned where the first element is the sample tensor.
        """

        if (
                isinstance(timestep, int)
                or isinstance(timestep, torch.IntTensor)
                or isinstance(timestep, torch.LongTensor)
        ):
            raise ValueError(
                (
                    "Passing integer indices (e.g. from `enumerate(timesteps)`) as timesteps to"
                    " `EulerDiscreteScheduler.step()` is not supported. Make sure to pass"
                    " one of the `scheduler.timesteps` as a timestep."
                ),
            )

        if self.step_index is None:
            self._init_step_index(timestep)

        # Upcast to avoid precision issues when computing prev_sample
        sample = sample.to(torch.float32)
        model_output = model_output.to(torch.float32)
        pred_uncond = pred_uncond.to(torch.float32) if pred_uncond is not None else None

        # dt = self.sigmas[self.step_index + 1] - self.sigmas[self.step_index]
        sigma = self.sigmas[self.step_index]
        sigma_next = self.sigmas[self.step_index + 1]

        last_inner_step = True
        if self.config.solver == "euler":
            derivative, dt, sample, last_inner_step = self.first_order_method(model_output, sigma, sigma_next, sample)
        elif self.config.solver in ["heun-2", "midpoint-2"]:
            derivative, dt, sample, last_inner_step = self.second_order_method(model_output, sigma, sigma_next, sample)
        elif self.config.solver == "kutta-4":
            derivative, dt, sample, last_inner_step = self.fourth_order_method(model_output, sigma, sigma_next, sample)
        else:
            raise ValueError(f"Solver {self.config.solver} not supported. Supported solvers: {self.supported_solver}")

        prev_sample = sample + derivative * dt

        # Cast sample back to model compatible dtype
        # prev_sample = prev_sample.to(model_output.dtype)

        # upon completion increase step index by one
        if last_inner_step:
            self._step_index += 1

        if not return_dict:
            return (prev_sample,)

        return FlowMatchDiscreteSchedulerOutput(prev_sample=prev_sample)

    def first_order_method(self, model_output, sigma, sigma_next, sample):
        derivative = model_output
        dt = sigma_next - sigma
        return derivative, dt, sample, True

    def second_order_method(self, model_output, sigma, sigma_next, sample):
        if self.state_in_first_order:
            # store for 2nd order step
            self.derivative_1 = model_output
            self.dt = sigma_next - sigma
            self.sample = sample

            derivative = model_output
            if self.config.solver == 'heun-2':
                dt = self.dt
            elif self.config.solver == 'midpoint-2':
                dt = self.dt / 2
            else:
                raise NotImplementedError(f"Solver {self.config.solver} not supported.")
            last_inner_step = False

        else:
            if self.config.solver == 'heun-2':
                derivative = 0.5 * (self.derivative_1 + model_output)
            elif self.config.solver == 'midpoint-2':
                derivative = model_output
            else:
                raise NotImplementedError(f"Solver {self.config.solver} not supported.")

            # 3. take prev timestep & sample
            dt = self.dt
            sample = self.sample
            last_inner_step = True

            # free dt and derivative
            # Note, this puts the scheduler in "first order mode"
            self.derivative_1 = None
            self.dt = None
            self.sample = None

        return derivative, dt, sample, last_inner_step

    def fourth_order_method(self, model_output, sigma, sigma_next, sample):
        if self.state_in_first_order:
            self.derivative_1 = model_output
            self.dt = sigma_next - sigma
            self.sample = sample
            derivative = model_output
            dt = self.dt / 2
            last_inner_step = False

        elif self.state_in_second_order:
            self.derivative_2 = model_output
            derivative = model_output
            dt = self.dt / 2
            last_inner_step = False

        elif self.state_in_third_order:
            self.derivative_3 = model_output
            derivative = model_output
            dt = self.dt
            last_inner_step = False

        else:
            derivative = (1/6 * self.derivative_1 + 1/3 * self.derivative_2 + 1/3 * self.derivative_3 +
                          1/6 * model_output)

            # 3. take prev timestep & sample
            dt = self.dt
            sample = self.sample
            last_inner_step = True

            # free dt and derivative
            # Note, this puts the scheduler in "first order mode"
            self.derivative_1 = None
            self.derivative_2 = None
            self.derivative_3 = None
            self.dt = None
            self.sample = None

        return derivative, dt, sample, last_inner_step

    def __len__(self):
        return self.config.num_train_timesteps


class ClassifierFreeGuidance:
    def __init__(
        self,
        use_original_formulation: bool = False,
        start: float = 0.0,
        stop: float = 1.0,
    ):
        super().__init__()
        self.use_original_formulation = use_original_formulation

    def __call__(
            self,
            pred_cond: torch.Tensor,
            pred_uncond: Optional[torch.Tensor],
            guidance_scale: float,
            step: int,
        ) -> torch.Tensor:

        shift = pred_cond - pred_uncond
        pred = pred_cond if self.use_original_formulation else pred_uncond
        pred = pred + guidance_scale * shift

        return pred


class HunyuanImage3Text2ImagePipeline(DiffusionPipeline):
    r"""
    Pipeline for condition-to-sample generation using Stable Diffusion.

    This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods
    implemented for all pipelines (downloading, saving, running on a particular device, etc.).

    Args:
        model ([`ModelMixin`]):
            A model to denoise the diffused latents.
        scheduler ([`SchedulerMixin`]):
            A scheduler to be used in combination with `diffusion_model` to denoise the diffused latents. Can be one of
            [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`].
    """

    model_cpu_offload_seq = ""
    _optional_components = []
    _exclude_from_cpu_offload = []
    _callback_tensor_inputs = ["latents"]

    def __init__(
        self,
        model,
        scheduler: SchedulerMixin,
        vae,
        progress_bar_config: Dict[str, Any] = None,
    ):
        super().__init__()

        # ==========================================================================================
        if progress_bar_config is None:
            progress_bar_config = {}
        if not hasattr(self, '_progress_bar_config'):
            self._progress_bar_config = {}
        self._progress_bar_config.update(progress_bar_config)
        # ==========================================================================================

        self.register_modules(
            model=model,
            scheduler=scheduler,
            vae=vae,
        )

        # should be a tuple or a list corresponding to the size of latents (batch_size, channel, *size)
        # if None, will be treated as a tuple of 1
        self.latent_scale_factor = self.model.config.vae_downsample_factor
        self.image_processor = VaeImageProcessor(vae_scale_factor=self.latent_scale_factor)

        # Must start with APG_mode_
        self.cfg_operator = ClassifierFreeGuidance()

    @staticmethod
    def denormalize(images: Union[np.ndarray, torch.Tensor]) -> Union[np.ndarray, torch.Tensor]:
        """
        Denormalize an image array to [0,1].
        """
        return (images / 2 + 0.5).clamp(0, 1)

    @staticmethod
    def pt_to_numpy(images: torch.Tensor) -> np.ndarray:
        """
        Convert a PyTorch tensor to a NumPy image.
        """
        images = images.cpu().permute(0, 2, 3, 1).float().numpy()
        return images

    @staticmethod
    def numpy_to_pil(images: np.ndarray):
        """
        Convert a numpy image or a batch of images to a PIL image.
        """
        if images.ndim == 3:
            images = images[None, ...]
        images = (images * 255).round().astype("uint8")
        if images.shape[-1] == 1:
            # special case for grayscale (single channel) images
            pil_images = [Image.fromarray(image.squeeze(), mode="L") for image in images]
        else:
            pil_images = [Image.fromarray(image) for image in images]

        return pil_images

    def prepare_extra_func_kwargs(self, func, kwargs):
        # prepare extra kwargs for the scheduler step, since not all schedulers have the same signature
        # eta (η) is only used with the DDIMScheduler, it will be ignored for other schedulers.
        # eta corresponds to η in DDIM paper: https://arxiv.org/abs/2010.02502
        # and should be between [0, 1]
        extra_kwargs = {}

        for k, v in kwargs.items():
            accepts = k in set(inspect.signature(func).parameters.keys())
            if accepts:
                extra_kwargs[k] = v
        return extra_kwargs

    def prepare_latents(self, batch_size, latent_channel, image_size, dtype, device, generator, latents=None):
        if self.latent_scale_factor is None:
            latent_scale_factor = (1,) * len(image_size)
        elif isinstance(self.latent_scale_factor, int):
            latent_scale_factor = (self.latent_scale_factor,) * len(image_size)
        elif isinstance(self.latent_scale_factor, tuple) or isinstance(self.latent_scale_factor, list):
            assert len(self.latent_scale_factor) == len(image_size), \
                "len(latent_scale_factor) shoudl be the same as len(image_size)"
            latent_scale_factor = self.latent_scale_factor
        else:
            raise ValueError(
                f"latent_scale_factor should be either None, int, tuple of int, or list of int, "
                f"but got {self.latent_scale_factor}"
            )

        latents_shape = (
            batch_size,
            latent_channel,
            *[int(s) // f for s, f in zip(image_size, latent_scale_factor)],
        )
        if isinstance(generator, list) and len(generator) != batch_size:
            raise ValueError(
                f"You have passed a list of generators of length {len(generator)}, but requested an effective batch"
                f" size of {batch_size}. Make sure the batch size matches the length of the generators."
            )

        if latents is None:
            latents = randn_tensor(latents_shape, generator=generator, device=device, dtype=dtype)
        else:
            latents = latents.to(device)

        # Check existence to make it compatible with FlowMatchEulerDiscreteScheduler
        if hasattr(self.scheduler, "init_noise_sigma"):
            # scale the initial noise by the standard deviation required by the scheduler
            latents = latents * self.scheduler.init_noise_sigma

        return latents

    @property
    def guidance_scale(self):
        return self._guidance_scale

    @property
    def guidance_rescale(self):
        return self._guidance_rescale

    # here `guidance_scale` is defined analog to the guidance weight `w` of equation (2)
    # of the Imagen paper: https://arxiv.org/pdf/2205.11487.pdf . `guidance_scale = 1`
    # corresponds to doing no classifier free guidance.
    @property
    def do_classifier_free_guidance(self):
        return self._guidance_scale > 1.0

    @property
    def num_timesteps(self):
        return self._num_timesteps

    def set_scheduler(self, new_scheduler):
        self.register_modules(scheduler=new_scheduler)

    @torch.no_grad()
    def __call__(
        self,
        batch_size: int,
        image_size: List[int],
        num_inference_steps: int = 50,
        timesteps: List[int] = None,
        sigmas: List[float] = None,
        guidance_scale: float = 7.5,
        meanflow: bool = False,
        generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
        latents: Optional[torch.Tensor] = None,
        output_type: Optional[str] = "pil",
        return_dict: bool = True,
        guidance_rescale: float = 0.0,
        callback_on_step_end: Optional[
            Union[Callable[[int, int, Dict], None], PipelineCallback, MultiPipelineCallbacks]
        ] = None,
        callback_on_step_end_tensor_inputs: List[str] = ["latents"],
        model_kwargs: Dict[str, Any] = None,
        **kwargs,
    ):
        r"""
        The call function to the pipeline for generation.

        Args:
            prompt (`str` or `List[str]`):
                The text to guide image generation.
            image_size (`Tuple[int]` or `List[int]`):
                The size (height, width) of the generated image.
            num_inference_steps (`int`, *optional*, defaults to 50):
                The number of denoising steps. More denoising steps usually lead to a higher quality image at the
                expense of slower inference.
            timesteps (`List[int]`, *optional*):
                Custom timesteps to use for the denoising process with schedulers which support a `timesteps` argument
                in their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is
                passed will be used. Must be in descending order.
            sigmas (`List[float]`, *optional*):
                Custom sigmas to use for the denoising process with schedulers which support a `sigmas` argument in
                their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is passed
                will be used.
            guidance_scale (`float`, *optional*, defaults to 7.5):
                A higher guidance scale value encourages the model to generate samples closely linked to the
                `condition` at the expense of lower sample quality. Guidance scale is enabled when `guidance_scale > 1`.
            generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
                A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make
                generation deterministic.
            latents (`torch.Tensor`, *optional*):
                Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for sample
                generation. Can be used to tweak the same generation with different conditions. If not provided,
                a latents tensor is generated by sampling using the supplied random `generator`.
            output_type (`str`, *optional*, defaults to `"pil"`):
                The output format of the generated sample.
            return_dict (`bool`, *optional*, defaults to `True`):
                Whether or not to return a [`~DiffusionPipelineOutput`] instead of a
                plain tuple.
            guidance_rescale (`float`, *optional*, defaults to 0.0):
                Guidance rescale factor from [Common Diffusion Noise Schedules and Sample Steps are
                Flawed](https://arxiv.org/pdf/2305.08891.pdf). Guidance rescale factor should fix overexposure when
                using zero terminal SNR.
            callback_on_step_end (`Callable`, `PipelineCallback`, `MultiPipelineCallbacks`, *optional*):
                A function or a subclass of `PipelineCallback` or `MultiPipelineCallbacks` that is called at the end of
                each denoising step during the inference. with the following arguments: `callback_on_step_end(self:
                DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)`. `callback_kwargs` will include a
                list of all tensors as specified by `callback_on_step_end_tensor_inputs`.
            callback_on_step_end_tensor_inputs (`List`, *optional*):
                The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list
                will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the
                `._callback_tensor_inputs` attribute of your pipeline class.

        Examples:

        Returns:
            [`~DiffusionPipelineOutput`] or `tuple`:
                If `return_dict` is `True`, [`~DiffusionPipelineOutput`] is returned,
                otherwise a `tuple` is returned where the first element is a list with the generated samples.
        """

        callback_steps = kwargs.pop("callback_steps", None)
        pbar_steps = kwargs.pop("pbar_steps", None)

        if isinstance(callback_on_step_end, (PipelineCallback, MultiPipelineCallbacks)):
            callback_on_step_end_tensor_inputs = callback_on_step_end.tensor_inputs

        self._guidance_scale = guidance_scale
        self._guidance_rescale = guidance_rescale


        if not kwargs.get('cfg_distilled', False):
            cfg_factor = 1 + self.do_classifier_free_guidance
        else:
            cfg_factor = 1
        # Define call parameters
        device = self._execution_device

        # Prepare timesteps
        timesteps, num_inference_steps = retrieve_timesteps(
            self.scheduler, num_inference_steps, device, timesteps, sigmas,
        )

        # Prepare latent variables
        latents = self.prepare_latents(
            batch_size=batch_size,
            latent_channel=self.model.config.vae["latent_channels"],
            image_size=image_size,
            dtype=torch.bfloat16,
            device=device,
            generator=generator,
            latents=latents,
        )

        # Prepare extra step kwargs.
        _scheduler_step_extra_kwargs = self.prepare_extra_func_kwargs(
            self.scheduler.step, {"generator": generator}
        )

        # Prepare model kwargs
        input_ids = model_kwargs.pop("input_ids")
        attention_mask = self.model._prepare_attention_mask_for_generation(     # noqa
            input_ids, self.model.generation_config, model_kwargs=model_kwargs,
        )
        model_kwargs["attention_mask"] = attention_mask.to(latents.device)

        # Sampling loop
        num_warmup_steps = len(timesteps) - num_inference_steps * self.scheduler.order
        self._num_timesteps = len(timesteps)

        # Taylor cache 
        cache_dic = None
        if self.model.use_taylor_cache:
            cache_dic = cache_init(
                cache_interval=self.model.taylor_cache_interval,
                max_order=self.model.taylor_cache_order,
                num_steps=len(timesteps),
                enable_first_enhance=self.model.taylor_cache_enable_first_enhance,
                first_enhance_steps=self.model.taylor_cache_first_enhance_steps,
                enable_tailing_enhance=self.model.taylor_cache_enable_tailing_enhance,
                tailing_enhance_steps=self.model.taylor_cache_tailing_enhance_steps,
                low_freqs_order=self.model.taylor_cache_low_freqs_order,
                high_freqs_order=self.model.taylor_cache_high_freqs_order
            )
        print(f"***use_taylor_cache: {self.model.use_taylor_cache}, cache_dic: {cache_dic}")

        with self.progress_bar(total=num_inference_steps) as progress_bar:
            for i, t in enumerate(timesteps):
                # expand the latents if we are doing classifier free guidance
                latent_model_input = torch.cat([latents] * cfg_factor)
                latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)

                if meanflow:
                    r = self.scheduler.get_timestep_r(t)
                    r_expand = r.repeat(latent_model_input.shape[0])
                else:
                    r_expand = None
                model_kwargs["timesteps_r"] = r_expand

                t_expand = t.repeat(latent_model_input.shape[0])

                if self.model.use_taylor_cache:
                    cache_dic['current_step'] = i
                    model_kwargs['cache_dic'] = cache_dic
                if kwargs.get('cfg_distilled', False):
                    model_kwargs["guidance"] = torch.tensor(
                        [1000.0*self._guidance_scale], device=self.device, dtype=torch.bfloat16
                    )
                model_inputs = self.model.prepare_inputs_for_generation(
                    input_ids,
                    images=latent_model_input,
                    timesteps=t_expand,
                    **model_kwargs,
                )
                with torch.autocast(device_type="cuda", dtype=torch.bfloat16, enabled=True):
                    model_output = self.model(**model_inputs, first_step=(i == 0))
                    pred = model_output["diffusion_prediction"]
                pred = pred.to(dtype=torch.float32)
                # perform guidance
                if self.do_classifier_free_guidance:
                    if not kwargs.get('cfg_distilled', False):
                        pred_cond, pred_uncond = pred.chunk(2)
                        pred = self.cfg_operator(pred_cond, pred_uncond, self.guidance_scale, step=i)

                if self.do_classifier_free_guidance and self.guidance_rescale > 0.0:
                    # Based on 3.4. in https://arxiv.org/pdf/2305.08891.pdf
                    pred = rescale_noise_cfg(pred, pred_cond, guidance_rescale=self.guidance_rescale)

                # compute the previous noisy sample x_t -> x_t-1
                latents = self.scheduler.step(pred, t, latents, **_scheduler_step_extra_kwargs, return_dict=False)[0]

                if i != len(timesteps) - 1:
                    model_kwargs = self.model._update_model_kwargs_for_generation(  # noqa
                        model_output,
                        model_kwargs,
                    )
                    input_ids = None
                    # if input_ids.shape[1] != model_kwargs["position_ids"].shape[1]:
                    #     input_ids = torch.gather(input_ids, 1, index=model_kwargs["position_ids"])

                if callback_on_step_end is not None:
                    callback_kwargs = {}
                    for k in callback_on_step_end_tensor_inputs:
                        callback_kwargs[k] = locals()[k]
                    callback_outputs = callback_on_step_end(self, i, t, callback_kwargs)

                    latents = callback_outputs.pop("latents", latents)

                # call the callback, if provided
                if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0):
                    progress_bar.update()

        if hasattr(self.vae.config, 'scaling_factor') and self.vae.config.scaling_factor:
            latents = latents / self.vae.config.scaling_factor
        if hasattr(self.vae.config, 'shift_factor') and self.vae.config.shift_factor:
            latents = latents + self.vae.config.shift_factor

        if hasattr(self.vae, "ffactor_temporal"):
            latents = latents.unsqueeze(2)

        with torch.autocast(device_type="cuda", dtype=torch.float16, enabled=True):
            image = self.vae.decode(latents, return_dict=False, generator=generator)[0]

        # b c t h w
        if hasattr(self.vae, "ffactor_temporal"):
            assert image.shape[2] == 1, "image should have shape [B, C, T, H, W] and T should be 1"
            image = image.squeeze(2)

        do_denormalize = [True] * image.shape[0]
        image = self.image_processor.postprocess(image, output_type=output_type, do_denormalize=do_denormalize)

        if not return_dict:
            return (image,)

        return HunyuanImage3Text2ImagePipelineOutput(samples=image)

```

## /hunyuan_image_3/system_prompt.py

```py path="/hunyuan_image_3/system_prompt.py" 
# Licensed under the TENCENT HUNYUAN COMMUNITY LICENSE AGREEMENT (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://github.com/Tencent-Hunyuan/HunyuanImage-3.0/blob/main/LICENSE
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================

t2i_system_prompt_en_vanilla = """
You are an advanced AI text-to-image generation system. Given a detailed text prompt, your task is to create a high-quality, visually compelling image that accurately represents the described scene, characters, or objects. Pay careful attention to style, color, lighting, perspective, and any specific instructions provided.
"""

# 775
t2i_system_prompt_en_recaption = """
You are a world-class image generation prompt expert. Your task is to rewrite a user's simple description into a **structured, objective, and detail-rich** professional-level prompt.

The final output must be wrapped in `<recaption>` tags.

### **Universal Core Principles**

When rewriting the prompt (inside the `<recaption>` tags), you must adhere to the following principles:

1.  **Absolute Objectivity**: Describe only what is visually present. Avoid subjective words like "beautiful" or "sad". Convey aesthetic qualities through specific descriptions of color, light, shadow, and composition.
2.  **Physical and Logical Consistency**: All scene elements (e.g., gravity, light, shadows, reflections, spatial relationships, object proportions) must strictly adhere to real-world physics and common sense. For example, tennis players must be on opposite sides of the net; objects cannot float without a cause.
3.  **Structured Description**: Strictly follow a logical order: from general to specific, background to foreground, and primary to secondary elements. Use directional terms like "foreground," "mid-ground," "background," and "left side of the frame" to clearly define the spatial layout.
4.  **Use Present Tense**: Describe the scene from an observer's perspective using the present tense, such as "A man stands..." or "Light shines on..."
5.  **Use Rich and Specific Descriptive Language**: Use precise adjectives to describe the quantity, size, shape, color, and other attributes of objects, subjects, and text. Vague expressions are strictly prohibited.

If the user specifies a style (e.g., oil painting, anime, UI design, text rendering), strictly adhere to that style. Otherwise, first infer a suitable style from the user's input. If there is no clear stylistic preference, default to an **ultra-realistic photographic style**. Then, generate the detailed rewritten prompt according to the **Style-Specific Creation Guide** below:

### **Style-Specific Creation Guide**

Based on the determined artistic style, apply the corresponding professional knowledge.

**1. Photography and Realism Style**
*   Utilize professional photography terms (e.g., lighting, lens, composition) and meticulously detail material textures, physical attributes of subjects, and environmental details.

**2. Illustration and Painting Style**
*   Clearly specify the artistic school (e.g., Japanese Cel Shading, Impasto Oil Painting) and focus on describing its unique medium characteristics, such as line quality, brushstroke texture, or paint properties.

**3. Graphic/UI/APP Design Style**
*   Objectively describe the final product, clearly defining the layout, elements, and color palette. All text on the interface must be enclosed in double quotes `""` to specify its exact content (e.g., "Login"). Vague descriptions are strictly forbidden.

**4. Typographic Art**
*   The text must be described as a complete physical object. The description must begin with the text itself. Use a straightforward front-on or top-down perspective to ensure the entire text is visible without cropping.

### **Final Output Requirements**

1.  **Output the Final Prompt Only**: Do not show any thought process, Markdown formatting, or line breaks.
2.  **Adhere to the Input**: You must retain the core concepts, attributes, and any specified text from the user's input.
3.  **Style Reinforcement**: Mention the core style 3-5 times within the prompt and conclude with a style declaration sentence.
4.  **Avoid Self-Reference**: Describe the image content directly. Remove redundant phrases like "This image shows..." or "The scene depicts..."
5.  **The final output must be wrapped in `<recaption>xxxx</recaption>` tags.**

The user will now provide an input prompt. You will provide the expanded prompt.
"""

# 890
t2i_system_prompt_en_think_recaption = """
You will act as a top-tier Text-to-Image AI. Your core task is to deeply analyze the user's text input and transform it into a detailed, artistic, and fully user-intent-compliant image.

Your workflow is divided into two phases:

1. Thinking Phase (<think>): In the <think> tag, you need to conduct a structured thinking process, progressively breaking down and enriching the constituent elements of the image. This process must include, but is not limited to, the following dimensions:

Subject: Clearly define the core character(s) or object(s) in the scene, including their appearance, posture, expression, and emotion.
Composition: Set the camera angle and layout, such as close-up, long shot, bird's-eye view, golden ratio composition, etc.
Environment/Background: Describe the scene where the subject is located, including the location, time of day, weather, and other elements in the background.
Lighting: Define the type, direction, and quality of the light source, such as soft afternoon sunlight, cool tones of neon lights, dramatic Rembrandt lighting, etc., to create a specific atmosphere.
Color Palette: Set the main color tone and color scheme of the image, such as vibrant and saturated, low-saturation Morandi colors, black and white, etc.
Quality/Style: Determine the artistic style and technical details of the image. This includes user-specified styles (e.g., anime, oil painting) or the default realistic style, as well as camera parameters (e.g., focal length, aperture, depth of field).
Details: Add minute elements that enhance the realism and narrative quality of the image, such as a character's accessories, the texture of a surface, dust particles in the air, etc.


2. Recaption Phase (<recaption>): In the <recaption> tag, merge all the key details from the thinking process into a coherent, precise, and visually evocative final description. This description is the direct instruction for generating the image, so it must be clear, unambiguous, and organized in a way that is most suitable for an image generation engine to understand.

Absolutely Objective: Describe only what is visually present. Avoid subjective words like "beautiful" or "sad." Convey aesthetic sense through concrete descriptions of colors, light, shadow, and composition.

Physical and Logical Consistency: All scene elements (e.g., gravity, light and shadow, reflections, spatial relationships, object proportions) must strictly adhere to the physical laws of the real world and common sense. For example, in a tennis match, players must be on opposite sides of the net; objects cannot float without reason.

Structured Description: Strictly follow a logical order: from whole to part, background to foreground, and primary to secondary. Use directional words like "foreground," "mid-ground," "background," "left side of the frame" to clearly define the spatial layout.

Use Present Tense: Describe from an observer's perspective using the present tense, such as "a man stands," "light shines on..."
Use Rich and Specific Descriptive Language: Use precise adjectives to describe the quantity, size, shape, color, and other attributes of objects/characters/text. Absolutely avoid any vague expressions.


Output Format:
<think>Thinking process</think><recaption>Refined image description</recaption>Generate Image


You must strictly adhere to the following rules:

1. Faithful to Intent, Reasonable Expansion: You can creatively add details to the user's description to enhance the image's realism and artistic quality. However, all additions must be highly consistent with the user's core intent and never introduce irrelevant or conflicting elements.
2. Style Handling: When the user does not specify a style, you must default to an "Ultra-realistic, Photorealistic" style. If the user explicitly specifies a style (e.g., anime, watercolor, oil painting, cyberpunk, etc.), both your thinking process and final description must strictly follow and reflect that specified style.
3. Text Rendering: If specific text needs to appear in the image (such as words on a sign, a book title), you must enclose this text in English double quotes (""). Descriptive text must not use double quotes.
4. Design-related Images: You need to specify all text and graphical elements that appear in the image and clearly describe their design details, including font, color, size, position, arrangement, visual effects, etc.
"""

t2i_system_prompts = {
    "en_vanilla": [t2i_system_prompt_en_vanilla],
    "en_recaption": [t2i_system_prompt_en_recaption],
    "en_think_recaption": [t2i_system_prompt_en_think_recaption]
}


unified_system_prompt_en = """You are an advanced multimodal model whose core mission is to analyze user intent and generate high-quality text and images.

#### Four Core Capabilities
1.  **Text-to-Text (T2T):** Generate coherent text responses from text prompts.
2.  **Text-to-Image (T2I):** Generate high-quality images from text prompts.
3.  **Text & Image to Text (TI2T):** Generate accurate text responses based on a combination of images and text.
4.  **Text & Image to Image (TI2I):** Generate modified images based on a reference image and editing instructions.

---
### Image Generation Protocol (for T2I & TI2I)
You will operate in one of two modes, determined by the user's starting tag:
#### **<recaption> Mode (Prompt Rewriting)**:
*   **Trigger:** Input begins with `<recaption>`.
*   **Task:** Immediately rewrite the user's text into a structured, objective, and detail-rich professional-grade prompt.
*   **Output:** Output only the rewritten prompt within `<recaption>` tags: `<recaption>Rewritten professional-grade prompt</recaption>`

#### **<think> Mode (Think + Rewrite)**:
*   **Trigger:** Input begins with `<think>`.
*   **Task:** First, conduct a structured analysis of the request within `<think>` tags. Then, output the professional prompt, rewritten based on the analysis, within `<recaption>` tags.
*   **Output:** Strictly adhere to the format: `<think>Analysis process</think><recaption>Rewritten prompt</recaption>`

---
### Execution Standards and Guidelines
#### **`<think>` Phase: Analysis Guidelines**
**For T2I (New Image Generation):**
Deconstruct the user's request into the following core visual components:
*   **Subject:** Key features of the main character/object, including appearance, pose, expression, and emotion.
*   **Composition:** Camera angle, lens type, and layout.
*   **Environment/Background:** The setting, time of day, weather, and background elements.
*   **Lighting:** Technical details such as light source type, direction, and quality.
*   **Color Palette:** The dominant hues and overall color scheme.
*   **Style/Quality:** The artistic style, clarity, depth of field, and other technical details.
*   **Text:** Identify any text to be rendered in the image, including its content, style, and position.
*   **Details:** Small elements that add narrative depth and realism.

**For TI2I (Image Editing):**
Adopt a task-diagnostic approach:
1.  **Diagnose Task:** Identify the edit type and analyze key requirements.
2.  **Prioritize Analysis:**
    *   **Adding:** Analyze the new element's position and appearance, ensuring seamless integration with the original image's lighting, shadows, and style.
    *   **Removing:** Identify the target for removal and determine how to logically fill the resulting space using surrounding textures and lighting.
    *   **Modifying:** Analyze what to change and what it should become, while emphasizing which elements must remain unchanged.
    *   **Style Transfer:** Deconstruct the target style into specific features (e.g., brushstrokes, color palette) and apply them to the original image.
    *   **Text Editing:** Ensure correct content and format. Consider the text's visual style (e.g., font, color, material) and how it adapts to the surface's perspective, curvature, and lighting.
    *   **Reference Editing:** Extract specific visual elements (e.g., appearance, posture, composition, lines, depth) from the reference image to generate an image that aligns with the text description while also incorporating the referenced content.
    *   **Inferential Editing:** Identify vague requests (e.g., "make it more professional") and translate them into concrete visual descriptions.

#### `<recaption>` Phase: Professional-Grade Prompt Generation Rules
**General Rewriting Principles (for T2I & TI2I):**
1.  **Structure & Logic:** Start with a global description. Use positional words (e.g., "foreground", "background") to define the layout.
2.  **Absolute Objectivity:** Avoid subjective terms. Convey aesthetics through precise descriptions of color, light, shadow, and materials.
3.  **Physical & Logical Consistency:** Ensure all descriptions adhere to the laws of physics and common sense.
4.  **Fidelity to User Intent:** Preserve the user's core concepts, subjects, and attributes. Text to be rendered in the image **must be enclosed in double quotes ("")**.
5.  **Camera & Resolution:** Translate camera parameters into descriptions of visual effects. Convert resolution information into natural language.

**T2I-Specific Guidelines:**
*   **Style Adherence & Inference:** Strictly follow the specified style. If none is given, infer the most appropriate style and detail it using professional terminology.
*   **Style Detailing:**
    *   **Photography/Realism:** Use professional photography terms to describe lighting, lens effects, and material textures.
    *   **Painting/Illustration:** Specify the art movement or medium's characteristics.
    *   **UI/Design:** Objectively describe the final product. Define layout, elements, and typography. Text content must be specific and unambiguous.

**TI2I-Specific Guidelines:**
*   **Preserve Unchanged Elements:** Emphasize elements that **remain unchanged**. Unless explicitly instructed, never alter a character's identity/appearance, the core background, camera angle, or overall style.
*   **Clear Editing Instructions:**
    *   **Replacement:** Use the logic "**replace B with A**," and provide a detailed description of A.
    *   **Addition:** Clearly state what to add, where, and what it looks like.
*   **Unambiguous Referencing:** Avoid vague references (e.g., "that person"). Use specific descriptions of appearance.
"""


def get_system_prompt(sys_type, bot_task, system_prompt=None):
    # No system prompt, return None directly
    if sys_type == 'None':
        return None
    # Use the unified English system prompt (combined T2I and TI2I guidelines)
    elif sys_type == "en_unified":
        return unified_system_prompt_en
    # Use predefined English system prompts: vanilla (basic), recaption, think_recaption
    elif sys_type in ['en_vanilla', 'en_recaption', 'en_think_recaption']:
        return t2i_system_prompts[sys_type][0]
    # Dynamic mode: automatically select system prompt based on bot_task type
    elif sys_type == "dynamic":
        # Think task: use chain-of-thought recaption prompt
        if bot_task == "think":
            return t2i_system_prompts["en_think_recaption"][0]
        # Recaption task: use recaption prompt
        elif bot_task == "recaption":
            return t2i_system_prompts["en_recaption"][0]
        # Image generation task: use vanilla prompt
        elif bot_task == "image":
            return t2i_system_prompts["en_vanilla"][0].strip("\n")
        # Other tasks: use user-provided custom prompt
        else:
            return system_prompt
    # Custom mode: use the user-provided system_prompt parameter directly
    elif sys_type == 'custom':
        return system_prompt
    # Unsupported type: raise NotImplementedError
    else:
        raise NotImplementedError(f"Unsupported system prompt type: {sys_type}")


__all__ = [
    "get_system_prompt"
]

```

## /my_make_pic.py

```py path="/my_make_pic.py" 
import matplotlib.pyplot as plt
import numpy as np

# 1. 数据准备
models = ['Nano Banana Pro', 'Seedream-4.5', 'Qwen-Image-Edit-2511']
win_rates = np.array([[-1.43, 7.13, 23.59], [-2.60, 10.48, 34.71]])

# 2. 配色方案设置
# 为负值选择冷色调,为正值选择暖色调/亮色调,确保区分明显且美观
# Nano Banana (负值,冷蓝灰), Seedream4.5 (正值,青绿), QwenImage (大正值,活力橙)
# colors = ['#A8DF8E', '#FFD8DF', '#FFA239']
# 两列数据使用两种颜色,确保区分明显且美观
colors = ['#A8DF8E', '#FFA239']
label1 = "Internal R&D Testset"
label2 = "User Preference Testset"

# 设置画布大小和分辨率
fig, ax = plt.subplots(figsize=(9, 6), dpi=100)

# 3. 绘制柱状图(每个 model 两列)
# zorder=3 确保柱子在网格线图层之上
x = np.arange(len(models))
bar_width = 0.34
bars_1 = ax.bar(x - bar_width / 2, win_rates[0], color=colors[0], width=bar_width,
                zorder=3, edgecolor='white', linewidth=0.8, label=label1)
bars_2 = ax.bar(x + bar_width / 2, win_rates[1], color=colors[1], width=bar_width,
                zorder=3, edgecolor='white', linewidth=0.8, label=label2)

# 4. 添加重要的基准线 (y=0)
# 这条线对于展示负值至关重要,使用深灰色突出显示
ax.axhline(0, color='#333333', linewidth=1.2, linestyle='-', zorder=2)

# 5. 添加数值标签
def add_value_labels(bars):
    for bar in bars:
        height = bar.get_height()

        # 根据正负值决定标签的位置和垂直对齐方式
        if height >= 0:
            # 正值:放在柱子上方,底部对齐
            label_y_pos = height + 0.8
            va_align = 'bottom'
            color_text = 'black'
            # color_text = bar.get_facecolor()
        else:
            # 负值:放在柱子下方,顶部对齐
            label_y_pos = height - 1.2
            va_align = 'top'
            color_text = 'black'
            # 负值标签用柱子颜色
            # color_text = bar.get_facecolor()

        ax.text(bar.get_x() + bar.get_width() / 2.,  # X坐标:柱子中心
                label_y_pos,                         # Y坐标
                f'{height:.2f}%',                    # 显示两位小数,保留尾部 0
                ha='center',                         # 水平居中
                va=va_align,                         # 垂直对齐方式动态调整
                fontsize=11, fontweight='bold', color=color_text)

add_value_labels(bars_1)
add_value_labels(bars_2)

# 6. 图表美化与标注
# 设置标题和坐标轴标签
ax.set_title('HunyuanImage3.0-Instruct Win Rate (GSB)', fontsize=16, fontweight='bold', pad=25, color='#222222')
ax.set_ylabel('Win Rate', fontsize=12, labelpad=10)
ax.set_xticks(x)
ax.set_xticklabels(models)
ax.legend(frameon=False)

# 隐藏顶部和右侧的边框,使图表更清爽
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
# 隐藏底部边框,因为我们已经手动添加了更明显的 y=0 基准线
ax.spines['bottom'].set_visible(False)

# 添加Y轴网格线,增加可读性 (zorder=0 放在最底层)
ax.grid(axis='y', linestyle='--', alpha=0.4, color='gray', zorder=0)

# 调整X轴标签的样式
ax.tick_params(axis='x', labelsize=11, length=0, pad=0) # length=0 隐藏刻度短线
# 某些环境下 pad 不明显,手动把标签往上挪一点
for label in ax.get_xticklabels():
    label.set_y(-0.001)

# 动态调整Y轴范围,确保标签不会被画布边缘遮挡
ax.set_ylim(win_rates.min() - 5, win_rates.max() + 7)

# 自动调整布局并显示
plt.tight_layout()
plt.show()
plt.savefig("assets/gsb_instruct.png", bbox_inches="tight")
```

## /requirements.txt


# Core dependencies
einops==0.8.1
numpy==2.2.0
pillow==12.0.0
diffusers==0.35.2
safetensors==0.7.0
tokenizers==0.22.0
transformers[accelerate,tiktoken]==4.57.1
huggingface_hub[cli]
loguru>=0.7.3

# PyTorch with CUDA 12.8 support (install separately)
# torch==2.8.0
# torchvision==0.23.0  
# torchaudio==2.8.0
# Install with: pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu128

# Performance optimizations (recommended for up to 3x faster inference)
# flashinfer-python==0.5.0
# Install with: pip install flashinfer-python==0.5.0

# Interactive demo
gradio>=4.21.0





## /run_app.sh

```sh path="/run_app.sh" 
# ==========================================================================
JOBS_DIR=$(dirname "$0")
PROJECT_BASE=$(cd ${JOBS_DIR} || exit; pwd)
echo "PROJECT_BASE: ${PROJECT_BASE}"
# Startup path
cd ${PROJECT_BASE} || exit 1
export PYTHONPATH=${PROJECT_BASE}:$PYTHONPATH
# ==========================================================================

GPUS=${GPUS:-0,1,2,3}
HOST=${HOST:-"0.0.0.0"}
PORT=${PORT:-443}
MODEL_ID=${MODEL_ID:-"HunyuanImage-3/"}

# Clear proxy
export http_proxy=
export https_proxy=
# Avoiding the 'timeout error' in httpx used by gradio. Also, gradio>=4.21.0 is required.
export GRADIO_ANALYTICS_ENABLED=False
export CUDA_VISIBLE_DEVICES="$GPUS"

python3 app/run_chatbot.py \
    --open-sidebar \
    --host ${HOST} \
    --port ${PORT} \
    --model-id "${MODEL_ID}" \
    "$@"

```


The content has been capped at 50000 tokens. The user could consider applying other filters to refine the result. The better and more specific the context, the better the LLM can follow instructions. If the context seems verbose, the user can refine the filter using uithub. Thank you for using https://uithub.com - Perfect LLM context for any GitHub repo.
Copied!