doc: update README to include support for SDXL models and additional command-line options for gen_img.py

2026-04-16 17:02:45 +00:00 · 2025-09-01 21:32:24 +09:00
parent 142d0be180
commit 9984868154
2 changed files with 144 additions and 12 deletions
--- a/docs/gen_img_README-ja.md
+++ b/docs/gen_img_README-ja.md
@@ -3,7 +3,7 @@ SD 1.xおよび2.xのモデル、当リポジトリで学習したLoRA、Control
 # 概要

 * Diffusers (v0.10.2) ベースの推論（画像生成）スクリプト。
-* SD 1.xおよび2.x (base/v-parameterization)モデルに対応。
+* SD 1.x、2.x (base/v-parameterization)、およびSDXLモデルに対応。
 * txt2img、img2img、inpaintingに対応。
 * 対話モード、およびファイルからのプロンプト読み込み、連続生成に対応。
 * プロンプト1行あたりの生成枚数を指定可能。
@@ -96,14 +96,20 @@ python gen_img_diffusers.py --ckpt <モデル名> --outdir <画像出力先>

 - `--ckpt <モデル名>`：モデル名を指定します。`--ckpt`オプションは必須です。Stable Diffusionのcheckpointファイル、またはDiffusersのモデルフォルダ、Hugging FaceのモデルIDを指定できます。

+- `--v1`：Stable Diffusion 1.x系のモデルを使う場合に指定します。これがデフォルトの動作です。
+
 - `--v2`：Stable Diffusion 2.x系のモデルを使う場合に指定します。1.x系の場合には指定不要です。

+- `--sdxl`：Stable Diffusion XLモデルを使う場合に指定します。
+
 - `--v_parameterization`：v-parameterizationを使うモデルを使う場合に指定します（`768-v-ema.ckpt`およびそこからの追加学習モデル、Waifu Diffusion v1.5など）。
    
-    `--v2`の指定有無が間違っているとモデル読み込み時にエラーになります。`--v_parameterization`の指定有無が間違っていると茶色い画像が表示されます。
+    `--v2`や`--sdxl`の指定有無が間違っているとモデル読み込み時にエラーになります。`--v_parameterization`の指定有無が間違っていると茶色い画像が表示されます。

 - `--vae`：使用するVAEを指定します。未指定時はモデル内のVAEを使用します。

+- `--tokenizer_cache_dir`：トークナイザーのキャッシュディレクトリを指定します（オフライン利用のため）。
+
 ## 画像生成と出力

 - `--interactive`：インタラクティブモードで動作します。プロンプトを入力すると画像が生成されます。
@@ -112,6 +118,10 @@ python gen_img_diffusers.py --ckpt <モデル名> --outdir <画像出力先>

 - `--from_file <プロンプトファイル名>`：プロンプトが記述されたファイルを指定します。1行1プロンプトで記述してください。なお画像サイズやguidance scaleはプロンプトオプション（後述）で指定できます。

+- `--from_module <モジュールファイル>`：Pythonモジュールからプロンプトを読み込みます。モジュールは`get_prompter(args, pipe, networks)`関数を実装している必要があります。
+
+- `--prompter_module_args`：prompterモジュールに渡す追加の引数を指定します。
+
 - `--W <画像幅>`：画像の幅を指定します。デフォルトは`512`です。

 - `--H <画像高さ>`：画像の高さを指定します。デフォルトは`512`です。
@@ -132,6 +142,24 @@ python gen_img_diffusers.py --ckpt <モデル名> --outdir <画像出力先>

 - `--negative_scale` : uncoditioningのguidance scaleを個別に指定します。[gcem156氏のこちらの記事](https://note.com/gcem156/n/ne9a53e4a6f43)を参考に実装したものです。

+- `--emb_normalize_mode`：embedding正規化モードを指定します。"original"（デフォルト）、"abs"、"none"から選択できます。プロンプトの重みの正規化方法に影響します。
+
+## SDXL固有のオプション
+
+SDXL モデル（`--sdxl`フラグ付き）を使用する場合、追加のコンディショニングオプションが利用できます：
+
+- `--original_height`：SDXL コンディショニング用の元の高さを指定します。これはモデルの対象解像度の理解に影響します。
+
+- `--original_width`：SDXL コンディショニング用の元の幅を指定します。これはモデルの対象解像度の理解に影響します。
+
+- `--original_height_negative`：SDXL ネガティブコンディショニング用の元の高さを指定します。
+
+- `--original_width_negative`：SDXL ネガティブコンディショニング用の元の幅を指定します。
+
+- `--crop_top`：SDXL コンディショニング用のクロップ上オフセットを指定します。
+
+- `--crop_left`：SDXL コンディショニング用のクロップ左オフセットを指定します。
+
 ## メモリ使用量や生成速度の調整

 - `--batch_size <バッチサイズ>`：バッチサイズを指定します。デフォルトは`1`です。バッチサイズが大きいとメモリを多く消費しますが、生成速度が速くなります。
@@ -139,8 +167,16 @@ python gen_img_diffusers.py --ckpt <モデル名> --outdir <画像出力先>
 - `--vae_batch_size <VAEのバッチサイズ>`：VAEのバッチサイズを指定します。デフォルトはバッチサイズと同じです。
    VAEのほうがメモリを多く消費するため、デノイジング後（stepが100%になった後）でメモリ不足になる場合があります。このような場合にはVAEのバッチサイズを小さくしてください。

+- `--vae_slices <スライス数>`：VAE処理時に画像をスライスに分割してVRAM使用量を削減します。None（デフォルト）で分割なし。16や32のような値が推奨されます。有効にすると処理が遅くなりますが、VRAM使用量が少なくなります。
+
+- `--no_half_vae`：VAE処理でfp16/bf16精度の使用を防ぎます。代わりにfp32を使用します。VAE関連の問題やアーティファクトが発生した場合に使用してください。
+
 - `--xformers`：xformersを使う場合に指定します。

+- `--sdpa`：最適化のためにPyTorch 2のscaled dot-product attentionを使用します。
+
+- `--diffusers_xformers`：Diffusers経由でxformersを使用します（注：Hypernetworksと互換性がありません）。
+
 - `--fp16`：fp16（単精度）での推論を行います。`fp16`と`bf16`をどちらも指定しない場合はfp32（単精度）での推論を行います。

 - `--bf16`：bf16（bfloat16）での推論を行います。RTX 30系のGPUでのみ指定可能です。`--bf16`オプションはRTX 30系以外のGPUではエラーになります。`fp16`よりも`bf16`のほうが推論結果がNaNになる（真っ黒の画像になる）可能性が低いようです。
@@ -157,6 +193,12 @@ python gen_img_diffusers.py --ckpt <モデル名> --outdir <画像出力先>

 - `--network_pre_calc`：使用する追加ネットワークの重みを生成ごとにあらかじめ計算します。プロンプトオプションの`--am`が使用できます。LoRA未使用時と同じ程度まで生成は高速化されますが、生成前に重みを計算する時間が必要で、またメモリ使用量も若干増加します。Regional LoRA使用時は無効になります 。

+- `--network_regional_mask_max_color_codes`：リージョナルマスクに使用する色コードの最大数を指定します。指定されていない場合、マスクはチャンネルごとに適用されます。Regional LoRAと組み合わせて、マスク内の色で定義できるリージョン数を制御するために使用されます。
+
+- `--network_args`：key=value形式でネットワークモジュールに渡す追加引数を指定します。例: `--network_args "alpha=1.0,dropout=0.1"`。
+
+- `--network_merge_n_models`：ネットワークマージを使用する場合、マージするモデル数を指定します（全ての読み込み済みネットワークをマージする代わりに）。
+
 # 主なオプションの指定例

 次は同一プロンプトで64枚をバッチサイズ4で一括生成する例です。
@@ -235,7 +277,9 @@ python gen_img_diffusers.py --ckpt model.safetensors

 - `--sequential_file_name`：ファイル名を連番にするかどうかを指定します。指定すると生成されるファイル名が`im_000001.png`からの連番になります。

- `--use_original_file_name`：指定すると生成ファイル名がオリジナルのファイル名と同じになります。
+- `--use_original_file_name`：指定すると生成ファイル名がオリジナルのファイル名の前に追加されます（img2imgモード用）。
+
+- `--clip_vision_strength`：指定した強度でimg2img用のCLIP Vision Conditioningを有効にします。CLIP Visionモデルを使用して入力画像からのコンディショニングを強化します。

 ## コマンドラインからの実行例

@@ -306,7 +350,9 @@ img2imgと併用できません。
 - `--highres_fix_upscaler`：2nd stageに任意のupscalerを利用します。現在は`--highres_fix_upscaler tools.latent_upscaler` のみ対応しています。

 - `--highres_fix_upscaler_args`：`--highres_fix_upscaler`で指定したupscalerに渡す引数を指定します。
-    `tools.latent_upscaler`の場合は、`--highres_fix_upscaler_args "weights=D:\Work\SD\Models\others\etc\upscaler-v1-e100-220.safetensors"`のように重みファイルを指定します。 
+    `tools.latent_upscaler`の場合は、`--highres_fix_upscaler_args "weights=D:\Work\SD\Models\others\etc\upscaler-v1-e100-220.safetensors"`のように重みファイルを指定します。
+
+- `--highres_fix_disable_control_net`：Highres fixの2nd stageでControlNetを無効にします。デフォルトでは、ControlNetは両ステージで使用されます。

 コマンドラインの例です。

@@ -319,6 +365,34 @@ python gen_img_diffusers.py  --ckpt trinart_characters_it4_v1_vae_merged.ckpt
    --highres_fix_scale 0.5 --highres_fix_steps 28 --strength 0.5
 ```

+## Deep Shrink
+
+Deep Shrinkは、異なるタイムステップで異なる深度のUNetを使用して生成プロセスを最適化する技術です。生成品質と効率を向上させることができます。
+
+以下のオプションがあります：
+
+- `--ds_depth_1`：第1フェーズでこの深度のDeep Shrinkを有効にします。有効な値は0から8です。
+
+- `--ds_timesteps_1`：このタイムステップまでDeep Shrink深度1を適用します。デフォルトは650です。
+
+- `--ds_depth_2`：Deep Shrinkの第2フェーズの深度を指定します。
+
+- `--ds_timesteps_2`：このタイムステップまでDeep Shrink深度2を適用します。デフォルトは650です。
+
+- `--ds_ratio`：Deep Shrinkでのダウンサンプリングの比率を指定します。デフォルトは0.5です。
+
+これらのパラメータはプロンプトオプションでも指定できます：
+
+- `--dsd1`：プロンプトからDeep Shrink深度1を指定します。
+  
+- `--dst1`：プロンプトからDeep Shrinkタイムステップ1を指定します。
+  
+- `--dsd2`：プロンプトからDeep Shrink深度2を指定します。
+  
+- `--dst2`：プロンプトからDeep Shrinkタイムステップ2を指定します。
+  
+- `--dsr`：プロンプトからDeep Shrink比率を指定します。
+
 ## ControlNet

 現在はControlNet 1.0のみ動作確認しています。プリプロセスはCannyのみサポートしています。
@@ -346,6 +420,20 @@ python gen_img_diffusers.py --ckpt model_ckpt --scale 8 --steps 48 --outdir txt2
    --guide_image_path guide.png --control_net_ratios 1.0 --interactive
 ```

+## ControlNet-LLLite
+
+ControlNet-LLLiteは、類似の誘導目的に使用できるControlNetの軽量な代替手段です。
+
+以下のオプションがあります：
+
+- `--control_net_lllite_models`：ControlNet-LLLiteモデルファイルを指定します。
+
+- `--control_net_multipliers`：ControlNet-LLLiteの倍率を指定します（重みに類似）。
+
+- `--control_net_ratios`：ControlNet-LLLiteを適用するステップの比率を指定します。
+
+注意：ControlNetとControlNet-LLLiteは同時に使用できません。
+
 ## Attention Couple + Reginal LoRA

 プロンプトをいくつかの部分に分割し、それぞれのプロンプトを画像内のどの領域に適用するかを指定できる機能です。個別のオプションはありませんが、`mask_path`とプロンプトで指定します。
@@ -450,7 +538,9 @@ python gen_img_diffusers.py --ckpt wd-v1-3-full-pruned-half.ckpt

 - `--opt_channels_last` : 推論時にテンソルのチャンネルを最後に配置します。場合によっては高速化されることがあります。

- `--network_show_meta` : 追加ネットワークのメタデータを表示します。
+- `--shuffle_prompts`：繰り返し時にプロンプトの順序をシャッフルします。`--from_file`で複数のプロンプトを使用する場合に便利です。
+
+- `--network_show_meta`：追加ネットワークのメタデータを表示します。


 --- 
@@ -478,6 +568,8 @@ latentのサイズを徐々に大きくしていくHires fixです。`gen_img.py
 - `--gradual_latent_ratio` : latentの初期サイズを指定します。デフォルトは 0.5 で、デフォルトの latent サイズの半分のサイズから始めます。
 - `--gradual_latent_ratio_step`: latentのサイズを大きくする割合を指定します。デフォルトは 0.125 で、latentのサイズを 0.625, 0.75, 0.875, 1.0 と徐々に大きくします。
 - `--gradual_latent_ratio_every_n_steps`: latentのサイズを大きくする間隔を指定します。デフォルトは 3 で、3ステップごとに latent のサイズを大きくします。
+- `--gradual_latent_s_noise`：Gradual LatentのS_noiseパラメータを指定します。デフォルトは1.0です。
+- `--gradual_latent_unsharp_params`：Gradual Latentのアンシャープマスクパラメータをksize,sigma,strength,target-x形式で指定します（target-x: 1=True, 0=False）。推奨値：`3,0.5,0.5,1`または`3,1.0,1.0,0`。

 それぞれのオプションは、プロンプトオプション、`--glt`、`--glr`、`--gls`、`--gle` でも指定できます。

--- a/docs/gen_img_README.md
+++ b/docs/gen_img_README.md
@@ -4,7 +4,7 @@ This is an inference (image generation) script that supports SD 1.x and 2.x mode
 # Overview

 * Inference (image generation) script.
-* Supports SD 1.x and 2.x (base/v-parameterization) models.
+* Supports SD 1.x, 2.x (base/v-parameterization), and SDXL models.
 * Supports txt2img, img2img, and inpainting.
 * Supports interactive mode, prompt reading from files, and continuous generation.
 * The number of images generated per prompt line can be specified.
@@ -13,7 +13,7 @@ This is an inference (image generation) script that supports SD 1.x and 2.x mode
 * Supports xformers for high-speed generation.
    * Although xformers are used for memory-saving generation, it is not as optimized as Automatic 1111's Web UI, so it uses about 6GB of VRAM for 512*512 image generation.
 * Extension of prompts to 225 tokens. Supports negative prompts and weighting.
-* Supports various samplers from Diffusers (fewer samplers than Web UI).
+* Supports various samplers from Diffusers including ddim, pndm, lms, euler, euler_a, heun, dpm_2, dpm_2_a, dpmsolver, dpmsolver++, dpmsingle.
 * Supports clip skip (uses the output of the nth layer from the end) of Text Encoder.
 * Separate loading of VAE.
 * Supports CLIP Guided Stable Diffusion, VGG16 Guided Stable Diffusion, Highres. fix, and upscale.
@@ -100,14 +100,20 @@ Specify from the command line.

 - `--ckpt <model_name>`: Specifies the model name. The `--ckpt` option is mandatory. You can specify a Stable Diffusion checkpoint file, a Diffusers model folder, or a Hugging Face model ID.

+- `--v1`: Specify when using Stable Diffusion 1.x series models. This is the default behavior.
+
 - `--v2`: Specify when using Stable Diffusion 2.x series models. Not required for 1.x series.

+- `--sdxl`: Specify when using Stable Diffusion XL models.
+
 - `--v_parameterization`: Specify when using models that use v-parameterization (`768-v-ema.ckpt` and models with additional training from it, Waifu Diffusion v1.5, etc.).

-    If the `--v2` specification is incorrect, an error will occur when loading the model. If the `--v_parameterization` specification is incorrect, a brown image will be displayed.
+    If the `--v2` or `--sdxl` specification is incorrect, an error will occur when loading the model. If the `--v_parameterization` specification is incorrect, a brown image will be displayed.

 - `--vae`: Specifies the VAE to use. If not specified, the VAE in the model will be used.

+- `--tokenizer_cache_dir`: Specifies the cache directory for the tokenizer (for offline usage).
+
 ## Image Generation and Output

 - `--interactive`: Operates in interactive mode. Images are generated when prompts are entered.
@@ -118,6 +124,8 @@ Specify from the command line.

 - `--from_module <module_file>`: Loads prompts from a Python module. The module should implement a `get_prompter(args, pipe, networks)` function.

+- `--prompter_module_args`: Specifies additional arguments to pass to the prompter module.
+
 - `--W <image_width>`: Specifies the width of the image. The default is `512`.

 - `--H <image_height>`: Specifies the height of the image. The default is `512`.
@@ -126,7 +134,7 @@ Specify from the command line.

 - `--scale <guidance_scale>`: Specifies the unconditional guidance scale. The default is `7.5`.

- `--sampler <sampler_name>`: Specifies the sampler. The default is `ddim`. ddim, pndm, dpmsolver, dpmsolver+++, lms, euler, euler_a provided by Diffusers can be specified (the last three can also be specified as k_lms, k_euler, k_euler_a).
+- `--sampler <sampler_name>`: Specifies the sampler. The default is `ddim`. The following samplers are supported: ddim, pndm, lms, euler, euler_a, heun, dpm_2, dpm_2_a, dpmsolver, dpmsolver++, dpmsingle. Some can also be specified with k_ prefix (k_lms, k_euler, k_euler_a, k_dpm_2, k_dpm_2_a).

 - `--outdir <image_output_destination_folder>`: Specifies the output destination for images.

@@ -140,6 +148,22 @@ Specify from the command line.

 - `--emb_normalize_mode`: Specifies the embedding normalization mode. Options are "original" (default), "abs", and "none". This affects how prompt weights are normalized.

+## SDXL-Specific Options
+
+When using SDXL models (with `--sdxl` flag), additional conditioning options are available:
+
+- `--original_height`: Specifies the original height for SDXL conditioning. This affects the model's understanding of the target resolution.
+
+- `--original_width`: Specifies the original width for SDXL conditioning. This affects the model's understanding of the target resolution.
+
+- `--original_height_negative`: Specifies the original height for SDXL negative conditioning.
+
+- `--original_width_negative`: Specifies the original width for SDXL negative conditioning.
+
+- `--crop_top`: Specifies the crop top offset for SDXL conditioning.
+
+- `--crop_left`: Specifies the crop left offset for SDXL conditioning.
+
 ## Adjusting Memory Usage and Generation Speed

 - `--batch_size <batch_size>`: Specifies the batch size. The default is `1`. A larger batch size consumes more memory but speeds up generation.
@@ -149,12 +173,14 @@ Specify from the command line.

 - `--vae_slices <number_of_slices>`: Splits the image into slices for VAE processing to reduce VRAM usage. None (default) for no splitting. Values like 16 or 32 are recommended. Enabling this is slower but uses less VRAM.

- `--no_half_vae`: Prevents using fp16/bf16 precision for VAE processing. Uses fp32 instead.
+- `--no_half_vae`: Prevents using fp16/bf16 precision for VAE processing. Uses fp32 instead. Use this if you encounter VAE-related issues or artifacts.

 - `--xformers`: Specify when using xformers.

 - `--sdpa`: Use scaled dot-product attention in PyTorch 2 for optimization.

+- `--diffusers_xformers`: Use xformers via Diffusers (note: incompatible with Hypernetworks).
+
 - `--fp16`: Performs inference in fp16 (single precision). If neither `fp16` nor `bf16` is specified, inference is performed in fp32 (single precision).

 - `--bf16`: Performs inference in bf16 (bfloat16). Can only be specified for RTX 30 series GPUs. The `--bf16` option will cause an error on GPUs other than the RTX 30 series. It seems that `bf16` is less likely to result in NaN (black image) inference results than `fp16`.
@@ -173,6 +199,10 @@ Specify from the command line.

 - `--network_regional_mask_max_color_codes`: Specifies the maximum number of color codes to use for regional masks. If not specified, masks are applied by channel. Used with Regional LoRA to control the number of regions that can be defined by colors in the mask.

+- `--network_args`: Specifies additional arguments to pass to the network module in key=value format. For example: `--network_args "alpha=1.0,dropout=0.1"`.
+
+- `--network_merge_n_models`: When using network merging, specifies the number of models to merge (instead of merging all loaded networks).
+
 # Examples of Main Option Specifications

 The following is an example of batch generating 64 images with the same prompt and a batch size of 4.
@@ -259,7 +289,7 @@ Example:

 - `--sequential_file_name`: Specifies whether to make file names sequential. If specified, the generated file names will be sequential starting from `im_000001.png`.

- `--use_original_file_name`: If specified, the generated file name will be the same as the original file name.
+- `--use_original_file_name`: If specified, the generated file name will be prepended with the original file name (for img2img mode).

 - `--clip_vision_strength`: Enables CLIP Vision Conditioning for img2img with the specified strength. Uses the CLIP Vision model to enhance conditioning from the input image.

@@ -375,6 +405,16 @@ These parameters can also be specified through prompt options:
  
 - `--dsr`: Specifies Deep Shrink ratio from the prompt.

+*Additional prompt options for Gradual Latent (requires `euler_a` sampler):*
+
+- `--glt`: Specifies the timestep to start increasing the size of the latent for Gradual Latent. Overrides the command line specification.
+
+- `--glr`: Specifies the initial size of the latent for Gradual Latent as a ratio. Overrides the command line specification.
+
+- `--gls`: Specifies the ratio to increase the size of the latent for Gradual Latent. Overrides the command line specification.
+
+- `--gle`: Specifies the interval to increase the size of the latent for Gradual Latent. Overrides the command line specification.
+
 ## ControlNet

 Currently, only ControlNet 1.0 has been confirmed to work. Only Canny is supported for preprocessing.
@@ -536,7 +576,7 @@ Gradual Latent is a Hires fix that gradually increases the size of the latent.
 - `--gradual_latent_ratio_step`: Specifies the ratio to increase the size of the latent. The default is 0.125, which means the latent size is gradually increased to 0.625, 0.75, 0.875, 1.0.
 - `--gradual_latent_ratio_every_n_steps`: Specifies the interval to increase the size of the latent. The default is 3, which means the latent size is increased every 3 steps.
 - `--gradual_latent_s_noise`: Specifies the s_noise parameter for Gradual Latent. Default is 1.0.
- `--gradual_latent_unsharp_params`: Specifies unsharp mask parameters for Gradual Latent: ksize, sigma, strength, target-x (1 means True). Values like `3,0.5,0.5,1` or `3,1.0,1.0,0` are recommended.
+- `--gradual_latent_unsharp_params`: Specifies unsharp mask parameters for Gradual Latent in the format: ksize,sigma,strength,target-x (where target-x: 1=True, 0=False). Recommended values: `3,0.5,0.5,1` or `3,1.0,1.0,0`.

 Each option can also be specified with prompt options, `--glt`, `--glr`, `--gls`, `--gle`.