Merge pull request #776 from kohya-ss/dev

add multiplier, steps range, dataset synthesis
2026-04-08 14:34:23 +00:00 · 2023-08-22 20:55:19 +09:00
parent 014c4b47c9 85f1114c4a
commit 71369ac98b
4 changed files with 291 additions and 26 deletions
--- a/docs/train_lllite_README-ja.md
+++ b/docs/train_lllite_README-ja.md
@@ -21,7 +21,9 @@ ComfyUIのカスタムノードを用意しています。: https://github.com/k
 ## モデルの学習

 ### データセットの準備
-通常のdatasetに加え、`conditioning_data_dir` で指定したディレクトリにconditioning imageを格納してください。conditioning imageは学習用画像と同じbasenameを持つ必要があります。また、conditioning imageは学習用画像と同じサイズに自動的にリサイズされます。
+通常のdatasetに加え、`conditioning_data_dir` で指定したディレクトリにconditioning imageを格納してください。conditioning imageは学習用画像と同じbasenameを持つ必要があります。また、conditioning imageは学習用画像と同じサイズに自動的にリサイズされます。conditioning imageにはキャプションファイルは不要です。
+
+たとえば DreamBooth 方式でキャプションファイルを用いる場合の設定ファイルは以下のようになります。

 ```toml
 [[datasets.subsets]]
@@ -32,9 +34,9 @@ conditioning_data_dir = "path/to/conditioning/image/dir"

 現時点の制約として、random_cropは使用できません。

-学習データとしては、元のモデルで生成した画像を学習用画像として、そこから加工した画像をconditioning imageとするのが良いようです。元モデルと異なる画風の画像を学習用画像とすると、制御に加えて、その画風についても学ぶ必要が生じます。ControlNet-LLLiteは容量が少ないため、画風学習には不向きです。
+学習データとしては、元のモデルで生成した画像を学習用画像として、そこから加工した画像をconditioning imageとした、合成によるデータセットを用いるのがもっとも簡単です（データセットの品質的には問題があるかもしれません）。具体的なデータセットの合成方法については後述します。

-もし生成画像以外を学習用画像とする場合には、後述の次元数を多めにしてください。
+なお、元モデルと異なる画風の画像を学習用画像とすると、制御に加えて、その画風についても学ぶ必要が生じます。ControlNet-LLLiteは容量が少ないため、画風学習には不向きです。このような場合には、後述の次元数を多めにしてください。

 ### 学習
 スクリプトで生成する場合は、`sdxl_train_control_net_lllite.py` を実行してください。`--cond_emb_dim` でconditioning image embeddingの次元数を指定できます。`--network_dim` でLoRA的モジュールのrankを指定できます。その他のオプションは`sdxl_train_network.py`に準じますが、`--network_module`の指定は不要です。
@@ -51,6 +53,122 @@ conditioning image embeddingの次元数は、サンプルのCannyでは32を指

 `--guide_image_path`で推論に用いるconditioning imageを指定してください。なおpreprocessは行われないため、たとえばCannyならCanny処理を行った画像を指定してください（背景黒に白線）。`--control_net_preps`, `--control_net_weights`, `--control_net_ratios` には未対応です。

+## データセットの合成方法
+
+### 学習用画像の生成
+
+学習のベースとなるモデルで画像生成を行います。Web UIやComfyUIなどで生成してください。画像サイズはモデルのデフォルトサイズで良いと思われます（1024x1024など）。bucketingを用いることもできます。その場合は適宜適切な解像度で生成してください。
+
+生成時のキャプション等は、ControlNet-LLLiteの利用時に生成したい画像にあわせるのが良いと思われます。
+
+生成した画像を任意のディレクトリに保存してください。このディレクトリをデータセットの設定ファイルで指定します。
+
+当リポジトリ内の `sdxl_gen_img.py` でも生成できます。例えば以下のように実行します。
+
+```dos
+python sdxl_gen_img.py --ckpt path/to/model.safetensors --n_iter 1 --scale 10 --steps 36 --outdir path/to/output/dir --xformers --W 1024 --H 1024 --original_width 2048 --original_height 2048 --bf16 --sampler ddim --batch_size 4 --vae_batch_size 2 --images_per_prompt 512 --max_embeddings_multiples 1 --prompt "{portrait|digital art|anime screen cap|detailed illustration} of 1girl, {standing|sitting|walking|running|dancing} on {classroom|street|town|beach|indoors|outdoors}, {looking at viewer|looking away|looking at another}, {in|wearing} {shirt and skirt|school uniform|casual wear} { |, dynamic pose}, (solo), teen age, {0-1$$smile,|blush,|kind smile,|expression less,|happy,|sadness,} {0-1$$upper body,|full body,|cowboy shot,|face focus,} trending on pixiv, {0-2$$depth of fields,|8k wallpaper,|highly detailed,|pov,} {0-1$$summer, |winter, |spring, |autumn, } beautiful face { |, from below|, from above|, from side|, from behind|, from back} --n nsfw, bad face, lowres, low quality, worst quality, low effort, watermark, signature, ugly, poorly drawn"
+```
+
+VRAM 24GBの設定です。VRAMサイズにより`--batch_size` `--vae_batch_size`を調整してください。
+
+`--prompt`でワイルドカードを利用してランダムに生成しています。適宜調整してください。
+
+### 画像の加工
+
+外部のプログラムを用いて、生成した画像を加工します。加工した画像を任意のディレクトリに保存してください。これらがconditioning imageになります。
+
+加工にはたとえばCannyなら以下のようなスクリプトが使えます。
+
+```python
+import glob
+import os
+import random
+import cv2
+import numpy as np
+
+IMAGES_DIR = "path/to/generated/images"
+CANNY_DIR = "path/to/canny/images"
+
+os.makedirs(CANNY_DIR, exist_ok=True)
+img_files = glob.glob(IMAGES_DIR + "/*.png")
+for img_file in img_files:
+    can_file = CANNY_DIR + "\\" + os.path.basename(img_file)
+    if os.path.exists(can_file):
+        print("Skip: " + img_file)
+        continue
+
+    print(img_file)
+
+    img = cv2.imread(img_file)
+
+    # random threshold
+    # while True:
+    #     threshold1 = random.randint(0, 127)
+    #     threshold2 = random.randint(128, 255)
+    #     if threshold2 - threshold1 > 80:
+    #         break
+
+    # fixed threshold
+    threshold1 = 100
+    threshold2 = 200
+
+    img = cv2.Canny(img, threshold1, threshold2)
+
+    cv2.imwrite(can_file, img)
+```
+
+### キャプションファイルの作成
+
+学習用画像のbasenameと同じ名前で、それぞれの画像に対応したキャプションファイルを作成してください。生成時のプロンプトをそのまま利用すれば良いと思われます。
+
+`sdxl_gen_img.py` で生成した場合は、画像内のメタデータに生成時のプロンプトが記録されていますので、以下のようなスクリプトで学習用画像と同じディレクトリにキャプションファイルを作成できます（拡張子 `.txt`）。
+
+```python
+import glob
+import os
+from PIL import Image
+
+IMAGES_DIR = "path/to/generated/images"
+
+img_files = glob.glob(IMAGES_DIR + "/*.png")
+for img_file in img_files:
+    cap_file = img_file.replace(".png", ".txt")
+    if os.path.exists(cap_file):
+        print(f"Skip: {img_file}")
+        continue
+    print(img_file)
+
+    img = Image.open(img_file)
+    prompt = img.text["prompt"] if "prompt" in img.text else ""
+    if prompt == "":
+        print(f"Prompt not found in {img_file}")
+
+    with open(cap_file, "w") as f:
+        f.write(prompt + "\n")
+```
+
+### データセットの設定ファイルの作成
+
+コマンドラインオプションからの指定も可能ですが、`.toml`ファイルを作成する場合は `conditioning_data_dir` に加工した画像を保存したディレクトリを指定します。
+
+以下は設定ファイルの例です。
+
+```toml
+[general]
+flip_aug = false
+color_aug = false
+resolution = [1024,1024]
+
+[[datasets]]
+batch_size = 8
+enable_bucket = false
+
+    [[datasets.subsets]]
+    image_dir = "path/to/generated/image/dir"
+    caption_extension = ".txt"
+    conditioning_data_dir = "path/to/canny/image/dir"
+```
+
 ## 謝辞

 ControlNetの作者である lllyasviel 氏、実装上のアドバイスとトラブル解決へのご尽力をいただいた furusu 氏、ControlNetデータセットを実装していただいた ddPn08 氏に感謝いたします。
--- a/docs/train_lllite_README.md
+++ b/docs/train_lllite_README.md
@@ -26,7 +26,7 @@ Due to the limitations of the inference environment, only CrossAttention (attn1

 ### Preparing the dataset

-In addition to the normal dataset, please store the conditioning image in the directory specified by `conditioning_data_dir`. The conditioning image must have the same basename as the training image. The conditioning image will be automatically resized to the same size as the training image.
+In addition to the normal dataset, please store the conditioning image in the directory specified by `conditioning_data_dir`. The conditioning image must have the same basename as the training image. The conditioning image will be automatically resized to the same size as the training image. The conditioning image does not require a caption file.

 ```toml
 [[datasets.subsets]]
@@ -37,9 +37,9 @@ conditioning_data_dir = "path/to/conditioning/image/dir"

 At the moment, random_crop cannot be used.

-As a training data, it seems to be better to use the images generated by the original model as training images and the images processed from them as conditioning images. If you use images with a different style from the original model as training images, the model will have to learn not only the control but also the style. ControlNet-LLLite is not suitable for style learning because of its small capacity.
+For training data, it is easiest to use a synthetic dataset with the original model-generated images as training images and processed images as conditioning images (the quality of the dataset may be problematic). See below for specific methods of synthesizing datasets.

-If you use images other than the generated images as training images, please increase the dimension as described below.
+Note that if you use an image with a different art style than the original model as a training image, the model will have to learn not only the control but also the art style. ControlNet-LLLite has a small capacity, so it is not suitable for learning art styles. In such cases, increase the number of dimensions as described below.

 ### Training

@@ -57,6 +57,121 @@ If you want to generate images with a script, run `sdxl_gen_img.py`. You can spe

 Specify the conditioning image to be used for inference with `--guide_image_path`. Since preprocess is not performed, if it is Canny, specify an image processed with Canny (white line on black background). `--control_net_preps`, `--control_net_weights`, and `--control_net_ratios` are not supported.

+## How to synthesize a dataset
+
+### Generating training images
+
+Generate images with the base model for training. Please generate them with Web UI or ComfyUI etc. The image size should be the default size of the model (1024x1024, etc.). You can also use bucketing. In that case, please generate it at an arbitrary resolution.
+
+The captions and other settings when generating the images should be the same as when generating the images with the trained ControlNet-LLLite model.
+
+Save the generated images in an arbitrary directory. Specify this directory in the dataset configuration file.
+
+
+You can also generate them with `sdxl_gen_img.py` in this repository. For example, run as follows:
+
+```dos
+python sdxl_gen_img.py --ckpt path/to/model.safetensors --n_iter 1 --scale 10 --steps 36 --outdir path/to/output/dir --xformers --W 1024 --H 1024 --original_width 2048 --original_height 2048 --bf16 --sampler ddim --batch_size 4 --vae_batch_size 2 --images_per_prompt 512 --max_embeddings_multiples 1 --prompt "{portrait|digital art|anime screen cap|detailed illustration} of 1girl, {standing|sitting|walking|running|dancing} on {classroom|street|town|beach|indoors|outdoors}, {looking at viewer|looking away|looking at another}, {in|wearing} {shirt and skirt|school uniform|casual wear} { |, dynamic pose}, (solo), teen age, {0-1$$smile,|blush,|kind smile,|expression less,|happy,|sadness,} {0-1$$upper body,|full body,|cowboy shot,|face focus,} trending on pixiv, {0-2$$depth of fields,|8k wallpaper,|highly detailed,|pov,} {0-1$$summer, |winter, |spring, |autumn, } beautiful face { |, from below|, from above|, from side|, from behind|, from back} --n nsfw, bad face, lowres, low quality, worst quality, low effort, watermark, signature, ugly, poorly drawn"
+```
+
+This is a setting for VRAM 24GB. Adjust `--batch_size` and `--vae_batch_size` according to the VRAM size.
+
+The images are generated randomly using wildcards in `--prompt`. Adjust as necessary.
+
+### Processing images
+
+Use an external program to process the generated images. Save the processed images in an arbitrary directory. These will be the conditioning images.
+
+For example, you can use the following script to process the images with Canny.
+
+```python
+import glob
+import os
+import random
+import cv2
+import numpy as np
+
+IMAGES_DIR = "path/to/generated/images"
+CANNY_DIR = "path/to/canny/images"
+
+os.makedirs(CANNY_DIR, exist_ok=True)
+img_files = glob.glob(IMAGES_DIR + "/*.png")
+for img_file in img_files:
+    can_file = CANNY_DIR + "\\" + os.path.basename(img_file)
+    if os.path.exists(can_file):
+        print("Skip: " + img_file)
+        continue
+
+    print(img_file)
+
+    img = cv2.imread(img_file)
+
+    # random threshold
+    # while True:
+    #     threshold1 = random.randint(0, 127)
+    #     threshold2 = random.randint(128, 255)
+    #     if threshold2 - threshold1 > 80:
+    #         break
+
+    # fixed threshold
+    threshold1 = 100
+    threshold2 = 200
+
+    img = cv2.Canny(img, threshold1, threshold2)
+
+    cv2.imwrite(can_file, img)
+```
+
+### Creating caption files
+
+Create a caption file for each image with the same basename as the training image. It is fine to use the same caption as the one used when generating the image. 
+
+If you generated the images with `sdxl_gen_img.py`, you can use the following script to create the caption files (`*.txt`) from the metadata in the generated images.
+
+```python
+import glob
+import os
+from PIL import Image
+
+IMAGES_DIR = "path/to/generated/images"
+
+img_files = glob.glob(IMAGES_DIR + "/*.png")
+for img_file in img_files:
+    cap_file = img_file.replace(".png", ".txt")
+    if os.path.exists(cap_file):
+        print(f"Skip: {img_file}")
+        continue
+    print(img_file)
+
+    img = Image.open(img_file)
+    prompt = img.text["prompt"] if "prompt" in img.text else ""
+    if prompt == "":
+        print(f"Prompt not found in {img_file}")
+
+    with open(cap_file, "w") as f:
+        f.write(prompt + "\n")
+```
+
+### Creating a dataset configuration file
+
+You can use the command line arguments of `sdxl_train_control_net_lllite.py` to specify the conditioning image directory. However, if you want to use a `.toml` file, specify the conditioning image directory in `conditioning_data_dir`.
+
+```toml
+[general]
+flip_aug = false
+color_aug = false
+resolution = [1024,1024]
+
+[[datasets]]
+batch_size = 8
+enable_bucket = false
+
+    [[datasets.subsets]]
+    image_dir = "path/to/generated/image/dir"
+    caption_extension = ".txt"
+    conditioning_data_dir = "path/to/canny/image/dir"
+```
+
 ## Credit

 I would like to thank lllyasviel, the author of ControlNet, furusu, who provided me with advice on implementation and helped me solve problems, and ddPn08, who implemented the ControlNet dataset.
--- a/networks/control_net_lllite.py
+++ b/networks/control_net_lllite.py
@@ -33,7 +33,7 @@ TRANSFORMER_MAX_BLOCK_INDEX = None


 class LLLiteModule(torch.nn.Module):
-    def __init__(self, depth, cond_emb_dim, name, org_module, mlp_dim, dropout=None):
+    def __init__(self, depth, cond_emb_dim, name, org_module, mlp_dim, dropout=None, multiplier=1.0):
        super().__init__()

        self.is_conv2d = org_module.__class__.__name__ == "Conv2d"
@@ -41,6 +41,7 @@ class LLLiteModule(torch.nn.Module):
        self.cond_emb_dim = cond_emb_dim
        self.org_module = [org_module]
        self.dropout = dropout
+        self.multiplier = multiplier

        if self.is_conv2d:
            in_dim = org_module.in_channels
@@ -119,6 +120,10 @@ class LLLiteModule(torch.nn.Module):
        中でモデルを呼び出すので必要ならwith torch.no_grad()で囲む
        / call the model inside, so if necessary, surround it with torch.no_grad()
        """
+        if cond_image is None:
+            self.cond_emb = None
+            return
+
        # timestepごとに呼ばれないので、あらかじめ計算しておく / it is not called for each timestep, so calculate it in advance
        # print(f"C {self.lllite_name}, cond_image.shape={cond_image.shape}")
        cx = self.conditioning1(cond_image)
@@ -141,6 +146,9 @@ class LLLiteModule(torch.nn.Module):
        学習用の便利forward。元のモジュールのforwardを呼び出す
        / convenient forward for training. call the forward of the original module
        """
+        if self.multiplier == 0.0 or self.cond_emb is None:
+            return self.org_forward(x)
+
        cx = self.cond_emb

        if not self.batch_cond_only and x.shape[0] // 2 == cx.shape[0]:  # inference only
@@ -160,11 +168,13 @@ class LLLiteModule(torch.nn.Module):
        if self.dropout is not None and self.training:
            cx = torch.nn.functional.dropout(cx, p=self.dropout)

-        cx = self.up(cx)
+        cx = self.up(cx) * self.multiplier

-        # residua (x) lを加算して元のforwardを呼び出す / add residual (x) and call the original forward
+        # residual (x) を加算して元のforwardを呼び出す / add residual (x) and call the original forward
        if self.batch_cond_only:
-            cx = torch.zeros_like(x)[1::2] + cx
+            zx = torch.zeros_like(x)
+            zx[1::2] += cx
+            cx = zx

        x = self.org_forward(x + cx)  # ここで元のモジュールを呼び出す / call the original module here
        return x
@@ -181,6 +191,7 @@ class ControlNetLLLite(torch.nn.Module):
        mlp_dim: int = 16,
        dropout: Optional[float] = None,
        varbose: Optional[bool] = False,
+        multiplier: Optional[float] = 1.0,
    ) -> None:
        super().__init__()
        # self.unets = [unet]
@@ -264,6 +275,7 @@ class ControlNetLLLite(torch.nn.Module):
                                child_module,
                                mlp_dim,
                                dropout=dropout,
+                                multiplier=multiplier,
                            )
                            modules.append(module)
            return modules
@@ -291,6 +303,10 @@ class ControlNetLLLite(torch.nn.Module):
        for module in self.unet_modules:
            module.set_batch_cond_only(cond_only, zeros)

+    def set_multiplier(self, multiplier):
+        for module in self.unet_modules:
+            module.multiplier = multiplier
+
    def load_weights(self, file):
        if os.path.splitext(file)[1] == ".safetensors":
            from safetensors.torch import load_file
--- a/sdxl_gen_img.py
+++ b/sdxl_gen_img.py
@@ -661,21 +661,28 @@ class PipelineLike:
        if self.control_nets:
            # guided_hints = original_control_net.get_guided_hints(self.control_nets, num_latent_input, batch_size, clip_guide_images)
            if self.control_net_enabled:
-                for control_net in self.control_nets:
+                for control_net, _ in self.control_nets:
                    with torch.no_grad():
                        control_net.set_cond_image(clip_guide_images)
            else:
-                for control_net in self.control_nets:
+                for control_net, _ in self.control_nets:
                    control_net.set_cond_image(None)

+        each_control_net_enabled = [self.control_net_enabled] * len(self.control_nets)
        for i, t in enumerate(tqdm(timesteps)):
            # expand the latents if we are doing classifier free guidance
            latent_model_input = latents.repeat((num_latent_input, 1, 1, 1))
            latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)

-            # # disable control net if ratio is set
-            # if self.control_nets and self.control_net_enabled:
-            #     pass # TODO
+            # disable control net if ratio is set
+            if self.control_nets and self.control_net_enabled:
+                for j, ((control_net, ratio), enabled) in enumerate(zip(self.control_nets, each_control_net_enabled)):
+                    if not enabled or ratio >= 1.0:
+                        continue
+                    if ratio < i / len(timesteps):
+                        print(f"ControlNet {j} is disabled (ratio={ratio} at {i} / {len(timesteps)})")
+                        control_net.set_cond_image(None)
+                        each_control_net_enabled[j] = False

            # predict the noise residual
            # TODO Diffusers' ControlNet
@@ -1567,7 +1574,7 @@ def main(args):
        upscaler.to(dtype).to(device)

    # ControlNetの処理
-    control_nets: List[ControlNetLLLite] = []
+    control_nets: List[Tuple[ControlNetLLLite, float]] = []
    # if args.control_net_models:
    #     for i, model in enumerate(args.control_net_models):
    #         prep_type = None if not args.control_net_preps or len(args.control_net_preps) <= i else args.control_net_preps[i]
@@ -1595,12 +1602,19 @@ def main(args):
                    break
            assert mlp_dim is not None and cond_emb_dim is not None, f"invalid control net: {model_file}"

-            control_net = ControlNetLLLite(unet, cond_emb_dim, mlp_dim)
+            multiplier = (
+                1.0
+                if not args.control_net_multipliers or len(args.control_net_multipliers) <= i
+                else args.control_net_multipliers[i]
+            )
+            ratio = 1.0 if not args.control_net_ratios or len(args.control_net_ratios) <= i else args.control_net_ratios[i]
+
+            control_net = ControlNetLLLite(unet, cond_emb_dim, mlp_dim, multiplier=multiplier)
            control_net.apply_to()
            control_net.load_state_dict(state_dict)
            control_net.to(dtype).to(device)
            control_net.set_batch_cond_only(False, False)
-            control_nets.append(control_net)
+            control_nets.append((control_net, ratio))

    if args.opt_channels_last:
        print(f"set optimizing: channels last")
@@ -2623,14 +2637,16 @@ def setup_parser() -> argparse.ArgumentParser:
    # parser.add_argument(
    #     "--control_net_preps", type=str, default=None, nargs="*", help="ControlNet preprocess to use / 使用するControlNetのプリプロセス名"
    # )
-    # parser.add_argument("--control_net_multiplier", type=float, default=None, nargs="*", help="ControlNet multiplier / ControlNetの適用率")
-    # parser.add_argument(
-    #     "--control_net_ratios",
-    #     type=float,
-    #     default=None,
-    #     nargs="*",
-    #     help="ControlNet guidance ratio for steps / ControlNetでガイドするステップ比率",
-    # )
+    parser.add_argument(
+        "--control_net_multipliers", type=float, default=None, nargs="*", help="ControlNet multiplier / ControlNetの適用率"
+    )
+    parser.add_argument(
+        "--control_net_ratios",
+        type=float,
+        default=None,
+        nargs="*",
+        help="ControlNet guidance ratio for steps / ControlNetでガイドするステップ比率",
+    )
    # # parser.add_argument(
    #     "--control_net_image_path", type=str, default=None, nargs="*", help="image for ControlNet guidance / ControlNetでガイドに使う画像"
    # )