Initial plan

Merge pull request #2070 from kohya-ss/fix-mean-ar-error-nan
Fix mean image aspect ratio error calculation to avoid NaN values
2026-04-14 16:22:28 +00:00 · 2025-07-05 00:41:07 +00:00 · 2025-04-29 21:29:42 +09:00 · 2025-04-29 21:27:04 +09:00 · 2025-04-27 21:29:44 +09:00 · 2025-04-27 21:28:38 +09:00
18 changed files with 393 additions and 583 deletions
--- a/.github/FUNDING.yml
+++ b/.github/FUNDING.yml
@@ -0,0 +1,3 @@
+# These are supported funding model platforms
+
+github: kohya-ss
--- a/README-ja.md
+++ b/README-ja.md
@@ -36,6 +36,8 @@ Python 3.10.6およびGitが必要です。
 - Python 3.10.6: https://www.python.org/ftp/python/3.10.6/python-3.10.6-amd64.exe
 - git: https://git-scm.com/download/win

+Python 3.10.x、3.11.x、3.12.xでも恐らく動作しますが、3.10.6でテストしています。
+
 PowerShellを使う場合、venvを使えるようにするためには以下の手順でセキュリティ設定を変更してください。
 （venvに限らずスクリプトの実行が可能になりますので注意してください。）

@@ -45,7 +47,7 @@ PowerShellを使う場合、venvを使えるようにするためには以下の

 ## Windows環境でのインストール

-スクリプトはPyTorch 2.1.2でテストしています。PyTorch 2.0.1、1.12.1でも動作すると思われます。
+スクリプトはPyTorch 2.1.2でテストしています。PyTorch 2.2以降でも恐らく動作します。

 （なお、python -m venv～の行で「python」とだけ表示された場合、py -m venv～のようにpythonをpyに変更してください。）

@@ -67,10 +69,12 @@ accelerate config

 コマンドプロンプトでも同一です。

-注：`bitsandbytes==0.43.0`、`prodigyopt==1.0`、`lion-pytorch==0.0.6` は `requirements.txt` に含まれるようになりました。他のバージョンを使う場合は適宜インストールしてください。
+注：`bitsandbytes==0.44.0`、`prodigyopt==1.0`、`lion-pytorch==0.0.6` は `requirements.txt` に含まれるようになりました。他のバージョンを使う場合は適宜インストールしてください。

 この例では PyTorch および xfomers は2.1.2／CUDA 11.8版をインストールします。CUDA 12.1版やPyTorch 1.12.1を使う場合は適宜書き換えください。たとえば CUDA 12.1版の場合は `pip install torch==2.1.2 torchvision==0.16.2 --index-url https://download.pytorch.org/whl/cu121` および `pip install xformers==0.0.23.post1 --index-url https://download.pytorch.org/whl/cu121` としてください。

+PyTorch 2.2以降を用いる場合は、`torch==2.1.2` と `torchvision==0.16.2` 、および `xformers==0.0.23.post1` を適宜変更してください。
+
 accelerate configの質問には以下のように答えてください。（bf16で学習する場合、最後の質問にはbf16と答えてください。）

 ```txt
--- a/README.md
+++ b/README.md
@@ -3,6 +3,8 @@ This repository contains training, generation and utility scripts for Stable Dif
 [__Change History__](#change-history) is moved to the bottom of the page. 
 更新履歴は[ページ末尾](#change-history)に移しました。

+Latest update: 2025-03-21 (Version 0.9.1)
+
 [日本語版READMEはこちら](./README-ja.md)

 The development version is in the `dev` branch. Please check the dev branch for the latest changes.
@@ -25,7 +27,7 @@ This repository contains the scripts for:

 The file does not contain requirements for PyTorch. Because the version of PyTorch depends on the environment, it is not included in the file. Please install PyTorch first according to the environment. See installation instructions below.

-The scripts are tested with Pytorch 2.1.2. 2.0.1 and 1.12.1 is not tested but should work.
+The scripts are tested with Pytorch 2.1.2. PyTorch 2.2 or later will work. Please install the appropriate version of PyTorch and xformers.

 ## Links to usage documentation

@@ -52,6 +54,8 @@ Python 3.10.6 and Git:
 - Python 3.10.6: https://www.python.org/ftp/python/3.10.6/python-3.10.6-amd64.exe
 - git: https://git-scm.com/download/win

+Python 3.10.x, 3.11.x, and 3.12.x will work but not tested.
+
 Give unrestricted script access to powershell so venv can work:

 - Open an administrator powershell window
@@ -78,10 +82,12 @@ accelerate config

 If `python -m venv` shows only `python`, change `python` to `py`.

-__Note:__ Now `bitsandbytes==0.43.0`, `prodigyopt==1.0` and `lion-pytorch==0.0.6` are included in the requirements.txt. If you'd like to use the another version, please install it manually.
+Note: Now `bitsandbytes==0.44.0`, `prodigyopt==1.0` and `lion-pytorch==0.0.6` are included in the requirements.txt. If you'd like to use the another version, please install it manually.

 This installation is for CUDA 11.8. If you use a different version of CUDA, please install the appropriate version of PyTorch and xformers. For example, if you use CUDA 12, please install `pip install torch==2.1.2 torchvision==0.16.2 --index-url https://download.pytorch.org/whl/cu121` and `pip install xformers==0.0.23.post1 --index-url https://download.pytorch.org/whl/cu121`.

+If you use PyTorch 2.2 or later, please change `torch==2.1.2` and `torchvision==0.16.2` and `xformers==0.0.23.post1` to the appropriate version.
+
 <!-- 
 cp .\bitsandbytes_windows\*.dll .\venv\Lib\site-packages\bitsandbytes\
 cp .\bitsandbytes_windows\cextension.py .\venv\Lib\site-packages\bitsandbytes\cextension.py
@@ -142,12 +148,23 @@ The majority of scripts is licensed under ASL 2.0 (including codes from Diffuser

 ## Change History

-### Working in progress
+### Mar 21, 2025 /  2025-03-21 Version 0.9.1
+
+- Fixed a bug where some of LoRA modules for CLIP Text Encoder were not trained. Thank you Nekotekina for PR [#1964](https://github.com/kohya-ss/sd-scripts/pull/1964)
+  - The LoRA modules for CLIP Text Encoder are now 264 modules, which is the same as before. Only 88 modules were trained in the previous version. 
+
+### Jan 17, 2025 /  2025-01-17 Version 0.9.0

 - __important__ The dependent libraries are updated. Please see [Upgrade](#upgrade) and update the libraries.
  - bitsandbytes, transformers, accelerate and huggingface_hub are updated. 
  - If you encounter any issues, please report them.

+- The dev branch is merged into main. The documentation is delayed, and I apologize for that. I will gradually improve it.
+- The state just before the merge is released as Version 0.8.8, so please use it if you encounter any issues.
+- The following changes are included.
+
+#### Changes
+
 - Fixed a bug where the loss weight was incorrect when `--debiased_estimation_loss` was specified with `--v_parameterization`. PR [#1715](https://github.com/kohya-ss/sd-scripts/pull/1715) Thanks to catboxanon! See [the PR](https://github.com/kohya-ss/sd-scripts/pull/1715) for details.
  - Removed the warning when `--v_parameterization` is specified in SDXL and SD1.5. PR [#1717](https://github.com/kohya-ss/sd-scripts/pull/1717)

@@ -188,7 +205,6 @@ The majority of scripts is licensed under ASL 2.0 (including codes from Diffuser
  - See the [transformers documentation](https://huggingface.co/docs/transformers/v4.44.2/en/main_classes/optimizer_schedules#schedules) for details on each scheduler.
  - `--lr_warmup_steps` and `--lr_decay_steps` can now be specified as a ratio of the number of training steps, not just the step value. Example: `--lr_warmup_steps=0.1` or `--lr_warmup_steps=10%`, etc.

-https://github.com/kohya-ss/sd-scripts/pull/1393
 - When enlarging images in the script (when the size of the training image is small and bucket_no_upscale is not specified), it has been changed to use Pillow's resize and LANCZOS interpolation instead of OpenCV2's resize and Lanczos4 interpolation. The quality of the image enlargement may be slightly improved. PR [#1426](https://github.com/kohya-ss/sd-scripts/pull/1426) Thanks to sdbds!

 - Sample image generation during training now works on non-CUDA devices. PR [#1433](https://github.com/kohya-ss/sd-scripts/pull/1433) Thanks to millie-v!
@@ -258,6 +274,12 @@ https://github.com/kohya-ss/sd-scripts/pull/1290) Thanks to frodo821!

 - Added a prompt option `--f` to `gen_imgs.py` to specify the file name when saving. Also, Diffusers-based keys for LoRA weights are now supported.

+#### 変更点
+
+- devブランチがmainにマージされました。ドキュメントの整備が遅れており申し訳ありません。少しずつ整備していきます。
+- マージ直前の状態が Version 0.8.8 としてリリースされていますので、問題があればそちらをご利用ください。
+- 以下の変更が含まれます。
+
 - SDXL の学習時に Fused optimizer が使えるようになりました。PR [#1259](https://github.com/kohya-ss/sd-scripts/pull/1259) 2kpr 氏に感謝します。
  - optimizer の backward pass に step を統合することで学習時のメモリ使用量を大きく削減します。学習結果は未適用時と同一ですが、メモリが潤沢にある場合は速度は遅くなります。
  - `sdxl_train.py` に `--fused_backward_pass` オプションを指定してください。現時点では optimizer は AdaFactor のみ対応しています。また gradient accumulation は使えません。
--- a/library/device_utils.py
+++ b/library/device_utils.py
@@ -2,6 +2,13 @@ import functools
 import gc

 import torch
+try:
+    # intel gpu support for pytorch older than 2.5
+    # ipex is not needed after pytorch 2.5
+    import intel_extension_for_pytorch as ipex  # noqa
+except Exception:
+    pass
+

 try:
    HAS_CUDA = torch.cuda.is_available()
@@ -14,8 +21,6 @@ except Exception:
    HAS_MPS = False

 try:
-    import intel_extension_for_pytorch as ipex  # noqa
-
    HAS_XPU = torch.xpu.is_available()
 except Exception:
    HAS_XPU = False
@@ -69,7 +74,7 @@ def init_ipex():

    This function should run right after importing torch and before doing anything else.

-    If IPEX is not available, this function does nothing.
+    If xpu is not available, this function does nothing.
    """
    try:
        if HAS_XPU:
--- a/library/ipex/init.py
+++ b/library/ipex/init.py
@@ -2,7 +2,11 @@ import os
 import sys
 import contextlib
 import torch
-import intel_extension_for_pytorch as ipex # pylint: disable=import-error, unused-import
+try:
+    import intel_extension_for_pytorch as ipex # pylint: disable=import-error, unused-import
+    legacy = True
+except Exception:
+    legacy = False
 from .hijacks import ipex_hijacks

 # pylint: disable=protected-access, missing-function-docstring, line-too-long
@@ -12,6 +16,13 @@ def ipex_init(): # pylint: disable=too-many-statements
        if hasattr(torch, "cuda") and hasattr(torch.cuda, "is_xpu_hijacked") and torch.cuda.is_xpu_hijacked:
            return True, "Skipping IPEX hijack"
        else:
+            try: # force xpu device on torch compile and triton
+                torch._inductor.utils.GPU_TYPES = ["xpu"]
+                torch._inductor.utils.get_gpu_type = lambda *args, **kwargs: "xpu"
+                from triton import backends as triton_backends # pylint: disable=import-error
+                triton_backends.backends["nvidia"].driver.is_active = lambda *args, **kwargs: False
+            except Exception:
+                pass
            # Replace cuda with xpu:
            torch.cuda.current_device = torch.xpu.current_device
            torch.cuda.current_stream = torch.xpu.current_stream
@@ -26,84 +37,99 @@ def ipex_init(): # pylint: disable=too-many-statements
            torch.cuda.is_current_stream_capturing = lambda: False
            torch.cuda.set_device = torch.xpu.set_device
            torch.cuda.stream = torch.xpu.stream
-            torch.cuda.synchronize = torch.xpu.synchronize
            torch.cuda.Event = torch.xpu.Event
            torch.cuda.Stream = torch.xpu.Stream
-            torch.cuda.FloatTensor = torch.xpu.FloatTensor
            torch.Tensor.cuda = torch.Tensor.xpu
            torch.Tensor.is_cuda = torch.Tensor.is_xpu
            torch.nn.Module.cuda = torch.nn.Module.xpu
-            torch.UntypedStorage.cuda = torch.UntypedStorage.xpu
-            torch.cuda._initialization_lock = torch.xpu.lazy_init._initialization_lock
-            torch.cuda._initialized = torch.xpu.lazy_init._initialized
-            torch.cuda._lazy_seed_tracker = torch.xpu.lazy_init._lazy_seed_tracker
-            torch.cuda._queued_calls = torch.xpu.lazy_init._queued_calls
-            torch.cuda._tls = torch.xpu.lazy_init._tls
-            torch.cuda.threading = torch.xpu.lazy_init.threading
-            torch.cuda.traceback = torch.xpu.lazy_init.traceback
            torch.cuda.Optional = torch.xpu.Optional
            torch.cuda.__cached__ = torch.xpu.__cached__
            torch.cuda.__loader__ = torch.xpu.__loader__
-            torch.cuda.ComplexFloatStorage = torch.xpu.ComplexFloatStorage
            torch.cuda.Tuple = torch.xpu.Tuple
            torch.cuda.streams = torch.xpu.streams
-            torch.cuda._lazy_new = torch.xpu._lazy_new
-            torch.cuda.FloatStorage = torch.xpu.FloatStorage
            torch.cuda.Any = torch.xpu.Any
            torch.cuda.__doc__ = torch.xpu.__doc__
            torch.cuda.default_generators = torch.xpu.default_generators
-            torch.cuda.HalfTensor = torch.xpu.HalfTensor
            torch.cuda._get_device_index = torch.xpu._get_device_index
            torch.cuda.__path__ = torch.xpu.__path__
-            torch.cuda.Device = torch.xpu.Device
-            torch.cuda.IntTensor = torch.xpu.IntTensor
-            torch.cuda.ByteStorage = torch.xpu.ByteStorage
            torch.cuda.set_stream = torch.xpu.set_stream
-            torch.cuda.BoolStorage = torch.xpu.BoolStorage
-            torch.cuda.os = torch.xpu.os
            torch.cuda.torch = torch.xpu.torch
-            torch.cuda.BFloat16Storage = torch.xpu.BFloat16Storage
            torch.cuda.Union = torch.xpu.Union
-            torch.cuda.DoubleTensor = torch.xpu.DoubleTensor
-            torch.cuda.ShortTensor = torch.xpu.ShortTensor
-            torch.cuda.LongTensor = torch.xpu.LongTensor
-            torch.cuda.IntStorage = torch.xpu.IntStorage
-            torch.cuda.LongStorage = torch.xpu.LongStorage
            torch.cuda.__annotations__ = torch.xpu.__annotations__
            torch.cuda.__package__ = torch.xpu.__package__
            torch.cuda.__builtins__ = torch.xpu.__builtins__
-            torch.cuda.CharTensor = torch.xpu.CharTensor
            torch.cuda.List = torch.xpu.List
            torch.cuda._lazy_init = torch.xpu._lazy_init
-            torch.cuda.BFloat16Tensor = torch.xpu.BFloat16Tensor
-            torch.cuda.DoubleStorage = torch.xpu.DoubleStorage
-            torch.cuda.ByteTensor = torch.xpu.ByteTensor
            torch.cuda.StreamContext = torch.xpu.StreamContext
-            torch.cuda.ComplexDoubleStorage = torch.xpu.ComplexDoubleStorage
-            torch.cuda.ShortStorage = torch.xpu.ShortStorage
            torch.cuda._lazy_call = torch.xpu._lazy_call
-            torch.cuda.HalfStorage = torch.xpu.HalfStorage
            torch.cuda.random = torch.xpu.random
            torch.cuda._device = torch.xpu._device
-            torch.cuda.classproperty = torch.xpu.classproperty
            torch.cuda.__name__ = torch.xpu.__name__
            torch.cuda._device_t = torch.xpu._device_t
-            torch.cuda.warnings = torch.xpu.warnings
            torch.cuda.__spec__ = torch.xpu.__spec__
-            torch.cuda.BoolTensor = torch.xpu.BoolTensor
-            torch.cuda.CharStorage = torch.xpu.CharStorage
            torch.cuda.__file__ = torch.xpu.__file__
-            torch.cuda._is_in_bad_fork = torch.xpu.lazy_init._is_in_bad_fork
            # torch.cuda.is_current_stream_capturing = torch.xpu.is_current_stream_capturing

+            if legacy:
+                torch.cuda.os = torch.xpu.os
+                torch.cuda.Device = torch.xpu.Device
+                torch.cuda.warnings = torch.xpu.warnings
+                torch.cuda.classproperty = torch.xpu.classproperty
+                torch.UntypedStorage.cuda = torch.UntypedStorage.xpu
+                if float(ipex.__version__[:3]) < 2.3:
+                    torch.cuda._initialization_lock = torch.xpu.lazy_init._initialization_lock
+                    torch.cuda._initialized = torch.xpu.lazy_init._initialized
+                    torch.cuda._is_in_bad_fork = torch.xpu.lazy_init._is_in_bad_fork
+                    torch.cuda._lazy_seed_tracker = torch.xpu.lazy_init._lazy_seed_tracker
+                    torch.cuda._queued_calls = torch.xpu.lazy_init._queued_calls
+                    torch.cuda._tls = torch.xpu.lazy_init._tls
+                    torch.cuda.threading = torch.xpu.lazy_init.threading
+                    torch.cuda.traceback = torch.xpu.lazy_init.traceback
+                    torch.cuda._lazy_new = torch.xpu._lazy_new
+
+                    torch.cuda.FloatTensor = torch.xpu.FloatTensor
+                    torch.cuda.FloatStorage = torch.xpu.FloatStorage
+                    torch.cuda.BFloat16Tensor = torch.xpu.BFloat16Tensor
+                    torch.cuda.BFloat16Storage = torch.xpu.BFloat16Storage
+                    torch.cuda.HalfTensor = torch.xpu.HalfTensor
+                    torch.cuda.HalfStorage = torch.xpu.HalfStorage
+                    torch.cuda.ByteTensor = torch.xpu.ByteTensor
+                    torch.cuda.ByteStorage = torch.xpu.ByteStorage
+                    torch.cuda.DoubleTensor = torch.xpu.DoubleTensor
+                    torch.cuda.DoubleStorage = torch.xpu.DoubleStorage
+                    torch.cuda.ShortTensor = torch.xpu.ShortTensor
+                    torch.cuda.ShortStorage = torch.xpu.ShortStorage
+                    torch.cuda.LongTensor = torch.xpu.LongTensor
+                    torch.cuda.LongStorage = torch.xpu.LongStorage
+                    torch.cuda.IntTensor = torch.xpu.IntTensor
+                    torch.cuda.IntStorage = torch.xpu.IntStorage
+                    torch.cuda.CharTensor = torch.xpu.CharTensor
+                    torch.cuda.CharStorage = torch.xpu.CharStorage
+                    torch.cuda.BoolTensor = torch.xpu.BoolTensor
+                    torch.cuda.BoolStorage = torch.xpu.BoolStorage
+                    torch.cuda.ComplexFloatStorage = torch.xpu.ComplexFloatStorage
+                    torch.cuda.ComplexDoubleStorage = torch.xpu.ComplexDoubleStorage
+
+            if not legacy or float(ipex.__version__[:3]) >= 2.3:
+                torch.cuda._initialization_lock = torch.xpu._initialization_lock
+                torch.cuda._initialized = torch.xpu._initialized
+                torch.cuda._is_in_bad_fork = torch.xpu._is_in_bad_fork
+                torch.cuda._lazy_seed_tracker = torch.xpu._lazy_seed_tracker
+                torch.cuda._queued_calls = torch.xpu._queued_calls
+                torch.cuda._tls = torch.xpu._tls
+                torch.cuda.threading = torch.xpu.threading
+                torch.cuda.traceback = torch.xpu.traceback
+
            # Memory:
-            torch.cuda.memory = torch.xpu.memory
            if 'linux' in sys.platform and "WSL2" in os.popen("uname -a").read():
                torch.xpu.empty_cache = lambda: None
            torch.cuda.empty_cache = torch.xpu.empty_cache
+
+            if legacy:
+                torch.cuda.memory_summary = torch.xpu.memory_summary
+                torch.cuda.memory_snapshot = torch.xpu.memory_snapshot
+            torch.cuda.memory = torch.xpu.memory
            torch.cuda.memory_stats = torch.xpu.memory_stats
-            torch.cuda.memory_summary = torch.xpu.memory_summary
-            torch.cuda.memory_snapshot = torch.xpu.memory_snapshot
            torch.cuda.memory_allocated = torch.xpu.memory_allocated
            torch.cuda.max_memory_allocated = torch.xpu.max_memory_allocated
            torch.cuda.memory_reserved = torch.xpu.memory_reserved
@@ -128,32 +154,44 @@ def ipex_init(): # pylint: disable=too-many-statements
            torch.cuda.initial_seed = torch.xpu.initial_seed

            # AMP:
-            torch.cuda.amp = torch.xpu.amp
-            torch.is_autocast_enabled = torch.xpu.is_autocast_xpu_enabled
-            torch.get_autocast_gpu_dtype = torch.xpu.get_autocast_xpu_dtype
+            if legacy:
+                torch.xpu.amp.custom_fwd = torch.cuda.amp.custom_fwd
+                torch.xpu.amp.custom_bwd = torch.cuda.amp.custom_bwd
+                torch.cuda.amp = torch.xpu.amp
+                if float(ipex.__version__[:3]) < 2.3:
+                    torch.is_autocast_enabled = torch.xpu.is_autocast_xpu_enabled
+                    torch.get_autocast_gpu_dtype = torch.xpu.get_autocast_xpu_dtype

-            if not hasattr(torch.cuda.amp, "common"):
-                torch.cuda.amp.common = contextlib.nullcontext()
-            torch.cuda.amp.common.amp_definitely_not_available = lambda: False
+                if not hasattr(torch.cuda.amp, "common"):
+                    torch.cuda.amp.common = contextlib.nullcontext()
+                torch.cuda.amp.common.amp_definitely_not_available = lambda: False

-            try:
-                torch.cuda.amp.GradScaler = torch.xpu.amp.GradScaler
-            except Exception: # pylint: disable=broad-exception-caught
                try:
-                    from .gradscaler import gradscaler_init # pylint: disable=import-outside-toplevel, import-error
-                    gradscaler_init()
                    torch.cuda.amp.GradScaler = torch.xpu.amp.GradScaler
                except Exception: # pylint: disable=broad-exception-caught
-                    torch.cuda.amp.GradScaler = ipex.cpu.autocast._grad_scaler.GradScaler
+                    try:
+                        from .gradscaler import gradscaler_init # pylint: disable=import-outside-toplevel, import-error
+                        gradscaler_init()
+                        torch.cuda.amp.GradScaler = torch.xpu.amp.GradScaler
+                    except Exception: # pylint: disable=broad-exception-caught
+                        torch.cuda.amp.GradScaler = ipex.cpu.autocast._grad_scaler.GradScaler

            # C
-            torch._C._cuda_getCurrentRawStream = ipex._C._getCurrentStream
-            ipex._C._DeviceProperties.multi_processor_count = ipex._C._DeviceProperties.gpu_subslice_count
-            ipex._C._DeviceProperties.major = 2024
-            ipex._C._DeviceProperties.minor = 0
+            if legacy and float(ipex.__version__[:3]) < 2.3:
+                torch._C._cuda_getCurrentRawStream = ipex._C._getCurrentRawStream
+                ipex._C._DeviceProperties.multi_processor_count = ipex._C._DeviceProperties.gpu_subslice_count
+                ipex._C._DeviceProperties.major = 12
+                ipex._C._DeviceProperties.minor = 1
+            else:
+                torch._C._cuda_getCurrentRawStream = torch._C._xpu_getCurrentRawStream
+                torch._C._XpuDeviceProperties.multi_processor_count = torch._C._XpuDeviceProperties.gpu_subslice_count
+                torch._C._XpuDeviceProperties.major = 12
+                torch._C._XpuDeviceProperties.minor = 1

            # Fix functions with ipex:
-            torch.cuda.mem_get_info = lambda device=None: [(torch.xpu.get_device_properties(device).total_memory - torch.xpu.memory_reserved(device)), torch.xpu.get_device_properties(device).total_memory]
+            # torch.xpu.mem_get_info always returns the total memory as free memory
+            torch.xpu.mem_get_info = lambda device=None: [(torch.xpu.get_device_properties(device).total_memory - torch.xpu.memory_reserved(device)), torch.xpu.get_device_properties(device).total_memory]
+            torch.cuda.mem_get_info = torch.xpu.mem_get_info
            torch._utils._get_available_device_type = lambda: "xpu"
            torch.has_cuda = True
            torch.cuda.has_half = True
@@ -161,19 +199,19 @@ def ipex_init(): # pylint: disable=too-many-statements
            torch.cuda.is_fp16_supported = lambda *args, **kwargs: True
            torch.backends.cuda.is_built = lambda *args, **kwargs: True
            torch.version.cuda = "12.1"
-            torch.cuda.get_device_capability = lambda *args, **kwargs: [12,1]
+            torch.cuda.get_arch_list = lambda: ["ats-m150", "pvc"]
+            torch.cuda.get_device_capability = lambda *args, **kwargs: (12,1)
            torch.cuda.get_device_properties.major = 12
            torch.cuda.get_device_properties.minor = 1
            torch.cuda.ipc_collect = lambda *args, **kwargs: None
            torch.cuda.utilization = lambda *args, **kwargs: 0

-            ipex_hijacks()
-            if not torch.xpu.has_fp64_dtype() or os.environ.get('IPEX_FORCE_ATTENTION_SLICE', None) is not None:
-                try:
-                    from .diffusers import ipex_diffusers
-                    ipex_diffusers()
-                except Exception: # pylint: disable=broad-exception-caught
-                    pass
+            device_supports_fp64, can_allocate_plus_4gb = ipex_hijacks(legacy=legacy)
+            try:
+                from .diffusers import ipex_diffusers
+                ipex_diffusers(device_supports_fp64=device_supports_fp64, can_allocate_plus_4gb=can_allocate_plus_4gb)
+            except Exception: # pylint: disable=broad-exception-caught
+                pass
            torch.cuda.is_xpu_hijacked = True
    except Exception as e:
        return False, e
--- a/library/ipex/attention.py
+++ b/library/ipex/attention.py
@@ -1,177 +1,119 @@
 import os
 import torch
-import intel_extension_for_pytorch as ipex # pylint: disable=import-error, unused-import
-from functools import cache
+from functools import cache, wraps

 # pylint: disable=protected-access, missing-function-docstring, line-too-long

 # ARC GPUs can't allocate more than 4GB to a single block so we slice the attention layers

-sdpa_slice_trigger_rate = float(os.environ.get('IPEX_SDPA_SLICE_TRIGGER_RATE', 4))
-attention_slice_rate = float(os.environ.get('IPEX_ATTENTION_SLICE_RATE', 4))
+sdpa_slice_trigger_rate = float(os.environ.get('IPEX_SDPA_SLICE_TRIGGER_RATE', 1))
+attention_slice_rate = float(os.environ.get('IPEX_ATTENTION_SLICE_RATE', 0.5))

 # Find something divisible with the input_tokens
@cache
-def find_slice_size(slice_size, slice_block_size):
-    while (slice_size * slice_block_size) > attention_slice_rate:
-        slice_size = slice_size // 2
-        if slice_size <= 1:
-            slice_size = 1
-            break
-    return slice_size
+def find_split_size(original_size, slice_block_size, slice_rate=2):
+    split_size = original_size
+    while True:
+        if (split_size * slice_block_size) <= slice_rate and original_size % split_size == 0:
+            return split_size
+        split_size = split_size - 1
+        if split_size <= 1:
+            return 1
+    return split_size
+

 # Find slice sizes for SDPA
@cache
-def find_sdpa_slice_sizes(query_shape, query_element_size):
-    if len(query_shape) == 3:
-        batch_size_attention, query_tokens, shape_three = query_shape
-        shape_four = 1
-    else:
-        batch_size_attention, query_tokens, shape_three, shape_four = query_shape
+def find_sdpa_slice_sizes(query_shape, key_shape, query_element_size, slice_rate=2, trigger_rate=3):
+    batch_size, attn_heads, query_len, _ = query_shape
+    _, _, key_len, _ = key_shape

-    slice_block_size = query_tokens * shape_three * shape_four / 1024 / 1024 * query_element_size
-    block_size = batch_size_attention * slice_block_size
+    slice_batch_size = attn_heads * (query_len * key_len) * query_element_size / 1024 / 1024 / 1024

-    split_slice_size = batch_size_attention
-    split_2_slice_size = query_tokens
-    split_3_slice_size = shape_three
+    split_batch_size = batch_size
+    split_head_size = attn_heads
+    split_query_size = query_len

-    do_split = False
-    do_split_2 = False
-    do_split_3 = False
+    do_batch_split = False
+    do_head_split = False
+    do_query_split = False

-    if block_size > sdpa_slice_trigger_rate:
-        do_split = True
-        split_slice_size = find_slice_size(split_slice_size, slice_block_size)
-        if split_slice_size * slice_block_size > attention_slice_rate:
-            slice_2_block_size = split_slice_size * shape_three * shape_four / 1024 / 1024 * query_element_size
-            do_split_2 = True
-            split_2_slice_size = find_slice_size(split_2_slice_size, slice_2_block_size)
-            if split_2_slice_size * slice_2_block_size > attention_slice_rate:
-                slice_3_block_size = split_slice_size * split_2_slice_size * shape_four / 1024 / 1024 * query_element_size
-                do_split_3 = True
-                split_3_slice_size = find_slice_size(split_3_slice_size, slice_3_block_size)
+    if batch_size * slice_batch_size >= trigger_rate:
+        do_batch_split = True
+        split_batch_size = find_split_size(batch_size, slice_batch_size, slice_rate=slice_rate)

-    return do_split, do_split_2, do_split_3, split_slice_size, split_2_slice_size, split_3_slice_size
+        if split_batch_size * slice_batch_size > slice_rate:
+            slice_head_size = split_batch_size * (query_len * key_len) * query_element_size / 1024 / 1024 / 1024
+            do_head_split = True
+            split_head_size = find_split_size(attn_heads, slice_head_size, slice_rate=slice_rate)

-# Find slice sizes for BMM
-@cache
-def find_bmm_slice_sizes(input_shape, input_element_size, mat2_shape):
-    batch_size_attention, input_tokens, mat2_atten_shape = input_shape[0], input_shape[1], mat2_shape[2]
-    slice_block_size = input_tokens * mat2_atten_shape / 1024 / 1024 * input_element_size
-    block_size = batch_size_attention * slice_block_size
+            if split_head_size * slice_head_size > slice_rate:
+                slice_query_size = split_batch_size * split_head_size * (key_len) * query_element_size / 1024 / 1024 / 1024
+                do_query_split = True
+                split_query_size = find_split_size(query_len, slice_query_size, slice_rate=slice_rate)

-    split_slice_size = batch_size_attention
-    split_2_slice_size = input_tokens
-    split_3_slice_size = mat2_atten_shape
+    return do_batch_split, do_head_split, do_query_split, split_batch_size, split_head_size, split_query_size

-    do_split = False
-    do_split_2 = False
-    do_split_3 = False
-
-    if block_size > attention_slice_rate:
-        do_split = True
-        split_slice_size = find_slice_size(split_slice_size, slice_block_size)
-        if split_slice_size * slice_block_size > attention_slice_rate:
-            slice_2_block_size = split_slice_size * mat2_atten_shape / 1024 / 1024 * input_element_size
-            do_split_2 = True
-            split_2_slice_size = find_slice_size(split_2_slice_size, slice_2_block_size)
-            if split_2_slice_size * slice_2_block_size > attention_slice_rate:
-                slice_3_block_size = split_slice_size * split_2_slice_size / 1024 / 1024 * input_element_size
-                do_split_3 = True
-                split_3_slice_size = find_slice_size(split_3_slice_size, slice_3_block_size)
-
-    return do_split, do_split_2, do_split_3, split_slice_size, split_2_slice_size, split_3_slice_size
-
-
-original_torch_bmm = torch.bmm
-def torch_bmm_32_bit(input, mat2, *, out=None):
-    if input.device.type != "xpu":
-        return original_torch_bmm(input, mat2, out=out)
-    do_split, do_split_2, do_split_3, split_slice_size, split_2_slice_size, split_3_slice_size = find_bmm_slice_sizes(input.shape, input.element_size(), mat2.shape)
-
-    # Slice BMM
-    if do_split:
-        batch_size_attention, input_tokens, mat2_atten_shape = input.shape[0], input.shape[1], mat2.shape[2]
-        hidden_states = torch.zeros(input.shape[0], input.shape[1], mat2.shape[2], device=input.device, dtype=input.dtype)
-        for i in range(batch_size_attention // split_slice_size):
-            start_idx = i * split_slice_size
-            end_idx = (i + 1) * split_slice_size
-            if do_split_2:
-                for i2 in range(input_tokens // split_2_slice_size): # pylint: disable=invalid-name
-                    start_idx_2 = i2 * split_2_slice_size
-                    end_idx_2 = (i2 + 1) * split_2_slice_size
-                    if do_split_3:
-                        for i3 in range(mat2_atten_shape // split_3_slice_size): # pylint: disable=invalid-name
-                            start_idx_3 = i3 * split_3_slice_size
-                            end_idx_3 = (i3 + 1) * split_3_slice_size
-                            hidden_states[start_idx:end_idx, start_idx_2:end_idx_2, start_idx_3:end_idx_3] = original_torch_bmm(
-                                input[start_idx:end_idx, start_idx_2:end_idx_2, start_idx_3:end_idx_3],
-                                mat2[start_idx:end_idx, start_idx_2:end_idx_2, start_idx_3:end_idx_3],
-                                out=out
-                            )
-                    else:
-                        hidden_states[start_idx:end_idx, start_idx_2:end_idx_2] = original_torch_bmm(
-                            input[start_idx:end_idx, start_idx_2:end_idx_2],
-                            mat2[start_idx:end_idx, start_idx_2:end_idx_2],
-                            out=out
-                        )
-            else:
-                hidden_states[start_idx:end_idx] = original_torch_bmm(
-                    input[start_idx:end_idx],
-                    mat2[start_idx:end_idx],
-                    out=out
-                )
-        torch.xpu.synchronize(input.device)
-    else:
-        return original_torch_bmm(input, mat2, out=out)
-    return hidden_states

 original_scaled_dot_product_attention = torch.nn.functional.scaled_dot_product_attention
-def scaled_dot_product_attention_32_bit(query, key, value, attn_mask=None, dropout_p=0.0, is_causal=False, **kwargs):
+@wraps(torch.nn.functional.scaled_dot_product_attention)
+def dynamic_scaled_dot_product_attention(query, key, value, attn_mask=None, dropout_p=0.0, is_causal=False, **kwargs):
    if query.device.type != "xpu":
        return original_scaled_dot_product_attention(query, key, value, attn_mask=attn_mask, dropout_p=dropout_p, is_causal=is_causal, **kwargs)
-    do_split, do_split_2, do_split_3, split_slice_size, split_2_slice_size, split_3_slice_size = find_sdpa_slice_sizes(query.shape, query.element_size())
+    is_unsqueezed = False
+    if len(query.shape) == 3:
+        query = query.unsqueeze(0)
+        is_unsqueezed = True
+    if len(key.shape) == 3:
+        key = key.unsqueeze(0)
+    if len(value.shape) == 3:
+        value = value.unsqueeze(0)
+    do_batch_split, do_head_split, do_query_split, split_batch_size, split_head_size, split_query_size = find_sdpa_slice_sizes(query.shape, key.shape, query.element_size(), slice_rate=attention_slice_rate, trigger_rate=sdpa_slice_trigger_rate)

    # Slice SDPA
-    if do_split:
-        batch_size_attention, query_tokens, shape_three = query.shape[0], query.shape[1], query.shape[2]
-        hidden_states = torch.zeros(query.shape, device=query.device, dtype=query.dtype)
-        for i in range(batch_size_attention // split_slice_size):
-            start_idx = i * split_slice_size
-            end_idx = (i + 1) * split_slice_size
-            if do_split_2:
-                for i2 in range(query_tokens // split_2_slice_size): # pylint: disable=invalid-name
-                    start_idx_2 = i2 * split_2_slice_size
-                    end_idx_2 = (i2 + 1) * split_2_slice_size
-                    if do_split_3:
-                        for i3 in range(shape_three // split_3_slice_size): # pylint: disable=invalid-name
-                            start_idx_3 = i3 * split_3_slice_size
-                            end_idx_3 = (i3 + 1) * split_3_slice_size
-                            hidden_states[start_idx:end_idx, start_idx_2:end_idx_2, start_idx_3:end_idx_3] = original_scaled_dot_product_attention(
-                                query[start_idx:end_idx, start_idx_2:end_idx_2, start_idx_3:end_idx_3],
-                                key[start_idx:end_idx, start_idx_2:end_idx_2, start_idx_3:end_idx_3],
-                                value[start_idx:end_idx, start_idx_2:end_idx_2, start_idx_3:end_idx_3],
-                                attn_mask=attn_mask[start_idx:end_idx, start_idx_2:end_idx_2, start_idx_3:end_idx_3] if attn_mask is not None else attn_mask,
+    if do_batch_split:
+        batch_size, attn_heads, query_len, _ = query.shape
+        _, _, _, head_dim = value.shape
+        hidden_states = torch.zeros((batch_size, attn_heads, query_len, head_dim), device=query.device, dtype=query.dtype)
+        if attn_mask is not None:
+            attn_mask = attn_mask.expand((query.shape[0], query.shape[1], query.shape[2], key.shape[-2]))
+        for ib in range(batch_size // split_batch_size):
+            start_idx = ib * split_batch_size
+            end_idx = (ib + 1) * split_batch_size
+            if do_head_split:
+                for ih in range(attn_heads // split_head_size): # pylint: disable=invalid-name
+                    start_idx_h = ih * split_head_size
+                    end_idx_h = (ih + 1) * split_head_size
+                    if do_query_split:
+                        for iq in range(query_len // split_query_size): # pylint: disable=invalid-name
+                            start_idx_q = iq * split_query_size
+                            end_idx_q = (iq + 1) * split_query_size
+                            hidden_states[start_idx:end_idx, start_idx_h:end_idx_h, start_idx_q:end_idx_q, :] = original_scaled_dot_product_attention(
+                                query[start_idx:end_idx, start_idx_h:end_idx_h, start_idx_q:end_idx_q, :],
+                                key[start_idx:end_idx, start_idx_h:end_idx_h, :, :],
+                                value[start_idx:end_idx, start_idx_h:end_idx_h, :, :],
+                                attn_mask=attn_mask[start_idx:end_idx, start_idx_h:end_idx_h, start_idx_q:end_idx_q, :] if attn_mask is not None else attn_mask,
                                dropout_p=dropout_p, is_causal=is_causal, **kwargs
                            )
                    else:
-                        hidden_states[start_idx:end_idx, start_idx_2:end_idx_2] = original_scaled_dot_product_attention(
-                            query[start_idx:end_idx, start_idx_2:end_idx_2],
-                            key[start_idx:end_idx, start_idx_2:end_idx_2],
-                            value[start_idx:end_idx, start_idx_2:end_idx_2],
-                            attn_mask=attn_mask[start_idx:end_idx, start_idx_2:end_idx_2] if attn_mask is not None else attn_mask,
+                        hidden_states[start_idx:end_idx, start_idx_h:end_idx_h, :, :] = original_scaled_dot_product_attention(
+                            query[start_idx:end_idx, start_idx_h:end_idx_h, :, :],
+                            key[start_idx:end_idx, start_idx_h:end_idx_h, :, :],
+                            value[start_idx:end_idx, start_idx_h:end_idx_h, :, :],
+                            attn_mask=attn_mask[start_idx:end_idx, start_idx_h:end_idx_h, :, :] if attn_mask is not None else attn_mask,
                            dropout_p=dropout_p, is_causal=is_causal, **kwargs
                        )
            else:
-                hidden_states[start_idx:end_idx] = original_scaled_dot_product_attention(
-                    query[start_idx:end_idx],
-                    key[start_idx:end_idx],
-                    value[start_idx:end_idx],
-                    attn_mask=attn_mask[start_idx:end_idx] if attn_mask is not None else attn_mask,
+                hidden_states[start_idx:end_idx, :, :, :] = original_scaled_dot_product_attention(
+                    query[start_idx:end_idx, :, :, :],
+                    key[start_idx:end_idx, :, :, :],
+                    value[start_idx:end_idx, :, :, :],
+                    attn_mask=attn_mask[start_idx:end_idx, :, :, :] if attn_mask is not None else attn_mask,
                    dropout_p=dropout_p, is_causal=is_causal, **kwargs
                )
        torch.xpu.synchronize(query.device)
    else:
-        return original_scaled_dot_product_attention(query, key, value, attn_mask=attn_mask, dropout_p=dropout_p, is_causal=is_causal, **kwargs)
+        hidden_states = original_scaled_dot_product_attention(query, key, value, attn_mask=attn_mask, dropout_p=dropout_p, is_causal=is_causal, **kwargs)
+    if is_unsqueezed:
+        hidden_states.squeeze(0)
    return hidden_states
--- a/library/ipex/diffusers.py
+++ b/library/ipex/diffusers.py
@@ -1,312 +1,47 @@
-import os
+from functools import wraps
 import torch
-import intel_extension_for_pytorch as ipex # pylint: disable=import-error, unused-import
-import diffusers #0.24.0 # pylint: disable=import-error
-from diffusers.models.attention_processor import Attention
-from diffusers.utils import USE_PEFT_BACKEND
-from functools import cache
+import diffusers # pylint: disable=import-error

 # pylint: disable=protected-access, missing-function-docstring, line-too-long

-attention_slice_rate = float(os.environ.get('IPEX_ATTENTION_SLICE_RATE', 4))

-@cache
-def find_slice_size(slice_size, slice_block_size):
-    while (slice_size * slice_block_size) > attention_slice_rate:
-        slice_size = slice_size // 2
-        if slice_size <= 1:
-            slice_size = 1
-            break
-    return slice_size
-
-@cache
-def find_attention_slice_sizes(query_shape, query_element_size, query_device_type, slice_size=None):
-    if len(query_shape) == 3:
-        batch_size_attention, query_tokens, shape_three = query_shape
-        shape_four = 1
-    else:
-        batch_size_attention, query_tokens, shape_three, shape_four = query_shape
-    if slice_size is not None:
-        batch_size_attention = slice_size
-
-    slice_block_size = query_tokens * shape_three * shape_four / 1024 / 1024 * query_element_size
-    block_size = batch_size_attention * slice_block_size
-
-    split_slice_size = batch_size_attention
-    split_2_slice_size = query_tokens
-    split_3_slice_size = shape_three
-
-    do_split = False
-    do_split_2 = False
-    do_split_3 = False
-
-    if query_device_type != "xpu":
-        return do_split, do_split_2, do_split_3, split_slice_size, split_2_slice_size, split_3_slice_size
-
-    if block_size > attention_slice_rate:
-        do_split = True
-        split_slice_size = find_slice_size(split_slice_size, slice_block_size)
-        if split_slice_size * slice_block_size > attention_slice_rate:
-            slice_2_block_size = split_slice_size * shape_three * shape_four / 1024 / 1024 * query_element_size
-            do_split_2 = True
-            split_2_slice_size = find_slice_size(split_2_slice_size, slice_2_block_size)
-            if split_2_slice_size * slice_2_block_size > attention_slice_rate:
-                slice_3_block_size = split_slice_size * split_2_slice_size * shape_four / 1024 / 1024 * query_element_size
-                do_split_3 = True
-                split_3_slice_size = find_slice_size(split_3_slice_size, slice_3_block_size)
-
-    return do_split, do_split_2, do_split_3, split_slice_size, split_2_slice_size, split_3_slice_size
-
-class SlicedAttnProcessor: # pylint: disable=too-few-public-methods
-    r"""
-    Processor for implementing sliced attention.
-
-    Args:
-        slice_size (`int`, *optional*):
-            The number of steps to compute attention. Uses as many slices as `attention_head_dim // slice_size`, and
-            `attention_head_dim` must be a multiple of the `slice_size`.
-    """
-
-    def __init__(self, slice_size):
-        self.slice_size = slice_size
-
-    def __call__(self, attn: Attention, hidden_states: torch.FloatTensor,
-    encoder_hidden_states=None, attention_mask=None) -> torch.FloatTensor: # pylint: disable=too-many-statements, too-many-locals, too-many-branches
-
-        residual = hidden_states
-
-        input_ndim = hidden_states.ndim
-
-        if input_ndim == 4:
-            batch_size, channel, height, width = hidden_states.shape
-            hidden_states = hidden_states.view(batch_size, channel, height * width).transpose(1, 2)
-
-        batch_size, sequence_length, _ = (
-            hidden_states.shape if encoder_hidden_states is None else encoder_hidden_states.shape
-        )
-        attention_mask = attn.prepare_attention_mask(attention_mask, sequence_length, batch_size)
-
-        if attn.group_norm is not None:
-            hidden_states = attn.group_norm(hidden_states.transpose(1, 2)).transpose(1, 2)
-
-        query = attn.to_q(hidden_states)
-        dim = query.shape[-1]
-        query = attn.head_to_batch_dim(query)
-
-        if encoder_hidden_states is None:
-            encoder_hidden_states = hidden_states
-        elif attn.norm_cross:
-            encoder_hidden_states = attn.norm_encoder_hidden_states(encoder_hidden_states)
-
-        key = attn.to_k(encoder_hidden_states)
-        value = attn.to_v(encoder_hidden_states)
-        key = attn.head_to_batch_dim(key)
-        value = attn.head_to_batch_dim(value)
-
-        batch_size_attention, query_tokens, shape_three = query.shape
-        hidden_states = torch.zeros(
-            (batch_size_attention, query_tokens, dim // attn.heads), device=query.device, dtype=query.dtype
-        )
-
-        ####################################################################
-        # ARC GPUs can't allocate more than 4GB to a single block, Slice it:
-        _, do_split_2, do_split_3, split_slice_size, split_2_slice_size, split_3_slice_size = find_attention_slice_sizes(query.shape, query.element_size(), query.device.type, slice_size=self.slice_size)
-
-        for i in range(batch_size_attention // split_slice_size):
-            start_idx = i * split_slice_size
-            end_idx = (i + 1) * split_slice_size
-            if do_split_2:
-                for i2 in range(query_tokens // split_2_slice_size): # pylint: disable=invalid-name
-                    start_idx_2 = i2 * split_2_slice_size
-                    end_idx_2 = (i2 + 1) * split_2_slice_size
-                    if do_split_3:
-                        for i3 in range(shape_three // split_3_slice_size): # pylint: disable=invalid-name
-                            start_idx_3 = i3 * split_3_slice_size
-                            end_idx_3 = (i3 + 1) * split_3_slice_size
-
-                            query_slice = query[start_idx:end_idx, start_idx_2:end_idx_2, start_idx_3:end_idx_3]
-                            key_slice = key[start_idx:end_idx, start_idx_2:end_idx_2, start_idx_3:end_idx_3]
-                            attn_mask_slice = attention_mask[start_idx:end_idx, start_idx_2:end_idx_2, start_idx_3:end_idx_3] if attention_mask is not None else None
-
-                            attn_slice = attn.get_attention_scores(query_slice, key_slice, attn_mask_slice)
-                            del query_slice
-                            del key_slice
-                            del attn_mask_slice
-                            attn_slice = torch.bmm(attn_slice, value[start_idx:end_idx, start_idx_2:end_idx_2, start_idx_3:end_idx_3])
-
-                            hidden_states[start_idx:end_idx, start_idx_2:end_idx_2, start_idx_3:end_idx_3] = attn_slice
-                            del attn_slice
-                    else:
-                        query_slice = query[start_idx:end_idx, start_idx_2:end_idx_2]
-                        key_slice = key[start_idx:end_idx, start_idx_2:end_idx_2]
-                        attn_mask_slice = attention_mask[start_idx:end_idx, start_idx_2:end_idx_2] if attention_mask is not None else None
-
-                        attn_slice = attn.get_attention_scores(query_slice, key_slice, attn_mask_slice)
-                        del query_slice
-                        del key_slice
-                        del attn_mask_slice
-                        attn_slice = torch.bmm(attn_slice, value[start_idx:end_idx, start_idx_2:end_idx_2])
-
-                        hidden_states[start_idx:end_idx, start_idx_2:end_idx_2] = attn_slice
-                        del attn_slice
-                torch.xpu.synchronize(query.device)
-            else:
-                query_slice = query[start_idx:end_idx]
-                key_slice = key[start_idx:end_idx]
-                attn_mask_slice = attention_mask[start_idx:end_idx] if attention_mask is not None else None
-
-                attn_slice = attn.get_attention_scores(query_slice, key_slice, attn_mask_slice)
-                del query_slice
-                del key_slice
-                del attn_mask_slice
-                attn_slice = torch.bmm(attn_slice, value[start_idx:end_idx])
-
-                hidden_states[start_idx:end_idx] = attn_slice
-                del attn_slice
-        ####################################################################
-
-        hidden_states = attn.batch_to_head_dim(hidden_states)
-
-        # linear proj
-        hidden_states = attn.to_out[0](hidden_states)
-        # dropout
-        hidden_states = attn.to_out[1](hidden_states)
-
-        if input_ndim == 4:
-            hidden_states = hidden_states.transpose(-1, -2).reshape(batch_size, channel, height, width)
-
-        if attn.residual_connection:
-            hidden_states = hidden_states + residual
-
-        hidden_states = hidden_states / attn.rescale_output_factor
-
-        return hidden_states
-
-
-class AttnProcessor:
-    r"""
-    Default processor for performing attention-related computations.
-    """
-
-    def __call__(self, attn: Attention, hidden_states: torch.FloatTensor,
-    encoder_hidden_states=None, attention_mask=None,
-    temb=None, scale: float = 1.0) -> torch.Tensor: # pylint: disable=too-many-statements, too-many-locals, too-many-branches
-
-        residual = hidden_states
-
-        args = () if USE_PEFT_BACKEND else (scale,)
-
-        if attn.spatial_norm is not None:
-            hidden_states = attn.spatial_norm(hidden_states, temb)
-
-        input_ndim = hidden_states.ndim
-
-        if input_ndim == 4:
-            batch_size, channel, height, width = hidden_states.shape
-            hidden_states = hidden_states.view(batch_size, channel, height * width).transpose(1, 2)
-
-        batch_size, sequence_length, _ = (
-            hidden_states.shape if encoder_hidden_states is None else encoder_hidden_states.shape
-        )
-        attention_mask = attn.prepare_attention_mask(attention_mask, sequence_length, batch_size)
-
-        if attn.group_norm is not None:
-            hidden_states = attn.group_norm(hidden_states.transpose(1, 2)).transpose(1, 2)
-
-        query = attn.to_q(hidden_states, *args)
-
-        if encoder_hidden_states is None:
-            encoder_hidden_states = hidden_states
-        elif attn.norm_cross:
-            encoder_hidden_states = attn.norm_encoder_hidden_states(encoder_hidden_states)
-
-        key = attn.to_k(encoder_hidden_states, *args)
-        value = attn.to_v(encoder_hidden_states, *args)
-
-        query = attn.head_to_batch_dim(query)
-        key = attn.head_to_batch_dim(key)
-        value = attn.head_to_batch_dim(value)
-
-        ####################################################################
-        # ARC GPUs can't allocate more than 4GB to a single block, Slice it:
-        batch_size_attention, query_tokens, shape_three = query.shape[0], query.shape[1], query.shape[2]
-        hidden_states = torch.zeros(query.shape, device=query.device, dtype=query.dtype)
-        do_split, do_split_2, do_split_3, split_slice_size, split_2_slice_size, split_3_slice_size = find_attention_slice_sizes(query.shape, query.element_size(), query.device.type)
-
-        if do_split:
-            for i in range(batch_size_attention // split_slice_size):
-                start_idx = i * split_slice_size
-                end_idx = (i + 1) * split_slice_size
-                if do_split_2:
-                    for i2 in range(query_tokens // split_2_slice_size): # pylint: disable=invalid-name
-                        start_idx_2 = i2 * split_2_slice_size
-                        end_idx_2 = (i2 + 1) * split_2_slice_size
-                        if do_split_3:
-                            for i3 in range(shape_three // split_3_slice_size): # pylint: disable=invalid-name
-                                start_idx_3 = i3 * split_3_slice_size
-                                end_idx_3 = (i3 + 1) * split_3_slice_size
-
-                                query_slice = query[start_idx:end_idx, start_idx_2:end_idx_2, start_idx_3:end_idx_3]
-                                key_slice = key[start_idx:end_idx, start_idx_2:end_idx_2, start_idx_3:end_idx_3]
-                                attn_mask_slice = attention_mask[start_idx:end_idx, start_idx_2:end_idx_2, start_idx_3:end_idx_3] if attention_mask is not None else None
-
-                                attn_slice = attn.get_attention_scores(query_slice, key_slice, attn_mask_slice)
-                                del query_slice
-                                del key_slice
-                                del attn_mask_slice
-                                attn_slice = torch.bmm(attn_slice, value[start_idx:end_idx, start_idx_2:end_idx_2, start_idx_3:end_idx_3])
-
-                                hidden_states[start_idx:end_idx, start_idx_2:end_idx_2, start_idx_3:end_idx_3] = attn_slice
-                                del attn_slice
-                        else:
-                            query_slice = query[start_idx:end_idx, start_idx_2:end_idx_2]
-                            key_slice = key[start_idx:end_idx, start_idx_2:end_idx_2]
-                            attn_mask_slice = attention_mask[start_idx:end_idx, start_idx_2:end_idx_2] if attention_mask is not None else None
-
-                            attn_slice = attn.get_attention_scores(query_slice, key_slice, attn_mask_slice)
-                            del query_slice
-                            del key_slice
-                            del attn_mask_slice
-                            attn_slice = torch.bmm(attn_slice, value[start_idx:end_idx, start_idx_2:end_idx_2])
-
-                            hidden_states[start_idx:end_idx, start_idx_2:end_idx_2] = attn_slice
-                            del attn_slice
-                else:
-                    query_slice = query[start_idx:end_idx]
-                    key_slice = key[start_idx:end_idx]
-                    attn_mask_slice = attention_mask[start_idx:end_idx] if attention_mask is not None else None
-
-                    attn_slice = attn.get_attention_scores(query_slice, key_slice, attn_mask_slice)
-                    del query_slice
-                    del key_slice
-                    del attn_mask_slice
-                    attn_slice = torch.bmm(attn_slice, value[start_idx:end_idx])
-
-                    hidden_states[start_idx:end_idx] = attn_slice
-                    del attn_slice
-            torch.xpu.synchronize(query.device)
-        else:
-            attention_probs = attn.get_attention_scores(query, key, attention_mask)
-            hidden_states = torch.bmm(attention_probs, value)
-        ####################################################################
-        hidden_states = attn.batch_to_head_dim(hidden_states)
-
-        # linear proj
-        hidden_states = attn.to_out[0](hidden_states, *args)
-        # dropout
-        hidden_states = attn.to_out[1](hidden_states)
-
-        if input_ndim == 4:
-            hidden_states = hidden_states.transpose(-1, -2).reshape(batch_size, channel, height, width)
-
-        if attn.residual_connection:
-            hidden_states = hidden_states + residual
-
-        hidden_states = hidden_states / attn.rescale_output_factor
-
-        return hidden_states
-
-def ipex_diffusers():
-    #ARC GPUs can't allocate more than 4GB to a single block:
-    diffusers.models.attention_processor.SlicedAttnProcessor = SlicedAttnProcessor
-    diffusers.models.attention_processor.AttnProcessor = AttnProcessor
+# Diffusers FreeU
+original_fourier_filter = diffusers.utils.torch_utils.fourier_filter
+@wraps(diffusers.utils.torch_utils.fourier_filter)
+def fourier_filter(x_in, threshold, scale):
+    return_dtype = x_in.dtype
+    return original_fourier_filter(x_in.to(dtype=torch.float32), threshold, scale).to(dtype=return_dtype)
+
+
+# fp64 error
+class FluxPosEmbed(torch.nn.Module):
+    def __init__(self, theta: int, axes_dim):
+        super().__init__()
+        self.theta = theta
+        self.axes_dim = axes_dim
+
+    def forward(self, ids: torch.Tensor) -> torch.Tensor:
+        n_axes = ids.shape[-1]
+        cos_out = []
+        sin_out = []
+        pos = ids.float()
+        for i in range(n_axes):
+            cos, sin = diffusers.models.embeddings.get_1d_rotary_pos_embed(
+                self.axes_dim[i],
+                pos[:, i],
+                theta=self.theta,
+                repeat_interleave_real=True,
+                use_real=True,
+                freqs_dtype=torch.float32,
+            )
+            cos_out.append(cos)
+            sin_out.append(sin)
+        freqs_cos = torch.cat(cos_out, dim=-1).to(ids.device)
+        freqs_sin = torch.cat(sin_out, dim=-1).to(ids.device)
+        return freqs_cos, freqs_sin
+
+
+def ipex_diffusers(device_supports_fp64=False, can_allocate_plus_4gb=False):
+    diffusers.utils.torch_utils.fourier_filter = fourier_filter
+    if not device_supports_fp64:
+        diffusers.models.embeddings.FluxPosEmbed = FluxPosEmbed
--- a/library/ipex/gradscaler.py
+++ b/library/ipex/gradscaler.py
@@ -5,7 +5,7 @@ import intel_extension_for_pytorch._C as core # pylint: disable=import-error, un

 # pylint: disable=protected-access, missing-function-docstring, line-too-long

-device_supports_fp64 = torch.xpu.has_fp64_dtype()
+device_supports_fp64 = torch.xpu.has_fp64_dtype() if hasattr(torch.xpu, "has_fp64_dtype") else torch.xpu.get_device_properties("xpu").has_fp64
 OptState = ipex.cpu.autocast._grad_scaler.OptState
 _MultiDeviceReplicator = ipex.cpu.autocast._grad_scaler._MultiDeviceReplicator
 _refresh_per_optimizer_state = ipex.cpu.autocast._grad_scaler._refresh_per_optimizer_state
--- a/library/ipex/hijacks.py
+++ b/library/ipex/hijacks.py
@@ -2,10 +2,19 @@ import os
 from functools import wraps
 from contextlib import nullcontext
 import torch
-import intel_extension_for_pytorch as ipex # pylint: disable=import-error, unused-import
 import numpy as np

-device_supports_fp64 = torch.xpu.has_fp64_dtype()
+device_supports_fp64 = torch.xpu.has_fp64_dtype() if hasattr(torch.xpu, "has_fp64_dtype") else torch.xpu.get_device_properties("xpu").has_fp64
+if os.environ.get('IPEX_FORCE_ATTENTION_SLICE', '0') == '0' and (torch.xpu.get_device_properties("xpu").total_memory / 1024 / 1024 / 1024) > 4.1:
+    try:
+        x = torch.ones((33000,33000), dtype=torch.float32, device="xpu")
+        del x
+        torch.xpu.empty_cache()
+        can_allocate_plus_4gb = True
+    except Exception:
+        can_allocate_plus_4gb = False
+else:
+    can_allocate_plus_4gb = bool(os.environ.get('IPEX_FORCE_ATTENTION_SLICE', '0') == '-1')

 # pylint: disable=protected-access, missing-function-docstring, line-too-long, unnecessary-lambda, no-else-return

@@ -26,7 +35,7 @@ def check_device(device):
    return bool((isinstance(device, torch.device) and device.type == "cuda") or (isinstance(device, str) and "cuda" in device) or isinstance(device, int))

 def return_xpu(device):
-    return f"xpu:{device.split(':')[-1]}" if isinstance(device, str) and ":" in device else f"xpu:{device}" if isinstance(device, int) else torch.device("xpu") if isinstance(device, torch.device) else "xpu"
+    return f"xpu:{device.split(':')[-1]}" if isinstance(device, str) and ":" in device else f"xpu:{device}" if isinstance(device, int) else torch.device(f"xpu:{device.index}" if device.index is not None else "xpu") if isinstance(device, torch.device) else "xpu"


 # Autocast
@@ -42,7 +51,7 @@ def autocast_init(self, device_type, dtype=None, enabled=True, cache_enabled=Non
 original_interpolate = torch.nn.functional.interpolate
@wraps(torch.nn.functional.interpolate)
 def interpolate(tensor, size=None, scale_factor=None, mode='nearest', align_corners=None, recompute_scale_factor=None, antialias=False): # pylint: disable=too-many-arguments
-    if antialias or align_corners is not None or mode == 'bicubic':
+    if mode in {'bicubic', 'bilinear'}:
        return_device = tensor.device
        return_dtype = tensor.dtype
        return original_interpolate(tensor.to("cpu", dtype=torch.float32), size=size, scale_factor=scale_factor, mode=mode,
@@ -73,35 +82,46 @@ def as_tensor(data, dtype=None, device=None):
        return original_as_tensor(data, dtype=dtype, device=device)


-if device_supports_fp64 and os.environ.get('IPEX_FORCE_ATTENTION_SLICE', None) is None:
-    original_torch_bmm = torch.bmm
+if can_allocate_plus_4gb:
    original_scaled_dot_product_attention = torch.nn.functional.scaled_dot_product_attention
 else:
    # 32 bit attention workarounds for Alchemist:
    try:
-        from .attention import torch_bmm_32_bit as original_torch_bmm
-        from .attention import scaled_dot_product_attention_32_bit as original_scaled_dot_product_attention
+        from .attention import dynamic_scaled_dot_product_attention as original_scaled_dot_product_attention
    except Exception: # pylint: disable=broad-exception-caught
-        original_torch_bmm = torch.bmm
        original_scaled_dot_product_attention = torch.nn.functional.scaled_dot_product_attention

-
-# Data Type Errors:
-@wraps(torch.bmm)
-def torch_bmm(input, mat2, *, out=None):
-    if input.dtype != mat2.dtype:
-        mat2 = mat2.to(input.dtype)
-    return original_torch_bmm(input, mat2, out=out)
-
@wraps(torch.nn.functional.scaled_dot_product_attention)
-def scaled_dot_product_attention(query, key, value, attn_mask=None, dropout_p=0.0, is_causal=False):
+def scaled_dot_product_attention(query, key, value, attn_mask=None, dropout_p=0.0, is_causal=False, **kwargs):
    if query.dtype != key.dtype:
        key = key.to(dtype=query.dtype)
    if query.dtype != value.dtype:
        value = value.to(dtype=query.dtype)
    if attn_mask is not None and query.dtype != attn_mask.dtype:
        attn_mask = attn_mask.to(dtype=query.dtype)
-    return original_scaled_dot_product_attention(query, key, value, attn_mask=attn_mask, dropout_p=dropout_p, is_causal=is_causal)
+    return original_scaled_dot_product_attention(query, key, value, attn_mask=attn_mask, dropout_p=dropout_p, is_causal=is_causal, **kwargs)
+
+# Data Type Errors:
+original_torch_bmm = torch.bmm
+@wraps(torch.bmm)
+def torch_bmm(input, mat2, *, out=None):
+    if input.dtype != mat2.dtype:
+        mat2 = mat2.to(input.dtype)
+    return original_torch_bmm(input, mat2, out=out)
+
+# Diffusers FreeU
+original_fft_fftn = torch.fft.fftn
+@wraps(torch.fft.fftn)
+def fft_fftn(input, s=None, dim=None, norm=None, *, out=None):
+    return_dtype = input.dtype
+    return original_fft_fftn(input.to(dtype=torch.float32), s=s, dim=dim, norm=norm, out=out).to(dtype=return_dtype)
+
+# Diffusers FreeU
+original_fft_ifftn = torch.fft.ifftn
+@wraps(torch.fft.ifftn)
+def fft_ifftn(input, s=None, dim=None, norm=None, *, out=None):
+    return_dtype = input.dtype
+    return original_fft_ifftn(input.to(dtype=torch.float32), s=s, dim=dim, norm=norm, out=out).to(dtype=return_dtype)

 # A1111 FP16
 original_functional_group_norm = torch.nn.functional.group_norm
@@ -133,6 +153,15 @@ def functional_linear(input, weight, bias=None):
        bias.data = bias.data.to(dtype=weight.data.dtype)
    return original_functional_linear(input, weight, bias=bias)

+original_functional_conv1d = torch.nn.functional.conv1d
+@wraps(torch.nn.functional.conv1d)
+def functional_conv1d(input, weight, bias=None, stride=1, padding=0, dilation=1, groups=1):
+    if input.dtype != weight.data.dtype:
+        input = input.to(dtype=weight.data.dtype)
+    if bias is not None and bias.data.dtype != weight.data.dtype:
+        bias.data = bias.data.to(dtype=weight.data.dtype)
+    return original_functional_conv1d(input, weight, bias=bias, stride=stride, padding=padding, dilation=dilation, groups=groups)
+
 original_functional_conv2d = torch.nn.functional.conv2d
@wraps(torch.nn.functional.conv2d)
 def functional_conv2d(input, weight, bias=None, stride=1, padding=0, dilation=1, groups=1):
@@ -142,14 +171,15 @@ def functional_conv2d(input, weight, bias=None, stride=1, padding=0, dilation=1,
        bias.data = bias.data.to(dtype=weight.data.dtype)
    return original_functional_conv2d(input, weight, bias=bias, stride=stride, padding=padding, dilation=dilation, groups=groups)

-# A1111 Embedding BF16
-original_torch_cat = torch.cat
-@wraps(torch.cat)
-def torch_cat(tensor, *args, **kwargs):
-    if len(tensor) == 3 and (tensor[0].dtype != tensor[1].dtype or tensor[2].dtype != tensor[1].dtype):
-        return original_torch_cat([tensor[0].to(tensor[1].dtype), tensor[1], tensor[2].to(tensor[1].dtype)], *args, **kwargs)
-    else:
-        return original_torch_cat(tensor, *args, **kwargs)
+# LTX Video
+original_functional_conv3d = torch.nn.functional.conv3d
+@wraps(torch.nn.functional.conv3d)
+def functional_conv3d(input, weight, bias=None, stride=1, padding=0, dilation=1, groups=1):
+    if input.dtype != weight.data.dtype:
+        input = input.to(dtype=weight.data.dtype)
+    if bias is not None and bias.data.dtype != weight.data.dtype:
+        bias.data = bias.data.to(dtype=weight.data.dtype)
+    return original_functional_conv3d(input, weight, bias=bias, stride=stride, padding=padding, dilation=dilation, groups=groups)

 # SwinIR BF16:
 original_functional_pad = torch.nn.functional.pad
@@ -164,6 +194,7 @@ def functional_pad(input, pad, mode='constant', value=None):
 original_torch_tensor = torch.tensor
@wraps(torch.tensor)
 def torch_tensor(data, *args, dtype=None, device=None, **kwargs):
+    global device_supports_fp64
    if check_device(device):
        device = return_xpu(device)
    if not device_supports_fp64:
@@ -227,7 +258,7 @@ def torch_empty(*args, device=None, **kwargs):
 original_torch_randn = torch.randn
@wraps(torch.randn)
 def torch_randn(*args, device=None, dtype=None, **kwargs):
-    if dtype == bytes:
+    if dtype is bytes:
        dtype = None
    if check_device(device):
        return original_torch_randn(*args, device=return_xpu(device), **kwargs)
@@ -250,6 +281,14 @@ def torch_zeros(*args, device=None, **kwargs):
    else:
        return original_torch_zeros(*args, device=device, **kwargs)

+original_torch_full = torch.full
+@wraps(torch.full)
+def torch_full(*args, device=None, **kwargs):
+    if check_device(device):
+        return original_torch_full(*args, device=return_xpu(device), **kwargs)
+    else:
+        return original_torch_full(*args, device=device, **kwargs)
+
 original_torch_linspace = torch.linspace
@wraps(torch.linspace)
 def torch_linspace(*args, device=None, **kwargs):
@@ -258,14 +297,6 @@ def torch_linspace(*args, device=None, **kwargs):
    else:
        return original_torch_linspace(*args, device=device, **kwargs)

-original_torch_Generator = torch.Generator
-@wraps(torch.Generator)
-def torch_Generator(device=None):
-    if check_device(device):
-        return original_torch_Generator(return_xpu(device))
-    else:
-        return original_torch_Generator(device)
-
 original_torch_load = torch.load
@wraps(torch.load)
 def torch_load(f, map_location=None, *args, **kwargs):
@@ -276,9 +307,27 @@ def torch_load(f, map_location=None, *args, **kwargs):
    else:
        return original_torch_load(f, *args, map_location=map_location, **kwargs)

+original_torch_Generator = torch.Generator
+@wraps(torch.Generator)
+def torch_Generator(device=None):
+    if check_device(device):
+        return original_torch_Generator(return_xpu(device))
+    else:
+        return original_torch_Generator(device)
+
+@wraps(torch.cuda.synchronize)
+def torch_cuda_synchronize(device=None):
+    if check_device(device):
+        return torch.xpu.synchronize(return_xpu(device))
+    else:
+        return torch.xpu.synchronize(device)
+

 # Hijack Functions:
-def ipex_hijacks():
+def ipex_hijacks(legacy=True):
+    global device_supports_fp64, can_allocate_plus_4gb
+    if legacy and float(torch.__version__[:3]) < 2.5:
+        torch.nn.functional.interpolate = interpolate
    torch.tensor = torch_tensor
    torch.Tensor.to = Tensor_to
    torch.Tensor.cuda = Tensor_cuda
@@ -289,9 +338,11 @@ def ipex_hijacks():
    torch.randn = torch_randn
    torch.ones = torch_ones
    torch.zeros = torch_zeros
+    torch.full = torch_full
    torch.linspace = torch_linspace
-    torch.Generator = torch_Generator
    torch.load = torch_load
+    torch.Generator = torch_Generator
+    torch.cuda.synchronize = torch_cuda_synchronize

    torch.backends.cuda.sdp_kernel = return_null_context
    torch.nn.DataParallel = DummyDataParallel
@@ -302,12 +353,15 @@ def ipex_hijacks():
    torch.nn.functional.group_norm = functional_group_norm
    torch.nn.functional.layer_norm = functional_layer_norm
    torch.nn.functional.linear = functional_linear
+    torch.nn.functional.conv1d = functional_conv1d
    torch.nn.functional.conv2d = functional_conv2d
-    torch.nn.functional.interpolate = interpolate
+    torch.nn.functional.conv3d = functional_conv3d
    torch.nn.functional.pad = functional_pad

    torch.bmm = torch_bmm
-    torch.cat = torch_cat
+    torch.fft.fftn = fft_fftn
+    torch.fft.ifftn = fft_ifftn
    if not device_supports_fp64:
        torch.from_numpy = from_numpy
        torch.as_tensor = as_tensor
+    return device_supports_fp64, can_allocate_plus_4gb
--- a/library/model_util.py
+++ b/library/model_util.py
@@ -643,16 +643,15 @@ def convert_ldm_clip_checkpoint_v2(checkpoint, max_length):
            new_sd[key_pfx + "k_proj" + key_suffix] = values[1]
            new_sd[key_pfx + "v_proj" + key_suffix] = values[2]

-    # rename or add position_ids
+    # remove position_ids for newer transformer, which causes error :(
    ANOTHER_POSITION_IDS_KEY = "text_model.encoder.text_model.embeddings.position_ids"
    if ANOTHER_POSITION_IDS_KEY in new_sd:
        # waifu diffusion v1.4
-        position_ids = new_sd[ANOTHER_POSITION_IDS_KEY]
        del new_sd[ANOTHER_POSITION_IDS_KEY]
-    else:
-        position_ids = torch.Tensor([list(range(max_length))]).to(torch.int64)

-    new_sd["text_model.embeddings.position_ids"] = position_ids
+    if "text_model.embeddings.position_ids" in new_sd:
+        del new_sd["text_model.embeddings.position_ids"]
+
    return new_sd


--- a/library/sdxl_train_util.py
+++ b/library/sdxl_train_util.py
@@ -345,8 +345,6 @@ def add_sdxl_training_arguments(parser: argparse.ArgumentParser):

 def verify_sdxl_training_args(args: argparse.Namespace, supportTextEncoderCaching: bool = True):
    assert not args.v2, "v2 cannot be enabled in SDXL training / SDXL学習ではv2を有効にすることはできません"
-    if args.v_parameterization:
-        logger.warning("v_parameterization will be unexpected / SDXL学習ではv_parameterizationは想定外の動作になります")

    if args.clip_skip is not None:
        logger.warning("clip_skip will be unexpected / SDXL学習ではclip_skipは動作しません")
--- a/library/train_util.py
+++ b/library/train_util.py
@@ -957,8 +957,11 @@ class BaseDataset(torch.utils.data.Dataset):
                    self.bucket_info["buckets"][i] = {"resolution": reso, "count": len(bucket)}
                    logger.info(f"bucket {i}: resolution {reso}, count: {len(bucket)}")

-            img_ar_errors = np.array(img_ar_errors)
-            mean_img_ar_error = np.mean(np.abs(img_ar_errors))
+            if len(img_ar_errors) == 0:
+                mean_img_ar_error = 0  # avoid NaN
+            else:
+                img_ar_errors = np.array(img_ar_errors)
+                mean_img_ar_error = np.mean(np.abs(img_ar_errors))
            self.bucket_info["mean_img_ar_error"] = mean_img_ar_error
            logger.info(f"mean ar error (without repeats): {mean_img_ar_error}")

--- a/networks/dylora.py
+++ b/networks/dylora.py
@@ -268,7 +268,7 @@ def create_network_from_weights(multiplier, file, vae, text_encoder, unet, weigh
 class DyLoRANetwork(torch.nn.Module):
    UNET_TARGET_REPLACE_MODULE = ["Transformer2DModel"]
    UNET_TARGET_REPLACE_MODULE_CONV2D_3X3 = ["ResnetBlock2D", "Downsample2D", "Upsample2D"]
-    TEXT_ENCODER_TARGET_REPLACE_MODULE = ["CLIPAttention", "CLIPMLP"]
+    TEXT_ENCODER_TARGET_REPLACE_MODULE = ["CLIPAttention", "CLIPSdpaAttention", "CLIPMLP"]
    LORA_PREFIX_UNET = "lora_unet"
    LORA_PREFIX_TEXT_ENCODER = "lora_te"

--- a/networks/lora.py
+++ b/networks/lora.py
@@ -866,7 +866,7 @@ class LoRANetwork(torch.nn.Module):

    UNET_TARGET_REPLACE_MODULE = ["Transformer2DModel"]
    UNET_TARGET_REPLACE_MODULE_CONV2D_3X3 = ["ResnetBlock2D", "Downsample2D", "Upsample2D"]
-    TEXT_ENCODER_TARGET_REPLACE_MODULE = ["CLIPAttention", "CLIPMLP"]
+    TEXT_ENCODER_TARGET_REPLACE_MODULE = ["CLIPAttention", "CLIPSdpaAttention", "CLIPMLP"]
    LORA_PREFIX_UNET = "lora_unet"
    LORA_PREFIX_TEXT_ENCODER = "lora_te"

--- a/networks/lora_diffusers.py
+++ b/networks/lora_diffusers.py
@@ -278,7 +278,7 @@ def merge_lora_weights(pipe, weights_sd: Dict, multiplier: float = 1.0):
 class LoRANetwork(torch.nn.Module):
    UNET_TARGET_REPLACE_MODULE = ["Transformer2DModel"]
    UNET_TARGET_REPLACE_MODULE_CONV2D_3X3 = ["ResnetBlock2D", "Downsample2D", "Upsample2D"]
-    TEXT_ENCODER_TARGET_REPLACE_MODULE = ["CLIPAttention", "CLIPMLP"]
+    TEXT_ENCODER_TARGET_REPLACE_MODULE = ["CLIPAttention", "CLIPSdpaAttention", "CLIPMLP"]
    LORA_PREFIX_UNET = "lora_unet"
    LORA_PREFIX_TEXT_ENCODER = "lora_te"

--- a/networks/lora_fa.py
+++ b/networks/lora_fa.py
@@ -755,7 +755,7 @@ class LoRANetwork(torch.nn.Module):

    UNET_TARGET_REPLACE_MODULE = ["Transformer2DModel"]
    UNET_TARGET_REPLACE_MODULE_CONV2D_3X3 = ["ResnetBlock2D", "Downsample2D", "Upsample2D"]
-    TEXT_ENCODER_TARGET_REPLACE_MODULE = ["CLIPAttention", "CLIPMLP"]
+    TEXT_ENCODER_TARGET_REPLACE_MODULE = ["CLIPAttention", "CLIPSdpaAttention", "CLIPMLP"]
    LORA_PREFIX_UNET = "lora_unet"
    LORA_PREFIX_TEXT_ENCODER = "lora_te"

--- a/sdxl_train_network.py
+++ b/sdxl_train_network.py
@@ -18,7 +18,6 @@ class SdxlNetworkTrainer(train_network.NetworkTrainer):
        self.is_sdxl = True

    def assert_extra_args(self, args, train_dataset_group):
-        super().assert_extra_args(args, train_dataset_group)
        sdxl_train_util.verify_sdxl_training_args(args)

        if args.cache_text_encoder_outputs:
--- a/train_network.py
+++ b/train_network.py
@@ -912,14 +912,22 @@ class NetworkTrainer:
                    if "latents" in batch and batch["latents"] is not None:
                        latents = batch["latents"].to(accelerator.device).to(dtype=weight_dtype)
                    else:
-                        with torch.no_grad():
-                            # latentに変換
-                            latents = vae.encode(batch["images"].to(dtype=vae_dtype)).latent_dist.sample().to(dtype=weight_dtype)
-
+                        if args.vae_batch_size is None or len(batch["images"]) <= args.vae_batch_size:
+                            with torch.no_grad():
+                                # latentに変換
+                                latents = vae.encode(batch["images"].to(dtype=vae_dtype)).latent_dist.sample().to(dtype=weight_dtype)
+                        else:
+                            chunks = [batch["images"][i:i + args.vae_batch_size] for i in range(0, len(batch["images"]), args.vae_batch_size)]
+                            list_latents = []
+                            for chunk in chunks:
+                                with torch.no_grad():
+                                # latentに変換
+                                    list_latents.append(vae.encode(chunk.to(dtype=vae_dtype)).latent_dist.sample().to(dtype=weight_dtype))
+                            latents = torch.cat(list_latents, dim=0)
                            # NaNが含まれていれば警告を表示し0に置き換える
-                            if torch.any(torch.isnan(latents)):
-                                accelerator.print("NaN found in latents, replacing with zeros")
-                                latents = torch.nan_to_num(latents, 0, out=latents)
+                        if torch.any(torch.isnan(latents)):
+                            accelerator.print("NaN found in latents, replacing with zeros")
+                            latents = torch.nan_to_num(latents, 0, out=latents)
                    latents = latents * self.vae_scale_factor

                    # get multiplier for each sample
Author	SHA1	Message	Date
copilot-swe-agent[bot]	7dc64e04e2	Initial plan	2025-07-05 00:41:07 +00:00
Kohya S.	a21b6a917e	Merge pull request #2070 from kohya-ss/fix-mean-ar-error-nan Fix mean image aspect ratio error calculation to avoid NaN values	2025-04-29 21:29:42 +09:00
Kohya S	4625b34f4e	Fix mean image aspect ratio error calculation to avoid NaN values	2025-04-29 21:27:04 +09:00
Kohya S.	3b25de1f17	Merge pull request #2065 from kohya-ss/kohya-ss-funding-yml Create FUNDING.yml	2025-04-27 21:29:44 +09:00
Kohya S.	f0b07c52ab	Create FUNDING.yml	2025-04-27 21:28:38 +09:00
Kohya S.	52c8dec953	Merge pull request #2015 from DKnight54/uncache_vae_batch Using --vae_batch_size to set batch size for dynamic latent generation	2025-04-07 21:48:02 +09:00
DKnight54	381303d64f	Update train_network.py	2025-03-29 02:26:18 +08:00
Kohya S.	a0f11730f7	Merge pull request #1966 from sdbds/faster_fix_sdxl Fatser fix bug for SDXL super SD1.5 assert cant use 32	2025-03-24 21:53:42 +09:00
Kohya S	5253a38783	Merge branch 'main' into dev	2025-03-21 22:07:03 +09:00
Kohya S	8f4ee8fc34	doc: update README for latest	2025-03-21 22:05:48 +09:00
Kohya S.	367f348430	Merge pull request #1964 from Nekotekina/main Fix missing text encoder attn modules	2025-03-21 21:59:03 +09:00
sdbds	3f49053c90	fatser fix bug for SDXL super SD1.5 assert cant use 32	2025-03-02 19:32:06 +08:00
Ivan Chikish	acdca2abb7	Fix [occasionally] missing text encoder attn modules Should fix #1952 I added alternative name for CLIPAttention. I have no idea why this name changed. Now it should accept both names.	2025-03-01 20:35:45 +03:00
Kohya S.	b286304e5f	Merge pull request #1953 from Disty0/dev Update IPEX libs	2025-02-26 21:03:09 +09:00
Disty0	f68702f71c	Update IPEX libs	2025-02-25 21:27:41 +03:00
Kohya S.	386b7332c6	Merge pull request #1918 from tsukimiya/fix_vperd_warning Remove v-pred warning.	2025-02-24 18:55:25 +09:00
Kohya S.	59ae9ea20c	Merge pull request #1945 from yidiq7/dev Remove position_ids for V2	2025-02-24 18:53:46 +09:00
Yidi	13df47516d	Remove position_ids for V2 The postions_ids cause errors for the newer version of transformer. This has already been fixed in convert_ldm_clip_checkpoint_v1() but not in v2. The new code applies the same fix to convert_ldm_clip_checkpoint_v2().	2025-02-20 04:49:51 -05:00
tsukimiya	4a71687d20	不要な警告の削除 (おそらく `be14c06267` の修正漏れ )	2025-02-04 00:42:27 +09:00
Kohya S.	6e3c1d0b58	Merge pull request #1879 from kohya-ss/dev merge dev to main	2025-01-17 23:25:56 +09:00
Kohya S	345daaa986	update README for merging	2025-01-17 23:22:38 +09:00