InvokeAI/invokeai/backend/model_manager/load/model_cache
psychedelicious 38343917f8 fix(backend): revert non-blocking device transfer
In #6490 we enabled non-blocking torch device transfers throughout the model manager's memory management code. When using this torch feature, torch attempts to wait until the tensor transfer has completed before allowing any access to the tensor. Theoretically, that should make this a safe feature to use.

This provides a small performance improvement but causes race conditions in some situations. Specific platforms/systems are affected, and complicated data dependencies can make this unsafe.

- Intermittent black images on MPS devices - reported on discord and #6545, fixed with special handling in #6549.
- Intermittent OOMs and black images on a P4000 GPU on Windows - reported in #6613, fixed in this commit.

On my system, I haven't experience any issues with generation, but targeted testing of non-blocking ops did expose a race condition when moving tensors from CUDA to CPU.

One workaround is to use torch streams with manual sync points. Our application logic is complicated enough that this would be a lot of work and feels ripe for edge cases and missed spots.

Much safer is to fully revert non-locking - which is what this change does.
2024-07-16 08:59:42 +10:00
..
__init__.py BREAKING CHANGES: invocations now require model key, not base/type/name 2024-03-01 10:42:33 +11:00
model_cache_base.py Merge remote-tracking branch 'origin/main' into lstein/feat/simple-mm2-api 2024-06-07 14:23:41 +10:00
model_cache_default.py fix(backend): revert non-blocking device transfer 2024-07-16 08:59:42 +10:00
model_locker.py Apply ruff rule to disallow all relative imports. 2024-07-04 09:35:37 -04:00