In #6490 we enabled non-blocking torch device transfers throughout the model manager's memory management code. When using this torch feature, torch attempts to wait until the tensor transfer has completed before allowing any access to the tensor. Theoretically, that should make this a safe feature to use.
This provides a small performance improvement but causes race conditions in some situations. Specific platforms/systems are affected, and complicated data dependencies can make this unsafe.
- Intermittent black images on MPS devices - reported on discord and #6545, fixed with special handling in #6549.
- Intermittent OOMs and black images on a P4000 GPU on Windows - reported in #6613, fixed in this commit.
On my system, I haven't experience any issues with generation, but targeted testing of non-blocking ops did expose a race condition when moving tensors from CUDA to CPU.
One workaround is to use torch streams with manual sync points. Our application logic is complicated enough that this would be a lot of work and feels ripe for edge cases and missed spots.
Much safer is to fully revert non-locking - which is what this change does.
* use model_class.load_singlefile() instead of converting; works, but performance is poor
* adjust the convert api - not right just yet
* working, needs sql migrator update
* rename migration_11 before conflict merge with main
* Update invokeai/backend/model_manager/load/model_loaders/stable_diffusion.py
Co-authored-by: Ryan Dick <ryanjdick3@gmail.com>
* Update invokeai/backend/model_manager/load/model_loaders/stable_diffusion.py
Co-authored-by: Ryan Dick <ryanjdick3@gmail.com>
* implement lightweight version-by-version config migration
* simplified config schema migration code
* associate sdxl config with sdxl VAEs
* remove use of original_config_file in load_single_file()
---------
Co-authored-by: Lincoln Stein <lstein@gmail.com>
Co-authored-by: Ryan Dick <ryanjdick3@gmail.com>
We can get black outputs when moving tensors from CPU to MPS. It appears MPS to CPU is fine. See:
- https://github.com/pytorch/pytorch/issues/107455
- https://discuss.pytorch.org/t/should-we-set-non-blocking-to-true/38234/28
Changes:
- Add properties for each device on `TorchDevice` as a convenience.
- Add `get_non_blocking` static method on `TorchDevice`. This utility takes a torch device and returns the flag to be used for non_blocking when moving a tensor to the device provided.
- Update model patching and caching APIs to use this new utility.
Fixes: #6545
* add support for probing and loading SDXL VAE checkpoint files
* broaden regexp probe for SDXL VAEs
---------
Co-authored-by: Lincoln Stein <lstein@gmail.com>
- Fix type errors
- Enable SilenceWarnings to be used as both a context manager and a decorator
- Remove duplicate implementation
- Check the initial verbosity on __enter__() rather than __init__()
* allow model patcher to optimize away the unpatching step when feasible
* remove lazy_offloading functionality
* allow model patcher to optimize away the unpatching step when feasible
* remove lazy_offloading functionality
* do not save original weights if there is a CPU copy of state dict
* Update invokeai/backend/model_manager/load/load_base.py
Co-authored-by: Ryan Dick <ryanjdick3@gmail.com>
* documentation fixes requested during penultimate review
* add non-blocking=True parameters to several torch.nn.Module.to() calls, for slight performance increases
* fix ruff errors
* prevent crash on non-cuda-enabled systems
---------
Co-authored-by: Lincoln Stein <lstein@gmail.com>
Co-authored-by: Kent Keirsey <31807370+hipsterusername@users.noreply.github.com>
Co-authored-by: Ryan Dick <ryanjdick3@gmail.com>
* allow model patcher to optimize away the unpatching step when feasible
* remove lazy_offloading functionality
* allow model patcher to optimize away the unpatching step when feasible
* remove lazy_offloading functionality
* do not save original weights if there is a CPU copy of state dict
* Update invokeai/backend/model_manager/load/load_base.py
Co-authored-by: Ryan Dick <ryanjdick3@gmail.com>
* documentation fixes added during penultimate review
---------
Co-authored-by: Lincoln Stein <lstein@gmail.com>
Co-authored-by: Kent Keirsey <31807370+hipsterusername@users.noreply.github.com>
Co-authored-by: Ryan Dick <ryanjdick3@gmail.com>
* avoid copying model back from cuda to cpu
* handle models that don't have state dicts
* add assertions that models need a `device()` method
* do not rely on torch.nn.Module having the device() method
* apply all patches after model is on the execution device
* fix model patching in latents too
* log patched tokenizer
* closes#6375
---------
Co-authored-by: Lincoln Stein <lstein@gmail.com>
There are only a couple SDXL inpainting models, and my tests indicate they are not as good as SD1.5 inpainting, but at least we support them now.
- Add the config file. This matches what is used in A1111. The only difference from the non-inpainting SDXL config is the number of in-channels.
- Update the legacy config maps to use this config file.
Pending:
- Move model install calls into model manager and create passthrus in invocation_context.
- Consider splitting load_model_from_url() into a call to get the path and a call to load the path.
* introduce new abstraction layer for GPU devices
* add unit test for device abstraction
* fix ruff
* convert TorchDeviceSelect into a stateless class
* move logic to select context-specific execution device into context API
* add mock hardware environments to pytest
* remove dangling mocker fixture
* fix unit test for running on non-CUDA systems
* remove unimplemented get_execution_device() call
* remove autocast precision
* Multiple changes:
1. Remove TorchDeviceSelect.get_execution_device(), as well as calls to
context.models.get_execution_device().
2. Rename TorchDeviceSelect to TorchDevice
3. Added back the legacy public API defined in `invocation_api`, including
choose_precision().
4. Added a config file migration script to accommodate removal of precision=autocast.
* add deprecation warnings to choose_torch_device() and choose_precision()
* fix test crash
* remove app_config argument from choose_torch_device() and choose_torch_dtype()
---------
Co-authored-by: Lincoln Stein <lstein@gmail.com>
When using refiner with a mask (i.e. inpainting), we don't have noise provided as an input to the node.
This situation uniquely hits a code path that wasn't reviewed when gradient denoising was implemented.
That code path does two things wrong:
- It lerp'd the input latents. This was fixed in 5a1f4cb1ce.
- It added noise to the latents an extra time. This is fixed in this change.
We don't need to add noise in `latents_from_embeddings` because we do it just a lines later in `AddsMaskGuidance`.
- Remove the extraneous call to `add_noise`
- Make `seed` a required arg. We never call the function without seed anyways. If we refactor this in the future, it will be clearer that we need to look at how seed is handled.
- Move the call to create the noise to a deeper conditional, just before we call `AddsMaskGuidance`. The created noise tensor is now only used in that function, no need to create it every time.
Note: Whether or not having both noise and latents as inputs on the node is correct is a separate conversation. This change just fixes the issue with the current setup.
- Allow user-defined precision on MPS.
- Use more explicit logic to handle all possible cases.
- Add comments.
- Remove the app_config args (they were effectively unused, just get the config using the singleton getter util)
The previous algorithm errored if the image wasn't divisible by the tile size. I've reimplemented it from scratch to mitigate this issue.
The new algorithm is simpler. We create a pool of tiles, then use them to create an image composed completely of tiles. If there is any awkwardly sized space on the edge of the image, the tiles are cropped to fit.
Finally, paste the original image over the tile image.
I've added a jupyter notebook to do a smoke test of infilling methods, and 10 test images.
The other infill algorithms can be easily tested with the notebook on the same images, though I didn't set that up yet.
Tested and confirmed this gives results just as good as the earlier infill, though of course they aren't the same due to the change in the algorithm.
Add `dump_path` arg to the converter function & save the model to disk inside the conversion function. This is the same pattern as in the other conversion functions.