Around the time we (I) implemented pydantic events, I noticed a short pause between progress images every 4 or 5 steps when generating with SDXL. It didn't happen with SD1.5, but I did notice that with SD1.5, we'd get 4 or 5 progress events simultaneously. I'd expect one event every ~25ms, matching my it/s with SD1.5. Mysterious!
Digging in, I found an issue is related to our use of a synchronous queue for events. When the event queue is empty, we must call `asyncio.sleep` before checking again. We were sleeping for 100ms.
Said another way, every time we clear the event queue, we have to wait 100ms before another event can be dispatched, even if it is put on the queue immediately after we start waiting. In practice, this means our events get buffered into batches, dispatched once every 100ms.
This explains why I was getting batches of 4 or 5 SD1.5 progress events at once, but not the intermittent SDXL delay.
But this 100ms wait has another effect when the events are put on the queue in intervals that don't perfectly line up with the 100ms wait. This is most noticeable when the time between events is >100ms, and can add up to 100ms delay before the event is dispatched.
For example, say the queue is empty and we start a 100ms wait. Then, immediately after - like 0.01ms later - we push an event on to the queue. We still need to wait another 99.9ms before that event will be dispatched. That's the SDXL delay.
The easy fix is to reduce the sleep to something like 0.01 seconds, but this feels kinda dirty. Can't we just wait on the queue and dispatch every event immediately? Not with the normal synchronous queue - but we can with `asyncio.Queue`.
I switched the events queue to use `asyncio.Queue` (as seen in this commit), which lets us asynchronous wait on the queue in a loop.
Unfortunately, I ran into another issue - events now felt like their timing was inconsistent, but in a different way than with the 100ms sleep. The time between pushing events on the queue and dispatching them was not consistently ~0ms as I'd expect - it was highly variable from ~0ms up to ~100ms.
This is resolved by passing the asyncio loop directly into the events service and using its methods to create the task and interact with the queue. I don't fully understand why this resolved the issue, because either way we are interacting with the same event loop (as shown by `asyncio.get_running_loop()`). I suppose there's some scheduling magic happening.
Our events handling and implementation has a couple pain points:
- Adding or removing data from event payloads requires changes wherever the events are dispatched from.
- We have no type safety for events and need to rely on string matching and dict access when interacting with events.
- Frontend types for socket events must be manually typed. This has caused several bugs.
`fastapi-events` has a neat feature where you can create a pydantic model as an event payload, give it an `__event_name__` attr, and then dispatch the model directly.
This allows us to eliminate a layer of indirection and some unpleasant complexity:
- Event handler callbacks get type hints for their event payloads, and can use `isinstance` on them if needed.
- Event payload construction is now the responsibility of the event itself (a pydantic model), not the service. Every event model has a `build` class method, encapsulating this logic. The build methods are provided as few args as possible. For example, `InvocationStartedEvent.build()` gets the invocation instance and queue item, and can choose the data it wants to include in the event payload.
- Frontend event types may be autogenerated from the OpenAPI schema. We use the payload registry feature of `fastapi-events` to collect all payload models into one place, making it trivial to keep our schema and frontend types in sync.
This commit moves the backend over to this improved event handling setup.
- Use protocol to define callbacks, this allows them to have kwargs
- Shuffle the profiler around a bit
- Move `thread_limit` and `polling_interval` to `__init__`; `start` is called programmatically and will never get these args in practice
- Add `OnNodeError` and `OnNonFatalProcessorError` callbacks
- Move all session/node callbacks to `SessionRunner` - this ensures we dump perf stats before resetting them and generally makes sense to me
- Remove `complete` event from `SessionRunner`, it's essentially the same as `OnAfterRunSession`
- Remove extraneous `next_invocation` block, which would treat a processor error as a node error
- Simplify loops
- Add some callbacks for testing, to be removed before merge
- Metadata is merged with the config. We can simplify the MM substantially and remove the handling for metadata.
- Per discussion, we don't have an ETA for frontend implementation of tags, and with the realization that the tags from CivitAI are largely useless, there's no reason to keep tags in the MM right now. When we are ready to implement tags on the frontend, we can refer back to the implementation here and use it if it supports the design.
- Fix all tests.
Consolidate graph processing logic into session processor.
With graphs as the unit of work, and the session queue distributing graphs, we no longer need the invocation queue or processor.
Instead, the session processor dequeues the next session and processes it in a simple loop, greatly simplifying the app.
- Remove `graph_execution_manager` service.
- Remove `queue` (invocation queue) service.
- Remove `processor` (invocation processor) service.
- Remove queue-related logic from `Invoker`. It now only starts and stops the services, providing them with access to other services.
- Remove unused `invocation_retrieval_error` and `session_retrieval_error` events, these are no longer needed.
- Clean up stats service now that it is less coupled to the rest of the app.
- Refactor cancellation logic - cancellations now originate from session queue (i.e. HTTP cancel endpoint) and are emitted as events. Processor gets the events and sets the canceled event. Access to this event is provided to the invocation context for e.g. the step callback.
- Remove `sessions` router; it provided access to `graph_executions` but that no longer exists.
- ModelMetadataStoreService is now injected into ModelRecordStoreService
(these two services are really joined at the hip, and should someday be merged)
- ModelRecordStoreService is now injected into ModelManagerService
- Reduced timeout value for the various installer and download wait*() methods
- Introduced a Mock modelmanager for testing
- Removed bare print() statement with _logger in the install helper backend.
- Removed unused code from model loader init file
- Made `locker` a private variable in the `LoadedModel` object.
- Fixed up model merge frontend (will be deprecated anyway!)
- Replace legacy model manager service with the v2 manager.
- Update invocations to use new load interface.
- Fixed many but not all type checking errors in the invocations. Most
were unrelated to model manager
- Updated routes. All the new routes live under the route tag
`model_manager_v2`. To avoid confusion with the old routes,
they have the URL prefix `/api/v2/models`. The old routes
have been de-registered.
- Added a pytest for the loader.
- Updated documentation in contributing/MODEL_MANAGER.md
Replace `delete_on_startup: bool` & associated logic with `ephemeral: bool` and `TemporaryDirectory`.
The temp dir is created inside of `output_dir`. For example, if `output_dir` is `invokeai/outputs/tensors/`, then the temp dir might be `invokeai/outputs/tensors/tmpvj35ht7b/`.
The temp dir is cleaned up when the service is stopped, or when it is GC'd if not properly stopped.
In the event of a catastrophic crash where the temp files are not cleaned up, the user can delete the tempdir themselves.
This situation may not occur in normal use, but if you kill the process, python cannot clean up the temp dir itself. This includes running the app in a debugger and killing the debugger process - something I do relatively often.
Tests updated.
- The default is to not delete on startup - feels safer.
- The two services using this class _do_ delete on startup.
- The class has "ephemeral" removed from its name.
- Tests & app updated for this change.
Turns out they are just different enough in purpose that the implementations would be rather unintuitive. I've made a separate ObjectSerializer service to handle tensors and conditioning.
Refined the class a bit too.
Turns out `ItemStorageABC` was almost identical to `PickleStorageBase`. Instead of maintaining separate classes, we can use `ItemStorageABC` for both.
There's only one change needed - the `ItemStorageABC.set` method must return the newly stored item's ID. This allows us to let the service handle the responsibility of naming the item, but still create the requisite output objects during node execution.
The naming implementation is improved here. It extracts the name of the generic and appends a UUID to that string when saving items.
- New generic class `PickleStorageBase`, implements the same API as `LatentsStorageBase`, use for storing non-serializable data via pickling
- Implementation `PickleStorageTorch` uses `torch.save` and `torch.load`, same as `LatentsStorageDisk`
- Add `tensors: PickleStorageBase[torch.Tensor]` to `InvocationServices`
- Add `conditioning: PickleStorageBase[ConditioningFieldData]` to `InvocationServices`
- Remove `latents` service and all `LatentsStorage` classes
- Update `InvocationContext` and all usage of old `latents` service to use the new services/context wrapper methods
* add basic functionality for model metadata fetching from hf and civitai
* add storage
* start unit tests
* add unit tests and documentation
* add missing dependency for pytests
* remove redundant fetch; add modified/published dates; updated docs
* add code to select diffusers files based on the variant type
* implement Civitai installs
* make huggingface parallel downloading work
* add unit tests for model installation manager
- Fixed race condition on selection of download destination path
- Add fixtures common to several model_manager_2 unit tests
- Added dummy model files for testing diffusers and safetensors downloading/probing
- Refactored code for selecting proper variant from list of huggingface repo files
- Regrouped ordering of methods in model_install_default.py
* improve Civitai model downloading
- Provide a better error message when Civitai requires an access token (doesn't give a 403 forbidden, but redirects
to the HTML of an authorization page -- arrgh)
- Handle case of Civitai providing a primary download link plus additional links for VAEs, config files, etc
* add routes for retrieving metadata and tags
* code tidying and documentation
* fix ruff errors
* add file needed to maintain test root diretory in repo for unit tests
* fix self->cls in classmethod
* add pydantic plugin for mypy
* use TestSession instead of requests.Session to prevent any internet activity
improve logging
fix error message formatting
fix logging again
fix forward vs reverse slash issue in Windows install tests
* Several fixes of problems detected during PR review:
- Implement cancel_model_install_job and get_model_install_job routes
to allow for better control of model download and install.
- Fix thread deadlock that occurred after cancelling an install.
- Remove unneeded pytest_plugins section from tests/conftest.py
- Remove unused _in_terminal_state() from model_install_default.
- Remove outdated documentation from several spots.
- Add workaround for Civitai API results which don't return correct
URL for the default model.
* fix docs and tests to match get_job_by_source() rather than get_job()
* Update invokeai/backend/model_manager/metadata/fetch/huggingface.py
Co-authored-by: Ryan Dick <ryanjdick3@gmail.com>
* Call CivitaiMetadata.model_validate_json() directly
Co-authored-by: Ryan Dick <ryanjdick3@gmail.com>
* Second round of revisions suggested by @ryanjdick:
- Fix type mismatch in `list_all_metadata()` route.
- Do not have a default value for the model install job id
- Remove static class variable declarations from non Pydantic classes
- Change `id` field to `model_id` for the sqlite3 `model_tags` table.
- Changed AFTER DELETE triggers to ON DELETE CASCADE for the metadata and tags tables.
- Made the `id` field of the `model_metadata` table into a primary key to achieve uniqueness.
* Code cleanup suggested in PR review:
- Narrowed the declaration of the `parts` attribute of the download progress event
- Removed auto-conversion of str to Url in Url-containing sources
- Fixed handling of `InvalidModelConfigException`
- Made unknown sources raise `NotImplementedError` rather than `Exception`
- Improved status reporting on cached HuggingFace access tokens
* Multiple fixes:
- `job.total_size` returns a valid size for locally installed models
- new route `list_models` returns a paged summary of model, name,
description, tags and other essential info
- fix a few type errors
* consolidated all invokeai root pytest fixtures into a single location
* Update invokeai/backend/model_manager/metadata/metadata_store.py
Co-authored-by: psychedelicious <4822129+psychedelicious@users.noreply.github.com>
* Small tweaks in response to review comments:
- Remove flake8 configuration from pyproject.toml
- Use `id` rather than `modelId` for huggingface `ModelInfo` object
- Use `last_modified` rather than `LastModified` for huggingface `ModelInfo` object
- Add `sha256` field to file metadata downloaded from huggingface
- Add `Invoker` argument to the model installer `start()` and `stop()` routines
(but made it optional in order to facilitate use of the service outside the API)
- Removed redundant `PRAGMA foreign_keys` from metadata store initialization code.
* Additional tweaks and minor bug fixes
- Fix calculation of aggregate diffusers model size to only count the
size of files, not files + directories (which gives different unit test
results on different filesystems).
- Refactor _get_metadata() and _get_download_urls() to have distinct code paths
for Civitai, HuggingFace and URL sources.
- Forward the `inplace` flag from the source to the job and added unit test for this.
- Attach cached model metadata to the job rather than to the model install service.
* fix unit test that was breaking on windows due to CR/LF changing size of test json files
* fix ruff formatting
* a few last minor fixes before merging:
- Turn job `error` and `error_type` into properties derived from the exception.
- Add TODO comment about the reason for handling temporary directory destruction
manually rather than using tempfile.tmpdir().
* add unit tests for reporting HTTP download errors
---------
Co-authored-by: Lincoln Stein <lstein@gmail.com>
Co-authored-by: Ryan Dick <ryanjdick3@gmail.com>
Co-authored-by: psychedelicious <4822129+psychedelicious@users.noreply.github.com>
* add base definition of download manager
* basic functionality working
* add unit tests for download queue
* add documentation and FastAPI route
* fix docs
* add missing test dependency; fix import ordering
* fix file path length checking on windows
* fix ruff check error
* move release() into the __del__ method
* disable testing of stderr messages due to issues with pytest capsys fixture
* fix unsorted imports
* harmonized implementation of start() and stop() calls in download and & install modules
* Update invokeai/app/services/download/download_base.py
Co-authored-by: Ryan Dick <ryanjdick3@gmail.com>
* replace test datadir fixture with tmp_path
* replace DownloadJobBase->DownloadJob in download manager documentation
* make source and dest arguments to download_queue.download() an AnyHttpURL and Path respectively
* fix pydantic typecheck errors in the download unit test
* ruff formatting
* add "job cancelled" as an event rather than an exception
* fix ruff errors
* Update invokeai/app/services/download/download_default.py
Co-authored-by: psychedelicious <4822129+psychedelicious@users.noreply.github.com>
* use threading.Event to stop service worker threads; handle unfinished job edge cases
* remove dangling STOP job definition
* fix ruff complaint
* fix ruff check again
* avoid race condition when start() and stop() are called simultaneously from different threads
* avoid race condition in stop() when a job becomes active while shutting down
---------
Co-authored-by: Lincoln Stein <lstein@gmail.com>
Co-authored-by: Ryan Dick <ryanjdick3@gmail.com>
Co-authored-by: psychedelicious <4822129+psychedelicious@users.noreply.github.com>
Co-authored-by: Kent Keirsey <31807370+hipsterusername@users.noreply.github.com>
The graph library occasionally causes issues when the default graph changes substantially between versions and pydantic validation fails. See #5289 for an example.
We are not currently using the graph library, so we can disable it until we are ready to use it. It's possible that the workflow library will supersede it anyways.
- use simpler pattern for migration dependencies
- move SqliteDatabase & migration to utility method `init_db`, use this in both the app and tests, ensuring the same db schema is used in both
Simplifies a couple things:
- Init is more straightforward
- It's clear in the migrator that the connection we are working with is related to the SqliteDatabase
- Simplify init args to path (None means use memory), logger, and verbose
- Add docstrings to SqliteDatabase (it had almost none)
- Update all usages of the class