616 Commits

Author SHA1 Message Date
d958d2e5a0 feat(mm): iterate on cache callbacks API 2025-05-15 14:37:22 +10:00
823ca214e6 feat(mm): iterate on cache callbacks API 2025-05-15 13:28:51 +10:00
a33da450fd feat(mm): support cache callbacks 2025-05-15 11:23:58 +10:00
518a896521 feat(mm): add usage_info to model config 2025-05-06 09:07:52 -04:00
1f63b60021 Implementing support for Non-Standard LoRA Format (#7985)
* integrate loRA

* idk anymore tbh

* enable fused matrix for quantized models

* integrate loRA

* idk anymore tbh

* enable fused matrix for quantized models

* ruff fix

---------

Co-authored-by: Sam <bhaskarmdutt@gmail.com>
Co-authored-by: psychedelicious <4822129+psychedelicious@users.noreply.github.com>
2025-05-05 09:40:38 -04:00
fb91f48722 change base model for chatGPT 4o 2025-04-29 09:12:49 +10:00
04c005284c add gpt-image to possible base model types 2025-04-28 15:39:11 -04:00
14944872c4 feat(mm): add model taxonomy for API models & Imagen3 as base model type 2025-04-28 13:31:26 -04:00
814406d98a feat(mm): siglip model loading supports partial loading
In the previous commit, the LLaVA model was updated to support partial loading.

In this commit, the SigLIP model is updated in the same way.

This model is used for FLUX Redux. It's <4GB and only ever run in isolation, so it won't benefit from partial loading for the vast majority of users. Regardless, I think it is best if we make _all_ models work with partial loading.

PS: I also fixed the initial load dtype issue, described in the prev commit. It's probably a non-issue for this model, but we may as well fix it.
2025-04-18 10:12:03 +10:00
c054501103 feat(mm): llava model loading supports partial loading; fix OOM crash on initial load
The model manager has two types of model cache entries:
- `CachedModelOnlyFullLoad`: The model may only ever be loaded and unloaded as a single object.
- `CachedModelWithPartialLoad`: The model may be partially loaded and unloaded.

Partial loaded is enabled by overwriting certain torch layer classes, adding the ability to autocast the layer to a device on-the-fly. See `CustomLinear` for an example.

So, to take advantage of partial loading and be cached as a `CachedModelWithPartialLoad`, the model must inherit from `torch.nn.Module`.

The LLaVA classes provided by `transformers` do inherit from `torch.nn.Module`, but we wrap those classes in a separate class called `LlavaOnevisionModel`. The wrapper encapsulate both the LLaVA model and its "processor" - a lightweight class that prepares model inputs like text and images.

While it is more elegant to encapsulate both model and processor classes in a single entity, this prevents the model cache from enabling partial loading for the chunky vLLM model.

Fixing this involved a few changes.
- Update the `LlavaOnevisionModelLoader` class to operate on the vLLM model directly, instead the `LlavaOnevisionModel` wrapper class.
- Instantiate the processor directly in the node. The processor is lightweight and does its business on the CPU. We don't need to worry about caching in the model manager.
- Remove caching support code from the `LlavaOnevisionModel` wrapper class. It's not needed, because we do not cache this class. The class now only handles running the models provided to it.
- Rename `LlavaOnevisionModel` to `LlavaOnevisionPipeline` to better represent its purpose.

These changes have a bonus effect of fixing an OOM crash when initially loading the models. This was most apparent when loading LLaVA 7B, which is pretty chunky.

The initial load is onto CPU RAM. In the old version of the loaders, we ignored the loader's target dtype for the initial load. Instead, we loaded the model at `transformers`'s "default" dtype of fp32.

LLaVA 7B is fp16 and weighs ~17GB. Loading as fp32 means we need double that amount (~34GB) of CPU RAM. Many users only have 32GB RAM, so this causes a _CPU_ OOM - which is a hard crash of the whole process.

With the updated loaders, the initial load logic now uses the target dtype for the initial load. LLaVA now needs the expected ~17GB RAM for its initial load.

PS: If we didn't make the accompanying partial loading changes, we still could have solved this OOM. We'd just need to pass the initial load dtype to the wrapper class and have it load on that dtype. But we may as well fix both issues.

PPS: There are other models whose model classes are wrappers around a torch module class, and thus cannot be partially loaded. However, these models are typically fairly small and/or are run only on their own, so they don't benefit as much from partial loading. It's the really big models (like LLaVA 7B) that benefit most from the partial loading.
2025-04-18 10:12:03 +10:00
9846229e52 build graph for cogview4 2025-04-10 10:50:13 +10:00
46316e43f0 typegen 2025-04-10 10:50:13 +10:00
321c2d358c Add CogView4 model loader. And various other fixes to get a CogView4 workflow running (though quality is still below expectations). 2025-04-10 10:50:13 +10:00
0338983895 Update CogView4 starter model entry with approximate bundle size. 2025-04-10 10:50:13 +10:00
e2c4ea8e89 Add CogView4 model probing. 2025-04-10 10:50:13 +10:00
52a8ad1c18 chore: rename model.size to model.file_size
to disambiguate from RAM size or pixel size
2025-04-10 09:53:03 +10:00
f09aacf992 fix: ModelProbe.probe needs to return a size field 2025-04-10 09:53:03 +10:00
9590e8ff39 feat: expose model storage size 2025-04-10 09:53:03 +10:00
8294e2cdea feat(mm): support size calculation for onnx models 2025-04-07 11:37:55 +10:00
9868c3bfe3 Merge branch 'main' into lora-classification 2025-03-31 16:43:26 +11:00
a44bfb4658 fix(mm): handle FLUX models w/ diff in_channels keys
Before FLUX Fill was merged, we didn't do any checks for the model variant. We always returned "normal".

To determine if a model is a FLUX Fill model, we need to check the state dict for a specific key. Initially, this logic was too strict and rejected quantized FLUX models. This issue was resolved, but it turns out there is another failure mode - some fine-tunes use a different key.

This change further reduces the strictness, handling the alternate key and also falling back to "normal" if we don't see either key. This effectively restores the previous probing behaviour for all FLUX models.

Closes #7856
Closes #7859
2025-03-31 12:32:55 +11:00
965753bf8b Ruff formatting 2025-03-31 08:18:00 +11:00
40c53ab95c Guard 2025-03-29 09:58:02 +11:00
c25f6d1f84 Merge branch 'main' into lora-classification 2025-03-28 12:32:22 +11:00
1af9930951 Merge branch 'main' into small-improvements 2025-03-28 12:11:09 +11:00
c276c1cbee Comment 2025-03-28 10:57:46 +11:00
c619348f29 Extract ModelOnDisk to its own module 2025-03-28 10:35:13 +11:00
0d75c99476 Caching 2025-03-27 17:55:09 +11:00
323d409fb6 Make ruff happy 2025-03-27 17:47:57 +11:00
f251722f56 LoRA classification API 2025-03-27 17:47:01 +11:00
7004fde41b fix(mm): vllm model calculates its own size 2025-03-27 09:36:14 +11:00
efd14ec0e4 Make ruff happy 2025-03-27 08:11:39 +11:00
82dd2d508f Deprecate checkpoint as file, diffusers as directory terminology 2025-03-27 08:10:12 +11:00
60b5aef16a Log error -> warning 2025-03-27 06:56:22 +11:00
0e8b5484d5 Error handling 2025-03-26 19:31:57 +11:00
454506c83e Type hints 2025-03-26 19:12:49 +11:00
8f6ab67376 Logs 2025-03-26 16:34:32 +11:00
5afcc7778f Redundant 2025-03-26 16:32:19 +11:00
325e07d330 Error handling 2025-03-26 16:30:45 +11:00
a016bdc159 Add todo 2025-03-26 16:17:26 +11:00
a14f0b2864 Fail early on invalid config 2025-03-26 16:10:32 +11:00
721483318a Extend ModelOnDisk 2025-03-26 16:10:00 +11:00
182580ff69 Imports 2025-03-26 12:55:10 +11:00
8e9d5c1187 Ruff formatting 2025-03-26 12:30:31 +11:00
99aac5870e Remove star imports 2025-03-26 12:27:00 +11:00
ffa0beba7a Merge branch 'main' into llava 2025-03-24 15:17:33 +11:00
0b4c6f0ab4 fix(mm): flux model variant probing
In #7780 we added FLUX Fill support, and needed the probe to be able to distinguish between "normal" FLUX models and FLUX Fill models.

Logic was added to the probe to check a particular state dict key (input channels), which should be 384 for FLUX Fill and 64 for other FLUX models.

The new logic was stricter and instead of falling back on the "normal" variant, it raised when an unexpected value for input channels was detected.

This caused failures to probe for BNB-NF4 quantized FLUX Dev/Schnell, which apparently only have 1 input channel.

After checking a variety of FLUX models, I loosened the strictness of the variant probing logic to only special-case the new FLUX Fill model, and otherwise fall back to returning the "normal" variant. This better matches the old behaviour and fixes the import errors.

Closes #7822
2025-03-24 12:36:18 +11:00
d8450033ea Fix 2025-03-21 17:46:18 +11:00
3938736bd8 Ruff formatting 2025-03-21 17:35:12 +11:00
fb2c7b9566 Defaults 2025-03-21 17:35:04 +11:00