add documentation and minor bug fixes

- Added new documentation for textual inversion training process - Move `main.py` into the deprecated scripts folder - Fix bug in `textual_inversion.py` which was causing it to not load the globals module correctly. - Sort models alphabetically in console front end - Only show diffusers models in console front end
2024-08-30 20:32:17 +00:00 · 2023-01-20 16:55:50 -05:00 · 2023-01-20 16:55:50 -05:00 · 080fc4b380
commit 080fc4b380
parent 195294e74f
6 changed files with 239 additions and 63 deletions
--- a/docs/assets/textual-inversion/ti-frontend.png
+++ b/docs/assets/textual-inversion/ti-frontend.png
--- a/docs/features/TEXTUAL_INVERSION.md
+++ b/docs/features/TEXTUAL_INVERSION.md
@ -10,83 +10,259 @@ You may personalize the generated images to provide your own styles or objects
 by training a new LDM checkpoint and introducing a new vocabulary to the fixed
 model as a (.pt) embeddings file. Alternatively, you may use or train
 HuggingFace Concepts embeddings files (.bin) from
-<https://huggingface.co/sd-concepts-library> and its associated notebooks.
+<https://huggingface.co/sd-concepts-library> and its associated
+notebooks.

-## **Training**
+## **Hardware and Software Requirements**

-To train, prepare a folder that contains images sized at 512x512 and execute the
-following:
+You will need a GPU to perform training in a reasonable length of
+time, and at least 12 GB of VRAM. We recommend using the [`xformers`
+library](../installation/070_INSTALL_XFORMERS) to accelerate the
+training process further. During training, about ~8 GB is temporarily
+needed in order to store intermediate models, checkpoints and logs.

-### WINDOWS
+## **Preparing for Training**

-As the default backend is not available on Windows, if you're using that
-platform, set the environment variable `PL_TORCH_DISTRIBUTED_BACKEND` to `gloo`
+To train, prepare a folder that contains 3-5 images that illustrate
+the object or concept. It is good to provide a variety of examples or
+poses to avoid overtraining the system. Format these images as PNG
+(preferred) or JPG. You do not need to resize or crop the images in
+advance, but for more control you may wish to do so.

-```bash
-python3 ./main.py -t \
-    --base ./configs/stable-diffusion/v1-finetune.yaml \
-    --actual_resume ./models/ldm/stable-diffusion-v1/model.ckpt \
-    -n my_cat \
-    --gpus 0 \
-    --data_root D:/textual-inversion/my_cat \
-    --init_word 'cat'
+Place the training images in a directory on the machine InvokeAI runs
+on. We recommend placing them in a subdirectory of the
+`text-inversion-training-data` folder located in the InvokeAI root
+directory, ordinarily `~/invokeai` (Linux/Mac), or
+`C:\Users\your_name\invokeai` (Windows). For example, to create an
+embedding for the "psychedelic" style, you'd place the training images
+into the directory
+`~invokeai/text-inversion-training-data/psychedelic`.
+
+## **Launching Training Using the Console Front End**
+
+InvokeAI 2.3 and higher comes with a text console-based training front
+end. From within the `invoke.sh`/`invoke.bat` Invoke launcher script,
+start the front end by selecting choice (3):
+
+```sh
+Do you want to generate images using the
+1. command-line
+2. browser-based UI
+3. textual inversion training
+4. open the developer console
+Please enter 1, 2, 3, or 4: [1] 3
 ```

-During the training process, files will be created in
-`/logs/[project][time][project]/` where you can see the process.
+From the command line, with the InvokeAI virtual environment active,
+you can launch the front end with the command
+`textual_inversion_fe`.

-Conditioning contains the training prompts inputs, reconstruction the input
-images for the training epoch samples, samples scaled for a sample of the prompt
-and one with the init word provided.
+This will launch a text-based front end that will look like this:

-On a RTX3090, the process for SD will take ~1h @1.6 iterations/sec.
+<figure markdown>
+![ti-frontend](../assets/textual-inversion/ti-frontend.png)
+</figure>

-!!! note
+The interface is keyboard-based. Move from field to field using
+control-N (^N) to move to the next field and control-P (^P) to the
+previous one. <Tab> and <shift-TAB> work as well. Once a field is
+active, use the cursor keys. In a checkbox group, use the up and down
+cursor keys to move from choice to choice, and <space> to select a
+choice. In a scrollbar, use the left and right cursor keys to increase
+and decrease the value of the scroll. In textfields, type the desired
+values.

-    According to the associated paper, the optimal number of
-    images is 3-5. Your model may not converge if you use more images than
-    that.
+The number of parameters may look intimidating, but in most cases the
+predefined defaults work fine. The red circled fields in the above
+illustration are the ones you will adjust most frequently.

-Training will run indefinitely, but you may wish to stop it (with ctrl-c) before
-the heat death of the universe, when you find a low loss epoch or around ~5000
-iterations. Note that you can set a fixed limit on the number of training steps
-by decreasing the "max_steps" option in
-configs/stable_diffusion/v1-finetune.yaml (currently set to 4000000)
+### Model Name

-## **Run the Model**
+This will list all the diffusers models that are currently
+installed. Select the one you wish to use as the basis for your
+embedding. Be aware that if you use a SD-1.X-based model for your
+training, you will only be able to use this embedding with other
+SD-1.X-based models. Similarly, if you train on SD-2.X, you will only
+be able to use the embeddings with models based on SD-2.X.

-Once the model is trained, specify the trained .pt or .bin file when starting
-invoke using
+### Trigger Term

-```bash
-python3 ./scripts/invoke.py \
-    --embedding_path /path/to/embedding.pt
+This is the prompt term you will use to trigger the embedding. Type a
+single word or phrase you wish to use as the trigger, example
+"psychedelic" (without angle brackets). Within InvokeAI, you will then
+be able to activate the trigger using the syntax `<psychedelic>`.
+
+### Initializer
+
+This is a single character that is used internally during the training
+process as a placeholder for the trigger term. It defaults to "*" and
+can usually be left alone.
+
+### Resume from last saved checkpoint
+
+As training proceeds, textual inversion will write a series of
+intermediate files that can be used to resume training from where it
+was left off in the case of an interruption. This checkbox will be
+automatically selected if you provide a previously used trigger term
+and at least one checkpoint file is found on disk.
+
+Note that as of 20 January 2023, resume does not seem to be working
+properly due to an issue with the upstream code.
+
+### Data Training Directory
+
+This is the location of the images to be used for training. When you
+select a trigger term like "my-trigger", the frontend will prepopulate
+this field with `~/invokeai/text-inversion-training-data/my-trigger`,
+but you can change the path to wherever you want.
+
+### Output Destination Directory
+
+This is the location of the logs, checkpoint files, and embedding
+files created during training. When you select a trigger term like
+"my-trigger", the frontend will prepopulate this field with
+`~/invokeai/text-inversion-output/my-trigger`, but you can change the
+path to wherever you want.
+
+### Image resolution
+
+The images in the training directory will be automatically scaled to
+the value you use here. For best results, you will want to use the
+same default resolution of the underlying model (512 pixels for
+SD-1.5, 768 for the larger version of SD-2.1).
+
+### Center crop images
+
+If this is selected, your images will be center cropped to make them
+square before resizing them to the desired resolution. Center cropping
+can indiscriminately cut off the top of subjects' heads for portrait
+aspect images, so if you have images like this, you may wish to use a
+photoeditor to manually crop them to a square aspect ratio.
+
+### Mixed precision
+
+Select the floating point precision for the embedding. "no" will
+result in a full 32-bit precision, "fp16" will provide 16-bit
+precision, and "bf16" will provide mixed precision (only available
+when XFormers is used).
+
+### Max training steps
+
+How many steps the training will take before the model converges. Most
+training sets will converge with 2000-3000 steps.
+
+### Batch size
+
+This adjusts how many training images are processed simultaneously in
+each step. Higher values will cause the training process to run more
+quickly, but use more memory. The default size will run with GPUs with
+as little as 12 GB.
+
+### Learning rate
+
+The rate at which the system adjusts its internal weights during
+training. Higher values risk overtraining (getting the same image each
+time), and lower values will take more steps to train a good
+model. The default of 0.0005 is conservative; you may wish to increase
+it to 0.005 to speed up training.
+
+### Scale learning rate by number of GPUs, steps and batch size
+
+If this is selected (the default) the system will adjust the provided
+learning rate to improve performance.
+
+### Use xformers acceleration
+
+This will activate XFormers memory-efficient attention. You need to
+have XFormers installed for this to have an effect.
+
+### Learning rate scheduler
+
+This adjusts how the learning rate changes over the course of
+training. The default "constant" means to use a constant learning rate
+for the entire training session. The other values scale the learning
+rate according to various formulas.
+
+Only "constant" is supported by the XFormers library.
+
+### Gradient accumulation steps
+
+This is a parameter that allows you to use bigger batch sizes than
+your GPU's VRAM would ordinarily accommodate, at the cost of some
+performance.
+
+### Warmup steps
+
+If "constant_with_warmup" is selected in the learning rate scheduler,
+then this provides the number of warmup steps. Warmup steps have a
+very low learning rate, and are one way of preventing early
+overtraining.
+
+## The training run
+
+Start the training run by advancing to the OK button (bottom right)
+and pressing <enter>. A series of progress messages will be displayed
+as the training process proceeds. This may take an hour or two,
+depending on settings and the speed of your system. Various log and
+checkpoint files will be written into the output directory (ordinarily
+`~/invokeai/text-inversion-output/my-model/`)
+
+At the end of successful training, the system will copy the file
+`learned_embeds.bin` into the InvokeAI root directory's `embeddings`
+directory, using a subdirectory named after the trigger token. For
+example, if the trigger token was `psychedelic`, then look for the
+embeddings file in
+`~/invokeai/embeddings/psychedelic/learned_embeds.bin`
+
+You may now launch InvokeAI and try out a prompt that uses the trigger
+term. For example `a plate of banana sushi in <psychedelic> style`.
+
+## **Training with the Command-Line Script**
+
+InvokeAI also comes with a traditional command-line script for
+launching textual inversion training. It is named
+`textual_inversion`, and can be launched from within the
+"developer's console", or from the command line after activating
+InvokeAI's virtual environment.
+
+It accepts a large number of arguments, which can be summarized by
+passing the `--help` argument:
+
+```sh
+textual_inversion --help
 ```

-Then, to utilize your subject at the invoke prompt
-
-```bash
-invoke> "a photo of *"
+Typical usage is shown here:
+```sh
+python textual_inversion.py \
+       --model=stable-diffusion-1.5 \
+       --resolution=512 \
+       --learnable_property=style \
+       --initializer_token='*' \
+       --placeholder_token='<psychedelic>' \
+       --train_data_dir=/home/lstein/invokeai/training-data/psychedelic \
+       --output_dir=/home/lstein/invokeai/text-inversion-training/psychedelic \
+       --scale_lr \
+       --train_batch_size=8 \
+       --gradient_accumulation_steps=4 \
+       --max_train_steps=3000 \
+       --learning_rate=0.0005 \
+       --resume_from_checkpoint=latest \
+       --lr_scheduler=constant \
+       --mixed_precision=fp16 \
+       --only_save_embeds
 ```

-This also works with image2image
+## Reading

-```bash
-invoke> "waterfall and rainbow in the style of *" --init_img=./init-images/crude_drawing.png --strength=0.5 -s100 -n4
-```
+For more information on textual inversion, please see the following
+resources:

-For .pt files it's also possible to train multiple tokens (modify the
-placeholder string in `configs/stable-diffusion/v1-finetune.yaml`) and combine
-LDM checkpoints using:
+* The [textual inversion repository](https://github.com/rinongal/textual_inversion) and
+  associated paper for details and limitations.
+* [HuggingFace's textual inversion training
+  page](https://huggingface.co/docs/diffusers/training/text_inversion)

-```bash
-python3 ./scripts/merge_embeddings.py \
-    --manager_ckpts /path/to/first/embedding.pt \
-    [</path/to/second/embedding.pt>,[...]] \
-    --output_path /path/to/output/embedding.pt
-```
+---

-Credit goes to rinongal and the repository
-
-Please see [the repository](https://github.com/rinongal/textual_inversion) and
-associated paper for details and limitations.
+copyright (c) 2023, Lincoln Stein and the InvokeAI Development Team
--- a/scripts/configure_invokeai.py
+++ b/scripts/configure_invokeai.py
@ -746,7 +746,7 @@ def initialize_rootdir(root:str,yes_to_all:bool=False):

    safety_checker = '--nsfw_checker' if enable_safety_checker else '--no-nsfw_checker'

-    for name in ('models','configs','embeddings'):
+    for name in ('models','configs','embeddings','text-inversion-data','text-inversion-training-data'):
        os.makedirs(os.path.join(root,name), exist_ok=True)
    for src in (['configs']):
        dest = os.path.join(root,src)
--- a/scripts/orig_scripts/main.py
+++ b/scripts/orig_scripts/main.py
--- a/scripts/textual_inversion.py
+++ b/scripts/textual_inversion.py
@ -1,11 +1,11 @@
 #!/usr/bin/env python

 # Copyright 2023, Lincoln Stein @lstein
-from ldm.invoke.globals import Globals, set_root
+from ldm.invoke.globals import Globals, global_set_root
 from ldm.invoke.textual_inversion_training import parse_args, do_textual_inversion_training

 if __name__ == "__main__":
    args = parse_args()
-    set_root(args.root_dir or Globals.root)
+    global_set_root(args.root_dir or Globals.root)
    kwargs = vars(args)
    do_textual_inversion_training(**kwargs)
--- a/scripts/textual_inversion_fe.py
+++ b/scripts/textual_inversion_fe.py
@ -13,8 +13,8 @@ from pathlib import Path
 from typing import List
 import argparse

-TRAINING_DATA = 'training-data'
-TRAINING_DIR = 'text-inversion-training'
+TRAINING_DATA = 'text-inversion-training-data'
+TRAINING_DIR = 'text-inversion-output'
 CONF_FILE = 'preferences.conf'

 class textualInversionForm(npyscreen.FormMultiPageAction):
@ -219,7 +219,7 @@ class textualInversionForm(npyscreen.FormMultiPageAction):

    def get_model_names(self)->(List[str],int):
        conf = OmegaConf.load(os.path.join(Globals.root,'configs/models.yaml'))
-        model_names = sorted(list(conf.keys()))
+        model_names = [idx for idx in sorted(list(conf.keys())) if conf[idx].get('format',None)=='diffusers']
        defaults = [idx for idx in range(len(model_names)) if 'default' in conf[model_names[idx]]]
        return (model_names,defaults[0])