add documentation and minor bug fixes

- Added new documentation for textual inversion training process - Move `main.py` into the deprecated scripts folder - Fix bug in `textual_inversion.py` which was causing it to not load the globals module correctly. - Sort models alphabetically in console front end - Only show diffusers models in console front end
2024-08-30 20:32:17 +00:00 · 2023-01-20 16:55:50 -05:00 · 2023-01-20 16:55:50 -05:00 · 080fc4b380
commit 080fc4b380
parent 195294e74f
6 changed files with 239 additions and 63 deletions
--- a/docs/assets/textual-inversion/ti-frontend.png
+++ b/docs/assets/textual-inversion/ti-frontend.png
--- a/docs/features/TEXTUAL_INVERSION.md
+++ b/docs/features/TEXTUAL_INVERSION.md
@ -10,83 +10,259 @@ You may personalize the generated images to provide your own styles or objects
 by training a new LDM checkpoint and introducing a new vocabulary to the fixed
 model as a (.pt) embeddings file. Alternatively, you may use or train
 HuggingFace Concepts embeddings files (.bin) from
-<https://huggingface.co/sd-concepts-library> and its associated notebooks.
+<https://huggingface.co/sd-concepts-library> and its associated
 notebooks.
-## **Training**
+## **Hardware and Software Requirements**
-To train, prepare a folder that contains images sized at 512x512 and execute the
+You will need a GPU to perform training in a reasonable length of
-following:
+time, and at least 12 GB of VRAM. We recommend using the [`xformers`
 library](../installation/070_INSTALL_XFORMERS) to accelerate the
 training process further. During training, about ~8 GB is temporarily
 needed in order to store intermediate models, checkpoints and logs.
-### WINDOWS
+## **Preparing for Training**
-As the default backend is not available on Windows, if you're using that
+To train, prepare a folder that contains 3-5 images that illustrate
-platform, set the environment variable `PL_TORCH_DISTRIBUTED_BACKEND` to `gloo`
+the object or concept. It is good to provide a variety of examples or
 poses to avoid overtraining the system. Format these images as PNG
 (preferred) or JPG. You do not need to resize or crop the images in
 advance, but for more control you may wish to do so.
-```bash
+Place the training images in a directory on the machine InvokeAI runs
-python3 ./main.py -t \
+on. We recommend placing them in a subdirectory of the
-    --base ./configs/stable-diffusion/v1-finetune.yaml \
+`text-inversion-training-data` folder located in the InvokeAI root
-    --actual_resume ./models/ldm/stable-diffusion-v1/model.ckpt \
+directory, ordinarily `~/invokeai` (Linux/Mac), or
-    -n my_cat \
+`C:\Users\your_name\invokeai` (Windows). For example, to create an
-    --gpus 0 \
+embedding for the "psychedelic" style, you'd place the training images
-    --data_root D:/textual-inversion/my_cat \
+into the directory
-    --init_word 'cat'
+`~invokeai/text-inversion-training-data/psychedelic`.
 ## **Launching Training Using the Console Front End**
 InvokeAI 2.3 and higher comes with a text console-based training front
 end. From within the `invoke.sh`/`invoke.bat` Invoke launcher script,
 start the front end by selecting choice (3):
 ```sh
 Do you want to generate images using the
 1. command-line
 2. browser-based UI
 3. textual inversion training
 4. open the developer console
 Please enter 1, 2, 3, or 4: [1] 3
 ```
-During the training process, files will be created in
+From the command line, with the InvokeAI virtual environment active,
-`/logs/[project][time][project]/` where you can see the process.
+you can launch the front end with the command
 `textual_inversion_fe`.
-Conditioning contains the training prompts inputs, reconstruction the input
+This will launch a text-based front end that will look like this:
 images for the training epoch samples, samples scaled for a sample of the prompt
 and one with the init word provided.
-On a RTX3090, the process for SD will take ~1h @1.6 iterations/sec.
+<figure markdown>
 ![ti-frontend](../assets/textual-inversion/ti-frontend.png)
 </figure>
-!!! note
+The interface is keyboard-based. Move from field to field using
 control-N (^N) to move to the next field and control-P (^P) to the
 previous one. <Tab> and <shift-TAB> work as well. Once a field is
 active, use the cursor keys. In a checkbox group, use the up and down
 cursor keys to move from choice to choice, and <space> to select a
 choice. In a scrollbar, use the left and right cursor keys to increase
 and decrease the value of the scroll. In textfields, type the desired
 values.
-    According to the associated paper, the optimal number of
+The number of parameters may look intimidating, but in most cases the
-    images is 3-5. Your model may not converge if you use more images than
+predefined defaults work fine. The red circled fields in the above
-    that.
+illustration are the ones you will adjust most frequently.
-Training will run indefinitely, but you may wish to stop it (with ctrl-c) before
+### Model Name
 the heat death of the universe, when you find a low loss epoch or around ~5000
 iterations. Note that you can set a fixed limit on the number of training steps
 by decreasing the "max_steps" option in
 configs/stable_diffusion/v1-finetune.yaml (currently set to 4000000)
-## **Run the Model**
+This will list all the diffusers models that are currently
 installed. Select the one you wish to use as the basis for your
 embedding. Be aware that if you use a SD-1.X-based model for your
 training, you will only be able to use this embedding with other
 SD-1.X-based models. Similarly, if you train on SD-2.X, you will only
 be able to use the embeddings with models based on SD-2.X.
-Once the model is trained, specify the trained .pt or .bin file when starting
+### Trigger Term
 invoke using
-```bash
+This is the prompt term you will use to trigger the embedding. Type a
-python3 ./scripts/invoke.py \
+single word or phrase you wish to use as the trigger, example
-    --embedding_path /path/to/embedding.pt
+"psychedelic" (without angle brackets). Within InvokeAI, you will then
 be able to activate the trigger using the syntax `<psychedelic>`.
 ### Initializer
 This is a single character that is used internally during the training
 process as a placeholder for the trigger term. It defaults to "*" and
 can usually be left alone.
 ### Resume from last saved checkpoint
 As training proceeds, textual inversion will write a series of
 intermediate files that can be used to resume training from where it
 was left off in the case of an interruption. This checkbox will be
 automatically selected if you provide a previously used trigger term
 and at least one checkpoint file is found on disk.
 Note that as of 20 January 2023, resume does not seem to be working
 properly due to an issue with the upstream code.
 ### Data Training Directory
 This is the location of the images to be used for training. When you
 select a trigger term like "my-trigger", the frontend will prepopulate
 this field with `~/invokeai/text-inversion-training-data/my-trigger`,
 but you can change the path to wherever you want.
 ### Output Destination Directory
 This is the location of the logs, checkpoint files, and embedding
 files created during training. When you select a trigger term like
 "my-trigger", the frontend will prepopulate this field with
 `~/invokeai/text-inversion-output/my-trigger`, but you can change the
 path to wherever you want.
 ### Image resolution
 The images in the training directory will be automatically scaled to
 the value you use here. For best results, you will want to use the
 same default resolution of the underlying model (512 pixels for
 SD-1.5, 768 for the larger version of SD-2.1).
 ### Center crop images
 If this is selected, your images will be center cropped to make them
 square before resizing them to the desired resolution. Center cropping
 can indiscriminately cut off the top of subjects' heads for portrait
 aspect images, so if you have images like this, you may wish to use a
 photoeditor to manually crop them to a square aspect ratio.
 ### Mixed precision
 Select the floating point precision for the embedding. "no" will
 result in a full 32-bit precision, "fp16" will provide 16-bit
 precision, and "bf16" will provide mixed precision (only available
 when XFormers is used).
 ### Max training steps
 How many steps the training will take before the model converges. Most
 training sets will converge with 2000-3000 steps.
 ### Batch size
 This adjusts how many training images are processed simultaneously in
 each step. Higher values will cause the training process to run more
 quickly, but use more memory. The default size will run with GPUs with
 as little as 12 GB.
 ### Learning rate
 The rate at which the system adjusts its internal weights during
 training. Higher values risk overtraining (getting the same image each
 time), and lower values will take more steps to train a good
 model. The default of 0.0005 is conservative; you may wish to increase
 it to 0.005 to speed up training.
 ### Scale learning rate by number of GPUs, steps and batch size
 If this is selected (the default) the system will adjust the provided
 learning rate to improve performance.
 ### Use xformers acceleration
 This will activate XFormers memory-efficient attention. You need to
 have XFormers installed for this to have an effect.
 ### Learning rate scheduler
 This adjusts how the learning rate changes over the course of
 training. The default "constant" means to use a constant learning rate
 for the entire training session. The other values scale the learning
 rate according to various formulas.
 Only "constant" is supported by the XFormers library.
 ### Gradient accumulation steps
 This is a parameter that allows you to use bigger batch sizes than
 your GPU's VRAM would ordinarily accommodate, at the cost of some
 performance.
 ### Warmup steps
 If "constant_with_warmup" is selected in the learning rate scheduler,
 then this provides the number of warmup steps. Warmup steps have a
 very low learning rate, and are one way of preventing early
 overtraining.
 ## The training run
 Start the training run by advancing to the OK button (bottom right)
 and pressing <enter>. A series of progress messages will be displayed
 as the training process proceeds. This may take an hour or two,
 depending on settings and the speed of your system. Various log and
 checkpoint files will be written into the output directory (ordinarily
 `~/invokeai/text-inversion-output/my-model/`)
 At the end of successful training, the system will copy the file
 `learned_embeds.bin` into the InvokeAI root directory's `embeddings`
 directory, using a subdirectory named after the trigger token. For
 example, if the trigger token was `psychedelic`, then look for the
 embeddings file in
 `~/invokeai/embeddings/psychedelic/learned_embeds.bin`
 You may now launch InvokeAI and try out a prompt that uses the trigger
 term. For example `a plate of banana sushi in <psychedelic> style`.
 ## **Training with the Command-Line Script**
 InvokeAI also comes with a traditional command-line script for
 launching textual inversion training. It is named
 `textual_inversion`, and can be launched from within the
 "developer's console", or from the command line after activating
 InvokeAI's virtual environment.
 It accepts a large number of arguments, which can be summarized by
 passing the `--help` argument:
 ```sh
 textual_inversion --help
 ```
-Then, to utilize your subject at the invoke prompt
+Typical usage is shown here:
-
+```sh
-```bash
+python textual_inversion.py \
-invoke> "a photo of *"
+       --model=stable-diffusion-1.5 \
       --resolution=512 \
       --learnable_property=style \
       --initializer_token='*' \
       --placeholder_token='<psychedelic>' \
       --train_data_dir=/home/lstein/invokeai/training-data/psychedelic \
       --output_dir=/home/lstein/invokeai/text-inversion-training/psychedelic \
       --scale_lr \
       --train_batch_size=8 \
       --gradient_accumulation_steps=4 \
       --max_train_steps=3000 \
       --learning_rate=0.0005 \
       --resume_from_checkpoint=latest \
       --lr_scheduler=constant \
       --mixed_precision=fp16 \
       --only_save_embeds
 ```
-This also works with image2image
+## Reading
-```bash
+For more information on textual inversion, please see the following
-invoke> "waterfall and rainbow in the style of *" --init_img=./init-images/crude_drawing.png --strength=0.5 -s100 -n4
+resources:
 ```
-For .pt files it's also possible to train multiple tokens (modify the
+* The [textual inversion repository](https://github.com/rinongal/textual_inversion) and
-placeholder string in `configs/stable-diffusion/v1-finetune.yaml`) and combine
+  associated paper for details and limitations.
-LDM checkpoints using:
+* [HuggingFace's textual inversion training
  page](https://huggingface.co/docs/diffusers/training/text_inversion)
-```bash
+---
 python3 ./scripts/merge_embeddings.py \
    --manager_ckpts /path/to/first/embedding.pt \
    [</path/to/second/embedding.pt>,[...]] \
    --output_path /path/to/output/embedding.pt
 ```
-Credit goes to rinongal and the repository
+copyright (c) 2023, Lincoln Stein and the InvokeAI Development Team
 Please see [the repository](https://github.com/rinongal/textual_inversion) and
 associated paper for details and limitations.
--- a/scripts/configure_invokeai.py
+++ b/scripts/configure_invokeai.py
@ -746,7 +746,7 @@ def initialize_rootdir(root:str,yes_to_all:bool=False):
    safety_checker = '--nsfw_checker' if enable_safety_checker else '--no-nsfw_checker'
-    for name in ('models','configs','embeddings'):
+    for name in ('models','configs','embeddings','text-inversion-data','text-inversion-training-data'):
        os.makedirs(os.path.join(root,name), exist_ok=True)
    for src in (['configs']):
        dest = os.path.join(root,src)
--- a/scripts/orig_scripts/main.py
+++ b/scripts/orig_scripts/main.py
--- a/scripts/textual_inversion.py
+++ b/scripts/textual_inversion.py
@ -1,11 +1,11 @@
 #!/usr/bin/env python
 # Copyright 2023, Lincoln Stein @lstein
-from ldm.invoke.globals import Globals, set_root
+from ldm.invoke.globals import Globals, global_set_root
 from ldm.invoke.textual_inversion_training import parse_args, do_textual_inversion_training
 if __name__ == "__main__":
    args = parse_args()
-    set_root(args.root_dir or Globals.root)
+    global_set_root(args.root_dir or Globals.root)
    kwargs = vars(args)
    do_textual_inversion_training(**kwargs)
--- a/scripts/textual_inversion_fe.py
+++ b/scripts/textual_inversion_fe.py
@ -13,8 +13,8 @@ from pathlib import Path
 from typing import List
 import argparse
-TRAINING_DATA = 'training-data'
+TRAINING_DATA = 'text-inversion-training-data'
-TRAINING_DIR = 'text-inversion-training'
+TRAINING_DIR = 'text-inversion-output'
 CONF_FILE = 'preferences.conf'
 class textualInversionForm(npyscreen.FormMultiPageAction):
@ -219,7 +219,7 @@ class textualInversionForm(npyscreen.FormMultiPageAction):
    def get_model_names(self)->(List[str],int):
        conf = OmegaConf.load(os.path.join(Globals.root,'configs/models.yaml'))
-        model_names = sorted(list(conf.keys()))
+        model_names = [idx for idx in sorted(list(conf.keys())) if conf[idx].get('format',None)=='diffusers']
        defaults = [idx for idx in range(len(model_names)) if 'default' in conf[model_names[idx]]]
        return (model_names,defaults[0])