diff --git a/docs/assets/textual-inversion/ti-frontend.png b/docs/assets/textual-inversion/ti-frontend.png new file mode 100644 index 0000000000..0500e9b132 Binary files /dev/null and b/docs/assets/textual-inversion/ti-frontend.png differ diff --git a/docs/features/TEXTUAL_INVERSION.md b/docs/features/TEXTUAL_INVERSION.md index 7ce0f41c5a..7d54ea971c 100644 --- a/docs/features/TEXTUAL_INVERSION.md +++ b/docs/features/TEXTUAL_INVERSION.md @@ -10,83 +10,263 @@ You may personalize the generated images to provide your own styles or objects by training a new LDM checkpoint and introducing a new vocabulary to the fixed model as a (.pt) embeddings file. Alternatively, you may use or train HuggingFace Concepts embeddings files (.bin) from - and its associated notebooks. + and its associated +notebooks. -## **Training** +## **Hardware and Software Requirements** -To train, prepare a folder that contains images sized at 512x512 and execute the -following: +You will need a GPU to perform training in a reasonable length of +time, and at least 12 GB of VRAM. We recommend using the [`xformers` +library](../installation/070_INSTALL_XFORMERS) to accelerate the +training process further. During training, about ~8 GB is temporarily +needed in order to store intermediate models, checkpoints and logs. -### WINDOWS +## **Preparing for Training** -As the default backend is not available on Windows, if you're using that -platform, set the environment variable `PL_TORCH_DISTRIBUTED_BACKEND` to `gloo` +To train, prepare a folder that contains 3-5 images that illustrate +the object or concept. It is good to provide a variety of examples or +poses to avoid overtraining the system. Format these images as PNG +(preferred) or JPG. You do not need to resize or crop the images in +advance, but for more control you may wish to do so. -```bash -python3 ./main.py -t \ - --base ./configs/stable-diffusion/v1-finetune.yaml \ - --actual_resume ./models/ldm/stable-diffusion-v1/model.ckpt \ - -n my_cat \ - --gpus 0 \ - --data_root D:/textual-inversion/my_cat \ - --init_word 'cat' +Place the training images in a directory on the machine InvokeAI runs +on. We recommend placing them in a subdirectory of the +`text-inversion-training-data` folder located in the InvokeAI root +directory, ordinarily `~/invokeai` (Linux/Mac), or +`C:\Users\your_name\invokeai` (Windows). For example, to create an +embedding for the "psychedelic" style, you'd place the training images +into the directory +`~invokeai/text-inversion-training-data/psychedelic`. + +## **Launching Training Using the Console Front End** + +InvokeAI 2.3 and higher comes with a text console-based training front +end. From within the `invoke.sh`/`invoke.bat` Invoke launcher script, +start the front end by selecting choice (3): + +```sh +Do you want to generate images using the +1. command-line +2. browser-based UI +3. textual inversion training +4. open the developer console +Please enter 1, 2, 3, or 4: [1] 3 ``` -During the training process, files will be created in -`/logs/[project][time][project]/` where you can see the process. +From the command line, with the InvokeAI virtual environment active, +you can launch the front end with the command +`textual_inversion_fe`. -Conditioning contains the training prompts inputs, reconstruction the input -images for the training epoch samples, samples scaled for a sample of the prompt -and one with the init word provided. +This will launch a text-based front end that will look like this: -On a RTX3090, the process for SD will take ~1h @1.6 iterations/sec. +
+![ti-frontend](../assets/textual-inversion/ti-frontend.png) +
-!!! note +The interface is keyboard-based. Move from field to field using +control-N (^N) to move to the next field and control-P (^P) to the +previous one. and work as well. Once a field is +active, use the cursor keys. In a checkbox group, use the up and down +cursor keys to move from choice to choice, and to select a +choice. In a scrollbar, use the left and right cursor keys to increase +and decrease the value of the scroll. In textfields, type the desired +values. - According to the associated paper, the optimal number of - images is 3-5. Your model may not converge if you use more images than - that. +The number of parameters may look intimidating, but in most cases the +predefined defaults work fine. The red circled fields in the above +illustration are the ones you will adjust most frequently. -Training will run indefinitely, but you may wish to stop it (with ctrl-c) before -the heat death of the universe, when you find a low loss epoch or around ~5000 -iterations. Note that you can set a fixed limit on the number of training steps -by decreasing the "max_steps" option in -configs/stable_diffusion/v1-finetune.yaml (currently set to 4000000) +### Model Name -## **Run the Model** +This will list all the diffusers models that are currently +installed. Select the one you wish to use as the basis for your +embedding. Be aware that if you use a SD-1.X-based model for your +training, you will only be able to use this embedding with other +SD-1.X-based models. Similarly, if you train on SD-2.X, you will only +be able to use the embeddings with models based on SD-2.X. -Once the model is trained, specify the trained .pt or .bin file when starting -invoke using +### Trigger Term -```bash -python3 ./scripts/invoke.py \ - --embedding_path /path/to/embedding.pt +This is the prompt term you will use to trigger the embedding. Type a +single word or phrase you wish to use as the trigger, example +"psychedelic" (without angle brackets). Within InvokeAI, you will then +be able to activate the trigger using the syntax ``. + +### Initializer + +This is a single character that is used internally during the training +process as a placeholder for the trigger term. It defaults to "*" and +can usually be left alone. + +### Resume from last saved checkpoint + +As training proceeds, textual inversion will write a series of +intermediate files that can be used to resume training from where it +was left off in the case of an interruption. This checkbox will be +automatically selected if you provide a previously used trigger term +and at least one checkpoint file is found on disk. + +Note that as of 20 January 2023, resume does not seem to be working +properly due to an issue with the upstream code. + +### Data Training Directory + +This is the location of the images to be used for training. When you +select a trigger term like "my-trigger", the frontend will prepopulate +this field with `~/invokeai/text-inversion-training-data/my-trigger`, +but you can change the path to wherever you want. + +### Output Destination Directory + +This is the location of the logs, checkpoint files, and embedding +files created during training. When you select a trigger term like +"my-trigger", the frontend will prepopulate this field with +`~/invokeai/text-inversion-output/my-trigger`, but you can change the +path to wherever you want. + +### Image resolution + +The images in the training directory will be automatically scaled to +the value you use here. For best results, you will want to use the +same default resolution of the underlying model (512 pixels for +SD-1.5, 768 for the larger version of SD-2.1). + +### Center crop images + +If this is selected, your images will be center cropped to make them +square before resizing them to the desired resolution. Center cropping +can indiscriminately cut off the top of subjects' heads for portrait +aspect images, so if you have images like this, you may wish to use a +photoeditor to manually crop them to a square aspect ratio. + +### Mixed precision + +Select the floating point precision for the embedding. "no" will +result in a full 32-bit precision, "fp16" will provide 16-bit +precision, and "bf16" will provide mixed precision (only available +when XFormers is used). + +### Max training steps + +How many steps the training will take before the model converges. Most +training sets will converge with 2000-3000 steps. + +### Batch size + +This adjusts how many training images are processed simultaneously in +each step. Higher values will cause the training process to run more +quickly, but use more memory. The default size will run with GPUs with +as little as 12 GB. + +### Learning rate + +The rate at which the system adjusts its internal weights during +training. Higher values risk overtraining (getting the same image each +time), and lower values will take more steps to train a good +model. The default of 0.0005 is conservative; you may wish to increase +it to 0.005 to speed up training. + +### Scale learning rate by number of GPUs, steps and batch size + +If this is selected (the default) the system will adjust the provided +learning rate to improve performance. + +### Use xformers acceleration + +This will activate XFormers memory-efficient attention. You need to +have XFormers installed for this to have an effect. + +### Learning rate scheduler + +This adjusts how the learning rate changes over the course of +training. The default "constant" means to use a constant learning rate +for the entire training session. The other values scale the learning +rate according to various formulas. + +Only "constant" is supported by the XFormers library. + +### Gradient accumulation steps + +This is a parameter that allows you to use bigger batch sizes than +your GPU's VRAM would ordinarily accommodate, at the cost of some +performance. + +### Warmup steps + +If "constant_with_warmup" is selected in the learning rate scheduler, +then this provides the number of warmup steps. Warmup steps have a +very low learning rate, and are one way of preventing early +overtraining. + +## The training run + +Start the training run by advancing to the OK button (bottom right) +and pressing . A series of progress messages will be displayed +as the training process proceeds. This may take an hour or two, +depending on settings and the speed of your system. Various log and +checkpoint files will be written into the output directory (ordinarily +`~/invokeai/text-inversion-output/my-model/`) + +At the end of successful training, the system will copy the file +`learned_embeds.bin` into the InvokeAI root directory's `embeddings` +directory, using a subdirectory named after the trigger token. For +example, if the trigger token was `psychedelic`, then look for the +embeddings file in +`~/invokeai/embeddings/psychedelic/learned_embeds.bin` + +You may now launch InvokeAI and try out a prompt that uses the trigger +term. For example `a plate of banana sushi in style`. + +## **Training with the Command-Line Script** + +InvokeAI also comes with a traditional command-line script for +launching textual inversion training. It is named +`textual_inversion`, and can be launched from within the +"developer's console", or from the command line after activating +InvokeAI's virtual environment. + +It accepts a large number of arguments, which can be summarized by +passing the `--help` argument: + +```sh +textual_inversion --help ``` -Then, to utilize your subject at the invoke prompt - -```bash -invoke> "a photo of *" +Typical usage is shown here: +```sh +python textual_inversion.py \ + --model=stable-diffusion-1.5 \ + --resolution=512 \ + --learnable_property=style \ + --initializer_token='*' \ + --placeholder_token='' \ + --train_data_dir=/home/lstein/invokeai/training-data/psychedelic \ + --output_dir=/home/lstein/invokeai/text-inversion-training/psychedelic \ + --scale_lr \ + --train_batch_size=8 \ + --gradient_accumulation_steps=4 \ + --max_train_steps=3000 \ + --learning_rate=0.0005 \ + --resume_from_checkpoint=latest \ + --lr_scheduler=constant \ + --mixed_precision=fp16 \ + --only_save_embeds ``` -This also works with image2image +## Reading -```bash -invoke> "waterfall and rainbow in the style of *" --init_img=./init-images/crude_drawing.png --strength=0.5 -s100 -n4 -``` +For more information on textual inversion, please see the following +resources: -For .pt files it's also possible to train multiple tokens (modify the -placeholder string in `configs/stable-diffusion/v1-finetune.yaml`) and combine -LDM checkpoints using: +* The [textual inversion repository](https://github.com/rinongal/textual_inversion) and + associated paper for details and limitations. +* [HuggingFace's textual inversion training + page](https://huggingface.co/docs/diffusers/training/text_inversion) +* [HuggingFace example script + documentation](https://github.com/huggingface/diffusers/tree/main/examples/textual_inversion) + (Note that this script is similar to, but not identical, to + `textual_inversion`, but produces embed files that are completely compatible. -```bash -python3 ./scripts/merge_embeddings.py \ - --manager_ckpts /path/to/first/embedding.pt \ - [,[...]] \ - --output_path /path/to/output/embedding.pt -``` +--- -Credit goes to rinongal and the repository - -Please see [the repository](https://github.com/rinongal/textual_inversion) and -associated paper for details and limitations. +copyright (c) 2023, Lincoln Stein and the InvokeAI Development Team \ No newline at end of file diff --git a/ldm/invoke/textual_inversion_training.py b/ldm/invoke/textual_inversion_training.py index ab53f0801d..7003a149fb 100644 --- a/ldm/invoke/textual_inversion_training.py +++ b/ldm/invoke/textual_inversion_training.py @@ -4,7 +4,6 @@ # and modified slightly by Lincoln Stein (@lstein) to work with InvokeAI import argparse -from argparse import Namespace import logging import math import os @@ -207,6 +206,12 @@ def parse_args(): parser.add_argument("--adam_epsilon", type=float, default=1e-08, help="Epsilon value for the Adam optimizer") parser.add_argument("--push_to_hub", action="store_true", help="Whether or not to push the model to the Hub.") parser.add_argument("--hub_token", type=str, default=None, help="The token to use to push to the Model Hub.") + parser.add_argument( + "--hub_model_id", + type=str, + default=None, + help="The name of the repository to keep in sync with the local `output_dir`.", + ) parser.add_argument( "--logging_dir", type=Path, @@ -455,7 +460,8 @@ def do_textual_inversion_training( checkpointing_steps:int=500, resume_from_checkpoint:Path=None, enable_xformers_memory_efficient_attention:bool=False, - root_dir:Path=None + root_dir:Path=None, + hub_model_id:str=None, ): env_local_rank = int(os.environ.get("LOCAL_RANK", -1)) if env_local_rank != -1 and env_local_rank != local_rank: @@ -518,10 +524,10 @@ def do_textual_inversion_training( pretrained_model_name_or_path = model_conf.get('repo_id',None) or Path(model_conf.get('path')) assert pretrained_model_name_or_path, f"models.yaml error: neither 'repo_id' nor 'path' is defined for {model}" pipeline_args = dict(cache_dir=global_cache_dir('diffusers')) - + # Load tokenizer if tokenizer_name: - tokenizer = CLIPTokenizer.from_pretrained(tokenizer_name,cache_dir=global_cache_dir('transformers')) + tokenizer = CLIPTokenizer.from_pretrained(tokenizer_name,**pipeline_args) else: tokenizer = CLIPTokenizer.from_pretrained(pretrained_model_name_or_path, subfolder="tokenizer", **pipeline_args) @@ -631,7 +637,7 @@ def do_textual_inversion_training( text_encoder, optimizer, train_dataloader, lr_scheduler ) - # For mixed precision training we cast the text_encoder and vae weights to half-precision + # For mixed precision training we cast the unet and vae weights to half-precision # as these models are only used for inference, keeping weights in full precision is not required. weight_dtype = torch.float32 if accelerator.mixed_precision == "fp16": @@ -670,6 +676,7 @@ def do_textual_inversion_training( logger.info(f" Total optimization steps = {max_train_steps}") global_step = 0 first_epoch = 0 + resume_step = None # Potentially load in the weights and states from a previous save if resume_from_checkpoint: @@ -680,15 +687,22 @@ def do_textual_inversion_training( dirs = os.listdir(output_dir) dirs = [d for d in dirs if d.startswith("checkpoint")] dirs = sorted(dirs, key=lambda x: int(x.split("-")[1])) - path = dirs[-1] - accelerator.print(f"Resuming from checkpoint {path}") - accelerator.load_state(os.path.join(output_dir, path)) - global_step = int(path.split("-")[1]) - - resume_global_step = global_step * gradient_accumulation_steps - first_epoch = resume_global_step // num_update_steps_per_epoch - resume_step = resume_global_step % num_update_steps_per_epoch + path = dirs[-1] if len(dirs) > 0 else None + + if path is None: + accelerator.print( + f"Checkpoint '{resume_from_checkpoint}' does not exist. Starting a new training run." + ) + resume_from_checkpoint = None + else: + accelerator.print(f"Resuming from checkpoint {path}") + accelerator.load_state(os.path.join(output_dir, path)) + global_step = int(path.split("-")[1]) + resume_global_step = global_step * gradient_accumulation_steps + first_epoch = global_step // num_update_steps_per_epoch + resume_step = resume_global_step % (num_update_steps_per_epoch * gradient_accumulation_steps) + # Only show the progress bar once on each machine. progress_bar = tqdm(range(global_step, max_train_steps), disable=not accelerator.is_local_main_process) progress_bar.set_description("Steps") @@ -700,7 +714,7 @@ def do_textual_inversion_training( text_encoder.train() for step, batch in enumerate(train_dataloader): # Skip steps until we reach the resumed step - if resume_from_checkpoint and epoch == first_epoch and step < resume_step: + if resume_step and resume_from_checkpoint and epoch == first_epoch and step < resume_step: if step % gradient_accumulation_steps == 0: progress_bar.update(1) continue diff --git a/ldm/modules/textual_inversion_manager.py b/ldm/modules/textual_inversion_manager.py index cf28cf8c7a..2e61be6b12 100644 --- a/ldm/modules/textual_inversion_manager.py +++ b/ldm/modules/textual_inversion_manager.py @@ -72,8 +72,9 @@ class TextualInversionManager(): self._add_textual_inversion(embedding_info['name'], embedding_info['embedding'], defer_injecting_tokens=defer_injecting_tokens) - except ValueError: - print(f' | ignoring incompatible embedding {embedding_info["name"]}') + except ValueError as e: + print(f' | Ignoring incompatible embedding {embedding_info["name"]}') + print(f' | The error was {str(e)}') else: print(f'>> Failed to load embedding located at {ckpt_path}. Unsupported file.') @@ -157,7 +158,8 @@ class TextualInversionManager(): try: self._inject_tokens_and_assign_embeddings(ti) except ValueError as e: - print(f' | ignoring incompatible embedding trigger {ti.trigger_string}') + print(f' | Ignoring incompatible embedding trigger {ti.trigger_string}') + print(f' | The error was {str(e)}') continue injected_token_ids.append(ti.trigger_token_id) injected_token_ids.extend(ti.pad_token_ids) diff --git a/scripts/configure_invokeai.py b/scripts/configure_invokeai.py index fec1cc6135..70f80f7846 100755 --- a/scripts/configure_invokeai.py +++ b/scripts/configure_invokeai.py @@ -747,7 +747,7 @@ def initialize_rootdir(root:str,yes_to_all:bool=False): safety_checker = '--nsfw_checker' if enable_safety_checker else '--no-nsfw_checker' - for name in ('models','configs','embeddings'): + for name in ('models','configs','embeddings','text-inversion-data','text-inversion-training-data'): os.makedirs(os.path.join(root,name), exist_ok=True) for src in (['configs']): dest = os.path.join(root,src) diff --git a/main.py b/scripts/orig_scripts/main.py similarity index 100% rename from main.py rename to scripts/orig_scripts/main.py diff --git a/scripts/textual_inversion.py b/scripts/textual_inversion.py index fb176a5eec..a7aae8f40f 100755 --- a/scripts/textual_inversion.py +++ b/scripts/textual_inversion.py @@ -1,11 +1,11 @@ #!/usr/bin/env python # Copyright 2023, Lincoln Stein @lstein -from ldm.invoke.globals import Globals, set_root +from ldm.invoke.globals import Globals, global_set_root from ldm.invoke.textual_inversion_training import parse_args, do_textual_inversion_training if __name__ == "__main__": args = parse_args() - set_root(args.root_dir or Globals.root) + global_set_root(args.root_dir or Globals.root) kwargs = vars(args) do_textual_inversion_training(**kwargs) diff --git a/scripts/textual_inversion_fe.py b/scripts/textual_inversion_fe.py index 82446e98a7..0639d9c2c8 100755 --- a/scripts/textual_inversion_fe.py +++ b/scripts/textual_inversion_fe.py @@ -6,14 +6,15 @@ import sys import re import shutil import traceback +import curses from ldm.invoke.globals import Globals, global_set_root from omegaconf import OmegaConf from pathlib import Path from typing import List import argparse -TRAINING_DATA = 'training-data' -TRAINING_DIR = 'text-inversion-training' +TRAINING_DATA = 'text-inversion-training-data' +TRAINING_DIR = 'text-inversion-output' CONF_FILE = 'preferences.conf' class textualInversionForm(npyscreen.FormMultiPageAction): @@ -43,6 +44,11 @@ class textualInversionForm(npyscreen.FormMultiPageAction): except: pass + self.add_widget_intelligent( + npyscreen.FixedText, + value='Use ctrl-N and ctrl-P to move to the ext and

revious fields, cursor arrows to make a selection, and space to toggle checkboxes.' + ) + self.model = self.add_widget_intelligent( npyscreen.TitleSelectOne, name='Model Name:', @@ -82,18 +88,18 @@ class textualInversionForm(npyscreen.FormMultiPageAction): max_height=4, ) self.train_data_dir = self.add_widget_intelligent( - npyscreen.TitleFilenameCombo, + npyscreen.TitleFilename, name='Data Training Directory:', select_dir=True, - must_exist=True, - value=saved_args.get('train_data_dir',Path(Globals.root) / TRAINING_DATA / default_placeholder_token) + must_exist=False, + value=str(saved_args.get('train_data_dir',Path(Globals.root) / TRAINING_DATA / default_placeholder_token)) ) self.output_dir = self.add_widget_intelligent( - npyscreen.TitleFilenameCombo, + npyscreen.TitleFilename, name='Output Destination Directory:', select_dir=True, must_exist=False, - value=saved_args.get('output_dir',Path(Globals.root) / TRAINING_DIR / default_placeholder_token) + value=str(saved_args.get('output_dir',Path(Globals.root) / TRAINING_DIR / default_placeholder_token)) ) self.resolution = self.add_widget_intelligent( npyscreen.TitleSelectOne, @@ -182,8 +188,8 @@ class textualInversionForm(npyscreen.FormMultiPageAction): def initializer_changed(self): placeholder = self.placeholder_token.value self.prompt_token.value = f'(Trigger by using <{placeholder}> in your prompts)' - self.train_data_dir.value = Path(Globals.root) / TRAINING_DATA / placeholder - self.output_dir.value = Path(Globals.root) / TRAINING_DIR / placeholder + self.train_data_dir.value = str(Path(Globals.root) / TRAINING_DATA / placeholder) + self.output_dir.value = str(Path(Globals.root) / TRAINING_DIR / placeholder) self.resume_from_checkpoint.value = Path(self.output_dir.value).exists() def on_ok(self): @@ -221,7 +227,7 @@ class textualInversionForm(npyscreen.FormMultiPageAction): def get_model_names(self)->(List[str],int): conf = OmegaConf.load(os.path.join(Globals.root,'configs/models.yaml')) - model_names = list(conf.keys()) + model_names = [idx for idx in sorted(list(conf.keys())) if conf[idx].get('format',None)=='diffusers'] defaults = [idx for idx in range(len(model_names)) if 'default' in conf[model_names[idx]]] return (model_names,defaults[0]) @@ -288,7 +294,9 @@ def save_args(args:dict): ''' Save the current argument values to an omegaconf file ''' - conf_file = Path(Globals.root) / TRAINING_DIR / CONF_FILE + dest_dir = Path(Globals.root) / TRAINING_DIR + os.makedirs(dest_dir, exist_ok=True) + conf_file = dest_dir / CONF_FILE conf = OmegaConf.create(args) OmegaConf.save(config=conf, f=conf_file)