From 65cfb0f312bafa9019b83bb4bab212849831b50a Mon Sep 17 00:00:00 2001 From: Any-Winter-4079 <50542132+Any-Winter-4079@users.noreply.github.com> Date: Sat, 24 Sep 2022 17:07:33 +0200 Subject: [PATCH 1/3] Create Sampler Tips (SAMPLER_CONVERGENCE.md) --- docs/help/SAMPLER_CONVERGENCE.md | 141 +++++++++++++++++++++++++++++++ 1 file changed, 141 insertions(+) create mode 100644 docs/help/SAMPLER_CONVERGENCE.md diff --git a/docs/help/SAMPLER_CONVERGENCE.md b/docs/help/SAMPLER_CONVERGENCE.md new file mode 100644 index 0000000000..5dfee5dc4e --- /dev/null +++ b/docs/help/SAMPLER_CONVERGENCE.md @@ -0,0 +1,141 @@ +--- +title: SAMPLER CONVERGENCE +--- + +## *Sampler Convergence* + +As features keep increasing, making the right choices for your needs can become increasingly difficult. What sampler to use? And for how many steps? Do you change the CFG value? Do you use prompt weighting? Do you allow variations? + +Even once you have a result, do you blend it with other images? Pass it through `img2img`? With what strength? Do you use inpainting to correct small details? Outpainting to extend cropped sections? + +The purpose of this series of documents is to help you better understand these tools, so you can make the best out of them. Feel free to contribute with your own findings! + +In this document, we will talk about sampler convergence. + +Looking for a short version? Here's a TL;DR in 3 tables. + +| Remember | +|:---| +| Results converge as steps (`-s`) are increased (except for `K_DPM_2_A` and `K_EULER_A`). Often at ≥ `-s100`, but may require ≥ `-s700`). | +| Producing a batch of candidate images at low (`-s8` to `-s30`) step counts can save you hours of computation. | +| `K_HEUN` and `K_DPM_2` converge in less steps (but are slower). | +| `K_DPM_2_A` and `K_EULER_A` incorporate a lot of creativity/variability. | + +| Sampler | (3 sample avg) it/s (M1 Max 64GB, 512x512) | +|---|---| +| `DDIM` | 1.89 | +| `PLMS` | 1.86 | +| `K_EULER` | 1.86 | +| `K_LMS` | 1.91 | +| `K_HEUN` | 0.95 (slower) | +| `K_DPM_2` | 0.95 (slower) | +| `K_DPM_2_A` | 0.95 (slower) | +| `K_EULER_A` | 1.86 | + +| Suggestions | +|:---| +| For most use cases, `K_LMS`, `K_HEUN` and `K_DPM_2` are the best choices (the latter 2 run 0.5x as quick, but tend to converge 2x as quick as `K_LMS`). At very low steps (≤ `-s8`), `K_HEUN` and `K_DPM_2` are not recommended. Use `K_LMS` instead.| +| For variability, use `K_EULER_A` (runs 2x as quick as `K_DPM_2_A`). | + +--- + +### *Sampler results* + +Let's start by choosing a prompt and using it with each of our 8 samplers, running it for 10, 20, 30, 40, 50 and 100 steps. + +Anime. `"an anime girl" -W512 -H512 -C7.5 -S3031912972` + +![191636411-083c8282-6ed1-4f78-9273-ee87c0a0f1b6-min (1)](https://user-images.githubusercontent.com/50542132/191868725-7f7af991-e254-4c1f-83e7-bed8c9b2d34f.png) + +### *Sampler convergence* + +Immediately, you can notice results tend to converge -that is, as `-s` (step) values increase, images look more and more similar until there comes a point where the image no longer changes. + +You can also notice how `DDIM` and `PLMS` eventually tend to converge to K-sampler results as steps are increased. +Among K-samplers, `K_HEUN` and `K_DPM_2` seem to require the fewest steps to converge, and even at low step counts they are good indicators of the final result. And finally, `K_DPM_2_A` and `K_EULER_A` seem to do a bit of their own thing and don't keep much similarity with the rest of the samplers. + +### *Batch generation speedup* + +This realization is very useful because it means you don't need to create a batch of 100 images (`-n100`) at `-s100` to choose your favorite 2 or 3 images. +You can produce the same 100 images at `-s10` to `-s30` using a K-sampler (since they converge faster), get a rough idea of the final result, choose your 2 or 3 favorite ones, and then run `-s100` on those images to polish some details. +The latter technique is 3-8x as quick. + +Example: + +At 60s per 100 steps. + +(Option A) 60s * 100 images = 6000s (100 images at `-s100`, manually picking 3 favorites) + +(Option B) 6s * 100 images + 60s * 3 images = 780s (100 images at `-s10`, manually picking 3 favorites, and running those 3 at `-s100` to polish details) + +The result is 1 hour and 40 minutes (Option A) vs 13 minutes (Option B). + +### *Topic convergance* + +Now, these results seem interesting, but do they hold for other topics? How about nature? Food? People? Animals? Let's try! + +Nature. `"valley landscape wallpaper, d&d art, fantasy, painted, 4k, high detail, sharp focus, washed colors, elaborate excellent painted illustration" -W512 -H512 -C7.5 -S1458228930` + +![191736091-dda76929-00d1-4590-bef4-7314ea4ea419-min (1)](https://user-images.githubusercontent.com/50542132/191868763-b151c69e-0a72-4cf1-a151-5a64edd0c93e.png) + +With nature, you can see how initial results are even more indicative of final result -more so than with characters/people. `K_HEUN` and `K_DPM_2` are again the quickest indicators, almost right from the start. Results also converge faster (e.g. `K_HEUN` converged at `-s21`). + +Food. `"a hamburger with a bowl of french fries" -W512 -H512 -C7.5 -S4053222918` + +![191639011-f81d9d38-0a15-45f0-9442-a5e8d5c25f1f-min (1)](https://user-images.githubusercontent.com/50542132/191868898-98801a62-885f-4ea1-aee8-563503522aa9.png) + +Again, `K_HEUN` and `K_DPM_2` take the fewest number of steps to be good indicators of the final result. `K_DPM_2_A` and `K_EULER_A` seem to incorporate a lot of creativity/variability, capable of producing rotten hamburgers, but also of adding lettuce to the mix. And they're the only samplers that produced an actual 'bowl of fries'! + +Animals. `"grown tiger, full body" -W512 -H512 -C7.5 -S3721629802` + +![191771922-6029a4f5-f707-4684-9011-c6f96e25fe56-min (1)](https://user-images.githubusercontent.com/50542132/191868870-9e3b7d82-b909-429f-893a-13f6ec343454.png) + +`K_HEUN` and `K_DPM_2` once again require the least number of steps to be indicative of the final result (around `-s30`), while other samplers are still struggling with several tails or malformed back legs. + +It also takes longer to converge (for comparison, `K_HEUN` required around 150 steps to converge). This is normal, as producing human/animal faces/bodies is one of the things the model struggles the most with. For these topics, running for more steps will often increase coherence within the composition. + +People. `"Ultra realistic photo, (Miranda Bloom-Kerr), young, stunning model, blue eyes, blond hair, beautiful face, intricate, highly detailed, smooth, art by artgerm and greg rutkowski and alphonse mucha, stained glass" -W512 -H512 -C7.5 -S2131956332`. This time, we will go up to 300 steps. + +![Screenshot 2022-09-23 at 02 05 48-min (1)](https://user-images.githubusercontent.com/50542132/191871743-6802f199-0ffd-4986-98c5-df2d8db30d18.png) + +Observing the results, it again takes longer for all samplers to converge (`K_HEUN` took around 150 steps), but we can observe good indicative results much earlier (see: `K_HEUN`). Conversely, `DDIM` and `PLMS` are still undergoing moderate changes (see: lace around her neck), even at `-s300`. + +In fact, as we can see in this other experiment, some samplers can take 700+ steps to converge when generating people. + +![191988191-c586b75a-2d7f-4351-b705-83cc1149881a-min (1)](https://user-images.githubusercontent.com/50542132/191992123-7e0759d6-6220-42c4-a961-88c7071c5ee6.png) + +Note also the point of convergence may not be the most desirable state (e.g. I prefer an earlier version of the face, more rounded), but it will probably be the most coherent arms/hands/face attributes-wise. You can always merge different images with a photo editing tool and pass it through `img2img` to smoothen the composition. + +### *Sampler generation times* + +Once we understand the concept of sampler convergence, we must look into the performance of each sampler in terms of steps (iterations) per second, as not all samplers run at the same speed. + +On my M1 Max with 64GB of RAM, for a 512x512 image: +| Sampler | (3 sample average) it/s | +|---|---| +| `DDIM` | 1.89 | +| `PLMS` | 1.86 | +| `K_EULER` | 1.86 | +| `K_LMS` | 1.91 | +| `K_HEUN` | 0.95 (slower) | +| `K_DPM_2` | 0.95 (slower) | +| `K_DPM_2_A` | 0.95 (slower) | +| `K_EULER_A` | 1.86 | + +Combining our results with the steps per second of each sampler, three choices come out on top: `K_LMS`, `K_HEUN` and `K_DPM_2` (where the latter two run 0.5x as quick but tend to converge 2x as quick as `K_LMS`). For creativity and a lot of variation between iterations, `K_EULER_A` can be a good choice (which runs 2x as quick as `K_DPM_2_A`). + +Additionally, image generation at very low steps (≤ `-s8`) is not recommended for `K_HEUN` and `K_DPM_2`. Use `K_LMS` instead. + +192044949-67d5d441-a0d5-4d5a-be30-5dda4fc28a00-min + +### *Three key points* + +Finally, it is relevant to mention that, in general, there are 3 important moments in the process of image formation as steps increase: + +* The (earliest) point at which an image becomes a good indicator of the final result (useful for batch generation at low step values, to then improve the quality/coherence of the chosen images via running the same prompt and seed for more steps). + +* The (earliest) point at which an image becomes coherent, even if different from the result if steps are increased (useful for batch generation at low step values, where quality/coherence is improved via techniques other than increasing the steps -e.g. via inpainting). + +* The point at which an image fully converges. + +Hence, remember that your workflow/strategy should define your optimal number of steps, even for the same prompt and seed (for example, if you seek full convergence, you may run `K_LMS` for `-s200` in the case of the red-haired girl, but `K_LMS` and `-s20`-taking one tenth the time- may do as well if your workflow includes adding small details, such as the missing shoulder strap, via `img2img`). From e19aab4a9b4f2d38dc2caec5f896c378dbfe0d1d Mon Sep 17 00:00:00 2001 From: Any-Winter-4079 <50542132+Any-Winter-4079@users.noreply.github.com> Date: Sun, 25 Sep 2022 19:12:11 +0200 Subject: [PATCH 2/3] Textual Inversion for M1 Update main.py Update ddpm.py Update personalized.py Update personalized_style.py Update v1-finetune.yaml Update environment-mac.yaml Rename v1-finetune.yaml to v1-m1-finetune.yaml Create v1-finetune.yaml Update main.py Update main.py Update environment-mac.yaml Update v1-inference.yaml --- configs/stable-diffusion/v1-finetune.yaml | 2 +- configs/stable-diffusion/v1-inference.yaml | 4 +- configs/stable-diffusion/v1-m1-finetune.yaml | 110 +++++++++++++++++++ environment-mac.yaml | 6 +- ldm/data/personalized.py | 2 +- ldm/data/personalized_style.py | 2 +- ldm/models/diffusion/ddpm.py | 4 +- ldm/modules/embedding_manager.py | 11 +- main.py | 45 ++++++-- 9 files changed, 163 insertions(+), 23 deletions(-) create mode 100644 configs/stable-diffusion/v1-m1-finetune.yaml diff --git a/configs/stable-diffusion/v1-finetune.yaml b/configs/stable-diffusion/v1-finetune.yaml index 7bc31168e7..df22987fa5 100644 --- a/configs/stable-diffusion/v1-finetune.yaml +++ b/configs/stable-diffusion/v1-finetune.yaml @@ -107,4 +107,4 @@ lightning: benchmark: True max_steps: 4000000 # max_steps: 4000 - \ No newline at end of file + diff --git a/configs/stable-diffusion/v1-inference.yaml b/configs/stable-diffusion/v1-inference.yaml index 59d8f33125..da4770ffc7 100644 --- a/configs/stable-diffusion/v1-inference.yaml +++ b/configs/stable-diffusion/v1-inference.yaml @@ -30,9 +30,9 @@ model: target: ldm.modules.embedding_manager.EmbeddingManager params: placeholder_strings: ["*"] - initializer_words: ["sculpture"] + initializer_words: ['face', 'man', 'photo', 'africanmale'] per_image_tokens: false - num_vectors_per_token: 1 + num_vectors_per_token: 6 progressive_words: False unet_config: diff --git a/configs/stable-diffusion/v1-m1-finetune.yaml b/configs/stable-diffusion/v1-m1-finetune.yaml new file mode 100644 index 0000000000..af37f1ec7e --- /dev/null +++ b/configs/stable-diffusion/v1-m1-finetune.yaml @@ -0,0 +1,110 @@ +model: + base_learning_rate: 5.0e-03 + target: ldm.models.diffusion.ddpm.LatentDiffusion + params: + linear_start: 0.00085 + linear_end: 0.0120 + num_timesteps_cond: 1 + log_every_t: 200 + timesteps: 1000 + first_stage_key: image + cond_stage_key: caption + image_size: 64 + channels: 4 + cond_stage_trainable: true # Note: different from the one we trained before + conditioning_key: crossattn + monitor: val/loss_simple_ema + scale_factor: 0.18215 + use_ema: False + embedding_reg_weight: 0.0 + + personalization_config: + target: ldm.modules.embedding_manager.EmbeddingManager + params: + placeholder_strings: ["*"] + initializer_words: ['face', 'man', 'photo', 'africanmale'] + per_image_tokens: false + num_vectors_per_token: 6 + progressive_words: False + + unet_config: + target: ldm.modules.diffusionmodules.openaimodel.UNetModel + params: + image_size: 32 # unused + in_channels: 4 + out_channels: 4 + model_channels: 320 + attention_resolutions: [ 4, 2, 1 ] + num_res_blocks: 2 + channel_mult: [ 1, 2, 4, 4 ] + num_heads: 8 + use_spatial_transformer: True + transformer_depth: 1 + context_dim: 768 + use_checkpoint: True + legacy: False + + first_stage_config: + target: ldm.models.autoencoder.AutoencoderKL + params: + embed_dim: 4 + monitor: val/rec_loss + ddconfig: + double_z: true + z_channels: 4 + resolution: 256 + in_channels: 3 + out_ch: 3 + ch: 128 + ch_mult: + - 1 + - 2 + - 4 + - 4 + num_res_blocks: 2 + attn_resolutions: [] + dropout: 0.0 + lossconfig: + target: torch.nn.Identity + + cond_stage_config: + target: ldm.modules.encoders.modules.FrozenCLIPEmbedder + +data: + target: main.DataModuleFromConfig + params: + batch_size: 1 + num_workers: 2 + wrap: false + train: + target: ldm.data.personalized.PersonalizedBase + params: + size: 512 + set: train + per_image_tokens: false + repeats: 100 + validation: + target: ldm.data.personalized.PersonalizedBase + params: + size: 512 + set: val + per_image_tokens: false + repeats: 10 + +lightning: + modelcheckpoint: + params: + every_n_train_steps: 500 + callbacks: + image_logger: + target: main.ImageLogger + params: + batch_frequency: 500 + max_images: 5 + increase_log_steps: False + + trainer: + benchmark: False + max_steps: 6200 +# max_steps: 4000 + diff --git a/environment-mac.yaml b/environment-mac.yaml index 95f38438e2..74cd66ff4b 100644 --- a/environment-mac.yaml +++ b/environment-mac.yaml @@ -32,13 +32,13 @@ dependencies: - omegaconf==2.1.1 - onnx==1.12.0 - onnxruntime==1.12.1 - - protobuf==3.20.1 + - protobuf==3.19.5 - pudb==2022.1 - - pytorch-lightning==1.6.5 + - pytorch-lightning==1.7.5 - scipy==1.9.1 - streamlit==1.12.2 - sympy==1.10.1 - - tensorboard==2.9.0 + - tensorboard==2.10.0 - torchmetrics==0.9.3 - pip: - flask==2.1.3 diff --git a/ldm/data/personalized.py b/ldm/data/personalized.py index 15fc8a8d2d..8d9573fbc6 100644 --- a/ldm/data/personalized.py +++ b/ldm/data/personalized.py @@ -117,7 +117,7 @@ class PersonalizedBase(Dataset): self.image_paths = [ os.path.join(self.data_root, file_path) - for file_path in os.listdir(self.data_root) + for file_path in os.listdir(self.data_root) if file_path != ".DS_Store" ] # self._length = len(self.image_paths) diff --git a/ldm/data/personalized_style.py b/ldm/data/personalized_style.py index 56d77d7e81..118d5be991 100644 --- a/ldm/data/personalized_style.py +++ b/ldm/data/personalized_style.py @@ -93,7 +93,7 @@ class PersonalizedBase(Dataset): self.image_paths = [ os.path.join(self.data_root, file_path) - for file_path in os.listdir(self.data_root) + for file_path in os.listdir(self.data_root) if file_path != ".DS_Store" ] # self._length = len(self.image_paths) diff --git a/ldm/models/diffusion/ddpm.py b/ldm/models/diffusion/ddpm.py index ccfffa9b9b..3f103da767 100644 --- a/ldm/models/diffusion/ddpm.py +++ b/ldm/models/diffusion/ddpm.py @@ -701,7 +701,7 @@ class LatentDiffusion(DDPM): @rank_zero_only @torch.no_grad() - def on_train_batch_start(self, batch, batch_idx, dataloader_idx): + def on_train_batch_start(self, batch, batch_idx, dataloader_idx=None): # only for very first batch if ( self.scale_by_std @@ -1890,7 +1890,7 @@ class LatentDiffusion(DDPM): N=8, n_row=4, sample=True, - ddim_steps=200, + ddim_steps=50, ddim_eta=1.0, return_keys=None, quantize_denoised=True, diff --git a/ldm/modules/embedding_manager.py b/ldm/modules/embedding_manager.py index 09e6f495ab..18688708f9 100644 --- a/ldm/modules/embedding_manager.py +++ b/ldm/modules/embedding_manager.py @@ -169,9 +169,14 @@ class EmbeddingManager(nn.Module): placeholder_embedding.shape[0], max_step_tokens ) - placeholder_rows, placeholder_cols = torch.where( - tokenized_text == placeholder_token.to(device) - ) + if torch.cuda.is_available(): + placeholder_rows, placeholder_cols = torch.where( + tokenized_text == placeholder_token.to(device) + ) + else: + placeholder_rows, placeholder_cols = torch.where( + tokenized_text == placeholder_token + ) if placeholder_rows.nelement() == 0: continue diff --git a/main.py b/main.py index 72aaa49c3b..436b7251ba 100644 --- a/main.py +++ b/main.py @@ -25,6 +25,23 @@ from pytorch_lightning.utilities import rank_zero_info from ldm.data.base import Txt2ImgIterableBaseDataset from ldm.util import instantiate_from_config +def fix_func(orig): + if hasattr(torch.backends, 'mps') and torch.backends.mps.is_available(): + def new_func(*args, **kw): + device = kw.get("device", "mps") + kw["device"]="cpu" + return orig(*args, **kw).to(device) + return new_func + return orig + +torch.rand = fix_func(torch.rand) +torch.rand_like = fix_func(torch.rand_like) +torch.randn = fix_func(torch.randn) +torch.randn_like = fix_func(torch.randn_like) +torch.randint = fix_func(torch.randint) +torch.randint_like = fix_func(torch.randint_like) +torch.bernoulli = fix_func(torch.bernoulli) +torch.multinomial = fix_func(torch.multinomial) def load_model_from_config(config, ckpt, verbose=False): print(f'Loading model from {ckpt}') @@ -422,9 +439,7 @@ class ImageLogger(Callback): self.rescale = rescale self.batch_freq = batch_frequency self.max_images = max_images - self.logger_log_images = { - pl.loggers.TestTubeLogger: self._testtube, - } + self.logger_log_images = { pl.loggers.TestTubeLogger: self._testtube, } if torch.cuda.is_available() else { } self.log_steps = [ 2**n for n in range(int(np.log2(self.batch_freq)) + 1) ] @@ -527,7 +542,7 @@ class ImageLogger(Callback): return False def on_train_batch_end( - self, trainer, pl_module, outputs, batch, batch_idx, dataloader_idx + self, trainer, pl_module, outputs, batch, batch_idx, dataloader_idx=None ): if not self.disabled and ( pl_module.global_step > 0 or self.log_first_step @@ -535,7 +550,7 @@ class ImageLogger(Callback): self.log_img(pl_module, batch, batch_idx, split='train') def on_validation_batch_end( - self, trainer, pl_module, outputs, batch, batch_idx, dataloader_idx + self, trainer, pl_module, outputs, batch, batch_idx, dataloader_idx=None ): if not self.disabled and pl_module.global_step > 0: self.log_img(pl_module, batch, batch_idx, split='val') @@ -555,7 +570,7 @@ class CUDACallback(Callback): torch.cuda.synchronize(trainer.root_gpu) self.start_time = time.time() - def on_train_epoch_end(self, trainer, pl_module, outputs): + def on_train_epoch_end(self, trainer, pl_module, outputs=None): if torch.cuda.is_available(): torch.cuda.synchronize(trainer.root_gpu) epoch_time = time.time() - self.start_time @@ -736,6 +751,12 @@ if __name__ == '__main__': trainer_kwargs = dict() # default logger configs + if torch.cuda.is_available(): + def_logger = 'testtube' + def_logger_target = 'TestTubeLogger' + else: + def_logger = 'csv' + def_logger_target = 'CSVLogger' default_logger_cfgs = { 'wandb': { 'target': 'pytorch_lightning.loggers.WandbLogger', @@ -746,15 +767,15 @@ if __name__ == '__main__': 'id': nowname, }, }, - 'testtube': { - 'target': 'pytorch_lightning.loggers.TestTubeLogger', + def_logger: { + 'target': 'pytorch_lightning.loggers.' + def_logger_target, 'params': { - 'name': 'testtube', + 'name': def_logger, 'save_dir': logdir, }, }, } - default_logger_cfg = default_logger_cfgs['testtube'] + default_logger_cfg = default_logger_cfgs[def_logger] if 'logger' in lightning_config: logger_cfg = lightning_config.logger else: @@ -868,6 +889,10 @@ if __name__ == '__main__': ] trainer_kwargs['max_steps'] = trainer_opt.max_steps + if hasattr(torch.backends, 'mps') and torch.backends.mps.is_available(): + trainer_opt.accelerator = 'mps' + trainer_opt.detect_anomaly = False + trainer = Trainer.from_argparse_args(trainer_opt, **trainer_kwargs) trainer.logdir = logdir ### From c78ae752bb5052f79431c5cfc60d09a041caf181 Mon Sep 17 00:00:00 2001 From: Any-Winter-4079 <50542132+Any-Winter-4079@users.noreply.github.com> Date: Tue, 27 Sep 2022 12:46:10 +0200 Subject: [PATCH 3/3] Fix dlopen err from choices: protobuf<3.20,>=3.9.2 pytorch-lightning==1.7.5 requires protobuf<3.20,>=3.9.2 but 3.19.5 seems to cause dlopen error on some setups. Downgrading to 3.19.4 seems to fix it. --- environment-mac.yaml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/environment-mac.yaml b/environment-mac.yaml index 74cd66ff4b..e0c7b7168d 100644 --- a/environment-mac.yaml +++ b/environment-mac.yaml @@ -32,7 +32,7 @@ dependencies: - omegaconf==2.1.1 - onnx==1.12.0 - onnxruntime==1.12.1 - - protobuf==3.19.5 + - protobuf==3.19.4 - pudb==2022.1 - pytorch-lightning==1.7.5 - scipy==1.9.1