mirror of
https://github.com/invoke-ai/InvokeAI
synced 2024-08-30 20:32:17 +00:00
3b5a8308eb
I was pretty busy trying to make the Readmes / docs look good in MkDocs
253 lines
11 KiB
Markdown
253 lines
11 KiB
Markdown
---
|
|
title: CompViz-Readme
|
|
---
|
|
|
|
# _README from [CompViz/stable-diffusion](https://github.com/CompVis/stable-diffusion)_
|
|
|
|
_Stable Diffusion was made possible thanks to a collaboration with
|
|
[Stability AI](https://stability.ai/) and [Runway](https://runwayml.com/) and
|
|
builds upon our previous work:_
|
|
|
|
[**High-Resolution Image Synthesis with Latent Diffusion Models**](https://ommer-lab.com/research/latent-diffusion-models/)<br/>
|
|
[Robin Rombach](https://github.com/rromb)\*,
|
|
[Andreas Blattmann](https://github.com/ablattmann)\*,
|
|
[Dominik Lorenz](https://github.com/qp-qp)\,
|
|
[Patrick Esser](https://github.com/pesser),
|
|
[Björn Ommer](https://hci.iwr.uni-heidelberg.de/Staff/bommer)<br/>
|
|
|
|
## **CVPR '22 Oral**
|
|
|
|
which is available on [GitHub](https://github.com/CompVis/latent-diffusion). PDF
|
|
at [arXiv](https://arxiv.org/abs/2112.10752). Please also visit our
|
|
[Project page](https://ommer-lab.com/research/latent-diffusion-models/).
|
|
|
|
![txt2img-stable2](../assets/stable-samples/txt2img/merged-0006.png)
|
|
[Stable Diffusion](#stable-diffusion-v1) is a latent text-to-image diffusion
|
|
model. Thanks to a generous compute donation from
|
|
[Stability AI](https://stability.ai/) and support from
|
|
[LAION](https://laion.ai/), we were able to train a Latent Diffusion Model on
|
|
512x512 images from a subset of the [LAION-5B](https://laion.ai/blog/laion-5b/)
|
|
database. Similar to Google's [Imagen](https://arxiv.org/abs/2205.11487), this
|
|
model uses a frozen CLIP ViT-L/14 text encoder to condition the model on text
|
|
prompts. With its 860M UNet and 123M text encoder, the model is relatively
|
|
lightweight and runs on a GPU with at least 10GB VRAM. See
|
|
[this section](#stable-diffusion-v1) below and the
|
|
[model card](https://huggingface.co/CompVis/stable-diffusion).
|
|
|
|
## Requirements
|
|
|
|
A suitable [conda](https://conda.io/) environment named `ldm` can be created and
|
|
activated with:
|
|
|
|
```bash
|
|
conda env create -f environment.yaml
|
|
conda activate ldm
|
|
```
|
|
|
|
You can also update an existing
|
|
[latent diffusion](https://github.com/CompVis/latent-diffusion) environment by
|
|
running
|
|
|
|
```bash
|
|
conda install pytorch torchvision -c pytorch
|
|
pip install transformers==4.19.2
|
|
pip install -e .
|
|
```
|
|
|
|
## Stable Diffusion v1
|
|
|
|
Stable Diffusion v1 refers to a specific configuration of the model architecture
|
|
that uses a downsampling-factor 8 autoencoder with an 860M UNet and CLIP
|
|
ViT-L/14 text encoder for the diffusion model. The model was pretrained on
|
|
256x256 images and then finetuned on 512x512 images.
|
|
|
|
\*Note: Stable Diffusion v1 is a general text-to-image diffusion model and
|
|
therefore mirrors biases and (mis-)conceptions that are present in its training
|
|
data. Details on the training procedure and data, as well as the intended use of
|
|
the model can be found in the corresponding
|
|
[model card](https://huggingface.co/CompVis/stable-diffusion). Research into the
|
|
safe deployment of general text-to-image models is an ongoing effort. To prevent
|
|
misuse and harm, we currently provide access to the checkpoints only for
|
|
[academic research purposes upon request](https://stability.ai/academia-access-form).
|
|
**This is an experiment in safe and community-driven publication of a capable
|
|
and general text-to-image model. We are working on a public release with a more
|
|
permissive license that also incorporates ethical considerations.\***
|
|
|
|
[Request access to Stable Diffusion v1 checkpoints for academic research](https://stability.ai/academia-access-form)
|
|
|
|
### Weights
|
|
|
|
We currently provide three checkpoints, `sd-v1-1.ckpt`, `sd-v1-2.ckpt` and
|
|
`sd-v1-3.ckpt`, which were trained as follows,
|
|
|
|
- `sd-v1-1.ckpt`: 237k steps at resolution `256x256` on
|
|
[laion2B-en](https://huggingface.co/datasets/laion/laion2B-en). 194k steps at
|
|
resolution `512x512` on
|
|
[laion-high-resolution](https://huggingface.co/datasets/laion/laion-high-resolution)
|
|
(170M examples from LAION-5B with resolution `>= 1024x1024`).
|
|
- `sd-v1-2.ckpt`: Resumed from `sd-v1-1.ckpt`. 515k steps at resolution
|
|
`512x512` on "laion-improved-aesthetics" (a subset of laion2B-en, filtered to
|
|
images with an original size `>= 512x512`, estimated aesthetics score `> 5.0`,
|
|
and an estimated watermark probability `< 0.5`. The watermark estimate is from
|
|
the LAION-5B metadata, the aesthetics score is estimated using an
|
|
[improved aesthetics estimator](https://github.com/christophschuhmann/improved-aesthetic-predictor)).
|
|
- `sd-v1-3.ckpt`: Resumed from `sd-v1-2.ckpt`. 195k steps at resolution
|
|
`512x512` on "laion-improved-aesthetics" and 10\% dropping of the
|
|
text-conditioning to improve
|
|
[classifier-free guidance sampling](https://arxiv.org/abs/2207.12598).
|
|
|
|
Evaluations with different classifier-free guidance scales (1.5, 2.0, 3.0, 4.0,
|
|
5.0, 6.0, 7.0, 8.0) and 50 PLMS sampling steps show the relative improvements of
|
|
the checkpoints: ![sd evaluation results](../assets/v1-variants-scores.jpg)
|
|
|
|
### Text-to-Image with Stable Diffusion
|
|
|
|
![txt2img-stable2](../assets/stable-samples/txt2img/merged-0005.png)
|
|
![txt2img-stable2](../assets/stable-samples/txt2img/merged-0007.png)
|
|
|
|
Stable Diffusion is a latent diffusion model conditioned on the (non-pooled)
|
|
text embeddings of a CLIP ViT-L/14 text encoder.
|
|
|
|
#### Sampling Script
|
|
|
|
After [obtaining the weights](#weights), link them
|
|
|
|
```
|
|
mkdir -p models/ldm/stable-diffusion-v1/
|
|
ln -s <path/to/model.ckpt> models/ldm/stable-diffusion-v1/model.ckpt
|
|
```
|
|
|
|
and sample with
|
|
|
|
```
|
|
python scripts/txt2img.py --prompt "a photograph of an astronaut riding a horse" --plms
|
|
```
|
|
|
|
By default, this uses a guidance scale of `--scale 7.5`,
|
|
[Katherine Crowson's implementation](https://github.com/CompVis/latent-diffusion/pull/51)
|
|
of the [PLMS](https://arxiv.org/abs/2202.09778) sampler, and renders images of
|
|
size 512x512 (which it was trained on) in 50 steps. All supported arguments are
|
|
listed below (type `python scripts/txt2img.py --help`).
|
|
|
|
```commandline
|
|
usage: txt2img.py [-h] [--prompt [PROMPT]] [--outdir [OUTDIR]] [--skip_grid] [--skip_save] [--ddim_steps DDIM_STEPS] [--plms] [--laion400m] [--fixed_code] [--ddim_eta DDIM_ETA] [--n_iter N_ITER] [--H H] [--W W] [--C C] [--f F] [--n_samples N_SAMPLES] [--n_rows N_ROWS]
|
|
[--scale SCALE] [--from-file FROM_FILE] [--config CONFIG] [--ckpt CKPT] [--seed SEED] [--precision {full,autocast}]
|
|
|
|
optional arguments:
|
|
-h, --help show this help message and exit
|
|
--prompt [PROMPT] the prompt to render
|
|
--outdir [OUTDIR] dir to write results to
|
|
--skip_grid do not save a grid, only individual samples. Helpful when evaluating lots of samples
|
|
--skip_save do not save individual samples. For speed measurements.
|
|
--ddim_steps DDIM_STEPS
|
|
number of ddim sampling steps
|
|
--plms use plms sampling
|
|
--laion400m uses the LAION400M model
|
|
--fixed_code if enabled, uses the same starting code across samples
|
|
--ddim_eta DDIM_ETA ddim eta (eta=0.0 corresponds to deterministic sampling
|
|
--n_iter N_ITER sample this often
|
|
--H H image height, in pixel space
|
|
--W W image width, in pixel space
|
|
--C C latent channels
|
|
--f F downsampling factor
|
|
--n_samples N_SAMPLES
|
|
how many samples to produce for each given prompt. A.k.a. batch size
|
|
(note that the seeds for each image in the batch will be unavailable)
|
|
--n_rows N_ROWS rows in the grid (default: n_samples)
|
|
--scale SCALE unconditional guidance scale: eps = eps(x, empty) + scale * (eps(x, cond) - eps(x, empty))
|
|
--from-file FROM_FILE
|
|
if specified, load prompts from this file
|
|
--config CONFIG path to config which constructs model
|
|
--ckpt CKPT path to checkpoint of model
|
|
--seed SEED the seed (for reproducible sampling)
|
|
--precision {full,autocast}
|
|
evaluate at this precision
|
|
|
|
```
|
|
|
|
Note: The inference config for all v1 versions is designed to be used with
|
|
EMA-only checkpoints. For this reason `use_ema=False` is set in the
|
|
configuration, otherwise the code will try to switch from non-EMA to EMA
|
|
weights. If you want to examine the effect of EMA vs no EMA, we provide "full"
|
|
checkpoints which contain both types of weights. For these, `use_ema=False` will
|
|
load and use the non-EMA weights.
|
|
|
|
#### Diffusers Integration
|
|
|
|
Another way to download and sample Stable Diffusion is by using the
|
|
[diffusers library](https://github.com/huggingface/diffusers/tree/main#new--stable-diffusion-is-now-fully-compatible-with-diffusers)
|
|
|
|
```py
|
|
# make sure you're logged in with `huggingface-cli login`
|
|
from torch import autocast
|
|
from diffusers import StableDiffusionPipeline, LMSDiscreteScheduler
|
|
|
|
pipe = StableDiffusionPipeline.from_pretrained(
|
|
"CompVis/stable-diffusion-v1-3-diffusers",
|
|
use_auth_token=True
|
|
)
|
|
|
|
prompt = "a photo of an astronaut riding a horse on mars"
|
|
with autocast("cuda"):
|
|
image = pipe(prompt)["sample"][0]
|
|
|
|
image.save("astronaut_rides_horse.png")
|
|
```
|
|
|
|
### Image Modification with Stable Diffusion
|
|
|
|
By using a diffusion-denoising mechanism as first proposed by
|
|
[SDEdit](https://arxiv.org/abs/2108.01073), the model can be used for different
|
|
tasks such as text-guided image-to-image translation and upscaling. Similar to
|
|
the txt2img sampling script, we provide a script to perform image modification
|
|
with Stable Diffusion.
|
|
|
|
The following describes an example where a rough sketch made in
|
|
[Pinta](https://www.pinta-project.com/) is converted into a detailed artwork.
|
|
|
|
```
|
|
python scripts/img2img.py --prompt "A fantasy landscape, trending on artstation" --init-img <path-to-img.jpg> --strength 0.8
|
|
```
|
|
|
|
Here, strength is a value between 0.0 and 1.0, that controls the amount of noise
|
|
that is added to the input image. Values that approach 1.0 allow for lots of
|
|
variations but will also produce images that are not semantically consistent
|
|
with the input. See the following example.
|
|
|
|
**Input**
|
|
|
|
![sketch-in](../assets/stable-samples/img2img/sketch-mountains-input.jpg)
|
|
|
|
**Outputs**
|
|
|
|
![out3](../assets/stable-samples/img2img/mountains-3.png)
|
|
![out2](../assets/stable-samples/img2img/mountains-2.png)
|
|
|
|
This procedure can, for example, also be used to upscale samples from the base
|
|
model.
|
|
|
|
## Comments
|
|
|
|
- Our codebase for the diffusion models builds heavily on
|
|
[OpenAI's ADM codebase](https://github.com/openai/guided-diffusion) and
|
|
[https://github.com/lucidrains/denoising-diffusion-pytorch](https://github.com/lucidrains/denoising-diffusion-pytorch).
|
|
Thanks for open-sourcing!
|
|
|
|
- The implementation of the transformer encoder is from
|
|
[x-transformers](https://github.com/lucidrains/x-transformers) by
|
|
[lucidrains](https://github.com/lucidrains?tab=repositories).
|
|
|
|
## BibTeX
|
|
|
|
```
|
|
@misc{rombach2021highresolution,
|
|
title={High-Resolution Image Synthesis with Latent Diffusion Models},
|
|
author={Robin Rombach and Andreas Blattmann and Dominik Lorenz and Patrick Esser and Björn Ommer},
|
|
year={2021},
|
|
eprint={2112.10752},
|
|
archivePrefix={arXiv},
|
|
primaryClass={cs.CV}
|
|
}
|
|
|
|
```
|