update IMG2IMG.md

This commit is contained in:
mauwii 2022-10-11 06:51:29 +02:00
parent c1f1dfa714
commit a19f148a8e
No known key found for this signature in database
GPG Key ID: D923DB04ADB3F5AB

View File

@ -2,7 +2,9 @@
title: Image-to-Image title: Image-to-Image
--- ---
# :material-image-multiple: **IMG2IMG** # :material-image-multiple: Image-to-Image
## `img2img`
This script also provides an `img2img` feature that lets you seed your creations with an initial This script also provides an `img2img` feature that lets you seed your creations with an initial
drawing or photo. This is a really cool feature that tells stable diffusion to build the prompt on drawing or photo. This is a really cool feature that tells stable diffusion to build the prompt on
@ -15,11 +17,15 @@ tree on a hill with a river, nature photograph, national geographic -I./test-pic
This will take the original image shown here: This will take the original image shown here:
<img src="https://user-images.githubusercontent.com/50542132/193946000-c42a96d8-5a74-4f8a-b4c3-5213e6cadcce.png" width=350> <div align="center" markdown>
![original image](https://user-images.githubusercontent.com/50542132/193946000-c42a96d8-5a74-4f8a-b4c3-5213e6cadcce.png){width=350}
</div>
and generate a new image based on it as shown here: and generate a new image based on it as shown here:
<img src="https://user-images.githubusercontent.com/111189/194135515-53d4c060-e994-4016-8121-7c685e281ac9.png" width=350> <div align="center" markdown>
![generated result](https://user-images.githubusercontent.com/111189/194135515-53d4c060-e994-4016-8121-7c685e281ac9.png){width=350}
</div>
The `--init_img (-I)` option gives the path to the seed picture. `--strength (-f)` controls how much The `--init_img (-I)` option gives the path to the seed picture. `--strength (-f)` controls how much
the original will be modified, ranging from `0.0` (keep the original intact), to `1.0` (ignore the the original will be modified, ranging from `0.0` (keep the original intact), to `1.0` (ignore the
@ -33,30 +39,34 @@ back into img2img the requested number of times. It generates
interesting variants. interesting variants.
Note that the prompt makes a big difference. For example, this slight variation on the prompt produces Note that the prompt makes a big difference. For example, this slight variation on the prompt produces
a very different image: a very different image: `photograph of a tree on a hill with a river`
`photograph of a tree on a hill with a river`
<div align="center" markdown>
<img src="https://user-images.githubusercontent.com/111189/194135220-16b62181-b60c-4248-8989-4834a8fd7fbd.png" width=350> <img src="https://user-images.githubusercontent.com/111189/194135220-16b62181-b60c-4248-8989-4834a8fd7fbd.png" width=350>
</div>
(When designing prompts, think about how the images scraped from the internet were captioned. Very few photographs will !!! tip
When designing prompts, think about how the images scraped from the internet were captioned. Very few photographs will
be labeled "photograph" or "photorealistic." They will, however, be captioned with the publication, photographer, camera be labeled "photograph" or "photorealistic." They will, however, be captioned with the publication, photographer, camera
model, or film settings.) model, or film settings.
If the initial image contains transparent regions, then Stable Diffusion will only draw within the If the initial image contains transparent regions, then Stable Diffusion will only draw within the
transparent regions, a process called "inpainting". However, for this to work correctly, the color transparent regions, a process called [`inpainting`](./INPAINTING.md#creating-transparent-regions-for-inpainting). However, for this to work correctly, the color
information underneath the transparent needs to be preserved, not erased. information underneath the transparent needs to be preserved, not erased.
More details can be found here: !!! warning
[Creating Transparent Images For Inpainting](./INPAINTING.md#creating-transparent-regions-for-inpainting)
**IMPORTANT ISSUE** `img2img` does not work properly on initial images smaller than 512x512. Please scale your `img2img` does not work properly on initial images smaller than 512x512. Please scale your
image to at least 512x512 before using it. Larger images are not a problem, but may run out of VRAM on your image to at least 512x512 before using it. Larger images are not a problem, but may run out of VRAM on your
GPU card. To fix this, use the --fit option, which downscales the initial image to fit within the box specified GPU card.
To fix this, use the `--fit` option, which downscales the initial image to fit within the box specified
by width x height: by width x height:
~~~
```bash
tree on a hill with a river, national geographic -I./test-pictures/big-sketch.png -H512 -W512 --fit tree on a hill with a river, national geographic -I./test-pictures/big-sketch.png -H512 -W512 --fit
~~~ ```
## How does it actually work, though? ## How does it actually work, though?
@ -66,11 +76,13 @@ gaussian noise and progressively refines it over the requested number of steps,
**Let's start** by thinking about vanilla `prompt2img`, just generating an image from a prompt. If the step count is 10, then the "latent space" (Stable Diffusion's internal representation of the image) for the prompt "fire" with seed `1592514025` develops something like this: **Let's start** by thinking about vanilla `prompt2img`, just generating an image from a prompt. If the step count is 10, then the "latent space" (Stable Diffusion's internal representation of the image) for the prompt "fire" with seed `1592514025` develops something like this:
```commandline ```bash
invoke> "fire" -s10 -W384 -H384 -S1592514025 invoke> "fire" -s10 -W384 -H384 -S1592514025
``` ```
<div align="center" markdown>
![latent steps](../assets/img2img/000019.steps.png) ![latent steps](../assets/img2img/000019.steps.png)
</div>
Put simply: starting from a frame of fuzz/static, SD finds details in each frame that it thinks look like "fire" and brings them a little bit more into focus, gradually scrubbing out the fuzz until a clear image remains. Put simply: starting from a frame of fuzz/static, SD finds details in each frame that it thinks look like "fire" and brings them a little bit more into focus, gradually scrubbing out the fuzz until a clear image remains.
@ -80,20 +92,26 @@ Put simply: starting from a frame of fuzz/static, SD finds details in each frame
Say I want SD to draw a fire based on this hand-drawn image: Say I want SD to draw a fire based on this hand-drawn image:
<div align="center" markdown>
![drawing of a fireplace](../assets/img2img/fire-drawing.png) ![drawing of a fireplace](../assets/img2img/fire-drawing.png)
</div>
Let's only do 10 steps, to make it easier to see what's happening. If strength is `0.7`, this is what the internal steps the algorithm has to take will look like: Let's only do 10 steps, to make it easier to see what's happening. If strength is `0.7`, this is what the internal steps the algorithm has to take will look like:
<div align="center" markdown>
![](../assets/img2img/000032.steps.gravity.png) ![](../assets/img2img/000032.steps.gravity.png)
</div>
With strength `0.4`, the steps look more like this: With strength `0.4`, the steps look more like this:
<div align="center" markdown>
![](../assets/img2img/000030.steps.gravity.png) ![](../assets/img2img/000030.steps.gravity.png)
</div>
Notice how much more fuzzy the starting image is for strength `0.7` compared to `0.4`, and notice also how much longer the sequence is with `0.7`: Notice how much more fuzzy the starting image is for strength `0.7` compared to `0.4`, and notice also how much longer the sequence is with `0.7`:
| | strength = 0.7 | strength = 0.4 | | | strength = 0.7 | strength = 0.4 |
| -- | -- | -- | | -- | :--: | :--: |
| initial image that SD sees | ![](../assets/img2img/000032.step-0.png) | ![](../assets/img2img/000030.step-0.png) | | initial image that SD sees | ![](../assets/img2img/000032.step-0.png) | ![](../assets/img2img/000030.step-0.png) |
| steps argument to `dream>` | `-S10` | `-S10` | | steps argument to `dream>` | `-S10` | `-S10` |
| steps actually taken | 7 | 4 | | steps actually taken | 7 | 4 |
@ -102,10 +120,9 @@ Notice how much more fuzzy the starting image is for strength `0.7` compared to
Both of the outputs look kind of like what I was thinking of. With the strength higher, my input becomes more vague, *and* Stable Diffusion has more steps to refine its output. But it's not really making what I want, which is a picture of cheery open fire. With the strength lower, my input is more clear, *but* Stable Diffusion has less chance to refine itself, so the result ends up inheriting all the problems of my bad drawing. Both of the outputs look kind of like what I was thinking of. With the strength higher, my input becomes more vague, *and* Stable Diffusion has more steps to refine its output. But it's not really making what I want, which is a picture of cheery open fire. With the strength lower, my input is more clear, *but* Stable Diffusion has less chance to refine itself, so the result ends up inheriting all the problems of my bad drawing.
If you want to try this out yourself, all of these are using a seed of `1592514025` with a width/height of `384`, step count `10`, the default sampler (`k_lms`), and the single-word prompt `"fire"`:
If you want to try this out yourself, all of these are using a seed of `1592514025` with a width/height of `384`, step count `10`, the default sampler (`k_lms`), and the single-word prompt `fire`: ```bash
```commandline
invoke> "fire" -s10 -W384 -H384 -S1592514025 -I /tmp/fire-drawing.png --strength 0.7 invoke> "fire" -s10 -W384 -H384 -S1592514025 -I /tmp/fire-drawing.png --strength 0.7
``` ```
@ -121,15 +138,19 @@ Here's strength `0.4` (note step count `50`, which is `20 ÷ 0.4` to make sure S
invoke> "fire" -s50 -W384 -H384 -S1592514025 -I /tmp/fire-drawing.png -f 0.4 invoke> "fire" -s50 -W384 -H384 -S1592514025 -I /tmp/fire-drawing.png -f 0.4
``` ```
<div align="center" markdown>
![](../assets/img2img/000035.1592514025.png) ![](../assets/img2img/000035.1592514025.png)
</div>
and strength `0.7` (note step count `30`, which is roughly `20 ÷ 0.7` to make sure SD does `20` steps from my image): and here is strength `0.7` (note step count `30`, which is roughly `20 ÷ 0.7` to make sure SD does `20` steps from my image):
```commandline ```bash
invoke> "fire" -s30 -W384 -H384 -S1592514025 -I /tmp/fire-drawing.png -f 0.7 invoke> "fire" -s30 -W384 -H384 -S1592514025 -I /tmp/fire-drawing.png -f 0.7
``` ```
<div align="center" markdown>
![](../assets/img2img/000046.1592514025.png) ![](../assets/img2img/000046.1592514025.png)
</div>
In both cases the image is nice and clean and "finished", but because at strength `0.7` Stable Diffusion has been give so much more freedom to improve on my badly-drawn flames, they've come out looking much better. You can really see the difference when looking at the latent steps. There's more noise on the first image with strength `0.7`: In both cases the image is nice and clean and "finished", but because at strength `0.7` Stable Diffusion has been give so much more freedom to improve on my badly-drawn flames, they've come out looking much better. You can really see the difference when looking at the latent steps. There's more noise on the first image with strength `0.7`:
@ -143,11 +164,13 @@ and that extra noise gives the algorithm more choices when it is evaluating how
Unfortunately, it seems that `img2img` is very sensitive to the step count. Here's strength `0.7` with a step count of `29` (SD did 19 steps from my image): Unfortunately, it seems that `img2img` is very sensitive to the step count. Here's strength `0.7` with a step count of `29` (SD did 19 steps from my image):
<div align="center" markdown>
![](../assets/img2img/000045.1592514025.png) ![](../assets/img2img/000045.1592514025.png)
</div>
By comparing the latents we can sort of see that something got interpreted differently enough on the third or fourth step to lead to a rather different interpretation of the flames. By comparing the latents we can sort of see that something got interpreted differently enough on the third or fourth step to lead to a rather different interpretation of the flames.
![](../assets/img2img/000046.steps.gravity.png) ![](../assets/img2img/000046.steps.gravity.png)
![](../assets/img2img/000045.steps.gravity.png) ![](../assets/img2img/000045.steps.gravity.png)
This is the result of a difference in the de-noising "schedule" - basically the noise has to be cleaned by a certain degree each step or the model won't "converge" on the image properly (see https://huggingface.co/blog/stable_diffusion for more about that). A different step count means a different schedule, which means things get interpreted slightly differently at every step. This is the result of a difference in the de-noising "schedule" - basically the noise has to be cleaned by a certain degree each step or the model won't "converge" on the image properly (see [stable diffusion blog](https://huggingface.co/blog/stable_diffusion) for more about that). A different step count means a different schedule, which means things get interpreted slightly differently at every step.