mirror of
https://github.com/invoke-ai/InvokeAI
synced 2024-08-30 20:32:17 +00:00
update IMG2IMG.md
This commit is contained in:
parent
c1f1dfa714
commit
a19f148a8e
@ -2,7 +2,9 @@
|
|||||||
title: Image-to-Image
|
title: Image-to-Image
|
||||||
---
|
---
|
||||||
|
|
||||||
# :material-image-multiple: **IMG2IMG**
|
# :material-image-multiple: Image-to-Image
|
||||||
|
|
||||||
|
## `img2img`
|
||||||
|
|
||||||
This script also provides an `img2img` feature that lets you seed your creations with an initial
|
This script also provides an `img2img` feature that lets you seed your creations with an initial
|
||||||
drawing or photo. This is a really cool feature that tells stable diffusion to build the prompt on
|
drawing or photo. This is a really cool feature that tells stable diffusion to build the prompt on
|
||||||
@ -15,11 +17,15 @@ tree on a hill with a river, nature photograph, national geographic -I./test-pic
|
|||||||
|
|
||||||
This will take the original image shown here:
|
This will take the original image shown here:
|
||||||
|
|
||||||
<img src="https://user-images.githubusercontent.com/50542132/193946000-c42a96d8-5a74-4f8a-b4c3-5213e6cadcce.png" width=350>
|
<div align="center" markdown>
|
||||||
|
{width=350}
|
||||||
|
</div>
|
||||||
|
|
||||||
and generate a new image based on it as shown here:
|
and generate a new image based on it as shown here:
|
||||||
|
|
||||||
<img src="https://user-images.githubusercontent.com/111189/194135515-53d4c060-e994-4016-8121-7c685e281ac9.png" width=350>
|
<div align="center" markdown>
|
||||||
|
{width=350}
|
||||||
|
</div>
|
||||||
|
|
||||||
The `--init_img (-I)` option gives the path to the seed picture. `--strength (-f)` controls how much
|
The `--init_img (-I)` option gives the path to the seed picture. `--strength (-f)` controls how much
|
||||||
the original will be modified, ranging from `0.0` (keep the original intact), to `1.0` (ignore the
|
the original will be modified, ranging from `0.0` (keep the original intact), to `1.0` (ignore the
|
||||||
@ -33,30 +39,34 @@ back into img2img the requested number of times. It generates
|
|||||||
interesting variants.
|
interesting variants.
|
||||||
|
|
||||||
Note that the prompt makes a big difference. For example, this slight variation on the prompt produces
|
Note that the prompt makes a big difference. For example, this slight variation on the prompt produces
|
||||||
a very different image:
|
a very different image: `photograph of a tree on a hill with a river`
|
||||||
|
|
||||||
`photograph of a tree on a hill with a river`
|
|
||||||
|
|
||||||
|
<div align="center" markdown>
|
||||||
<img src="https://user-images.githubusercontent.com/111189/194135220-16b62181-b60c-4248-8989-4834a8fd7fbd.png" width=350>
|
<img src="https://user-images.githubusercontent.com/111189/194135220-16b62181-b60c-4248-8989-4834a8fd7fbd.png" width=350>
|
||||||
|
</div>
|
||||||
|
|
||||||
(When designing prompts, think about how the images scraped from the internet were captioned. Very few photographs will
|
!!! tip
|
||||||
|
|
||||||
|
When designing prompts, think about how the images scraped from the internet were captioned. Very few photographs will
|
||||||
be labeled "photograph" or "photorealistic." They will, however, be captioned with the publication, photographer, camera
|
be labeled "photograph" or "photorealistic." They will, however, be captioned with the publication, photographer, camera
|
||||||
model, or film settings.)
|
model, or film settings.
|
||||||
|
|
||||||
If the initial image contains transparent regions, then Stable Diffusion will only draw within the
|
If the initial image contains transparent regions, then Stable Diffusion will only draw within the
|
||||||
transparent regions, a process called "inpainting". However, for this to work correctly, the color
|
transparent regions, a process called [`inpainting`](./INPAINTING.md#creating-transparent-regions-for-inpainting). However, for this to work correctly, the color
|
||||||
information underneath the transparent needs to be preserved, not erased.
|
information underneath the transparent needs to be preserved, not erased.
|
||||||
|
|
||||||
More details can be found here:
|
!!! warning
|
||||||
[Creating Transparent Images For Inpainting](./INPAINTING.md#creating-transparent-regions-for-inpainting)
|
|
||||||
|
|
||||||
**IMPORTANT ISSUE** `img2img` does not work properly on initial images smaller than 512x512. Please scale your
|
`img2img` does not work properly on initial images smaller than 512x512. Please scale your
|
||||||
image to at least 512x512 before using it. Larger images are not a problem, but may run out of VRAM on your
|
image to at least 512x512 before using it. Larger images are not a problem, but may run out of VRAM on your
|
||||||
GPU card. To fix this, use the --fit option, which downscales the initial image to fit within the box specified
|
GPU card.
|
||||||
|
|
||||||
|
To fix this, use the `--fit` option, which downscales the initial image to fit within the box specified
|
||||||
by width x height:
|
by width x height:
|
||||||
~~~
|
|
||||||
|
```bash
|
||||||
tree on a hill with a river, national geographic -I./test-pictures/big-sketch.png -H512 -W512 --fit
|
tree on a hill with a river, national geographic -I./test-pictures/big-sketch.png -H512 -W512 --fit
|
||||||
~~~
|
```
|
||||||
|
|
||||||
## How does it actually work, though?
|
## How does it actually work, though?
|
||||||
|
|
||||||
@ -66,11 +76,13 @@ gaussian noise and progressively refines it over the requested number of steps,
|
|||||||
|
|
||||||
**Let's start** by thinking about vanilla `prompt2img`, just generating an image from a prompt. If the step count is 10, then the "latent space" (Stable Diffusion's internal representation of the image) for the prompt "fire" with seed `1592514025` develops something like this:
|
**Let's start** by thinking about vanilla `prompt2img`, just generating an image from a prompt. If the step count is 10, then the "latent space" (Stable Diffusion's internal representation of the image) for the prompt "fire" with seed `1592514025` develops something like this:
|
||||||
|
|
||||||
```commandline
|
```bash
|
||||||
invoke> "fire" -s10 -W384 -H384 -S1592514025
|
invoke> "fire" -s10 -W384 -H384 -S1592514025
|
||||||
```
|
```
|
||||||
|
|
||||||
|
<div align="center" markdown>
|
||||||

|

|
||||||
|
</div>
|
||||||
|
|
||||||
Put simply: starting from a frame of fuzz/static, SD finds details in each frame that it thinks look like "fire" and brings them a little bit more into focus, gradually scrubbing out the fuzz until a clear image remains.
|
Put simply: starting from a frame of fuzz/static, SD finds details in each frame that it thinks look like "fire" and brings them a little bit more into focus, gradually scrubbing out the fuzz until a clear image remains.
|
||||||
|
|
||||||
@ -80,20 +92,26 @@ Put simply: starting from a frame of fuzz/static, SD finds details in each frame
|
|||||||
|
|
||||||
Say I want SD to draw a fire based on this hand-drawn image:
|
Say I want SD to draw a fire based on this hand-drawn image:
|
||||||
|
|
||||||
|
<div align="center" markdown>
|
||||||

|

|
||||||
|
</div>
|
||||||
|
|
||||||
Let's only do 10 steps, to make it easier to see what's happening. If strength is `0.7`, this is what the internal steps the algorithm has to take will look like:
|
Let's only do 10 steps, to make it easier to see what's happening. If strength is `0.7`, this is what the internal steps the algorithm has to take will look like:
|
||||||
|
|
||||||
|
<div align="center" markdown>
|
||||||

|

|
||||||
|
</div>
|
||||||
|
|
||||||
With strength `0.4`, the steps look more like this:
|
With strength `0.4`, the steps look more like this:
|
||||||
|
|
||||||
|
<div align="center" markdown>
|
||||||

|

|
||||||
|
</div>
|
||||||
|
|
||||||
Notice how much more fuzzy the starting image is for strength `0.7` compared to `0.4`, and notice also how much longer the sequence is with `0.7`:
|
Notice how much more fuzzy the starting image is for strength `0.7` compared to `0.4`, and notice also how much longer the sequence is with `0.7`:
|
||||||
|
|
||||||
| | strength = 0.7 | strength = 0.4 |
|
| | strength = 0.7 | strength = 0.4 |
|
||||||
| -- | -- | -- |
|
| -- | :--: | :--: |
|
||||||
| initial image that SD sees |  |  |
|
| initial image that SD sees |  |  |
|
||||||
| steps argument to `dream>` | `-S10` | `-S10` |
|
| steps argument to `dream>` | `-S10` | `-S10` |
|
||||||
| steps actually taken | 7 | 4 |
|
| steps actually taken | 7 | 4 |
|
||||||
@ -102,10 +120,9 @@ Notice how much more fuzzy the starting image is for strength `0.7` compared to
|
|||||||
|
|
||||||
Both of the outputs look kind of like what I was thinking of. With the strength higher, my input becomes more vague, *and* Stable Diffusion has more steps to refine its output. But it's not really making what I want, which is a picture of cheery open fire. With the strength lower, my input is more clear, *but* Stable Diffusion has less chance to refine itself, so the result ends up inheriting all the problems of my bad drawing.
|
Both of the outputs look kind of like what I was thinking of. With the strength higher, my input becomes more vague, *and* Stable Diffusion has more steps to refine its output. But it's not really making what I want, which is a picture of cheery open fire. With the strength lower, my input is more clear, *but* Stable Diffusion has less chance to refine itself, so the result ends up inheriting all the problems of my bad drawing.
|
||||||
|
|
||||||
|
If you want to try this out yourself, all of these are using a seed of `1592514025` with a width/height of `384`, step count `10`, the default sampler (`k_lms`), and the single-word prompt `"fire"`:
|
||||||
|
|
||||||
If you want to try this out yourself, all of these are using a seed of `1592514025` with a width/height of `384`, step count `10`, the default sampler (`k_lms`), and the single-word prompt `fire`:
|
```bash
|
||||||
|
|
||||||
```commandline
|
|
||||||
invoke> "fire" -s10 -W384 -H384 -S1592514025 -I /tmp/fire-drawing.png --strength 0.7
|
invoke> "fire" -s10 -W384 -H384 -S1592514025 -I /tmp/fire-drawing.png --strength 0.7
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -121,15 +138,19 @@ Here's strength `0.4` (note step count `50`, which is `20 ÷ 0.4` to make sure S
|
|||||||
invoke> "fire" -s50 -W384 -H384 -S1592514025 -I /tmp/fire-drawing.png -f 0.4
|
invoke> "fire" -s50 -W384 -H384 -S1592514025 -I /tmp/fire-drawing.png -f 0.4
|
||||||
```
|
```
|
||||||
|
|
||||||
|
<div align="center" markdown>
|
||||||

|

|
||||||
|
</div>
|
||||||
|
|
||||||
and strength `0.7` (note step count `30`, which is roughly `20 ÷ 0.7` to make sure SD does `20` steps from my image):
|
and here is strength `0.7` (note step count `30`, which is roughly `20 ÷ 0.7` to make sure SD does `20` steps from my image):
|
||||||
|
|
||||||
```commandline
|
```bash
|
||||||
invoke> "fire" -s30 -W384 -H384 -S1592514025 -I /tmp/fire-drawing.png -f 0.7
|
invoke> "fire" -s30 -W384 -H384 -S1592514025 -I /tmp/fire-drawing.png -f 0.7
|
||||||
```
|
```
|
||||||
|
|
||||||
|
<div align="center" markdown>
|
||||||

|

|
||||||
|
</div>
|
||||||
|
|
||||||
In both cases the image is nice and clean and "finished", but because at strength `0.7` Stable Diffusion has been give so much more freedom to improve on my badly-drawn flames, they've come out looking much better. You can really see the difference when looking at the latent steps. There's more noise on the first image with strength `0.7`:
|
In both cases the image is nice and clean and "finished", but because at strength `0.7` Stable Diffusion has been give so much more freedom to improve on my badly-drawn flames, they've come out looking much better. You can really see the difference when looking at the latent steps. There's more noise on the first image with strength `0.7`:
|
||||||
|
|
||||||
@ -143,11 +164,13 @@ and that extra noise gives the algorithm more choices when it is evaluating how
|
|||||||
|
|
||||||
Unfortunately, it seems that `img2img` is very sensitive to the step count. Here's strength `0.7` with a step count of `29` (SD did 19 steps from my image):
|
Unfortunately, it seems that `img2img` is very sensitive to the step count. Here's strength `0.7` with a step count of `29` (SD did 19 steps from my image):
|
||||||
|
|
||||||
|
<div align="center" markdown>
|
||||||

|

|
||||||
|
</div>
|
||||||
|
|
||||||
By comparing the latents we can sort of see that something got interpreted differently enough on the third or fourth step to lead to a rather different interpretation of the flames.
|
By comparing the latents we can sort of see that something got interpreted differently enough on the third or fourth step to lead to a rather different interpretation of the flames.
|
||||||
|
|
||||||

|

|
||||||

|

|
||||||
|
|
||||||
This is the result of a difference in the de-noising "schedule" - basically the noise has to be cleaned by a certain degree each step or the model won't "converge" on the image properly (see https://huggingface.co/blog/stable_diffusion for more about that). A different step count means a different schedule, which means things get interpreted slightly differently at every step.
|
This is the result of a difference in the de-noising "schedule" - basically the noise has to be cleaned by a certain degree each step or the model won't "converge" on the image properly (see [stable diffusion blog](https://huggingface.co/blog/stable_diffusion) for more about that). A different step count means a different schedule, which means things get interpreted slightly differently at every step.
|
||||||
|
Loading…
x
Reference in New Issue
Block a user