update IMG2IMG.md

2024-08-30 20:32:17 +00:00 · 2022-10-11 06:51:29 +02:00 · 2022-10-11 06:51:29 +02:00 · a19f148a8e
commit a19f148a8e
parent c1f1dfa714
1 changed files with 55 additions and 32 deletions
--- a/docs/features/IMG2IMG.md
+++ b/docs/features/IMG2IMG.md
@ -2,7 +2,9 @@
 title: Image-to-Image
 ---

-# :material-image-multiple: **IMG2IMG**
+# :material-image-multiple: Image-to-Image
+
+## `img2img`

 This script also provides an `img2img` feature that lets you seed your creations with an initial
 drawing or photo. This is a really cool feature that tells stable diffusion to build the prompt on
@ -15,11 +17,15 @@ tree on a hill with a river, nature photograph, national geographic -I./test-pic

 This will take the original image shown here:

-<img src="https://user-images.githubusercontent.com/50542132/193946000-c42a96d8-5a74-4f8a-b4c3-5213e6cadcce.png" width=350>
+<div align="center" markdown>
+![original image](https://user-images.githubusercontent.com/50542132/193946000-c42a96d8-5a74-4f8a-b4c3-5213e6cadcce.png){width=350}
+</div>

 and generate a new image based on it as shown here:

-<img src="https://user-images.githubusercontent.com/111189/194135515-53d4c060-e994-4016-8121-7c685e281ac9.png" width=350>
+<div align="center" markdown>
+![generated result](https://user-images.githubusercontent.com/111189/194135515-53d4c060-e994-4016-8121-7c685e281ac9.png){width=350}
+</div>

 The `--init_img (-I)` option gives the path to the seed picture. `--strength (-f)` controls how much
 the original will be modified, ranging from `0.0` (keep the original intact), to `1.0` (ignore the
@ -33,30 +39,34 @@ back into img2img the requested number of times. It generates
 interesting variants.

 Note that the prompt makes a big difference. For example, this slight variation on the prompt produces
-a very different image:
-
-`photograph of a tree on a hill with a river`
+a very different image: `photograph of a tree on a hill with a river`

+<div align="center" markdown>
 <img src="https://user-images.githubusercontent.com/111189/194135220-16b62181-b60c-4248-8989-4834a8fd7fbd.png" width=350>
+</div>

-(When designing prompts, think about how the images scraped from the internet were captioned. Very few photographs will
-be labeled "photograph" or "photorealistic." They will, however, be captioned with the publication, photographer, camera
-model, or film settings.)
+!!! tip
+
+    When designing prompts, think about how the images scraped from the internet were captioned. Very few photographs will
+    be labeled "photograph" or "photorealistic." They will, however, be captioned with the publication, photographer, camera
+    model, or film settings.

 If the initial image contains transparent regions, then Stable Diffusion will only draw within the
-transparent regions, a process called "inpainting". However, for this to work correctly, the color
+transparent regions, a process called [`inpainting`](./INPAINTING.md#creating-transparent-regions-for-inpainting). However, for this to work correctly, the color
 information underneath the transparent needs to be preserved, not erased.

-More details can be found here:
-[Creating Transparent Images For Inpainting](./INPAINTING.md#creating-transparent-regions-for-inpainting)
+!!! warning

-**IMPORTANT ISSUE** `img2img` does not work properly on initial images smaller than 512x512. Please scale your
-image to at least 512x512 before using it. Larger images are not a problem, but may run out of VRAM on your
-GPU card. To fix this, use the --fit option, which downscales the initial image to fit within the box specified
-by width x height:
-~~~
-tree on a hill with a river, national geographic -I./test-pictures/big-sketch.png -H512 -W512 --fit
-~~~
+    `img2img` does not work properly on initial images smaller than 512x512. Please scale your
+    image to at least 512x512 before using it. Larger images are not a problem, but may run out of VRAM on your
+    GPU card.
+
+    To fix this, use the `--fit` option, which downscales the initial image to fit within the box specified
+    by width x height:
+
+    ```bash
+    tree on a hill with a river, national geographic -I./test-pictures/big-sketch.png -H512 -W512 --fit
+    ```

 ## How does it actually work, though?

@ -66,11 +76,13 @@ gaussian noise and progressively refines it over the requested number of steps,

 **Let's start** by thinking about vanilla `prompt2img`, just generating an image from a prompt. If the step count is 10, then the "latent space" (Stable Diffusion's internal representation of the image) for the prompt "fire" with seed `1592514025` develops something like this:

-```commandline
+```bash
 invoke> "fire" -s10 -W384 -H384 -S1592514025
 ```

+<div align="center" markdown>
 ![latent steps](../assets/img2img/000019.steps.png)
+</div>

 Put simply: starting from a frame of fuzz/static, SD finds details in each frame that it thinks look like "fire" and brings them a little bit more into focus, gradually scrubbing out the fuzz until a clear image remains.

@ -80,20 +92,26 @@ Put simply: starting from a frame of fuzz/static, SD finds details in each frame

 Say I want SD to draw a fire based on this hand-drawn image:

+<div align="center" markdown>
 ![drawing of a fireplace](../assets/img2img/fire-drawing.png)
+</div>

 Let's only do 10 steps, to make it easier to see what's happening. If strength is `0.7`, this is what the internal steps the algorithm has to take will look like:

+<div align="center" markdown>
 ![](../assets/img2img/000032.steps.gravity.png)
+</div>

 With strength `0.4`, the steps look more like this:

+<div align="center" markdown>
 ![](../assets/img2img/000030.steps.gravity.png)
+</div>

 Notice how much more fuzzy the starting image is for strength `0.7` compared to `0.4`, and notice also how much longer the sequence is with `0.7`:

 |  | strength = 0.7 | strength = 0.4 |
-| -- | -- | -- |
+| -- | :--: | :--: |
 | initial image that SD sees | ![](../assets/img2img/000032.step-0.png) | ![](../assets/img2img/000030.step-0.png) |
 | steps argument to `dream>` | `-S10` | `-S10` |
 | steps actually taken | 7 | 4 |
@ -102,10 +120,9 @@ Notice how much more fuzzy the starting image is for strength `0.7` compared to

 Both of the outputs look kind of like what I was thinking of. With the strength higher, my input becomes more vague, *and* Stable Diffusion has more steps to refine its output. But it's not really making what I want, which is a picture of cheery open fire. With the strength lower, my input is more clear, *but* Stable Diffusion has less chance to refine itself, so the result ends up inheriting all the problems of my bad drawing.

+If you want to try this out yourself, all of these are using a seed of `1592514025` with a width/height of `384`, step count `10`, the default sampler (`k_lms`), and the single-word prompt `"fire"`:

-If you want to try this out yourself, all of these are using a seed of `1592514025` with a width/height of `384`, step count `10`, the default sampler (`k_lms`), and the single-word prompt `fire`:
-
-```commandline
+```bash
 invoke> "fire" -s10 -W384 -H384 -S1592514025 -I /tmp/fire-drawing.png --strength 0.7
 ```

@ -121,15 +138,19 @@ Here's strength `0.4` (note step count `50`, which is `20 ÷ 0.4` to make sure S
 invoke> "fire" -s50 -W384 -H384 -S1592514025 -I /tmp/fire-drawing.png -f 0.4
 ```

+<div align="center" markdown>
 ![](../assets/img2img/000035.1592514025.png)
+</div>

-and strength `0.7` (note step count `30`, which is roughly `20 ÷ 0.7` to make sure SD does `20` steps from my image):
+and here is strength `0.7` (note step count `30`, which is roughly `20 ÷ 0.7` to make sure SD does `20` steps from my image):

-```commandline
+```bash
 invoke> "fire" -s30 -W384 -H384 -S1592514025 -I /tmp/fire-drawing.png -f 0.7
 ```

+<div align="center" markdown>
 ![](../assets/img2img/000046.1592514025.png)
+</div>

 In both cases the image is nice and clean and "finished", but because at strength `0.7` Stable Diffusion has been give so much more freedom to improve on my badly-drawn flames, they've come out looking much better. You can really see the difference when looking at the latent steps. There's more noise on the first image with strength `0.7`:

@ -143,11 +164,13 @@ and that extra noise gives the algorithm more choices when it is evaluating how

 Unfortunately, it seems that `img2img` is very sensitive to the step count. Here's strength `0.7` with a step count of `29` (SD did 19 steps from my image):

+<div align="center" markdown>
 ![](../assets/img2img/000045.1592514025.png)
+</div>

 By comparing the latents we can sort of see that something got interpreted differently enough on the third or fourth step to lead to a rather different interpretation of the flames.

 ![](../assets/img2img/000046.steps.gravity.png)
 ![](../assets/img2img/000045.steps.gravity.png)

-This is the result of a difference in the de-noising "schedule" - basically the noise has to be cleaned by a certain degree each step or the model won't "converge" on the image properly (see https://huggingface.co/blog/stable_diffusion for more about that). A different step count means a different schedule, which means things get interpreted slightly differently at every step. 
+This is the result of a difference in the de-noising "schedule" - basically the noise has to be cleaned by a certain degree each step or the model won't "converge" on the image properly (see [stable diffusion blog](https://huggingface.co/blog/stable_diffusion) for more about that). A different step count means a different schedule, which means things get interpreted slightly differently at every step.