Merge branch 'development' of github.com:lstein/stable-diffusion into asymmetric-tiling

2024-08-30 20:32:17 +00:00 · 2022-10-18 13:34:10 -04:00
parent d6195522aa 71c3835f3e
commit 9d19213b8a
21 changed files with 321 additions and 50 deletions
--- a/docs/assets/still-life-inpainted.png
+++ b/docs/assets/still-life-inpainted.png
--- a/docs/assets/still-life-scaled.jpg
+++ b/docs/assets/still-life-scaled.jpg
--- a/docs/features/CLI.md
+++ b/docs/features/CLI.md
@ -85,6 +85,7 @@ overridden on a per-prompt basis (see [List of prompt arguments](#list-of-prompt
 | `--from_file <path>`                      |                                           | `None`                                         | Read list of prompts from a file. Use `-` to read from standard input                                |
 | `--model <modelname>`                     |                                           | `stable-diffusion-1.4`                         | Loads model specified in configs/models.yaml. Currently one of "stable-diffusion-1.4" or "laion400m" |
 | `--full_precision`                        | `-F`                                      | `False`                                        | Run in slower full-precision mode. Needed for Macintosh M1/M2 hardware and some older video cards.   |
+| `--png_compression <0-9>`                 | `-z<0-9>`                                 |  6                                             | Select level of compression for output files, from 0 (no compression) to 9 (max compression)         |
 | `--web`                                   |                                           | `False`                                        | Start in web server mode                                                                             |
 | `--host <ip addr>`                        |                                           | `localhost`                                    | Which network interface web server should listen on. Set to 0.0.0.0 to listen on any.                |
 | `--port <port>`                           |                                           | `9090`                                         | Which port web server should listen for requests on.                                                 |
@ -153,6 +154,7 @@ Here are the invoke> command that apply to txt2img:
 | --seed <int>       | -S<int>   | None                | Set the random seed for the next series of images. This can be used to recreate an image generated previously.|
 | --sampler <sampler>| -A<sampler>| k_lms              | Sampler to use. Use -h to get list of available samplers. |
 | --hires_fix        |           |                     | Larger images often have duplication artefacts. This option suppresses duplicates by generating the image at low res, and then using img2img to increase the resolution |
+| --png_compression <0-9> | -z<0-9> |  6           | Select level of compression for output files, from 0 (no compression) to 9 (max compression)         |
 | --grid             | -g        | False               | Turn on grid mode to return a single image combining all the images generated by this prompt |
 | --individual       | -i        | True                | Turn off grid mode (deprecated; leave off --grid instead) |
 | --outdir <path>    |  -o<path> | outputs/img_samples  | Temporarily change the location of these images |
@ -211,11 +213,35 @@ accepts additional options:
    [Inpainting](./INPAINTING.md) for details.

 inpainting accepts all the arguments used for txt2img and img2img, as
-well as the --mask (-M) argument:
+well as the --mask (-M) and --text_mask (-tm) arguments:

 | Argument <img width="100" align="right"/> |  Shortcut  |  Default            |  Description |
 |--------------------|------------|---------------------|--------------|
 | `--init_mask <path>` | `-M<path>`   | `None`                |Path to an image the same size as the initial_image, with areas for inpainting made transparent.|
+| `--text_mask <prompt> [<float>]` | `-tm <prompt> [<float>]` | <none>  | Create a mask from a text prompt describing part of the image|
+
+`--text_mask` (short form `-tm`) is a way to generate a mask using a
+text description of the part of the image to replace. For example, if
+you have an image of a breakfast plate with a bagel, toast and
+scrambled eggs, you can selectively mask the bagel and replace it with
+a piece of cake this way:
+
+~~~
+invoke> a piece of cake -I /path/to/breakfast.png -tm bagel
+~~~
+
+The algorithm uses <a
+href="https://github.com/timojl/clipseg">clipseg</a> to classify
+different regions of the image. The classifier puts out a confidence
+score for each region it identifies. Generally regions that score
+above 0.5 are reliable, but if you are getting too much or too little
+masking you can adjust the threshold down (to get more mask), or up
+(to get less). In this example, by passing `-tm` a higher value, we
+are insisting on a more stringent classification.
+
+~~~
+invoke> a piece of cake -I /path/to/breakfast.png -tm bagel 0.6
+~~~

 # Other Commands

--- a/docs/features/INPAINTING.md
+++ b/docs/features/INPAINTING.md
@ -34,7 +34,46 @@ original unedited image and the masked (partially transparent) image:
 invoke> "man with cat on shoulder" -I./images/man.png -M./images/man-transparent.png
 ```

-We are hoping to get rid of the need for this workaround in an upcoming release.
+## **Masking using Text**
+
+You can also create a mask using a text prompt to select the part of
+the image you want to alter, using the <a
+href="https://github.com/timojl/clipseg">clipseg</a> algorithm. This
+works on any image, not just ones generated by InvokeAI.
+
+The `--text_mask` (short form `-tm`) option takes two arguments. The
+first argument is a text description of the part of the image you wish
+to mask (paint over). If the text description contains a space, you must
+surround it with quotation marks. The optional second argument is the
+minimum threshold for the mask classifier's confidence score, described
+in more detail below.
+
+To see how this works in practice, here's an image of a still life
+painting that I got off the web.
+
+<img src="../assets/still-life-scaled.jpg">
+
+You can selectively mask out the
+orange and replace it with a baseball in this way:
+
+~~~
+invoke> a baseball -I /path/to/still_life.png -tm orange
+~~~
+
+<img src="../assets/still-life-inpainted.png">
+
+The clipseg classifier produces a confidence score for each region it
+identifies. Generally regions that score above 0.5 are reliable, but
+if you are getting too much or too little masking you can adjust the
+threshold down (to get more mask), or up (to get less). In this
+example, by passing `-tm` a higher value, we are insisting on a tigher
+mask. However, if you make it too high, the orange may not be picked
+up at all!
+
+~~~
+invoke> a baseball -I /path/to/breakfast.png -tm orange 0.6
+~~~
+

 ### Inpainting is not changing the masked region enough!

--- a/environment-mac.yml
+++ b/environment-mac.yml
@ -57,6 +57,7 @@ dependencies:
      - -e git+https://github.com/openai/CLIP.git@main#egg=clip
      - -e git+https://github.com/Birch-san/k-diffusion.git@mps#egg=k_diffusion
      - -e git+https://github.com/TencentARC/GFPGAN.git#egg=gfpgan
+      - -e git+https://github.com/invoke-ai/clipseg.git#egg=clipseg
      - -e .
 variables:
  PYTORCH_ENABLE_MPS_FALLBACK: 1
--- a/environment.yml
+++ b/environment.yml
@ -37,4 +37,5 @@ dependencies:
    - -e git+https://github.com/CompVis/taming-transformers.git@master#egg=taming-transformers
    - -e git+https://github.com/Birch-san/k-diffusion.git@mps#egg=k_diffusion
    - -e git+https://github.com/TencentARC/GFPGAN.git#egg=gfpgan
+    - -e git+https://github.com/invoke-ai/clipseg.git#egg=clipseg
    - -e .
--- a/frontend/dist/assets/index.89883620.js
+++ b/frontend/dist/assets/index.89883620.js
--- a/frontend/dist/index.html
+++ b/frontend/dist/index.html
@ -6,7 +6,7 @@
  <meta name="viewport" content="width=device-width, initial-scale=1.0" />
  <title>InvokeAI - A Stable Diffusion Toolkit</title>
  <link rel="shortcut icon" type="icon" href="/assets/favicon.0d253ced.ico" />
-  <script type="module" crossorigin src="/assets/index.ea68b5f5.js"></script>
+  <script type="module" crossorigin src="/assets/index.89883620.js"></script>
  <link rel="stylesheet" href="/assets/index.58175ea1.css">
 </head>

--- a/frontend/src/app/socketio/middleware.ts
+++ b/frontend/src/app/socketio/middleware.ts
@ -22,9 +22,9 @@ import * as InvokeAI from '../invokeai';
 * some new action to handle whatever data was sent from the server.
 */
 export const socketioMiddleware = () => {
-  const { hostname, port } = new URL(window.location.href);
+  const { origin } = new URL(window.location.href);

-  const socketio = io(`http://${hostname}:${port}`, {
+  const socketio = io(origin, {
    timeout: 60000,
  });

--- a/ldm/generate.py
+++ b/ldm/generate.py
@ -35,7 +35,8 @@ from ldm.invoke.devices import choose_torch_device, choose_precision
 from ldm.invoke.conditioning import get_uc_and_c
 from ldm.invoke.model_cache import ModelCache
 from ldm.invoke.seamless import configure_model_padding
-
+from ldm.invoke.txt2mask import Txt2Mask, SegmentedGrayscale
+    
 def fix_func(orig):
    if hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
        def new_func(*args, **kw):
@ -190,6 +191,7 @@ class Generate:
        self.esrgan = esrgan
        self.free_gpu_mem = free_gpu_mem
        self.size_matters = True  # used to warn once about large image sizes and VRAM
+        self.txt2mask = None

        # Note that in previous versions, there was an option to pass the
        # device to Generate(). However the device was then ignored, so
@ -269,6 +271,7 @@ class Generate:
            # these are specific to img2img and inpaint
            init_img         = None,
            init_mask        = None,
+            text_mask        = None,
            fit              = False,
            strength         = None,
            init_color       = None,
@ -301,6 +304,8 @@ class Generate:
           seamless                        // whether the generated image should tile
           hires_fix                        // whether the Hires Fix should be applied during generation
           init_img                        // path to an initial image
+           init_mask                       // path to a mask for the initial image
+           text_mask                       // a text string that will be used to guide clipseg generation of the init_mask
           strength                        // strength for noising/unnoising init_img. 0.0 preserves image exactly, 1.0 replaces it completely
           facetool_strength               // strength for GFPGAN/CodeFormer. 0.0 preserves image exactly, 1.0 replaces it completely
           ddim_eta                        // image randomness (eta=0.0 means the same seed always produces the same image)
@ -407,6 +412,7 @@ class Generate:
                width,
                height,
                fit=fit,
+                text_mask=text_mask,
            )

            # TODO: Hacky selection of operation to perform. Needs to be refactored.
@ -622,17 +628,14 @@ class Generate:
            width,
            height,
            fit=False,
+            text_mask=None,
    ):
        init_image      = None
        init_mask       = None
        if not img:
            return None, None

-        image = self._load_img(
-            img,
-            width,
-            height,
-        )
+        image = self._load_img(img)

        if image.width < self.width and image.height < self.height:
            print(f'>> WARNING: img2img and inpainting may produce unexpected results with initial images smaller than {self.width}x{self.height} in both dimensions')
@ -650,10 +653,12 @@ class Generate:
        init_image   = self._create_init_image(image,width,height,fit=fit)                   # this returns a torch tensor

        if mask:
-            mask_image = self._load_img(
-                mask, width, height)  # this returns an Image
+            mask_image = self._load_img(mask)  # this returns an Image
            init_mask = self._create_init_mask(mask_image,width,height,fit=fit)

+        elif text_mask:
+            init_mask = self._txt2mask(image, text_mask, width, height, fit=fit)
+
        return init_image, init_mask

    def _make_base(self):
@ -832,7 +837,7 @@ class Generate:

        print(msg)

-    def _load_img(self, img, width, height)->Image:
+    def _load_img(self, img)->Image:
        if isinstance(img, Image.Image):
            image = img
            print(
@ -894,6 +899,29 @@ class Generate:
            mask = ImageOps.invert(mask)
        return mask

+    # TODO: The latter part of this method repeats code from _create_init_mask()
+    def _txt2mask(self, image:Image, text_mask:list, width, height, fit=True) -> Image:
+        prompt = text_mask[0]
+        confidence_level = text_mask[1] if len(text_mask)>1 else 0.5
+        if self.txt2mask is None:
+            self.txt2mask = Txt2Mask(device = self.device)
+
+        segmented = self.txt2mask.segment(image, prompt)
+        mask = segmented.to_mask(float(confidence_level))
+        mask = mask.convert('RGB')
+        # now we adjust the size
+        if fit:
+            mask = self._fit_image(mask, (width, height))
+        else:
+            mask = self._squeeze_image(mask)
+        mask = mask.resize((mask.width//downsampling, mask.height //
+                              downsampling), resample=Image.Resampling.NEAREST)
+        mask = np.array(mask)
+        mask = mask.astype(np.float32) / 255.0
+        mask = mask[None].transpose(0, 3, 1, 2)
+        mask = torch.from_numpy(mask)
+        return mask.to(self.device)
+
    def _has_transparency(self, image):
        if image.info.get("transparency", None) is not None:
            return True
--- a/ldm/invoke/args.py
+++ b/ldm/invoke/args.py
@ -378,6 +378,14 @@ class Args(object):
            default='stable-diffusion-1.4',
            help='Indicates which diffusion model to load. (currently "stable-diffusion-1.4" (default) or "laion400m")',
        )
+        model_group.add_argument(
+            '--png_compression','-z',
+            type=int,
+            default=6,
+            choices=range(0,9),
+            dest='png_compression',
+            help='level of PNG compression, from 0 (none) to 9 (maximum). Default is 6.'
+        )
        model_group.add_argument(
            '--sampler',
            '-A',
@ -649,6 +657,14 @@ class Args(object):
            dest='save_intermediates',
            help='Save every nth intermediate image into an "intermediates" directory within the output directory'
        )
+        render_group.add_argument(
+            '--png_compression','-z',
+            type=int,
+            default=6,
+            choices=range(0,10),
+            dest='png_compression',
+            help='level of PNG compression, from 0 (none) to 9 (maximum). Default is 6.'
+        )
        img2img_group.add_argument(
            '-I',
            '--init_img',
@ -661,6 +677,14 @@ class Args(object):
            type=str,
            help='Path to input mask for inpainting mode (supersedes width and height)',
        )
+        img2img_group.add_argument(
+            '-tm',
+            '--text_mask',
+            nargs='+',
+            type=str,
+            help='Use the clipseg classifier to generate the mask area for inpainting. Provide a description of the area to mask ("a mug"), optionally followed by the confidence level threshold (0-1.0; defaults to 0.5).',
+            default=None,
+        )
        img2img_group.add_argument(
            '--init_color',
            type=str,
--- a/ldm/invoke/generator/txt2img.py
+++ b/ldm/invoke/generator/txt2img.py
@ -74,3 +74,4 @@ class Txt2Img(Generator):
        if self.perlin > 0.0:
            x = (1-self.perlin)*x + self.perlin*self.get_perlin_noise(width  // self.downsampling_factor, height // self.downsampling_factor)
        return x
+
--- a/ldm/invoke/pngwriter.py
+++ b/ldm/invoke/pngwriter.py
@ -33,13 +33,13 @@ class PngWriter:

    # saves image named _image_ to outdir/name, writing metadata from prompt
    # returns full path of output
-    def save_image_and_prompt_to_png(self, image, dream_prompt, name, metadata=None):
+    def save_image_and_prompt_to_png(self, image, dream_prompt, name, metadata=None, compress_level=6):
        path = os.path.join(self.outdir, name)
        info = PngImagePlugin.PngInfo()
        info.add_text('Dream', dream_prompt)
        if metadata:
          info.add_text('sd-metadata', json.dumps(metadata))
-        image.save(path, 'PNG', pnginfo=info)
+        image.save(path, 'PNG', pnginfo=info, compress_level=compress_level)
        return path

    def retrieve_metadata(self,img_basename):
--- a/ldm/invoke/readline.py
+++ b/ldm/invoke/readline.py
@ -53,6 +53,8 @@ COMMANDS = (
    '--log_tokenization','-t',
    '--hires_fix',
    '--inpaint_replace','-r',
+    '--png_compression','-z',
+    '--text_mask','-tm',
    '!fix','!fetch','!history','!search','!clear',
    '!models','!switch','!import_model','!edit_model'
    )
--- a/ldm/invoke/txt2mask.py
+++ b/ldm/invoke/txt2mask.py
@ -0,0 +1,122 @@
+'''Makes available the Txt2Mask class, which assists in the automatic
+assignment of masks via text prompt using clipseg.
+
+Here is typical usage:
+    
+    from ldm.invoke.txt2mask import Txt2Mask, SegmentedGrayscale
+    from PIL import Image
+
+    txt2mask = Txt2Mask(self.device)
+    segmented = txt2mask.segment(Image.open('/path/to/img.png'),'a bagel')
+    
+    # this will return a grayscale Image of the segmented data
+    grayscale = segmented.to_grayscale()
+
+    # this will return a semi-transparent image in which the
+    # selected object(s) are opaque and the rest is at various
+    # levels of transparency
+    transparent = segmented.to_transparent()
+
+    # this will return a masked image suitable for use in inpainting:
+    mask = segmented.to_mask(threshold=0.5)
+
+The threshold used in the call to to_mask() selects pixels for use in
+the mask that exceed the indicated confidence threshold. Values range
+from 0.0 to 1.0. The higher the threshold, the more confident the
+algorithm is. In limited testing, I have found that values around 0.5
+work fine.
+'''
+
+import torch
+import numpy as  np
+from models.clipseg import CLIPDensePredT
+from einops import rearrange, repeat
+from PIL import Image
+from torchvision import transforms
+
+CLIP_VERSION = 'ViT-B/16'
+CLIPSEG_WEIGHTS = 'src/clipseg/weights/rd64-uni.pth'
+CLIPSEG_SIZE = 352
+
+class SegmentedGrayscale(object):
+    def __init__(self, image:Image, heatmap:torch.Tensor):
+        self.heatmap = heatmap
+        self.image = image
+        
+    def to_grayscale(self)->Image:
+        return self._rescale(Image.fromarray(np.uint8(self.heatmap*255)))
+
+    def to_mask(self,threshold:float=0.5)->Image:
+        discrete_heatmap = self.heatmap.lt(threshold).int()
+        return self._rescale(Image.fromarray(np.uint8(discrete_heatmap*255),mode='L'))
+
+    def to_transparent(self)->Image:
+        transparent_image = self.image.copy()
+        transparent_image.putalpha(self.to_grayscale())
+        return transparent_image
+
+    # unscales and uncrops the 352x352 heatmap so that it matches the image again
+    def _rescale(self, heatmap:Image)->Image:
+        size = self.image.width if (self.image.width > self.image.height) else self.image.height
+        resized_image = heatmap.resize(
+            (size,size),
+            resample=Image.Resampling.LANCZOS
+        )
+        return resized_image.crop((0,0,self.image.width,self.image.height))
+
+class Txt2Mask(object):
+    '''
+    Create new Txt2Mask object. The optional device argument can be one of
+    'cuda', 'mps' or 'cpu'.
+    '''
+    def __init__(self,device='cpu'):
+        print('>> Initializing clipseg model for text to mask inference')
+        self.device = device
+        self.model = CLIPDensePredT(version=CLIP_VERSION, reduce_dim=64, )
+        self.model.eval()
+        # initially we keep everything in cpu to conserve space
+        self.model.to('cpu')
+        self.model.load_state_dict(torch.load(CLIPSEG_WEIGHTS, map_location=torch.device('cpu')), strict=False)
+
+    @torch.no_grad()
+    def segment(self, image:Image, prompt:str) -> SegmentedGrayscale:
+        '''
+        Given a prompt string such as "a bagel", tries to identify the object in the
+        provided image and returns a SegmentedGrayscale object in which the brighter
+        pixels indicate where the object is inferred to be.
+        '''
+        self._to_device(self.device)
+        prompts = [prompt]   # right now we operate on just a single prompt at a time
+
+        transform = transforms.Compose([
+            transforms.ToTensor(),
+            transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
+            transforms.Resize((CLIPSEG_SIZE, CLIPSEG_SIZE)), # must be multiple of 64...
+        ])
+
+        img = self._scale_and_crop(image)
+        img = transform(img).unsqueeze(0)
+
+        preds = self.model(img.repeat(len(prompts),1,1,1), prompts)[0]
+        heatmap = torch.sigmoid(preds[0][0]).cpu()
+        self._to_device('cpu')
+        return SegmentedGrayscale(image, heatmap)
+
+    def _to_device(self, device):
+        self.model.to(device)
+
+    def _scale_and_crop(self, image:Image)->Image:
+        scaled_image = Image.new('RGB',(CLIPSEG_SIZE,CLIPSEG_SIZE))
+        if image.width > image.height: # width is constraint
+            scale = CLIPSEG_SIZE / image.width
+        else:
+            scale = CLIPSEG_SIZE / image.height
+        scaled_image.paste(
+            image.resize(
+                (int(scale * image.width),
+                 int(scale * image.height)
+                ),
+                resample=Image.Resampling.LANCZOS
+            ),box=(0,0)
+        )
+        return scaled_image
--- a/ldm/models/diffusion/ddpm.py
+++ b/ldm/models/diffusion/ddpm.py
@ -1353,7 +1353,7 @@ class LatentDiffusion(DDPM):
                num_downs = self.first_stage_model.encoder.num_resolutions - 1
                rescale_latent = 2 ** (num_downs)

-                # get top left postions of patches as conforming for the bbbox tokenizer, therefore we
+                # get top left positions of patches as conforming for the bbbox tokenizer, therefore we
                # need to rescale the tl patch coordinates to be in between (0,1)
                tl_patch_coordinates = [
                    (
--- a/ldm/modules/diffusionmodules/util.py
+++ b/ldm/modules/diffusionmodules/util.py
@ -64,7 +64,8 @@ def make_ddim_timesteps(
 ):
    if ddim_discr_method == 'uniform':
        c = num_ddpm_timesteps // num_ddim_timesteps
-        ddim_timesteps = np.asarray(list(range(0, num_ddpm_timesteps, c)))
+        # ddim_timesteps = np.asarray(list(range(0, num_ddpm_timesteps, c)))
+        ddim_timesteps = (np.arange(0, num_ddim_timesteps) * c).astype(int)
    elif ddim_discr_method == 'quad':
        ddim_timesteps = (
            (
@ -81,8 +82,8 @@ def make_ddim_timesteps(

    # assert ddim_timesteps.shape[0] == num_ddim_timesteps
    # add one to get the final alpha values right (the ones from first scale to data during sampling)
-#    steps_out = ddim_timesteps + 1
-    steps_out = ddim_timesteps
+    steps_out = ddim_timesteps + 1
+    # steps_out = ddim_timesteps

    if verbose:
        print(f'Selected timesteps for ddim sampler: {steps_out}')
--- a/requirements-linux-arm64.txt
+++ b/requirements-linux-arm64.txt
@ -22,4 +22,5 @@ transformers==4.19.2
 -e git+https://github.com/CompVis/taming-transformers.git@master#egg=taming-transformers
 -e git+https://github.com/lstein/k-diffusion.git@master#egg=k-diffusion
 -e git+https://github.com/TencentARC/GFPGAN.git#egg=gfpgan
+-3 git+https://github.com/invoke-ai/clipseg.git#egg=clipseg
 -e .
--- a/requirements.txt
+++ b/requirements.txt
@ -35,3 +35,4 @@ realesrgan
 git+https://github.com/openai/CLIP.git@main#egg=clip
 git+https://github.com/Birch-san/k-diffusion.git@mps#egg=k-diffusion
 git+https://github.com/TencentARC/GFPGAN.git#egg=gfpgan
+git+https://github.com/invoke-ai/clipseg.git#egg=clipseg
--- a/scripts/invoke.py
+++ b/scripts/invoke.py
@ -95,7 +95,10 @@ def main():
            "\n* Initialization done! Awaiting your command (-h for help, 'q' to quit)"
        )

-    main_loop(gen, opt, infile)
+    try:
+        main_loop(gen, opt, infile)
+    except KeyboardInterrupt:
+        print("\ngoodbye!")

 # TODO: main_loop() has gotten busy. Needs to be refactored.
 def main_loop(gen, opt, infile):
@ -270,6 +273,7 @@ def main_loop(gen, opt, infile):
                            model_hash = gen.model_hash,
                        ),
                        name      = filename,
+                        compress_level = opt.png_compression,
                    )

                    # update rfc metadata
--- a/scripts/preload_models.py
+++ b/scripts/preload_models.py
@ -10,28 +10,31 @@ import sys
 import transformers
 import os
 import warnings
+import torch
 import urllib.request
+import zipfile
+import traceback

 transformers.logging.set_verbosity_error()

 # this will preload the Bert tokenizer fles
-print('preloading bert tokenizer...', end='')
-
-tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
+print('Loading bert tokenizer (ignore deprecation errors)...', end='')
+with warnings.catch_warnings():
+    warnings.filterwarnings('ignore', category=DeprecationWarning)
+    tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
 print('...success')
+sys.stdout.flush()

 # this will download requirements for Kornia
-print('preloading Kornia requirements...', end='')
+print('Loading Kornia requirements...', end='')
 with warnings.catch_warnings():
    warnings.filterwarnings('ignore', category=DeprecationWarning)
    import kornia
 print('...success')

 version = 'openai/clip-vit-large-patch14'
-
-print('preloading CLIP model...',end='')
 sys.stdout.flush()
-
+print('Loading CLIP model...',end='')
 tokenizer = CLIPTokenizer.from_pretrained(version)
 transformer = CLIPTextModel.from_pretrained(version)
 print('...success')
@ -61,7 +64,6 @@ if gfpgan:
        FaceRestoreHelper(1, det_model='retinaface_resnet50')
        print('...success')
    except Exception:
-        import traceback
        print('Error loading ESRGAN:')
        print(traceback.format_exc())

@ -89,13 +91,11 @@ if gfpgan:
                urllib.request.urlretrieve(model_url,model_dest)
                print('...success')
        except Exception:
-            import traceback
            print('Error loading GFPGAN:')
            print(traceback.format_exc())

 print('preloading CodeFormer model file...',end='')
 try:
-        import urllib.request
        model_url  = 'https://github.com/sczhou/CodeFormer/releases/download/v0.1.0/codeformer.pth'
        model_dest = 'ldm/invoke/restoration/codeformer/weights/codeformer.pth'
        if not os.path.exists(model_dest):
@ -103,7 +103,27 @@ try:
            os.makedirs(os.path.dirname(model_dest), exist_ok=True)
            urllib.request.urlretrieve(model_url,model_dest)
 except Exception:
-    import traceback
    print('Error loading CodeFormer:')
    print(traceback.format_exc())
 print('...success')
+
+print('Loading clipseq model for text-based masking...',end='')
+try:
+    model_url  = 'https://owncloud.gwdg.de/index.php/s/ioHbRzFx6th32hn/download'
+    model_dest = 'src/clipseg/clipseg_weights.zip'
+    if not os.path.exists(model_dest):
+        os.makedirs(os.path.dirname(model_dest), exist_ok=True)
+        urllib.request.urlretrieve(model_url,model_dest)
+        with zipfile.ZipFile(model_dest,'r') as zip:
+            zip.extractall('src/clipseg')
+            os.rename('src/clipseg/clipseg_weights','src/clipseg/weights')
+        from models.clipseg import CLIPDensePredT
+        model = CLIPDensePredT(version='ViT-B/16', reduce_dim=64, )
+        model.eval()
+        model.load_state_dict(torch.load('src/clipseg/weights/rd64-uni-refined.pth'), strict=False)
+except Exception:
+    print('Error installing clipseg model:')
+    print(traceback.format_exc())
+print('...success')
+
+