add support for Apple hardware using MPS acceleration

2024-08-30 20:32:17 +00:00 · 2022-08-31 00:33:23 -04:00 · 2022-08-31 00:33:23 -04:00 · bdb0651eb2
commit bdb0651eb2
parent 1714816fe2
16 changed files with 361 additions and 52 deletions
--- a/.gitignore
+++ b/.gitignore
@ -181,3 +181,5 @@ outputs
 logs
 testtube
 checkpoints
 # If it's a Mac
 .DS_Store
--- a/README-Mac-MPS.md
+++ b/README-Mac-MPS.md
@ -0,0 +1,228 @@
 # Apple Silicon Mac Users
 Several people have gotten Stable Diffusion to work on Apple Silicon
 Macs using Anaconda. I've gathered up most of their instructions and
 put them in this fork (and readme). I haven't tested anything besides
 Anaconda, and I've read about issues with things like miniforge, so if
 you have an issue that isn't dealt with in this fork then head on over
 to the [Apple
 Silicon](https://github.com/CompVis/stable-diffusion/issues/25) issue
 on GitHub (that page is so long that GitHub hides most of it by
 default, so you need to find the hidden part and expand it to view the
 whole thing). This fork would not have been possible without the work
 done by the people on that issue.
 You have to have macOS 12.3 Monterey or later. Anything earlier than that won't work.
 BTW, I haven't tested any of this on Intel Macs.
 How to:
 ```
 git clone https://github.com/lstein/stable-diffusion.git
 cd stable-diffusion
 mkdir -p models/ldm/stable-diffusion-v1/
 ln -s /path/to/ckpt/sd-v1-1.ckpt models/ldm/stable-diffusion-v1/model.ckpt
 conda env create -f environment-mac.yaml
 conda activate ldm
 ```
 These instructions are identical to the main repo except I added
 environment-mac.yaml because Mac doesn't have cudatoolkit.
 After you follow all the instructions and run txt2img.py you might get several errors. Here's the errors I've seen and found solutions for.
 ### Doesn't work anymore?
 We are using PyTorch nightly, which includes support for MPS. I don't
 know exactly how Anaconda does updates, but I woke up one morning and
 Stable Diffusion crashed and I couldn't think of anything I did that
 would've changed anything the night before, when it worked. A day and
 a half later I finally got it working again. I don't know what changed
 overnight. PyTorch-nightly changes overnight but I'm pretty sure I
 didn't manually update it. Either way, things are probably going to be
 bumpy on Apple Silicon until PyTorch releases a firm version that we
 can lock to.
 To manually update to the latest version of PyTorch nightly (which could fix issues), run this command.
 	conda install pytorch torchvision torchaudio -c pytorch-nightly
 ## Debugging?
 Tired of waiting for your renders to finish before you can see if it
 works? Reduce the steps! The picture wont look like anything but if it
 finishes, hey, it works! This could also help you figure out if you've
 got a memory problem, because I'm betting 1 step doesn't use much
 memory.
 	python ./scripts/txt2img.py --prompt "ocean" --ddim_steps 1
 ### "No module named cv2" (or some other module)
 Did you remember to `conda activate ldm`? If your terminal prompt
 begins with "(ldm)" then you activated it. If it begins with "(base)"
 or something else you haven't.
 If it says you're missing taming you need to rebuild your virtual
 environment.
 	conda env remove -n ldm
 	conda env create -f environment-mac.yaml
 If you have activated the ldm virtual environment and tried rebuilding
 it, maybe the problem could be that I have something installed that
 you don't and you'll just need to manually install it. Make sure you
 activate the virtual environment so it installs there instead of
 globally.
 	conda activate ldm
 	pip install *name*
 You might also need to install Rust (I mention this again below).
 ### "The operator [name] is not current implemented for the MPS device." (sic)
 Example error.
 ```
 ...
 NotImplementedError: The operator 'aten::index.Tensor' is not current implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on [https://github.com/pytorch/pytorch/issues/77764](https://github.com/pytorch/pytorch/issues/77764). As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.
 ```
 Just do what it says:
 	export PYTORCH_ENABLE_MPS_FALLBACK=1
 ### "Could not build wheels for tokenizers"
 I have not seen this error because I had Rust installed on my computer before I started playing with Stable Diffusion. The fix is to install Rust.
 	curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
 ### How come `--seed` doesn't work?
 > Completely reproducible results are not guaranteed across PyTorch
 releases, individual commits, or different platforms. Furthermore,
 results may not be reproducible between CPU and GPU executions, even
 when using identical seeds.
 [PyTorch docs](https://pytorch.org/docs/stable/notes/randomness.html)
 There is an [open issue](https://github.com/pytorch/pytorch/issues/78035) (as of August 2022) in pytorch regarding gradient inconsistency. I am guessing that's what is causing this.
 ### libiomp5.dylib error?
 	OMP: Error #15: Initializing libiomp5.dylib, but found libomp.dylib already initialized.
 There are several things you can do. First, you could use something
 besides Anaconda like miniforge. I read a lot of things online telling
 people to use something else, but I am stuck with Anaconda for other
 reasons.
 Or you can try this.
 	export KMP_DUPLICATE_LIB_OK=True
 Or this (which takes forever on my computer and didn't work anyway).
 	conda install nomkl
 This error happens with Anaconda on Macs, and
 [nomkl](https://stackoverflow.com/questions/66224879/what-is-the-nomkl-python-package-used-for)
 is supposed to fix the issue (it isn't a module but a fix of some
 sort). [There's more
 suggestions](https://stackoverflow.com/questions/53014306/error-15-initializing-libiomp5-dylib-but-found-libiomp5-dylib-already-initial),
 like uninstalling tensorflow and reinstalling. I haven't tried them.
 ### Not enough memory.
 This seems to be a common problem and is probably the underlying
 problem for a lot of symptoms (listed below). The fix is to lower your
 image size or to add `model.half()` right after the model is loaded. I
 should probably test it out. I've read that the reason this fixes
 problems is because it converts the model from 32-bit to 16-bit and
 that leaves more RAM for other things. I have no idea how that would
 affect the quality of the images though.
 See [this issue](https://github.com/CompVis/stable-diffusion/issues/71).
 ### "Error: product of dimension sizes > 2**31'"
 This error happens with img2img, which I haven't played with too much
 yet. But I know it's because your image is too big or the resolution
 isn't a multiple of 32x32. Because the stable-diffusion model was
 trained on images that were 512 x 512, it's always best to use that
 output size (which is the default). However, if you're using that size
 and you get the above error, try 256 x 256 or 512 x 256 or something
 as the source image.
 BTW, 2**31-1 = [2,147,483,647](https://en.wikipedia.org/wiki/2,147,483,647#In_computing), which is also 32-bit signed [LONG_MAX](https://en.wikipedia.org/wiki/C_data_types) in C.
 ### I just got Rickrolled! Do I have a virus?
 You don't have a virus. It's part of the project. Here's
 [Rick](https://github.com/lstein/stable-diffusion/blob/main/assets/rick.jpeg)
 and here's [the
 code](https://github.com/lstein/stable-diffusion/blob/69ae4b35e0a0f6ee1af8bb9a5d0016ccb27e36dc/scripts/txt2img.py#L79)
 that swaps him in. It's a NSFW filter, which IMO, doesn't work very
 good (and we call this "computer vision", sheesh).
 Actually, this could be happening because there's not enough RAM. You could try the `model.half()` suggestion or specify smaller output images.
 ### My images come out black
 I haven't solved this issue. I just throw away my black
 images. There's a [similar
 issue](https://github.com/CompVis/stable-diffusion/issues/69) on CUDA
 GPU's where the images come out green. Maybe it's the same issue?
 Someone in that issue says to use "--precision full", but this fork
 actually disables that flag. I don't know why, someone else provided
 that code and I don't know what it does. Maybe the `model.half()`
 suggestion above would fix this issue too. I should probably test it.
 ### "view size is not compatible with input tensor's size and stride"
 ```
  File "/opt/anaconda3/envs/ldm/lib/python3.10/site-packages/torch/nn/functional.py", line 2511, in layer_norm
    return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
 RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.
 ```
 Update to the latest version of lstein/stable-diffusion. We were
 patching pytorch but we found a file in stable-diffusion that we could
 change instead. This is a 32-bit vs 16-bit problem.
 ### The processor must support the Intel bla bla bla
 What? Intel? On an Apple Silicon?
 	Intel MKL FATAL ERROR: This system does not meet the minimum requirements for use of the Intel(R) Math Kernel Library.
 	The processor must support the Intel(R) Supplemental Streaming SIMD Extensions 3 (Intel(R) SSSE3) instructions.██████████████| 50/50 [02:25<00:00,  2.53s/it]
 	The processor must support the Intel(R) Streaming SIMD Extensions 4.2 (Intel(R) SSE4.2) instructions.
 	The processor must support the Intel(R) Advanced Vector Extensions (Intel(R) AVX) instructions.
 This fixed it for me:
 	conda clean --yes --all
 ### Still slow?
 I changed the defaults of n_samples and n_iter to 1 so that it uses
 less RAM and makes less images so it will be faster the first time you
 use it. I don't actually know what n_samples does internally, but I
 know it consumes a lot more RAM. The n_iter flag just loops around the
 image creation code, so it shouldn't consume more RAM (it should be
 faster if you're going to do multiple images because the libraries and
 model will already be loaded--use a prompt file to get this speed
 boost).
 These flags are the default sample and iter settings in this fork/branch:
 ~~~~
 python scripts/txt2img.py --prompt "ocean" --n_samples=1 --n_iter=1
 ~~~
--- a/README.md
+++ b/README.md
@ -387,7 +387,7 @@ Credit goes to @rinongal and the repository located at
 https://github.com/rinongal/textual_inversion Please see the
 repository and associated paper for details and limitations.
-# Latest
+# Latest Changes
 - v1.13 (in process)
@ -403,9 +403,9 @@ For older changelogs, please visit **[CHANGELOGS](CHANGELOG.md)**.
 # Installation
-There are separate installation walkthroughs for [Linux/Mac](#linuxmac) and [Windows](#windows).
+There are separate installation walkthroughs for [Linux](#linux), [Windows](#windows) and [Macintosh](#Macintosh)
-## Linux/Mac
+## Linux
 1. You will need to install the following prerequisites if they are not already available. Use your
   operating system's preferred installer
@ -580,7 +580,15 @@ python scripts\dream.py -l
 python scripts\dream.py
 ```
-10. Subsequently, to relaunch the script, first activate the Anaconda command window (step 3), enter the stable-diffusion directory (step 5, "cd \path\to\stable-diffusion"), run "conda activate ldm" (step 6b), and then launch the dream script (step 9).
+10. Subsequently, to relaunch the script, first activate the Anaconda
 command window (step 3), enter the stable-diffusion directory (step 5,
 "cd \path\to\stable-diffusion"), run "conda activate ldm" (step 6b),
 and then launch the dream script (step 9).
 **Note:** Tildebyte has written an alternative ["Easy peasy Windows
 install"](https://github.com/lstein/stable-diffusion/wiki/Easy-peasy-Windows-install)
 which uses the Windows Powershell and pew. If you are having trouble
 with Anaconda on Windows, give this a try (or try it first!)
 ### Updating to newer versions of the script
@ -595,11 +603,16 @@ git pull
 This will bring your local copy into sync with the remote one.
-## Simplified API for text to image generation
+## Macintosh
 See (README-Mac-MPS)[README-Mac-MPS.md] for instructions.
 # Simplified API for text to image generation
 For programmers who wish to incorporate stable-diffusion into other
-products, this repository includes a simplified API for text to image generation, which
+products, this repository includes a simplified API for text to image
-lets you create images from a prompt in just three lines of code:
+generation, which lets you create images from a prompt in just three
 lines of code:
 ```
 from ldm.simplet2i import T2I
@ -608,9 +621,10 @@ outputs = model.txt2img("a unicorn in manhattan")
 ```
 Outputs is a list of lists in the format [[filename1,seed1],[filename2,seed2]...]
-Please see ldm/simplet2i.py for more information.
+Please see ldm/simplet2i.py for more information. A set of example scripts is
 coming RSN.
-## Workaround for machines with limited internet connectivity
+# Workaround for machines with limited internet connectivity
 My development machine is a GPU node in a high-performance compute
 cluster which has no connection to the internet. During model
--- a/environment-mac.yaml
+++ b/environment-mac.yaml
@ -0,0 +1,32 @@
 name: ldm
 channels:
  - apple
  - conda-forge
  - pytorch-nightly
  - defaults
 dependencies:
  - python=3.10.4
  - pip=22.1.2
  - pytorch
  - torchvision
  - numpy=1.23.1
  - pip:
    - albumentations==0.4.6
    - opencv-python==4.6.0.66
    - pudb==2019.2
    - imageio==2.9.0
    - imageio-ffmpeg==0.4.2
    - pytorch-lightning==1.4.2
    - omegaconf==2.1.1
    - test-tube>=0.7.5
    - streamlit==1.12.0
    - pillow==9.2.0
    - einops==0.3.0
    - torch-fidelity==0.3.0
    - transformers==4.19.2
    - torchmetrics==0.6.0
    - kornia==0.6.0
    - -e git+https://github.com/openai/CLIP.git@main#egg=clip
    - -e git+https://github.com/CompVis/taming-transformers.git@master#egg=taming-transformers
    - -e git+https://github.com/lstein/k-diffusion.git@master#egg=k-diffusion
    - -e .
--- a/ldm/dream/devices.py
+++ b/ldm/dream/devices.py
@ -0,0 +1,11 @@
 import torch
 def choose_torch_device() -> str:
    '''Convenience routine for guessing which GPU device to run model on'''
    if torch.cuda.is_available():
        return 'cuda'
    if torch.backends.mps.is_available():
        return 'mps'
    return 'cpu'
--- a/ldm/models/diffusion/ddim.py
+++ b/ldm/models/diffusion/ddim.py
@ -4,6 +4,7 @@ import torch
 import numpy as np
 from tqdm import tqdm
 from functools import partial
 from ldm.dream.devices import choose_torch_device
 from ldm.modules.diffusionmodules.util import (
    make_ddim_sampling_parameters,
@ -14,17 +15,17 @@ from ldm.modules.diffusionmodules.util import (
 class DDIMSampler(object):
-    def __init__(self, model, schedule='linear', device='cuda', **kwargs):
+    def __init__(self, model, schedule='linear', device=None, **kwargs):
        super().__init__()
        self.model = model
        self.ddpm_num_timesteps = model.num_timesteps
        self.schedule = schedule
-        self.device = device
+        self.device   = device or choose_torch_device()
    def register_buffer(self, name, attr):
        if type(attr) == torch.Tensor:
            if attr.device != torch.device(self.device):
-                attr = attr.to(torch.device(self.device))
+                attr = attr.to(dtype=torch.float32, device=self.device)
        setattr(self, name, attr)
    def make_schedule(
--- a/ldm/models/diffusion/ksampler.py
+++ b/ldm/models/diffusion/ksampler.py
@ -2,7 +2,7 @@
 import k_diffusion as K
 import torch
 import torch.nn as nn
-
+from ldm.dream.devices import choose_torch_device
 class CFGDenoiser(nn.Module):
    def __init__(self, model):
@ -18,11 +18,11 @@ class CFGDenoiser(nn.Module):
 class KSampler(object):
-    def __init__(self, model, schedule='lms', device='cuda', **kwargs):
+    def __init__(self, model, schedule='lms', device=None, **kwargs):
        super().__init__()
        self.model = K.external.CompVisDenoiser(model)
        self.schedule = schedule
-        self.device = device
+        self.device   = device or choose_torch_device()
        def forward(self, x, sigma, uncond, cond, cond_scale):
            x_in = torch.cat([x] * 2)
--- a/ldm/models/diffusion/plms.py
+++ b/ldm/models/diffusion/plms.py
@ -4,6 +4,7 @@ import torch
 import numpy as np
 from tqdm import tqdm
 from functools import partial
 from ldm.dream.devices import choose_torch_device
 from ldm.modules.diffusionmodules.util import (
    make_ddim_sampling_parameters,
@ -13,18 +14,17 @@ from ldm.modules.diffusionmodules.util import (
 class PLMSSampler(object):
-    def __init__(self, model, schedule='linear', device='cuda', **kwargs):
+    def __init__(self, model, schedule='linear', device=None, **kwargs):
        super().__init__()
        self.model = model
        self.ddpm_num_timesteps = model.num_timesteps
        self.schedule = schedule
-        self.device = device
+        self.device   = device if device else choose_torch_device()
    def register_buffer(self, name, attr):
        if type(attr) == torch.Tensor:
            if attr.device != torch.device(self.device):
-                attr = attr.to(torch.device(self.device))
+                attr = attr.to(torch.float32).to(torch.device(self.device))
        setattr(self, name, attr)
    def make_schedule(
--- a/ldm/modules/attention.py
+++ b/ldm/modules/attention.py
@ -234,6 +234,7 @@ class BasicTransformerBlock(nn.Module):
        )
    def _forward(self, x, context=None):
        x = x.contiguous() if x.device.type == 'mps' else x
        x = self.attn1(self.norm1(x)) + x
        x = self.attn2(self.norm2(x), context=context) + x
        x = self.ff(self.norm3(x)) + x
--- a/ldm/modules/encoders/modules.py
+++ b/ldm/modules/encoders/modules.py
@ -5,6 +5,7 @@ import clip
 from einops import rearrange, repeat
 from transformers import CLIPTokenizer, CLIPTextModel
 import kornia
 from ldm.dream.devices import choose_torch_device
 from ldm.modules.x_transformer import (
    Encoder,
@ -67,7 +68,12 @@ class TransformerEmbedder(AbstractEncoder):
    """Some transformer encoder layers"""
    def __init__(
-        self, n_embed, n_layer, vocab_size, max_seq_len=77, device='cuda'
+        self,
        n_embed,
        n_layer,
        vocab_size,
        max_seq_len=77,
        device=choose_torch_device(),
    ):
        super().__init__()
        self.device = device
@ -89,7 +95,9 @@ class TransformerEmbedder(AbstractEncoder):
 class BERTTokenizer(AbstractEncoder):
    """Uses a pretrained BERT tokenizer by huggingface. Vocab size: 30522 (?)"""
-    def __init__(self, device='cuda', vq_interface=True, max_length=77):
+    def __init__(
        self, device=choose_torch_device(), vq_interface=True, max_length=77
    ):
        super().__init__()
        from transformers import (
            BertTokenizerFast,
@ -145,7 +153,7 @@ class BERTEmbedder(AbstractEncoder):
        n_layer,
        vocab_size=30522,
        max_seq_len=77,
-        device='cuda',
+        device=choose_torch_device(),
        use_tokenizer=True,
        embedding_dropout=0.0,
    ):
@ -230,7 +238,7 @@ class FrozenCLIPEmbedder(AbstractEncoder):
    def __init__(
        self,
        version='openai/clip-vit-large-patch14',
-        device='cuda',
+        device=choose_torch_device(),
        max_length=77,
    ):
        super().__init__()
@ -455,13 +463,13 @@ class FrozenCLIPTextEmbedder(nn.Module):
    def __init__(
        self,
        version='ViT-L/14',
-        device='cuda',
+        device=choose_torch_device(),
        max_length=77,
        n_repeat=1,
        normalize=True,
    ):
        super().__init__()
-        self.model, _ = clip.load(version, jit=False, device='cpu')
+        self.model, _ = clip.load(version, jit=False, device=device)
        self.device = device
        self.max_length = max_length
        self.n_repeat = n_repeat
@ -496,7 +504,7 @@ class FrozenClipImageEmbedder(nn.Module):
        self,
        model,
        jit=False,
-        device='cuda' if torch.cuda.is_available() else 'cpu',
+        device=choose_torch_device(),
        antialias=False,
    ):
        super().__init__()
--- a/ldm/simplet2i.py
+++ b/ldm/simplet2i.py
@ -28,6 +28,7 @@ from ldm.models.diffusion.plms import PLMSSampler
 from ldm.models.diffusion.ksampler import KSampler
 from ldm.dream.pngwriter import PngWriter
 from ldm.dream.image_util import InitImageResizer
 from ldm.dream.devices import choose_torch_device
 """Simplified text to image API for stable diffusion/latent diffusion
@ -523,19 +524,15 @@ class T2I:
        return self.seed
    def _get_device(self):
-        if torch.cuda.is_available():
+        device_type = choose_torch_device()
-            return torch.device('cuda')
+        return torch.device(device_type)
        elif torch.backends.mps.is_available():
            return torch.device('mps')
        else:
            return torch.device('cpu')
    def load_model(self):
        """Load and initialize the model from configuration variables passed at object creation time"""
        if self.model is None:
            seed_everything(self.seed)
            try:
-                config = OmegaConf.load(self.config)
+                config      = OmegaConf.load(self.config)
                self.device = self._get_device()
                model = self._load_model_from_config(config, self.weights)
                if self.embedding_path is not None:
--- a/notebook_helpers.py
+++ b/notebook_helpers.py
@ -14,7 +14,7 @@ from ldm.models.diffusion.ddim import DDIMSampler
 from ldm.util import ismap
 import time
 from omegaconf import OmegaConf
-
+from ldm.dream.devices import choose_torch_device
 def download_models(mode):
@ -117,7 +117,8 @@ def get_cond(mode, selected_path):
        c = rearrange(c, '1 c h w -> 1 h w c')
        c = 2. * c - 1.
-        c = c.to(torch.device("cuda"))
+        device = choose_torch_device()
        c = c.to(device)
        example["LR_image"] = c
        example["image"] = c_up
--- a/scripts/dream.py
+++ b/scripts/dream.py
@ -8,11 +8,11 @@ import re
 import sys
 import copy
 import warnings
 import time
 import ldm.dream.readline
 from ldm.dream.pngwriter import PngWriter, PromptFormatter
 from ldm.dream.server import DreamServer, ThreadingDreamServer
 def main():
    """Initialize command-line parsers and the diffusion model"""
    arg_parser = create_argv_parser()
@ -81,7 +81,11 @@ def main():
            sys.exit(-1)
    # preload the model
    tic = time.time()
    t2i.load_model()
    print(
        f'model loaded in', '%4.2fs' % (time.time() - tic)
    )
    if not infile:
        print(
--- a/scripts/inpaint.py
+++ b/scripts/inpaint.py
@ -6,7 +6,7 @@ import numpy as np
 import torch
 from main import instantiate_from_config
 from ldm.models.diffusion.ddim import DDIMSampler
-
+from ldm.dream.devices import choose_torch_device
 def make_batch(image, mask, device):
    image = np.array(Image.open(image).convert("RGB"))
@ -61,8 +61,8 @@ if __name__ == "__main__":
    model.load_state_dict(torch.load("models/ldm/inpainting_big/last.ckpt")["state_dict"],
                          strict=False)
-    device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
+    device  = choose_torch_device()
-    model = model.to(device)
+    model   = model.to(device)
    sampler = DDIMSampler(model)
    os.makedirs(opt.outdir, exist_ok=True)
--- a/scripts/orig_scripts/img2img.py
+++ b/scripts/orig_scripts/img2img.py
@ -18,6 +18,7 @@ from pytorch_lightning import seed_everything
 from ldm.util import instantiate_from_config
 from ldm.models.diffusion.ddim import DDIMSampler
 from ldm.models.diffusion.plms import PLMSSampler
 from ldm.dream.devices         import choose_torch_device
 def chunk(it, size):
@ -40,7 +41,7 @@ def load_model_from_config(config, ckpt, verbose=False):
        print("unexpected keys:")
        print(u)
-    model.cuda()
+    model.to(choose_torch_device())
    model.eval()
    return model
@ -199,7 +200,7 @@ def main():
    config = OmegaConf.load(f"{opt.config}")
    model = load_model_from_config(config, f"{opt.ckpt}")
-    device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
+    device = choose_torch_device()
    model = model.to(device)
    if opt.plms:
@ -241,8 +242,10 @@ def main():
    print(f"target t_enc is {t_enc} steps")
    precision_scope = autocast if opt.precision == "autocast" else nullcontext
    if device.type in ['mps', 'cpu']:
        precision_scope = nullcontext # have to use f32 on mps
    with torch.no_grad():
-        with precision_scope("cuda"):
+        with precision_scope(device.type):
            with model.ema_scope():
                tic = time.time()
                all_samples = list()
--- a/scripts/orig_scripts/txt2img.py
+++ b/scripts/orig_scripts/txt2img.py
@ -15,10 +15,10 @@ from contextlib import contextmanager, nullcontext
 import k_diffusion as K
 import torch.nn as nn
-from ldm.util import instantiate_from_config
+from ldm.util                  import instantiate_from_config
 from ldm.models.diffusion.ddim import DDIMSampler
 from ldm.models.diffusion.plms import PLMSSampler
-
+from ldm.dream.devices         import choose_torch_device
 def chunk(it, size):
    it = iter(it)
@ -40,7 +40,7 @@ def load_model_from_config(config, ckpt, verbose=False):
        print("unexpected keys:")
        print(u)
-    model.cuda()
+    model.to(choose_torch_device())
    model.eval()
    return model
@ -190,13 +190,14 @@ def main():
        opt.ckpt = "models/ldm/text2img-large/model.ckpt"
        opt.outdir = "outputs/txt2img-samples-laion400m"
    seed_everything(opt.seed)
    config = OmegaConf.load(f"{opt.config}")
    model = load_model_from_config(config, f"{opt.ckpt}")
-    device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
+    seed_everything(opt.seed)
-    model = model.to(device)
+
    device = torch.device(choose_torch_device())
    model  = model.to(device)
    #for klms
    model_wrap = K.external.CompVisDenoiser(model)
@ -240,11 +241,17 @@ def main():
    start_code = None
    if opt.fixed_code:
-        start_code = torch.randn([opt.n_samples, opt.C, opt.H // opt.f, opt.W // opt.f], device=device)
+        shape = [opt.n_samples, opt.C, opt.H // opt.f, opt.W // opt.f]
        if device.type == 'mps':
            start_code = torch.randn(shape, device='cpu').to(device)
        else:
            torch.randn(shape, device=device)
    precision_scope = autocast if opt.precision=="autocast" else nullcontext
    if device.type in ['mps', 'cpu']:
        precision_scope = nullcontext # have to use f32 on mps
    with torch.no_grad():
-        with precision_scope("cuda"):
+        with precision_scope(device.type):
            with model.ema_scope():
                tic = time.time()
                all_samples = list()