A Human-in-the-loop? workflow for creating HD images from text
DALL·E Flow is an interactive workflow for generating high-definition images from text prompt. First, it leverages DALL·E-Mega to generate image candidates, and then calls CLIP-as-service to rank the candidates w.r.t. the prompt. The preferred candidate is fed to GLID-3 XL for diffusion, which often enriches the texture and background. Finally, the candidate is upscaled to 1024x1024 via SwinIR.
DALL·E Flow is built with Jina in a client-server architecture, which gives it high scalability, non-blocking streaming, and a modern Pythonic interface. Client can interact with the server via gRPC/Websocket/HTTP with TLS.
Why Human-in-the-loop? Generative art is a creative process. While recent advances of DALL·E unleash people's creativity, having a single-prompt-single-output UX/UI locks the imagination to a single possibility, which is bad no matter how fine this single result is. DALL·E Flow is an alternative to the one-liner, by formalizing the generative art as an iterative procedure.
DALL·E Flow is in client-server architecture.
⚠️ 2022/8/8 Started using CLIP-as-service as an external executor. Now you can easily deploy your own CLIP executor if you want. There is a small breaking change as a result of this improvement, so please reopen the notebook in Google Colab.⚠️ 2022/7/6 Demo server migration to AWS EKS for better availability and robustness, server URL is now changing togrpcs://dalle-flow.dev.jina.ai
. All connections are now with TLS encryption, please reopen the notebook in Google Colab.⚠️ 2022/6/25 Unexpected downtime between 6/25 0:00 - 12:00 CET due to out of GPU quotas. The new server now has 2 GPUs, add healthcheck in client notebook.- 2022/6/3 Reduce default number of images to 2 per pathway, 4 for diffusion.
- 🐳 2022/6/21 A prebuilt image is now available on Docker Hub! This image can be run out-of-the-box on CUDA 11.6. Fix an upstream bug in CLIP-as-service.
⚠️ 2022/5/23 Fix an upstream bug in CLIP-as-service. This bug makes the 2nd diffusion step irrelevant to the given texts. New Dockerfile proved to be reproducible on a AWS EC2p2.x8large
instance.- 2022/5/13b Removing TLS as Cloudflare gives 100s timeout, making DALLE Flow in usable Please reopen the notebook in Google Colab!.
- 🔐 2022/5/13 New Mega checkpoint! All connections are now with TLS, Please reopen the notebook in Google Colab!.
- 🐳 2022/5/10 A Dockerfile is added! Now you can easily deploy your own DALL·E Flow. New Mega checkpoint! Smaller memory-footprint, the whole Flow can now fit into one GPU with 21GB memory.
- 🌟 2022/5/7 New Mega checkpoint & multiple optimization on GLID3: less memory-footprint, use
ViT-L/14@336px
from CLIP-as-service,steps 100->200
. - 🌟 2022/5/6 DALL·E Flow just got updated! Please reopen the notebook in Google Colab!
- Revised the first step: 16 candidates are generated, 8 from DALL·E Mega, 8 from GLID3-XL; then ranked by CLIP-as-service.
- Improved the flow efficiency: the overall speed, including diffusion and upscaling are much faster now!
Using client is super easy. The following steps are best run in Jupyter notebook or Google Colab.
You will need to install DocArray and Jina first:
pip install "docarray[common]>=0.13.5" jina
We have provided a demo server for you to play:
⚠️ Due to the massive requests, our server may be delay in response. Yet we are very confident on keeping the uptime high. You can also deploy your own server by following the instruction here.
server_url = 'grpc://dalle-flow.jina.ai:51005'
Now let's define the prompt:
prompt = 'an oil painting of a humanoid robot playing chess in the style of Matisse'
Let's submit it to the server and visualize the results:
from docarray import Document
doc = Document(text=prompt).post(server_url, parameters={'num_images': 8})
da = doc.matches
da.plot_image_sprites(fig_size=(10,10), show_index=True)
Here we generate 16 candidates, 8 from DALLE-mega and 8 from GLID3 XL, this is as defined in num_images
, which takes about ~2 minutes. You can use a smaller value if it is too long for you.
The 16 candidates are sorted by CLIP-as-service, with index-0
as the best candidate judged by CLIP. Of course, you may think differently. Notice the number in the top-left corner? Select the one you like the most and get a better view:
fav_id = 3
fav = da[fav_id]
fav.embedding = doc.embedding
fav.display()
Now let's submit the selected candidates to the server for diffusion.
diffused = fav.post(f'{server_url}', parameters={'skip_rate': 0.5, 'num_images': 36}, target_executor='diffusion').matches
diffused.plot_image_sprites(fig_size=(10,10), show_index=True)
This will give 36 images based on the selected image. You may allow the model to improvise more by giving skip_rate
a near-zero value, or a near-one value to force its closeness to the given image. The whole procedure takes about ~2 minutes.
Select the image you like the most, and give it a closer look:
dfav_id = 34
fav = diffused[dfav_id]
fav.display()
Finally, submit to the server for the last step: upscaling to 1024 x 1024px.
fav = fav.post(f'{server_url}/upscale')
fav.display()
That's it! It is the one. If not satisfied, please repeat the procedure.
Btw, DocArray is a powerful and easy-to-use data structure for unstructured data. It is super productive for data scientists who work in cross-/multi-modal domain. To learn more about DocArray, please check out the docs.
You can host your own server by following the instruction below.
DALL·E Flow needs one GPU with 21GB VRAM at its peak. All services are squeezed into this one GPU, this includes (roughly)
- DALLE ~9GB
- GLID Diffusion ~6GB
- SwinIR ~3GB
- CLIP ViT-L/14-336px ~3GB
The following reasonable tricks can be used for further reducing VRAM:
- SwinIR can be moved to CPU (-3GB)
- CLIP can be delegated to CLIP-as-service demo server (-3GB)
It requires at least 40GB free space on the hard drive, mostly for downloading pretrained models.
High-speed internet is required. Slow/unstable internet may throw frustrating timeout when downloading models.
CPU-only environment is not tested and likely won't work. Google Colab is likely throwing OOM hence also won't work.
If you have installed Jina, the above flowchart can be generated via:
# pip install jina
jina export flowchart flow.yml flow.svg
We have provided a prebuilt Docker image that can be pull directly.
docker pull jinaai/dalle-flow:latest
We have provided a Dockerfile which allows you to run a server out of the box.
Our Dockerfile is using CUDA 11.6 as the base image, you may want to adjust it according to your system.
git clone https://github.com/jina-ai/dalle-flow.git
cd dalle-flow
docker build --build-arg GROUP_ID=$(id -g ${USER}) --build-arg USER_ID=$(id -u ${USER}) -t jinaai/dalle-flow .
The building will take 10 minutes with average internet speed, which results in a 18GB Docker image.
To run it, simply do:
docker run -p 51005:51005 -v $HOME/.cache:/home/dalle/.cache --gpus all jinaai/dalle-flow
- The first run will take ~10 minutes with average internet speed.
-v $HOME/.cache:/root/.cache
avoids repeated model downloading on every docker run.- The first part of
-p 51005:51005
is your host public port. Make sure people can access this port if you are serving publicly. The second par of it is the port defined in flow.yml.
You should see the screen like following once running:
Note that unlike running natively, running inside Docker may give less vivid progressbar, color logs, and prints. This is due to the limitations of the terminal in a Docker container. It does not affect the actual usage.
Running natively requires some manual steps, but it is often easier to debug.
mkdir dalle && cd dalle
git clone https://github.com/jina-ai/dalle-flow.git
git clone https://github.com/jina-ai/SwinIR.git
git clone https://github.com/CompVis/latent-diffusion.git
git clone https://github.com/jina-ai/glid-3-xl.git
You should have the following folder structure:
dalle/
|
|-- dalle-flow/
|-- SwinIR/
|-- glid-3-xl/
|-- latent-diffusion/
cd latent-diffusion && pip install -e . && cd -
cd SwinIR && pip install -e . && cd -
cd glid-3-xl && pip install -e . && cd -
There are couple models we need to download for GLID-3-XL:
cd glid-3-xl
wget https://dall-3.com/models/glid-3-xl/bert.pt
wget https://dall-3.com/models/glid-3-xl/kl-f8.pt
wget https://dall-3.com/models/glid-3-xl/finetune.pt
cd -
cd dalle-flow
pip install -r requirements.txt
pip install jax==0.3.13
Now you are under dalle-flow/
, run the following command:
jina flow --uses flow.yml
You should see this screen immediately:
On the first start it will take ~8 minutes for downloading the DALL·E mega model and other necessary models. The proceeding runs should only take ~1 minute to reach the success message.
When everything is ready, you will see:
Congrats! Now you should be able to run the client.
You can modify and extend the server flow as you like, e.g. changing the model, adding persistence, or even auto-posting to Instagram/OpenSea. With Jina and DocArray, you can easily make DALL·E Flow cloud-native and ready for production.
By default CLIPTorchEncoder
runs as an external executor.
If you want to run your own CLIP, you can do that by removing external executor related configs (host, port, tls and external
) from flow.yml
.
- To extend DALL·E Flow you will need to get familiar with Jina and DocArray.
- Join our Slack community and chat with other community members about ideas.
- Join our Engineering All Hands meet-up to discuss your use case and learn Jina's new features.
- When? The second Tuesday of every month
- Where? Zoom (see our public events calendar/.ical) and live stream on YouTube
- Subscribe to the latest video tutorials on our YouTube channel
DALL·E Flow is backed by Jina AI and licensed under Apache-2.0. We are actively hiring AI engineers, solution engineers to build the next neural search ecosystem in open-source.