This paper presents UltraEdit, a large-scale (~4M editing samples), automatically generated dataset for instruction-based image editing. Our key idea is to address the drawbacks in existing image editing datasets like InstructPix2Pix and MagicBrus, and provide a systematic approach to producing massive and high-quality image editing samples. UltraEdit offers several distinct advantages: 1) It features a broader range of editing instructions by leveraging the creativity of large language models (LLMs) alongside in-context editing examples from human raters; 2) Its data sources are based on real images, including photographs and artworks, which provide greater diversity and reduced bias compared to datasets solely generated by text-to-image models; 3) It also supports region-based editing, enhanced by high-quality, automatically produced region annotations. Our experiments show that canonical diffusion-based editing baselines trained on UltraEdit set new records on MagicBrush and Emu-Edit benchmarks. Our analysis further confirms the crucial role of real image anchors and region-based editing data.
Construction of UltraEdit:
(Upper)
We use LLM with in-context examples to produce editing instructions and target captions given the collected image captions.
(Middle)
For free-form editing, we use the collected images as anchors, and invoke regular diffusion followed by prompt-to-prompt (P2P) control to produce source and target images.
(Bottom)
For region-based editing, we first produce an editing region based on the instruction, then invoke a modified inpainting diffusion pipeline to produce the images.
Comparison of different image editing datasets. Both EditBench and MagicBrush are manually annotated but are limited in size.
InstructPix2Pix and HQ-Edit are large datasets automatically generated using T2I models like
Stable Diffusion and DALL-E, though they present notable biases from the generative models leading to failure cases.
UltraEdit offers large-scale samples with rich editing tasks and fewer biases.
Examples of UltraEdit. Free-form and Region-based Image Editing
Statistics of Free-form and Region-based Image Editing Data. The following table shows the instance numbers, number of unique instructions, and their respective proportions for different instruction types.
The Editing examples generated by Stable Diffusion3 trained with UltraEdit dataset. It supports both free-form (without mask) and region-based (with mask) image editing. You can try this model in the above Demo.
Input Image | Edit Instruction | Edited Images |
---|---|---|
|
Add a UFO in the sky |
|
|
Add a moon in the sky |
|
|
add cherry blossoms |
|
|
Please dress her in a short purple wedding dress adorned with white floral embroidery |
|
|
give her a chief's headdress. |
|
We evaluate the performance of canonical diffusion-based editing models trained on our dataset across different instruction-based image editing benchmarks.
We trained the same diffusion model using an equal amount of training data from the UltraEdit dataset and evaluated its performance against its counterparts.Evaluation Results on the MagicBrush Benchmark
We conduct qualitative evaluations to assess the consistency, instruction alignment, and image quality of the edited images generated by our model trained on ULTRAEDIT using the MagicBrush benchmark and Emu Test benchmark.
Qualitative Evaluation of Single-turn Setting
Qualitative Evaluation of Multi-turn Setting
Qualitative Evaluation on Emu Test Set
Qualitative evaluation of using real images as anchors during image generation
This website is adapted from ArxivCap, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.