UltraEdit

UltraEdit: Instruction-based Fine-Grained Image Editing at Scale

Peking University BIGAI UCLA UIUC
* Equal Contribution
† Corresponding Authors

Abstract

This paper presents UltraEdit, a large-scale (~4M editing samples), automatically generated dataset for instruction-based image editing. Our key idea is to address the drawbacks in existing image editing datasets like InstructPix2Pix and MagicBrus, and provide a systematic approach to producing massive and high-quality image editing samples. UltraEdit offers several distinct advantages: 1) It features a broader range of editing instructions by leveraging the creativity of large language models (LLMs) alongside in-context editing examples from human raters; 2) Its data sources are based on real images, including photographs and artworks, which provide greater diversity and reduced bias compared to datasets solely generated by text-to-image models; 3) It also supports region-based editing, enhanced by high-quality, automatically produced region annotations. Our experiments show that canonical diffusion-based editing baselines trained on UltraEdit set new records on MagicBrush and Emu-Edit benchmarks. Our analysis further confirms the crucial role of real image anchors and region-based editing data.

UltraEdit

Construction of UltraEdit:
(Upper) We use LLM with in-context examples to produce editing instructions and target captions given the collected image captions.
(Middle) For free-form editing, we use the collected images as anchors, and invoke regular diffusion followed by prompt-to-prompt (P2P) control to produce source and target images.
(Bottom) For region-based editing, we first produce an editing region based on the instruction, then invoke a modified inpainting diffusion pipeline to produce the images.



Comparison of different image editing datasets. Both EditBench and MagicBrush are manually annotated but are limited in size. InstructPix2Pix and HQ-Edit are large datasets automatically generated using T2I models like Stable Diffusion and DALL-E, though they present notable biases from the generative models leading to failure cases. UltraEdit offers large-scale samples with rich editing tasks and fewer biases.

Example

Examples of UltraEdit. Free-form and Region-based Image Editing

Statistics

Statistics of Free-form and Region-based Image Editing Data. The following table shows the instance numbers, number of unique instructions, and their respective proportions for different instruction types.

Editing Examples

The Editing examples generated by Stable Diffusion3 trained with UltraEdit dataset. It supports both free-form (without mask) and region-based (with mask) image editing. You can try this model in the above Demo.

Input Image Edit Instruction Edited Images
Add a UFO in the sky
Edited Image Placeholder
Add a moon in the sky
Edited Image Placeholder
add cherry blossoms
Edited Image Placeholder
Please dress her in a short purple wedding dress adorned with white floral embroidery
Edited Image Placeholder
give her a chief's headdress.
Edited Image Placeholder

Evaluation

We evaluate the performance of canonical diffusion-based editing models trained on our dataset across different instruction-based image editing benchmarks.

We trained the same diffusion model using an equal amount of training data from the UltraEdit dataset and evaluated its performance against its counterparts.

Evaluation Results on MagicBrush Benchmark

Evaluation Results on the MagicBrush Benchmark

Qualitative Evaluations

We conduct qualitative evaluations to assess the consistency, instruction alignment, and image quality of the edited images generated by our model trained on ULTRAEDIT using the MagicBrush benchmark and Emu Test benchmark.

Evaluation Results on MagicBrush Benchmark

Qualitative Evaluation of Single-turn Setting

Evaluation Results on MagicBrush Benchmark

Qualitative Evaluation of Multi-turn Setting

Evaluation Results on MagicBrush Benchmark

Qualitative Evaluation on Emu Test Set

Evaluation Results on MagicBrush Benchmark

Qualitative evaluation of using real images as anchors during image generation




Acknowledgement

This website is adapted from ArxivCap, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.