Self-Prompting Diffusion Transformer for Open-Vocabulary Scene Text Editing via In-Context Learning

MT Lab logo
Beijing Institute of Technology logo
1MT Lab, Meitu Inc. 2School of Computer Science & Technology, Beijing Institute of Technology

Introduction

Scene text editing aims to modify text inside images while keeping the edited results natural and visually consistent. However, existing methods often fail to preserve the original text’s style and are usually limited to a fixed set of words or languages.

We propose a self-prompting text editing method that learns directly from the original image, without requiring additional text encoders. By leveraging the contextual learning ability of modern generative models, our method can edit previously unseen text while preserving the original visual style.

Our approach supports open-vocabulary multilingual editing across languages such as Chinese, English, Japanese, Korean, Russian, and Thai. Experiments show that it produces more accurate and realistic editing results than existing methods.

Introduction overview

Method

Method framework
Cooldown strategy

Visualization

Result

AnyWord result
Result on AnyText-benchmark
MSTEdit result
MSTEdit on MSTEdit dataset