GARY

Diffusion & MLLMs

探索扩散的无限可能 · 构建多模态的未来

About Me

I am Gary, a PhD focusing on diffusion models and multimodal large models.

My research is dedicated to exploring the boundaries of generative AI, turning noise into art and enabling text, images, video, and 3D to flow freely across different modalities.

In this digital era, I believe AI is not just a tool but also an extension of creativity. By continuously optimizing the diffusion process, I am building intelligent systems capable of understanding and creating multimodal content.

When I am not training models, I enjoy thinking about the nature of consciousness and the emergence of intelligence—perhaps the next breakthrough is hidden somewhere deep in a dream.

我是Gary,一名专注于扩散模型多模态大模型研究的PhD。

我的研究致力于探索生成式AI的边界,将噪声转化为艺术,让文本、图像、视频、3D在不同模态间自由流转。

在这个数字化时代,我相信AI不仅是工具,更是创造力的延伸。通过不断优化扩散过程,我正在构建能够理解和创造多模态内容的智能系统。

当我不在训练模型时,我喜欢思考意识的本质智能的涌现——也许下一个突破就藏在梦境深处。

Research Interests 研究领域

Diffusion 扩散模型

Conduct an in-depth study of the mathematical principles of diffusion processes, explore more efficient sampling algorithms and noise scheduling strategies, and strive to reduce diffusion time from infinity to the millisecond level. 深入研究扩散过程的数学原理,探索更高效的采样算法和噪声调度策略。致力于将扩散时间从无限缩短到毫秒级。

MLLMs 多模态融合

Construct a unified representation space that enables free modality conversion in the latent space, achieving true cross-modal understanding and generation. 构建统一的表示空间,让模态在潜空间中自由转换。实现真正的跨模态理解和生成。

Efficiency 高效推理

Through model compression, quantization, and distillation, bring large models to edge devices—making AI omnipresent and omnipotent. 通过模型压缩、量化蒸馏等技术,让大模型在边缘设备上奔跑。让AI无处不在,无所不能。

AI Art AI艺术

Treat algorithms as brushes and noise as pigment; probe the resonance between machine creativity and human aesthetics to birth entirely new art forms. 将算法视为画笔,噪声视为颜料。探索机器创造力与人类审美的共鸣点,创造前所未有的艺术形式。

Papers 论文发表

Beyond Randomness: Understand the Order of the Noise in Diffusion

2025.11 | Gary, et al. | Under Review

In text-driven content generation (T2C) diffusion model, semantic of generated content is mostly attributed to the process of text embedding and attention mechanism interaction. The initial noise of the generation process is typically characterized as a random element that contributes to the diversity of the generated content. Contrary to this view, this paper reveals that beneath the random surface of noise lies strong analyzable patterns. Specifically, this paper first conducts a comprehensive analysis of the impact of random noise on the model's generation. We found that noise not only contains rich semantic information, but also allows for the erasure of unwanted semantics from it in an extremely simple way based on information theory, and using the equivalence between the generation process of diffusion model and semantic injection to inject semantics into the cleaned noise. Then, we mathematically decipher these observations and propose a simple but efficient training-free and universal two-step "Semantic Erasure-Injection" process to modulate the initial noise in T2C diffusion model. Experimental results demonstrate that our method is consistently effective across various T2C models based on both DiT and UNet architectures and presents a novel perspective for optimizing the generation of diffusion model, providing a universal tool for consistent generation.

Diffusion 扩散模型 Noise Optimization 噪声优化 Training-free

Break Stylistic Sophon: Are We Really Meant to Confine the Imagination in Style Transfer?

2025.05| Gary, et al. | Under Review

In the realm of image style transfer, existing algorithms relying on single reference style images encounter formidable challenges, such as severe semantic drift, overfitting, color limitations, and a lack of a unified framework. These issues impede the generation of high quality, diverse, and semantically accurate images. In this study, we introduce StyleWallfacer, an innovative unified training and inference framework, which not only addresses various issues encountered in the style transfer process of traditional methods but also unifies the framework for different tasks. This framework is designed to advance the development of this field by enabling high-quality style transfer and text driven stylization. First, we propose a semantic-based style injection method that uses BLIP to generate text descriptions strictly aligned with the semantics of the style image in CLIP space. By leveraging a large language model to remove style-related descriptions from these descriptions, we create a semantic gap. This gap is then used to fine-tune the model, enabling efficient and drift-free injection of style knowledge. Second, we propose a data augmentation strategy based on human feedback, incorporating high-quality samples generated early in the fine-tuning process into the training set to facilitate progressive learning and significantly reduce its overfitting. Finally, we design a training-free triple diffusion process using the fine-tuned model, which manipulates the features of self-attention layers in a manner similar to the cross-attention mechanism. Specifically, in the generation process, the key and value of the content-related process are replaced with those of the style-related process to inject style while maintaining text control over the model. We also introduce query preservation to mitigate disruptions to the original content. Under such a design, we have achieved high-quality image-driven style transfer and text-driven stylization, delivering high-quality style transfer results while preserving the original image content. Moreover, we achieve image color editing during the style transfer process for the first time, further pushing the boundaries of controllable image generation and editing technologies and breaking the limitations imposed by reference images on style transfer. Our experimental results demonstrate that our proposed method outperforms state-of-the-art methods.

AI Art AI艺术 Style Transfer 风格迁移 Diffusion 扩散模型

SHSRD: Efficient Conditional Diffusion Model for Single Hyperspectral Image Superresolution

2025.03 | Gary, et al. | JSTARS 2025

The emergence of deep neural networks has spurred progress in single hyperspectral image (HSI) superresolution. However, mainstream models prioritize network architecture and optimization, overlooking the limitations of HSI datasets’ small scale and potential data imbalance. Direct training models with such datasets can lead to issues, such as overfitting. To better solve the above-mentioned problems from the perspective of dataset, we propose SHSRD, an advanced superresolution framework specifically designed for HSIs based on diffusion model. It incorporates a spectral information injection module, which selectively introduces diverse spectral information into the model in a conditional manner, thereby enabling efficient spectral information perception. Furthermore, a two-stage training strategy is meticulously devised. Initially, the model undergoes pretraining on a large-scale natural image dataset. Subsequently, leveraging transfer learning techniques, the knowledge acquired from the natural image dataset is seamlessly transferred to the single HSI superresolution task. This strategic approach culminates in the realization of high-quality superresolution for single HSIs, effectively addressing the challenges associated with limited training data in hyperspectral imaging. It also demonstrates significant generalization, with only one pretraining session on a natural image dataset required for scale-specific tasks, allowing for rapid transfer to any small HSI dataset. Extensive experiments on four public datasets demonstrate that SHSRD outperforms state-of-the-art methods.

Diffusion 扩散模型 HSI 高光谱图像 Low-Level Vision 底层视觉

Contact Me 建立连接

Let’s together unlock AI’s boundless potential—finding order within noise and harmony across modalities. 让我们一起探索AI的无限可能,在噪声中发现秩序,在模态间寻找共鸣。

Email 邮箱
Gary_144@mail.ustc.edu.cn
RedNote 小红书
@gary爱登山
Google Scholar
Google Scholar