KOREA AI SUMMIT 2024

Spotlight Session

Chair

Ji-Hoon Jeong

Foundational AI

Improving Diffusion Models for Authentic Virtual Try-on in the Wild

Yisol Choi

(KAIST)

This paper considers image-based virtual try-on, which renders an image of a person wearing a curated garment, given a pair of images depicting the person and the garment, respectively. Previous works adapt existing exemplar-based inpainting diffusion models for virtual try-on to improve the naturalness of the generated visuals compared to other methods (e.g., GAN-based), but they fail to preserve the identity of the garments. To overcome this limitation, we propose a novel diffusion model that improves garment fidelity and generates authentic virtual try-on images. Our method, coined IDM–VTON, uses two different modules to encode the semantics of garment image; given the base UNet of the diffusion model, 1) the high-level semantics extracted from a visual encoder are fused to the cross-attention layer, and then 2) the low-level features extracted from parallel UNet are fused to the self-attention layer. In addition, we provide detailed textual prompts for both garment and person images to enhance the authenticity of the generated visuals. Finally, we present a customization method using a pair of person-garment images, which significantly improves fidelity and authenticity. Our experimental results show that our method outperforms previous approaches (both diffusion-based and GAN-based) in preserving garment details and generating authentic virtual try-on images, both qualitatively and quantitatively. Furthermore, the proposed customization method demonstrates its effectiveness in a real-world scenario.

Towards Interpretable Controllability in Object-Centric Learning

Jinwoo Kim

(Yonsei University)

The binding problem in artificial neural networks is actively explored with the goal of achieving human-level recognition skills through the comprehension of the world in terms of symbol-like entities. Especially in the field of computer vision, object-centric learning (OCL) is extensively researched to better understand complex scenes by acquiring object representations or slots. While recent studies in OCL have made strides with complex images or videos, the interpretability and interactivity over object representation remain largely uncharted, still holding promise in the field of OCL. In this paper, we introduce a novel method, Slot Attention with Image Augmentation (SlotAug), to explore the possibility of learning interpretable controllability over slots in a self-supervised manner by utilizing an image augmentation strategy. We also devise the concept of sustainability in controllable slots by introducing iterative and reversible controls over slots with two proposed submethods: Auxiliary Identity Manipulation and Slot Consistency Loss.

Salience-Based Adaptive Masking: Revisiting Token Dynamics for Enhanced Pre-training

Hyesong Choi

(Ewha Womans Unviersity)

In this paper, we introduce Saliency-Based Adaptive Masking (SBAM), a novel and cost-effective approach that significantly enhances the pre-training performance of Masked Image Modeling (MIM) approaches by prioritizing token salience. Our method provides robustness against variations in masking ratios, effectively mitigating the performance instability issues common in existing methods. This relaxes the sensitivity of MIM-based pre-training to masking ratios, which in turn allows us to propose an adaptive strategy for ‘tailored’ masking ratios for each data sample, which no existing method can provide. Toward this goal, we propose an Adaptive Masking Ratio (AMR) strategy that dynamically adjusts the proportion of masking for the unique content of each image based on token salience. We show that our method significantly improves over the state-of-the-art in mask-based pre-training on the ImageNet-1K dataset.

DEVIAS: Learning Disentangled Video Representations of Action and Scene

Geo Ahn

(Kyung Hee University)

Video recognition models often learn scene-biased action representation due to the spurious correlation between actions and scenes in the training data. Although scene-debiased action recognition models might address the issue, they often overlook valuable scene information in the data. To address this challenge, we propose to learn DisEntangled VIdeo representations of Action and Scene (DEVIAS). The architecture consists of a disentangling encoder (DE) and action mask decoder (AMD). With the resulting disentangled representations, we can achieve robust performance across diverse scenarios, including both seen and unseen action-scene combinations.

Interaction AI with Reality

NVS-Adapter: Plug-and-Play Novel View Synthesis from a Single Image

Yoonwoo Jeong

(POSTECH)

Transfer learning of large-scale Text-to-Image (T2I) models has recently shown impressive potential for Novel View Synthesis (NVS) of diverse objects from a single image. While previous methods typically train large models on multi-view datasets for NVS, fine-tuning the whole parameters of T2I models not only demands a high cost but also reduces the generalization capacity of T2I models in generating diverse images in a new domain. In this study, we propose an effective method, dubbed NVS-Adapter, which is a plug-and-play module for a T2I model, to synthesize novel multi-views of visual objects while fully exploiting the generalization capacity of T2I models. NVS-Adapter consists of two main components; view-consistency cross-attention learns the visual correspondences to align the local details of view features, and global semantic conditioning aligns the semantic structure of generated views with the reference view. Experimental results demonstrate that the NVS-Adapter can effectively synthesize geometrically consistent multi-views and also achieve high performance on benchmarks without full fine-tuning of T2I models.

CLIP-RT : Learning Language-Conditioned Robotic Policies from Natural Language Supervision

Junghyun Kim

(Seoul National University)

This paper explores how nonexperts can teach robots desired skills in their environments. We argue that natural language is an intuitive and accessible interface for robot learning. To this end, we investigate two key aspects: (1) how nonexperts collect robotic data using natural language supervision and (2) how pre-trained vision-language models learn end-to-end policies directly from this supervision. We propose a data collection framework that collects robot demonstrations based on natural language supervision (e.g., “move forward”) and further augments these demonstrations. Next, we introduce a model that learns language-conditioned policies from natural language supervision called CLIP-RT. Our model employs pre-trained CLIP models and learns to predict actions represented in language via contrastive imitation learning. We first train CLIP-RT on large-scale robotic data and then enable it to learn desired skills using data collected from our framework. CLIP-RT shows strong capabilities in acquiring novel manipulation skills, outperforming the state-of-the-art model, OpenVLA (7B parameters), by 17% in average success rates, while using 7x fewer parameters (1B).

RadiomicsFill-Mammo: Synthetic Mammogram Mass Manipulation with Radiomics Features

Inye Na

(Sungkyunkwan University)

Motivated by the question, "Can we generate tumors with desired attributes?" this study leverages radiomics features to explore the feasibility of generating synthetic tumor images. We present RadiomicsFill-Mammo, a technique that generates realistic mammogram mass images mirroring specific radiomics attributes and incorporating clinical variables such as BI-RADS and breast density. RadiomicsFill-Mammo improves mass detection capabilities and enables simulated sample generation, advancing medical imaging research and treatment planning.

AI for Scientific and Social Challenges

Learning Phoneme Sequences from Speech Neural Signals Using Diffusion Models for Open Vocabulary BCI

Soowon Kim

(Korea University)

Generating unconstrained speech or text from speech-related biosignals is a critical challenge in brain-computer interfaces. Phonemes, the smallest units of sound, are essential for constructing words and sentences. We propose a diffusion model-based method that decodes phoneme sequences from EEG signals using a limited speech corpus, enabling the generation of unseen words. By aligning EEG signals with phoneme sequences, our approach identifies minimal word sets for efficient learning, advancing noninvasive speech decoding for neural communication.

Flexible Molecular Alignment Using Diffusion Generative Models

Iljung Kim

(Hanyang University)

The molecular structure of a ligand determines its binding affinity to proteins. Due to their structural flexibility, molecules can interact with multiple binding pocket conformations, a property that enhances interaction predictions when integrated into algorithms. Our novel approach utilizes diffusion generative models and conditional graph neural networks to generate diverse and chemically plausible conformations. This method explores a broader range of ligand conformations and significantly improves accuracy in structure-based virtual screening tasks.

Advancing Irregular Time Series Classification in Astronomy with Neural Stochastic Differential Equations

Seungsu Kam

(UNIST)

Addressing the classification challenges of irregular time series data in astronomical studies like Large Synoptic Survey Telescope (LSST), this research leverages Neural Stochastic Differential Equations (Neural SDEs) to tackle data irregularity and incompleteness. We analyze a comprehensive analysis to the Neural Langevin-type SDEs' optimal initial condition, which is pivotal role in modelling continuous latent state. Three different strategies for selecting initial condition are compared under regular and irregular scenario using LSST dataset. Our empirical evaluation using Langevin-type SDEs highlights the superiority of static approach over dynamic approaches for initial condition. This discovery highlights the effectiveness of well-chosen initial values of Neural SDEs to enhance the performance of astronomical time series classification under irregular observations.

MuLe: Multi-Grained Graph Learning for Multi-Behavior Recommendation

Geonwoo Ko

(Soongsil University)

Multi-behavior recommender systems leverage auxiliary interactions to enhance target behavior recommendations, such as purchases. However, existing methods struggle with effectively utilizing multi-faceted behavior relationships and handling uncertain auxiliary interactions. In this paper, we propose MuLe (Multi-Grained Graph Learning), a novel graph-based model that captures diverse aspects of behaviors through a multi-grained graph learning strategy. To handle uncertain auxiliary interactions, we apply graph attention to weight their relevance to the target behavior. Then, we use attention mechanism to aggregates diverse behavior embeddings. Experiments show that MuLe outperforms state-of-the-art methods, improving HR@10 by 44.6% and NDCG@10 by 52.9%.