Evaluating and Improving Text to Image Alignment with Latent Diffusion Models

The Data Science Discipline of the School of EECS is hosting the following guest seminar:

Evaluating and Improving Text to Image Alignment with Latent Diffusion Models

Speaker: Jaskirat Singh (Australian National University)
Host: Dr Ruihong Qiu

Abstract: Recent advancements in the domain of text-conditioned image generation, particularly through latent diffusion models, have shown impressive outcomes. However, as the complexity of textual prompts escalates, even advanced diffusion models occasionally misrepresent the intended semantics. Notably, such misalignments often bypass detection by established multi-modal models like CLIP. In response, our study delves into a novel decompositional approach to enhance and evaluate text-to-image alignment. We propose the Decompositional-Alignment-Score (DAS), which dissects intricate prompts into distinct assertions. These assertions' alignment with the resultant images is subsequently appraised using a VQA model. An aggregate of these scores then determines the overarching text-to-image alignment score. Our experiments demonstrate that the DAS metric resonates more closely with human judgment compared to traditional CLIP and BLIP scores. Additionally, the assertion-specific alignment scores offer valuable insights. These insights can guide an iterative refinement process to accentuate the prominence of each assertion in the produced images. Human evaluations underscore that our methodology exceeds the previous benchmark by 8.7% in overall text-to-image alignment precision.

Speaker Bio: Jaskirat Singh is a second year Ph.D. student in the College of Engineering and Computer Science (CECS) at the Australian National University, where he is supervised by Prof. Liang Zheng and Prof. Stephen Gould. He is am currently also an intern in the Creative Intelligence Lab at Adobe Research working under Dr. Zhe Lin and Dr. Jianming Zhang. His main research interests are in fields of controllable image synthesis and creative content generation, aiming to leverage recent multimodal foundational models in order to allow the user to better express their ideas through the visual media. On a broader level, the aim is to facilitate for the development of systems which allow for a more intuitive and controllable expression of user's ideas through generative image and text.

About Data Science Seminar

This seminar series is hosted by EECS Data Science.

Venue

Building 14-115, Online via Zoom https://uqz.zoom.us/j/82896549343