DiffCloth: Diffusion Based Garment Synthesis and Manipulation via Structural Cross-modal Semantic Alignment

Zhang, Xujie; Yang, Binbin; Kampffmeyer, Michael Christian; Zhang, Wenqing; Zhang, Shiyue; Lu, Guansong; Lin, Liang; Xu, Hang; Liang, Xiaodan

dc.contributor.author	Zhang, Xujie
dc.contributor.author	Yang, Binbin
dc.contributor.author	Kampffmeyer, Michael Christian
dc.contributor.author	Zhang, Wenqing
dc.contributor.author	Zhang, Shiyue
dc.contributor.author	Lu, Guansong
dc.contributor.author	Lin, Liang
dc.contributor.author	Xu, Hang
dc.contributor.author	Liang, Xiaodan
dc.date.accessioned	2024-02-16T13:36:48Z
dc.date.available	2024-02-16T13:36:48Z
dc.date.issued	2024-01-15
dc.description.abstract	Cross-modal garment synthesis and manipulation will significantly benefit the way fashion designers generate garments and modify their designs via flexible linguistic interfaces. However, despite the significant progress that has been made in generic image synthesis using diffusion models, producing garment images with garment part level semantics that are well aligned with input text prompts and then flexibly manipulating the generated results still remains a problem. Current approaches follow the general text-to-image paradigm and mine cross-modal relations via simple cross-attention modules, neglecting the structural correspondence between visual and textual representations in the fashion design domain. In this work, we instead introduce DiffCloth, a diffusion-based pipeline for cross-modal garment synthesis and manipulation, which empowers diffusion models with flexible compositionality in the fashion domain by structurally aligning the cross-modal semantics. Specifically, we formulate the part-level cross-modal alignment as a bipartite matching problem between the linguistic Attribute-Phrases (AP) and the visual garment parts which are obtained via constituency parsing and semantic segmentation, respectively. To mitigate the issue of attribute confusion, we further propose a semantic-bundled cross-attention to preserve the spatial structure similarities between the attention maps of attribute adjectives and part nouns in each AP. Moreover, DiffCloth allows for manipulation of the generated results by simply replacing APs in the text prompts. The manipulation-irrelevant regions are recognized by blended masks obtained from the bundled attention maps of the APs and kept unchanged. Extensive experiments on the CM-Fashion benchmark demonstrate that DiffCloth both yields state-of-the-art garment synthesis results by leveraging the inherent structural information and supports flexible manipulation with region consistency.	en_US
dc.identifier.citation	Zhang X, Yang, Kampffmeyer MC, Zhang W, Zhang, Lu, Lin L, Xu H, Liang X. DiffCloth: Diffusion Based Garment Synthesis and Manipulation via Structural Cross-modal Semantic Alignment. IEEE International Conference on Computer Vision (ICCV). 2023	en_US
dc.identifier.cristinID	FRIDAID 2185862
dc.identifier.doi	10.1109/ICCV51070.2023.02116
dc.identifier.issn	1550-5499
dc.identifier.issn	2380-7504
dc.identifier.uri	https://hdl.handle.net/10037/32955
dc.language.iso	eng	en_US
dc.publisher	IEEE	en_US
dc.relation.journal	IEEE International Conference on Computer Vision (ICCV)
dc.relation.projectID	Norges forskningsråd: 309439	en_US
dc.relation.projectID	Norges forskningsråd: 315029	en_US
dc.rights.accessRights	openAccess	en_US
dc.rights.holder	Copyright 2023 The Author(s)	en_US
dc.title	DiffCloth: Diffusion Based Garment Synthesis and Manipulation via Structural Cross-modal Semantic Alignment	en_US
dc.type.version	acceptedVersion	en_US
dc.type	Journal article	en_US
dc.type	Tidsskriftartikkel	en_US
dc.type	Peer reviewed	en_US

Tilhørende fil(er)

Navn:: article.pdf
Størrelse:: 3.112Mb
Format:: PDF

Åpne

Denne innførselen finnes i følgende samling(er)

Artikler, rapporter og annet (fysikk og teknologi) [942]

Vis enkel innførsel