ub.xmlui.mirage2.page-structure.muninLogoub.xmlui.mirage2.page-structure.openResearchArchiveLogo
    • EnglishEnglish
    • norsknorsk
  • Velg spraakEnglish 
    • EnglishEnglish
    • norsknorsk
  • Administration/UB
View Item 
  •   Home
  • Fakultet for naturvitenskap og teknologi
  • Institutt for fysikk og teknologi
  • Artikler, rapporter og annet (fysikk og teknologi)
  • View Item
  •   Home
  • Fakultet for naturvitenskap og teknologi
  • Institutt for fysikk og teknologi
  • Artikler, rapporter og annet (fysikk og teknologi)
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

DiffCloth: Diffusion Based Garment Synthesis and Manipulation via Structural Cross-modal Semantic Alignment

Permanent link
https://hdl.handle.net/10037/32955
DOI
https://doi.org/10.1109/ICCV51070.2023.02116
Thumbnail
View/Open
article.pdf (3.112Mb)
Accepted manuscript version (PDF)
Date
2024-01-15
Type
Journal article
Tidsskriftartikkel
Peer reviewed

Author
Zhang, Xujie; Yang, Binbin; Kampffmeyer, Michael Christian; Zhang, Wenqing; Zhang, Shiyue; Lu, Guansong; Lin, Liang; Xu, Hang; Liang, Xiaodan
Abstract
Cross-modal garment synthesis and manipulation will significantly benefit the way fashion designers generate garments and modify their designs via flexible linguistic interfaces. However, despite the significant progress that has been made in generic image synthesis using diffusion models, producing garment images with garment part level semantics that are well aligned with input text prompts and then flexibly manipulating the generated results still remains a problem. Current approaches follow the general text-to-image paradigm and mine cross-modal relations via simple cross-attention modules, neglecting the structural correspondence between visual and textual representations in the fashion design domain. In this work, we instead introduce DiffCloth, a diffusion-based pipeline for cross-modal garment synthesis and manipulation, which empowers diffusion models with flexible compositionality in the fashion domain by structurally aligning the cross-modal semantics. Specifically, we formulate the part-level cross-modal alignment as a bipartite matching problem between the linguistic Attribute-Phrases (AP) and the visual garment parts which are obtained via constituency parsing and semantic segmentation, respectively. To mitigate the issue of attribute confusion, we further propose a semantic-bundled cross-attention to preserve the spatial structure similarities between the attention maps of attribute adjectives and part nouns in each AP. Moreover, DiffCloth allows for manipulation of the generated results by simply replacing APs in the text prompts. The manipulation-irrelevant regions are recognized by blended masks obtained from the bundled attention maps of the APs and kept unchanged. Extensive experiments on the CM-Fashion benchmark demonstrate that DiffCloth both yields state-of-the-art garment synthesis results by leveraging the inherent structural information and supports flexible manipulation with region consistency.
Publisher
IEEE
Citation
Zhang X, Yang, Kampffmeyer MC, Zhang W, Zhang, Lu, Lin L, Xu H, Liang X. DiffCloth: Diffusion Based Garment Synthesis and Manipulation via Structural Cross-modal Semantic Alignment. IEEE International Conference on Computer Vision (ICCV). 2023
Metadata
Show full item record
Collections
  • Artikler, rapporter og annet (fysikk og teknologi) [1057]
Copyright 2023 The Author(s)

Browse

Browse all of MuninCommunities & CollectionsAuthor listTitlesBy Issue DateBrowse this CollectionAuthor listTitlesBy Issue Date
Login

Statistics

View Usage Statistics
UiT

Munin is powered by DSpace

UiT The Arctic University of Norway
The University Library
uit.no/ub - munin@ub.uit.no

Accessibility statement (Norwegian only)