Coordinate Transformer: Achieving Single-stage Multi-person Mesh Recovery from Videos

Li, Haoyuan; Dong, Haoye; Jia, Hanchao; Huang, Dong; Kampffmeyer, Michael Christian; Lin, Liang; Liang, Xiaodan

dc.contributor.author	Li, Haoyuan
dc.contributor.author	Dong, Haoye
dc.contributor.author	Jia, Hanchao
dc.contributor.author	Huang, Dong
dc.contributor.author	Kampffmeyer, Michael Christian
dc.contributor.author	Lin, Liang
dc.contributor.author	Liang, Xiaodan
dc.date.accessioned	2024-02-16T13:13:48Z
dc.date.available	2024-02-16T13:13:48Z
dc.date.issued	2024-01-15
dc.description.abstract	Multi-person 3D mesh recovery from videos is a critical first step towards automatic perception of group behavior in virtual reality, physical therapy and beyond. However, existing approaches rely on multi-stage paradigms, where the person detection and tracking stages are performed in a multi-person setting, while temporal dynamics are only modeled for one person at a time. Consequently, their performance is severely limited by the lack of inter-person interactions in the spatial-temporal mesh recovery, as well as by detection and tracking defects. To address these challenges, we propose the Coordinate transFormer (Coord-Former) that directly models multi-person spatial-temporal relations and simultaneously performs multi-mesh recovery in an end-to-end manner Instead of partitioning the feature map into coarse-scale patch-wise tokens, CoordFormer leverages a novel Coordinate-Aware Attention to preserve pixel-level spatial-temporal coordinate information. Additionally, we propose a simple, yet effective Body Center Attention mechanism to fuse position information. Extensive experiments on the 3DPW dataset demonstrate that CoordFormer significantly improves the state-of-the-art, outperforming the previously best results by 4.2%, 8.8% and 4.7% according to the MPJPE, PAMPJPE, and PVE metrics, respectively, while being 40% faster than recent video-based approaches. The released code can be found at https://github.com/Li-Hao-yuan/CoordFormer.	en_US
dc.identifier.citation	Li H, Dong H, Jia, Huang D, Kampffmeyer MC, Lin L, Liang X. Coordinate Transformer: Achieving Single-stage Multi-person Mesh Recovery from Videos. IEEE International Conference on Computer Vision (ICCV). 2023	en_US
dc.identifier.cristinID	FRIDAID 2185859
dc.identifier.doi	10.1109/ICCV51070.2023.00803
dc.identifier.issn	1550-5499
dc.identifier.issn	2380-7504
dc.identifier.uri	https://hdl.handle.net/10037/32953
dc.language.iso	eng	en_US
dc.publisher	IEEE	en_US
dc.relation.journal	IEEE International Conference on Computer Vision (ICCV)
dc.relation.projectID	Norges forskningsråd: 309439	en_US
dc.relation.projectID	Norges forskningsråd: 315029	en_US
dc.rights.accessRights	openAccess	en_US
dc.rights.holder	Copyright 2023 The Author(s)	en_US
dc.title	Coordinate Transformer: Achieving Single-stage Multi-person Mesh Recovery from Videos	en_US
dc.type.version	acceptedVersion	en_US
dc.type	Journal article	en_US
dc.type	Tidsskriftartikkel	en_US
dc.type	Peer reviewed	en_US

Tilhørende fil(er)

Navn:: article.pdf
Størrelse:: 28.08Mb
Format:: PDF

Åpne

Denne innførselen finnes i følgende samling(er)

Artikler, rapporter og annet (fysikk og teknologi) [1062]

Vis enkel innførsel