balamuruganthambiraja.github.io
Open in
urlscan Pro
2606:50c0:8001::153
Public Scan
URL:
https://balamuruganthambiraja.github.io/Imitator/
Submission: On August 23 via api from US — Scanned from DE
Submission: On August 23 via api from US — Scanned from DE
Form analysis
0 forms found in the DOMText Content
More Research IMITATOR: PERSONALIZED SPEECH-DRIVEN 3D FACIAL ANIMATION Balamurugan Thambiraja1, Ikhsanul Habibie2, Sadegh Aliakbarian3, Darren Cosker3, Christian Theobalt2, Justus Thies1, 1 Max Planck Institute for Intelligent Systems, Tübingen, Germany, 2 Max Planck Institute for Informatics, Saarland, Germany, 3 Mesh Labs, Microsoft, Cambridge, UK. Paper arXiv Video Code (coming soon) OVERVIEW Imitator is a novel method for personalized speech-driven 3D facial animation. Given an audio sequence and a personalized style-embedding as input, we generate person-specific motion sequences with accurate lip closures for bilabial consonants ('m','b','p'). The style-embedding of a subject can be computed by a short reference video (e.g., 5s). ABSTRACT Speech-driven 3D facial animation has been widely explored, with applications in gaming, character animation, virtual reality, and telepresence systems. State-of-the-art methods deform the face topology of the target actor to sync the input audio without considering the identity-specific speaking style and facial idiosyncrasies of the target actor, thus, resulting in unrealistic and inaccurate lip movements. To address this, we present Imitator, a speech-driven facial expression synthesis method, which learns identity-specific details from a short input video and produces novel facial expressions matching the identity-specific speaking style and facial idiosyncrasies of the target actor. Specifically, we train a style-agnostic transformer on a large facial expression dataset which we use as a prior for audio-driven facial expressions. Based on this prior, we optimize for identity-specific speaking style based on a short reference video. To train the prior, we introduce a novel loss function based on detected bilabial consonants to ensure plausible lip closures and consequently improve the realism of the generated expressions. Through detailed experiments and a user study, we show that our approach produces temporally coherent facial expressions from input audio while preserving the speaking style of the target actors. VIDEO PROPOSED METHOD Overview of the proposed method. Our method takes audio as input and encodes it to audio embedding using a pre-trained Wav2Vec2.0 model . This audio embedding â1:T is interpreted by an auto-regressive viseme decoder which generates a generalized motion feature v̂1:T. A style-adaptable motion decoder maps these motion features to person-specific facial expressions ŷ1:T in terms of vertex displacements on top of a template mesh. IMPACT OF PERSONALIZATION IMPACT OF LIP CONTACT LOSS BIBTEX @inproceedings{Thambiraja2022Imitator, author = {Thambiraja Balamurugan and Habibie, Ikhsanul and Aliakbarian, Sadegh and Cosker, Darren and Theobalt, Christian and Thies, Justus}, title = {Imitator: Personalized Speech-driven 3D Facial Animation}, publisher = {arXiv}, year = {2022}, } Source code mainly borrowed from Keunhong Park's Nerfies website. Please contact Balamurugan Thambiraja for feedback and questions.