balamuruganthambiraja.github.io Open in urlscan Pro
2606:50c0:8001::153  Public Scan

URL: https://balamuruganthambiraja.github.io/Imitator/
Submission: On August 23 via api from US — Scanned from DE

Form analysis 0 forms found in the DOM

Text Content

More Research


IMITATOR: PERSONALIZED SPEECH-DRIVEN 3D FACIAL ANIMATION

Balamurugan Thambiraja1, Ikhsanul Habibie2, Sadegh Aliakbarian3, Darren Cosker3,



Christian Theobalt2, Justus Thies1,
1 Max Planck Institute for Intelligent Systems, Tübingen, Germany, 2 Max Planck
Institute for Informatics, Saarland, Germany,



3 Mesh Labs, Microsoft, Cambridge, UK.
Paper arXiv Video Code (coming soon)


OVERVIEW

Imitator is a novel method for personalized speech-driven 3D facial animation.
Given an audio sequence and a personalized style-embedding as input, we generate
person-specific motion sequences with accurate lip closures for bilabial
consonants ('m','b','p'). The style-embedding of a subject can be computed by a
short reference video (e.g., 5s).


ABSTRACT

Speech-driven 3D facial animation has been widely explored, with applications in
gaming, character animation, virtual reality, and telepresence systems.
State-of-the-art methods deform the face topology of the target actor to sync
the input audio without considering the identity-specific speaking style and
facial idiosyncrasies of the target actor, thus, resulting in unrealistic and
inaccurate lip movements.

To address this, we present Imitator, a speech-driven facial expression
synthesis method, which learns identity-specific details from a short input
video and produces novel facial expressions matching the identity-specific
speaking style and facial idiosyncrasies of the target actor.

Specifically, we train a style-agnostic transformer on a large facial expression
dataset which we use as a prior for audio-driven facial expressions. Based on
this prior, we optimize for identity-specific speaking style based on a short
reference video. To train the prior, we introduce a novel loss function based on
detected bilabial consonants to ensure plausible lip closures and consequently
improve the realism of the generated expressions. Through detailed experiments
and a user study, we show that our approach produces temporally coherent facial
expressions from input audio while preserving the speaking style of the target
actors.


VIDEO




PROPOSED METHOD

Overview of the proposed method. Our method takes audio as input and encodes it
to audio embedding using a pre-trained Wav2Vec2.0 model . This audio embedding
â1:T is interpreted by an auto-regressive viseme decoder which generates a
generalized motion feature v̂1:T. A style-adaptable motion decoder maps these
motion features to person-specific facial expressions ŷ1:T in terms of vertex
displacements on top of a template mesh.


IMPACT OF PERSONALIZATION




IMPACT OF LIP CONTACT LOSS




BIBTEX

@inproceedings{Thambiraja2022Imitator,
  author    = {Thambiraja Balamurugan and Habibie, Ikhsanul and Aliakbarian, Sadegh and Cosker, Darren and Theobalt, Christian and Thies, Justus},
  title     = {Imitator: Personalized Speech-driven 3D Facial Animation},
  publisher = {arXiv},
  year      = {2022},
}

Source code mainly borrowed from Keunhong Park's Nerfies website.

Please contact Balamurugan Thambiraja for feedback and questions.