| 2025 |
[MAGIC-Talk] MAGIC-Talk: Motion-aware Audio-Driven Talking Face Generation with Customizable Identity Control |
Arxiv 2025 |
|
|
|
| 2025 |
[DGTalker] DGTalker:Disentangled Generative Latent Space Learning for Audio-Driven Gaussian Talking Heads |
ICCV 2025 |
|
Project |
Gaussian, Latent Space |
| 2025 |
Talking Head Generation via Viewpoint and Lighting Simulation Based on Global Representation |
ACM MM 2025 |
|
|
Depth-based |
| 2025 |
[PESTalk] PESTalk: Speech-Driven 3D Facial Animation with Personalized Emotional Styles |
ACM MM 2025 |
|
|
FLAME |
| 2025 |
[GOES] GOES: 3D Gaussian-based One-shot Head Animation with Any Emotion and Any Style |
ACM MM 2025 |
|
|
One-Shot, 3DGS |
| 2025 |
[See the Speaker] See the Speaker: Crafting High-Resolution Talking Faces from Speech with Prior Guidance and Region Refinement |
TASLP 2025 |
|
|
High-Resolution, Talking Faces, Speech-to-Face, Diffusion |
| 2025 |
[LSF-Animation] LSF-Animation: Label-Free Speech-Driven Facial Animation via Implicit Feature Representation |
SIGGRAPH ASIA 2025 |
Code |
|
Label-Free, Speech-Driven, Facial Animation, FLAME |
| 2025 |
[MOSPA] MOSPA: Human Motion Generation Driven by Spatial Audio |
NeurIPS 2025 (Spotlight) |
Code |
|
Spatial Audio, Human Motion Generation, Virtual Human |
| 2025 |
[Unmasking Puppeteers] Unmasking Puppeteers: Leveraging Biometric Leakage to Disarm Impersonation in AI-based Videoconferencing |
Arxiv 2025 |
|
|
Biometric Leakage, AI Videoconferencing, Security |
| 2025 |
[Lookahead Anchoring] Lookahead Anchoring: Preserving Character Identity in Audio-Driven Human Animation |
Arxiv 2025 |
|
Project |
Character Identity, Audio-Driven, Human Animation, Temporal Consistency |
| 2025 |
[MAGIC-Talk] MAGIC-Talk: Motion-aware Audio-Driven Talking Face Generation with Customizable Identity Control |
Arxiv 2025 |
|
|
Motion-aware, Audio-Driven, Talking Face, Identity Control |
| 2025 |
[Playmate2] Playmate2: Training-Free Multi-Character Audio-Driven Animation via Diffusion Transformer with Reward Feedback |
Arxiv 2025 |
|
|
Multi-Character, Audio-Driven, Diffusion Transformer, Reward Feedback |
| 2025 |
[DEMO] DEMO: Disentangled Motion Latent Flow Matching for Fine-Grained Controllable Talking Portrait Synthesis |
Arxiv 2025 |
|
|
Disentangled Motion, Flow Matching, Talking Portrait, Controllable |
| 2025 |
[SyncLipMAE] SyncLipMAE: Contrastive Masked Pretraining for Audio-Visual Talking-Face Representation |
Arxiv 2025 |
|
|
Contrastive Masked Pretraining, Audio-Visual, Talking-Face |
| 2025 |
[EGSTalker] EGSTalker: Real-Time Audio-Driven Talking Head Generation with Efficient Gaussian Deformation |
IEEE SMC 2025 |
|
|
Real-Time, Audio-Driven, Gaussian Deformation, Talking Head |
| 2025 |
[AvatarSync] AvatarSync: Rethinking Talking-Head Animation through Phoneme-Guided Autoregressive Perspective |
Arxiv 2025 |
|
|
Phoneme-Guided, Autoregressive, Talking-Head Animation |
| 2025 |
[When Words Smile] When Words Smile: Generating Diverse Emotional Facial Expressions from Text |
EMNLP 2025 |
|
Project |
Text-to-Expression, Emotional Facial Animation, 3D Avatar |
| 2025 |
[Efficient Long-duration Talking Video Synthesis] Efficient Long-duration Talking Video Synthesis with Linear Diffusion Transformer under Multimodal Guidance |
Arxiv 2025 |
|
|
Long-duration, Talking Video, Diffusion Transformer, Multimodal |
| 2025 |
[PASE] PASE: Phoneme-Aware Speech Encoder to Improve Lip Sync Accuracy for Talking Head Synthesis |
Arxiv 2025 |
|
|
Phoneme-Aware, Lip Sync, Talking Head Synthesis |
| 2025 |
[3DiFACE] 3DiFACE: Synthesizing and Editing Holistic 3D Facial Animation |
Arxiv 2025 |
|
Project |
3D Facial Animation, Diffusion, Editing, Speech-Driven |
| 2025 |
[AudioRTA] Audio Driven Real-Time Facial Animation for Social Telepresence |
SIGGRAPH Asia 2025 |
|
Project |
Real-Time, VR, Diffusion, Social Telepresence |
| 2025 |
[StableDub] StableDub: Taming Diffusion Prior for Generalized and Efficient Visual Dubbing |
Arxiv 2025 |
|
Project |
Visual Dubbing, Diffusion, Mamba-Transformer |
| 2025 |
[KSDiff] KSDiff: Keyframe-Augmented Speech-Aware Dual-Path Diffusion for Facial Animation |
Arxiv 2025 |
|
|
Keyframe, Diffusion, Dual-Path, Facial Animation |
| 2025 |
[SynchroRaMa] SynchroRaMa: Lip-Synchronized and Emotion-Aware Talking Face Generation via Multi-Modal Emotion Embedding |
WACV 2026 |
|
Project |
Multi-Modal, Emotion-Aware, LLM |
| 2025 |
[AU-Landmark] Talking Head Generation via AU-Guided Landmark Prediction |
Arxiv 2025 |
|
|
Action Units, Landmark Prediction, Diffusion |
| 2025 |
[Tiny Voice2Face] Tiny is not small enough: High-quality, low-resource facial animation models through hybrid knowledge distillation |
ACM TOG 2025 (SIGGRAPH) |
|
Project |
Knowledge Distillation, Low-Resource, Real-Time, 3D |
| 2025 |
[PGSTalker] PGSTalker: Real-Time Audio-Driven Talking Head Generation via 3D Gaussian Splatting with Pixel-Aware Density Control |
ICONIP 2025 |
|
|
3DGS, Real-Time, Pixel-Aware, Audio-Driven |
| 2025 |
[StyGazeTalk] Beat on Gaze: Learning Stylized Generation of Gaze and Head Dynamics |
Arxiv 2025 |
|
|
Gaze Control, Head Motion, Style-Aware, 3D |
| 2025 |
[EmoCAST] EmoCAST: Emotional Talking Portrait via Emotive Text Description |
Arxiv 2025 |
|
Project |
Emotional Talking Portrait, Text-Driven, Diffusion |
| 2025 |
[UnAvgLip] Removing Averaging: Personalized Lip-Sync Driven Characters Based on Identity Adapter |
Arxiv 2025 |
|
|
Personalized Lip-Sync, Identity Preservation, Diffusion |
| 2025 |
[Think2Sing] Think2Sing: Orchestrating Structured Motion Subtitles 025for Singing-Driven 3D Head Animation |
Arxiv 2025 |
|
|
Singing-Driven, 3D Head, Diffusion |
| 2025 |
[InfinityHuman] InfinityHuman: Towards Long-Term Audio-Driven Human |
Arxiv 2025 |
|
Project |
Long-Term, Hand Motion, Pose-Guided |
| 2025 |
[MagicTalk] MagicTalk: Implicit and Explicit Correlation Learning for Diffusion-based Emotional Talking Face Generation |
CVM 2026 |
|
|
Implicit and Explicit Correlation Learning, Emotional Talking Face Generation |
| 2025 |
[Audio2Face-3D] Audio2Face-3D: Audio-driven Realistic Facial Animation For Digital Avatars |
Arxiv 2025 |
|
|
Audio-driven Realistic Facial Animation, Digital Avatars |
| 2025 |
[DisenEmo] DisenEmo: Learning disentangled emotional representation from facial motion for 3D talking head generation |
ICIP 2025 |
|
|
Disentangled Emotional Representation, 3D Talking Head Generation |
| 2025 |
[ExpTalk] ExpTalk: Diverse Emotional Expression via Adaptive Disentanglement and Refined Alignment for Speech-Driven 3D Facial Animation |
IJCAI 2025 |
|
|
Adaptive Disentanglement, Refined Alignment, 3D Facial Animation |
| 2025 |
[SyncGaussian] SyncGaussian: Stable 3D Gaussian-Based Talking Head Generation with Enhanced Lip Sync via Discriminative Speech Feature |
IJCAI 2025 |
|
|
Stable 3D Gaussian-Based Talking Head Generation, Enhanced Lip Sync, Discriminative Speech Feature |
| 2025 |
[Wan-S2V] Wan-S2V: Audio-Driven Cinematic Video Generation |
Arxiv 2025 |
|
|
Cinematic, Audio-Driven, Video Generation |
| 2025 |
[InfiniteTalk] InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing |
Arxiv 2025 |
|
|
Sparse-Frame Dubbing, Full-Body |
| 2025 |
[D^3-Talker] D^3-Talker: Dual-Branch Decoupled Deformation Fields for Few-Shot 3D Talking Head Synthesis |
ECAI 2025 |
|
|
Few-Shot, 3DGS, Deformation Fields |
| 2025 |
[RealTalk] RealTalk: Realistic Emotion-Aware Lifelike Talking-Head Synthesis |
ICCV 2025 Workshop (Artificial Social Intelligence) |
|
|
Emotion, NeRF, VAE |
| 2025 |
[FantasyTalking2] FantasyTalking2: Timestep-Layer Adaptive Preference Optimization for Audio-Driven Portrait Animation |
Arxiv 2025 |
|
Project |
Audio-Driven, Portrait Animation, Preference Optimization |
| 2025 |
[HM-Talker] HM-Talker: Hybrid Motion Modeling for High-Fidelity Talking Head Synthesis |
Arxiv 2025 |
|
|
Hybrid Motion, High-Fidelity, Talking Head |
| 2025 |
[StableAvatar] StableAvatar: Infinite-Length Audio-Driven Avatar Video Generation |
Arxiv 2025 |
Code |
Project |
Stable Diffusion |
| 2025 |
[READ] READ: Real-time and Efficient Asynchronous Diffusion for Audio-driven Talking Head Generation |
Arxiv 2025 |
|
Project |
Real-time, Asynchronous Diffusion, Audio-driven |
| 2025 |
[X-Actor] X-Actor: Emotional and Expressive Long-Range Portrait Acting from Audio |
Arxiv 2025 |
|
Project |
Emotional Portrait, Long-range, Audio-driven |
| 2025 |
[DICE-Talk] DICE-Talk: Disentangle Identity, Cooperate Emotion: Correlation-Aware Emotional Talking Portrait Generation |
ACM MM 2025 |
|
|
Emotional Portrait, Identity Preservation, Emotion Cooperation |
| 2025 |
[RAP] RAP: Real-time Audio-driven Portrait Animation with Video Diffusion Transformer |
Arxiv 2025 |
|
|
Real-time, Video Diffusion Transformer, Audio-driven |
| 2025 |
[UniTalker: Conversational Speech-Visual Synthesis] UniTalker: Conversational Speech-Visual Synthesis |
ACM MM 2025 |
|
|
Conversational Speech-Visual, Multimodal, Emotion |
| 2025 |
[M2DAO-Talker] M2DAO-Talker: Harmonizing Multi-granular Motion Decoupling and Alternating Optimization for Talking-head Generation |
Arxiv 2025 |
|
Project |
Multi-granular Motion, Decoupling, Optimization |
| 2025 |
[Preview WB-DH] Preview WB-DH: Towards Whole Body Digital Human Bench for the Generation of Whole-body Talking Avatar Videos |
ICCV 2025 Workshop MMFM4 |
|
Project |
Whole-Body Avatar, Benchmark Dataset |
| 2025 |
[MEDTalk] MEDTalk: Multimodal Controlled 3D Facial Animation with Dynamic Emotions by Disentangled Embedding |
Arxiv 2025 |
Code |
|
Multimodal, 3D Facial Animation, Dynamic Emotions |
| 2025 |
[Think Before You Talk] Think Before You Talk: Enhancing Meaningful Dialogue Generation in Full-Duplex Speech Language Models with Planning-Inspired Text Guidance |
Arxiv 2025 |
|
Project |
Dialogue Generation, Speech Language Models, Planning |
| 2025 |
[KLASSify to Verify] KLASSify to Verify: Audio-Visual Deepfake Detection Using SSL-based Audio and Handcrafted Visual Features |
ACM MM 2025 |
|
|
Deepfake Detection, Audio-Visual, SSL |
| 2025 |
[DiTalker] DiTalker: A Unified DiT-based Framework for High-Quality and Speaking Styles Controllable Portrait Animation |
Arxiv 2025 |
|
Project |
DiT, Portrait Animation, Speaking Styles |
| 2025 |
[Learning Phonetic Context-Dependent Viseme] Learning Phonetic Context-Dependent Viseme for Enhancing Speech-Driven 3D Facial Animation |
Interspeech 2025 |
Project |
|
Phonetic Context, Viseme, 3D Facial Animation |
| 2025 |
[SpA2V] SpA2V: Harnessing Spatial Auditory Cues for Audio-driven Spatially-aware Video Generation |
ACM MM 2025 |
|
|
Spatial Audio, Video Generation, MLLM |
| 2025 |
[Biometric Verification in Avatar Videos] Is It Really You? Exploring Biometric Verification Scenarios in Photorealistic Talking-Head Avatar Videos |
IEEE IJCB 2025 |
|
|
Biometric Verification, Avatar Security, Facial Motion |
| 2025 |
[Who is a Better Talker] Who is a Better Talker: Subjective and Objective Quality Assessment for AI-Generated Talking Heads |
Arxiv 2025 |
|
|
Quality Assessment, Dataset, THQA-10K |
| 2025 |
[JWB-DH-V1] JWB-DH-V1: Benchmark for Joint Whole-Body Talking Avatar and Speech Generation Version 1 |
WiCV @ ICCV 2025 |
|
Project |
Whole-Body Avatar, Benchmark Dataset |
| 2025 |
[Mask-Free Audio-driven Talking Face Generation] Mask-Free Audio-driven Talking Face Generation for Enhanced Visual Quality and Identity Preservation |
Arxiv 2025 |
|
|
Mask-Free, Identity Preservation, Audio-driven |
| 2025 |
[MemoryTalker] MemoryTalker: Personalized Speech-Driven 3D Facial Animation via Audio-Guided Stylization |
ICCV 2025 |
|
Project |
Personalized, 3D Facial Animation, Memory |
| 2025 |
[EchoMimicV3] EchoMimicV3: 1.3B Parameters are All You Need for Unified Multi-Modal and Multi-Task Human Animation |
Arxiv 2025 |
|
|
Multi-Modal, Multi-Task, Human Animation |
| 2025 |
[Real-time Generation of Various Types of Nodding] Real-time Generation of Various Types of Nodding for Avatar Attentive Listening System |
ICMI 2025 |
Code |
|
Real-time, Nodding Generation, Avatar Interaction |
| 2025 |
[MoDA] MoDA: Multi-modal Diffusion Architecture for Talking Head Generation |
Arxiv 2025 |
|
Project |
Multi-modal, Diffusion, Talking Head Generation |
| 2025 |
[GGTalker] GGTalker: Talking Head Synthesis with Generalizable Gaussian Priors and Identity-Specific Adaptation |
ICCV 2025 |
Code |
Project |
3D Talking Head, Gaussian Priors, Identity Adaptation |
| 2025 |
[FixTalk] FixTalk: Taming Identity Leakage for High-Quality Talking Head Generation in Extreme Cases |
Arxiv 2025 |
|
|
Identity Leakage, Extreme Cases |
| 2025 |
[OmniHuman-1] OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models |
ICCV 2025 |
|
|
Human Animation, Scaling |
| 2025 |
[Audio-Plane] Audio-Plane: Audio Factorization Plane Gaussian Splatting for Real-Time Talking Head Synthesis |
Arxiv 2025 |
|
|
Audio Factorization, Gaussian Splatting |
| 2025 |
[FIAG] Few-Shot Identity Adaptation for 3D Talking Heads via Global Gaussian Field |
Arxiv 2025 |
|
|
Few-Shot, Global Gaussian Field, 3DGS |
| 2025 |
[MirrorMe] MirrorMe: Towards Realtime and High Fidelity Audio-Driven Halfbody Animation |
Arxiv 2025 |
|
|
Real-time, Half-body Animation |
| 2025 |
[JAM-Flow] JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching |
Arxiv 2025 |
|
|
Flow Matching, Audio-Motion |
| 2025 |
[ARTalk] ARTalk: Speech-Driven 3D Head Animation via Autoregressive Model |
Arxiv 2025 |
|
|
Autoregressive, FLAME, 3D |
| 2025 |
[OmniAvatar] OmniAvatar: Efficient Audio-Driven Avatar Video Generation with Adaptive Body Animation |
Arxiv 2025 |
|
|
Audio-Driven, Body Animation |
| 2025 |
[Audio-Visual Driven Compression] Audio-Visual Driven Compression for Low-Bitrate Talking Head Videos |
ICMR 2025 |
|
|
Compression, Low-Bitrate |
| 2025 |
[SyncTalk++] SyncTalk++: High-Fidelity and Efficient Synchronized Talking Heads Synthesis Using Gaussian Splatting |
Arxiv 2025 |
|
|
3DGS, Synchronization |
| 2025 |
[Loudspeaker Beamforming] Loudspeaker Beamforming to Enhance Speech Recognition Performance of Voice Driven Applications |
ICASSP 2025 |
|
|
Speech Recognition, Beamforming |
| 2025 |
[AlignHuman] AlignHuman: Improving Motion and Fidelity via Timestep-Segment Preference Optimization for Audio-Driven Human Animation |
Arxiv 2025 |
|
|
Preference Optimization, Human Animation |
| 2025 |
[Controllable Expressive 3D Facial Animation] Controllable Expressive 3D Facial Animation via Diffusion in a Unified Multimodal Space |
ICME 2025 |
|
|
3D, Diffusion, Multimodal |
| 2025 |
[PAHA] A Unit Enhancement and Guidance Framework for Audio-Driven Avatar Video Generation |
Arxiv 2025 |
|
|
Parts-Aware, Enhancement |
| 2025 |
[HunyuanVideo-HOMA] HunyuanVideo-HOMA: Generic Human-Object Interaction in Multimodal Driven Human Animation |
Arxiv 2025 |
|
|
Human-Object Interaction, Animation |
| 2025 |
[EmoVOCA] EmoVOCA: Speech-Driven Emotional 3D Talking Heads |
WACV 2025 |
|
|
Emotional, 3D, VOCA |
| 2025 |
[LetsTalk] Efficient Long-duration Talking Video Synthesis with Linear Diffusion Transformer under Multimodal Guidance |
Arxiv 2025 |
|
|
Long-duration, Linear Diffusion |
| 2025 |
[Lipschitz-Driven Noise Robustness] Lipschitz-Driven Noise Robustness in VQ-AE for High-Frequency Texture Repair in ID-Specific Talking Heads |
Arxiv 2025 |
|
|
Noise Robustness, VQ-AE, High-Frequency |
| 2025 |
[LLIA] LLIA -- Enabling Low-Latency Interactive Avatars: Real-Time Audio-Driven Portrait Video Generation with Diffusion Models |
Arxiv 2025 |
|
|
Low-Latency, Real-Time, Interactive |
| 2025 |
[Sonic] Sonic: Shifting Focus to Global Audio Perception in Portrait Animation |
CVPR 2025 |
|
|
Global Audio Perception, Portrait Animation |
| 2025 |
[HALT] High Accuracy, Less Talk (HALT): Reliable LLMs through Capability-Aligned Finetuning |
Arxiv 2025 |
|
|
LLM, Reliability |
| 2025 |
[OmniTalker] OmniTalker: One-shot Real-time Text-Driven Talking Audio-Video Generation With Multimodal Style Mimicking |
Arxiv 2025 |
|
|
Text-Driven, Multimodal Style |
| 2025 |
[Silencer] Silence is Golden: Leveraging Adversarial Examples to Nullify Audio Control in LDM-based Talking-Head Generation |
CVPR 2025 |
|
|
Adversarial Defense, Privacy |
| 2025 |
[EchoMimicV2] EchoMimicV2: Towards Striking, Simplified, and Semi-Body Human Animation |
CVPR 2025 |
|
|
Talking Body |
| 2025 |
[Cocktail-Party AVSR] Cocktail-Party Audio-Visual Speech Recognition |
Interspeech 2025 |
|
|
Audio-Visual Speech Recognition |
| 2025 |
[TalkingMachines] TalkingMachines: Real-Time Audio-Driven FaceTime-Style Video via Autoregressive Diffusion Models |
Arxiv 2025 |
|
|
Real-Time, Autoregressive Diffusion |
| 2025 |
[MMGT] MMGT: Motion Mask Guided Two-Stage Network for Co-Speech Gesture Video Generation |
Arxiv 2025 |
|
|
Co-Speech Gesture, Two-Stage |
| 2025 |
[TalkingHeadBench] TalkingHeadBench: A Multi-Modal Benchmark & Analysis of Talking-Head DeepFake Detection |
Arxiv 2025 |
|
|
DeepFake Detection, Benchmark |
| 2025 |
[V2SFlow] V2SFlow: Video-to-Speech Generation with Speech Decomposition and Rectified Flow |
ICASSP 2025 |
|
|
Video-to-Speech, Speech Decomposition |
| 2025 |
[IM-Portrait] IM-Portrait: Learning 3D-aware Video Diffusion for Photorealistic Talking Heads from Monocular Videos |
CVPR 2025 |
|
|
3D-aware, Video Diffusion |
| 2025 |
[GAN-based Voice Conversion] Generative Adversarial Network based Voice Conversion: Techniques, Challenges, and Recent Advancements |
Arxiv 2025 |
|
|
Voice Conversion, Survey |
| 2025 |
[HunyuanVideo-Avatar] HunyuanVideo-Avatar: High-Fidelity Audio-Driven Human Animation for Multiple Characters |
Arxiv 2025 |
|
|
Multi-Character, Animation |
| 2025 |
[MultiTalk] Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation |
Arxiv 2025 |
|
|
Multi-Person, Conversational |
| 2025 |
[FaceEditTalker] FaceEditTalker: Interactive Talking Head Generation with Facial Attribute Editing |
Arxiv 2025 |
|
|
Attribute Editing, Interactive |
| 2025 |
[EdiDub] Video Editing for Audio-Visual Dubbing |
Arxiv 2025 |
|
|
Video Editing, Dubbing |
| 2025 |
[Wav2Sem] Wav2Sem: Plug-and-Play Audio Semantic Decoupling for 3D Speech-Driven Facial Animation |
CVPR 2025 |
|
|
3D, Semantic Decoupling |
| 2025 |
[DualTalk] Dual-Speaker Interaction for 3D Talking Head Conversations |
CVPR 2025 |
|
|
3D, Interaction, Dual, FLAME |
| 2025 |
[AsynFusion] AsynFusion: Towards Asynchronous Latent Consistency Models for Decoupled Whole-Body Audio-Driven Avatars |
Arxiv 2025 |
|
|
Whole-Body, Diffusion |
| 2025 |
Supervising 3D Talking Head Avatars with Analysis-by-Audio-Synthesis |
Arxiv 2025 |
|
|
3D, Diffusion |
| 2025 |
[Playmate] Playmate: Flexible Control of Portrait Animation via 3D-Implicit Space Guided Diffusion |
Arxiv 2025 |
|
|
Diffusion, 3D |
| 2025 |
[M3G] M3G: Multi-Granular Gesture Generator for Audio-Driven Full-Body Human Motion Synthesis |
NIPS 2025 |
|
|
Gesture, Full-Body, 3DGS |
| 2025 |
[VTutor] VTutor: An Animated Pedagogical Agent SDK that Provide Real Time Multi-Model Feedback |
Arxiv 2025 |
|
|
SDK, LLM, Real-time |
| 2025 |
[PAHA] PAHA: Parts-Aware Audio-Driven Human Animation with Diffusion Model |
Arxiv 2025 |
|
|
Diffusion |
| 2025 |
[OT-Talk] OT-Talk: Animating 3D Talking Head with Optimal Transportation |
Arxiv 2025 |
|
|
FLAME, 3D |
| 2025 |
[GenSync] GenSync: A Generalized Talking Head Framework for Audio-driven Multi-Subject Lip-Sync using 3D Gaussian Splatting |
CVPRW 2025 |
|
|
3DGS |
| 2025 |
[Model See Model Do] Model See Model Do: Speech-Driven Facial Animation with Style Control |
SIGGRAPH 2025 |
|
|
|
| 2025 |
[FlowDubber] FlowDubber: Movie Dubbing with LLM-based Semantic-aware Learning and Flow Matching based Voice Enhancing |
Arxiv 2025 |
|
|
LLM, Qwen |
| 2025 |
[KeySync] KeySync: A Robust Approach for Leakage-free Lip Synchronization in High Resolution |
Arxiv 2025 |
|
|
|
| 2025 |
[Ditto] Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis |
Arxiv 2025 |
|
|
Diffusion |
| 2025 |
[FREAK] FREAK: Frequency-modulated High-fidelity and Real-time Audio-driven Talking Portrait Synthesis |
ICMR 2025 |
|
|
|
| 2025 |
[MobilePortrait] MobilePortrait: Real-Time One-Shot Neural Head Avatars on Mobile Devices |
CVPR 2025 |
|
|
100+fps |
| 2025 |
[ACTalk] Audio-visual Controlled Video Diffusion with Masked Selective State Spaces Modeling for Natural Talking Head Generation |
Arxiv 2025 |
|
|
|
| 2025 |
[FADA] FADA: Fast Diffusion Avatar Synthesis with Mixed-Supervised Multi-CFG Distillation |
CVPR 2025 |
|
|
Fast Diffusion 12.5X speedup |
| 2025 |
[Loopy] Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency |
ICLR 2025 (Oral) |
|
|
|
| 2025 |
[OmniTalker] OmniTalker: Real-Time Text-Driven Talking Head Generation with In-Context Audio-Visual Style Replication |
Arxiv 2025 |
|
|
Omni |
| 2025 |
[Follow Your Motion] Follow Your Motion: A Generic Temporal Consistency Portrait Editing Framework with Trajectory Guidance |
Arxiv 2025 |
|
|
|
| 2025 |
[Audio-driven Gesture Generation] Audio-driven Gesture Generation via Deviation Feature in the Latent Space |
Arxiv 2025 |
|
|
Gesture |
| 2025 |
[Perceptually Accurate 3D Talking Head] Perceptually Accurate 3D Talking Head Generation: New Definitions, Speech-Mesh Representation, and Evaluation Metrics |
CVPR 2025 |
|
|
|
| 2025 |
[MGGTalk] Monocular and Generalizable Gaussian Talking Head Animation |
CVPR 2025 |
Project |
|
One Shot, 3DGS |
| 2025 |
[DeepDubber-V1] DeepDubber-V1: Towards High Quality and Dialogue, Narration, Monologue Adaptive Movie Dubbing Via Multi-Modal Chain-of-Thoughts Reasoning Guidance |
Arxiv 2025 |
|
|
Dubbing |
| 2025 |
[DAMC] Dual Audio-Centric Modality Coupling for Talking Head Generation |
Arxiv 2025 |
|
|
NeRF |
| 2025 |
[Audio-Plane] Audio-Plane: Audio Factorization Plane Gaussian Splatting for Real-Time Talking Head Synthesis |
Arxiv 2025 |
|
|
3DGS |
| 2025 |
[AudCast] AudCast: Audio-Driven Human Video Generation by Cascaded Diffusion Transformers |
CVPR 2025 |
|
|
DiT |
| 2025 |
[DisentTalk] DisentTalk: Cross-lingual Talking Face Generation via Semantic Disentangled Diffusion Model |
ICME 2025 |
|
|
|
| 2025 |
[Teller] Teller: Real-Time Streaming Audio-Driven Portrait Animation with Autoregressive Motion Generation |
CVPR 2025 |
|
|
Autoregressive |
| 2025 |
[DiffusionTalker] DiffusionTalker: Efficient and Compact Speech-Driven 3D Talking Head via Personalizer-Guided Distillation |
ICME 2025 |
|
|
Diffusion, 3D |
| 2025 |
[Synergizing Motion and Appearance] Synergizing Motion and Appearance: Multi-Scale Compensatory Codebooks for Talking Head Video Generation |
CVPR 2025 |
|
|
|
| 2025 |
[HunyuanPortrait] HunyuanPortrait: Implicit Condition Control for Enhanced Portrait Animation |
CVPR 2025 |
|
|
Hunyuan |
| 2025 |
[MuseTalk] MuseTalk: Real-Time High-Fidelity Video Dubbing via Spatio-Temporal Sampling |
Arxiv 2025 |
|
|
|
| 2025 |
[StyleSpeaker] StyleSpeaker: Audio-Enhanced Fine-Grained Style Modeling for Speech-Driven 3D Facial Animation |
Arxiv 2025 |
|
|
3D |
| 2025 |
[LatentSync] LatentSnc: Taming Audio-Conditioned Latent Diffusion Models for Lip Sync with SyncNet Supervision |
Arxiv 2025 |
|
|
|
| 2025 |
[VersaAnimator] Versatile Multimodal Controls for Whole-Body Talking Human Animation |
Arxiv 2025 |
|
|
|
| 2025 |
[MagicInfinite] MagicInfinite: Generating Infinite Talking Videos with Your Words and Voice |
Arxiv 2025 |
|
|
|
| 2025 |
[KeyFace] KeyFace: Expressive Audio-Driven Facial Animation for Long Sequences via KeyFrame Interpolation |
CVPR 2025 |
|
|
Diffusion, Long Sequences |
| 2025 |
[TexTalk] Towards High-fidelity 3D Talking Avatar with Personalized Dynamic Texture |
CVPR 2025 |
|
|
Texture |
| 2025 |
[AdaMesh] AdaMesh: Personalized Facial Expressions and Head Poses for Adaptive Speech-Driven 3D Facial Animation |
IEEE Transactions on Multimedia |
|
Project |
MLoRA, Personalized |
| 2025 |
[InsTaG] InsTaG: Learning Personalized 3D Talking Head from Few-Second Video |
CVPR 2025 |
|
|
Few Shot, 3DGS |
| 2025 |
[FLAP] FLAP: Fully-controllable Audio-driven Portrait Video Generation through 3D head conditioned diffusion model |
Arxiv 2025 |
|
|
Diffusion |
| 2025 |
[NeRF-3DTalker] NeRF-3DTalker: Neural Radiance Field with 3D Prior Aided Audio Disentanglement for Talking Head Synthesis |
ICASSP 2025 |
|
|
|
| 2025 |
Emotional Face-to-Speech |
Arxiv 2025 |
|
|
emotion, face2speech |
| 2025 |
[EmoTalkingGaussian] EmoTalkingGaussian: Continuous Emotion-conditioned Talking Head Synthesis |
Arxiv 2025 |
|
|
emotion, 3DGS |
| 2025 |
[EmoFace] EmoFace: Emotion-Content Disentangled Speech-Driven 3D Talking Face Animation |
Arxiv 2025 |
|
|
emotion,3D |
| 2025 |
Towards Dynamic NeProbTalk3Dural Communication and Speech Neuroprosthesis Based on Viseme Decoding |
ICASSP 2025 |
|
|
Viseme |
| 2025 |
[SyncAnimation] SyncAnimation: A Real-Time End-to-End Framework for Audio-Driven Human Pose and Talking Head Animation |
Arxiv 2025 |
|
|
Huaman Pose |
| 2025 |
[JoyGen] JoyGen: Audio-Driven 3D Depth-Aware Talking-Face Video Editing |
Arxiv 2025 |
|
|
Depth, JD work |
| 2025 |
[IPTalk] Identity-Preserving Video Dubbing Using Motion Warping |
Arxiv 2025 |
|
|
Video Dubbing |
| 2025 |
[LipGen] LipGen: Viseme-Guided Lip Video Generation for Enhancing Visual Speech Recognition |
ICASSP 2025 |
|
|
VSR |
| 2025 |
[DEGSTalk] DEGSTalk: Decomposed Per-Embedding Gaussian Fields for Hair-Preserving Talking Face Synthesis |
ICASSP 2025 |
|
|
Hair-Preserving |
| 2025 |
[UniAvatar] UniAvatar: Taming Lifelike Audio-Driven Talking Head Generation with Comprehensive Motion and Lighting Control |
Arxiv 2025 |
|
|
SD, Lighting control |
| 2024 |
[GLCF] GLCF: A Global-Local Multimodal Coherence Analysis Framework for Talking Face Generation Detection |
Arxiv 2024 |
|
|
Dataset |
| 2024 |
[VQTalker] VQTalker: Towards Multilingual Talking Avatars through Facial Motion Tokenization |
Arxiv 2024 |
|
|
visemes, code book |
| 2024 |
[PointTalk] PointTalk: Audio-Driven Dynamic Lip Point Cloud for 3D Gaussian-based Talking Head Synthesis |
AAAI 2025 |
|
|
Point Cloud, Gaussian Splatting |
| 2024 |
[EmotiveTalk] EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion |
CVPR 2025 review |
|
Project |
Emotion, Expressive, Diffusion |
| 2024 |
[GoHD] GoHD: Gaze-oriented and Highly Disentangled Portrait Animation with Rhythmic Poses and Realistic Expression |
AAAI 2025 |
|
|
Gaze-oriented |
| 2024 |
[EmoDubber] EmoDubber: Towards High Quality and Emotion Controllable Movie Dubbing |
CVPR 2025 |
|
|
Emotion, Dubber |
| 2024 |
[PortraitTalk] PortraitTalk: Towards Customizable One-Shot Audio-to-Talking Face Generation |
Arxiv 2024 |
|
|
Diffusion, Attention, One-Shot |
| 2024 |
[DEEPTalk] DEEPTalk: Dynamic Emotion Embedding for Probabilistic Speech-Driven 3D Face Animation |
AAAI 2025 |
|
|
3D face, FLAME, Emotion |
| 2024 |
[LatentSync] LatentSync: Audio Conditioned Latent Diffusion Models for Lip Sync |
Arxiv 2024 |
|
|
Diffusion, SyncNet |
| 2024 |
[GoHD] GoHD: Gaze-oriented and Highly Disentangled Portrait Animation with Rhythmic Poses and Realistic Expression |
AAAI 2025 |
|
|
Gaze |
| 2024 |
[FLOAT] FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait |
ICCV 2025 |
|
Project |
Flow Matching |
| 2024 |
[SVP] SVP: Style-Enhanced Vivid Portrait Talking Head Diffusion Model |
Arxiv 2024 |
|
|
Diffusion, Style |
| 2024 |
[Mini-Omni] Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming |
Technical report |
|
|
Omni!!! |
| 2024 |
[ControlTalk] Controllable Talking Face Generation by Implicit Facial Keypoints Editing |
Arxiv 2024 |
|
|
Face Edit |
| 2024 |
[SPEAK] SPEAK: Speech-Driven Pose and Emotion-Adjustable Talking Head Generation |
Arxiv 2024 |
|
|
|
| 2024 |
[LokiTalk] LokiTalk: Learning Fine-Grained and Generalizable Correspondences to Enhance NeRF-based Talking Head Synthesis |
Arxiv 2024 |
|
|
NeRF |
| 2024 |
[MEMO] MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation |
Arxiv 2024 |
|
|
Memory |
| 2024 |
[INFP] INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations |
Arxiv 2024 |
|
|
Dyadic Conversations |
| 2024 |
[IF-MDM] IF-MDM: Implicit Face Motion Diffusion Model for High-Fidelity Realtime Talking Head Generation |
Arxiv 2024 |
|
|
Motion Diffusion Model |
| 2024 |
[MemFace] Memories are One-to-Many Mapping Alleviators in Talking Face Generation |
IEEE 2024 |
|
|
Memory |
| 2024 |
[Ditto] Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis |
Arxiv 2024 |
|
|
Diffusion |
| 2024 |
[GaussianSpeech] GaussianSpeech: Audio-Driven Gaussian Avatars |
Arxiv 2024 |
|
|
3DGS, 3D |
| 2024 |
[LetsTalk] LetsTalk: Latent Diffusion Transformer for Talking Video Synthesis |
Arxiv 2024 |
|
|
|
| 2024 |
[EmotiveTalk] EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion |
Arxiv 2024 |
|
|
|
| 2024 |
[S^3D-NeRF] S^3D-NeRF: Single-Shot Speech-Driven Neural Radiance Field for High Fidelity Talking Head Synthesis |
ECCV 2024 |
|
|
|
| 2024 |
[LES-Talker] LES-Talker: Fine-Grained Emotion Editing for Talking Head Generation in Linear Emotion Space |
Arxiv 2024 |
|
|
Fine-Grained Emotion |
| 2024 |
[JoyVASA] JoyVASA: Portrait and Animal Image Animation with Diffusion-Based Audio-Driven Facial Dynamics and Head Motion Generation |
Arxiv 2024 |
|
|
Diffusion, VASA |
| 2024 |
[JoyHallo] JoyHallo: Digital human model for Mandarin |
Arxiv 2024 |
|
|
Diffusion, Hallo |
| 2024 |
[Hallo2] Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation |
ICLR 2025 |
|
|
Diffusion, Hallo |
| 2024 |
Audio-Driven Emotional 3D Talking-Head Generation |
Arxiv 2024 |
|
|
Emotion |
| 2024 |
[Stereo-Talker] Stereo-Talker: Audio-driven 3D Human Synthesis with Prior-Guided Mixture-of-Experts |
Arxiv 2024 |
|
|
|
| 2024 |
[Takin-ADA] Takin-ADA: Emotion Controllable Audio-Driven Animation with Canonical and Landmark Loss Optimization |
Arxiv 2024 |
|
|
|
| 2024 |
[DAWN] DAWN: Dynamic Frame Avatar with Non-autoregressive Diffusion Framework for Talking Head Video Generation |
Arxiv 2024 |
|
|
Non-autoregressive Diffusion |
| 2024 |
[LaDTalk] LaDTalk: Latent Denoising for Synthesizing Talking Head Videos with High Frequency Details |
Arxiv 2024 |
|
|
|
| 2024 |
Diverse Code Query Learning for Speech-Driven Facial Animation |
Arxiv 2024 |
|
|
|
| 2024 |
[TalkinNeRF] TalkinNeRF: Animatable Neural Fields for Full-Body Talking Humans |
ECCVW 2024 |
|
|
NeRF |
| 2024 |
[ProbTalk3D] ProbTalk3D: Non-Deterministic Emotion Controllable Speech-Driven 3D Facial Animation Synthesis Using VQ-VAE |
SIGGRAPH MIG 2024 |
|
|
3D |
| 2024 |
[JEAN] JEAN: Joint Expression and Audio-guided NeRF-based Talking Face Generation |
BMVC 2024 |
|
|
NeRF |
| 2024 |
[3DFacePolicy] 3DFacePolicy: Speech-Driven 3D Facial Animation with Diffusion Policy |
Arxiv 2024 |
|
|
|
| 2024 |
[LawDNet] LawDNet: Enhanced Audio-Driven Lip Synthesis via Local Affine Warping Deformation |
Arxiv 2024 |
|
|
|
| 2024 |
[StyleTalk++] StyleTalk++: A Unified Framework for Controlling the Speaking Styles of Talking Heads |
TPAMI 2024 |
|
|
|
| 2024 |
[DiffTED] DiffTED: One-shot Audio-driven TED Talk Video Generation with Diffusion-based Co-speech Gestures |
Arxiv 2024 |
|
|
diffusion |
| 2024 |
[EMOdiffhead] EMOdiffhead: Continuously Emotional Control in Talking Head Generation via Diffusion |
Arxiv 2024 |
|
|
Diffusion |
| 2024 |
[PersonaTalk] PersonaTalk: Bring Attention to Your Persona in Visual Dubbing |
SIGGRAPH Asia 2024 |
|
|
|
| 2024 |
KAN-Based Fusion of Dual-Domain for Audio-Driven Facial Landmarks Generation |
Arxiv 2024 |
|
|
KAN |
| 2024 |
[TalkLoRA] TalkLoRA: Low-Rank Adaptation for Speech-Driven Animation |
Arxiv 2024 |
|
|
LoRA |
| 2024 |
[Avatar Concept Slider] Avatar Concept Slider: Manipulate Concepts In Your Human Avatar With Fine-grained Control |
Arxiv 2024 |
|
|
|
| 2024 |
[G3FA] G3FA: Geometry-guided GAN for Face Animation |
BMVC 2024 |
|
|
|
| 2024 |
[Meta-Face] Meta-Learning Empowered Meta-Face: Personalized Speaking Style Adaptation for Audio-Driven 3D Talking Face Animation |
Arxiv 2024 |
|
|
|
| 2024 |
[DEEPTalk] DEEPTalk: Dynamic Emotion Embedding for Probabilistic Speech-Driven 3D Face Animation |
Arxiv 2024 |
|
|
|
| 2024 |
High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model |
TIP ? |
|
|
|
| 2024 |
Style-Preserving Lip Sync via Audio-Aware Style Reference |
TIP ? |
|
|
|
| 2024 |
[Talk to the Wall] Talk to the Wall: The Role of Speech Interaction in Collaborative Visual Analytics |
TVCG 2024 |
|
|
Collaborative |
| 2024 |
[MDT-A2G] MDT-A2G: Exploring Masked Diffusion Transformers for Co-Speech Gesture Generation |
Arxiv 2024 |
|
|
Co-Speech Gesture |
| 2024 |
[GLDiTalker] GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer |
Arxiv 2024 |
|
|
|
| 2024 |
[UniTalker] UniTalker: Scaling up Audio-Driven 3D Facial Animation through A Unified Model |
Arxiv 2024 |
|
|
|
| 2024 |
[DiM-Gesture] DiM-Gesture: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2 framework |
Arxiv 2024 |
|
|
|
| 2024 |
[What if Red Can Talk?] What if Red Can Talk? Dynamic Dialogue Generation Using Large Language Models |
ACL Wordplay 2024 |
|
|
|
| 2024 |
[LinguaLinker] LinguaLinker: Audio-Driven Portraits Animation with Implicit Facial Control Enhancement |
Arxiv 2024 |
|
|
|
| 2024 |
[RealTalk] RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network |
Arxiv 2024 |
|
|
|
| 2024 |
Landmark-guided Diffusion Model for High-fidelity and Temporally Coherent Talking Head Generation |
Arxiv 2024 |
|
|
|
| 2024 |
[JambaTalk] JambaTalk: Speech-Driven 3D Talking Head Generation Based on Hybrid Transformer-Mamba Model |
Arxiv 2024 |
|
|
3D |
| 2024 |
[Talk Less, Interact Better] Talk Less, Interact Better: Evaluating In-context Conversational Adaptation in Multimodal LLMs |
COLM 2024 |
|
|
LLM |
| 2024 |
[Digital Avatars] Digital Avatars: Framework Development and Their Evaluation |
Arxiv 2024 |
|
|
|
| 2024 |
[EmoTalk3D] EmoTalk3D: High-Fidelity Free-View Synthesis of Emotional 3D Talking Head |
ECCV 2024 |
|
|
|
| 2024 |
[PAV] PAV: Personalized Head Avatar from Unstructured Video Collection |
ECCV 2024 |
|
|
|
| 2024 |
Text-based Talking Video Editing with Cascaded Conditional Diffusion |
Arxiv 2024 |
|
|
|
| 2024 |
[EmoFace] EmoFace: Audio-driven Emotional 3D Face Animation |
IEEE Conference Virtual Reality and 3D User Interfaces (VR). IEEE, 2024 |
|
|
|
| 2024 |
Learning Online Scale Transformation for Talking Head Video Generation |
Arxiv 2024 |
|
|
|
| 2024 |
[EchoMimic] EchoMimic: Lifelike Audio-Driven Portrait Animations through Editable Landmark Conditioning |
AAAI 2025 |
|
|
🔥阿里 |
| 2024 |
Audio-driven High-resolution Seamless Talking Head Video Editing via StyleGAN |
Arxiv 2024 |
|
|
StyleGAN |
| 2024 |
Enhancing Speech-Driven 3D Facial Animation with Audio-Visual Guidance from Lip Reading Expert |
Interspeech 2024 |
|
|
3D |
| 2024 |
[MultiTalk] MultiTalk: Enhancing 3D Talking Head Generation Across Languages with Multilingual Video Dataset |
Interspeech 2024 |
|
|
3D, Dataset |
| 2024 |
[NLDF] NLDF: Neural Light Dynamic Fields for Efficient 3D Talking Head Generation |
Arxiv 2024 |
|
|
NeRF |
| 2024 |
[Make Your Actor Talk] Make Your Actor Talk: Generalizable and High-Fidelity Lip Sync with Motion and Appearance Disentanglement |
Arxiv 2024 |
|
|
|
| 2024 |
[Talk With Human-like Agents] Talk With Human-like Agents: Empathetic Dialogue Through Perceptible Acoustic Reception and Reaction |
ACL 2024 |
|
|
|
| 2024 |
[V-Express] V-Express: Conditional Dropout for Progressive Training of Portrait Video Generation |
Technique Report |
|
|
🔥EMO, Diffusion, Open-source |
| 2024 |
[CVTHead] CVTHead: One-shot Controllable Head Avatar with Vertex-feature Transformer |
WACV 2024 |
|
|
|
| 2024 |
[Hallo] Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation |
Arxiv 2024 |
|
|
🔥EMO, Diffusion, Open-source |
| 2024 |
[Emotional Conversation] Emotional Conversation: Empowering Talking Faces with Cohesive Expression, Gaze and Pose Generation |
Arxiv 2024 |
|
|
Emotion |
| 2024 |
[MultiDialog] Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation |
ACL 2024 |
|
|
dataset |
| 2024 |
[ControlTalk] Controllable Talking Face Generation by Implicit Facial Keypoints Editing |
Arxiv 2024 |
|
|
Controller |
| 2024 |
[InstructAvatar] InstructAvatar: Text-Guided Emotion and Motion Control for Avatar Generation |
Arxiv 2024 |
|
|
Text-Guided |
| 2024 |
[Listen, Disentangle, and Control] Listen, Disentangle, and Control: Controllable Speech-Driven Talking Head Generation |
Arxiv 2024 |
|
|
A Brenchmark and Survey |
| 2024 |
[NeRFFaceSpeech] NeRFFaceSpeech: One-shot Audio-driven 3D Talking Head Synthesis via Generative Prior |
CVPRW 2024 |
|
|
SadTalker+NeRF |
| 2024 |
[SwapTalk] SwapTalk: Audio-Driven Talking Face Generation with One-Shot Customization in Latent Space |
ICASSP 2025 |
|
|
|
| 2024 |
[AniTalker] AniTalker: Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion Encoding |
Arxiv 2024 |
|
|
|
| 2024 |
[EMOPortraits] EMOPortraits: Emotion-enhanced Multimodal One-shot Head Avatars |
Arxiv 2024 |
|
|
EMO |
| 2024 |
[GaussianTalker] GaussianTalker: Real-Time High-Fidelity Talking Head Synthesis with Audio-Driven 3D Gaussian Splatting |
ACMM 2024 |
|
|
🔥Gaussian Splatting |
| 2024 |
[CSTalk] CSTalk: Correlation Supervised Speech-driven 3D Emotional Facial Animation Generation |
Arxiv 2024 |
|
|
Emotion |
| 2024 |
[GSTalker] GSTalker: Real-time Audio-Driven Talking Face Generation via Deformable Gaussian Splatting |
Arxiv 2024 |
|
|
🔥Gaussian Splatting |
| 2024 |
[GaussianTalker] GaussianTalker: Speaker-specific Talking Head Synthesis via 3D Gaussian Splatting |
ACMM 2024 |
|
|
🔥Gaussian Splatting |
| 2024 |
[TalkingGaussian] TalkingGaussian: Structure-Persistent 3D Talking Head Synthesis via Gaussian Splatting |
ECCV 2024 |
|
|
🔥Gaussian Splatting |
| 2024 |
[Learn2Talk] Learn2Talk: 3D Talking Face Learns from 2D Talking Face |
Arxiv 2024 |
|
|
🔥Gaussian Splatting |
| 2024 |
[VASA-1] VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time |
NeurIPS 2024 (Oral) |
|
|
🔥🔥🔥Awesome,Microsoft |
| 2024 |
Pose-Aware 3D Talking Face Synthesis using Geometry-guided Audio-Vertices Attention |
IEEE 2024 |
|
|
|
| 2024 |
[THQA] THQA: A Perceptual Quality Assessment Database for Talking Heads |
Arix 2024 |
|
|
|
| 2024 |
[EDTalk] EDTalk: Efficient Disentanglement for Emotional Talking Head Synthesis |
ECCV 2024 (Oral) |
|
|
Emotion |
| 2024 |
[FaceChain-ImagineID] FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio |
Arix 2024 |
|
|
|
| 2024 |
[Talk3D] Talk3D: High-Fidelity Talking Portrait Synthesis via Personalized 3D Generative Prior |
Arix 2024 |
|
|
|
| 2024 |
[AniPortrait] AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation |
Arix 2024 |
|
|
🔥🔥🔥Similar to EMO |
| 2024 |
[Make-Your-Anchor] Make-Your-Anchor: A Diffusion-based 2D Avatar Generation Framework |
CVPR 2024 |
|
|
|
| 2024 |
Adaptive Super Resolution For One-Shot Talking-Head Generation |
ICASSP 2024 |
|
|
|
| 2024 |
[VLOGGER] VLOGGER: Multimodal Diffusion for Embodied |
Arix 2024 |
|
|
Embodied |
| 2024 |
[EmoVOCA] EmoVOCA: Speech-Driven Emotional 3D Talking Heads |
Arix 2024 |
|
|
3D, VOCA |
| 2024 |
[ScanTalk] ScanTalk: 3D Talking Heads from Unregistered Scans |
ECCV 2024 |
|
|
3D |
| 2024 |
[Style2Talker] Style2Talker: High-Resolution Talking Head Generation with Emotion Style and Art Style |
Arix 2024 |
|
|
|
| 2024 |
[EMO] EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions |
Arix 2024 |
|
|
🔥🔥🔥Amazing, Diffusion |
| 2024 |
[G4G] G4G:A Generic Framework for High Fidelity Talking Face Generation with Fine-grained Intra-modal Alignment |
Arix 2024 |
|
|
A Generic Framework |
| 2024 |
Learning Dynamic Tetrahedra for High-Quality Talking Head Synthesis |
CVPR 2024 |
|
|
High-Quality |
| 2024 |
[DiffSpeaker] DiffSpeaker: Speech-Driven 3D Facial Animation with Diffusion Transformer |
Arix 2024 |
|
|
3D |
| 2024 |
[EmoSpeaker] EmoSpeaker: One-shot Fine-grained Emotion-Controlled Talking Face Generation |
Arix 2024 |
|
|
Emotion |
| 2024 |
[NeRF-AD] NeRF-AD: Neural Radiance Field with Attention-based Disentanglement for Talking Face Synthesis |
ICASSP 2024 |
|
|
AU |
| 2024 |
[Real3D-Portrait] Real3D-Portrait: One-shot Realistic 3D Talking Portrait Synthesis |
ICLR 2024 |
|
|
3D, One-Shot,Realistic |
| 2024 |
[SyncTalk] SyncTalk: The Devil😈 is in the Synchronization for Talking Head Synthesis |
CVPR 2024 |
|
|
😈Talking Head |
| 2024 |
[AdaMesh] AdaMesh: Personalized Facial Expressions and Head Poses for Adaptive Speech-Driven 3D Facial Animation |
Arix 2024 |
|
|
3D,Mesh |
| 2024 |
[DREAM-Talk] DREAM-Talk: Diffusion-based Realistic Emotional Audio-driven Method for Single Image Talking Face Generation |
Arix 2024 |
|
|
Emotion |
| 2024 |
[AE-NeRF] AE-NeRF: Audio Enhanced Neural Radiance Field for Few Shot Talking Head Synthesis |
AAAI 2024 |
|
|
|
| 2024 |
[R2-Talker] R2-Talker: Realistic Real-Time Talking Head Synthesis with Hash Grid Landmarks Encoding and Progressive Multilayer Conditioning |
Arxiv 2024 |
|
|
based-RAD-NeRF |
| 2024 |
[DT-NeRF] DT-NeRF: Decomposed Triplane-Hash Neural Radiance Fields for High-Fidelity Talking Portrait Synthesis |
ICASSP 2024 |
- |
- |
ER-NeRF |
| 2023 |
[ER-NeRF] Efficient Region-Aware Neural Radiance Fields for High-Fidelity Talking Portrait Synthesis |
ICCV 2023 |
|
|
Tri-plane |
| 2023 |
[LipNeRF] LipNeRF: What is the right feature space to lip-sync a NeRF? |
FG 2023 |
|
|
Wav2lip |
| 2024 |
[VectorTalker] VectorTalker: SVG Talking Face Generation with Progressive Vectorisation |
Arix 2024 |
|
|
SVG |
| 2024 |
[Mimic] Mimic: Speaking Style Disentanglement for Speech-Driven 3D Facial Animation |
AAAI 2024 |
|
|
3D |
| 2024 |
[DreamTalk] DreamTalk: When Expressive Talking Head Generation Meets Diffusion Probabilistic Models |
Arix 2024 |
|
|
Diffusion |
| 2024 |
[FaceTalk] FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models |
Arix 2024 |
|
|
|
| 2024 |
[GSmoothFace] GSmoothFace: Generalized Smooth Talking Face Generation via Fine Grained 3D Face Guidance |
Arix 2024 |
|
|
3D |
| 2024 |
[GMTalker] GMTalker: Gaussian Mixture based Emotional talking video Portraits |
Arix 2024 |
|
|
Emotion |
| 2024 |
[VividTalk] VividTalk: One-Shot Audio-Driven Talking Head Generation Based on 3D Hybrid Prior |
Arix 2024 |
|
|
Mesh |
| 2024 |
[GAIA] GAIA: Zero-shot Talking Avatar Generation |
Arix 2024 |
Code(coming) |
|
😲😲😲 |
| 2023 |
Implicit Identity Representation Conditioned Memory Compensation Network for Talking Head Video Generation |
ICCV 2023 |
|
|
- |
| 2023 |
[ToonTalker] ToonTalker: Cross-Domain Face Reenactment |
ICCV 2023 |
- |
- |
- |
| 2023 |
Efficient Emotional Adaptation for Audio-Driven Talking-Head Generation |
ICCV 2023 |
|
|
- |
| 2023 |
[EMMN] EMMN: Emotional Motion Memory Network for Audio-driven Emotional Talking Face Generation |
ICCV 2023 |
- |
- |
Emotion |
| 2023 |
Emotional Listener Portrait: Realistic Listener Motion Simulation in Conversation |
ICCV 2023 |
- |
- |
Emotion,LHG |
| 2023 |
[MODA] MODA: Mapping-Once Audio-driven Portrait Animation with Dual Attentions |
ICCV 2023 |
- |
- |
- |
| 2023 |
[Facediffuser] Facediffuser: Speech-driven 3d facial animation synthesis using diffusion |
ACM SIGGRAPH MIG 2023 |
|
|
🔥Diffusion,3D |
| 2023 |
Audio-Driven Dubbing for User Generated Contents via Style-Aware Semi-Parametric Synthesis |
TCSVT 2023 |
- |
- |
|
| 2023 |
[SadTalker] SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation |
CVPR 2023 |
|
|
3D,Single Image |
| 2023 |
[EmoTalk] EmoTalk: Speech-Driven Emotional Disentanglement for 3D Face Animation |
ICCV 2023 |
|
|
3D,Emotion |
| 2023 |
Emotional Talking Head Generation based on Memory-Sharing and Attention-Augmented Networks |
InterSpeech 2023 |
|
|
Emotion |
| 2023 |
[DINet] DINet: Deformation Inpainting Network for Realistic Face Visually Dubbing on High Resolution Video |
AAAI 2023 |
|
|
|
| 2023 |
[StyleTalk] StyleTalk: One-shot Talking Head Generation with Controllable Speaking Styles |
AAAI 2023 |
|
|
Style |
| 2023 |
High-fidelity Generalized Emotional Talking Face Generation with Multi-modal Emotion Space Learning |
CVPR 2023 |
|
|
Emotion |
| 2023 |
[StyleSync] StyleSync: High-Fidelity Generalized and Personalized Lip Sync in Style-based Generator |
CVPR 2023 |
|
|
- |
| 2023 |
[TalkLip] TalkLip: Seeing What You Said - Talking Face Generation Guided by a Lip Reading Expert |
CVPR 2023 |
|
|
|
| 2023 |
[CodeTalker] CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior |
CVPR 2023 |
|
|
3D,codebook |
| 2023 |
[EmoGen] Emotionally Enhanced Talking Face Generation |
Arxiv 2023 |
|
|
Emotion |
| 2023 |
[DAE-Talker] DAE-Talker: High Fidelity Speech-Driven Talking Face Generation with Diffusion Autoencoder |
ACMM 2023 |
|
|
🔥Diffusion |
| 2023 |
[READ Avatars] READ Avatars: Realistic Emotion-controllable Audio Driven Avatars |
Arxiv 2023 |
|
|
- |
| 2023 |
[DiffTalk] DiffTalk: Crafting Diffusion Models for Generalized Talking Head Synthesis |
CVPR 2023 |
|
|
🔥Diffusion |
| 2023 |
[Diffused Heads] Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation |
Arxiv 2023 |
- |
|
🔥Diffusion |
| 2022 |
[VideoReTalking] VideoReTalking: Audio-based Lip Synchronization for Talking Head Video Editing In the Wild |
SIGGRAPH 2022 |
|
|
|
| 2022 |
[MemFace] Expressive Talking Head Generation with Granular Audio-Visual Control |
CVPR 2022 |
- |
- |
- |
| 2022 |
Talking Face Generation with Multilingual TTS |
CVPR 2022 |
|
|
- |
| 2022 |
[EAMM] EAMM: One-Shot Emotional Talking Face via Audio-Based Emotion-Aware Motion Model |
SIGGRAPH 2022 |
- |
- |
Emotion |
| 2022 |
[SPACEx] SPACEx 🚀: Speech-driven Portrait Animation with Controllable Expression |
arXiv 2022 |
- |
Project |
- |
| 2022 |
[AV-CAT] Masked Lip-Sync Prediction by Audio-Visual Contextual Exploitation in Transformers |
SIGGRAPH Asia 2022 |
- |
- |
- |
| 2022 |
[MemFace] Memories are One-to-Many Mapping Alleviators in Talking Face Generation |
arXiv 2022 |
- |
- |
- |
| 2021 |
[PC-AVS] PC-AVS: Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation |
CVPR 2021 |
|
|
- |
| 2021 |
[IATS] Imitating Arbitrary Talking Style for Realistic Audio-Driven Talking Face Synthesis |
ACM MM 2021 |
- |
- |
- |
| 2021 |
[Speech2Talking-Face] Speech2Talking-Face: Inferring and Driving a Face with Synchronized Audio-Visual Representation |
IJCAI 2021 |
- |
- |
- |
| 2021 |
[FAU] Talking Head Generation with Audio and Speech Related Facial Action Units |
BMVC 2021 |
- |
- |
AU |
| 2021 |
[EVP] Audio-Driven Emotional Video Portraits |
CVPR 2021 |
|
|
Emotion |
| 2021 |
[IATS] IATS: Imitating Arbitrary Talking Style for Realistic Audio-Driven Talking Face Synthesis |
ACM Multimedia 2021 |
- |
- |
- |
| 2020 |
[Wav2Lip] A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild |
ACM Multimedia 2020 |
|
|
- |
| 2020 |
[RhythmicHead] Talking-head Generation with Rhythmic Head Motion |
ECCV 2020 |
|
|
- |
| 2020 |
[MakeItTalk] Speaker-Aware Talking-Head Animation |
SIGGRAPH Asia 2020 |
|
|
- |
| 2020 |
[Neural Voice Puppetry] Audio-driven Facial Reenactment |
ECCV 2020 |
|
|
- |
| 2020 |
[MEAD] A Large-scale Audio-visual Dataset for Emotional Talking-face Generation |
ECCV 2020 |
|
|
- |
| 2020 |
Realistic Speech-Driven Facial Animation with GANs |
IJCV 2020 |
|
|
- |
| 2019 |
[DAVS] Talking Face Generation by Adversarially Disentangled Audio-Visual Representation |
AAAI 2019 |
|
|
- |
| 2019 |
[ATVGnet] Hierarchical Cross-modal Talking Face Generation with Dynamic Pixel-wise Loss |
CVPR 2019 |
|
|
- |
| 2018 |
Lip Movements Generation at a Glance |
ECCV 2018 |
|
|
- |
| 2018 |
[VisemeNet] Audio-Driven Animator-Centric Speech Animation |
SIGGRAPH 2018 |
|
|
- |
| 2017 |
[Synthesizing Obama] Learning Lip Sync From Audio |
SIGGRAPH 2017 |
|
|
- |
| 2017 |
[You Said That?] Synthesising Talking Faces From Audio |
BMVC 2019 |
|
|
- |
| 2017 |
Audio-Driven Facial Animation by Joint End-to-End Learning of Pose and Emotion |
SIGGRAPH 2017 |
|
|
- |
| 2017 |
A Deep Learning Approach for Generalized Speech Animation |
SIGGRAPH 2017 |
|
|
- |
| 2016 |
[LRW] Lip Reading in the Wild |
ACCV 2016 |
|
|
- |