MoDA: Multi-modal Diffusion Architecture for Talking Head Generation

Abstract

Talking head generation with arbitrary identities and speech audio remains a crucial problem in the realm of the virtual metaverse. Despite progress, current methods still struggle to synthesize diverse facial expressions and natural head movements while generating synchronized lip movements with the audio. The main challenge is stylistic discrepancies between speech audio, individual identity, and portrait dynamics. To address the challenge of inter-modal inconsistency, we introduce MoDA, a multi-modal diffusion architecture with two well-designed technologies. First, MoDA explicitly models the interaction among motion, audio, and auxiliary conditions, enhancing overall facial expressions and head dynamics. In addition, a coarse-to-fine fusion strategy is employed to progressively integrate different conditions, ensuring effective feature fusion. Experimental results demonstrate that MoDA improves video diversity, realism, and efficiency, making it suitable for real-world applications.