Automated Lip-Reading

Automated Lip-Reading: Extracting Speech from Video of a Talking Face

Project Team: Yue Wang (Linguistics, SFU), Ghassan Hamarneh (Computing Science, SFU), Paul Tupper (Mathematics, SFU), Dawn Behne (Psychology, The Norwegian University of Science and Technology), Joan Sereno (Linguistics, University of Kansas), Allard Jongman (Linguistics, University of Kansas). 

Speaking face-to-face, voice and coordinated facial movements are simultaneously used to perceive speech. In noisy environments, seeing a speaker’s facial movements makes speech perception easier. Similarly, with multimedia, we rely on visual cues when the audio is not transmitted well (e.g., during video conferencing) or in noisy backgrounds. In the current era of social media, we increasingly encounter multimedia-induced challenges where the audio signal in the video is of poor quality or misaligned (e.g., via Skype). The next big question for speech scientists, and relevant for all multimedia users, is what speech information can be extracted from a face and whether the corresponding audio signal can be recreated from it to enhance speech.