This image demonstrates how the JALI technology adds layers to its animation of facial expressions, to procedurally generate emotion-driven speech performances. (Image credit: JALI Research Inc.)

How an SFU and U of T collaboration is powering one of the most popular video games of 2020

December 15, 2020

A linguistic AI technology automatically generates character speech performances.

By Suraaj Aulakh

Released last week, Cyberpunk 2077 is one of the most highly anticipated video games of the year and is being praised for the realistic facial animation of its characters. A team of Canadian computer scientists, including SFU’s own Eugene Fiume, is behind the technology that is powering the game through automatically generated facial expressions.

Fiume is the dean of the Faculty of Applied Sciences and a professor in the School of Computing Science. He is well known in the field of computer graphics for his pioneering work in developing key CGI technologies that we take for granted today. Fiume’s contributions in the field have led to many accolades including appointment as a Fellow of the Royal Society of Canada and, most recently, induction to the ACM SIGGRAPH Academy.

He is also the co-founder of JALI Research (short for Jaw And Lip Integration Research), where he and his colleagues from University of Toronto explore the use of anatomical, linguistic, and artificial intelligence (AI) methods to create 3D facial animations that are informed by real human expressions. Their research caught the eye of video game developer CD Projekt Red, who recruited the team to transfer the technology to Cyberpunk 2077.

However, the intention is not merely to accurately animate parts of the face.

“We are not just focused on getting the perfect lip synch, eye blinks and all the details one might think of when animating parts of the face,” says Fiume. “Instead we’re interested in the creation of a believable performance. People want look at the whole, not just the parts.

“Our goal is to take a spoken performance and turn it into a visual one. An animator can always get involved, although we can create compelling performances automatically.”

SFU professor and dean Eugene Fiume is one of the researchers behind the technology.

The JALI technology allows 3D character faces to automatically perform the audio dialogue that is provided, with a high degree of accuracy. This eliminates the need for facial motion capture – when animations are created using facial expressions captured from an actor’s performance of the same dialogue – or the long hours to animate a scene, one key frame at a time. That said, the JALI technology is also able to work in tandem with motion capture techniques.

“Human speech contains a huge amount of information,” notes Fiume. “We can extract expression and emphasis from human speech in many different languages and, putting that information together with a text transcript, we can animate a 3D virtual model of the face.

“In effect, we transform a voice actor into a synthetic facial performer in a virtual environment.”

This is a game-changer (pardon the pun) as it allows game designers to add more complexity to their virtual world without being limited by their animation capacity. This creates opportunities to include more characters, varying facial structures and speaking styles, longer storylines, multiple ways for conversations to unfold, and the ability to switch between languages. JALI allows character speech performances to seamlessly adapt to the situation and the choices of the player, making games like Cyberpunk 2077 very appealing.

How the technology works

Creating 3D characters that have the ability to perform in this manner requires extensive research on human facial expressions as well as machine learning and rule-based AI methodologies.

“When we speak, our emotions physically project into our facial muscle actions,” explains Fiume.

“Not only does this affect the movement of our lips, tongue and jaw, but human emotions also affect the blinking of our eyes, pupil dilation, saccades, and eye gaze, as well as furrowing of our eyebrows and forehead, and motions of our neck and head.”

Taking these factors into consideration, the researchers developed a procedural AI workflow that allows animators to automatically generate realistic character performances, with the ability to customize a performance as desired. Recently the JALI and CD Projekt Red teams presented this research at SIGGRAPH 2020.

With this new technology, the animator can feed in the text transcript of the dialogue, a voice recording and any other custom parameters. The system will analyze the volume, pitch and speech rate of the audio file, as well as the phoneme alignment of the transcript (i.e., the breakdown of sounds within words), to determine the optimal animation of the jaw, lips, tongue, eyes, eyelids, eyebrows, head and neck.

To take it one step further, the AI workflow is also able to adapt to different languages. Through machine learning methods the system recognizes what words to emphasize in different languages, and how that impacts the facial expression. For game designers this means they can make their games more accessible to global audiences by offering high-quality language options.

Animation has traditionally been a highly manual process, which motivates computer graphics researchers such as Fiume to identify roadblocks in the production pipeline and build technologies that help deliver content faster and at a high quality.

In the case of Cyberpunk 2077, the vision to produce high-quality animations for all characters with hundreds of hours of dialogue in its open-world game may have been too ambitious to tackle by animators alone. Now with the help of JALI’s technology, emotion-driven facial expressions can be procedurally generated, allowing the animators to focus more of their attention on enhancing the character’s overall performance to match the storyline, and less on the lip-synching and basic facial animation itself.

“The ultimate goal is to generate performances that are so convincing that you perceive the whole, and don’t notice the pieces. If you are not talking about the lip-synching or facial expressions, then that means you are not being distracted by them, and we know we’ve done our job.”