We, human, hear a sound and then process the message. After processing the message, we speak and show the expressions. That's the exact same process that I'm using in this solution.
Mic Input -> STT (Speech to Text AI) -> LLM (Large Language Model AI) -> TTS (Text to Speech AI) -> Audio Output -> Lip Sync & Animations
Here, I'll first talk about python script and then dive into Unreal Engine. The python script focus on 'receives and process message' and the output is just audio output. In unreal engine receives audio input and turn audio data into facial expression, animation, lips movements and final audio output.
Python 3.10
Windows 10 or 11
Visual Studio 2019
Unreal Engine 5.1
Unreal Engine 5.1 - Quixel Bridge Plugin
You can copy the text down below, paste it into a text editor and then save it as 'requirements.txt'.
SpeechRecognition
openai~=0.27.1
pyttsx3~=2.90
pydub~=0.25.1
python-osc~=1.8.1
In your project terminal, run 'pip install -r requirements.txt'
Next, copy the script down below.
import speech_recognition as sr
import pyttsx3
import openai
// LLM AI OpenAI GPT-3
openai.api_key = 'YOUR API KEY'
// Text to Speech AI
engine = pyttsx3.init()
voices = engine.getProperty('voices')
// This voice is where your Microsoft language voices
engine.setProperty('voice', voices[0].id)
// Speech To Text AI
r = sr.Recognizer()
// Setup mic input index
mic = sr.Microphone(device_index=0)
// Define prompt for GPT-3
conversation = ''
user_name = 'Vance'
bot_name = 'Vance_CloneAI'
// Run the conversion
while True:
with mic as source:
print('\n Listening...')
r.adjust_for_ambient_noise(source, duration=0.1)
audio = r.listen(source)
print("no longer listening")
try:
user_input = r.recognize_google(audio, language="en-US")
except:
continue
prompt = user_name + ':' + user_input + '\n' + bot_name + ':'
conversation += prompt
response = openai.Completion.create(
model="text-davinci-003",
prompt=conversation,
temperature=0.5,
max_tokens=256,
top_p=1,
frequency_penalty=0,
presence_penalty=0
)
response_str = response["choices"][0]['text'].replace('\n','')
response_str = response_str.split(
user_name + ':' ,1)[0].split(bot_name+':',1)[0]
conversation+= response_str +'\n'
print(response_str)
engine.say(response_str)
engine.runAndWait()
Remember to fill in your OpenAI api key at 'YOUR API KEY'. (You can find your api key here: Overview - OpenAI API.)
Also, make sure your microphone input set it to the right device_index. (My mini Rayzer microphone is device_index 0)
Run the script.
Say hello. Then you'll see the output console display:
Listening...
no longer listening
result2:
{ 'alternative': [ { 'confidence': 0.98762906,
'transcript': 'hello how are you doing'}],
'final': True}
Hi there, I'm doing great. How about you?
Listening...
The output console is telling you:
It first print 'Listening...' that give you a cue that you can start speaking.
Once you finish talking, the silent duration over 0.1 second the script will stop listening and show you 'no longer listening'.
GoogleRecognizer then willl show you the TTS result, 'hello how are you doing'.
GPT-3 answer: 'Hi there, I'm doing great. How about you?'
pyttsx3 will turn GPT-3 answer to audio output.
First part, Python Script finished. You can stop the script and close it.
Go to your project folder, right click on OVRLipSyncDemo.uproject and then choose Switch Unreal Engine version...
After switch 5.1, then double clik OVRLipSyncDemo.sln to open up Visual Studio 2019
Select OVRLipSync.Build.cs
Go to line 39: Add "AndroidPermission"
Go to OVRLipSyncLiveActorComponent.cpp
Find line 35: DEFAULT_DEVICE_NAME TEXT("Default Device") and leave "Default Device" as empty string like "".
Go to Build -> Rebuild Solution OVRLipSyncDemo
Click Local Windows Debugger to run project
make sure your microphone is working properly, and check if the avatar lips are moving.
If the mic input works, then you can stop game.
Go to Window -> Quixel Bridge
Sign In, download Metahuman and export Metahuman.
Double click your Metahuman Blueprint
Add OVRLipSyncActor Component
Event Graph
I can call the visemes data from Blueprint and then store into a float array for further use.
AnimGraph
Once you have the visemes data you will need to assign and map them to ARKit blendshapes. In my case, the float data is either control 1 or multiple blendshapes at same time.
You only need the script above to run the test. However, if you'd like to see the scripts, here's my GitHub repo link - GitHub: easylife122/ChatGPTVA (github.com)
Before we run the script and Unreal Engine, there is a tool needs to download and install, VB-Audio Virtual Cable. You can find it here: VB-Audio Virtual Apps. (Now it only supports Windows and Mac.)
After you install it, in windows search 'Sound mixer options'. Open up 'Sound mixer options'. And run your script.
Select CABLE Input (VB-Audio Virtual Cable) as Python Script Output.
Select CABLE Output (VB-Audio Virtual Cable) as Unreal Editor Input.
There are already many online services like: InworldAI, MetahumanSDK, Amazon Polly UE Project. And here are my simple reviews.
InworldAI: Easy to implement, customize AI, breaking sound, TTS, STT are limited.
MetahumanSDK: gpt-3 model can be fine-tune, TTS limited, Slow.
Amazon Polly UE Project: Flexible gtp-3. A bit faster.
Python - Oculus LipySync: The most flexible option, CAN BE completely local execuate.
STT - Sound Reconginze, Openai Whisper, Google Cloud, AWS, Azure, AssemblyAI, IBM Watson STT, Scriptix, etc.
AI - Google Chat, AWS, Azure, Facebook Messenger, Slack bot, ChatBot, Crisp, Bot Libre, Wit.ai, Twilio, GPT-3, etc.
TTS - gTTS, Google Cloud, Amazon Polly, Microsoft Azure, NVIDIA Nemo, etc.
Few things still need to work on:
Cut down the responding time 5-15s -> 0-2s
Computer Vision: Read face, follow face, read fingers, read hand gestures.
Animation States Verities
Stable Diffusion implement.
Backend system implement.
Prompt Development
App Run on Lower End Device
App Run on Pixel Streaming
Last Update: May 4, 2023
Software: Unreal Engine 5.1, Python 3.10, Visual Studio 2019, VB-Cable
OS: Windows 10
Specs: RTX 3080, AMD 3900x, 64GB RAM
Reference: