GPT-3 | UE5.1: Virtual AI
Quick Tutorial of how-to setup Python Script and UE 5.1 for Metahuman Virtual AI
In this tutorial the first part is to setup a python script to connect OpenAI GPT-3-Turbo via Google Recognize so that the script can turn your voice into a OpenAI prompt in real-time. And the second part is using OculusLipySync plugin for Unreal Engine and setup Metahuman for Virtual AI. Let's see the final result video down below.
How do I approach a virtual AI solution?
We, human, hear a sound and then process the message. After processing the message, we speak and show the expressions. That's the exact same process that I'm using in this solution.
Mic Input -> STT (Speech to Text AI) -> LLM (Large Language Model AI) -> TTS (Text to Speech AI) -> Audio Output -> Lip Sync & Animations
Here, I'll first talk about python script and then dive into Unreal Engine. The python script focus on 'receives and process message' and the output is just audio output. In unreal engine receives audio input and turn audio data into facial expression, animation, lips movements and final audio output.
Here's the diagram of the complete solution:
Prerequisites
Python 3.10
Windows 10 or 11
Visual Studio 2019
Unreal Engine 5.1
Unreal Engine 5.1 - Quixel Bridge Plugin
Python Script
Step 1: Install the required Python Packages:
You can copy the text down below, paste it into a text editor and then save it as 'requirements.txt'.
SpeechRecognition
openai~=0.27.1
pyttsx3~=2.90
pydub~=0.25.1
python-osc~=1.8.1
In your project terminal, run 'pip install -r requirements.txt'
Step 2: Copy and run the script:
Next, copy the script down below.
GoogleRecognizeOpenaiPyttsx3.py
import speech_recognition as sr
import pyttsx3
import openai
// LLM AI OpenAI GPT-3
openai.api_key = 'YOUR API KEY'
// Text to Speech AI
engine = pyttsx3.init()
voices = engine.getProperty('voices')
// This voice is where your Microsoft language voices
engine.setProperty('voice', voices[0].id)
// Speech To Text AI
r = sr.Recognizer()
// Setup mic input index
mic = sr.Microphone(device_index=0)
// Define prompt for GPT-3
conversation = ''
user_name = 'Vance'
bot_name = 'Vance_CloneAI'
// Run the conversion
while True:
with mic as source:
print('\n Listening...')
r.adjust_for_ambient_noise(source, duration=0.1)
audio = r.listen(source)
print("no longer listening")
try:
user_input = r.recognize_google(audio, language="en-US")
except:
continue
prompt = user_name + ':' + user_input + '\n' + bot_name + ':'
conversation += prompt
response = openai.Completion.create(
model="text-davinci-003",
prompt=conversation,
temperature=0.5,
max_tokens=256,
top_p=1,
frequency_penalty=0,
presence_penalty=0
)
response_str = response["choices"][0]['text'].replace('\n','')
response_str = response_str.split(
user_name + ':' ,1)[0].split(bot_name+':',1)[0]
conversation+= response_str +'\n'
print(response_str)
engine.say(response_str)
engine.runAndWait()
Remember to fill in your OpenAI api key at 'YOUR API KEY'. (You can find your api key here: Overview - OpenAI API.)
Also, make sure your microphone input set it to the right device_index. (My mini Rayzer microphone is device_index 0)
Run the script.
Say hello. Then you'll see the output console display:
Listening...
no longer listening
result2:
{ 'alternative': [ { 'confidence': 0.98762906,
'transcript': 'hello how are you doing'}],
'final': True}
Hi there, I'm doing great. How about you?
Listening...
The output console is telling you:
It first print 'Listening...' that give you a cue that you can start speaking.
Once you finish talking, the silent duration over 0.1 second the script will stop listening and show you 'no longer listening'.
GoogleRecognizer then willl show you the TTS result, 'hello how are you doing'.
GPT-3 answer: 'Hi there, I'm doing great. How about you?'
pyttsx3 will turn GPT-3 answer to audio output.
First part, Python Script finished. You can stop the script and close it.
Step 2: Recompile the plugin to UE 5.1 version:
Go to your project folder, right click on OVRLipSyncDemo.uproject and then choose Switch Unreal Engine version...
After switch 5.1, then double clik OVRLipSyncDemo.sln to open up Visual Studio 2019
Select OVRLipSync.Build.cs
Go to line 39: Add "AndroidPermission"
Go to OVRLipSyncLiveActorComponent.cpp
Find line 35: DEFAULT_DEVICE_NAME TEXT("Default Device") and leave "Default Device" as empty string like "".
Go to Build -> Rebuild Solution OVRLipSyncDemo
Click Local Windows Debugger to run project
Step 3: Play and test mic input:
make sure your microphone is working properly, and check if the avatar lips are moving.
If the mic input works, then you can stop game.
Step 4: Import Metahuman:
Go to Window -> Quixel Bridge
Sign In, download Metahuman and export Metahuman.
Step 5: Add OVRLipySyncActor component to Metahuman Blueprint
Double click your Metahuman Blueprint
Add OVRLipSyncActor Component
Step 6: Animation Blueprint Modification
Event Graph
I can call the visemes data from Blueprint and then store into a float array for further use.
AnimGraph
Once you have the visemes data you will need to assign and map them to ARKit blendshapes. In my case, the float data is either control 1 or multiple blendshapes at same time.
You only need the script above to run the test. However, if you'd like to see the scripts, here's my GitHub repo link - GitHub: easylife122/ChatGPTVA (github.com)
VB-Audio Virtual Cable
Before we run the script and Unreal Engine, there is a tool needs to download and install, VB-Audio Virtual Cable. You can find it here: VB-Audio Virtual Apps. (Now it only supports Windows and Mac.)
After you install it, in windows search 'Sound mixer options'. Open up 'Sound mixer options'. And run your script.
Select CABLE Input (VB-Audio Virtual Cable) as Python Script Output.
Select CABLE Output (VB-Audio Virtual Cable) as Unreal Editor Input.
Solutions Comparison
There are already many online services like: InworldAI, MetahumanSDK, Amazon Polly UE Project. And here are my simple reviews.
InworldAI: Easy to implement, customize AI, breaking sound, TTS, STT are limited.
MetahumanSDK: gpt-3 model can be fine-tune, TTS limited, Slow.
Amazon Polly UE Project: Flexible gtp-3. A bit faster.
Python - Oculus LipySync: The most flexible option, CAN BE completely local execuate.
AI Web Services/AI pretrained models: TTS, Chatbot, STT
STT - Sound Reconginze, Openai Whisper, Google Cloud, AWS, Azure, AssemblyAI, IBM Watson STT, Scriptix, etc.
AI - Google Chat, AWS, Azure, Facebook Messenger, Slack bot, ChatBot, Crisp, Bot Libre, Wit.ai, Twilio, GPT-3, etc.
TTS - gTTS, Google Cloud, Amazon Polly, Microsoft Azure, NVIDIA Nemo, etc.
Issues
Few things still need to work on:
Cut down the responding time 5-15s -> 0-2s
Computer Vision: Read face, follow face, read fingers, read hand gestures.
Animation States Verities
Stable Diffusion implement.
Backend system implement.
Prompt Development
App Run on Lower End Device
App Run on Pixel Streaming
Last Update: May 4, 2023
Software: Unreal Engine 5.1, Python 3.10, Visual Studio 2019, VB-Cable
OS: Windows 10
Specs: RTX 3080, AMD 3900x, 64GB RAM
Reference: