GPT-3 | UE5.1: Virtual AI

Quick Tutorial of how-to setup Python Script and UE 5.1 for Metahuman Virtual AI

In this tutorial the first part is to setup a python script to connect OpenAI GPT-3-Turbo via Google Recognize so that the script can turn your voice into a OpenAI prompt in real-time. And the second part is using OculusLipySync plugin for Unreal Engine and setup Metahuman for Virtual AI. Let's see the final result video down below.

How do I approach a virtual AI solution?

We, human, hear a sound and then process the message. After processing the message, we speak and show the expressions. That's the exact same process that I'm using in this solution.

Mic Input -> STT (Speech to Text AI) -> LLM (Large Language Model AI) -> TTS (Text to Speech AI) -> Audio Output -> Lip Sync & Animations

Here, I'll first talk about python script and then dive into Unreal Engine. The python script focus on 'receives and process message' and the output is just audio output. In unreal engine receives audio input and turn audio data into facial expression, animation, lips movements and final audio output.

Here's the diagram of the complete solution:

Prerequisites

Python 3.10
Windows 10 or 11
Visual Studio 2019
Unreal Engine 5.1
Unreal Engine 5.1 - Quixel Bridge Plugin

Python Script

Step 1: Install the required Python Packages:

You can copy the text down below, paste it into a text editor and then save it as 'requirements.txt'.

SpeechRecognition

openai~=0.27.1

pyttsx3~=2.90

pydub~=0.25.1

python-osc~=1.8.1

In your project terminal, run 'pip install -r requirements.txt'

Step 2: Copy and run the script:

Next, copy the script down below.

GoogleRecognizeOpenaiPyttsx3.py

import speech_recognition as sr

import pyttsx3

import openai

// LLM AI OpenAI GPT-3

openai.api_key = 'YOUR API KEY'

// Text to Speech AI

engine = pyttsx3.init()

voices = engine.getProperty('voices')

// This voice is where your Microsoft language voices

engine.setProperty('voice', voices[0].id)

// Speech To Text AI

r = sr.Recognizer()

// Setup mic input index

mic = sr.Microphone(device_index=0)

// Define prompt for GPT-3

conversation = ''

user_name = 'Vance'

bot_name = 'Vance_CloneAI'

// Run the conversion

while True:

with mic as source:

print('\n Listening...')

r.adjust_for_ambient_noise(source, duration=0.1)

audio = r.listen(source)

print("no longer listening")

try:

user_input = r.recognize_google(audio, language="en-US")

except:

continue

prompt = user_name + ':' + user_input + '\n' + bot_name + ':'

conversation += prompt

response = openai.Completion.create(

model="text-davinci-003",

prompt=conversation,

temperature=0.5,

max_tokens=256,

top_p=1,

frequency_penalty=0,

presence_penalty=0

)

response_str = response["choices"][0]['text'].replace('\n','')

response_str = response_str.split(

user_name + ':' ,1)[0].split(bot_name+':',1)[0]

conversation+= response_str +'\n'

print(response_str)

engine.say(response_str)

engine.runAndWait()

Remember to fill in your OpenAI api key at 'YOUR API KEY'. (You can find your api key here: Overview - OpenAI API.)

Also, make sure your microphone input set it to the right device_index. (My mini Rayzer microphone is device_index 0)

Run the script.

Say hello. Then you'll see the output console display:

Listening...

no longer listening

result2:

{ 'alternative': [ { 'confidence': 0.98762906,

'transcript': 'hello how are you doing'}],

'final': True}

Hi there, I'm doing great. How about you?

Listening...

The output console is telling you：

It first print 'Listening...' that give you a cue that you can start speaking.
Once you finish talking, the silent duration over 0.1 second the script will stop listening and show you 'no longer listening'.
GoogleRecognizer then willl show you the TTS result, 'hello how are you doing'.
GPT-3 answer: 'Hi there, I'm doing great. How about you?'
pyttsx3 will turn GPT-3 answer to audio output.

First part, Python Script finished. You can stop the script and close it.

Unreal Engine

Step 1: Download Oculus LipSync Unreal project

Downloads - Oculus Lipsync Unreal

Step 2: Recompile the plugin to UE 5.1 version:

Go to your project folder, right click on OVRLipSyncDemo.uproject and then choose Switch Unreal Engine version...

After switch 5.1, then double clik OVRLipSyncDemo.sln to open up Visual Studio 2019

Select OVRLipSync.Build.cs

Go to line 39: Add "AndroidPermission"

Go to OVRLipSyncLiveActorComponent.cpp

Find line 35: DEFAULT_DEVICE_NAME TEXT("Default Device") and leave "Default Device" as empty string like "".

Go to Build -> Rebuild Solution OVRLipSyncDemo

Click Local Windows Debugger to run project

Step 3: Play and test mic input:

make sure your microphone is working properly, and check if the avatar lips are moving.

If the mic input works, then you can stop game.

Step 4: Import Metahuman:

Go to Window -> Quixel Bridge

Step 5: Add OVRLipySyncActor component to Metahuman Blueprint

Double click your Metahuman Blueprint

Add OVRLipSyncActor Component

Step 6: Animation Blueprint Modification

Event Graph

I can call the visemes data from Blueprint and then store into a float array for further use.

AnimGraph

Once you have the visemes data you will need to assign and map them to ARKit blendshapes. In my case, the float data is either control 1 or multiple blendshapes at same time.

You only need the script above to run the test. However, if you'd like to see the scripts, here's my GitHub repo link - GitHub: easylife122/ChatGPTVA (github.com)

VB-Audio Virtual Cable

Before we run the script and Unreal Engine, there is a tool needs to download and install, VB-Audio Virtual Cable. You can find it here: VB-Audio Virtual Apps. (Now it only supports Windows and Mac.)

After you install it, in windows search 'Sound mixer options'. Open up 'Sound mixer options'. And run your script.

Select CABLE Input (VB-Audio Virtual Cable) as Python Script Output.
Select CABLE Output (VB-Audio Virtual Cable) as Unreal Editor Input.

Solutions Comparison

There are already many online services like: InworldAI, MetahumanSDK, Amazon Polly UE Project. And here are my simple reviews.

InworldAI: Easy to implement, customize AI, breaking sound, TTS, STT are limited.
MetahumanSDK: gpt-3 model can be fine-tune, TTS limited, Slow.
Amazon Polly UE Project: Flexible gtp-3. A bit faster.
Python - Oculus LipySync: The most flexible option, CAN BE completely local execuate.

AI Web Services/AI pretrained models: TTS, Chatbot, STT

STT - Sound Reconginze, Openai Whisper, Google Cloud, AWS, Azure, AssemblyAI, IBM Watson STT, Scriptix, etc.
AI - Google Chat, AWS, Azure, Facebook Messenger, Slack bot, ChatBot, Crisp, Bot Libre, Wit.ai, Twilio, GPT-3, etc.
TTS - gTTS, Google Cloud, Amazon Polly, Microsoft Azure, NVIDIA Nemo, etc.

Issues

Few things still need to work on:

Cut down the responding time 5-15s -> 0-2s
Computer Vision: Read face, follow face, read fingers, read hand gestures.
Animation States Verities
Stable Diffusion implement.
Backend system implement.
Prompt Development
App Run on Lower End Device
App Run on Pixel Streaming

Last Update: May 4, 2023

Software: Unreal Engine 5.1, Python 3.10, Visual Studio 2019, VB-Cable

OS: Windows 10

Specs: RTX 3080, AMD 3900x, 64GB RAM

Reference:

Page updated

Google Sites

Report abuse