GPT-3 | UE5.1: Virtual AI

Quick Tutorial of how-to setup Python Script and UE 5.1 for Metahuman Virtual AI

In this tutorial the first part is to setup a python script to connect OpenAI GPT-3-Turbo via Google Recognize so that the script can turn your voice into a OpenAI prompt in real-time. And the second part is using OculusLipySync plugin for Unreal Engine and setup Metahuman for Virtual AI. Let's see the final result video down below.

How do I approach a virtual AI solution? 

We, human, hear a sound and then process the message. After processing the message, we speak and show the expressions. That's the exact same process that I'm using in this solution. 

Here, I'll first talk about python script and then dive into Unreal Engine. The python script focus on 'receives and process message' and the output is just audio output. In unreal engine receives audio input and turn audio data into facial expression, animation, lips movements and final audio output.

Here's the diagram of the complete solution:

Prerequisites

Python Script

Step 1: Install the required Python Packages:

You can copy the text down below, paste it into a text editor and then save it as 'requirements.txt'. 

SpeechRecognition

openai~=0.27.1

pyttsx3~=2.90

pydub~=0.25.1

python-osc~=1.8.1

In your project terminal, run 'pip install -r requirements.txt'

Step 2: Copy and run the script:

Next, copy the script down below.

GoogleRecognizeOpenaiPyttsx3.py

import speech_recognition as sr

import pyttsx3

import openai


// LLM AI OpenAI GPT-3

openai.api_key = 'YOUR API KEY'


// Text to Speech AI

engine = pyttsx3.init()

voices = engine.getProperty('voices')

// This voice is where your Microsoft language voices

engine.setProperty('voice', voices[0].id)


// Speech To Text AI

r = sr.Recognizer()

// Setup mic input index

mic = sr.Microphone(device_index=0)


// Define prompt for GPT-3

conversation = ''

user_name = 'Vance'

bot_name = 'Vance_CloneAI'


// Run the conversion

while True:

   with mic as source:

       print('\n Listening...')

       r.adjust_for_ambient_noise(source, duration=0.1)

       audio = r.listen(source)

   print("no longer listening")


   try:

       user_input = r.recognize_google(audio, language="en-US")

   except:

       continue


   prompt = user_name + ':' + user_input + '\n' + bot_name + ':'

   conversation += prompt


   response = openai.Completion.create(

       model="text-davinci-003",

       prompt=conversation,

       temperature=0.5,

       max_tokens=256,

       top_p=1,

       frequency_penalty=0,

       presence_penalty=0

   )


   response_str = response["choices"][0]['text'].replace('\n','')

   response_str = response_str.split(

       user_name + ':' ,1)[0].split(bot_name+':',1)[0]


   conversation+= response_str +'\n'

   print(response_str)


   engine.say(response_str)

   engine.runAndWait()


Remember to fill in your OpenAI api key at 'YOUR API KEY'. (You can find your api key here: Overview - OpenAI API.)

Also, make sure your microphone input set it to the right device_index. (My mini Rayzer microphone is device_index 0)

Run the script.

Say hello. Then you'll see the output console display:

 Listening...

no longer listening

result2:

{   'alternative': [   {   'confidence': 0.98762906,

                           'transcript': 'hello how are you doing'}],

    'final': True}

 Hi there, I'm doing great. How about you?


 Listening...

The output console is telling you:

First part, Python Script finished. You can stop the script and close it.

Unreal Engine

Step 1: Download Oculus LipSync Unreal project

 Downloads - Oculus Lipsync Unreal 

Step 2: Recompile the plugin to UE 5.1 version:

Step 3: Play and test mic input:

Step 4: Import Metahuman:

Step 5: Add OVRLipySyncActor component to Metahuman Blueprint

Step 6: Animation Blueprint Modification

I can call the visemes data from Blueprint and then store into a float array for further use.


Once you have the visemes data you will need to assign and map them to ARKit blendshapes. In my case, the float data is either control 1 or multiple blendshapes at same time.

You only need the script above to run the test. However, if you'd like to see the scripts, here's my GitHub repo link - GitHub: easylife122/ChatGPTVA (github.com) 

VB-Audio Virtual Cable

Before we run the script and Unreal Engine, there is a tool needs to download and install, VB-Audio Virtual Cable. You can find it here: VB-Audio Virtual Apps. (Now it only supports Windows and Mac.)

After you install it, in windows search 'Sound mixer options'. Open up 'Sound mixer options'. And run your script.

Solutions Comparison

There are already many online services like: InworldAI, MetahumanSDK, Amazon Polly UE Project. And here are my simple reviews. 

AI Web Services/AI pretrained models: TTS, Chatbot, STT 

Issues 

Few things still need to work on: 

Last Update:  May 4, 2023

Software: Unreal Engine 5.1, Python 3.10, Visual Studio 2019, VB-Cable

OS: Windows 10

Specs: RTX 3080, AMD 3900x, 64GB RAM


Reference: