This is a follow up post to my previous post about Nao and gpt integration.

To recap: I previously had two code projects – a Java project that connected to Nao and ran the text to speech and conversation routine on the robot, and a python project which ran a server which essentially wraps gpt. Nao would start a conversation using a random topic generated by GPT and then record a .wav file which contained the user response. The python code would pick up the .wav file and transfer it to GPT which would perform speech to text functionality and return a response to what the user said. This response would be returned to the java code and Nao would say it using text to speech. Whew!
Using camera input from Nao with GPT

For my next experiment I used the camera on Nao to take a picture and integrated GPT to start a conversation about it. It sounds like a quick addition to my previous code but, alas, it was not! I had to refactor quite a bit to add the new vision feature to the python server that connects to GPT. And I had to remember how the complicated server interaction worked between the Java code and Python code – why did I make it so complicated?! Oh that’s right, because Nao code is so version specific and I had Java code already, while GPT doesn’t have a native Java api. Damn.
For the curious, this is the vision code for gpt – for more, see the Java github respository and the Python one.
response = OpenAIclient.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": vision_prompt_text},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{base64_image}"
},
},
],
}
],
max_tokens=300,
)
The end result is startlingly engaging!
Watch this video in which Nao and I end up having a meaningful conversation about my cultural background:
In this second video, I try to lead Nao to show GPT’s sense of humor:
Compare this to my early prototype with Microsoft Vision API for 8 years ago – this kind of fluency is what I dreamed of then – you can see the potential.
Conclusion:
I have to say, I thoroughly enjoy speaking to GPT with Nao! The conversations are so engaging and since I added the endless loop, they can just go on for as long as I would wish, getting deeper and evolving the topic. The conversations flow well from turn to turn and are quite entertaining. Using the environmental input from the camera seeds the conversation much more personally and makes it more relevant. With this level of technology, I can really see robots combined with AI being used to combat loneliness in the future.
I do miss something in the interaction though – I really think Nao needs some movement to make it seem more lifelike. In the next project I will try to record some movements using the Choregraph software and to replay them during speech. The other thing is the long pauses after listening – part of that is the connection to openai and part of that is a wait that I’ve put in to give the user some time to reply. I’d love to get rid, of it but I for now don’t quite know how to signal to Nao to stop recording when the user is done speaking.
Well, that’s it for this post and project, until the next iteration, thanks for reading!

Leave a comment