Integrating Vision and Conversation: How NAO and GPT Take Social Robotics to the Next Level

This is a follow up post to my previous post about Nao and gpt integration.

To recap: I previously had two code projects – a Java project that connected to Nao and ran the text to speech and conversation routine on the robot, and a python project which ran a server which essentially wraps gpt. Nao would start a conversation using a random topic generated by GPT and then record a .wav file which contained the user response. The python code would pick up the .wav file and transfer it to GPT which would perform speech to text functionality and return a response to what the user said. This response would be returned to the java code and Nao would say it using text to speech. Whew!

Using camera input from Nao with GPT

For my next experiment I used the camera on Nao to take a picture and integrated GPT to start a conversation about it. It sounds like a quick addition to my previous code but, alas, it was not! I had to refactor quite a bit to add the new vision feature to the python server that connects to GPT. And I had to remember how the complicated server interaction worked between the Java code and Python code – why did I make it so complicated?! Oh that’s right, because Nao code is so version specific and I had Java code already, while GPT doesn’t have a native Java api. Damn.

For the curious, this is the vision code for gpt – for more, see the Java github respository and the Python one.

 response = OpenAIclient.chat.completions.create(
                    model="gpt-4o",
                    messages=[
                        {
                            "role": "user",
                            "content": [
                                {"type": "text", "text": vision_prompt_text},
                                {
                                    "type": "image_url",
                                    "image_url": {
                                        "url": f"data:image/jpeg;base64,{base64_image}"
                                    },
                                },
                            ],
                        }
                    ],
                    max_tokens=300,
                )

The end result is startlingly engaging!

Watch this video in which Nao and I end up having a meaningful conversation about my cultural background:

In this second video, I try to lead Nao to show GPT’s sense of humor:

Compare this to my early prototype with Microsoft Vision API for 8 years ago – this kind of fluency is what I dreamed of then – you can see the potential.

Conclusion:

I have to say, I thoroughly enjoy speaking to GPT with Nao! The conversations are so engaging and since I added the endless loop, they can just go on for as long as I would wish, getting deeper and evolving the topic. The conversations flow well from turn to turn and are quite entertaining. Using the environmental input from the camera seeds the conversation much more personally and makes it more relevant. With this level of technology, I can really see robots combined with AI being used to combat loneliness in the future.

I do miss something in the interaction though – I really think Nao needs some movement to make it seem more lifelike. In the next project I will try to record some movements using the Choregraph software and to replay them during speech. The other thing is the long pauses after listening – part of that is the connection to openai and part of that is a wait that I’ve put in to give the user some time to reply. I’d love to get rid, of it but I for now don’t quite know how to signal to Nao to stop recording when the user is done speaking.

Well, that’s it for this post and project, until the next iteration, thanks for reading!

The Integration of Machine Learning and Generative AI in Robotics: Insights from the Robot Report Webinar

In a recent webinar hosted by Dan Kara from The Robot Report, an insightful overview of the integration of machine learning (ML) and generative AI in the field of robotics was presented. The session highlighted various platforms and technologies from major cloud service providers and innovative companies, showcasing how they are revolutionizing the robotics landscape.…

Robot Storytelling with enhancements engagement

This is a mini paper I wrote with my teammates while attending the Applied Social Robotics summer school at the Hogeschool Utrecht. Our hypothesis was about storytelling, but with and without enhancements like sound effects, gestures and questions. We then ran both versions of the storytelling experience with some participants from the university and reported…

by thosha moodley September 2, 2024September 2, 2024

DIY Adventures: Repairing Our Aldebaran Nao Robot from Disk Errors to Broken Fingers

Our Aldebaran Nao robot has recently had some issues. Since we are about 6 years out of warranty, and the alternative was an expensive paperweight, we’ve had to fix all problems ourselves. See the post on how we repaired Nao’s Battery Pack for the previous fix we did on Nao. Many years ago, when Nao was…

by thosha moodley April 25, 2024April 25, 2024

Exploring Sentience: Merging ChatGPT with Humanoid Robotics for Enhanced Conversational Experiences

Since ChatGPT’s release, I’ve been inspired to combine this incredible chatbot technology with a robot to see what additional value could be created with robotic embodiment. We have an Aldebaran Nao and an Anki Vector social robot – I decided to try Nao first because I had a working API project for for it, and…

by thosha moodley March 21, 2024March 21, 2024

The New Future NFT Exhibition at MOCO Museum in Amsterdam

A few months ago we went to the MOCO Museum’s the New Future NFT exhibition. If that sentence only raises questions in your mind, you can be forgiven! NFT stands for non fungible token, and its a term used to represent digital assets like videos and images that have been put onto the blockchain to…

by thosha moodley December 31, 2022January 1, 2023

Share this:

Related

Leave a comment Cancel reply