Multi-Modal AI on the Web
What if you could use multimodal LLMs to interact with websites or IoT devices using motion control? As advancements in multimodal AI offer new opportunities to push the boundaries of what can be done with this technology, I started wondering how it could be leveraged from the perspective of human-computer interaction. In this talk, I will take you through my research experimenting with building motion-controlled prototypes using LLMs in JavaScript.
00:00 Introduction & Speaker Background 02:01 The Value of Conferences & Making Connections 03:54 Motion Control with Multimodal AI: The Core Question 04:32 Inspiration: The Room E Project & Context-Aware Systems 06:38 What is Multimodal AI? 07:19 Project Goal: Controlling Devices with Hand Gestures & Gemini 07:48 Approaches to Motion Control with LLMs 09:31 Demo 1: Gesture Detection with Gemini 12:58 Demo 2: Function Calling – Toggling a Light with Gestures 16:11 Demo 3: Multi-Turn Function Calling for Multiple Lights 20:04 Demo 4: Combining Gemini with TensorFlow.js for Color Control 23:27 Demo 5: Custom Gestures with Vector Embeddings & Vector Databases 28:12 Research, Resources & Future Directions 29:43 Final Thoughts & Q&A
