Practical AI ā€“ Episode #254

Large Action Models (LAMs) & Rabbits šŸ‡

get fully-Connected with Chris & Daniel

All Episodes

Recently the release of the rabbit r1 device resulted in huge interest in both the device and ā€œLarge Action Modelsā€ (or LAMs). What is an LAM? Is this something new? Did these models come out of nowhere, or are they related to other things we are already using? Chris and Daniel dig into LAMs in this episode and discuss neuro-symbolic AI, AI tool usage, multimodal models, and more.

Featuring

Sponsors

Read Write Own ā€“ Read, Write, Own: Building the Next Era of the Internetā€”a new book from entrepreneur and investor Chris Dixonā€”explores one possible solution to the internetā€™s authenticity problem: Blockchains. From AI that tracks its source material to generative programs that compensateā€”rather than cannibalizeā€”creators. Itā€™s a call to action for a more open, transparent, and democratic internet. One that opens the black box of AI, tracks the origins we see online, and much more. Order your copy of Read, Write, Own today at readwriteown.com

Shopify ā€“ Sign up for a $1/month trial period at shopify.com/practicalai

Fly.io ā€“ The home of Changelog.com ā€” Deploy your apps and databases close to your users. In minutes you can run your Ruby, Go, Node, Deno, Python, or Elixir app (and databases!) all over the world. No ops required. Learn more at fly.io/changelog and check out the speedrun in their docs.

Notes & Links

šŸ“ Edit Notes

Chapters

1 00:07 Welcome to Practical AI
2 00:43 Daniel is in Atlanta?
3 01:49 Daniel's embarrassing moment
4 02:48 New AI devices
5 03:28 Where does privacy fit?
6 04:47 We've been giving data for years
7 06:59 Perception of humanity
8 09:27 A look at the rabbit r1
9 11:43 Why hardware?
10 13:32 Moving past our phones
11 14:46 Sponsor: Read Write Own
12 15:54 Explaining the LAM
13 20:28 Different inputs for models
14 23:10 Integrating external systems
15 26:06 Structuring an unstructured world
16 29:59 Sponsor: Shopify
17 32:35 Origins of LAMs
18 38:35 How instruments could be used as inputs
19 41:18 If this approach sticks
20 45:00 Predictions on LAMs
21 46:14 This has been fun!
22 47:27 Outro

Transcript

šŸ“ Edit Transcript

Changelog

Play the audio to listen along while you enjoy the transcript. šŸŽ§

Welcome to another Fully Connected episode of the Practical AI podcast. In these Fully Connected episodes we try to keep you up to date with everything thatā€™s happening in the AI and machine learning world, and try to give you a few learning resources to level up your AI game. This is Daniel Whitenack. Iā€™m the founder and CEO of Prediction Guard, and Iā€™m joined as always by my co-host, Chris Benson, whoā€™s a tech strategist at Lockheed Martin. How are you doing, Chris?

Doing very well, Daniel. Enjoying the day. And by the way, since youā€™ve traveled to the Atlanta area tonight, we havenā€™t gotten together, but youā€™re just a few minutes away, actually, soā€¦ Welcome to Atlanta!

Just got in. Yeah, weā€™re within - not maybe a short drive, depending on your view of what a short drive isā€¦

Anything under three hours is short in Atlanta, and I think youā€™re like 45 minutes away from me right now.

Yeah, so hopefully weā€™ll get a chance to catch up tomorrow, which will be awesome, because we rarely get to see each other in person. Itā€™s been an interesting couple of weeks for me. So for those that are listening from abroad maybe, we had some major ice and snow type storms recently, and my great and embarrassing moment was I was walking back from the office, in the freezing rain, and I slipped and fell, and my laptop bag, with laptop in it, broke my fall, which is maybe goodā€¦ But that also broke the laptop. So ā€“ actually, the laptop works, itā€™s just the screen doesnā€™t work, so maybe Iā€™ll be able to resolve thatā€¦

Itā€™s like a mini portable server there, isnā€™t it?

Yeah, exactly. You have enough monitors around, itā€™s not that much of an issueā€¦ But yeah, I had to put Ubuntu on a burner laptop for the trip. So yeahā€¦ Itā€™s always a fun time. Speaking of personal devices, thereā€™s been a lot of interesting news and releases, not of ā€“ well, I guess of models, but also of interesting actual hardware devices related to AI recently. One of those is the Rabbit R1, which was announced and sort of launched to preorders with a lot of acclaim.

Another one that I saw was the AI PIN, which is like a little ā€“ I donā€™t know, my grandma would call it a brooch, maybe. Like a large pin that you put on your jacket, or something like thatā€¦ I am wondering, Chris, as you see these devices - and I want to dig a lot more into some of the interesting research and models and data behind some of these things like Rabbitā€¦ But just generally, what are your thoughts on this sort of trend of AI-driven personal devices to help you with all of your personal things, and plugged in to all of your personal data, and sort of AI attached to everything in your life?

Well, I think itā€™s coming. Maybe itā€™s hereā€¦ But I know that I am definitely torn. I mean, I love the idea of all this help along the way. Thereā€™s so many ā€“ I forget everything. Iā€™m terrible. If I donā€™t write something down and then follow up on the list, I am not a naturally organized person. My wife is, and my wife is always reminding me that I really struggle in this area. And usually sheā€™s not being very nice in the way that she does it. Itā€™s all love, Iā€™m sureā€¦ But yeah, so part of me is like ā€œWow, this is the way I can actually be all there, get all the things done.ā€ But the idea of just giving up all my data, and just being ā€“ like so many others, that aspect is not appealing. So I guess Iā€™m ā€“ Iā€™m not leaping.

How much different do you think this sort of thing is than everything we already give over with our smartphones?

Itā€™s a good point youā€™re making.

I mean, weā€™ve had computing devices with us in our pocket or on our person 24/7 for at least the past 10 years; at least for those that have adopted the iPhone or whatever, when it came out. But yeah, so in terms of location, certainly account access and certain automations, what do you think makes ā€“ because obviously, this is something on the mind of the makers of these devices, because I think both the AI PIN and the Rabbit make some sort of explicit statements in their launch, and on their website about ā€œPrivacy is really important to us. This is how weā€™re doing things, because we really care about this.ā€ So obviously, they anticipated some kind of additional reaction. But we all already have smartphones. I think most of us, if we are willing to admit it, we know that weā€™re being tracked everywhere, and all of our data goes everywhereā€¦ So I donā€™t know, what is it about this AI element that you think either makes an actual difference in terms of the substance of whatā€™s happening with the data? Or is it just a perception thing?

[00:06:11.02] Itā€™s probably a perception thing with me. Because everything that you said, I agree with; youā€™re dead on. And weā€™ve been giving this data for years, and weā€™ve gotten comfortable with it, and thatā€™s just something that we all kind of donā€™t like about it, but weā€™ve been accepting it for years. And I guess itā€™s the expectation with these AI assistants that weā€™ve been hearing about for so long coming, and weā€™re starting to see things like the Rabbit come into market, and such, that thereā€™s probably a whole new level of kind of analysis of us, and all the things, and in a sense knowing you better than you do, that is uncomfortable and probably will not be as uncomfortable in the years to come, because weā€™ll grow used to that as well. But I have to admit, right now itā€™s an emotional reaction, and it makes me a little bit leery.

Yeah, maybe itā€™s prior to these sorts of devices there was sort of the perception at least that ā€œYes, my data is going somewhere. Maybe thereā€™s a nefarious person behind this, but thereā€™s sort of a person behind this. Like, the data is going all to Facebook or Meta, and maybe theyā€™re even listening in on me, and putting ads for mattresses in my feedā€, or whatever the thing isā€¦ So that perception has been around for quite some time, regardless of whether Facebook is actually listening in or whatever. Or itā€™s another party, like the NSA and the governmentā€™s listening inā€¦ But I think all of those perceptions really relied on this idea that even if thereā€™s something bad happening, that I donā€™t want happening with my data, thereā€™s sort of a group of people back there doing something with it. And now thereā€™s this sort of idea of this agentic entity behind the scenes thatā€™s doing something with my data, without human oversight. I think maybe thatā€™s ā€“ if thereā€™s anything sort of fundamentally different here, I think itā€™s the level of automation and sort of agentic nature of this, which does provide some sort of difference. Although thereā€™s always like - you know, if youā€™re processing voice or something, thereā€™s voice analytics, and you can put that to text, andā€¦ Then there are always NLP models in the background doing various things, or whatever. So thereā€™s some level of automation thatā€™s already been there, butā€¦

I agree. You mentioned perception up front, and I think that makes a big differenceā€¦ And you mentioned NSA. Intelligence agencies - I think we all just assume that theyā€™re all listening to all the things, all the time now, and thatā€™s one of those things thatā€™s completely beyond your control. And so thereā€™s almost no reason to worry about it, I suppose, unless you happen to be one of the people that an intelligence agency would care about, which I donā€™t particularly think I am. So it just goes someplace and you just kind of shrug it off.

Thereā€™s a certain amount of what weā€™ve done these years with mobile, where youā€™re opting in. I think itā€™s leveling up, as weā€™re saying, with some of these AI agents coming out; we know how much data about ourselves is going to be there, and so itā€™s just escalating the opt-in up to a whole new level. So hopefully weā€™ll see what happensā€¦ I hope it works out well.

Yeah. We havenā€™t really ā€“ for the listeners maybe that are just listening to this and havenā€™t actuallyā€¦ Maybe youā€™re in parallel doing the search and looking at these devices, but in case youā€™re on your run, or in your car, we can describe a little bitā€¦ So I described the AI PIN thing a little bitā€¦ The Rabbit I thought was a really, really cool design. I donā€™t know if thereā€™s any nerds out there that love the sort of synthesizer analog sequencer, teenage engineering stuff thatā€™s out thereā€¦ But actually, the sort of hardware design teenage engineering was involved in that in some way.

[00:10:05.03] So itā€™s like a little square thing, the Rabbit R1. Itā€™s got like one button you can push and speak a command; itā€™s got a little actual hardware wheel that you can spin to scroll, and the screen is kind of just - they show it as black most of the time, but it pops up with the song youā€™re playing on Spotify, or some of the things you would expect to be happening on a touchscreen, or that sort of thingā€¦ But the primary interface is thought to be in my understanding speech; not that you would be pulling up a keyboard on the thing and typing in a lot. Thatā€™s kind of not the point. The point would be this sort of speech-driven conversational - and Iā€™d even call it an operating system - conversational operating system to do certain actions or tasks, which weā€™ll talk a lot more about the kind of research behind thatā€¦ But thatā€™s kind of what the device is, and looks like.

Itā€™s interesting that, going with the device route, and the fact that theyā€™re selling the actual unit itselfā€¦ And over the years we started on our computer, or we started on desktops, and then went to laptops, and then went to our phonesā€¦ And the phones have evolved over time. And weā€™ve been talking about wearables and things like that over the years as theyā€™ve evolved, but I think thereā€™s a little bit of a gamble in actually having it as a physical device, because thatā€™s something else that theyā€™re presuming youā€™re gonna put at the center of your life. That versus being kind of the traditional phone app approach, where youā€™re using the thing that your customer already has in their hands. What are your thoughts about the physicalness of this offering?

I think itā€™s interestingā€¦ One of the points, if you watch the release or launch or promotion video for the Rabbit R1, he talks about sort of the app-driven nature of smartphones, and thereā€™s an app for everythingā€¦ And thereā€™s so many apps now that navigating apps is kind of a task in and of itself. And that Silicon Valley meme, ā€œNo one ever deletes an appā€, right? So you just accumulate more and more apps, and they kind of build up on your phone, and now you have to organize them into little groupings, or whateverā€¦ So I think the point being that itā€™s nice that thereā€™s an app for everything, but the navigation and orchestration of those various apps is sometimes not seamless, and burdensome. Iā€™m even thinking about myself, and kind of checking over here ā€“ I got into Uber, oh, I forgot to switch over my payment on my Uber app, so now Iā€™ve got to open my bank app, and then grab my virtual card number, and copy that overā€¦ But then Iā€™ve got to go to my password management app to copy my passwordā€¦ Thereā€™s all these sorts of interactions between various things that arenā€™t as seamless, as you might think they would be. But itā€™s easy for me to say in words, conversationally, ā€œHey, I want to update the payment on my current Uber rideā€, or whatever. So the thought that that would be an easy thing to express conversationally is interesting; and then have that be accomplished in the background, if it actually works, is also quite interesting.

1

I agree with that. And I canā€™t help but wonder, if you look back at the advent of the phone, and the smartphone, and the iPhone comes out, and it really isnā€™t really so much a phone anymore, but a little computerā€¦ And so the idea of the phone being the base device in your life has been something that has been with us now for over 15 years. And so one of the things I wonder is, could there be a trend where maybe the phone doesnā€™t become ā€“ if you think about it, youā€™re texting, but a lot of your texting isnā€™t really texting, itā€™s messaging in appsā€¦ Maybe the phone is no longer the central device in your life going forward, and maybe youā€™re actually having your primary thing. That would obviously play into Rabbitā€™s approach, where theyā€™re giving you another device, it packages everything together in that AI OS that theyā€™re talking about, where conversationally it runs your life, if you expose your life to it the way you are across many apps on the phoneā€¦ But itā€™s an opportunity potentially to take a left turn with the way we think about devices, and maybe in the not so distant future maybe the phone is no longer the centerpiece.

Alright, Chris, well, thereā€™s a few things interacting in the background here in terms of the technology behind the Rabbit device, and Iā€™m sure other similar types of devices that have come out. Actually, thereā€™s some of this sort of technology that weā€™ve talked a little bit about on the podcast before. I donā€™t know if you remember we had the episode with AskUI, which - they had this sort of multi-modal model; I think a lot of their focus over time was on testing. A lot of people might test web applications or websites using something like Selenium, or something like that, that automates desktop activity or interactions with web applicationsā€¦ And actually automates that for testing purposes or other purposes. AskUI had some of this technology a while back to kind of perform certain actions using AI on a user interface without sort of hard coding; like, click on 100 pixels this way, and 20 pixels down this way. So that I think has been going on for some time.

This adds a sort of different element to it, in that thereā€™s the voice interactionā€¦ But then theyā€™re really emphasizing the flexibility of this, and the updating of itā€¦ So actually, they emphasize ā€“ I think some of the examples they gave is I have a certain configuration on my laptop or on my screen that Iā€™m using with a browser, with certain plugins that make it look a certain wayā€¦ And everything sort of looks different for everybody, and itā€™s all configured in their own sort of way. Even app-wise, apps kind of are very personalized now, which makes it a challenge to say ā€œClick on this button at this place.ā€ It might not be at the same place for everybody all the time. And of course, apps update, and that sort of thing.

So the solution that Rabbit has come out with to deal with this is what theyā€™re calling a large action model. And specifically, theyā€™re talking about this large action model being a neurosymbolic model. And I want to talk through a little bit of that. But before I do, I think we sort of have to back up and talk a little bit about AI models, large language models; ChatGPT has been interacting with external things for some time now, and I think thereā€™s confusion at least about how that happens, and what the model is doingā€¦ So it might be good just to kind of set the stage for this in terms of how these models are interacting with external things.

The way that this looks, at least in the Rabbit case, is you click the button and you say ā€œOh, I want to change the payment card on my Uber [unintelligible 00:19:07.21] and stuff happens in the background and somehow the large action model interacts with Uber, and maybe my bank app or whatever, and actually makes the update. So the question is how this happens. Have you used any of the plugins or anything in ChatGPT, or the kind of search the web type of plugin to a chat interface, or anything like that?

Absolutely. I mean, thatā€™s what makes the ā€“ I mean, I think people tend to focus on the model itself. Thatā€™s where all the glory is, and people say ā€œAh, this model versus that.ā€ But so much of the power comes in the plugins themselves, or other ways in which they interact with the world. And so as weā€™re trying to kind of pave our way into the future and figure out how weā€™re going to use these, and how theyā€™re going to impact our lives, whether it be the Rabbit way, or whether youā€™re talking ChatGPT with its plugins - thatā€™s the key. Itā€™s all those interactions, itā€™s the touchpoints with the different things that you care about which makes it worthwhile. So yes, absolutely, and Iā€™m looking forward to doing it [unintelligible 00:20:12.05]

[00:20:14.27] Yeah. So thereā€™s a couple of things maybe that we can talk about, and actually, some of them are even highlighted in recent things that happened, that we may want to highlight also. One of those is, if you think about a large language model like that used in ChatGPT, or NeuralChat, LLaMA 2, whatever it isā€¦ You put text in, and you get text out. Weā€™ve talked about that a lot on the show. So you put your prompt in, and you get a completion, itā€™s like fancy autocomplete, and you get this completion out. Not that interesting.

Weā€™ve talked a little bit about RAG on the show, which means I am programming some logic around my prompt such that when I get my user input, Iā€™m searching some of my own data or some external data that Iā€™ve stored in a vector database, or in a set of embeddings, to retrieve text thatā€™s semantically similar to my query, and just pushing that into the prompt as a sort of grounding mechanism to sort of ground the answer in that external data. So youā€™ve got sort of basic autocomplete, youā€™ve got retrieval to insert external data via a vector database, youā€™ve got some multimodal inputā€¦ And by multimodal models, Iā€™m meaning things like LLaVA. And actually, this week there was a great - published on January 24th, I saw it in the daily papers on Hugging Faceā€¦ ā€œMM LLMs: Recent advances in multimodal large language models.ā€ So if youā€™re wanting to know sort of the state of the art and whatā€™s going on in multimodal large language models, I just mentioned - thatā€™s probably a much deeper dive that you can go into. So check out that, and weā€™ll link in our show notes.

But these are models that would not only take a text prompt, but might take a text prompt paired with an image, right? So you could put an image in, and you say ā€“ also have a text prompt that says ā€œIs there a raccoon in this image?ā€ And hopefully the reasoning happens and it says yes or no if thereā€™s a ā€“

Is there always a raccoon in the image?

Thereā€™s always a raccoon everywhereā€¦ Thatā€™s one element of this; that would be a specialized model that allows you to integrate multiple modes of data. And thereā€™s similar ones out there for audio, and text, and other things. So again, in summary, youā€™ve got text-to-text autocomplete, youā€™ve got this retrieval mechanism to pull in some external text data into your text prompt, youā€™ve got specialized models that allow you to bring in an image in textā€¦ All of thatā€™s super-interesting, and I think itā€™s connected to what Rabbit is doing. But thereā€™s actually more to whatā€™s going on with, letā€™s say when people perform actions on external systems, or integrate external systems with these sorts of AI models. And this is what in the sort of Langchain world, if youā€™ve interacted with Langchain at all, they would call this maybe tools. And you even saw things in the past like ToolFormer and other models where the idea was ā€œWell, okay, I have - maybe itā€™s the Google Search APIā€, or one of these search APIs, right? I know that I can take a JSON object, send it off to that API, and get a search result, right? Okay, so now if I want to call that search API with an AI model, what I need to do is get the AI model to generate the right JSON-structured output that I can then just programmatically - not with any sort of fancy AI logic, but programmatically - take that JSON object and send it off to the API, get the response, and either plug that in in a sort of retrieval way that we talked about beforeā€¦ And just give it back to the user as the response that they wanted, right?

[00:24:28.08] So this has been happening for quite a while. This is kind of - like, we saw one of these cool AI demos every week, where ā€œOh, the AI is integrated with Kayak now, to get me a rental car. And the AI is integrated with this external system.ā€ All really cool, but at the heart of that was the idea that I would generate structured output that I could use in a regular computer programming way to call an API, and then get a result back, which I would then use in my system. So thatā€™s kind of this tool idea, which is still not quite what Rabbit is doing, but I think thatā€™s something that people donā€™t realize is happening behind the scenes in these tools.

I think thatā€™s really popular ā€œin the enterpriseā€, with air quotes there, because that approach is, in large organizations, theyā€™re going to other ā€“ the cloud providers, with their APIs, Microsoft has the relationship with Open AI, and theyā€™re wrapping that, Google has their APIs, and theyā€™re using RAG in that same way, to try to integrate with systems, instead of actually creating the models on their own. I would say thatā€™s a very, very popular approach right now in the enterprise environments, that are still more software-driven, and still trying to figure out how to use APIs for AI models.

Yeah, and I can give you a concrete example of something we did with a customer at Prediction Guard, which is the Shopify API. So eCommerce customer, the Shopify API has this sort of Shopify ā€“ I think itā€™s called ShopifyQL, query language. Itā€™s structured, and you can call the regular API via GraphQL. And so itā€™s a very structured sort of way you can call this API to get sales information, or order information or do certain tasks. And so you can create a natural language query and say ā€œOkay, well, donā€™t try to give me natural language out. Give me ShopifyQL, or give me something that I can plug into a GraphQL query, and then Iā€™m going to go off and query the Shopify API, and either perform some interaction or get some data.ā€ So this is very popular. This is how you sort of get AI on top of tools.

Whatā€™s interesting, I think, that Rabbit observes in what theyā€™re saying, and others have observed as wellā€¦ I think you take the case like AskUI, like we talked about beforeā€¦ And the observation is that not everything has this sort of nice structured way you can interact with it with an API.

So think about ā€“ pull out your phone; youā€™ve got all of these apps on your phone. Some of them will have a nice API thatā€™s well defined, some of them will have an API that me as a user, I know nothing about. Thereā€™s maybe an API that exists there, but itā€™s hard to use, or not that well documented, or maybe I donā€™t have the right account to use it, or somethingā€¦ Thereā€™s all of these interactions that I want to do on my accounts, with my web apps, with my apps, that have no defined structured API to execute all of those things.

[00:27:58.09] So then the question comes - and thatā€™s why I wanted to lead up to this, is because even if you can retrieve data to get grounded answers, even if you can integrate images, even if you can interact with APIs, all of that gets you pretty far, as weā€™ve seen, but ultimately, not everything is going to have a nice structured API, or itā€™s not going to have an API thatā€™s updated, or has all the features that you want, or does all the things you want. So I think the fundamental question that the Rabbit research team is thinking about is ā€œHow do we then reformulate the problem in a flexible way, to allow a user to trigger an AI system to perform arbitrary actions across an arbitrary number of applications, or an application, without knowing beforehand the structure of that application or its API?ā€ So I think thatā€™s the really interesting question.

I agree with you completely. And thereā€™s so much complexityā€¦ They refer to it as human intentions expressed through actions on a computer. And that sounds really, really simple when you say it like that, but thatā€™s quite a challenge to make that work in an unstructured world. So Iā€™m really curious - they have the research page, but I donā€™t think theyā€™ve put out any papers that describe some of the research theyā€™ve done yet, have they?

Just in general termsā€¦ And thatā€™s where we get to the exciting world of large action models.

Somehow that makes me think of like Arnold Schwarzenegger.

Large action heroes.

There you go. Exactly. Yeah.

Yeah, Chris, so coming from Arnold Schwarzenegger and large action heroes, to large action modelsā€¦ I was wondering if this was a term that Rabbit came up with. I think it has existed for some amount of time; I at least saw it at least as far as back as June of last year 2023, I saw Silvio Savareseā€™s article on Salesforce AI Research blog about ā€œLAMs, from large language models to large action models.ā€ I think the focus of that article was very much on the sort of agentic stuff that we talked about before, in terms of interacting with different systems, but in a very automated way. The term large action model as far as Rabbit refers to it, itā€™s this new architecture that they are saying that theyā€™ve come up with - and Iā€™m sure they have, because seems like the device worksā€¦ We donā€™t know, I think, all of the details about it; at least I havenā€™t seen all of the details, or itā€™s sort of not transparent in the way that maybe a model release would be on Hugging Face, with code associated with it, and a long research paperā€¦ Maybe Iā€™m missing that somewhere, or listeners can tell me if theyā€™ve found it. I couldnā€™t find that.

They do have a research page though, which gives us a few clues as to whatā€™s going on, and some explanation in kind of general terms. And what theyā€™ve described is that their goal is to observe human interactions with a UI, and there seems to be some sort of multimodal model that is detecting what things are where in the UIā€¦ And theyā€™re mapping that onto some kind of flexible, symbolic, synthesized representation of a program.

So the user is doing this thing - so Iā€™m changing the payment on my Uber app, and thatā€™s represented or synthesized behind the scenes in some sort of structured way, and kind of updated over time as it sees demonstrations, human demonstrations of this going on. And so the words that they ā€“ Iā€™ll just kind of read this, so people, if theyā€™re not looking at the articleā€¦ They say ā€œWe designed the technical stack from the ground up, from the data collection platform to the new network architectureā€, and hereā€™s the sort of very dense, loaded wording that probably has a lot packed into itā€¦ They say ā€œthat utilizes both transformer-style attention, and graph-based message passing, combined with program synthesizers, that are demonstration and example-guided.ā€ So thatā€™s a lot in that statement, and of course, they mentioned a few, in more description, in other places. But it seems like my sort of interpretation of this is that the requested action comes in to the system, to the network architecture, and thereā€™s a neural layerā€¦ So this is a neural symbolic model.

[00:36:01.18] So thereā€™s a neural layer that somehow interprets that user action into a set of symbols, or representations that itā€™s learned about the UI; the Shopify UI, or the Uber UI, or whatever. And then they use some sort of symbolic logic processing of this sort of synthesized program to actually execute a series of actions within the app, and perform an action that itā€™s learned through demonstration.

So this is sort of what they mean, I think, when theyā€™re talking about neurosymbolic. So thereā€™s a neural network portion of this, kind of like when you put something into ChatGPT, or a transformer-based large language model, and you get something out. In the case of - we were talking about getting JSON structured out when weā€™re interacting with an external tool, but here it seems like youā€™re getting some sort of thing out, whatever that is - a set of symbols, or some sort of structured thing - thatā€™s then passed through symbolic processing layers, that are essentially symbolic and rule-based ways to execute a learned program over this application. And by program here, I think they mean ā€“ they reference a couple of papers, and my best interpretation is that they mean not a computer program in the sense of Python code, but a logical program that represents an action, like ā€œHere is the logical program to update the payment on the Uber app. You go here, and then you click this, and then you enter that, and then you blah, blahā€, you do those things. Except here, those programs - so the synthesized programs are learned by looking at human intentions, and what they do in an application. And thatā€™s how those programs are synthesized.

So that was a long ā€“ I donā€™t know how well that held together, but that was my best, at this point, without seeing anything else, from a single sort of blog postā€¦

When you can keep me quiet for a couple of minutes there, it means youā€™re doing a pretty good job. I have a question I want to throw out, and I donā€™t know that youā€™ll be able to answer it, obviously, but itā€™s just to speculateā€¦ While we were talking about that, and thinking about multimodal, Iā€™m wondering - the device itself comes with many of the same sensors that youā€™re going to find in a cell phone these daysā€¦ But Iā€™m wondering if that feeds in more than just the speech. And it obviously has the camera on it, it comes with a [unintelligible 00:38:50.14] I canā€™t say the word. GPS, accelerometer, and gyroscope. And obviously ā€“ so itā€™s detecting motion, and location, all the things; it has the camera, it has the micā€¦ How much of that do you think is relevant to the large action model in terms of inputs? Do you think that there is potentially relevance in the non-speech and non-camera concerns on it? Do you think the way people move could have some play in there? I know weā€™re being purely speculative, but it just caught my imagination.

Yeah, Iā€™m not sure. I mean, it could be that thatā€™s used in ways similar to how those sensors are used on smartphones these days. Like, if Iā€™m asking Rabbit to book me an Uber, to here, or something like that, right? Now, it could infer the location maybe of where I am, based on where Iā€™m wanting to go, or ask me where I am. But likely, the easiest thing would be to use a GPS sensor, to know my location and just put that as the pin in the Uber app, and now it knows.

So I think thereā€™s some level of interaction between these things. Iā€™m not sure how much, but it seems like, at least in terms of location, I could definitely see that coming into play. Iā€™m not sure on the other ones.

[00:40:16.10] Well, physically, it looks like a lot like a smartphone without the phone.

Yeah, a smartphone ā€“ a different sort of aspect ratio, but still kind of touchscreen. I think you can still pull up a keyboard, and that sort of thing. And you see things when you prompt it. So yeah, I imagine that thatā€™s maybe an evolution of this over time, as sensory input of various things. I could imagine that being very interesting in running, or fitness type of scenarios. If Iā€™ve got my Rabbit with me, and I instruct Rabbit to post a celebratory social media post every time I keep my mileage, or my time per mile, at a certain level, or something, and itā€™s using some sort of sensors on the device to do that. I think thereā€™s probably ways that will work out [unintelligible 00:41:15.12]

Itā€™ll be interesting that if this approach sticks - and I might make an analogy to things like the Oura ring for health, wearing that, and then competitors started coming out, and then Amazon has their own version of a health ring thatā€™s coming out. Along those lines, you have all these incumbent players in the AI space that are, for the most part, very large, well-funded cloud companies, and in at least one case, a retail company blended in thereā€¦ And so if this might be an alternative, in some ways, to the smartphone being the dominant device, and it has all the same capabilities, plus more, and they have the LAM behind it to drive that functionality, how long does it take for an Amazon or a Google or a Microsoft to come along after this and start producing their own variant? ā€¦because they already have the infrastructure that they need to produce the backend, and theyā€™re going to be able to produce ā€“ Google and Amazon certainly produce frontend stuff quite a lot as well. So itā€™ll be interesting to see if this is the beginning of a new marketplace opening up in the AI space as an [unintelligible 00:42:29.29]

So thereā€™s already really great hardware out there for smartphones, and I wonder if something like this is kind of a shock to the market. But in some ways, just as phones with external key buttons sort of morphed into smartphones with touchscreens, otherwise I could see smartphones that are primarily app-driven in the way that we interact with them now being pushed in a certain direction because of these interfaces. So smartphones wonā€™t look the same in two years as they do now, and they wonā€™t follow that same sort of app-driven trajectory like they are now, probably because of things that are rethoughtā€¦ And it might not be that we all have Rabbits in our pocket, but maybe smartphones become more like Rabbits over time. Iā€™m not sure. I think that thatā€™s very likely a thing that happened.

[00:43:37.20] Itā€™s also interesting to me - itā€™s a little bit hard to parse out for me whatā€™s the workload like between whatā€™s happening on the device and whatā€™s happening in the cloud, and what sort of connectivity is actually needed for full functionality with the device. Maybe thatā€™s something, if you want to share your own findings on that, in our Slack community at Changelog.com/community, weā€™d love to hear about it.

My understanding is there is at least a good portion of the LAM and the LAM-powered routines that are operating in a centralized sort of platform and hardware. So thereā€™s not this kind of huge large model running on a very low-power device that might suck away all the energyā€¦ But I think thatā€™s also an interesting direction, is how far could we get, especially with local models getting so good recently, with fine-tuned, local, optimized, quantized models doing action-related things on edge devices in our pockets, that arenā€™t relying on stable and high-speed internet connectionsā€¦ Which also, of course, helps with the privacy-related issues as well.

I agree. By the way, Iā€™m going to make a predictionā€¦ Iā€™m predicting that a large cloud computing service provider will purchase Rabbit.

Alright, you heard it here first. I donā€™t know what sort of odds Chris is giving, orā€¦ Iā€™m not gonna bet against him, thatā€™s for sure. But yeah, I think thatā€™s interesting. I think there will be a lot of action models of some type, whether those will be tool using LLMs, or LAMs, or SLMs, or whatever; whatever weā€™ve got coming up.

And they could have named it a Lamb instead of a Rabbit, I just wanna point out. Theyā€™re getting their animals mixed up.

Yeah, thatā€™s a really good point. I donā€™t know if they came up with Rabbit before LAM, but maybe they just had the lack of the b thereā€¦ But I think they probably could have figured out something.

Yeah. And the only thing that could have been in [unintelligible 00:45:59.01] is a raccoon, of course. But thatā€™s beside the point. I had to come around full circle there.

Of course, of course. Weā€™ll leave that device up to you as well. [laughs] Alright. Well, this has been fun, Chris. I do recommend, in terms of - if people want to learn more, thereā€™s a really good research page on Rabbit.tech, rabbit.tech/research, and down at the bottom of the page thereā€™s a list of references that they share throughout, that people might find interesting as they explore the technology. I would also recommend that people look at Langchainā€™s documentation on toolsā€¦ And also maybe just check out a couple of these tools. Theyā€™re not that complicated. Like I say, they expect JSON input, and then they run a software function and do a thing. Thatā€™s sort of whatā€™s happening there. So maybe check out some of those in the array of tools that people have built for Langchain, and try using them. So yeah, this has been fun, Chris.

It was great. Thanks for bringing the Rabbit to our attention.

Yeah. Hopefully see you in person soon.

Thatā€™s right.

And yeah, weā€™ll include some links in our show notes, so everyone, take a look at them. Talk to you soon, Chris.

Have a good one.

Changelog

Our transcripts are open source on GitHub. Improvements are welcome. šŸ’š

Player art
  0:00 / 0:00