Weāve seen a rise in interest recently and a number of major announcements related to local LLMs and AI PCs. NVIDIA, Apple, and Intel are getting into this along with models like the Phi family from Microsoft. In this episode, we dig into local AI tooling, frameworks, and optimizations to help you navigate this AI niche, and we talk about how this might impact AI adoption in the longer term.
Matched from the episode's transcript š
Daniel Whitenack: Yeah. I was asked ā I was at a conference last week, and I was asked which direction things would be going, either local AI models or hosted in the cloud⦠And I think the answer is definitely both, in the same way that there is a place for ā if you just think about databases, for example, as a technology, thereās a place for embedded local databases, that operate where an application operates. Thereās a place for databases that run kind of at the edge, but on a heavier compute node that serves maybe some environment, and thereās a use case for databases in the cloud. And sometimes those even coexisting, for various reasons.
In this case, weāre talking about AI models. So I have a bunch of files on my laptop; I may not want those files to leave my laptop, so it might be privacy reasons that I want to search those files or ask questions of those files with an AI model. So privacy security type of thing, or in a healthcare environment they may have to be airgapped, or offline sort of thing, or public utilities sort of scenario, where you canāt be connected to the public internet⦠But then it might just be also because of latency or performance, inconsistent networks, or flaky networks, where you have to operate sort of online/offline⦠Thereās a whole variety of reasons to do this. But yeah, thereās also a lot of ways that, as you said, this is rapidly developing, and people are finding all of these various ways of running models at the edge. And we can highlight ā if youāre just into this now, and getting into AI models, maybe youāve used open AIās endpoint, or youāve used an LLM API⦠If you wanted to run a large language model, or an AI model on your laptop, thereās a variety of easy ways to do that. I know a lot of people that are using something like LLM Studio - this is just an application that you can run and test out different modelsā¦
Thereās a project called Ollama, which I think is really nice and really easy to use. You kind of just spin it up; you can either spin it up as a Python library, or as a kind of server thatās running on your local machine, and interact with Ollama as you would kind of an LLM API. And then thereās things like llama.cpp, and a bunch of other things. These I would kind of categorize as local model applications or systems where thereās either a UI, or a server, or a Python client thatās kind of geared specifically towards running these models locally.
And then thereās a sort of whole set of technologies that are kind of Python libraries or optimization or compilation libraries that might take a model thatās maybe bigger, or not suited to run in a local or lower-power environment, and run that locally.
[00:10:03.11] So if youāre using the Transformers library from Hugging Face, you might use something like Bits and Bytes as a library to quantize models, shrink them down⦠Thereās optimization libraries like Optimum, and MLC, OpenVINO⦠These all have ā some exist for some period of time. Actually, I think in the past, weāve had the Apache TVM project on the show, and we talked about OctoML⦠So this is not a new concept, because weāve been sort of optimizing models for various hardwares for some time. But these optimization or compilation libraries are also usually kind of hardware-specific, so you optimize for a specific hardware. Whereas other of these local model systems are maybe more general-purpose, less optimized for hardware specifically. I donāt know if youāve got a chance to try out any of these systems, Chris, running some models on your laptopā¦