Practical AI – Episode #239

Automate all the UIs!

with Dominik Klotz, Co-Founder & CTO of AskUI

All Episodes

Dominik Klotz from askui joins Daniel and Chris to discuss the automation of UI, and how AI empowers them to automate any use case on any operating system. Along the way, the trio explore various approaches and the integration of generative AI, large language models, and computer vision.

Featuring

Sponsors

StatsigBuild faster with confidence. Startups to Fortune 500s rely on Statsig to make data-driven decisions. Ship smarter and faster with the unified platform for feature flags, experimentation, and analytics. Our listeners get free white-glove onboarding, migration support, and 5 million free events per month.

Changelog News – A podcast+newsletter combo that’s brief, entertaining & always on-point. Subscribe today.

FastlyOur bandwidth partner. Fastly powers fast, secure, and scalable digital experiences. Move beyond your content delivery network to their powerful edge cloud platform. Learn more at fastly.com

Fly.ioThe home of Changelog.com — Deploy your apps and databases close to your users. In minutes you can run your Ruby, Go, Node, Deno, Python, or Elixir app (and databases!) all over the world. No ops required. Learn more at fly.io/changelog and check out the speedrun in their docs.

Notes & Links

📝 Edit Notes

Chapters

1 00:05 Welcome to Practical AI 00:30
2 00:35 Sponsor: Statsig 03:25
3 04:12 askui 03:37
4 07:49 What does data look like? 02:04
5 09:53 Tying in classification 01:58
6 11:51 Range of uses 03:45
7 15:37 Many platforms 1 approach 01:05
8 16:42 askui's practical approach 03:40
9 20:22 Deploying 01:49
10 22:11 Daniel's bad idea? 03:39
11 25:49 Handling input & output 01:43
12 27:32 Other tests 01:32
13 29:04 Sponsor:Changelog News 01:41
14 30:45 Facing AI challenges 04:48
15 35:33 Getting started 02:03
16 37:37 Roadmap challenges 01:55
17 39:32 Future vision 02:00
18 41:32 Conclusion 00:36
19 42:16 Outro 00:44

Transcript

📝 Edit Transcript

Changelog

Play the audio to listen along while you enjoy the transcript. 🎧

Welcome to another episode of Practical AI. This is Daniel Whitenack. I’m the founder of Prediction Guard, and I’m joined as always by my co-host, Chris Benson, who is a tech strategist at Lockheed Martin. How are you doing, Chris?

Doing well today, Daniel. How are you?

I can’t complain. It’s like these two weeks in the Midwest in the United States that I enjoy each year, where it’s like between hot and really, really cold… And so yeah, I’m really excited about that. And yeah, just cranking away… I’ll be at the Intel Innovation Conference next week, which is going to be a fun experience to see some of what they’re doing in the AI space, and talk to them about some of the stuff we’ve been trying on Intel hardware… So yeah, it’s a lot of good stuff coming up. I’m really excited, actually… It was through that Intel community that I met Dominik from askui, which is our guest today. Welcome, Dominik. It’s great to have you here.

Hey. Thanks, Dan, for having me.

Yeah. Well, askui - could you tell us a little bit about first maybe just what is it about UI that askui is concerned with, and how did you get to start thinking about UI and automation?

Maybe a little bit of background, what are we doing… We try to free humans from being robots. What does it mean in particular? Not only – you have repetitive tasks on user interfaces, and what we are trying to do to bridge the gap between you have to be the programmer, versus you can describe in natural language what do you want to do, so your intention, what you want to automate. For example, if you want to log in in your Facebook, then you can say with us “Please click the Login button, please fill out this, this and this credential”, in natural language, which allows also non-technical guys to automate the user interface and lower the border which you can automate or start to automate things. The second question was a little bit about…?

[00:06:27.19] Why did you start caring about this problem?

A little bit of background about this… I previously came – my background is in software development. I was working previously at Siemens before I founded with Jonas, my co-founder at askui. So what I was doing there were some planned automation systems where we had to test everything because our systems had to run 365 days a year; so everything should be tested well enough, so we were [unintelligible 00:06:58.14] test everything through. Then it switched also a little bit to [unintelligible 00:07:03.22] the same thing - you have to test, you have to understand directly and to test your application. And then, I was moving to a new organization, which tries to modernize or bring their agility within Siemens. I learned a lot about Scrum, agility, bring all the new stuff, tools… But I had also a problem because Selenium and other tools couldn’t solve the pain to write unit tests, the user interface tests, and then I was thinking “Hey, can we not solve this with AI? Because AI can understand visual information and can understand natural language information.” So I thought, “Hey, can’t we not combined those?” And with all of this, the journey started.

This might be a completely ignorant question, but I just want it for my own context as well… Now in my background as a data scientist, I’ve participated many times in the fun activity of web scraping. And through that, I know there’s a lot of ways in which web data or UI data might be exposed. So could you talk a little bit about what does the data look like? What does the problem look like as you’re trying to get an AI model to interact, or use an AI model or a natural language input to an AI model to then interact with either a UI or a web page, especially when a lot of those things can be quite varied in terms of how they’re built? Is it more of a visual thing, or is it something else?

Yeah, really on the visual part. For example, when you’re talking about Selenium, Selenium is only working with web interfaces, and you cannot use this, for example, on Android, and so on. What we are doing really is a screenshot of the system. So it means we can take a screenshot of the application, we can take a screenshot directly off the operating system, and then our AI model, what we trained can detect the user interfaces; that means we’re detecting buttons, we’re detecting text, we’re detecting text fields, we’re detecting checkboxes, we’re detecting icons… And so we have already trained a model which can understand the visual representations of a user interface. And then we connected our natural language part to it to match the intention, what you have. For example, click on the Login button [unintelligible 00:09:33.25] and then we are moving the mouse there. So this is a little bit different from the concept, versus I connect directly to the application, and then I try to scrape the source code off the application to get my information. Maybe this is the best way to differentiate.

So you’re kind of starting with a screenshot instead of doing web scraping.

Correct.

And then just doing classification on the screenshot itself to do that.

Correct. So we have an object detection model in place, which really takes a screenshot and detects all the elements on it.

[00:10:08.17] As a quick follow-up, when you do the classification and identify what you have in the screenshot, how do you tie that in with the test desired? So you have a collection of tests that you’re trying to automate… How are you tying, like “I’ve got a screenshot, I have a button, I have a text field…”, how does that tie into a particular unit test, or something?

Currently, we have our TypeScript application, which you can download, which is also available as an askui Npm package which you can use… And with this, you can directly start, install it, and then you can write a standard test, and then you have, for example [unintelligible 00:10:42.24] test block, where we then can write “askui.click.withtext.” so on. In the background what happens - we have a controller then installed on your local system, which connects to this operating system, it takes a screenshot, and you also have the ability to move the mouse. Then we’re taking the screenshot, we are connecting the instruction, the click on button, for example, send this to our inference backend, then we get the result back, and then we’re moving the mouse there.

And with these single steps, as you would do it in Selenium, you can then write your workflows or tests to automate every operating system. We are currently running on Windows, on Mac and on iOS. And especially for legacy applications in the Windows environment, we can test also these applications. We are not limited to web. Does this answer the question?

It does. Thank you. I appreciate the – it was a good explanation.

So you obviously had a certain perspective when you came to this problem, because you had worked at Siemens, you had thought about this sort of automation… I’m wondering, as you’ve developed this technology, and seen others that might have had this need or experienced this pain, what is the sort of range of things that you’re seeing people either do or want to do with this kind of automation technology, that might fit the use case, sometimes fit the use case that you had in mind initially, but sometimes might be sort of new things that you didn’t think about previously?

Currently, because we’re using AI technology, and with AI technology you are a little bit more flexible in what you can describe, and of course, you can learn it based on data, and have not too explicitly describe which things you want to do [unintelligible 00:12:45.11] what we started, where we want to do regression tests for classical user interface tests; then we moved a little bit in the direction of robot process automation… And I think where we’re planning in the future is to automatically transfer unstructured data to user interfaces. So imagine you have a PDF, a formula, and then you want to say “Hey–” Normally, normal workers will deal with this, which are copying, for example, the PDF or information from the PDF over to another formula. And what you will can do then with our technology is to say “Hey, please automatically copy all the information over which you see on the left side, to the right side.” This is where we think the technology will go, so that you can have [unintelligible 00:13:42.05] the matching of the files, everything. When you then connect also large language models, which have the ability to understand the language a little bit better, or a little bit closer, as we could, as humans can define it, then you can build such systems.

[00:13:58.22] Yeah, that’s really interesting. I’m thinking of even like in – I had a friend that works in the offices, same office building that we’re in here, and he stopped by and we were talking today… He takes screenshots of his work as he goes along in the day every so often, and he does this because he wants – like, he can’t always remember all the things that he was doing throughout the day, and so he takes these screenshots as a sort of historical record of the various interfaces and what he was doing, and such. So I had that in my mind as you were talking through all of these things. I’m like “Hey, this would be really cool for him.” He could potentially query certain things about his day, and these interfaces, and what he did, and create potential automations off of those.

This is also a use case where we’re thinking to be a recorder in the background, to be then collecting repetitive tasks, and then say, for example “Hey, we have now detected you have done this over the last week five times. Should we automate this for you?”, for example?

There are also cases where the technology can [unintelligible 00:15:06.15] because we have the understanding of user interfaces, and with a lot of engineering stuff around this, then we can build such nice systems.

That seems like an automation – like, automation and AI seems to like scare people, for some maybe justified reasons, some unjustified, or I don’t know what’s justified… But that’s like an automation that – why wouldn’t I want that, to save myself some time? I would love to automate some of my repetitive tasks.

So one of the things I’m wondering having looked at your website is you talk about all the different platforms; web apps, and enterprise apps, and everything. Are you able to take the same approach across the different platforms that you’re targeting for that? And is that what makes it so flexible, by doing screenshots and stuff? Is that what does it, or how do you break down the challenge of getting from addressing one platform in your initial one, and then starting to spread out across the other platforms?

The main technology is accessing the screenshot from the operating system, and through controlling it. If you have tried to do that, there are a lot of open source software already available, which can do this. If you have such technology accessible, then you can take the screenshot. For example, we have Android support, where we can take a screenshot, and it’s the same model and it’s the same technology behind them as we are using on the Windows operating system or on the Linux operating system. And we have a general model which can solve all the tasks.

So Dominik, I’m already thinking of a lot of use cases that maybe I might want to automate and interact with this system… And one of the things we were chatting about prior to actually hitting the Record button on this episode was maybe the unique approach that askui has taken in terms of more of a software engineering approach to understanding how these machine learning and AI systems work, and utilizing them in this sort of like more practical software engineering approach to this. I’m wondering if you could talk about that in a little bit more detail, and how you’ve approached these problems that you think is maybe unique, or at least represents your perspective on how to build systems like this?

Yeah. First of all, what I see or what we saw already in the past - the research area has built a lot of models, has released them on the public, but then it stopped it. After you have published your paper, you have no interest anymore to bring this to production. What we also see currently, beginning from the 20220s, is that a lot of new applications are coming which tried to solve this, to formalize this as software patterns, as I noted. At the beginning, I started with the machine learning and I was wondering “Are there no software patterns, like metric pattern, like trainer pattern, like other kinds of pattern which you can reuse and which you can communicate in a better way?” Why are we good at software development nowadays is because we have standardized the patterns which we use, and use them everywhere. So this is what I was always searching a little bit, and also the tooling.

[00:18:32.13] And how did we approach this from the machine learning – we have different teams a little bit. The first thing that we have done is we have built an application which used directly our model. And we just plugged in the best of the first model which we could produce, and we started with our model which had only five images for object detection for detecting this; five images training dataset, which is not a lot. But we’ve done this to prove that it’s end-to-end possible. And we [unintelligible 00:19:04.14] a lot of things away. And then we’re going out to the customers and saying “Hey, can you work with us?” and then they complain that “Yeah, it’s nice, but your object detection model doesn’t work so good.” So what we’ve done on the next one - we collected more data, trained a better model, then we go out to the customer, “Is this now enough?” And then they were like “Okay, it’s going in the direction”, but we could only support one application at the beginning.

And then we iterated based on this, and tried to connect all our things together. What this means on the other side - we started directly going out to the customer, and then improving everything. And now we are making [unintelligible 00:19:47.09] This was hard at the beginning, because I mentioned already that in 2020 all the tools [unintelligible 00:19:54.11] everything came up which you could reuse. And now we’re migrating one more to the data pipeline, and trying to bring everything to the customer so that they can also train by themselves. But this is our software engineering approach where we say “Hey, bring everything to the customer, let the customer complain”, then doing the next iteration step. So that’s just the lesson.

So as a follow-up, I want to ask you to kind of flip-flop what you just covered a little bit, and what’s it like to engage from the customer’s perspective? Because you’re kind of taking it from your perspective, and we take it over… So if you’re a customer and you start to use askui and deploy it, what does that picture look like? What does the customer go through as they start deploying or start utilizing the service?

What he’s doing to – there’s also a little bit of history, but what is this now? You’re going now to our website, then login with your credentials, and then you can directly upload your first screenshot, and then simulate on this. So do a simulation directly on the screenshot, so that you see the a-ha effect. And then the next step, what you want to do, if you have created your workflow a little bit, then you want to automate this. And then schedule this, for example, in a Docker container in the background, that we have already, if you’re in the work environment. And then after a time, you get the result, and you’re happy that you have automated with really easy things the workflow.

What we are now doing is to reduce also the hurdle that you have to learn less [unintelligible 00:21:37.09] Because this is the thing, we’re trying to bring all problems from the user perspective away, so that the user has a really easy life to create automations, to maintain automations, and to schedule all the stuff, set up all the testing environment… Because it’s not only the automation part what you’re interested in; you’re also interested in “Where can I schedule it?” and [unintelligible 00:22:02.00] “How can I connect data?” and so on. This is what we’re currently doing.

[00:22:09.18] That’s awesome. I want to propose what I think is probably a bad idea, but I want to get your reaction to it. So I’m wondering, there’s this way now that you’ve enabled people to automate their interactions with various UIs. And there’s plenty of cases where I don’t really care to interact with the UI, but I do need to accomplish a task. But I also – like, for example, one of these that I struggle with all the time is like AWS and its interface, which is just like, you can do everything right, but it’s super-hard to understand how to do anything. Is there an opportunity, let’s say – and again, I think this is probably from the start a bad idea, but let’s say I just gave some sort of agent, tied it to askui, my credit card information and whatever, and just said “Hey, I want you to create an AWS account and spin up this infrastructure and do this. And then when you’re done, give me the URL, so I can access my – like, here’s the GitHub repo; tell me when it’s ready, and here’s my URL.” Now, I imagine that would also need to tie into like other external knowledge, like the documentation from AWS, or something… But is this type of scenario anything that like you’ve been talking about internally, or see as things that might come about in the future? As soon as you start automating things around UIs, there’s the one side of it, which I think we’ve talked about a lot, which is automating the things you’ve already done with UIs. But what if you want to do things with UIs that you haven’t done yet, but don’t really care to learn how to do?

This is also one of our plans to do this, to leverage – but we already played a little bit around with large language models, giving all the documentation to do this, and also tries to translate, for example, a Google documentation, and create out of this [unintelligible 00:24:14.03] This was working quite good, and the direction will go there. So in the future, we can do this.

There was another part… This was the attention base, “Please create me an EC2 instance, and here’s my credit card information.” This was the intention. “Please do this. You can do this.” But the main problem with this is humans – so if you talk about it to your colleague, or to someone from another nation and so on, you will always have communication hurdles. So you always have to have the tiny steps between, where you then can correct a little bit, a little bit up. This is one thing. Another thing is are we trustful for credit card information? For this, I have to say you’re submitting also on GitHub, or GitLab, or your security tokens to your AWS access. There are standards already in place which are also validated which can do this, and this is what we have to follow. We have also businesses, so enterprise customers, which want to have these standards, so we have to be compliant with the standards. So from our side, it’s no problem, you can trust us. [laughter]

But on the other way, you have also the possibility, because it’s nice, to create the tests online, download everything, and execute this locally on your machine, without any access on our side. You can also hide all the information, all the secrets, I would say, on your device, and have the guarantee that they’re not leaked, or anything.

Yeah, that was another thing I was going to ask, is like - hey, if I’m going to a site… Let’s just say – I’ll give an example, because I love the product and what they’re doing… So Chris, you remember we had Josh from Coqui on, talking about their voice studio, and all that…

[00:26:03.04] …and I’ve been using that recently… You go in, you can input your text, like a sentence, and synthesize a voice, and then you can change the language or something, and then go through to export the file… So there’s not just a UI interaction here, there’s input data and output data. That might include things like passwords, it might also include things like “Hey, I generated a file with this UI. Where does that go?” Like, I clicked Download, or something… I don’t know. What’s the best practices around, as you’re automating these things? How should a customer think about these inputs and outputs, and how do you handle those in terms of what you’re building?

I would follow up this question to how are you normally doing integration tests, or end-to-end tests? You have always the same data there. So first of all, we recommend customers try to use always synthetic or generated data; don’t leak production data to it, because [unintelligible 00:27:01.17] what you learn when you start automating and testing; don’t use production data. And then the other thing, you have to apply it security standards. So there are environment variables or secret files which you can inject to it, go with it, do it, use our security function that you’re not sending to us anything what’s related to it. And this is what I could recommend to the customers.

I’m curious, we’ve been talking about testing for a while, and most of the use cases have been unit tests. Are there any other types of testing, like integration testing, that you’re able to do? Is there a workflow for those kinds of things? Or is this really focused on kind of the screenshot itself, and everyone stands alone? Is there any way to tie them together?

Yeah, you can tie them together. We are only a library which you can use in TypeScript. In the future, we will also support Python and other languages to make this technology also available. But you can also combine this for Selenium, or some other techniques. You can connect to a Mongo database, getting the data out, processing this [unintelligible 00:28:12.06] to another system, and so on. Yeah, we are really flexible in these ways, because our main concept - this is maybe also another thing that’s unique on our side - we are thinking that automation or user interface automation as a local user interface automating us to certain level is nice, because you can give that ability to other people. But at a certain point we are reaching the limit, and for this, you need developers to build some nice stuff to automate, or to connect, for example, to MongoDB or to some other sorts of stuff. In this case, there’s always the possibility to go from our low-code view to our code view, and insert directly code. So there’s no problem, you can do this. And you can also install other libraries.

Break: [00:29:06.22]

I always like to ask guests who have really manifested some new idea that’s really driven fundamentally by AI and machine learning this question, and that’s - as you were building out this product, what challenges did you find in using… You already alluded this a little bit with like “Oh, researchers release all of these models, and then what happens after that? They’re not really supported, or maybe they die off, or other things.” So what specifically were the kind of machine learning or AI challenges that you faced as you were trying to make this work? You alluded a little bit to the data side of things, and kind of adding data over time… But I imagine there’s much, much more than that. So what are some of those things that stand out? Just practical things that you faced in trying to apply this technology to a real world automation problem.

Yeah, [unintelligible 00:31:44.00] at the beginning, previously I was a software engineer, and I had no clue about machine learning. I got a little bit theoretical knowledge, but theoretical is nice, but in practice, it’s completely different… So I had no clue that, for example, if [unintelligible 00:31:58.13] the learning rates, you can bring your model to convergence. Such things I struggled with at the beginning; also how you connect the layers at the beginning to make everything work. But then we have solved this problem, and then the next step where it starts to struggle is making the extra elements visible. You have to generate the right metrics, and see and understand what you’re learning and what you are not learning. Then when you have solved this challenge for yourself, so you have tried out TensorBoard, you have the next challenge to go to “How can I manage the data? How can I increase the data? How can I version the data?” And then you came in to change how can you do repeatable experiments that you can say “Hey, we have done now progress, step by step”? And for this, you have to then search a lot about what tools are out there, which tools are good… So [unintelligible 00:32:52.08]

Then the next step is you have figured out that you have messed everything up, and your code is totally s**t out there, so you have to think a little bit about “How can I structure the code?” This is what I mentioned previously with patterns. So you’re looking in other repositories, how other developers structured the code so that did that it’s more maintainable and more reusable. And then you’re coming across, for example, PyTorch Lightning [unintelligible 00:33:20.08] to build up the models in a modular way.

And then you’re not only one developer, you are two developers, two machine learning researchers. Then you have to communicate, so you have to exchange data. So you’re starting to copy and paste data, and send it over Slack, or some stuff. Then you have to say “Hey, it’s totally stupid what we are doing. We need a data platform.”

And then you go through step by step, you have reached the complete expertise and you say “Now we need complete data search, we need a metric system, there’s how we exchange data between the teams, this is a way how we label data.”

[00:34:01.15] For example, also the other challenge is not only exchange of data and getting data, there’s also the challenge of labeling good data; then you’re looking at the labeling tools. Then you figure out “Yeah, the labeling tools are nice for standard use cases, but sometimes, especially in our case, because we have five different models which are trained together in a nice way”, that you have labeled different kinds of data for different models. Of course, these model types are fitting perfectly for this use case, so you will start thinking about how you can improve the labeling process. So now we are building a new labeling tool based on Streamlit, so that we can easily connect, for example, our inference part, that we can do a little bit freeloading, then we can automate stuff, and to improve everything. And then, at the beginning you remember that you talked once to some guy which always said “If you do machine learning, you will end up building a labeling tool [unintelligible 00:34:52.29]

[laughs] You’ve reached the pinnacle.

But this is a journey, and I think when somebody would now ask me how to start, I would answer in a completely different way [unintelligible 00:35:07.03]

So that’s literally what I was about to ask you, because that was a fantastic journey that you just took us on, about all the practicalities… You know, you solve one problem, and you hit the next, and you hit the next… And you just describe coming in completely new to this, and taking it all the way to being very, very productive, and all the practicalities. So that’s actually what I want to ask you. You said “I wouldn’t do it the same way.” I’d like to know – there is at least one person out there right now, if not many, that are thinking about AI, they may have dabbled in it maybe, not… They have an idea for a startup, they’re listening to you, and they’re going “That’s the guy who started doing this, but I have my own idea.” What would you tell them? How do you get started, to get going? Because this is a daunting field to break into.

Do you mean only the machine learning part, or to start a startup based on machine learning?

Mostly the machine learning. How did you learn – it’s a skill set, it takes time to digest, and it’s constantly evolving… How did you digest that skill set so that you could be productive?

Okay. First of all, what I would recommend - directly introduce tools which support you. Use directly Hugging Face, try to build models based on Hugging Face, so that you have libraries that are based on PyTorch Lightning, because they’re giving the things for free.

Then the other thing, I would directly introduce a version control system for data. Again, I would recommend – we are currently using [unintelligible 00:36:43.26] I would now recommend DVC at the beginning. And then try to find – especially for the machine learning part, try to find one researcher and one software engineer and bring them together and let them communicate, because then you get the efficiency from software engineers, so especially cloud software engineers, and you get the research knowledge and bring them together, so they can learn from each other to exchange ideas how to do this in the software manner, and how to do research.

If I would start again, I would now say “Hey, here you have one team”, two people, one software guy with DevOps background, also the software development and DevOps background a little bit, and then a premium machine learning researcher to it. And then I think they would benefit the best.

Yeah. And you talked a little bit about your journey in terms of the technical side of learning about these tools, and also the kind of bringing on more people side… As you look forward to the next steps in the roadmap – I love it, how you publish your roadmap on your site, which is really cool… As you look forward to the future of that roadmap, what do you feel like are the challenges that you’re facing right now, as someone who is… Does it have to do with “Oh, now what do I do with all these generative AI stuff, and how does that factor into our product?” Or does it have to do with “How do we make these models better and support a wider set of use cases?” Or is it a combination, or something completely different?

[00:38:23.21] I would say it’s not more about the technical challenge what you could solve, because technical challenges you can solve this with knowledge and a little bit of research. So normally, if it’s not physically impossible, you can solve things with a certain time. This is [unintelligible 00:38:39.10] no problem. The main problem is what I see is to speed up the development process itself, so that the right things are researched, the right things are designed, and the right things are started to develop, so that you’re bringing more the focus on one topic. And if you have a lot of people in your company, or a lot of, I would say, interfaces, then you have to bring them one common understanding what you want to achieve, how you want to work, how you want to define a requirement. This is I would say currently the main challenge, I would say. And then from the technical challenge, we have to talk to customers, get the feedback, build the stuff as quick as possible, and iterate in terms of what the business wants to have.

So as we kind of wind up here – and this is pretty typical; we love to get kind of the benefit of your insight, not only for these short-term practicalities, but a little bit of the dreaming. You’ve come this far in this journey that you’ve described, and you now have this capability that didn’t exist, as an entrepreneur; as you’re looking at the future, what are the ideas - maybe speculative; it doesn’t have to be based in the realities of what we have today. Where do you want to go with this? What do you envision building over the next couple of years? You can pick the horizon - two years, five years, whatever you think. So when you lay in bed at night, you’re like “That’s the place I’m going eventually.” What does that look like?

Maybe to look a little bit back in history how this project started - I think I haven’t told this before… When I was also at Siemens, then I also done my Masters thesis about visual question answering, which should solve the task, what we are doing now, in separate tasks, as an end to end task. So what my dream is - it’s to take the now available technology, large language models, also including the visual part, and the natural part, combine everything, so that we can teach or bring every kind of data inside the model. What does this mean, every kind of data? So this means we get, for example, manuals for software which says “Hey, please click on this button.” So then the person’s coming to you and saying “Hey, please create me account in this one.” You give it a submenu, and then it’s done completely automatically, without any kinds of learning, because it’s learned how to interact with the operating system. This is one thing where we want to go, to bring nowadays technology in, to make it accessible for the users… And also that everyone, really everyone, can use it. Also your grandpa. [laughs]

That’s great. I am excited to see some of those things come down the line. And I think one of the things I’ve enjoyed about this conversation is that you’ve brought a lot of the sort of positive side of automation, that is really so helpful to technical people, but also other people that are doing these tasks that they really actually don’t want to do, or can’t scale to a certain point. So I think it’s really awesome, and - yeah, I’m looking forward to seeing your future work with askui. Thank you so much for joining the podcast. Really appreciate it, Dominik.

Thanks you for having me.

Changelog

Our transcripts are open source on GitHub. Improvements are welcome. 💚

Player art
  0:00 / 0:00