Practical AI – Episode #94
Operationalizing ML/AI with MemSQL
with Nikita Shamgunov, CEO of MemSQL
A lot of effort is put into the training of AI models, but, for those of us that actually want to run AI models in production, performance and scaling quickly become blockers. Nikita from MemSQL joins us to talk about how people are integrating ML/AI inference at scale into existing SQL-based workflows. He also touches on how model features and raw files can be managed and integrated with distributed databases.
DigitalOcean – DigitalOcean’s developer cloud makes it simple to launch in the cloud and scale up as you grow. They have an intuitive control panel, predictable pricing, team accounts, worldwide availability with a 99.99% uptime SLA, and 24/7/365 world-class support to back that up. Get your $100 credit at do.co/changelog.
Fastly – Our bandwidth partner. Fastly powers fast, secure, and scalable digital experiences. Move beyond your content delivery network to their powerful edge cloud platform. Learn more at fastly.com.
Rollbar – We move fast and fix things because of Rollbar. Resolve errors in minutes. Deploy with confidence. Learn more at rollbar.com/changelog.
Notes & Links
Click here to listen along while you enjoy the transcript. 🎧
Welcome to Practical AI. This is Daniel Whitenack. I’m a data scientist with SIL International. Normally, I’d be joined by my co-host, Chris Benson, who is a principal AI strategist at Lockheed Martin, but he’s in the midst of some family health-related things, so he’s taking the time that he needs… But we’re definitely excited to chat about a really interesting topic today.
Actually, in our Slack channel I remember some conversation a couple of weeks ago where we were discussing the issue of “Hey, I trained my model, it works great on my data. I evaluate and it all seems good. But then when I try to integrate this into code, the performance is actually really terrible and it’s kind of a mismatch between production things…” And I think that we’re gonna be able to get into some of those things today.
Today we have as our guest Nikita Shamgunov, who’s the CTO of MemSQL. I’m really excited to talk to you today, Nikita. Welcome.
Happy to be here. Like Daniel said, my name is Nikita. Actually, I’m co-CEO and founder of MemSQL. I don’t mind the confusion; I started as a CTO, and I took over as CEO in 2017…
…and recently, about a year ago, brought a co-CEO, Raj Verma, with the thinking that we’re gonna take the company public.
Oh, gotcha. On that note, why don’t you give a little bit of maybe the background, first of yourself, and then we can get into maybe a little bit of the background of MemSQL? I think that would be great context.
Definitely. So I’ve spent my career in data management, and databases specifically. I came to the United States after finishing my grad school in St. Petersburg, Russia, and joined the SQL Server Engine team. So I went from a kind of very research-oriented life and work, to basically system engineering. When you build databases, it’s actually a very different cadence, versus using databases.
When you use databases, you think about things like performance, you think about SQL as the API to the database, and then you think about reliability and uptime. And when you build a database, you think about quality; you think about the life of somebody who is using the database and how you make that lives easier… And obviously, you think about performance and scalability and how the database user or developer can achieve that performance and scalability.
It sounds like an interesting transition from the academic world to the systems engineering world. Was it a hard shift for you, or was that focus on the user and reliability - was that something that you were already passionate about going into that work?
[04:12] I was very passionate about engineering in general. What I loved about building databases is that the product, the database engine, is like a computer science in a box. It has algorithms, it has data structures, it has system engineering, you interface with networking io, CPU, caches… You need to be aware of the computer architecture in order to build world-class software. So that certainly resonated a lot. And that was the core premise of why I wanted to start working on a world-class industry product. And then, from there, the passion to the user came in, and over time just that curiosity about building new things, and breaking ground, and entrepreneurship came through the years, while working at SQL Server. Mind you, during that time Microsoft was going through a cloud transition. Everything that we’re seeing today at scale, at that time all of that stuff was being born, being conceived, and major architectural choices were made. Some of them were right, some of them were not right.
So that’s my big-company background. And then I switched and joined Facebook. In fact, one of the premises of me joining Facebook was not to make a lot of money, because that was pre-IPO, 2010… Actually, move to the Silicon Valley and meet the kind of people I will later start a company with. And what happened is that as I walked into Facebook on day one, I met my future co-founder, Eric. And relatively shortly after, just within 6-8 months, we started MemSQL [unintelligible 00:05:52.03]
Oh, that’s a wild ride, I guess… Moving to a total new place, experiencing Facebook and that culture, especially at that sort of stage, and then founding something… So maybe describe a bit how that happened so fast, the idea for MemSQL and the motivation that this was something that was really needed. How did that occur?
Yeah, so distributed systems at SQL Server - we always knew that was the future, especially as you go into the cloud transition… And at the time, back in 2008-2010, Microsoft had a flagship product SQL Server, which is a single-node database system. Very good, very powerful; I’m really proud I worked on this one. The main competitor, Oracle, had distributed systems at its disposal, Oracle Exadata and Oracle RAC. And the way the database market is structured is that the top-tier workloads that have high performance requirements, high availability requirements, do require distributed systems, and Microsoft didn’t have that. So that top of the market Microsoft was losing to Oracle then. Actually, I think they mostly caught up right now. But then, architecturally, single-node databases are very hard to change and turn them into a distributed system. That was some of the moonshot projects that my co-founder and CTO Adam worked on at Microsoft, and that moonshot project didn’t succeed.
When I walked into Facebook, the need for distributes systems became apparent, because every Facebook workload is that high-level, high-end workload… And sometimes it’s from the reliability standpoint, sometimes it’s from the scale standpoint; most of the time it’s from a scale standpoint… Because you know, back in 2010 I think Facebook was on the march to cross a billion active users. So that was on everyone’s mind, that was everyone’s goal, “How can we cross a billion active users?” and obviously, history shows that Facebook had blown through that goal quite successfully.
[08:02] Were they trying to architect something internally to deal with that, or was it sort of an open problem when you were there?
It’s more than one system, for something that is so big as Facebook. It turns out that all the data workloads split into categories. Some of them are data lakes… Hadoop basically got a lot of advancements at Facebook. Some of them are operational, powering Facebook.com, and multiple data management technologies are on the critical path, between typing facebook.com and actually seeing the newsfeed; there’s a separate data management solution for messaging, separate for the newsfeed, and the list goes on. And within that also there’s a whole bunch of point solutions for various analytical workloads. One is for time series… In fact, there’s a startup called SignalFx who took some of the ideas, and then the folks from Facebook left and started that company that was recently acquired by Splunk.
And then there was a system called Scuba that gives you real-time analytics, and a lot of ideas there influenced the MemSQL roadmap as well. So long story short, lots and lots of data management systems and data management workloads inside Facebook, but each and every one is a distributed system. So that pervasiveness of a distributed system [unintelligible 00:09:23.03] it really validated the thinking that the future of database systems is distributed… And that’s how we started MemSQL.
And we started this as in-memory, hence the name… Now in fact the name is kind of limiting, because MemSQL has evolved way past being in-memory and in-memory only. It’s the version one that was in-memory and single node, but very quickly we expanded to a distributed system, built tiered architecture from memory to disk, and now we expanded into S3 or other object stores.
Yeah, I’d be interested to hear a little bit about – I mean, you kind of gave us a sense of the initial founding and some of the initial ideas… I’d be curious as far as right now, with MemSQL, could you just give a high-level view of the sorts of things that people are turning to MemSQL for, the consistent things that you see really people getting value out of, and then maybe some of the newer things that are enabling maybe new sorts of workloads that you didn’t even anticipate in those early days?
Definitely. First of all, databases are a very long game… And the most successful database products on the planet, which are Postgres, MySQL, SQL Server and Oracle, are all 30+ years old, and we still use it today, which is – basically, if you turn into any other piece of technology, that’s not the case. Technology is very transient; we’re building something, and something new comes in and completely disrupts what was there before… But databases seem to stay for a long time.
Yeah. I think in my experience from working at the different places I’ve encountered Postgres a lot. Right now I’m working on a team that’s using SQL Server for certain things… Of course, I’ve encountered certain things like Mongo or other databases, like the NoSQL, or those sorts of databases, but I think – you were talking about the user experience, and it always seemed like to me the natural user experience… And you gained a lot of power with that SQL interface to the database.
The relational – yeah, for sure. Yeah, so the vision is a single pane of glass to all your data and all your workloads. But when you start peeling the layers and understanding what would it take to deliver on that vision, you start understanding how you scale storage, how you scale compute, and how you scale both storage and compute for your low-latency operational workloads. Think about powering your apps, and loading your web page, and those all need to come back to you in ideally sub-100 milliseconds… To running what is called “big, expensive” analytical queries, that scan large volumes of data to give you insights.
[12:14] Those insights could be reports, those insights could be analytical information, which is also called decision support… You need to make a decision, so you need to know what works, what doesn’t, you need to know how your sales are doing in this state versus the other state, this product versus that product… And so that is a continuous process of evaluating and looking at data, and understanding – driving insights out of data.
So the interesting piece about what I’ve just described is that for your operational needs, you need a SQL database, like Postgres, like MySQL, like SQL Server. I mean, you don’t need it; you can use a MongoDB, you can use a NoSQL database, but you need an operational database, let’s just put it this way. And for the majority of workload today people are using relational databases that speak SQL; and for a smaller part of the market they use NoSQL databases, which is more preference and user experience and whatnot.
And then for an analytical system, people use data warehouses - Teradata, Snowflake, BigQuery, and the interface to those databases is also SQL. And what you just started the podcast with is like “Oh, I trained my model against the data that sits in the data lake, or a data warehouse, and now I need to put it in production. I have data quality, data consistency issues, my performance is not the same…” A lot of that comes from the underlying data management. And if you really peel the layers, it comes from the fact that you run this very same model on top of data and data management systems that are different.
And one can argue that “Well, they’ve got their reasons for them to be different”, but there’s a more contrarian viewpoint here that is “We live in the world of clouds, where things are abstracted away from you”, and that gives an opportunity to build ideally a serverless interface that speaks SQL, and that gives you access to all your data. It gives you access to all your data for reporting capabilities, for low-latency capabilities for operational workloads… And that would allow you to never leave your data universe as you go and move from one workload to another. And that can be huge, because whatever data you trained - for example your example; whatever data you used to trained the model lives in that ocean of data, and that data is easily accessible to you. And then you trained the model, so now you need to convert new data that is incoming, or marry new data to old data and convert it into pixels, which is your app, or your website, or whatever… And that can be done right there, off of the same dataset you’ve been operating on… Which is certainly not the case today.
Today you have a data lake, a data warehouse, and a number of operational databases that can be integrated by a third piece of software, like ETL tools or integration tools, which just generates a lot of complexity… And a lot of that can be simplified if you imagine a world of having a serverless, SQL, low-latency API to all your data. That’s the vision where we’re driving towards, and this is a multi-year, probably multi-decade – MemSQL is nine years old today, so it’s gonna be a multi-decade kind of life’s work. But the workloads that we see emerging, and the new workloads that are enabled by a system like this are real-time analytics and real-time decision support, when you need to go back and look at the history of what was happening to make a real-time decision, and do it at scale as well. So that is something that we see a lot in financial markets…
[16:20] MemSQL is a give-or-take 40-million-dollar run rate company with 70% growth. We’ve just had an article on TechCrunch that revealed our numbers… And a good amount of that revenue is coming from financial markets. And if you think about it, that’s what happens there; there’s a constant stream of information that’s coming in, modifying that data state that you have, and you need to make decisions about buy/sell, you need to make decisions in wealth management, you need to make decisions in portfolio management and trading… But also, you need to make decisions in various systems that for example monitor something that’s very large. In Morgan Stanley for example it’s a trading system, and we monitor this trading system, providing decision support to “Oh, should we provide some sort of maintenance? Should we re-route our trades?” All of those things. That’s what MemSQL is used for today, and we didn’t have a system on the market that had those capabilities before.
I definitely liked where you were going in terms of describing the sort of single-window to all of your data via the SQL interface… And I know that we talked a little bit, so we kind of touched on the AI and machine learning elements of this, and how they fit in… You were talking about going on this journey to create the single interface to all of your data… How did AI and machine learning workloads start to cross your path at MemSQL and start to be something that you felt like needed to be part of the strategy of how you were building out this system?
Yeah, this is a great question. When we did our analysis, we discovered that about 20% to 30% of all the workflows that MemSQL supports have some sort of machine learning or AI angle to this.
[20:04] So this is a very large number. And when we looked at it, we always wanted to have dedicated AI capabilities in the system, and we certainly used AI internally to make certain decisions around workload management, query optimization… But the fact that the modern workloads - and obviously people put modern workloads on MemSQL - have a lot of AI and ML capabilities was eye-opening to us.
In those cases that you noticed, was it like people that were - like you were saying - using MemSQL to do large queries to prepare their training data for an AI model?
And two specific examples - we have a great integration with Spark; we have a MemSQL-Spark connector that gives you very fast data exchange between MemSQL and Spark. Fast meaning cluster to cluster, multi-channel bus between the two.
We noticed that people put all their data into MemSQL, grab it through Spark, store models somewhere else. So we don’t take part in hosting models. So this is the first part. And what people like about MemSQL is that two-way path for data exchange between MemSQL and Spark. If you have something in Spark, you can persist it in MemSQL; if you have something in MemSQL, you can pull it into Spark… And MemSQL is a world-class [unintelligible 00:21:29.15] processing engine, so you can send SQL query to it to do the first pass, and slice and dice data before it gets fed into training algorithms… Which MemSQL itself doesn’t support; it’s just the backbone for that data.
And the second use case that we started to see being pronounced is people build apps on top of MemSQL. And those apps have models, evaluate models real-time, and usually there’s some sort of an SLA for an app either displaying this information to the end user, or the app is completely back-office and they’re just crunching data; for that, they need to pull data from somewhere, run this data against a model, and based on the results that you see from that model, do something. A typical example is fraud.
We do in-transaction fraud detection for some of the major banks, where the SLA is 40 milliseconds to make a decision if that particular transaction is fraudulent or not. And in order to make that decision, you need to go – you have a model; that model’s already trained… Then you need to grab some data for that specific account, go back and look at the previous 1,000 transactions, feed those transactions against the model, and then the model will tell you if it’s a fraudulent transaction or not. So MemSQL is supporting use cases like this.
Yeah, that’s really interesting.
So again, both sides of the spectrum, both just providing basically data lake or a data warehouse capabilities, with all your data in one place, let data scientists play with that data and use whatever data science tools, the tools do jour; should it be Spark –
Yeah. Most talk SQL.
Yeah, yeah. Should it be Spark, should it be Pandas, should it be TensorFlow, or PyTorch, whatever. We provide very fast data exchange to whatever frameworks you use.
And the second one is “Oh, I wanna put my model into production, so I’m gonna register that model somewhere, either in Kubernetes, SageMaker…” There are tools for that now, and it’s a rapidly-evolving space. But it all starts with data anyway, so you need to have a data backbone and you need to have a data management system with system of record capabilities in order to provide uptime, low latency, all of those things.
[24:01] And where it’s going is we’re thinking to keep building world-class integrations with systems that both data scientists use for training, and engineers use for putting models into production, to enable that exchange from a push-button standpoint… Given you have a model, put that model somewhere, tell MemSQL about that model, and you’ll be able to consume that model either from SQL, through user-defined functions, or through an application query the model, query the data, and the application provides the glue.
Okay. Yeah, I was curious about that piece… It sounds like right now this sort of workflow is you have an application like a Python application or whatever it is, you load your serialized model into memory, and then when it’s time to fulfill a user request, then you make a SQL query against MemSQL, get the data you need, run it through your model and respond to the user. Is that about right?
Yeah, that’s how it works today… And where it’s heading is this will still be probably 50% of the use cases, because certain things you still wanna control and write very custom logic… But we want to make MemSQL aware of models that are stored in a particular repository, and being able through SQL to run data through those models and return results back into MemSQL.
Yeah, that’s really interesting.
Yeah. The reason that’s useful is that sometimes you want to run that model against a very large volume of data… So if your application row-by-row pulls data from a database, rounds it against the model, gets the results, potentially stores it back in the database, that is an extremely inefficient way. But what you can do is you can establish a similarly to Spark connector and multi-channel bus with optimized data formats - we’re thinking Apache Arrow, or something like this - where running a model against a billion records should be a one or two-second proposition.
Yeah, that’s awesome. I’m thinking of like a facial recognition use case or something like that, where you may want to compare the embedded representation of this image against thousands and thousands and thousands, or maybe even millions of records that you have in your database, that are reference faces from your facial recognition, or something like that. Am I following the right path here?
You are, and we have use cases like this where people store feature vectors in the database. In a way, people run this use case in MemSQL from the do-it-yourself kind of way. MemSQL supports vector/tensor operations as built-in… And obviously, a facial recognition models not all the time, but often is represented as kind of a TensorFlow DAG that evaluates, and the individual nodes in the DAG are vector math. They’re not something that’s spectacularly complex. It’s a vector dot product also known as a scalar vector multiplication. So MemSQL does that, and we have customers that in production do facial recognition over millions of faces, to enable things like, you know, when somebody walks into a supermarket and I want to custom-tailor the experience for that person. Or security systems in the airports.
What happens is there’s a camera. The camera looks at the next face. The custom logic extracts a feature vector out of the new face, and then you run a query against MemSQL that says “Give me all the records where a vector dot product, a feature vector is stored in the database, multiplied by the vector you just received, is between 0.9 and 1.” And that gives you all the similar faces.
[28:06] And because MemSQL is a distributed system, even though it’s a brute force way of doing it, there’s no index. You just go literally run the dot product against millions of faces stored in the database… But because everything is so tightly optimized, you can still run this within 50 to 100 milliseconds.
And that’s running in production, and like I said, for both government security use cases, as well as things like walking into a grocery store, and the system suggests [unintelligible 00:28:37.29] are not in the system right now, but go buy this/something else.”
Gotcha. I guess that’s a computer vision use case, and I’m thinking about the types of data that are involved in machine learning and AI workloads, and we’ve got of course imagery, and video, and we’ve got a lot of natural language processing going on these days… And you know, some of these types of data I’ve dealt with in SQL databases before; of course, numbers and strings and that sort of thing… But I wouldn’t typically think of like “Oh, I’m gonna store this image, or video, or like an audio file, or something” in a database… So are you thinking that in the longer term a good workflow around this is that you’re storing this sort of feature vectors or embedded representations of maybe text, or audio, or maybe spectrograms of audio via the Tensor built-in, or those sorts of things? Or are there other ways around that?
This is a great question. To me it’s what it is now, and what is it gonna be as we go, and I will give you a very product-centric answer to this question… You know, like “What would a product manager thing?” And they always start with the user. The user in this particular case is, again, data scientists from the training standpoint, and an engineer from building an app standpoint. I think today data scientists, with the tools that the data scientists use, it’s a lot more natural to store this data in a data lake; basically, in S3.
Yeah, just files.
It’s bottomless, it’s files, it’s cheap, and all the toolsets work out-of-the-box. And the reason to put that data into a database is only when you get some sort of additional benefits to that. When you put structured data, the benefits are obvious. The aggregations – so it enables low-latency access to that data, it enables very fast aggregations and reporting. So you can slice and dice that data in the database before pulling the data out, and use your custom tools to provide reporting.
For unstructured data, the only benefits that I see are governance. Database can provide that unified access layer to all your data, but it doesn’t give you any compute benefits over that [unintelligible 00:30:59.17] So that’s the way we think about it right now, as well as exploring.
I think what’s gonna happen in the future - databases just like MemSQL will give you an option to access that data that’s stored in the data lake and in the file system through the database API, with the benefit of marrying that data, and really understanding metadata, potentially building a full-text index against that data… So you can marry that data with the rest of your enterprise data, which is usually relational.
But do not yank the direct access to the file system, because that’s what data scientists do every day; and they would be confused if you remove that access pattern from it.
Yeah. I guess on that side of things - we kind of talked a lot about the operationalizing of models… On the training side now we’re kind of talking about access to files, and all of those things, and you’re saying you have the integration with Spark… For me, a lot of times I store everything in S3, like you were saying; it’s very natural for me. I just say “I want this file, and I’m gonna use it”, but there’s definitely issues that come up very quickly on that front, too. I know even this morning, I was trying to deal with 200 GB of audio data, and I was just sitting around for a while and making coffee… It’s not very productive or fun to deal with those sorts of things.
[32:37] I guess on the training side of things you have – maybe people that are used to the Spark interface can do that. Are there other ways with MemSQL that – like, if I wanna access my audio files in S3, is there a way to do that with MemSQL outside of Spark? Are there other sorts of interfaces I can use?
Not at the moment.
But I will share some of the thinking. Right now, there’s a lot of technology we’re building around just relational data, and providing that single-pane glass window into all your relational data. That’s where we’re the strongest. When we think about S3, we think how we can offload all the data that’s not currently touched by the system into S3 - we call this thing bottomless, and making databases bottomless. If you think about Postgres, Postgres is not bottomless; it’s bound to the amount of hard drive that you run Postgres on. But we wanna make it completely bottomless and very cheap. S3 is probably one of the cheapest ways to store data in the cloud, and we have things like MinIO, that is one of the cheapest ways to store data on premises.
Specifically around that pattern that you describe - I have an audio file, it’s 200 GB and it’s a pain to go and transfer that file from one device to another, and it’s a pain to download it from S3 to your local storage and all those things… So the thinking there, again, is through integrations. If MemSQL is aware that “Here’s the file, in that particular format, stored in S3”, and then you want to somehow either bring computation to data, or you want access to a subset of that file, and only that you wanna bring into your training environment, either running it in the cloud or somewhere else - we wanna enable those things. That’s where it stops so far. That’s where our thinking stops so far.
We’re certainly aware of the scenarios and we’re aware of some of the pains that people go through. The place where we think MemSQL can add value is versioning, because you oftentimes need to run and rerun experiments, and the model – it’s not just the model, it’s the model and the data that it’s been training on; that’s really the unit that is consistent. And if the data changed, the model might be rendered obsolete, or it might not be… So just versioning makes a ton of sense, from the ability to run experiments, verify experiments, share and exchange the models and data across data scientists. So I think that’s where we can provide a non-linear amount of value over time.
Alright, turning now a bit from the AI and ML integrations maybe to more analytical workloads… I know that when we were talking before the show and in conversations leading up to the show, it sounds like there’s some pretty interesting things going on in terms of MemSQL being used during the Covid-19 pandemic, and of course, there’s interesting tracing work going on, and all of those things that I’ve heard about… But I haven’t really heard about how some of those things are being enabled, so I’d be curious to hear a little bit more about that.
Definitely. So let’s step back for a second and think about what different parts of the world, and different companies and governments, what do they fundamentally want to accomplish as we go through the pandemic? The first one is simple - how do we stop the spread of the virus? Okay, well maybe we cannot really stop it, or let’s say we put our actions and efforts to do that… But since it’s spreading and it’s a matter of fact, what else can we do and how we can drive our decisions based on data? What kind of decisions? Well, it could be capacity planning for ventilators. We know there’s an outbreak there, and we will likely have our healthcare system overrun, and we need to provide extra-capacity to the healthcare system… But how much capacity?
All of those questions require answers, and the answers are in data. That’s where data science comes in, and that’s where just starting from collecting the data, putting it in one place, organizing the data and feeding this information to people who have the levers of power.
The second one is “Who wants the data?” We have obviously Apple and Google, who own the data because they have a device. Every individual on this planet - not every, but most of them - are now on the smartphone, so you can tap into that stream of data and get information about who is at which location at any point in time, and then marry that location with migration patterns, and marry that location with individual tracing.
Given that we know that this person has Covid-19, who are all the people that this person came across in the past two weeks, so we can go reach out to them and say “Hey, you probably want to be tested.”
The second entity that has that data - maybe government, but I don’t know about that… But certainly telcos. Telcos have this information; maybe not as accurate, because they don’t have a GPS on the device… Actually, they do GPS on the device, but they may not be able to tap into the GPS. But they can triangulate the location based on cell towers.
So we’re working with some of the largest telecommunication operators here in the United States, as well as around the world, and I think the one that’s public is TrueDigital, one of the largest telcos in South-East Asia. And we do both; we do the migration patterns, where if you go back to March, February timeframe, we already knew that there was an outbreak in China, and there was an outbreak in Italy, and we already knew how bad that was… And looking at the flights from Italy, and tracing individuals that land, and then starting to see this pattern of people getting sick emerge, you can start driving decisions off of this. You can start putting policies in place that can stop the spread, you can start do capacity planning, you can start manufacturing masks and ventilators and distribute them into places based on the patterns that we’re observing.
[40:10] So that’s how data management solutions are helpful to companies that have the data, and also the insights that those systems generate are useful for people with the levers of power to drive policy and to drive decisions. Google and Apple - especially Google - have the technology, but telcos don’t, and that’s where we partner and give them those abilities.
Yeah, it strikes me that – you know, the things you’re discussing, there’s definitely a lot of potential and value there, and earlier on in the episode, about facial recognition and a lot of things that are possible there on a large scale… And I think that as people are now in this pandemic, and kind of layered on top of that, all of the climate that’s in our country and around the world around injustice, and policing - there’s a lot of people asking really good questions about actually data management and security and privacy… And I’m curious, with you being in a position to have so many conversations with different types of entities around how they view data management, how that’s changing as we think about these powerful applications of large-scale analytics, but also the potential concern with privacy and tracking and all of those things - I’m just curious to get some of your thoughts on how large organizations are starting to view data management and security maybe now a little bit different than they might have in the past, given all of the things that are going on in our world.
Definitely. It’s a multi-faceted question. It starts with data management to highlight everything that’s going on, and the big problems that we face, and big issues that we face as a nation - how can data management help here. And I think one of the answers to that - of many; there’s so many things that would go into solving these big issues that you raised… But one of those things where data management can actually help is with data sharing and data consumption. Imagine the police data was given by the government to the whole world in the easiest way from the consumption standpoint, and it’s completely real-time.
So if you have an arrest, and that arrest by regulation has to be a part of a public record, that is in the system, in ten seconds after that arrest happened. So that information is just live, real-time, for everyone’s consumption. And with our vision of a single pane of glass towards all your data and all workloads, we will be able to enable those things, and enable anybody to log in into our cloud service and consume that data, assuming the provider is willing to publish that data.
Imagine that climate change data is available to anybody in real-time, and it’s live, and it’s easy to consume. So where we live today is a lot of datasets are public, and a lot of datasets are public and there’s regulation that forces them to be public, but they are published in a non-standard, obscure way.
Yeah, they’re not discoverable.
They’re not discoverable. So to consume that dataset - it’s a project. It’s like going into a library or going to a court and asking for permission, and they will bring these papers and put it on the table. I’m inspired rewatching [unintelligible 00:43:42.13] when they got access to some sensitive data that had to, by law, be public. They had to jump through hoops.
But imagine all that data is discoverable, it is at your fingertips, and that data is up to date. So you don’t have to think about “Oh, I downloaded this last month. What changed between last month and today?” So it’s just there. That can make a lot of things easier, more transparent, and we’ll be living in a better world.
[44:12] We need to think about the implications of that, like what if bad guys had access to this data, but that’s a policy question; that’s not a data management question. I think data management should enable us to live in a world like this, and the technology is already there.
Yeah, and I imagine that if you have this sort of single way to interact with data that’s centered around SQL, and people are familiar with that, they’re able to use it, in addition to the sharing of data there’s the sharing of methodologies that can happen… For example, even in our last episode that we recorded, we talked about some tooling that’s out there around fairness and bias and other things… It’s a little bit – like, you have to read a good amount of documentation, you have to figure out how to use these things… I wouldn’t say it’s seamless and easily integrated into your workflow at this point, but I could imagine for example a suite of tooling that is easily accessible via certain SQL workloads that look for bias in your data on certain features, or highlight certain things in your dataset, and all those things… And like you say, whether you’re using TensorFlow, or PyTorch, or Spark, or whatever, you could potentially have access to those things in terms of people sharing their methodologies, because things are centralized in terms of the SQL language. Do you see that?
I’m wondering what’s the MemSQL community like, I guess, in terms of people working on projects built on top of MemSQL - what’s that community like, and do they share certain things like that, or certain things available that are maybe open source, that are built on top of MemSQL, that people can work on in a collaborative way?
The community is on forum.memsql.com, and then there’s a community of mostly enterprise developers, actually - because that’s been our focus so far - that are sharing through MemSQL events and conferences. Where we’re going is - you know, now that we’ve gotten here and we’re opening up the cloud for more and more to the community, we’re thinking a lot in terms of free, and how we can make a lot of the things that got us here, got us to the 40 million round rate, with 70% growth… How can we take some of that and open them up, and by opening it up, providing a certain set of features and capabilities to the world for free? So on our dime, you go in the cloud, you log in, and there’s this free tier of stuff that you can do. That’s our current thinking so far, and I’m actually gonna be personally overseeing that effort here at MemSQL.
Yeah, that’s really exciting. I’ll be excited to dig in and play around with those things. One other thing that I guess is Covid-related and also related to our changing world is people’s workflow and productivity during this time. I’m just curious, with MemSQL growing so fast, and obviously a lot changing, a lot happening, how has that been for MemSQL and how do you see tech, work from home and productivity stuff moving forward, from your perspective as a CEO?
First of all, we are in uncharted territory. MemSQL wasn’t a company that was born remote-first. Even though we’re global and we have offices in San Francisco, Seattle, Lisbon, Kiev (Ukraine), Bangalore (India) and sales offices all over the place, there’s still concentration in each location, and usually a particular concentration over a component that people work on within an individual location.
[48:00] We weren’t impacted from our performance standpoints. It’s been one quarter of Covid; we basically just finished our Covid quarter, we demonstrated tremendous results, we’re very happy and excited about the future. And we obviously shifted all our workflows into working from home workflows.
Now, the worry that I have - and I’m being paid to be paranoid - is that it works fine so far because we are tapping into the social capital that we’ve built over the years. And a quarter of Covid is [unintelligible 00:48:35.02] social capital and all these social links are established between people, and they’ve built them while working at a particular location, and looking into people in the eye, their friends and colleagues. So that’s gone, right? Every meeting is a formal meeting, if you think about it; we’re missing out on hallway conversations…
Yeah, I guess I haven’t thought about it that way, but it’s true.
Yeah, we’re missing out on hallway conversations, we’re missing out on grabbing coffee together and having these nice, positive experiences brainstorming while walking towards a nice coffee shop and grabbing a latte. So I want those things to be back… Hopefully, this will happen relatively soon, and we’ll have a dent in the social capital that we’ve built, and then we’ll kind of fill up that dent by getting back together. So that’s my hope. Obviously, we can’t control that; the situation controls us a little bit.
Yeah, it’s interesting… I’ve been working remote previous to Covid for maybe about 3-4 years now… And I definitely get what you’re saying. I’ve had to intentionally over time develop relationships with local data scientists or technical people that are maybe not working at the same organization that I am, but it’s a chance for me to get together with those people and just talk about things… Because sometimes I wonder, just sitting at my computer - I brainstorm a lot of things, and sometimes I wonder if I’m crazy, because I’m not talking about those things to anyone, except when I’m presenting them to my supervisor, and presenting them to a group, and I’m supposed to sound like I know what I’m talking about, hopefully, a little bit… So yeah, it’s not that sort of information environment, and that’s a very interesting observation. I hope that some of that can come back.
Yeah. As we wrap up here, I’d love to give you a chance to just let people know – obviously, there’s MemSQL.com, we’ll have the links in the show notes… But as a data scientist or AI person, are there ways that people can play around with MemSQL and get a little hands-on, see what it feels like and how to do certain things? Where would you recommend that they start getting onboarded?
Definitely. If you want free forever, we have our software and we give our software to up to four servers - like I said, it’s a cluster software - to install whenever you want, and run forever. We call this our software free tier; it grew three times over the past year from a number of active users standpoint. It’s basically one of the best column stores on the planet. Data is highly compressed, it’s stored on disk, very fast reporting, everything is updatable, transactional system of record.
So where other companies that run on premises - the Verticas, the Greenplums - they wanna charge you for that, you get it free, and you can put billions and billions of data points in the system, and get very fast SQL response from it.
In the cloud, our free tier is time-based, so I encourage people to log in… You can play around with the system. That would allow you to not use any software, and consume everything as a service… But because we’re running it on our infrastructure, we’re limiting access to free for a period of time. We’ll be announcing more changes there. We’ll give the system to you for free forever, for limited usage, but that hasn’t come out yet, so that’s something we’re working on. So that would be probably the best place to start. And of course, go to forum.memsql.com to learn about the systems.
Awesome. Yeah, we’ll have those links in the show notes. I really appreciate you chatting about everything today. I think our listeners will really enjoy the content, and hopefully check some of these things out. Thank you so much for joining us, and I hope to have one of those hallway chats with you at some point when things are actually opened up.
Well, if you’re in Silicon Valley or I’m there, I will make sure to ping you and we’ll hopefully make that happen.
Yeah, definitely. Thank you so much.
Our transcripts are open source on GitHub. Improvements are welcome. 💚