Search results for Commons Claus

Causal inference

With all the LLM hype, it’s worth remembering that enterprise stakeholders want answers to “why” questions. Enter causal inference. Paul Hünermund has been doing research and writing on this topic for some time and joins us to introduce the topic. He also shares some relevant trends and some tips for getting started with methods including double machine learning, experimentation, difference-in-difference, and more.

Matched from the episode's transcript 👇

Paul Hünermund: I’ll start with fairness, because that’s actually the very first example that I use in my own course, Causality Causal Inference course here at Copenhagen Business School. It’s a case taken from Google actually, so a while ago, I think in 2019. Well, already earlier - the story goes longer, but they have been accused of underpaying women in their organization. So there we have a classic example of like a protected attribute, like gender, race, and so forth, and we want to prevent bias in some form of automated or semi-automated decision-making, right? And that comes up all the time. I mean, in loan acceptance models, for example, we want to remove bias, and so forth.

[34:23] So to make the story quick, is they have been accused of underpaying women in their organization, and then they did a fairly sophisticated analysis, published a whitepaper, and the result of that analysis was that they found that they’re actually underpaying men; at least they thought so. And not only men, but actually high-level software engineers, so high-seniority software engineers at Google. And then because they’re committed to fairness in their organization, they actually raised salary levels for these high-level software engineers based on the analysis. So it also had a practical component to it, or like a policy implication.

We cannot analyze this case here in detail, but if you do that analysis, it’s very likely that they actually did sort of fairly common causal inference mistakes, or they conditioned on some variables that are downstream, that are affected by gender, like occupation, for example… And then if you have discrimination already at that stage, that for example women don’t have it’s so easy to get into high-level positions for various reasons that we know of, then that will be a classic mistake, and you can produce these kind of, again, nonsensical correlations in the end, like the sharks and the ice cream.

That’s one example that you can actually easily transport to other kinds of questions - like I mentioned, algorithmic bias. And that’s a causal question, because if you don’t understand how variables in your model causally interact and relate to each other, you cannot answer this question, you cannot decide how to correctly analyze the data.

Robustness, I mentioned – so the transportability, transfer learning kind of aspect of experimental knowledge and their causal inference techniques have been developed… Also dealing with selection bias in data, so a dataset that might not be a representative sample of the population that you care about, but it’s measured with some form of selection bias, because only happy customers answer your consumer survey, or unhappy customers, but no one in between answers these questions…

And then lastly, explainability - I think explainability almost comes for free with causal inference. I mean, don’t get me wrong, causal inference is a hard task, but once you solve it, explainability almost comes for free, because - well, I mentioned “The book of why”, right? So causal questions are always related to why questions, counterfactual as well… Like, “Why did my headache go away? It wasn’t because I took the aspirin this morning.” I mentioned this example. This is the way we reason, this is the way we explain, for example, things to other humans, and so there’s an immediate connection to explainability.

Go Time #266

Is htmx the way to Go?

A quick look at the history of building web apps, followed by a discussion of htmx and how it compares to both modern and traditional ways of building.

Matched from the episode's transcript 👇

chg: Yeah, I’m happy to nerd out on that, absolutely. So all hypermedia is a media. So in the case of HTML, which is hypermedia we’re all most familiar with, it’s a media that has what are called hypermedia controls in it. And the classic hypermedia control is an anchor tag or a link. And when you have an anchor tag in an HTML document, it makes that document nonlinear. Early on, that was the big deal about hypermedia, is that you’re not just reading a document, you’re interacting with it, you can follow links to other documents, and that’s the idea of this web.

And then the form tag, which I think came along in HTML 2, introduced this idea of updating. Actually, it was more than just following links around in academic documents, which is where the web sort of started out. Now, suddenly, with this form tag, you had the ability to actually pass a significant amount of information up to a server, and update the notion of updating content on the web was baked into HTML.

And so HTML has sort of these two core hypermedia controls, these two core ways of interacting with the document in a nonlinear manner. And that’s why HTML is a hypermedia. And so people recognized that this was a new and interesting technical approach to things. The idea had been kicking around for a while, but Roy Fielding, who did a lot of the initial work in the Apache Project on a lot of the early web technologies, he wrote a thesis - or a dissertation, I should say - for his PhD, and in that he coined this term REST, that we’ve been sort of talking about.

And so what he tried to do in his dissertation - it’s very academic language, unfortunately, but what he tried to do is discuss how’s the web different than other network architectures that have been adopted before. So he had been in technology for a while, so he was familiar with the older thick client model of network applications. It was very common in, say, the 1980s, before the web came along. So he wrote this dissertation to contrast “How’s this web thing different than that?” And the term that he came up with to describe the web was REST, Representational State Transfer, as a network architecture, as a system architecture.

[35:59] And it’s unfortunately pretty academic language, but the crux of REST, in my opinion, is – you define it in terms of constraints, but the crux was this thing called the uniform interface. And boy, how can I summarize this in a layman’s term, as quickly as I can…? The core idea, and what’s interesting about HTML, is when you get an HTML document from the server, you have no idea what the content is going to say in it. It could have links, it could have a form that does some action, whatever. The browser, when it asks for a particular URL, doesn’t have any idea what content is going to come back. It just knows it’s going to be HTML. So it’s going to render that HTML, and let the user select from the hypermedia controls that are on the page. And so there’s this really interesting aspect of hypermedia where you stream down not just the data, but also the operations on the data together. It comes down in one sort of complete package. And by doing that, then the user can see “Oh, here’s a new action”, or “I want to delete this thing”, or “I want to update it” or whatever. But the user selects the actions from the hypermedia. And that’s in contrast with JSON.

So JSON, typically, you would get down just sort of the raw information about, say, a contact or a bank account. A client-side template would be responsible for turning that into a UI, and the clients would have to know, “Okay, for updating customers, I need to issue a post to this URL.” It would all be encoded in your application code.

And so that’s the big distinction between like a JSON-style data API and a hypermedia response that you would get in a hypermedia system. And it’s ironic, the reason we’ve been saying sort of REST in quotes when we’re talking about JSON is that these days people would describe the JSON API as RESTful. They probably wouldn’t even describe the hypermedia API as an API; they would say, “That’s just a web page. What are you talking about? That’s not an API.” That’s unfortunate, but that’s just the way the industry has gone. There’s a long story behind that. There’s an essay up on the htmx website on the htmx.org/essays page called “How did REST come to mean the opposite of REST”, that you can read, which sort of really goes into the gory details of how that happened, technically.

But that’s the big idea… So that’s what hypermedia is - it’s media that has hypermedia controls inside of it, typically links and forms and HTML. And then a RESTful system, a RESTful system architecture is something that has a bunch of constraints on it, one of which uses a hypermedia for server communication.

Ship It! #73

A modern bank infrastructure

Matias Pan is a Staff Software Engineer at Lemon Cash, a crypto startup based in Argentina. Lemon infrastructure runs digital wallets & physical cards, which technically makes them a bank. How does Matias & his team think about enabling developers get code from their workstations into production? Remember, we are talking about a bank - a bad deploy is a big deal. And when a bad database migration goes out, what happens then?

Matched from the episode's transcript 👇

Matias Pan: Yeah. So one of the things was about TerraForm. I used to use TerraForm, but a really while back, and it was a fairly common use case. We just had TerraForm files, and that’s it. We didn’t have anything more than that. We didn’t have modules, we didn’t have workspaces, we didn’t have templating with Terragrunt… Just TerraForm. With this thing that we built - and this is actually part of like Axel, Christian, Jorge, and some other folks from the team - really dialed in this part, where we use modules to make the developer… Like, whenever you reach a point where you need something different… So you provision your service, but I don’t know, you needed a Dynamo database, for example. Not a classic RDS one that most services in a bank usually use - we want to make sure that we can accommodate that use case without us having to do any work, and developers having to do little work. So we use the concept of TerraForm modules a lot for that.

So you have a module for Dynamo, you have a module for Redis, a module for a bunch of other things. That is much simpler; you just have to define one resource, and that’s it. If you didn’t have that module, you’d have to define the security group, and the resource, and maybe the private link stuff… And yeah, a lot of things that are very confusing, at least for me. But TerraForm modules are a useful thing; maybe we’re using them wrong. I honestly haven’t done a lot of research there, but…

Go Time #238

Might Go actually be OOP?

A conversation with Ronna Steinberg, who was an OOP developer for many years, and now is a Go Google Developer Expert. Ronna has been thinking about Go and OOP for awhile, asking herself whether or not Go is an object oriented programming language. Tune in to find out her answer and hear some of the options gophers have for object oriented design.

Matched from the episode's transcript 👇

Ronna Steinberg: I think people do a lot of procedural coding with Go. You don’t have constructors, for instance. And why do you not have constructors? Because anything can be a type. An integer can be a new type… You can define a new type in pretty much any way that you want. So you don’t really have constructors, because you don’t have classes, and anything can be a type. And then everything can also have methods. But the truth is that without constructors, without a formal way of working with objects, people get lost… And then they look down at people who come in from other languages, who need these things.

It is very difficult to define a type and allow anybody to make any changes, so you cannot really tell – how do you tell if it’s corrupt or not? How do you write any kind of defensive code in that situation?

We also have these best practices – I mean, I don’t like the word “best practices”, because I think that I usually use the word “common practice” to explain… Because it’s not always best. There’s a case for pretty much anything. But the common practices are “Everything is public, anyone can do anything with [unintelligible 00:11:24.15] So it is a bit tricky for people coming into the language to know exactly what they’re supposed to do and how. And in that sense, we’re not making their lives easier. We’re just making it harder, instead of doing this gatekeeping. “Oh, no, you’re coming from Java. You probably don’t know how to do Go. Never mind… Forget about it.” This will take some time… And it doesn’t matter; they could have 20 years of experience, but you’ll still look at them like “But do you know Go? Do you?”

So that’s where I am… I am curious though if we’re getting any remarks from our listeners. Maybe they have some opinions…

Changelog Interviews #461

Fauna is rethinking the database

This week we’re talking with Evan Weaver about Fauna — the database for a new generation of applications. Fauna is a transactional database delivered as a secure and scalable cloud API with native GraphQL. It’s the first implementation of its kind based on the Calvin paper as opposed to Spanner. We cover Evan’s history leading up to Fauna, deep details on the Calvin algorithm, the CAP theorem for databases, what it means for Fauna to be temporal native, applications well suited for Fauna, and what’s to come in the near future.

Matched from the episode's transcript 👇

Evan Weaver: Not everybody. Obviously, a lot of Fauna’s features were inspired by things that we wanted to have at Twitter, and we’re forced to develop and forgo on our own. Fauna is really design for the modern web 2.0+ application world. With SaaS, in particular - I would say the majority of our customers are building some kind of SaaS app, with a business purpose. Or consumer-oriented applications. And then I think that the third category, which somewhat overlaps with the first, or blockchain-adjacent applications, things that use crypto for public transactional purposes, but also store additional data for application purposes.

The thing that these all have in common is that – you know, there’s a wide variety of customers and people interacting with datasets that interact with it in a soft, real-time way; they interact with it from the web, from mobile applications… You know, it’s all the apps you use today.

What we don’t do is analytics. We’re not a olap database, we’re not a data warehouse. We’re not a cache for some other database. The transactional consistency does have a cost in throughput and latency, so if all you want is a cache, you should go get memcache or something like that. We’re not an information retrieval system; we don’t replace Elasticsearch. We’re not a queue. You can go get Kafka or something else like that for those purposes. It’s really sort of the dream of MySQL, like - we wanna be to the serverless era, and JAMstack, and kind of the API infrastructure era the same way MySQL was to the web 1.0 era, where this is a general-purpose operational data platform. It’s very easy to use, it’s very easy to adopt… No startup costs, develop on your laptop. It does a very good job; I mean, we can argue about whether MySQL did a good job, but it was a better job than others at the time, because it existed. It does a very good job at that core, short request transactional user data, mission-critical data, constrained indexed use cases… And then it does a decent job at everything else you need to build a fully-featured application, so that you can get started without having to have a whole bunch of tools all mixed up in your tool chain.

We fundamentally don’t really believe in the classic polyglot persistence attitude where you pick the best tool for every single kind of query pattern you might have in your app. Databases are heavy pieces of infrastructure. It’s hard to move data around. You don’t wanna have too many of them. So the more general-purpose that it can be, the less you have to use. We do have an advantage in the cloud though, that we can connect and integrate more easily with adjacent systems in a way that takes the integration burden off the user. So that’s one of the things we’re working on going forward, making it seamless to link up to the analytics database you wanna use, the queue you wanna use, and that kind of thing.

Founders Talk #74

Intensely focused on building a software company

This week Adam talks with John-Daniel Trask, co-founder & CEO of Raygun. Raygun is an award-winning application monitoring company founded by John-Daniel Trask (better known as JD) and Jeremy Boyd in Wellington, New Zealand. They have revenues in the 8 digits annually, and have done it with very little funding (~1.7M USD). Today’s conversation with JD shares a ton of wisdom. Listen twice and take notes.

Matched from the episode's transcript 👇

John-Daniel Trask: So railgun, in a basic sense, is where you take a metal rod, you coil wire around it outside of it, and effectively use an electromagnetic pulse – sorry, you electromagnetize the coil to force the projectile out very quickly. They are actually deployed these days on some of, I believe, the U.S. – I’m not a military guy, so apologies to anybody where I get this wildly wrong… But a couple of the battleships and whatnot that the American Navy have - they can fire very long ranges, very strong projectiles from them. And therefore, in the ‘90s and in the early 2000’s they were common in computer games. But generally speaking, you’re not running around with a handheld railgun. They’re more mounted on a ship… But anyway. Complicated to explain. It doesn’t sound like error tracking at all…

But we built the error tracking software because we were reflecting a little bit on how we had built software in the past, and why was it that the software that Jeremy and I wrote was generally considered to be higher-quality than what some other folks did. And I don’t believe that for a second that it was because we were somehow just better programmers… It was because we had always instrumented our software to tell us about the errors, and it would just send us an email. We did this back in 2004 when I’d started at this company; I actually learned it from Jeremy. So we thought “Why don’t we build an entire product around this?” Like a workflow, and make it so other people are like “Surely if we find this useful for getting in front of bugs before customers even realize, then that’s a good thing.”

And so we built it, we put it into market in 2013, and it went out the door, and we just at the time thought of it as one of the many products in our catalog. I will say, recurring revenue versus one-off sort of stuff - oh, my goodness; my hair is thin, but I would be bald if it wasn’t for recurring revenue. Starting the month at the end of where last month was, rather than starting the month at zero and having to try and get up - it was a powerful mechanic for our business.

Anyway, we put that out, and later that year, around August of 2013 we got approached by a mid-sized U.S. tech company to acquire our business to obtain the Raygun software. And that was the moment where Jeremy and I – again, we’d bootstrapped to this point. We kind of sat back and went, “Hm… We might be on to something here.”

[32:03] It was a little bit different. We had actually sold off some of those other businesses that we’d built or been part of building, but we were like “Oh, okay, we should raise a little bit of money and try and go a bit quicker here.” So we did.

We went out and raised at that point 1.3 million dollars, kiwi. So it would be about 700,000 USD… So, like, nothing. And as part of that, the people that invested were actually more or less the folks that we’d built those other businesses with. They were like “This is our time to invest back with you guys. We know and trust you.” And that was pretty cool. I really appreciated them.

It was also – I’ve been told before by some folks that are generally more successful than me that this was a bad idea, but I was one of the largest investors in our own round… Because you always hear the question about “Oh, dilution” and “Founders made bad decisions because they don’t wanna dilute, but actually the value at the end is better” etc. And I was like, “Well, the best way to try and reduce my dilution is surely just to invest alongside everybody else.” And so I did that. That’s worked out well, and the business took off. It really took off for us.

And I know the video is not gonna be in the podcast, but effectively, imagine your classic hockey stick. As much as I hate talking about hockey sticks, there’s actually a really, really interesting point in here, which was there’s this inflection point, and people would say to me “What happened there?” And I’d say, “Well, that’s when we raised money.” And they’re like “What did you spend it on?” and I was like “Nothing.” They’re like, “What do you mean you spent it on nothing?” And I was like “We didn’t spend it. What we did was – it actually just gave Jeremy and I the confidence to ignore all of our other products and just focus on the Raygun product.” Because now we had this buffer that we could afford to ignore the other products. And just the act of then having the entire team in the company - and it was only like 5-6 of us at this point - focus on Raygun was what generated that revenue uptick; the growth took off, the awareness blew up.

We then – you know, the classic thing, “Good luck begets more good luck”, we got a great write-up from somebody in the industry who didn’t even… They actually knew us, but didn’t know that it was from our company at the time, and they wrote this amazing review online, and they had a very popular blog, and that exploded the customer base for a good few months. And really – I know we’ll talk about it, but effectively, the rest was history.

Practical AI #98

🤗 All things transformers with Hugging Face

Sash Rush, of Cornell Tech and Hugging Face, catches us up on all the things happening with Hugging Face and transformers. Last time we had Clem from Hugging Face on the show (episode 35), their transformers library wasn’t even a thing yet. Oh how things have changed! This time Sasha tells us all about Hugging Face’s open source NLP work, gives us an intro to the key components of transformers, and shares his perspective on the future of AI research conferences.

Matched from the episode's transcript 👇

Sasha Rush: [27:42] Yeah, this is a great question, and I think in some ways you guys maybe have insight into this that I would be also interested to hear about. Let me start at the high-level. One thing that fascinates me about the current usage of deep learning is that you have people who approach it from many different angles… And in one of our papers we kind of broke this down into three different classes. We talk about there being architects, there being trainers, and then there being end users. I think within the ecosystem, Transformers has different meanings to all three of those people.

If you’re a company like OpenAI, or Allen AI, companies at the cutting edge of research training, you use Transformers or related libraries to try to build the next architecture or the next pre-trained model. And that often means running these very large training jobs on multi-GPUs, over many days, and then using Transformers as a way to distribute your model through our hub, and make it easy for people to use it or do adapt it for their tasks.

If you’re an expert, but maybe not at the frontend frontier of research, another common use case is this kind of fine-tuning use case, where you have data for your company or for a given problem that you wanna solve, and you bring that data into the library, use it in training mode to fine-tune on your dataset. It may take a couple hours and require some GPUs, but out of that you get a really accurate model for the task you’re interested in.

But then at the other end you have end users who wanna use the library as a way of performing standard NLP tasks. You might wanna use it as a way to do summarization, or translation, or an identity recognition or question-answering… And you can often just use it completely in inference mode, maybe not even using Python; just kind of taking up a pre-trained model and using it directly for your task in that kind of setting.

So I think all of these people are within the machine learning ecosystem, but they kind of have different end goals or different use cases, and I think we’re trying to aim to support any of those outcomes.

Go Time #132

The trouble with databases

Databases are tricky, especially at scale. In this episode Mat, Jaana, and Jon discuss different types of databases, the pros and cons of each, along with the many ways developers can have issues with databases. They also explore questions like, “Why are serial IDs problematic?” and “What alternatives are there if we aren’t using serial IDs?” while at it.

Matched from the episode's transcript 👇

Jon Calhoun: The classic example of that that I’ve heard - I’ve been told that at Stripe one of the common things they’ve done is that they have a NoSQL database that they’re using for all the really high-speed transactions, but then on the backend when they wanna run analytics and do all these other things, it’s really hard to do that, and a lot of times people want SQL, they wanna be able to use some tools that use SQL for that… So they actually take a lot of that data and translate it into a SQL database. And while it’s delayed, it’s only used internally, so that’s okay… So they’re taking that trade-off and deciding “It’s useful to have this data in both formats…” And like you said, they didn’t switch from one to the other; it’s more of a “This makes sense for this use case, and we ported over to this for another use case.”

Go Time #106

Code editors and language servers

In this episode we talk with Ramya Rao about code editors and language servers. We share our thoughts on which editor we use, why we use it, and why we’d switch. We also discuss what a language server is and why it matters in connecting editors and the languages they support. We also dive into various ways to be effective with VS Code including shortcuts, plugins, and more.

Matched from the episode's transcript 👇

Ramya Rao: Yeah, finding references, or even giving an interface - well, what implements this interface? Not just where is it referenced, but what implements this interface? The symbol search that I talked about - every file, every language has a particular shape defined to it; there are functions, there are classes, there are interfaces… Everybody has a way of defining them, and the editor gives you a way to show them in a structured format… Whether you call them interface or something else, it’s up to the language. But the fact that there is a structure and you’d like to see the structure is common, regardless of what language you’re using.

JS Party #97

The wonderful thing about Tiggers

KBall, Jerod, and Divya dig deep into how we learn. We look into how to choose what to learn, techniques for learning, and a set of respective resources.

Matched from the episode's transcript 👇

Divya Sasidharan: I know, it’s on my mind. Or even TypeScript, which is also something which is on my docket of things to learn… But it’s like, if I don’t feel motivated – sometimes there’s an aspect of “I should learn something.” In the community everyone’s talking about TypeScript, and I have this mentality that I should learn it. But at the same time, every time I have approached it, my motivation for learning it and actually understanding it drops, because I don’t actually have that intrinsic reason to learn it. It’s very much like “I should learn it, because the community dictates this particular thing, and therefore I want–” It’s not really a desire, I don’t really want to learn the thing; I just feel like it’s knowledge I should have… And that makes it really hard for me to learn it, because I just feel like I’m constantly just hitting a brick wall.

You just no longer are motivated, because you’re just very easily demoralized… Which is kind of the learning process - getting demoralized is very common. But if you have a purpose and it’s very much like that’s what you want, then getting over that hurdle is much easier, because you have a goal in mind, and you’re able to just push past. But if you’re like “Oh, I’m just learning this for the sake of learning this”, then it’s really hard for you to just continue on that.

For example, when I was in college I was interested in building for the web, and being a web developer… And there was a point where I was like “Oh, maybe I should switch majors to become a computer science major.” And I took a couple of classes, but I was like “I don’t understand why I’m learning this. I want to be a web developer, and none of this applies.” I mean, sure, a lot of the concepts translate now, now that I’m deep into – not really deep into my career, but you know, many years in… Now I’m like “Oh, okay, I see why I should learn specific algorithms, and whatever.” But at the time it didn’t make sense, because it was so abstracted.

[11:56] So for me, going back to what I was saying, “just in time” - when I approach a problem and then it becomes important that I need to know that, then I learn it, and there’s a likelihood that I’ll actually master that technique or that concept, and not the reverse.

Go Time #98

Generics in Go

Mat, Johnny, Jon, and special guest Ian Lance Taylor discuss generics in Go. What are generics and why are they useful? Why aren’t interfaces enough? How will the standard library change if generics are added to Go? How has the community contributed to generics? If generics are added, how will this negatively affect the language?

Matched from the episode's transcript 👇

Mat Ryer: Yeah. Well, I think a lot of the classic problems will be immediately solved once generics is available, and hopefully solved in the standard library. Actually, something I’d like to talk about in a minute is, first of all, how has the community contributed, but also I’m quite interested in “How do we not all go off and build our own libraries, or all the common things that everyone’s gonna need? How do we rally around a central place for that?” It’d be quite an interesting – hopefully all the things we need, the common ones, sets and all the other types of graph structures, and trees, and all this stuff… Do we expect them to live inside the standard library, or do we think that somebody outside is gonna make them first?

Practical AI #21

UBER and Intel’s Machine Learning platforms

We recently met up with Cormac Brick (Intel) and Mike Del Balso (Uber) at O’Reilly AI in SF. As the director of machine intelligence in Intel’s Movidius group, Cormac is an expert in porting deep learning models to all sorts of embedded devices (cameras, robots, drones, etc.). He helped us understand some of the techniques for developing portable networks to maximize performance on different compute architectures.

In our discussion with Mike, we talked about the ins and outs of Michelangelo, Uber’s machine learning platform, which he manages. He also described why it was necessary for Uber to build out a machine learning platform and some of the new features they are exploring.

Matched from the episode's transcript 👇

Mike Del Balso: That’s a good question. The state of Uber’s ML stuff about three years ago was that a lot of people were trying to do that. There were a lot of people – you know, grad students learned how to build their ML models in their grad school classes and whatever, and they have their own ways to do it, everybody has their own… I use R, I use Python… And we saw was that people were either trying to productionize an R model and run an R runtime in production at low latency, which is just very challenging - and people will cringe when they hear that today… Secondly, you will see data scientists that did have engineer support - they would build up these bespoke towers of infrastructure per use case basis, that would tend to be less well-built, just because they had lower resources, but duplicative of different pieces of infrastructure that people would build to serve these models in production across all the different ML use cases the company has… And then kind of the scariest is people just wouldn’t get started at all, because some people wouldn’t have a way to get their models into production.

So we saw the opportunity to build a common platform to help people have a unified way to build models, and to (this is the trickiest part) put those same models that they had prototyped on into production, to make those predictions… And along the way, bring a lot of data science best practices, build into the system reproducibility, common analyses, versioning and all that kind of good stuff, that is kind of like these data science best practices that aren’t yet really well established. We have a lot of really well-established software engineering best practices that everybody knows - CI/CD, version control and stuff like that… That stuff is not as well appreciated in the data science community, and it’s just because a lot of this work is new. It’s not like these guys don’t understand the importance of it, but it’s just like the best processes and the best patterns for building this stuff have not yet – we have not really converged on those yet.

So we’ve spent a lot of effort to focus on where we think this stuff is going to go, and to help build the tools to empower data scientists to do the right thing from the beginning.

Changelog Interviews #313

The first cloud native programming language

Jerod talked with Paul Fremantle, the CTO and Co-Founder of WSO2, about their new programming language, Ballerina — a cloud-native language which aims to make it easier to write microservices that integrate APIs. They talked about the creation of the language and how it was inspired by so many technologies, cloud native features like built-in container support, serverless-friendly, observability, and how it works with, or without, a service mesh — just to name a few.

Matched from the episode's transcript 👇

Paul Fremantle: [44:12] I think that those core things are really important. The other things we’ve done that are kind of nice, that some of them are unique, some of them are similar – if you look at things like C#, they have this thing called LINQ, which is Language Integrated Query… It’s basically, instead of using SQL, you sort of jump into – you have a real program syntax that is reminiscent of SQL, and it allows you to code queries.

So we have the concept of a table in the language; effectively, you say “I have a record structure, and a table as a set of rows of that record structure”, and you can query that table. That table can be backed by a real SQL database, or it could just be in memory. That’s just like C#‘s stuff, but it’s very nice. I’m not saying it’s great - it’s really cool, and it makes you very productive.

But then something that we’ve taken and done that I don’t think anyone has done in that way is that we have – one of the things you often do in distributed systems is you start looking at events and you start trying to process those events… And if you’re just processing one event at a time, then that’s kind of a service. But you start thinking, “Well, actually I want to know what’s happening to these events over time…”

A classic example is I’m maybe trying to spot someone trying to break into my system, or trying to build a fake app, or something… And what I see is I see multiple logins from the same IP address, happening within a certain period of time. So now I’m not just looking at individual events, I’m kind of saying “Well, what happened in the last 10 seconds? Did I see that?”

So we’ve built that concept of a stream right into the language, and the language-integrated query allows you to query across time as well, across those events. So you can really quickly and simply say “If I see the same IP address sending a login request over the next 10 minutes, then we’re gonna send an alert out, or disable that IP address.”

So that’s one aspect I think is really cool. Another one that I think is really important that I haven’t really come across is that we’re trying to really build security right into the language, in a way that I haven’t seen before.

There’s two aspects of security and distributed systems that are kind of difficult. One is identity. Most people nowadays have moved to kind of trying to use token-based systems for identity across distributed systems, so things like OAuth tokens are very common. Google, Facebook, GitHub - everyone uses them all the time. But typically, that kind of identity model and the idea that I might say “Well, okay, I’ve got a request coming in. It’s got an identity attached to it” - it’s something that’s handled through some libraries, or whatever… So we’ve built that right into the concept… The concept of identity of callers, and we’re actually building the identity of the service itself into the language. That’s some research we’re doing right now around a standard called SPIFFE. We’re kind of trying to build the identity of the service itself into the language.

[48:06] But then the second thing that’s a big problem in distributed systems is basically spoofing, and tainted data, and injection attacks. That’s another big challenge, and we’ve built that kind of concept right into the language, as well. When you receive data over a network socket into Ballerina, we automatically realize that that is potentially tainted, and the taint analysis is part of the compiler and the compiler checks. So effectively, you as a developer - your code won’t compile if there’s potentially a SQL injection attack here… Because unless you’ve actually cleaned that data and validated it’s not tainted, we won’t let you use it somewhere where it’s dangerous.

Changelog Interviews #219

TensorFlow and Deep Learning with Eli Bixby

Eli Bixby, Developer Programs Engineer at Google, joined the show to talk to talk about TensorFlow, machine learning and deep learning, why Google open sourced it, and more.

Matched from the episode's transcript 👇

Eli Bixby: Well, if they’ve never seen machine learning before, the gold standard is the Stanford Coursera course on Machine Learning. It’ a great introductory course, I can’t recommend it enough; it covers lots of topics. If you want an extra challenge, you can try duplicating your assignments in TensorFlow, and that’s totally possible. Also, if you’re familiar with machine learning and you just want to get into TensorFlow, there’s a Udacity course, I believe, on Deep Learning using TensorFlow, for people who have machine learning experience, but not necessarily deep learning experience.

So again, the Getting Started guide always depends on who you are. We don’t really have an avenue for people who have never written a line of code in their life before; that’s pretty common of a lot of projects. It’s hard to say… Like, where do you draw the line of people you’re bringing into the fold of machine learning? I think there’s never been a better time to learn machine learning, and certainly you can start with some of those resources. There are a lot of great ML classes that are online that started the whole MOOC thing, and I would definitely recommend taking one of those, just so you can get familiar with the terms, honestly. There’s a lot of terminology that I’ve been throwing around, like “classification” and “regression”, and so much of it will start to click as you’re reading blogposts, as you’re reading documentation. It’s this feedback effect. The more terms you get, the more the things you’re reading everyday anyways starts to make sense, and the more you learn from them.

devonestes.com

Three classes of problems found by mutation testing

Devon C. Estes:

It’s fairly common for folks who haven’t used mutation testing before to not immediately see the value in the practice. Mutation testing is, after all, still a fairly niche and under-used tool in the average software development team’s toolbox. So today I’m going to show a few specific types of very common problems that mutation testing is great at finding for us, and that are hard or impossible to find with other methods

He goes on to detail the “multiple execution paths on a single line” problem, the “untested side effect” problem, and the “missing pin” problem.

changelog.com

Export Esri feature classes to open data formats like CSV, JSON, and GeoJSON

We’ve been covering the topics of open government and open data since nearly the start of The Changelog. More recently, we covered the city of Chicago being on Github and how they open sourced various datasets for civic and commercial usage. Needless to say, the topics open government and open data is something we hackers get excited about.

Wednesday – March 13th (2013), Michael Byrne, Geographic Information Officer for the FCC, announced a new project on Twitter, and welcomes your help to improve it.

From the readme:

Much of the data in government coffers is contained in spatial databases. A large percentage of government spatial data is ESRI software. While the common interchange format, the ESRI Shapefile, is easily exported and imported by many other softwares, this data file format (the Shapefile) is not intrinsically part of the www ecology.

Basically, many government agencies use proprietary software, such as ESRI software, to do their day to day data storage and analysis, but getting that data out and into an open data format like CSV, JSON, or GeoJSON takes some extra effort – esri2open helps to solve this problem.

Check out the project on GitHub to learn more and help improve it!