Ivan Kwiatkowski joins Natalie once again for a follow-up episode to Hacking with Go: Part 2. This time we’ll get Ivan’s perspective on the way Go’s security features are designed and used, from the user/hacker perspective. And of course we will also talk about how AI fits into all this…
Sourcegraph – Transform your code into a queryable database to create customizable visual dashboards in seconds. Sourcegraph recently launched Code Insights — now you can track what really matters to you and your team in your codebase. See how other teams are using this awesome feature at about.sourcegraph.com/code-insights
FireHydrant – The reliability platform for every developer. Incidents impact everyone, not just SREs. FireHydrant gives teams the tools to maintain service catalogs, respond to incidents, communicate through status pages, and learn with retrospectives. Small teams up to 10 people can get started for free with all FireHydrant features included. No credit card required to sign up. Learn more at firehydrant.com/
Notes & Links
|3||02:47||It's Go Time!|
|4||03:36||Welcome back, Ivan!|
|5||04:41||IDA Pro helping reverse engineers|
|6||06:36||Is Go better for researchers or hackers?|
|7||07:35||Rust is the real menace|
|8||09:31||On Go's cross-compilation feature|
|9||10:57||Go support for exotic platforms|
|10||13:49||Security risks of listing module deps|
|11||16:22||Benefits of Go over C & C++|
|12||18:29||Is Go code more secure overall?|
|14||21:18||Does COBOL malware exist?|
|15||22:02||Reversing Pascal malware|
|17||25:09||Is the reversing process always the same?|
|18||29:13||Those pesky goroutines|
|19||33:11||Visualizing the reversing process|
|20||35:24||Does Go's simplicity aid reversing?|
|21||37:52||The efficiency of Go's compiler|
|22||38:47||Do malware devs catch their errors?|
|23||41:33||Evaluating errors in Assembly|
|24||43:23||AI Tools for malware code review|
|25||46:08||AI and codegen|
|26||49:37||Ivan's (premature) unpop|
|27||51:59||AI for fingerprinting malware authors|
Click here to listen along while you enjoy the transcript. 🎧
Today, Ivan Kwiatkowski, you are joining us again, to talk more about hacking with Go and cover all the things we did not manage to cover in our last episode.
Yes, very happy to be back here.
So for those who did not tune in to the previous episode, could you shortly introduce yourself?
Yes, of course. So my name is Ivan Kwiatkowski. I am a French cybersecurity researcher, and I work for Kaspersky. Specifically, I work in a threat intelligence team, and my role in this team, apart from writing reports, and that kind of stuff, is really to proceed with the reverse-engineering of the malware that is provided to me by my coworkers. So basically, they do the threat hunting, they find some interesting stuff to look at, they identify implants used by the attackers, and they give them to me. And then my job from there is using the fantastic tools of the reverse-engineer, which are IDA Pro, basically… I do read the assembly code of those programs and try to figure out what they do as best as I can. And that’s my life.
And in the previous episode, we talked a little bit about how IDA is yes or no support in Go, and that it got better over time, but still place to improve…
Yeah, absolutely. So if you tried reverse-engineering Go programs something like maybe two years ago, maybe just one year ago, then you would be in a lot of trouble, because the tools just weren’t there. So it involved using some third-party plugins, some code taken from - I wouldn’t say suspicious, but GitHub repositories that weren’t that well maintained, or that didn’t have really clear instructions… And so it was really a difficult path for the reverse-engineer that had to do it. And thankfully, over the years, the developers from IDA, this company in Belgium called Hex Rays, they’ve been listening to the customer complaints, I suppose, and they have made a number of improvements that allow us to support Go programs a lot more easily. And that entails having some recognition for the various functions that come from the Go standard library, better support from the Google executables as a whole… In the very latest versions, if I recall correctly, a number of the things that were actually implemented in some of the third-party plugins that Juan Andres Guerrero-Saade from SentinelOne, as well as myself, had implemented manually in Python. Those kind of features, they include them into IDA Pro, into the main line.
So I do expect that one year from now maybe this won’t even be a discussion anymore. I mean, sure, Go will still be a very kind of alien language for us to look at in terms of like the assembly code that it generates, but at the end of the day, I do suspect that the tooling problems are going to be over in the near future. So that’s a good thing as far as we are concerned.
So is Go a better language for a security researcher to pick up, or for a hacker?
Well, when it comes to security researchers, we don’t actually have to write that many programs. Most of the tools that we use are already provided to us by the community. I mentioned IDA Pro - nobody is ever going to redevelop IDA Pro. I mean, some guys did, so it wouldn’t be fair saying that they wouldn’t, but most people are not going to do this. And if they were going to do this, then I think that choosing either the Go language or C++ wouldn’t make that big of a difference considering the scale of such a project.
When it comes to a hacker, I think that for them the Go language is still probably a very good bet, because as far as I can tell from a very unscientific polling of my co-workers and other reverse engineers in the field, it feels like most people still really dislike having to work on the Go language; like really, really dislike it.
In my opinion though, if I were to write malicious programs, I would use Rust, because the code generated by Rust is actually even way worse. And at the moment, I’m not exactly sure how to approach that type of code. I have to work on that, but this would be my intuition - to use Rust, because I know that I’m going to make someone’s day miserable somewhere in the future.
And when you say “worse”, you mean basically because the steps that you need to do to reverse-engineer and kind of figure out what’s happening are actually more painful…
[08:06] Exactly. I say that in the sense that – maybe it’s just a personal thing, right? I have spent a bit of time trying to figure out how Go ticks, not at the language level, but at least at the assembly level; I would never pretend that I am a Go expert, or that I have deep insights about the inner workings of the Go language… But if you give me a binary written in Go, and I expect that eventually I will be able to tell you what it’s supposed to be doing. When it comes to Rust, it’s kind of uncharted territory as far as I’m concerned. As far as I know - and again, this is not something that I’ve checked all that much, although this is something that I probably will have to do very soon… It feels like the tooling, like IDA, for instance, the disassembler that we use on the job doesn’t seem to support the Rust language as well as it supports the Go language. It doesn’t recognize as many things. And at the end of the day, Rust tends to generate constructs that really look like C++, and C++ is kind of a mess to begin with. It’s a very powerful language, I really love C++; if I have to write some complex program, I will write in C++, because this is the one I have the most experience with. But when it comes to reading assembly written in C++, oh my God, it’s just so convoluted, and there are so many levels of indirection added at every level. So this is not something I would be happy with, and Rust being a new, more complex, or a new, more alien C++ is really not something I would be happy with.
How about the easy cross-compilation of Go? The fact that you write your code, and then you create one binary, and then you kind of ship it as is, and on top of that you just write one more command and then you have it for any architecture? Does this make any difference for you?
It does, in the sense that I think as a developer, it’s pretty cool to have that. I also have always felt that this feature that is very often brought forward by defenders of the Go language was a bit – I don’t think it’s that important, right? Not that we don’t want cross-compilation or that we don’t want programs that can run anywhere, but I don’t feel like the Go language is especially adding something new there. I mean, when I write C++ code, it can already run everywhere, provided that I write it properly, of course. Way back when, probably like 10 years ago, when I was in school, I was learning Java, and I actually programmed a few projects in Java code. Supposedly, Java was supposed to be working on any platform as well, right? So it didn’t work as well as we expected due to many reasons, but overall, it doesn’t seem to me that this ability to run on any platform, and to have a code that will compile everywhere is really something that Go is actually bringing to the table. I think this is something that we already had, and that maybe Go is making easier for a lot of developers. But it’s not something new, and something that would make me switch languages, by any stretch. By the way - maybe this is a question for you… I have no idea about the support for Go on exotic platforms. What about running Go code for Solaris, or for the ARM architecture? Is that something that’s supported out of the box, or isn’t it? Because I know for a fact that when new CPUs come out, the first thing that the manufacturers release is going to be a C compiler, or a C++ compiler. So we know that eventually, those languages are always going to work. But I have the motion - and maybe I’m wrong about this - that when it comes to the Go language, that if you have this new platform somewhere, then you will have to wait for Google to release the corresponding compiler, and that may take some time. Is that correct?
So as you’re asking, I googled the command [unintelligible 00:11:35.24] which is what you run to do this, and for Solaris, it does come out of the box. What was the second one you asked for?
This one is going to be supported, I would imagine. It would be to compile for ARM or MIPS, probably less used architectures… But I would imagine that at least ARM is supported very well.
Yes, ARM is out of the box indeed. Yeah.
ARM-64, and so on. Yeah. And MIPS - some of the variations are not, but most of them, yes. The ones that are not out of the box is MIPS-64 P32 and MIPS-64 P32LE.
[12:15] Okay. So yeah, overall, these are probably architectures that most people don’t care about. So I don’t think this is a big fault on the part of the Go language. What I will say though is that, as far as I’m concerned, C is already a multi-platform language as it is, and if other languages provide this as well, then good for them, but to me, it’s not something groundbreaking.
Yeah, that’s fair. Lots of DevOps people do love that feature, that you don’t need to do much to ship everything to everyone in your favorite architecture.
Yeah. That aspect is pretty important, and also very appreciated by malware authors; it’s the fact that - yeah, when you write some program in C, then you might have modules that are distributed in the form of a DLL file in Windows, or .so shared object on Linux, and so on… And then you end up with a program, an executable file, and then several object libraries that come with it. And then when you want to distribute it, you have to send this big archive that contains many files. I would agree that when it comes to the Go language, you end up with a single binary, and that’s pretty useful, right? Especially in the context where you don’t have control over the client, the “victim’s machine”; then in that case, then of course, just having to send an executable and knowing that it is self-contained, and going to work everywhere, all the time, is going to be a big advantage. There are ways to do this in C, C++ and so on, but it will require some work… Which, agreed, is not reported on when you’re using Go. So this is one point for Go, I would say.
Another is - tell me what you think - the concept of modules, where it has this file that says what are all the dependencies, and what version specifically is used where, in case you’re using some package of an older version, and whatnot. So given that this is all kind of compiled into a module, and it’s being sent out as one - does this provide any value for a hacker or for security researcher?
Well, for a security researcher or a hacker, I’m not exactly sure. I mean, as far as developers go, it prevents you from falling into this pithole of dependency hell, where – this is stuff I actually experienced last week, while working on a Python project. I updated all my packages, and some of them were not compatible with each other, and my whole project just broke down in production… Which is always fun. So this is something I could have prevented by just fixing the version numbers, which is what you’re supposed to be doing. But overall, having this mechanism is kind of a good thing.
When it comes to security, making sure that hackers are able to compile their thing is really not something we worry about too much. What we do worry about is the fact that when you end up with a single binary that contains everything, it’s kind of an issue for reverse engineers. when you compare this to a C program or C++ program that has some various DLLs, then the different files already represent some sort of separation between the code, right? The DLLs might or might not have a relevant name, but at the end of the day, you know that they’re going to be split according to some form of functionality, right? This type of code is going to go in this DLL, the main intelligence of the program will go in the main executables, and so on.
So when you have those big malware platforms that you have to work on, then having several files is actually a pretty good thing for us. When you have a big Go binary, that is 5 to 10 megabytes big, and then you have to just dive in there and try to find out where the interesting code is located - it’s a good thing; where the uninteresting code (which is the library code) is; that’s something that the tools can recognize pretty well. But you still have this big program that contains everything, and it’s just much easier if you already have this kind of separation, where you can already focus on some specific functionality, even though you haven’t been able to dig into the whole project. So in that sense, I would say that this feature is pretty useful for the offensive side, I would say.
[16:22] I wonder if there’s any particular feature that is good for the defensive side… But I’ll keep asking questions until we find something. Or do you have one in mind?
I do have one in mind… The best, and I think one of the strongest selling points of Go over C and C++ and all those unmanaged languages is going to be that when you write programs in Go as a developer, you know that you’re never going to have any problems with memory corruption, buffer overflows, and those kinds of issues. I would be very surprised if the Go language would allow you to read outside the bounds of an array, and that kind of stuff. So all this is already taken into account for you, and it’s not going to make my personal daily job as a reverse engineer easier, but what it’s going to provide is that by default, people will have a much harder time shooting themselves in the foot. And this is a good thing overall, because it means that if I download an application that was written by someone else in the Go language, I don’t have to worry as much about the code quality, because I know that the language is going to provide a number of guarantees, and that will make sure that at least a number of vulnerabilities are not going to affect me, ever. This is true as a whole, for the whole industry, even outside security - if programs are written in the Go language, like an FTP server, or an email server, or whatever… If such programs are written in the Go language, and we know that at least we won’t have to worry about buffer overflows, and this means less weekends spent in incident response engagements because some customer didn’t patch their program, or because there was some vulnerability discovered as a zero day, and that’s being exploited in the wild, for an application that has some buffer overflow vulnerability, and that is just available widely on the internet.
So overall, for defenders, less vulnerabilities and less ways for developers to make tragic, tragic mistakes is always going to be a good thing. And I think this, in fact, overshadows any advantage that the attackers are gaining over us, on the personal level, with the reverse engineers.
That is a very interesting point. I guess it would be interesting to see if overall Go code is more secure, however you would measure that. It’s probably going to be interesting to see how to do that.
Well, having precise metrics is always going to be difficult, but if you compare CVE number quantities for projects written in the GO language and for projects within C, then I think it’s very likely that you would find that programs written in memory-unmanaged languages, like C, C++ and the like, are always going to have more bugs, just because by default there are more opportunities to shoot yourself in the foot.
So if you take the developers with equivalent skill, and for one of them some bugs are just unavailable, and for the other one you have twice as much mistakes that you can make, it feels very obvious to me that no matter if the developers have the same level of skill, then the person using the unsafe language is always going to make more mistakes.
Have you ever seen malware in COBOL?
I haven’t, actually.
I wonder how would you evaluate that - on the more safe, or on the less safe side, how would you say all that?
I actually have no idea. COBOL is one of the languages that I know of; I know that if he wants to work in banks and be paid the big bucks, then you should definitely learn COBOL, because all the former COBOL developers - they died of old age by now, so they are kind of hard to find… Beyond this, never in my life have I heard about – or never in maybe the last ten years have I heard about malware written in COBOL. If I were to find one, then it would be a pretty cool blog article, but it would probably be a miserable week for me, because then I would have to probably learn the language and figure out how it works. Although, to be fair, this week I actually had to reverse-engineer a program that was written in Pascal…
Which is the spiritual parent of Go.
Yeah, I suppose. And also spiritual parents of many other languages, because it came from the ’80s, maybe the ’70s, right? I remember learning Pascal, or at least the basics, when I was in school; something like 15 years ago now.
Same. It was my high school graduation project; it had to be in Pascal.
Yeah. There you go. I probably wrote a little bit of Pascal myself; probably never actual projects that did anything meaningful… But I did have to look at some APT malware, like real-life APT.
What does APT stand for?
Oh, sorry, it means Advanced Persistent Threat. So APTs are one of the categories of hackers that we track in our daily work. So you have on one end the attackers that are financially motivated, or cybercrime, ransomware groups and all the like, and on the other end of the spectrum, you have what we call the APTs, which basically are the state-sponsored actors, or the mercenary actors, all the groups that are focused on cyber espionage. Now, initially, the name APT was, I think, proposed by Mandiant - it’s probably around 2010, something like this - in their first report. And back then, I think it makes sense to say, on one end, we have the low-scale cyber criminals that are doing run of the mill crimeware, and on the other end of the spectrum, you have those state-sponsored attackers that are doing very sophisticated things. I think that today, trying to separate attackers between levels of sophistication doesn’t make that much sense anymore when you have extremely skilled ransomware groups that use very cutting-edge pentesting methodologies… And we do have APTs that are extremely bad, I would say; they have poor OpSec, they don’t know how to use their tools, and so on… So at this stage, I think in 2022 when you hear APT, you just have to think about espionage. I think this is going to be the way to understand this.
[23:59] But in any case - yeah, this week I was working on another APT case, taking place in some of the STEM countries, in the CIS, Commonwealth Independent States, I think… Anyway, and one of the malware implants that we found there was actually written in Pascal, and so it was kind of a trip down the memory lane to, one, figure out what Pascal was again, and also trying to understand what kind of assembly was generated by this Pascal compiler. It wasn’t that bad, actually. Way less bad than having to discover the Go language.
That’s interesting to know, because there is a lot of similarity between Go and Pascal, but knowing that the translation of that is different…
Oh yeah, I can tell you, even though I’m not an expert in any of the languages, even though you might have some similarities on the code level, in maybe the constructs and the way that you declare things and so on, when it comes to assembly, the languages could not be more different from one another.
I thought if they’re conceptually similar, they might have a similar structure, but I guess not.
No. I suppose they took inspiration when it comes to how to write the code… But then when it comes to what the compiler does, then yeah, the Go compiler really does its own thing.
And when you say that each language has their own different thing in the assembly representation, or even when you reverse-engineer that into like a visual representation, how many different ways can you have – can it really be like every time completely different?
It’s not always completely, completely different, but there are some meaningful differences. I would say that the C language - maybe it’s a misconception I have, because the C language is traditionally what you learn reverse engineering on. So the C language to me is going to be the language that is closest to the CPU. When you compile a C program, then there’s going to be a kind of direct translation from your C code to the assembly language. The compiler isn’t going to be too smart about things; when you do something in C, then when you write it in code, then it kind of shows in the assembly language. And of course, you can add some compiler optimizations, for speed, for space, and so on, but overall, the translation is going to be a pretty – I wouldn’t say it’s easily reversible, but I think it’s pretty direct. You really find your bearings from the C code to the assembly language. And I think it’s actually not that much of a surprise that decompiler tools such as Ghidra, or such as the one sold with IDA Pro do take the assembly and convert back to C language, because I think this is the closest. And then when you go to languages that have, I would say, higher levels of conceptual complexity, then this is where the compiler starts doing a lot of things on its own, and this is where the code that you write ends up being super-different from the assembly that you read. When it comes to C++ and when you use an STD string, it seems like a very simple thing, right? But under the hood, the STD string class is actually a template instantiation of a very, very complex series of nested templates, and you end up with… How can I explain this? You end up with a weird structure that has first a table that contains a pointer to methods, which is something you never wrote in C; you end up with methods calling each other, nested methods that come from the template library from the C++ standard library, and so on, and things just get crazy from here.
Taking the example of Pascal - and again, I don’t write that much Pascal code, but it’s very obvious to me that when I look at the assembly code and I see reference counters being incremented and decremented automatically, and all that kind of stuff, then this is something that was automatically added by the compiler. And it’s, I guess, useful as far as the running a program goes. But when it comes to me understanding the program, making sure that reference counts are handled properly, and that the objects are going to be free when there are no more references to them is something that I really don’t care about, and it’s just cluttering my window, really. It’s just code added by the compiler that has no meaning, at least when it comes to what the program is supposed to be doing. It doesn’t add any intelligence to program, it’s just something that gets in the way.
[28:06] Go is probably one of the far extremes of this, right there with C++, because the Go compiler is really doing a lot of stuff under the hood. It really – how can I put this…?
Yeah, optimizations. It in-lines anything that is not worth a function call… The calling convention is its own thing… Oh, you also have a garbage collection mechanism. When you write a simple Hello World program in the Go language, it ends up being an executable that’s something like one megabytes big, or something like this. I get that today storage space isn’t that expensive; we don’t care about one megabyte of code. Like, what’s the difference between this and seven kilobytes? I think not that much when it comes to hard drives. But when you are a reverse engineer, and you have to look at one megabyte of code instead of seven kilobytes, it’s actually a big deal, right? And this is the kind of thing that the Go compiler does to you, along with a number of optimizations, along with this very weird calling convention that they have etc.
Another thing that I don’t like about Go - I mean, it’s a good thing, right? I don’t like it as a reverse engineer; it’s the goroutines.
Yes. This would have been my next question. Yes, please do elaborate.
Great. So it seems like a very, very easy way to create threaded programs, as far as I understand, which is great as far as developers go… But it makes it a bit too easy for malware developers to create threaded programs as well. And when it comes to understanding what a program does, we really like linear programs. We want instructions that we can look at one after the other, we want programs that we can debug very easily… And as soon as many things start happening in many threads, then oh my God, following things around becomes extremely, extremely difficult. So I would actually like for threaded programs to be more difficult to write, and to be less available for attackers, if that were an option.
How do threads represent themselves visually when you do reverse-engineering?
Well, they don’t really represent themselves, because threads as a concept, they are a fundamentally runtime object, right? A thread is going to be a unit of execution that is going to run some code. And when you have a single thread, which is the case for a lot of programs, then you can just follow what’s going on in the code linearly, and then you figure out what is going on. When you have several threads, then this is really an order of magnitude more of complexity that you have to wade through, in the sense that as a reverse engineer, you not only have to think about what is going on in the program or in the function that you are reading, but at all times, you have to think about the fact that there might be another thread running somewhere that might be doing things that are affecting what you’re doing right now, or what the current function that you’re reading is doing. And so you do not have the luxury of having all the information that you need in a single place. The functionality, the intelligence of the program ends up being spread over different units of execution, and you have to keep everything in your head to have any hope of understanding what is going on. So this is really a very heavy mental tax that is imposed on the reverse engineer. And of course, the more threads there are, then the more effort you have to go through to try to keep track of everything that is going on.
One good example of this is a Go program that I mentioned in the previous podcast, it’s called Stowaway. This is an open source project that is used to do various proxying operations as a pen testing tool. You can create tunnels, SOCKs proxies etc. and probably pipe them with each other. I’m not exactly sure… But what I’m sure of is that when I was reading the program’s assembly, it felt very, very miserable, because it was obvious that many things were happening at the same time, which of course is going to be the case, because when you have some network program, then packets can arrive from any end of the various terminals. And you can also have many tunnels running at the same time.
[32:05] So you have all these things taking place at the same time, and trying to figure out exactly what does what is extremely difficult. And if I hadn’t been able to figure out that this was actually an open source project for which I was able to go find the code, then probably I might not have been able to figure out everything that the program was doing at all, because there was just too much to work through, and too much to remember, because my memory is actually quite limited, as is the memory of any human being compared to a computer, really.
Until we started upgrading…
Yeah, I wish…
All the interesting terms that you’re mentioning, like APT, and also a Stowaway, which you reminded now - it will be in the show notes, for those who want to look back at that and see how can they handle threads in reverse engineering. It will be interesting to see some example of how that actually looks… Because you said that this kind of can tint the results, and whatnot… Maybe like there’s data dependency between the two, but… I still try to visualize and understand – so I’m referring back to the video that you have published on YouTube, which I will also link, of your “Reverse-engineering a Go program”, and you kind of build this block diagram of the different steps, and you write what’s happening there, what you guess, and so on. So affecting the results by sharing some data, or do a calculation return, and then on top of that - that all makes sense. Do commands just pop up randomly in what’s happening now, and you kind of try to paint that in the relevant context once you’re reverse-engineering something with multi-threads?
Actually, I think the best way to visualize it is not to try to think about the program as it is running, but imagine that instead of reading assembly code, you are reading the source code of a Go program. I think you are not going to dispute the fact that if you receive some Go project from a friend or a co-worker, and you know nothing about this project, and you have to read the source code, then this source code is going to be much easier to understand if the program is just a single thread that is doing a single thing, right? If you receive this program that, as soon as it begins, it launches three different threads that are supposed to do different things at the same time, then figuring out exactly what the program is going to be doing is going to be, I think, much more difficult. Now, imagine the same thing, but instead of receiving proper Go code, then you would receive Go code where all the variable names have been wiped, and all the variables are named A, B, and C etc. So you cannot even use the function names, or you cannot even use the variable names to try to understand what the program is supposed to do. Imagine that you have no comments inside the code as well. This is basically what reverse-engineers have to do.
You don’t always have to imagine that…
Yes, of course. [laughter] Yeah, this might be real life for a lot of people out there. Shout-out to them, I guess… But this is exact what reverse-engineering is. You receive some source code; whether it is assembly or high-level code - that does a bit of a difference, because assembly is hard to read… But basically, it’s going to be the same thing that you have to go through, right? You receive some code, you have to understand what it does, and the more complex this code is, the more sophisticated its operations are, then the harder time you’re going to have to understand exactly what is going on in there.
This brings me to the next question, that generally Go best practice, let’s say, or the right way to do Go is to write simple, readable code, rather than sophisticated, and like ternary operations and whatnot, and complicated things… Does this in any way help, or not?
[35:45] I wish, but it turns out that for the compiler, whether you write the simple way or the ternary operation, if the compiler is smart enough, then at the end of the day it’s going to generate the very same assembly construct. Hopefully, the compiler, if you do this “if then… else”, or if you use the ternary operator, it’s going to be able to recognize that is the same thing, and in the end it’s going to generate assembly that does exactly the same. So it’s a good thing as far as development practices are concerned… But when you reach the assembly level, then all those helpful things and all those precautions that you have taken to make sure that other people will be able to understand what you’re doing - they just get taken out by the compiler, because they are things for humans; they are not things for CPUs, and so they have no place in your compiled program.
Yeah, that’s a great point.
Actually, one callback from one question you asked earlier - you asked how many different things can the compiler do when it comes to different languages… An additional example I can give for Go is Go programs, Go functions, they can return any number of return values, right? This is not something that most languages are doing… So when you look at the assembly language, at the end of the day it turns out to be translated in CPU code in very different ways than normal functions are supposed to work, right? When you have a function that can only return single arguments, which is the case for a lot of languages out there, like C, C++, etc. then you have a very simple convention. The convention says, “Oh, the return value will be in register EAX in assembly.” This is the rule; it’s very simple. When it comes to the Go language, then you don’t have a single place, because you can have several arguments. And so they are returned differently through the stack., and you have to go look for them… It’s just much more complex, and it’s very different from a traditional language.
And the difference between the languages are going to be small things like this when it comes to conventions, but actually, some of all of them are going to result in having a source code or an assembly code at the end of the day that is really extremely different from one another.
Would you say, from what you see, that the way Go handles on the compiler level the return of multiple arguments is efficient, overall?
Yeah, it really feels like it, as far as I recall. Maybe this is me making a mistake, but it feels like the return values from one function call are placed exactly where they should be on the stack, so that another function can use them as arguments immediately. And so chained calls between different functions, where you have a function calling – when you call the function by passing an argument which is the result of another function call, it feels like an assembly that these function calls are going to be very close to each other. You won’t have to move back stuff from the return values back on the stack etc. It’s just already there, and I think that on that level, they’re going to be pretty efficient, and go pretty fast.
Nice. Good to know; it’s always encouraging… So Go is kind of built in a way that you don’t debug this line by line, with breakpoints and so on, as you do in many other languages, but you do something and you check for errors, all the time. I will not ask you whether malware is generally written like this or not, or how good are they with their error catching… Unless you know, and then please do share.
I do know, actually, because I end up reading the code, right? So what I learned about Go by reading assembly code, and also by trying it myself to understand what was going on, is the fact that the Go language will not allow you to not catch the errors, right? If you have a function that returns two return values, and if you do not catch them, then you’re going to be in trouble. The code is not going to compile. So I think you can probably create this underscore variable, that means “I don’t care.”
Right. But as far as I can tell now, at least for the programs that I’ve seen, they do catch the errors, and they check for the errors, and they handle them properly… Which I think makes sense, right? Because if the language forces you to do it, then you’re going to do it. Of course, you can circumvent this by using this special variable, and not actually checking, but if the mechanism is there and if the language creates the framework where you kind of have to do it, then it kind of feels foolish, I think, to not do it, even though you can. Because if you don’t want to do those things and just go back to C and play without any safety belts, and just play by your own rules… But if you’re going to use the Go language, I think it makes sense to use the language as it was intended. This is in fact what I’m seeing when I look at malware code.
[40:25] That is a little bit sparking joy to know that, even [unintelligible 00:40:27.06] the best practices… But it’s true that errors have a lot of information in them.
Is that something that, as a developer, you are not seeing? Do you see a lot of co-workers and the code you receive where the error-checking is bypassed and not used at all?
No, usually those things will not pass peer review… But I just don’t know enough whether hackers do peer review. So that’s – that is interesting.
Well, this is what I’ve seen; it’s going to be anecdotal at best. Of course, there are always going to be hackers out there that do things their own way, their own bad way, just like real world developers that work on other legitimate projects… And so I can only speak about the few malware programs that I have seen, and for which I can say that they looked pretty well developed as far as I was concerned, but there’s bound to be somewhere out there that is going to be writing the most despicable Go code you can think of.
That, of course, makes sense. But still, I’m happy to hear that generally good practices are followed everywhere… But another thing that I wanted to ask about that is how does this represent in assembly, given that this is generally not a very common practice…? I guess because you don’t see lots of errors, you basically don’t see the representation of this?
Well, the way I see it is that when I look at the assembly code, most of the times, since I’m not a Go developer, I have to look up the functions that are being called. Sometimes the names are self-explanatory… But most of the times I have to go to the Go documentation… Which by the way, I think is extremely well done. Every time I look for functions documentation, I find it, which is always a good thing. There are languages where you try to find stuff, and you just don’t; even just basic functions. But anyway… So I go look at the documentation, and then I get information about what arguments this function is supposed to receive, and what return values it’s supposed to provide. And then when I do this looking up, then I get information about whether or not an error value is supposed to be returned by the function. And when that’s the case, and I cannot recall of an instance where I was supposed to see an error value returned from the program, and that was not checked.
So the way that you would see it in assembly would be like you have this function call, and then you see some random variable being taken back from the stack, and compared with value zero. So basically, if err=nil, and then you have a block, and whether or not the error is or is not nil, then you can go into that block and go into another one. But that block is here, which means that the attacker or the malware author went through the trouble of actually making sure that the error – like, there was no error returned by the function. So this is the way that I observe it.
So earlier you mentioned something that you – you compared kind of your work to getting code from a colleague, but it’s all kind of no parameter names, no documentation, no function names, and so on… And that reminded me how sometimes you can use all sorts of AI tools like Codex and Copilot and whatnot to highlight that, and says, “Well, explain what it does.” So did you ever have a chance to use one of those?
[43:48] I didn’t. Now, I know that GitHub released this project. I personally have a very religious fear of such projects, just because I know that the way it works is that all the source code that I write gets uploaded into the cloud, and analyzed, and gets to feed the machine learning algorithm. And, it’s kind of stupid, because all the code I write ends up being open source anyway… But I don’t like that.
And if it’s on GitHub, it goes to the same place.
Exactly. It also ends up there. So overall, there’s not really a good reason for me not to do it. But I didn’t try it yet. I’ve been told by some co-workers… I think they’ve used it for Python, and I’ve been told that it’s amazing. It can pretty much guess what you’re thinking, which is kind of scary.
When you write code, or when you reverse it?
Yeah, exactly. When you write code. I am not aware of a machine learning project that would help you reverse-engineer programs… Although I am 100% sure that this is possible. I’ve been playing a lot with the image generation AI, especially Midjourney. I tried the one that generates text which is called Lex… All those AIs, as far as I’m concerned, produce incredible results. If you had told me one year ago that I would be able to type some text and I would get the corresponding image generated, and that the image would actually look pretty amazing, I would really not have believed it, for real; I would have said that this is science fiction, and it’s never going to happen in my lifetime. Or maybe when I’m old, and don’t understand what’s going on anymore. But we’re there; we’re there for many complex applications, such as understanding human language, and generating lots of contents… And at this stage, I would be extremely shocked if you told me that the recognizing functions that are actually generated by other computers is not something possible. This is 100% going to happen, eventually. I don’t know who is going to do it. Maybe I should, actually. I don’t know anything about [unintelligible 00:45:39.20] but this is an extremely worthy project, and I think that eventually, this is going to help us win so much time when we work on some unknown programs. we probably would have to have specialized AIs for different languages; we will need one for C, one for C++, one for Go etc. But I cannot imagine that this is not in our future, and probably in our near future, too. Hopefully, they won’t sell this too expensive, because I want it.
So a follow-up question to that… For code generation, some languages are better than other; for example, Go is performing even better than Python and such, just because NGO has this built-in linter, and there’s many things that it’s not either/or, but it’s definitely tabs. It’s definitely curly brackets and a new line. So the AI has a more consistent dataset to be trained on… Versus Python, and many other languages that you can write in whatever way… So it just sees lots of different examples, and it might, in the best case, generate inconsistent code of, you know, one file is different from another, but even just sometimes wrong, following two different paradigms in one file. Do you think that for the reverse part of it, will this benefit kind of keep rolling, or not? For the AI perspective of it.
On the other end, when you have some assembly code, then assembly code is going to be this very strict, unique language that probably all AIs will have to work on. And I’m not exactly sure how they are going to work their way back up to either recognizing a function, or actually generating corresponding high-level code…
[48:08] But the good thing that AIs will have going for them is that assembly is going to be like the exact opposite of ambiguous, right? Like, you have ambiguous, and at the exact other end of what’s possible you have assembly, which is 100% precise…
As consistent as it gets.
Yeah, consistent, and actually done in some ways, but it’s just very simple operations that can only do a single thing in a very defined way. So on that front, I think that this is actually going to be a very, very good thing for the AIs, whenever they are ready.
So there’s already good tools out there, that just take binary and translate that into assembly; not 100%, but a very good coverage. And assembly is consistent enough, so that means that some IDA Pro plugin that uses AI must be developed as we speak, to say “Here’s assembly input. Please translate that to Go code for me.”
Well, it’s a good question, because you would think that someone would be working on this… But when you look at the market for reverse engineering tools, it’s actually quite small. You have Hex Rays, the creators of IDA Pro, and Hex Rays are kind of an old-school company. Their product is amazing, but they haven’t really tried to create any form of disruption in the past 20 years. Now, they have been doing lots of improvements to their products, but I think part of it is only because they have been challenged by Ghidra, their open source competitor. It’s kind of my opinion that – but let’s make this my unpopular opinion, if you want to… But it’s kind of my opinion that if Ghidra had never appeared, then IDA Pro would basically have stayed kind of the same for the next decade or so, because the developers had kind of no incentive to make it significantly better, because just they had no competition there.
The way that the compiler is working at the moment, as far as I understand, is purely through algorithmic means, and they do not use any form of machine learning. There is no AI applied to their decompilation process. Maybe they have started working on this, but as a company - and maybe I’m totally mistaken about this, I don’t work there, but Hex Ray doesn’t strike me as a company that could be doing groundbreaking R&D, that would maybe prepare for the next generation of decompilers. I think they would rather make incremental changes on their existing product to make it slightly better year after year.
I think Ghidra, which has a decompiler as well, is open source, and I think this decompiler is doing pretty well, too… But I don’t think it uses AI in any shape or form either, and I’m not aware of any plans to like to start working on this. Probably developing some AI product focused on reverse engineering would require some very specific AI knowledge. And my feeling is – I don’t know everyone in the security field, but I tend to… Like, my perception of this is that we have people that are extremely focused and extremely talented and skilled in the specific field of cybersecurity that they are working on… But I cannot recall anyone that I’ve met that was both a great researcher, or a great reverse engineer, a great pen tester, and a very qualified data scientist as well, or someone that would be able to tune a machine learning algorithm that would change our lives forever. Maybe someone is working on this somewhere, but if it’s the case, then I’m not aware of it.
I think if it were public, or if it was out there, I think I would know about it; maybe I don’t. But overall, if there was this big project about to be released, I want to believe I would have heard about it. But I do still hope that some company somewhere, probably in Israel, is working on this in a secret lab, and eventually, they’re going to take the market by storm, and make my job a lot easier.
[51:59] Yeah, another idea for the followers who are tuning in. So in addition to an AI that translates from assembly to Go, it would be personally interesting for me to have some AI that says “This malware is written in the style of…” And what I mean by that is that already now you can write - if you go to Codex and other tools like that, you can say, “Write a Go program that does this, and that, and this, and do it in the style of…” And if you mention a GitHub handle of somebody who is a known developer, has lots of stars, or some other big presence, and their type of code has a flavor - which is maybe less common in Go, because it’s so structured - you will get their style of code. So eventually, a next step to this magical plugin would be “In whose style is this malware written?” This will be also interesting. And then all those language teachers out there will know “Oh, well, I taught that hacker.”
Yeah, that’s a very nice idea. And actually, there have been, maybe not rumors, but open research projects working exactly on this for probably decades in the [unintelligible 00:53:08.15] intelligence field. I think I recall, a CCC presentation from maybe 2010 or 2012, one of those years… And basically, people were already working on obtaining code from open source repositories - they basically downloading GitHub, and trying to extract maybe some penmanship characteristics from every single developer, and they were hoping that they would be able to take any program in the future and be able to tell you that “This might probably be the developer that created this executable.”
Now, it’s been 10 years, 15 years… I’m not sure – I haven’t heard about this for a while, so maybe this either didn’t work as well as they expected, or this was actually absorbed into some intelligence service somewhere… Because there are very obvious intelligence applications there, that I think would – like, those types of services would love to have such a capability, because they would be able to identify malware authors; they wouldn’t need really to have the burden of proof the way that police forces have… They would just be able to know who the guy behind some malware is, and just do their usual parallel construction stuff.
So this is something we know they want, and this is something that we also know they spent money on, and I remember that some universities were actually working on those types of projects… For research, not for intelligence, but those worlds, they tend to communicate with each other anyway, when there are applications.
If this came to fruition, then this did so in secret. The way that they used to be working on this was, again, algorithmic; they were trying to extract the characteristics, and they were not using those blackbox AI capabilities that we have now. Maybe this is a new avenue of research for those applications. Maybe we’ll know in the future. But as far as I know, this is an existing problem that people are trying to solve, and I haven’t seen any signs that they have, although it’s not exactly sure that I would.
This episode took very interesting turns, like new ideas for tools and projects and whatnot, and remember where you heard this first.
Yeah, I was not expecting that.
Yeah, but that’s very cool. That’s really inspiring to speak with you, Ivan. Thank you for this hour of the conversation.
As the previous episode, it also ended with me having lots of open questions… Please consider joining me again next year for one more episode, or ten more. [laughs]
Yeah, of course. Eventually - and I think it will be pretty soon - we’re going to expand the whole knowledge that I have on the Go language, right? If I have to come back, then at this point I will really have to look into the language more, and maybe try to come with actual research that I can share with you… Because otherwise – I don’t want this conversation to be boring, but I’ll do my best.
Well, next time we’ll talk about generics maybe… See how that Go proposal goes.
And I’ll work on that.
Yeah. And instead of unpopular opinion, we have provided two unicorn ideas… So you’re welcome, everyone. [laughs]
I can pretty much give you my personal guarantee that if anyone in the audience actually implements one of those two things, that they’re going to be extremely rich. So there you go.
That’s almost as good as being on the rank of the most unpopular opinion. [laughs] Thanks, everyone, for joining. Thank you very much, Ivan.
Thank you. I was happy being here, it was a pleasure speaking to you, and see you next time.
Our transcripts are open source on GitHub. Improvements are welcome. 💚