The panel are joined by Teiva Harsanyi, author of 100 Go Mistakes, to talk about how best to make mistakes when writing Go.
Teiva Harsanyi: [laughs] Now, I believe actually that it’s even more of a belief in Go thanks to goroutines, as you have said, Mat… Because goroutines compared to threads are great, they are more lightweight, they are faster to spin up, they are faster to contexts-switch, and so on… So there shouldn’t be any real reason for a concurrent application to be slower than a sequential application.
I took here a concrete example, actually… I took as an example the merge sort algorithm. And just as a quick reminder about what the merge sort algorithm is - if we take for example the recursive implementation, we basically get a list of elements as an input, and we will break down repeatedly each sub-list into sub-lists, into two halves. We are going to do it repeatedly. And once we reach sub-lists of a single element, we go up again and we merge the two sub-lists in a sorting manner.
A quick example, if we have 2 and 1, for example, we are going to split it into two halves, two on one side, one on the other side, and as each sub-list contains a single element, meaning it’s already sorted, then we are going to merge it in a sorting manner, so we will have one and two.
So in a nutshell, the merge sort algorithm - we just get a slice as an input, we check the length of the slice, if it’s bigger than one, we compute the middle, we apply merge sort on the first half, merge sort on the second half, and then we merge.
So the structure, for example, for this algorithm seems like a perfect fit for concurrency, because we could say every time I can handle each half into a specific goroutine. So the first half in one goroutine, the second in another goroutine, and say I will introduce some sort of synchronization at some point to wait for both goroutines.
So if we implement this parallel version of the algorithm, I run it on my local computer with a certain number of elements, and actually this parallel version is about ten times slower than the sequential version. And despite the fact that the parallel version leverages multiple cores, right? So it’s more than ten times slower. And what is the reason for that, if we think a bit about it? As we said, the algorithm is about to repeatedly split lists into two sub-lists; so at some point we will have 1,024 elements, then 512, then 256 and so on, until we reach 8, 4, 2 and 1 elements.
Now let’s try to imagine, in your opinion, what’s the fastest between spinning up two goroutines that will both merge two elements and wait for them, or in the current goroutine merge two elements and then merge two other elements? And of course, it’s gonna be the latter here, right? Because it’s gonna be faster to do it in the current goroutine.
[20:10] And if we think about it actually, in the merge sort algorithm, the deeper we go, the less efficient it will be to spin up a goroutine. And sure, goroutines are fast, but spinning up a new goroutine - it has a cost, because we have to wait for its creation, we have to wait for the internal Go scheduler to execute it, we have also the fact that concurrency introduces some form of synchronization because of mutex, or channels, or whatever… So everything has a cost, right?
Here one possible solution for this algorithm - the goal is not, of course, to design the most optimal solution for the merge sort algorithm, but discuss about a potential solution. It could be to say I will define a threshold, and I will apply the parallel algorithm that we’ve just described, but if the number of elements is below a certain threshold, it’s simply not worth spinning up new goroutines. So instead, I am going to execute sequentially. And this threshold may depend on the machine, and everything. On my side it was about 2048 elements. And if I run this new hybrid version, let’s say, of the parallel algorithm, it’s about 40% faster this time compared to the sequential implementation.
And one very last thing to say - I have done the same test in Java actually, where we don’t have the principle of goroutines (we just have threads here) and the threshold actually was higher. It was around four times bigger, if I recall correctly, compared to goroutines. So it’s kind of interesting, because somehow it shows that goroutines are actually somehow more efficient than threads for concurrent workloads, because they are for example faster to spin up… But as we illustrated with the merge sort algorithm, it’s not magic nonetheless; concurrency isn’t always faster.
Break: [22:19]