Some time ago, I added an article on what Merkle Trees are and a simple implementation in Go. You can find the article here "Merkle Trees".
After all this time, I finally decided to write a follow-up article to compare how we can make the process even faster and see what is the cost of the attained speed.
I ran some benchmarks on an Apple machine with an M1 chip and 8 GB of RAM running MacOS Sonomo 14.0 using Go 1.18 and below are the results.
There are two methods which I used to generate the Merkle Tree.
Sequential Node Generation
Below is the implementation for sequentially generating the tree nodes.
func generateTree(s []byte, depth int) *TreeNode {
// length is the length of the data
// depth is the the maximum required depth of the tree
return generateTreeUtil(s, 0, length, depth)
}
func generateTreeUtil(s []byte, start, end, level int) *TreeNode {
// to check the leaf node condition
if start == end-1 || level == 1 {
data := sha256.Sum256(s[start:end]) // SHA256 hash of the data slice
return NewTreeNode(fmt.Sprintf("%x", data), nil, nil)
}
mid := (start + end) / 2
// getting left and right children
var leftNode, rightNode *TreeNode
leftNode = generateTreeUtil(s, start, mid, level-1, mode)
rightNode = generateTreeUtil(s, mid, end, level-1, mode)
// concatenating the children data and applying SHA256 hash
data := sha256.Sum256([]byte(leftNode.Data + rightNode.Data))
return NewTreeNode(fmt.Sprintf("%x", data), leftNode, rightNode)
}
Concurrent Node Generation
below is the implementation for concurrently generating the tree nodes.
The difference from the sequential node generation is that the left child and the right child will be created concurrently using goroutines. sync.WaitGroup
has been used to wait for the children to be created and then the hash value of the current node will be calculated.
func generateTree(s []byte) *TreeNode {
return generateTreeUtil(s, 0, length, depth, SERIES)
}
func generateTreeUtil(s []byte, start, end, level int, mode string) *TreeNode {
// to check the leaf node condition
if start == end-1 || level == 1 {
data := sha256.Sum256(s[start:end])
return NewTreeNode(fmt.Sprintf("%x", data), nil, nil)
}
mid := (start + end) / 2
var leftNode, rightNode *TreeNode
var wg sync.WaitGroup // create a waitgroup
wg.Add(2) // to wait for 2 nodes to be created
go func() { // goroutine to generate left child
leftNode = generateTreeUtil(s, start, mid, level-1, mode)
wg.Done()
}()
go func() { // goroutine to generate right child
rightNode = generateTreeUtil(s, mid, end, level-1, mode)
wg.Done()
}()
wg.Wait() // waiting for children to be created
// concatenating the children data and applying SHA256 hash
data := sha256.Sum256([]byte(leftNode.Data + rightNode.Data))
return NewTreeNode(fmt.Sprintf("%x", data), leftNode, rightNode)
}
Talk is cheap show me the
codebenchmark.
The benchmark was run using the standard testing
package of Go.
func Benchmark_generateTree(b *testing.B) {
b.ResetTimer()
for i := 0; i < b.N; i++ {
b.StopTimer()
s := generateSlice(length)
b.StartTimer()
generateTree(s)
}
}
as you can see, generateSlice
is a helper function that initialises a slice of random bytes using the crypto/rand
package.
// length is the length of the slice which we want to generate
func generateSlice(length int) []byte {
s := make([]byte, length)
rand.Read(s)
return s
}
Show me the
moneynumbers.
the execution command for the benchmark
> go test -bench=. -benchtime 2s -count 2 -benchmem -cpu 4 -run notest
the same set of parameters have been used for running the benchmarks and they are listed below.
length := 8388608 // 2^23
depth := 4
Sequential generation benchmark results
result saved in sequential.out
goos: darwin
goarch: arm64
pkg: github.com/satyarth42/merkle-tree
Benchmark_generateTree-4 680 3545200 ns/op 4985 B/op 92 allocs/op
Benchmark_generateTree-4 676 3598568 ns/op 4973 B/op 92 allocs/op
Benchmark_generateTree-4 669 3617644 ns/op 4983 B/op 92 allocs/op
Benchmark_generateTree-4 669 3592869 ns/op 4983 B/op 92 allocs/op
Benchmark_generateTree-4 669 3584948 ns/op 4985 B/op 92 allocs/op
Benchmark_generateTree-4 670 3575145 ns/op 4960 B/op 92 allocs/op
Benchmark_generateTree-4 667 3586166 ns/op 4988 B/op 92 allocs/op
Benchmark_generateTree-4 668 3575892 ns/op 4974 B/op 92 allocs/op
Benchmark_generateTree-4 669 3589222 ns/op 4984 B/op 92 allocs/op
Benchmark_generateTree-4 669 3572839 ns/op 4981 B/op 92 allocs/op
PASS
Concurrent generation Benchmark results
result saved in concurrent.out
goos: darwin
goarch: arm64
pkg: github.com/satyarth42/merkle-tree
Benchmark_generateTree-4 2060 1105763 ns/op 6501 B/op 113 allocs/op
Benchmark_generateTree-4 2173 1085570 ns/op 6520 B/op 113 allocs/op
Benchmark_generateTree-4 2260 1099327 ns/op 6505 B/op 113 allocs/op
Benchmark_generateTree-4 2122 1115856 ns/op 6514 B/op 113 allocs/op
Benchmark_generateTree-4 2176 1118865 ns/op 6509 B/op 113 allocs/op
Benchmark_generateTree-4 2200 1107257 ns/op 6519 B/op 113 allocs/op
Benchmark_generateTree-4 2384 1100632 ns/op 6499 B/op 113 allocs/op
Benchmark_generateTree-4 2302 1180999 ns/op 6507 B/op 113 allocs/op
Benchmark_generateTree-4 2350 1181622 ns/op 6516 B/op 113 allocs/op
Benchmark_generateTree-4 2287 1096340 ns/op 6507 B/op 113 allocs/op
PASS
Now, let's compare the results of the sequential and concurrent methodologies.
> benchstat sequential.out concurrent.out > compare.out
goos: darwin
goarch: arm64
pkg: github.com/satyarth42/merkle-tree
│ sequential.out │ concurrent.out │
│ sec/op │ sec/op vs base │
_generateTree-4 3.586m ± 0% 1.107m ± 7% -69.14% (p=0.000 n=10)
│ sequential.out │ concurrent.out │
│ B/op │ B/op vs base │
_generateTree-4 4.866Ki ± 0% 6.355Ki ± 0% +30.60% (p=0.000 n=10)
│ sequential.out │ concurrent.out │
│ allocs/op │ allocs/op vs base │
_generateTree-4 92.00 ± 0% 113.00 ± 0% +22.83% (p=0.000 n=10)
A whopping 69% improvement in terms of runtime.
But it has come at a cost, the cost is that the memory footprint increased by 30% which is not a small number and maybe something to evaluate deeper as per the use case involved, e.g. one might have strong constraints on available memory.
On looking at the number of allocs/op
, they increased by approximately 23%, which again is not a small number.
In summary, we got a tremendous improvement in runtime but one must be careful to not blindly start applying this concept everywhere not factoring in the additional memory and allocations the program had to endure.
Now, many of us may feel like we have found the holy grail of making our Go programmes faster, but alas, we haven't.
Image Source: imgflip.com/i/2pxoxn
Like many things, this is not a silver bullet to improve every program runtime.
Let's look at what happens when we change the parameters.
length = 32768
depth = 6
Sequential generation benchmark results
goos: darwin
goarch: arm64
pkg: github.com/satyarth42/merkle-tree
Benchmark_generateTree-4 63726 36536 ns/op 18522 B/op 376 allocs/op
Benchmark_generateTree-4 64778 37636 ns/op 18522 B/op 376 allocs/op
Benchmark_generateTree-4 64303 37534 ns/op 18522 B/op 376 allocs/op
Benchmark_generateTree-4 63836 37615 ns/op 18522 B/op 376 allocs/op
Benchmark_generateTree-4 63885 37529 ns/op 18522 B/op 376 allocs/op
Benchmark_generateTree-4 64074 37573 ns/op 18522 B/op 376 allocs/op
Benchmark_generateTree-4 64098 37729 ns/op 18522 B/op 376 allocs/op
Benchmark_generateTree-4 64902 37548 ns/op 18522 B/op 376 allocs/op
Benchmark_generateTree-4 63628 37673 ns/op 18522 B/op 376 allocs/op
Benchmark_generateTree-4 63628 37594 ns/op 18522 B/op 376 allocs/op
PASS
Concurrent generation benchmark results
goos: darwin
goarch: arm64
pkg: github.com/satyarth42/merkle-tree
Benchmark_generateTree-4 38347 62255 ns/op 25027 B/op 469 allocs/op
Benchmark_generateTree-4 39420 61349 ns/op 25025 B/op 469 allocs/op
Benchmark_generateTree-4 39004 60675 ns/op 25026 B/op 469 allocs/op
Benchmark_generateTree-4 38838 61620 ns/op 25025 B/op 469 allocs/op
Benchmark_generateTree-4 38646 62497 ns/op 25025 B/op 469 allocs/op
Benchmark_generateTree-4 39267 61431 ns/op 25027 B/op 469 allocs/op
Benchmark_generateTree-4 39352 60861 ns/op 25025 B/op 469 allocs/op
Benchmark_generateTree-4 39478 60414 ns/op 25031 B/op 469 allocs/op
Benchmark_generateTree-4 39658 61071 ns/op 25026 B/op 469 allocs/op
Benchmark_generateTree-4 38815 61448 ns/op 25026 B/op 469 allocs/op
PASS
Comparison
goos: darwin
goarch: arm64
pkg: github.com/satyarth42/merkle-tree
│ sequential.out │ concurrent.out │
│ sec/op │ sec/op vs base │
_generateTree-4 37.58µ ± 0% 61.39µ ± 1% +63.34% (p=0.000 n=10)
│ sequential.out │ concurrent.out │
│ B/op │ B/op vs base │
_generateTree-4 18.09Ki ± 0% 24.44Ki ± 0% +35.11% (p=0.000 n=10)
│ sequential.out │ concurrent.out │
│ allocs/op │ allocs/op vs base │
_generateTree-4 376.0 ± 0% 469.0 ± 0% +24.73% (p=0.000 n=10)
What the hell happened here? Why is the concurrent program slower than the sequential one?
We increased the depth of the tree and reduced the length of the slice.
A possible explanation would be that the overhead of using a WaitGroup and invoking goroutines is making the whole program slower.
Wrap Up
Simply changing the program flow from sequential to concurrent will not always make your program go faster. One needs to weigh different implementations and figure out which one works best for the use case.
You may encounter memory constraints, you may find latency to be of utmost importance, and you may encounter varied inputs which may impact the execution of your program.
A good system evolves with time and the changes required.