Golang Structs Memory Allocation - II

Golang Structs Memory Allocation - II

The benchmark numbers

I shared some of the understandings about how memory is allocated in Go within a struct and how can we make minor tweaks and attain good optimizations in our program. The link to the previous article is here (Golang Struct Memory Allocation).

Talk is cheap, show me the code numbers.

I ran some benchmarks on an Apple machine with an M1 chip and 8 GB of RAM running MacOS Sonomo 14.0 and below are the results that shed some light on the extent to which Go programs can be optimized.

First, let's look at the function which is used to run the benchmark

func Sum(arr []Example) int32 {
    var sum int32
    for idx := range arr {
        sum = sum + arr[idx].d + int32(arr[idx].a) + int32(arr[idx].c)
    }
    return sum
}

As we understand, the function sum takes a slice of the type Example, iterates through the slice and adds the int8 and int32 fields into a variable called sum and returns the value.

Now, let's take a look at the benchmark code.

const length = 100000000 // 10^8
func BenchmarkSum(b *testing.B) {
    b.ResetTimer()
    arr := make([]Example, length)
    _ = Sum(arr)
}

Benchmark tests need to have a prefix Benchmark, they take one parameter of type *testing.B.

b.ResetTimer is used to reset the benchmark timer, it is helpful when we may have some heavy setup commands that need to be run and which may hamper the results of the benchmark that we will be running, e.g. computationally heavy inits. In this case, it is added for cosmetic purposes only, but one may decide to use it to handle heavy inits.

Here, we are initializing a slice of type Example of length 10^8 and then passing it to the Sum function to get the result, which we will be conveniently ignoring for running the benchmark.

Benchmark command

go test -bench=. -benchtime 2s -count 10 -benchmem -cpu 4 -run notest

-benchtime indicates how long the benchmark needs to be run

-count indicates how many times to run the benchmark to get multiple results which may have variations to make better sense of the results.

Unoptimized struct and results

// unsafe.Sizeof(Example{}) => 32
type Example struct {
    a int8
    b string
    c int8
    d int32
}

Benchmark Results: saved in unoptimized.txt

goos: darwin
goarch: arm64
pkg: github.com/satyarth42/blogs
BenchmarkSum-4       1000000000             0.5105 ns/op           3 B/op           0 allocs/op
BenchmarkSum-4       1000000000             0.5222 ns/op           3 B/op           0 allocs/op
BenchmarkSum-4       1000000000             0.5117 ns/op           3 B/op           0 allocs/op
BenchmarkSum-4       1000000000             0.5270 ns/op           3 B/op           0 allocs/op
BenchmarkSum-4       1000000000             0.5616 ns/op           3 B/op           0 allocs/op
BenchmarkSum-4       1000000000             0.5771 ns/op           3 B/op           0 allocs/op
BenchmarkSum-4       1000000000             0.5764 ns/op           3 B/op           0 allocs/op
BenchmarkSum-4       1000000000             0.5325 ns/op           3 B/op           0 allocs/op
BenchmarkSum-4       1000000000             0.5056 ns/op           3 B/op           0 allocs/op
BenchmarkSum-4       1000000000             0.4920 ns/op           3 B/op           0 allocs/op
PASS
ok      github.com/satyarth42/blogs    132.412s

Optimized Struct

// unsafe.Sizeof(Example{}) => 24
type Example struct {
    b string
    d int32
    a int8
    c int8
}

Benchmark Results: saved in optimized.txt

goos: darwin
goarch: arm64
pkg: github.com/satyarth42/blogs
BenchmarkSum-4       1000000000             0.4072 ns/op           2 B/op           0 allocs/op
BenchmarkSum-4       1000000000             0.3486 ns/op           2 B/op           0 allocs/op
BenchmarkSum-4       1000000000             0.3637 ns/op           2 B/op           0 allocs/op
BenchmarkSum-4       1000000000             0.3539 ns/op           2 B/op           0 allocs/op
BenchmarkSum-4       1000000000             0.3341 ns/op           2 B/op           0 allocs/op
BenchmarkSum-4       1000000000             0.3716 ns/op           2 B/op           0 allocs/op
BenchmarkSum-4       1000000000             0.3537 ns/op           2 B/op           0 allocs/op
BenchmarkSum-4       1000000000             0.3709 ns/op           2 B/op           0 allocs/op
BenchmarkSum-4       1000000000             0.3356 ns/op           2 B/op           0 allocs/op
BenchmarkSum-4       1000000000             0.3337 ns/op           2 B/op           0 allocs/op
PASS
ok      github.com/satyarth42/blogs    70.524s

Comparison

> benchstat unoptimized optimized
goos: darwin
goarch: arm64
pkg: github.com/satyarth42/blogs
      │  unoptimized  │              optimized               │
      │    sec/op     │    sec/op     vs base                │
Sum-4   0.5246n ± 10%   0.3538n ± 6%  -32.56% (p=0.000 n=10)

      │ unoptimized │             optimized              │
      │    B/op     │    B/op     vs base                │
Sum-4    3.000 ± 0%   2.000 ± 0%  -33.33% (p=0.000 n=10)

      │ unoptimized │           optimized            │
      │  allocs/op  │ allocs/op   vs base            │
Sum-4    0.000 ± 0%   0.000 ± 0%  ~ (p=1.000 n=10) ¹
¹ all samples are equal

Looking at the comparison results, we can infer that there is an almost 32.56% improvement in run time for each op and a 33.33% improvement in memory utilized.

But how did we see so much of performance improvement?

One straight reason is that overall less memory had to be allocated to the slice, the optimized struct is of size 24 bytes whereas the unoptimized struct is 32 bytes. On creating a slice of length 10^8, we straightaway observe a difference of 8 * 10^8 bytes in the memory allocation.

The second reason has more to do with the optimisation done at the CPU level which can be termed as localisation of reference, which means that with the reduction in struct size, more elements of the slice could be cached by the CPU, hence greatly improving memory access time manifolds.

Final Thoughts

Here after running the benchmarks, we see how a good understanding of Go's memory management and minor tweaks can improve the performance of our program so much.