The informal design goal of the Prometheus 2.x TSDB startup was that it should take no more than about a minute. Over the past few months there’s been reports of it taking quite a bit more than this, which is a problem if your Prometheus restarts for some reason. Almost all of that time is loading the WAL (write ahead log), which are the samples in the last few hours which have yet to be compacted into a block. I finally got a chance to dig into this at the end of October, and the outcome was PR#440 which reduced CPU time by 6.5x and walltime by 4x. Let’s look at how I arrived at these improvements.
I’ve been meaning to get more familiar with
pprof, the Go profiling tool, as my job revolves around working on and around Go microservices. My team has been able to see the impact of the Go experts who can quickly find issues buried in a stack of profiles collected on a service.
Brian’s post is a great example of 1) identifying the an issue, 2) diagnosing said issue and 3) observing the implemented improvements using
pprof. His parting paragraph is particularly insightful, specifically:
I did spend quite a bit of time pouring over the code, and had several dead ends such as removing the call to NumSamples, doing reading and decoding in separate threads, and a few variants of how the processWALSamples sharding worked
Profiling and optimization is a mix of knowing your codebase and being able to identifying false leads. A tool like
pprof is invaluable when identifying both issues and improvements in a measurable way.