Warning

This page can now be found at https://djharper.dev/post/2018/03/29/using-hyperloglog-in-production-a-retrospective/

Your browser should redirect to the new URL at djharper.dev in 5 seconds, but if this fails please click here

I've migrated my website away from github pages and will no longer be posting content here.

djhworld

thoughts


Using HyperLogLog in production, a retrospective

A few years ago I was involved in a project that required us to provide a time series metric on how many concurrent users were using our products, and what quality of service they were receiving.

On embarking on this journey, it quickly became apparent that the tricky part would be doing the count of unique concurrent users, over a set of dimensions in one minute windows. We’d run into the classic count-distinct problem.

Our dataset was stored in Amazon S3, with PrestoDB as our query backend, and the query we were running was slow and difficult to scale. As you can imagine, running count(distinct) on thousands of anonymous session IDs across millions of records isn’t an easy problem to solve. Well, until someone pointed us in the direction of the approx_distinct function.

tl;dr: We tried using HyperLogLog over fine grained time series data in production, but at the time it was not working as well as we’d hoped. I wanted to explore HyperLogLog further though as I thought it was cool, so spent some time building HyperLogLog Playground, a website to help visualise what this cool algorithm is doing under the hood.

By trading away some degree of accuracy, approx_distinct gave us a result in a much quicker time, with what looked like a reasonable trade off for production use. However, after a few weeks, it was becoming obvious that, while the speed of the query was satisfactory, sometimes the result added noise. You would see that over one minute windows, the line on our graphs would deviate in a sporadic fashion.

This meant users were questioning the validity of the data, was it the error in the approximation, or were there user impacting problems? In this case, at the time it seemed the approximate count was the wrong choice for fine grained time series analysis, and meant we had to pivot towards a streaming data pipeline that would give a fully accurate picture, at the cost of latency, added infrastructure and complexity.

Since then, it looks like the Presto contributors implemented a change to increase the precision options of approx_distinct, so this might have alleviated our noise problem, but we had moved on by that point. Our users liked having a near real time, accurate figure from the new pipeline.

So while, at the time, approx_distinct did not work for us, I was still intrigued by what approx_distinct was actually doing. What made it so quick? How could it get so close to the actual figure? This line of questioning inspired me to create a small website called HyperLogLog Playground where I could somewhat consolidate my understanding, while learning a bit of Javascript on the way. It was a fun exercise and I hope people like it.

Whenever you are presented with a count distinct problem, it’s always worth keeping HyperLogLog in mind as a tool to use, but just be 100% sure that the small loss of accuracy is tolerable, and for time series data make sure you test it against a variety of scenarios.

Warning

This page can now be found at https://djharper.dev/post/2018/03/29/using-hyperloglog-in-production-a-retrospective/

Your browser should redirect to the new URL at djharper.dev in 5 seconds, but if this fails please click here

I've migrated my website away from github pages and will no longer be posting content here.