15X on a Hunch

Recently we achieved a 15 fold speed up to a core process in our optimization engine. Essentially this shortcuts four doublings of computer throughput - four years of hardware advances.

There are multiple takeaways -

1. “I’d call a hunch the result of automatic reasoning below the conscious level on data you did not know you possessed.” - Robert Heinlein

2. Architecture or algorithm design is still essential for maximum performance even as advances in computer hardware and cloud architectures spoil generations of developers who don't optimize the low-level processes involved in executing algorithms or moving data.

3. Moreover, experiments are low cost and easy to conduct in the cloud where capacity can be purchased by the minute. It's not like its rocket hardware that has to be scrapped with each subsequent generation. Basically, it is a low / low-risk cost environment to play a hunch.

4. Aim High - We are trying to achieve a 10X improvement in our core abilities, as my friends advocate, vs. spending the same amount of time on small incremental performance improvements. So we are willing to try some radical reconfigurations as experiments.

5. Virtualize but Verify - Coastal had a particular implementation of a high-speed cache, running on a VM, that our hardware team insisted was fine, running a low percentage of CPU, and performing well. Yet something always bothered me about this - and on a hunch, I co-located this routine expecting at least to save the network hops leaving one server to another - roughly a doubling of speed from 120u to 60u. This particular device could return 100,000 lockups/distance calculations in 29 seconds. When placed intra-machine, the same 100,000 calculations are returned in 2 seconds - a 15X gain. I followed a hunch. This same pattern has re-occurred - sometimes with dramatic project saving effects - so frequently that I'm now conditioned to follow these hunches.

I could not have anticipated a 15x gain. I had to repeat the tests several times to believe it. But still, there was this nagging background thought that an improvement was there to be had. It is listening to that hunch that moved the needle - while not getting sidetracked by the daily task list. After all our own hardware team said it was fine - but they were measuring the wrong thing - CPU use on the VM vs throughput to the requester. It just seemed like it could be faster.

Even though these machines were co-located, the network delays were material - but more than that the virtualized nature of the lookup machine likely introduced competition for ports or network bandwidth or even storage. That latter item, storage, has caused us a lot of VM disruption moving us to dedicated hardware and dedicated SDS drives where sustained predictable performance is mandatory.

Play your hunches, especially when the price of failure is so small and the potential benefits can leapfrog four years of performance improvements. Trust but verify - even your own team. And step back and look at the big picture - the spot in the architecture where you can have the most impact. And test, test, test!