The routines were individually benchmarked using some custom tools (iterate repeatedly and use statistical analysis to converge on an estimate). Always compared against a plain C reference implementation.
Then there was a system for benchmarking the software as a whole on a wide variety of architectures, including NUMA. With lots of plots and statistics.
Usually you’d eventually end up at a point where the improvements are below the noise floor or they help on some systems and cause regression on others. The rule was usually “no regressions”
VTune for multithreading optimization. Built a fibers and lockfree system for efficient scheduling.
Then there was a system for benchmarking the software as a whole on a wide variety of architectures, including NUMA. With lots of plots and statistics.
Usually you’d eventually end up at a point where the improvements are below the noise floor or they help on some systems and cause regression on others. The rule was usually “no regressions”
VTune for multithreading optimization. Built a fibers and lockfree system for efficient scheduling.