I teach my students to make a tree of the promising optimizations and suboptimization (it can be rough), keep track of each interesting optimizations, then implement the branch that seems the more promising and at each step use the profiler massively, memory bound, compute bound, latency bound, number of flops, cache accesses, etc..
After narrowing the smallest leaf, think about another branch and if some optimizations studied could lead to nice results there too.
With time they learn which branch is more promising and which optimizations are good beforehand with their problem.
I guess this could be called branch and bound with memoization instead of brute force aha.
After narrowing the smallest leaf, think about another branch and if some optimizations studied could lead to nice results there too.
With time they learn which branch is more promising and which optimizations are good beforehand with their problem.
I guess this could be called branch and bound with memoization instead of brute force aha.
Note: we write code in cuda