One way to improve performance is to unroll loops and inline code. Unfortunately this increases code size and puts pressure on the instruction cache, making a program sometimes slower. It's probably a lot harder to balance these out in the compiler than to just... sometimes try.