Hacker News new | past | comments | ask | show | jobs | submit login

No amount of measuring and squeezing--not even years of it--is a subsitute for high-level thinking. And vice versa.

Imagine: function F() { for (i = 0; i < 10; i++) { A(); B(); C(); } }

If we profile this code, we might find out, e.g. B takes the majority of the time--let's say 90%. So you spend hours, days, weeks, making B 2X faster. Great. Now you removed 45% of execution time. But the loop in the outer function F is just a few instructions, it is not "hot"--it won't show up in profiles except for ones that capture stacks.

If you're just stuck in the weeds optimizing hot functions that show up in profiles, it's possible to completely overlook F. That loop might be completely redundant, causing 10X the workload by repeatedly computing A, B, and C, which may don't need to be recomputed.

There are bazillions of examples like this. Say you find out that a function is super, super hot. But it's just a simple function. There are calls to it all over the code. You can't make it any faster. Instead you need to figure out how to not call it at all, e.g. by caching or rethinking the whole algorithm.

> How could measuring be a substitute for thinking/analyzing/predicting/forming a plan?

This happens more than you think. Understanding how the system works in enough detail and also at a high level to formulate a plan is in short supply. Jumping in and hacking in things, like a cache or something, is surprisingly common.




Small functions need special attention not just because they show up as leaf nodes everywhere but also because they are difficult for profilers to account properly. You get two functions listed as each taking 4% of CPU time, and one could easily be taking up twice as much compute as the other. The sort of memory pressure that small functions can generate can end up scapegoating a big function that uses a large fraction of memory because it gets stuck with cold cache or lots of GC pressure from the piddly functions fouling the nest.

One of my best examples of this, I had a function reported as still taking 10% of cumulative run time, after I’d tweaked it as much as I could. But I’d set up a benchmark that called a code path a deterministic number of times and this function was getting called twice as much as it should. I found two sibling methods asking the same question and rearranged them so the answer came as an argument and nixed the duplicate call. I reran the benchmark and instead of getting a reduction of 5% (10/2), I got 20%. That was all memory pressure.

The worst memory pressure I ever fixed I saw a 10x improvement by removing one duplicate call. Now, there was a quadratic part of that call but it was a small enough n that I expected 3x and hoped for 4x, and was as shocked as anyone when it went from 30s to 3s with one refactor.


> it won't show up in profiles except for ones that capture stacks

I don't think I've ever used a profiler that couldn't report you were in F() here. One that only captures your innermost functions really doesn't seem that useful, for exactly the reasons you give.


The default usage of perf does this. There's also a few profilers I know of that will show the functions taking the most time.

IMO, those are (generally) nowhere near as useful as a flame/icicle graph.

Not saying they are never useful; Sometimes people do really dumb things in 1 function. However, the actual performance bottleneck often lives at least a few levels up the stack.


Which is why the defaults for perf always drive me crazy. You want to see the entire call tree with the cumulative and exclusive time spent in all the functions.

I’m honestly curious why the defaults are the way they are. I have basically never found them to be what I want. Surely the perf people aren’t doing something completely different than I am?

I almost never find graph usage useful, TBH (and flamegraphs are worse than useless). And perf's support for stack traces is always wonky _somehow_, so it's not easy to find good defaults for the cases where I need them (I tend to switch between fp, lbr and dwarf depending on a whole lot of factors).

Tell me about it!

I think I've only been able to get good call stacks when I build everything myself with the right compilation options. This is a big contrast with what I remember working with similar tools under MSFT environments (MS Profiler or vTune).

You can get it to work though but it's a pain.


To be honest I don't like Linux profiling tools at all. Clearly the people working on them have a very different set of problems than I do

I think it boils down to what Brendan Gregg likes. He must be doing somewhat different type of work and so he likes these defaults.

Agree with this. But not what I concluded from OP. Architectural decisions from the start is where most optimizations should happen. I remember from school some kids that did this super optimized loop and the teacher said. Do you really have to do that same calculation on every iteration?

But, in the real world. Code bases are massive. And it is hard to predict when worlds collide. Most things does not matter until they do. So measuring is the way to go I believe.


Measuring is also useless once someone has introduced bottom up caching.

There’s so much noise at that point that even people who would usually catch problems start to miss them.

There’s usual response to this is, “well you can turn caching off to do profiling” but that’s incorrect because once people know they can get a value from the cache they stop passing it on the stack. So your function that calls A() three times that should have called it 2? You find now that it’s being called ten times.

And the usual response to that is, “well it’s free now so who cares?” Except it’s not free. Every cache miss now either costs you multiple, or much more complex cache bookkeeping which is more overhead, and every hit resets the MRU data on that entry making it more likely that other elements get evicted.

For instance in NodeJS concurrent fetches for the same resource often go into a promise cache, but now the context of the closure for the promise is captured in the cache, and it doesn’t take much to confuse v8 into keeping a bunch of data in scope that’s not actually reachable. I’ve had to fix that a few times. Hundreds of megabytes in one case because it kept an entire request handler in scope.


And I forgot the worst part which is that most of these workflows assume that A() will return the same answer for the duration of the interaction and that’s just not true. Passing value objects on the stack guarantees snapshotting. For the duration of the call sequence, all of the code will see the same A. Not so with the cache.

You make still run into problems where you expect A and B to have a relationship between them that doesn’t hold if there’s a gap between looking them up, but it’s often less likely or severe than if for instance half a page has the data in state S and half of it is in state T.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact