No amount of measuring and squeezing--not even years of it--is a subsitute for h...

hinkley · 2025-04-29T17:53:29 1745949209

Small functions need special attention not just because they show up as leaf nodes everywhere but also because they are difficult for profilers to account properly. You get two functions listed as each taking 4% of CPU time, and one could easily be taking up twice as much compute as the other. The sort of memory pressure that small functions can generate can end up scapegoating a big function that uses a large fraction of memory because it gets stuck with cold cache or lots of GC pressure from the piddly functions fouling the nest.

One of my best examples of this, I had a function reported as still taking 10% of cumulative run time, after I’d tweaked it as much as I could. But I’d set up a benchmark that called a code path a deterministic number of times and this function was getting called twice as much as it should. I found two sibling methods asking the same question and rearranged them so the answer came as an argument and nixed the duplicate call. I reran the benchmark and instead of getting a reduction of 5% (10/2), I got 20%. That was all memory pressure.

The worst memory pressure I ever fixed I saw a 10x improvement by removing one duplicate call. Now, there was a quadratic part of that call but it was a small enough n that I expected 3x and hoped for 4x, and was as shocked as anyone when it went from 30s to 3s with one refactor.

dmurray · 2025-04-29T17:04:47 1745946287

> it won't show up in profiles except for ones that capture stacks

I don't think I've ever used a profiler that couldn't report you were in F() here. One that only captures your innermost functions really doesn't seem that useful, for exactly the reasons you give.

cogman10 · 2025-04-29T22:12:23 1745964743

The default usage of perf does this. There's also a few profilers I know of that will show the functions taking the most time.

IMO, those are (generally) nowhere near as useful as a flame/icicle graph.

Not saying they are never useful; Sometimes people do really dumb things in 1 function. However, the actual performance bottleneck often lives at least a few levels up the stack.

YZF · 2025-04-30T06:04:13 1745993053

Which is why the defaults for perf always drive me crazy. You want to see the entire call tree with the cumulative and exclusive time spent in all the functions.

saagarjha · 2025-04-30T07:31:56 1745998316

I’m honestly curious why the defaults are the way they are. I have basically never found them to be what I want. Surely the perf people aren’t doing something completely different than I am?

Sesse__ · 2025-04-30T13:01:31 1746018091

I almost never find graph usage useful, TBH (and flamegraphs are worse than useless). And perf's support for stack traces is always wonky _somehow_, so it's not easy to find good defaults for the cases where I need them (I tend to switch between fp, lbr and dwarf depending on a whole lot of factors).

YZF · 2025-05-01T05:30:36 1746077436

Tell me about it!

I think I've only been able to get good call stacks when I build everything myself with the right compilation options. This is a big contrast with what I remember working with similar tools under MSFT environments (MS Profiler or vTune).

You can get it to work though but it's a pain.

saagarjha · 2025-05-01T11:01:06 1746097266

To be honest I don't like Linux profiling tools at all. Clearly the people working on them have a very different set of problems than I do

YZF · 2025-05-01T03:16:12 1746069372

I think it boils down to what Brendan Gregg likes. He must be doing somewhat different type of work and so he likes these defaults.

AtNightWeCode · 2025-04-29T17:54:39 1745949279

Agree with this. But not what I concluded from OP. Architectural decisions from the start is where most optimizations should happen. I remember from school some kids that did this super optimized loop and the teacher said. Do you really have to do that same calculation on every iteration?

But, in the real world. Code bases are massive. And it is hard to predict when worlds collide. Most things does not matter until they do. So measuring is the way to go I believe.

hinkley · 2025-04-29T18:29:49 1745951389

Measuring is also useless once someone has introduced bottom up caching.

There’s so much noise at that point that even people who would usually catch problems start to miss them.

There’s usual response to this is, “well you can turn caching off to do profiling” but that’s incorrect because once people know they can get a value from the cache they stop passing it on the stack. So your function that calls A() three times that should have called it 2? You find now that it’s being called ten times.

And the usual response to that is, “well it’s free now so who cares?” Except it’s not free. Every cache miss now either costs you multiple, or much more complex cache bookkeeping which is more overhead, and every hit resets the MRU data on that entry making it more likely that other elements get evicted.

For instance in NodeJS concurrent fetches for the same resource often go into a promise cache, but now the context of the closure for the promise is captured in the cache, and it doesn’t take much to confuse v8 into keeping a bunch of data in scope that’s not actually reachable. I’ve had to fix that a few times. Hundreds of megabytes in one case because it kept an entire request handler in scope.

hinkley · 2025-05-01T19:33:03 1746127983

And I forgot the worst part which is that most of these workflows assume that A() will return the same answer for the duration of the interaction and that’s just not true. Passing value objects on the stack guarantees snapshotting. For the duration of the call sequence, all of the code will see the same A. Not so with the cache.

You make still run into problems where you expect A and B to have a relationship between them that doesn’t hold if there’s a gap between looking them up, but it’s often less likely or severe than if for instance half a page has the data in state S and half of it is in state T.