Hacker News new | past | comments | ask | show | jobs | submit login

> I dislike the “intuition doesn’t work, profile your code” mantra because it seemingly says profiling is a viable replacement for theoretical calculations, which it isn’t.

This seems like a nonsensical statement to me. How could measuring be a substitute for thinking/analyzing/predicting/forming a plan?

Measuring/profiling just means observing the system you want to optimize in a systematic way. You certainly won't be very effective at optimizing anything if you don't observe it.

Theoretical calculations means you've formed a model of what's happening and you're devising a plan to optimize against that model. But of course, a useful model needs to represent the significant aspects of your system (and a good model should exclude most insignificant ones). Failing to observe your system means your model could be bad -- focused on insignificant aspects and missing significant ones -- and you'd never know.




I think many people have seen both sides of this in practice. I’ve seen engineers follow the profiler into a dead-end because they see nothing that stands out in the profiler or they don’t grok the essential nature of the code or system they are profiling. I’ve seen engineers with deep domain expertise consistently make accurate estimates of how a code changes will impact performance without ever using a profiler because their mental model of the code execution maps to reality with high fidelity.

Profilers fill in gaps in our mental model of code execution but they are not a substitute for it. Computer behavior is largely knowable from first principles, albeit requiring a considerable degree of expertise and detail. For some people in some contexts, the profiler adds little information to what they already understand. I know a few people that do almost all of their optimization work using pencil and paper with great effectiveness and precision. Not something I would recommend for most software engineers but not unreasonable at the limit either.


Optimize according to your mental model; profile to check that your mental model matches reality.

If you optimize infrequently, or haven't profiled code like this recently, or haven't profiled this specific codebase recently, then your mental model is probably due a refresh.


In a complex system it isn't always easy to reason about the behavior. Even if you fully understand the cpu, the compiler, the kernel, the disks etc. the way they interact in a real application isn't easy to hold in your head or to deal with using theoretical tools.

I've spent a lot of time optimizing video codecs and while it is maybe possible to reason about these without a profiler it's much faster to profile. The profiler isn't going to tell you what you need to do though.


A lot of the optimisation I do largely consists of a series of highly educated guesses and the resulting fixes are right in 99% percent of the cases.

You're right, I could've phrased that better.

Profiling to find suboptimal code is perfectly fine. Then you need to figure out how to fix it. Many people don't understand how performance optimization works, so they blindly add caching, improve constant time by invoking more low-level methods, etc. This obviously doesn't work, yet intuitively (to those people, anyway) it should produce good results.

That's why the mantra exists: don't trust your intuition, don't believe it when it says these changes improve performance, instead measure that performance and only apply changes that work. This is also perfectly fine, but this is a double-edged sword, and I've seen people go too far in this direction.

For example, they refuse to do any changes that don't immediately improve performance according to the profiler. If they modify one computation and performance decreases, they abandon this path altogether. They treat optimization as a game with a dense fog of war, and they refuse to apply deductive reasoning and, of course, intuition to apply changes that, according to the profiler at least, are not immediately rewarding.


I think there's a related problem where profiling/measurements can be made poorly and not reflect the real world.

Eg: indexing or partitioning a database table may appear to make things slower if you don't have both a representative amount of data and representative query patterns when you're measuring the change.

You should still measure your changes, but sometimes you need to be careful about measuring them in the right way, and possibly simulating a future context (eg: more scale) before drawing a conclusion.

Intuition about how the context will evolve and what effect that might have on the tradeoffs of different approaches is helpful


Sounds like a tricky balancing act. There are things that are extremely difficult to "game out." CPUs are very complicated. There are optimizations that seem like they could be cache friendly in theory, but aren't in practice.

Sure, but avoiding CPU cache misses is hardly the only form of optimization. Some optimizations are blatantly obvious. As an example, there was some code I was working on that called the database every single time it was executed. The data it was pulling from the database got updated once a year, and the entire dataset could fit into about 4kb of memory. Caching it in memory was an obvious fix, and given how fast the rest of the code ran once it had this data it was basically a 500% speedup that anyone with half a brain could have roughly estimated.

"Intuitively" literally means without having to learn something.

Adding caches or switching to lower-level calls is definitely something learned, and I wouldn't call it "intuitive".

What I think you are referring to is that sometimes, simply reading and understanding the code can tell you where the problem really is — still, my experience is that you want to measure the before and after to at least identify the general area you should be looking at, and more often than not, I could figure out what needs optimizing and how without having to get detailed profiles.

I did end up being surprised a few times, but it was mostly due to external code like buggy library implementations that didn't do the things they promised (eg. async library really synchronizing everything with a badly placed mutex).

At the same time, it's wrong to focus on a single case or a single profile (executing one code path in one set of circumstances), but what you are describing simply sounds like bad engineering — the fact that you can misuse the tool does not make the tool bad, it's still the engineer wielding it who's at fault.


And yet their statement makes perfect sense to me.

Caching and lower level calls are generic solutions that work everywhere, but are also generally the last and worst way to optimise (thus why they need such careful analysis since they so often have the opposite effect).

Better is to optimise the algorithms, where actual profiling is a lesser factor. Not a zero factor of course, as a rule of thumb it’s probably still wise to test your improvements, but if you manage to delete an n^2 loop then you really don’t need a profiler to tell you that you’ve made things better.


Which is exactly what I said: I am only arguing against their use of "intuitively".

I think the confusion sneaks in because “measuring” is something you do several times and each iteration has different connotation.

Once you “find” a candidate change you measure it to see if what you did made things worse and you put it back if it did, or maybe you try it in combination with other changes to see if its value is complementary.

But people fuck up all the time reading the initial telemetry, which is often where I come in. I get tired of hearing people say, “we’ve done all we can, look how flat this chart is,” and hand someone my beer. You won’t find all of the good candidates in the percentage of run time list. That’s not all the data that’s there, and not every change that works needs to even be supported by the initial data. It only needs to be supported by the delta afterward.


No amount of measuring and squeezing--not even years of it--is a subsitute for high-level thinking. And vice versa.

Imagine: function F() { for (i = 0; i < 10; i++) { A(); B(); C(); } }

If we profile this code, we might find out, e.g. B takes the majority of the time--let's say 90%. So you spend hours, days, weeks, making B 2X faster. Great. Now you removed 45% of execution time. But the loop in the outer function F is just a few instructions, it is not "hot"--it won't show up in profiles except for ones that capture stacks.

If you're just stuck in the weeds optimizing hot functions that show up in profiles, it's possible to completely overlook F. That loop might be completely redundant, causing 10X the workload by repeatedly computing A, B, and C, which may don't need to be recomputed.

There are bazillions of examples like this. Say you find out that a function is super, super hot. But it's just a simple function. There are calls to it all over the code. You can't make it any faster. Instead you need to figure out how to not call it at all, e.g. by caching or rethinking the whole algorithm.

> How could measuring be a substitute for thinking/analyzing/predicting/forming a plan?

This happens more than you think. Understanding how the system works in enough detail and also at a high level to formulate a plan is in short supply. Jumping in and hacking in things, like a cache or something, is surprisingly common.


Small functions need special attention not just because they show up as leaf nodes everywhere but also because they are difficult for profilers to account properly. You get two functions listed as each taking 4% of CPU time, and one could easily be taking up twice as much compute as the other. The sort of memory pressure that small functions can generate can end up scapegoating a big function that uses a large fraction of memory because it gets stuck with cold cache or lots of GC pressure from the piddly functions fouling the nest.

One of my best examples of this, I had a function reported as still taking 10% of cumulative run time, after I’d tweaked it as much as I could. But I’d set up a benchmark that called a code path a deterministic number of times and this function was getting called twice as much as it should. I found two sibling methods asking the same question and rearranged them so the answer came as an argument and nixed the duplicate call. I reran the benchmark and instead of getting a reduction of 5% (10/2), I got 20%. That was all memory pressure.

The worst memory pressure I ever fixed I saw a 10x improvement by removing one duplicate call. Now, there was a quadratic part of that call but it was a small enough n that I expected 3x and hoped for 4x, and was as shocked as anyone when it went from 30s to 3s with one refactor.


> it won't show up in profiles except for ones that capture stacks

I don't think I've ever used a profiler that couldn't report you were in F() here. One that only captures your innermost functions really doesn't seem that useful, for exactly the reasons you give.


The default usage of perf does this. There's also a few profilers I know of that will show the functions taking the most time.

IMO, those are (generally) nowhere near as useful as a flame/icicle graph.

Not saying they are never useful; Sometimes people do really dumb things in 1 function. However, the actual performance bottleneck often lives at least a few levels up the stack.


Which is why the defaults for perf always drive me crazy. You want to see the entire call tree with the cumulative and exclusive time spent in all the functions.

I’m honestly curious why the defaults are the way they are. I have basically never found them to be what I want. Surely the perf people aren’t doing something completely different than I am?

I almost never find graph usage useful, TBH (and flamegraphs are worse than useless). And perf's support for stack traces is always wonky _somehow_, so it's not easy to find good defaults for the cases where I need them (I tend to switch between fp, lbr and dwarf depending on a whole lot of factors).

Tell me about it!

I think I've only been able to get good call stacks when I build everything myself with the right compilation options. This is a big contrast with what I remember working with similar tools under MSFT environments (MS Profiler or vTune).

You can get it to work though but it's a pain.


To be honest I don't like Linux profiling tools at all. Clearly the people working on them have a very different set of problems than I do

I think it boils down to what Brendan Gregg likes. He must be doing somewhat different type of work and so he likes these defaults.

Agree with this. But not what I concluded from OP. Architectural decisions from the start is where most optimizations should happen. I remember from school some kids that did this super optimized loop and the teacher said. Do you really have to do that same calculation on every iteration?

But, in the real world. Code bases are massive. And it is hard to predict when worlds collide. Most things does not matter until they do. So measuring is the way to go I believe.


Measuring is also useless once someone has introduced bottom up caching.

There’s so much noise at that point that even people who would usually catch problems start to miss them.

There’s usual response to this is, “well you can turn caching off to do profiling” but that’s incorrect because once people know they can get a value from the cache they stop passing it on the stack. So your function that calls A() three times that should have called it 2? You find now that it’s being called ten times.

And the usual response to that is, “well it’s free now so who cares?” Except it’s not free. Every cache miss now either costs you multiple, or much more complex cache bookkeeping which is more overhead, and every hit resets the MRU data on that entry making it more likely that other elements get evicted.

For instance in NodeJS concurrent fetches for the same resource often go into a promise cache, but now the context of the closure for the promise is captured in the cache, and it doesn’t take much to confuse v8 into keeping a bunch of data in scope that’s not actually reachable. I’ve had to fix that a few times. Hundreds of megabytes in one case because it kept an entire request handler in scope.


And I forgot the worst part which is that most of these workflows assume that A() will return the same answer for the duration of the interaction and that’s just not true. Passing value objects on the stack guarantees snapshotting. For the duration of the call sequence, all of the code will see the same A. Not so with the cache.

You make still run into problems where you expect A and B to have a relationship between them that doesn’t hold if there’s a gap between looking them up, but it’s often less likely or severe than if for instance half a page has the data in state S and half of it is in state T.


I think what he’s saying here is you can’t skip the basic math step to arrive at good performance. Staring at profiling results will lead you to a local minima

Profiling doesn't mean you don't do the math. Profiling is simply there to show you that "hey, this is where the problems actually are".

You do profiling because it's WAY too easy to get obsessed about theoretical problems when a simple measurement will show you the actual problems.

You do the math on the actual problem location, not a method with O(n!) which only gets called with n=3.

You still have to look at the entire call stack when profiling (which means thinking about the overarching algorithm).


Or optimizing a scheduling noop loop dozens of times given local encapsulation obscures global functional context.

The fact remains most projects that do small trivial modular prototypes first will ultimately know which paths are viable before painting themselves into a corner algorithmically.

Best of luck =3


Well, he's saying that intuition does work... But does it really?

If a problem area is so intuitively obvious, why would you introduce the problem in the first place? In reality, performance optimizations are usually needed where you least expect them. Which means that you can't get there intuitively. Hence, the suggestion of using profiling to help track down where the problem is instead.


Most code I’ve seen is written in an environment where people can’t afford to “intuit”: thinking through the repercussions of specific language or algorithm choices comes second to solving a problem “well enough”. Building an intuition takes learning how you’ve built something poorly, and for a specific problem can take hours to gain context and walk yourself through many paths.

Bring in a performance expert and their intuition can quickly identify many performance improvements. The tough part for the business is when and where do you pay for the experience of performant code, and when is good enough to leave alone?


Do you want to claim you've never written quick and ugly code to get something working to come back and fix it up later?

Pretty much everyone I know will throw down an O(n^2) algorithm or whatever in their first pass and replace it with something more thought out once they have the time to think deeply about it.

If you're fretting about optimization at every stage of development, you're really doing it wrong. This is precisely the type of early optimization that everyone and their dog will tell you is bad. You should focus first on making your program work and laying out the logic and data flows. Optimization does not matter until the program is almost finished.


> Pretty much everyone I know will throw down an O(n^2) algorithm or whatever in their first pass and replace it with something more thought out once they have the time to think deeply about it.

Most of the times I've seen this, the faster algorithm is literally just a dictionary with a well-defined key.

I honestly do not understand why that's not the first solution most devs think of and why n^2 seems to dominate.

As an example, I've seen code like this quiet a bit

    result = []
    for (item: items) {
        for (item2: items) {
        if (item != item2 && item.foo == item2.foo) {
          result.add(item)
        }
      }
    }
Easily replaced and massively easier to read as

    result = []
    itemFoo = HashSet()
    
    for (item: items) {
      if (itemFoo.contains(item.foo))
        result.add(item)
      else
        itemFoo.add(item.foo)
    }
The mindset of re-looking for values in a collection you are already iterating through is just foreign to me. The first solution for something like this in my mind is always utilizing dictionaries.

The first version is more readable, faster for small n (which is the common case), can not run out of memory due to the hash table, compiles to less code, does not need item to be hashable and cannot accidentally copy elements (depending on language). It _should_ be your default. (Well, except you should start from the items after item1, so that you don't need the item1 != item2 test.)

> The first version is more readable

Disagree

> faster for small n (which is the common case)

How can you know that? And for smaller n, the choice hardly matters.

> can not run out of memory due to the hash table

If n gets large enough that hash table size becomes a problem, n is large enough that you need the hash table or some other structure that reduces the complexity down from O(n^2).

> compiles to less code

This is just nonsense. Any codebase of significant size already has a hashtable setup and unless you are using a language like C++, the code size of using a hash table anywhere is constant. The cost is already paid.

The actual amount of compiled code from the switch is identical.

> does not need item to be hashable

True.

There would be reasons not to choose option 2, but it should be default until it's been proven to be a problem. And even then, the next step is almost certainly not going to be option 1.

But then this could also be language and application specific. I deal primarily with backend services/Java where memory is plentiful, and allocation is fast. Algorithmic complexity is for my applications very often the better indicator of what good code should be.


For many people performance optimization really starts with your second version. I'm not even sure what this is supposed to do exactly but the O(N^2) is just bad/sloppy code. Optimization is usually more about getting the right algorithms to execute faster. At least that's how I think about it.

Yet, as you'll see in this thread at least 1 other commentor is arguing that option 1 is better. What you and I see as bad/sloppy code, they see as the ideal starting point.

A pretty common attitude that I've ran into. People take the "premature optimization" quote to mean "Never think about big oh when writing code".


Beware of microbenchmarks, but at least on my machine under the specific circumstances I ran it under, the double loop algorithm is about two times faster than the hash map algorithm while n < 100. That might not mean much if you spend your days working with large datasets, but in my day-to-day datasets with 100 entires or less is the norm. Readability is subjective, but at least from a performance perspective it is a reasonable starting point unless you know upfront that you are going to be working with large datasets – but in that case the optimization isn't premature, is it?

Here's the main issue I have with this reasoning. It builds in a fault to the program that doesn't need to be there.

If n is usually less than 100 then sure maybe it does make sense to do the more brute force approach. However, what's the impact of using the hash table first? If n is usually less than 100, now this method is slower usually, but is that a problem? Well, you should use a profiler to find out if it is or isn't. However, that may be entirely tolerable performance wise.

On the other hand, if your assumption that "n < 100" doesn't hold up and, indeed you start seeing n > 100, or 1000, or 10000. Now all the sudden this method becomes a fault in the application you developed that needs to be rewritten.

What's devious about the "n > 100" problem is when you first write the code, you may have very well been correct that "n < 100". But time and inputs change the way code is executed.

A lot of the code I've fixed has been due to this. I've seen the counter where really searching a list/array was the better option only a handful of times.

To me, premature optimization is using a known algorithmically inferior approach, of the same complexity as the superior approach, because you know that for small n, it is faster without knowing whether or not that will be the case.


> It builds in a fault to the program that doesn't need to be there.

Not really. You're just exploring an unknown frontier. Once the frontier is known and you actually understand what you are trying to solve you're still going to put in the right algorithm.

I mean, you might not if you're vibe coding, which I guess is apparently a thing somehow, but I'm assuming we're talking about actual engineering here, not speaking the sordid tales of cowboys. The "premature optimization" line was definitely not directed at those who think code is a feeling.

When the frontier is unknown, whatever algorithm gets something operating as soon as possible is the right one. The given example may be so contrived that the hash map would be what you see as the quickest and dirtiest solution anyway, but with a little more complexity loops do tend to be easier to reason about on a first-pass basis.

> Now all the sudden this method becomes a fault in the application you developed that needs to be rewritten.

In the same way that finding out that a blue button doesn't resonate with users, requiring you to rewrite your application to make the button green. Such is the life of building software. But that's why engineers get paid the big bucks – nobody else is stupid enough to do it.

> A lot of the code I've fixed has been due to this.

You're not really fixing it, you're updating it to reflect the information you gleaned during the exploratory phase. I mean, unless you're coding on vibes or something, but, again, assuming actual engineering is taking place here. Nobody should be coding non-prototypes by the guide of their feelings.


We're not really benchmarking here. The author of the code does need to understand the size of the data they're operating on etc. I also didn't critique the hashmap version. Using hashmaps [for] everything is also "lazy".

Good code should follow something along these lines:

- A reasonable understanding of what needs to happen. What are the requirements. Including data set sizes. Including how "hot" this code path is.

- A reasonable pick of the algorithm to perform what needs to happen.

Sure. O(N^2) for very small data sets could outperform an O(NlogN) or O(N). The conflation of big oh notation with actual performance is just another form of sloppiness as is picking a bad algorithm. In both cases it reflects a lack of understanding of how computers work and how to do things efficiently.

Let's look at what we're trying to do here. We want to find all values that occur more than once in a vector. Using a hash map means you have to:

- Create this data structure on the fly.

- Hash the values.

- Do the insertions/lookup.

You should consider the size of your dataset, you should consider locality/caches etc.

100 entries squared is 10k though. If this is part of a critical code path, let's say your frontend code does this for every DOM element or every notification it needs to process this can end up the bulk of CPU usage. This sort of stuff is why I've seen applications cause the fans to go on or the browser to crash. I would say bad O(N^2) algorithms for large Ns are pretty common.

I would be surprised if e.g. a C++ implementation of the hash version doesn't win over the double loop for data sets of 100. But maybe that's borderline. For a dataset of 1000 there's no way the double loop is winning.

Other alternatives might be something like an efficient small bloom filter that can be stored in a register, sorting the array. I mean at least cut the iterations in half by not iterating over the entire range in the inner loop!

Then you won't have the trouble with:

  if (item != item2 && item.foo == item2.foo) {
being a bit less readable then you might think.

> I mean at least cut the iterations in half by not iterating over the entire range in the inner loop!

The reason I wrote it that way is because the actual common way I find this is in Java where the team wants to use the `stream` api.

IE

    items.stream()
      .filter((item)->items.stream()
        .filter((item2)->item2 != item)
        .findAny((item2)->Object.equals(item.foo, item2.foo)).isPresent())
      .toList();
That's not really super readable or universalizable to other languages.

Even more commonly, instead of doing the `findAny` I often see the trip up being someone doing a `contains` or other O(n) operations on `List` objects inside iterations of that list.

It's this sort of fluent/functional style of programming that often seems to get tripped up on doing n^2 (or more) algorithms.


This only works if you don't need real time performance. For anything that does need to run in realtime, you have to be damn sure that whatever your initial approach is can be optimized to run at the required sample rate. Otherwise, you will be ripping out the entire design and starting over. You might as well optimize first.

Purposefully introducing underperforming code as a tradeoff to meet some other goal (faster delivery, for example) is something else entirely. It doesn't require intuition or profiling – at most requiring only memory of the choice. A fun aside, but not related to what we're talking about.

One could even say optimization does not matter until the program is already running.

FYI, the author is a woman.

I think the person you are responding to is criticizing the original "mantra" quote not the author's criticism of it(?)

Yeah, that's why I think it's nonsense.

Measuring doesn't mean don't think. Measuring and thinking are two different things. You need to do them both to optimize effectively.


> > I dislike the “intuition doesn’t work, profile your code” mantra because it seemingly says profiling is a viable replacement for theoretical calculations, which it isn’t.

> This seems like a nonsensical statement to me. How could measuring be a substitute for thinking/analyzing/predicting/forming a plan?

I believe the mantra “intuition doesn’t work, profile your code” is understood to mean “don't rely on your intuition alone; profile your work.” It's a warning that when it comes to performance optimization, intuition is often unreliable, so you need to supplement your mental model with actual data if you want to avoid big mistakes.

I don't know why the author of the blog post is interpreting the mantra literally, as if it claims intuition is obsoleted by profiling.


I think theoretical calculations could mean a detailed calculation of the number of times a certain function or operation is called. We all know from tech interviews that there are big O time complexity, but this is usually very hand-wavy and not precise enough. You can usually come up with a more precise formula though it can get messy with recurrences if your algorithm involves recursion. You probably need a computer algebra system. Spending an afternoon doing these symbolic calculations might give you better intuition of why the profiling produced such a measurement.

Agree. A classic example is compilers that let you choose between optimizing for speed or binary size. But a smaller sized binary is sometimes faster.

Why not both?

One way to improve performance is to unroll loops and inline code. Unfortunately this increases code size and puts pressure on the instruction cache, making a program sometimes slower. It's probably a lot harder to balance these out in the compiler than to just... sometimes try.

Impossible. Speed option may do things like loop unrolling, function inlining and today even way more complicated things than that and therefor creates larger binaries.

While profiling and measuring is very important if you want to optimize performance, there are a lot of things you can do without any profiling. In many situations the consequences of each approach are well known or easily reasoned about. Most of the time it's simply "do less work" = "more performance", or avoiding obvious and well-known patterns like N+1 database queries.

But it's also very easily to mislead yourself that way, many "optimizations" might do much less than you think. So you should avoid implementing more complex or harder to understand code just because you think it is faster, but otherwise I'd certainly try to write faster code by default in areas I know well enough to judge that.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact