We're not really benchmarking here. The author of the code does need to understand the size of the data they're operating on etc. I also didn't critique the hashmap version. Using hashmaps [for] everything is also "lazy".
Good code should follow something along these lines:
- A reasonable understanding of what needs to happen. What are the requirements. Including data set sizes. Including how "hot" this code path is.
- A reasonable pick of the algorithm to perform what needs to happen.
Sure. O(N^2) for very small data sets could outperform an O(NlogN) or O(N). The conflation of big oh notation with actual performance is just another form of sloppiness as is picking a bad algorithm. In both cases it reflects a lack of understanding of how computers work and how to do things efficiently.
Let's look at what we're trying to do here. We want to find all values that occur more than once in a vector. Using a hash map means you have to:
- Create this data structure on the fly.
- Hash the values.
- Do the insertions/lookup.
You should consider the size of your dataset, you should consider locality/caches etc.
100 entries squared is 10k though. If this is part of a critical code path, let's say your frontend code does this for every DOM element or every notification it needs to process this can end up the bulk of CPU usage. This sort of stuff is why I've seen applications cause the fans to go on or the browser to crash. I would say bad O(N^2) algorithms for large Ns are pretty common.
I would be surprised if e.g. a C++ implementation of the hash version doesn't win over the double loop for data sets of 100. But maybe that's borderline. For a dataset of 1000 there's no way the double loop is winning.
Other alternatives might be something like an efficient small bloom filter that can be stored in a register, sorting the array. I mean at least cut the iterations in half by not iterating over the entire range in the inner loop!
That's not really super readable or universalizable to other languages.
Even more commonly, instead of doing the `findAny` I often see the trip up being someone doing a `contains` or other O(n) operations on `List` objects inside iterations of that list.
It's this sort of fluent/functional style of programming that often seems to get tripped up on doing n^2 (or more) algorithms.
Good code should follow something along these lines:
- A reasonable understanding of what needs to happen. What are the requirements. Including data set sizes. Including how "hot" this code path is.
- A reasonable pick of the algorithm to perform what needs to happen.
Sure. O(N^2) for very small data sets could outperform an O(NlogN) or O(N). The conflation of big oh notation with actual performance is just another form of sloppiness as is picking a bad algorithm. In both cases it reflects a lack of understanding of how computers work and how to do things efficiently.
Let's look at what we're trying to do here. We want to find all values that occur more than once in a vector. Using a hash map means you have to:
- Create this data structure on the fly.
- Hash the values.
- Do the insertions/lookup.
You should consider the size of your dataset, you should consider locality/caches etc.
100 entries squared is 10k though. If this is part of a critical code path, let's say your frontend code does this for every DOM element or every notification it needs to process this can end up the bulk of CPU usage. This sort of stuff is why I've seen applications cause the fans to go on or the browser to crash. I would say bad O(N^2) algorithms for large Ns are pretty common.
I would be surprised if e.g. a C++ implementation of the hash version doesn't win over the double loop for data sets of 100. But maybe that's borderline. For a dataset of 1000 there's no way the double loop is winning.
Other alternatives might be something like an efficient small bloom filter that can be stored in a register, sorting the array. I mean at least cut the iterations in half by not iterating over the entire range in the inner loop!
Then you won't have the trouble with:
being a bit less readable then you might think.