This blog post announces
pts-line-bisect,
a C
program and an equivalent
Python
script and library for doing binary seach in
line-sorted text files, and it also explains some of the design details
of the Python implementation.
Let's suppose you have a sorted text file, in which each line is
lexicographically larger than the previous one, and you want to find a
specific range of lines, or all lines with a specific prefix.
I've written
the program pts-line-bisect
for that recently. See the beginning of the
README
on many examples how to use the C program. The Python program has fewer features,
here is how to use it:
- Closed ended interval search: print all lines at least bar and at most foo:
$ python pts_line_bisect.py -t file.sorted bar foo
- Open ended interval search: print all lines at least bar and smaller than foo:
$ python pts_line_bisect.py -e file.sorted bar foo
- Prepend position: print the smallest offset where foo can be inserted (please
note that this does not report whether foo is present, always exit(0)).
$ python pts_line_bisect.py -oe file.sorted foo
- Append position: print the largest offset where foo can be inserted (please
note that this does not report whether foo is present, always exit(0)).
$ python pts_line_bisect.py -oae file.sorted foo
- Exact range: print the range (start byte offset and after-the-last-byte end
offset) of lines containing only foo:
$ python pts_line_bisect.py -ot file.sorted foo
Please note that these tools support duplicate lines correctly
(i.e. if the same line appears multiple times, in a block).
Please note that these tools and libraries assume that the key is at the
beginning of the line. If you have a sorted file whose lines contain the sort key
somewhere in the middle, you need to make changes to the code. It's possible to do
so in a way that the code will still support duplicate keys (i.e. records with the
same key).
Analysis
I've written the article titled
Evolution
of a binary search implementation for line-sorted text files about the
topic, containing the problem statement, the possible pitfalls, an analysis
of some incorrect solutions available on the web as code example, a detailed
specification and explanation of my solution (including a proof), disk seek
and speed analysis, a set of speedup ideas and their implementation, and
further notes about the speedups in the C implementation.
As a teaser, here is an incorrect solution in Python:
def bisect_left_incorrect(f, x):
"""... Warning: Incorrect implementation with corner case bugs!"""
x = x.rstrip('\n')
f.seek(0, 2) # Seek to EOF.
lo, hi = 0, f.tell()
while lo < hi:
mid = (lo + hi) >> 1
f.seek(mid)
f.readline() # Ignore previous line, find our line.
if x <= f.readline().rstrip('\n'):
hi = mid
else:
lo = mid + 1
return lo
Can you spot the all the 3 bugs?
Read
the
article for all the details and the solution.
About sorting: For the above implementation to work, files must be sorted lexicographically, using the unsigned value (0..255) for each byte in the line. If you don't sort the file, or you sort it using some language collation or some locale, then the linked implementation won't find the results you are looking for. On Linux, use LC_ALL=C sort <in.txt >out.txt
to sort lexicographically.
As a reference, here is the correct implementation of the same algorithm
for finding the start of the interval in a sorted list or other sequences
(based on the bisect module):
def bisect_left(a, x, lo=0, hi=None):
"""Return the index where to insert item x in list a, assuming a is sorted.
The return value i is such that all e in a[:i] have e < x, and all e in
a[i:] have e >= x. So if x already appears in the list, a.insert(x) will
insert just before the leftmost x already there.
Optional args lo (default 0) and hi (default len(a)) bound the
slice of a to be searched.
"""
if lo < 0:
raise ValueError('lo must be non-negative')
if hi is None:
hi = len(a)
while lo < hi:
mid = (lo + hi) >> 1
if x <= a[mid]: # Change `<=' to `<', and you get bisect_right.
hi = mid
else:
lo = mid + 1
return lo
A typical real-word use case for such a binary search tool is retrieving
lines corresponding to a particular time range in log files (or time based
measurement records). These files text files with variable-length lines, with
the log timestamp in the beginning of the line, and they are generated in
increasing timestamp order. Unfortunately the lines are not lexicographically
sorted, so the timestamp has to be decoded first for the comparison.
The bsearch tool does that,
it also supports parsing arbitrary, user-specifiable datetime formats, and it
can binary search in
gzip(1)ped files as well
(by building an index). It's also of high performance and low overhead,
partially because it is written in C++. So bsearch is practical tool
with lots of useful features. If you need anything more complicated than
a lexicographic binary search, use it instead.
Before getting too excited about binary search, please note that there
are much faster alternatives for data on disk.
In an external (file-based) binary search the slowest operation is the
disk seek needed for each bisection. Most of the software and hardware
components would be waiting for the hard disk to move the reading head.
(Except, of course, that seek times are negligible when non-spinning storage
hardware such as SSD or memory card is used.)
An out-of-the box solution would be adding the data to more
disk-efficient key-value store. There are several programs providing such
stores. Most of them are based on a
B-tree,
B*-tree or
B+-tree data structure if
sorted iteration and range searches have to be supported, or
disk-based
hashtables otherwise.
Some of the high-performance single-machine key-value stores:
cdb (read-only),
Tokyo Cabinet,
Kyoto Cabinet,
LevelDB;
see more in the NoSQL software list.
The fundamental speed difference between a B-tree search and a binary
search in a sorted list stems from the fact that B-trees have a branching
factor larger than 2 (possibly 100s or 1000s), thus each seeking step in a
B-tree search reduces possible input size by a factor larger than 2, while
in a binary search each step reduces the the input size by a factor 2 only
(i.e. we keep either the bottom half or the top half). So both kinds of
searches are logarithmic, but the base of the logarithm is different, and
this causes a constant factor difference in the disk seek count. By careful
tunig of the constant in B-trees it's usual to have only 2 or 3 disk seeks
for each search even for 100GB of data, while a binary search in a such
a large file with 50 bytes per record would need 31 disk seeks. By taking
10ms as the seek time (see more info about
typical
hard disk seek times), a typical B-tree search takes 0.03 second, and a
typical binary search takes 0.31 second.
Have fun binary searching, but don't forget to sort your files first!