|
|
1.1 root 1: .fp 1 PA
2: .fp 2 PI
3: .fp 3 PB
4: .fp 4 PX
5: .EQ
6: delim $$
7: .EN
8: .ND January 2, 1981
9: .TM 81-11272-1 11173 11173-11
10: .TL
11: .UX
12: B-trees
13: .AU "MH 2C522" 7214
14: P.J. Weinberger
15: .AI
16: .MH
17: .AB
18: B-trees are a good data structure for storing, retrieving,
19: and updating records in files containing many records.
20: Ordinary
21: .UX
22: tools such as
23: $awk$ and $grep$ become too slow when used on files with more than
24: a few hundred lines,
25: and $ed$ can no longer be used for updating after a thousand or so lines.
26: This memo contains descriptions of subroutines and commands
27: for creating and maintaining B-trees,
28: and a complete example of their use in a data base with two files.
29: .AE
30: .CS 5 2 7 0 0 1
31: .NH
32: Introduction
33: .PP
34: This memo contains a description of subroutines and commands
35: which implement B-trees.
36: (See [Comer] for a bibliography on B-trees.)
37: The subroutines make the B-trees look to the user like a list
38: of
39: .I "key\(emvalue"
40: pairs,
41: sorted by
42: .I key .
43: The subroutines provide for positioning within the list,
44: reading the next pair, deleting pairs, and writing pairs.
45: The way the subroutines implement the list is by using
46: a prefix-compressed B-tree,
47: in which prefixes common to consecutive
48: .I keys
49: are factored out.
50: The details of the data structures are in sections 2 and 5.
51: .PP
52: There are several advantages to using these B-trees as the file
53: organization for data bases.
54: A given key can be searched for with a small number of file system
55: accesses, typically two or three.
56: Since the file looks to the user as if it is sorted on the keys,
57: it is easy to retrieve the lexically next key, and
58: so to scan the file in lexical order.
59: Because the keys are sorted,
60: adjacent keys tend to have a common prefix,
61: which can be compressed out.
62: In the case of
63: the dictionary
64: .I /usr/dict/words ,
65: where the
66: .I values
67: are all null,
68: the original file consists of 24473 words and 201032 bytes,
69: while the B-tree contains only 134144 bytes.
70: The B-tree is significantly shorter even though it contains
71: headers and other overhead.
72: .PP
73: Sections 3 and 4 contain descriptions of subroutines and commands.
74: .PP
75: There are 9 subroutines.
76: 4 of these correspond to
77: .UX
78: system calls $read$, $write$, $open$, $close$,
79: with the difference that the data stored in the B-trees is not just
80: a vector of bytes, but is a sequence of $key\(emvalue$ pairs.
81: There is one routine for deleting,
82: which has no corresponding system call.
83: The remaining 4 correspond to various aspects of $lseek$:
84: one positions, one reports on the position, one positions to the
85: beginning, and the last gives the length of the $value$ at the
86: current position.
87: .PP
88: The commands are utilities for building B-trees,
89: dealing with them at the command level,
90: and gathering statistics about them.
91: .PP
92: Section 6 contains a complete example constructing and using
93: a small but interesting data base.
94: .NH
95: B-trees
96: .PP
97: The B-trees used by the subroutines are slightly different from
98: standard B-trees.
99: The resemblance would be closer if the routines did not support the
100: .I value
101: half of the pair.
102: Then the
103: .UX
104: B-trees would be B-trees in which all the keys are in leaves,
105: and in which the keys can be of variable length, and are stored
106: with some compression.
107: Adding the
108: .I value
109: half of a pair allows very large records to be stored.
110: Further each record can be stored contiguously,
111: rather than broken across tree nodes.
112: .PP
113: The standard definition of B-tree requires some bounds on the number
114: of keys in a node,
115: as in 2-3 trees, where each node has 2 or 3 keys.
116: While this restriction is convenient for proving theorems
117: it is silly in practice.
118: Intuitively, trees which branch a lot at each node
119: have more leaves for a given height,
120: so that searching from the root to a leaf requires fewer file
121: system accesses the bushier the tree.
122: Therefore one wants as many keys as possible in each node,
123: all other things being equal.
124: The node size has been chosen to be the disk block size.
125: Given that,
126: various algorithms for data compression might be used to squeeze
127: as many keys as possible into a node.
128: .PP
129: The method I chose maintains the keys in lexical order in the node.
130: If a key begins with the same $n$ bytes as its immediate predecessor,
131: it is stored with the $n$ bytes replaced by the count $n$.
132: A trivial analysis shows how much space might be saved:
133: Suppose that the keys are chosen at random from strings of length $m$
134: in which each of $a$ characters appears with equal probability.
135: Since there are no more than $a sup k$ prefixes of length $k$,
136: one expects that the length of the common prefix of two of $N$
137: keys is about $log ( N) / log (a)$.
138: Since one byte is required to store the length of the common prefix,
139: the space saving should be around
140: $N( log N / log a - 1) $ bytes.
141: .PP
142: Using this compression technique does not make handling the keys
143: much more complicated.
144: The reader will have no trouble seeing that searching to find a
145: key is easy, that the size of a list of keys decreases when a key
146: is deleted, and that the size of a list of keys increases by no more
147: than the uncompressed size of the new key when a key is inserted.
148: .PP
149: The neighborhood of an interior node in a B-tree looks like
150: .KS
151: .nf
152: .so picture.out
153: .fi
154: .KE
155: Each node which is not a leaf contains one more pointer than key:
156: .EQ
157: p sub 0 , ~ k sub 0 , ~ p sub 1 , ~ k sub 1 , ~ ... ,
158: ~ k sub n , ~ p sub n+1 .
159: .EN
160: Of the keys in the subtree whose root is the given node,
161: $p sub j$ points to the subtree containing keys strictly greater than
162: $k sub j-1$ and no greater than $k sub j$.
163: $p sub 0$ points to the subtree containing keys no greater than $k sub 0$,
164: and $p sub n+1$ points to the subtree containing keys strictly greater
165: than $k sub n$.
166: If $n$ is 0, then there is only one subtree, and $p sub 0$ points to it.
167: This last case can occur when the tree is being built by the $build$
168: command, which may leave a single node nearly empty at each level of
169: the tree due to its zeal in making the rest of the nodes at that
170: level as full as possible.
171: More detail on file structure is in section 5.
172: .NH
173: Subroutines
174: .PP
175: There are 9 subroutines that C programs can use to access B-trees.
176: The normal order of events would be for a program to open one or more
177: B-trees using $bopen$.
178: To retrieve information the program would use $bseek$ to position
179: itself in the B-tree.
180: It could use $breclen$ to see how much it was about to read,
181: and then $bread$ to actually retrieve the data.
182: If it wanted to delete information it would use $bdelete$,
183: and to write new or changed information it would use $bwrite$.
184: .SH
185: /usr/include/cbt.h
186: .LP
187: contains definitions of some manifest constants and types used
188: by the subroutines.
189: Manifest constants are represented as CONSTANT.
190: A pointer of type
191: .I *bfile
192: is returned by the open routine and passed to each of the other
193: routines.
194: Data is passed in
195: an
196: .I mbuf :
197: .DS
198: .ft 8
199: typedef struct {
200: char *mdata;
201: unsigned short mlen;
202: } mbuf;
203: .ft P
204: .DE
205: .I mdata
206: points to
207: .I mlen
208: bytes of data.
209: .PP
210: An open B-tree is either positioned at its end or at some pair.
211: When the file is opened, it is positioned at the first pair.
212: Reading the B-tree gets the pair at the current position.
213: Each subroutine has a deterministic effect on the current position.
214: .PP
215: B-trees can have any one of the 4 types INDEX, READONLY, both,
216: or neither.
217: The type is determined when the B-tree is created.
218: INDEX says that the
219: .I value
220: part of a pair will always be null;
221: i.e.,
222: .I mlen
223: will be zero.
224: A B-tree should be READONLY if updates are infrequent and are never done
225: while someone else is reading in the B-tree.
226: .SH
227: bfile *bopen(b-tree-name, flag)
228: char *b-tree-name;
229: .LP
230: returns a pointer to an internal structure if the open is
231: successful, and NULL otherwise.
232: The file is positioned to its first pair.
233: .I flag
234: should be 0 if the file is only to be read, and 2 otherwise.
235: It is passed directly to the system
236: .I open
237: routine.
238: .SH
239: bclose(bf) bfile *bf;
240: .LP
241: closes the B-tree, flushing buffers and cleaning up in general.
242: If a process updates a B-tree which is not READONLY,
243: then it must call this routine before any other user can perceive
244: its updates.
245: .SH
246: bfirst(bf) bfile *bf;
247: .LP
248: resets the current position in the file to the beginning.
249: It returns EOF if the B-tree is empty.
250: .SH
251: bseek(bf, key) bfile *bf; mbuf key;
252: .LP
253: sets the current position in the file to the first key in the file
254: which is lexically greater than or equal to
255: .I key .
256: The routine returns FOUND if
257: .I key
258: was in the file,
259: EOF if
260: .I key
261: is greater than any key in the file,
262: and NOTFOUND otherwise.
263: Lexical ordering is that induced by the C type
264: .I char .
265: .SH
266: breclen(bf) bfile *bf;
267: .LP
268: returns the length of the
269: .I value
270: at the current position in the file,
271: or EOF if you are at the end of the file.
272: (On a machine where
273: $short$ and
274: .I int
275: are 16 bits long,
276: EOF will be a legal value for $mlen$,
277: name 65535.
278: If the user is on such a machine and writes objects of this length,
279: he must not rely on this subroutine to detect the end of the file.)
280: .SH
281: mbuf bkey(bf) bfile *bf;
282: .LP
283: returns the
284: .I key
285: at the current position of the file.
286: A key cannot have
287: .I mlen
288: of zero, so if the returned value does,
289: you are at the end of the file.
290: Do not change the data pointed to by
291: .I mdata
292: in the returned value,
293: as it is used by the subroutines.
294: .SH
295: bread(bf, key, rec) bfile *bf; mbuf *key, *rec;
296: .LP
297: returns 0 if successful and EOF otherwise.
298: If
299: .I key
300: is not NULL,
301: $ key->mlen $
302: is set to the length of the key at the current position of the file,
303: and the buffer pointed to by
304: $ key->mdata $
305: is filled with the key.
306: Likewise, if
307: .I rec
308: is not NULL,
309: $ rec->mlen $ is set to the length of the value at the current position
310: of the file,
311: and the buffer pointed to by
312: $rec->mdata $
313: is filled with the value.
314: Then the current position in the file is moved to the next pair.
315: It is the user's responsibility to make the buffers large enough
316: to hold what is put in them.
317: $breclen$ gives the necessary size for values,
318: and no key can be as large as the defined constant NDSZ from
319: .I /usr/include/cbt.h .
320: .SH
321: bdelete(bf, key) bfile *bf; mbuf key;
322: .LP
323: deletes the pair with the given
324: .I key
325: from the file if it is present.
326: The routine returns EOF on error,
327: FOUND if a pair was deleted,
328: and NOTFOUND otherwise.
329: Afterwards, the current position in the file is as if
330: .I "bseek(bf, key)"
331: had been called.
332: .SH
333: bwrite(bf, key, value) bfile *bf; mbuf key, value;
334: .LP
335: writes the pair
336: .I "key value"
337: into the file.
338: It returns EOF on error, FOUND if a pair was overwritten,
339: and NOTFOUND if a new pair was written.
340: When the routine returns,
341: the position of the file is as if
342: .I "bseek(bf, key)"
343: had been called.
344: Keys should not be more than about 120 bytes long.
345: .SH
346: Error Returns
347: .PP
348: If an error causes a routine to return EOF,
349: the external integer
350: .I errno
351: is set to one of the codes defined in the header file.
352: BNOWRITE means you tried to write on a B-tree not opened for writing,
353: BIOWRT means that the system wrote fewer bytes than requested,
354: BNOMEM means that a subroutine needed more memory and couldn't get it,
355: and BTALL means that the tree has grown too tall.
356: .I errno
357: can also be set because one of the system calls failed.
358: Then it will have one of the values from
359: .I /usr/include/errno.h .
360: .NH
361: Commands
362: .PP
363: The commands are all of the form
364: .I "cbt command args" .
365: The file
366: .I cbt
367: is a shell script which invokes the actual commands.
368: The commands are utilites.
369: $cat$ just uses the subroutines of the last section,
370: but $creat$, $build$, and $report$
371: require inside knowledge of data structures not available through
372: the B-tree subroutines.
373: This is all in the spirit of data abstraction.
374: The user, through the subroutines, gets certain services,
375: which are best provided through private data structures
376: not available to the user.
377: $build$ has another and more compelling justification;
378: it creates trees that require less space than if they were constructed
379: by a sequence of $bwrite$s.
380: .SH
381: cbt creat [flags] file
382: .LP
383: creates an empty B-tree.
384: In
385: .I flags ,
386: .I -i
387: makes the file have type INDEX,
388: and
389: .I -r
390: makes the file have type READONLY.
391: The
392: .UX
393: file created is
394: .I file.T .
395: If the B-tree is not of type INDEX,
396: .I file.F
397: is created to hold the values.
398: .SH
399: cbt build [-R] file
400: .LP
401: builds the B-tree
402: .I file
403: from the standard input.
404: The input data must be sorted by key.
405: The new file has each node as full as possible,
406: given that the program decides what to do with each key
407: before looking at the next.
408: .PP
409: If the optional argument is not present
410: each line of input contains a key and value separated by a tab.
411: All the bytes to the first tab on a line make up the key
412: (which must not be empty),
413: the tab is removed,
414: and the remaining bytes, not including the trailing newline,
415: make up the value.
416: Thus there is no way, without the optional argument to the command, to
417: get a tab or newline in a key, or a newline in a value.
418: .PP
419: If the optional argument is present,
420: the command expects a sequence of data of the form
421: .DS
422: .ft 8
423: struct {
424: struct
425: unsigned short klen;
426: char kdata[klen];
427: } key;
428: struct {
429: unsigned short vlen;
430: char vdata[vlen];
431: } value;
432: };
433: .ft P
434: .DE
435: The dictionary mentioned in the introduction was constructed by
436: .DS
437: .ft 8
438: cbt creat -i foo; sort /usr/dict/words | cbt build foo
439: .ft P
440: .DE
441: .SH
442: cbt cat [flags] files
443: .LP
444: writes the contents of the files on its standard output.
445: If the
446: .I R
447: flag is present the output is suitable for
448: .I R
449: input to
450: .I build,
451: otherwise the output is a sequence of `lines' of the form
452: key, tab, value, newline.
453: Thus the command
454: .DS
455: .ft 8
456: cbt cat -R file | cbt build -R newfile
457: .ft P
458: .DE
459: will compress the old file into the new file.
460: .SH
461: cbt report file
462: .LP
463: scans the file and reports on the number of pairs,
464: the total size, the amount of the tree in use, and
465: the amount of free space.
466: .NH
467: File Structure
468: .PP
469: The
470: .I .F
471: file contains the values concatenated together in the order they
472: were written.
473: Each time a pair is written, the new value is put on the end of
474: the
475: .I .F
476: file.
477: Deleting a pair has no effect on the
478: .I .F
479: file.
480: Therefore as a B-tree is updated,
481: the
482: .I .F
483: file grows.
484: .PP
485: The
486: .I .T
487: file contains the B-tree proper,
488: and pointers to the
489: .I .F
490: file if the B-tree is not an INDEX.
491: The
492: .I .T
493: contains only tree nodes,
494: each one NDSZ bytes long.
495: (NDSZ is defined in the header file.)
496: Each node has the form
497: .DS
498: header, keys, space, addresses, trailer.
499: .DE
500: The trailer is a $short$ containing the number of unused bytes in the
501: node,
502: which is the size of $space$.
503: .PP
504: The form of an $address$ depends on the type of the file
505: and the height of the node in the tree.
506: Leaves have height 0.
507: Each other node has height one greater than its children.
508: B-trees have the property that each node has a well-defined
509: height.
510: If the node is not a leaf, each $address$ is a $short$
511: which is the location of a child of the node.
512: The node pointed to by $address$ is the node starting at $address * NDSZ$
513: in the file.
514: Leaves in files of type INDEX don't have addresses.
515: Otherwise an $address$ is a structure containing the length of the
516: value and its starting position in the
517: .I .F
518: file.
519: .PP
520: Each $key$ structure contains its total length in a
521: one-byte field, then the number of bytes of prefix
522: in common with the preceding key in the node, and then the rest of the key.
523: .PP
524: The $header$ contains an identifier so that a program
525: can tell if it rewrote the node,
526: the number of keys in the node,
527: the type of the node,
528: and the height of the node.
529: The first node in a file is always the root.
530: The identifier is made from the process id and the time.
531: If a file is not READONLY the routines will not overwrite a node unless the
532: node was originally written by the process,
533: or unless the file is being closed and the node is the root.
534: In this way a file which is not READONLY supports one writer concurrently with
535: programs reading.
536: .NH
537: An example
538: .PP
539: The example is a data base with two files.
540: With it the user can translate from old department numbers
541: to new department numbers,
542: and get the title of a department from either department number.
543: The example illustrates how to construct the data base
544: and suggests one way of contstructing an index.
545: There is also a complete program for retrieving from the data base
546: whcih illustrates using all the subroutines which don't change
547: B-trees.
548: .PP
549: A file named $deptnos$ contains old department numbers,
550: new department numbers,
551: and department titles.
552: It is 1124 lines long and contains 56816 bytes.
553: Here are the first 4 and last 4 lines:
554: .DS
555: 0001 00001 Chairman of the Board
556: 0002 20000 Executive Vice President - Corporate Studies
557: 0003 30000 President/Project Planning
558: 0004 40000 Executive Vice President - Customer Systems
559: 9543 59543 Demand Economics Department
560: 9544 59544 Computer Systems Department
561: 9545 59545 Advanced Software Department
562: 9800 59800 Transfer of Non-Incremental Cost Between BIS and AT&T
563: .DE
564: For the example the file will go into 2 B-trees.
565: One named $dept$ will have the new department numbers as keys
566: and the department titles as values.
567: The other,
568: $olddept$,
569: will be an index which will give new department numbers as a
570: function of old.
571: By combining the two, a user will be able to determine
572: the department title from the old department number.
573: In both,
574: leading zeros will be stripped off department numbers.
575: The commands
576: .I "cbt creat dept"
577: and
578: .I "cbt creat -i olddept"
579: create empty B-trees.
580: The bold (but correct) command
581: .EQ
582: delim off
583: .EN
584: .DS
585: .ft 8
586: awk < deptnos '{$1 = ""; print}' |
587: sed 's/^ //
588: s/^0*//
589: s/ /\et/' | sort | cbt build dept
590: .ft P
591: .DE
592: .EQ
593: delim $$
594: .EN
595: builds $dept$.
596: The $awk$ command deletes the old department number and
597: replaces multiple blanks by a single blank.
598: The $sed$ command removes the leading blank
599: (which is the separator $awk$ put after the null first field)
600: and replaces the single blank separating the new
601: department number from the department title with a tab for
602: .I "cbt build" .
603: $sort$ sees 48089 bytes.
604: $dept.T$ contains 13284 bytes,
605: and $dept.F$ contains 39955 bytes.
606: Thus the additional space required for the B-tree is(39955+13284)/48089,
607: which is 1.118, about 12%.
608: The command
609: .EQ
610: delim off
611: .EN
612: .DS
613: .ft 8
614: awk < deptnos '{print $1, $2}' |
615: sed 's/^0*//
616: s/ 0*/ /' | sort | cbt build olddept
617: .ft P
618: .DE
619: .EQ
620: delim $$
621: .EN
622: builds $olddept$.
623: The $awk $ command prints the old and new department numbers,
624: separated by a space.
625: Each of the strings is the key of a pair in olddept.
626: The output of $sort$ is 12875 bytes long,
627: and $olddept.T$ is 11776 bytes long.
628: .PP
629: The program in the appendix provides a simple facility for using
630: $dept$ and $olddept$.
631: If the user types a number, all the new department numbers
632: which have that number
633: as a prefix are printed,
634: together with their department names.
635: If the user types a number preceded with the letter O,
636: then the program computes the set of old department numbers which
637: begin with the typed number,
638: and prints a set of lines consisting of the old and new
639: department numbers and the department title.
640: .NH
641: Conclusion
642: .PP
643: The subroutines and commands described in this memo can be used
644: to build and maintain data bases of moderate size.
645: The subroutines give a clean interface to C programs,
646: and the files themselves are efficient to use and
647: reasonably economical of space.
648: .SG
649: .SH
650: Bibliography
651: .LP
652: D. Comer,
653: .I
654: The Ubiquitous B-Tree,
655: .R
656: Computing Surveys
657: .B (11)
658: June 1979,
659: pp 121-137.
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.