Annotation of researchv10no/cmd/btree/memo, revision 1.1.1.1

1.1       root        1: .fp 1 PA
                      2: .fp 2 PI
                      3: .fp 3 PB
                      4: .fp 4 PX
                      5: .EQ
                      6: delim $$
                      7: .EN
                      8: .ND January 2, 1981
                      9: .TM 81-11272-1 11173 11173-11
                     10: .TL
                     11: .UX
                     12: B-trees
                     13: .AU "MH 2C522" 7214
                     14: P.J. Weinberger
                     15: .AI
                     16: .MH
                     17: .AB
                     18: B-trees are a good data structure for storing, retrieving,
                     19: and updating records in files containing many records.
                     20: Ordinary
                     21: .UX
                     22: tools such as
                     23: $awk$ and $grep$ become too slow when used on files with more than
                     24: a few hundred lines,
                     25: and $ed$ can no longer be used for updating after a thousand or so lines.
                     26: This memo contains descriptions of subroutines and commands
                     27: for creating and maintaining B-trees,
                     28: and a complete example of their use in a data base with two files.
                     29: .AE
                     30: .CS 5 2 7 0 0 1
                     31: .NH
                     32: Introduction
                     33: .PP
                     34: This memo contains a description of subroutines and commands
                     35: which implement B-trees.
                     36: (See [Comer] for a bibliography on B-trees.)
                     37: The subroutines make the B-trees look to the user like a list
                     38: of
                     39: .I "key\(emvalue"
                     40: pairs,
                     41: sorted by
                     42: .I key .
                     43: The subroutines provide for positioning within the list,
                     44: reading the next pair, deleting pairs, and writing pairs.
                     45: The way the subroutines implement the list is by using
                     46: a prefix-compressed B-tree,
                     47: in which prefixes common to consecutive
                     48: .I keys
                     49: are factored out.
                     50: The details of the data structures are in sections 2 and 5.
                     51: .PP
                     52: There are several advantages to using these B-trees as the file
                     53: organization for data bases.
                     54: A given key can be searched for with a small number of file system
                     55: accesses, typically two or three.
                     56: Since the file looks to the user as if it is sorted on the keys,
                     57: it is easy to retrieve the lexically next key, and
                     58: so to scan the file in lexical order.
                     59: Because the keys are sorted,
                     60: adjacent keys tend to have a common prefix,
                     61: which can be compressed out.
                     62: In the case of
                     63: the dictionary
                     64: .I /usr/dict/words ,
                     65: where the
                     66: .I values
                     67: are all null,
                     68: the original file consists of 24473 words and 201032 bytes,
                     69: while the B-tree contains only 134144 bytes.
                     70: The B-tree is significantly shorter even though it contains
                     71: headers and other overhead.
                     72: .PP
                     73: Sections 3 and 4 contain descriptions of subroutines and commands.
                     74: .PP
                     75: There are 9 subroutines.
                     76: 4 of these correspond to
                     77: .UX
                     78: system calls $read$, $write$, $open$, $close$,
                     79: with the difference that the data stored in the B-trees is not just
                     80: a vector of bytes, but is a sequence of $key\(emvalue$ pairs.
                     81: There is one routine for deleting,
                     82: which has no corresponding system call.
                     83: The remaining 4 correspond to various aspects of $lseek$:
                     84: one positions, one reports on the position, one positions to the
                     85: beginning, and the last gives the length of the $value$ at the
                     86: current position.
                     87: .PP
                     88: The commands are utilities for building B-trees,
                     89: dealing with them at the command level,
                     90: and gathering statistics about them.
                     91: .PP
                     92: Section 6 contains a complete example constructing and using
                     93: a small but interesting data base.
                     94: .NH
                     95: B-trees
                     96: .PP
                     97: The B-trees used by the subroutines are slightly different from
                     98: standard B-trees.
                     99: The resemblance would be closer if the routines did not support the
                    100: .I value
                    101: half of the pair.
                    102: Then the
                    103: .UX
                    104: B-trees would be B-trees in which all the keys are in leaves,
                    105: and in which the keys can be of variable length, and are stored
                    106: with some compression.
                    107: Adding the
                    108: .I value
                    109: half of a pair allows very large records to be stored.
                    110: Further each record can be stored contiguously,
                    111: rather than broken across tree nodes.
                    112: .PP
                    113: The standard definition of B-tree requires some bounds on the number
                    114: of keys in a node,
                    115: as in 2-3 trees, where each node has 2 or 3 keys.
                    116: While this restriction is convenient for proving theorems
                    117: it is silly in practice.
                    118: Intuitively, trees which branch a lot at each node
                    119: have more leaves for a given height,
                    120: so that searching from the root to a leaf requires fewer file
                    121: system accesses the bushier the tree.
                    122: Therefore one wants as many keys as possible in each node,
                    123: all other things being equal.
                    124: The node size has been chosen to be the disk block size.
                    125: Given that,
                    126: various algorithms for data compression might be used to squeeze
                    127: as many keys as possible into a node.
                    128: .PP
                    129: The method I chose maintains the keys in lexical order in the node.
                    130: If a key begins with the same $n$ bytes as its immediate predecessor,
                    131: it is stored with the $n$ bytes replaced by the count $n$.
                    132: A trivial analysis shows how much space might be saved:
                    133: Suppose that the keys are chosen at random from strings of length $m$
                    134: in which each of $a$ characters appears with equal probability.
                    135: Since there are no more than $a sup k$ prefixes of length $k$,
                    136: one expects that the length of the common prefix of two of $N$
                    137: keys is about $log ( N) / log (a)$.
                    138: Since one byte is required to store the length of the common prefix,
                    139: the space saving should be around
                    140: $N( log N / log a - 1) $ bytes.
                    141: .PP
                    142: Using this compression technique does not make handling the keys
                    143: much more complicated.
                    144: The reader will have no trouble seeing that searching to find a
                    145: key is easy, that the size of a list of keys decreases when a key
                    146: is deleted, and that the size of a list of keys increases by no more
                    147: than the uncompressed size of the new key when a key is inserted.
                    148: .PP
                    149: The neighborhood of an interior node in a B-tree looks like
                    150: .KS
                    151: .nf
                    152: .so picture.out
                    153: .fi
                    154: .KE
                    155: Each node which is not a leaf contains one more pointer than key:
                    156: .EQ
                    157: p sub 0 , ~ k sub 0 , ~  p sub 1 , ~  k sub 1 , ~  ... ,
                    158: ~ k sub n , ~  p sub n+1 .
                    159: .EN
                    160: Of the keys in the subtree whose root is the given node,
                    161: $p sub j$ points to the subtree containing keys strictly greater than
                    162: $k sub j-1$ and no greater than $k sub j$.
                    163: $p sub 0$ points to the subtree containing keys no greater than $k sub 0$,
                    164: and $p sub n+1$ points to the subtree containing keys strictly greater
                    165: than $k sub n$.
                    166: If $n$ is 0, then there is only one subtree, and $p sub 0$ points to it.
                    167: This last case can occur when the tree is being built by the $build$
                    168: command, which may leave a single node nearly empty at each level of
                    169: the tree due to its zeal in making the rest of the nodes at that
                    170: level as full as possible.
                    171: More detail on file structure is in section 5.
                    172: .NH
                    173: Subroutines
                    174: .PP
                    175: There are 9 subroutines that C programs can use to access B-trees.
                    176: The normal order of events would be for a program to open one or more
                    177: B-trees using $bopen$.
                    178: To retrieve information the program would use $bseek$ to position
                    179: itself in the B-tree.
                    180: It could use $breclen$ to see how much it was about to read,
                    181: and then $bread$ to actually retrieve the data.
                    182: If it wanted to delete information it would use $bdelete$,
                    183: and to write new or changed information it would use $bwrite$.
                    184: .SH
                    185: /usr/include/cbt.h
                    186: .LP
                    187: contains definitions of some manifest constants and types used
                    188: by the subroutines.
                    189: Manifest constants are represented as CONSTANT.
                    190: A pointer of type
                    191: .I *bfile
                    192: is returned by the open routine and passed to each of the other
                    193: routines.
                    194: Data is passed in
                    195: an
                    196: .I mbuf :
                    197: .DS
                    198: .ft 8
                    199: typedef struct {
                    200:        char            *mdata;
                    201:        unsigned short  mlen;
                    202: } mbuf;
                    203: .ft P
                    204: .DE
                    205: .I mdata
                    206: points to
                    207: .I mlen
                    208: bytes of data.
                    209: .PP
                    210: An open B-tree is either positioned at its end or at some pair.
                    211: When the file is opened, it is positioned at the first pair.
                    212: Reading the B-tree gets the pair at the current position.
                    213: Each subroutine has a deterministic effect on the current position.
                    214: .PP
                    215: B-trees can have any one of the 4 types INDEX, READONLY, both,
                    216: or neither.
                    217: The type is determined when the B-tree is created.
                    218: INDEX says that the
                    219: .I value
                    220: part of a pair will always be null;
                    221: i.e.,
                    222: .I mlen
                    223: will be zero.
                    224: A B-tree should be READONLY if updates are infrequent and are never done
                    225: while someone else is reading in the B-tree.
                    226: .SH
                    227: bfile *bopen(b-tree-name, flag)
                    228: char *b-tree-name;
                    229: .LP
                    230: returns a pointer to an internal structure if the open is
                    231: successful, and NULL otherwise.
                    232: The file is positioned to its first pair.
                    233: .I flag
                    234: should be 0 if the file is only to be read, and 2 otherwise.
                    235: It is passed directly to the system
                    236: .I open
                    237: routine.
                    238: .SH
                    239: bclose(bf) bfile *bf;
                    240: .LP
                    241: closes the B-tree, flushing buffers and cleaning up in general.
                    242: If a process updates a B-tree which is not READONLY,
                    243: then it must call this routine before any other user can perceive
                    244: its updates.
                    245: .SH
                    246: bfirst(bf) bfile *bf;
                    247: .LP
                    248: resets the current position in the file to the beginning.
                    249: It returns EOF if the B-tree is empty.
                    250: .SH
                    251: bseek(bf, key) bfile *bf; mbuf key;
                    252: .LP
                    253: sets the current position in the file to the first key in the file
                    254: which is lexically greater than or equal to
                    255: .I key .
                    256: The routine returns FOUND if
                    257: .I key
                    258: was in the file,
                    259: EOF if
                    260: .I key
                    261: is greater than any key in the file,
                    262: and NOTFOUND otherwise.
                    263: Lexical ordering is that induced by the C type
                    264: .I char .
                    265: .SH
                    266: breclen(bf) bfile *bf;
                    267: .LP
                    268: returns the length of the
                    269: .I value
                    270: at the current position in the file,
                    271: or EOF if you are at the end of the file.
                    272: (On a machine where
                    273: $short$ and
                    274: .I int
                    275: are 16 bits long,
                    276: EOF will be a legal value for $mlen$,
                    277: name 65535.
                    278: If the user is on such a machine and writes objects of this length,
                    279: he must not rely on this subroutine to detect the end of the file.)
                    280: .SH
                    281: mbuf bkey(bf) bfile *bf;
                    282: .LP
                    283: returns the
                    284: .I key
                    285: at the current position of the file.
                    286: A key cannot have
                    287: .I mlen
                    288: of zero, so if the returned value does,
                    289: you are at the end of the file.
                    290: Do not change the data pointed to by
                    291: .I mdata
                    292: in the returned value,
                    293: as it is used by the subroutines.
                    294: .SH
                    295: bread(bf, key, rec) bfile *bf; mbuf *key, *rec;
                    296: .LP
                    297: returns 0 if successful and EOF otherwise.
                    298: If
                    299: .I key
                    300: is not NULL,
                    301: $ key->mlen $
                    302: is set to the length of the key at the current position of the file,
                    303: and the buffer pointed to by
                    304: $ key->mdata $
                    305: is filled with the key.
                    306: Likewise, if
                    307: .I rec
                    308: is not NULL,
                    309: $ rec->mlen $ is set to the length of the value at the current position
                    310: of the file,
                    311: and the buffer pointed to by
                    312: $rec->mdata $
                    313: is filled with the value.
                    314: Then the current position in the file is moved to the next pair.
                    315: It is the user's responsibility to make the buffers large enough
                    316: to hold what is put in them.
                    317: $breclen$ gives the necessary size for values,
                    318: and no key can be as large as the defined constant NDSZ from
                    319: .I /usr/include/cbt.h .
                    320: .SH
                    321: bdelete(bf, key) bfile *bf; mbuf key;
                    322: .LP
                    323: deletes the pair with the given
                    324: .I key
                    325: from the file if it is present.
                    326: The routine returns EOF on error,
                    327: FOUND if a pair was deleted,
                    328: and NOTFOUND otherwise.
                    329: Afterwards, the current position in the file is as if
                    330: .I "bseek(bf, key)"
                    331: had been called.
                    332: .SH
                    333: bwrite(bf, key, value) bfile *bf; mbuf key, value;
                    334: .LP
                    335: writes the pair
                    336: .I "key value"
                    337: into the file.
                    338: It returns EOF on error, FOUND if a pair was overwritten,
                    339: and NOTFOUND if a new pair was written.
                    340: When the routine returns,
                    341: the position of the file is as if
                    342: .I "bseek(bf, key)"
                    343: had been called.
                    344: Keys should not be more than about 120 bytes long.
                    345: .SH
                    346: Error Returns
                    347: .PP
                    348: If an error causes a routine to return EOF,
                    349: the external integer
                    350: .I errno
                    351: is set to one of the codes defined in the header file.
                    352: BNOWRITE means you tried to write on a B-tree not opened for writing,
                    353: BIOWRT means that the system wrote fewer bytes than requested,
                    354: BNOMEM means that a subroutine needed more memory and couldn't get it,
                    355: and BTALL means that the tree has grown too tall.
                    356: .I errno
                    357: can also be set because one of the system calls failed.
                    358: Then it will have one of the values from
                    359: .I /usr/include/errno.h .
                    360: .NH
                    361: Commands
                    362: .PP
                    363: The commands are all of the form
                    364: .I "cbt command args" .
                    365: The file
                    366: .I cbt
                    367: is a shell script which invokes the actual commands.
                    368: The commands are utilites.
                    369: $cat$ just uses the subroutines of the last section,
                    370: but $creat$, $build$, and $report$
                    371: require inside knowledge of data structures not available through
                    372: the B-tree subroutines.
                    373: This is all in the spirit of data abstraction.
                    374: The user, through the subroutines, gets certain services,
                    375: which are best provided through private data structures
                    376: not available to the user.
                    377: $build$ has another and more compelling justification;
                    378: it creates trees that require less space than if they were constructed
                    379: by a sequence of $bwrite$s.
                    380: .SH
                    381: cbt creat [flags] file
                    382: .LP
                    383: creates an empty B-tree.
                    384: In
                    385: .I flags ,
                    386: .I -i
                    387: makes the file have type INDEX,
                    388: and
                    389: .I -r
                    390: makes the file have type READONLY.
                    391: The
                    392: .UX
                    393: file created is
                    394: .I file.T .
                    395: If the B-tree is not of type INDEX,
                    396: .I file.F
                    397: is created to hold the values.
                    398: .SH
                    399: cbt build [-R] file
                    400: .LP
                    401: builds the B-tree
                    402: .I file
                    403: from the standard input.
                    404: The input data must be sorted by key.
                    405: The new file has each node as full as possible,
                    406: given that the program decides what to do with each key
                    407: before looking at the next.
                    408: .PP
                    409: If the optional argument is not present
                    410: each line of input contains a key and value separated by a tab.
                    411: All the bytes to the first tab on a line make up the key
                    412: (which must not be empty),
                    413: the tab is removed,
                    414: and the remaining bytes, not including the trailing newline,
                    415: make up the value.
                    416: Thus there is no way, without the optional argument to the command, to
                    417: get a tab or newline in a key, or a newline in a value.
                    418: .PP
                    419: If the optional argument is present,
                    420: the command expects a sequence of data of the form
                    421: .DS
                    422: .ft 8
                    423: struct {
                    424:        struct
                    425:                unsigned short klen;
                    426:                char kdata[klen];
                    427:        } key;
                    428:        struct {
                    429:                unsigned short vlen;
                    430:                char vdata[vlen];
                    431:        } value;
                    432: };
                    433: .ft P
                    434: .DE
                    435: The dictionary mentioned in the introduction was constructed by
                    436: .DS
                    437: .ft 8
                    438: cbt creat -i foo; sort /usr/dict/words | cbt build foo
                    439: .ft P
                    440: .DE
                    441: .SH
                    442: cbt cat [flags] files
                    443: .LP
                    444: writes the contents of the files on its standard output.
                    445: If the
                    446: .I R
                    447: flag is present the output is suitable for
                    448: .I R
                    449: input to
                    450: .I build,
                    451: otherwise the output is a sequence of `lines' of the form
                    452: key, tab, value, newline.
                    453: Thus the command
                    454: .DS
                    455: .ft 8
                    456: cbt cat -R file | cbt build -R newfile
                    457: .ft P
                    458: .DE
                    459: will compress the old file into the new file.
                    460: .SH
                    461: cbt report file
                    462: .LP
                    463: scans the file and reports on the number of pairs,
                    464: the total size, the amount of the tree in use, and
                    465: the amount of free space.
                    466: .NH
                    467: File Structure
                    468: .PP
                    469: The
                    470: .I .F
                    471: file contains the values concatenated together in the order they
                    472: were written.
                    473: Each time a pair is written, the new value is put on the end of
                    474: the
                    475: .I .F
                    476: file.
                    477: Deleting a pair has no effect on the
                    478: .I .F
                    479: file.
                    480: Therefore as a B-tree is updated,
                    481: the
                    482: .I .F
                    483: file grows.
                    484: .PP
                    485: The
                    486: .I .T
                    487: file contains the B-tree proper,
                    488: and pointers to the
                    489: .I .F
                    490: file if the B-tree is not an INDEX.
                    491: The
                    492: .I .T
                    493: contains only tree nodes,
                    494: each one NDSZ bytes long.
                    495: (NDSZ is defined in the header file.)
                    496: Each node has the form
                    497: .DS
                    498: header, keys, space, addresses, trailer.
                    499: .DE
                    500: The trailer is a $short$ containing the number of unused bytes in the
                    501: node,
                    502: which is the size of $space$.
                    503: .PP
                    504: The form of an $address$ depends on the type of the file
                    505: and the height of the node in the tree.
                    506: Leaves have height 0.
                    507: Each other node has height one greater than its children.
                    508: B-trees have the property that each node has a well-defined
                    509: height.
                    510: If the node is not a leaf, each $address$ is a $short$
                    511: which is the location of a child of the node.
                    512: The node pointed to by $address$ is the node starting at $address * NDSZ$
                    513: in the file.
                    514: Leaves in files of type INDEX don't have addresses.
                    515: Otherwise an $address$ is a structure containing the length of the
                    516: value and its starting position in the
                    517: .I .F
                    518: file.
                    519: .PP
                    520: Each $key$ structure contains its total length in a
                    521: one-byte field, then the number of bytes of prefix
                    522: in common with the preceding key in the node, and then the rest of the key.
                    523: .PP
                    524: The $header$ contains an identifier so that a program
                    525: can tell if it rewrote the node,
                    526: the number of keys in the node,
                    527: the type of the node,
                    528: and the height of the node.
                    529: The first node in a file is always the root.
                    530: The identifier is made from the process id and the time.
                    531: If a file is not READONLY the routines will not overwrite a node unless the
                    532: node was originally written by the process,
                    533: or unless the file is being closed and the node is the root.
                    534: In this way a file which is not READONLY supports one writer concurrently with
                    535: programs reading.
                    536: .NH
                    537: An example
                    538: .PP
                    539: The example is a data base with two files.
                    540: With it the user can translate from old department numbers
                    541: to new department numbers,
                    542: and get the title of a department from either department number.
                    543: The example illustrates how to construct the data base
                    544: and suggests one way of contstructing an index.
                    545: There is also a complete program for retrieving from the data base
                    546: whcih illustrates using all the subroutines which don't change
                    547: B-trees.
                    548: .PP
                    549: A file named $deptnos$ contains old department numbers,
                    550: new department numbers,
                    551: and department titles.
                    552: It is 1124 lines long and contains 56816 bytes.
                    553: Here are the first 4 and last 4 lines:
                    554: .DS
                    555: 0001  00001   Chairman of the Board
                    556: 0002  20000   Executive Vice President - Corporate Studies
                    557: 0003  30000   President/Project Planning
                    558: 0004  40000   Executive Vice President - Customer Systems
                    559: 9543  59543   Demand Economics Department
                    560: 9544  59544   Computer Systems Department
                    561: 9545  59545   Advanced Software Department
                    562: 9800  59800   Transfer of Non-Incremental Cost Between BIS and AT&T
                    563: .DE
                    564: For the example the file will go into 2 B-trees.
                    565: One named $dept$ will have the new department numbers as keys
                    566: and the department titles as values.
                    567: The other,
                    568: $olddept$,
                    569: will be an index which will give new department numbers as a
                    570: function of old.
                    571: By combining the two, a user will be able to determine
                    572: the department title from the old department number.
                    573: In both,
                    574: leading zeros will be stripped off department numbers.
                    575: The commands
                    576: .I "cbt creat dept"
                    577: and
                    578: .I "cbt creat -i olddept"
                    579: create empty B-trees.
                    580: The bold (but correct) command
                    581: .EQ
                    582: delim off
                    583: .EN
                    584: .DS
                    585: .ft 8
                    586: awk < deptnos '{$1 = ""; print}' |
                    587: sed    's/^ //
                    588:        s/^0*//
                    589:        s/ /\et/' | sort | cbt build dept
                    590: .ft P
                    591: .DE
                    592: .EQ
                    593: delim $$
                    594: .EN
                    595: builds $dept$.
                    596: The $awk$ command deletes the old department number and
                    597: replaces multiple blanks by a single blank.
                    598: The $sed$ command removes the leading blank
                    599: (which is the separator $awk$ put after the null first field)
                    600: and replaces the single blank separating the new
                    601: department number from the department title with a tab for
                    602: .I "cbt build" .
                    603: $sort$ sees 48089 bytes.
                    604: $dept.T$ contains 13284 bytes,
                    605: and $dept.F$ contains 39955 bytes.
                    606: Thus the additional space required for the B-tree is(39955+13284)/48089,
                    607: which is 1.118, about 12%.
                    608: The command
                    609: .EQ
                    610: delim off
                    611: .EN
                    612: .DS
                    613: .ft 8
                    614: awk < deptnos '{print $1, $2}' |
                    615: sed    's/^0*//
                    616:        s/ 0*/ /' | sort | cbt build olddept
                    617: .ft P
                    618: .DE
                    619: .EQ
                    620: delim $$
                    621: .EN
                    622: builds $olddept$.
                    623: The $awk $ command prints the old and new department numbers,
                    624: separated by a space.
                    625: Each of the strings is the key of a pair in olddept.
                    626: The output of $sort$ is 12875 bytes long,
                    627: and $olddept.T$ is 11776 bytes long.
                    628: .PP
                    629: The program in the appendix provides a simple facility for using
                    630: $dept$ and $olddept$.
                    631: If the user types a number, all the new department numbers
                    632: which have that number
                    633: as a prefix are printed,
                    634: together with their department names.
                    635: If the user types a number preceded with the letter O,
                    636: then the program computes the set of old department numbers which
                    637: begin with the typed number,
                    638: and prints a set of lines consisting of the old and new
                    639: department numbers and the department title.
                    640: .NH
                    641: Conclusion
                    642: .PP
                    643: The subroutines and commands described in this memo can be used
                    644: to build and maintain data bases of moderate size.
                    645: The subroutines give a clean interface to C programs,
                    646: and the files themselves are efficient to use and
                    647: reasonably economical of space.
                    648: .SG
                    649: .SH
                    650: Bibliography
                    651: .LP
                    652: D. Comer,
                    653: .I
                    654: The Ubiquitous B-Tree,
                    655: .R
                    656: Computing Surveys
                    657: .B (11)
                    658: June 1979,
                    659: pp 121-137.

unix.superglobalmegacorp.com

This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.