|
|
1.1 ! root 1: .fp 1 PA ! 2: .fp 2 PI ! 3: .fp 3 PB ! 4: .fp 4 PX ! 5: .EQ ! 6: delim $$ ! 7: .EN ! 8: .ND January 2, 1981 ! 9: .TM 81-11272-1 11173 11173-11 ! 10: .TL ! 11: .UX ! 12: B-trees ! 13: .AU "MH 2C522" 7214 ! 14: P.J. Weinberger ! 15: .AI ! 16: .MH ! 17: .AB ! 18: B-trees are a good data structure for storing, retrieving, ! 19: and updating records in files containing many records. ! 20: Ordinary ! 21: .UX ! 22: tools such as ! 23: $awk$ and $grep$ become too slow when used on files with more than ! 24: a few hundred lines, ! 25: and $ed$ can no longer be used for updating after a thousand or so lines. ! 26: This memo contains descriptions of subroutines and commands ! 27: for creating and maintaining B-trees, ! 28: and a complete example of their use in a data base with two files. ! 29: .AE ! 30: .CS 5 2 7 0 0 1 ! 31: .NH ! 32: Introduction ! 33: .PP ! 34: This memo contains a description of subroutines and commands ! 35: which implement B-trees. ! 36: (See [Comer] for a bibliography on B-trees.) ! 37: The subroutines make the B-trees look to the user like a list ! 38: of ! 39: .I "key\(emvalue" ! 40: pairs, ! 41: sorted by ! 42: .I key . ! 43: The subroutines provide for positioning within the list, ! 44: reading the next pair, deleting pairs, and writing pairs. ! 45: The way the subroutines implement the list is by using ! 46: a prefix-compressed B-tree, ! 47: in which prefixes common to consecutive ! 48: .I keys ! 49: are factored out. ! 50: The details of the data structures are in sections 2 and 5. ! 51: .PP ! 52: There are several advantages to using these B-trees as the file ! 53: organization for data bases. ! 54: A given key can be searched for with a small number of file system ! 55: accesses, typically two or three. ! 56: Since the file looks to the user as if it is sorted on the keys, ! 57: it is easy to retrieve the lexically next key, and ! 58: so to scan the file in lexical order. ! 59: Because the keys are sorted, ! 60: adjacent keys tend to have a common prefix, ! 61: which can be compressed out. ! 62: In the case of ! 63: the dictionary ! 64: .I /usr/dict/words , ! 65: where the ! 66: .I values ! 67: are all null, ! 68: the original file consists of 24473 words and 201032 bytes, ! 69: while the B-tree contains only 134144 bytes. ! 70: The B-tree is significantly shorter even though it contains ! 71: headers and other overhead. ! 72: .PP ! 73: Sections 3 and 4 contain descriptions of subroutines and commands. ! 74: .PP ! 75: There are 9 subroutines. ! 76: 4 of these correspond to ! 77: .UX ! 78: system calls $read$, $write$, $open$, $close$, ! 79: with the difference that the data stored in the B-trees is not just ! 80: a vector of bytes, but is a sequence of $key\(emvalue$ pairs. ! 81: There is one routine for deleting, ! 82: which has no corresponding system call. ! 83: The remaining 4 correspond to various aspects of $lseek$: ! 84: one positions, one reports on the position, one positions to the ! 85: beginning, and the last gives the length of the $value$ at the ! 86: current position. ! 87: .PP ! 88: The commands are utilities for building B-trees, ! 89: dealing with them at the command level, ! 90: and gathering statistics about them. ! 91: .PP ! 92: Section 6 contains a complete example constructing and using ! 93: a small but interesting data base. ! 94: .NH ! 95: B-trees ! 96: .PP ! 97: The B-trees used by the subroutines are slightly different from ! 98: standard B-trees. ! 99: The resemblance would be closer if the routines did not support the ! 100: .I value ! 101: half of the pair. ! 102: Then the ! 103: .UX ! 104: B-trees would be B-trees in which all the keys are in leaves, ! 105: and in which the keys can be of variable length, and are stored ! 106: with some compression. ! 107: Adding the ! 108: .I value ! 109: half of a pair allows very large records to be stored. ! 110: Further each record can be stored contiguously, ! 111: rather than broken across tree nodes. ! 112: .PP ! 113: The standard definition of B-tree requires some bounds on the number ! 114: of keys in a node, ! 115: as in 2-3 trees, where each node has 2 or 3 keys. ! 116: While this restriction is convenient for proving theorems ! 117: it is silly in practice. ! 118: Intuitively, trees which branch a lot at each node ! 119: have more leaves for a given height, ! 120: so that searching from the root to a leaf requires fewer file ! 121: system accesses the bushier the tree. ! 122: Therefore one wants as many keys as possible in each node, ! 123: all other things being equal. ! 124: The node size has been chosen to be the disk block size. ! 125: Given that, ! 126: various algorithms for data compression might be used to squeeze ! 127: as many keys as possible into a node. ! 128: .PP ! 129: The method I chose maintains the keys in lexical order in the node. ! 130: If a key begins with the same $n$ bytes as its immediate predecessor, ! 131: it is stored with the $n$ bytes replaced by the count $n$. ! 132: A trivial analysis shows how much space might be saved: ! 133: Suppose that the keys are chosen at random from strings of length $m$ ! 134: in which each of $a$ characters appears with equal probability. ! 135: Since there are no more than $a sup k$ prefixes of length $k$, ! 136: one expects that the length of the common prefix of two of $N$ ! 137: keys is about $log ( N) / log (a)$. ! 138: Since one byte is required to store the length of the common prefix, ! 139: the space saving should be around ! 140: $N( log N / log a - 1) $ bytes. ! 141: .PP ! 142: Using this compression technique does not make handling the keys ! 143: much more complicated. ! 144: The reader will have no trouble seeing that searching to find a ! 145: key is easy, that the size of a list of keys decreases when a key ! 146: is deleted, and that the size of a list of keys increases by no more ! 147: than the uncompressed size of the new key when a key is inserted. ! 148: .PP ! 149: The neighborhood of an interior node in a B-tree looks like ! 150: .KS ! 151: .nf ! 152: .so picture.out ! 153: .fi ! 154: .KE ! 155: Each node which is not a leaf contains one more pointer than key: ! 156: .EQ ! 157: p sub 0 , ~ k sub 0 , ~ p sub 1 , ~ k sub 1 , ~ ... , ! 158: ~ k sub n , ~ p sub n+1 . ! 159: .EN ! 160: Of the keys in the subtree whose root is the given node, ! 161: $p sub j$ points to the subtree containing keys strictly greater than ! 162: $k sub j-1$ and no greater than $k sub j$. ! 163: $p sub 0$ points to the subtree containing keys no greater than $k sub 0$, ! 164: and $p sub n+1$ points to the subtree containing keys strictly greater ! 165: than $k sub n$. ! 166: If $n$ is 0, then there is only one subtree, and $p sub 0$ points to it. ! 167: This last case can occur when the tree is being built by the $build$ ! 168: command, which may leave a single node nearly empty at each level of ! 169: the tree due to its zeal in making the rest of the nodes at that ! 170: level as full as possible. ! 171: More detail on file structure is in section 5. ! 172: .NH ! 173: Subroutines ! 174: .PP ! 175: There are 9 subroutines that C programs can use to access B-trees. ! 176: The normal order of events would be for a program to open one or more ! 177: B-trees using $bopen$. ! 178: To retrieve information the program would use $bseek$ to position ! 179: itself in the B-tree. ! 180: It could use $breclen$ to see how much it was about to read, ! 181: and then $bread$ to actually retrieve the data. ! 182: If it wanted to delete information it would use $bdelete$, ! 183: and to write new or changed information it would use $bwrite$. ! 184: .SH ! 185: /usr/include/cbt.h ! 186: .LP ! 187: contains definitions of some manifest constants and types used ! 188: by the subroutines. ! 189: Manifest constants are represented as CONSTANT. ! 190: A pointer of type ! 191: .I *bfile ! 192: is returned by the open routine and passed to each of the other ! 193: routines. ! 194: Data is passed in ! 195: an ! 196: .I mbuf : ! 197: .DS ! 198: .ft 8 ! 199: typedef struct { ! 200: char *mdata; ! 201: unsigned short mlen; ! 202: } mbuf; ! 203: .ft P ! 204: .DE ! 205: .I mdata ! 206: points to ! 207: .I mlen ! 208: bytes of data. ! 209: .PP ! 210: An open B-tree is either positioned at its end or at some pair. ! 211: When the file is opened, it is positioned at the first pair. ! 212: Reading the B-tree gets the pair at the current position. ! 213: Each subroutine has a deterministic effect on the current position. ! 214: .PP ! 215: B-trees can have any one of the 4 types INDEX, READONLY, both, ! 216: or neither. ! 217: The type is determined when the B-tree is created. ! 218: INDEX says that the ! 219: .I value ! 220: part of a pair will always be null; ! 221: i.e., ! 222: .I mlen ! 223: will be zero. ! 224: A B-tree should be READONLY if updates are infrequent and are never done ! 225: while someone else is reading in the B-tree. ! 226: .SH ! 227: bfile *bopen(b-tree-name, flag) ! 228: char *b-tree-name; ! 229: .LP ! 230: returns a pointer to an internal structure if the open is ! 231: successful, and NULL otherwise. ! 232: The file is positioned to its first pair. ! 233: .I flag ! 234: should be 0 if the file is only to be read, and 2 otherwise. ! 235: It is passed directly to the system ! 236: .I open ! 237: routine. ! 238: .SH ! 239: bclose(bf) bfile *bf; ! 240: .LP ! 241: closes the B-tree, flushing buffers and cleaning up in general. ! 242: If a process updates a B-tree which is not READONLY, ! 243: then it must call this routine before any other user can perceive ! 244: its updates. ! 245: .SH ! 246: bfirst(bf) bfile *bf; ! 247: .LP ! 248: resets the current position in the file to the beginning. ! 249: It returns EOF if the B-tree is empty. ! 250: .SH ! 251: bseek(bf, key) bfile *bf; mbuf key; ! 252: .LP ! 253: sets the current position in the file to the first key in the file ! 254: which is lexically greater than or equal to ! 255: .I key . ! 256: The routine returns FOUND if ! 257: .I key ! 258: was in the file, ! 259: EOF if ! 260: .I key ! 261: is greater than any key in the file, ! 262: and NOTFOUND otherwise. ! 263: Lexical ordering is that induced by the C type ! 264: .I char . ! 265: .SH ! 266: breclen(bf) bfile *bf; ! 267: .LP ! 268: returns the length of the ! 269: .I value ! 270: at the current position in the file, ! 271: or EOF if you are at the end of the file. ! 272: (On a machine where ! 273: $short$ and ! 274: .I int ! 275: are 16 bits long, ! 276: EOF will be a legal value for $mlen$, ! 277: name 65535. ! 278: If the user is on such a machine and writes objects of this length, ! 279: he must not rely on this subroutine to detect the end of the file.) ! 280: .SH ! 281: mbuf bkey(bf) bfile *bf; ! 282: .LP ! 283: returns the ! 284: .I key ! 285: at the current position of the file. ! 286: A key cannot have ! 287: .I mlen ! 288: of zero, so if the returned value does, ! 289: you are at the end of the file. ! 290: Do not change the data pointed to by ! 291: .I mdata ! 292: in the returned value, ! 293: as it is used by the subroutines. ! 294: .SH ! 295: bread(bf, key, rec) bfile *bf; mbuf *key, *rec; ! 296: .LP ! 297: returns 0 if successful and EOF otherwise. ! 298: If ! 299: .I key ! 300: is not NULL, ! 301: $ key->mlen $ ! 302: is set to the length of the key at the current position of the file, ! 303: and the buffer pointed to by ! 304: $ key->mdata $ ! 305: is filled with the key. ! 306: Likewise, if ! 307: .I rec ! 308: is not NULL, ! 309: $ rec->mlen $ is set to the length of the value at the current position ! 310: of the file, ! 311: and the buffer pointed to by ! 312: $rec->mdata $ ! 313: is filled with the value. ! 314: Then the current position in the file is moved to the next pair. ! 315: It is the user's responsibility to make the buffers large enough ! 316: to hold what is put in them. ! 317: $breclen$ gives the necessary size for values, ! 318: and no key can be as large as the defined constant NDSZ from ! 319: .I /usr/include/cbt.h . ! 320: .SH ! 321: bdelete(bf, key) bfile *bf; mbuf key; ! 322: .LP ! 323: deletes the pair with the given ! 324: .I key ! 325: from the file if it is present. ! 326: The routine returns EOF on error, ! 327: FOUND if a pair was deleted, ! 328: and NOTFOUND otherwise. ! 329: Afterwards, the current position in the file is as if ! 330: .I "bseek(bf, key)" ! 331: had been called. ! 332: .SH ! 333: bwrite(bf, key, value) bfile *bf; mbuf key, value; ! 334: .LP ! 335: writes the pair ! 336: .I "key value" ! 337: into the file. ! 338: It returns EOF on error, FOUND if a pair was overwritten, ! 339: and NOTFOUND if a new pair was written. ! 340: When the routine returns, ! 341: the position of the file is as if ! 342: .I "bseek(bf, key)" ! 343: had been called. ! 344: Keys should not be more than about 120 bytes long. ! 345: .SH ! 346: Error Returns ! 347: .PP ! 348: If an error causes a routine to return EOF, ! 349: the external integer ! 350: .I errno ! 351: is set to one of the codes defined in the header file. ! 352: BNOWRITE means you tried to write on a B-tree not opened for writing, ! 353: BIOWRT means that the system wrote fewer bytes than requested, ! 354: BNOMEM means that a subroutine needed more memory and couldn't get it, ! 355: and BTALL means that the tree has grown too tall. ! 356: .I errno ! 357: can also be set because one of the system calls failed. ! 358: Then it will have one of the values from ! 359: .I /usr/include/errno.h . ! 360: .NH ! 361: Commands ! 362: .PP ! 363: The commands are all of the form ! 364: .I "cbt command args" . ! 365: The file ! 366: .I cbt ! 367: is a shell script which invokes the actual commands. ! 368: The commands are utilites. ! 369: $cat$ just uses the subroutines of the last section, ! 370: but $creat$, $build$, and $report$ ! 371: require inside knowledge of data structures not available through ! 372: the B-tree subroutines. ! 373: This is all in the spirit of data abstraction. ! 374: The user, through the subroutines, gets certain services, ! 375: which are best provided through private data structures ! 376: not available to the user. ! 377: $build$ has another and more compelling justification; ! 378: it creates trees that require less space than if they were constructed ! 379: by a sequence of $bwrite$s. ! 380: .SH ! 381: cbt creat [flags] file ! 382: .LP ! 383: creates an empty B-tree. ! 384: In ! 385: .I flags , ! 386: .I -i ! 387: makes the file have type INDEX, ! 388: and ! 389: .I -r ! 390: makes the file have type READONLY. ! 391: The ! 392: .UX ! 393: file created is ! 394: .I file.T . ! 395: If the B-tree is not of type INDEX, ! 396: .I file.F ! 397: is created to hold the values. ! 398: .SH ! 399: cbt build [-R] file ! 400: .LP ! 401: builds the B-tree ! 402: .I file ! 403: from the standard input. ! 404: The input data must be sorted by key. ! 405: The new file has each node as full as possible, ! 406: given that the program decides what to do with each key ! 407: before looking at the next. ! 408: .PP ! 409: If the optional argument is not present ! 410: each line of input contains a key and value separated by a tab. ! 411: All the bytes to the first tab on a line make up the key ! 412: (which must not be empty), ! 413: the tab is removed, ! 414: and the remaining bytes, not including the trailing newline, ! 415: make up the value. ! 416: Thus there is no way, without the optional argument to the command, to ! 417: get a tab or newline in a key, or a newline in a value. ! 418: .PP ! 419: If the optional argument is present, ! 420: the command expects a sequence of data of the form ! 421: .DS ! 422: .ft 8 ! 423: struct { ! 424: struct ! 425: unsigned short klen; ! 426: char kdata[klen]; ! 427: } key; ! 428: struct { ! 429: unsigned short vlen; ! 430: char vdata[vlen]; ! 431: } value; ! 432: }; ! 433: .ft P ! 434: .DE ! 435: The dictionary mentioned in the introduction was constructed by ! 436: .DS ! 437: .ft 8 ! 438: cbt creat -i foo; sort /usr/dict/words | cbt build foo ! 439: .ft P ! 440: .DE ! 441: .SH ! 442: cbt cat [flags] files ! 443: .LP ! 444: writes the contents of the files on its standard output. ! 445: If the ! 446: .I R ! 447: flag is present the output is suitable for ! 448: .I R ! 449: input to ! 450: .I build, ! 451: otherwise the output is a sequence of `lines' of the form ! 452: key, tab, value, newline. ! 453: Thus the command ! 454: .DS ! 455: .ft 8 ! 456: cbt cat -R file | cbt build -R newfile ! 457: .ft P ! 458: .DE ! 459: will compress the old file into the new file. ! 460: .SH ! 461: cbt report file ! 462: .LP ! 463: scans the file and reports on the number of pairs, ! 464: the total size, the amount of the tree in use, and ! 465: the amount of free space. ! 466: .NH ! 467: File Structure ! 468: .PP ! 469: The ! 470: .I .F ! 471: file contains the values concatenated together in the order they ! 472: were written. ! 473: Each time a pair is written, the new value is put on the end of ! 474: the ! 475: .I .F ! 476: file. ! 477: Deleting a pair has no effect on the ! 478: .I .F ! 479: file. ! 480: Therefore as a B-tree is updated, ! 481: the ! 482: .I .F ! 483: file grows. ! 484: .PP ! 485: The ! 486: .I .T ! 487: file contains the B-tree proper, ! 488: and pointers to the ! 489: .I .F ! 490: file if the B-tree is not an INDEX. ! 491: The ! 492: .I .T ! 493: contains only tree nodes, ! 494: each one NDSZ bytes long. ! 495: (NDSZ is defined in the header file.) ! 496: Each node has the form ! 497: .DS ! 498: header, keys, space, addresses, trailer. ! 499: .DE ! 500: The trailer is a $short$ containing the number of unused bytes in the ! 501: node, ! 502: which is the size of $space$. ! 503: .PP ! 504: The form of an $address$ depends on the type of the file ! 505: and the height of the node in the tree. ! 506: Leaves have height 0. ! 507: Each other node has height one greater than its children. ! 508: B-trees have the property that each node has a well-defined ! 509: height. ! 510: If the node is not a leaf, each $address$ is a $short$ ! 511: which is the location of a child of the node. ! 512: The node pointed to by $address$ is the node starting at $address * NDSZ$ ! 513: in the file. ! 514: Leaves in files of type INDEX don't have addresses. ! 515: Otherwise an $address$ is a structure containing the length of the ! 516: value and its starting position in the ! 517: .I .F ! 518: file. ! 519: .PP ! 520: Each $key$ structure contains its total length in a ! 521: one-byte field, then the number of bytes of prefix ! 522: in common with the preceding key in the node, and then the rest of the key. ! 523: .PP ! 524: The $header$ contains an identifier so that a program ! 525: can tell if it rewrote the node, ! 526: the number of keys in the node, ! 527: the type of the node, ! 528: and the height of the node. ! 529: The first node in a file is always the root. ! 530: The identifier is made from the process id and the time. ! 531: If a file is not READONLY the routines will not overwrite a node unless the ! 532: node was originally written by the process, ! 533: or unless the file is being closed and the node is the root. ! 534: In this way a file which is not READONLY supports one writer concurrently with ! 535: programs reading. ! 536: .NH ! 537: An example ! 538: .PP ! 539: The example is a data base with two files. ! 540: With it the user can translate from old department numbers ! 541: to new department numbers, ! 542: and get the title of a department from either department number. ! 543: The example illustrates how to construct the data base ! 544: and suggests one way of contstructing an index. ! 545: There is also a complete program for retrieving from the data base ! 546: whcih illustrates using all the subroutines which don't change ! 547: B-trees. ! 548: .PP ! 549: A file named $deptnos$ contains old department numbers, ! 550: new department numbers, ! 551: and department titles. ! 552: It is 1124 lines long and contains 56816 bytes. ! 553: Here are the first 4 and last 4 lines: ! 554: .DS ! 555: 0001 00001 Chairman of the Board ! 556: 0002 20000 Executive Vice President - Corporate Studies ! 557: 0003 30000 President/Project Planning ! 558: 0004 40000 Executive Vice President - Customer Systems ! 559: 9543 59543 Demand Economics Department ! 560: 9544 59544 Computer Systems Department ! 561: 9545 59545 Advanced Software Department ! 562: 9800 59800 Transfer of Non-Incremental Cost Between BIS and AT&T ! 563: .DE ! 564: For the example the file will go into 2 B-trees. ! 565: One named $dept$ will have the new department numbers as keys ! 566: and the department titles as values. ! 567: The other, ! 568: $olddept$, ! 569: will be an index which will give new department numbers as a ! 570: function of old. ! 571: By combining the two, a user will be able to determine ! 572: the department title from the old department number. ! 573: In both, ! 574: leading zeros will be stripped off department numbers. ! 575: The commands ! 576: .I "cbt creat dept" ! 577: and ! 578: .I "cbt creat -i olddept" ! 579: create empty B-trees. ! 580: The bold (but correct) command ! 581: .EQ ! 582: delim off ! 583: .EN ! 584: .DS ! 585: .ft 8 ! 586: awk < deptnos '{$1 = ""; print}' | ! 587: sed 's/^ // ! 588: s/^0*// ! 589: s/ /\et/' | sort | cbt build dept ! 590: .ft P ! 591: .DE ! 592: .EQ ! 593: delim $$ ! 594: .EN ! 595: builds $dept$. ! 596: The $awk$ command deletes the old department number and ! 597: replaces multiple blanks by a single blank. ! 598: The $sed$ command removes the leading blank ! 599: (which is the separator $awk$ put after the null first field) ! 600: and replaces the single blank separating the new ! 601: department number from the department title with a tab for ! 602: .I "cbt build" . ! 603: $sort$ sees 48089 bytes. ! 604: $dept.T$ contains 13284 bytes, ! 605: and $dept.F$ contains 39955 bytes. ! 606: Thus the additional space required for the B-tree is(39955+13284)/48089, ! 607: which is 1.118, about 12%. ! 608: The command ! 609: .EQ ! 610: delim off ! 611: .EN ! 612: .DS ! 613: .ft 8 ! 614: awk < deptnos '{print $1, $2}' | ! 615: sed 's/^0*// ! 616: s/ 0*/ /' | sort | cbt build olddept ! 617: .ft P ! 618: .DE ! 619: .EQ ! 620: delim $$ ! 621: .EN ! 622: builds $olddept$. ! 623: The $awk $ command prints the old and new department numbers, ! 624: separated by a space. ! 625: Each of the strings is the key of a pair in olddept. ! 626: The output of $sort$ is 12875 bytes long, ! 627: and $olddept.T$ is 11776 bytes long. ! 628: .PP ! 629: The program in the appendix provides a simple facility for using ! 630: $dept$ and $olddept$. ! 631: If the user types a number, all the new department numbers ! 632: which have that number ! 633: as a prefix are printed, ! 634: together with their department names. ! 635: If the user types a number preceded with the letter O, ! 636: then the program computes the set of old department numbers which ! 637: begin with the typed number, ! 638: and prints a set of lines consisting of the old and new ! 639: department numbers and the department title. ! 640: .NH ! 641: Conclusion ! 642: .PP ! 643: The subroutines and commands described in this memo can be used ! 644: to build and maintain data bases of moderate size. ! 645: The subroutines give a clean interface to C programs, ! 646: and the files themselves are efficient to use and ! 647: reasonably economical of space. ! 648: .SG ! 649: .SH ! 650: Bibliography ! 651: .LP ! 652: D. Comer, ! 653: .I ! 654: The Ubiquitous B-Tree, ! 655: .R ! 656: Computing Surveys ! 657: .B (11) ! 658: June 1979, ! 659: pp 121-137.
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.