|
|
1.1 root 1: .\" Copyright (c) 1986 The Regents of the University of California.
2: .\" All rights reserved.
3: .\"
4: .\" Redistribution and use in source and binary forms are permitted
5: .\" provided that the above copyright notice and this paragraph are
6: .\" duplicated in all such forms and that any documentation,
7: .\" advertising materials, and other materials related to such
8: .\" distribution and use acknowledge that the software was developed
9: .\" by the University of California, Berkeley. The name of the
10: .\" University may not be used to endorse or promote products derived
11: .\" from this software without specific prior written permission.
12: .\" THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR
13: .\" IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED
14: .\" WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
15: .\"
16: .\" @(#)3.t 6.2 (Berkeley) 3/7/89
17: .\"
18: .ds RH New file system
19: .NH
20: New file system organization
21: .PP
22: In the new file system organization (as in the
23: old file system organization),
24: each disk drive contains one or more file systems.
25: A file system is described by its super-block,
26: located at the beginning of the file system's disk partition.
27: Because the super-block contains critical data,
28: it is replicated to protect against catastrophic loss.
29: This is done when the file system is created;
30: since the super-block data does not change,
31: the copies need not be referenced unless a head crash
32: or other hard disk error causes the default super-block
33: to be unusable.
34: .PP
35: To insure that it is possible to create files as large as
36: $2 sup 32$ bytes with only two levels of indirection,
37: the minimum size of a file system block is 4096 bytes.
38: The size of file system blocks can be any power of two
39: greater than or equal to 4096.
40: The block size of a file system is recorded in the
41: file system's super-block
42: so it is possible for file systems with different block sizes
43: to be simultaneously accessible on the same system.
44: The block size must be decided at the time that
45: the file system is created;
46: it cannot be subsequently changed without rebuilding the file system.
47: .PP
48: The new file system organization divides a disk partition
49: into one or more areas called
50: .I "cylinder groups".
51: A cylinder group is comprised of one or more consecutive
52: cylinders on a disk.
53: Associated with each cylinder group is some bookkeeping information
54: that includes a redundant copy of the super-block,
55: space for inodes,
56: a bit map describing available blocks in the cylinder group,
57: and summary information describing the usage of data blocks
58: within the cylinder group.
59: The bit map of available blocks in the cylinder group replaces
60: the traditional file system's free list.
61: For each cylinder group a static number of inodes
62: is allocated at file system creation time.
63: The default policy is to allocate one inode for each 2048
64: bytes of space in the cylinder group, expecting this
65: to be far more than will ever be needed.
66: .PP
67: All the cylinder group bookkeeping information could be
68: placed at the beginning of each cylinder group.
69: However if this approach were used,
70: all the redundant information would be on the top platter.
71: A single hardware failure that destroyed the top platter
72: could cause the loss of all redundant copies of the super-block.
73: Thus the cylinder group bookkeeping information
74: begins at a varying offset from the beginning of the cylinder group.
75: The offset for each successive cylinder group is calculated to be
76: about one track further from the beginning of the cylinder group
77: than the preceding cylinder group.
78: In this way the redundant
79: information spirals down into the pack so that any single track, cylinder,
80: or platter can be lost without losing all copies of the super-block.
81: Except for the first cylinder group,
82: the space between the beginning of the cylinder group
83: and the beginning of the cylinder group information
84: is used for data blocks.\(dg
85: .FS
86: \(dg While it appears that the first cylinder group could be laid
87: out with its super-block at the ``known'' location,
88: this would not work for file systems
89: with blocks sizes of 16 kilobytes or greater.
90: This is because of a requirement that the first 8 kilobytes of the disk
91: be reserved for a bootstrap program and a separate requirement that
92: the cylinder group information begin on a file system block boundary.
93: To start the cylinder group on a file system block boundary,
94: file systems with block sizes larger than 8 kilobytes
95: would have to leave an empty space between the end of
96: the boot block and the beginning of the cylinder group.
97: Without knowing the size of the file system blocks,
98: the system would not know what roundup function to use
99: to find the beginning of the first cylinder group.
100: .FE
101: .NH 2
102: Optimizing storage utilization
103: .PP
104: Data is laid out so that larger blocks can be transferred
105: in a single disk transaction, greatly increasing file system throughput.
106: As an example, consider a file in the new file system
107: composed of 4096 byte data blocks.
108: In the old file system this file would be composed of 1024 byte blocks.
109: By increasing the block size, disk accesses in the new file
110: system may transfer up to four times as much information per
111: disk transaction.
112: In large files, several
113: 4096 byte blocks may be allocated from the same cylinder so that
114: even larger data transfers are possible before requiring a seek.
115: .PP
116: The main problem with
117: larger blocks is that most UNIX
118: file systems are composed of many small files.
119: A uniformly large block size wastes space.
120: Table 1 shows the effect of file system
121: block size on the amount of wasted space in the file system.
122: The files measured to obtain these figures reside on
123: one of our time sharing
124: systems that has roughly 1.2 gigabytes of on-line storage.
125: The measurements are based on the active user file systems containing
126: about 920 megabytes of formatted space.
127: .KF
128: .DS B
129: .TS
130: box;
131: l|l|l
132: a|n|l.
133: Space used % waste Organization
134: _
135: 775.2 Mb 0.0 Data only, no separation between files
136: 807.8 Mb 4.2 Data only, each file starts on 512 byte boundary
137: 828.7 Mb 6.9 Data + inodes, 512 byte block UNIX file system
138: 866.5 Mb 11.8 Data + inodes, 1024 byte block UNIX file system
139: 948.5 Mb 22.4 Data + inodes, 2048 byte block UNIX file system
140: 1128.3 Mb 45.6 Data + inodes, 4096 byte block UNIX file system
141: .TE
142: Table 1 \- Amount of wasted space as a function of block size.
143: .DE
144: .KE
145: The space wasted is calculated to be the percentage of space
146: on the disk not containing user data.
147: As the block size on the disk
148: increases, the waste rises quickly, to an intolerable
149: 45.6% waste with 4096 byte file system blocks.
150: .PP
151: To be able to use large blocks without undue waste,
152: small files must be stored in a more efficient way.
153: The new file system accomplishes this goal by allowing the division
154: of a single file system block into one or more
155: .I "fragments".
156: The file system fragment size is specified
157: at the time that the file system is created;
158: each file system block can optionally be broken into
159: 2, 4, or 8 fragments, each of which is addressable.
160: The lower bound on the size of these fragments is constrained
161: by the disk sector size,
162: typically 512 bytes.
163: The block map associated with each cylinder group
164: records the space available in a cylinder group
165: at the fragment level;
166: to determine if a block is available, aligned fragments are examined.
167: Figure 1 shows a piece of a map from a 4096/1024 file system.
168: .KF
169: .DS B
170: .TS
171: box;
172: l|c c c c.
173: Bits in map XXXX XXOO OOXX OOOO
174: Fragment numbers 0-3 4-7 8-11 12-15
175: Block numbers 0 1 2 3
176: .TE
177: Figure 1 \- Example layout of blocks and fragments in a 4096/1024 file system.
178: .DE
179: .KE
180: Each bit in the map records the status of a fragment;
181: an ``X'' shows that the fragment is in use,
182: while a ``O'' shows that the fragment is available for allocation.
183: In this example,
184: fragments 0\-5, 10, and 11 are in use,
185: while fragments 6\-9, and 12\-15 are free.
186: Fragments of adjoining blocks cannot be used as a full block,
187: even if they are large enough.
188: In this example,
189: fragments 6\-9 cannot be allocated as a full block;
190: only fragments 12\-15 can be coalesced into a full block.
191: .PP
192: On a file system with a block size of 4096 bytes
193: and a fragment size of 1024 bytes,
194: a file is represented by zero or more 4096 byte blocks of data,
195: and possibly a single fragmented block.
196: If a file system block must be fragmented to obtain
197: space for a small amount of data,
198: the remaining fragments of the block are made
199: available for allocation to other files.
200: As an example consider an 11000 byte file stored on
201: a 4096/1024 byte file system.
202: This file would uses two full size blocks and one
203: three fragment portion of another block.
204: If no block with three aligned fragments is
205: available at the time the file is created,
206: a full size block is split yielding the necessary
207: fragments and a single unused fragment.
208: This remaining fragment can be allocated to another file as needed.
209: .PP
210: Space is allocated to a file when a program does a \fIwrite\fP
211: system call.
212: Each time data is written to a file, the system checks to see if
213: the size of the file has increased*.
214: .FS
215: * A program may be overwriting data in the middle of an existing file
216: in which case space would already have been allocated.
217: .FE
218: If the file needs to be expanded to hold the new data,
219: one of three conditions exists:
220: .IP 1)
221: There is enough space left in an already allocated
222: block or fragment to hold the new data.
223: The new data is written into the available space.
224: .IP 2)
225: The file contains no fragmented blocks (and the last
226: block in the file
227: contains insufficient space to hold the new data).
228: If space exists in a block already allocated,
229: the space is filled with new data.
230: If the remainder of the new data contains more than
231: a full block of data, a full block is allocated and
232: the first full block of new data is written there.
233: This process is repeated until less than a full block
234: of new data remains.
235: If the remaining new data to be written will
236: fit in less than a full block,
237: a block with the necessary fragments is located,
238: otherwise a full block is located.
239: The remaining new data is written into the located space.
240: .IP 3)
241: The file contains one or more fragments (and the
242: fragments contain insufficient space to hold the new data).
243: If the size of the new data plus the size of the data
244: already in the fragments exceeds the size of a full block,
245: a new block is allocated.
246: The contents of the fragments are copied
247: to the beginning of the block
248: and the remainder of the block is filled with new data.
249: The process then continues as in (2) above.
250: Otherwise, if the new data to be written will
251: fit in less than a full block,
252: a block with the necessary fragments is located,
253: otherwise a full block is located.
254: The contents of the existing fragments
255: appended with the new data
256: are written into the allocated space.
257: .PP
258: The problem with expanding a file one fragment at a
259: a time is that data may be copied many times as a
260: fragmented block expands to a full block.
261: Fragment reallocation can be minimized
262: if the user program writes a full block at a time,
263: except for a partial block at the end of the file.
264: Since file systems with different block sizes may reside on
265: the same system,
266: the file system interface has been extended to provide
267: application programs the optimal size for a read or write.
268: For files the optimal size is the block size of the file system
269: on which the file is being accessed.
270: For other objects, such as pipes and sockets,
271: the optimal size is the underlying buffer size.
272: This feature is used by the Standard
273: Input/Output Library,
274: a package used by most user programs.
275: This feature is also used by
276: certain system utilities such as archivers and loaders
277: that do their own input and output management
278: and need the highest possible file system bandwidth.
279: .PP
280: The amount of wasted space in the 4096/1024 byte new file system
281: organization is empirically observed to be about the same as in the
282: 1024 byte old file system organization.
283: A file system with 4096 byte blocks and 512 byte fragments
284: has about the same amount of wasted space as the 512 byte
285: block UNIX file system.
286: The new file system uses less space
287: than the 512 byte or 1024 byte
288: file systems for indexing information for
289: large files and the same amount of space
290: for small files.
291: These savings are offset by the need to use
292: more space for keeping track of available free blocks.
293: The net result is about the same disk utilization
294: when a new file system's fragment size
295: equals an old file system's block size.
296: .PP
297: In order for the layout policies to be effective,
298: a file system cannot be kept completely full.
299: For each file system there is a parameter, termed
300: the free space reserve, that
301: gives the minimum acceptable percentage of file system
302: blocks that should be free.
303: If the number of free blocks drops below this level
304: only the system administrator can continue to allocate blocks.
305: The value of this parameter may be changed at any time,
306: even when the file system is mounted and active.
307: The transfer rates that appear in section 4 were measured on file
308: systems kept less than 90% full (a reserve of 10%).
309: If the number of free blocks falls to zero,
310: the file system throughput tends to be cut in half,
311: because of the inability of the file system to localize
312: blocks in a file.
313: If a file system's performance degrades because
314: of overfilling, it may be restored by removing
315: files until the amount of free space once again
316: reaches the minimum acceptable level.
317: Access rates for files created during periods of little
318: free space may be restored by moving their data once enough
319: space is available.
320: The free space reserve must be added to the
321: percentage of waste when comparing the organizations given
322: in Table 1.
323: Thus, the percentage of waste in
324: an old 1024 byte UNIX file system is roughly
325: comparable to a new 4096/512 byte file system
326: with the free space reserve set at 5%.
327: (Compare 11.8% wasted with the old file system
328: to 6.9% waste + 5% reserved space in the
329: new file system.)
330: .NH 2
331: File system parameterization
332: .PP
333: Except for the initial creation of the free list,
334: the old file system ignores the parameters of the underlying hardware.
335: It has no information about either the physical characteristics
336: of the mass storage device,
337: or the hardware that interacts with it.
338: A goal of the new file system is to parameterize the
339: processor capabilities and
340: mass storage characteristics
341: so that blocks can be allocated in an
342: optimum configuration-dependent way.
343: Parameters used include the speed of the processor,
344: the hardware support for mass storage transfers,
345: and the characteristics of the mass storage devices.
346: Disk technology is constantly improving and
347: a given installation can have several different disk technologies
348: running on a single processor.
349: Each file system is parameterized so that it can be
350: adapted to the characteristics of the disk on which
351: it is placed.
352: .PP
353: For mass storage devices such as disks,
354: the new file system tries to allocate new blocks
355: on the same cylinder as the previous block in the same file.
356: Optimally, these new blocks will also be
357: rotationally well positioned.
358: The distance between ``rotationally optimal'' blocks varies greatly;
359: it can be a consecutive block
360: or a rotationally delayed block
361: depending on system characteristics.
362: On a processor with an input/output channel that does not require
363: any processor intervention between mass storage transfer requests,
364: two consecutive disk blocks can often be accessed
365: without suffering lost time because of an intervening disk revolution.
366: For processors without input/output channels,
367: the main processor must field an interrupt and
368: prepare for a new disk transfer.
369: The expected time to service this interrupt and
370: schedule a new disk transfer depends on the
371: speed of the main processor.
372: .PP
373: The physical characteristics of each disk include
374: the number of blocks per track and the rate at which
375: the disk spins.
376: The allocation routines use this information to calculate
377: the number of milliseconds required to skip over a block.
378: The characteristics of the processor include
379: the expected time to service an interrupt and schedule a
380: new disk transfer.
381: Given a block allocated to a file,
382: the allocation routines calculate the number of blocks to
383: skip over so that the next block in the file will
384: come into position under the disk head in the expected
385: amount of time that it takes to start a new
386: disk transfer operation.
387: For programs that sequentially access large amounts of data,
388: this strategy minimizes the amount of time spent waiting for
389: the disk to position itself.
390: .PP
391: To ease the calculation of finding rotationally optimal blocks,
392: the cylinder group summary information includes
393: a count of the available blocks in a cylinder
394: group at different rotational positions.
395: Eight rotational positions are distinguished,
396: so the resolution of the
397: summary information is 2 milliseconds for a typical 3600
398: revolution per minute drive.
399: The super-block contains a vector of lists called
400: .I "rotational layout tables".
401: The vector is indexed by rotational position.
402: Each component of the vector
403: lists the index into the block map for every data block contained
404: in its rotational position.
405: When looking for an allocatable block,
406: the system first looks through the summary counts for a rotational
407: position with a non-zero block count.
408: It then uses the index of the rotational position to find the appropriate
409: list to use to index through
410: only the relevant parts of the block map to find a free block.
411: .PP
412: The parameter that defines the
413: minimum number of milliseconds between the completion of a data
414: transfer and the initiation of
415: another data transfer on the same cylinder
416: can be changed at any time,
417: even when the file system is mounted and active.
418: If a file system is parameterized to lay out blocks with
419: a rotational separation of 2 milliseconds,
420: and the disk pack is then moved to a system that has a
421: processor requiring 4 milliseconds to schedule a disk operation,
422: the throughput will drop precipitously because of lost disk revolutions
423: on nearly every block.
424: If the eventual target machine is known,
425: the file system can be parameterized for it
426: even though it is initially created on a different processor.
427: Even if the move is not known in advance,
428: the rotational layout delay can be reconfigured after the disk is moved
429: so that all further allocation is done based on the
430: characteristics of the new host.
431: .NH 2
432: Layout policies
433: .PP
434: The file system layout policies are divided into two distinct parts.
435: At the top level are global policies that use file system
436: wide summary information to make decisions regarding
437: the placement of new inodes and data blocks.
438: These routines are responsible for deciding the
439: placement of new directories and files.
440: They also calculate rotationally optimal block layouts,
441: and decide when to force a long seek to a new cylinder group
442: because there are insufficient blocks left
443: in the current cylinder group to do reasonable layouts.
444: Below the global policy routines are
445: the local allocation routines that use a locally optimal scheme to
446: lay out data blocks.
447: .PP
448: Two methods for improving file system performance are to increase
449: the locality of reference to minimize seek latency
450: as described by [Trivedi80], and
451: to improve the layout of data to make larger transfers possible
452: as described by [Nevalainen77].
453: The global layout policies try to improve performance
454: by clustering related information.
455: They cannot attempt to localize all data references,
456: but must also try to spread unrelated data
457: among different cylinder groups.
458: If too much localization is attempted,
459: the local cylinder group may run out of space
460: forcing the data to be scattered to non-local cylinder groups.
461: Taken to an extreme,
462: total localization can result in a single huge cluster of data
463: resembling the old file system.
464: The global policies try to balance the two conflicting
465: goals of localizing data that is concurrently accessed
466: while spreading out unrelated data.
467: .PP
468: One allocatable resource is inodes.
469: Inodes are used to describe both files and directories.
470: Inodes of files in the same directory are frequently accessed together.
471: For example, the ``list directory'' command often accesses
472: the inode for each file in a directory.
473: The layout policy tries to place all the inodes of
474: files in a directory in the same cylinder group.
475: To ensure that files are distributed throughout the disk,
476: a different policy is used for directory allocation.
477: A new directory is placed in a cylinder group that has a greater
478: than average number of free inodes,
479: and the smallest number of directories already in it.
480: The intent of this policy is to allow the inode clustering policy
481: to succeed most of the time.
482: The allocation of inodes within a cylinder group is done using a
483: next free strategy.
484: Although this allocates the inodes randomly within a cylinder group,
485: all the inodes for a particular cylinder group can be read with
486: 8 to 16 disk transfers.
487: (At most 16 disk transfers are required because a cylinder
488: group may have no more than 2048 inodes.)
489: This puts a small and constant upper bound on the number of
490: disk transfers required to access the inodes
491: for all the files in a directory.
492: In contrast, the old file system typically requires
493: one disk transfer to fetch the inode for each file in a directory.
494: .PP
495: The other major resource is data blocks.
496: Since data blocks for a file are typically accessed together,
497: the policy routines try to place all data
498: blocks for a file in the same cylinder group,
499: preferably at rotationally optimal positions in the same cylinder.
500: The problem with allocating all the data blocks
501: in the same cylinder group is that large files will
502: quickly use up available space in the cylinder group,
503: forcing a spill over to other areas.
504: Further, using all the space in a cylinder group
505: causes future allocations for any file in the cylinder group
506: to also spill to other areas.
507: Ideally none of the cylinder groups should ever become completely full.
508: The heuristic solution chosen is to
509: redirect block allocation
510: to a different cylinder group
511: when a file exceeds 48 kilobytes,
512: and at every megabyte thereafter.*
513: .FS
514: * The first spill over point at 48 kilobytes is the point
515: at which a file on a 4096 byte block file system first
516: requires a single indirect block. This appears to be
517: a natural first point at which to redirect block allocation.
518: The other spillover points are chosen with the intent of
519: forcing block allocation to be redirected when a
520: file has used about 25% of the data blocks in a cylinder group.
521: In observing the new file system in day to day use, the heuristics appear
522: to work well in minimizing the number of completely filled
523: cylinder groups.
524: .FE
525: The newly chosen cylinder group is selected from those cylinder
526: groups that have a greater than average number of free blocks left.
527: Although big files tend to be spread out over the disk,
528: a megabyte of data is typically accessible before
529: a long seek must be performed,
530: and the cost of one long seek per megabyte is small.
531: .PP
532: The global policy routines call local allocation routines with
533: requests for specific blocks.
534: The local allocation routines will
535: always allocate the requested block
536: if it is free, otherwise it
537: allocates a free block of the requested size that is
538: rotationally closest to the requested block.
539: If the global layout policies had complete information,
540: they could always request unused blocks and
541: the allocation routines would be reduced to simple bookkeeping.
542: However, maintaining complete information is costly;
543: thus the implementation of the global layout policy
544: uses heuristics that employ only partial information.
545: .PP
546: If a requested block is not available, the local allocator uses
547: a four level allocation strategy:
548: .IP 1)
549: Use the next available block rotationally closest
550: to the requested block on the same cylinder. It is assumed
551: here that head switching time is zero. On disk
552: controllers where this is not the case, it may be possible
553: to incorporate the time required to switch between disk platters
554: when constructing the rotational layout tables. This, however,
555: has not yet been tried.
556: .IP 2)
557: If there are no blocks available on the same cylinder,
558: use a block within the same cylinder group.
559: .IP 3)
560: If that cylinder group is entirely full,
561: quadratically hash the cylinder group number to choose
562: another cylinder group to look for a free block.
563: .IP 4)
564: Finally if the hash fails, apply an exhaustive search
565: to all cylinder groups.
566: .PP
567: Quadratic hash is used because of its speed in finding
568: unused slots in nearly full hash tables [Knuth75].
569: File systems that are parameterized to maintain at least
570: 10% free space rarely use this strategy.
571: File systems that are run without maintaining any free
572: space typically have so few free blocks that almost any
573: allocation is random;
574: the most important characteristic of
575: the strategy used under such conditions is that the strategy be fast.
576: .ds RH Performance
577: .sp 2
578: .ne 1i
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.