|
|
1.1 root 1: .\" Copyright (c) 1982 Regents of the University of California.
2: .\" All rights reserved. The Berkeley software License Agreement
3: .\" specifies the terms and conditions for redistribution.
4: .\"
5: .\" @(#)2.t 4.3 (Berkeley) 2/1/86
6: .\"
7: .ds RH Overview of the file system
8: .NH
9: Overview of the file system
10: .PP
11: The file system is discussed in detail in [Mckusick84];
12: this section gives a brief overview.
13: .NH 2
14: Superblock
15: .PP
16: A file system is described by its
17: .I "super-block" .
18: The super-block is built when the file system is created (\c
19: .I newfs (8))
20: and never changes.
21: The super-block
22: contains the basic parameters of the file system,
23: such as the number of data blocks it contains
24: and a count of the maximum number of files.
25: Because the super-block contains critical data,
26: .I newfs
27: replicates it to protect against catastrophic loss.
28: The
29: .I "default super block"
30: always resides at a fixed offset from the beginning
31: of the file system's disk partition.
32: The
33: .I "redundant super blocks"
34: are not referenced unless a head crash
35: or other hard disk error causes the default super-block
36: to be unusable.
37: The redundant blocks are sprinkled throughout the disk partition.
38: .PP
39: Within the file system are files.
40: Certain files are distinguished as directories and contain collections
41: of pointers to files that may themselves be directories.
42: Every file has a descriptor associated with it called an
43: .I "inode".
44: The inode contains information describing ownership of the file,
45: time stamps indicating modification and access times for the file,
46: and an array of indices pointing to the data blocks for the file.
47: In this section,
48: we assume that the first 12 blocks
49: of the file are directly referenced by values stored
50: in the inode structure itself\(dg.
51: .FS
52: \(dgThe actual number may vary from system to system, but is usually in
53: the range 5-13.
54: .FE
55: The inode structure may also contain references to indirect blocks
56: containing further data block indices.
57: In a file system with a 4096 byte block size, a singly indirect
58: block contains 1024 further block addresses,
59: a doubly indirect block contains 1024 addresses of further single indirect
60: blocks,
61: and a triply indirect block contains 1024 addresses of further doubly indirect
62: blocks (the triple indirect block is never needed in practice).
63: .PP
64: In order to create files with up to
65: 2\(ua32 bytes,
66: using only two levels of indirection,
67: the minimum size of a file system block is 4096 bytes.
68: The size of file system blocks can be any power of two
69: greater than or equal to 4096.
70: The block size of the file system is maintained in the super-block,
71: so it is possible for file systems of different block sizes
72: to be accessible simultaneously on the same system.
73: The block size must be decided when
74: .I newfs
75: creates the file system;
76: the block size cannot be subsequently
77: changed without rebuilding the file system.
78: .NH 2
79: Summary information
80: .PP
81: Associated with the super block is non replicated
82: .I "summary information" .
83: The summary information changes
84: as the file system is modified.
85: The summary information contains
86: the number of blocks, fragments, inodes and directories in the file system.
87: .NH 2
88: Cylinder groups
89: .PP
90: The file system partitions the disk into one or more areas called
91: .I "cylinder groups".
92: A cylinder group is comprised of one or more consecutive
93: cylinders on a disk.
94: Each cylinder group includes inode slots for files, a
95: .I "block map"
96: describing available blocks in the cylinder group,
97: and summary information describing the usage of data blocks
98: within the cylinder group.
99: A fixed number of inodes is allocated for each cylinder group
100: when the file system is created.
101: The current policy is to allocate one inode for each 2048
102: bytes of disk space;
103: this is expected to be far more inodes than will ever be needed.
104: .PP
105: All the cylinder group bookkeeping information could be
106: placed at the beginning of each cylinder group.
107: However if this approach were used,
108: all the redundant information would be on the top platter.
109: A single hardware failure that destroyed the top platter
110: could cause the loss of all copies of the redundant super-blocks.
111: Thus the cylinder group bookkeeping information
112: begins at a floating offset from the beginning of the cylinder group.
113: The offset for
114: the
115: .I "i+1" st
116: cylinder group is about one track further
117: from the beginning of the cylinder group
118: than it was for the
119: .I "i" th
120: cylinder group.
121: In this way,
122: the redundant
123: information spirals down into the pack;
124: any single track, cylinder,
125: or platter can be lost without losing all copies of the super-blocks.
126: Except for the first cylinder group,
127: the space between the beginning of the cylinder group
128: and the beginning of the cylinder group information stores data.
129: .NH 2
130: Fragments
131: .PP
132: To avoid waste in storing small files,
133: the file system space allocator divides a single
134: file system block into one or more
135: .I "fragments".
136: The fragmentation of the file system is specified
137: when the file system is created;
138: each file system block can be optionally broken into
139: 2, 4, or 8 addressable fragments.
140: The lower bound on the size of these fragments is constrained
141: by the disk sector size;
142: typically 512 bytes is the lower bound on fragment size.
143: The block map associated with each cylinder group
144: records the space availability at the fragment level.
145: Aligned fragments are examined
146: to determine block availability.
147: .PP
148: On a file system with a block size of 4096 bytes
149: and a fragment size of 1024 bytes,
150: a file is represented by zero or more 4096 byte blocks of data,
151: and possibly a single fragmented block.
152: If a file system block must be fragmented to obtain
153: space for a small amount of data,
154: the remainder of the block is made available for allocation
155: to other files.
156: For example,
157: consider an 11000 byte file stored on
158: a 4096/1024 byte file system.
159: This file uses two full size blocks and a 3072 byte fragment.
160: If no fragments with at least 3072 bytes
161: are available when the file is created,
162: a full size block is split yielding the necessary 3072 byte
163: fragment and an unused 1024 byte fragment.
164: This remaining fragment can be allocated to another file, as needed.
165: .NH 2
166: Updates to the file system
167: .PP
168: Every working day hundreds of files
169: are created, modified, and removed.
170: Every time a file is modified,
171: the operating system performs a
172: series of file system updates.
173: These updates, when written on disk, yield a consistent file system.
174: The file system stages
175: all modifications of critical information;
176: modification can
177: either be completed or cleanly backed out after a crash.
178: Knowing the information that is first written to the file system,
179: deterministic procedures can be developed to
180: repair a corrupted file system.
181: To understand this process,
182: the order that the update
183: requests were being honored must first be understood.
184: .PP
185: When a user program does an operation to change the file system,
186: such as a
187: .I write ,
188: the data to be written is copied into an internal
189: .I "in-core"
190: buffer in the kernel.
191: Normally, the disk update is handled asynchronously;
192: the user process is allowed to proceed even though
193: the data has not yet been written to the disk.
194: The data,
195: along with the inode information reflecting the change,
196: is eventually written out to disk.
197: The real disk write may not happen until long after the
198: .I write
199: system call has returned.
200: Thus at any given time, the file system,
201: as it resides on the disk,
202: lags the state of the file system represented by the in-core information.
203: .PP
204: The disk information is updated to reflect the in-core information
205: when the buffer is required for another use,
206: when a
207: .I sync (2)
208: is done (at 30 second intervals) by
209: .I "/etc/update" "(8),"
210: or by manual operator intervention with the
211: .I sync (8)
212: command.
213: If the system is halted without writing out the in-core information,
214: the file system on the disk will be in an inconsistent state.
215: .PP
216: If all updates are done asynchronously, several serious
217: inconsistencies can arise.
218: One inconsistency is that a block may be claimed by two inodes.
219: Such an inconsistency can occur when the system is halted before
220: the pointer to the block in the old inode has been cleared
221: in the copy of the old inode on the disk,
222: and after the pointer to the block in the new inode has been written out
223: to the copy of the new inode on the disk.
224: Here,
225: there is no deterministic method for deciding
226: which inode should really claim the block.
227: A similar problem can arise with a multiply claimed inode.
228: .PP
229: The problem with asynchronous inode updates
230: can be avoided by doing all inode deallocations synchronously.
231: Consequently,
232: inodes and indirect blocks are written to the disk synchronously
233: (\fIi.e.\fP the process blocks until the information is
234: really written to disk)
235: when they are being deallocated.
236: Similarly inodes are kept consistent by synchronously
237: deleting, adding, or changing directory entries.
238: .ds RH Fixing corrupted file systems
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.