43BSDReno/share/doc/smm/13.kchanges/netinet.t - annotate

Return to netinet.t CVS log

Up to [CSRG BSD Unix] / 43BSDReno / share / doc / smm / 13.kchanges

Annotation of 43BSDReno/share/doc/smm/13.kchanges/netinet.t, revision 1.1.1.1

1.1 root 1: .\" Copyright (c) 1986 Regents of the University of California.
2: .\" All rights reserved. The Berkeley software License Agreement
3: .\" specifies the terms and conditions for redistribution.
4: .\"
5: .\" @(#)netinet.t 1.8 (Berkeley) 4/11/86
6: .\"
7: .hw SUBNETSARELOCAL
8: .NH
9: Internet network protocols
10: .PP
11: There are numerous bug fixes and extensions in the Internet
12: protocol support (\fB/sys/netinet\fP).
13: This section describes some of the more important changes
14: with very little detail.
15: As many of the changes span several source files,
16: and as it is very difficult to merge this code with earlier versions
17: of these protocols,
18: it is strongly recommended that the 4.3BSD network be adopted
19: intact, with local hacks merged into it only if necessary.
20: .NH 2
21: Internet common code
22: .PP
23: By far, the most important change in IP and the shared Internet support
24: layer is the addition of subnetwork addressing.
25: This facility is used (and required) by a number of large university
26: and other networks that include multiple physical networks
27: as well as connections with the DARPA Internet.
28: Subnet support allows a collection of interconnected local networks
29: to share a single network number,
30: hiding the complexity of the local environment and routing
31: from external hosts and gateways.
32: The subnet support in 4.3BSD conforms with the Internet standard
33: for subnet addressing, RFC-950.
34: For each network interface, a network mask is set along with the address.
35: This mask determines which portion of the address is the network number,
36: including the subnet, and by default is set according to the network
37: class (A, B, or C, with 8, 16, or 24 bits of network part, respectively).
38: Within a subnetted network each subnet appears as a distinct network;
39: externally, the entire network appears to be a single entity.
40: .PP
41: Another important change in IP addressing
42: is a change to the default IP broadcast address.
43: The default broadcast address is the address with a host part of all ones
44: (using the definition INADDR_BROADCAST),
45: in conformance with RFC-919.
46: In 4.2BSD, the broadcast address was the address with a host part
47: of all zeros (INADDR_ANY).
48: To facilitate the conversion process,
49: and to help avoid breaking networks with forwarded broadcasts,
50: 4.3BSD allows the broadcast address to be set for each interface.
51: IP recognizes and accepts network broadcasts
52: as well as subnet broadcasts when subnets are enabled.
53: Such broadcasts normally originate from hosts that do not know about subnets.
54: IP also accepts old-style (4.2) broadcasts using a host part of all
55: zeros, either as a network or subnet broadcast.
56: An address of all ones
57: is recognized as ``broadcast on this network,'' and an address of all
58: zeros is accepted as well.
59: The latter two are sometimes used in
60: broadcast information requests or network mask requests in the course
61: of starting a diskless workstation.
62: ICMP includes support for the Network Mask Request and Response.
63: A new routine, \fIin_broadcast\fP,
64: was added for the use of link layer output routines
65: to determine whether an IP packet should be broadcast.
66: .PP
67: Network numbers are now stored and used unshifted to
68: minimize conversions and reduce the overhead associated with comparisons.
69: 4.2BSD shifted network numbers to the low-order part of the word.
70: The structure defining Internet addresses no longer includes
71: the old IMP-host fields, but only a featureless 32-bit address.
72: .XP in.h
73: The definitions of Internet port numbers in this file
74: were deleted, as they have been superceded by the \fIgetservicebyname\fP
75: interface.
76: A definition was added for the single
77: option at the IP level accessible through \fIsetsockopt\fP,
78: IP_OPTIONS.
79: .XP in_pcb.h
80: The Internet protocol control block includes a pointer to an optional
81: mbuf containing IP options.
82: .XP in_var.h
83: This new header file contains the declaration of the Internet
84: variety of the per-interface address information.
85: The \fIin_ifaddr\fP structure includes the network, subnet, network mask
86: and broadcast information.
87: .XP in.c
88: The \fIif_*\fP routines which manipulate Internet addresses
89: were renamed to \fIin_*\fP.
90: \fIin_netof\fP and \fIin_lnaof\fP check whether the address
91: is for a directly-connected network, and if so they use the local
92: network mask to return the subnet/net and host portions, respectively.
93: \fIin_localaddr\fP determines whether an address corresponds
94: to a directly-connected network.
95: By default, this includes any subnet of a local network;
96: a configuration option, SUBNETSARELOCAL=0, changes this to return
97: true only for a directly-connected subnet or non-subnetted network.
98: Interface \fIioctl\fPs that get or set addresses or related status information
99: are forwarded to \fIin_control\fP, which implements them.
100: \fIin_iaonnetof\fP replaces \fIif_ifonnetof\fP for Internet addresses only.
101: .XP in_pcb.c
102: The destination address of a \fIconnect\fP may be given as INADDR_ANY (0)
103: as a shorthand notation for ``this host.''
104: This simplifies the process of connecting to local servers
105: such as the name-domain server that translates host names to addresses.
106: Also, the short-hand address INADDR_BROADCAST is converted to the broadcast
107: address for the primary local network; it fails if that network
108: is incapable of broadcast.
109: The source address for a connection or datagram
110: is selected according to the outgoing interface;
111: the initial route is allocated at this time and stored
112: in the protocol control block, so that it may be used again
113: when actually sending the packet(s).
114: The \fIin_pcbnotify\fP routine was generalized to apply any function
115: and/or report an error to all connections to a destination;
116: it is used to notify connections of routing changes and other
117: non-error situations as well as errors.
118: New entries have been added to this level to invalidate cached
119: routes when routing changes occur,
120: as well as to report possible routing failures detected by
121: higher levels.
122: .XP in_proto.c
123: The protocol switch table for Internet protocols includes entries
124: for the \fIctloutput\fP routines.
125: ICMP may be used with raw sockets.
126: A raw wildcard entry allows raw sockets to use any protocol
127: not already implemented in the kernel (e.g., EGP).
128: .NH 2
129: IP
130: .PP
131: Support was added for IP source routing and other IP options
132: (partly derived from BBN's implementation).
133: On output, IP options such as strict or loose source route and record
134: may be set by a client process using TCP, UDP or raw IP sockets.
135: IP properly updates source-route and record-route options
136: when forwarding (and leaves them in the packet, unlike 4.2 which
137: stripped them out after updating).
138: IP input preserves any source-routing information in an incoming packet
139: and passes it up to the receiving protocol upon request,
140: reversing it and arranging it in the same way as user-supplied options.
141: Both TCP and ICMP retrieve incoming source routes for use in replies.
142: Most of the option-handling code has been converted to use
143: \fIbcopy\fP instead of structure assignments when copying addresses,
144: as the alignment in the incoming packet may not be correct for the host.
145: This is not required on the VAX, but is needed on most other machines
146: running 4.2BSD.
147: .XP ip.h
148: The IP time-to-live field is decremented by one when forwarding;
149: in 4.2BSD this value was five.
150: .XP ip_var.h
151: Data structures and definitions were added for storing
152: IP options.
153: New fields have been added to the structure containing IP statistics.
154: .XP ip_input.c
155: The changes to save and present incoming IP source-routing information
156: to higher level protocols are in this file.
157: The identity of the interface that received the packet is also
158: determined by \fIip_input\fP and passed to the next protocol
159: receiving the packet.
160: To avoid using uninitialized data structures,
161: IP must not begin receiving packets until at least one Internet address
162: has been set.
163: A bug in the reassembly of IP packets with options has been corrected.
164: Machines with only a single network interface (in addition to the loopback
165: interface) no longer attempt to forward received IP packets that are
166: not destined for them;
167: they also do not respond with ICMP errors unless configured with
168: the GATEWAY option.
169: This change prevents large increases in network activity which used to result
170: when an IP packet that was broadcast was not understood as a broadcast.
171: A one-element route cache was added to the IP forwarding routine.
172: When a packet is forwarded using the same interface on which it arrived,
173: if the source host is on the directly-attached network,
174: an ICMP redirect is sent to the source.
175: If the route used for forwarding was a route to a host
176: or a route to a subnet,
177: a host redirect is used, otherwise a network redirect is sent.
178: The generation of redirects may be disabled by a configuration option,
179: IPSENDREDIRECTS=0.
180: More statistics are collected, in particular on traffic and fragmentation.
181: The \fIip_ctlinput\fP routine was moved to each of the upper-level
182: protocols, as they each have somewhat different requirements.
183: .XP ip_output.c
184: The IP output routine manages a cached route in the protocol
185: control block for each TCP, UDP or raw IP socket.
186: If the destination has changed, the route has been marked down,
187: or the route was freed because of a routing change, a new route
188: is obtained.
189: The route is not used if the IP_ROUTETOIF (aka SO_DONTROUTE or MSG_DONTROUTE)
190: option is present.
191: Preformed IP options passed to \fIip_output\fP are inserted,
192: changing the destination address as required.
193: The \fIip_ctloutput\fP routine allows options to be set for an individual
194: socket, validating and internalizing them as appropriate.
195: .XP raw_ip.c
196: The type-of-service and offset fields in the IP header
197: are set to zero on output.
198: The SO_DONTROUTE flag is handled properly.
199: .NH 2
200: ICMP
201: .PP
202: There have been numerous fixes and corrections to ICMP.
203: Length calculations have been corrected, allowing
204: most ICMP packet lengths to be received and allowing errors
205: to be sent about smaller input packets.
206: ICMP now uses information about the interface on which a message
207: was received to determine the
208: correct source address on returned error packets
209: and replies to information requests.
210: Support was added for the Network Mask Request.
211: Responses to source-routed requests use the reversed source route
212: for the return trip.
213: Timestamps are created with \fImicrotime\fP, allowing 1-millisecond
214: resolution.
215: The \fIicmp_error\fP routine is capable of sending ICMP redirects.
216: When processing network redirects, the returned source address is converted
217: to a network address before passing it to the routing redirect handler.
218: The translation of ICMP errors to Unix error returns was updated.
219: .NH 2
220: TCP
221: .PP
222: In addition to bug fixes, several performance changes have been
223: made to TCP.
224: Several of these address overall network performance and congestion
225: avoidance, while others address performance of an individual connection.
226: The most important changes concern the TCP send policy.
227: First, the sender silly-window syndrome avoidance strategy was fixed.
228: In 4.2BSD, the amount that could be sent was compared to the offered window,
229: and thus small amounts could still be sent if the receiver offered
230: a silly window.
231: Once this was fixed, there were problems with peers that never offered
232: windows large enough for a maximum segment, or at least 512 bytes
233: (e.g., the peer is a TAC or an IBM PC).
234: Code was then added to maintain estimates of the peer's receive and send
235: buffer sizes.
236: The send policy will now send if the offered
237: window is at least one-half of the receiver's buffer, as well as when
238: the window is at least a full-sized segment.
239: (When the window is large enough for all data that is queued,
240: the data will also be sent.)
241: The send buffer size estimate is not yet used, but is desired for a new
242: delayed-acknowledgement scheme that has yet to be tested.
243: Another problem that was exposed when the silly-window avoidance was fixed
244: was that the persist code didn't expect to be used with a non-zero window.
245: The persist now lasts only until the first timeout, at which time
246: a packet is sent of the largest size allowed by the window.
247: If this packet is not acknowledged, the output routine must begin retransmission
248: rather than returning to the persist state.
249: .PP
250: Another change related to the send policy is a strategy designed to minimize
251: the number of small packets outstanding on slow links.
252: This is an implementation of an algorithm proposed by John Nagle
253: in RFC-896.
254: The algorithm is very simple:
255: when there is outstanding, unacknowledged data pending
256: on a connection, new data are not sent unless they fill a maximum-sized
257: segment.
258: This allows bulk data transfers to proceed,
259: but causes small-packet traffic such as remote login to bundle together
260: data received during a single round-trip time.
261: On high-bandwidth, low-delay networks such as a local Ethernet,
262: this change seldom causes delay, but over slow links or across the Internet,
263: the number of small packets can be reduced considerably.
264: This algorithm does interact poorly with one type of usage, however,
265: as demonstrated by the X window system.
266: When small packets are sent in a stream, such as when doing rubber-banding
267: to position a new window, and when no echo or other acknowledgement
268: is being received from the other end of the connection,
269: the round-trip delay becomes as large as the delayed-acknowledgement timer
270: on the remote end.
271: For such clients, a TCP option may be set with \fIsetsockopt\fP
272: to defeat this part of the send policy.
273: .PP
274: For bulk-data transfers, the largest single change to improve performance
275: is to increase the size of the send and receive buffers.
276: The default buffer size in 4.3BSD is 4096 bytes, double the value in 4.2BSD.
277: These values allow more outstanding data and reduce the amount of time
278: waiting for a window update from the receiver.
279: They also improve the utility of the delayed-acknowledgement strategy.
280: The delayed acknowledgment strategy withholds acknowledgements
281: until a window update would uncover at least 35% of the window;
282: in 4.2BSD, with 1024-byte packets on an Ethernet and 2048-byte windows,
283: this took only a single packet.
284: With 4096-byte windows, up to 50% of the acknowledgements may be avoided.
285: .PP
286: The use of larger buffers might cause problems when bulk-data transfers
287: must traverse several networks and gateways with limited buffering capacity.
288: The source-quench ICMP message was provided to allow gateways in such
289: circumstances to cause source hosts to slow their rate of packet injection
290: into the network.
291: While 4.2BSD ignored such messages, the 4.3BSD TCP includes a mechanism
292: for throttling back the sender when a source quench is received.
293: This is done by creating an artificially small window (one which is 80%
294: of the outstanding data at the time the quench is received, but no less than
295: one segment).
296: This artificial congestion window is slowly opened as acknowledgements
297: are received.
298: The result under most circumstances is a slow fluctuation around the buffering
299: limit of the intermediate gateways, depending on the other traffic flowing
300: at the same time.
301: .PP
302: A final set of changes designed to improve network throughput
303: concerns the retransmission policy.
304: The retransmission timer is set according to the current round-trip
305: time estimate.
306: Unfortunately, the round-trip timing code in 4.2BSD had several bugs
307: which caused retransmissions to begin much too early.
308: These bugs in round trip timing have been corrected.
309: Also, the retransmission code has been tuned, using a faster
310: backoff after the first retransmission.
311: On an initial connection request where there is no round-trip time estimate,
312: a much more conservative policy is used.
313: When a slow link intervenes between the sender and the destination,
314: this policy avoids queuing large numbers of retransmitted connection requests
315: before a reply can be received. It also avoids saturation when
316: the destination host
317: is down or nonexistent.
318: During a connection, when the retransmission timer expires,
319: only a single packet is sent.
320: When only a single packet has been lost, this avoids resending
321: data that was successfully received;
322: when a host has gone down or become unreachable, it avoids sending
323: multiple packets at each timeout.
324: Once another acknowledgement is received, the transmission policy
325: returns to normal.
326: .PP
327: 4.2BSD offered a maximum receive segment size of 1024 for all connections,
328: and accepted such offers whenever made.
329: However, that size was especially poor for the Arpanet
330: and other 1822-based IMP networks (sorry, make that PSN networks)
331: where the maximum packet size is 1007 bytes.
332: This was compounded by a bug in the LH/DH driver that did not allow
333: space for an end-of-packet bit in the receive buffer,
334: and thus maximum size packets that were received were split across buffers.
335: This, in turn, aggravated a hardware
336: problem causing small packets following a segmented packet to be concatenated
337: with the previous packet.
338: The result of this set of conditions was that performance across
339: the Arpanet was sometimes abominably slow.
340: The maximum size segment selected by 4.3BSD is chosen according
341: to the destination and the interface to be used.
342: The segment size chosen is somewhat less than the maximum transmission unit
343: of the outgoing interface.
344: If the destination is not local,
345: the segment size is a convenient small size near
346: the default maximum size (512 bytes).
347: This value is both the maximum segment size
348: offered to the sender by the receive side,
349: and the maximum size segment that will be sent.
350: Of course, the send size is also limited
351: to be no more than the receiver has indicated it is willing to receive.
352: .PP
353: The initial sequence number prototype for TCP is now
354: incremented much more quickly; this has exposed two bugs.
355: Both the window-update receiving code and the urgent data receiving
356: code compared sequence numbers to 0 the first time they were called
357: on a connection. This fails if the initial sequence number has
358: wrapped around to negative numbers. Both are now initialized
359: when the connection is set up. This still remains a problem
360: in maintaining compatibility with 4.2BSD systems;
361: thus an option, TCP_COMPAT_42, was added to avoid using such sequence numbers
362: until 4.2 systems have been upgraded.
363: .PP
364: Additional changes in TCP are listed by source file:
365: .XP tcp_input.c
366: The common case of TCP data input, the arrival of the next
367: expected data segment with an empty reassembly queue, was made
368: into a simplified macro for efficiency.
369: \fITcp_input\fP was modified to know when it needed to call the output side,
370: reducing unnecessary tests for most acknowledgement-only packets.
371: The receive window size calculation on input was modified
372: to avoid shrinking the offered window;
373: this change was needed due to a change in input data
374: packaging by the link layer.
375: A bug in handling TCP packets received with both data and options
376: (that are not supposed to be used) has been corrected.
377: If data is received on a connection after the process has closed,
378: the other end is sent a reset, preventing connections from
379: hanging in CLOSE_WAIT on one end and FIN_WAIT_2 on the other.
380: (4.2BSD contained code to do this, but it was never executed
381: because such input packets had already been dropped
382: as being outside of the receive window.)
383: A timer is now started upon entering
384: FIN_WAIT_2 state if the local user has closed, closing the connection
385: if the final FIN is not received within a reasonable time.
386: Half-open connections are now reset more reliably; there were circumstances
387: under which one end could be rebooted, and new connection requests
388: that used the same port number might not receive a reset.
389: The urgent-data code was modified to remember which data had
390: already been read by the user, avoiding possible confusion if two
391: urgent-data signals were received close together.
392: Another change was made specifically for connections with a TAC.
393: The TAC doesn't fill in the window field on its initial packet (SYN),
394: and the apparent window is random.
395: There is some question as to the validity of the window field
396: if the packet does not have ACK set,
397: and therefore TCP was changed to ignore the window information
398: on those packets.
399: .XP tcp_output.c
400: The advertised window is never allowed to shrink,
401: in correspondence with the earlier change in the input handler.
402: The retransmit code was changed to check for shrinking windows,
403: updating the connection state rather than timing out
404: while waiting for acknowledgement.
405: The modifications to the send policy described above are largely
406: within this file.
407: .XP tcp_timer.c
408: The timer routines were changed to allow a longer wait for acknowledgements.
409: (TCP would generally time out before the routing protocol
410: had changed routes.)
411: .NH 2
412: UDP
413: .PP
414: An error in the checksumming of output UDP packets was corrected.
415: Checksums are now checked by default, unless the COMPAT_42 configuration
416: option is specified; it is provided to allow communication with the 4.2BSD UDP
417: implementation, which generates incorrect checksums.
418: When UDP datagrams are received for a port at which no process is listening,
419: ICMP unreachable messages are sent in response unless the input packet
420: was a broadcast.
421: The size of the receive buffer was increased, as several large datagrams
422: and their attached addresses could otherwise fill the buffer.
423: The time-to-live of output datagrams was reduced from 255
424: to 30.
425: UDP uses its own \fIctlinput\fP routine for handling of ICMP errors,
426: so that errors may be reported to the sender without closing the socket.
427: .NH 2
428: Address Resolution Protocol
429: .PP
430: The address resolution protocol has been generalized somewhat.
431: It was specific for IP on 10\ Mb/s Ethernet; it now handles multiple
432: protocols on 10 Mb/s Ethernet and could easily be adapted to other
433: hardware as well.
434: This change was made while adding ARP resolution
435: of trailer protocol addresses.
436: Hosts desiring to receive trailer
437: encapsulations must now indicate that by the use of ARP. This allows
438: trailers to be used between cooperating 4.3 machines while using
439: non-trailer encapsulations with other hosts.
440: The negotiation need not be symmetrical: a VAX may request trailers,
441: for example, and a SUN may note this and send trailer packets
442: to the VAX without itself requesting trailers.
443: This change requires modifications to the 10 Mb/s Ethernet drivers,
444: which must provide an additional argument to \fIarpresolve\fP,
445: a pointer for the additional return value indicating whether trailer
446: encapsulations may be sent.
447: With this change, the IFF_NOTRAILERS flag on each interface is interpreted
448: to mean that trailers should not be requested.
449: Modifications to ARP from SUN Microsystems add \fIioctl\fP operations
450: to examine and modify entries in the ARP address translation table,
451: and to allow ARP translations to be ``published.''
452: When future requests are received for Ethernet address translations,
453: if the translation is in the table and is marked as published,
454: they will be answered for that host.
455: Those modifications superceded the ``oldmap'' algorithmic translation
456: from IP addresses, which has been removed.
457: Packets are not forwarded to the loopback interface if it is not marked
458: up, and a bug causing an mbuf to be freed twice
459: if the loopback output fails was corrected.
460: ARP complains if a host lists the broadcast address as its Ethernet address.
461: The ARP tables were enlarged to reflect larger network configurations
462: now in use.
463: A new function for use in driver messages, \fIether_sprintf\fP,
464: formats a 48-bit Ethernet address and returns a pointer to the resulting string.
465: .NH 2
466: IMP support
467: .PP
468: The support facilities for connections to an 1822 (or X.25) IMP port
469: (\fB/sys/netimp\fP)
470: have had several bug fixes and one extension.
471: Unit numbers are now checked more carefully during autoconfiguration.
472: Code from BRL was installed to support class B and C networks.
473: Error packets received from the IMP such as Host Dead are queued
474: in the interrupt handler for reprocessing from a software interrupt,
475: avoiding state transitions in the protocols at priorities above \fIsplnet\fP.
476: The host-dead timer is no longer restarted when attempting new output,
477: as a persistent sender could otherwise prevent new output from being attempted
478: once a host was reported down.
479: The network number is always taken from the address
480: configured for the interface at boot time;
481: network 10 is no longer assumed.
482: A timer is used to prevent blocking if RFNM messages from the IMP are lost.
483: A race was fixed when freeing mbufs containing host table entries,
484: as the mbuf had been used after it was freed.

unix.superglobalmegacorp.com

This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.