Annotation of 43BSDReno/share/doc/smm/13.kchanges/netinet.t, revision 1.1.1.1

1.1       root        1: .\" Copyright (c) 1986 Regents of the University of California.
                      2: .\" All rights reserved.  The Berkeley software License Agreement
                      3: .\" specifies the terms and conditions for redistribution.
                      4: .\"
                      5: .\"    @(#)netinet.t   1.8 (Berkeley) 4/11/86
                      6: .\"
                      7: .hw SUBNETSARELOCAL
                      8: .NH
                      9: Internet network protocols
                     10: .PP
                     11: There are numerous bug fixes and extensions in the Internet
                     12: protocol support (\fB/sys/netinet\fP).
                     13: This section describes some of the more important changes
                     14: with very little detail.
                     15: As many of the changes span several source files,
                     16: and as it is very difficult to merge this code with earlier versions
                     17: of these protocols,
                     18: it is strongly recommended that the 4.3BSD network be adopted
                     19: intact, with local hacks merged into it only if necessary.
                     20: .NH 2
                     21: Internet common code
                     22: .PP
                     23: By far, the most important change in IP and the shared Internet support
                     24: layer is the addition of subnetwork addressing.
                     25: This facility is used (and required) by a number of large university
                     26: and other networks that include multiple physical networks
                     27: as well as connections with the DARPA Internet.
                     28: Subnet support allows a collection of interconnected local networks
                     29: to share a single network number,
                     30: hiding the complexity of the local environment and routing
                     31: from external hosts and gateways.
                     32: The subnet support in 4.3BSD conforms with the Internet standard
                     33: for subnet addressing, RFC-950.
                     34: For each network interface, a network mask is set along with the address.
                     35: This mask determines which portion of the address is the network number,
                     36: including the subnet, and by default is set according to the network
                     37: class (A, B, or C, with 8, 16, or 24 bits of network part, respectively).
                     38: Within a subnetted network each subnet appears as a distinct network;
                     39: externally, the entire network appears to be a single entity.
                     40: .PP
                     41: Another important change in IP addressing
                     42: is a change to the default IP broadcast address.
                     43: The default broadcast address is the address with a host part of all ones
                     44: (using the definition INADDR_BROADCAST),
                     45: in conformance with RFC-919.
                     46: In 4.2BSD, the broadcast address was the address with a host part
                     47: of all zeros (INADDR_ANY).
                     48: To facilitate the conversion process,
                     49: and to help avoid breaking networks with forwarded broadcasts,
                     50: 4.3BSD allows the broadcast address to be set for each interface.
                     51: IP recognizes and accepts network broadcasts
                     52: as well as subnet broadcasts when subnets are enabled.
                     53: Such broadcasts normally originate from hosts that do not know about subnets.
                     54: IP also accepts old-style (4.2) broadcasts using a host part of all
                     55: zeros, either as a network or subnet broadcast.
                     56: An address of all ones
                     57: is recognized as ``broadcast on this network,'' and an address of all
                     58: zeros is accepted as well.
                     59: The latter two are sometimes used in
                     60: broadcast information requests or network mask requests in the course
                     61: of starting a diskless workstation.
                     62: ICMP includes support for the Network Mask Request and Response.
                     63: A new routine, \fIin_broadcast\fP,
                     64: was added for the use of link layer output routines
                     65: to determine whether an IP packet should be broadcast.
                     66: .PP
                     67: Network numbers are now stored and used unshifted to
                     68: minimize conversions and reduce the overhead associated with comparisons.
                     69: 4.2BSD shifted network numbers to the low-order part of the word.
                     70: The structure defining Internet addresses no longer includes
                     71: the old IMP-host fields, but only a featureless 32-bit address.
                     72: .XP in.h
                     73: The definitions of Internet port numbers in this file
                     74: were deleted, as they have been superceded by the \fIgetservicebyname\fP
                     75: interface.
                     76: A definition was added for the single
                     77: option at the IP level accessible through \fIsetsockopt\fP,
                     78: IP_OPTIONS.
                     79: .XP in_pcb.h
                     80: The Internet protocol control block includes a pointer to an optional
                     81: mbuf containing IP options.
                     82: .XP in_var.h
                     83: This new header file contains the declaration of the Internet
                     84: variety of the per-interface address information.
                     85: The \fIin_ifaddr\fP structure includes the network, subnet, network mask
                     86: and broadcast information.
                     87: .XP in.c
                     88: The \fIif_*\fP routines which manipulate Internet addresses
                     89: were renamed to \fIin_*\fP.
                     90: \fIin_netof\fP and \fIin_lnaof\fP check whether the address
                     91: is for a directly-connected network, and if so they use the local
                     92: network mask to return the subnet/net and host portions, respectively.
                     93: \fIin_localaddr\fP determines whether an address corresponds
                     94: to a directly-connected network.
                     95: By default, this includes any subnet of a local network;
                     96: a configuration option, SUBNETSARELOCAL=0, changes this to return
                     97: true only for a directly-connected subnet or non-subnetted network.
                     98: Interface \fIioctl\fPs that get or set addresses or related status information
                     99: are forwarded to \fIin_control\fP, which implements them.
                    100: \fIin_iaonnetof\fP replaces \fIif_ifonnetof\fP for Internet addresses only.
                    101: .XP in_pcb.c
                    102: The destination address of a \fIconnect\fP may be given as INADDR_ANY (0)
                    103: as a shorthand notation for ``this host.''
                    104: This simplifies the process of connecting to local servers
                    105: such as the name-domain server that translates host names to addresses.
                    106: Also, the short-hand address INADDR_BROADCAST is converted to the broadcast
                    107: address for the primary local network; it fails if that network
                    108: is incapable of broadcast.
                    109: The source address for a connection or datagram
                    110: is selected according to the outgoing interface;
                    111: the initial route is allocated at this time and stored
                    112: in the protocol control block, so that it may be used again
                    113: when actually sending the packet(s).
                    114: The \fIin_pcbnotify\fP routine was generalized to apply any function
                    115: and/or report an error to all connections to a destination;
                    116: it is used to notify connections of routing changes and other
                    117: non-error situations as well as errors.
                    118: New entries have been added to this level to invalidate cached
                    119: routes when routing changes occur,
                    120: as well as to report possible routing failures detected by
                    121: higher levels.
                    122: .XP in_proto.c
                    123: The protocol switch table for Internet protocols includes entries
                    124: for the \fIctloutput\fP routines.
                    125: ICMP may be used with raw sockets.
                    126: A raw wildcard entry allows raw sockets to use any protocol
                    127: not already implemented in the kernel (e.g., EGP).
                    128: .NH 2
                    129: IP
                    130: .PP
                    131: Support was added for IP source routing and other IP options
                    132: (partly derived from BBN's implementation).
                    133: On output, IP options such as strict or loose source route and record
                    134: may be set by a client process using TCP, UDP or raw IP sockets.
                    135: IP properly updates source-route and record-route options
                    136: when forwarding (and leaves them in the packet, unlike 4.2 which
                    137: stripped them out after updating).
                    138: IP input preserves any source-routing information in an incoming packet
                    139: and passes it up to the receiving protocol upon request,
                    140: reversing it and arranging it in the same way as user-supplied options.
                    141: Both TCP and ICMP retrieve incoming source routes for use in replies.
                    142: Most of the option-handling code has been converted to use
                    143: \fIbcopy\fP instead of structure assignments when copying addresses,
                    144: as the alignment in the incoming packet may not be correct for the host.
                    145: This is not required on the VAX, but is needed on most other machines
                    146: running 4.2BSD.
                    147: .XP ip.h
                    148: The IP time-to-live field is decremented by one when forwarding;
                    149: in 4.2BSD this value was five.
                    150: .XP ip_var.h
                    151: Data structures and definitions were added for storing
                    152: IP options.
                    153: New fields have been added to the structure containing IP statistics.
                    154: .XP ip_input.c
                    155: The changes to save and present incoming IP source-routing information
                    156: to higher level protocols are in this file.
                    157: The identity of the interface that received the packet is also
                    158: determined by \fIip_input\fP and passed to the next protocol
                    159: receiving the packet.
                    160: To avoid using uninitialized data structures,
                    161: IP must not begin receiving packets until at least one Internet address
                    162: has been set.
                    163: A bug in the reassembly of IP packets with options has been corrected.
                    164: Machines with only a single network interface (in addition to the loopback
                    165: interface) no longer attempt to forward received IP packets that are
                    166: not destined for them;
                    167: they also do not respond with ICMP errors unless configured with
                    168: the GATEWAY option.
                    169: This change prevents large increases in network activity which used to result
                    170: when an IP packet that was broadcast was not understood as a broadcast.
                    171: A one-element route cache was added to the IP forwarding routine.
                    172: When a packet is forwarded using the same interface on which it arrived,
                    173: if the source host is on the directly-attached network,
                    174: an ICMP redirect is sent to the source.
                    175: If the route used for forwarding was a route to a host
                    176: or a route to a subnet,
                    177: a host redirect is used, otherwise a network redirect is sent.
                    178: The generation of redirects may be disabled by a configuration option,
                    179: IPSENDREDIRECTS=0.
                    180: More statistics are collected, in particular on traffic and fragmentation.
                    181: The \fIip_ctlinput\fP routine was moved to each of the upper-level
                    182: protocols, as they each have somewhat different requirements.
                    183: .XP ip_output.c
                    184: The IP output routine manages a cached route in the protocol
                    185: control block for each TCP, UDP or raw IP socket.
                    186: If the destination has changed, the route has been marked down,
                    187: or the route was freed because of a routing change, a new route
                    188: is obtained.
                    189: The route is not used if the IP_ROUTETOIF (aka SO_DONTROUTE or MSG_DONTROUTE)
                    190: option is present.
                    191: Preformed IP options passed to \fIip_output\fP are inserted,
                    192: changing the destination address as required.
                    193: The \fIip_ctloutput\fP routine allows options to be set for an individual
                    194: socket, validating and internalizing them as appropriate.
                    195: .XP raw_ip.c
                    196: The type-of-service and offset fields in the IP header
                    197: are set to zero on output.
                    198: The SO_DONTROUTE flag is handled properly.
                    199: .NH 2
                    200: ICMP
                    201: .PP
                    202: There have been numerous fixes and corrections to ICMP.
                    203: Length calculations have been corrected, allowing
                    204: most ICMP packet lengths to be received and allowing errors
                    205: to be sent about smaller input packets.
                    206: ICMP now uses information about the interface on which a message
                    207: was received to determine the
                    208: correct source address on returned error packets
                    209: and replies to information requests.
                    210: Support was added for the Network Mask Request.
                    211: Responses to source-routed requests use the reversed source route
                    212: for the return trip.
                    213: Timestamps are created with \fImicrotime\fP, allowing 1-millisecond
                    214: resolution.
                    215: The \fIicmp_error\fP routine is capable of sending ICMP redirects.
                    216: When processing network redirects, the returned source address is converted
                    217: to a network address before passing it to the routing redirect handler.
                    218: The translation of ICMP errors to Unix error returns was updated.
                    219: .NH 2
                    220: TCP
                    221: .PP
                    222: In addition to bug fixes, several performance changes have been
                    223: made to TCP.
                    224: Several of these address overall network performance and congestion
                    225: avoidance, while others address performance of an individual connection.
                    226: The most important changes concern the TCP send policy.
                    227: First, the sender silly-window syndrome avoidance strategy was fixed.
                    228: In 4.2BSD, the amount that could be sent was compared to the offered window,
                    229: and thus small amounts could still be sent if the receiver offered
                    230: a silly window.
                    231: Once this was fixed, there were problems with peers that never offered
                    232: windows large enough for a maximum segment, or at least 512 bytes
                    233: (e.g., the peer is a TAC or an IBM PC).
                    234: Code was then added to maintain estimates of the peer's receive and send
                    235: buffer sizes.
                    236: The send policy will now send if the offered
                    237: window is at least one-half of the receiver's buffer, as well as when
                    238: the window is at least a full-sized segment.
                    239: (When the window is large enough for all data that is queued,
                    240: the data will also be sent.)
                    241: The send buffer size estimate is not yet used, but is desired for a new
                    242: delayed-acknowledgement scheme that has yet to be tested.
                    243: Another problem that was exposed when the silly-window avoidance was fixed
                    244: was that the persist code didn't expect to be used with a non-zero window.
                    245: The persist now lasts only until the first timeout, at which time
                    246: a packet is sent of the largest size allowed by the window.
                    247: If this packet is not acknowledged, the output routine must begin retransmission
                    248: rather than returning to the persist state.
                    249: .PP
                    250: Another change related to the send policy is a strategy designed to minimize
                    251: the number of small packets outstanding on slow links.
                    252: This is an implementation of an algorithm proposed by John Nagle
                    253: in RFC-896.
                    254: The algorithm is very simple:
                    255: when there is outstanding, unacknowledged data pending
                    256: on a connection, new data are not sent unless they fill a maximum-sized
                    257: segment.
                    258: This allows bulk data transfers to proceed,
                    259: but causes small-packet traffic such as remote login to bundle together
                    260: data received during a single round-trip time.
                    261: On high-bandwidth, low-delay networks such as a local Ethernet,
                    262: this change seldom causes delay, but over slow links or across the Internet,
                    263: the number of small packets can be reduced considerably.
                    264: This algorithm does interact poorly with one type of usage, however,
                    265: as demonstrated by the X window system.
                    266: When small packets are sent in a stream, such as when doing rubber-banding
                    267: to position a new window, and when no echo or other acknowledgement
                    268: is being received from the other end of the connection,
                    269: the round-trip delay becomes as large as the delayed-acknowledgement timer
                    270: on the remote end.
                    271: For such clients, a TCP option may be set with \fIsetsockopt\fP
                    272: to defeat this part of the send policy.
                    273: .PP
                    274: For bulk-data transfers, the largest single change to improve performance
                    275: is to increase the size of the send and receive buffers.
                    276: The default buffer size in 4.3BSD is 4096 bytes, double the value in 4.2BSD.
                    277: These values allow more outstanding data and reduce the amount of time
                    278: waiting for a window update from the receiver.
                    279: They also improve the utility of the delayed-acknowledgement strategy.
                    280: The delayed acknowledgment strategy withholds acknowledgements
                    281: until a window update would uncover at least 35% of the window;
                    282: in 4.2BSD, with 1024-byte packets on an Ethernet and 2048-byte windows,
                    283: this took only a single packet.
                    284: With 4096-byte windows, up to 50% of the acknowledgements may be avoided.
                    285: .PP
                    286: The use of larger buffers might cause problems when bulk-data transfers
                    287: must traverse several networks and gateways with limited buffering capacity.
                    288: The source-quench ICMP message was provided to allow gateways in such
                    289: circumstances to cause source hosts to slow their rate of packet injection
                    290: into the network.
                    291: While 4.2BSD ignored such messages, the 4.3BSD TCP includes a mechanism
                    292: for throttling back the sender when a source quench is received.
                    293: This is done by creating an artificially small window (one which is 80%
                    294: of the outstanding data at the time the quench is received, but no less than
                    295: one segment).
                    296: This artificial congestion window is slowly opened as acknowledgements
                    297: are received.
                    298: The result under most circumstances is a slow fluctuation around the buffering
                    299: limit of the intermediate gateways, depending on the other traffic flowing
                    300: at the same time.
                    301: .PP
                    302: A final set of changes designed to improve network throughput
                    303: concerns the retransmission policy.
                    304: The retransmission timer is set according to the current round-trip
                    305: time estimate.
                    306: Unfortunately, the round-trip timing code in 4.2BSD had several bugs
                    307: which caused retransmissions to begin much too early.
                    308: These bugs in round trip timing have been corrected.
                    309: Also, the retransmission code has been tuned, using a faster
                    310: backoff after the first retransmission.
                    311: On an initial connection request where there is no round-trip time estimate,
                    312: a much more conservative policy is used.
                    313: When a slow link intervenes between the sender and the destination,
                    314: this policy avoids queuing large numbers of retransmitted connection requests
                    315: before a reply can be received. It also avoids saturation when
                    316: the destination host
                    317: is down or nonexistent.
                    318: During a connection, when the retransmission timer expires,
                    319: only a single packet is sent.
                    320: When only a single packet has been lost, this avoids resending
                    321: data that was successfully received;
                    322: when a host has gone down or become unreachable, it avoids sending
                    323: multiple packets at each timeout.
                    324: Once another acknowledgement is received, the transmission policy
                    325: returns to normal.
                    326: .PP
                    327: 4.2BSD offered a maximum receive segment size of 1024 for all connections,
                    328: and accepted such offers whenever made.
                    329: However, that size was especially poor for the Arpanet
                    330: and other 1822-based IMP networks (sorry, make that PSN networks)
                    331: where the maximum packet size is 1007 bytes.
                    332: This was compounded by a bug in the LH/DH driver that did not allow
                    333: space for an end-of-packet bit in the receive buffer,
                    334: and thus maximum size packets that were received were split across buffers.
                    335: This, in turn, aggravated a hardware
                    336: problem causing small packets following a segmented packet to be concatenated
                    337: with the previous packet.
                    338: The result of this set of conditions was that performance across
                    339: the Arpanet was sometimes abominably slow.
                    340: The maximum size segment selected by 4.3BSD is chosen according
                    341: to the destination and the interface to be used.
                    342: The segment size chosen is somewhat less than the maximum transmission unit
                    343: of the outgoing interface.
                    344: If the destination is not local,
                    345: the segment size is a convenient small size near
                    346: the default maximum size (512 bytes).
                    347: This value is both the maximum segment size
                    348: offered to the sender by the receive side,
                    349: and the maximum size segment that will be sent.
                    350: Of course, the send size is also limited
                    351: to be no more than the receiver has indicated it is willing to receive.
                    352: .PP
                    353: The initial sequence number prototype for TCP is now
                    354: incremented much more quickly; this has exposed two bugs.
                    355: Both the window-update receiving code and the urgent data receiving
                    356: code compared sequence numbers to 0 the first time they were called
                    357: on a connection.  This fails if the initial sequence number has
                    358: wrapped around to negative numbers.  Both are now initialized
                    359: when the connection is set up.  This still remains a problem
                    360: in maintaining compatibility with 4.2BSD systems;
                    361: thus an option, TCP_COMPAT_42, was added to avoid using such sequence numbers
                    362: until 4.2 systems have been upgraded.
                    363: .PP
                    364: Additional changes in TCP are listed by source file:
                    365: .XP tcp_input.c
                    366: The common case of TCP data input, the arrival of the next
                    367: expected data segment with an empty reassembly queue, was made
                    368: into a simplified macro for efficiency.
                    369: \fITcp_input\fP was modified to know when it needed to call the output side,
                    370: reducing unnecessary tests for most acknowledgement-only packets.
                    371: The receive window size calculation on input was modified
                    372: to avoid shrinking the offered window;
                    373: this change was needed due to a change in input data
                    374: packaging by the link layer.
                    375: A bug in handling TCP packets received with both data and options
                    376: (that are not supposed to be used) has been corrected.
                    377: If data is received on a connection after the process has closed,
                    378: the other end is sent a reset, preventing connections from
                    379: hanging in CLOSE_WAIT on one end and FIN_WAIT_2 on the other.
                    380: (4.2BSD contained code to do this, but it was never executed
                    381: because such input packets had already been dropped
                    382: as being outside of the receive window.)
                    383: A timer is now started upon entering
                    384: FIN_WAIT_2 state if the local user has closed, closing the connection
                    385: if the final FIN is not received within a reasonable time.
                    386: Half-open connections are now reset more reliably; there were circumstances
                    387: under which one end could be rebooted, and new connection requests
                    388: that used the same port number might not receive a reset.
                    389: The urgent-data code was modified to remember which data had
                    390: already been read by the user, avoiding possible confusion if two
                    391: urgent-data signals were received close together.
                    392: Another change was made specifically for connections with a TAC.
                    393: The TAC doesn't fill in the window field on its initial packet (SYN),
                    394: and the apparent window is random.
                    395: There is some question as to the validity of the window field
                    396: if the packet does not have ACK set,
                    397: and therefore TCP was changed to ignore the window information
                    398: on those packets.
                    399: .XP tcp_output.c
                    400: The advertised window is never allowed to shrink,
                    401: in correspondence with the earlier change in the input handler.
                    402: The retransmit code was changed to check for shrinking windows,
                    403: updating the connection state rather than timing out
                    404: while waiting for acknowledgement.
                    405: The modifications to the send policy described above are largely
                    406: within this file.
                    407: .XP tcp_timer.c
                    408: The timer routines were changed to allow a longer wait for acknowledgements.
                    409: (TCP would generally time out before the routing protocol
                    410: had changed routes.)
                    411: .NH 2
                    412: UDP
                    413: .PP
                    414: An error in the checksumming of output UDP packets was corrected.
                    415: Checksums are now checked by default, unless the COMPAT_42 configuration
                    416: option is specified; it is provided to allow communication with the 4.2BSD UDP
                    417: implementation, which generates incorrect checksums.
                    418: When UDP datagrams are received for a port at which no process is listening,
                    419: ICMP unreachable messages are sent in response unless the input packet
                    420: was a broadcast.
                    421: The size of the receive buffer was increased, as several large datagrams
                    422: and their attached addresses could otherwise fill the buffer.
                    423: The time-to-live of output datagrams was reduced from 255
                    424: to 30.
                    425: UDP uses its own \fIctlinput\fP routine for handling of ICMP errors,
                    426: so that errors may be reported to the sender without closing the socket.
                    427: .NH 2
                    428: Address Resolution Protocol
                    429: .PP
                    430: The address resolution protocol has been generalized somewhat.
                    431: It was specific for IP on 10\  Mb/s Ethernet; it now handles multiple
                    432: protocols on 10 Mb/s Ethernet and could easily be adapted to other
                    433: hardware as well.
                    434: This change was made while adding ARP resolution
                    435: of trailer protocol addresses.
                    436: Hosts desiring to receive trailer
                    437: encapsulations must now indicate that by the use of ARP.  This allows
                    438: trailers to be used between cooperating 4.3 machines while using
                    439: non-trailer encapsulations with other hosts.
                    440: The negotiation need not be symmetrical: a VAX may request trailers,
                    441: for example, and a SUN may note this and send trailer packets
                    442: to the VAX without itself requesting trailers.
                    443: This change requires modifications to the 10 Mb/s Ethernet drivers,
                    444: which must provide an additional argument to \fIarpresolve\fP,
                    445: a pointer for the additional return value indicating whether trailer
                    446: encapsulations may be sent.
                    447: With this change, the IFF_NOTRAILERS flag on each interface is interpreted
                    448: to mean that trailers should not be requested.
                    449: Modifications to ARP from SUN Microsystems add \fIioctl\fP operations
                    450: to examine and modify entries in the ARP address translation table, 
                    451: and to allow ARP translations to be ``published.''
                    452: When future requests are received for Ethernet address translations,
                    453: if the translation is in the table and is marked as published,
                    454: they will be answered for that host.
                    455: Those modifications superceded the ``oldmap'' algorithmic translation
                    456: from IP addresses, which has been removed.
                    457: Packets are not forwarded to the loopback interface if it is not marked
                    458: up, and a bug causing an mbuf to be freed twice
                    459: if the loopback output fails was corrected.
                    460: ARP complains if a host lists the broadcast address as its Ethernet address.
                    461: The ARP tables were enlarged to reflect larger network configurations
                    462: now in use.
                    463: A new function for use in driver messages, \fIether_sprintf\fP,
                    464: formats a 48-bit Ethernet address and returns a pointer to the resulting string.
                    465: .NH 2
                    466: IMP support
                    467: .PP
                    468: The support facilities for connections to an 1822 (or X.25) IMP port
                    469: (\fB/sys/netimp\fP)
                    470: have had several bug fixes and one extension.
                    471: Unit numbers are now checked more carefully during autoconfiguration.
                    472: Code from BRL was installed to support class B and C networks.
                    473: Error packets received from the IMP such as Host Dead are queued
                    474: in the interrupt handler for reprocessing from a software interrupt,
                    475: avoiding state transitions in the protocols at priorities above \fIsplnet\fP.
                    476: The host-dead timer is no longer restarted when attempting new output,
                    477: as a persistent sender could otherwise prevent new output from being attempted
                    478: once a host was reported down.
                    479: The network number is always taken from the address
                    480: configured for the interface at boot time;
                    481: network 10 is no longer assumed.
                    482: A timer is used to prevent blocking if RFNM messages from the IMP are lost.
                    483: A race was fixed when freeing mbufs containing host table entries,
                    484: as the mbuf had been used after it was freed.

unix.superglobalmegacorp.com

This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.