A Comprehensive Taxonomy and System Architecture for ATM
Host-Network Interfaces
Bo-kyun Na
May 12, 2000
Current Network Systems
The current networks are:
- Service dependent: Telephony, computer neworks
- Inflexibilible: In adapting changes or new service requirements
- Inefficient: Resource sharing between networks (Packet switching)
The single service-independent network is:
- Flexible and future-safe in bandwidth (Adaptive network)
- Efficient in the use of available resources (Statistical switching)
- Less expensive (Single nework < - manufacturing, operating,
maintaining)
Review of Host-Network Interface Systems
Definition: A network I/O to connect host to network.
With TCP/IP suites,
Figure
Problems of Current HNI Systems
- Poor protocol implementations:
- Complicated and slow transport protocol
- Poor interface between applications and HNI:
- Unnecessary data copying
- Low-speed system bus compared to CPU bus
- High overhead with OS functions and context switchings:
- Process scheduling
- Entities managements - timers, buffers, and connection states
- Interprocess communication between application and
protocol-implementing processes
- Interrupts
- Context switching
Conventional Classifications of HNIs
**** - By the number of data copy's required
- By the processing capability of HNI's
- By the memory access
- By the host access
*** The Number of Data Transfer copies
*** - Peter A. Steenkiste and colleagues
- According to the system buffer, Host-network interfaces are
classified as:
- Traditional 5-access data flow,
- Outboard buffering,
- DMA based host-network interface architectures.
***
- Similarly, C. Dalton et al analyze TCP/IP protocol
*** implementation in terms of minimization data movement in memory
(Afterburner HNI).
*** The Number of Data Transfer Copies
Figure
Limitations of Current Taxonomies
- Comparative power
- Only one category and non-hierarchical
- predictive power
- No inference of similarities or differences
- Comprehensive power
- Authorization only the related architectures
- LAN, WAN, MAN, tightly-coupled or loosely-coupled systems
- Systematic power
- No combined aspects of the hardware and the software
functional units
Research Objectives and Approaches
HNI taxonomy is motivated and directed to:
- Construct an architectural knowledge base as a foundation (Descriptive)
- Understand the position of architectures relative to one another
(Comparative or hierarchical)
- Predict a new design as an optimized to the service requirements
(Predictive)
- Understand the relations between architectural components and
protocol processing functional units (Systematic)
- Accordingly, construct a comprehensive atlas of HNIs within the
unified framework of the taxonomy (Comprehensive)
Skillicorn's Taxonomy on Computer Systems
- Three reasons for classifying architectures:
- To understand what has already been achieved
- To reveal possible configurations to suggest other possibilities
- To allow useful performance models to be built and used
- More descriptive classification of the computer architectures
- Extends Flynn's taxonomy in the multiprocessor category
- A two-level hierarchy
- Upper level classifies computer architectures based on
the numbers of processors for data and for instructions and
the interconnections between them: nIP - nIM, nDP -
nDM, IPs - DPs, and DPs - DPs
- Lower level to distinguish variants even more precisely
and based on the state machine level of processors
System Model
Two-context taxonomy, as the approaches:
- Architectural context: The HNI architectures and their limitations
- Main components required to transfer data between host
and network
- Host CPU
- Main memory
- Cache memory
- Protocol processor or controller
- Network buffer
- Switch
- Connections and the interactions between these components
- PMS notation to simplify the block diagram of HNIs
- A system as an interconnected set of components or individual
devices associated with a set of operations.
- Protocol context: The implementation of network protocol functions
- Identifying all the software functions, required to transfer
data (descriptive power)
- Mapping the software functions into the components identified
at the architectural classification (systematic power)
- Identifying all possible techniques to develop a HNI
(predictive power)
- Protocol functional units:
- Routing/switching
- Data copying
- Flow control
- Error handling
- Packetization/Depacketization
Performance Model
- Rearranging protocol functional units so that it takes less time
(parallelization of protocol implementations)
- Reducing unnecessary data copying in main memory
- No data copying to system communication buffer in kernel space
- Fast host interfacing (low-speed system bus v.s. high-speed CPU bus)
- Increase the speed of system bus
- Through other medium (host register, dual-ported memory)
- Hardware implementation of protocol functional units
- On-board protocol processor
- CRC checking or Flow control
- HNI implements protocol functional units
- Avoiding system calls by OS and context switching
Illustrative Example: Jaguar
- University of California at Berkeley
- Providing Java applications with efficient access to system resources
- Retaining the protection of the Java environment
- Allowing Java applications to directly manipulate memory outside
of the Java heap such as communication buffer
- JaguarVIA, as a Java interface to the Berkeley VIA
communication layer
- Implemented with Intel PentiumII, Linux OS, and Myrinet network
interface board
- Shown performance improvement only by optimizing API
Continue...
Figure
Continue...
Figure
Syntactic Notation of Conventional HNIs
This table shows our syntax notation for host-network
interface designs that have been reported in the literature.
Figure
Design of Intelligent HNI
- Highest performance
- Reducing the heavy loads of the host
- on protocol processing
- on system peripheral bus
- Support host to take more-processing required jobs
Design Requirements
- In architectural context,
- Protocol processing in HNI
- A better connection component than a system peripheral bus
- CRC checking by hardware component while data being transferred
- In protocol-implementation context,
- Implementation in a pipelined or parallelized way
- Routing - source's and destination's addresses
- Data copying - direct transfer between MH to network FIFO
- Error handling - high speed CRC check for cell header and payload
- Flow control - on-board protocol processor counts cells
- Packetization - Fixed sized cell by PN - message as a
local data in host
Taxonomy-based Design Approach
- In architectural context,
- PH - MN : Burst transfer by cache controller (Bcc) for
status and control information, start address and length of data, and
connection establishment parameters.
- MC - MN : Bcc for data transfer.
- In protocol-implementation context,
- The protocol processor packetizes in the fixed size.
- Virtual circuit switching (ATM).
- User space memory - network FIFO data copying scheme.
- CRC-8 and CRC-32 by hardware, sequence number, MID, length
indicator of data. FEC + ARQ for error correction.
- Rate-based flow control.
Diagram for architectural context of intelligent HNI
Figure
Diagram for protocol context of intelligent HNI
Figure
Formal Performance Metrics: 1
Figure
Formal Performance Metrics: 2
- Tmp: Clock cycle time of CPU data bus in nanoseconds.
For Alpha 21164, it is 33 MHz or 30.3 nsec.
- Tmc: Clock cycle time of adaptor control
processor. MC88110 has 25 nsec data bus clock cycle time for 64 bits.
- Lfmp: The number of clock cycles per load from cache to
host CPU, or from network buffer to host CPU.
- Lfmc: The number of clock cycles per load from network
buffer to network-adaptor control processor. It takes 3-1-1-1 data
transfer clock cycles.
- Lsmp: The number of clock cycles per load from main memory
to host CPU. For typical EDO (Expanded Data Output) dynamic RAMs,
7-2-2-2 data transfer clock cycles in a pipelined-burst mode.
- Lsmc: The number of clock cycles per load through
bus. It is supposed to have the same value with Lsmp.
- Sfmp: The number of clock cycles per store to the network
buffer, or to the cache from the host CPU.
- Sfmc: The number of clock cycles per store to one cell
FIFO in network
adaptor.
- Ssmc: The number of clock cycles per store to main memory
from host CPU or to cell FIFO from main memory in conventional
architecture.
Transmitting Performance of Intelligent HNI
- For 48 Bytes data:
- 3 slow loads (MH - 128 bits/load) in 7-2-2 mP cycle
- 3 fast stores (128 bits/store) in 5 clock cycles (3-1-1) to MN
- 6 fast loads (64 bits) in 8 (3-1-1-1-1-1) clock cycles in
pipelined burst
- 7 fast stores (64 bits) in 9 (8 payloads and 1 header) clock cycles
- 5 arithmetics and 1 branch instructions
|
|
|
|
48 byte/cell ×8 bit/byte (3 Lsmp +3 SfmP) TmP + (6 LfmC + 7 SfmC +6 computation) TmC
|
|
| |
|
|
384 (11 + 5) (30.3 ×10-9) + (8 + 9 + 6)(25 ×10-9)
|
|
| |
|
|
| |
|
Transmitting Performance of Conventional OSIRIS HNI
- PN loads control information first, then loads and stores
directly cell payload into cell FIFO
- 1 slow load and store for control information
- 6 slow loads for data transfer in 7-2-2-2-7-2 clock cycles with
64-bit wide
- 7 fast stores - 6 for data and 1 for header
- 5 arithmetic and 1 branch instructions
|
|
|
|
48 byte/cell ×8 bit/byte (Lsmp+Ssmp) Tmp + (6 Lsmc + 7 SfmC + 6computation) TmC
|
|
| |
|
|
384 (7 + 7) (30.3 ×10-9) + (22 + 9 + 6)(25 ×10-9)
|
|
| |
|
|
| |
|
Segmentation Performance of intelligent HNI: worst case
Figure
Segmentation Performance of intelligent HNI: best case
Figure
Receiving Performance of intelligent HNI
- In intelligent HNI for 48 Bytes:
- 6 fast loads (3-1-1-1-1-1)
- 6 fast stores (3-1-1-1-1-1)
- 8 arithmetic and 4 branch instructions
|
|
|
|
48 Bytes ( 6 Lfmc + 6 Sfmc +12 computation) TuC
|
|
| |
|
| |
|
|
| |
|
Receiving Performance of OSIRIS HNI
- In OSIRIS for 48 Bytes:
- 6 fast loads and stores to MN
- 8 arithmetic and 4 branch instructions
- 6 slow loads of DMA data transfers (64 bit/load)
- 3 fast stores in main memory (128 bits/store)
- 6 fast store to MN can be ignored by the dual-ported memory
feature
|
|
|
|
48 Bytes 3Sfmp ×Tmp + ( 6 Lsmc + 6 Sfmc + 12 comp) TuC
|
|
| |
|
|
384 5 ×30.3 ×10-9 + (22 + 8 + 12) 25 ×10-9
|
|
| |
|
|
| |
|
Reassembly Performance of intelligent HNI
Figure
Answer to Research Questions
Future Works
- Extention of protocol-implementation context to include API
- The overhead of OS affects API implementations and causes the
whole communication network environment to slow down, even though the
HNI has low latency and the host runs applications with high computing
power.
- Thus, the API should be treated as one important functional unit
in protocol-implementation context of the taxonomic system to analyze
more accurately HNIs.
- One example - JaguarVIA
- a Java API to improve performance of communication and
I/O methods in Java programming.
Skillicorn's Taxonomy
Using his taxonomic scheme, Skillicorn established a single category
of 28 classes:
class | nIP | nDP | IP-DP | IP-IM | DP-DM | DP-DP | Name |
|
3 | 0 | n | - | - | n-n | n×n | loosely coupled dataflow |
4 | 0 | n | - | - | n ×n | - | tightly coupled dataflow |
6 | 1 | n | 1-1 | 1-1 | 1-1 | - | von Neumann uniprocessor |
8 | 1 | n | 1-n | 1-1 | n-n | n ×n | type 1 array
processor |
9 | 1 | n | 1-n | 1-1 | n×n | - | type 2 array processor
|
13 | n | n | n-n | n-n | n-n | - | separate von Neumann
uniprocessors |
14 | n | n | n-n | n-n | n-n | n×n | loosely coupled
von Neumann |
15 | n | n | n-n | n-n | n-n | - | tightly coupled von Neumann
|
The Processing Capability of HNI's
- E. Cooper and et al,
- Due to the difference in performance and implementation
complexity,
- Protocol processor on board
- Memory accessing way (user and kernel spaces in main memory)
- DMA capability
- Simple HNIs
- Host as protocol processor
- Memory accessing on both of user and kernel spaces
- Intelligent HNIs
- On-board protocol processor
- Memory accessing only on user space
- DMA capability
Memory Access
According to the address-mapping spaces and data accessing mechanisms,
David Banks and et al classifies HNIs as mapping:
- In memory space
- In I/O space
- Programmed I/O
- Direct Memory Access (DMA)
- Burst mode transfer by memory controller
Host Access
D. Henry and C. Joerg proposes four categories:
- OS-level DMA based
- User-level memory mapped
- User-level register mapped
- Hardwired
Architectural Context
- The characteristics are from the logical structures and
connections of hosts, host-network interfaces, and a network.
- Principal components are:
- Host CPU
- Main memory
- Cache memory
- Protocol processor or controller
- Network buffer
- Switch
- By PMS (processor-memory-switch) notations:
- The relations between processor, memory, and I/O devices.
- A system as an interconnected set of components or individual
devices, associated with a set of operations.
continue...
- Host CPU unit is to execute instructions of the user
application which may require a communication to, if required, the
protocol processor related to the communication, and to transform data
usually in ways that correspond to basic arithmetic operations.
- Main memory unit is an intelligent storage device that passes
data to and from the host CPU.
- Cache unit consists of a small fast memory that acts as a buffer
between the main memory and the host CPU.
- Network buffer unit is a staging and speed-matching area for
data in transit between the host and the network. It consists of a
network buffer memory and network FIFOs. Network buffer memory is the
storage for transport-layer data (so called message) and information
related to the control and management parameters, while network FIFOs
are two set of registers for sending and receiving. It also provides
the protocol processor with contention-free memory access to the
packet data.
- Protocol processor unit manages packet processing and various
bookkeeping functions associated the protocol.
- Switch unit provides connectivity between other functional
units in one way of programmed I/O, DMA (Direct Memory Access), burst
transfer by memory controller, or register accessing.
Protocol-Implementation Context
- Packetization/Depacketization
- Data copying
- Flow control
- Error handling
- Routing/switching
Packetization/Depacketization
- Data is bundled and transmitted as packets of bit streams with
some overhead bits to avoid network errors
- A message larger than the packet data size
- OS or a protocol processing unit fragments the message into a
series of packet data
- Reassembles the packets into the message in the receiving process.
- Classified by who processes and by packet length as following:
- A variable-sized data unit that is packetized by the host.
- A variable-sized data unit that is packetized by the HNI.
- A fixed-sized data unit that is packetized by the host.
- A fixed-sized data unit that is packetized by the HNI.
Data Copying
- Memory space accessing - a user space and a kernel space
- Data Path - Programmed I/O, DMA, burst transfer by memory
controller
- Classification:
- 0-copy from user space on main memory to network.
- 1-copy from user space of main memory to network buffer memory,
then to network.
- 1-copy from user space of main memory to kernel space of main
memory.
- 2-copy from user space to kernel space of main memory then to
network buffer memory.
0-copy from user space on main memory to network
- The host processor informs data address and length to the
network protocol processor.
- The network protocol processor initiates DMA controller to
transfer data from a user space of main memory to network.
Figure
1-copy from user space of main memory to network buffer memory,
then to network
After data is copied to network buffer memory, HNI provides
some other protocol processing such as CRC and composing of header
fields.
Figure
1-copy from user space of main memory to kernel space of main
memory
- To process protocol, data is copied to kernel space.
- Good security and protection of data between users.
Figure
2-copy from user space to kernel space of main memory then to
network buffer memory
Transport protocol processing on kernel space except CRC
checking which is supported on host-network interface.
Figure
Flow Control
- The flow of offered-data load just enough to achieve a
throughput that is very close to that of the resources' capacity
- Receiver flow control schemes (RFC):
- A receiving window by acknowledgements and sequence numbers
- Software-based for counting of the number of received packets
- Hardware-based for counting of the number of received packets
- Hardware-based measurement of the inter-packet arrival time
- Accepting/discarding packets based on buffer availability
- State-reporting schemes (SR):
- Window-based acknowledgements
- The current inter-packet arrival time or burst size of packets
- Start/stop control signal
- Transmitter flow control (TFC):
- Sending packets based on current window size
- Sending packets based on the current acceptable transmission rate
- Sending packets based on start/stop control signal
Error Handling
- Error detection, and correction when the network is not
reliable.
- Error-detection functions (D):
- Cyclic Redundancy Check (CRC)
- Type: CRC-8, CRC-10, CRC-12, CRC-16, CRC-32, CRC-CCITT.
- Domain: A control part, A data part, or A control + data
part.
- Length indicator of data.
- Sequence number.
- Multiplexing identifier (MID).
- Error-correction functions (C):
- Forward Error Correction (FEC).
- Backward Error Correction/Automatic Repeat Request (ARQ).
- FEC + ARQ.
Routing/Switching
- When an application process sets up a connection to another
- Address generating (or address resolution) function
- Source address, destination address(es) and intermediate-node
addresses if needed.
- At least, port numbers, host numbers, network numbers, and
frame numbers in TCP/IP suites
- Classification:
- Circuit switching
- Source routing/packet switching
- Internode routing/packet switching
- Virtual circuit switching (statistical switching)
Illustrative Example: AURORA
- AURORA - An experimental wide area network testbed
- Operation at near giga-bit per second bandwidths
- HNI converts data between the network format (ATM cells) and
a format useful to the host (transport level message - TPDU)
- OSIRIS for TURBOchannel bus on DECstation 5000
- ORBIT for MICROchannel bus on the IBM RS/6000
Continue...
Figure
Continue...
Figure
Simple HNI Design
- ATM network with AAL-5
- A cost-effective system: minimal hardware support
- Easy adjustments to the changes of environment
- low speed network and multiple protocol support environment
Design Requirements
In architectural context:
- Conection component - Comparative bandwidth to that of the
system bus
- Fast processing of the cell header error checking
In protocol context:
- Routing - only the source's and destination's addresses
- Data copying - main memory to network FIFO
- Error handling - in ATM network, CRC checking, sequence
numbering, MID, data length indicator and FEC + ARQ
- Flow control - simple implementation.
- Packetization - fixed sized payload
Taxonomy-based Design Approach
- In the architectural context,
- Connection between main memory and network FIFO by direct memory
access: DMA for MH - FIFO.
- In protocol implementation context,
- Virtual circuit switching.
- Zero data copying between user space of main memory and network FIFO.
- CRC-8 by hardware for header error check, CRC-10 by software for
data error check, sequence number, MID, and length indicator of data.
- Window-based and rate-based flow control
Diagram for Archtiectural context of simple HNI
Figure
Diagram for protocol context of simple HNI
Figure
Transmitting Performance of simple HNI
To send 1 payload data (384 bits),
6 slow loads and 6 slow stores through the 64-bit PCI system bus.
For the header information,
1 slow store (since less than 1 fast load for the header information
and CRC field), 5 arithmatic and 1 branch instructions.
|
|
|
|
48 byte/cell ×8 bit/byte (6 Lsm p + 7 Ssmp + 6 computation) Tmp
|
|
| |
|
|
384 (22 + 22 + 6) (30.3 ×10-9)
|
|
| |
|
|
| |
|
Segmentation Performance of simple HNI
Figure
Receiving Performance of simple HNI
Reassembly by a lightly loaded host for an ATM cell:
6 loads (7-2-2-2-7-2) by 64 bit DMA B/W; 3 (7-2-2) stores by 128 bit
memory B/W.
|
|
|
|
48 Bytes ( 6 Lsmp + 3 Ssmp +12 computation) Tup
|
|
| |
|
|
384 (22 + 11 + 12) 30.3 ×10-9
|
|
| |
|
|
| |
|
In heavy loaded host: additional 3 slow loads from main memory.
|
|
|
|
48 Bytes ( 6 Lsmp + 3 Lsmp + 3Ssmp + 12 computation) Tup
|
|
| |
|
|
384 (22 + 11 + 11 + 12) 30.3 ×10-9
|
|
| |
|
|
| |
|
Reassembly Performance of simple HNI
Figure
File translated from TEX by TTH, version 2.33.
On 10 May 2000, 20:41.