About RTP and the Audio-Video Transport Working Group: Some Frequently Asked Questions

Next: Historical Notes Up: About RTP and the Audio-Video Transport Working Group Previous: Implementations of Older RTP Versions

Some Frequently Asked Questions

Is RTP a transport protocol or a kind of application protocol?

RTP has important properties of a transport protocol: it runs on end systems, it provides demultiplexing. It differs from transport protocols like TCP in that it (currently) does not offer any form of reliability or a protocol-defined flow/congestion control. However, it provides the necessary hooks for adding reliability, where appropriate, and flow/congestion control. Some like to refer to this property as application-level framing (see D. Clark and D. Tennenhouse, "Architectural considerations for a new generation of protocols", SIGCOMM'90, Philadelphia). RTP so far has been mostly implemented within applications, but that has no bearing on its role. TCP is still a transport protocol even if it is implemented as part of an application rather than the operating system kernel.

RTP does not ensure real-time delivery. So how come it is called a real-time protocol?

No end-to-end protocol, including RTP, can ensure in-time delivery. This always requires the support of lower layers that actually have control over resources in switches and routers. RTP provides functionality suited for carrying real-time content, e.g., a timestamp and control mechanisms for synchronizing different streams with timing properties.

Is Is RTP an unreliable protocol? Are there any mechanisms provided for error recovery in RTP?

As currently defined, RTP does not define any mechanisms for recovering for packet loss. Such mechanisms are likely to be highly dependent on the packet content. For example, for audio, it has been suggested to add low-bit-rate redundancy, offset in time. For other applications, retransmission of lost packets may be appropriate. This requires no additions to RTP. RTP probably has the necessary header information (like sequence numbers) for some forms of error recovery by retransmission.

Can RTP run over IPng? ATM?

Yes. RTP contains no specific assumptions about the capabilities of the lower layers, except that they provide framing. It contains no network-layer addresses, so that RTP is not affected by addressing changes. Any additional lower-layer capabilities such as security or quality-of-service guarantees can obviously be used by applications employing RTP. There are several implementations of video tools that run RTP directly over AAL5. It should be noted that the RTCP CNAME field is currently based on the assumption that hosts have Internet-style domain names.

Why can't we just use TCP for audio and video?

For delivering audio and video for playback, TCP may be appropriate. Also, with sufficiently long buffering and adequate average throughput, near-real-time delivery using TCP can be successful, as practiced by the Netscape WWW browser. TCP may often run over highly lossy networks (e.g., the German X.25 network) with acceptable throughput, even though the uncompensated losses would make audio or video communication impossible.

However, for real-time delivery of audio and video, TCP and other reliable transport protocols such as XTP are inappropriate. The three main reaons are:

Reliable transmission is inappropriate for delay-sensitive data such as real-time audio and video. By the time the sender has discovered the missing packet and retransmitted it, at least one round-trip time, likely more, has elapsed. The receiver either has to wait for the retransmission, increasing delay and incurring an audible gap in playout, or discard the retransmitted packet, defeating the TCP mechanism. Standard TCP implementations force the receiver application to wait, so that packet losses would always yield increased delay. Note that a single packet lost repeatedly could drastically increase delay, which would persist at least until the end of talkspurt.
TCP cannot support multicast.
The TCP congestion control mechanisms decreases the congestion window when packet losses are detected ("slow start"). Audio and video, on the other hand, have "natural" rates that cannot be suddenly decreased without starving the receiver. For example, standard PCM audio requires 64 kb/s, plus any header overhead, and cannot be delivered in less than that. Video could be more easily throttled simply by slowing the acquisition of frames at the sender when the transmitter's send buffer is full, with the corresponding delay. The correct congestion response for these media is to change the audio/video encoding, video frame rate, or video image size at the transmitter, based, for example, on feedback received through RTCP receiver report packets.

An additional small disadvantage is that the TCP and XTP headers are larger than a UDP header (40 bytes for TCP and XTP 3.6, 32 bytes for XTP 4.0, compared to 8 bytes). Also, these reliable transport protocols do not contain the necessary timestamp and encoding information needed by the receiving application, so that they cannot replace RTP. (They would not need the sequence number as these protocols assure that no losses or reordering takes place.)

While LANs often have sufficient bandwidth and low enough losses not to trigger these problems, TCP does not offer any advantages in that scenario either, except for the recovery from rare packet losses. Even in a LAN with no losses, TCP would suffer from the initial slow start delay.

Can't we just use XTP?

Many of the arguments parallel those in the previous section.

The question of the relationship of RTP and XTP appears to arise frequently. (This may simply be due to the word 'transport' in both protocol names.) However, XTP and RTP are not replacements for each other. XTP is designed as a general, configurable network and transport protocol for both reliable and unreliable data communications. RTP has no reliability mechanisms (although these could be added if desired for specific applications) and no flow control like the rate control in XTP. RTP is not intended for regular, reliable data transfer (where TCP or XTP might be used instead). For real-time data, where retransmission is usually not possible due to timing constraints, XTP would have to disable retransmission. Flow/congestion control for real-time data is most likely inappropriate as the rate of such sources is inherently given and not modifiable on the time-scale of transport-protocol flow control, as explained in the previous section. It should be noted that RTP supports mechanisms that allow a form of congestion control on longer time scales, e.g., by modifying the source encoder if network congestion is detected.

RTP has no protocol state by itself and can thus be used over either connection-less networks, such as IP/UDP, or connection-oriented networks, such as XTP, ST-II or ATM (AAL3/4 or AAL5). Many real-time multimedia applications use multicast with a large fan-out, e.g., several hundred to thousands for a lecture or concert. Connection-oriented protocols like XTP have difficulty scaling to such a large number of receivers.

XTP does not offer timing or content type (media) information, and thus would need these services, as offered by RTP. XTP provides no RTP-like direct feedback of the received quality-of-service, and thus, again, would have to "import" these from another protocol. Looking at existing applications using XTP for real-time services confirms that they need to add a layer similar in content to the RTP data part "between" XTP and the actual media.

Is there an RTP library or kernel implementation?

RTP (in particular, the data part) is tightly coupled to the application, so that a kernel or library implementation makes little sense. However, NeVoT can be used as a linkable library that implements RTP for an audio tool, with a documented API. The sources to NeVoT, rtpdump and vic also contain RTCP processing modules which should be usable in other applications with minor modifications. Note also that the specification itself contains numerous code fragments. (Most of the other applications are using older versions of RTP and thus should not be relied upon for developments.)

What are some of the differences between the VAT protocol and RTP?

The VAT protocol was originally implemented in the VAT audio tool and subsequently also in other audio tools such as NeVoT. It is currently the most frequently used packet format for audio on the MBONE. The VAT header format is only described in header files. (See the VAT and NeVoT sources for details.) Many aspects of RTP and the VAT protocol are similar, but RTP improves upon the VAT protocol in a number of ways:

The VAT protocol was designed for audio only, while RTP is specified for audio and video and may be suitable for other real-time applications.
RTP is designed to be protocol-independent and can be used with non-IP protocols (ATM AAL5, for example) as well as, say, IPv6.
RTP source identification simplifies the use of mixers and translators.
RTP has a number of features that simplify use of application-level encryption (padding, etc.).
The RTP header is extensible, should the need arise in the future.
The RTP header has a sequence number which simplifies accurate loss detection and measurement and the handling of images transmitted in several packets.
The RTCP SDES packets contain additional information that simplify tracing of misbehaving sources, e.g., their email address or telephone number.
The RTCP SDES CNAME items simplify the construction of multimedia application from independent media agents.
RTCP sender and receiver reports allow the implementation of adaptive applications, that is, applications where senders scale their bandwidth consumption based on network load.
RTCP sender and receiver reports allow monitoring of the quality of service within, say, a multimedia conference.

A new version of VAT (currently in alpha-test) also implements RTP. As soon as there are a sufficient number of stable applications using RTP, it is anticipated that most Internet MBONE audio/video events will be transmitted using RTP.

What are the differences between RTP version 1 and 2?

Version 1 is of historical interest only. Applications should not be written for it. RTP version 2 is not backwards compatible with version 1. If you care, you can find a definition of version 1 in an old Internet draft.

Are there related ITU efforts?

Media formats:

G.711:: Audio encoding at 64 kb/s (mu-law and A-law).
H.261:: Video encoding.
H.263:: Video encoding, improved version of H.261.
H.324:: Audio and video over POTS at less than 20 kb/s.

For conferencing over ISDN:

H.221:: Frame structure for a 64 to 1920 kbit/s channel in audiovisual teleservices.
H.320:: Framework for transmitting audio and video over circuit-switched digital networks (primarily ISDN).
H.323:: H.320 over LAN.

For conference control, application and data sharing, there are a number of recommendations:

T.120:: Introduction to the audiographics and audiovisual conferencing recommendations.
T.121:: Generic application template.
T.122:: Multipoint communication service for audiographics and audiovisual conferencing service definition
T.123:: Protocol stack for audiographics and audiovisual teleconference applications.
T.124:: Generic conference control.
T.125:: Multipoint communication service protocol specification.
T.126:: Still image protocol specification.
T.127:: Multipoint binary file transfer protocol.

Are there other efforts in using the Internet for real-time audio and video?

Too many, some may say. vat versions 3.4 and earlier, one of the early (recent) Internet audio applications, uses mostly the same audio encodings as specified in the RTP profile, but a different protocol. There are a number of "Internet telephones" (usually for PCs) using proprietary audio coding and protocols, meant for point-to-point connections:

Speak Freely for Unix and Windows
Internet Phone (Vocaltec)
Digiphone
Quarterdeck
Internet Telephone Company
Telescape Intercom by Telescape
Audio/video directly over ATM: Nemesys
FreeVue audio and video

For near real-time distribution of audio, e.g., the on-demand delivery of music or news:

RealAudio (Microsoft Windows only)
Xing
AudioSoft

CuSeeMe (for Windows PC and the Macintosh) is a combined audio and video tool using reflectors rather than IP-level multicast.

RealAudio writes what currently applies to all tools:

If the packet loss is high, it may be due to a busy network. If this is the case, there is little you can do to remedy the situation other than to try connecting to the site at a later time.

A survey can be found at www.von.com.

Henning Schulzrinne