TCE Design Issues

The Thread-based Communication Environment (TCE) has been designed as a portable library which provides an efficient integration between message passing and multithreading. The multithreaded component satisfies application needs for a consistent and uniform method of supporting pseudo concurrency (uniprocessors) and true concurrency (multiprocessors) as well as allowing an overlap between computation and IO/communication operations for all kinds of architectures. The communication component in TCE, provides an efficient way of communicating between processes scattered across distributed memory nodes. Having the message passing capabilities is of crucial importance, as most of the parallel programs are written for distributed memory parallel machines, and in general all distributed systems have to exchange information by value passing (on a contrary to reference passing in shared memory machines).

The fundamental design decision was to select a proper multithreading model. Many modern Operating Systems and Operating Environments furnish their own light-weight multithreading support, either in the kernel or user space. So, in our library the multithreading functionality could be structured as a higher level, uniform interface to various threaded systems, different on each platform. The problem with this approach is for some architectures there is no support for threads at all. Therefore, the portability of our environment would be jeopardized. Secondly, to closely integrate communication with multithreading access to internal structures of a multithreading implementation is required. For those reasons, we decided to develop our own multithreaded package, which then can be ported to all platforms of interest.

For the development of the TCE multithreading we took advantage of a simplistic, portable thread package already developed by David Keppel (QuickThreads). This package provides only primitives for thread creation, destruction and context switch. Basing on this functionality, we have developed a prioritized scheduler, with preemptive capabilities and semaphores for thread synchronization. All aspects of the thread scheduling can be customized, and the interface has been made so flexible that an application can built its own scheduler to better match its specific needs.

A common concern in developing a threaded system is a cost of inter-thread context switch. It is a very important issue, which stimulated the development of user-level multithreading that is much more efficient than the kernel-level counterpart. TCE threads are entirely implemented in the user space but to make context swaps even more efficient we used a concept of a "floating" scheduler. This model assumes that there is designated thread that performs scheduling operations, instead a scheduler is built as non-threaded functions executed on current thread's stack. This way we have saved one context switch at the expense of a more complex implementation. Concurrent threads need a synchronization mechanism, to ensure a desired order of execution. Different synchronization methods have been developed over the past years, such as semaphores, mutexes with conditional variables and monitors. We have chosen semaphores for our implementation, because they are more powerful than mutexes. Monitors are higher level synchronization constructs which are to be used at a language level. We have abandoned the idea of implementing monitors since our library cannot assume anything about the language it will be invoked from.

While designing the communication subsystem for TCE we considered the message passing and remote procedure call (RPC) paradigms. Message passing style excels in performance, but RPC has more robust semantics. We finally decide to choose message passing as the communication mechanism, mainly because RPC produces larger overhead, requires an extra acknowledgment message and its calling semantics is language dependent. Message passing on the other hand, is fast and more flexible for building parallel applications. The RPC interactions can still be implemented by an application itself by using multiple threads and message passing primitives.

Majority of communication systems are single-threaded. Therefore, message recipients are simple addressed by relative or absolute processes' identifiers. In a multithreaded environment, these methods are not "precise" enough. We would like to give application builders more flexibility, to allow them to address particular threads as receivers of messages. On the other hand, that addressing should not require to couple threads' identifiers with particular messages, as threads are often used as "helpers", which do their job and then exit. Therefore, we based the communication in TCE on abstract communication objects called ports and channels. Ports represent processes' points of contact and are used to create communication links — channels. Depending on how an application defines ports and channels in terms of on- and off-stack variables, they will be either accessible by all processes' threads or be local within a thread context. By using appropriate ports, two processes can establish an exclusive link in the form of a channel. Channels may be two- and one-directional or have both directions closed (null channels). To better tailor application needs to the message passing style to application needs, channels may have different characteristics which describe details of the desired transfer protocol. TCE currently supports buffered and unbuffered (synchronous) message passing as well as additional acknowledgments send when a message reaches a destination process or is consumed by a receiving thread.

The final phase in the TCE design concerns interoperability between different programming models as well as various computer architectures. To support both data parallel and functional parallel paradigms, our system enforces the host-node programming style, according to which an application can have a complete control over how and where its processes are placed. This allows to freely mix data and functional parallelism, even within one set of cooperating processes. Since most of parallel applications exercise the data parallel paradigm, our environment furnishes a special support for it by providing a rich set of collective communications, which can be performed on a collection of channels called channel sets. We assumed that using our library, it should be possible to take advantage of various, heterogenous computing resources. Except SIMD machines, all other low- and high-end architectures ought to be covered. To provide a dynamic access to all currently available resources, we decided to uses a network of daemons. Daemons are placed on all nodes, which are both listed in configuration files and are loosly-coupled with other nodes. A tightly-coupled, parallel system has only one daemon placed at its front-end. Despite the fact that certain portion of a composite application can reside on parallel nodes which are not accessible from the Internet network, communication is still possible via message forwarding. An application does have to be aware of that, as the TCE runtime together with daemons will automatically forward messages to the destination processes. Similarly, an application will not be bothered by differences in machine dependent representations of data formats. All conversions, when needed, are performed by our library.