Light Weight Threads General Overview TCE System and its Use in Interpreted Environments Janusz Niemiec NPAC Syracuse University 111 College Place Syracuse NY 13244-4100 Abstract of TCE Thread Presentation This presentation was prepared by Janusz Niemiec and describes Overview of Multithreading Existing Experience with Multithreading on UNIX Experience with multithreading for parallel processing -- Nexus and Chant Design and Implementation of TCE (Thread-based Communication Environment) This was based on experience with MOVIE interpreted environment and use of TCE in this and other interpreted systems such as parallel HTTP servers and Java from Sun is explored General Concepts of Multithreading -- 1: What are threads and their benefits Multithreading is defined as having multiple contexts of execution within the same address space. Benefits of having multiple threads of execution: For a large class of applications it is much easier to express a program control flow by concurrent actions than by the finite automaton model. On multiprocessor systems, multiple threads can run in parallel thus speeding up the application execution. Using multiple threads, an application can take an advantage of overlapping between IO and CPU operations (CPU and IO channels can operate concurrently because of different hardware circuitry). Multithreading can mask a large percentage of latency incurred during memory and IO operations. General Concepts of Multithreading -- 2: Comparison of Threads and Processes Having many threads within one address space is a better solution than having multiple, heavy-weight processes cooperating with each other: Different processes cannot easily share common data structures and other, Operating System dependent resources. Switching the CPU between processes is a costly operation which requires not only the swap of machine registers but also OS specifics like the kernel stack, process stack, page tables and signal vectors. Often cache purging has to be done too. A context switch for threads can be very efficient. Since there is no common address space for different processes, interaction between them (IPC) has to be carried out by the system supplied mechanisms: semaphores, shared memory, sockets, pipes etc. These interactions are much more expensive than accessing common data structures. General Concepts of Multithreading -- 3: Efficiency of Threads vs. Processes Threads and Processes -- efficiency comparison Switching between heavy-weight processes for contemporary RISC processors is always in a millisecond range, the thread context switch is in a microsecond range. Accessing a single data element takes from a couple to hundreds nanoseconds, initiating a single IPC between different processes takes from a few to tens of milliseconds. General Concepts of Multithreading -- 4: Application or Kernel Threads Multithreading can be supported at the Operating System level (kernel-level threads), at the application level (user-level threads) or simultaneously at both levels. User-level threads are much more efficient than kernel level threads. They don't require any OS support, so a given user-level thread package can be portable between different machines and Os'. The weakness of user-level threads is a possibility that a thread which blocks on a system call will eventually block the whole process. To avoid that user-level thread translate blocking system calls into non-blocking ones. Kernel-level threads have the advantage that when a thread blocks on a system call, the Operating System, aware of that, can schedule another processes' thread to run. Those threads can also be scheduled on different processors (when the hardware offers more than one CPU) in a transparent to an application fashion. The main disadvantage of the kernel-level threads is their poor efficiency -- order of magnitude slower than user-level counterparts. General Concepts of Multithreading -- 5: Brief History and Motivation Historically, multithreaded processes were first used to implement Operating System services, later on they migrated into different application layers. Threads were applied in developing servers (file server, windowing server, etc.). Each thread within that server could serve one client, thus the response time of such a server was improved. Subsequently, threads have been used at the client's side, mainly in connection with the RPC communication. Because the RPC semantics is blocking, multiple RPC operations could be in programs when issued by different threads. The multiprocessor hardware made the threads packages to be used by shared-memory, parallel programs to improve their runtime performance. Currently, multithreading is being utilized for fine and coarse grain parallelism (both shared and distributed memory) at the level of compilers, runtime systems and applications. Multithreading on Unix-compliant Operating Systems -- 1: Sun LWP Probably the most widely known multithreaded support has been provided by the SunOS OS by Sun Microsystems. The Light-Weight Processes (LWP) package is a user-level thread package. It can be briefly characterized by the following attributes: The package delivers primitives for thread creation, destruction, blocking and resuming. The package is non-preemptive, with prioritized scheduling. The LWP default scheduler can be replaced by a scheduler supplied by an application. Threads can be synchronized by monitors with conditional variables. For each thread, it is possible to specify what kind of registers will be used (integer or/and floating point). Only registers which were defined as active are being saved and restored. LWP provides a way to map asynchronous signals into thread invocations as well as traps into exceptions handled by threads. Multithreading on Unix-compliant Operating Systems -- 2: LWP Functions LWP Threads can communicate using a rendezvous message passing style -- a sender and a receiver have to meet in order to communicate. There are three message passing calls -- msg_send(), msg_recv() and msg_reply(). Since LWP runs in the user space, all its threads have to use non-blocking system calls to avoid blocking the whole Unix process. Multithreading on Unix-compliant Operating Systems -- 3: DCE Threads The most complete, and at the same time complex, thread package has been furnished by the Open Software Foundation (OSF); the Distributed Computing Environment(DCE) threads package. It contains 51 primitives, some of them not strictly necessary but provided for convenience only. This thread model can be implemented as kernel- or user-level threads. Its scheduling can be preemptive, non-preemptive, prioritized, round-robin or even other algorithms can be specified. The user has the choice to select one of them. There are calls for a thread creation, destruction, killing and exiting. A newly created thread becomes a child of the thread which created it. The parent can either wait for child's death (join operator) or it may forget that a child exist -- detaching it from its execution (detach operator). Multithreading on Unix-compliant Operating Systems -- 4: DCE Thread Functions Threads can synchronize their actions using mutexes (binary semaphores). There are two kind of mutexes -- a fast mutex and a friendly mutex. To help in avoiding deadlocks, DCE offers conditional variables with the wait, signal and broadcast operations on them. The notion of per-thread global variable is supported. A thread can declare a variable to be global only to its functions. Functions which belong to other threads will not be able to use those variables. To avoid specifying a lot of parameters for each call which creates threads, mutexes and conditional variables, a notion of a template has been introduced. When created, threads, mutexes and conditional variables are initialized by values from appropriate templates. RPC is used to communicate between the DCE, multithreaded processes. Multithreading on Unix-compliant Operating Systems -- 5: IRIX "Threads" An interesting approach to multithreading has been proposed by the IRIX operating system from SGI. A regular IRIX process can create a shared process group by calling sproc system call. The sproc call works similarly to a regular Unix fork, except that it creates a copy of the calling process which has the same address space for both the code and the data segments. While it is not a multithreading according to a strict definition, it certainly has many aspects of it. Multithreading on Unix-compliant Operating Systems --6:IRIX Thread Characteristics Processes created this way have their private registers and stacks and signal vectors, except that, everything else is being shared. All processes created by sproc are scheduled in a preemptive fashion by the OS. Synchronization can be done in a standard way -- using semaphores. Since these processes have all attributes of heavy-weight tasks, their creation and context-switch is more costly comparing to "true" threads. Multithreading on Unix-compliant Operating Systems -- 7: Mach Threads The Mach Operating System (microkernel based) has both the kernel- and the user-level threads. Kernel threads: The kernel threads are run in the preemptive, prioritized fashion. They can be either time-shared on one CPU or run concurrently on a multiprocessor. An application can directly control to which processor a particular thread has been assigned. The details of the scheduling policy can be controlled by the application too. Threads communicate through massage passing. Messages are being sent and received at ports, which can be global to threads within a process or private to a thread. The kernel implementation offers around 2 dozen thread primitives which can be used to implement other thread interfaces. Multithreading on Unix-compliant Operating Systems -- 8: C-Threads C-Threads -- user-level threads for Mach This user thread package contains 6 calls for the thread management (create, exit, join, detach, yield and self). Similarly to the DCE package, C-Threads offer mutexes and conditional variables for synchronization. Their semantics and syntax is also very similar to the DCE threads. C-Threads had many different implementations on Mach. It has been ported to other Operating Systems too. We briefly inspect two implementation based on the Mach kernel thread support: In the first implementation, C-Threads were managed entirely in the user space. For each C-Thread process, there was one Mach kernel thread. This solution suffered from a known problem -- blocking the whole process on a blocking operation called by one thread. A second approach, mapped 1-to-1 all C-Threads onto Mach kernel threads. The threads were then scheduled preemptively, taking advantage of the multiprocessor hardware. Multithreaded Environments for Parallel Processing -- Nexus Nexus is a multithreaded communication environment, developed at the Argonne National Lab. It is intended as a compiler target or as a basis of a higher level library. The fundamental constructs in Nexus are: Node -- a set of contexts on a computing node. Context -- address space plus an executable code, there can be more than one data segment within a context Thread -- a thread of execution, which resides within a context, and is not visible outside it. Global Pointer -- a triplet {Node,Context,Address} which allows to access data within a different context. Remote Service Request -- an RPC-like communication mechanism, which allows to send a buffer to a remote handler for execution. The Nexus' thread scheduling can be preemptive or non-preemptive. Its thread implementation can use a user-level thread package or a kernel thread support. Multithreading Environments for Parallel Processing -- Chant - 1 The Chant environment has been developed in ICASE , NASA. Likewise Nexus, Chant can either take advantage of an existing thread package or uses its own multithreading. The multithreading model, however, assumes a non-preemptive execution. The Chant's threads are globally visible and they can be created on remote nodes. The communication is either point-to-point or a remote service invocation, the later is build on top of the point-to-point method. The point-to-point messaging is based on an explicit communication between threads. The sending thread has to submit the global receiving thread id, and the same concerns the receiving thread (submitting the sender id). Those ids are triplets -- {node,process,thread}, and are usually passed as a message tag. This way there is no copying necessary when a message is being received, since the message tag already addresses the receiving thread. Multithreading Environments for Parallel Processing -- Chant - 2 In order to eliminate a need for buffering all together, a sending thread is blocked until a matching thread requests a message. A thread can request a blocking point-to-point operation or a non-blocking one. The blocking operation will block only the thread that issued it, non-blocking allows a thread to proceed, returning an identifier. The identifier can be later checked to decide whether the operation has completed or not. The remote services differ only by a fact that the servicing threads don't expect a request. Thus, there is a dedicated thread waiting for requests and activating appropriate handlers. Unlike the point-to-point variation, this method requires buffering. Thread-based Communication Environment (TCE) -- 1: Goals To provide an efficient, thread--based communication library capable of supporting distributed and parallel processing on variety on platforms. To ensure interoperability between different types of architectures with different CPUs and Operating Systems. To make the environment as simple as possible without compromising the performance or functionality. To assist the programmer in choosing computational nodes and style of interactions between his processes. Thread-based Communication Environment (TCE) -- 2: How Has it Been Achieved By abstract communication objects called ports and channels it possible to build the client-server connection as well as peer-to-peer parallel relations. By mapping application processes onto computational nodes both data parallelism and functional parallelism can be exploited. These two paradigms can be even mixed in one application. The multithreaded support is based on a user-level thread package, thus TCE can be easily ported to different processors. Different architecture are supported : clusters of heterogenous workstations -- through ports and channels shared memory parallel machines -- through multithreading distributed memory parallel machines -- through ports and channels The differences in data formats (byte ordering) is taken care of internally by the library. Machine specific IPC operations are masked through higher level operations on channels and ports. Thread-based Communication Environment (TCE) -- 3: Why it is Great The set of interface functions in TCE has been minimized due to a fact that by using multiple threads we could get rid of all non-blocking and asynchronous calls. Respectfully, much cleaner semantics and better optimized code has been obtained. Performance and functionality has not been jeopardized, since multiple threads provide more concurrency of execution. Despite the fact that a higher abstraction level increases portability, a programmer may want to specify a particular machine to carry out the computation or impose a specific method of IPC for efficiency reasons. Those alternative operations are implemented as additional set of operators on communication objects. TCE -- Implementation -- 1: What is a thread and how do you make it! Notion of a process and a thread A TCE process is a regular Unix, DOS or WindowsNT process. Although it is spawned according to the user request, its creation and scheduling is carried exclusively by the Operating System. A TCE thread is a light-weight thread of execution. Its creation and destruction as well as scheduling is controlled by the TCE layer. The underlying Operating System is neither involved in those actions, nor it is aware that a process is multithreaded. Thread creation and destruction When a new TCE process is started, it contains only one thread of execution (single-threaded process). For programs written in C or C++ the main function is the body of that thread. To create more threads, the program has to request it by calling tce_thrd_init, giving a function name and addresses of parameters to that function. TCE -- Implementation -- 2: tce_thrd_init As a result the space for the thread's stack and descriptor is allocated. tce_thrd_init returns a pointer to that descriptor void func(int i) { .... } .... void main(int argc, char *argv[]) { TCE_Thrd *thrd; int prm = 0; .... thrd = tce_thrd_init(func,&prm); } TCE -- Implementation -- 3 The most important fields in the descriptors are: thread's registers and program counter pointer to the thread's stack thread priority thread's priority queue pointer to the parent thread pointers to all thread's children Thread Destruction: When the function (declared as a thread) reaches its end, it is removed from the scheduling queue and from the parent thread list of children. If that thread has no children it wishes to wait for, its descriptor is stack space will be reclaimed and the thread will be physically destroyed. Otherwise, the destruction stage will be postponed until all its children die. There are the following important aspects of this scheme of thread destruction: clean semantics -- related threads form a tree which represents relations threads' parameters which were declared as automatic variables in a parent thread will still be valid, even when the parent thread finished its execution TCE -- Implementation -- 4: Scheduling policy All threads are scheduled according to their priorities. The priority range is between 0 and 31, where 31 is the highest priority. Each priority has its queue of ready threads, which can be empty (no threads of this priority ready to run). The TCE scheduler always chooses the highest priority thread to run. If there is more than one thread, in the highest priority queue, these threads are scheduled in the round-robin fashion. When all threads are blocked -- meaning there is no thread ready to be executed, TCE starts its "Idle" thread. That thread just waits for an external even (such as a message or an OS signal). If that even causes one or more of the blocked threads to become ready, the "Idle" thread is stopped and the scheduler selects the ready thread with the highest priority. TCE -- Implementation -- 5: Priority A newly created thread inherits its parent priority. It can be changed any time by calling tce_thrd_priority with a new priority for a given thread: void main(int argc, char *argv[]) { TCE_Thrd *thrd; int prm = 0; .... thrd = tce_thrd_init(func,&prm); tce_thrd_priority(thrd,31); } The first thread gets priority 0. TCE -- Implementation -- 6 : Preemptive and Non-Preemptive scheduling In the TCE environment the user has a choice between preemptive or non-preemptive multithreading. Non-preemptive scheduling is the default. This causes the scheduler to be invoked only when the thread which is currently running blocks itself or explicitly yields control (tce_thrd_yield()). If a new thread is being started, the state of the "old" thread (state of registers and the stack pointer) is being preserved in the thread's descriptor. Similarly, the context of the "new" thread is being loaded from its descriptor into processor registers. This operation is called a context switch . Every time a thread is scheduled again, it will restart its execution where it has been stopped, with identical processor state. The preemptive scheduling can be activated by calling tce_preempt_on(). After that call, any current thread can be preempted due to two possible conditions: TCE -- Implementation -- 7: Time Slice per Thread in Scheduling A higher priority thread becomes ready the current thread exhausted its time slice and there is one or more threads with the same priority as the current one The time slice for a particular priority queue can be controlled by the application: void main(int argc, char *argv[]) { TCE_Thrd *thrd; int prm = 0; int priority = 31; int slice = 10; /* 10 ms slice */ ... tce_preempt_on(); thrd = tce_thrd_init(func,prm); tce_thrd_priority(thrd,priority); tce_clock_slice(priority,slice); tce_thrd_start(thrd); ... tce_preempt_off(); ... } TCE -- Implementation -- 8: Preemptive Scheduling Once started, the preemptive scheduling can be deactivated by calling tce_preempt_off. During a program execution preemptiveness can be switched on and off freely. To implement the preemptive execution TCE uses two signals -- the clock signal and the IO signal (SIGALRM and SIGIO on Unix-like Operating Systems). When one of those signals is generated, the current execution context is preempted, and the control is transferred to appropriate signal handler. The handler, then, either calculates the time used by the currently running thread (clock) or inspects IO descriptors. When any of these activities transferred a higher priority thread into the ready state, the context switch takes place. Since the preemptive scheduling can change threads' contexts any time, the application should guard its critical sections by TCE semaphores. This way, a desired order of execution can be assured. TCE -- Implementation -- 9: To Preempt Or Not To Preempt, ... Non-preemptive execution is in most cases a more efficient way to carry out computation: Context switching between threads is a relatively expensive operation. In a non-preemptive environment it occurs only when a thread cannot proceed (has to be blocked) or yields control (under program control). Thus, the management of switching contexts is optimal (assuming that application can control it optimally). Unnecessary or too frequent switches not only waste CPU time but also disrupt caches and can produce page faults. Handling asynchronous signals (mandatory for the preemptive mode) spoils cache localities and on some processors can generate a lot of extra operations. TCE -- Implementation -- 10: When is Preemptive Scheduling Used Preemptive execution is often required to satisfy the following demands: Real-time operations -- refreshing a display at a constant rate, polling a device at certain frequency, etc. Irregular problems, in which the current flow of control depends or might depend heavily upon frequent updates from other processing units. A high degree of overlap between computation and communication -- for some systems sending or receiving large buffers has to be done in several steps. Failing to do it at appropriate rate can diminish the prospects of having CPU and IO transactions carried concurrently. In order to satisfy those contradictory needs, TCE allows the user to decide which scheduling policy should be used. TCE -- Implementation -- 11: Non-Blocking I/O operations TCE offers a user-level multithreading, which is advantages when efficiency is concerned, but possesses also a drawback -- a potentially blocking I/O operation will cause blocking the whole process, instead of blocking only the thread that issued that I/O request. To cope with this problem, TCE translates blocking I/O calls into non-blocking ones. A thread that requests such an operation is then blocked (by the scheduler) until the underlying Operating System signals that the non-blocking operation has completed. However, the presented solution is not general enough. On some Operating Systems certain I/O system calls don't have non-blocking (asynchronous) variations. Fortunately, all relevant OSs on workstations and MIMD parallel machines provide non-blocking interfaces for accessing terminals and network devices. This is why programs using TCE should always use TCE calls to interact with the keyboard or the network. TCE -- Implementation -- 12: Thread Synchronization In order to synchronize threads' activities TCE provides operations on semaphores. A semaphore is an object, which has to be created by the tce_sem_init function. After initializing, three additional operations are available: tce_sem_up -- increments the semaphore, when value is now 0, the first thread blocked on this semaphore will be allowed to proceed, and the counter will be set to a negative value again. tce_sem_down -- tries to decrement the specified semaphore, if the value already less than 0 the thread issuing this operation will be blocked on that semaphore and the value will not be changed. tce_sem_free -- allows to proceed all threads which have been blocked on this semaphore, the semaphore counter will be reset to 0. TCE -- Implementation -- 12: Synchronization Example TCE_Sem *sem,*wait; ... void func1(p1) { ... tce_sem_down(sem); /* enter critical section */ ... /* do some work */ tce_sem_up(sem); /* leave critical section */ tce_sem_down(wait); /* wait for parent to let me exit */ } void func2(p2) { ... tce_sem_down(sem); ... /* do some work */ tce_sem_up(sem); tce_sem_down(wait); } void main(int argc, char *argv[]) { TCE_Thrd *thrd1,*thrd2; int prm1,prm2; ... wait = tce_sem_init(0); /* create semaphore, set counter to 0 */ sem = tce_sem_init(1); thrd1 = tce_thrd_init(func1,prm1); tce_thrd_start(thrd1); thrd2 = tce_thrd_init(func2,prm2); tce_thrd_start(thrd2); ... tce_sem_free(); /* let them proceed */ ... } TCE -- Implementation -- 13: Parent-Child Join Synchronization There is additional synchronization mechanism in TCE, which allows a parent thread to synchronize with its children at the moment of their termination. tce_thrd_join function will block the parent thread until the specified child terminates its execution: void func(p1) { ... /* do some work */ } void main(int argc, char *argv[]) { TCE_Thrd *thrd; int prm; ... thrd = tce_thrd_init(func,prm); tce_thrd_start(thrd); ... tce_thrd_join(thrd); /* wait for my child's termination */ ... } TCE -- Implementation -- 14: Communication in TCE Communication operators concern threads belonging to different processes. Threads within the same address space should use global data structure, instead of sending messages (the last is still possible but less efficient). TCE provides only one mechanism for inter-process communication -- message passing. To send or receive messages, a communication object called channel has to be established. The channel can be set up, by presenting a pair of ports to the TCE kernel. When the two matching pairs are found, the channel will be constructed. One channel connects exclusively two processes. It is up to those processes to decide which their threads will be writing or/and reading to/from that channel. TCE -- Implementation -- 15: Ports ports are processes' points of contact, there are four standard ports for each process: TCE_InPort -- a general receiving port, on this type of port requests to create a channel will be directed. TCE_OutPort -- a general sending port, this port is to be used to create channels between an application and the TCE kernel. TCE_ExcPort -- an exception port, used to create channels on which exceptions from the OS and the TCE kernel will be reported. TCE_NullPort -- a port used to close one or both directions on a channel. Except those standard ports, an application may have any number of sending ports, which can be treated as images of other processes' TCE_InPorts. TCE -- Implementation -- 16: Creating Ports Ports are obtained while a new process is created or by receiving them as a message. In the following example the SendingPorts array will be filled out by ports created as a result of starting NumOfNodes copies of the SlaveName program on Nodes nodes. #define NumOfNodes 8 #define SlaveName "/usr/janusz/slave_program" TCE_Port *SendingPorts[NumOfNodes]; TCE_Chn *Channels[NumOfNodes]; ... n = tce_exec(SendingPorts,Nodes,SlaveName,TCE_InPort,NumOfNodes); tce_chn_init(Channels,TCE_InPort,SendingPorts,n); for(i = 0; i < n; i++) tce_chn_snd(Channels[i],SendingPorts,n,FMT_Port); ... Using those ports the parent process can now establish a dedicated channel to each of its children by calling tce_chn_init. Through the channels it can pass the sending ports to all its children (tce_chn_snd). TCE -- Implementation -- 17: Ports contd There is another way to obtain a sending port -- receive it in a message. This way both the sender and the receiver of that message have the same sending port pointing at the same receiving port. A newly created child process always passes the sending right to its receiving port. The opposite is not true, since parent may pass the Null Port as the argument for tce_exec, thus preventing a child from opening a talking connection. Ports are an essential ingredients of the TCE model. They allow one to decouple the verification and authentication processes from regular operations on channels. By controlling the access to its receiving port, a thread can avoid other possibly malicious, threads tampering with its internal state. As we will see later, one is permitted to create more than one channel per connection, as well as to specify whether the channel is two-way, one-way or null. TCE -- Implementation -- 18: Channels To create a channel a pair of receiving and sending ports has to be given to tce_chn_int. Then the matching request posted by other process has to be found. To simplify the processes of creating channels, tce_chn_int accepts an array of sending ports and returns an array of established channels: ... tce_chn_init(Channels,TCE_InPort,SendingPorts,n); ... If instead of the receiving or sending port the Null port is given, TCE will create a one-way channel. If both ports are replaced by the Null ports -- a null ("faked") channel is returned. Operations for non-existing channel directions are null functions. For example, a write operation on a channel with only the receive direction enabled will return immediately, reporting 0 items sent. TCE -- Implementation -- 19:tce_chn_send Similarly to ports, a channel can be passed in a message to another process. The process that posts this operation, however, will loose its access to that channel (channel can link only two threads). There are only three basic calls which operate on a channel: tce_chn_send tce_chn_rcv tce_chn_set tce_chn_snd -- sends a message, if required a sending thread will be blocked until the operation completes ... blen = tce_chn_snd(Channels[0],buffer,len,FMT_Int); ... TCE -- Implementation -- 20: tce_chn_rcv/set tce_chn_rcv -- receives a message on the channel, Receiving thread will be blocked when no matching message is already available ... blen = tce_chn_rcv(Channels[0],buffer,len,FMT_Int); ... tce_chn_set -- impose different characteristic on the channel The requesting thread can be blocked when the requested type of a channel has to be accepted by the other peer ... res = tce_chn_set(Channels[0],TCE_CHN_SYNC,1); ... TCE -- Implementation -- 21: Channel Sets Channels provide peer-to-peer communication links. To allow a collective interaction without specifying individually every process, an object called channel set is provided. To create the channel set object, the function tce_chnset_init has to be called. It takes an array of channels over which the channel set is to be spanned. All processes having access to those channels must participate (submitting identical arrays of channels). TCE_ChnSet *ChnSet; ... tce_chnset_init(&ChnSet,Channnels,n); ... TCE -- Implementation -- 22: Channel Set Functions The following calls operate on channel sets. In order to complete, all processes which are a part of a particular channel set has to submit matching operations: tce_chnset_sync -- provides a barrier synchronization tce_chnset_bcast -- broadcasts a message tce_chnset_gather -- gathers different buffers in one process tce_chnset_concat -- concatenates individual buffers and sends the resulting buffer to all tce_chnset_scatter -- scatters different parts of the buffer among all processes tce_chnset_reduce -- performs a reduction operation on processes' buffers, places the result in one process tce_chnset_combine -- similar to tce_chnset_reduce, except that all processes are getting the result tce_chnset_prefix -- performs a reduction operation in an ordered fashion, sends out results to all processes tce_chnset_shift -- shifts data in both regular and circular fashion, in the left or right direction, by the specified number of nodes. TCE -- Implementation -- 23: Example of Use of Channel Set Functions #define len 1024 #define cicrcular 1 ... extern void comp(int *, int *, int ); TCE_ChnSet *ChnSet; int buffer[len],inbuf[len],outbuf[len]; int src = 0,step = 1; ... tce_chnset_sync(ChnSet); tce_chnset_bcast(ChnSet,src,buffer,len,FMT_Int); tce_chnset_gather(ChnSet,src,inbuf,outbuf,len,FMT_Int); tce_chnset_concat(ChnSet,inbuf,outbuf,len,FMT_Int); tce_chnset_scatter(ChnSet,src,inbuf,outbuf,len,FMT_Int); tce_chnset_reduce(ChnSet,src,comp,inbuf,outbuf,len ,FMT_Int); tce_chnset_combine(ChnSet,comp,inbuf,outbuf,len,FMT_Int); tce_chnset_prefix(ChnSet,comp,inbuf,outbuf,len,FMT_Int); tce_chnset_shift(ChnSet,circular,step,inbuf,outbuf,len,FMT_Int); ... TCE -- Implementation -- 24: Using Communication Objects All threads within a process can access channels and channel sets. The semantics of that access is blocking. It means, that when a requested operation cannot be completed, the thread which issued that operation will be blocked. It will be made ready again, when that operation has been already completed or it can proceed. This is in contrast to non-blocking or asynchronous operations. Since internally TCE uses those non-blocking calls, the performance of the thread-based communication should be slightly worse than using non-blocking calls. This is caused by the context switching costs, which have to be added to each non-blocking invocation which results in blocking a thread. In majority of cases, however, this disadvantage is greatly outweighed by the clean communication semantics and a better overleap between communication and computation (greater concurrency). This approach is particularly well suited for taking advantage of dedicated communication co-processors, or more than one CPU per node. TCE -- Implementation -- 25: Communication Modes While using channels, the user can request different communication modes: a regular mode -- TCE_CHN_REG, which buffers a message on the receiving side, when no thread waits for it. a synchronous mode -- TCE_CHN_SYNC, which postpones a send, until a matching receive has be posted. This guarantees that the message will not have to be buffed at the receiver's side. a force mode -- TCE_CHN_FORCE, which discards the received message when nobody waits for it. The send operation proceeds without any delay. Different modes can be used to better match the nature of an application or to obtain performance gains. TCE -- Implementation -- 26:Changing Channel Characteristics To change channel characteristics, both communicating threads have to call tce_chn_set with identical parameters: ... res = tce_chn_set(Channel,TCE_CHN_SYNC,1); ... res = tce_chn_set(Channel,TCE_CHN_REG,1); ... tce_chn_set can be also used to specify a tag for the sending or receiving messages. This time, the participation of the other process is not needed. When a thread blocks on a channel the tag is being preserved, even when a different thread changes that tag in a meantime. ... res = tce_chn_set(Channel,TCE_CHN_TAG,13); ... TCE -- Implementation -- 27: tce_chnset_set Just as with channels, channel sets have an operation called tce_chnset_set. This call helps to distinguish between different collective operations which are in progress, by defining different tags for them. ... res = tce_chnset_set(ChnSet,TCE_CHNSET_TAG,13); ... TCE -- Implementation -- 28: Communication between heterogenous machines Each TCE call which sends or receives a buffer, requests a user to submit the basic type of the buffer items. When a message has to be sent to a machine which has a different byte order, the user buffer is XDR encoded. Then the XDR decoding will take place at the receiving side, and the thread will get the buffer in the native data format. The program can enforce no encoding at all by defining the data type as Fmt_Byte. TCE detects automatically whether XDR translation is necessary or not, thus an application should always specify the correct data type. TCE channels and channel sets can be spanned over different parallel MIMD machines. The translation between the native communication systems is performed internally by the TCE kernel. TCE -- Implementation -- 29: TCE as a Distributed Environment To support the distributed computing style, we need to support the client-server interactions. In TCE it is possible by implementing a process (or a collection of processes) as a resident server. That servers registers its ports under predefined names. Any client process then, can obtained those ports (requesting them by name), and build a channel to the server. When the client-server interaction is completed, the channel would be closed. The multithreaded support greatly helps in building such servers, since one server's thread can be dedicated to accept new requests, while other can work on the already in progress transactions. At the client side, multiple threads allow to interact with many servers concurrently. The preemptive scheduling ensures that all clients will be treated fairly, despite of the load fluctuations. The popular RPC communication can be implemented as a pair of consecutive tce_chn_snd and tce_chn_rcv calls. TCE -- Implementation -- 30: TCE as a Parallel Environment TCE is well suited to support parallel programming. The applied host-node style allows supporting data parallel an functional parallel programming paradigms. The higher level communication masks differences between message passing systems on different distributed architectures. Thus, programs written for one collection of distributed nodes should be ran without any changes on a different collection of nodes. Shared memory, parallel machines can be naturally adopted through the multithreading interface. Multithreaded support allows an efficient parallel execution, by masking latencies and communication delays. A user definable scheduling policy as well as different communication styles on channels provide a way to optimally map an application onto the underlying architecture. MOVIE -- 1:Introduction To Total System Multitasking Object-oriented Visual Interactive Environment (MOVIE) is a system of interpreters which interpret programs written in MovieScript (extended PostScript). The MOVIE interpreter acts like a virtual processor, executing MovieScript programs in a machine and processor independent manner. The MOVIE network model as well as the thread model and scheduling is very similar to TCE. As a matter of fact the development of MOVIE gave foundations for developing TCE. The basic difference between MOVIE and TCE is that MOVIE is an interpreted, stand-alone system while TCE is a compiled library. The unit of execution in MOVIE is an interpreted thread, which is a part of an interpreted tasks. Similarly to a TCE process, a task is only a passive envelope for its threads. It contains three standard ports together with other resources shared by its threads. MOVIE -- 2: MOVIE Threads A MOVIE thread is an active entity, which code is being interpreted by the MovieScript interpreter. Each thread contains its descriptor, and three stacks: Operand Stack -- used to store operands in the postfix notation Execution Stack -- utilized internally by the interpreter Dictionary Stack -- containing dictionaries which hold {name,object} pairs. During a context switch, instead of swapping contents of registers, the pointers to those three stacks are being swapped. After that, the interpreter starts interpreting the new thread. When a new task is being started, there is automatically created a thread within it. All other threads have to be explicitly created by the fork operator. That operator copies the content of its stacks onto the new thread's stacks. This allows new threads to inherit parent`s context (alike global variables in TCE). All subsequent operations on stacks are not visible to both the parent and the child (likewise local variables in TCE). MOVIE -- 3: Communication of Code Since in the MOVIE system the thread's code is machine independent, it can be sent as any other message to be interpreted (executed). Because of it, a thread can be created without its code or it can remain active even when there is no code for it. To let the MOVIE kernel know that such a "daemon" thread is requested, the thread has to declare one of the channel as a "code" channel. The lack of code, then, will block that thread on this channel, and any messages delivered to that channel will be treated as code, making that blocked thread ready. All operations on threads and semaphores have the same semantics as TCE calls. They have different syntax though, and are treated as operators. MOVIE -- 4: Communication Model and Preemptive Structure The MOVIE communication model is very similar to the discussed previously TCE model. Standard, task--wide ports are used to construct inter-task channels, and these are the base for constructing channel sets. Different modes of communication can be imposed on channels. But unlike TCE, in MOVIE there is no need to instruct the interpreter what kind of data is being sent of received. The system is object-oriented, so this information is already available to the network subsystem. Instead of a poorly formatted chunk memory, the formatted object is being sent. At the receiving end, a copy of that object is reconstructed and passed to an appropriate channel. The MOVIE system is preemptive at the interpreted level. It means that no C-level code is being preempted. A context switch can take place between two operators, when the interpreter discovers that either IO or clock flags have been set. This solution is more efficient and allows a simpler implementation. The drawback is that the system cannot react immediately while executing an operator, thus some required time constraints may not be met. Multithreading for Interpreted Environments -- 1: Overview So far we discussed a compiled library and an interpreted systems. Both of them benefit from the multithreaded ingredient. For interpreted environments the multithreading support can be implemented in two ways: Interpreted level only -- threads are implemented as interpretive objects. The context switch between threads does not involve swapping machine registers, but instead the interpreter is switched to interpret different threads. Interpreted and compiled level -- each interpreted thread is mapped onto a compiled thread. The compiled thread consists of the interpreter which interprets the interpreted thread. The example of the first approach is MOVIE, of the other -- the HotJava browser. Comparing those two, we can notice the following aspects: Multithreading for Interpreted Environments -- 2: Comparison of Two Implementation Strategies Multithreading at the interpreted level only is : simpler to develop more efficient as far as performance and memory consumption is concerned. The two-level multithreading is more responsive (when based on a preemptive scheduling), capable of supporting time-critical operations. It can take advantage of multiprocessing hardware. Multithreading for Interpreted Environments -- 3: Multithreading for HTTP Servers and Clients A very practical example of an interpreted system, which can benefit from multithreaded communication model are the HTTP servers and clients. A WWW server interprets the HTTP protocol while responding to clients' requests. Currently, every time a new client connects to a server, the server "forks" itself, creating a new copy of the server which now handles the established connection exclusively. Forking a new process is an expensive operation. In addition memory and other system resources are being wasted. An alternative solution is to have a multithreaded server, which will create a new thread each time a new connection is requested. This is not only more efficient if terms of performance and memory consumption, but allows to implement easily interactions between different client. Something, which is a natural next step in the Web technologies. Multithreading for Interpreted Environments -- 4: Preemptive WWW Server and Client Multithreading Even a more scalable approach is to have a mesh of WWW servers which can efficiently support more clients, while providing a shared Web space, often computational intensive, for client interactions. Both preemptive multithreading and efficient, higher level communication is essential to implement this kind of services. At the WWW client side we have an existing example already. The new browser developed at Sun Microsystems, called HotJava, employs a preemptive multithreading. Currently this multithreaded interpreter is based on Solaris 2.3 and 2.4 preemptive thread support (http://www.sun.com). The porting efforts to other systems are underway. Multithreading for Interpreted Environments -- 5: HotJava Browser The HotJava browser uses a two-level multithreading. The main reason for this choice was dictated by a dual nature of the code execution: The regular mode of execution is to interpret a preecompiled code written in the Java(TM) language. The executable code is a stream of bytecodes, thus the interpretation can be fairly fast. When a higher performance is expected, however, the stream of bytecodes is being translated into the machine code on the fly. Therefore, both the interpreted and compiled code can use the compiled thread support. The Java language is based on a restricted subset of C++, with some added extensions from Objective C. The main distinction between the HotJava browser and other browsers (Mosaic for example), is that HotJava can acquire any information from a server, not being aware of the information type or structure. The necessary tools for presenting this information will be transferred from the server, linked with the HotJava runtime and started as interpreted or compiled threads.