Given by Janusz Niemiec and Geoffrey Fox at CPS600 Spring Semester95 on April 1995. Foils prepared July 6,1995
Abstract * Foil Index for this file
This presentation was prepared by Janusz Niemiec and describes |
Overview of Multithreading |
Existing Experience with Multithreading on UNIX |
Experience with multithreading for parallel processing -- Nexus and Chant |
Design and Implementation of TCE (Thread-based Communication Environment) |
This was based on experience with MOVIE interpreted environment and use of TCE in this and other interpreted systems such as parallel HTTP servers and Java from Sun is explored |
This table of Contents Abstract
Janusz Niemiec |
NPAC |
Syracuse University |
111 College Place |
Syracuse |
NY 13244-4100 |
This presentation was prepared by Janusz Niemiec and describes |
Overview of Multithreading |
Existing Experience with Multithreading on UNIX |
Experience with multithreading for parallel processing -- Nexus and Chant |
Design and Implementation of TCE (Thread-based Communication Environment) |
This was based on experience with MOVIE interpreted environment and use of TCE in this and other interpreted systems such as parallel HTTP servers and Java from Sun is explored |
Multithreading is defined as having multiple contexts of execution within the same address space. |
Benefits of having multiple threads of execution:
|
Having many threads within one address space is a better solution than having multiple, heavy-weight processes cooperating with each other:
|
Threads and Processes -- efficiency comparison
|
Multithreading can be supported at the Operating System level (kernel-level threads), at the application level (user-level threads) or simultaneously at both levels.
|
Historically, multithreaded processes were first used to implement Operating System services, later on they migrated into different application layers.
|
Probably the most widely known multithreaded support has been provided by the SunOS OS by Sun Microsystems. The Light-Weight Processes (LWP) package is a user-level thread package. It can be briefly characterized by the following attributes:
|
LWP Threads can communicate using a rendezvous message passing style -- a sender and a receiver have to meet in order to communicate. There are three message passing calls -- msg_send(), msg_recv() and msg_reply(). |
Since LWP runs in the user space, all its threads have to use non-blocking system calls to avoid blocking the whole Unix process. |
The most complete, and at the same time complex, thread package has been furnished by the Open Software Foundation (OSF); the Distributed Computing Environment(DCE) threads package. It contains 51 primitives, some of them not strictly necessary but provided for convenience only.
|
There are calls for a thread creation, destruction, killing and exiting. A newly created thread becomes a child of the thread which created it. The parent can either wait for child's death (join operator) or it may forget that a child exist -- detaching it from its execution (detach operator). |
Threads can synchronize their actions using mutexes (binary semaphores). There are two kind of mutexes -- a fast mutex and a friendly mutex. To help in avoiding deadlocks, DCE offers conditional variables with the wait, signal and broadcast operations on them. |
The notion of per-thread global variable is supported. A thread can declare a variable to be global only to its functions. Functions which belong to other threads will not be able to use those variables. |
To avoid specifying a lot of parameters for each call which creates threads, mutexes and conditional variables, a notion of a template has been introduced. When created, threads, mutexes and conditional variables are initialized by values from appropriate templates.
|
An interesting approach to multithreading has been proposed by the IRIX operating system from SGI.
|
Processes created this way have their private registers and stacks and signal vectors, except that, everything else is being shared.
|
The Mach Operating System (microkernel based) has both the kernel- and the user-level threads.
|
C-Threads -- user-level threads for Mach
|
Nexus is a multithreaded communication environment, developed at the Argonne National Lab. It is intended as a compiler target or as a basis of a higher level library.
|
The Chant environment has been developed in ICASE , NASA.
|
In order to eliminate a need for buffering all together, a sending thread is blocked until a matching thread requests a message.
|
To provide an efficient, thread--based communication library capable of supporting distributed and parallel processing on variety on platforms. |
To ensure interoperability between different types of architectures with different CPUs and Operating Systems. |
To make the environment as simple as possible without compromising the performance or functionality. |
To assist the programmer in choosing computational nodes and style of interactions between his processes. |
By abstract communication objects called ports and channels it possible to build the client-server connection as well as peer-to-peer parallel relations. By mapping application processes onto computational nodes both data parallelism and functional parallelism can be exploited. These two paradigms can be even mixed in one application. |
The multithreaded support is based on a user-level thread package, thus TCE can be easily ported to different processors. Different architecture are supported :
|
The differences in data formats (byte ordering) is taken care of internally by the library. Machine specific IPC operations are masked through higher level operations on channels and ports. |
The set of interface functions in TCE has been minimized due to a fact that by using multiple threads we could get rid of all non-blocking and asynchronous calls. Respectfully, much cleaner semantics and better optimized code has been obtained. Performance and functionality has not been jeopardized, since multiple threads provide more concurrency of execution. |
Despite the fact that a higher abstraction level increases portability, a programmer may want to specify a particular machine to carry out the computation or impose a specific method of IPC for efficiency reasons. Those alternative operations are implemented as additional set of operators on communication objects. |
Notion of a process and a thread
|
Thread creation and destruction
|
As a result the space for the thread's stack and descriptor is allocated. tce_thrd_init returns a pointer to that descriptor |
void func(int i) |
{ |
.... |
} |
.... |
void main(int argc, char *argv[]) |
{
|
} |
The most important fields in the descriptors are:
|
Thread Destruction: |
When the function (declared as a thread) reaches its end, it is removed from the scheduling queue and from the parent thread list of children. If that thread has no children it wishes to wait for, its descriptor is stack space will be reclaimed and the thread will be physically destroyed. Otherwise, the destruction stage will be postponed until all its children die. There are the following important aspects of this scheme of thread destruction:
|
All threads are scheduled according to their priorities. |
The priority range is between 0 and 31, where 31 is the highest priority. |
Each priority has its queue of ready threads, which can be empty (no threads of this priority ready to run). |
The TCE scheduler always chooses the highest priority thread to run. |
If there is more than one thread, in the highest priority queue, these threads are scheduled in the round-robin fashion. |
When all threads are blocked -- meaning there is no thread ready to be executed, TCE starts its "Idle" thread. That thread just waits for an external even (such as a message or an OS signal). If that even causes one or more of the blocked threads to become ready, the "Idle" thread is stopped and the scheduler selects the ready thread with the highest priority. |
A newly created thread inherits its parent priority. It can be changed any time by calling tce_thrd_priority with a new priority for a given thread: |
void main(int argc, char *argv[]) |
{
|
} |
The first thread gets priority 0. |
In the TCE environment the user has a choice between preemptive or non-preemptive multithreading. |
Non-preemptive scheduling is the default. This causes the scheduler to be invoked only when the thread which is currently running blocks itself or explicitly yields control (tce_thrd_yield()). If a new thread is being started, the state of the "old" thread (state of registers and the stack pointer) is being preserved in the thread's descriptor. Similarly, the context of the "new" thread is being loaded from its descriptor into processor registers. This operation is called a context switch . Every time a thread is scheduled again, it will restart its execution where it has been stopped, with identical processor state. |
The preemptive scheduling can be activated by calling tce_preempt_on(). After that call, any current thread can be preempted due to two possible conditions: |
A higher priority thread becomes ready
|
void main(int argc, char *argv[]) |
{
|
} |
Once started, the preemptive scheduling can be deactivated by calling tce_preempt_off. During a program execution preemptiveness can be switched on and off freely.
|
Non-preemptive execution is in most cases a more efficient way to carry out computation:
|
Preemptive execution is often required to satisfy the following demands:
|
TCE offers a user-level multithreading, which is advantages when efficiency is concerned, but possesses also a drawback -- a potentially blocking I/O operation will cause blocking the whole process, instead of blocking only the thread that issued that I/O request. |
To cope with this problem, TCE translates blocking I/O calls into non-blocking ones. A thread that requests such an operation is then blocked (by the scheduler) until the underlying Operating System signals that the non-blocking operation has completed. |
However, the presented solution is not general enough. On some Operating Systems certain I/O system calls don't have non-blocking (asynchronous) variations. Fortunately, all relevant OSs on workstations and MIMD parallel machines provide non-blocking interfaces for accessing terminals and network devices. |
This is why programs using TCE should always use TCE calls to interact with the keyboard or the network. |
In order to synchronize threads' activities TCE provides operations on semaphores. |
A semaphore is an object, which has to be created by the tce_sem_init function. After initializing, three additional operations are available:
|
TCE_Sem *sem,*wait; |
... |
void func1(p1) |
{
|
} |
void func2(p2) |
{ ...
|
} |
void main(int argc, char *argv[]) |
{
|
} |
There is additional synchronization mechanism in TCE, which allows a parent thread to synchronize with its children at the moment of their termination. tce_thrd_join function will block the parent thread until the specified child terminates its execution: |
void func(p1) |
{
|
} |
void main(int argc, char *argv[]) |
{
|
} |
Communication operators concern threads belonging to different processes. Threads within the same address space should use global data structure, instead of sending messages (the last is still possible but less efficient). |
TCE provides only one mechanism for inter-process communication -- message passing. |
To send or receive messages, a communication object called channel has to be established. |
The channel can be set up, by presenting a pair of ports to the TCE kernel. When the two matching pairs are found, the channel will be constructed. |
One channel connects exclusively two processes. It is up to those processes to decide which their threads will be writing or/and reading to/from that channel. |
ports are processes' points of contact, there are four standard ports for each process:
|
Except those standard ports, an application may have any number of sending ports, which can be treated as images of other processes' TCE_InPorts. |
Ports are obtained while a new process is created or by receiving them as a message. In the following example the SendingPorts array will be filled out by ports created as a result of starting NumOfNodes copies of the SlaveName program on Nodes nodes. |
#define NumOfNodes 8 |
#define SlaveName "/usr/janusz/slave_program" |
TCE_Port *SendingPorts[NumOfNodes]; |
TCE_Chn *Channels[NumOfNodes]; |
... |
n = tce_exec(SendingPorts,Nodes,SlaveName,TCE_InPort,NumOfNodes); |
tce_chn_init(Channels,TCE_InPort,SendingPorts,n); |
for(i = 0; i < n; i++)
|
... |
Using those ports the parent process can now establish a dedicated channel to each of its children by calling tce_chn_init. Through the channels it can pass the sending ports to all its children (tce_chn_snd). |
There is another way to obtain a sending port -- receive it in a message. This way both the sender and the receiver of that message have the same sending port pointing at the same receiving port.
|
Ports are an essential ingredients of the TCE model. They allow one to decouple the verification and authentication processes from regular operations on channels. By controlling the access to its receiving port, a thread can avoid other possibly malicious, threads tampering with its internal state. As we will see later, one is permitted to create more than one channel per connection, as well as to specify whether the channel is two-way, one-way or null. |
To create a channel a pair of receiving and sending ports has to be given to tce_chn_int. Then the matching request posted by other process has to be found. To simplify the processes of creating channels, tce_chn_int accepts an array of sending ports and returns an array of established channels: |
... |
tce_chn_init(Channels,TCE_InPort,SendingPorts,n); |
... |
If instead of the receiving or sending port the Null port is given, TCE will create a one-way channel. If both ports are replaced by the Null ports -- a null ("faked") channel is returned. Operations for non-existing channel directions are null functions. For example, a write operation on a channel with only the receive direction enabled will return immediately, reporting 0 items sent. |
Similarly to ports, a channel can be passed in a message to another process. The process that posts this operation, however, will loose its access to that channel (channel can link only two threads). |
There are only three basic calls which operate on a channel:
|
tce_chn_snd -- sends a message,
|
... |
blen = tce_chn_snd(Channels[0],buffer,len,FMT_Int); |
... |
tce_chn_rcv -- receives a message on the channel,
|
... |
blen = tce_chn_rcv(Channels[0],buffer,len,FMT_Int); |
... |
tce_chn_set -- impose different characteristic on the channel
|
... |
res = tce_chn_set(Channels[0],TCE_CHN_SYNC,1); |
... |
Channels provide peer-to-peer communication links. To allow a collective interaction without specifying individually every process, an object called channel set is provided. |
To create the channel set object, the function tce_chnset_init has to be called. It takes an array of channels over which the channel set is to be spanned. All processes having access to those channels must participate (submitting identical arrays of channels). |
TCE_ChnSet *ChnSet; |
... |
tce_chnset_init(&ChnSet,Channnels,n); |
... |
The following calls operate on channel sets. In order to complete, all processes which are a part of a particular channel set has to submit matching operations:
|
#define len 1024 |
#define cicrcular 1 |
... |
extern void comp(int *, int *, int ); |
TCE_ChnSet *ChnSet; |
int buffer[len],inbuf[len],outbuf[len]; |
int src = 0,step = 1; |
... |
tce_chnset_sync(ChnSet); |
tce_chnset_bcast(ChnSet,src,buffer,len,FMT_Int); |
tce_chnset_gather(ChnSet,src,inbuf,outbuf,len,FMT_Int); |
tce_chnset_concat(ChnSet,inbuf,outbuf,len,FMT_Int); |
tce_chnset_scatter(ChnSet,src,inbuf,outbuf,len,FMT_Int); |
tce_chnset_reduce(ChnSet,src,comp,inbuf,outbuf,len ,FMT_Int); |
tce_chnset_combine(ChnSet,comp,inbuf,outbuf,len,FMT_Int); |
tce_chnset_prefix(ChnSet,comp,inbuf,outbuf,len,FMT_Int); |
tce_chnset_shift(ChnSet,circular,step,inbuf,outbuf,len,FMT_Int); |
... |
All threads within a process can access channels and channel sets. The semantics of that access is blocking. It means, that when a requested operation cannot be completed, the thread which issued that operation will be blocked. It will be made ready again, when that operation has been already completed or it can proceed. This is in contrast to non-blocking or asynchronous operations. |
Since internally TCE uses those non-blocking calls, the performance of the thread-based communication should be slightly worse than using non-blocking calls. This is caused by the context switching costs, which have to be added to each non-blocking invocation which results in blocking a thread. |
In majority of cases, however, this disadvantage is greatly outweighed by the clean communication semantics and a better overleap between communication and computation (greater concurrency). This approach is particularly well suited for taking advantage of dedicated communication co-processors, or more than one CPU per node. |
While using channels, the user can request different communication modes:
|
Different modes can be used to better match the nature of an application or to obtain performance gains. |
To change channel characteristics, both communicating threads have to call tce_chn_set with identical parameters:
|
tce_chn_set can be also used to specify a tag for the sending or receiving messages. This time, the participation of the other process is not needed. When a thread blocks on a channel the tag is being preserved, even when a different thread changes that tag in a meantime.
|
Just as with channels, channel sets have an operation called tce_chnset_set. This call helps to distinguish between different collective operations which are in progress, by defining different tags for them. |
... |
res = tce_chnset_set(ChnSet,TCE_CHNSET_TAG,13); |
... |
Each TCE call which sends or receives a buffer, requests a user to submit the basic type of the buffer items. When a message has to be sent to a machine which has a different byte order, the user buffer is XDR encoded. Then the XDR decoding will take place at the receiving side, and the thread will get the buffer in the native data format. |
The program can enforce no encoding at all by defining the data type as Fmt_Byte. |
TCE detects automatically whether XDR translation is necessary or not, thus an application should always specify the correct data type. |
TCE channels and channel sets can be spanned over different parallel MIMD machines. The translation between the native communication systems is performed internally by the TCE kernel. |
To support the distributed computing style, we need to support the client-server interactions. In TCE it is possible by implementing a process (or a collection of processes) as a resident server. That servers registers its ports under predefined names. Any client process then, can obtained those ports (requesting them by name), and build a channel to the server. When the client-server interaction is completed, the channel would be closed. |
The multithreaded support greatly helps in building such servers, since one server's thread can be dedicated to accept new requests, while other can work on the already in progress transactions. At the client side, multiple threads allow to interact with many servers concurrently. |
The preemptive scheduling ensures that all clients will be treated fairly, despite of the load fluctuations. |
The popular RPC communication can be implemented as a pair of consecutive tce_chn_snd and tce_chn_rcv calls. |
TCE is well suited to support parallel programming. The applied host-node style allows supporting data parallel an functional parallel programming paradigms. |
The higher level communication masks differences between message passing systems on different distributed architectures. Thus, programs written for one collection of distributed nodes should be ran without any changes on a different collection of nodes. |
Shared memory, parallel machines can be naturally adopted through the multithreading interface. |
Multithreaded support allows an efficient parallel execution, by masking latencies and communication delays. |
A user definable scheduling policy as well as different communication styles on channels provide a way to optimally map an application onto the underlying architecture. |
Multitasking Object-oriented Visual Interactive Environment (MOVIE) is a system of interpreters which interpret programs written in MovieScript (extended PostScript). |
The MOVIE interpreter acts like a virtual processor, executing MovieScript programs in a machine and processor independent manner. |
The MOVIE network model as well as the thread model and scheduling is very similar to TCE. As a matter of fact the development of MOVIE gave foundations for developing TCE. The basic difference between MOVIE and TCE is that MOVIE is an interpreted, stand-alone system while TCE is a compiled library. |
The unit of execution in MOVIE is an interpreted thread, which is a part of an interpreted tasks. Similarly to a TCE process, a task is only a passive envelope for its threads. It contains three standard ports together with other resources shared by its threads. |
A MOVIE thread is an active entity, which code is being interpreted by the MovieScript interpreter. |
Each thread contains its descriptor, and three stacks:
|
During a context switch, instead of swapping contents of registers, the pointers to those three stacks are being swapped. After that, the interpreter starts interpreting the new thread. |
When a new task is being started, there is automatically created a thread within it. All other threads have to be explicitly created by the fork operator. That operator copies the content of its stacks onto the new thread's stacks. This allows new threads to inherit parent`s context (alike global variables in TCE). All subsequent operations on stacks are not visible to both the parent and the child (likewise local variables in TCE). |
Since in the MOVIE system the thread's code is machine independent, it can be sent as any other message to be interpreted (executed). Because of it, a thread can be created without its code or it can remain active even when there is no code for it. To let the MOVIE kernel know that such a "daemon" thread is requested, the thread has to declare one of the channel as a "code" channel. The lack of code, then, will block that thread on this channel, and any messages delivered
to that channel will be treated as code, making that blocked thread ready.
|
The MOVIE communication model is very similar to the discussed previously TCE model. Standard, task--wide ports are used to construct inter-task channels, and these are the base for constructing channel sets. Different modes of communication can be imposed on channels. But unlike TCE, in MOVIE there is no need to instruct the interpreter what kind of data is being sent of received. The system is object-oriented, so this information is already available to the network subsystem. Instead of a poorly formatted chunk memory, the formatted object is being sent. At the receiving end, a copy of that object is reconstructed and passed to an appropriate channel. |
The MOVIE system is preemptive at the interpreted level. It means that no C-level code is being preempted. A context switch can take place between two operators, when the interpreter discovers that either IO or clock flags have been set. This solution is more efficient and allows a simpler implementation. The drawback is that the system cannot react immediately while executing an operator, thus some required time constraints may not be met. |
So far we discussed a compiled library and an interpreted systems. Both of them benefit from the multithreaded ingredient. |
For interpreted environments the multithreading support can be implemented in two ways:
|
The example of the first approach is MOVIE, of the other -- the HotJava browser. Comparing those two, we can notice the following aspects: |
Multithreading at the interpreted level only is :
|
The two-level multithreading is more responsive (when based on a preemptive scheduling), capable of supporting time-critical operations. It can take advantage of multiprocessing hardware. |
A very practical example of an interpreted system, which can benefit from multithreaded communication model are the HTTP servers and clients. |
A WWW server interprets the HTTP protocol while responding to clients' requests.
|
Even a more scalable approach is to have a mesh of WWW servers which can efficiently support more clients, while providing a shared Web space, often computational intensive, for client interactions. Both preemptive multithreading and efficient, higher level communication is essential to implement this kind of services.
|
The HotJava browser uses a two-level multithreading. The main reason for this choice was dictated by a dual nature of the code execution:
|
The main distinction between the HotJava browser and other browsers (Mosaic for example), is that HotJava can acquire any information from a server, not being aware of the information type or structure. The necessary tools for presenting this information will be transferred from the server, linked with the HotJava runtime and started as interpreted or compiled threads. |