Using solaris threads: Java Runtime Experiences

Pavani Diwanji (pavani@eng.sun.com)

Please do not redestribute/print this paper without the permission of the author.

ABSTRACT

This paper describes the limitations encountered in native Solaris threads while using them for the Java runtime environment. The three main areas that the paper discusses are thread priorities, garbage collection support and interrupting/aborting threads out of blocking system calls. The paper describes in detail the short-term workarounds applied to these issues and also talks about the long-term solutions that are currently being worked on in cooperation with the multi-threading OS group. It also talks briefly about the impact of these issues on current and future applications and why its important to remedy them.

0. OVERVIEW

In the first section, the paper gives some background by discussing briefly the basic Java thread model, the Java runtime environment and how the Java runtime uses Solaris threads. Section 2 talks about priority-related issues including how the current two level thread model in Solaris can make the thread priorities ineffective. In section 3, the paper describes how garbage collection in Java interacts with threads and the issues encountered running on Solaris threads in that area. The fourth section describes in detail the need for and problems encountered in aborting threads out of lengthy blocking system calls. Lastly, the paper concludes describing the current state of Java running on Solaris threads.

1. Java threads on Solaris

Java supports threads at the language level. Threads are a class from which you can instantiate new thread objects. The Thread class provides a rich collection of methods to start, run, stop and query the state of the Java threads. Java threads are priority preemptive. A Java thread will continue to run until it blocks, yields control or is preempted because a higher priority thread becomes runnable. Java also has integrated thread synchronization through use of re-entrant monitors. Every class and instantiated object has its own monitor. At language level, methods that need to be thread-safe can be declared synchronized. Java threads can do a wait or a timed wait on a monitor. Waiting threads can be notified using the thread notify or notifyAll (broadcast) method.

The figure given below illustrates the layering of the components in the Java runtime environment. The environment is cleanly divided into system specific and system independent portions separated by a host programming interface. This paper is intended to cover the issues encountered while using Solaris threads in the threads and system support layer below the host programming interface. Java threads are part of Java class libraries and are layered upon infrastructure provided by the thread and system support layer.

The current release of the Java runtime (version 1.0) on Solaris runs using a user-level threads library called "green threads" as a part of the thread and system support layer. The green threads library is a purely user level threads library so the Java runtime atop the green threads library is a single threaded process as far as the system is concerned.

There were two main motivations for trying to replace the green threads library with native Solaris threads. One was to get true multiprocessing i.e on a multi-processor machine Java threads should be able make use of multiple processors. The second reason was to get away from using nonblocking I/O. Any blocking system call done by the Java runtime or in any shared libraries can potentially block the whole process. As a result there is an I/O manager which shadows (overrides) all possible blocking system calls and converts them to equivalent nonblocking calls. This piece of code is complicated and it needs to anticipate all possible blocking calls that a piece of native code can call. If a shared library makes any blocking call that the I/O manager has not shadowed then the whole Java process will block. By moving to using Solaris threads, we can get rid of the nonblocking I/O manager.

In the Java runtime using the native Solaris threads library, every Java thread maps onto an unbound user-level Solaris thread [3]. Java monitors are implemented using a single native mutex to guard the meta-state of the monitor and two condition variables. One condition variable is used by the threads waiting to enter the monitor. The second one is used by threads explicitly waiting on the monitor i.e executing the wait or the timedwait method. Because Solaris mutexes are not re-entrant we could not simply have a monitor entry map to a mutex; re-entering a monitor would deadlock the application.

2. Thread Priorities

In Solaris, the user-level threads library scheduler controls the scheduling of the application threads onto lwps but it has no knowledge or control over the scheduling of lwps beneath. Since underlying lwps are timesliced, one of the results of this design is that a lwp running a high priority thread can be preempted when its time slice is up and a lower priority thread can be run in its place. This breaks one of the basic tenets of the Java threads model which says that if there are n available processors in the system, n highest priority threads will be running on them at any given time. It is important for us to preserve this model of strict priorities. Currently the most popular application using Java is the web browser HotJava, which supports embedded applets which uses high priority threads to do multimedia (audio and graphics animation) expecting them to be scheduled and run before other threads of lower priority.

Running threads as bound threads belonging to the real-time scheduling class as a workaround was not an option for Java since we could not require our users to have root privileges. Scheduler activations [4] or Threads Control [5] offer solutions to this problem.

Strict priorities support on other platforms like Win32 (Windows NT and Windows 95) is not sufficient. The range of distinct priority levels that the Win32 supports is smaller than the range of priorities available to java threads.

One of the other features of the green threads library that is not supported in Solaris threads is a fix for the priority inversion problem [6]. Priority inversion is a common problem that occurs when a lower priority thread blocks the execution of a higher priority thread as threads compete for shared resources like monitors. The green threads library implements the priority inheritance scheme for dealing with the priority inversion problem. When a higher priority thread attempting to grab a monitor held by a lower priority thread blocks, the lower priority monitor owner thread inherits the priority of the blocked thread until it releases the monitor. The native Solaris threads library currently does not support priority inheritance. The Posix threads specification defines priority inheritance as an optional feature, but the Posix thread library on Solaris currently does not support this feature. In order for Java threads to be able to deal with the priority inversion problem, it is necessary for this optional feature to be supported by the native threads library on Solaris.

3. Garbage Collection

Automatic garbage collection is an integral part of the Java language and the runtime system. Once the programmer has allocated an object, the runtime automatically keeps track of the object's status and reclaims memory when the objects are no longer in use, freeing memory for future use. The Java runtime system has two ways of collecting garbage: synchronous and asynchronous. Synchronous garbage collection is done explicitly by the encompassing application, e.g. when the system runs out of memory or the user requests it. The problem with synchronous garbage collection is that it can cause poor interactive responsiveness if done at wrong times. The Java runtime solves the problem by running a background garbage collector using a lowest priority thread called the idle thread. Since the Java thread model advocates strict priorities, the idle thread will only run when there are no other higher priority threads waiting to be run. Running of this idle thread is taken to be an indication that the system is idle and that it is an appropriate time to do garbage collection. If another thread becomes runnable while the garbage collector is running, it backs out and shuts down gracefully and quickly to let the other threads run. For the existing asynchronous garbage collector to work with Solaris threads, we need the strict priority guarantee to work and we need the ability to ask the scheduler if there is any other thread waiting to be run. This problem surfaces on Win32 platform also because there is no way to find out if there are any runnable threads waiting to be run. In the absence of any mechanism to detect if there are threads waiting to run we have had to disable asynchronous garbage collection in the Solaris threads and Win32 versions of Java. Instead these systems rely solely on Java's synchronous garbage collection which is invoked under the control of the encompassing application.

The garbage collection operation in the Java runtime is simple "stop and collect" operation. It requires pausing of all the threads (except the one performing the garbage collect) before garbage collection starts and resuming them afterwards. It also needs to get hold of the stack pointer and requires that the register state for each thread be flushed out on the stack. The garbage collector could not use Solaris threads suspend call (thr_suspend) to pause the threads as it was not synchronous[1]. The suspend call returns before the thread is actually paused (Win32 threads also suffer from the this asynchronous suspend bug).

The next solution we explored was a signal based solution to stop all the threads as it forced the registers to be flushed onto the stack. The idea was to send a signal to stop all the threads one by one. The threads, after acknowledging the receipt of the pause indication signal, will wait on a condition variable in the signal handler waiting for the garbage collection to be over. Waiting threads are signalled when the garbage collection completes. The first attempt used SIGUSR1 as the pause signal. The problem with using SIGUSR1 is that it is a maskable signal. We ran into a bug in the RPC library code where it masked all the signals while waiting for the network I/O to occur. The next step was to try using SIGLWP to stop the threads. SIGLWP is a special signal that threads are not allowed to mask. libthread internally uses SIGLWP for thread preemption so we had to override SIGLWP signal handler. The overridden signal handler passed the signal on to the thread library if the garbage collection was not going on. It was by no means a clean solution and it ultimately did not work as we tripped on another serious bug where the Solaris thread doing a cond_wait, on getting a signal, had to reacquire the lock associated with the condition variable before calling the signal handler. If this lock was held by a thread that had been stopped, it leads to a deadlock. Since the fix required a patch to the thread library short-term, we decided to explore /proc based solution.

The solution that finally worked for us involved using the /proc API to get a list of threads currently bound to the lwps, suspend them using lwp_suspend call, and then wait until all the lwps have been suspended before starting the garbage collection. Hans Boehm at Xerox Parc originally came up with this solution. We use the /proc API again to check the status of the lwps after trying to suspend them and at the same time to get the register status for each one of them. The interesting challenge that we discovered here was getting to the stack pointer of those threads that were not currently riding any lwps. The way we solved the problem was by storing away the stack pointer before making all calls where the threads can possibly get preempted. This solution again is not clean, as it requires shadowing calls like condition wait, lock etc.

Longer term, the Java runtime needs to use a more sophisticated garbage collection scheme which will not require pausing of all the threads. Though the current /proc based solution for the garbage collection problems is working reliably, it is a hack as it relies on internal knowledge of threads library. The Solaris threads library needs to provide a set of APIs to support garbage collection functionality better. This set of APIs should include ability to suspend the thread synchronously and to cleanly get to the current values of the stack pointer and other registers.

4. Thread interruption: aborting blocking calls

Thread interruption and cancellation is crucial functionality in today's multi-threaded applications. Java threads can be aborted out of a blocking call through two mechanisms: interrupt and stop.

The Java thread interrupt mechanism is based on the Modula-3 alerts [7]. An interrupt is an indication to a thread that it should cease what it is doing. Interrupts are typically used to terminate a long running computation or to force a thread to stop waiting for I/O or a condition variable. For example when a user clicks the Cancel button on HotJava, the thread currently doing the operation is interrupted. Any Java thread can interrupt any other Java thread in the same group by calling the Thread.interrupt() method. If a thread is performing a long cpu-bound operation, it can periodically test to see if it has been interrupted by using the Thread.isInterrupted() method. This gives thread an opportunity to pick a convenient time for stopping. When a thread is blocked in an I/O or condition wait, it cannot check to see if it has been interrupted. In such cases, an exception is raised to interrupt the thread.

The Java thread stop mechanism is used for thread cancellation. It is a harsher mechanism: when a thread is stopped it receives a thread death exception regardless of whatever it is currently executing.

The implementation of the interrupt and stop mechanism for the user-level green threads involves setting an interrupt or a death flag against the thread. The internal Java runtime code checks the flag at safe points and when the flag is realized to be true, the appropriate exception (Interrupted or ThreadDeath exception) is raised in the thread as needed. The reason that this works easily with the user-level green thread library on Solaris is because we have complete control over the blocking synchronization primitives and convert all the blocking system calls into corresponding nonblocking calls as explained in the previous section.

When using Solaris threads, the main problem turned out to be aborting threads blocked in long standing blocking calls like read, recv, accept etc. so that an exception can be raised. The first solution involved the use of a signal to kick the thread out of the blocking system call after setting the thread interrupt or death flag. Lets assume that Thread A is trying to read the http response from the network when the user hits the stop button. Then:

Thread A:

1. while (!threadA.exception-flag)

2. err = read(fd);

Thread B: Browser <stop> button is hit.

1. Thread sets an exception flag i.e thread death flag (threadA.exception-flag = true)

2. Use thr_kill to send a signal to thread A to kick it out of blocked system call.

The proposed solution works if the thread is already in the read when the signal occurs, i.e. if the order of execution is A1, A2, B1, B2. The solution does not work if the signal is received by thread A after the check and before the read i.e the thread order of execution is A1, B1, B2, A2. Protecting the operation with encompassing locks does not work. For example using {lock, A1, A2, unlock} and {lock, B1, B2, unlock} will not work. It will simply prevent thread B from signalling while thread A is in the read. Hence the purpose of being able to cancel or interrupt the thread blocked in long standing call is not achieved.

A workaround solution that is currently in place uses the setjmp and longjmp calls to get around the described race condition. Thread B still follows the same logic, but thread A does a setjmp call before going into read. A longjmp is done in the signal handler, when the signal is received. The longjmp call is used to indicate that the thread has been interrupted and is used to get past read call.

A longer term solution for thread stopping is to move to using the Posix threads API so that we can use the Posix thread cancellation mechanism. The Posix thread cancellation mechanism defines the blocking system calls as implicit cancellation points. This will solve our thread stopping problem in a clean fashion, but we still need the solution described above to interrupt a thread without having to cancel it.

5. CONCLUSIONS

Through examining some interesting problems encountered by Java and Java applications, this paper hopes to drive the effort for ongoing work in the Solaris threads library like support for thread control and extensions for garbage collection. It also provides strong justifications for future work like supporting priority inheritance and making thread interruption easier for the user. Currently we have a version of the Java runtime running atop Solaris threads with user level workarounds for the garbage collection and the thread interruption issues as described above.

6. Acknowledgments

Firstly I want to thank Devang Shah, without whose help and support this project would have been long abandoned. I also want to thank Chris Warth and Tim Lindholm for providing me with a starting point. I also am grateful for continued guidance and comments provided by Tim Lindholm, Steve Kleiman, Arthur Van Hoff, Dave Connelly and Stuart Cheshire.

7. References

[1] James Gosling, Henry McGlinton. Java language environment - A white paper.
[2] James Gosling, Bill Joy, Guy Steele. Java language specification.
[3] Powell, Kleiman, Barton, Shah, Stein and Weeks. SunOS Multi-thread architecture.
[4]Anderson, Bershad, Lazowska and Levy. Scheduler Activations: Effective kernel support for user level management of parallelism.
[5] Andy Tucker. Thread control in Solaris.
[6] Steve Kleiman, Devang Shah and Bart Smaalders. Programming with threads.
[7] Samuel Harbison. Modula-3.
[1] thr_suspend has been fixed in Solaris 2.5 to be synchronous