HPCWire article on LSF

LSF OFFERS UTOPIA FOR SOCIOPHOBIC HPC WORKSTATIONS

In the past few years, growing numbers of industrial companies have boldly taken the leap into the "downsized" universe of RISC-powered, UNIX-operated workstations. Several of the more aggressive innovators have accumulated hundreds or even thousands of workstations, spread across LANs and WANs. As opposed to massively parallel or even vector "supercomputers", which are still the exclusive domain of the rich and adventurous, the RISC workstation appears to be carving out a genuine market in the lean, mean business world of the Stingy Nineties.

In most cases, the decision to take the RISC has been determined by the quest for "speed" and "power". Taken together with the performance per dollar figures put forward by the vendors, these often make for an irresistible "cost-effectiveness" and "competitive advantage" pitch to the hard-nosed executives who hold the purse strings. Chances are, however, that these executives aren't so pleased one or two years later -- when they are hard pressed to see how "downsizing" improves their bottom line. The savings due to the elimination of those outrageous utility bills for the old mainframe will be roughly cancelled by the yearly upgrades which the "power users" absolutely must have to "do their job".

So, what's the problem? Like the personal computer, and unlike the mainframe, the desktop or deskside workstation was designed to be largely autonomous. And the original portions of UNIX, the ones that are basically common to all vendors, reinforce this sociophobic attitude. If you buy all your machines from the same hardware vendor, you can take advantage of more sophisticated, but also more proprietary, systems software to achieve tighter coupling among your systems. Thus, all major vendors offer a range of systems, from X-terminals to powerful multiprocessor compute and file servers, designed to implement master-slave and workgroup computing. In the real world, however, one hardly ever finds homogenous workstation networks... and simple master-slave computing is hardly ever the answer to real-life workload problems.

Sociophobic workstations have bursty loads, essentially determined by the work habits of their owners. Sherlock Holmes would have a field day deducing the minutest details of a subject's life by monitoring a "perfmeter" on her SUN: when she starts work in the morning, the mood she's in, when she breaks for coffee, goes for lunch, attends meetings, goes on trips etc. This means the "privately owned" workstation is idle or severely underutilized for at least 12-14 hours every weekday, and over most weekends. On the other hand, when the power user does get going, she can saturate "her" machine in an incredibly short time. She might then begin sending resource-hungry tasks to the most powerful server on the LAN or WAN that is accessible to her username and application. This server being a shared resource, however, it too will become saturated in short order, as every workstation user in the department attempts to alleviate the congestion of tasks on his or her "own" machine. Thus, everyone's would-be "interactive" turnaround takes forever, whereas batch queues grow desperately long. Very soon, all the machines are clogged and production is at a halt.

Thus, "downsizing" can only work for industry if one can reproduce the resource and time sharing, administrative and monitoring functionality of a single-box mainframe or supercomputer while yet maintaining the power and flexibility of a distributed workstation system. We need software that is analogous to the mainframe's operating system, but will function in today's heterogeneous distributed computing environments and can be optimized for each installation's particular circumstances.

If this sounds utopian -- it certainly was, until a few months ago, when the LSF "Load Sharing Facility" , developed by Platform Computing Corporation in Toronto, was put on the market. To the surprise and delight of the growing number of companies and institutions who are test-driving and buying it, LSF is actually well on its way to realizing the above ideal. Dan Minior, MIS Manager for Pratt & Whitney in Hartford, Connecticut, says, "LSF is really a breakthrough technology. It's a shift in the way you use equipment, because you can use a lower-cost mix of workstations and get more out of them. We are using LSF on close to a thousand hosts, spreading quickly to the whole company. Our batch job throughput doubled after we switched from NQS to lsbatch. We're able to run our IPCs three times faster under the interactive load-sharing shell of LSF. A structure calculation that took one hour to run on a SPARCstation 2 was trimmed down to only five minutes by spreading the processing load over the cluster. LSF gives you total control of the resources, so you can optimize them for batch, parallel or interactive processing. A lot of people won't believe this until they see it for themselves."

In the words of Lorraine Galvin, Senior Manager of Computing Systems at Bell Northern Research Ltd. in Ottawa, Ontario, "we have been using LSF for both our internal and third party applications. LSF has allowed us to spread workload across machines of various architectures. This has resulted in a 30-40% decrease in response time for some of our applications and allowed us to take better advantage of our heterogeneous computing infrastructure. Some applications are available on only specific platforms - LSF allows (us) to transparently give access to these applications to users on other types of machines. This increase in performance and availability of applications has substantially improved user productivity as well as [our] leverage on limited computing resources... We are basing our corporate-wide vision of our distributed computing environment on LSF. We have been using LSF in production on 500-1000 hosts, and are deploying LSF company wide on several thousand hosts."

Perhaps the most exciting thing about LSF is that each site seems to discover a slightly different set of favorite features. In Western Digital's case, as recounted by Joe Orcino, Application Support Engineer: "In our environment, the design engineers' time is the most valuable resource. We want to make sure they are productive, but can't afford giving each of them all the resources they would like. LSF does two things for us. First, it helps our design engineers to get jobs done faster because they can now get to cycles they weren't able to access before. Second, the resource accounting in LSF helps the management to keep the pulse of the system and to decide whether more computing equipment is necessary to maintain the engineers' productivity."

Richard Fairfield, the Technical Director of the Math Sciences Computing Center at the University of Washington, tells us how LSF helped him to overcome a typical Chief Information Officer's nightmare situation: "We have a number of RISC servers supporting over 1000 active users on X terminals. LSF solved two big problems for us: 1) Users tend to cluster on one or two machines, leaving most compute servers idle while overloading a few. With load sharing login, users are now logged into the least loaded machine. 2) We run applications (such as LISP) which use massive amounts of resources (mainly memory and cpu power). They tend to overload compute servers for short period even if only a few users are logged on. Using the load sharing shell, lstcsh, "problem" applications always get run on the least loaded host... I used to spend a lot of time monitoring the system for load problems and dealing with users' load problems individually. Since installing LSF, we have stopped having problems with load balancing."

The LSF software system consists of a kernel, an API, and a set of ready-made utilities. Among the utilities, we note a load-sharing version of the tcsh UNIX shell, a load-sharing parallel version of "gnu-make" and an enhanced, load sharing version of the PVM message passing library. Another utility, lsbatch, is demonstrably the best load sharing batch queueing system on the market today. All these applications take advantage of the comprehensive information gathered by lightweight daemons running on each participating machine in an "elective monarchy" arrangement, which ensures maximum fault tolerance. A plethora of parameters and variables are monitored and compared to configured thresholds to decide whether a given host is suitable for the execution of a given task or application. LSF ensures scalability by organizing hosts into workgroups. Its algorithms for load information distribution and task placement take advantage of this organization. LSF assumes a shared file name space that's uniform throughout a workgroup (implemented by means of NFS and NIS, by means of AFS, etc.). It can take advantage of Kerberos authentication services.

Because lsbatch makes its placement decisions based on exactly the same information as the local and distributed interactive tasks, there is never any conflict between batch and interactive applications requesting the same resources. Of course, every user has the option (subject to management policy) of locking his or her workstation out of the shared pool. Another unique advantage conferred to lsbatch by the LSF kernel is maximum fault tolerance.

Today, LSF is available in Sun-OS, Solaris-2, AIX, HP-UX, Ultrix, OSF-1, ConvexOS, and SGI IRIX versions. This means, for instance, that LSF can integrate DEC Alpha farms, Convex/HP Metacomputers, IBM's SP-1 systems, Sun SparcCenter 2000, or SGI Challenger and Onyx systems into existing heterogeneous workstation environments (and with each other). LSF's batch system supports the checkpointing services provided by the ConvexOS kernel. lsbatch also comes with a Motif-based GUI and can emulate NQS. LSF its well on its way to becoming the de-facto standard for resource sharing between HPC systems, with more and more hardware and software vendors announcing their support.

A conservative cost-benefits analysis (based on customers' production results) reveals that the payback period for a 20-license installation is about five months. It is less than four months for a 500-license installation. The factors that generate the economic benefit are engineers' time savings and equipment savings.

Try executing a few LSF commands for yourself!

Please send email to the address below if you're interested in a 45-day free trial of LSF .

Sergiu Sanielevici, sergio@speedy.hypercomp.ns.ca