Training is the most visible part of the PET program for many of the CEWES MSRC users. During Year 2, the CEWES MSRC PET training program passed several milestones. Remote classes have become a routine part of our training schedule. Service to remote users was also improved when classes moved to the Training and Education Facility (TEF). The TEF is furnished with professional quality video production and recording equipment. This has enhanced the Mbone broadcasts and improved the quality of recorded classes in the tape library. The TEF is equipped with twelve SGI Indy R5000 student workstations and an SGI O2 for the instructor. Training material from any source (laptop, workstation, tranparencies, etc.) can be projected onto the classroom screen for instruction, broadcast over Mbone, and saved on video tape.
The CEWES MSRC PET training team has worked closely with Jackson State University, the lead HBCU/MI institution in the CEWES MSRC PET effort. JSU has served as a testbed and consulted on the development of the Tango collaborative environment system implemented by Syracuse University for offering remote training to off-site CEWES MSRC users.
Training Curriculum
CEWES MSRC PET training is designed to assist the CEWES MSRC user in transitioning to new programming environments and efficiently using the present and future SPP (Scalable Parallel Processing) hardware acquired under the HPCM program. The training curriculum is a living document with new topics being added continually to keep up with the fast pace of research and development in the field of HPC. The curriculum contains courses in the following general categories.
(a). Parallel programming. (b). Architecture and software specific topics. (c). Visualization and performance. (d). CTA targeted classes, workshops, and forums.
Training courses are conducted both on-site at the CEWES MSRC and at remote sites with concentrations of CEWES MSRC users. All material from the training courses is posted on the CEWES MSRC PET website. Training course notices and registration are also handled on the website. Table 7 gives a list of all training courses taught during Year 2, together with the evaluation score of the class on a scale of 1 (poor) to 5 (excellent). Course descriptions are included at the end of this present section.
Internet-Based Training Workshop
The CEWES MSRC PET training team was actively involved in the DoD MSRC User Group Meeting in San Diego in June. CEWES MSRC PET team organized and sponsored a Real-Time Distance Training Session for the Internet-Based Training Workshop held on June 24, 1997. The purpose of the workshop was to investigate cost-effective means for providing education and training to remote users. The following presentations were given during the CEWES MSRC PET session:
Paradigms for Distance Education - Alan Craig, NCSA, University of Illinois
Teaching over the Internet: A Low-Tech Approach with High-Impact Results - Nancy Davis, Georgia Tech
CEWES MSRC Training and Education Facilities Upgrade - John Eberle, Nichols Research
CEWES Graduate Institute
The CEWES Graduate Institute is an association of universities and CEWES through which academic credit and graduate degrees can be earned. The CEWES MSRC PET on-site staff supports the Graduate Institute by teaching graduate courses in high performance computing. During the spring semester of 1997 Dr Wayne Mastin, PET On-site Team Lead, taught the MSU graduate course MA 8463, Numerical Linear Algebra. Nine students from CEWES completed the course and earned three semester hours of graduate credit from MSU.
Seminars
The CEWES MSRC PET program offers seminars at the CEWES MSRC on an irregular basis. These are presentations by experts in their field and are designed to introduce the CEWES MSRC users to current research topics in HPC. The following seminar presentations occured during Year 2 of the CEWES MSRC PET program:
Solution of the Time-Dependent Incompressible Navier-Stokes Equations - Numerical, Parallelization and Performance Issues, Dr Danesh Tafti, NCSA April 25, 1997
Java and HPC Dr Geoffrey Fox, Syracuse July 23, 1997
PVMPI: Interoperating Multiple MPI Vendor Implementations with Unified Process Management Graham E. Fagg, University of Tennessee October 29,1997 (JSU Seminar on October 30, 1997)
Scalable Computing using the Bulk Synchronous Parallel Model Dr Jon Hill and Dr Bill McColl, Oxford University March 20, 1998 (JSU Seminar on same day)
CD-ROMS
The CEWES MSRC PET partner universities have a sizable volume of educational materials on various areas of HPCC which are suitable for asynchronous use, for example by DoD researchers wishing to increase their familiarity with various HPCC techniques and tools through self-study. NPAC at Syracuse University is leading a joint effort to produce CD-ROMs of HPCC educational material to make available to DoD researchers for self-study. The first edition of the CD-ROM includes course materials for several Syracuse University computational science courses, on-line editions of two books on HPCC, various training materials, reference material, and standards documents. Plans are to continue and expand this effort, producing at least one new CD-ROM edition each year containing new and revised materials as the HPCC landscape changes.
Web-Based Training
During the Fall 1997 and Spring 1998 terms, Jackson State University, in Mississippi, twice offered the course CSC 499, "Programming for the Web" to its students in connection with the CEWES MSRC PET effort. The course was taught by instructors physically located at Syracuse University (SU) in New York using materials developed for use in regular SU courses. During the regularly scheduled lectures, the instructors would display lecture slides on a workstation in Syracuse, and students, attending class in a computer lab at JSU, would see the slides on their screens as the instructor displayed it. The lecture was deliveres through an audio link, and the students could ask questions either via the audio link or a "chat" tool.
The technology behind this distance education project are the Tango collaboratory tool and WebWisdom, an educational repository and presentation tool. The tools, developed by Syracuse University researchers, utilize Java and other web-based technologies to provide an environment for the full two-way exchange of multimedia content in real time. Capabilities include a shared web browser, chat tool, whiteboard, and two-way streaming audio and video.
The course, which covered the architecture of the world-wide web, HTML, CGI scripting, the Java programming language, and relational database technologies, was very well received by students. Syracuse and Jackson State are looking forward to expanding their distance education collaboration to include graduate-level courses, as well as JSU faculty teaching CSC 499 remotely to other universities. This project also served as an important pilot for the use of the Tango and WebWisdom tools to deliver CEWES MSRC PET-sponsored training classes to DoD researchers, thereby increasing the availability of classes and reducing the need to travel to attend them.
___________________________________________________________________
Training Course Descriptions
--------------------------------------------------------------------
Parallel Tools and Libraries
This course will cover parallel numerical libraries for solution of dense and large sparse linear systems. LAPACK is designed to run efficiently on shared-memory vector and parallel processors. LAPACK provides routines for solving systems of simultaneous linear equations, least-squares solutions of linear systems of equations, eigenvalue problems, and singular value problems. The associated matrix factorizations (LU, Cholesky, QR, SVD, Schur, generalized Schur) are also provided, as are related computations such as reordering of the Schur factorizations and estimating condition numbers. Dense and banded matrices are handled, but not general sparse matrices.
The ScaLAPACK (or Scalable LAPACK) library includes a subset of LAPACK routines redesigned for distributed memory MIMD parallel computers. Like LAPACK, the ScaLAPACK routines are based on block-partitioned algorithms in order to minimize the frequency of data movement between different levels of the memory hierarchy. (For such machines, the memory hierarchy includes the off-processor memory of other processors, in addition to the hierarchy of registers, cache, and local memory on each processor.) The fundamental building blocks of the ScaLAPACK library are distributed memory versions (PBLAS) of the Level 1, 2 and 3 BLAS, and a set of Basic Linear Algebra Communication Subprograms (BLACS) forcommunication tasks that arise frequently in parallel linear algebra computations.
For many large-scale scientific applications that require solving large, sparse linear systems, iterative methods are the only practical solution. PETSc, the Portable, Extensible Toolkit for Scientific Computation, is a suite of data structures and routines for the uni- and parallel-processor solution of large-scale scientific application problems modeled by partial differential equations. It includes scalable parallel preconditioners and Krylov subspace methods for solving sparse linear systems, including ICC, ILU, block Jacobi, overlapping Schwarz, CG, GMRES, and Bi-CG-stab.
----------------------------------------------------------------------
Message Passing Interface (MPI)
Message-Passing Interface (MPI) is the de facto standard for message-passing developed by the Message-Passing Interface Forum (MPIF). MPI provides many features needed to build portable, efficient, scalable, and heterogeneous message-passing code. These features include point-to-point and collective communication, support for datatypes, virtual topologies, process-group and communication context management, and language bindings for the FORTRAN and C languages. In this tutorial we will cover the important features supported by MPI with examples and illustrations. Also an introduction to extensions of MPI (MPI-2) and message-passing in real-time (MPI/RT) will also be provided.
The first day of the tutorial will be an introduction to parallel programming and Message Passing Interface (MPI) followed by discussion on point-to-point communication and collective communication. The second day will cover communicators, topologies, user datatypes, and profiling and debugging interface. On the third day we will discuss intercommunicators, extensions to MPI (MPI-2), and real-time MPI (MPI/RT) followed by an open session. Each session will have a lecture, illustration of the concepts/constructs discussed in the lecture with an example, and a lab to provide hands-on experience. Before the next session the solution to the labs will be discussed.
---------------------------------------------------------------------
Performance Evaluation of Parallel Systems
This tutorial will give a comprehensive introduction to the methodology and usage of Performance Evaluation of Parallel Applications. The principles of performance evaluation will be introduced. The different levels of performance evaluation for evaluating computer systems will be introduces. These levels range from Low Level benchmarks for basic system parameters to full application benchmarks for the complete systems. The specific differences of these levels will be discussed.
The standard benchmarks for different levels of performance evaluation as well as there short-comings will be discussed in detail. This will include various standard benchmarks like the ParkBench benchmark suite, the NAS Parallel Benchmarks, the SPEC benchmark suite and the Linpack benchmark. Latest results of these standard benchmarks will be presented and analyzed.
Various issues related with workload-driven evaluation and characterization of applications and systems will be discussed. These issues will be illustrated with real application codes.
The methods for analytical performance modeling and prediction will be introduced in detail. Their application will be shown in case studies. An introduction to scalability analysis of parallel applications will be given. Different statistical methods for the analysis of benchmark results and benchmark suites will be introduced and their application to standard benchmark suites such as NPB will be shown.
----------------------------------------------------------------------
T3E Applications Programming
This course is designed for applications programmers who must understand parallel processing concepts and write codes that run on CRAY T3E systems. It provides practical experience in developing, debugging, and analyzing performance of massively parallel programs using the Cray Research parallel programming paradigms and tools. No prior knowledge of parallel programming is assumed in this course.
Skills Addressed: - Understand the CRAY T3E architecture - Understand different parallel programming paradigms (message passing and shared memory routines) - Demonstrate functionality in using CRAY T3E system performance analysis and debugging tools - Identify the CRAY T3E I/O components
---------------------------------------------------------------------
Java and the World Wide Web
An overview of the Java language and its capabilities and the implications of Java for the World Wide Web.
Topics covered are: - The architecture of the Web and examples of Web/Java applications - Java language basics, object-oriented programming, and animation - Building user interfaces with Abstract Windowing Toolkit (AWT), I/O, and networking
---------------------------------------------------------------------
IBM SP Programming
This course will provide an overview of the features of the IBM RS/6000 SP. Students will learn the mechanics necessary to make efficient use of the machine. This course will not teach users how to write parallel programs but rather, given a program, how does one run it on the SP in an efficient and proper manner?
Students will learn about the hardware characteristics of the SP as well as the software which is available. There will also be discussion of code optimization.
------------------------------------------------------------------
Visualization Systems and Toolkits
VTK is an object-oriented system intended to provide users with a powerful yet relatively simple method for constructing visualization tools. An application written using VTK may be run on a variety of machines from PCs to workstations without modification. In addition, a user wishing to develop rapid prototypes can opt to write a VTK application which is interpreted rather than compiled as a C++ program.
This course will discuss the features and capabilities of VTK, and look at many example applications. Several labs will give participants a chance to run and modify demos. This should provide an opportunity for each to assess the potential usability of the system for visualizing data from their own area of research.
-----------------------------------------------------------------
C++ Programming
The purpose of this course is teach the philosophy and syntax of the C++ programming languages. Special emphasis will be placed on two areas:
1. Improvements to the C programming language found in C++ 2. The object-oriented programming features of C++
Since the object-oriented approach to writing code is a new technique for many users, the course will offer a description of and rationale for this powerful programming style.
List of Topics: - The class data type for object creation and use - The various types of C++ functions - Function and operator overloading - Inheritance and class hierarchies - The C++ I/O stream - Abstract data types
----------------------------------------------------------------
SGI ProDev Workshop
Speedshop/Workshop are system tools that assist the programmer in allocating memory, checking the system resources in use and are designed to assist the programmer in proficient use of the platform in use.
------------------------------------------------------------------
Parallel Programming Workshop for Fortran Programmers
The workshop will begin with a one-day lecture on strategy, tools, and examples in parallel programming. On the remaining days participants will work with their own codes.
-----------------------------------------------------------------
CTH: A Software Family for Multidimensional Continuum Mechanics Analysis
CTH is a family of codes under development at Sandia National Laboratories (SNL) for use in modeling complex multidimensional (one-, two-, and three-dimensional), multi-material problems that are characterized by large deformations and/or strong shocks. A two-step Eulerian solution algorithm is used to solve the mass, momentum, and energy conservation equations. The first step is a Lagrangian step in which the computational mesh distorts to follow material motion. The second step is a remap step in which the distorted mesh is mapped back to the original mesh, resulting in motion of the material through the mesh. CTH has been carefully designed to minimize the numerical dispersion present in many Eulerian codes. All quantities are fluxed through the computational mesh using second-order convection algorithms, and a high resolution interface tracking algorithm is used to prevent unrealisitic breakup and distortion of material interfaces.
CTH has several models for calculating material response in strong shock, large deformation events. Models have been included for material strength, fracture, distended materials, high explosive detonation, and a variety of boundary conditions. The material strength model is elastic, perfectly-plastic with thermal softening, and fracture can be initiated based on pressure or principal stress. High explosive detonation is simulated using an automated programmed burn model; a Jones-Wilkins-Lee equation-of-state may be used to compute the thermodynamic properties of the high explosive reaction products. Highly accurate analytic equations-of-state may be used to model single-phase solid, liquid, and vapor states, mixed phase vapor-liquid and solid-liquid states, and solids with solid-solid phase changes.
Sophisticated graphics support is available in the CTH family of codes. A history graphics program (HISPLT) is available that allows graphical presentation of data recorded at Eulerian or Lagrangian history points during a calculation. The amount of output generated at execution by CTH is minimized, and a post processing program (CTHED) is provided that allows examination of the database at the desired times and level of detail. Finally, a very robust graphics program (CTHPLT) is available that allows graphical presentation of data in both two- and three-dimensions.
This course will present an extended overview of the features, capabilities, and usage of the CTH family of codes. Sample problems will be constructed, executed, and analyzed during interactive terminal sessions. Information will also be provided regarding ongoing CTH development activities. At the conclusion of the course, individuals should be prepared to use CTH as a tool in the analysis of realistic problems.
---------------------------------------------------------------
Techniques in Code Parallelization
The techniques needed to parallelize a code are described. These includes partitioning, load balancing, preprocessing, and postprocessing. Examples of parallelization efforts carried out at the University of Texas will be given.
List of Topics
1. Analyze Structure of the Computational Problem (Mary Wheeler) A. Operator Splittings and Physical Decompositions B. Discretization, Spatial and Temporal C. Solvers 2. Partitioning Algorithms and Load Balancing (Carter Edwards) 3. Data Input Integration (Victor Parr) A. Preprecessor for Inputs B. Queries for Inputs 4. SPMD Data/Domain Decomposition Formulation (Clint Dawson) A. Inclusion of Message Passing Libraries B. Interactive Visualization/Steering 5. Post-processing (Victor Parr)
--------------------------------------------------------------
Workshop on Portable Parallel Performance Tools
This workshop will cover the basics of tool-assisted performance analysis and tuning, as well as introduce a number of tools, both research and commercial, that are available on multiple parallel platforms. Both post-mortem analysis of trace files generated during program execution, and run-time analysis using dynamic instrumentation, will be covered. Tools to be covered include AIMS, MPE logging and nupshot, Pablo, Paradyn, and VAMPIR. With the exception of VAMPIR, which is commercial, these tools will continue to be available on CEWES MSRC platforms following the workshop. (VAMPIR will be available for the week following the workshop on an evaluation license.) Workshop participants will be invited to participate in a followup usability study of the tools. Results of a preliminary evaluation of the tools may be found at http://www.cs.utk.edu/~browne/perftools-review/.
---------------------------------------------------------------
Code Optimization for MPPs
This course will focus on the optimization of numeric intensive codes for MPPs. It will be a mixture of lecture and presentation with discussion and hands-on exercises in the afternoon.
The course will begin with a quick overview of the basics of performance and processor architecture. Then we will cover a wide variety of optimizations geared towards enhancing single processor performance. Topics will include efficient use of the memory hierarchy, functional units, amortizing loop overhead and dependency analysis. Common bottlenecks and caveats will be discussed as well as proposed solutions, and the logic behind them.
After covering the single processor case, we will progress towards specific optimizations geared towards MPPs. Topics will include betters ways of data layout, appropriate granularity of computation and reducing communication and contention for resources.
Lastly we will cover some details of each of the architectures in the SP2, the T3E and the Origin 2000. Specific limitations of the respective architectures will be discussed as well as how they affect the various optimization techniques. In addition, each platform is shipped with programming tools designed to aid in the optimization process. The course will conclude with a quick survey of these tools and their usefulness to the application engineer.