Cornell Theory Center

Glossary of Supercomputing Terms

Using this glossary

This is only a partial listing to meet local needs...other glossaries that complement this one can be found at:


A B C D E F G H I J K L M N O P Q R S T U V W X Y Z - NPAC

Sources


ACRI:
Advanced Computing Research Institute, a unit of the Cornell Theory Center that is closely connected with Cornell's Computer Science Department. ACRI's work is concerned with scientific computation research and its application to engineering and scientific problems, especially those involving the use of advanced computer architectures and environments. For more information, see ACRI's web page.

AFS:
A distributed file system that allows users on different machines to access the same files. AFS allows files to be shared not only on different machines at the same site, but also at different sites across the country. At CTC, all researchers' home directories are in AFS, and all of CTC's SP nodes and RS/6000s have access to this file system. The AFS acronym stands for Andrew File System, a project of Carnegie-Mellon University, but the version installed at CTC is a commercial product from a company called Transarc.

AIX:
Advanced Interactive Executor, IBM's distribution of UNIX for systems including RS/6000s and SPs. AIX includes features of both AT&T's System V UNIX and BSD (Berkeley Standard Distribution) UNIX.

Amdahl's Law (Parallel Computers):
Let F be the fraction of a program that is serial, and let 1-F be the fraction of that program which can be parallelized. If N is the number of parallel processors used by the program, then the speedup, S, attained in performance is given by:

S = partial parallel speed/single processor speed = (F + (1-F)/N)-1.

architecture:
The design of the hardware components of a computer system, and the ways in which these components interact to produce the complete machine. For a concurrent processor, the architecture includes both the topology of the machine as a whole and the detailed design of each node.

array:
A sequence of objects in which order is significant (as opposed to a set, which is a group of objects in which order is not significant). An array variable in a program stores an n-dimensional matrix or vector of data. The term computer array is also used to describe the set of nodes of a concurrent processor; this term implies but does not require that the processor has a geometric or matrix-like connectivity.

ATM:
Asynchronous Transfer Mode, a data transfer protocol (also called cell switching) which features dynamic bandwidth allocation and a fixed cell length.


A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

bandwidth:
A measure of the speed of information transfer, typically used to quantify the communication capability of concurrent computers. Bandwidth can be used to measure both node to node and collective (bus) communication capability. Bandwidth is usually measured in megabytes of data per second.

barrier:
Point of synchronization for multiple simultaneous processes. Each process reaches the barrier, and then waits until all of the other processes have reached that same barrier before proceeding.

batch:
A method of processing a series of commands on a computer with no human interaction. Typically, a list of commands is placed within a file, and then that file is executed. EASY-LL and LoadLeveler are batch systems that run on CTC's IBM SP.

benchmark:
A standardized program (or suite of programs) that is run to measure the performance of one computer against the performance of other computers running the same program.

blocking:
The action of communication routines that wait until their function is complete before returning control to the calling program. For example, a routine that sends a message might delay its exit until it receives confirmation that the message has been received.


A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

C:
Programming language, originally based on "B", designed by Dennis Ritchie. C is a low-level language that has many features commonly found in higher-level languages. C and C++ are two of the most common programming languages used today.

C++:
An object-oriented superset of the C programming language. C++ allows the user to use abstract data classes and other advanced data representation/manipulation methods.

cache:
A fast memory used to hold commonly used variables which are automatically fetched by hardware from the slower and larger main computer memory. Large memory requirements often lead to dense but slow memories. Memory throughput is high for large amounts of data, but for individual or small amounts of data, the fetch times can be very long. To overcome long fetch times, computer architects use smaller interface memories with better fetch speeds or cache memories. The term is more often used when these memories are required to interface with main memory. If the required data are already stored in the cache, fetches are fast. If the required data are not in the cache, a cache miss results in the cache being refilled from main memory at the expense of time.
Cache memories are usually made transparent to the user. A reference to a given area of main memory for one piece of data or instruction is usually closely followed by several additional references to that same area for other data or instruction. Consequently, caches are automatically filled by a pre-defined algorithm. The computer system manages the "prefetch" process.

clustering:
A type of multiprocessor or architecture in which there is a hierarchy of the units of replication. At the lowest level, "processors" are replicated to form a "cluster". The cluster consists of M processors and a shared switching network that provides communication among the processors and access to a shared local memory. At the next higher level, the clusters are replicated. A clustered "system" consists of N clusters interconnected through a global network that allows communication among the clusters and access to a global memory that is shared among the cluster. The purpose of clustering is to reduce conflicts in accessing shared resources, whether these resources are the communication network or the storage system.
Typically, a clustered architecture is a "hyperstar", that is, a star of stars. The cluster is a "star of processors" sharing a local communication network and a local memory. The system is then a "star of clusters" in which the shared resources are the global network and the global memory.

coarse grain:
See granularity

collective communication routines:
Message passing routines that exchange data among all the processors in a group. These routines usually include a barrier routine for synchronization, a broadcast routine for sending data from one processor to all processors, and gather/scatter routines.

communication overhead:
A measure of the additional workload incurred in a concurrent algorithm due to communication between the nodes of the concurrent processor. If communication is the only source of overhead, then the communication overhead is given by: ((number of processors * parallel run time) - sequential run time) / sequential run time.

computational science:
A field that concentrates on the effective use of computer software, hardware and mathematics to solve real problems. It is a term used when it is desirable to distinguish the more pragmatic aspects of computing from (1) computer science, which often deals with the more theoretical aspects of computing; and from (2) computing engineering, which deals primarily with the design and construction of computers themselves. Computational science is often thought of as the third leg of science along with experimental and theoretical science.

computer science:
The systematic study of computing systems and computation. The body of knowledge resulting from this discipline contains theories for understanding computing systems and methods; design methodology, algorithms, and tools; methods for the testing of concepts; methods of analysis and verification; and knowledge representation and implementation.

concurrent processor:
An architecture that allows more than one process to be executed at the same time, usually by having more than one CPU.

Connection Machine:
A SIMD concurrent computer once manufactured by the now defunct Thinking Machines Corporation.

contention:
A situation that occurs when several processes attempt to access the same resource simultaneously. An example is memory contention in shared memory multiprocessors, which occurs when two processors attempt to read from or write to a location in shared memory in the same clock cycle.

CPU:
Central Processing Unit, the arithmetic and control portions of a sequential computer.

critical section:
A section of code that should be executed by only one processor at a time. Typically such a section involves data that must be read, modified, and rewritten; if processor 2 is allowed to read the data after processor 1 has read it but before processor 1 has updated it, then processor 1's update will be overwritten and lost when processor 2 writes its update. Spin locks or semaphores are used to ensure strict sequential execution of a critical section of code by one processor at a time.

CSERG:
Computational Science and Engineering Research Group

CTC:
Cornell Theory Center


A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

data decomposition:
A way of dividing arrays among local CPUs to minimize communication.

data dependency:
A situation that occurs when there are two or more references to the same memory location, and at least one of them is a write. Parallel programmers must be aware of data dependencies in their programs to ensure that modifications made to data by one processor are communicated to other processors that use the same data. See recurrence for an example of a particular type of dependency.

data parallel:
A programming model in which each processor performs the same work on a unique segment of the data. Either message passing libraries such as MPI or higher-level languages such as HPF can be used for coding with this model. An alternative to data parallel is functional parallel.

deadlock:
A situation in which processors of a concurrent processor are waiting on an event which will never occur. A simple version of deadlock for a loosely synchronous environment arises when blocking reads and writes are not correctly matched. For example, if two nodes both execute blocking writes to each other at the same time, deadlock will occur since neither write can complete until a complementary read is executed in the other node.

distributed memory:
Memory that is split up into segments, each of which may be directly accessed by only one node of a concurrent processor. Distributed memory and shared memory are two major architectures that require very different programming styles.

distributed processing:
Processing on a number of networked computers that do not share main memory. One typically infers the computers are of different relative power and function. An example is a supercomputer, a mini-computer, and a workstation all providing computation for a single application in such a way that the process is distributed appropriately over all the vehicles.

distributed shared memory (DSM):
Memory that is physically distributed, but is masked by the operating system so that it appears to the user as shared memory with a single address space. Also called virtual shared memory.


A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

EASY:
A mechanism developed at Argonne National Laboratory for scheduling parallel batch jobs on an IBM SP and ensuring that both large and small jobs have an equitable chance to run.

EASY-LL:
A collaboration between Argonne's EASY scheduling mechanism and IBM's LoadLeveler batch system. Developed jointly by CTC and IBM, EASY-LL now schedules most of the nodes on CTC's SP and will be made available to other SP sites that wish to install it.

efficiency:
A measure of the amount of time a parallel programs spends doing computation as opposed to communication. Efficiency is defined as speedup / number of processors. The closer it is to 1, the more perfectly parallel the task is at that level of parallelism; the closer to 0, the less parallel.

ESSL:
The Engineering and Scientific Subroutine Library, a library of mathematical subroutines that have been highly optimized for the SP family of machines and so run much faster than equivalent routines from other libraries. For more information, see CTC's ESSL documentation page.


A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

FDDI:
Fiber Distributed Data Interface, a networking standard based on fiber optics. FDDI specifies a data transmission rate of 100 megabits per second using a wavelength of 1300 nanometers (i.e., light). It uses token ring access control. A FDDI network has a length limit of 200 kilometers but requires repeaters no more than 2 kilometers apart.

FFT:
An acronym for Fast Fourier Transform, a technique for very fast computation of Fourier series. A discrete Fourier transform using N points can be computed in N log N steps by the fast method, whereas the straightforward method would take N**2 steps.

fine grain:
See granularity

FLOPS:
Floating point operations per second; a measure of memory access performance, equal to the rate at which a machine can perform single-precision floating-point calculations.

Fortran:
Acronym for FORmula TRANslator, one of the oldest high-level programming languages but one that is still widely used in scientific computing because of its compact notation for equations, ease in handling large arrays, and huge selection of library routines for solving mathematical problems efficiently. Fortran 77 and Fortran 90 are the two standards currently in use. On AIX systems, the Fortran compiler is called XL Fortran or xlf; on other UNIX systems, the Fortran compiler is often called f77. HPF is a set of extensions to Fortran that support parallel programming.

Fortran 77:
A standard for the Fortran language that is currently implemented by virtually all available compilers.

Fortran 90:
A newer standard that extends Fortran 77's capabilities in several areas, but has not yet been fully implemented on every available compiler. Fortran 90 offers powerful new ways of manipulating arrays and allows the user to develop an object-oriented programming style. Other new features include optional arguments and recursion in procedures, dynamic storage allocation, user-defined data types, and pointers similar to those used in the C language.

functional decomposition:
A method of programming decomposition in which a problem is broken up into several independent tasks, or functions, which can be run simultaneously on different processors.

functional parallel:
A programming model in which a program is broken down by tasks, and parallelism is achieved by assigning each task to a different processor. Message passing libraries such as MPI are commonly used for communication among the processors. An alternative to functional parallel is data parallel.


A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

gather/scatter:
A collective communication operation in which (for gather) one process collects data from each participating process and stores it in order by process number, or (for scatter) one process divides up some data and distributes a chunk to each participating process, again in order by process number.

gigabyte (GB):
2**30 (hex 40,000,000) bytes of data, i.e. 1,073,741,824 bytes.

gigaflop or gflop:
One billion (10**9) floating point operations per second.

global memory:
The main memory accessible by all processors or CPUs.

grain size:
The number of fundamental entities, or members, in a grain. For example, if a grid is spatially decomposed into subgrids, then the grain size is the number of grid points in a subgrid.

granularity:
A term often used in parallel processing to indicate independent processes that could be distributed to multiple CPUs. Fine granularity is illustrated by execution of statements or small loop iterations as separate processes; coarse granularity involves subroutines or sets of subroutines as separate processes. The more processes, the "finer" the granularity and the more overhead required to keep track of them. Granularity can also be related to the temporal duration of a "task" at work. It is not only the number of processes but also how much work each process does, relative to the time of synchronization, that determines the overhead and reduces speedup figures.

granule:
The fundamental grouping of members of a domain (system) into an object manipulated as a unit.


A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

high-performance switch or high-speed switch:
See switch.

HiPPI:
High Performance Parallel Interface, a network technology standard that specifies a transmission speed of 100 megabytes per second and allows devices to be attached directly to the network without an intervening computer.

HPF:
High Performance Fortran, an extension to Fortran 77 or 90 that provides: opportunities for parallel execution automatically detected by the compiler; various type of available parallelism - MIMD, SIMD, or some combination; allocation of data among individual processor memories, and placement of data within a single processor.

HPSS:
High Performance Storage System, a new-generation, hierarchical, mass storage system software that was designed and built to provide scalability and performance. Anticipated use of the system will have it managing millions of files and petabytes of data. HPSS relies heavily on parallel I/O to attain impressive aggregate data rates.

hypercube architecture:
Multiple CPU architecture with 2^N processors. Each CPU has N nearest neighbors in a manner similar to a hypercube, where each corner has N edges. The 2**3 machine would have eight CPUs arranged at the corners of a cube connected by the edges.


A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

IBM:
International Business Machines Corporation, a firm that manufactures computer hardware and software including the IBM RS/6000 SP. CTC has a long history of joint work with IBM in developing and testing products needed by the scientific research community.

IMSL:
International Mathematical and Statistical Library, a useful set of computational subroutines written in Fortran.

inhomogeneous problem:
A problem whose underlying domain contains members of different types, e.g., an ecological problem such as WaTor with different species of fish. Inhomogeneous problems are typically irregular, but the converse is not generally true.

I/O:
Input/Output, the hardware and software mechanisms connecting a computer with the "outside world". This includes computer to disk and computer to terminal/network/graphics connections. Standard I/O is a particular software package developed under UNIX for the C language.

irregular problem:
A problem with a geometrically irregular domain containing many similar members, e.g., finite element nodal points.


A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

No J entries yet.


A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

kilobyte (KB):
2**10 (hex 400) bytes of data, i.e. 1,024 bytes.


A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

LAPACK:
A library of Fortran 77 subroutines for solving the most common problems in numerical linear algebra: systems of linear equations, linear least squares problems, eigenvalue problems, and singular value problems. LAPACK been designed to be efficient on a wide range of high performance computers.

latency:
The time to send a zero-length message from one node of a concurrent processor to another. Non-zero latency arises from the overhead in initiating and completing the message transfer. Also called startup time.

load balance:
A goal for algorithms running on concurrent processors, which is achieved if all the nodes perform approximately equal amounts of work, so that no node is idle for a significant amount of time.

LoadLeveler:
A system for scheduling batch jobs on an SP system or cluster of RS/6000 workstations, developed by IBM based on a project called CONDOR from the University of Wisconsin.

local disk space:
Disk space attached to an individual processor (CPU) in a multiprocessor system. On CTC's IBM SP, each node's local disk space is used as scratch space for running jobs, because access to it is much faster than access to the shared file system, AFS.

local memory:
The memory associated with a single CPU in a multiple CPU architecture, or memory associated with a local node in a distributed system.

loop unrolling:
An optimization technique valid for both scalar and vector architectures. The iterations of an inner loop are decreased by a factor of two or more by explicit inclusion of the very next one or several iterations. Loop unrolling can allow traditional compilers to make better use of the registers and to improve overlap operations. On vector machines loop unrolling may either improve or degrade performance, and the process involves a tradeoff between overlap and register use on one hand and vector length on the other.


A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

main memory:
A level of random access memory that lies between cache or register memory and extended random access memory. Main memory has higher capacity, but it is slower than cache or registers and has less capacity but faster access than extended random access memory. Unfortunately, some newer supercomputer models have introduced another tier of memory structure, so that what is referred to as cache, main, and extended is getting somewhat confusing.

mass storage:
An external storage device capable of storing large amounts of data. Online, directly accessible disk space is limited on most systems; mass storage systems provide additional space that is slower and more difficult to access, but can be virtually unlimited in size. CTC's mass storage system currently uses the National Storage Laboratory (NSL) version of the UniTree software, but is being upgraded to HPSS.

master-worker:
A programming approach in which one process, designated the "master," assigns tasks to other processes known as "workers."

megabyte (MB):
2**20 (hex 100,000) bytes of data, i.e. 1,048,576 bytes.

member:
Members are grouped first into granules and then grains. In a finite difference problem, the members are grid points.

memory:
See cache and main memory.

message passing:
A communication paradigm in which processes communicate by exchanging messages via communication channels.

MIMD:
Multiple Instruction, Multiple Data, an architecture in which multiple instruction streams are executed simultaneously. Each single instruction may handle multiple data elements (e.g., one or more vectors in a vector machine). While single-processor vector computers are able to operate in MIMD mode because of overlapped functional units, MIMD terminology is used more generally to refer to multiprocessor machines. See also SIMD, SISD; for more information, see the "Taxonomy of Architectures" section of CTC's Introduction to Parallel Processing module.

MIPS:
Millions of instructions per second.

MOPS:
Millions of operations per second.

MPI:
Message Passing Interface, a de facto standard for communication among the nodes running a parallel program on a distributed memory system. MPI is a library of routines that can be called from both Fortran and C programs. MPI's advantage over older message passing libraries is that it is both portable (because MPI has been implemented for almost every distributed memory architecture) and fast (because each implementation is optimized for the hardware it runs on).

MPL:
Message Passing Library developed by IBM, a precursor of MPI for the RS/6000 and SP architectures.

MPP:
Massively parallel processor, a parallel system with many processors. "Many" is usually defined as 1000 or more processors.

multiprocessing:
The ability of a computer to intermix jobs on one or more CPUs.


A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

NAG:
A library of Fortran 77 routines for the solution of numerical and statistical problems.

nearest neighbor:
A computer architecture that involves a connectivity which can be interpreted as that between adjacent members in geometric space.

node:
One of the individual computers that are linked together to form a parallel system. On an IBM SP, a node consists of an RS/6000 processor with its associated memory and disk space; future generations of IBM SP will have nodes with more than one processor. SP nodes are grouped into racks of 8 or 16 nodes.


A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

object-oriented programming:
Style of programming characterized by the use of separate "objects" to perform different tasks within a program. These "objects" usually consist of an abstract data type or class, along with the methods and procedures used to manipulate that abstract data type.

optimization:
The act of tuning a program to achieve the fastest possible performance on the system where it is running. There are tools available to help with this process, including optimization flags on compiler commands and optimizing preprocessors such as KAP from Kuck & Associates. You can also optimize a program by hand, using profiling tools to identify "hot spots" where the program spends most of its execution time. Optimization requires repeated test runs and comparison of timings, so it is usually worthwhile only for production programs that will be rerun many times.

overhead:
There are four contributions to the overhead, f, defined so that the speedup = number of nodes / (1 + f). The communication overhead and load balance contributions are also defined in this glossary. There are also algorithmic and software contributions to the overhead.


A B CD E F G H I J K L M N O P Q R S T U V W X Y Z

parallel processing:
Processing with more than one CPU on a single application simultaneously.

parallelization:
The process of achieving a high percentage of the CPU time expended in parallel; minimizing idle CPU time in a parallel processing environment. For one program, parallelization refers to the splitting of program execution among many CPUs.

partition:
The nodes assigned to run a particular parallel program under the IBM SP's Parallel Operating Environment (POE). The size of the partition is defined as the number of nodes.

partitioning:
Restructuring a program or algorithm in semi-independent computational segments to take advantage of multiple CPUs simultaneously. The goal is to achieve roughly equivalent work in each segment with minimal need for intersegment communication. It is also worthwhile to have fewer segments than CPUs on a dedicated system.

PE:
Parallel Environment, an environment designed for the development and execution of parallel FORTRAN, C, or C++ programs under AIX on an IBM SP. It consists of libraries, tools, and system support processes.

percentage parallelization:
The percentage of CPU expenditure processed in parallel on a single job. It is usually not possible to achieve 100 percent of an application's processing time to be shared equally on all CPUs.

petabyte:
2**50 (hex 4,000,000,000,000) bytes of data, i.e. 1,125,899,906,842,620 bytes.

PIOFS:
Parallel I/O File System, an IBM product designed to support applications that require large temporary files and high I/O bandwidth. PIOFS allows files as large as 128 terabytes, which can be logically partitioned into subfiles for faster and more efficient parallel processing without the overhead of maintaining separate files. For more information, see CTC's PIOFS documentation page.

pipelining:
The execution of a sequence of data sets by a single processor in such a way that subsequent data elements or instructions in the sequence can begin execution before previous elements have completed execution, with all such elements executing at the same time; an assembly line approach. In vector supercomputers the floating-point operations were often pipelined with memory fetches and stores of the vector data sets. (The Denelcor HEP is an example of a pipelined instruction set; the Japanese supercomputers, the IBM 3090/VF, CRAY and Cyber 200 series are examples of pipelined arithmetic units.)

POE:
Parallel Operating Environment, the software used to compile, run, and monitor parallel programs under the PE.

polled communication:
Polling involves a node inspecting the communication hardware -- typically a flag bit -- to see if information has arrived or departed. Polling is an alternative to an interrupt-driven system and is typically the basis for implementing the crystalline operating systems. The natural synchronization of the nodes imposed by polling is used in the implementation of blocking communication primitives.

primary memory:
Main memory accessible by the CPU(s) without using input/output processes.

process:
The task executing on a given processor at a given time.

processor:
The part of the computer that actually executes your instructions. Also known as central processing unit or CPU.

PVM:
Parallel Virtual Machine, a message passing library and set of tools used to create and execute concurrent or parallel applications.


A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

No Q entries yet


A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

rack:
A group of 8 wide nodes or 16 thin nodes on an IBM SP.

RAID:
Redundant array of inexpensive disks; a file system containing many disks, some of which are used to hold redundant copies of data or error correction codes to increase reliability. RAIDS are often used as parallel access file systems, where the sheer size of storage capacity required precludes using more conventional (but more expensive) disk technology.

recurrence:
A dependency in a DO-loop whereby a result depends upon completion of the previous iteration of the loop. Such dependencies inhibit vectorization. For example:

A(I) = A(I-1) + B(I)

In a loop on I, this process would not be vectorizable on most vector computers without marked degradation in performance. This is not an axiom or law, but rather is simply a fact resulting from current machine design.

reduced instruction set computer (RISC):
A philosophy of instruction set design where a small number of simple, fast instructions are implemented rather than a larger number of slower, more complex instructions.

rendering:
The process of turning mathematical data into a picture or graph.

RS/6000:
RISC System/6000 Scalable POWERparallel System, a type of workstation manufactured by IBM. The nodes on an SP system are RS/6000s.

RS/6000 SP:
See SP.


A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

scalably parallel:
All nodes are an equal communication distance apart and multiple routes avoid bottlenecks.

scattered decomposition:
A technique for decomposing data domains that involves scattering, or sprinkling, the elements of the domain over the nodes of the concurrent processor. This technique is used when locality, which is preserved by the alternate decomposition into connected domains often called domain decomposition, is less important than the gain in load balance obtained by associating each node with all parts of the domain.

scheduler:
The part of a computer system's software that determines which task is assigned to each system resource at any given time. In a batch system such as EASY-LL or LoadLeveler that maintains a queue of jobs waiting to run, the scheduler determines when each job can start based on such criteria as the order in which job requests were submitted and the availability of the system resources needed by each job.

sequential execution:
See serial processing. Parallel programs may have sections that must be executed sequentially; see critical section.

serial processing:
Running an application on a single node.

shared memory:
A memory that is directly accessed by more than one node of a concurrent processor. Shared memory and distributed memory are two major architectures that require very different programming styles.

SIMD:
Single Instruction, Multiple Data, an architecture that characterizes most vector computers. A single instruction initiates a process that sets in motion streams of data and results. The term is also applicable to parallel processors where one instruction causes more than one processor to perform the same operation synchronously, even on different pieces of data (e.g., ILLIAC). See also MIMD, SISD, and SPMD; for more information, see the "Taxonomy of Architectures" section of CTC's Introduction to Parallel Processing module.

SISD:
Single Instruction, Single Data, a traditional computer architecture where each instruction (of a single-instruction stream) deals with specific data elements or pairs of operands rather than "streams" of data. See also SIMD and MIMD; for more information, see the "Taxonomy of Architectures" section of CTC's Introduction to Parallel Processing module.

Smart Node Program:
An outreach program run by the Theory Center to distribute supercomputing information, expertise, support, and training to researchers at a consortium of universities, colleges, and government research laboratories throughout the United States.

SMP:
Symmetric Multi-Processor, a shared memory system from IBM featuring up to eight RS/6000 processors connected by a crossbar switch.

SP:
Scalable POWERparallel System, a distributed memory machine from IBM. It consists of nodes (RS/6000 processors with associated memory and disk) connected by an ethernet and by a high-performance switch.

speedup:
A measure of how much faster a given program runs when executed in parallel on several processors as compared to serial execution on a single processor. Speedup is defined as S = sequential run time / parallel run time.

SPMD:
Single Program, Multiple Data. A generalization of the SIMD data-parallel programming, SPMD further loosens the synchronicity bonds that restrict the operation of the various processors and simply specifies what program all processors must run, but not which instruction each should be executing at any particular time. Data distribution, however, is still a key concern; in fact, another commonly used term for SPMD is data decomposition -- this alludes to the fact that the overall dataset is going to be decomposed, resulting in a part of it going to each of the participating processors, each of which will run the same code against their own individual sections.

square decomposition:
A strategy in which the array of nodes is decomposed into a two-dimensional mesh; we can then define scattered or local (domain) versions of this decomposition.

Standard I/O:
See I/O.

startup time:
See latency.

stride:
A term derived from the concept of walking (striding) through the data from one noncontiguous location to the next. If data are to be accessed as a number of evenly spaced, discontiguous blocks, then stride is the distance between the beginnings of successive blocks. For example, consider accessing rows of a column-stored matrix. The rows have elements that are spaced in memory by a stride of N, the dimension of the matrix.

striping:
Another technique for avoiding serialized I/O. In this case, the idea is to have each node write out its own portion of data into its own file. This is particularly good if checkpointing is what is really desired, because having each node's state saved in its own file is actually what you want.

supercomputer(s):
At any given time, that class of general-purpose computers that are both faster than their commercial competitors and have sufficient central memory to store the problem sets for which they are designed. Computer memory, throughput, computational rates, and other related computer capabilities contribute to performance. Consequently, a quantitative measure of computer power in large-scale scientific processing does not exist, and a precise definition of supercomputers is difficult to formulate.

supercomputing:
Both the art and the science of writing and using software that enables the user to obtain the highest performance achievable on a particular supercomputer system, making the most efficient use of the system and its resources.

superlinear:
Greater than linear; usually used in reporting results on parallel computers where the processing speed increases more rapidly than does the number of processors. Example: A job taking 16 hours on a one-processor machine may take only a half-hour on a 16-processor machine instead of one hour (from linear speedup). Upon close examination such examples are seen to be the result of algorithmic improvements. A demonstration that this is the case is easily described. Consider a uni-processor that simulates a multiple-processor machine (such as the 16-processor machine in the example cited). The uni-processor simulates the first cycle of each of the 16 parallel machines. It then simulates the second of each of the parallel machines, and so on until the program concludes. Clearly, (ignoring overhead) this takes 16 times as long as would be the case on a 16-processor machine of the same cycle time. Thus, any speedup seen by the 16-processor machine can be simulated at one-sixteenth speed by the one-processor machine. Speed increase is at most and at best linear in the number of processors.

switch:
A high bandwidth data transmission device used to communicate between different nodes on an IBM SP.

synchronization:
The act of bringing two or more processes to known points in their execution at the same clock time. Explicit synchronization is not needed in SIMD programs (in which every processor either executes the same operation as every other or does nothing), but is often necessary in SPMD and MIMD programs. The time wasted by processes waiting for other processes to synchronize with them can be a major source of inefficiency in parallel programs.


A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

task:
The basic unit of work to be performed by the computer. A program typically consists of a number of tasks, such as reading data, performing calculations, and writing output. A common way to parallelize a program is to divide up the tasks and assign one task to each processor or node. For this reason, the term "task" is sometimes used interchangeably with processor or node in discussions of parallel programming.

T1:
Network transmissions of a DS1 formatted digital signal at a rate of 1.5 Mb/s

T3:
Network transmissions of a DS3 formatted digital signal at a rate of 45 Mb/s

TCP/IP:
Transmission Control Protocol/Internet Protocol, the protocol used for communications on the internet. TCP/IP software includes the telnet and ftp commands.

terabyte:
2**40 (hex 10,000,000,000) bytes of data, i.e. 1,099,511,627,776 bytes.

teraflop:
A processor speed of one trillion (10**12) floating point operations per second.

thin node:
The smaller of the two types of nodes on an IBM SP system. Roughly equivalent to an RS/6000 model 390, a thin node has a 64 KB data cache, a 64 bit memory bus, up to 4 GB of local disk space and up to 512 MB of memory. The smaller cache and bus can make a thin node slower than a wide node for some applications.

token:
When you login to a machine in AFS you must be issued a "token" in order to become an authenticated AFS user. This token is issued to you automatically at login when you enter your password. Every token has an expiration date associated with it; on CTC machines, tokens are set to expire after 100 hours. To see what tokens you have, enter the command "tokens"


A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

UNIX:
An operating system originally developed by AT&T which, in various incarnations, is now available on most types of supercomputer.

UniTree:
A hierarchical file archival system. It uses disk as its first storage layer, and tape cartridges as its second layer. Files are moved from disk to tape depending on how much UniTree disk space is available and on the size and age of files in the system.


A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

vector:
A computer vector is an array of numbers with a location prescribed according to a contiguous or random formula. A mathematical vector is a set of numbers or components with no conditions on their retrievability from computer memory. (See also stride.)

vector processing:
The practice of using computers that can process multiple vectors or arrays. Modern supercomputers achieve speed through pipelined arithmetic units. Pipelining, when coupled with instructions designed to process multiple vectors, arrays or numbers rather than one data pair at a time, leads to great performance improvements.

vectorization:
The act of tuning an application code to take advantage of vector architecture.

virtual concurrent processor (aka virtual machine):
The virtual concurrent processor is a programming environment which allows the user a hardware-independent, portable programming environment within the message passing paradigm. The virtual machine is composed of virtual nodes which correspond to individual processes; there may be several processes or virtual nodes on the node of a real computer.

virtual machine loosely synchronous communication system:
A specific realization of a hardware independent programming environment, or the virtual machine, which applies to the loosely synchronous problem class.

visualization:
In the broadest sense, visualization is the art or science of transforming information to a form "comprehensible" by the sense of sight. Visualization is broadly associated with graphical display in the form of pictures (printed or photo), workstation displays, or video.


A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

wide node:
The larger of the two types of nodes on an IBM SP system. Roughly equivalent to an RS/6000 model 590, a wide node has a 256 KB data cache, a 256 bit memory bus, up to 18 GB of local disk space and up to 2048 MB of memory. CTC's SP system has one wide node with the maximum amount of memory and several with 1024 MB. The larger cache and bus can make a wide node faster than a thin node for some applications.

wormhole routing:
A technique for routing messages in which the head of the message establishes a path, which is reserved for the message until the tail has passed through it. Unlike virtual cut-through, the tail proceeds at a rate dictated by the progress of the head, which reduces the demand for intermediate buffering.


A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

No X entries yet


A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

No Y entries yet


A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

No Z entries yet


A B C D E F G H I J K L M N O P Q R S T U V W X Y Z


Sources: