NPAC Technical Report SCCS-751
Language and Runtime Support for the Execution of Heirarchical Clustering Applications on Distributed Memory Machines
Nawal Copty
Submitted January 14 1996
Abstract
Parallel computing has been quite successful in solving problems that
have very regular structure, because the structure naturally leads to
a balanced allocation of data and computations across the processors
and to efficient communication between them. Examples of such
problems can be found in matrix computations, signal/image processing,
and the natural sciences. However, in many important mathematical,
scientific, and industrial problems, data access patterns are
irregular and evolve during the computation. The parallelization of
these problems poses difficulties related to data partitioning, data
communication, and load balancing.
This dissertation examines particular types of irregular problems,
known as hierarchical clustering applications, and identifies the
language and runtime support needed for their efficient execution on
distributed memory machines.
Hierarchical clustering applications are found in a variety of
disciplines, including physics, image processing, document retrieval,
biology, and the social sciences. These applications start with a set
of objects and a set of variables describing these objects. The aim
is to divide the objects into groups or clusters that satisfy certain
constraints. The clusters are built gradually by putting the most
similar objects together and producing a nested sequence or hierarchy
of partitions that can be represented by trees or dendrograms.
The language and runtime support needed for the execution of
hierarchical clustering applications on distributed memory machines is
determined as follows. Starting with a definition of hierarchical
clustering, a graph-theoretic formulation of hierarchical clustering,
using the multigraph concept, is presented. This formulation is used
to describe a programming paradigm for hierarchical clustering. The
paradigm is applied to a representative clustering application, region
growing, and is implemented on a distributed memory machine in both
the data parallel and message passing models. The parallel
implementations provide insight into the language and runtime support
needed for region growing in particular, and for hierarchical
clustering applications in general.
The language and runtime support needed for the execution of
hierarchical clustering applications is described using High
Performance Fortran (HPF) and is divided into the following
categories: data distribution, runtime data re-distribution,
unstructured communication, processors, data structures, and library
routines. The results of the study are useful for the application
programmer, parallel language designer, as well as the compile-time
and runtime systems developer.