Improving the Performance of LU Factorization in Software DSM Environment

Zhongze Li

Department of Computer Science and Engineering,

Beijing University of Aeronautics and Astronautics

Campus Box 2-52, BUAA, Beijing, P.R. China, 100083

e-mail:para@cs.sebuaa.ac.cn

Geoffrey C. Fox

Northeast Parallel Architectures Center

111 College Place

Syracuse University

Syracuse, NY 13244-4100

gcf@npac.syr.edu

Hongbin Liao

Department of Computer Science and Engineering,

Beijing University of Aeronautics and Astronautics

Campus Box 1-85 , BUAA, Beijing, P.R. China, 100083

para@cs.sebuaa.ac.cn

Wei Li

Department of Computer Science and Engineering,

Beijing University of Aeronautics and Astronautics

Beijing, P. R. China, 100083

liwei@cs.sebuaa.ac.cn

Abstract

Distributed shared memory provides a virtual address space shared among processes on loosely coupled processors. The advantages offered by DSM include ease of programming and portability achieved through the shared memory programming paradigm, the low cost of distributed-memory machines, and scalability resulting from the absence of hardware bottlenecks. In DSM systems, many different processors may cache one page. When one processor write this page, the DSM system should update the copy of the page on other processors. Adve, S. describes the definition of Consistency Model in [1]: A consistency model is essentially a contract between the software and the memory. It says that if the software agrees to obey certain rules, the memory promises to work correctly.

A variety of consistency models have been proposed. In general, consistency models can be sorted into two kinds:

Consistency models not using synchronization operations. Strict consistency, sequential consistency [7] , causal consistency [8] , processor consistency [11] , PRAM consistency [9] belong to type(a).

Models with synchronization operations. Type (b) consist of weak consistency [10], release consistency [2], entry consistency [4] .

The primary difference between type (a) and type (b) is the frequency with which consistency is updated. If a consistency model in type (a) is used, the changed values of variables in each processor are propagated to other processor immediately. There is frequent communication among processors. However, changes are propagated only when a synchronization variable is accessed if we use consistency models in type (b). The communication events in type (b) are smaller in number but larger in size than in type (a) schemes. So the weak consistency, release consistency, and entry consistency schemes should offer improved performance.

The three models in type (b) differ in how synchronization is implemented. In RC, when a release is performed, the processor doing the release pushes out the modified data to all other processors that already have a cached copy and thus might potentially need it. But there is no way to tell if those processors actually will need the modified data and to be safe, all of the identified processors get everything that has changed. So RC can lead to redundant communication.

LRC [3] was presented by Keleher to get over this disadvantage of RC. In LRC, at the time of a release, nothing is sent anywhere. Instead, when an acquire is done, the processor trying to do the acquire has to get the most recent values of the variables from the machine or machines holding them. Unfortunately LRC can not hide communication latency because LRC doesn't push out the data until another acquire operation.

The Entry Consistency scheme (EC) was proposed to reduce false sharing. EC requires each ordinary shared variable to be explicitly associated with some synchronization variable such as a lock or barrier, which protects access to that variable. EC reduces the effect of false sharing but at the same time adds substantial programming complexity compared with release consistency.

In order to get over the drawbacks of RC, LRC and EC, we present a new consistency model, termed the Group Consistency Model. The group consistency model is motivated by the following observations:

In parallel computation, processors, which take part in a barrier operation, typically share updated data and constitute a group. This is seen in message passing approaches like MPI where the most elegant programming practices are built around collective communication primitives (such as shift or broadcast) which essentially define such groups of processors.
In many programs, only a portion of the global data space is needed to be visible to processors in a group as described in [12] and [13].
************ I DO NOT UNDERSTAND THIS ********** (Although some sections of the global space are computated, they are computed again)

The main idea of the group consistency model is to establish the relationship between processors and data explicitly. When Diffs(a diff is a run-length encoding of the modifications performed by one processor on a shared page) are produced, they propagate to processors identified by a shared group_id and are not sent to processors which are not in this group. Diffs are not produced in the reuse sections (corresponding to situation c)). Correspondingly we propose three kinds of APIs: barrier(group_id); open_critical_section(L), close_critical_section(L); open_reuse_section(L), close_reuse_section(L). In order to improve performance of the group consistency model, we use a home-based
group consistency model as used in   [13] and [14]. But the entire page isn't transferred from home to other nodes in the home-based group consistency as opposed to the approach in the protocols of [13], [14]. Moreover, the group consistency model is more eager than LRC. Before we describe the barrier function in detail, we give some definitions.
                    copy_set: The set of processors which have the copy of a page.
                    effect_tag: effect_tag is a two dimensional array whose first dimension is the page order, the second
                                          dimension is group_id. If the value of effect_tag is 1, it implies that the processors in this group_id are partly
                                          updated. ************ I DO NOT UNDERSTAND THIS(effect_tag) **********
                    send_tag:    The send_tag denotes the number of different groups that need to be updated by a given
                                            diff

        So we can describe the barrier function with the following pseudo code:
           barrier(group_id)
                  if (group_id == unknown)
                            send diff to home
                            invalidate the page in copy_set(not including home)
                            invalidate all pages with effect_tag == 1
                 else if me != home of diff
                            invalidate in copy_set but not in group_id
                            send diff to home
                            receive send_tag from home
                            if(send_tag != 1)
                                  broadcast diff in group_id
                 else if me == home of diff
                            receive diff from copy_set
                            applies the diff
                            send send_tag to every processor in copy_set
                            if(send_tag == 1)
                                    propagate whole page in group_id
                            else
                                    effect_tag = 1

     Apparently, the group consistency integrate the advantages of RC and LRC and remove their disadvantages.   When group_id is unknown, the group consistency model is changed into LRC; when group_id is global, it is defaulted to RC. RC and LRC are thus special cases of the group consistency model. The group consistency approach has four advantages over RC, LRC, and ScC:
1) It doesn't send diff to all processors in copy_set; rather it only sends diff to processors in the group. This reduces the redundant communication existing in RC.

2) Because the scope of processors that need updated data, is defined, updated data is sent eagerly to processors in group. The number of page faults are fewer than in LRC and ScC.

3) It produce fewer diffs than standard RC, LRC, ScC methods. This is because critical sections and reused sections are indicated explicitly, and so fewer diffs are generated.

4) The protocol data and messages are much smaller than that in LRC. This is because data is propagated to processors in the group, and it is not needed to send the request page message to home

We used the LU factorization program in SPLASH 2 in order to verify our idea. LU decomposition is a typical important linear algebra program which can be programmed efficiently but it is quite challenging to implement well. The parallel LU algorithm in SPLASH is defined as follows:

                                                                     step1: factor diagonal block
                                                                     step2: barrier
                                                                     step3: update perimeter blocks
                                                                     step4: barrier
                                                                     step5: update interior blocks.

                                                                                            Figure 1

All blocks, which reside on the same row but to the right, and in the same column but below the diagonal block are called perimeter blocks. All blocks in the interior region defined by the perimeter blocks are called interior blocks. Ownership of blocks is assigned to processors using a 2D Block-Cyclic decomposition.

There are three disadvantages if we use RC or LRC with the above algorithm.
1) In step2, factored diagonal block may be updated in all processors that have a copy of the factored
        diagonal block, though only processors that contain perimeter blocks actually need it.
2) Although in step4, perimeter blocks should only transfer to other processors along the same row and column,
       they may transfer to all processors.
3) In iteration n, consider the barrier in step2. Both the factored diagonal block in iteration n and updated interior
       blocks in iteration n-1 may be updated in all processors, whereas actually only factored diagonal block in iteration n
       should be updated.

Using the group consistency model, Figure 1 can be changed into Figure 2
                                                                   step1:    factor diagonal block
                                                                   step2:   barrier(group_diag_row)
                                                                   step3:   barrier(group_diag_col)
                                                                   step4:   update perimeter blocks
                                                                   step5:   barrier(group_in_myrow)
                                                                   step6:   barrier(gruop_in_mycol)
                                                                   step7:   if this is not the last block
                                                                                      open_reuse_section(l)
                                                                  step8:        update interior blocks
                                                                  step9:        close_reuse_section(l)
                                                                  step10: else
                                                                                      update interior blocks
                                                                                      barrier(unknown)

Figure 2

On a network of four Pentium 133Mhz machines connected by a 100Mbit/sec Enthernet, we use PVM 3.3.11 to simulate both the RC and the group consistency model. The performance of LU factorization is recorded in table 1. The size of the matrix is 512x512, and the element block is 16x16. The shape of the processor matrix is 2x2. Because the 4K page size contained 512 double numbers, a page is shared among members of a column of the processor matrix. The poor performance of LU using RC protocol results mainly from disadvantage 3) described above. From Table 1, we see that the performance of LU is improved greatly because GC protocol leads to fewer diffs and less communication. Because we
can not simulate the overhead of page fault, we didn't test the LU example using the LRC protocol. ScC (Scope Consistency) is based on a lock not on barrier mechanism and so we did not evaluate it on this example [13].
In summary the group consistency model integrates the advantages of RC and LRC, and so it should offer good performance in many DSM programs based on the barrier mechanism. We will test a greater variety of programs with different access patterns and use a wider variety of hardware with more nodes in final version of paper.

Table 1

Consistency Model	time(sec)
RC	11.42
GC	6.58

References
1    Adve, S., and Hill, M.: " Weak Ordering: A New Definition," Proc. 17th Ann. Int'l Symp. on Computer Architecture, ACM, pp. 2-14, 1990
2    Gharachorloo, K., Lenoski, D., Laudon, J., gibbons, P., Gupta, A., and Hennessy, J.:Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors, " Proc. 17th Ann. Int'l Symp. on Computer Architecture, ACM, pp. 15-26, 1990
3   Keleher, P., Cox, A.L., and Zwaenepoel, W.:"Lazy Release Consistency", Proc. 19th Ann. Int'l Symp. on Computer Architecture, ACM, pp. 13-21, 1992.
4   Bershad, B.N., Zekauskas, M.J., and Sawdon, W.A.: "The Midway Distributed Shared Memory System," Proc. IEEE COMPCON Conf., IEEE, pp. 528-537, 1993.
5 Liviu Iftode, Jaswinder P. Singh and Kai Li, "Scope Consistency: ABridge between Release Consistency and Entry Consistency", in Proceedings of the 8th Annual ACM Symposium on Parallel
Algorithms and Architectures, June, 1996
6 Woo, S. C., Singh, J. P., and Hennessy, J. L. "The Performance Advantages of Integrating Block Data
Transfer in Cache-Coherent Multiprocessors." in Proceedings of the 6th International Conference on
Architectural Support for Programming Languages and Operating Systems(ASPLOS-VI). October 1994.
7 Lamport, L.: "How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs," IEEE Trans. on Computers, vol. C-28, pp.690-691, Sept. 1979
8   Hutto, P.W., and Ahamad, M.: "Slow Memory: Weakening Consistency to Enhance Concurrency in distributed Shared Memories," Proc. 10th Int'l Conf. on distributed Computing Systems, IEEE, pp. 302-311, 1990
9 Lipton. R. J., and Sandberg, J.S.:"Pram: A Scalable Shared Memory," Tech. Rep. CS-TR-180-88, Princeton Univ., Sept. 1988.
10 Dubois, M., Scheurich, C., and Briggs, F.A.: "Memory Access Buffering in Multiprocessors," Proc. 13th Ann. Int'l Symp. on Computer Architecture, ACM, pp. 434-442, 1986.
11 Goodman, J.R.: "Cache Consistency and Sequential Consistency," Tech. Rep. 61, IEEE Scalable Coherent Interface Working Group, IEEE, 1989.
12 Tanenbaum, A. S.: "Distributed Operating Systems", Prentice Hall, pp 291, 1995
13 L. Iftode, J.P. Singh, and K. Li: "Scope Consistency: a Bridge Between Release Consistency and Entry
Consistency. " In Proceedings of the 8th Annual ACM Symposium on Parallel Algorithms and   Architectures,
June 1996.
14 Y. Zhou, L. Iftode, and K. Li: "Performance Evaluation of Two Home-Based Lazy Release Consistency
Protocols for Shared Virtual Memory Systems." In Proceedings of the Operating Systems Design and
Implementation Symposium, October 1996.