Improving the Performance of LU Factorization in Software DSM Environment

Zhongze Li Department of Computer Science and Engineering, Beijing University of Aeronautics and Astronautics Campus Box 2-52, BUAA, Beijing, P.R. China, 100083 e-mail:para@cs.sebuaa.ac.cn Geoffrey C. Fox Northeast Parallel Architectures Center 111 College Place Syracuse University Syracuse, Ny 13244-4100 gcf@npac.syr.edu Hongbin Liao Department of Computer Science and Engineering, Beijing University of Aeronautics and Astronautics Campus Box 1-85 , BUAA, Beijing, P.R. China, 100083 para@cs.sebuaa.ac.cn Wei Li Department of Computer Science and Engineering, Beijing University of Aeronautics and Astronautics Beijing, P. R. China, 100083 liwei@cs.sebuaa.ac.cn Abstract
Distributed shared memory provides a virtual address space shared among processes on loosely coupled processors. The advantages offered by DSM include ease of programming and portability achieved throught the shared memory programming paradigm, the low cost of distributed-memory machines, and scalability resulting from the absence of hardware bottlenecks. In DSM systems, one page may be cached by many different processors. When one processor write this page, the DSM system should update the copy of the page on other processors. Adve, S. describe the definition of Consistence Model in [1] : A consistency model is esentially a contract between the software and the memory. It says that if the software agrees to obey certain rules, the memory promises to work correctly.

A variety of consistency models have been proposed. In general, the consistency models can be sorted into two kinds:(a) consistency models not using synchronization operations. strict consistency, sequential consistency [7] , causal consistency [8] , processor consistency [11] , PRAM consistency [9] belong to type(a). (b)models with synchronization operations. Type (b) consist of weak consistency [10] , release consistency [2] , entry consistency [4] . The primary difference of the type(a) and type(b) is the frequency of maintaining consistency. If one consistency model in type (a) is used, the changes of varibles in each processor are propagated to other processor immediately. The communications among processors are frequent. However, only when a synchronization variable is accessed, the changes are propagated if we use consistency models in type (b). So the number of communication in type (b) is less than that in type (a) greatly. Moreover, the communication volume in type (b) is more than that in type (a). In a word, due to access to synchronization variable, frequent and small data volume communication in type (a) be changed into infrequent large data volume data communcation. So the weak consistency, release consistency, and entry consistency should offer better performance.

The three models in type (b) differ in how synchronization works. In RC, when a release is done, the processor doing the release pushes out all the modified data to all other processors that already have a cached copy and thus might potentially need it. But there is no way to tell if those processors actually will need it, that is to say some processors which have a copy of the modified data actually need not the data. To be safe, all of them get everything that has changed. So RC may bright about redundant communication.

LRC [3] was presented by keleher to get over the disadvantage of RC. In LRC, at the time of a release, nothing is sent anywhere. Instead, when an acquire is done, the processor trying to do the acquire has to get the most recent values of the variables from the machine or machines holding them. But LRC can not hide communication latency because LRC doesn't push out the data until another acquire operation.

The Entry Consistency(EC) was proposed to reduce false sharing. EC requires each ordinary shared variable to be explicitly associated with some synchronization variable such as a lock or barrier which protects access to that variable. EC reduces the effect of false sharing but at the same time adds substantial programming complexity compared with release consistency.

In order to get over the drawbacks of RC , LRC and EC, we present a new consistency model, named Group Consistency Model. The group consistency model comes from following observation:

   a)In parallel computation, processors which take part in the barrier operation will need updated data implicitly.   The processors which need the updated the data constitute a group.
   b) In many programs, only a portion of global space is needed to be visible to processors in a group as described in   [12] [13] .
   c) Although some sections of the global space are computated, they are computed again.

The main idea of the group consistency model is to establish the relationship between processors and data explicitly. Diffs(a diff is a run-length encoding of the modifications performed by one processor on a shared page) produced propagate to processors identified by group_id not to processors which are not in group. Diffs are not produced in the reuse sections(corresponding to situation c). Correspondingly we propose three kinds of APIs: barrier(group_id); open_critical_section(L), close_critical_section(L); open_reuse_section(L), close_reuse_section(L). In order to improve performance of the group consistency model, we use home-based
group consistency model as do in   [13]   [14] . But entire page isn't transferd from home to other nodes in home-based group consistency not as do in protocols [13]   [14] . Moreover, the group consistency model is more eager than LRC. Before we describe the barrier function in detail, we give some definitions.
                    copy_set: The set constituted by processors which have the copy of a page.
                    effect_tag: effect_tag is a two dimensional array whose first dimension is page order, the second
                                          dimension is group_id, if effect_tag is 1, which illustrates the processors in group_id are
                                          updated partly.
                    send_tag:    The send_tag denotes the number of different group that need to be updated by
                                            diff

        So we can describe the barrier function with pseudo codes:
           barrier(group_id)
                  if (group_id == unknown)
                            send diff to home
                            invalidate the page in copy_set(not including home)
                            invalidate all pages with effect_tag == 1
                 else if me != home of diff
                            invalidate in copy_set but not in group_id
                            send diff to home
                            receive send_tag from home
                            if(send_tag != 1)
                                  broadcast diff in group_id
                 else if me == home of diff
                            receive diff from copy_set
                            applies the diff
                            send send_tag to every processor in copy_set
                            if(send_tag == 1)
                                    propagate whole page in group_id
                            else
                                    effect_tag = 1

     Apparently, the group consistency integrate the advantages of RC and LRC and remove the disadvantages of them.   When group_id is equal to unknown, the group consistency model is changed into LRC, when group_id is global, it is changed into RC. RC and LRC is the special case of the group consistency model. In a word, the group consistency has four advantages over RC, LRC, ScC:
    1) It doesn't send diff to all processors in copy_set, it only send diff to processors in group. Those processor in
         group need updated data implicitly, which reduce redundent communication existing in RC.
    2) Because the scope of processors which need updated data is definite, updated data is sent to processors
         in group eagerly. The number of page fault is fewer than that in LRC and ScC.
   3) It produce fewer diffs than standard RC, LRC, ScC.
          Because critical sections and reused sections are indicated explicitly, Fewer diffs can be produced.
   4) The protocal data and messages are much smaller than that in LRC
           Because data is propagated to processors in group, it is not needed to send the request page message
           to home

      We test LU program in SPLASH 2 in order to verify our idea. LU decomposition is dominant in dense linear computing. It is representative of many dense linear algebra computations, and can by performed efficiently.
The parallel LU algorithm in SPLASH decribe as follows:

                                                                     step1: factor diagonal block
                                                                     step2: barrier
                                                                     step3: update perimeter blocks
                                                                     step4: barrier
                                                                     step5: update interior blocks.

                                                                                            Figure 1

All blocks which reside on the same row but to the right, and in the same column but below the diagonal block are called perimeter blocks. All blocks in the interior region defined by the perimeter blocks are called interior blocks. Ownship of blocks is assigned to processors using a 2D Block-Cyclic decomposition.

   There are three disadvantages in above algorithms if we use RC or LRC.
1) In step2, factored diagonal block may be updated in all processors which have the copy of factored
        diagonal block, though only processors that contain perimeter blocks actually need it.
2) Although In step4, perimeter blocks should transfer to other processors along the same row and column,
       they may transfer to all processors,
3) In iteration n, when barrier in stpe2, Both the factored diagonal block in iteration n and updated interior
       blocks in iteration n-1 be updated in all processors, actually only factored diagonal block in iteration n
       should be updated.

      Using the group consistency model, Figure 1 can be changed into Figure 2
                                                                   step1:    factor diagonal block
                                                                   step2:   barrier(group_diag_row)
                                                                   step3:   barrier(group_diag_col)
                                                                   step4:   update perimeter blocks
                                                                   step5:   barrier(group_in_myrow)
                                                                   step6:   barrier(gruop_in_mycol)
                                                                   step7:   if this is not the last block
                                                                                      open_reuse_section(l)
                                                                  step8:        update interior blocks
                                                                  step9:        close_resue_section(l)
                                                                  step10: else
                                                                                      update interior blocks
                                                                                      barrier(unknown)

Figure 2 On 4 Pentium 133 connected by a 100Mbit/sec Enthernet, we use PVM 3.3.11 to simulate the RC and the group consistency model. The performance of LU factorization is illustrated in table 1. The size of matrix is 512x512, the element block is 16x16. The shape of processor matrix is 2x2. Because a page(4K) contained 512 double number, a page is shared among column of processor matrix. The poor performance of LU using RC protocol
results from mainly disadvantage 3) described above. From Table 1, we see that the performance of LU is improved greatly because GC protocol bringht about fewer diffs and the number of communcation. Because we
can not simulate the overhead of page fault, we don't test LU in LRC protocol. LU is an insensitive program in ScC [13] . ScC(Scope Consistency) base on lock not on barrier.
In a word, the group consistency model integrate the advantages of RC and LRC, so it should offer good performance in many DSM programs based on barrier mechnism. We will test a variety of programs with
different access patterns. Table 1

Consistency Model	time(sec)
RC	11.42
GC	6.58

References
1    Adve, S., and Hill, M.: " Weak Ordering: A New Definition," Proc. 17th Ann. Int'l Symp. on Computer Architecture, ACM, pp. 2-14, 1990
2    Gharachorloo, K., Lenoski, D., Laudon, J., gibbons, P., Gupta, A., and Hennessy, J.:Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors, " Proc. 17th Ann. Int'l Symp. on Computer Architecture, ACM, pp. 15-26, 1990
3   Keleher, P., Cox, A.L., and Zwaenepoel, W.:"Lazy Release Consistency", Proc. 19th Ann. Int'l Symp. on Computer Architecture, ACM, pp. 13-21, 1992.
4   Bershad, B.N., Zekauskas, M.J., and Sawdon, W.A.: "The Midway Distributed Shared Memory System," Proc. IEEE COMPCON Conf., IEEE, pp. 528-537, 1993.
5 Liviu Iftode, Jaswinder palSingh and Kai Li, "Scope Consistency: ABridge between Release Consistency and Entry Consistency", in Proceedings of the 8th Annual ACM Symposium on Parallel
Algorithms and Architectures, June, 1996
6 Woo, S. C., Singh, J. P., and Hennessy, J. L. "The Performance Advantates of Integrating Block Data
Transfer in Cache-Coherent Multiprocessors." in Proceedings of the 6th International Conference on
Architectural Support for Programming Languages and Operating Systems(ASPLOS-VI). October 1994.
7 Lamport, L.: "How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs," IEEE Trans. on Computers, vol. C-28, pp.690-691, sept. 1979
8   Hutto, P.W., and Ahamad, M.: "Slow Memory: Weakening Consistency to Enhance Concurrency in distributed Shared Memories," Proc. 10th Int'l Conf. on distributed Computing Systems, IEEE, pp. 302-311, 1990
9 Lipton. R. J., and Sandberg, J.S.:"Pram: A Scalable Shared Memory," Tech. Rep. CS-TR-180-88, Princeton Univ., Sept. 1988.
10 Dubois, M., Scheurich, C., and Briggs, F.A.: "Memory Access Buffering in Multiprocessors," Proc. 13th Ann. Int'l Symp. on Computer Architecture, ACM, pp. 434-442, 1986.
11 Goodman, J.R.: "Cache Consistency and Sequential Consistency," Tech. Rep. 61, IEEE Scalable Coherent Interface Working Group, IEEE, 1989.
12 Tanenbaum, A. S.: "Distributed Operating Systems", Prentice Hall, pp 291, 1995
13 L. Iftode, J.P. singh, and K. Li: "Scope Consistency: a Bridge Between Release Consistency and Entry
Consistency. " In Proceedings of the 8th Annual ACM Symposium on Parallel Algorithms and   Architectures,
June 1996.
14 Y. Zhou, L. Iftode, and K. Li: "Performance Evaluation of Two Home-Based Lazy Release Consistency
Protocols for Shared Virtual Memory Systems." In Proceedings of the Operating Systems Design and
Implementation Symposium, October 1996.