Zhongze Li
Department of Computer Science and Engineering,
Beijing University of Aeronautics and Astronautics
Campus Box 2-52, BUAA, Beijing, P.R. China, 100083
e-mail:para@cs.sebuaa.ac.cn
Geoffrey C. Fox
Northeast Parallel Architectures Center
111 College Place
Syracuse University
Syracuse, NY 13244-4100
gcf@npac.syr.edu
Hongbin Liao
Department of Computer Science and Engineering,
Beijing University of Aeronautics and Astronautics
Campus Box 1-85 , BUAA, Beijing, P.R. China, 100083
para@cs.sebuaa.ac.cn
Wei Li
Department of Computer Science and Engineering,
Beijing University of Aeronautics and Astronautics
Beijing, P. R. China, 100083
liwei@cs.sebuaa.ac.cn
Abstract
Distributed shared memory provides a virtual address space shared among processes on loosely coupled processors. The advantages offered by DSM include ease of programming and portability achieved through the shared memory programming paradigm, the low cost of distributed-memory machines, and scalability resulting from the absence of hardware bottlenecks. In DSM systems, many different processors may cache one page. When one processor write this page, the DSM system should update the copy of the page on other processors. Adve, S. describes the definition of Consistency Model in [1]: A consistency model is essentially a contract between the software and the memory. It says that if the software agrees to obey certain rules, the memory promises to work correctly.
A variety of consistency models have been proposed. In general, consistency models can be sorted into two kinds:
The primary difference between type (a) and type (b) is the frequency with which consistency is updated. If a consistency model in type (a) is used, the changed values of variables in each processor are propagated to other processor immediately. There is frequent communication among processors. However, changes are propagated only when a synchronization variable is accessed if we use consistency models in type (b). The communication events in type (b) are smaller in number but larger in size than in type (a) schemes. So the weak consistency, release consistency, and entry consistency schemes should offer improved performance.
The three models in type (b) differ in how synchronization is implemented. In RC, when a release is performed, the processor doing the release pushes out the modified data to all other processors that already have a cached copy and thus might potentially need it. But there is no way to tell if those processors actually will need the modified data and to be safe, all of the identified processors get everything that has changed. So RC can lead to redundant communication.
LRC [3] was presented by Keleher to get over this disadvantage of RC. In LRC, at the time of a release, nothing is sent anywhere. Instead, when an acquire is done, the processor trying to do the acquire has to get the most recent values of the variables from the machine or machines holding them. Unfortunately LRC can not hide communication latency because LRC doesn't push out the data until another acquire operation.
The Entry Consistency scheme (EC) was proposed to reduce false sharing. EC requires each ordinary shared variable to be explicitly associated with some synchronization variable such as a lock or barrier, which protects access to that variable. EC reduces the effect of false sharing but at the same time adds substantial programming complexity compared with release consistency.
In order to get over the drawbacks of RC, LRC and EC, we present a new consistency model, termed the Group Consistency Model. The group consistency model is motivated by the following observations:
The main idea of the group consistency model is to establish the relationship between processors and data explicitly. When Diffs(a diff is a run-length encoding of the modifications performed by one processor on a shared page) are produced, they propagate to processors identified by a shared group_id and are not sent to processors which are not in this group. Diffs are not produced in the reuse sections (corresponding to situation c)). Correspondingly we propose three kinds of APIs: barrier(group_id); open_critical_section(L), close_critical_section(L); open_reuse_section(L), close_reuse_section(L). In order to improve performance of the group consistency model, we use a home-based
group consistency model as used in [13] and [14]. But the entire page isn't transferred from home to other nodes in the home-based group consistency as opposed to the approach in the protocols of [13], [14]. Moreover, the group consistency model is more eager than LRC. Before we describe the barrier function in detail, we give some definitions.
copy_set: The set of processors which have the copy of a page.
effect_tag: effect_tag is a two dimensional array whose first dimension is the page order, the second
dimension is group_id. If the value of effect_tag is 1, it implies that the processors in this group_id are partly
updated. ************ I DO NOT UNDERSTAND THIS(effect_tag) **********
send_tag: The send_tag denotes the number of different groups that need to be updated by a given
diff
So we can describe the barrier function with the following pseudo code:
barrier(group_id)
if (group_id == unknown)
send diff to home
invalidate the page in copy_set(not including home)
invalidate all pages with effect_tag == 1
else if me != home of diff
invalidate in copy_set but not in group_id
send diff to home
receive send_tag from home
if(send_tag != 1)
broadcast diff in group_id
else if me == home of diff
receive diff from copy_set
applies the diff
send send_tag to every processor in copy_set
if(send_tag == 1)
propagate whole page in group_id
else
effect_tag = 1
Apparently, the group consistency integrate the advantages of RC and LRC and remove their disadvantages. When group_id is unknown, the group consistency model is changed into LRC; when group_id is global, it is defaulted to RC. RC and LRC are thus special cases of the group consistency model. The group consistency approach has four advantages over RC, LRC, and ScC:
1) It doesn't send diff to all processors in copy_set; rather it only sends diff to processors in the group. This reduces the redundant communication existing in RC.
2) Because the scope of processors that need updated data, is defined, updated data is sent eagerly to processors in group. The number of page faults are fewer than in LRC and ScC.
3) It produce fewer diffs than standard RC, LRC, ScC methods. This is because critical sections and reused sections are indicated explicitly, and so fewer diffs are generated.
4) The protocol data and messages are much smaller than that in LRC. This is because data is propagated to processors in the group, and it is not needed to send the request page message to home
We used the LU factorization program in SPLASH 2 in order to verify our idea. LU decomposition is a typical important linear algebra program which can be programmed efficiently but it is quite challenging to implement well. The parallel LU algorithm in SPLASH is defined as follows:
step1: factor diagonal block
step2: barrier
step3: update perimeter blocks
step4: barrier
step5: update interior blocks.
Figure 1
All blocks, which reside on the same row but to the right, and in the same column but below the diagonal block are called perimeter blocks. All blocks in the interior region defined by the perimeter blocks are called interior blocks. Ownership of blocks is assigned to processors using a 2D Block-Cyclic decomposition.
There are three disadvantages if we use RC or LRC with the above algorithm.
1) In step2, factored diagonal block may be updated in all processors that have a copy of the factored
diagonal block, though only processors that contain perimeter blocks actually need it.
2) Although in step4, perimeter blocks should only transfer to other processors along the same row and column,
they may transfer to all processors.
3) In iteration n, consider the barrier in step2. Both the factored diagonal block in iteration n and updated interior
blocks in iteration n-1 may be updated in all processors, whereas actually only factored diagonal block in iteration n
should be updated.
Using the group consistency model, Figure 1 can be changed into Figure 2
step1: factor diagonal block
step2: barrier(group_diag_row)
step3: barrier(group_diag_col)
step4: update perimeter blocks
step5: barrier(group_in_myrow)
step6: barrier(gruop_in_mycol)
step7: if this is not the last block
open_reuse_section(l)
step8: update interior blocks
step9: close_reuse_section(l)
step10: else
update interior blocks
barrier(unknown)
Figure 2
On a network of four Pentium 133Mhz machines connected by a 100Mbit/sec Enthernet, we use PVM 3.3.11 to simulate both the RC and the group consistency model. The performance of LU factorization is recorded in table 1. The size of the matrix is 512x512, and the element block is 16x16. The shape of the processor matrix is 2x2. Because the 4K page size contained 512 double numbers, a page is shared among members of a column of the processor matrix. The poor performance of LU using RC protocol results mainly from disadvantage 3) described above. From Table 1, we see that the performance of LU is improved greatly because GC protocol leads to fewer diffs and less communication. Because we
can not simulate the overhead of page fault, we didn't test the LU example using the LRC protocol. ScC (Scope Consistency) is based on a lock not on barrier mechanism and so we did not evaluate it on this example [13].
In summary the group consistency model integrates the advantages of RC and LRC, and so it should offer good performance in many DSM programs based on the barrier mechanism. We will test a greater variety of programs with different access patterns and use a wider variety of hardware with more nodes in final version of paper.
Table 1
Consistency Model |
time(sec) |
RC |
11.42 |
GC |
6.58 |
References
1 Adve, S., and Hill, M.: " Weak Ordering: A New Definition," Proc. 17th Ann. Int'l Symp. on Computer Architecture, ACM, pp. 2-14, 1990
2 Gharachorloo, K., Lenoski, D., Laudon, J., gibbons, P., Gupta, A., and Hennessy, J.:Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors, " Proc. 17th Ann. Int'l Symp. on Computer Architecture, ACM, pp. 15-26, 1990
3 Keleher, P., Cox, A.L., and Zwaenepoel, W.:"Lazy Release Consistency", Proc. 19th Ann. Int'l Symp. on Computer Architecture, ACM, pp. 13-21, 1992.
4 Bershad, B.N., Zekauskas, M.J., and Sawdon, W.A.: "The Midway Distributed Shared Memory System," Proc. IEEE COMPCON Conf., IEEE, pp. 528-537, 1993.
5 Liviu Iftode, Jaswinder P. Singh and Kai Li, "Scope Consistency: ABridge between Release Consistency and Entry Consistency", in Proceedings of the 8th Annual ACM Symposium on Parallel
Algorithms and Architectures, June, 1996
6 Woo, S. C., Singh, J. P., and Hennessy, J. L. "The Performance Advantages of Integrating Block Data
Transfer in Cache-Coherent Multiprocessors." in Proceedings of the 6th International Conference on
Architectural Support for Programming Languages and Operating Systems(ASPLOS-VI). October 1994.
7 Lamport, L.: "How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs," IEEE Trans. on Computers, vol. C-28, pp.690-691, Sept. 1979
8 Hutto, P.W., and Ahamad, M.: "Slow Memory: Weakening Consistency to Enhance Concurrency in distributed Shared Memories," Proc. 10th Int'l Conf. on distributed Computing Systems, IEEE, pp. 302-311, 1990
9 Lipton. R. J., and Sandberg, J.S.:"Pram: A Scalable Shared Memory," Tech. Rep. CS-TR-180-88, Princeton Univ., Sept. 1988.
10 Dubois, M., Scheurich, C., and Briggs, F.A.: "Memory Access Buffering in Multiprocessors," Proc. 13th Ann. Int'l Symp. on Computer Architecture, ACM, pp. 434-442, 1986.
11 Goodman, J.R.: "Cache Consistency and Sequential Consistency," Tech. Rep. 61, IEEE Scalable Coherent Interface Working Group, IEEE, 1989.
12 Tanenbaum, A. S.: "Distributed Operating Systems", Prentice Hall, pp 291, 1995
13 L. Iftode, J.P. Singh, and K. Li: "Scope Consistency: a Bridge Between Release Consistency and Entry
Consistency. " In Proceedings of the 8th Annual ACM Symposium on Parallel Algorithms and Architectures,
June 1996.
14 Y. Zhou, L. Iftode, and K. Li: "Performance Evaluation of Two Home-Based Lazy Release Consistency
Protocols for Shared Virtual Memory Systems." In Proceedings of the Operating Systems Design and
Implementation Symposium, October 1996.