A variety of consistency models have been proposed.
In general, the consistency models can be sorted into two kinds:(a) consistency
models not using synchronization operations. strict consistency,
sequential consistency [7] , causal consistency
[8] , processor consistency [11]
, PRAM consistency [9] belong
to type(a). (b)models with synchronization operations.
Type (b) consist of weak consistency [10]
, release consistency [2] ,
entry consistency [4] . The primary difference
of the type(a) and type(b) is the frequency of maintaining consistency.
If one consistency model in type (a) is used, the changes of varibles in
each processor are propagated to other processor immediately. The communications
among processors are frequent. However, only when a synchronization variable
is accessed, the changes are propagated if we use consistency models in
type (b). So the number of communication in type (b) is less than that
in type (a) greatly. Moreover, the communication volume in type (b) is
more than that in type (a). In a word, due to access to synchronization
variable, frequent and small data volume communication in type (a)
be changed into infrequent large data volume data communcation. So the
weak consistency, release consistency, and entry consistency should offer
better performance.
The three models in type (b) differ in how synchronization
works. In RC, when a release is done, the processor doing the release
pushes out all the modified data to all other processors that already have
a cached copy and thus might potentially need it. But there is no way to
tell if those processors actually will need it, that is to say
some processors which have a copy of the modified data actually need not
the data. To be safe, all of them get everything that has changed.
So RC may bright about redundant communication.
LRC [3] was presented by keleher to get over the disadvantage of RC. In LRC, at the time of a release, nothing is sent anywhere. Instead, when an acquire is done, the processor trying to do the acquire has to get the most recent values of the variables from the machine or machines holding them. But LRC can not hide communication latency because LRC doesn't push out the data until another acquire operation.
The Entry Consistency(EC) was proposed to reduce false sharing. EC requires each ordinary shared variable to be explicitly associated with some synchronization variable such as a lock or barrier which protects access to that variable. EC reduces the effect of false sharing but at the same time adds substantial programming complexity compared with release consistency.
In order to get over the drawbacks of RC , LRC and EC, we present a new consistency model, named Group Consistency Model. The group consistency model comes from following observation:
a)In parallel computation, processors which take part in
the barrier operation will need updated data implicitly. The
processors which need the updated the data constitute a group.
b) In many programs, only a portion of global space is
needed to be visible to processors in a group as described in
[12] [13] .
c) Although some sections of the global space are
computated, they are computed again.
The main idea of the group consistency model is to establish
the relationship between processors and data explicitly. Diffs(a
diff is a run-length encoding of the modifications performed by one processor
on a shared page) produced propagate to processors identified
by group_id not to processors which are not in group. Diffs are not
produced in the reuse sections(corresponding to situation c). Correspondingly
we propose three kinds of APIs: barrier(group_id); open_critical_section(L),
close_critical_section(L); open_reuse_section(L), close_reuse_section(L).
In order to improve performance of the group consistency model, we use
home-based
group consistency model as do in [13]
[14] . But entire page isn't transferd
from home to other nodes in home-based group consistency not as do
in protocols [13] [14]
. Moreover, the group consistency model is more eager than
LRC. Before we describe the barrier function in detail, we give some
definitions.
copy_set: The set constituted by processors which have the copy of
a page.
effect_tag: effect_tag is a two dimensional array whose first dimension
is page order, the second
dimension is group_id, if effect_tag is 1, which illustrates the processors
in group_id are
updated partly.
send_tag: The send_tag denotes the number of different
group that need to be updated by
diff
So we can describe the barrier
function with pseudo codes:
barrier(group_id)
if (group_id == unknown)
send diff to home
invalidate the page in copy_set(not including home)
invalidate all pages with effect_tag == 1
else if me != home of diff
invalidate in copy_set but not in group_id
send diff to home
receive send_tag from home
if(send_tag != 1)
broadcast diff in group_id
else if me == home of diff
receive diff from copy_set
applies the diff
send send_tag to every processor in copy_set
if(send_tag == 1)
propagate whole page in group_id
else
effect_tag = 1
Apparently, the group consistency integrate
the advantages of RC and LRC and remove the disadvantages of them.
When group_id is equal to unknown, the group consistency model is changed
into LRC, when group_id is global, it is changed into RC. RC
and LRC is the special case of the group consistency model. In a
word, the group consistency has four advantages over RC, LRC, ScC:
1) It doesn't send diff to all processors in copy_set,
it only send diff to processors in group. Those processor in
group need updated
data implicitly, which reduce redundent communication existing in RC.
2) Because the scope of processors which need updated
data is definite, updated data is sent to processors
in group eagerly.
The number of page fault is fewer than that in LRC and ScC.
3) It produce fewer diffs than standard RC, LRC,
ScC.
Because critical
sections and reused sections are indicated explicitly, Fewer diffs can
be produced.
4) The protocal data and messages are much smaller
than that in LRC
Because
data is propagated to processors in group, it is not needed to send the
request page message
to home
We test LU program in SPLASH 2
in order to verify our idea. LU decomposition is dominant in dense
linear computing. It is representative of many dense linear algebra
computations, and can by performed efficiently.
The parallel LU algorithm in SPLASH decribe as follows:
step1: factor diagonal block
step2: barrier
step3: update perimeter blocks
step4: barrier
step5: update interior blocks.
Figure 1
All blocks which reside on the same row but to the right, and in the same column but below the diagonal block are called perimeter blocks. All blocks in the interior region defined by the perimeter blocks are called interior blocks. Ownship of blocks is assigned to processors using a 2D Block-Cyclic decomposition.
There are three disadvantages in above algorithms if we
use RC or LRC.
1) In step2, factored diagonal block may be updated
in all processors which have the copy of factored
diagonal block, though
only processors that contain perimeter blocks actually need it.
2) Although In step4, perimeter blocks should transfer to other
processors along the same row and column,
they may transfer to all processors,
3) In iteration n, when barrier in stpe2, Both the factored
diagonal block in iteration n and updated interior
blocks in iteration n-1 be updated
in all processors, actually only factored diagonal block in iteration
n
should be updated.
Using the group consistency model, Figure
1 can be changed into Figure 2
step1: factor diagonal block
step2: barrier(group_diag_row)
step3: barrier(group_diag_col)
step4: update perimeter blocks
step5: barrier(group_in_myrow)
step6: barrier(gruop_in_mycol)
step7: if this is not the last block
open_reuse_section(l)
step8: update interior blocks
step9: close_resue_section(l)
step10: else
update interior blocks
barrier(unknown)
|
|
|
|
|
|
References
1 Adve, S., and Hill, M.:
" Weak Ordering: A New Definition," Proc. 17th Ann. Int'l Symp. on Computer
Architecture, ACM, pp. 2-14, 1990
2 Gharachorloo, K., Lenoski,
D., Laudon, J., gibbons, P., Gupta, A., and Hennessy, J.:Memory Consistency
and Event Ordering in Scalable Shared-Memory Multiprocessors, " Proc.
17th Ann. Int'l Symp. on Computer Architecture, ACM, pp. 15-26,
1990
3 Keleher, P., Cox, A.L.,
and Zwaenepoel, W.:"Lazy Release Consistency", Proc. 19th Ann.
Int'l Symp. on Computer Architecture, ACM, pp. 13-21, 1992.
4 Bershad, B.N., Zekauskas, M.J.,
and Sawdon, W.A.: "The Midway Distributed Shared Memory System," Proc.
IEEE COMPCON Conf., IEEE, pp. 528-537, 1993.
5 Liviu Iftode, Jaswinder palSingh and
Kai Li, "Scope Consistency: ABridge between Release Consistency and Entry
Consistency", in Proceedings of the 8th Annual ACM Symposium on Parallel
Algorithms and Architectures, June, 1996
6 Woo, S. C., Singh, J. P., and Hennessy,
J. L. "The Performance Advantates of Integrating Block Data
Transfer in Cache-Coherent Multiprocessors." in Proceedings of the
6th International Conference on
Architectural Support for Programming Languages and Operating Systems(ASPLOS-VI).
October 1994.
7 Lamport, L.: "How to Make a Multiprocessor
Computer That Correctly Executes Multiprocess Programs," IEEE Trans.
on Computers, vol. C-28, pp.690-691, sept. 1979
8 Hutto, P.W., and Ahamad, M.:
"Slow Memory: Weakening Consistency to Enhance Concurrency in distributed
Shared Memories," Proc. 10th Int'l Conf. on distributed Computing Systems,
IEEE, pp. 302-311, 1990
9 Lipton. R. J., and Sandberg, J.S.:"Pram:
A Scalable Shared Memory," Tech. Rep. CS-TR-180-88, Princeton Univ., Sept.
1988.
10 Dubois, M., Scheurich, C., and Briggs,
F.A.: "Memory Access Buffering in Multiprocessors," Proc. 13th Ann.
Int'l Symp. on Computer Architecture, ACM, pp. 434-442, 1986.
11 Goodman, J.R.: "Cache Consistency
and Sequential Consistency," Tech. Rep. 61, IEEE Scalable Coherent Interface
Working Group, IEEE, 1989.
12 Tanenbaum, A. S.: "Distributed Operating
Systems", Prentice Hall, pp 291, 1995
13 L. Iftode, J.P. singh, and K. Li:
"Scope Consistency: a Bridge Between Release Consistency and Entry
Consistency. " In Proceedings of the 8th Annual ACM Symposium on
Parallel Algorithms and Architectures,
June 1996.
14 Y. Zhou, L. Iftode, and K. Li: "Performance
Evaluation of Two Home-Based Lazy Release Consistency
Protocols for Shared Virtual Memory Systems." In Proceedings of
the Operating Systems Design and
Implementation Symposium, October 1996.