Subject: Re: C508: Comparing Windows NT, Linux, and QNX as the Basis for Cluster Systems From: Dror Feitelson Date: Thu, 05 Jul 2001 10:52:12 +0300 To: Geoffrey Fox Dear Professor Fox: Enclosed is the revised version of our paper "Comparing Windows NT, Linux, and QNX as the Basis for Cluster Systems" (number C508), which we have submited for publication in Concurrency --- Practice & Experience. We have implemented most of the suggestions of the referees, and provide a detailed list in the attachment. In particular, we also tried to emphasize the specific points raised in your cover letter. We are looking forward to hearing regarding your final editorial decision. We can easily provide the LaTeX source of the paper, and postscript files for the figures if so desired. Sincerely, -- Dror Feitelson C508: Comparing Windows NT, Linux, and QNX as the Basis for Cluster Systems Notes regarding comments made by referees ========================================= Referee 1 --------- 1) The referee suggested we might drop QNX. As you suggested, we prefer to retain it. 2) The main point was the lack of clarity regarding the operational use of the cluster we had in mind. We have clarified this point in section 1.1. As we note there, our focus is on computational clusters that support parallel HPC applications, but many of the features we review are also relevant to other modes of operation. In this contect the referee also mentioned the issue of file sharing. We added a subsection on this topic in section 2.2. 3) Another concern was that a limiting factor for cluster usage is the availability of application development tools. While this is a true concern, we feel that it is outside the scope of this paper. Rather, we prefer to focus on technical features of the operating systems themselves. However, we did add a note to this effect to the final sections that lists additional work that could be done. 4) The referee was also concerned about the performance measurements regarding networking. Again, we feel that a comprehensive survey of this issue is beyond the scope of this paper. There is a need to consider different hardware platforms (Ethernet, Myrinet, Giganet), different software interfaces (sockets, MPI), and different communication patterns. Instead, we prefer to focus on those communication facilities that are indeed implemented in the operating system -- namely the TCP/IP stack, and for QNX also the Fleet protocol. As suggested, we also performed some additional measurements using UDP. 5) Next, the referee suggests we note the results of the Top500 list as a possible metric. As this does indeed show that actual large installations have a strong preference for Linux-based systems, we did indeed include this data as suggested. 6) Finally, the referee suggested we point out that the different systems may have distinct advantages in different niches. We have done so as suggested. Referee 2 --------- 0) The referee requested that we make a better case for including QNX. This was done. 1&2) Our coverage of Windows is based on Windows NT, because it was the dominant system at the time the work was done. We added coverage of Windows 2000 wherever it introduced significant new features. Thus, to a large degree, the current discussion of NT applies to Windows 2000 as well. 3&16-19) We have bundled our benchmark codes in a tar file that is now available on the web (at URL http://www.cs.huji.ac.il/~feit/papers/benchmarks.tar.gz). Minor corrections suggested in items 4, 5, 7, 9, 10, 11, 12, 13, 14, 15, 20, 21, 22, 23, 24, 25, 26, 27, and 28 have also been done. Item 8, regarding the definition of DCOM, is not accurate. We let is stand as is. Referee 3 --------- The main points that don't overlap with previous referees are: 1) The referee noted that the availability of application-level software such as MPI and ScaLAPACK can have a strong influence on the choice of system. We agree, and have added a section about software availability. It seems that the most important difference between the systems is actually in cluster management systems. 2) Another concern is the importance of availability. Again we agree, but in this case we don't have any hard data. We therefore leave the situation as it was before: the study of reliability and availability is mentioned in the conclusions as a topic that requires further work. 3) Job submital (for parallel programs) is not part of what the base system does -- it is handled by the cluster software built above it, as described in section 1.1. The focus of this paper is on services provided by the base system which enable the implementation of such a cluster management facility. But we now do mention the availability of cluster control software in the new section on software availability for the different systems. 4) The point about applications is not clear. We are dealing with the design of the cluster, and did not run applications on it (except for the benchmarks used in the performance evaluation section). API comparison: What is "(man 2