The Quadrics Network (QsNET)[Overview] [Collective Communication] [I/O Traffic] [Multiple Network Rails] [Resource Management] [Recent Publications] [Home] OverviewQsNET is the interconnect that will be used in the 30 TeraOps ASCI Q-machine. From the architectural point of view, the Q-machine can be described as a cluster of Shared Memory Multiprocessors (SMP), which is expected to the deliver a peak performance in excess of 30 TeraOps (with approximately 15000 processors). Multiple independent QsNET network rails will interconnect the SMPs through their I/O ports (for example, PCI or PCI-X).Experimental results indicate that the version 3 of QsNET (Elan3) performs remarkably well, e.g., user-level latency under 2 mus and bandwidth over 300 MB/s, efficient support for collective communication patterns, and excellent contention resolution under heavy traffic. A preliminary description and evaluation of QsNet is reported in the following short paper. We describe the main features of the two building blocks of the QsNET, a network interface that can perform zero-copy user-level communication and a wormhole routing switch. We also focus our attention on the routing and flow control algorithms, deadlock avoidance and on how the processing nodes are integrated in a global, virtual shared memory. Pdf BibTeX. The following paper, is a comprehensive performance evaluation of the Quadrics network, and extends the previous paper with the analysis of collective permutation patterns, scalability of uniform traffic and analysis of I/O traffic. Collective CommunicationAnother interesting feature of QsNET is the native support for collective communication. In the following paper we present an in-depth description of how the QsNET supports both hardware- and software-based collectives. Experimental results conducted on 64-node AlphaServer cluster with 256 processors indicate that the time to complete the hardware-based barrier synchronization on the whole network is as low as 6 mus, with very good scalability. Good latency and scalability are also achieved with the software-based synchronization, which takes about 15 mus. With the broadcast, similar performance is achieved by the hardware- and software-based implementations, which can deliver messages of up to 256 bytes in 13 mus and can get a sustained asymptotic bandwidth of 288 Mbytes/sec on all the nodes. The hardware-based barrier is almost insensitive to the network congestion, with 93% of the synchronizations taking less than 20 when the network is flooded with a background traffic of unicast messages.Abstract Gzipped PostScript BibTeX. I/O TrafficA common trend in the design of large-scale clusters is to use a high-performance data network to integrate the processing nodes in a single parallel computer. In these systems the performance of the interconnect can be a limiting factor for the input/output (I/O), which is traditionally bottlenecked by the disk bandwidth. In the following paper we present an experimental analysis on a 64-node AlphaServer cluster based on the Quadrics network of the behavior of the interconnect under I/O traffic, and the influence of the placement of the I/O servers on the overall performance. The effects of using dedicated I/O nodes or overlapping I/O and computation on the I/O nodes are also analyzed. In addition, we evaluate how background I/O traffic interferes with other parallel applications running concurrently. Our experimental results show that a correct placement of the I/O servers can provide up to 20% increase in the available I/O bandwidth.Abstract Pdf BibTeX. Multiple Network RailsUsing multiple independent networks (also known as rails) is an emerging technique to overcome bandwidth limitations and enhance fault-tolerance of current high-performance clusters. In the following paper we present and analyze various venues of exploiting multiple rails. Different rail access policies are presented and compared, including static and dynamic allocation schemes. An analytical lower bound on the number of networks required for static rail allocation is shown. We also present an extensive experimental comparison of the behavior of various allocation schemes in terms of bandwidth and latency. The proposed allocation scheme scales well with the number of rails and message sizes. This in turn suggests that the performance obtained from the network can be increased through the use of multiple rails if an appropriate rail allocation scheme is used.Abstract Gzipped PostScript BibTeX. Resource ManagementQsNET integrates multiple SMPs using a Resource Management System called RMS. RMS provides a single point of interface to the system for resource management. It includes facilities for gathering information on resources (monitoring, auditing, accounting, fault diagnosis, and statistical data collection) and for resource handling (CPU allocation, access control, parallel jobs support, and execution, and scheduling). RMS is implemented as a set of UNIX commands and daemons that communicate using socket daemons and access a status database for storing or retrieving all the system details. In the following paper, we explore the performance of the RMS gang scheduler. This scheduler can take advantage of this network's unique capabilities, including a network interface card-based processor and memory and efficient user-level communication libraries. We developed a micro-benchmark to test the scheduler's performance under various aspects of parallel job workloads: memory usage, bandwidth and latency-bound communication, number of processes, timeslice quantum, and multiprogramming levels. Our experiments show that the gang scheduler performs relatively well under most workload conditions, is largely insensitive to the number of concurrent jobs in the system and scales almost linearly with number of nodes.Abstract Gzipped PostScript BibTeX. |