Table of Contents:
  • Cover HIGH-PERFORMANCE COMPUTING Contents Preface Contributors PART 1 Programming Model 1 ClusterGOP: A High-Level Programming Environment for Clusters 1.1 Introduction 1.2 GOP Model and ClusterGOP Architecture 1.3 VisualGOP 1.4 The ClusterGOP Library 1.5 MPMD programming Support 1.6 Programming Using ClusterGOP 1.7 Summary 2 The Challenge of Providing A High-Level Programming Model for High-Performance Computing 2.1 Introduction 2.2 HPC Architectures 2.3 HPC Programming Models: The First Generation 2.4 The Second Generation of HPC Programming Models 2.5 OpenMP for DMPs 2.6 Experiments with OpenMP on DMPs 2.7 Conclusions 3 SAT: Toward Structured Parallelism Using Skeletons 3.1 Introduction 3.2 SAT: A Methodology Outline 3.3 Skeletons and Collective Operations 3.4 Case Study: Maximum Segment SUM (MSS) 3.5 Performance Aspect in SAT 3.6 Conclusions and Related Work 4 Bulk-Synchronous Parallelism: An Emerging Paradigm of High-Performance Computing 4.1 The BSP Model 4.2 BSP Programming 4.3 Conclusions 5 Cilk Versus MPI: Comparing Two Parallel Programming Styles on Heterogeneous Systems 5.1 Introduction 5.2 Experiments 5.3 Results 5.4 Conclusion 6 Nested Parallelism and Pipelining in OpenMP 6.1 Introduction 6.2 OpenMP Extensions for Nested Parallelism 6.3 OpenMP Extensions For Thread Synchronization 6.4 Summary 7 OpenMP for Chip Multiprocessors 7.1 Introduction 7.2 3SoC Architecture Overview 7.3 The OpenMp Compiler/Translator 7.4 Extensions to OpenMP for DSEs 7.5 Optimization for OpenMP 7.6 Implementation 7.7 Performance Evaluation 7.8 Conclusions PART 2 Architectural and System Support 8 Compiler and Run-Time Parallelization Techniques for Scientific Computations on Distributed-Memory Parallel Computers 8.1 Introduction 8.2 Background Material 8.3 Compiling Regular Programs on DMPCs 8.4 Compiler and Run-Time Support for Irregular Programs 8.5 Library Support for Irregular Applications 8.6 Related Works 8.7 Concluding Remarks 9 Enabling Partial-Cache Line Prefetching through Data Compression 9.1 Introduction 9.2 Motivation of Partial Cache-Line Prefetching 9.3 Cache Design Details 9.4 Experimental Results 9.5 Related Work 9.6 Conclusion 10 MPI Atomicity and Concurrent Overlapping I/O 10.1 Introduction 10.2 Concurrent Overlapping I/O 10.3 Implementation Strategies 10.4 Experiment Results 10.5 Summary 11 Code Tiling: One Size Fits All 11.1 Introduction 11.2 Cache Model 11.3 Code Tiling 11.4 Data Tiling 11.5 Finding Optimal Tile Sizes 11.6 Experimental Results 11.7 Related Work 11.8 Conclusion 12 Data Conversion for Heterogeneous Migration/Checkpointing 12.1 Introduction 12.2 Migration and Checkpointing 12.3 Data Conversion 12.4 Coarse-Grain Tagged RMR in MigThread 12.5 Microbenchmarks and Experiments 12.6 Related Work 12.7 Conclusions and Future Work 13 Receiving-Message Prediction and Its Speculative Execution 13.1 Background 13.2 Receiving-Message Prediction Method 13.3 Implementation of the Method in the MIPI Libraries 13.4 Experimental Results 13.5 Conclusing Remarks 14 An Investigation of the Applicability of Distributed FPGAs to High-Performance Computing 14.1 Introduction 14.2 High Performance Computing with Cluster Computing 14.3 Reconfigurable Computing with EPGAs 14.4 DRMC: A Distributed Reconfigurable Metacomputer 14.5 Algorithms Suited to the Implementation on FPGAs/DRMC 14.6 Algorithms Not Suited to the Implementation on FPGAs/DRMC 14.7 Summary PART 3 Scheduling and Resource Management 15 Bandwidth-Aware Resource Allocation for Heterogeneous Computing Systems to Maximize Throughput 15.1 Introduction 15.2 Related Work 15.3 System Model and Problem Statement 15.4 Resource Allocation to Maximize System Throughput 15.5 Experimental Results 15.6 Conclusion 16 Scheduling Algorithms with Bus Bandwidth Considerations for SMPs 16.1 Intr.