Efficient intra-node Communication for Chip Multiprocessors
ForfatterHenriksen, Torje Starbo
The microprocessor industry has reached limitations of sequential processing power due to power-efficiency and heat problems. With the integrated-circuit technology moving forward, chip-multithreading has become the trend, increasing parallel processing power. The shift of focus has resulted in the vast majority of supercomputers having chip-multiprocessors. While the high performance computing community has long written parallel applications using libraries such as MPI, the performance-characteristics have changed from the traditional uni-core cluster, to the current generation of multi-core clusters; more communication is between processes on the same node, and processes run on cores sharing hardware resources, such as cache and memory bus. We explore the possibilities of optimizing a widely used MPI implementation, Open MPI, to minimize communication overhead for communication between processes running on a single node. We take three approaches for optimization: First we measure the message-passing latency between the different cores, and reduce latency for large messages by keeping the sender and receiver synchronized. Second, we increase scalability by using two new queue-designs, reducing the number of communication queues that need to be polled to receive messages. Third, we experiment with mapping a parallel application to different cores, using only a single node. The mapping is done dynamically during runtime, with no prior knowledge of the application's communication pattern. Our results show that for large messages sent between cores sharing cache, message-passing latency can be significantly reduced. Results from running the NAS Parallel Benchmarks using the new queue-designs show that Open MPI can increase its scalability when running more than 64 processes on a single node. Our dynamic mapper performs close to our manual mapping, but rarely increases performance. We see from the experimental results, that the three techniques give performance increase in different scenarios. Combining techniques like these with other techniques, can be a key to unlocking the parallel performance for a broader range of parallel applications.
ForlagUniversitetet i Tromsø
University of Tromsø
Følgende lisensfil er knyttet til denne innførselen: