Using a virtual event space to understand parallel application communication behavior
We have developed EventSpace, a configurable data collecting, management and observation system for monitoring low-level synchronization and communication events with the purpose of understanding the behavior of parallel applications on clusters and multi-clusters. Applications are instrumented by adding data collecting code in the form of event collectors to an applications communication paths. When triggered these create and store virtual events to a virtual event space. Based on the meta-data describing the communication paths, virtual events can be combined to provide different views of the applications communication behavior. We used the data collected by EventSpace to do a post-mortem analysis of a wind-tunnel application, a river simulator, global clock synchronization, and a hierarchical barrier benchmark. The views allowed us to detect anomalous communication behavior, detect load balance problems, analyze hierarchical barriers, synchronize the Pentium timestamp counters on the cluster nodes, and analyze the accuracy of the synchronization.
ForlagUniversitetet i Tromsø
University of Tromsø
SerieTekniske rapporter / Institutt for informatikk 44(2003)
Følgende lisensfil er knyttet til denne innførselen: