Abstracts for the first workshop of the Joint Laboratory for Petascale Computing
Wednesday, June 10
Workshop on Programming models
Parallel Shared Memory Programming Languages
Marc Snir, UIUC
The talk will discuss problems with the current crop of parallel shared memory languages and propose research directions for the development of easier to use and more performing parallel programming languages. We shall briefly touch on ongoing, relevant research at UPCRC.
Programming hierarchical multicore using hybrid approaches: a runtime's perspective
Raymond Namyst, INRIA and University of Bordeaux
In the field of HPC, the current hardware trend is to build clusters of complex hierarchical multiprocessor architectures. Approaching the theoretical performance of these architectures is a complex issue, and often requires to mix different programming environments and parallel libraries (e.g. MPI, OpenMP, TBB, MKL, etc). Recently, heterogeneous technologies such as specialized coprocessors (e.g. Cell/BE SPUs) or data-parallel accelerators (e.g. GPGPUs) have also been introduced intro high end parallel machines, pushing the need for hybrid execution models even further.
In this talk, I will present our experiences in designing runtime systems for hierarchical multicore architecture, and I will discuss several recent extensions we have developed to support hybrid programming models and heterogeneous architectures.
Hybrid parallelism on real applications and simulations
Jean-François Mehaut, Laboratoire d'Informatique de Grenoble
In this talk, I will describe "some" of the key problems in the development of scientific simulations on modern computing platforms. I will start my talk with atomistic simulations to compute the full electronic structure of new materials. The atomistic simulations are intensively used by physicists (CEA) in the area of nanosciences and nanotechnologies. The second type of application comes from the seismology and the simulation of earthquakes.
Developing nanomaterials is particularly difficult and complex, which explains why it is important to predict their properties before designing them and to model the impact of manufacturing them as realistically as possible. To achieve this, we develop simulation codes (BigDFT) based on ab initio methods, with a physical description of the phenomena at the atomic scale. This computer aided-design requires enormous, massively parallel computing capacities. The BigDFT code shows systematic convergence properties and an excellent efficiency on homogeneous computing SMP-clusters. We are also adapting the BigDFT code to run on hybrid computing clusters (CPU and GPU). BigDFT is one of the target applications of the national ANR/ProHMPT project with several INRIA teams.
Numerical modelling of seismic wave propagation in complex thred-dimensional media is an important research topic in seismology. Several approaches are studied and their suitability with respect to the specific constraints of NUMA architectures are evaluated. These modelling approaches rely on modern numerical schemes such as spectral elements, high-order finite differences applied to realistic 3D models. The French national project ANR/NUMASIS focusses on issues related to parallel algorithms (scheduling, memory affinity) in order to optimize earthquake simulations on clusters of NUMA machines.
Several INRIA teams (Magique3D, Mescal, Paris, Runtime, ScAlApplix,...) collaborate to national projects (Numasis, ProHMPT,...). The CEA (DAM, INAC) is strongly associated to these projects. Some French companies (Bull, CAPS Entreprise, Total...) are involved in these projects. Multidisciplinarity is a key point of these projects.
Kaapi: an adaptive runtime system for parallel computing
Thierry Gautier, INRIA
The high availability of multicores/manycores based architectures for computer science seems to be very attractive to the engineer because, at a first level, such computers aggregate high performances. Nevertheless, obtaining high performances on applications remains a challenging problem. For instance, the delay to access the memory is non uniform and the irregularity of computations require to use scheduling algorithms to automatically balance the workload among the processors.
This presentation focus the MOAIS' approach to address this problem through the development of Kaapi a research runtime system that can be adapted in order to specialized the scheduling algorithm or to add extra feature such that fault tolerance support. We illustrate our methodology with some results obtained on several applications.
ProActive Parallel Suite: Strong Programming Model to bridge Distributed and Multi-Core Computing
Denis Caromel, INRIA-Univ. Nice Sophia Antipolis-CNRS
We will share or experience at simplifying the programming of applications that are distributed on Local Area Network (LAN), on cluster of workstations, or GRIDs, and of course, Clouds. We will promote a kind of approach, Network On Chip, to cope seamlessly with both distributed and shared-memory multi-core machines. A theoretical foundation ensures constant behavior, whatever the environment.
The point will be illustrated with ProActive an Open Source library for parallel, distributed, and concurrent computing, allowing to showcase Interactive and graphical GUI and tools. Benchmarks will also be reported.
Programming Methodologies beyond petascale, based on adaptive runtime systems
Sanjay Kale, UIUC
Multiple PetaFLOPS class machines have appeared during the past year, and many multi-PetaFLOPS machines are on the anvil. It will be a substantial challenge to make existing parallel CSE applications run efficiently on them, and even more challenging to design new applications that can effectively leverage the large computational power of these machines. Multicore chips and SMP nodes are becoming popular and pose challenges of their own. Further, a new set of challenges in productivity arise, especially if we wish to have a broader set of applications and people to use these machines. I will review a set of techniques, incorporated in the Charm++ system, that have proved useful in my group’s work, on multiple parallel applications that have scaled to tens of thousands of processors, on machines like Blue Gene/L, Blue Gene/P, Cray XT3 and XT4. These techniques were developed in the context of our experience with several applications ranging from quantum chemistry, biomolecular simulations, simulation of solid propellant rockets, and computational astronomy. I will identify new challenges and potential solutions for the performance issues. Issues presented by multicore chips and SMP nodes will also be addressed. Finally, I will review some new and old ideas for increasing productivity in parallel programming substantially.
Thursday, June 11
Workshop on numerical libraries
Optimizing Sparse Data Structures for Matrix-Vector Multiply
William Gropp (UIUC) and Dahai Guo (NCSA)
Sparse matrix-vector multiply is an important operation, both in iterative methods for solving systems of linear equations and in explicit methods, where the application of an explicit linear operator can be written as a matrix-vector product. Much effort has been focused on developing effective data structures for this problem including the recent application of autotuning approaches to optimize aspects of the data structure. A major impediment to improving performance of this operation is that it is memory-bandwidth intensive. In this talk, we show how to take advantage of features of the IBM POWER architecture to achieve much higher sustained performance by creating new data structures that are a better match to the capabilities of the POWER architecture, and we show results with a variety of matrices. The new data structures also show benefits on Intel processors. The underlying approach can be applied to other operations that are limited by memory bandwidth.
Communication optimal algorithms in linear algebra
Laura Grigori, INRIA Saclay
Joint work with J. Demmel, M. Hoemmen (UC Berkeley), J. Langou (CU Denver), H. Xiang (University Paris 6)
In this talk we present several recent results obtained in the design of parallel algorithms for dense and sparse linear algebra. The overall goal of this research is to reformulate and redesign linear algebra algorithms so that they are optimal in an amount of the communication they perform, while retaining the numerical stability. The work here involves both theoretical investigation and practical coding on diverse computational platforms. In the theoretical investigation we identified lower bounds on communication for different operations in linear algebra, where communication refers to data movement between processors in the parallel case, and to data movement between different levels of memory hierarchy in the sequential case. We present new algorithms that attain the communication lower bounds (up to polylogarithmic factors), and thus greatly reduce the communication relative to conventional algorithms as implemented in the widely used libraries LAPACK and ScaLAPACK. The results obtained to date concern the LU and QR factorizations of dense matrices.
Hybrid iterative-direct domain decomposition based solvers for the time-harmonic Maxwell equations
Stéphane Lanteri, INRIA Sophia Antipolis-Méditerranée, NACHOS project-team
Electromagnetic (EM) waves are ubiquitous in present day technology. Indeed, electromagnetism has found and continues to find applications in a wide array of areas, encompassing both industrial and military purposes. Equally notable are societal applications, in particular those concerned with the question of the existence of adverse effects resulting from the interaction of EM waves with humans, or those dealing with medical applications of EM waves. Although the principles of electromagnetics are well understood, solving Maxwell's equations for the simulation of realistic wave propagation problems is still a challenge. For practical applications, the solution of such problems is complicated by the detailed geometrical features of scattering objects, the physical properties of the propagation medium and the characteristics of the radiating sources. In addition, because the wavelength is often short, the algebraic systems resulting from the discretization of Maxwell's equations can be extremely large. Domain decomposition principles are thus ideally suited for the design of efficient and scalable solvers for such systems. In this talk we will discuss about our recent efforts towards the development on Hybrid iterative-direct domain decomposition based solvers for discontinuous Galerkin discretization of the system of time-harmonic Maxwell equations, in view of the simulation of electromagnetic wave propagation problems involving heterogeneous media and complex domains.
The MUMPS library
Jean-Yves L'Excellent, INRIA Rhône-Alpes
We present the MUMPS library, a numerical library to solve sparse systems of linear equations of the form A x = b by direct methods. MUMPS is based on a multifrontal approach and uses message passing for parallelism. One of its originalities is its large spectrum of features, resulting from national and international collaborations as well as feedback by its wide community of users. The numerical stability relies on pivoting strategies that require dynamic computational task graphs; therefore an asynchronous approach with dynamic distributed schedulers has been designed. As is the case for other state-of-the-art direct solvers, out-of-core issues, memory scalability, parallelization of the symbolic and preprocessing phases, and adaptation to petascale computers are critical. MUMPS is available free of charge. Further information on the library is available at http://mumps.enseeiht.fr/ or http://graal.ens-lyon.fr/MUMPS/.
Toward robust hybrid parallel sparse solvers for large scale applications
Jean Roman, ENSEIRB, LaBRI and INRIA Bordeaux - Sud-Ouest HiePACS Project
In this work we investigate the parallel scalability of variants of additive Schwarz preconditioners for three dimensional non-overlapping domain decomposition methods. To alleviate the computational cost, both in terms of memory and floating-point complexity, we investigate variants based on a sparse approximation. The robustness of the preconditioners is illustrated on a set of linear systems arising from the finite element discretization of academic diffusion and convection-diffusion problems, and from real-life structural mechanical problems.
Parallel experiments exploiting one and two level of paralelism on up to a thousand processors onsome problems will be presented. The efficiency from a numerical and parallel performance view point are studied on problem ranging from a few hundred thousands unknowns upto a few tens of millions.
Solvers and partitioners in the Bacchus project
François Pellegrini, INRIA Futurs
The newly born INRIA team "Bacchus" aims at developing and validating numerical methods adapted to physical problems that are modeled by a set of partial differential equations the main behavior of which is of hyperbolic type : fluid dynamics, which is the main specialty of the team, but also aeroacoustics, geophysics, MHD. The efficient implementation of these methods on modern parallel architectures requires the development of various software libraries, comprising sparse linear system solvers and partitioners. Our talk will focus on these two latter aspects. We will outline the main capabilities of the NUMA-aware direct solver Pastix, and of the hybrid solver HIPS, both of which can be accessed through the Murge common interface. We will also present the main features of the PT-Scotch parallel library for sparse matrix ordering and static mapping which, unlike standard graph partitioners, can take advantage of heterogeneous architectures.
Workshop on fault-tolerance
Challenges in Resilience for Peta-ExaScale systems and research opportunities
Franck Cappello, INRIA
The emergence of Petascale systems and the Future Exascale reinvigorates the community interest about how to manage faults in such systems and ensure that large applications successfully complete. Over the last decade, most of the attention of the community was devoted on Rollback Recovery mechanisms. Unfortunately, despite their high degree of optimization, existing approaches do not fit well with the challenging evolutions of Peta-ExaScale scale systems. There is room and even a need for new approaches. Opportunities may come from different origins like adding hardware dedicated to fault tolerance, developing fault oblivious algorithms or redundancy at the algorithmic level. We will sketch some of these opportunities and their current konwn limitations.
Rollback Recovery in Message Passing Systems: MPICH-V and Open MPI
Thomas Hérault, INRIA
In this talk, I will present the study that was conducted in the Grand-Large project/team of the INRIA at Orsay, and the Parallelism team of the University Paris-Sud on transparent fault tolerant protocols for message passing systems (in particular MPICH). The talk will present all the protocols that were studied during the last years, the results we obtained, and ongoing research done today in collaboration with the University of Tennessee in Open MPI.
ProActive SPMD and Fault Tolerance: Protocols and Benchmarks
Brian Amedro, INRIA-Univ. Nice Sophia Antipolis-CNRS
We will present a rollback-recovery fault tolerance protocol for the asynchronous communicating active objects model ASP (Asynchronous Sequential Processes), and its open source Java implementation ProActive. In particular, we will describe the protocol extension for grid computing and its connection with the SPMD programming model. Finally, we will consider the implications of core failures in the upcoming manycore computing.
Asynchronous iterative methods and fault tolerance
Mourad Hakem, University of Franche-Comté
This talk deals with iterative algorithms and reliability. First, we present some features of parallel synchronous and asynchronous iterative algorithms and their applications to numerical problems with an emphasis of their convergence conditions. We also explain why asynchronous algorithms are more reliable and tolerant to messages losses. Next, we provide an other/alternative solution to achieve reliabilty for grid computing environments without complicated mechanism for failures detection/recovery. It is based on an active replication scheme, capable of supporting RF (Reliability Factor) arbitrary fail-silent/fail-stop node failures.
Providing consistency guarantees in large-scale distributed systems
Marc Shapiro, INRIA Paris-Rocquencourt et LIP6
Being able to share and update data consistently is a requirement of advanced Internet applications (e.g., Wikipedia or multi-user games) and of future cloud computing, edge computing or massive user collaborations. Application programmers require well-defined, rigorous consistency guarantees, in order to cope with complexity. However, current technologies do not support consistent updates well. Databases ensure strong consistency, but they are not scalable, as they serialize all updates and execute them at all sites. Cloud storage platforms sacrifice consistency in order to scale, being designed for read-only data. We propose several complementary approaches to breaking the scalability barrier while providing rigorous correctness guarantees. In order to scale strong consistency, replication will be partial, and consistency will be tailored to application-specified semantics. To scale even further, we study speculative execution and weak consistency, but ensure that, despite anomalies observable by applications, application-specified invariants are eventually satisfied.
Scalable Fault Tolerance Schemes using Adaptive Runtime Support
Eric Bohm, UIUC
HPC systems for Computational Science and Engineering have almost reached the threshold where some form of fault tolerance becomes mandatory. Although system-level checkpoint-restart keeps things simple for the application developer, they lead to high overhead. Meanwhile, application-level schemes are effort-intensive for the programmer. Schemes based on smart runtime systems appear to be at the right level for addressing fault tolerance. Our work, based on object-level virtualization and implemented by the Charm++ runtime system, supports such schemes.
Charm++ offers a series of techniques that can help tolerate faults in large parallel systems. These techniques include distributed checkpoints, message-logging with parallel recovery, and proactive object migration. When combined with the measurement-based load balancing facilities available in Charm++, one can both tolerate faults and continue execution on remaining resources with optimal efficiency. These techniques can also be applied to MPI applications running under AMPI, an MPI implementation based on Charm++.
Friday, June 12
Open topics: accelerators, compilers, communication libraries, etc.
New Abstractions for Data Parallel Programming
María Garzarán, UIUC
Developing applications is becoming increasingly difficult due to recent growth in machine complexity along many dimensions, especially that of parallelism. We are studying data types that can be used to represent data parallel operations. Developing parallel programs with these data types have numerous advantages and such a strategy should facilitate parallel programming and enable portability across machine classes and machine generations without significant performance degradation.
In this talk, I will discuss our vision of data parallel programming with powerful abstractions. I will first discuss earlier work on data parallel programming and list some of its limitations. Then, I will introduce several dimensions along which is possible to develop more powerful data parallel programming abstractions. Finally, I will outline some of the work we are doing on library generation and autotuning.
Hybrid Parallel Programming
François Bodin, INRIA
By 2010 all PC processors will be multicore. Multicore processors are bringing such a huge computing power that many applications, from scientific to consumers, will be pushed forward at high speed. However multicore, especially when heterogeneous, comes at a price.
Multicore hardware is pervasive and strongly affects software applications. To achieve performance, application development has to change dramatically to harness the huge parallelism multicore offers.
Current exploitation of multicore requires mixing the programming style (e.g. OpenMP and MPI) and languages (e.g. Fortran and CUDA). As a consequence, fine-tuning applications can quickly turn into a nightmare especially if code portability is required.
In this presentation we survey techniques and issues for dealing with hybrid multicore architectures.
ArchExplorer.org: Leveraging Modular Simulation to Automate Design-Space Exploration
Olivier Temam, INRIA
While processor architecture design is currently more an art than a systematic process, growing complexity and more stringent time-to-market constraints are strong incentives for streamlining architecture design into a more systematic process. Methods have emerged for quickly and automatically scanning the large ranges of hardware blocks parameters. But, at the moment, systematic exploration rarely goes beyond setting hardware block parameters.
In this study, we show that it is possible to broaden the scope of architecture design space exploration to automatically composing different architecture blocks together. For that purpose, we leverage recent trends in modular simulation design, coupled with appropriate standardized architecture module introspection capabilities and a module repository. We show that, with these features, it is possible to design an architecture browser that systematically tries out a vast range of possible designs. Moreover, it is possible to know the best tradeoff architecture under various constraints at any time by putting together a continuous rather than a one-time exploration process, which is restarted and updated every time a new variant of any given architecture component is available. Finally, because architecture designs comparisons can be significantly affected by compiler performance, we show that this systematic process can be coupled with automatic compiler tuning for each new design point, and thus achieve accurate design points comparisons.
High Performance Computing with Accelerators
Volodymyr Kindratenko, UIUC
Scientific computing community has been investigating the use of application accelerators, such as FPGAs, Cell/B.E., and more recently GPUs, to improve the performance of computationally intensive codes beyond of what is possible on today's mainstream multi-processor systems. As the technology matures, we are starting to see application accelerators finding their way into production systems, compute clusters in particular. At NCSA, we have two such clusters outfitted with GPUs and FPGAs and we have been working with a number of application teams to rewrite their codes to take advantage of the accelerators. This presentation will cover some of the issues in deploying and running accelerator clusters and our efforts on implementing applications for GPU clusters in particular.
Optimizing communication on multicore clusters
Alexandre Denis, INRIA
Processors become massively multicore and new programming models emerge that mix message passing and multi-threading. Modern communication subsystems now have to deal with multithreading: the impact of thread-safety, the contention on network interfaces or the consequence of data locality on performance have to be studied carefully.
In this talk, we will study the impact of threads on communication performance in several aspects. Having multiple threads polling at the same time leads to contention and thus a large performance drop. We will present major design issues to avoid contention when dealing with multiple threads accessing the communication library at the same time. Moreover, there is a cost implied by the locking used to ensure thread-safety in communication. Designing an efficient modern communication library requires precautions in order to limit the impact of thread-safety mechanisms on performance.
On another hand, multicore chips are an opportunity for some new optimizations of communications. We will show that multiple streams from multiple threads are an opportunity to apply new packet scheduling optimization, mostly based on coalescing packets from multiple threads. Moreover, idle cores may be used to make non-blocking communication actualy progress in the background.
We will show how these methods have been implemented in the NewMadeleine communication library, the PIOMan I/O event manager, and have been integrated into MPICH2-Nemesis.