High-performance parallel graph analytics

Keshav Pingali and Andrew Lenhartf, University of Texas at Austin, USA
Manoj Kumar, IBM T. J. Watson Research Center
(tutorial sponsored by Austin|Portugal )

Full-Day tutorial (25th of August)

Future demand for computational capability and programming tools will be driven increasingly by analytics applications that process vast amounts of customer, citizen, and employee interactions other than their transactions. These analytics applications are scaling rapidly in terms of the size and variety of data analyzed, the complexity of models explored and tested, and the number of analytics professionals or data scientists supported concurrently. These applications have several components, each of which exhibits a different workload behavior. Ingestion of data from varying human (e.g., speech) or sensors (e.g., images, location) input, integration of information across the input sources, discovery of models that explain the information or predict future behavior, and extraction (through query) of information matching specified criterion, are all key components of an analytics workload exhibiting distinct workload behavior.

The dominant model for internal representation of data in analytics applications, once the information from various input sources is linked together, is a graph structure. This is evidenced in the data managed by Facebook, Wikipedia, or public collections such as web. Efficient execution of analytics applications requires design of a data tier for the large data sets that is highly efficient on operations common in analytics applications. The NoSQL databases, particularly graph databases, will play an important role in this space.

The performance required for these analytics applications cannot be met with traditional systems because single thread performance gains from Dennard scaling have reached a plateau. Additional performance gains will come from multi-core/many-core architectures and on-chip and of-chip accelerators (GPUs and FPGAs). However, parallel programming for these emerging architectures remains as difficult as it was 30-40 years ago despite the many promising approaches for exploiting parallelism, such as declarative languages, logic programming, and automatic parallelization proposed to date. These approaches have succeeded only in a few niche application areas because of the computation-centric abstractions used by them to think about parallelism fail to capture the fine grained data-dependent parallelism typical in graph applications.

New run-times are needed to exploit the fine grained data-dependent parallelism, or amorphous parallelism, in large graph analytics on future systems based on many-core/multi-core cpus, GPUs and FPGAs. New development environments are needed as well for driving programmer productivity in developing the parallel programs for graph analytics, and new parallel algorithms are needed to process the interaction data and sensor data, which is orders of magnitude larger than the transactional data of the previous era.

An emerging data-centric foundation for parallel programming called the operator formulation, in which algorithms are described in terms of actions on data structures, simplifies the programming of multi-core/many-core and GPU based systems. This data-centric view of parallel algorithms shows that a generalized form of data-parallelism called amorphous data-parallelism is ubiquitous even in complex, irregular graph applications such as mesh generation and partitioning algorithms, graph analytics, and machine learning applications. Binding time considerations provide a unification of parallelization techniques ranging from static parallelization to speculative parallelization.

In this tutorial we will cover the following topics, which collectively provide the essential background for developing applications, algorithms, run-times and systems for the emerging analytics applications.

  1. Examples of a few real world analytics applications driving the future computational requirements, such as consumer modeling and IT log analysis. We will identify their basic building blocks and the various data representations used by these building blocks.
  2. An overview of NoSQL databases, particularly graph databases that store the voluminous data processed by the above applications.
  3. . New programming methodologies and run-times to facilitate the development of the new analytics applications, and to leverage the emerging systems, with a deep dive on Galois, a programming model and run-time for exploiting amorphous data-parallelism on multi-cores and GPUs.
  4. A few challenging and commonly used algorithms in large scale data analytics, such as various centrality algorithms, and deep belief network-based learning algorithms, and the approaches to their parallelization.
  5. The emerging systems architectures and high performance implementation of key graph algorithms on them.

Keshav Pingali is a Professor in the Department of Computer Science at the University of Texas at Austin, and he holds the W.A."Tex" Moncrief Chair of Computing in the Institute for Computational Engineering and Sciences (ICES) at UT Austin. He was on the faculty of the Department of Computer Science at Cornell University from 1986 to 2006, where he held the India Chair of Computer Science. Keshav’s research has focused on programming models, compilers and runtime systems for parallel programming. Keshav is a Fellow of the ACM, IEEE and the AAAS. He received the Distinguished Alumnus Award from IIT Kanpur, India in 2013. He was the co-Editor-in-chief of the ACM Transactions on Programming Languages and Systems, and currently serves on the editorial boards of the International Journal of Parallel Programming and Distributed Computing. He also served on the NSF CISE Advisory Committee (2009-2012).

Manoj Kumar is Program Director for Analytics Applications at IBM’s T.J Watson Research Center. Prior to that, at IBM’s Information Management division he was responsible for analytics solutions and establishing IBM’s presence in financial risk management, and before that he worked on IBM’s early prototypes in parallel computing. Manoj received M.S. and Ph.D. degrees from Rice University, Houston, Texas, USA, in 1981 and 1984 respectively, all in electrical engineering.


Heterogeneous Memory Models

Benedict R. Gaster, Qualcomm, Inc

Half-Day tutorial (26th of August)

The ability for developers to reason about values returned from memory loads plays a fundamental role in defining correct programs. In the presence of concurrency, as offered by HSA¹s execution model, the developer must correctly deal with memory values that may have been updated by multiple concurrent actors. Shared memory computers and programming languages divide this complexity into models : in particular the memory model specifies safety, while the execution model specifies liveness. Previous sections of this tutorial have focused in HSA¹s execution model and in this section we address HSA¹s memory model, as seen by the application developer. Like other state of the art memory models, OpenCL¹s model is a shared memory consistency model. However, unlike C++ or Java, OpenCL also directly addresses issues with locality in a heterogeneous setting, adopting HRF¹s notion of scopes, allowing the developer to provide fine-grain control over store visibility. In this tutorial we introduce OpenCL¹s memory model, providing an introduction to its formal underpinnings and show how other modern memory models, such as C++ and HSA, map naturally to OpenCL¹s model. We detail some recent work on scoped based consistency and show how this can be used to formalize OpenCL¹s (sometimes) weakly defined model.

Benedict R. Gaster is an architect working at Qualcomm on next-generation heterogeneous processors. Benedict has contributed extensively to the OpenCL¹s design and recently has worked on extending data-race-free consistency models to be heterogeneous aware. Benedict has a Ph.D in computer science for his work on type systems for extensible records and variants.