Annual Meeting 2020

The UCF Consortium held its 2020 annual meeting and workshop virtually in December. The annual meeting covered multiple topics around the consortium’s growing projects, such as UCX, UCC, OpenSNAPI, the latest developments, usage and futures.

Date Time Topic Speaker/Moderator
11/30 08:00-09:00
UCF State of the Union (Video)
Download Slides

Unified Communication Framework (UCF) – Collaboration between industry,laboratories, and academia to create production grade communication frameworks and open standards for data-centric and high-performance applications. In this talk we will present recent advances in development UCF projects including Open UCX, Apache Spark UCX as well incubation projects in the area of SmartNIC programming, benchmarking, and other areas of accelerated compute.

Gilad Shainer, Nvidia

Gilad Shainer serves as senior vice-president of marketing for Mellanox networking at NVIDIA, focusing on high- performance computing, artificial intelligence and the InfiniBand technology. Mr. Shainer joined Mellanox in 2001 as a design engineer and later served in senior marketing management roles since 2005. Mr. Shainer serves as the chairman of the HPC-AI Advisory Council organization, the president of UCF and CCIX consortiums, a member of IBTA and a contributor to the PCISIG PCI-X and PCIe specifications. Mr. Shainer holds multiple patents in the field of high-speed networking. He is a recipient of 2015 R&D100 award for his contribution to the CORE-Direct In-Network Computing technology and the 2019 R&D100 award for his contribution to the Unified Communication X (UCX) technology. Gilad Shainer holds a MSc degree and a BSc degree in Electrical Engineering from the Technion Institute of Technology in Israel.

Pavel Shamis (Pasha), Arm

Pavel Shamis is a Principal Research Engineer at Arm. His work is focused on co-design software, and hardware building blocks for high-performance interconnect technologies, development of communication middleware, and novel programming models. Prior to joining ARM, he spent five years at Oak Ridge National Laboratory (ORNL) as a research scientist at Computer Science and Math Division (CSMD). In this role, Pavel was responsible for research and development multiple projects in high-performance communication domains including Collective Communication Offload (CORE-Direct & Cheetah), OpenSHMEM, and OpenUCX. Before joining ORNL, Pavel spent ten years at Mellanox Technologies, where he led Mellanox HPC team and was one of the key drivers in the enablement Mellanox HPC software stack, including OFA software stack, OpenMPI, MVAPICH, OpenSHMEM, and other. Pavel is a board member of UCF consortium and co-maintainer of Open UCX. He holds multiple patents in the area of in-network accelerator. Pavel is a recipient of 2015 R&D100 award for his contribution to the development CORE-Direct in-network computing technology and the 2019 R&D100 award for the development of Open Unified Communication X (Open UCX) software framework for HPC, data analytics, and AI.

09:00-10:00
GPU memory support (Video)
Download Slides

Seamless support of GPU buffers communication introduces yet another dimension for designing optimal data transfer protocols. In this talk we will have open discussion about outstanding issues and challenges, including out-of-box performance, PCI topology support, rendezvous protocols, memory type cache and memory hooks, and more.

Yossi Itigin, Nvidia

Yossi Itigin is a UCX team lead at NVIDIA, focuses on high-performance communication middleware, and a maintainer of OpenUCX project. Prior to joining NVIDIA, Mr. Itigin spent nine years at Mellanox Technologies in different technical roles, all related to developing and optimizing RDMA software.

10:00-10:30
MPICH/UCX Update (Video)
Download Slides

An update on UCX usage within the MPICH library. Topics covered include point-to-point communication, RMA, multi-threading, and more.

Ken Raffenetti ,Argonne National Laboratory

Ken is a Principal Software Development Specialist at Argonne National Laboratory.

10:30-11:15
RDMA-CORE: DMA-BUF based GPU RDMA Support (Video)
Download Slides

The capability of using GPU memory in RDMA operations (a.k.a GPU Direct RDMA) is important for scale-out configuration of GPU computation. In this talk, a dma-buf based approach is presented to enable such capability in the upstream Linux kernel and user space rdma-core libraries.

Jianxin Xiong, Intel

Jianxin Xiong works for Intel. His experience covers various layers of interconnection software stack, such as RDMA drivers in Linux kernel, RDMA device virtualization, Open Fabric Interface, DAPL, Tag Matching Interface, and MPI. His current focus is GPU/accelerator scale-out with RDMA devices.

11:15-12:15
UCX for Apache Spark (Video)
Download Slides

Utilizing accelerators in Apache Spark presents opportunities for significant speedup of ETL, ML and DL applications. In this talk i’ll describe acceleration of Spark data shuffle across network and GPU accelerators using UCX library. Will show our existing solutions on speeding up host-to-host data transfer, using one-sided communications, device-to-device acceleration for Rapids Spark using tag API, and our roadmap to unify these two solutions.

Peter Rudenko, Nvidia

Peter Rudenko is a software engineer in High Performance Computing team, focusing on accelerating data intensive applications, developing UCX communication library and various big data solutions.

12:15-13:15
UCX Python – Dask/RAPIDS (Video)
Download Slides

Abstract: Efficient communication is paramount to data science, at the same time being very challenging, given it’s dominated by high-level languages such as Python. We will show how UCX-Py leverages UCX performance greatly for distributed frameworks such as Dask and GPU-accelerated ecosystem RAPIDS. The performance achieved is seen in the COVID-19 research on Summit, the world’s largest supercomputer at ORNL.

Ben Zaitlen, NVIDIA

Benjamin Zaitlen is a system software manager for NVIDIA-RAPIDS, focusing on connecting the PyData Ecosystem to accelerated computing and networking

Peter Entschev, NVIDIA

Peter Entschev is a system software engineer at NVIDIA, working on distributed computing and communications on the RAPIDS team.

Matthew Baker, ORNL

Matthew graduated from East Tennessee State University (ETSU) with a masters in applied computer science. Matthew’s past work has included working on programming models and high performance network libraries. Matthew’s current work includes monitoring systems at large scale and allowing applications to make decisions at run time.

12/01 08:00-08:40 UCF – Future directions (Video)
Steve Poole, Los Alamos National Laboratory

Steve is the Chief Architect for Next Generation Platforms at Los Alamos National Laboratory

08:40-09:40
UCP Protocols v2 (Video)
Download Slides

In order to achieve out-of-box performance, UCX must select the optimal protocol for the given scenario, considering factors such as buffer length, memory locality, and data layout. In this talk we will present and discuss the next version of protocol selection mechanism, and make sure it will be able to handle that task.

Yossi Itigin, Nvidia

Yossi Itigin is a UCX team lead at NVIDIA, focuses on high-performance communication middleware, and a maintainer of OpenUCX project. Prior to joining NVIDIA, Mr. Itigin spent nine years at Mellanox Technologies in different technical roles, all related to developing and optimizing RDMA software.

09:40-10:40
UCP Active messages API (Video)
Download Slides

Active messages is a common messaging interface for various PGAS (Partitioned Global Address Space) APIs/libraries. In this talk will give an overview of the existing UCP active messaging API, including recently introduced rendezvous protocol capabilities.

Mikhail Brinskii, Nvidia

Mikhail Brinskii is a UCX developer with main focus on networking and HPC solutions. Before joining Nvidia, Mikhail had been working at Intel developing highly optimized Intel MPI Library.

10:40-11:40
UCX development in Huawei (Video)
Download Slides

Huawei is unique vendor in the HPC landscape, producing all cluster components: Compute (Kunpeng 920 CPU), Storage (OceanStor solutions), Interconnect (RoCE NICs and Switches) and accelerators. This talk will present some of our recent work on UCX – and how it fits in the bigger picture of our HPC solutions.

Alex Margolin, HPC software architect and team leader, Huawei

Alex is CS Ph.D. candidate in the area of MPI collective operations, software architect and developer in the areas surrounding MPI performance and low-latency communication. Today, team leader of the cluster computing team in Tel-Aviv Research center, Huawei.

11:40-12:20
Open Smart NIC API – State of the Union (Video)

In this talk, we will give the community an update on what has been done over the last year with OpenSNAPI. We will also discuss two potential use case for OpenSNAPI for both SmartNICs as well as Computational Storage devices

Steve Poole, Los Alamos National Laboratory

Steve is the Chief Architect for Next Generation Platforms at Los Alamos National Laboratory

12/02 08:00-09:00
BlazingSQL with UCX (Video)
Download Slides

BlazingSQL is an open-source SQL engine built on GPUs as part of the RAPIDS ecosystem. We have recently added the capacity to leverage UCX as a communication protocol in our engine. We will discuss how we integrated UCX into BlazingSQL. First, we re-implemented our communication library to support multiple networking protocols. Second, we alternated between the NB and NBR APIs to find the most appropriate fit for our code-base. We’ll discuss our implementation, current results when compared to our non-UCX communication protocols, and our future plans for UCX.

Rodrigo Aramburu

Rodrigo one of the cofounders of BlazingSQL, an open-source GPU accelerated SQL engine contained within a Python package. Rodrigo started his career as a consultant at Deloitte building analytics systems for the financial services industry before working on startups initially delivering analytics projects in Peru to multiple government entities and Latin American multi-nationals. Rodrigo started working on BlazingSQL with his brother, and they were inspired by work that came out of a consulting project for the Peruvian government. That work now makes its home within the RAPIDS GPU data science ecosystem to enable end-to-end workloads entirely on GPUs.

Felipe Aramburu, BlazingSQL

Felipe is a maker. From aquaponics, beer and cheese-making to home automation. He is obsessed with creating. Before being CTO of BlazingSQL he and his brother had a consulting company based out of Peru where they originally built BlazingSQL as a tool to help them with their own consulting work. Before this he was the CTO of kWhOURs which provided a SaaS solution for energy auditing. Through BlazingSQL he has become a high performance junkie that spends nights dreaming about how hybrid processing systems are going to change the world.

09:00-10:00
Charm++ with UCX (Video)
Download Slides Part 1 | Part 2

The Charm++ parallel programming system supports a variety of networking hardware through its machine layers, with UCX as a recent addition. In this talk, we will discuss the issues and results seen with the UCX machine layer over the last one year. Additionally, we will also talk about the upcoming feature of support for direct GPU-GPU communication with UCX in Charm++.

Nitin Bhat, Charmworks

Nitin Bhat is a software engineer at Charmworks and has been a core developer of Charm++ since 2015. Prior to Charmworks, he graduated with an MS in Computer Science from University of Illinois Urbana-Champaign, where he was working as a Research Assistant at the Parallel Programming Laboratory.

Jaemin Choi, University of Illinois Urbana-Champaign

Jaemin Choi is a PhD candidate and research assistant at the University of Illinois Urbana-Champaign. His research revolves around support for GPUs in the Charm++ parallel programming system, including asynchronous progress of GPU tasks and optimizations for inter-GPU communication.

10:00-10:30
ROCM support in UCX: Status and Roadmap (Video)
Download Slides

This talk will present various improvements done to improve performance of ROCm GPUs in UCX. I’ll also talk about future plans and what’s required to achieve them.

Sourav Chakraborty, AMD

Sourav Chakraborty works for the Radeon Technologies Group in AMD. His PhD research at The Ohio State University includes MPI semantics and implementation, fast interconnects, and scalable and reliable distributed systems.

10:40-11:40
UCX counters in Score-P and Vampir (Video)
Download Slides

Score-P is a data acquisition tool which allows collecting profiling and trace data of MPI applications, part of the VI-HPS project. This talk will present (a) a brief introduction to some of the main VI-HPS (Score-P, Cube and Vampir), (b) UCX counters trace data acquisition using Score-P with the Huawei UCX statistics plugin, and (c) Visualization over Vampir.

Shuki Zanyovka, HPC and Networking architect at Huawei

An Electrical and SW engineer with background in Statistical Signal Processing, Wireless Communications and Data Science. Past experience: Developing ARM based low-power IoT devices of different types: Cellular Handsets / Digital TV Receivers / Automotive (V2X) and GPON FTTH switches.

11:40-12:40 Unified Communication Datatypes – State of the Union (Video)
Download Slides
Pavan Balaji, Argonne National Laboratory
12:40-13:00
Arm IP building blocks and standards for SmartNIC (Video)

This talk describes the building-blocks available from Arm for modern SmartNICs/DPUs along with relevant standards enabling operating systems to boot on Arm-based SmartNICs without modification. Support for interfaces like CXL in Arm IP that targets future SmartNIC architectures is also discussed.

Kshitij Sudan, Arm

Kshitij the Technical Assistant to the GM of Arm’s Infrastructure business unit. He is a systems architect and focuses on enabling Arm customers targeted different market segments using Arm IP. Kshitij holds a PhD in computer architecture from University of Utah.

12/03 08:00-09:00
UCC: Design and Implementation of Next Generation Collectives Library (Video)
Download Slides

UCC is a community-driven effort to develop collective API and library implementation for applications in various domains, including High-Performance Computing, Artificial Intelligence, Data Center, and I/O. Over the last few months, the UCC WG group has met weekly to develop the UCC specification. In this talk, I will highlight some of the design principles of the UCC v1.0 specification. I will also share the status of UCC implementation and upcoming plans of the working group. Further, we will share results from the experimental implementation – XCCL, which has helped make an informed decision regarding UCC interfaces and semantics.

Manjunath Gorentla Venkata, Nvidia

Manjunath Gorentla Venkata is an HPC software architect at NVIDIA. His focus is on programming models and network libraries for HPC systems. Previously, he was a research scientist and the Languages Team lead at Oak Ridge National Laboratory. He’s served on open standards committees for parallel programming models, including OpenSHMEM and MPI for many years, and he is the author of more than 50 research papers in this area. Manju earned Ph.D. and M.S. degrees in computer science from the University of New Mexico.

09:00-09:30
One-to-many UCT transports, part I: Shared-memory (Video)
Download Slides

Though UCX stands for unified communication, the focus has been on P2P. This talk presents the first part of our work on broadcast communication – inter-process communication within the same host. This new UCT transport significantly accelerates MPI’s Barrier, Bcast, Reduce, Allreduce, Scatter and Gather.

Alex Margolin, HPC software architect and team leader, Huawei

Alex is CS Ph.D. candidate in the area of MPI collective operations, software architect and developer in the areas surrounding MPI performance and low-latency communication. Today, team leader of the cluster computing team in Tel-Aviv Research center, Huawei.

09:30-10:00
One-to-many UCT transports, part II: Multicast (Video)

Though UCX stands for unified communication, the focus has been on P2P. This talk presents the first part of our work on broadcast communication – inter-process communication within the same host. This new UCT transport significantly accelerates MPI’s Barrier, Bcast, Reduce, Allreduce, Scatter and Gather.

Morad Horany, HPC software developer, Huawei

Computer Engineering Bs.C., previously a Firmware Engineer for NICs at Mellanox in the areas of Quality of Service and Virtualization, now member of the cluster computing team in Tel-Aviv Research center, Huawei.

10:00-11:00
Until UCC is available – UCG status update (Video)
Download Slides

A lot has happened since UCG’s 1st pull-request, back in 2018: the API has matured, optimizations across UCX have been implemented (some accepted upstream) by team members in China and in Israel, and this talk will present the journey so far – and the road ahead.

Alex Margolin, HPC software architect and team leader, Huawei

Alex is CS Ph.D. candidate in the area of MPI collective operations, software architect and developer in the areas surrounding MPI performance and low-latency communication. Today, team leader of the cluster computing team in Tel-Aviv Research center, Huawei.

11:00-11:45
RDMA-CORE Linux kernel and user space updates (Video)
Download Slides

Review interesting changes in the RDMA world over the last year, with a focus on elements interesting to the UXC community

Jason Gunthorpe, Nvidia

Jason is the maintainer for RDMA in the Linux kernel and userspace rdma-core.

11:45-12:45
Scaling Facebook’s Deep Learning Recommender Model (DLRM) with UCC/XCCL (Video)
Download Slides

Recommender models comprise over 50% of the AI training workload in Facebook’s data centers. Facebook recently open-sourced a deep learning recommendation model (DLRM) that can be used as a tool for co-designing datacenter hardware and software for large-scale recommender model training. To these ends, we have implemented a new UCC process group for distributed PyTorch and have exposed various all-to-all and allreduce collectives through the UCC teams abstraction. We present an assessment of the suitability of the current UCC API for recommender workloads by running DLRM over the UCC process group on the Selene A100 SuperPOD on up to 256 GPUs and ConnectX-6 HCAs. We identify streams, collective topologies, and offload-devices as key abstractions that should be considered for inclusion to UCC.

Josh Ladd, Nvidia

Josh Ladd is a senior director of networking software at Mellanox/Nvidia where he leads the HPC middleware research and development team. Prior to joining Mellanox/Nvidia, Josh was a computer scientist at ORNL, where his research centered on the design and implemented of collective communication frameworks for extreme-scale supercomputers. Josh earned his PhD in mathematics from Colorado State University.

Srinivas, Facebook

12:45-13:30
Open Smart NIC API – OpenSHMEM I/O Extensions for Fine-grained Access to Persistent Memory Storage (Video)

Application workflows use files to communicate between stages of data processing and analysis kernel executions. To address speed and granularity issues with this method, we employ persistent memory (PMEM) devices that provide DRAM-like speeds and byte granular access combined with persistent storage capabilities. We deploy an Arm-based Mellanox Bluefield SmartNIC with attached NVDIMM-N modules. Both SmartNIC and PMEM introduce API design and system software integration challenges. We address this with the design and implementation for an innovative client-server software architecture with a client API extension to the OpenSHMEM library.

Megan Grodowitz, Arm

Megan Grodowitz has been a Staff Research Engineer at Arm, Ltd since joining Arm in 2017. Before Arm, she worked as computer science research staff at DoE and DoD labs since receiving her PhD from University of Notre Dame in 2011.

Date Time Topic Speaker/Moderator
9 Dec 13:00-13:50 InfiniBand and RDMA recent advances
View Slides
Gilad – Mellanox
14:00-15:50 Proposal for collective operations API and implementation overview
View Slides

Collective Operations in UCX
View Slides

Manju – Mellanox

 

Alex – Toga

16:00-17:00 Hardware specific (u-arch) codes in UCX
View Slides
Alex – Toga
10 Dec 9:00-9:50 LANL Future Directions
View Slides
Stephen Poole – LANL
10:00-11:00 GPU Affinity
board discussion, no slides
Akshay- NVIDIA

Yossi – Mellanox

11:00-12:00 GPU Affinity
board discussion, no slides
Akshay – NVIDIA

Yossi – Mellanox

13:00-14:00 Pipelined transfers within the node, am_zcopy for cuda-ipc
board discussion, no slides
Devender
14:00-14:50 RDMA-Core State of the Union
View Slides
Jason – Mellanox
15:00-15:30 UCX shared memory in containers
View Slides
Mikhail – Mellanox
15:30-16:00 Documentation
board discussion, no slides
Brent – Arm

Pasha – Arm

16:00-16:30 Release Procedure
board discussion, no slides
Yossi – Mellanox

Pasha – Arm

16:30-17:00 CI at Azure
board discussion, no slides
Jason – Mellanox

Yossi – Mellanox

Dec 11 09:00-9:50 Charm++ – Overview
View Slides
Nitin – Charmworks
10:00-10:30 Charm++ – Present work and results
board discussion, no slides
Mikhail – Mellanox
10:40-12:00  CI at Azure
board discussion, no slides
Yossi – Mellanox
12:00-13:00 Continue GPU protocols discussion
board discussion, no slides
Akshay – NVIDIA

Devendar

Yossi – Mellanox

13:00-14:00 UCP request API
View Slides
Mikhail – Mellanox
14:00-14:50 UCP active message API
board discussion, no slides
Sameh
15:00-15:50 UCX Protocols v2
View Slides
Yossi – Mellanox
16:00-17:00 RapidsAI/Dask/Joins
View Slides
Benjamin – NVIDIA

Nikolay – NVIDIA

Dec 12 09:00-09:50 Accelerating Spark with UCX
View Slides
Yossi – Mellanox
10:00-11:00 SmartNIC
board discussion, no slides
Pasha – Arm