Program

Program

<< OpenACC Summit canceled >>

The OpenACC Summit at HPC Asia 2021 is scheduled for Monday, January 18, 2021 to Tuesday, January 19, 2021 as a completely digital, remote event prior to the HPC Asia 2021 main conference. The Summit will include a keynote, a Birds of a Feather (BoF) interactive discussion, an "Ask the Experts" session, several OpenACC talks, and a GPU Bootcamp. Agenda and speakers are still being finalized, please check back for updates.

https://www.openacc.org/events/openacc-summit-hpc-asia-2021

Opening Ceremony

Morning 1 [ 1/20 - 8:00 (Seoul), 0:00 (Brussels), 1/19 - 15:00 (US PST), 17:00 (US CST) ]

Welcome & Opening Remarks

Heon Young Yeom (HPC Asia 2021 General co-chair, Seoul National University)

Keynote 1 (ECP talk by Douglas Kothe)

* This talk is rescheduled to Day2 Morning

Session 1 (Accelerators and Architectures)

#Session Chair - Won Woo Ro (Yensei Univ.)

Morning 2 [ 1/20 - 9:00 (Seoul), 1:00 (Brussels), 1/19 - 16:00 (US PST), 18:00 (US CST) ]

A Deep Reinforcement Learning Method for Solving Task Mapping Problems with Dynamic Traffic on Parallel Systems

Yu-Cheng Wang, Jerry Chou (National Tsing Hua University),

I-Hsin Chung (IBM T. J. Watson Research Center)

An Analysis of System Balance and Architectural Trends Based on Top500 Supercomputers

Awais Khan (Sogang University),

Hyogi Sim, Sudharshan S. Vazhkudai (Oak Ridge National Laboratory),

Ali R. Butt (Virginia Tech),

Youngjae Kim (Sogang University)

Performance Evaluation of OpenCL-Enabled Inter-FPGAOptical Link Communication Framework CIRCUS and SMI

Ryuta Kashino, Ryohei Kobayashi, Norihisa Fujita, Taisuke Boku (University of Tsukuba)

Spectral Element Simulations on the NEC SX-Aurora TSUBASA

Niclas Jansson (KTH Royal Institute of Technology)

Keynote 2 (EPI talk by Jean-Marc DENIS)

#Session Chair - John Kim (KAIST)

Afternoon 1 [ 1/20 - 14:00 (Seoul), 6:00 (Brussels), 1/19 - 21:00 (US PST),  23:00 (US CST) ]

Future Supercomputers are game changers for processor architecture. Why?

Jean-Marc DENIS (Chief of Staff, innovation & Strategy, Atos / Chairman of the Board, European Processor Initiative)

Session 2 (Programming Models and System Software)

#Session Chair - Neelima Bayyapu (NITK Surathkal)

Afternoon 2 [ 1/20 - 15:00 (Seoul), 7:00 (Brussels), 1/19 - 22:00 (US PST), 24:00 (US CST) ]

HybridHadoop: CPU-GPU Hybrid Scheduling in Hadoop

Chanyoung Oh, Hyeonjin Jung (University of Seoul),

Saehanseul Yi (University of California, Irvine),

Illo Yoon, Youngmin Yi (University of Seoul)

neoSYCL: a SYCL implementation for SX-Aurora TSUBASA

Yinan Ke (Graduate School of Information Sciences, Tohoku University),

Mulya Agung, Hiroyuki Takizawa (Cyberscience Center, Tohoku University)

CSPACER: A Reduced API Set Runtime for the Space Consistency Model

Khaled Ibrahiim (Lawrence Berkeley National Laboratory)

Keynote 1 (ECP talk by Douglas Kothe)

#Session Chair - Rio Yokoda (Tokyo Tech.)

Morning 1 [ 1/21 - 8:00 (Seoul), 0:00 (Brussels), 1/20 - 15:00 (US PST), 17:00 (US CST) ]

Update on the US Department of Energy Exascale Computing Project

Douglas B. Kothe (Director, Exascale Computing Project(ECP), Oak Ridge National Laboratory )

Session 3 (Application and Algorithms)

#Session Chair - Sang Kyu Kwak (UNIST)

Morning 2 [ 1/21 - 9:00 (Seoul), 1:00 (Brussels), 1/20 - 16:00 (US PST), 18:00 (US CST) ]

Efficient Implementation of a Dimensionality Reduction Method Using a Complex Moment-Based Subspace

Takahiro Yano, Yasunori Futamura, Akira Imakura, Tetsuya Sakurai (University of Tsukuba)

Conjugate Gradient Solvers with High Accuracy and Bit-wise Reproducibility between CPU and GPU using Ozaki scheme

Daichi Mukunoki (RIKEN Center for Computational Science),

Katsuhisa Ozaki (Shibaura Institute of Technology),

Takeshi Ogita (Tokyo Woman's Christian University),

Roman Iakymchuk (Sorbonne University)

A Compressed, Divide and Conquer Algorithm for Scalable Distributed  Matrix-Matrix Multiplication

Majid Rasouli, Robert M. Kirby, Hari Sundar (School of Computing, University of Utah)

GPU Acceleration of Multigrid Preconditioned Conjugate Gradient Solver on Block-Structured Cartesian Grid

Naoyuki Onodera, Yasuhiro Idomura, Yuta Hasegawa, Susumu Yamashita (Japan Atomic Energy Agency/Nuclear Science and Engineering Center),

Takashi Shimokawabe (Information Technology Center, The University of Tokyo),

Takayuki Aoki (Global Scientific Information and Computing Center, Tokyo Institute of Technology)

Keynote 3 (AI/Cloud talk by Jaejin Lee)

#Session Chair - Jae Hyuk Huh (KAIST)

Afternoon 1 [ 1/21 - 13:00 (Seoul), 5:00 (Brussels), 1/20 - 20:00 (US PST), 22:00 (US CST) ]

An HPC Deep Learning Framework for AI Cloud

Jaejin Lee (Professor, Dept. of Computer Science and Engineering, Seoul National University)

Poster Session

#Session Chair - Jik-Soo Kim (Myongji Univ.)

Afternoon 2 [ 1/21 - 14:00 (Seoul), 6:00 (Brussels), 1/20 - 21:00 (US PST), 23:00 (US CST) ]

Performance Modeling of HPC Applications on Overcommitted Systems

Shohei Minami, Toshio Endo, Akihiro Nomura (Tokyo Institute of Technology)

GPU Optimizations for Atmospheric Chemical Kinetics

Theodoros Christoudias (The Cyprus Institute),

Timo Kirfel, Astrid Kerkweg, Domenico Taraborrelli (Forschungszentrum Jülich GmbH, IEK-8),

Georges-Emmanuel Moulard, Erwan Raffin (Center for Excellence in Performance Programming, Atos),

Victor Azizi, Gijs van den Oord, Ben van Werkhoven (Netherlands eScience Center)

HPC LINPACK Parameter Optimization on Homo-/Heterogeneous System of ARM Neoverse N1SDP

Je-Seok Ham, Yong Cheol Peter Cho, Juyeob Kim, Chun-Gi Lyuh, Jinkyu Kim, Jinho Han, Youngsu Kwon (Electronics and Telecommunications Research Institute)

Closing

Afternoon 3 [ 1/21 - 15:00 (Seoul), 7:00 (Brussels), 1/20 - 22:00 (US PST), 24:00 (US CST) ]

Message from Steering Committee

Taisuke Boku (University of Tsukuba)

Introducing HPC Asia 2022 Plan

Miwako Tsuji (General vice-cochair of HPC Asia 2022, RIKEN)

Closing Remarks

Soonwook Hwang (HPC Asia 2021 General co-chair, KISTI)

Workshop 1 (Multi-scale, Multi-physics and Coupled Problems on highly parallel systems, MMCP)

1/22 [ 18:30 ~ 18:40 ]

Opening Remarks

Sabine Roller (STS, Zimt, University of Siegen),

Osni Marques (Scalable Solvers Group, Lawrence Berkeley National Laboratory)

1/22 [ 18:40 - 19:25 ]

Invited Talk: High Performance Computing and Its Industrial Applications in Fugaku Era

Chisachi Kato (University of Tokyo, Center for Research on Innovative Simulation Software (CISS))

1/22 [ 19:35 - 19:50 ]

Talk: Multi-scale Modelling of Urban Air Pollution with Coupled Weather Forecast and Traffic Simulation on HPC Architecture

László Környei (Széchenyi István University),

Zoltán Horváth, Ákos Kovács, Bence Liszkai, Andreas Ruopp

1/22 [ 19:50 - 20:05 ]

Talk: Advantages of Space-Time Finite Elements for Domains with Time Varying Topology

Norbert Hosters (RWTH Aachen),

Maximilian von Danwitz, Patrick Antony, Marek Behr

1/22 [ 20:05 - 20:20 ]

Talk: Molecular-Continuum Flow Simulation in the Exascale and Big Data Era

Philipp Neumann (Helmut-Schmidt University),

Vahid Jafari, Piet Jarmatz, Felix Maurer, Helene Wittenberg, Niklas Wittmer

1/22 [ 20:30 - 20:45 ]

Talk: An efficient halo approach for Euler-Lagrange simulations based on MPI-3 shared memory

Patrick Kopper (University of Stuttgart),

Marcel Pfeiffer, Stephen Copplestone, Andrea Beck

1/22 [ 20:45 - 21:00 ]

Talk: Node-level Performance Optimizations in CFD Codes

Peter Wauligmann (High Performance Computing Center Stuttgart (HLRS)),

Jakob Dürrwächter, Philipp Offenhäuser, Adrian Schlottke, Martin Bernreuther, Björn Dick

1/22 [ 21:00 ~ 21:10 ]

Closing Remarks

Sabine Roller (STS, Zimt, University of Siegen),

Osni Marques (Scalable Solvers Group, Lawrence Berkeley National Laboratory)

Workshop 2 (Intel eXtreme Performance Users Group, IXPUG)

1/22 [ 09:00 ~ 09:10 ]

Opening Remarks

Taisuke Boku (Workshop co-chair, University of Tsukuba)

1/22 [ 09:10 - 09:55 ]

Keynote Address: Advancing HPC Together

John K. Lee (Intel Corporation)

1/22 [ 09:55 - 10:25 ]

High Performance Simulations of Quantum Transport using Manycore Computing

Yosang Jeong, Hoon Ryu (KISTI)

1/22 [ 10:25 - 10:50 ]

Distributed MLPerf ResNet50 Training on Intel Xeon Architectures with TensorFlow

Wei Wang, Niranjan Hasabnis (Intel Corporation)

1/22 [ 11:15 - 11:50 ]

Invited Talk: oneAPI Industry Initiative for Accelerated Computing

Joe Curley (Intel Corporation)

1/22 [ 12:20 - 12:50 ]

A Comparison of Parallel Profiling Tools for Programs Utilizing the FFT

Brian Leu (Applied Dynamics International),

Samar Aseeri (KAUST),

Benson K. Muite (Kichakato Kizito)

1/22 [ 12:50 - 13:05 ]

Efficient Parallel Multigrid Method on Intel Xeon Phi Clusters

Kengo Nakajima (the University of Tokyo),

Balazs Gerofi (RIKEN R-CCS),

Yutaka Ishikawa (RIKEN R-CCS),

Masashi Horikoshi (Intel Corporation)

1/22 [ 13:05 ~ ]

Closing Remarks

Toshihiro Hanawa (Workshop co-chair, the University of Tokyo)

Workshop 3 (Recent Progresses on Quantum Information Sciencs and Technology)

#Workshop Chair - Hoon Ryu (KISTI)

1/22 [ 13:00 ~ 13:30 ]

Introduction to Q-Center in South Korea

Prof. Yonuk Chong (SKKU, South Koreay)

1/22 [ 13:30 - 14:30 ]

Quantum Computing with QISKIT ( I ) - Fundamentals

Dr. Hwajung Kang (IBM TJ Watson Research Center, USA)

1/22 [ 14:50 - 15:50 ]

Quantum Computing with QISKIT ( II ) - Introduction to Algorithms

Dr. Hwajung Kang (IBM TJ Watson Research Center, USA)

1/22 [ 16:00 - 16:30 ]

Learning Simulator for Quantum Algorithm Designs

Dr. Jeongho Bang (KIAS, South Korea)

1/22 [ 16:30 - 17:00 ]

Quantum Biology: Numerically Exact Simulation of Open Quantum System Dynamics

Dr. James Lim (Ulm University, Germany)

Keynote 1: (ECP talk by Douglas Kothe)

Update on the US Department of Energy Exascale Computing Project

Douglas B. Kothe (Director, Exascale Computing Project(ECP), Oak Ridge National Laboratory )

Abstract:
The vision of the U.S. Department of Energy (DOE) Exascale Computing Project (ECP) is to accelerate innovation with exascale simulation and data science solutions. ECP’s mission is to deliver exascale-ready applications and solutions that address currently intractable problems; create and deploy an expanded and vertically integrated software stack on exascale; and leverage research activities and products into HPC exascale systems. ECP’s RD&D activities, which encompass the development of applications, software technologies, and hardware technologies and architectures, is carried out by over many small teams of scientists and engineers. Illustrative examples will be given on how the ECP teams are delivering in its three areas of technical focus:

* Applications: Creating or enhancing the predictive capability of applications through algorithmic and software advances via co-design centers; targeted development of requirements-based models, algorithms, and methods; systematic improvement of exascale system readiness and utilization; and demonstration and assessment of effective software integration.

* Software Technologies: Developing and delivering a vertically integrated software stack containing advanced mathematical libraries, extreme-scale programming environments, development tools, visualization libraries, and the software infrastructure to support large-scale data management and data science for science and security applications.

* Hardware and Integration: Supporting R&D focused on innovative architectures for competitive exascale system designs; objectively evaluating hardware designs; deploying an integrated and continuously tested exascale software ecosystem; accelerating application readiness on targeted exascale architectures; and training on key ECP technologies to accelerate the software development cycle and optimize productivity of application and software developers.



Bio:
Douglas B. Kothe (Doug) has thirty-five years of experience in conducting and leading applied R&D in computational science applications designed to simulate complex physical phenomena in the energy, defense, and manufacturing sectors. Doug is currently the Director of the U.S. Department of Energy (DOE) Exascale Computing Project. Prior to that, he was Deputy Associate Laboratory Director of the Computing and Computational Sciences Directorate at Oak Ridge National Laboratory (ORNL). Other positions for Doug at ORNL, where he has been since 2006, include Director of Science at the National Center for Computational Sciences (2006-2010) and Director of the Consortium for Advanced Simulation of Light Water Reactors (CASL), DOE’s first Energy Innovation Hub (2010-2015). In leading the CASL Hub, Doug drove the creation, application, and deployment of an innovative Virtual Environment for Reactor Applications (2016 R&D winner), which offered a technology step change for the US nuclear energy industry.

Before coming to ORNL, Doug spent 20 years at Los Alamos National Laboratory, where he held a number of technical and line and program management positions, with a common theme being the development and application of modeling and simulation technologies targeting multi-physics phenomena characterized by the presence of compressible or incompressible interfacial fluid flow, where his field-changing accomplishments are known internationally. Doug also spent one year at Lawrence Livermore National Laboratory in the late 1980s as a physicist in defense sciences.

Doug holds a Bachelor in Science in Chemical Engineering from the University of Missouri – Columbia (1983) and a Masters in Science (1986) and Doctor of Philosophy (1987) in Nuclear Engineering from Purdue University.

Keynote 2: (EPI talk by Jean-Marc DENIS)

Future Supercomputers are game changers for processor architecture. Why?

Jean-Marc DENIS (Chief of Staff, innovation & Strategy, Atos / Chairman of the Board, European Processor Initiative)

Abstract:
The rise of artificial intelligence in HPC, associated to the data deluge, combined with the transition from monolithic applications toward complex workflows lead the HPC community, especially the hardware architects to reconsider how the next generation supercomputers are designed.

In this talk, the transition from existing homogenous to future modular architectures is discussed. The consequences on the general purpose processor is addressed. Ultimately, all these considerations lead to the guidelines having ruled the design of the European low power HPC processor to empower top Exascale supercomputers of the world. We will also elaborate on the first information related to RHEA, the first European HPC processor.



Bio:
Since the beginning of 2020, Jean-Marc is the Chief of Staff of the Innovation and Strategy Division at Atos. In addition, since mid-2018, Jean-Marc has been also elected as Chair of the Board of the European Processor Initiative (EPI). Prior to that, Jean-Marc Denis took different positions in the HPC industry.

After five years of research in the development of new solvers for the for Maxwell equations at Matra Defense (France) as mathematician from 1990 to 1995, Jean-Marc Denis had several technical position in the HPC industry between 1995 to 2004 from HPC pre-sales to Senior Solution Architect.

Since mid if 2004 Jean-Marc has worked at Bull SAS head Quarter (France) where he has started the HPC activity. In less than 10 years, the HPC revenue at Bull exploded from nothing in 2004 to 200M€ in 2015, making Bull the undisputed leader of the European HPC industry and the fourth in the world. From 2011 to the end of 2016, Jean-Marc has leaded the worldwide business activity with the goal to consolidate the ATOS/Bull position in Europe and to make ATOS/Bull a worldwide leader in Extreme Computing with footprint in Middle-East, Asia, Africa and South America.

From 2018 to 2020, Jean-Marc has been the head of Strategy and Plan at Atos/Bull, in charge of the global cross-Business Unit Strategy and of the definition of the 3 years business plan. In 2016 and 2017, Jean-Marc has been in charge of the definition of the strategy for the BigData Division at ATOS/Bull. In his position, his role is to define the global approach for the different BigData business lines covering HPC, Legacy (mainframe), Entreprise computing, DataScience consulting and Software.

In parallel to his activities at ATOS/Bull, since 2008, Jean-Marc Denis has taught “Supercomputer Architecture” concepts in Master 2 degree at the University of Reims Champagne Ardennes (URCA), France.

Keynote 3: (AI/Cloud talk by Jaejin Lee)

An HPC Deep Learning Framework for AI Cloud

Jaejin Lee (Professor, Dept. of Computer Science and Engineering, Seoul National University)

Abstract:
In this talk, we introduce a research direction in deep learning frameworks that automatically executes a deep learning model in parallel. The target system is a heterogeneous cluster where various accelerators are mixed. The proposed deep learning framework is compatible with popular deep learning frameworks, such as PyTorch and TensorFlow. The user just runs a deep learning model written for a single GPU or a single CPU core in the proposed deep learning framework for inference or training. The underlying AI Cloud system exploits the proposed deep learning framework to parallelize the user's workload automatically. Also, we introduce the study of deep learning optimization techniques for GPUs whose performance surpasses that of state-of-the-art deep learning optimization frameworks, such as TensorFlow XLA, TensorRT, and TVM in inference and training.



Bio:
Jaejin Lee is a professor in the Department of Computer Science and Engineering and the Graduate School of Data Science (vice-dean of student affairs) at Seoul National University (SNU). He is also the leader of the Thunder research group at SNU. He received his PhD degree in Computer Science from the University of Illinois at Urbana-Champaign (UIUC) in 1999. His PhD study was supported in part by graduate fellowships from IBM and Korea Foundation for Advanced Studies. He received an MS degree in Computer Science from Stanford University in 1995 and a BS degree in Physics from SNU in 1991. After obtaining the PhD degree, he spent a half year at the UIUC as a visiting lecturer and postdoctoral research associate. He was an assistant professor in the Department of Computer Science and Engineering at Michigan State University from January 2000 to August 2002 before joining SNU. He is an IEEE fellow and a member of ACM.

Session 1: Accelerators and Architectures

A Deep Reinforcement Learning Method for Solving Task Mapping Problems with Dynamic Traffic on Parallel Systems

Yu-Cheng Wang, Jerry Chou (National Tsing Hua University),

I-Hsin Chung (IBM T. J. Watson Research Center)

Efficient mapping of application communication patterns to the network topology is a critical problem for optimizing the performance of communication bound applications on parallel computing systems. The problem has been extensively studied in the past, but they mostly formulate the problem as finding an isomorphic mapping between two static graphs with edges annotated by traffic volume and network bandwidth. But in practice, the network performance is difficult to be accurately estimated, and communication patterns are often changing over time and not easily obtained. Therefore, this work proposes a deep reinforcement learning (DRL) approach to explore better task mappings by utilizing the performance prediction and runtime communication behaviors provided from a simulator to learn an efficient task mapping algorithm. We extensively evaluated our approach using both synthetic and real applications with varied communication patterns on Torus and Dragonfly networks. Compared with several existing approaches from literature and software library, our proposed approach found task mappings that consistently achieved comparable or better application performance. Especially for a real application, the average improvement of our approach on Torus and Dragonfly networks are 11% and 16%, respectively. In comparison, the average improvements of other approaches are all less than 6%.

Session 1: Accelerators and Architectures

An Analysis of System Balance and Architectural Trends Based on Top500 Supercomputers

Awais Khan (Sogang University),

Hyogi Sim, Sudharshan S. Vazhkudai (Oak Ridge National Laboratory),

Ali R. Butt (Virginia Tech),

Youngjae Kim (Sogang University)

Supercomputer design is a complex, multi-dimensional optimiza- tion process, wherein several subsystems need to be reconciled to meet a desired figure of merit performance for a portfolio of applications and a budget constraint. However, overall, the HPC community has been gravitating towards ever more FLOPS, at the expense of many other subsystems. To draw attention to overall system balance, in this paper, we analyze balance ratios and ar- chitectural trends in the world’s most powerful supercomputers. Specifically, we have collected the performance characteristics of systems between 1993 and 2019 based on the Top500 lists, and then analyzed their architectures from diverse system design perspec- tives. Notably, our analysis studies the performance balance of the machines, across a variety of subsystems such as compute, memory, I/O, interconnect, intra-node connectivity and power. Our analysis reveals that balance ratios of the various subsystems need to be considered carefully alongside the application workload portfolio to provision the subsystem capacity and bandwidth specifications, which can help achieve optimal performance.

Session 1: Accelerators and Architectures

Performance Evaluation of OpenCL-Enabled Inter-FPGA Optical Link Communication Framework CIRCUS and SMI

Ryuta Kashino, Ryohei Kobayashi, Norihisa Fujita, Taisuke Boku (University of Tsukuba)

In recent years, FPGA (Field Programmable Gate Array) has been receiving attention as an accelerator in HPC research area. One of the strong feature of today's FPGA is its ability on high bandwidth communication performance with direct optical links to construct multi-FPGA platform as well as its adjustability. However, the programming on FPGA is laborious for users to apply it on their applications. Under user-friendly programming environment, it is possible to apply it for various HPC applications on multi-FPGA platform.

Among several researches toward high-level synthesis to utilize the FPGA communication feature, we focus on two such systems; CIRCUS and SMI which are available on Intel FPGA with direct optical links in 40 \textasciitilde 100 Gbps of performance. In both systems, a user can access the optical link in OpenCL kernels where a high-level programming for HPC applications is possible. In this paper, we introduce them for practical cases , and compare these implementation and performance on real systems.

In conclusion, we evaluated that the CIRCUS system for single point-to-point communication achieves up to 90Gbps of bandwidth with 100Gbps optical links under OpenCL code. It is 2.7 times faster than SMI implemented on the same platform, and also we confirmed that the broadcast data transfer among four FPGAs by CIRCUS is up to 31 Gbps of bandwidth which is 5.3 times faster than SMI implementation. In addition, we found the main cause of performance bottleneck on SMI when it is applied to 100Gbps platform and compare with CIRCUS implementation.

Session 1: Accelerators and Architectures

Spectral Element Simulations on the NEC SX-Aurora TSUBASA

Niclas Jansson (KTH Royal Institute of Technology)

Following the recent transition in the high performance computing landscape to more heterogeneous architectures, application developers are faced with the challenge of ensuring good performance across a diverse set of platforms. In this paper, we present our work on porting the spectral element code Nek5000 to the recent vector architecture SX-Aurora TSUBASA. Using Nek5000's mini-app Nekbone, we formulate suitable loop transformations in key kernels, allowing for better vectorization, increasing the baseline performance by a factor of six. Using the new transformations, we demonstrate that the main compute intensive matrix-vector and matrix-matrix multiplication kernels achieves close to half the peak performance of a SX-Aurora core. Our work also addresses the gather-scatter operations, a key kernel for efficient matrix-free spectral element formulation. We introduce a new implementation of Nek5000's gather-scatter library with mesh topology awareness for improved vectorization via exploitation of the SX-Aurora's hardware gather-scatter instructions, improving performance with up to 116%. A detailed description of the implementation is given together with a performance study, comparing both single node performance and strong scalability characteristics, running across multiple SX-Aurora cards.

Session 2: Programming Models and System Software

HybridHadoop: CPU-GPU Hybrid Scheduling in Hadoop

Chanyoung Oh, Hyeonjin Jung (University of Seoul),

Saehanseul Yi (University of California, Irvine),

Illo Yoon, Youngmin Yi (University of Seoul)

As a GPU has become an essential component in high performance computing, it has been attempted by many works to leverage GPU computing in Hadoop. However, few works considered to fully utilize the GPU in Hadoop and only a few works studied utilizing both CPU and GPU at the same time. In this paper, we propose a CPU-GPU hybrid scheduling in Hadoop, where both CPUs and GPUs in a node are exploited as much as possible in an adaptive manner. The technical barrier stands in that the optimal number of GPU tasks is not known in advance, and the total number of Containers in a node cannot be changed once a Hadoop job starts. In the proposed approach, we first determine the initial number of Containers as well as the hybrid execution mode, then the proposed dynamic scheduler adjusts the number of Containers for a GPU and a CPU with the help of a GPU monitor during the job execution. It also employs a load-balancing algorithm for the tail. The experiments with various benchmarks show that the proposed CPU-GPU hybrid scheduling achieves 3.79x of speedup on average against the 12-core CPU-only Hadoop.

Session 2: Programming Models and System Software

neoSYCL: a SYCL implementation for SX-Aurora TSUBASA

Yinan Ke (Graduate School of Information Sciences, Tohoku University),

Mulya Agung, Hiroyuki Takizawa (Cyberscience Center, Tohoku University)

Recently, the high-performance computing world has moved to more heterogeneous architectures. Thus, it has become a standard practice to offload a part of application execution to dedicated accelerators. However, the disadvantage in productivity is still a problem in programming for accelerators. This paper proposes neoSYCL: a SYCL implementation for SX-Aurora TSUBASA, aiming to improve productivity and achieve comparable performance with native implementations. Unlike other implementation, neoSYCL can identify and separate the kernel part of the SYCL code at the source code level. Thus, this approach can easily be moved to any heterogeneous architectures using the offload programming model. In this paper, we show the evaluation results on SX-Aurora. To quantitatively discuss not only performance but also the productivity, we use two different benchmarks and code-complexity metrics for the evaluation. The results show that neoSYCL can improve productivity while reaching the same performance as native implementations.

Session 2: Programming Models and System Software

CSPACER: A Reduced API Set Runtime for the Space Consistency Model

Khaled Ibrahiim (Lawrence Berkeley National Laboratory)

We present our design and implementation of a runtime for the Space Consistency model. The Space Consistency abstraction is a generalized form of the full-empty bit synchronization for dis- tributed memory programming, where a memory region is associ- ated with a counter that determines its consistency. The abstraction allows for efficient implementation of not only point-to-point data transfers but also for collective communication primitives as well. In this work, we present interface design, implementation details, and performance results on Cray XC systems. Our reduced API design aims at low-overhead initiation of communication, threaded pro- cessing of runtime functions, and efficient pipelining to improve the computation-communication overlap. We show the performance benefits of using this runtime both at the microbenchmark level and in application settings.

Session 2: Programming Models and System Software

SeisSol on Distributed Multi-GPU Systems: CUDA Code Generation for the Modal Discontinuous Galerkin Methodp

Ravil Dorozhinskii, Michael Bader (Technical University of Munich)

We present a GPU implementation of the high order Discontinuous Galerkin (DG) scheme in SeisSol, a software package for simulating seismic waves and earthquake dynamics. Our particular focus is on providing a performance portable solution for heterogeneous distributed multi-GPU systems. We therefore redesigned SeisSol's code generation cascade for GPU programming models. This includes CUDA source code generation for the performance-critical small batched matrix multiplications kernels. The parallelisation extends the existing MPI+X scheme and supports SeisSol's cluster-wise Local Time Stepping (LTS) algorithm for ADER time integration.

We performed a Roofline model analysis to ensure that the generated batched matrix operations achieve the performance limits posed by the memory-bandwidth roofline. Our results also demonstrate that the generated GPU kernels outperform the corresponding cuBLAS subroutines by 2.5 times on average. We present strong and weak scaling studies of our implementation on the Marconi100 supercomputer (with 4 Nvidia Volta V100 GPUs per node) on up to 256 GPUs , which revealed good parallel performance and efficiency in case of time integration using global time stepping. However, we show that directly mapping the LTS method from CPUs to distributed GPU environments results in lower hardware utilization. Nevertheless, due to the algorithmic advantages of local time stepping, the method still reduces time-to-solution by a factor of 1.3 on average in contrast to the GTS scheme.

Session 3 : Application and Algorithms

Efficient Implementation of a Dimensionality Reduction Method Using a Complex Moment-Based Subspace

Takahiro Yano, Yasunori Futamura, Akira Imakura, Tetsuya Sakurai (University of Tsukuba)

Dimensionality reduction methods are widely used for processing data efficiently. Recently Imakura et al. proposed a novel dimensionality reduction method using a complex moment based subspace. The method can use more eigenvectors than existing matrix trace optimization-based dimensionality reduction methods and thus it is reported the method accomplishes higher precision. However, the computational complexity is higher than existing methods, in particular for nonlinear kernel version. To reduce the computational complexity, we propose a practical parallel implementation of the method by introducing Nystrom approximation. We evaluate the parallel performance of our implementation using the Oakforest-PACS supercomputer.

Session 3 : Application and Algorithms

Efficient Contour Integral-based Eigenvalue Computation using an Iterative Linear Solver with Shift-Invert Preconditioning

Yasunori Futamura, Tetsuya Sakurai (University of Tsukuba)

Contour integral-based (CI) eigenvalue solvers are one of the efficient and robust approaches for sparse eigenvalue problems. They have attracted attention owing to their inherent parallelism. For implementing a CI eigensolver, the inner linear systems arising in the algorithm need to be solved using an efficient method. One widely-used method is to use a sparse direct linear solver provided by a well-established numerical library; it is numerically robust and presents good load balancing of parallel execution of the CI eigensolver. However, owing to high total computational and memory cost, the performance of the direct solver approach is suboptimal. In this study, we propose an alternative method that utilizes a block Krylov iterative linear solver and shift-invert preconditioning that can take advantage of the shift-invariance of the block Krylov subspace. Our approach adaptively sets a preconditioning parameter according to the number of parallel processes to reduce the iteration counts. Several numerical examples confirm that our method outperforms the direct solver approach.

Session 3 : Application and Algorithms

Conjugate Gradient Solvers with High Accuracy and Bit-wise Reproducibility between CPU and GPU using Ozaki scheme

Daichi Mukunoki (RIKEN Center for Computational Science),

Katsuhisa Ozaki (Shibaura Institute of Technology),

Takeshi Ogita (Tokyo Woman's Christian University),

Roman Iakymchuk (Sorbonne University)

On Krylov subspace methods such as the Conjugate Gradient (CG) method, the number of iterations until convergence may increase due to the loss of computational accuracy caused by rounding errors in floating-point computations. At the same time, because the order of the computation is nondeterministic on parallel computation, the result and the behavior of the convergence may be nonidentical in different computational environments, even for the same input. In this study, we present an accurate and reproducible implementation of the unpreconditioned CG method on x86 CPUs and NVIDIA GPUs. In our method, while all variables are stored on FP64, all inner product operations (including matrix-vector multiplications) are performed using the Ozaki scheme. The scheme delivers the correctly rounded computation as well as bit-level reproducibility among different computational environments. In this paper, we show some examples where the standard FP64 implementation of CG results in nonidentical results across different CPUs and GPUs. We then demonstrate the applicability and the effectiveness of our approach in terms of accuracy and reproducibility and their performance on both CPUs and GPUs. Furthermore, we compare the performance of our method against an existing accurate and reproducible CG implementation based on the Exact Basic Linear Algebra Subprograms (ExBLAS) on CPUs.

Session 3 : Application and Algorithms

A Compressed, Divide and Conquer Algorithm for Scalable Distributed Matrix-Matrix Multiplication

Majid Rasouli, Robert M. Kirby, Hari Sundar (School of Computing, University of Utah)

Matrix-matrix multiplication (GEMM) is a widely used linear algebra primitive common in scientific computing and data sciences. While several highly-tuned libraries and implementations exist, these typically target either sparse or dense matrices. The performance of these tuned implementations on unsupported types can be poor, and this is critical in cases where the structure of the computations is associated with varying degrees of sparsity. One such example is Algebraic Multigrid (AMG), a popular solver and preconditioner for large sparse linear systems. In this work, we present a new divide and conquer sparse GEMM, that is also highly performant and scalable when the matrix becomes dense, as in the case of AMG matrix hierarchies. In addition, we implement a lossless data compression algorithm to reduce the communication cost. We combine this with an efficient communication pattern during distributed-memory GEMM to provide 2.24 times (on average) better performance than the state-of-the-art sparse matrix library PETSc. Additionally, we show that the performance and scalability of our method surpass PETSc even more when the density of the matrix increases. We demonstrate the efficacy of our methods by comparing our GEMM with PETSc on a wide range of matrices.

Session 3 : Application and Algorithms

GPU Acceleration of Multigrid Preconditioned Conjugate Gradient Solver on Block-Structured Cartesian Grid

Naoyuki Onodera, Yasuhiro Idomura, Yuta Hasegawa, Susumu Yamashita (Japan Atomic Energy Agency/Nuclear Science and Engineering Center),

Takashi Shimokawabe (Information Technology Center, The University of Tokyor),

Takayuki Aoki (Global Scientific Information and Computing Center, Tokyo Institute of Technology)

We develop a multigrid preconditioned conjugate gradient (MG-CG) solver for the pressure Poisson equation in a two-phase flow CFD code JUPITER. The JUPITER code is redesigned to realize efficient CFD simulations including complex boundaries and objects based on a block-structured Cartesian grid system. The code is written in CUDA, and is tuned to achieve high performance on GPU based supercomputers. The main kernels of the MG-CG solver achieve more than 90% of the roofline performance. The MG preconditioner is constructed based on the geometric MG method with a three-stage V-cycle, and a red-black SOR (RB-SOR) smoother and its variant with cache-reuse optimization (CR-SOR) are applied at each stage. The numerical experiments are conducted for two-phase flows in a fuel bundle of a nuclear reactor. Thanks to the block-structured data format, grids inside fuel pins are removed without performance degradation, and the total number of grids is reduced to 2.26×109, which is about 70% of the original Cartesian grid. The MG-CG solvers with the RB-SOR and CR-SOR smoothers reduce the number of iterations to less than 15% and 9% of the original preconditioned CG method, leading to ×3.1 and ×5.9 speedups, respectively. In the strong scaling test, the MG-CG solver with the CR-SOR smoother is accelerated by 2.1×between 64 and 256 GPUs. The obtained performance indicates that the MG-CG solver designed for the block structured grid is highly efficient and enables large-scale simulations of two-phase flows on GPU based supercomputers.

Poster Session:

Performance Modeling of HPC Applications on Overcommitted Systems

Shohei Minami, Toshio Endo, Akihiro Nomura (Tokyo Institute of Technology)

Recently, the use of interactive jobs in addition to traditional batch jobs are getting common in supercomputer systems. We expect overcommitting scheduling, in which multiple HPC jobs share one computational resource, to accept them in a single system and to make effective use of resources. On the other hand, the performance of a job must be affected when it is in an overcommitted state. Therefore, we need to investigate the impact of the overcommitted state. We evaluated the performance change when a single computational resource was shared by multiple HPC jobs, and built a model to predict the performance change using performance counters as input variables. The model could predict the performance in the overcommitted state with good accuracy, and the fatal mispredictions occupied only 1.4% of the total cases.

Poster Session:

Toward Data-Adaptable TinyML using Model Partial Replacement for Resource Frugal Edge Device

Jisu Kwon, Daejin Park (Kyungpook National University)

The machine learning (ML) model, trained for inference, consists of a network and weights. Training the model requires enormous hardware resources, thus it is usually conducted on the server or cloud. However, the device that is closely connected to the sensor and receives data generated from the surrounding environment is the edge device. If the server performs inference, data will be affected by disturbance provoked in the process of transmission from the edge device to the server, or device power resource management becomes a troublesome cause of the large proportion of energy consumption from message transmission. Therefore, TinyML, a paradigm for using ML in microcontroller units (MCUs) based edge devices that are closely related to users rather than servers or clouds, receives attention. TinyML focuses on performing self-inference on data input to edge devices through sensors. We focused on the overhead of updating the entire firmware of the flash memory due to the characteristics of the embedded device when inference about the input of a new domain is required. Our object is that the inference result of the fixed model maintains a reasonable score, even if the domain of the input data, the target of inference, is different.

Poster Session:

GPU Optimizations for Atmospheric Chemical Kinetics

Theodoros Christoudias (The Cyprus Institute) ,

Timo Kirfel, Astrid Kerkweg, Domenico Taraborrelli (Forschungszentrum Jülich GmbH, IEK-8) ,

Georges-Emmanuel Moulard, Erwan Raffin (Center for Excellence in Performance Programming, Atos) ,

Victor Azizi, Gijs van den Oord, Ben van Werkhoven (Netherlands eScience Center)

We present a series of optimizations to alleviate stack memory overflow issues and improve overall performance of GPU computational kernels in atmospheric chemical kinetics model simulations. We use heap memory in numerical solvers for stiff ODEs, move chemical reaction constants and tracer concentration arrays from stack to global memory, use direct pointer indexing for array memory access, and use CUDA streams to overlap computation with memory transfer to the device. Overall, an order of magnitude reduction in GPU memory requirements is achieved, allowing for simultaneous offloading from multiple MPI processes per node and/or increasing the chemical mechanism complexity.

Poster Session:

HPC LINPACK Parameter Optimization on Homo-/Heterogeneous System of ARM Neoverse N1SDP

Je-Seok Ham, Yong Cheol Peter Cho, Juyeob Kim, Chun-Gi Lyuh, Jinkyu Kim, Jinho Han, Youngsu Kwon (Electronics and Telecommunications Research Institute)

HPL(High Performance Linpack) is the standard benchmark used to evaluate supercomputers (high-performance computing systems) around the world. HPL solves a linear system of equations, Ax=b, through a series of mathematical processes such as 2D Block-Cyclic matrix distribution and LU Decomposition. In order to achieve the best performance, optimization of key HPL parameters is essential. In this paper, we propose an optimization analysis technique of HPL parameters based on HPLinpack performance results and the efficiency graph pattern on a homogeneous system consisting of the Neoverse N1SDP(System Development Platform) from ARM. Results show the significant influence of GEMM operations on performance and the need for its acceleration were further verified through HPL function call profiling. Finally, we show the effectiveness of the proposed parameter optimization methods on a heterogeneous system. A HPL modified for heterogenous system exhibited considerably improved performance when tested on the Neoverse N1SDP and the Nvidia RTX 2060 GPU.

Workshop: (Recent Progresses on Quantum Information Sciencs and Technology)

Introduction to Q-Center in South Korea

Prof. Yonuk Chong (Sungkyunkwan University, South Korea)

The Q-center has been founded in mid-2020 under the support of National Research Foundation of Korea. This talk not only presents a brief introduction of main functionalities & activities of the center including the latest ongoing effort made to provide quantum cloud services to (domestic) researchers, but discusses diverse upcoming action plans for supporting R&D and educational activities in the field of quantum information science & engineering.

Date & Time for online presentation

2021-Jan-22, 13:00-13:30

Workshop: (Recent Progresses on Quantum Information Sciencs and Technology)

Tutorial: Quantum Computing with IBM QISKIT

Dr. Hwajung Kang (IBM TJ Watson Research Center, USA)

QISKIT is an open-source development framework for quantum computing. It provides tools for creating and manipulating programs running on IBM-powered gate-based quantum computer (Q Experience) or on simulators in a local computer. Presenting a great chance for researchers in the field of computer science to learn the basic concept of quantum computing and how they can be realized for a python-based programming, this 2-hour tutorial will be conducted in two sessions: (1) Fundamentals and (2) Introduction to Quantum Algorithms.

Date & Time for online presentation

2021-Jan-22, 13:30-14:30 (Part 1)

2021-Jan-22, 14:50-15:50 (Part 2)

Workshop: (Recent Progresses on Quantum Information Sciencs and Technology)

Tutorial: Quantum Computing with IBM QISKIT

Dr. Hwajung Kang (IBM TJ Watson Research Center, USA)

QISKIT is an open-source development framework for quantum computing. It provides tools for creating and manipulating programs running on IBM-powered gate-based quantum computer (Q Experience) or on simulators in a local computer. Presenting a great chance for researchers in the field of computer science to learn the basic concept of quantum computing and how they can be realized for a python-based programming, this 2-hour tutorial will be conducted in two sessions: (1) Fundamentals and (2) Introduction to Quantum Algorithms.

Date & Time for online presentation

2021-Jan-22, 13:30-14:30 (Part 1)

2021-Jan-22, 14:50-15:50 (Part 2)

Workshop: (Recent Progresses on Quantum Information Sciencs and Technology)

Learning Simulator for Quantum Algorithm Designs

Dr. Jeongho Bang (Korea Institute for Advanced Study, South Korea)

In this talk, we introduce a machine-learning-based method to ‘learn’ a (structure of) quantum algorithm. The essence of the method is to use a classical-quantum hybrid learning simulator, where a quantum system is being taught by a classical learning algorithm. Our learning simulator is applied to learn a quantum algorithm for solving a function decision problem, called as a Deutsch problem. It is demonstrated by the numerical simulations that our simulator can always learn a quantum algorithm which is corresponding to, however not exactly equal to, the original Deutsch’s quantum algorithm. The most remarkable result is that the learning time is proportional to square-root of the search-space dimension, in contrast to an exponential (or polynomial at best) tendency found in classical learning. This is because the learning simulator reflects the (quantum) speedup of the algorithm identified in its (classical) learning.

Date & Time for online presentation

2021-Jan-22, 16:00-16:30

Workshop: (Recent Progresses on Quantum Information Sciencs and Technology)

Quantum Biology: Numerically Exact Simulation of Open Quantum System Dynamics

Dr. James Lim (Ulm University, Germany)

In experiments, quantum mechanical systems are not completely isolated from their environments, and it is important to develop a theoretical framework for treating system-environment couplings in a reliable and efficient way. In this talk, I will discuss the theory of open quantum systems and its application to spontaneous emission from a two-level atom where radiative environments are well approximated by a Markovian bath. In addition, I will report recent theoretical efforts towards numerically exact simulations of photosynthetic pigment-protein complexes, where electronic excitations are coupled to highly structured vibrational environments at finite temperatures, leading to non-Markovian effects, and perturbation theory cannot be used.

Date & Time for online presentation

2021-Jan-22, 16:30-17:00

Workshop: (Intel eXtreme Performance Users Group, IXPUG)

Opening Remarks

Taisuke Boku (Workshop co-chair, University of Tsukuba)

Date & Time for online presentation

2021-Jan-22, 09:00-09:10

Workshop: (Intel eXtreme Performance Users Group, IXPUG)

Keynote Address: Advancing HPC Together

John K. Lee (Intel Corporation)

(TBD)

Date & Time for online presentation

2021-Jan-22, 09:10-09:55

Workshop: (Intel eXtreme Performance Users Group, IXPUG)

High Performance Simulations of Quantum Transport using Manycore Computing

Yosang Jeong, Hoon Ryu (KISTI)

The Non-Equilibrium Green’s Function (NEGF) has been widely utilized in the field of nanoscience and nanotechnology to predict carrier transport behaviors in electronic device channels of sizes in a quantum regime. This work explores how much performance improvement can be driven for NEGF computations with unique features of manycore computing, where the core numerical step of NEGF computations involves a recursive process of matrix-matrix multiplication. The major techniques adopted for the performance enhancement are data-restructuring, matrix-tiling, thread-scheduling, and offload computing and we present in-depth discussion on why they are critical to fully exploit the power of manycore computing hardware including Intel Xeon Phi Knights Landing systems and NVIDIA general-purpose graphic processing unit (GPU) devices. Performance of the optimized algorithm has been tested in a single computing node, where the host is Xeon Phi 7210 that is equipped with two NVIDIA Quadro GV100 GPU devices. The target structure of NEGF simulations is a [100] silicon nanowire that consists of 100K atoms involving a 1000K×1000K complex Hamiltonian matrix. Through rigorous benchmark tests, we show, with optimization techniques whose details are elaborately explained, the workload can be accelerated almost by a factor of up to ∼20 compared to the unoptimized case.

Date & Time for online presentation

2021-Jan-22, 09:55-10:25

Workshop: (Intel eXtreme Performance Users Group, IXPUG)

Distributed MLPerf ResNet50 Training on Intel Xeon Architectures with TensorFlow

Wei Wang, Niranjan Hasabnis (Intel Corporation)

MLPerf benchmarks, which measure training and inference performance of ML hardware and software, have published three sets of ML training results so far. In all sets of results, ResNet50v1.5 was used as a standard benchmark to showcase the latest developments on image recognition tasks. The latest MLPerf training round (v0.7) featured Intel’s submission with TensorFlow. In this paper, we describe the recent optimization work that enabled this submission. In particular, we enabled BFloat16 data type in ResNet50v1.5 model as well as in Intel-optimized TensorFlow to exploit full potential of 3rd generation Intel Xeon scalable processors that have built-in BFloat16 support. We also describe the performance optimizations as well as the state-of-the-art accuracy/convergence results of ResNet50v1.5 model, achieved with large-scale distributed training (with up to 256 MPI workers) with Horovod. These results lay great foundation to support future MLPerf training submissions with large scale Intel Xeon clusters.

Date & Time for online presentation

2021-Jan-22, 10:25-10:50

Workshop: (Intel eXtreme Performance Users Group, IXPUG)

Invited Talk: oneAPI Industry Initiative for Accelerated Computing

Joe Curley (Intel Corporation)

(TBD)

Date & Time for online presentation

2021-Jan-22, 11:15-11:50

Workshop: (Intel eXtreme Performance Users Group, IXPUG)

Single-Precision Calculation of Iterative Refinement of Eigenpairs of a Real Symmetric-Definite Generalized Eigenproblem by Using a Filter Composed of a Single Resolvent

Hiroshi Murakami (Tokyo Metropolitan University)

By using a filter, we calculate approximate eigenpairs of a real symmetric-definite generalized eigenproblem 𝐴v = 𝜆𝐵v whose eigenvalues are in a specified interval. In our experiments in this paper, the IEEE-754 single-precision floating-point (binary 32bit) number system is used for calculations. In general, a filter is constructed by using some resolvents R(𝜌) with different shifts 𝜌. For a given vector x, an action of a resolvent y := R(𝜌)x is given by solving a system of linear equations 𝐶(𝜌)y = 𝐵x for y, here the coefficient 𝐶(𝜌) =𝐴−𝜌𝐵 is symmetric. We assume to solve this system of linear equations by matrix factorization of 𝐶(𝜌), for example by the modified Cholesky method (𝐿𝐷𝐿^𝑇 decomposition method). When both matrices 𝐴 and 𝐵 are banded, 𝐶(𝜌) is also banded and the modified Cholesky method for banded system can be used to solve the system of linear equations. The filter we used is either a polynomial of a resolvent with a real shift, or a polynomial of an imaginary part of a resolvent with an imaginary shift. We use only a single resolvent to construct the filter in order to reduce both amounts of calculation to factor matrices and especially storage to hold factors of matrices. The most disadvantage when we use only a single resolvent rather than many is, such a filter have poor properties especially when compuation is made in single-precision. Therefore, approximate eigenpairs required are not obtained in good accuracy if they are extracted from the set of vectors made by an application of a combination of 𝐵-orthonormalization and filtering to a set of initial random vectors. However, experiments show approximate eigenpairs required are refined well if they are extracted from the set of vectors obtained by a few applications of a combination of 𝐵-orthonormalization and filtering to a set of initial random vectors.

Date & Time for online presentation

2021-Jan-22, 11:50-12:20

Workshop: (Intel eXtreme Performance Users Group, IXPUG)

A Comparison of Parallel Profiling Tools for Programs Utilizing the FFT

Brian Leu (Applied Dynamics International),

Samar Aseeri (KAUST),

Benson K. Muite (Kichakato Kizito)

Performance monitoring is an important component of code optimization. Performance monitoring is also important for the beginning user, but can be difficult to configure appropriately. The overhead of the performance monitoring tools Craypat, FPMP, mpiP, Scalasca and TAU, are measured using default configurations likely to be chosen by a novice user and shown to be small when profiling Fast Fourier Transform based solvers for the Klein Gordon equation based on 2decomp&FFT and on FFTE. Performance measurements help explain that despite FFTE having a more efficient parallel algorithm, it is not always faster than 2decom&FFT because the complied single core FFT is not as fast as that in FFTW which is used in 2decomp&FFT.

Date & Time for online presentation

2021-Jan-22, 12:20-12:50

Workshop: (Intel eXtreme Performance Users Group, IXPUG)

Efficient Parallel Multigrid Method on Intel Xeon Phi Clusters

Kengo Nakajima (the University of Tokyo),

Balazs Gerofi (RIKEN R-CCS),

Yutaka Ishikawa (RIKEN R-CCS),

Masaaki Horikoshi (Intel Corporation)

The parallel multigrid method is expected to play an important role in scientific computing on exa-scale supercomputer systems for solving large-scale linear equations with sparse matrices. Because solving sparse linear systems is a very memory-bound process, efficient method for storage of coefficient matrices is a crucial issue. In the previous works, authors implemented sliced ELL method to parallel conjugate gradient solvers with multigrid preconditioning (MGCG) for the application on 3D groundwater flow through heterogeneous porous media (pGW3D-FVM), and excellent performance has been obtained on large-scale multicore/manycore clusters. In the present work, authors introduced SELL-C-s to the MGCG solver, and evaluated the performance of the solver with various types of OpenMP/MPI hybrid parallel programing models on the Oakforest-PACS (OFP) system at JCAHPC using up to 1,024 nodes of Intel Xeon Phi. Because SELL-C-s is suitable for wide-SIMD architecture, such as Xeon Phi, improvement of the performance over the sliced ELL was more than 20%. This is one of the first examples of SELL-C-s applied to forward/backward substitutions in ILU-type smoother of multigrid solver. Furthermore, effects of IHK/McKernel has been investigated, and it achieved 11% improvement on 1,024 nodes.

Date & Time for online presentation

2021-Jan-22, 12:50-13:05

Workshop: (Intel eXtreme Performance Users Group, IXPUG)

Closing Remarks

Toshihiro Hanawa (the University of Tokyo)

Date & Time for online presentation

2021-Jan-22, 13:05 ~

Copyright ©2020. Korean Society for Computational Science and Engeering All Right Reserved