JUWELS簇上块张量矩阵乘法3D-DFT的可伸缩性 Scalability of 3D-DFT by block tensor-matrix multiplication on the JUWELS Cluster

作者:Nitin Malapally Viacheslav Bolnykh Estela Suarez Paolo Carloni Thomas Lippert Davide Mandelli

三维离散傅立叶变换(DFT)是一种用于解决不同领域问题的技术。目前,3D-DFT的常用实现方式是从快速傅立叶变换(FFT)算法中推导出来的。然而,有证据表明,分布式存储器3D-FFT算法由于使用了全对全通信,因此不能很好地扩展。在这里,以Sedukhin\textit等人的工作为基础。【第30届国际计算机及其应用会议论文集,CATA 2015,第193-200页(2015年1月)】,我们重新审视了通过使用点对点通信的替代方法来提高3D-DFT的可扩展性的可能性,尽管其算术复杂度更高。新算法通过Cannon算法的三种特殊变体,在体积分解的域上利用张量矩阵乘法。它在这里被实现为一个名为S3DFT的C++库,并在J“ulich超级计算中心的JUWELS集群上进行了测试

The 3D Discrete Fourier Transform (DFT) is a technique used to solve problems in disparate fields. Nowadays, the commonly adopted implementation of the 3D-DFT is derived from the Fast Fourier Transform (FFT) algorithm. However, evidence indicates that the distributed memory 3D-FFT algorithm does not scale well due to its use of all-to-all communication. Here, building on the work of Sedukhin \textit{et al}. [Proceedings of the 30th International Conference on Computers and Their Applications, CATA 2015 pp. 193-200 (01 2015)], we revisit the possibility of improving the scaling of the 3D-DFT by using an alternative approach that uses point-to-point communication, albeit at a higher arithmetic complexity. The new algorithm exploits tensor-matrix multiplications on a volumetrically decomposed domain via three specially adapted variants of Cannon’s algorithm. It has here been implemented as a C++ library called S3DFT and tested on the JUWELS Cluster at the J\”ulich Supercomputing Center. Our implementation of the shared memory tensor-matrix multiplication attained 88\% of the theoretical single node peak performance. One variant of the distributed memory tensor-matrix multiplication shows excellent scaling, while the other two show poorer performance, which can be attributed to their intrinsic communication patterns. A comparison of S3DFT with the Intel MKL and FFTW3 libraries indicates that currently iMKL performs best overall, followed in order by FFTW3 and S3DFT. This picture might change with further improvements of the algorithm and/or when running on clusters that use network connections with higher latency, e.g. on cloud platforms.



Related posts