Single-precision complex matrix multiplication is the primary computation of quantum circuit simulation by tensor network contraction.
We improve the throughput of the simulation using SGEMM emulation on Tensor Cores.
There are two kinds of SGEMM emulation method implementation: TF32TCEC and FP16TCEC.
While TF32TCEC uses TF32 Tensor Cores and can be an alternative to SGEMM, the supported exponent range for input matrices is smaller than FP32 in FP16TCEC, which uses FP16 Tensor Cores.
However, TF32TCEC has higher throughput than FP16TCEC.
Therefore, there is a trade-off between the supported exponent range and throughput.
We propose an automatic precision selection to choose which Tensor Core to use.
This method checks the exponent distribution of the input matrices before computing a matrix multiplication.
See
our paper (ISC High Performance 2023) for more detail.