RCT-Net: TDNN based Speaker Verification with 2D Res2Nets on Frame Level Feature Extractor

Khamsehashari, Razieh; Miao, Fengying; Polzehl, Tim; Möller, Sebastian

Home // SIGNAL 2023, The Eighth International Conference on Advances in Signal, Image and Video Processing // View article

RCT-Net: TDNN based Speaker Verification with 2D Res2Nets on Frame Level Feature Extractor

Authors:
Razieh Khamsehashari
Fengying Miao
Tim Polzehl
Sebastian Möller

Keywords: ResNet, Residual blocks, TDNN, RCT-Net, speaker verification, automatic speaker verification (ASV)

Abstract:
In speaker verification, Time Delay Neural Networks (TDNNs) and Residual Networks (ResNets) are currently achieving cutting-edge results. These architectures have very different structural characteristics, and development of hybrid networks appears to be a promising path forward. In this study, inspired by the combination of Convolutional Neural Network (CNN) blocks and multi-scale architectures we present a Residual-based CNN TDNN (RCT) system and evaluate the performance of integrating different residual blocks into a TDNN-based structure. We extend the state-of-the-art speaker embedding model for speaker recognition, namely Emphasized Channel Attention, Propagation, and Aggregation based CNN-TDNN (ECAPA CNN-TDNN), by gradually incorporating the proposed 2D convolutional stem with various bottleneck residual blocks. We evaluate the performance of our models on standard VoxCeleb1-O test set to investigate the performance of residual blocks and TDNN in the speaker verification domain. As a result, the proposed models significantly outperform the state-of-the-art by up to 14.6% of EER.

Pages: 37 to 42

Copyright: Copyright (c) IARIA, 2023

Publication date: March 13, 2023

Published in: conference

ISSN: 2519-8432

ISBN: 978-1-68558-057-5

Location: Barcelona, Spain

Dates: from March 13, 2023 to March 17, 2023