Baidu Releases Deep Speaker: End-to-end System for Voiceprint Recognition

(Original title: Baidu released Deep Speaker: end-to-end system for large-scale voiceprint recognition) Recently, Baidu US Institute of Technology announced its breakthrough achievements in voiceprint recognition. The research shows that the method of deep learning has achieved a significant improvement in recognition accuracy compared with the traditional i-vector method. Voiceprint recognition algorithms seek to identify the speaker's identity from the audio. Two common identification tasks are confirmation (the speaker is the person he claims to be) and speaker identification (identifying the source of sound among a group of unknown speakers). This technology already has a variety of applications. For example, voiceprints can be used to log in to the device. Speaker confirmation can also be used as an extra security measure for financial transactions. In addition, shared devices like smart home assistants can also use this technology to provide personalized services. Recent papers using neural networks for voiceprint recognition have improved the traditional i-vector approach (see the original paper or slide in the Interspeech tutorial). The i-vector method believes that the content of speech can be divided into two parts, one part depends on the speaker and channel variability, and the other part depends on other related factors. I-vector voiceprint recognition is a multi-step process that involves using different speaker data to estimate a common background model (usually a Gaussian mixture model), collecting sufficient statistics, extracting i-vectors, and finally using a Classifier for identification tasks. Some papers use neural networks instead of the i-vector pipeline method. Other researchers either trained a text-related (users must say the same) end-to-end speaker recognition system or trained text-independent (this model is not related to the content of the speech) end-to-end speaker recognition system. We introduce Deep Speaker: an end-to-end neural voiceprint recognition system that achieves good results in both text-related and text-independent scenarios. This means that the system can be trained to identify who is speaking, whether it is when you say “wake” to your home helper or you speak in a meeting. Deep Speaker consists of a deep neural network layer that extracts features from the audio, a time pool based on cosine similarity, and triplet loss. Baidu US Institute of Technology explored the effects of the ResNet-activated convolution model and existing models on the extraction of acoustic features. Note: The Baidu Institute of Technology uses the triad loss used in face recognition. During the training, they chose the speaker's speech and then calculated an embedding (labeled "Anchor"). Two more embedments are produced, one from the same speaker (marked "Positive") and one from a different speaker (marked "Negative"). In training, the goal is to make the cosine similarity between Anchor and positive embedding higher than the cosine similarity between Anchor and negative embedding. Baidu US Research Institute demonstrated the effectiveness of Deep Speaker on three different data sets, including text-related and text-independent tasks. One of the UIDs datasets includes approximately 250,000 speakers and is the largest dataset in the knowledge literature. Experimental results show that Deep Speaker is obviously better than DNN-based i-vector method. For example, 100 speakers were randomly selected in a text-independent data set. The error rate of Deep Speaker on the speaker identification task was 1.83%, and the correct rate was 92.58%. Compared to the DNN-based i-vector method, it reduces the error rate by 50% and improves the correct rate by 60%. Explanation: The data sets used by Baidu Institute of Science and Technology in the experiment were UIDs, XiaoDu, and Mandarin. UIDs and XiaoDu are Putonghua data sets, and Mturk is an English data set. UIDs and Mturk are text-independent data sets. XiaoDu is a text-related data set based on Baidu's wake word. In order to experiment on different sized training sets, they used the full UIDs data set (250,000) and one of its subsets (50,000). In the evaluation phase, they selected an Anchor, and then randomly selected 1 Anchor positive sample and 99 Anchor negatives samples from the test part. The team also found that Deep Speaker learned language-independent features. When trained in a common language environment, Deep Speaker can achieve 5.57% error rate and 88% accuracy in English recognition and recognition tasks. In addition, training in Mandarin and then English can be compared with training in English only, which improves the accuracy of English recognition. These results all show that even though different languages ​​sound very different, deep speakers learn cross-language sound characteristics. These results are similar to those of Deep Speech 2, and the same architecture can be used to identify voiceprints in different languages. 燑br>