DeepSegmentor
Sequence Segmentation using Joint RNN and Structured Prediction Models (ICASSP 2017)
view repo
We describe and analyze a simple and effective algorithm for sequence segmentation applied to speech processing tasks. We propose a neural architecture that is composed of two modules trained jointly: a recurrent neural network (RNN) module and a structured prediction model. The RNN outputs are considered as feature functions to the structured model. The overall model is trained with a structured loss function which can be designed to the given segmentation task. We demonstrate the effectiveness of our method by applying it to two simple tasks commonly used in phonetic studies: word segmentation and voice onset time segmentation. Results sug- gest the proposed model is superior to previous methods, ob- taining state-of-the-art results on the tested datasets.
READ FULL TEXT VIEW PDF
We investigate training end-to-end speech recognition models with the
re...
read it
Voice Onset Time (VOT), a key measurement of speech for basic research a...
read it
Phonemic segmentation of speech is a critical step of speech recognition...
read it
In this paper we develop a relatively simple and effective neural joint ...
read it
State-of-the-art systems for semantic image segmentation utilize feed-fo...
read it
Pre-aspiration is defined as the period of glottal friction occurring in...
read it
Previous works on the Recurrent Neural Network-Transducer (RNN-T) models...
read it
Sequence Segmentation using Joint RNN and Structured Prediction Models (ICASSP 2017)
Sequence segmentation is an important task for many speech and audio applications such as speaker diarization, laboratory phonology research, speech synthesis, and automatic speech recognition (ASR). Segmentation models can be used as a pre-process step to clean the data (e.g., removing non-speech regions such as music or noise to reduce ASR error [1, 2]). They can also be used as tools in clinically- or theoretically-focused phonetic studies that utilize acoustic properties as a dependent measure. For example, voice onset time, a key feature distinguishing voiced and voiceless consonants across languages [3], is important both in ASR [4], clinical [5], and theoretical studies [6].
Previous work on speech sequence segmentation focuses on generative models such as hidden Markov models (see for example
[7] and the references therein); on discriminative methods [2, 8, 9]; or on deep learning
[10, 11].Inspired by the recent work on combined deep network and structured prediction models [12, 13, 14, 15, 16]
, we would like to further improve performance on speech sequence segmentation and propose a new efficient joint deep network and structure prediction model. Specifically, we jointly optimize RNN and structured loss parameters by using RNN outputs as feature functions for a structured prediction model. First, an RNN encodes the entire speech utterance and outputs new representation for each of the frames. Then, an efficient search is applied over all possible segments so that the most probable one can be selected. We evaluate this approach using two tasks: word segmentation and voice onset time segmentation. In both tasks the input is a speech segment and the goal is to determine the boundaries of the defined event. We show that the proposed approach outperforms previous methods on these two segmentation tasks.
In the problem of speech segmentation we are provided with a speech utterance, denoted as
, represented as a sequence of acoustic feature vectors, where each
is a -dimensional vector. The length of the speech utterance, , is not a fixed value, since the input utterances can have different durations.Each input utterance is associated with a timing sequence, denoted by , where can vary across different inputs. Each element , where indicates the start time of a new event in the speech signal. We annotate all the possible timing sequence of size by
For example, in word segmentation the goal is to segment a word from silence and noise in the signal. In this case the size of is 2, namely word onset and offset. However, in phoneme segmentation the goal is to segment every phoneme in a spoken word. In this case the size of is different for each input sequence.
Generally, our method is suitable for different sequence size . In this paper we focused on , and leave the problem of to future work.
We now describe our model in greater detail. First, we present the structured prediction framework and then discuss how it is combined with an RNN.
We consider the following prediction rule with , such that is a good approximation to the true label of , as follows: ¯y’_w(¯x) = argmax_¯y∈Y w^⊤ϕ(¯x, ¯y)
Following the structured prediction framework, we assume there exists some unknown probability distribution
over pairs where is the desired output (or reference output) for input . Both and are usually structured objects such as sequences, trees, etc. Our goal is to set so as to minimize the expected cost, or the risk,(1) |
This objective function is hard to minimize directly since the distribution is unknown. We use a training set of examples that are drawn i.i.d. from , and replace the expectation in (1) with a mean over the training set.
The cost is often a combinatorial non-convex quantity, which is hard to minimize. Hence, instead of minimizing the cost directly, we minimize a slightly different function called a surrogate loss, denoted , and closely related to the cost. Overall, the objective function in (1) transforms into the following objective function, denoted as : F(w, ¯x, ¯y) = 1m∑_i=1^m ¯ℓ(w, ¯x, ¯y)
In this work the surrogate loss function is the structural hinge loss [17] defined as
Usually, is manually chosen using data analysis techniques and involves manipulation on local and global features. In the next subsection we describe how to use an RNN as feature functions.
RNN is a deep network architecture that can model the behavior of dynamic temporal sequences using an internal state which can be thought of as memory [18, 19]. RNN provides the ability to predict the current frame label based on the previous frames. Bidirectional RNN is a model composed of two RNNs: the first is a standard RNN while the second reads the input backwards. Such a model can predict the current frame based on both past and future frames. By using the RNN outputs we can jointly train the structured and network models.
Recall our prediction rule in Eq. (3.1): notice that can be viewed as where each
can be extracted using different techniques, e.g., hand-crafted, feed-forward neural network, RNNs, etc. We can formulate the prediction rule as follows: ¯
y’_w(¯x) = argmax_¯y∈Y^p w^⊤ϕ(¯x, ¯y) = argmax_¯y∈Y^p w^⊤∑_i=1^p ϕ’(¯x, y_i) = argmax_¯y∈Y^p w^⊤∑_i=1^p RNN(¯x, y_i), where the RNN can be of any type and architecture. For example, we can use bidirectional RNN and consider as the concatenation of both outputs . This is depicted in Figure 1. We call our model DeepSegmentor .Our goal is to find the model parameters so as to minimize the risk as in Eq. (1
). Recall, we use the structural hinge loss function, and since both the loss function and the RNN are differentiable we can optimize them using gradient based methods such as stochastic gradient descent (SGD). In order to optimize the network parameters using the back-propagation algorithm
[20], we must find the outer derivative of each layer with respect to the model parameters and inputs.The derivative of the loss layer with respect to the layer parameters for the training example is
where
(2) |
Similarly, the derivatives with respect to the layer’s inputs are
The derivatives of the rest of the layers are the same as an RNN model.
We investigate two segmentation problems; word segmentation and voice onset time segmentation. We describe each of them in details in the following subsections.^{1}^{1}1All models were implemented using Torch7 toolkit [21, 22]
In the problem of word segmentation we are provided with a speech utterance which contains a single word; our goal is to predict its start and end times. The ability to determine these timings is crucial to phonetic studies that measure speaker properties (e.g. response time [23]) or as a preprocessing step for other phonetic analysis tools [11, 10, 9, 8, 24].
Our dataset comes from a laboratory study by Fink and Goldrick [23]. Native English speakers were shown a set of 90 pictures. Some participants produced the name of the picture (e.g., saying “cat”, “chair”) while others performed a semantic classification task (e.g., saying “natural”, “man-made”). Productions other than the intended response or disfluencies were excluded. Recordings were randomly assigned to two transcribers who annotated the onset and offset of each word. We analyze a subset of the recordings, including data from 60 participants, evenly distributed across tasks.
We compare our model to an RNN that was trained using the Negative-Log-Liklihood (NLL). The NLL model makes a binary decision in every frame to predict whether there is voice activity or not. Recall, our goal is to find the start and end times of the word; in this task, the RNN leaves us with a distribution over all possible onsets. To account for this, we apply a smoothing algorithm and find the most probable pair of timings.
We trained the DeepSegmentor model using the structured loss function as in (3), denoted as Combined Duration (CD) loss. The motivation for using this function is due to disparities in the manual annotations, which are common and depend both on human errors and objective difficulties in placing the boundaries. Hence we chose a loss function that takes into account the variations in the annotations.
(3) |
where , and is a user defined tolerance parameter.
We use two layers of bidirectional LSTMs for the DeepSegmentor model with dropout [25] after each recurrent layer. We extracted the 13 Mel-Frequency Cepstrum Coefficients (MFCCs), without the deltas, every 10 ms, and use them as inputs to the network. We optimize the networks using AdaGrad [26]. All parameters were tuned on a dedicated development set for both of the models. As for the NLL models, we trained 4 different models; LSTM with one and two layers, and bidirectional LSTM with one and two layers, denoted as RNN, 2RNN, BI-RNN and BI-2-RNN, respectively. Table 1 summarizes the results for both models.
RNN | 2-RNN | BI-RNN | BI-2-RNN | DeepSeg. | |
---|---|---|---|---|---|
Onset | 6.0 | 5.84 | 2.88 | 3.48 | 2.02 |
Offset | 9.43 | 8.92 | 4.46 | 3.75 | 3.96 |
CD | 15.42 | 14.76 | 7.35 | 7.24 | 5.98 |
Besides being efficient and more elegant, DeepSegmentor is superior to the NLL models when measuring (3), with the exception of BI-2-RNN, which was slightly better for the offset measurement.
Voice onset time (VOT) is the time between the onset of a stop burst and the onset of voicing. As noted in the introduction, it is widely used in theoretical and clinical studies as well as ASR tasks. In this problem the input is a speech utterance containing a single stop consonant, and the output is the VOT onset and offset times.
We compared our model to two other methods for VOT measurement. First is the AutoVOT algorithm [9]
. This algorithm follows the structured prediction approach of linear classifier with hand-crafted features and feature-functions. The second algorithm is the
DeepVOT algorithm [11]. This algorithm uses RNNs with NLL as loss function. Hence, it predicts for each frame whether it is related to the VOT or not. Using the RNN predictions, a dynamic programming algorithm is applied to find the best onset and offset times. Our approach combines both of these methods while jointly training RNN with structured loss function.We use two different datasets. The first one, pgwords, is from a laboratory study by Paterson and Goldrick [6]. American English monolinguals and Brazilian Portuguese (L1)-English bilinguals (24 participants each) named a set of 144 pictures. Productions other than the intended label as well as those with code-switching or disfluencies were excluded. VOT of remaining words was annotated by one transcriber.
For the pgwords dataset we use two layers of bidirectional LSTMs with dropout after each recurrent layer. We use (3) as our loss function. The input features are the same as in [9, 11]; overall we have 63 features per frame. We optimize the networks using AdaGrad optimization. All parameters were tuned on a dedicated development set. Table 2 summarizes the results using the same loss function as in [9]. Results suggests that DeepSegmentor outperforms the AutoVOT model over all tolerance values. However, when comparing to DeepVOT, the picture is mixed. In the lower tolerance values DeepSegmentor is superior to the DeepVOT while for higher values DeepVOT performs better. We believe these results are due to the DeepVOT being less delicate and solving a much coarser problem than the DeepSegmentor ; hence, it performs better when considering high tolerance values. We believe the integration between these two systems, (using DeepVOT as pre-training for the DeepSegmentor ), will yield more accurate and robust results. We leave this investigation for future work.
Model | 2 | 5 | 10 | 15 | 25 | 50 |
---|---|---|---|---|---|---|
AutoVOT | 49.1 | 81.3 | 93.9 | 96.0 | 97.2 | 98.1 |
DeepVOT | 53.8 | 91.6 | 97.6 | 98.7 | 99.6 | 100 |
DeepSeg. | 78.2 | 94.1 | 97.1 | 98.6 | 99.1 | 99.4 |
For the bb dataset we use two layers of LSTMs with dropout after each recurrent layer. We have experiences with bidirectional LSTMs as well but only forward LSTM performs better on this dataset. We use (3) as our loss function. We use the same features as in [9, 11], overall we have 51 features per frame. We optimize the networks using AdaGrad optimization. All parameters were tuned on a dedicated development set. Table 3 summarize the results using the loss function as in [9]. It is worth notice that we see the same behavior on this dataset as well, regarding the DeepVOT preforms better then the DeepSegmentor in hight tolerance values.
Model | 2 | 5 | 10 | 15 | 25 | 50 |
---|---|---|---|---|---|---|
AutoVOT | 59.1 | 80.5 | 89.9 | 94.3 | 96.8 | 98.1 |
DeepVOT | 60.3 | 84.2 | 94.3 | 94.9 | 98.1 | 98.7 |
DeepSeg. | 64.8 | 85.5 | 94.3 | 95.0 | 96.2 | 97.5 |
Future work will explore timing sequence of length greater than 2 - for instance, in phoneme segmentation, where the sequence varies across training examples. The model’s robustness to noise and length as well as its ability to generalize are also key areas of future development. We would therefore like to explore training the model in two stages: first as a multi-class version and then fine-tuning using structured loss. With respect to machine learning, future directions include the effect of network size, depth, and loss function on model performance.
In this paper we present a new algorithm for speech segmentation and evaluate its performance to two different tasks. The proposed algorithm combines structured loss function with recurrent neural networks and outperforms current state-of- the-art methods.
“Neural architectures for named entity recognition,”
arXiv preprint, 2016.“Distributed representations, simple recurrent networks, and grammatical structure,”
Machine learning, vol. 7, no. 2-3, pp. 195–225, 1991.“rnn: Recurrent library for torch,”
arXiv preprint, 2015.“Adaptive subgradient methods for online learning and stochastic optimization,”
JMLR, vol. 12, pp. 2121–2159, 2011.
Comments
There are no comments yet.