Feedback

Deep bottleneck features for spoken language identification.

Documento informativo
Deep Bottleneck Features for Spoken Language Identification Bing Jiang1, Yan Song1*, Si Wei2, Jun-Hua Liu2, Ian Vince McLoughlin1, Li-Rong Dai1 1 National Engineering Laboratory of Speech and Language Information Processing, University of Science and Technology of China, Hefei, AnHui, China, 2 iFlytek Research, Anhui USTC iFlytek Co., Ltd., Hefei, AnHui, China Abstract A key problem in spoken language identification (LID) is to design effective representations which are specific to language information. For example, in recent years, representations based on both phonotactic and acoustic features have proven their effectiveness for LID. Although advances in machine learning have led to significant improvements, LID performance is still lacking, especially for short duration speech utterances. With the hypothesis that language information is weak and represented only latently in speech, and is largely dependent on the statistical properties of the speech content, existing representations may be insufficient. Furthermore they may be susceptible to the variations caused by different speakers, specific content of the speech segments, and background noise. To address this, we propose using Deep Bottleneck Features (DBF) for spoken LID, motivated by the success of Deep Neural Networks (DNN) in speech recognition. We show that DBFs can form a low-dimensional compact representation of the original inputs with a powerful descriptive and discriminative capability. To evaluate the effectiveness of this, we design two acoustic models, termed DBF-TV and parallel DBF-TV (PDBF-TV), using a DBF based i-vector representation for each speech utterance. Results on NIST language recognition evaluation 2009 (LRE09) show significant improvements over state-of-the-art systems. By fusing the output of phonotactic and acoustic approaches, we achieve an EER of 1.08%, 1.89% and 7.01% for 30 s, 10 s and 3 s test utterances respectively. Furthermore, various DBF configurations have been extensively evaluated, and an optimal system proposed. Citation: Jiang B, Song Y, Wei S, Liu J-H, McLoughlin IV, et al. (2014) Deep Bottleneck Features for Spoken Language Identification. PLoS ONE 9(7): e100795. doi:10.1371/journal.pone.0100795 Editor: Donald A. Robin, University of Texas Health Science Center at San Antonio, Research Imaging Institute, United States of America Received March 11, 2014; Accepted May 29, 2014; Published July 1, 2014 Copyright: ß 2014 Jiang et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: This work was partially funded by the National Nature Science Foundation of China (Grant No. 61273264), the National 973 program of China (Grant No. 2012CB326405) and Chinese Universities Scientific Fund (Grant No. Wk2100060008). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: Authors Si Wei and Jun-Hua Liu are employed by Anhui USTC iFlytek Co, which is a private spin-off company from the authors’ university. There are no patents, products in development or marketed products to declare. This does not alter the authors’ adherence to all the PLOS ONE policies on sharing data and materials. * Email: songy@ustc.edu.cn Phonotactic representations focus on capturing the statistics of phonemic constraints and patterns for each language. It is known that the phonotactic representation of a given utterance is the token sequence or lattice output from a phone recognizer (PR). The corresponding approaches, such as Parallel Phone Recognizers followed by Language Models (PPR-LM) [3] and Parallel Phone Recognizers followed by Support Vector Machines (PPRSVM) [10,11] have achieved the state-of-the-art performance. However, the effectiveness of such representations relies heavily on the performance of the phone recognizer (PR) [12]. When the labelled dataset size is limited, it is difficult to achieve good PR results. Furthermore, the recognizing stage is time consuming, which constrains the wide applicability of the phonotactic approaches. By contrast, acoustic representations mainly capture the spectral feature distribution for each language, which is more efficient and does not require prior linguistic knowledge. Two important factors for effective acoustic representation are, (1) a front-end feature extractor which forms the frame level representation based on spectral features, and (2) a back-end model which constructs the acoustic representation for spoken LID. A popular feature is Shift Delta Cepstra (SDC), which is effectively an extension of traditional MFCC or PLP features [13–15]. Typical back-end Introduction Language identification (LID) is the task of determining the identity of the spoken language present within a speech utterance. LID is a key pre-processing technique for future multi-lingual speech processing systems, such as audio and video information retrieval, automatic machine translation, diarization, multi-lingual speech recognition, intelligent surveillance and so on. A major problem in LID is how to design a language specific and effective representation for speech utterances. It is challenging due to large variations introduced by different speech content, speakers, channels and background noises. Over the past few decades, intensive research efforts have studied the effectiveness of different representations from various research domains, such as phonotactic and acoustic information [1–3], lexical knowledge [4], prosodic information [5], articulatory parameters [6], and universal attributes [7]. Among existing representations, Eady [5], Matrouf et. al. [4] and Kirchoff et. al. [6] show that appropriate incorporation of extra language-related cues may help to improve the effectiveness of representation. In this paper, we mainly focus on the phonotactic and acoustic representations, which are considered to be the most common ones for LID [8,9]. PLOS ONE | www.plosone.org 1 July 2014 | Volume 9 | Issue 7 | e100795 DBF for Spoken LID Figure 1. An illustration of the DNN training and DBF extraction procedure. Left: Pre-training of a stack of RBMs with the first layer hosting a Gaussian-Bernoulli RBM and all other layers being Bernoulli-Bernoulli RBMs. The inputs to each RBM are from the outputs of the lower layer RBM. Middle: The generative model DBN constructed from a stack of RBMs. Right: The corresponding DNN and DBF extractor. The DNN is created by adding a randomly initialized softmax output layer on top of the DBN, and the parameters of DNN are obtained in a fine-tuning phase. The final DBF extractor in the bottom right dashed rectangle is obtained by removing the layers above the bottleneck layer. doi:10.1371/journal.pone.0100795.g001 tion of the original inputs. It should be noted that this is unlike work by Diez et. al. [32,33], in which the log-likelihood ratios of posterior probabilities, called Phone Log-Likelihood Ratios (PLLR), output from the multi-layered perceptron(MLP), were used as frame level features for LID. We will present a more detailed discussion and comparison later in this article. This paper extends our preliminary work in five main ways: models include Gaussian Mixture Model-Universal Background Model (GMM-UBM) [15] and Gaussian Mixture Model-Support Vector Machine (GMM-SVM) [16,17]. With the help of modern machine learning techniques, such as discriminative training [18– 20], Factor Analysis (FA) [21–23] and Total Variability (TV) modeling [24,25], the performance of acoustic approaches tends to be comparable to or even exceed that of phonotactic ones. In fact, even greater performance improvement can be achieved by exploiting both phonotactic and acoustic approaches, through fusing their results [26–28]. Despite significant recent advances in LID techniques, performance is still far from satisfactory, especially for short duration utterances [9]. This may be because language characteristics are a kind of weak information latently contained in the speech signal and largely dependent on its statistical properties. For short duration utterances especially, existing representations are deficient by being overly susceptible to variations caused by different speakers, channels, speech content and background noises. To address this, more powerful features, having higher discriminative and descriptive capabilities, are preferred. Recently, deep learning techniques have achieved significant performance gains in a number of applications, including large scale speech recognition and image classification [29,30], largely due to their powerful modeling capabilities, aided by the availability of the large scale datasets. In this paper, we aim to apply deep learning techniques to the spoken LID task. Our preliminary work demonstrated that an acoustic system based on deep bottleneck features (DBF) can effectively mine the contextual information embedded in speech frames [31]. Specially, DBFs were generated by a structured Deep Neural Network (DNN) containing a narrow internal bottleneck layer. Since the number of hidden nodes in the bottleneck layer is much smaller than those in other layers, DNN training forces the activation signals in the bottleneck layer to form a low-dimensional compact representa- PLOS ONE | www.plosone.org N N N N N The DBF extractor and DNN structure are analyzed and evaluated together with the crucial DBF training and extraction process (including assessing two alternative training corpuses and their configurations). In addition, the relationship to the conventional SDC [13–15] and recently proposed PLLR [32,33] approaches are explored; Two new acoustical systems are presented, i.e. DBF-TV and parallel DBF-TV (PDBF-TV), and systematically evaluated across various configurations of DBF extractor. The systems are evaluated for a range of input feature temporal window sizes, and number of bottleneck layer hidden nodes; The relationship is explored between DBF and different test conditions, based on analysis of evaluation results; An optimal LID system configuration is proposed based on the NIST language recognition evaluation 2009 (LRE09) dataset, and compared to other high performance published approaches; A phonotactic representation is constructed, using a GMMHMM based phone recognizer (PR) trained with DBF. The output is fused with that of the acoustic representation (using two alternative fusion methods) to achieve extremely good performance. Experimental results will demonstrate that an acoustic representation based on DBF significantly improves on state-of-the-art performance, especially for short duration utterances. The 2 July 2014 | Volume 9 | Issue 7 | e100795 DBF for Spoken LID to be binary, the energy function of the state E(v,h) is defined as: E(v,h)~{ H V X X vi hj wji { i~1 j~1 vi bvi { i~1 H X hj bhj ð1Þ j~1 where wji represents the weight between visible unit i and hidden unit j, bvi and bhj denote the real-valued biases of visible unit i and hidden unit j respectively. The Bernoulli-Bernoulli RBM model parameters can be defined as h~fW,bh ,bv g, where W~fwij gV |H , bh ~½bh1 ,bh2 ,:::,bhH > and bv ~½bv1 ,bv2 ,:::,bvV > . For a Gaussian-Bernoulli RBM, the visible units are real-valued which means v[RV , and h[f0,1gH are binary. Thus, the energy function is defined as follows: Figure 2. Block diagram of our proposed DBF-TV LID system. This system consists of two main phases, the acoustic frontend and TV modeling back-end. doi:10.1371/journal.pone.0100795.g002 proposed phonotactic and acoustic fusion achieves equal error rate (EER) figures of 1.08%, 1.89% and 7.01% for 30 s, 10 s and 3 s test utterances respectively. This clearly exceeds the performance of the best currently reported LID system [9], as well as our own previous work [31] (in which the EER for 30 s, 10 s and 3 s test utterances is 1.98%, 3.47% and 9.71%). The paper is organized as follows. How to generate the DBF from a DNN is first briefly introduced, including the two main categories, generative pre-training and discriminative fine-tuning. Then, our proposed LID systems is presented in detail. Finally, the experimental setup and results are presented and analyzed, followed by the conclusion and future work. E(v,h)~{ V X H X vi i~1 j~1 si hj wji z V X (vi {bv )2 i i~1 2s2i { H X hj bhj ð2Þ j~1 where vi is a real-valued activity of visible unit i. Each visible unit adds a parabolic offset to the energy function which is governed by si . The Gaussian-Bernoulli RBM model parameter set can be defined as h~fW,bh ,bv ,s2 g similarly, where the variance parameters s2i are commonly fixed to a pre-determined value instead of being learnt. According to the energy function E(v,h) in Eq. (1)&(2), the joint probability associated with configuration (v,h) is defined as follows: Methods Deep Bottleneck Features In this section, we discuss the DBF extraction procedure and structure as shown in Figure 1, used as an acoustic frontend for the spoken LID task. We first describe the DNN training process, including generative pre-training and discriminative fine-tuning phases, followed by the DBF extraction process. We then detail the configuration of DBF extraction for LID. Finally, we discuss the relation to several existing frame level features, e.g. SDC and PLLR. p(v,h; h)~ exp ({E(v,h; h)) Z ð3Þ where Z~ XX v DNN Training exp({E(v,h; h)) ð4Þ h is a partition function. Given a training set, the RBM model parameters h can be estimated by maximum likelihood learning via the contrastive divergence (CD) algorithm [38]. After the RBM of a lower layer is trained, the inferred states of the hidden units can be used as the visible data for training the RBM of a higher layer. This process is repeated to produce multiple layers of RBMs. Finally, the RBMs can be stacked to produce the DBN, as shown in the middle part of Figure 1. Fine-Tuning Phase. The fine-tuning phase is shown in the right part of Figure 1, in which an output labelling layer is added on top of the pre-trained DBN. For a multiclass classification problem, there are K units in the output layers. In our work, these units correspond to the language-specific phonemes. Each unit corresponds to the label of input features, which converts a number of Bernoulli distributed units h into a multinomial distribution through the following softmax function, The DNN training process includes pre-training and fine-tuning phases [34]. During the pre-training phase, a generative Deep Belief Net (DBN) with stacked Restricted Boltzmann Machines (RBM) is trained in an unsupervised way. During the discriminative fine-tuning phase, a randomly initialized softmax layer is added on top of the DBN, and all the parameters are fine-tuned jointly using back-propagation (BP). Generally, the pre-training phase provides a region of the weight space that allows the finetuning phase to converge to a better local optimum, and reduce overfitting [35]. Pre-Training Phase. The basic idea of pre-training is to fit a generative DBN model to the input data. Conceptually, the DBN can be trained greedily in a layer-by-layer manner, by treating each pair of layers as a RBM [36], as shown in the left part of Figure 1. An RBM is a bipartite graph model in which the visible stochastic units are only connected to the hidden stochastic units [37]. The RBM is a two-layer structure with V visible stochastic units v~½v1 ,v2 ,:::,vV > , and H hidden stochastic units h~½h1 ,h2 ,:::,hH > . The most frequently used RBMs are the Gaussian-Bernoulli RBM and Bernoulli-Bernoulli RBM. In Bernoulli-Bernoulli RBM, v[f0,1gV and h[f0,1gH are assumed PLOS ONE | www.plosone.org V X XH  wki hi zbk XH  p(kDh; hDNN )~ XK w h zb exp pi i p i~1 p~1 exp 3 i~1 ð5Þ July 2014 | Volume 9 | Issue 7 | e100795 DBF for Spoken LID Figure 3. Block diagrams of two PDBF-TV LID systems. The diagram above the dashed line is PDBF-TV with later fusion. The diagram below the dashed line is the PDBF-TV with early fusion. doi:10.1371/journal.pone.0100795.g003 where k is an index over all classes, hDNN are the DNN model parameters, p(kDh; hDNN ) denotes the probability that the input is classified into the k-th class. The cost function C defines the cross-entropy error between the true class label d and the predicted label from the softmax operation; ym (x)~ M2 X j~1 w3mj s M1 X w2ji s i~1 D X d~1 w1id xd zb1i ! zb2j ! zb3m ð7Þ 1 represents the logistic sigmoid function. 1z exp (:) x~½x1 ,x2 ,:::,xD T is the D -dimensional input feature, concatenated from multiple frames of MFCC and prosodic features. wlji is the weight on a connection to unit j in the l-th hidden layer from unit i in the layer below. bli is the bias of unit i in the l-th hidden layer. where s(:)~ C~{ K X dk log p(kDh; hDNN ) ð6Þ k~1 where K is the total number of classes, and dk [f0,1gK are the target variables indicating the class label with a 1-of-K coding scheme. The BP algorithm is used to jointly tune all model parameters by minimizing the cross entropy function in Eq. (6). DNN Training Settings Corpus. Two separate DNNs, used for forming DBF extractors, are evaluated in this paper. The Mandarin DNN (MA-DNN) is trained from conversational telephone speech, consisting of more than 1,600,000 utterances of about 1,000 hours duration, recorded from 32,950 Mandarin speakers. The English DNN (EN-DNN) uses the well-known Switchboard corpus, consisting of the Switchboard-I training set and 20-hour Call Home English data, having about 300-hours duration. This data will only be used to train and construct two DBF feature extractors (MA-DBF and EN-DBF). Each feature extractor will later be evaluated for LID, using completely different multilingual of the stability of the surface layer: a growing instability in the surface layer leads to stronger convection and mixing. The drag coefficient is calculated following CH = κ2 A·B (43) with   z  z  zsl sl 0m − 9M + 9M , z0m L L   z  z  zsl sl 0h − 9H + 9H , B = ln z0h L L A = ln (45) where rs,min is the minimum surface resistance, LAI the leaf area index of the vegetated fraction, f1 a correction function Geosci. Model Dev., 8, 453–471, 2015 (50) where f2 is calculated following Eq. (47). where RiB is the bulk Richardson number and θv,s and hθv i are the virtual potential temperatures of the surface and the mixed-layer atmosphere respectively. The calculations of both resistances rveg and rsoil follow the method of Jarvis (1976) and are employed similarly as in the ECMWF global forecasting model. The vegetation resistance is based on the following multiplicative equation. rs,min f1 (Sin ) f2 (w2 ) f3 (VPD)f4 (T ), LAI where wwilt is the volumetric soil moisture at wilting point, wfc is the volumetric soil moisture at field capacity and gD is a correction factor for vapour pressure deficit that only plays a role in high vegetation. The soil resistance depends on the amount of soil moisture in the layer that has direct contact with the atmosphere. rsoil = rs,min f2 (w1 ), where κ is the von Kármán constant, z0m and z0h are the roughness lengths for momentum and heat, respectively, zsl is the depth of the atmospheric surface layer of 0.1h, L is the Monin–Obukhov length and 9M and 9H are the integrated stability functions for momentum and heat taken from Beljaars (1991). To find the values of L that are required in this function, the following implicit function is solved using a Newton–Raphson iteration method:  g zsl hθv i − θv,s RiB = (44) hθv i hU i2 h   i zsl zsl  z0h  zsl ln z0h − 9H L + 9H L = h    i2 , L sl 0m ln zz0m − 9M zLsl + 9M zL rveg = depending on incoming short wave radiation Sin , f2 a function depending on soil moisture w, f3 a function depending on vapour pressure deficit (VPD) and f4 a function depending on temperature T . Of these correction functions, the first three originate from the ECMWF model documentation and the fourth from Noilhan and Planton (1989) and are defined as   0.004Sin + 0.05 1 = min 1, (46) f1 (Sin ) 0.81 (0.004Sin + 1) 1 w − wwilt = (47) f2 (w) wfc − wwilt 1 = exp (gD VPD) (48) f3 (VPD) 1 = 1.0 − 0.0016(298.0 − T )2 , (49) f4 (T ) 2.7.3 Evapotranspiration calculation The total evapotranspiration consists of three parts: soil evaporation, leaf transpiration and evaporation of liquid water from the leaf surface. The total evapotranspiration is therefore proportional to the vegetated fraction of the land surface: LE = cveg (1 − cliq )LEveg + cveg cliq LEliq + (1 − cveg )LEsoil , (51) where LEveg is the transpiration from vegetation, LEsoil the evaporation from the soil and LEliq the evaporation of liquid water. The fractions that are used are cveg , which is the fraction of the total area that is covered by vegetation and cliq , which is the fraction of the vegetated area that contains liquid water. Since cliq is not constant in time, it is modelled following cliq = Wl , LAI Wmax (52) where Wmax is the representative depth of a water layer that can lay on one leaf and Wl the actual water depth. The evolution of Wl is governed by the following equation: LEliq dWl = , dt ρw Lv (53) where ρw is the density of water. www.geosci-model-dev.net/8/453/2015/ R. H. H. Janssen and A. Pozzer: Boundary layer dynamics with MXL/MESSy 2.7.4 Soil model 2.7.5 A soil model is available for the calculation of soil temperature and moisture changes that occur on diurnal time scales, and therefore affect the surface heat fluxes. It is a forcerestore soil model, based on the model formulation of Noilhan and Planton (1989) with the soil temperature formulation from Duynkerke (1991). The soil model consists of two layers, of which only the thin upper layer follows a diurnal cycle. The soil temperature in the upper layer (Tsoil1 ) is calculated using the following equation, where the first term is the force term and the second the restore term: 2π dTsoil1 = CT G − (Tsoil1 − Tsoil2 ), dt τ (54) where CT is the surface soil/vegetation heat capacity, G the soil heat flux already introduced in the SEB, τ the time constant of one day and Tsoil2 the temperature of the deeper soil layer. The soil heat flux is calculated using G = 3(Ts − Tsoil1 ), (55) where 3 is the thermal conductivity of the skin layer. The heat capacity CT is calculated following Clapp and Hornberger (1978) using   wsat b/2 log 10 CT = CT,sat , (56) w2 where CT,sat is the soil heat capacity at saturation, w2 the volumetric water content of the deeper soil layer and b an empirical constant that originates from data fitting. The evolution of the volumetric water content of the top soil layer w1 is calculated using  dw1 C1 LEsoil C2 w1 − w1,eq , =− − dt ρw d1 Lv τ (57) where w1,eq is the water content of the top soil layer in equilibrium, d1 is a normalization depth and where C1 and C2 are two coefficients related to the Clapp and Hornberger parametrization (Clapp and Hornberger, 1978), that are calculated using   wsat b/2+1 C1 = C1,sat , (58) w1 w2 C2 = C2,ref , (59) wsat − w2 where C1,sat and C2,ref are constants taken from Clapp and Hornberger (1978). The water content of the top soil layer in equilibrium is     ! w2 p w2 8p 1− (60) w1, eq = w2 − wsat a wsat wsat with a and p two more fitted constants from Clapp and Hornberger (1978). www.geosci-model-dev.net/8/453/2015/ 461 Surface fluxes of momentum Finally, we introduce parametrizations for the momentum fluxes at the surface (Stull, 1988) and the friction velocity (u∗ ), which are required to calculate the horizontal wind budget in the mixed layer (Sect. 2.3). p u∗ = CM hU i, (61) u0 w 0 s = −CM hU ihui, (62) v 0 w 0 s = −CM hU ihvi, (63) where CM is the drag coefficient for momentum, defined as κ2 CM = h    sl − 9M zLsl + 9M ln zz0m 3 z0m  L i. (64) Implementation in the MESSy structure MESSy is an interface to couple submodels of earth system processes, which offers great flexibility to choose between different geophysical and -chemical processes. In the first version of MESSy, only the general circulation model ECHAM5 could be used (Jöckel et al., 2005). The second round of development also enabled the use of base models with different dimensions (Jöckel et al., 2010). For the current implementation, we used MESSy version 2.50. Here, we give a brief overview of the MESSy structure, more details can be found in Jöckel et al. (2005, 2010). In MESSy, each FORTRAN module belongs to one of the following layers: – The Base Model Layer (BML) defines the domain of the model, which can be a box, 1-D or 3-D. This can be a complex atmospheric model, for instance a general circulation model like ECHAM5 (Jöckel et al., 2005), but in our case it consists of two boxes stacked on top of each other (Fig. 3). – The Base Model Interface Layer (BMIL) manages the calls to specific submodels, data input and output, and data transfer between the submodels and the base model. Global variables are stored in structures called “channel objects”. – The SubModel Interface Layer (SMIL) collects relevant data from the BMIL, transfers it to the SMCL and sends the calculated results back to the BMIL. The SMIL contains the calls of the respective submodel routines for the initialization, time integration, and finalizing phase of the model. – The SubModel Core Layer (SMCL) contains the code for the base model-independent implementation of the physical and chemical processes or a diagnostic tool. Geosci. Model Dev., 8, 453–471, 2015 462 R. H. H. Janssen and A. Pozzer: Boundary layer dynamics with MXL/MESSy free troposphere entrainment gas/particle partitioning chemical conversion boundary height layer land surface sunrise deposition emission time sunset Figure 3. Processes relevant to evolution of species concentrations in the boundary layer. Open and closed circles depict gas-phase and aerosol-phase species, respectively. For the implementation of MXL, a generic 1-D base model is created in MESSy, called VERTICO. VERTICO contains calls to modules for time and tracer management, the time loop which integrates the model equations and in VERTICO tracer concentrations are updated each time step, combining the tracer tendencies from each active submodel. It is de facto a 3-D base model in which the horizontal resolution has been reduced to a single grid box, to facilitate the submodel coupling within the MESSy framework. This also facilitates the possible development of a column model that includes more vertical levels in both boundary layer and free troposphere. The current implementation of VERTICO consists of two domains (see Fig. 3): the lower one represents the well-mixed boundary layer during daytime, as represented by the MXL S, et al. (2012) Ensembl 2012. Nucleic Acids Res 40: D84–90. 46. (2010) Integrating common and rare genetic variation in diverse human populations. Nature 467: 52–58. 47. Qi HP, Qu ZY, Duan SR, Wei SQ, Wen SR, et al. (2012) IL-6-174 G/C and 572 C/G Polymorphisms and Risk of Alzheimer’s Disease. PLoS One 7. 48. Freedman R, Goldowitz D (2010) Studies on the hippocampal formation: From basic development to clinical applications: Studies on schizophrenia. Prog Neurobiol 90: 263–275. 49. Keshavan MS, Dick E, Mankowski I, Harenski K, Montrose DM, et al. (2002) Decreased left amygdala and hippocampal volumes in young offspring at risk for schizophrenia. Schizophr Res 58: 173–183. 50. Jayakumar PN, Venkatasubramanian G, Gangadhar BN, Janakiramaiah N, Keshavan MS (2005) Optimized voxel-based morphometry of gray matter volume in first-episode, antipsychotic-naive schizophrenia. Prog Neuropsychopharmacol Biol Psychiatry 29: 587–591. 51. Lodge DJ, Grace AA (2010) Developmental pathology, dopamine, stress and schizophrenia. Int J Dev Neurosci. 52. Phillips LJ, McGorry PD, Garner B, Thompson KN, Pantelis C, et al. (2006) Stress, the hippocampus and the hypothalamic-pituitary-adrenal axis: implications for the development of psychotic disorders. Aust N Z J Psychiatry 40: 725– 741. 53. McEwen BS (2008) Central effects of stress hormones in health and disease: Understanding the protective and damaging effects of stress and stress mediators. Eur J Pharmacol 583: 174–185. 54. Baune BT, Konrad C, Grotegerd D, Suslow T, Birosova E, et al. (2012) Interleukin-6 gene (IL-6): a possible role in brain morphology in the healthy adult brain. Journal of neuroinflammation 9: 125. 55. Brune M (2012) Does the oxytocin receptor (OXTR) polymorphism (rs2254298) confer ’vulnerability’ for psychopathology or ’differential susceptibility’? Insights from evolution. BMC medicine 10: 38. 56. Belsky J, Pluess M (2009) Beyond diathesis stress: differential susceptibility to environmental influences. Psychological bulletin 135: 885–908. 57. Kang HJ, Voleti B, Hajszan T, Rajkowska G, Stockmeier CA, et al. (2012) Decreased expression of synapse-related genes and loss of synapses in major depressive disorder. Nature medicine 18: 1413–1417. 58. Muller N, Empl M, Riedel M, Schwarz M, Ackenheil M (1997) Neuroleptic treatment increases soluble IL-2 receptors and decreases soluble IL-6 receptors in schizophrenia. European archives of psychiatry and clinical neuroscience 247: 308–313. 59. Zalcman S, Green-Johnson JM, Murray L, Nance DM, Dyck D, et al. (1994) Cytokine-specific central monoamine alterations induced by interleukin-1, -2 and -6. Brain research 643: 40–49. 60. Zaretsky MV, Alexander JM, Byrd W, Bawdon RE (2004) Transfer of inflammatory cytokines across the placenta. Obstetrics and gynecology 103: 546–550. 61. Lowe GC, Luheshi GN, Williams S (2008) Maternal infection and fever during late gestation are associated with altered synaptic transmission in the hippocampus of juvenile offspring rats. American journal of physiology Regulatory, integrative and comparative physiology 295: R1563–1571. 62. Marx CE, Jarskog LF, Lauder JM, Lieberman JA, Gilmore JH (2001) Cytokine effects on cortical neuron MAP-2 immunoreactivity: implications for schizophrenia. Biological Psychiatry 50: 743–749. 63. Gilmore JH, Fredrik Jarskog L, Vadlamudi S, Lauder JM (2004) Prenatal infection and risk for schizophrenia: IL-1beta, IL-6, and TNFalpha inhibit cortical neuron dendrite development. Neuropsychopharmacology: official publication of the American College of Neuropsychopharmacology 29: 1221– 1229. 64. Samuelsson AM, Jennische E, Hansson HA, Holmang A (2006) Prenatal exposure to interleukin-6 results in inflammatory neurodegeneration in hippocampus with NMDA/GABA(A) dysregulation and impaired spatial learning. American journal of physiology Regulatory, integrative and comparative physiology 290: R1345–1356. 65. Hama T, Miyamoto M, Tsukui H, Nishio C, Hatanaka H (1989) Interleukin-6 as a neurotrophic factor for promoting the survival of cultured basal forebrain cholinergic neurons from postnatal rats. Neuroscience letters 104: 340–344. 66. Mehler MF, Kessler JA (1998) Cytokines in brain development and function. Advances in protein chemistry 52: 223–251. PLOS ONE | www.plosone.org 67. Baier PC, May U, Scheller J, Rose-John S, Schiffelholz T (2009) Impaired hippocampus-dependent and -independent learning in IL-6 deficient mice. Behavioural brain research 200: 192–196. 68. Vignozzi L, Cellai I, Santi R, Lombardelli L, Morelli A, et al. (2012) Antiinflammatory effect of androgen receptor activation in human benign prostatic hyperplasia cells. J Endocrinol 214: 31–43. 69. Kovacs EJ, Plackett TP, Witte PL (2004) Estrogen replacement, aging, and cellmediated immunity after injury. Journal of leukocyte biology 76: 36–41. 70. Bonafe M, Olivieri F, Cavallone L, Giovagnetti S, Mayegiani F, et al. (2001) A gender—dependent genetic predisposition to produce high levels of IL-6 is detrimental for longevity. Eur J Immunol 31: 2357–2361. 71. Bergouignan L, Chupin M, Czechowska Y, Kinkingnehun S, Lemogne C, et al. (2009) Can voxel based morphometry, manual segmentation and automated segmentation equally detect hippocampal volume differences in acute depression? Neuroimage 45: 29–37. 72. Keller SS, Mackay CE, Barrick TR, Wieshmann UC, Howard MA, et al. (2002) Voxel-based morphometric comparison of hippocampal and extrahippocampal abnormalities in patients with left and right hippocampal atrophy. Neuroimage 16: 23–31. 73. Maguire EA, Gadian DG, Johnsrude IS, Good CD, Ashburner J, et al. (2000) Navigation-related structural change in the hippocampi of taxi drivers. Proceedings of the National Academy of Sciences 97: 4398–4403. 74. Kubicki M, Shenton ME, Salisbury DF, Hirayasu Y, Kasai K, et al. (2002) Voxel-based morphometric analysis of gray matter in first episode schizophrenia. Neuroimage 17: 1711–1719. 75. Good CD, Scahill RI, Fox NC, Ashburner J, Friston KJ, et al. (2002) Automatic Differentiation of Anatomical Patterns in the Human Brain: Validation with Studies of Degenerative Dementias. Neuroimage 17: 29–46. 76. Lindberg O, Manzouri A, Westman E, Wahlund LO (2012) A comparison between volumetric data generated by voxel-based morphometry and manual parcellation of multimodal regions of the frontal lobe. AJNR Am J Neuroradiol 33: 1957–1963. 77. Uchida RR, Del-Ben CM, Araujo D, Busatto-Filho G, Duran FL, et al. (2008) Correlation between voxel based morphometry and manual volumetry in magnetic resonance images of the human brain. An Acad Bras Cienc 80: 149– 156. 78. Job DE, Whalley HC, McConnell S, Glabus M, Johnstone EC, et al. (2002) Structural Gray Matter Differences between First-Episode Schizophrenics and Normal Controls Using Voxel-Based Morphometry. Neuroimage 17: 880–889. 79. Rajarethinam R, DeQuardo JR, Miedler J, Arndt S, Kirbat R, et al. (2001) Hippocampus and amygdala in schizophrenia: assessment of the relationship of neuroanatomy to psychopathology. Psychiatry Res 108: 79–87. 80. Csernansky JG, Wang L, Jones D, Rastogi-Cruz D, Posener JA, et al. (2002) Hippocampal deformities in schizophrenia characterized by high dimensional brain mapping. Am J Psychiatry 159: 2000–2006. 81. Tepest R, Wang L, Miller MI, Falkai P, Csernansky JG (2003) Hippocampal deformities in the unaffected siblings of schizophrenia subjects. Biol Psychiatry 54: 1234–1240. 82. McDonald C, Marshall N, Sham PC, Bullmore ET, Schulze K, et al. (2006) Regional brain morphometry in patients with schizophrenia or bipolar disorder and their unaffected relatives. Am J Psychiatry 163: 478–487. 83. Testa C, Laakso MP, Sabattoli F, Rossi R, Beltramello A, et al. (2004) A comparison between the accuracy of voxel-based morphometry and hippocampal volumetry in Alzheimer’s disease. Journal of Magnetic Resonance Imaging 19: 274–282. 84. Ebdrup BH, Glenthoj B, Rasmussen H, Aggernaes B, Langkilde AR, et al. (2010) Hippocampal and caudate volume reductions in antipsychotic-naive firstepisode schizophrenia. J Psychiatry Neurosci 35: 95–104. 85. Hu M, Li J, Eyler L, Guo X, Wei Q, et al. (2013) Decreased left middle temporal gyrus volume in antipsychotic drug-naive, first-episode schizophrenia patients and their healthy unaffected siblings. Schizophr Res 144: 37–42. 86. Rizos EN, Papathanasiou M, Michalopoulou PG, Mazioti A, Douzenis A, et al. (2011) Association of serum BDNF levels with hippocampal volumes in first psychotic episode drug-naive schizophrenic patients. Schizophr Res 129: 201– 204. 87. Chua SE, Cheung C, Cheung V, Tsang JT, Chen EY, et al. (2007) Cerebral grey, white matter and csf in never-medicated, first-episode schizophrenia. Schizophr Res 89: 12–21. 88. Salgado-Pineda P, Baeza I, Perez-Gomez M, Vendrell P, Junque C, et al. (2003) Sustained attention impairment correlates to gray matter decreases in first episode neuroleptic-naive schizophrenic patients. Neuroimage 19: 365–375. 11 May 2014 | Volume 9 | Issue 5 | e96021
Deep bottleneck features for spoken language identification. Deep Bottleneck Features For Spoken Language Identification
RECENT ACTIVITIES
Autor
Documento similar

Deep bottleneck features for spoken language identification.

Livre