World Scientific
Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×
Our website is made possible by displaying certain online content using javascript.
In order to view the full content, please disable your ad blocker or whitelist our website www.worldscientific.com.

System Upgrade on Mon, Jun 21st, 2021 at 1am (EDT)

During this period, the E-commerce and registration of new users may not be available for up to 6 hours.
For online purchase, please visit us again. Contact us at [email protected] for any enquiries.

Acoustic Space Learning for Sound-Source Separation and Localization on Binaural Manifolds

    In this paper, we address the problems of modeling the acoustic space generated by a full-spectrum sound source and using the learned model for the localization and separation of multiple sources that simultaneously emit sparse-spectrum sounds. We lay theoretical and methodological grounds in order to introduce the binaural manifold paradigm. We perform an in-depth study of the latent low-dimensional structure of the high-dimensional interaural spectral data, based on a corpus recorded with a human-like audiomotor robot head. A nonlinear dimensionality reduction technique is used to show that these data lie on a two-dimensional (2D) smooth manifold parameterized by the motor states of the listener, or equivalently, the sound-source directions. We propose a probabilistic piecewise affine mapping model (PPAM) specifically designed to deal with high-dimensional data exhibiting an intrinsic piecewise linear structure. We derive a closed-form expectation-maximization (EM) procedure for estimating the model parameters, followed by Bayes inversion for obtaining the full posterior density function of a sound-source direction. We extend this solution to deal with missing data and redundancy in real-world spectrograms, and hence for 2D localization of natural sound sources such as speech. We further generalize the model to the challenging case of multiple sound sources and we propose a variational EM framework. The associated algorithm, referred to as variational EM for source separation and localization (VESSL) yields a Bayesian estimation of the 2D locations and time-frequency masks of all the sources. Comparisons of the proposed approach with several existing methods reveal that the combination of acoustic-space learning with Bayesian inference enables our method to outperform state-of-the-art methods.

    References

    • J.   Blauert , Spatial Hearing: The Psychophysics of Human Sound Localization ( MIT Press , 1997 ) . Google Scholar
    • D.   Wang and G. J.   Brown , Computational Auditory Scene Analysis: Principles, Algorithms and Applications ( IEEE Press , 2006 ) . CrossrefGoogle Scholar
    • A. Deleforge and R. P. Horaud, The cocktail party robot: Sound source separation and localisation with an active binaural head, Proc. 7th ACM/IEEE Int. Conf. Human Robot Interaction (HRI) (2012) pp. 431–438. Google Scholar
    • E. C. Cherry, J. Acoust. Soc. Am. 25(5), 975 (1953). Crossref, ISIGoogle Scholar
    • S.   Haykin and Z.   Chen , Neural Comput.   17 , 1875 ( 2005 ) . Crossref, Medline, ISIGoogle Scholar
    • L. Rayleigh, On our perception of sound direction, Philos. Mag. 13 (1907) 214–232 . Google Scholar
    • J. C.   Middlebrooks and D. M.   Green , Annu. Rev. Psychol.   42 , 135 ( 1991 ) . Crossref, Medline, ISIGoogle Scholar
    • R. S.   Woodworth and H.   Schlosberg , Experimental Psychology ( Holt , 1965 ) . Google Scholar
    • P. M. Hofman and A. J. Van Opstal, J. Acoust. Soc. Am. 103(5), 2634 (1998). Crossref, Medline, ISIGoogle Scholar
    • R. Liu and Y. Wang, Robotica 28(7), 1013 (2010). Crossref, ISIGoogle Scholar
    • X. Alameda-Pineda and R. P. Horaud, Geometrically constrained robust time delay estimation using non-coplanar microphone arrays, Proc. 20th Eur. Signal Processing Conf. (EUSIPCO) (2012) pp. 1309–1313. Google Scholar
    • M. I. Mandel, D. P. W. Ellis and T. Jebara, An EM algorithm for localizing multiple sound sources in reverberant environments, Proc. Neural Information Processing Systems (NIPS) Conf. (2007) pp. 953–960. Google Scholar
    • A. Deleforge and R. P. Horaud, A latently constrained mixture model for audio source separation and localization, Proc. 10th Int. Conf., LVA/ICA (2012) pp. 372–379. Google Scholar
    • J. Woodruff and D. Wang, IEEE Trans. Acoust., Speech, Signal Process. 20(5), 1503 (2012). Google Scholar
    • O.   Yilmaz and S.   Rickard , IEEE Trans. Signal Process.   52 , 1830 ( 2004 ) . Crossref, ISIGoogle Scholar
    • H. Viste and G. Evangelista, On the use of spatial cues to improve binaural source separation, Proc. Int. Conf. Digital Audio Effects (DAFX) (2003) pp. 209–213. Google Scholar
    • A. R. Kullaib, M. Al-Mualla and D. Vernon, 2D binaural sound localization: For urban search and rescue robotics, Proc. Mobile Robotics (2009) pp. 423–435. Google Scholar
    • F. Keyrouz, W. Maier and K. Diepold, Robotic localization and separation of concurrent sound sources using self-splitting competitive learning, Proc. IEEE Symp. Computational Intelligence in Image and Signal Processing (CIISP) (2007) pp. 340–345. Google Scholar
    • J. Hörnsteinet al., Sound localization for humanoid robots — Building audio-motor maps based on the HRTF, Proc. IEEE/RSJ IROS Int. Conf. Intelligent Robots and Systems (2006) pp. 1170–1176. Google Scholar
    • P. M. Hofmanet al., Nature Neurosci. 1(5), 417 (1998). Crossref, Medline, ISIGoogle Scholar
    • B. A. Wright and Y. Zhang, Int. J. Audiol. 45(S1), 92 (2006). Crossref, ISIGoogle Scholar
    • M. Aytekin, C. F. Moss and J. Z. Simon, Neural Comput. 20(3), 603 (2008). Crossref, Medline, ISIGoogle Scholar
    • H.   Poincaré , The Foundations of Science; Science and Hypothesis, the Value of Science, Science and Method , Trans. of La valeur de la science , ed. G. B.   Halsted ( Science Press , New York , 1929 ) . Google Scholar
    • J. K.   O'Regan and A.   Noe , Behav. Brain Sci.   24 , 939 ( 2001 ) . Crossref, Medline, ISIGoogle Scholar
    • R. Held and A. Hein, J. Comp. Physiol. Psychol. 56(5), 872 (1963). Crossref, Medline, ISIGoogle Scholar
    • A. Deleforge and R. P. Horaud, 2D sound-source localization on the binaural manifold, IEEE Workshop on Machine Learning for Signal Processing (IEEE, Santander, Spain, 2012) pp. 1–6. Google Scholar
    • A. Deleforge, F. Forbes and R. P. Horaud, Variational EM for binaural sound-source separation and localization, ICASSP 2013 — 38th Int. Conf. Acoustics, Speech, and Signal Processing (IEEE, Vancouver, Canada, 2013) pp. 76–80. Google Scholar
    • S. T. Roweis, Advances in Neural Information Processing Systems 13 (MIT Press, 2000) pp. 793–799. Google Scholar
    • S. Bensaid, A. Schutz and D. T. M. Slock, Latent Variable Analysis and Signal Separation (Springer, 2010) pp. 106–113. CrossrefGoogle Scholar
    • P.   Comon and C.   Jutten , Handbook of Blind Source Separation, Independent Component Analysis and Applications ( Academic Press (Elsevier) , 2010 ) . Google Scholar
    • N. Q. K. Duong, E. Vincent and R. Gribonval, IEEE Trans. Audio Signal Lang. Process. 18(7), 1830 (2010). Crossref, ISIGoogle Scholar
    • M. I. Mandel, R. J. Weiss and D. P. W. Ellis, IEEE Trans. Audio, Speech Lang. Process. 18(2), 382 (2010). Crossref, ISIGoogle Scholar
    • R. Rosipal and N. Krämer, Subspace, Latent Structure and Feature Selection 3940 (Springer, 2006) pp. 34–51. CrossrefGoogle Scholar
    • K. C. Li, J. Am. Stat. Assoc. 86(414), 316 (1991). Crossref, ISIGoogle Scholar
    • H. M. Wu, J. Comput. Graph. Stat. 17(3), 590 (2008). Crossref, ISIGoogle Scholar
    • R. D. de Veaux, Comput. Stat. Data Anal. 8(3), 227 (1989). Crossref, ISIGoogle Scholar
    • L. Xu, M. I. Jordan and G. E. Hinton, An alternative model for mixtures of experts, Proc. Neural Information Processing Systems (NIPS) Conf.7 (1995) pp. 633–640. Google Scholar
    • A. Kain and M. W. Macon, Spectral voice conversion for text-to-speech synthesis, Proc. Int. Conf. Acoustics, Speech, and Signal Processing (ICASSP)1 (1998) pp. 285–288. Google Scholar
    • Y.   Stylianou , O.   Cappé and E.   Moulines , IEEE Trans. Acoust., Speech, Signal Process.   6 , 131 ( 1998 ) . CrossrefGoogle Scholar
    • T. Toda, A. Black and K. Tokuda, Speech Commun. 50(3), 215 (2008). Crossref, ISIGoogle Scholar
    • Y. Qiao and N. Minematsu, Mixture of probabilistic linear regressions: A unified view of GMM-based mapping techniques, Proc. Int. Conf. Acoustics, Speech, and Signal Processing (ICASSP) (2009) pp. 3913–3916. Google Scholar
    • A. Deleforge, F. Forbes and R. Horaud, High-dimensional regression with gaussian mixtures and partially-latent response variables, (2013) , arXiv:1308.2302 . Google Scholar
    • M. Otani, T. Hirahara and S. Ise, J. Acoust. Soc. Am. 125(5), 3253 (2009). Crossref, Medline, ISIGoogle Scholar
    • J. S. Garofolo et al. , TIMIT Acoustic-Phonetic Continuous Speech Corpus , Linguistic Data Consortium ( Philadelphia , 1993 ) . Google Scholar
    • R. Talmon, I. Cohen and S. Gannot, Supervised source localization using diffusion kernels, Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) (2011) pp. 245–248. Google Scholar
    • Z. Zhang and H. Zha, Principal manifolds and nonlinear dimensionality reduction via tangent space alignment, Journal of Shanghai University (English Edition) 8(4) (2004) 406–424 . Google Scholar
    • M.   Beal and Z.   Ghahramani , Bayesian Stat.   7 , 453 ( 2003 ) . Google Scholar
    • P. Aarabi, IEEE Trans. Syst. Man, Cybern. C, Appl. Rev. 32(4), 474 (2002). Crossref, ISIGoogle Scholar
    • J. Mouba and S. Marchand, A source localization/separation/respatialization system based on unsupervised classification of interaural cues, Proc. Int. Conf. Digital Audio Effects (Montreal, Canada, 2006) pp. 233–238. Google Scholar
    • H. Buchner, R. Aichner and W. Kellermann, IEEE Trans. Audio, Speech Lang. Process. 13(1), 120 (2005). Crossref, ISIGoogle Scholar
    • H. Sawada, S. Araki and S. Makino, A Two-stage frequency-domain blind source separation method for underdetermined convolutive mixtures, Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) (New Paltz, NY, 2007) pp. 139–142. Google Scholar
    • E. Vincent, R. Gribonval and C. Févotte, IEEE Trans. Audio, Speech Lang. Process. 14(4), 1462 (2006). Crossref, ISIGoogle Scholar
    Published: 2 April 2014
    Remember to check out the Most Cited Articles!

    Check out our titles in neural networks today!