World Scientific
  • Search
Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More

System Upgrade on Tue, May 28th, 2024 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at [email protected] for any enquiries.
Special Issue: Selected Papers from the IEEE International Symposium on Multimedia (ISM2009); Guest Editors: Gerald Friedland and Mei-Ling ShyuNo Access


    In this paper, we first show the importance of face-voice correlation for audio-visual person recognition. We propose a simple multimodal fusion technique which preserves the correlation between audio-visual features during speech and evaluate the performance of such a system against audio-only, video-only, and audio-visual systems which use audio and visual features neglecting the interdependency of a person's spoken utterance and the associated facial movements. Experiments performed on the VidTIMIT dataset show that the proposed multimodal fusion scheme has a lower error rate than all other comparison conditions and is more robust against replay attacks. The simplicity of the fusion technique allows for low-complexity designs for a simple low-cost real-time DSP implementation. We then discuss some problems associated with the previously proposed design and, as a solution to those problems, propose two novel classifier designs which provide more flexibility and a convenient way to represent multimodal data where each modality has different characteristics. We also show that these novel classifier designs offer superior performance in terms of both accuracy and robustness.


    • D. Shah, K. J. Han and S. S. Nayaranan, A low-complexity dynamic face-voice feature fusion approach to multimodal person recognition, Proc. IEEE International Symposium on Multimedia (ISM) (2009) pp. 24–31. Google Scholar
    • A. K.   Jain , R.   Bolle and S.   Pankanti , Biometrics: Personal Identification in Networked Society ( Kluwer Academic Publishers , 1999 ) . Google Scholar
    • L. Bielet al., IEEE Trans. Instrum. Meas. 50(3), 808 (2001), DOI: 10.1109/19.930458. Web of ScienceGoogle Scholar
    • L. Wanget al., IEEE Trans. Pattern Anal. Mach. Intell. 25(12), 1505 (2003). Web of ScienceGoogle Scholar
    • A. K. Jain, A. Ross and S. Prabhakar, IEEE Trans. Circuits Syst. Video Technol. 14(1), 4 (2004), DOI: 10.1109/TCSVT.2003.818349. Web of ScienceGoogle Scholar
    • D.   Maltoni et al. , Handbook of Fingerprint Recognition , 2nd edn. ( Springer , 2009 ) . Google Scholar
    • R. Brunelli and D. Falavigna, IEEE Trans. Pattern Anal. Mach. Intell. 12(10), 955 (1995). Google Scholar
    • J. Kittleret al., IEEE Trans. Pattern Anal. Mach. Intell. 20(3), 226 (1998), DOI: 10.1109/34.667881. Web of ScienceGoogle Scholar
    • R. W. Frischholz and U. Dieckmann, Computer 33(2), 64 (2000), DOI: 10.1109/2.820041. Web of ScienceGoogle Scholar
    • A. Ross and A. K. Jain, Multimodal biometrics: An overview, Proc. European Signal Processing Conference (EUSIPCO) (2004) pp. 1221–1224. Google Scholar
    • M. Faundez-Zanuyet al., IEEE Aerosp. Electron. Syst. Mag. 21(8), 29 (2006), DOI: 10.1109/MAES.2006.1703234. Web of ScienceGoogle Scholar
    • C. C. Chibelushi, J. S. D. Mason and F. Deravi, Integration of acoustic and visual speech for speaker recognition, Proc. European Conference on Speech Communication and Technology (Eurospeech) (1993) pp. 157–160. Google Scholar
    • J. Luettin, N. A. Thacker and S. W. Beet, Speaker identification by lipreading, Proc. International Conference on Spoken Language Processing (ICSLP) (1996) pp. 62–65. Google Scholar
    • S. Ben-Yacoub, Y. Abdeljaoued and E. Mayoraz, IEEE Trans. Neural Netw. 10(5), 1065 (1999), DOI: 10.1109/72.788647. Web of ScienceGoogle Scholar
    • C. Sanderson and K. K. Paliwal, Digit. Signal Process. 14(5), 449 (2004), DOI: 10.1016/j.dsp.2004.05.001. Web of ScienceGoogle Scholar
    • P. S. Aleksic and A. K. Katsaggelos, Proc. IEEE 94(11), 2025 (2006), DOI: 10.1109/JPROC.2006.886017. Web of ScienceGoogle Scholar
    • H. Bredin and G. Chollet, EURASIP J. Appl. Signal Process. 2007(1), 179 (2007). Google Scholar
    • C.   Sanderson , Biometric Person Recognition: Face, Speech, and Fusion ( VDM Verlag , 2008 ) . Google Scholar
    • W. Karamet al., EURASIP J. Adv. Signal Process. 2009(4), 1 (2009), DOI: 10.1155/2009/746481. Google Scholar
    • W. H. Sumby and I. Pollack, J. Acoust. Soc. Am. 26(2), 212 (1954), DOI: 10.1121/1.1907309. Web of ScienceGoogle Scholar
    • D. W.   Massaro , Speech Perception by Ear and Eye: A Paradigm for Psychological Inquiry ( Lawrence Erlbaum Associates , 1987 ) . Google Scholar
    • G.   Potamios et al. , Issues in Visual and Audio-Visual Speech Processing , eds. G.   Bailly , E.   Vatikiotis-Bateson and P.   Perrier ( MIT Press , 2004 ) . Google Scholar
    • A. P. Dempster, N. M. Laird and D. B. Rubin, Journal of the Royal Statistical Society 39(1), 1 (1977). Google Scholar
    • J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett and N. L. Dahlgren, The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus. CDROM: NIST order number PB91-100354, 1993 . Google Scholar
    • K. M. Kryszczuk and A. Drygajlo, Color correction for face detection based on human visual perception metaphor, Proc. Workshop on Multmodal User Authentication (MMUA) (2003) pp. 138–143. Google Scholar
    • T. Lehn-Schioler, L. K. Hansen and J. Larse, Mapping from speech to images using continuous state space models, Proc. International Workshop on Machine Learning for Multimodal Interaction (MLMI) (2004) pp. 136–145. Google Scholar
    • R. Gocke, Current trends in joint audio-video signal processing: A review, Proc. International Symposium on Signal Processing and Its Applications (ISSPA) (2005) pp. 70–73. Google Scholar
    • G. Chetty and M. Wagner, Liveness detection using cross-modal correlations in face-voice person authentication, Proc. Interspeech — European Conference on Speech Communication and Technology (Eurospeech) (2005) pp. 2181–2184. Google Scholar
    • J. Campbell, Proc. IEEE 85(9), 1437 (1997), DOI: 10.1109/5.628714. Web of ScienceGoogle Scholar
    • P. Viola and M. J. Jones, International Journal of Computer Vision 57(2), 137 (2004), DOI: 10.1023/B:VISI.0000013087.49260.fb. Web of ScienceGoogle Scholar
    • D. A. Reynolds, Speech Communication 17(1–2), 91 (1995), DOI: 10.1016/0167-6393(95)00009-D. Web of ScienceGoogle Scholar
    Remember to check out the Most Cited Articles!

    Check out our titles in Semantic Computing!