Data-Driven Audio Feature Space Clustering for Automatic Sound Recognition in Radio Broadcast News

Aiming to an automatic sound recognizer for radio broadcasting events, a methodology of clustering the audio feature space using the discrimination ability of the audio descriptors as a criterion, is investigated in this work. From a given and close set of audio events, commonly found in broadcast news transmissions, a large set of audio descriptors is extracted and their data-driven ranking of relevance is clustered, providing a more robust feature selection. The clusters of the feature space are feeding machine learning algorithms implemented as classification models during the experimental evaluation. This methodology showed that support vector machines provide significantly good results, considering the achieved accuracy due to their ability of coping well in high dimensionality experimental conditions.


Introduction
Over the last decade, there is consecutive increase of the available data accessible by an increasing number of people.Radio and TV broadcast transmissions and the web-based multimedia data offer an enormous amount of audiovisual data.The availability of these resources has led research to focus on a vast number of applications related to automatic processing of such multimedia data including TV program automatic handling, story classification, automatic highlighting of events, sports news handling, automatic transcription extraction, automatic commercial detection, summarization etc. [1][2][3][4][5][6][7][8][9] Concerning the audio data, the automatic analysis of the audio signals can offer the users useful information.In the case of broadcast news, automatic processing is related to tasks such as sound recognition, 10,11 speaker recognition, 12 anchor detection, 13 role detection, [14][15][16] story boundary detection, 2,17,18 summary construction from anchor talking, 9,19 channel's quality detection, 20 sound event detection, 21,22 non-linguistic humanproduced sounds detection, 5,6,[23][24][25] audio type segmentation in sport games, 4,26,27 highlight scene extraction from sports games, 3 violence scene detection, 28 music characteristics classification, 29,30 jingle detection, 1 commercial block detection, 8 voice activity detection, 31 language recognition, 32 emotion recognition 33 and speech recognition. 34ound recognition is the cornerstone of analysis as typically precedes the other stages.
During sound recognition the audio signal is decomposed to discrete intervals corresponding to sound events of interest.In broadcast news signals additionally to the major sound categories of the speech and music, common sounds are the non-linguistic sounds, noise from the recording/transmission conditions, bubble noise, background/ environmental sounds and superposition of sounds.For the decomposition of the broadcast signal, the signal is initially preprocessed and parameterized.Consequently, the parameterized signal is processed by a pattern recognition algorithm.
Over the years, several time-domain and frequency-domain features have been used for parameterizing the broadcast audio signals. 10,35,36Zero crossing rate and the Mel frequency cepstral coefficients are the most commonly used in time-domain and in the frequency-domain correspondingly.Other commonly used features are the pitch, perceptual linear predictive coefficients, harmonics-to-noise ratio, linear predictive coding coefficients, chroma, autocorrelation etc. 10,26,35,36,38,39 In the pattern recognition stage, a big variety of probabilistic and discriminative machine learning algorithms have been proposed.The most commonly used are the Gaussian mixture models and the hidden Markov models. 10,11,14,26,37,40Also widely used are the support vector machines, 11,14,38,39,41 the artificial neural networks, 10 the k-nearest neighbor algorithm, 14,38 the decision trees, 10,38 the genetic algorithms, 2 the fuzzy logic 42 and boosting techniques. 41,43elated architectures incorporate fusion frameworks among recognition models 28,44 and combination of model-based and distance based algorithms. 13,26,27,39,40Postprocessing schemes can improve the overall recognition accuracy.Among the postprocessing schemes are (i) transformation of the feature matrix, 23,[44][45][46] (ii) correction of logical errors based on empirical rules, 11 (iii) isolation of the segments of interest in cases where the post-processing is focused on specific classes 10,11,13,38,40,47 and (iv) merging of sound events and separation of them in a post-processing stage. 28he structure of the analysis of sounds categorizes the task to different classes.Some of the widely used are: (i) multi-class problem, 10 (ii) binary-classes problem, 37 (iii) hierarchical structure of the classes problem, 11 (iv) two-groups or multi-group of classes problem 28 and (v) detection of a class over the other classes problem. 19,48n this work, we present a broadcast news sound recognition methodology based on widely known and used audio features.The implemented framework clusters the audio feature space to subspaces, based on data-driven criteria.Consequently, the subspaces that are found useful in terms of their sound discrimination ability are utilized in the sound recognition task.We concentrate our interest in investigating our methodology based on main hypotheses that are expected to be verified.The first hypothesis is that clustering the audio feature space using the discrimination ability of the audio descriptors as a criterion will be beneficial to the task.Secondly, even though most of the machine learning algorithms incorporate the ability of identifying the most appropriate features and discard the rest, the use of irrelevant features often deteriorates the effectiveness of the algorithms.Also we hypothesize that clustering the features using the discrimination ability not only avoids deterioration but also incorporates a more robust stage between feature extraction and recognition that assists the methodology.Finally, algorithms able to cope well with high dimensional feature spaces, like the SVMs, will manage to perform very well in all models.In this way, it is expected to achieve the optimization of the classification accuracy, avoiding a time consuming greedy feature selection approach.
The rest of the article is organized as follows: In Section 2, the proposed methodology for recognition of sounds using clustering of the audio feature space is described.In Section 3, the experimental setup is given and in Section 4, the experimental results are presented.Finally, in Section 5 we conclude this work.

Sound Recognition with Unsupervised Audio Feature Space Clustering
In the proposed scheme the recognition of the sounds of interest is based on short-time analysis in the time and frequency domain.A selection of clusters of the feature space, that are expected to be more discriminative with respect to the sounds of interest, is implemented.Figure 1   and the computed feature values are concatenated in one feature vector, , per audio frame i.After decomposing the audio signals to sequences of feature vectors, a ranking algorithm is applied by the feature evaluation block.The output of this block is a number of feature clusters, Cj, with 1 ≤ j ≤ J, which divide the feature space to J clusters with respect to the estimated ranking score, i.e. the discriminative ability of the features.Consequently cluster C1 will include the most discriminative features, while cluster CJ will include the less discriminative ones.The number of clusters J is either manually defined or determined by a threshold criterion with respect to the sparseness of each cluster.The clustering of the feature space allows the training of sound type models with subsets of the features instead of using the entire feature space.V is then forwarded to the classifier f, where a decision d is taken with respect to the corresponding model Mj of the selected features, i.e.
( , The recognition is based on frame-level classification among a closed set of sound types.Further post-processing of the results can be performed for fine-tuning of the estimated sound type intervals.The described architecture allows the exploitation of the feature subspaces, which contribute to the robust discrimination, excluding the feature subspaces that do not contribute.

Automatic Sound Recognition in Radio Broadcast News
1750005-5

Experimental Setup
The experimental setup for the evaluation of the architecture described in Section 2, is presented here.In this framework we are interested in examining our methodology validating the main hypotheses mentioned in the Introduction.As SVMs are able to cope well in high dimensional feature space, it is expected that will manage to perform excellent in comparison with all other models and probably outperform them.Clustering the audio feature space using the discrimination ability of the audio descriptors as a criterion, will be beneficial in the task.The audio data used for the evaluation, the feature extraction algorithms and the classification methods are also described in this section.

Audio data description
For our task, due to the lack of one database appropriate for sound type recognition from broadcast recordings, we relied on a number of existing audio data collections.The data collections used are (i) the Voice of America VOA radio broadcast news 49 for the Greek language, which is part of the NIST 2009 Language Recognition Evaluation, 50 (ii) the BBC FX Library, 51 (iii) the BBC broadcast news database, 52 (iv) the Partners In Rhyme database 53 and (v) the SoundBible database. 54Sound instances acquired from nonbroadcast collections were convolved with randomly selected silence intervals from broadcast audio signals.All audio data were stored in single-channel audio files with sampling frequency 8 kHz and resolution analysis 8 bits per sample.The selected audio data collection consists of recordings with total duration of approximately 8 min.The duration distribution per sound type is illustrated in Table 1.
The collected data include the most common sounds found in radio broadcasts.The entire evaluation audio dataset was manually annotated by an expert audio engineer.

Feature extraction
The sound types appearing in radio broadcast signals differ in kind (speech, music, etc.).
In the literature most of the feature extraction algorithms are dedicated to specific audio signals, mainly speech and music.In this study, we rely on the OpenSmile 35 framework for extracting a number of features that have been widely used in applications related to speech, music and sound recognition.The audio signal is initially frame blocked to overlapping frames of constant length of 25 msec.A 1st order FIR pre-emphasis filter followed by Hamming windowing is applied to each frame.From each frame we compute (i) the zero-crossing rate, (ii) the frame energy, and after computing the spectral magnitude we compute (iii) the Mel frequency cepstral coefficients, 55 (iv) the pitch envelope, (v) the voice probability, (vi) the chroma coefficients, 56,57 and the spectral magnitude statistics (vii) energy per 4 equally distributed at 0-FS/2 bands, (viii) roll off, (ix) flux, (x) centroid, frequency with (xi) maximum and (xii) minimum magnitude.All audio features are concatenated to a common feature vector, which is further expanded with first and second derivatives (delta and delta-delta coefficients).

Feature evaluation and clustering
After computing the audio features described in subsection 3.2, the feature evaluation block estimates the importance of each feature, with respect to their discriminative ability on the task.For the evaluation we relied on the ReliefF algorithm. 58The ReliefF algorithm computes a vector W of the estimations of the qualities of all the audio features.
The ranking position of each feature is defined by its ranking score, i.e. the corresponding estimation of quality, w ∈ R, which indicates the degree of importance of that feature.These ranking scores are used to cluster the feature space into five clusters using the EM algorithm. 59The number of the clusters was chosen based on empirical knowledge and preliminary experiments.In detail, the ranking scores, w ∈ R, are used to iteratively train five one-dimensional Gaussian distributions, each for one cluster.After the completion of the EM training each feature is assigned to the cluster where the ranking score has the maximum likelihood.The usage of EM ensures the maximum likelihood in the distribution models.This clustering procedure ensures that attributes with close ranking scores will be grouped together in the same cluster, since their importance is alike, resulting clusters corresponding to meaningful subsets of features.The clusters are used for estimating classification models with different sets of clusters during the training phase and for cluster-specific feature extraction during the test phase.

Sound type classification
For the construction of the classification models we used the implementations of machine learning algorithms of the WEKA software toolkit. 60Well-known and widely used, in the areas of audio, speech and music processing, algorithms were selected. 10,11,38The evaluated algorithms are: (i) a two-layered back-propagation multilayer perceptron (MLP) neural network, 61 (ii) a support vector classifier (SVM) with radial basis function kernel utilizing the sequential minimal optimization algorithm, 62 (iii) a k-nearest neighbor classifier (IBk) 63 and (iv) a C4.5 decision tree learner (J48). 64The hyper-parameters of all algorithms were selected using grid search.For the purpose of direct comparison, all algorithms were trained with the same training data and evaluated on the same test data.For each cluster combination, one model was trained.

Experimental Results
The architecture presented in Section 2 was evaluated using the experimental protocol described in Section 3. The performance of the four algorithms was evaluated on frame level.In order to avoid overlap between the training and test subsets a ten-fold cross validation experimental setup was followed.The achieved results for the full audio feature vector are shown in Table 2.As can be seen in Table 2, the SVM classification algorithm outperformed all the other algorithms achieving accuracy of 96.02%.The second-best performing algorithm was the IBk, which achieved approximately 1% lower performance.Both the decision tree and the neural network achieved significantly lower performance.The advantage of the SVM algorithm can be explained by the ability of SVMs to cope well with the high dimensionality of the feature space in respect to the amount of data, since they do not suffer from the curse of dimensionality 62 and, in contrast to the rest algorithms, will converge to the global optimal parameter values, and thus will not provide suboptimal performance.In a further step we evaluated the discriminative ability of each feature in order to investigate the effect of dimensionality reduction.The choice of five clusters was empirically, without this decision undermining criteria-based decisions that respects the likelihood of the data.The resulting feature subset clusters consisted of 16, 18, 12, 71 and 90 features respectively.In Table 3, we present the audio features of the 1st cluster.The selection of clusters was defined during the training phase.As can be seen in Table 3, within the most discriminative features are the pitch (absolute value and envelope), several MFCCs, the energy, the voicing probability, some spectral magnitude statistics and the zero-crossing rate.These results are in agreement with Refs.23 and 65, where the MFCCs, the zero crossing rate, the voicing probability and the pitch were found as discriminative features.In Table 4, we present the accuracy for all clusters and for each algorithm.The performance of each method for the full As can be seen in Table 4, the use of subsets of features improves the overall performance.Specifically, the 117 best features in terms of discriminative ability ranking, which correspond to the 4-best clusters i.e. 43.5% reduction of the number of features used, compared to the full feature set, achieved the highest performance for all the evaluated algorithms.These results show that clustering the features by using the ranking is an effective method to discard irrelevant features and works in favor of the outcome even though most of the machine learning algorithms have the ability to learn which are the most appropriate features.This is owed to the significant reduction of the feature space dimension, which reduces the effect of the curse of dimensionality phenomenon as well as to the fact that the use of fewer features prevents from overfitting.For all sets of clusters the accuracy of the four classifiers is similar to the full feature set, i.e. the SVM algorithm outperforms all the other algorithms and is followed by the IBk algorithm.The main advantage of SVM that leads to outperforming all the other models is its fundamental property of coping well with high dimensional feature space 66 along with their ability to learn complex relationship between the input and output of the data. 67Moreover it can be pointed out that in all cases, IBk managed to achieve performance close to SVM.In one case, i.e.C1,2,3, IBk even managed to slightly (approximately 0.5%) outperform the SVM and give the highest performance in this set of clusters.On the contrary, when the dimensionality of the feature space increases a lot, i.e. the number of the features of C1,2,3,4 and C1,2,3,4,5 become 2.5 and 4.5 times the dimension of C1,2,3 respectively, the performance of IBk deteriorates, since due to the large number of features, all data vectors are almost equidistant to the search query vector based on the Euclidean distance. 68Finally, in no case, neither MLP nor J48 managed to achieve a performance close to SVM or IBk models.In the case of MLP, this could be attributed to the amount of available training data in respect to the feature space 69,70 and in the case of J48, to the over-fitting of the model to the training data. 64The accession of the 5th cluster, in all algorithms, showed that the growth of the parameterization reserved to decrease the achieved accuracy.The small changes in accuracy after the addition of the last clusters shows the efficiency of choosing feature by clustering.In a further step, the confusion matrix of the SVM model for the case of C1,2,3,4 feature set case was calculated and is shown in Table 5.As can be seen in the table, applause, music and silence are the types that achieved the highest recognition accuracies, showing rates above 96%.Speech, bubble-noise, laugh and especially cough presented deterioration in their accuracy rates achieving recognition rates between 85.59% and 94.05%.

Conclusions
The development of automatic event processing is driven by the availability of events and the quantity of applications.Since automatic audio recognizers have been cornerstones in audio event procedures, several methodologies have been investigated.
In the present work, we studied an automatic sound recognition framework based on short time analysis of audio events commonly found in radio broadcast transmissions.The set of audio descriptors were organized into clusters based on their discrimination ability, incorporating a more robust method of selection.Several well-known and widely used machine learning algorithms were used.SVM managed to outperform due to its ability to cope well in high dimensionality problems.A t-test showed that SVM offered statistically significant better results than the rest.The IBk gave high accuracies, due to the nature of the examining audio data set.The addition of clusters with less significant features showed that it does not reserve the maximum accuracy, while it can reverse the opposite.
illustrates this framework.As shown, the proposed scheme is divided into two phases, the training and test phase.During the training phase a set of R audio recordings X  {X r }, 1 ≤ r ≤ R, with known sound labels, is used to train models for each of the sound types of interest.The training phase consists of the pre-processing, feature extraction, evaluation of features for clustering and classification model construction steps.During pre-processing the training audio files, X  {X r }, are frame blocked with overlapping frames, O  {O r } of constant length with constant time-shift step.The sequences of audio frames are decomposed to sequences of feature vectors, V  {V r }, in the feature extraction block.The feature extraction block applies a number of feature extraction algorithms to each audio frame
During the training step of the classification model, the j most valuable clusters are used to train model Mj, i.e. model M1 is trained with the features of cluster C1 and so on until the last model MJ will be trained with all clusters.The training phase results in J models for radio broadcast sound type classification.During the test phase, an unknown test audio file, Y, is pre-processed similarly to the training phase, i.e. with the same frame length and time-shift step, resulting in a sequence of frames, O Y .The pre-processed audio signal, O Y , is then processed by the feature extraction block estimating those features, , i C V that belong to a number of selected clusters Cj.The selection of the clusters is performed manually depending on the experimental setup.The estimated sub-feature sequence, , i C

Table 1 .
Duration distribution of the sound types in the collected audio data.

Table 2 .
Broadcast news sound recognition accuracy for different algorithms.

Table 3 .
Top-16 audio features (assigned to cluster 1) according to the ReliefF criterion.

Table 4 .
Accuracy for different audio feature subsets and classification algorithms.