A Novel Unsupervised Audio Clustering Approach in Noisy Environments
Detection of conversations in a noisy environment is challenging. We propose the following novel framework for audio clustering. First, we propose to apply computational auditory scene analysis (CASA) as a front-end to separate speech signals from non-speech background noise. Inspired by auditory perception, CASA typically segregates speech from noise by producing a binary time-frequency mask. The binary masks are then used to reconstruct clean speeches. Second, since the reconstructed clean speeches may contain more than one speaker"s voice, we propose an unsupervised audio clustering approach to perform speech separation. Unreliable time-frequency (T-F) units in simultaneous streams are reconstructed using a speech prior, and cepstral features are subsequently derived for clustering. We search for two clusters exhibiting the biggest speaker difference, i.e. the trace of the between- and within-cluster scatter matrix ratio. To speed up the search process, a genetic algorithm (GA) is employed. Third, after we extract the audio streams of each speaker, we go one more step. We propose to apply the latest speaker identification algorithm developed by our team for each separated voice stream. The reason to apply robust algorithms is that there may still be residual noise in the separated voice streams.
Small Business Information at Submission:
SIGNAL PROCESSING, INC.
13619 Valley Oak Circle ROCKVILLE, MD -
Number of Employees: