In The Journal of the Acoustical Society of America
In acoustic scene classification (ASC), acoustic features play a crucial role in the extraction of scene information, which can be stored over different time scales. Moreover, the limited size of the dataset may lead to a biased model with a poor performance for recordings from unseen cities and confusing scene classes. This paper proposes a long-term wavelet feature that captures discriminative long-term scene information. The extracted scalogram requires a lower storage capacity and can be classified faster and more accurately compared with classic Mel filter bank coefficients (FBank). Furthermore, a data augmentation scheme is adopted to improve the generalization of the ASC systems, which extends the database iteratively with auxiliary classifier generative adversarial neural networks (ACGANs) and a deep learning-based sample filter. Experiments were conducted on datasets from the Detection and Classification of Acoustic Scenes and Events (DCASE) challenges. The DCASE17 and DCASE19 datasets marked a performance boost of the proposed techniques compared with the FBank classifier. Moreover, the ACGAN-based data augmentation scheme achieved an absolute accuracy improvement of 6.10% on recordings from unseen cities, far exceeding classic augmentation methods.
Chen Hangting, Liu Zuozhen, Liu Zongming, Zhang Pengyuan