"What's That Sound?": A Versatile, Robust, and Lightweight Convolutional Transformer for Environment Sound Recognition
Abstract – The conventional hearing aid is both costly and limited in usage, as it is not intended to detect non-speech audio. Our objective is to develop a machine learning solution to provide a more accurate and affordable mechanism to identify surrounding sounds to improve the safety of the hearing impaired, i.e., if a car is honking behind pedestrians, or a gunshot is fired, and they need to move away from the source. By adding randomized augmentations to audio, concatenating a Mel-Frequency Cepstral Coefficients (MFCCs) diagram and a log-mel Spectrogram, and including Convolutional Neural Networks (CNNs) in a Transformer architecture, the Randomized Audiomentational Layered Convolutional Transformers (RALCT) model efficiently extracts features from diversified audio representations. In addition, RALCT is small enough, with only approximately 310,000 parameters, to be deployed into mobile devices. Experimental results on the UrbanSound8K dataset resulted in an accuracy consistently over 93% for all variations of RALCT with the highest at 94.56%, reaching state-of-the-art levels. To leverage the capabilities of this technology, a mobile app is developed to be integrated with the model to provide real-time safety control. RALCT thus represents a robust, lightweight, affordable, and versatile deep learning tool to aid the navigation and safety of the hearing impaired.