CONTENTS DEMON style neural networks front-end features for passive sonar classification ·············································· Sangmin Lee, Jaeyoung Hwang, Yoonchang Han, Donmoon Lee, Do Kyung Shin, ·······································································Seung Hwan Kim, and Young Dae Kim85 Improvement of non-negative matrix factorization-based active sonar reverberation suppression method using L1-norm and majorization-minimization ······························ Seokjin Lee, Wonnyoung Lee, Yena You, Seungheon Lee, Daekyung Kim, and Junsub Nam94 Detection of abnormal engine using the impulsive vibration signal ········································································· Hyun-Soo Byun, Sung-Hwan Shin, In-Soo Jung, and Jae-Min Jin109 A study on the flow of sound effects for screen golf users ···················································································································· SungJin Lee and MyoungHwan Shin117 Analysis of the distribution of the ultrasonic field transmitted into a solid cylinder from the outside ································································································································ Misun Jo and Moojoon Kim126 Emergency situation detection and speech recognition enhancement utilizing Whisper ································································································ Taeyeun Hwang, Ha-Jin Yu, and Jeong-Rae Kim132 P300 and behavioral responses to a cross-category deviant in lexical context ··················································································· Ji Young Lee, Mark S. Hedrick, and Ashley W. Harkrider144 Interplay of auditory cue optimization and gaming proficiency to reduce reaction times in a first-person shooter game ············································································· Sungjoon Kim, Rai Sato, Pooseung Koh, and Sungyoung Kim153 ■ Special Issue on Flow-Induced Noise from Submarine Hull Form and Appendages Investigation into cavitation noise generation mechanism of gap-flow cavitation for a NACA 0009 wing ·················································································································· Sangheon Lee and Cheolung Cheong161 Flow-induced noise prediction around a cylinder using deep neural network ···················································································· Minjoon Kim, Imjun Ban, Yumi Lee, and Sung-chul Shin170 Assessment of underwater radiated noise impact due to flow noise of submarine hull and appendages using Lattice Boltzmann Method ···························································································· Yonguk Lee, Sangheon Lee, and Cheolung Cheong177 ▪Society News and Information ······················································································································ i 본 사업은 기획재정부의 복권기금 및 과학기술정보통신부의 과학기술진흥 기금으로 추진되어 사회적 가치 실현과 국가 과학기술 발전에 기여합니다. THE ACOUSTICAL SOCIETY OF KOREA Vol.44, No.2March 2025I. Introduction In underwater environments such as submarines, passive Sound Navigation and Ranging (SONAR) is one of the most important elements for recognizing surroundings and threats. [1,2] Most of the work using this information was performed by skilled human operators. Their analysis is based on the visual characteristics of signal-processed signals such as Low-frequency Analysis and Recording (LOFAR) or Detection Envelope Modulation On Noise (DEMON). [3] However, these systems have human-related issues such as inconsistency DEMON style neural networks front-end features for passive sonar classification 수동 소나 분류를 위한 DEMON 형식의 신경망 프론트엔드 특성 Sangmin Lee, 1 Jaeyoung Hwang, 1 Yoonchang Han, 1 Donmoon Lee, 1 † Do Kyung Shin, 2 Seung Hwan Kim, 2 and Young Dae Kim 2 (이상민, 1 황재영, 1 한윤창, 1 이돈문, 1† 신도경, 2 김승환, 2 김영대 2 ) 1 Cochl, Inc., 2 LIG Nex1 (Received November 20, 2024; revised January 3, 2025; accepted February 17, 2025) ABSTRACT: This study proposes a novel neural network front-end feature based on conventional sonar signal processing. It simplifies the extraction of Detection Envelope Modulation On Noise (DEMON)gram, a method used in passive sonar signal processing, by implementing it with two consecutive Short-Time Fourier Transform (STFT) operations. This converts the 1-dimensional sonar signal into a 2-dimensional feature that can effectively capture the frequency modulation characteristics of cavitation generated by propellers. This DEMONgram-based frontend feature, when combined with conventional Mel spectrogram-based features in audio classification, can demonstrate higher performance. Experimental results on the ShipsEar dataset show that the proposed method achieves an accuracy of 81.0 %, a 5.8 % point improvement over the conventional Mel spectrogram-based features, thus demonstrating its effectiveness in passive sonar signal classification tasks. Keywords: Passive sonar, Neural networks, Detection Envelope Modulation On Noise (DEMON) analysis, Vessel classification PACS numbers: 43.30.Sf, 43.60.Bf 초 록: 본 연구에서는 기존의 소나 신호처리를 기반으로 한 새로운 신경망의 프론트엔드 특성을 제안하였다. 이는 수동 소나 신호처리 방법 중 하나인 Detection Envelope Modulation On Noise(DEMON)gram을 추출하는 방법을 단순화한 것으로 연속적인 두 번의 Short-Time Fourier Transform(STFT) 연산으로 구현되었다. 이를 통해서 1차원의 소나 신호를 프로펠러에서 발생하는 공동현상의 주파수 변조 특성을 효과적으로 포착할 수 있는 2차원 특성으로 변화시 킨다. 이러한 DEMONgram 기반의 프론트엔드 특성은 오디오 분류에서의 일반적인 멜 스펙트로그램 기반 특성과 결 합되었을 때, 보다 높은 성능을 보여줄 수 있다. ShipsEar 데이터셋에서 수행된 실험 결과, 제안 방식은 기존 멜 스펙트로 그램 기반 특성 대비 5.8 %포인트 향상된 81.0 %의 정확도를 달성하며 수동 소나 신호 분류 작업에서의 그 효과성을 입증하였다. 핵심용어: 수동소나, 신경망, 데몬 분석, 선박 분류 한국음향학회지 제44권 제2호 pp. 85~93 (2025) The Journal of the Acoustical Society of Korea Vol.44, No.2 (2025) https://doi.org/10.7776/ASK.2025.44.2.085 pISSN : 1225-4428 eISSN : 2287-3775 †Corresponding author: Donmoon Lee (dmlee@cochl.ai) Department of Research, Cochl. Inc., 41 Bongeunsa-ro 33-gil, Gangnam-gu, Seoul 06107, Republic of Korea (Fax: 82-2-6918-0714) Copyrightⓒ 2025 The Acoustical Society of Korea. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. 85Sangmin Lee, Jaeyoung Hwang, Yoonchang Han, Donmoon Lee, Do Kyung Shin, Seung Hwan Kim, and Young Dae Kim 한국음향학회지 제 44 권 제 2 호 (2025) 86 due to a lack of skill or fatigue. [4] Neural networks have demonstrated remarkable re- cognition power in various domains, such as image, [5,6] natural language, [7-9] and audio domains. [10,11] Their ability to learn complex patterns and make accurate predictions has been proven and they have begun to reduce human labor in many fields. Passive sonar has also been applied to recent neural networks-based algorithm analysis. It could be seen that it is the early stage of applying the algorithm to general acoustic analysis, which is a similar one-dimensional (1D) time series. However, because the physical properties of underwater sound waves are different from those of airborne sound waves, there are inherent limitations in applying existing acoustic analysis methods. In addition, unlike sound, which humans can primarily hear and perceive as a difference, sonar undergoes a transformation that can represent mechanical characteristics, and then analysis is performed on it to find visual differences. Most studies followed the conventions in general audio analysis. First, the passive sonar waveform is transformed into two-dimensional (2D) features such as a spectrogram [12] by applying a Short-Time Fourier Transform-based (STFT- based) transformation. The focus of this process is to mimic human auditory perception, and features that can express differences in human auditory perception, such as Mel spectrogram and Constant Q Transform (CQT), have shown better performance than raw spectrograms. Another approach is to apply a 1D convolution-based network that can be directly applied to passive sonar signals. [13] However, these approaches can be seen to have limitations: since human hearing is limited to the audible frequency range, systems that mimic this ability cannot fully utilize the information in passive sonar signals, and the data scarcity in passive sonar was a significant hurdle for applying 1D networks, which typically thrive on large datasets, unlike the abundance found in general audio analysis. In this study, we proposed a novel front-end feature for neural networks that closely resembles the DEMON, a widely used technique for analyzing passive sonar data. The proposed method reflects the physical characteristics of machines underwater, as intended by DEMON, and provides structural flexibility by being composed of operations that can be implemented within a neural network. To the best of our knowledge, this is the first study to model the characteristics of DEMON within a neural network structure. Our contributions are as follows: 1) We propose a DEMON-like front-end feature that improves the performance of passive sonar classifi- cation in neural networks. 2) We statistically evaluate the performance of the proposed method in a reproducible experimental setup using strictly partitioned data. II. Proposed method 2.1 Network frontend for sonar signal Sonar signals contain a variety of noises, including mechanical noise from machinery, cavity noise from propeller cavitation, and fluid noise from the interaction of fluids with the hull. One way to get the information we want from this complex sonar is to look for distinct visual elements, such as LOFARgram or DEMONgram. Fig. 1 shows the overall process of obtaining LOFAR- gram from sonar. First, decimation is performed con- sidering the underwater environment and computational efficiency, extracting specific frequency ranges by applying frequency filters, such as lowpass filter, highpass filter, or bandpass filter, STFT operations are performed con- sidering the target object. After that, post-processing such as Two-Pass Split-Window (TPSW) is performed for more Fig. 1. An overview of LOFARgram generation.DEMON style neural networks front-end features for passive sonar classification The Journal of the Acoustical Society of Korea Vol.44, No.2 (2025) 87 distinct visual elements. This involves a process of emphasizing visual elements during computation but is not fundamentally different from the characteristics of a STFT-based front-end for general audio processing. Therefore, in terms of feature representation, LOFARgram can be considered functionally equivalent to a standard spectrogram. Unlike LOFARgram, which directly analyzes sound frequency components, DEMONgram focuses on extracting characteristics indicative of propeller cavitation. [14] As shown in Fig. 2, DEMON performs energy calculation which includes windowing before the STFT, a key difference from LOFAR’s standard STFT approach. This preliminary processing enables DEMONgram to analyze temporally repetitive components within specific fre- quency bands, rather than simply the original signal’s frequency content. We believe that the unique properties of DEMONgram cannot be expressed by existing spectrogram features. So we implemented a front-end feature that mimics DEMONgram by performing STFT operation twice, which is preferred in Graphic Processing Unit (GPU) computation, [15] as shown in Fig. 3. Front-end features in neural networks generally refer to features extracted through the front layers of the neural network. They consist of transformations based on fixed operations rather than learnable variables and are responsible for any preprocessing that needs to be performed on the raw data. This is advantageous for large-scale data that require various types of preprocessing and has advantages in terms of storage space required for preprocessing and diversity of preprocessing. In particular, for audio, front-end feature extraction refers to the process of converting a waveform into a spectrogram, which allows for flexible adjustment of the accompanying hyperparameters. [15] To achieve these benefits, we implemented the following DEMON-like feature as a front-end layer within the network, considering computational cost and memory usage. The first STFT uses a relatively small Fast Frourier Transfrom (FFT) window, as it aims to decimate the signal and select the cavitation frequency to compute the energy over time. One of the advantages of this is that cavitation frequency selection becomes a simple indexing operation, which has computational advantages. After that, the second STFT is performed in the same way as the process of obtaining the original DEMONgram. We do not employ post-processing techniques like TPSW, relying on the network’s capacity to extract essential information directly from the input, rather than relying on visual enhancement. Instead, to improve the efficiency of the network, we reduce the input dimensionality by using a portion of frequency components from a DEMON-like feature. We call this process post-indexing. Post-indexing employs a low-pass filter- like frequency selection process, selecting only a specific portion of the frequency components, to efficiently analyze relevant data by reducing redundant input dimensionality. The post-indexing ratio (r pi ) defines the percentage of used frequency components relative to the total number of frequency components. This is because we thought that the important features of DEMONgram mainly exist in the low-frequency band, so we can reduce the size of the result obtained from the second STFT and obtain computational benefits. Fig. 2. An overview of DEMONgram generation. Fig. 3. Proposed DEMON-style front-end. The operations required to obtain a DEMONgram are based on two STFTs.Sangmin Lee, Jaeyoung Hwang, Yoonchang Han, Donmoon Lee, Do Kyung Shin, Seung Hwan Kim, and Young Dae Kim 한국음향학회지 제 44 권 제 2 호 (2025) 88 Fig. 4 shows the results of the conventional DEMON- gram and the proposed DEMON-style feature. As with the conventional DEMON, we can observe characteristics that appear as vertical lines in the low frequency band. That is, we can obtain similar features to the typical DEMONgram in the proposed method and adjust its dimensionality through post-indexing as shown in the figure. 2.2 Network architecture Fig. 5 illustrates the proposed network architecture. The proposed network has an end-to-end structure and extracts Mel spectrogram and DEMONgram, through two front- end layers. Each feature is then analyzed through a separate network and finally connected to predict the final output. In the proposed method, the base model for analyzing each front-end feature refers to any classifi- cation network that uses 2D input. The raw waveform input is converted into a typical 2D input by the network’s front-end layer, allowing for the use of diverse 2D classification models. By introducing two parallel front- end layers, we can analyze and leverage information from both the spectrogram and the DEMONgram. III. Experiments The experiment was designed to verify the effectiveness of the proposed DEMON-style front-end features. Through experiments, we compared the proposed method with existing STFT-based methods, using the common approach of converting sonar data into 2D front-end features and then applying known networks. In this process, we intentionally reduced variables such as augmentation and compared the classification performance according to the difference in front-end features. 3.1 Dataset We used the ShipsEar [16] dataset, a public dataset for sonar classification. ShipsEar is a dataset of underwater ship sounds collected directly from hydrophones. It consists of 90 sound sources for 11 types of ships, and the total length of the sound sources is approximately 3 h. Unfortunately, there is an imbalance in the number of ships collected per ship, which limits the number of ships available to train the neural network. To ensure a fair comparison of performance and future reproducibility, we followed the data processing methodology used in the previous study. [17] Target class: A total of 9 classes were used, including 8 ships and ambient noise, excluding 3 ship types (Pilot ship, Trawler, and Tug Boat). Data split and audio segmentation: To ensure generalization and fair comparison, training, validation, and test sets were split at the audio source level. The samples forming each split were those specified in the Fig. 4. (Color available online) The comparison between the conventional DEMONgram (a) and the proposed DEMON-style feature (b) is presented. The DEMONgram was obtained using a 64-point band-pass filter, followed by a 64-point low-pass filter and decimation. This example shows the use of 50 % post-indexing. This technique was used to handle the increased frequency range of DEMON- style features. Fig. 5. Diagram of the proposed network archi- tecture. The sonar signal passes the MEL layer and DEMON layer in parallel, and then extracted features are input into two independent base models. Then the outputs of each model are concatenated to one linear layer and the probabilities of the predicted ship type can be obtained.DEMON style neural networks front-end features for passive sonar classification The Journal of the Acoustical Society of Korea Vol.44, No.2 (2025) 89 previous study. [17] Each sound source was cut into 30-second segments with a 15-second overlap. Samples in less than 30 s were discarded. 3.2 Experimental conditions In this experiment, we compare the proposed method with Mel spectrogram, a commonly used front-end feature in general audio analysis. This is a spectrogram-based feature that uses a logarithmic frequency scale rather than a linear frequency scale, and is known to be more useful than spectrograms for general audio classification as well as passive sonar classification. [17] MEL is a classification network that uses Mel spectrogram as a front-end. The 30-second sonar data input is first converted into a Mel spectrogram, which is then connected to base models, such as ResNet18, [6] ResNet50V2, [18] or MobileNetV2. [19] After passing through each network structure, the information of each channel is collected through global average pooling and converted into a 1D embedding. This embedding is connected to a fully connected layer corresponding to the 9 classes. DEMON changes the Mel spectrogram to the proposed DEMON-style features under the same conditions. To compare the information contained in the proposed method and Mel spectrogram, we applied both features simultaneously. In MEL + DEMON, which is the proposed network structure, input sonar data passes two independent layers; the MEL layer and the DEMON layer in parallel, then is converted into Mel spectrogram and DEMON- style features. In this case, the two identical base models are applied to each of the two front-end features, resulting in two embedding vectors. These two embedding vectors are concatenated and connected to 9 outputs through a fully connected layer similar to before. In this case, the entire network is doubled, and the performance of the model can be improved because of the larger model size, not because of the combined features. So, we added MEL + MEL and DEMON + DEMON conditions, which replaced the DEMON layer with the MEL layer and the DEMON layer with the MEL layer, respectively, for a fair comparison. For each condition, either the Mel spectro- gram or the proposed method was used in parallel. 3.3 Training details The parameters required to extract the Mel spectrogram were selected heuristically based on previous studies. For audio with a sample rate of 52,734 Hz, STFT was performed with a size of 2,048 (approximately 40 ms) and an overlap of 50 %, with 300 Mel bins for the 0 kHz ~ 16 kHz band. In the first STFT for DEMON extraction, the FFT window was 64, the hop size was 32, and we used the energy of components larger than 5 kHz was used. In the second STFT, the FFT window was 2,048, the hop size was 64, and the post-indexing ratio (r pi ) was set to 75 %, meaning that the lower 75 % of frequency components were used through post-indexing. These parameters for the experiments, such as window size and hop size for FFT, and r pi were chosen empirically. Adam Optimizer [20] with learning rate 1e-6 was used, and training was performed for 250 epochs with a batch size of 16. Evaluation was performed using the model that showed the lowest loss in the validation set. For statistical comparison, 5 repeated experiments were performed under each condition. The experiments were conducted using ResNet50V2 and MobileNetV2 architectures provided by the Tensor- Flow library, while ResNet18 was implemented by us. During the experiments, the stability and performance were better when the networks were initialized using ImageNet pre-trained weights, [21] so ResNet50V2 and MobileNetV2 were initialized using the weights provided by TensorFlow. IV. Results & discussion 4.1 Classification performance of the proposed methods Table 1 shows the experimental results under each condition. Under the same network conditions, MEL Sangmin Lee, Jaeyoung Hwang, Yoonchang Han, Donmoon Lee, Do Kyung Shin, Seung Hwan Kim, and Young Dae Kim 한국음향학회지 제 44 권 제 2 호 (2025) 90 always outperforms DEMON. However, for ResNet50V2, the performance of DEMON is almost similar to that of MEL in ResNet18, suggesting that the DEMON-style features contain useful information for sonar classifi- cation. When using the Mel spectrogram only, as in the case of MEL and MEL + MEL, the classification performance gradually improves in the order of ResNet18, ResNet50V2, and MobileNetV2. On the other hand, DEMON and DEMON + DEMON gradually improve the classification performance in the order of ResNet18, MobileNetV2, and ResNet50V2. The DEMON + DEMON method shows better results than the DEMON method in ResNet18 and MobileNetV2 and it seems that is because the network size is doubled, as we mentioned in section 3.2. Based on this, we could assume that the optimal network capacity may vary as we analyze different front-end features. To analyze DEMON-style features, we assume that a relatively large network such as ResNet50V2 is required, and our results support this by showing performance variation across parallel networks. When using the same two front-end features in parallel, the proposed method shows significant performance improvement on small networks compared to Mel spectrogram. The proposed method (MEL + DEMON) performed best under all network conditions. Although the difference was statistically significant (Student paired t-test) only under the MobileNetV2 condition, this shows that overall, the two features complement each other and can improve classification performance. 4.2 The effect of variables on the DEMON-style front-end features The proposed method extracts DEMON-style features using various variables. Among these, we conducted additional experiments to verify the effects of cavitation frequency and post-indexing, varying the values of variables on the proposed method (MEL + DEMON). Fig. 6 shows the classification performance according to the selection of cavitation frequency when a high-pass filter is applied to the proposed method. In most cases, the more low-frequency components are removed, the better the classification performance. These results are different from the general knowledge that the frequency band reflecting the cavitation phenomenon is 3 kHz to 8 kHz, and this is thought to be due to the distribution of noise components included in the data. We thought we would need a process to set appropriate frequency bands for each data set. Fig. 7 shows the classification performance according to Table 1. Classification performance results according to front-end features. Performance refers to the average test accuracy and standard deviation of the results repeated five times under the same conditions. ConditionResNet18ResNet50V2MobileNetV2 MEL0.643 ± 0.0420.729 ± 0.0340.752 ± 0.028 DEMON0.450 ± 0.0250.647 ± 0.0300.514 ± 0.040 MEL + MEL0.607 ± 0.0260.734 ± 0.0220.776 ± 0.016 DEMON + DEMON0.497 ± 0.0470.641 ± 0.0120.612 ± 0.017 MEL + DEMON0.672 ± 0.0440.743 ± 0.0220.810 ± 0.012 Fig. 6. (Color available online) Effect of cavitation frequency in DEMON-style front-end features.DEMON style neural networks front-end features for passive sonar classification The Journal of the Acoustical Society of Korea Vol.44, No.2 (2025) 91 post-indexing. We applied the various r pi (25 %, 50 %, 75 %, and 100 %) to see the effect of input dimensionality. For example, with the 25 % condition, we choose the values only under 25 % of the maximum frequency. A series of experiments have shown that using more data does not always lead to better results. This is also true for ResNet50V2, which has a relatively large model capacity. We believe that this is a factor related to the amount of data, not just the model capacity. For small datasets like ShipsEar used in this experiment, it may be more effective to reduce the size of the input data, which also requires a process to find the optimal point. If we have sufficient data available, we expect these trends to vary with the amount of data and the size of the model. V. Conclusions In this study, we extracted the front-end features of a neural network based on characteristics traditionally specialized for sonar signal analysis and performed passive sonar classification based on these features. Through experiments under various conditions, we demonstrate several valid parameters for extracting DEMON-style features, which have the potential to improve sonar signal classification performance. Since the proposed method is a kind of feature extractor that generates 2D data, it can be applied to any kind of task and network that uses sonar data. We believe that there is a lot of room for improvement in our proposed approach. Since the proposed method is structured as a network, it allows for data-driven tuning rather than requiring a human to find the optimal point. In this experiment, numerous variables influenced the performance, making it challenging to accurately evaluate their impact through repeated trials alone. Another improvement is that we can apply network structure and augmentation methods optimized for DEMON features. We believe there can be many more applications and developments than existing methods. Acknowledgement This work was supported by a Korean Research Institute for defense Technology planning and advancement (KRIT) grant funded by the Korean Government Defense Acquisition Program Administration (No. KRIT-CT-22- 023-431 02, 2022). References 1.D. J. Creasey, Remote Sensing for Environmental Sciences (Springer, Berlin, Heidelberg, 1976), pp. 277- 303. 2.W. S. Burdic, Underwater Acoustic System Analysis (Prentice-Hall, New Jersey, 1984), pp. 113. 3.G. R. Arrabito, B. E. Cooke, and S. M. McFadden, “Recommendations for enhancing the role of the auditory modality for processing sonar data,” Appl. Acoust. 66, 986-1005 (2005). 4.D. Kobus and L. Lewandowski, “Critical factors in sonar operation: A survey of experienced operators,” NHRC Tech. Rep., 1991. 5.A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Adv. Neural. Inf. Process. Syst. 26, 1097-1105 (2012). 6.K. He, X. Zang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” Proc. IEEE CVPR, 770-778 (2016). 7.J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding”, Proc. NAACL, 4171- Fig. 7. (Color available online) Effect of post- indexing in DEMON-style front-end features.Next >