한국음향학회지 제44권 제4호 (2025)

< PreviousJin Soo Seo 한국음향학회지 제 44 권 제 4 호 (2025) 340 88 subbands, the short-time mean-square power (local energy) is computed. After appending 20 zeroes at the beginning and 12 at the end, as described in Reference [2], a logarithmic compression is applied to construct a 120-dimensional short-time pitch representation X[p, n], where n is the frame index, and p corresponds to MIDI pitches between 1 and 120. Unlike conventional chromagrams, [2] both CRP and CTP remove slowly-varying components of the pitch representation in each frame, which are claimed to be closely related to timbre and should be eliminated for timbre-invariant representations. In CRP, the slowly- varying components of the pitch representation are obtained by the lower-frequency DCT components of the spectral energy. In CTP, we directly detrend the spectral energy by using the moving average or the Hodrick- Prescott [5] filter which decomposes a signal into two components; medium-to-long term trend component and short-term cycle component. As shown in Fig. 1, for all the frames in X, the CRP and the CTP extraction are performed as follows: 1) estimating slowly-varying com- ponents of the 120-dimensional logarithmically com- pressed pitch representation; 2) subtracting the estimated slowly-varying components from the pitch representation; 3) taking the positive part of the trend-subtracted pitch by the halfwave rectification in case of CTP; and 4) performing chroma binning to derive 12-bin chromagram. To further improve CSI performance of the CRP and the CTP, which currently rely solely on pitch-directional discriminant feature extraction, we propose pre- and post-processing steps, including temporal smoothing and soft decision of chroma activation. 2.2 Temporal smoothing for the pitch representation Temporal smoothing plays a critical role in reducing noise while preserving essential signal characteristics across various disciplines, including speech and audio processing, biomedical signal analysis, financial time- series forecasting, computer vision, and climate science. Previous works [1-3] have primarily focused on pitch- directional processing for chromagram estimation. In this study, temporal smoothing is employed as a preprocessing step to reduce noise and fluctuations introduced during cover song generation, thereby improving chromagram continuity and mitigating pitch estimation errors. For chromagram extraction, temporal smoothing is applied to the pitch representation X[p, n] along the n-direction (frame) for each p as shown in Fig. 1. We evaluate four temporal smoothing methods: Gaussian smoothing, [8] Fourier smoothing, [9] morphological smoo- thing, [10] and shape-adapted smoothing. [11] Each method offers distinct advantages suited to different applications. Gaussian smoothing is effective for local noise reduction, Fourier smoothing removes periodic noise, morphological smoothing preserves edges, and shape-adapted smoothing adapts to structural details. Gaussian smoothing is a low-pass filter that reduces high-frequency noise while preserving low-frequency components. It smooths the pitch representation by convolving it with a Gaussian kernel. The kernel assigns higher weights to nearby frames and lower weights to distant ones. The filter window size is typically a multiple (a) (b) Fig. 1. (Color available online) Overview of the chromagram extraction from an audio with temporal smoothing and soft decision: (a) chroma DCT- reduced log pitch and (b) chroma trend-removed log pitch.Salient chromagram extraction by using temporal smoothing and soft decision for cover song identification The Journal of the Acoustical Society of Korea Vol.44, No.4 (2025) 341 of the kernel’s standard deviation. Fourier smoothing reduces high-frequency noise by transforming the pitch representation into the frequency domain, applying a low-pass filter with a cutoff (typically 0.05), and reconstructing the pitch representation via the inverse Fourier transform. Fourier smoothing is known to be effective in removing periodic noise while maintaining the main structure, though sharp cutoffs may introduce artifacts. Morphological smoothing applies nonlinear filtering to reduce noise while preserving structural details. It operates through erosion (replacing each point with the minimum in its neighborhood, shrinking objects) and dilation (replacing each point with the maximum, expanding objects). Opening (erosion followed by dilation) removes small noise, while closing (dilation followed by erosion) fills gaps and smooths boundaries. We use opening followed by closing to denoise and preserve the overall shape of the pitch representation. Shape-adapted smoothing, such as anisotropic diffusion (Perona-Malik filtering), selectively smooths regions while preserving edges. This method iteratively updates each point using a diffusion coefficient that decreases at edges, preventing excessive smoothing across boundaries. It is applied to the pitch representation to reduce noise in homogeneous regions while preserving sharp musical note changes. 2.3 Soft decision of chroma activation Previous works directly utilize the 12-dimensional output of chroma binning in Fig. 1, which adds up the corresponding values of the pitch representation that belong to the same chroma. This paper introduces an additional step, soft decision of chroma activation, following chroma binning for decision-making whether the chroma (pitch class) is actually active or inactive. Instead of hard decision, where only active or inactive are considered, we propose a soft decision method which retains gradual transition between on and off, preserving information about confidence on pitch-class presence. In this paper, soft decision of chroma activation refers to smoothly mapping the real-valued output       of the chroma binning step to a chroma activation score       which has a continuous range between 0 and 1. To make this mapping adaptive to the input music signal, we first apply the modified Z-score normalization [12] to       and take maximum value as follows:       MAD   MAX        MD      (1) for the chroma value c =1, 2, ..., 12 where MD(R) denote the median of R, and MAD(R) is computed as the median of the absolute deviations from the median of R as follows:     MD        MD    (2) To incorporate song-level information into the resulting chromagram, both MD(R) and MAD(R) are calculated over the entire sequence R, treating all elements jointly. Unlike conventional Z-score normalization, which relies on the mean and standard deviation, the modified Z-score approach is more robust to outliers and less sensitive to the underlying distribution, which is often difficult to estimate. To quantify the salience of pitch classes in the chromagram, we propose a chroma activation score based on a sigmoid transformation. This transformation smoothly maps input values into the range between 0 and 1, allowing for soft decision-making rather than binary activation. The chroma activation score       for pitch bin c at time frame n is computed as:                    ,(3) where k is the threshold determining decision sensitivity. In the soft decision context, k acts as a threshold that determines when a value is considered activated or not. Typical values of k range from 1 to 5, corresponding to thresholds from MAD(R) to 5MAD(R) above the median. Jin Soo Seo 한국음향학회지 제 44 권 제 4 호 (2025) 342 The choice of k should reflect the desired specificity of chroma activation, which may vary depending on the application. The chroma activation score quantifies the dominance of each pitch class at a given moment in the audio signal. Finally the chromagram       is obtained by normalizing each 12-dimensional the chroma activation score       with respect to the Euclidean norm, ensuring that each vector has unit length. An illustrative example of the chromagram considered in this study is presented in Fig. 2, alongside the ground truth derived from the Isophonics chord annotation dataset. [13] In comparison to the conventional CTP, the postprocessing with soft activation effectively attenuated less prominent false activations while retaining stronger ones, resulting in a chromagram that more closely resembles the ground truth. Additionally, temporal smoothing contributed to the removal of isolated false activations. III. Experimental results The CSI performance of the proposed salient chro- magram was evaluated on two cover song datasets. The first cover song dataset (abbreviated as covers80) is the one that was used by Dan Ellis in his work. [14] The covers80 consists of 80 original and cover song pairs (160 songs in total), which are available online. The second cover song dataset (abbreviated as covers330) is com- posed of 1000 songs, where 330 songs are test data (30 original songs and 10 cover versions per each original song), and the other 670 songs were embedded as imposters. The covers330 was collected by the author. Each song in the datasets was converted to mono at a sampling frequency of 22050 Hz and then divided into frames of 200 ms overlapped by 100 ms where the 12-dimensional chromagram vector was computed as a low-level feature for each frame. The 12-dimensional chromagram vector was normalized with respect to the Euclidean norm to have unit length. In extracting chromagram, we utilized the pitch representation in the chroma toolbox [2] with the default parameter settings. From the pitch representation, we extracted CRP [1] and CTP [3] with the proposed temporal smoothing and soft decision of chroma activation. 3.1 Baseline CSI methods and evaluation metrics CSI can be performed using two approaches: sequence alignment and song-level feature matching. Sequence alignment-based methods [4] attempt to find the optimal alignment between feature sequences of two songs by leveraging techniques from speech recognition and DNA sequence analysis, such as Dynamic Time Warping (DTW) or Smith-Waterman (SW) distance. In contrast, song-level feature matching methods [6] compute a whole- song or segment-level feature summary, which is then compared with the corresponding summary from another song to assess similarity. This paper employs two baseline CSI methods to evaluate the extracted chromagram. The first method is based on sequence alignment, where the Optimal Trans- position Index (OTI) [4] is used to measure chromagram similarity, and the Smith-Waterman (SW) algorithm [15] is Fig. 2. (Color available online) Chromagram extracted from the 10-second excerpt of the song “Let it be” by the Beatles: (a) ground truth, (b) CTP, (c) CTP with soft activation (k = 3), and (d) CTP with both soft activation and temporal smoothing using ani- sotropic diffusion.Salient chromagram extraction by using temporal smoothing and soft decision for cover song identification The Journal of the Acoustical Society of Korea Vol.44, No.4 (2025) 343 applied for local sequence alignment. The OTI-SW method [4] consists of three modules: preprocessing, similarity matrix creation, and sequence alignment. The second method is musically motivated version embeddings (MOVE), [6] which computes song-level similarity using Euclidean distance between feature embeddings. These embeddings are extracted from a five-layer convolutional neural network with multi-channel adaptive attention. We utilized the experimental code and pre-trained neural network model of MOVE, trained on 44,909 songs, available in a GitHub repository. [16] Based on the computed music similarity, a ranked list of the most similar songs is generated as potential cover versions of the query. CSI performance is evaluated using mean average precision (MAP) and mean rank of the first correctly identified cover (MR1), following the prior research. [3-6] MAP assesses ranking performance by considering the positions of all correct cover songs, assigning higher scores to methods that rank multiple covers near the top. MR1 quantifies the position of the first correctly identified cover song, emphasizing early retrieval. Higher MAP values indicate better performance, whereas lower MR1 values are preferable. 3.2 Results Fig. 3 presents the MAP of CSI using MOVE method for different values of the soft decision threshold k without applying temporal smoothing. The use of soft decision improved MAP for k values between 3 and 5 compared to the MAP without soft decision for both datasets. As k decreases, soft decision exhibits higher sensitivity to activation, leading to an increased number of false activations. Conversely, as k increases, soft decision has lower sensitivity to activation, potentially missing real activations. The highest cumulative MAP across the two datasets was observed at k = 3.5 for the CRP and k = 3 for the CTP. Therefore, these values were adopted in the subsequent temporal smoothing experiments. Tables 1 and 2 present the CSI performance of the CRP and CTP representations with various temporal smoothing methods, evaluated using the OTI-SW and MOVE CSI baselines on the Covers160 and Covers330 datasets, respectively. The parameters for each temporal smoothing method were selected to achieve the best performance. The parameters for each temporal smoothing method were selected to optimize performance. Notably, prior work [3] employed linear interpolation along the temporal axis for fast sequence matching in OTI-SW, while also reducing the feature rate through decimation. In contrast, the proposed temporal smoothing approaches yielded superior MAP performance compared to Reference [3]. Regardless of the chromagram type, temporal smoothing improved both MAP and MR1 across the two CSI baselines in most cases. Although the improvements were not universal, the overall trend indicates that temporal smoothing provides consistent benefits, enhancing both precision and ranking performance. These results highlight its practical utility in real-world CSI scenarios. Among the evaluated methods, anisotropic diffusion achieved the best performance in (a) (b) Fig. 3. (Color available online) MAP of cover song identification using MOVE as a function of the soft decision threshold k for (a) the covers80 and (b) the covers330 dataset.Jin Soo Seo 한국음향학회지 제 44 권 제 4 호 (2025) 344 most cases, followed by either Fourier or morphological smoothing, depending on the dataset and baseline. These results suggest that preserving sharp note transitions while suppressing noise in homogeneous regions, as achieved by anisotropic diffusion, is critical for improving the CSI accuracy of the chromagram. These findings underscore the importance of selecting appropriate preprocessing and postprocessing strategies in chromagram extraction. Experimental results on two benchmark datasets demon- strate that combining temporal smoothing as a pre- processing step with soft decision postprocessing leads to significant improvements in CSI accuracy. IV. Conclusions Enhancing chromagram saliency is crucial for reliable CSI, as cover song generation introduces various distortions. This paper improves chromagram saliency through preprocessing and postprocessing techniques. Temporal smoothing is applied as a preprocessing to reduce noise and fluctuations, enhancing continuity and mitigating pitch estimation errors. As a postprocessing, a soft decision of chroma activation is proposed to more reliably determine pitch-class presence. Experimental results on two datasets demonstrate that both temporal smoothing and soft decision effectively improve cover song retrieval accuracy. Acknowledgement This study was supported by Gangneung-Wonju National University. References 1.M. Muller and S. Ewert, “Towards timbre-invariant audio features for harmony-based music,” IEEE Trans. Audio Speech Lang. Process. 18, 649-662 (2010). 2.M. Muller and S. Ewert, “Chroma Toolbox: MATLAB implementations for extracting variants of chroma- based audio features,” Proc. ISMIR-2011, 215-220 (2011). 3.J. Seo, “Salient chromagram extraction based on trend removal for cover song identification,” IEICE Trans. Inf. & Syst. 104, 51-54 (2021). 4.J. Serra, E. Gomez, P. Herrera, and X. Serra, “Chroma Table 1. Cover song identification performance of the CRP and the CTP on the covers80 dataset for different temporal smoothing. The evaluation metrics are mean average precision, MAP, and mean rank of the first correctly identified cover, MR1. Temporal smoothing method CRPCTP MAPMR1MAPMR1 Results by using MOVE method Without smoothing0.6539.760.6759.09 Gaussian smoothing0.6799.920.6718.66 Fourier smoothing0.67810.660.6769.04 Morphological smoothing0.68111.800.67110.54 Anisotropic smoothing0.67410.120.7058.66 Results by using OTI-SW method Without smoothing0.592 21.480.64618.67 Gaussian smoothing0.650 18.320.665 17.09 Fourier smoothing0.671 15.290.697 15.43 Morphological smoothing0.670 15.970.682 15.54 Anisotropic smoothing0.671 15.460.699 13.95 Post interpolation [3] 0.60524.860.66917.23 Table 2. Cover song identification performance of the CRP and the CTP on the covers330 dataset for different temporal smoothing. The evaluation metrics are mean average precision, MAP, and mean rank of the first correctly identified cover, MR1. Temporal smoothing method CRPCTP MAPMR1MAPMR1 Results by using MOVE method Without smoothing0.6474.170.7263.58 Gaussian smoothing0.6723.220.7492.78 Fourier smoothing0.6753.090.7572.09 Morphological smoothing0.6862.340.7333.23 Anisotropic smoothing0.6872.720.7672.12 Results by using OTI-SW method Without smoothing0.7297.150.781 5.79 Gaussian smoothing0.7455.770.799 5.50 Fourier smoothing0.7495.350.8025.43 Morphological smoothing0.7425.320.7945.27 Anisotropic smoothing0.7475.260.7985.42 Post interpolation [3] 0.7105.700.7674.24Salient chromagram extraction by using temporal smoothing and soft decision for cover song identification The Journal of the Acoustical Society of Korea Vol.44, No.4 (2025) 345 binary similarity and local alignment applied to cover song identification,” IEEE Trans. Audio Speech Lang Process. 16, 1138-1151 (2008). 5.J. Seo, “A relevance-based pairwise chromagram similarity for improving cover song retrieval accuracy” (in Korean), J. Acoust. Soc. Kr. 43, 200-206 (2024). 6.F. Yesiler, J. Serra, and E. Gomez, “Accurate and scalable version identification using musically- motivated embeddings,” Proc. ICASSP-2020, 21-25 (2020). 7.R. Hodrick and E. Prescott, “Postwar U.S. business cycles: An empirical investigation,” J. Money, Credit and Banking, 29, 1-16 (1997). 8.A. Wink and J. Roerdink, “Denoising functional MR images: a comparison of wavelet denoising and Gaussian smoothing,” IEEE Trans. Medical Imaging, 23, 374-387 (2004). 9.D. Song, A. Baek, and N. Kim, “Forecasting stock market indices using padding-based fourier transform denoising and time series deep learning models,” IEEE Access, 9, 83786-83796 (2021). 10.P. Maragos, “Tutorial on advances in morphological image processing and analysis,” Optical Engineering, 26, 623-632 (1987). 11.P. Perona and J. Malik, “Scale-space and edge detection using anisotropic diffusion,” IEEE Trans. Pattern Analysis and Machine Intelligence, 12, 629- 639 (1990). 12.A.S. Yaro, F. Maly, P. Prazak, and K. Maly, “Outlier detection performance of a modified z-score method in time-series rss observation with hybrid scale estimators,” IEEE Access, 12, 12785-12796 (2024). 13.C. Harte, Towards automatic extraction of harmony information from music signals, (Ph.D. dissertation, Queen Mary University of London, 2010). 14.Covers80 Cover Song Data Set, http://labrosa.ee.colum bia.edu/projects/coversongs/covers80/, (Last viewed March 7, 2025). 15.T. F. Smith and M. S. Waterman, “Identification of common molecular subsequences,” J. Molecular Biology, 147, 195-197 (1981). 16.MOVE Code, https://github.com/furkanyesiler/move, (Last viewed March 7, 2025). Profile ▸Jin Soo Seo (서 진 수) He received the B.S., M.S., and Ph.D. degrees from Korea Advanced Institute of Science and Technology in 1998, 2000, and 2005 respectively, all in electrical engineering. He was a senior researcher at ETRI from 2006 to 2008. He joined the Department of Electrical Engineering at Gangneung-Wonju Na- tional University in 2008. His research interests are speech and audio pro- cessing, multimedia retrieval, and pat- tern recognition.다중 경로를 고려한 수중 표적 산란 모델 이근화 북극해 음향실험(KAMAS-24)에서 수직선배열센서를 이용하여 중거리에서 측정한 해빙의 영향을 받은 중주파수 음파전달신호의 도달각 변동 박정수 , 손수욱 , 박중용 , 이대혁 , 김우식 , 배호석 , 김한수 , 윤영글 , 조성호 , 강돈혁 , 손우주 Impedance matching circuit design for maximum power transfer in Janus Helmholtz transducers Yoonsang Jeong, Kibae Lee, Hyun Hee Yim, and Chong Hyun Lee 수중 통신 환경에서 코드분할 다중 접속 방식을 위한 확산 요소 선택 기준 연구 홍예권 , 정지원 , 김준호 , 안병선 단일 처프와 continuous wave 리플리카를 이용한 도플러 천이 주파수 추정 오단비 , 김기만 , 김준호 , 안병선 오정합 조건에 따른 정합장 처리와 머신러닝 기반 음원 거리 추정 성능 비교 박소연 , 김근환 , 변기훈 배열 기울기 오정합에 대한 위상 보정 기반 정합장 처리 기법 이유진 , 김동현 , 변기훈 다중 기울기 제약조건을 가진 정합장 처리를 이용한 비상관성 다중 음원 위치 추정 김동현 , 변기훈 2025 한국음향학회지 특집호 “수중음향” 특집호 편집위원장: 변기훈 교수(한국해양대) The Journal of The Acoustical Society of KoreaThe Journal of The Acoustical Society of KoreaI. 서 론 능동 소나 방정식에서 표적 강도는 수중 표적에 의한 음향 에너지의 재분배를 정량적으로 나타낸 다. 표적 강도는 먼 거리의 음원으로부터 입사하는 입사 인텐서티와 표적의 음향 중심으로부터 1 m 떨 어진 지점에서의 산란 인텐서티의 비의 상용로그에 10을 곱한 값으로 정의된다. 이때 수중 표적과 음원 은 무한 영역에 존재하는 것으로 가정된다. [1] 무한환경에서의 수중음향 표적 강도 모델은 여러 연구들 [2-4] 이 존재한다. 그러나 실제 해양 환경에서 는 음원과 수중 표적은 해수면과 해저면으로 둘러 싸여 있다. 음원으로부터 수중 표적으로 전달되는 신호는 수중 도파관의 다중 경로를 통해 입사하게 된다. 입사된 신호는 수중 표적에서 부피 산란을 통 해 에너지가 재분배되고, 다시 수신기로 전파될 때 다중 경로를 통해 전달된다. 수중 표적에 의해 에너 지가 재분배되는 과정에서 각각의 경로로 전달되는 음파들이 서로 간섭하게 되며, 이에 따라 최종적으 로 수신되는 신호는 모든 양 방향 경로의 음파들의 합으로 복잡하게 표현된다. Ingenito [5] 는 위와 같은 수중에서 발생하는 표적 다중 경로를 고려한 수중 표적 산란 모델 Underwater target scattering model with multipath effects 이근화 1† (Keunhwa Lee 1 † ) 1 세종대학교 국방시스템공학과 (Received April 7, 2025; accepted July 14, 2025) 초 록: 수중 표적 탐지는 신호의 코히런스를 최대한 이용하는 방향으로 발전되고 있다. 본 연구에서는 수중 도파관과 수중 표적간의 연성 효과를 고려한 표적 산란 모델을 소개한다. 일반화 된 표적 산란 모델은 Waveguide Green 함수법 을 이용해 얻어지며 적분 방정식의 형태로 표현된다. 본 연구에서는 수중 표적에 물리광학 이론을 적용하여 적분 방정식 을 대수 방정식으로 간략화했다. 이때 음선 또는 정상 모드법으로 얻어진 Green 함수를 적용하면 최종 식은 음선 또는 정상 모드 쌍의 집합으로 표현된다. 본 산란 모델은 수중도파관에서 표적 산란을 보다 실제적으로 모의하는데 유용하다. 핵심용어: 표적산란모델, 수중도파관, 다중경로, 소나방정식 ABSTRACT: The detection of underwater target has been advanced towards fully utilizing the coherence of signal. In this paper, we introduce a target scattering model that considers the coupling between the target scattering and propagation in the ocean waveguide. The proposed target scattering model is derived using the waveguide Green’s function method and is expressed in the form of an integral equation. By applying the physical optics theory into this integral equation, the explicit form for the scattered pressure can be obtained, and the final expression is represented as a set of ray or normal mode pairs using the waveguide Green’s function. This scattering model will be useful for more realistically simulating the target scattering in the underwater environment. Keywords: Target scattering model, Ocean waveguide, Multipath, Sonar equation PACS numbers: 43.30.Gv, 43.30.Bp. 한국음향학회지 제44권 제4호 pp. 347～353 (2025) The Journal of the Acoustical Society of Korea Vol.44, No.4 (2025) https://doi.org/10.7776/ASK.2025.44.4.347 pISSN : 1225-4428 eISSN : 2287-3775 †Corresponding author: Keunhwa Lee (nasalkh2@sejong.ac.kr) Department of Ocean Systems Engineering, Sejong University, 209 Neungdong-ro, Gwangjin-gu, Seoul 05006, Republic of Korea (Tel: 82-2-3408-3508, Fax: 82-2-3408-4340) Copyrightⓒ 2025 The Acoustical Society of Korea. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. 347Next >