You are on page 1of 9

1

Table of Contents

Sr.No Topic Page No

01 Abstract 03

02 Introduction 04

03 Problem 04

04 Methodology 05

05 Analysis 07

06 Conclusion 08

07 References 09

2
ABSTRACT
Pitch detection is a pivotal component of speech signal processing, integral to applications such as
voice recognition, music analysis, and diverse audio processing tasks. This comprehensive report
delves into an array of pitch detection methodologies, encompassing both time and frequency
domain analyses, as well as auto-correlation and cepstrum techniques. The primary objective of
this study is to elevate the precision of fundamental frequency estimation, thereby fostering
advancements in pitch detection within speech signals.
By combining various analytical approaches, this research strives to refine the understanding of
pitch-related features and contribute to the ongoing evolution of pitch detection methods. The
significance of accurate pitch detection reverberates across domains, influencing fields like
telecommunications, entertainment, and human-computer interaction. As technological
applications relying on vocal interfaces continue to burgeon, the outcomes of this study hold the
potential to augment the efficacy of speech-based systems and enhance user experiences in
manifold domains.

3
INTRODUCTION
Sounds from different things mix together in our ears, creating different pitches. Pitch comes from
the way our vocal folds vibrate, called the fundamental frequency. It's a crucial part of how we
speak. As science and technology advance, things get simpler for us. But with more progress comes
more discovery and finding new problems to solve. This report explores the world of sounds,
investigating how we perceive and understand them. In this journey, we aim to unravel the
complexities, striking a balance between the straightforward and the intricate, as we navigate the
realms of human perception and technological innovation.

WHAT IS THE PROBLEM?


Our keen interest in exploring the realm of real and computer-generated sounds motivates the
selection of this research topic. The focus is particularly on delving into the world of interactive
music to address potential shortcomings in electronic music spontaneity. However, a significant
challenge arises in the precise measurement of pitch periods using only one acoustic signal,
especially in noisy speech environments. The detailed structure of the signal can become
corrupted, leading to inaccurate fundamental frequency measurements. This problem intensifies in
complex settings such as streets, shopping malls, or cafes, where distinguishing between speech
and various other acoustical signals becomes challenging due to high noise levels. The presence
of background noise with statistical characteristics similar to speech further complicates the
detection of the speech signal of interest, particularly in spaces like exhibition halls.

METHODOLOGY:
1. Data Collection:
Source: https://www.kaggle.com/competitions/pitch/data)
Our investigation into pitch detection techniques begins with a comprehensive dataset from
Kaggle's Pitch Competition. This dataset encompasses a diverse range of sounds, forming the basis
for our analysis.

2. Data Preprocessing:
Specific sound samples were meticulously chosen from the Kaggle dataset for the application of
pitch detection techniques. These samples were strategically selected to represent a variety of
acoustic scenarios and challenges.

4
3. Implementation of Pitch Detection Techniques:
Various techniques have been developed to address the challenges associated with pitch detection.
These methods fall into three distinct categories:
1. Time Domain Detection
2. Frequency Domain Detection
3. Statistical Detection (Machine Detectors based on models of Ears)
A time-based pitch detector gauges the pitch period by identifying the glottal closure instant (GCI)
and measuring the time intervals between these events. Complementarily, frequency domain pitch
detection is employed to ascertain the pitch, involving a period-by-period processing of the speech
signal. In our project, we focus on both Time Domain Detection and Frequency Domain Detection
to determine the pitch of the inputted speech signals.

1. TIME DOMAIN DETECTION


Time domain detection primarily leverages the time-domain properties of speech signals. The time
domain involves the analysis of mathematical functions, time series data in economics, and their
variations over time. A time domain graph visually illustrates how a signal changes with time.
Pitch detectors operating in the time domain directly estimate the pitch period from the speech
waveform. The underlying assumption is that for quasiperiodic signals, simplifying the impact of
formant structure allows straightforward time-domain measurements to yield reliable period
estimates. In time domain feature detection methods, the signal is typically preprocessed to
emphasize certain time-domain features. The period of the signal is then calculated by determining
the difference between the time occurrences of these features. In our pitch detection approach
within the time domain, two methods are employed:

• Fast Fourier Transform (FFT)


• Autocorrelation

• Fast Fourier Transform:


Fast Fourier Transform (FFT) serves as a technique to convert a signal from the time domain to
the frequency domain. As a pitch detection algorithm, FFT has demonstrated its effectiveness and
robustness in practice, especially when applied to natural sounds like voice and classical musical
instruments. The Fourier transformation concept is utilized to transition the signal from time-
domain data to frequency space.

5
• Autocorrelation:
The autocorrelation function serves as a fundamental element in time-domain pitch detection
algorithms. The key concept behind employing this function is to generate a representation that
exhibits significant peaks at positions corresponding to the period of the waveform, with the largest
peak occurring first. This technique primarily focuses on the time domain, specifically when
correlating a segment of a signal with itself. Autocorrelation becomes particularly useful when
dealing with low-frequency components. The essence of this method lies in evaluating the distance
between positions of the maximum and second maximum correlation.
In the context of the autocorrelation function (ACF), correlation measures the similarity between
two input functions. For the autocorrelation function Γ(d), the input functions are identical,
represented by the same signal x(n), as illustrated in the equation:

Here, 'd' represents the lag or delay between the signal and a delayed segment, and 'N' denotes the
number of samples in the input signal. In cases where the signal is periodic or quasi-periodic, the
similarities between x(n) and x(n+d) are heightened. Correlation coefficients also register high
values if the lag is equal to a period or a multiple thereof. As the autocorrelation function is the
inverse Fourier Transform of the power spectrum of the input signal, the pitch is determined by
the frequency \( \frac{fs}{d} \) at which the maximum of the ACF occurs, where \( fs \) is the
sampling frequency of the speech signal. Notably, this technique is independent of unknown phase
relations and formant structures, avoiding complications associated with these parameters.

2. FREQUENCY DOMAIN DETECTION


This category primarily leverages the frequency-domain properties inherent in speech signals. The
frequency domain involves the analysis of mathematical functions or signals concerning
frequency, representing the number of cycles per unit time. A frequency domain graph visually
depicts the distribution of signals within specific frequency bands across a range of frequencies.
The key principle employed in frequency domain pitch detection relies on the periodic nature of
signals. In a periodic signal, the frequency spectrum comprises a series of impulses at the
fundamental frequency (fo) and its harmonics (2fo, 3fo, and so forth). The methodology involves
transforming the signal into the frequency domain, scrutinizing the frequency domain
representation for the first harmonic, or seeking the largest common divisor of all harmonics,
among other implications of the period.
To enhance accuracy in locating harmonic peaks, windowing signals is a common practice,
ensuring the avoidance of spectral leakage. Depending on the window type, a minimum number
of periods of the signal must be studied. Simple measurements can then be conducted on the
frequency spectrum of the signal, or a nonlinearly transformed version, such as the cepstral pitch
detector, to estimate the period of the signal.

6
Within frequency-based methods, the signal frame undergoes transformation into the frequency
domain, often facilitated by the Fourier transform. A notable technique employed in frequency
domain pitch detection is the Cepstrum Method.

• Cepstrum Method:
A Cepstrum is the result of performing the Inverse Fourier Transform (IFT) on the logarithm of
the estimated spectrum of a signal. This technique finds application in the analysis of human
speech. The term 'Cepstrum' is derived by reversing the first four letters of 'spectrum'. Operations
on Cepstral include quefrency analysis, liftering, or cepstral analysis. The power Cepstrum proves
useful in exploring the periodicity of harmonic signals in the frequency representation. Taking the
Fast Fourier Transform (FFT) again reveals a peak corresponding to the fundamental period. This
process can be interpreted as a de-convolution, especially when the input signal is produced by a
train of impulses convolved with a filter. The logarithmic transformation simplifies the
multiplication in the frequency domain, and applying FFT again de-convolves the original signal,
ultimately revealing the fundamental frequency.

ANALYSIS
Exploring effective pitch detection methods is essential in the realm of sound analysis. Two
primary techniques, Autocorrelation and Cepstrum, stand out for their unique characteristics. This
empirical evaluation aims to provide insights into their complexities and efficiencies.

1. Autocorrelation:
Advantages:
• Simplicity and Efficiency
• Conceptual Ease in Mathematical Modeling
Disadvantages:
• Challenge in Peak Level Selection
• Moderate Error in Pitch Calculation

2. Cepstrum Analysis:
Advantages:

• Simplified Spectral Component Estimation


• Moderate Time Computation with FFT and IFFT

7
Disadvantages:

• Computational Intensity of FFT and IFFT


• Inherent Low-Pass Filtering and Sensitivity to Dominant Frequencies

Empirical Comparison
Time Complexity:
Autocorrelation (AUTO): O(n)
Cepstrum (CEPS): O(n)

Latency Ranking:
Cepstrum (CEPS) >> Autocorrelation (AUTO)

Error in Computation Ranking:


Cepstrum (CEPS) >> Autocorrelation (AUTO)

Trade-offs:
Autocorrelation: Simplicity and Efficiency
Cepstrum: Computational Intensity with Enhanced Pitch Analysis

CONCLUSION
In summary, our exploration of pitch detection techniques, focusing on Autocorrelation and
Cepstrum analysis, revealed distinct trade-offs. Autocorrelation excelled in simplicity and real-
time applications, while Cepstrum offered enhanced accuracy despite higher computational
demands. The empirical evaluation showcased each method's strengths and weaknesses, guiding
their applicability based on specific requirements. Looking ahead, potential hybrid approaches and
the integration of machine learning could further refine pitch detection in diverse acoustic
scenarios. This project lays a foundational understanding, contributing to ongoing advancements
in sound analysis methodologies.

8
REFERENCES
https://en.wikipedia.org/wiki/Pitch_detection
https://www.kaggle.com/competitions/pitch/data
https://www.ijsr.net/archive/v4i3/SUB151957.pdf
https://mural.maynoothuniversity.ie/14192/1/JT_an%20investigation.pdf
https://ccrma.stanford.edu/~pdelac/154/m154paper.htm
https://www.section.io/engineering-education/machine-learning-for-audio-classification/
https://ieeexplore.ieee.org/document/9277448

You might also like