Ideas2IT rewards key players with 1/3rd of the Company in New Initiative.  Read More >
Back to Blogs

Audio Classification on Edge AI


We have seen text data , image data in our day to day life but have you thought about how audio data looks like. Is it a simple audio file like .mp3, .wav etc.. and how can we use them for audio analysis like audio classification, speech recognition, Environment sound classification , chatbot like Alexa etc. This series of blogs will give you an insight about all that.

  • In this first blog we will see how audio data looks like, how it is related to Fourier series and Fourier transform
  • In the second blog we will see in depth what is the logic behind using audio data representation as STFT,MFCC and what are its advantages.
  • In the third blog we will see how we can use raw audio data and convert into MFCC format and train a model using keras efficientnetv2 model and compare its accuracy with other audio data representation.
  • In the last blog we convert the trained model into c++ format for deploying in a Microcontroller device like ESP-32.

Audio data format:

We know that ML only understands numbers ie. The representation of any data will be real numbers which will be fed into ML or DL models . Even if your data is text data or image data or audio data it is converted into specific format (ie real numbers ) which can fed into model as input. But the catch is representation of text data , audio data and image data might be different and each representation has its own advantages and disadvantages. I know the previous line would be confusing. Let me pull up an example.

1. Audio data representation as real number:

Similar to texts and images, audio is unstructured data meaning that it’s not arranged in tables with connected rows and columns.Usually audio data will be recorded in analog format(ie Physical Tape,disk etc..) but it will be converted into digital format.Why is that?.Digital is easy to share, edit, and you can even manipulate the data without hassle.How to manipulate?.You should definitely see what is fourier series and fourier transform.Now coming back , let me give you a short insight of how analog and digital data looks like in the diagram -1 below.

(Note: From the above diagram we can see that red curve is the analog format data and the blue vertical bars over the red curve is the digital format data which is obtained by approximation of analog curve using fourier series or fourier transformation.)Let me give a difference between analog audio vs digital audio data in the diagram-2 below. From this we can clearly see that digital sound wave may look slightly different from analog audio but its a very close approximation where much of the information about the audio still exists and we can easily manipulate this digital audio data easily.How is that?. From the digital audio data , we can get a bunch of zeros and ones which is nothing but a representation of digital audio data.

Now coming to the main part, Audio data can be represented in many different ways like Raw form, STFT, FFT, MFCC, Chromagram, Mel-Spectrogram etc.. Each have its own advantages and disadvantages . We will discuss in the upcoming blog the mostly used representation of audio data. ie STFT and MFCC.

Before we begin , we must know some basics of Audio characteristics.

Basics: A sound is a wave of vibrations traveling through a medium like air or water and finally reaching our ears. It has three key characteristics to be considered when analyzing audio data — time period, amplitude, and frequency.Time period is how long a certain sound lasts or, in other words, how many seconds it takes to complete one cycle of vibrations.Amplitude is the sound intensity measured in decibels (dB) which we perceive as loudness.Frequency measured in Hertz (Hz) indicates how many sound vibrations happen per second. People interpret frequency as low or high pitch.While frequency is an objective parameter, the pitch is subjective. The human hearing range lies between 20 and 20,000 Hz. Scientists claim that most people perceive as low pitch all sounds below 500 Hz — like the plane engine roar. In turn, high pitch for us is everything beyond 2,000 Hz (for example, a whistle.)

Fourier series:

The below diagram-3 is a digital representation of an audio. Let’s assume that I have sampled this data at 16kHz, it means that in one second there are a sequence of 16000 amplitudes(Consider them as single vector as of now and we have 16000 vectors in this case). So if the audio is of 10 seconds, the total amplitudes will be 16000*10 and that is a lot! So how do we extract the necessary information from this giant set of amplitudes? This is where Fourier series & Fourier transform helps us.

In simple terms Fourier series is just a mathematical representation(sum of sine and cosine functions) of periodic waves(audio sound) in Time Amplitude domain.

Fourier transform:

Its a mathematical representation of aperiodic waves (non periodic waves) in Frequency Amplitude domain. And how its done?. Simple.The Fourier transform takes a time-domain signal and transforms it into a frequency-domain representation. But how is it?. Let me give you an idea to how to do it .Fourier transform represents the frequency content of an entire signal by computing the complex amplitudes of its individual frequency components. The resulting spectrum gives a detailed view of the frequency content of the entire signal, but it does not provide any information about how the frequency content changes over time.Let me show you how individual frequencies look like in Fourier transform:

In layman terms : An audio signal comprises of several single-frequency sound waves(in the above diagram we can see 3 single frequencies sound waves). In the above representation, we were only able to see the red coloured squiggly wave that results from the addition of amplitudes of all the other waves at different frequencies, at each time step in Time amplitude domain. Fourier transform helps us here by decomposing a signal into its individual frequencies and the amplitude corresponding to that frequency in the [Frequency Amplitude] domain.Now You will be wondering to know, is there any operation to find out how different frequencies changes over time. This is were STFT is used.


In the upcoming blog we will see how we can use raw audio data and convert into MFCC format and train a model using keras efficientnetv2 architecture and convert into c++ format for deploying in Microcontroller like ESP-32 and will compare the result with and without MFCC.

Credit Links:

Ideas2IT Team

Connect with Us

We'd love to brainstorm your priority tech initiatives and contribute to the best outcomes.