Anyone familiar with mel frequency cepstral coefficients?

dewdrop_world

I'm looking for pseudocode to calculate MFCCs (mel frequency cepstral coefficients) from a time-domain signal. All I can find online are engineering papers (which cover theory but not implementation), or very basic definitions with absolutely nothing about how to do it. Or, "use this library" and be done with it. No help if I want to do an implementation in SuperCollider.

The intent is to grab microsound events from an input stream using feature detection (SC has PV_JensenAndersen and PV_HainsworthFoote feature detectors) and then analyze them for spectral similarity so that I can spit out grain clouds of similar-but-not-exactly-the-same events. Eventually this will be based on live mic input.

I've been trying all week to work with raw FFT data but it's just not good enough... two sounds can be very different sonically but have surprisingly similar FFT magnitudes.

Thanks.

hjh

(I will also be looking on musicdsp.org but I can't access the site from work, for an unknown reason.)

DrJustice · Posted: Tue Sep 12, 2006 2:53 pm Post subject:

Hmm... no replies yet?

Well, I'll give it a try although I'm not familiar with MFCC as such. However, I got curious, and you're right there's very little in the way of real examples out there. I include the definition (for a magnitude only output vector) for the benefit of our readers:

mfcc[] = fft( mel( log( fft( signal[] ) ) ) )

Of course it's the mel() mapping of the spectrum from Hertz to Mel that is the crux. I have no idea how to best choose mel frequency scaling and possibly truncation before the last fft, but I guess experimentation could shed some light on that. The equations to convert from Hz to mel and vice versa are:

mel = 1127.01048 * log( 1 + fhz / 700 )
fhz = 700 *( exp( mel / 1127.01048 ) - 1 )

If you can do the fft and log steps, the bit that's left is the spectral mapping. I assume you can simply interpolate in the Hz spectrum to pick the magnitudes at the frequencies corresponding to your chosen "mel bins" before the last fft, truncating (windowing?) the mel series as necessary when/if you overshoot the Hz spectrum. Perhaps the scaling/truncation can be dealt with by scaling the mel spectrum to exactly span the full Hz spectrum for the number of mel bins you choose.

Below is some code to implement the latter mapping. It assumes that you have the log magnitude spectrum in mag_hz_v[] and that the fft has has enough points (more than the mel spectrum) to give adequate resolution in the low frequencies. The resultant vector, mag_mel_v[], is the remapped spectrum ready for the final fft. The code is a working Lua program; note that the table index in Lua starts at 1.

dewdrop_world · Posted: Thu Sep 14, 2006 12:42 pm Post subject:

Thanks, quite helpful.

Turns out I didn't need to do the log and the second fft. I had been differentiating the fft curve and locating places where the slope changes from positive to negative to identify magnitude peaks. Simply mapping the linear frequency scale onto mel frequencies and applying that algorithm is getting entirely acceptable results.

I will post the sc code sooner or later, just have some cleanup to do.

Thanks!
hjh

DrJustice · Posted: Fri Sep 15, 2006 2:27 pm Post subject:

I'm glad it was of some use Smile

Do you mean that just taking the fft of the signal and doing a Hz to Mel mapping is sufficient for use as a feature extraction in your case?

I was thinking that it might be nice to give the output of the code so people can see what is going on - I.e. that there is increased spectral resolution Hertz-wise in the lower frequencies (still no interpolation though, so the magnitudes are a bit off):

dewdrop_world · Posted: Sat Sep 23, 2006 7:53 pm Post subject:

Yes, the resolution in terms of real frequency is better at the low end.

Here's the beginning and end of the 256-element table. mel increases linearly, fhz starts out with a delta below 20 Hz at first, increasing to around 1300 Hz by the time we hit Nyquist.

I know this isn't truly mathematically ideal but it gets reasonable results to the ear Smile

hjh

mel: 0, fhz: 0 .. 0
mel: 15.325758062803, fhz: 0 .. 21.533203125
mel: 30.651516125605, fhz: 21.533203125 .. 21.533203125
mel: 45.977274188408, fhz: 21.533203125 .. 43.06640625
mel: 61.30303225121, fhz: 43.06640625 .. 43.06640625
mel: 76.628790314013, fhz: 43.06640625 .. 64.599609375
mel: 91.954548376815, fhz: 64.599609375 .. 64.599609375
mel: 107.28030643962, fhz: 64.599609375 .. 86.1328125
mel: 122.60606450242, fhz: 86.1328125 .. 86.1328125
mel: 137.93182256522, fhz: 86.1328125 .. 107.666015625

...

mel: 3816.1137576378, fhz: 19982.8125 .. 20262.744140625
mel: 3831.4395157006, fhz: 20262.744140625 .. 20564.208984375
mel: 3846.7652737634, fhz: 20564.208984375 .. 20844.140625
mel: 3862.0910318263, fhz: 20844.140625 .. 21145.60546875
mel: 3877.4167898891, fhz: 21145.60546875 .. 21447.0703125
mel: 3892.7425479519, fhz: 21447.0703125 .. 21748.53515625
mel: 3908.0683060147, fhz: 21748.53515625 .. 22050

// partial code to make the table, taken out of context