Saturday 2 June 2012

The Cocktail Party Effect (Part 2 of 2)

Late at night, when the boys are (finally) asleep, we like to be able to hear the voices of the people on the television without waking up the neighbors.  In part one I described how to build a custom digital audio filter by specifying a frequency response and running an optimizer to determine the optimal coefficients.

This post looks at one of the actual filters I use on my television in more detail.

First up, here's the response curve for the filter.
White = Filter,   Blue=Desired Response
You can see the extremely strong cut-off starting at 1000Hz, dropping very quickly, -10 dB at 300Hz, and -20dB at 50Hz.  In the other direction, we have an almost flat response between 1000Hz and 5000Hz, and then a very gradual drop, -3dB at 10kHz.

This corresponds nicely with the human voice, which ranges between 800Hz - 8000Hz, with the majority of the sound energy between 1500Hz-4000Hz

Here's the actual code I use to set the desired frequency:

    float GetDesiredResponse(float freq)
    {
        float logFreq = log(freq);
        float cutOff0 = log(1000.0f);
        float cutOff1 = log(3400.0f);
        float cutOff2 = log(8000.0f);

        float result = 1.0f;
        if(logFreq  < cutOff0)
        {
            float factor = logFreq / cutOff0;
            result *= pow(factor, 10.0f);
        }
        if(logFreq > cutOff1)
        {
            float factor = 2.0f - logFreq / cutOff1;
            result *= pow(factor, 2.0f);
        }
        if(logFreq > cutOff2)
        {
            float factor = 2.0f - logFreq / cutOff2;
            result *= pow(factor, 2.0f);
        }
        return result;
    }


As you can see with this method, it's relatively easy to get precise control over the frequency response.

A couple quick notes:
  • It's important that the frequency response be a quasiconcave function.  This ensures there are no kinks in the response, which will (1) cause visiting audiophiles to complain, and (2) make some voices more difficult to comprehend.
  • Be careful when specifying a very steep transition, or trying to completely stop-pass some frequencies.  A digital filter has some pretty strict limits on the kinds of things it can filter.  Like the proverbial genie, if you try and go there, the optimizer will give you exactly what you ask for.
  • I can't hear past ~17kHz, so I took extra care to test with high frequencies, but played back at half speed to make sure I wasn't torturing the cats and dogs.
  • This is also a good time to read up about Odd and Even functions, which correspond to odd/even numbers of coefficients in the polynomial.  An Odd function, for example, will start at -∞ and rise to +∞.  That's useful if you want a high-pass or low-pass filter, but bad for a band-pass filter.
  • Having problems with the y racing off to infinity? Your Y polynomial may have an unstable feedback loop.  Try a longer or shorter filter, or add an A-stability criteria to your optimizer.

 

Weighting

So you're probably wondering why the filter matches the desired function so accurately over the vocal range, but seems to drift in other areas.  Here's the weighting function I use to compute the error:

That is, I sample 129 frequencies logarithmically spaced between 9Hz to 18000Hz.  For each frequency, I add the square of the L2 error, weighted by the square of (desired response + a quarter).

Why so much squaring?  To help the function minimizer converge to a global minimum, it's nice to give big parabolas for the optimizer to slide inside.

Oh, and before I forget, here's the actual filter in C++, licensed under CC0:

    // From http://missingbytes.blogspot.com/2012/06/cocktail-party-effect-part-2-of-2.html
    static float x[5] = {0};
    for(int i=0; i<length; i++)
    {
        float sampleLeft = GetNextLeftSample();
        float sampleRight = GetNextRightSample();

        for(int j=4; j>0; j--)
        {
            x[j] = x[j-1];
        }
        x[0] = sampleLeft + sampleRight;

        // The filter!
        static float y1=0.0f, y2=0.0f, y3=0.0f, y4=0.0f;
        f32 y0=+0.588746*x[0]-0.492725*x[1]-0.757061*x[2]+0.661062*x[3]
            +1.242758*y1+0.545942*y2-1.044236*y3+0.238844*y4;
        y4=y3;y3=y2;y2=y1;y1=y0;

        float value = y0*volume;
        float outValue = bound(value, -32768, 32767);
        if(value != outValue){clippingCount++;}
        unsigned short emit = (unsigned short)outValue;
        *dest++ = emit;// Output left
        *dest++ = emit;// Output right
    }


Results

So what does it sound like?  Well, for music, pretty bad actually :)  But that's not the point!  It's designed for voice, and that where it really shines - even though the voice quality sounds a little unnatural, you can understand what people are saying, even at very low volume levels.

    (Oops, having some video encoding problems - updated video coming real soon now)

The Future

Can we do better than this?  Yes, we can!  When I finally get around to setting up a 7.1 audio system, I plan to measure the room response function using a calibrated microphone.  Then it's a simple matter of taking the FFT of the incoming audio, dividing through by the room response, then taking the inverse fourier transform before sending to the sound card.  Normally this would entail some delay/lag, but that's not a problem when playing buffered video.  I could even have different responses for different listener locations, or dynamically respond to changes in temperature or humidity.

... watch this space for a future update.




No comments:

Post a Comment