nota bene: this article is under construction. Some specific figures about scientifically measuring latencies in an example setup are still missing.
And they’re still missing – I originally posted this text on my old website about four years ago – but it still remains valid.
Ever since the first musicians started to use laptop computers as effects or virtual instruments, there has been a discussion about latency. Makers of audio interfaces specified latencies, and musicians made bold statements as to why their specific playing style required some extremely low latency which only selected interfaces and very powerful computers could provide.
This article deals with the concept of latency. What it is, what causes it and how we can affect it. It will also try to dispel some myths that have formed when discussing this.
What is it?
Latency is defined as the time span between the moment something is initiated and its consequences are observed. This becomes most apparent with systems which are kind of a black box. If you have a vending machine where you throw in a coin and then get a candy bar, the latency of this black box is the time from when you throw the coin until the bar drops into the slot and you can retrieve it.
If in this example we do not only look at the machine with its stimulus (throw in a coin) and the action (candy bar drops down) but at the whole system including the human in front of the machine, our stimulus might be the thought “I want a candy bar” and the action is when he starts to shove the thing into his mouth.
In this case, the total latency is made up of three different processes (digging out a coin and throwing it into the machine, the machine dispensing the bar, the human picking up the bar, unwrapping it and eating it), and the total latency is the sum of all three latencies.
Summarizing, there are two important points in our application (which, again, is the musician playing an instrument and listening to it):
- The complete system must be identified and defined in a meaningful way. The alpha and the omega here is always the musician!
- The system might be chained up from different subsystems, each with their own latency. The total latency is the sum of these latencies.
What causes it?
So now that we know how to look at our system and how to put together the total latency from the sub-systems’ latencies, let’s look at a typical example of today:
A musician stands in a room with an electric guitar. This guitar is connected (e.g. via DI – and two cables) to the analogue in of a computer’s audio interface. The signal is then processed in the computer (like with an amp modeller), and the processed output is sent out through the interface’s analogue out (through a cable) into an amplifier which in turn feeds a speaker (again through a cable) which the musician hears.
We call this setup “S1”. Later on, we’ll be comparing this with two variants:
In setup “S2”, the speaker is replaced with a pair of headphones.
In setup “S3”, the computer (with its interface) is replaced by an analogue guitar preamp (the speaker stays the same as in S1). S3 is basically a typical guitarist’s setup.
If we look at the signal chain in S3, we have the following latencies:
A: from the moment the musician decides to play a note until his finger moves.
A1: the musician decides to play a note. This decision is transformed into a command to move the finger.
A2: the command to move the finger travels down the nerves to the muscle.
A3: the muscle starts the movement, somewhat later the finger is moving.
B: from the moment the string is plucked to the electrical signal leaving the guitar.
C: the way from the guitar to the speaker.
C1: electrical signal travelling through the cable to the amplifier.
C2: the signal being processed by the amplifier.
C3: the signal travelling from the amplifier through the cable to the speaker.
C4: the speaker transforming the electrical signal into air movements.
D: the sound travelling from the speaker to the performer’s ears.
E: from the ear to the musician’s brain.
E1: the sound wave generating a neural response in the ear
E2: the neural signal being perceived as the guitar’s sound.
Now, let us first look at the electrical signals travelling through audio cables (C1, C3). The signal is travelling at the speed of light, which is roughly in the order of 3E8 m/s. That means if we have a total cable length of 30m (which should be more than sufficient for our application), the total latency caused here is about 0.0001ms. As we will se further on the latencies we’re discussing are in the ms range, the effect of C1 and C3 can be neglected.
A and E are effects which happen in the human body. While the total is well in the 100ms range, we need to understand here that a) these are latencies which are completely beyond our control, b) we are simply used to it. So it suffices to understand that a) latencies in this dimension are there, b) the musician has intuitively learned to live with them.
B has mainly to do with the mechanical intertia of the string. While we can easily understand that this is also something that every guitarist has learned to live with, it’s interesting to look at this effect with some other instruments: if e.g. you look at a church organ with a 32ft register, the time it takes a) from the player to hit the key to the organ starting to blow air into the pipe, b) for an oscillation to establish in the large amount of air in the pipe is considerable. So these are also ms-and-up effects musicians have learned to live with.
An interesting point is C2, simply because this is the place which will be taken by our computer in S1 and S2. There are two things to be said about this: a) the time lies above zero. How much? Coming from signal theory, an EQ will delay the signal by a phase shift of 90° at resonant frequency. So if we have our typical Bass/Middle/Treble EQ in our amp, the bass EQ (which may have a center frequency of 150Hz) will delay our signal by around 1.7ms. Add to that the (smaller) delay of the mid and hi EQ (in such circuits, the different bands are usually set up in a serial pattern), and you’ve got a delay of >2ms by the EQ alone. The rest of the amplifier will have a much smaller effect on the latency here, so let’s just state that our C2 latency in setup S3 is 2ms.
Remaining in the C section is C4. Frankly, I have no idea about the dimension of this latency, but it is safe to say that a) every electric guitarist got used to it, b) the latency can be reduced by using a smaller speaker and lower volumes – as we do in S2 with the headphones.
Now the remaining latency is D. D is interesting in so far as a) it is also something that the typical electric guitarist got used to, and b) it is something we have a direct influence on, simply by moving closer or further away from the speaker. We can even nearly eliminate it by using headphones, as in S2.
To get a feeling for this latency, we first take that the speed of sound is roughly 300m/s at room temperature. This means we get a latency of 3.3ms for every meter of distance. If we’re 3m away from the speaker (a typical rehearsal room situation), this latency becomes 10ms. For 8m (as when playing on a big stage), this rises to more than 25ms.
So, summarizing for S3, we get
A: precise values not known (above 100ms range), compensated for by every musician
B: ms-range, compensated for by every musician
C2: ~2ms, compensated for by every electric guitarist
C4: ms-range, compensated for by every electric guitarist
D: 5-25ms, compensated for by every electric guitarist
E: precise values not known, compensated for by every musician
If we summarize the values of C and D as “the things that happen outside of the human body and the instrument”, we get a range of slightly above 8ms to about 30ms.
If we do look at our laptop setup of S1 and S2, let’s first consider S1.
Comparing this with the discussion of S3, we see that the only difference is found in C2: “the signal is processed by the amplifier” becomes “the signal is processed by the laptop and then runs through a power amplifier”, or, if we neglect the latency of the power amplifier (which is for a linear amplifier with linear group delay, or in more mundane terms, “without any EQ”, again in the sub-ms-range), it becomes “the signal is processed by the laptop”.
To get a value for our new C2 (let’s call it S1:C2), we will further split up the latencies found here:
a) the signal is converted by the A/D converters of the audio interface
b) the signal is buffered by the audio interface’s input drivers
c) the signal is processed by the audio application (e.g. an amp simulator)
d) the signal is buffered by the audio interface’s output drivers
e) the signal is converted by the D/A converters of the audio interface
Ok, so we normally get a specification by the interface manufacturer which says something like “latency down to 4ms with zero-latency monitoring”. So what does that mean – does that mean that C2 is 4ms and if we only monitor the signal (i.e. leave out the b-d steps) we get zero latency? Certainly not. The latency specified by the interface documentation is only b) and d).
So first, what about those converter influences, a) and e)? You normally have a hard time finding out those, even when looking at the data sheets of some really high-class stuff (think Apogee). You might find them if you know the converters which are used and then looking up their specs. What comes out here is something like this: you have several latching delays etc. which are in the ns range (and will not be considered further). Then you get a delay of the digital filter, which is usually specified as a multiple of the inverse sampling clock and comes out in the 0.1ms range (this for 44.1/48kHz sampling rates, less for higher ones). You may or may not have a specification of a pipeline delay, again in the 0.1ms range. What you don’t see is the influence of the analogue anti-aliasing filters on the input side, but judging from the fact that if you’re using a 2nd order lowpass and the phase shift is less than 180° at resonant frequency, you get half a sample clock – which corresponds to the 0.01ms range even at 44.1kHz. But even when summing all those effects, we get something around 0.2-0.3ms in each direction, or half a ms in both ways – so we won’t consider it any further.
Now to our b) and d) values. Usually, you have different settings for your sound interface driver which allow you to set this latency to something in between 1ms and 16ms either way. You may or may not get the information that this has something to do with system load or system performance.
What happens here is this: with this parameter, you adjust the buffer size for the input/output drivers. This buffer is basically a pipeline which for the input sits between the audio input and the software accessing the audio – like your amp simulator, or between the amp simulator and the audio output for the output side. This makes it possible for the software not to be required to read and write one sample precisely at each sample clock, rather it can just procrastinate during short periods of high system load without interrupting the audio stream – the interruption of the audio stream is what you hear as clicks. Of course, making this buffer bigger will give the audio application (or the computer software in general) a bigger range to tend to other businesses, but it will also increase the latency. So it’s always a tradeoff here: if you require extremely short latency, you can do this by setting a small buffer but then will only be able to use your CPU to a total load of around 50 or 60%. Or you decide to load your computer close to the 100%-redline, but have to increase your buffers and thus your latency. Or you use a very radical combination of both and decide to live with the ugly clicks (something I would advise against). Btw, working with multi processor platforms here does help too, RME Audio has a nice tutorial on how to optimize multi processor systems for low latency.
This buffer can of course also be used to compensate for timing problems of the audio interface. The range of allowed buffer sizes is usually constricted in a way that it will always give a performance without any artefacts caused by the interface itself. But it is important to understand here that while all interfaces are equal in so far that they will induce a specific latency when using a specific buffer size, there are differences in which buffer size will work for your application caused by the quality both of the audio hardware and the drivers.
Summarizing, in a typical setup you will have latencies in between 1.5ms and 8ms each way (for most interfaces, they are identical).
Finally to our remaining influence, c). The short version: it’s a little bit tricky. I’ve not seen a single piece of software in this area (read: standalone audio effects/amp simulators, DAW software, VST hosts) which specifies this time. In the case of VST plugins, there is a software “VST Scanner” (freeware) which displays the latency reported by a plugin – but still, you don’t know about the VST host or the standalone version of this plugin (if there is any). So if you need to know it in detail, there’s no way ‘round measuring it (see the annex for a howto).
How to deal with it
Summarizing, we found out the following:
- in a typical laptop setup, the total latency through the laptop (S1:C2) is in the ms to lower 10ms range.
- There is also a latency through our guitar amp (S3:C2), but it’s considerably smaller (lower ms range).
- A big part of the equation for the S3 setup is S3:D – the time of the sound travelling through the air, but it’s something we have a direct influence on.
So the simple solution to the latency problem: go with our setup S2 – use headphones! The total perceived latency of a computer setup using headphones is about the same as using to a guitar amplifier through the air. Note that this also works perfectly well when playing live if you get your monitor signal of the other players through your headphones as well!
Annex: How to measure Latency
In this section, we will focus on how to measure the latency of the computer, or more specifically, what we named as S1:C2. This measurement setup is designed to rely on components found in a typical “hobby musician with a computer” setup and gives us an accuracy in the 0.05ms range or better.
Here’s what you need:
- your DUT (device under test): the laptop with its audio interface with analogue ins and audio processing software.
- an analogue audio signal source able to feed a (roughly) line level input with two identical outputs: this can be about anything: an electric guitar (with or without a preamp/booster), a microphone with a preamp, a synthesizer, a CD player, a computer with a soundcard, a professional arbitrary waveform generator – you name it. The tricky part here might be to get two identical signals.
- an analogue audio signal recorder with (at least) two channels. Again, your choice of DAT, computer, iPod and some means to analyze the recording offline (like a simple audio editor software), OR
- a sampling signal analyzer with at least two channels. As this is a little bit more tricky, we’ll stick with the solution above (chances are that if you own a sampling signal analyzer, you won’t need this tutorial anyway).
We’ll be working in the following way:
- Send a test signal both into one channel of your recorder and into your DUT and the DUT’s output into another channel of your recorder.
- Analyze the recorded test signals on both channels (or more specifically, their timing relation).
In our rundown, we’ll first assume that your DUT is already set up and works as planned – or at least will output an analogue audio signal if you send an analogue audio signal into the input. To make the following analysis as easy as possible, set it up in a way that will least affect the signal (and preserves the stereo image) without removing any processing components from the direct path. That means if e.g. your sound processing consists of an amp simulator and a compressor, you’d set the amp simulator to a clean sound with all EQs in the neutral position and the compressor to a threshold of 0dB. Note that you don’t need to consider any send effects here. Also, make sure here that in the soundcard’s settings, any “input monitoring” has been turned off.
Secondly, to the choice of the remaining components. While I did already mention that we have quite some choice both with the signal source and the signal recorder, by far the easiest way is to use another computer with a (stereo line level) audio input and output. If none is available to you, the second best choice would be a CD player and a digital recorder with the option to transfer the signal digitally to the computer (which is now your DUT) after the measurement has been made.
I will not go into details how to do this using analogue tape recorders and guitars etc., simply because I want to keep this article relatively simple and without too many “if you are using a xxx, you have to do yyy for this step”. So everything from now on will address the setup with a computer. When using the solution with a CD player and a digital audio recorder, you simply have to burn the test signals to a CD to be played in your CD player and then transfer the recorded signal digitally to your computer.
First, we have to create a test signal. For reasons of a simple analysis, the best thing would be to use an impulse or a step or a square pulse. We’ll describe it using an impulse.
Open the wave editor of your choice. Create a new mono audio signal (length: a few seconds) at the maximum sample rate possible for your analogue output of the computer (in case of transferring to CD, this would be 44.1 kHz). You now should have a “blank” waveform (meaning the signal is in the center of the screen for the entire length). Now move into the middle of the waveform in the time axis and zoom in to your maximum zoom ratio. Most probably, the waveform now appears as a number of small dots, crosses or circles which are connected with line segments. Now switch to paint or draw mode (if the program doesn’t do this automatically) and take one of the dots and move it to the top of the window. You now have a flat wave with a short spike in the middle – the impulse.
Now create a new stereo audio signal with the same sample rate and length as before. Mark and copy your entire waveform from the mono file and paste it to both channels of the stereo waveform. Make sure that they are in parallel – the impulse is at exactly the same point in time. Save this waveform – this is our test signal.
In all following steps, it might be a good idea not to actually play back the test signal via speakers or headphones. Best to turn all those things off, just to make sure.
Now make the following connections: connect one output from your “analysis computer” to the input of your DUT. Connect the other output to one input of your analysis computer (before doing so, make sure that “input monitoring” has been turned off!). Connect the output of your DUT to the other input of your analysis computer. If you use more than one input and output on your DUT, make sure to use the corresponding pair (i.e. left input and left output). You now have set up two signal paths: analysis computer out to analysis computer in (we call this “direct signal”) and analysis computer out through DUT to analysis computer in (“DUT signal”).
Now set up the software for your analysis computer. What you will need is a) an audio playback program, b) an audio recorder. Make sure that both can access the sound hardware without getting in each other’s way – or use a software that can playback and record at the same time (like any audio sequencer – Krystal is a freeware variety which comes to mind).
Start the audio recording. Play back your test signal. Stop the audio recording.
Now on to analyzing the recorded audio. There are two ways to do this: either by using a normal audio editor, or by using a signal analysis tool that does something called “cross correlation”. Cross correlation basically is a mathematical process which shows (among other things) if and by how much two signals are delayed compared to each other. You don’t really need to do this here, but it sure looks cooler if you want to demonstrate to someone just how professional you are and how sound your analysis is.
Load your recorded audio file into the wave editor (or the recorded files, if you didn’t record it as one stereo file). Now you should see two waveforms, each with one impulse at different points in time. First, look at the direct signal (if you didn’t keep track here – this is the signal where the impulse appears earlier). Put one marker exactly at the peak of the impulse. Then look at the DUT signal and put one marker exactly at the peak of the impulse. Measure or calculate the distance between both markers. This time is S1:C2.
So you (or some smartypants guy questioning the validity of your measurements) might wonder how precise this setup is. There are three sources of errors in this measurement, some of which can be optimized:
- quantization in the time domain: by sampling our input signal, it becomes quantized not only in the amplitude but also in the time domain. This effect is statistical (it depends when in relation to the sample clock your signal hits the converters) but has a maximum value of +/-0.5 sample clocks. This corresponds in your “worst case” of f=44100Hz to roughly +/-0.01ms or even less for higher sample rates. In other words, it’s considerably below the dimension we’re looking at here.
- interchannel phase deviation after the input jacks or before the output jacks of the converter: this is fairly minimal for all current converters, falls below the ns range and can also be neglected.
- clock stability of the converter: what’s of interest is the absolute deviation from the specified sampling frequency. This directly affects our time measurement: a deviation of 1% will also mean a deviation in the measurement of 1%. Sadly, those manufacturers tend not to specify these values, but experiences with these components lead me to believe these deviations are considerably below 0.5%, at least if you do your measurements at room temperature.
- different cable length on both channels: the total cable length in both your direct and DUT signal path may be different. Dimension: if there is a difference in cable length of 1m, the effect is in the ns range. If you still want to exclude this effect, make all total cable lengths the same in each signal path.
Taking all influences together, we’re well below an error of 1% – which should be good enough for our application.