All In The Mix - The Importance Of Real-Time Mixing In Video Games

This a transcript of a talk I did at the Develop Conference in Brighton in July 2010 on audio mixing for video games.


Remember, this was written to accompany a Keynote presentation, but I can't post that here, so you'll have to imagine it.  It looked awesome ;-)




Today I’m going to be talking about audio mixing; what’s the purpose of mixing over and above getting the levels right, and why a good mix is so important.  I’m also going to talk about the stages a mix engineer will go through when working in linear media, and what lessons we can learn, and what techniques we can use from the linear world when working with interactive material.

So, here’s what I’m going to talk about today.  Firstly, a short introduction.  Secondly, I’ll ask “what is mixing?”  What’s the purpose of it, and what’s to be gained by good mixing practices.

Thirdly, I’d like to talk about different approaches to mixing systems within interactive entertainment.

Fourth, I want to look at some of the tools and techniques we can use when mixing our games.

And lastly, I want to look at monitoring and mixing standards in our industry.


INTRODUCTION

The industry has come a long way in the last couple of years in terms of the quality of the audio content created for games.  Big developers now spend a considerable amount of money recording new audio content, and usually even more money on the writing and recording of original music for their titles. 

However, the audio assets that go into a game are only 50% of the complete experience.  The other 50% is down to implementation of those assets, with a good mix being a large part of that.

In my view, the mixing process is something that in the past has been overlooked.  In my experience, this is usually down to the minimal amount of time scheduled between when a game is content complete and when the game is mastered.



WHAT IS MIXING?

OK, so let’s start with the basics.  Mixing is the process of bringing together all of your audio assets, sound effects, dialogue and music together and making them sit together nicely so that the whole becomes a coherent audio experience.  Technically, it’s about achieving clarity.  If you’ve got too many sounds sharing the same sonic characteristics being played at once, the whole thing will just mush together and you won’t be able to discern any detail.

Artistically however, mixing is about focus.  It’s about using all of the audio material that you have put together in your title, and modifying it in real time, in order to manipulate the person playing the game into feeling what you want them to feel, and to make them focus on what you feel is important.  By dynamically changing the mix, we have a massive amount of power over how the player perceives the situation they’re in. 

This definition of mixing is the same whether you’re mixing a game, film, TV or music.  However, when it comes to mixing for games specifically, mixing processes fall into two categories.


Active and Passive mixing

Active Mixing is where event triggers that come directly from within the game itself change the audio mix.  An example of this may be the recalling of a set of volumes for a group of sounds, a snapshot, triggered by an in-game event, or for example the tinnitus effect when grenade goes off close to you, in a first person shooter, where all the sound is filtered, except the ringing in your ears effect.

The other category is Passive Mixing.  This is a bit more subtle, and is more akin to the way that you would configure a music or film mix in ProTools or Nuendo.  Passive mixing is what I would describe as the configuration of dynamics processors, and how they interact with each other.

As an example of passive mixing, here’s probably the simplest setup possible.  



Here we have a simple routing diagram, detailing how the dialogue and the music interact with each other.

We have a compressor on the music, with a side chain input coming from the dialogue.  So, although the compressor is on the music track, it’s not actually listening to the music, it’s listening to the dialogue.

The louder the dialogue, the more the compressor will attenuate the volume of the music track. 

As I’ve said, this is a very simple setup.  I’ll go into a bit more detail on routing a bit later.

The key difference between active and passive mixing is that with active mixing the mix changes are triggered by events in the game.  The system isn’t actually aware what is coming out of the speakers, only that a certain event has been triggered. 

With passive mixing the mix system is actually listening to the audio signals themselves and then adjusting levels depending on actual volume levels of the channels or sub-groups.  As I said, this is more akin to how you would set up the routing and audio processors on a mixing desk.

So, in my view, a perfect setup would be a combination of both active and passive systems.


Different approaches to mixing

In videogames, developers use a wide variety of different mixing techniques, depending on what technology they have available to them.  I’d like to show you a couple of these different approaches.

Snapshots

The first of these I want to touch on is the snapshot mixing system.  This is probably the easiest to implement, and because of that, the most widely used technique.  It’s something that’s been used for years on mixing consoles used in film and music production, and now the technology has trickled down, it’s found on a lot of live music mixing consoles. 

It enables the sound designer or mixer to take a snapshot of the volumes of all the different channels at any particular time, and then recall them as and when they’re needed.  In live music, for example, an engineer might have one snapshot for the support act, and then switch to another for the main act, or they may want a different mix for each song.

Similarly, in games, we may want a snapshot to be recalled when we reach a certain stage or location in the game.  In order to heighten the tension we may choose to reduce the ambience and music in a certain location and push up the foley, footsteps or the protagonist’s breathing, for example. 

If we do this over a slow enough period of time, 9 times out of 10 the player will not notice this change has taken place.  This is a very powerful hook we can use to get to the player.  Certain subtle changes in the mix over time, triggered by snapshots can be used to make the player, either consciously or subconsciously, focus on whatever the designers feel it is that they should be focusing on.

Or, we may want to recall a snapshot when the pause button is pressed so that all the in-game sounds are muted, and then recall another snapshot when the player continues in the game.  This, by the way, has the added advantage that if you use a mix snapshot when pausing a game, you don’t have to individually stop every single sound, and then restart each and every sound when the player continues.  You just recall a snapshot in which the in-game sounds carry on in the background, but have their volume reduced to zero.

In Motorstorm, the team deal with snapshots slightly differently.  They use snapshots within a much smaller timeframe, changing the mix for specific events.  They have a default mix that’s set up initially, and then game events trigger changes in the volumes of groups of sounds for very short events such as car impacts, that last for the length of the impact, and then revert back to the default mix.

As I’ve said, snapshots are the easiest and most widely used run-time mixing technique used these days.  However, in the last couple of years, other ways of doing things have started to appear.

HDR Audio

Certain developers, such as DICE, Splash Damage and our own Guerrilla Studios have started to reduce the problem of huge amounts of sounds being triggered at once using what's been called High Dynamic Range, or ‘HDR Audio’. 

The way it works is this. 

A window is defined between a range of dB values to cover the dynamic range for a given system.  The size of that window is governed by the type of sound system the user is using, whether it be headphones, TV or home cinema type system.

Templates are created which contain the samples as well as control data, including real-world loudness data for each sound.



If a really loud sound is triggered, the window will jump up so that that sound is contained at the top of the window, and any other sounds that are triggered that fall below the lower limit of the window are discarded.  As the loud sound dies away, the window, whose size overall doesn’t change, will move down again.  And so what you have is this range of a constant size, moving up and down the loudness range, letting sounds pass through if they’re within the range, and blocking other sounds if they’re not.

In some cases, this system can deal with about 80% of the mix changes within the game automatically, and then snapshots changes are used for special cases.


Self-aware systems

Up until recently, most mix systems were event based.  Send the system an event, and it would then change the overall mix in some way that you’ve specified in advance.

However, with the increase in computational power available to designers and programmers working on current generation machines, it’s now possible for us to go one stage further and have systems that are aware of what they’re actually outputting, and to make passive mix decisions accordingly.

If you can store spectral information as metadata about each audio file in your game, and you know how loud each sound is being played and where it is in the 3D world, the system can know exactly what is coming out of each speaker at any moment.

By laying down a set of rules beforehand, you can increase or decrease the volume of any sound or set of sounds, or, more crucially, increase or decrease certain spectral components of any sound or set of sounds on the fly, to leave space in the overall mix for the stuff you want to cut through.  Basically we’re talking on the fly EQ’ing automated by a set of rules you specify beforehand.

Obviously it’s still early days, but I know of at least a couple of developers, ourselves included, that are doing work on this type of tech at the moment.


TOOLS AND TECHNIQUES

So leaving aside these systems, I want to focus now on some of the more basic tools and techniques that can be used for mixing game audio.


In order to do the job effectively and give us as much control as possible, we need to arm ourselves with the right tools.  In this section, I’m going to give you a brief overview of the sorts of tools and technologies that are useful to you when mixing.

I’m going to talk about the importance of setting priority values for different set of sounds, so if there’s too many sounds playing you can intelligently drop sounds that the player doesn’t need to hear, or won’t notice if they stop playing.  

Next, how to organise the different types of sounds into sub-groups so you can manipulate large amounts of sound simultaneously. 

I’ve already spoken about the ‘whys’ of mixer snapshots.  I’d like to speak a little about the ‘hows’.

And lastly, I’ll talk about dynamics processing, and how to use them effectively.  And by dynamics processors, I mean compressors and limiters.


Prioritising Sound Effects

I worked on a PS3 project a couple of years ago, and the audio engine we used had 40 channels that could be used at once.  If you have 40 channels already being used and another sound is called, it will steal a voice from something else, and something, somewhere will stop playing.  If that sound is a looped ambience that gets stolen, it will stop, and it won’t be started again, which could destroy the atmosphere of a game.

Now 40 sounds may sound like a lot, but when you take into consideration that the ambiences in Heavenly Sword took up a minimum of say 10 channels, the weapons took up between 10 and 15, footsteps could take up to 10 channels, depending on how many characters you had close to you, you soon realise that 40 channels doesn’t really go a long way. 

So, in order to make sure that important sounds are always played and unimportant sounds are stopped when the number of channels available is low, each sound effect needs to be given a priority value. 

A sound with a low priority value should never steal a voice from a sound with a higher priority value.  Similarly, if a sound effect is triggered with a high priority value and all the channels are being used, the system should stop playing the sound with the lowest priority value.

This way you can always be sure that the sounds that are crucial to the gameplay experience such as critical dialogue or the players weapon sounds will always be played, and sounds such as footsteps which, if there’s lots going on and the player would never notice if they were missing, give way to more important sounds.

It also helps, in the case of a game in a 3D environment, to weight these priorities depending the intrinsic loudness of a particular event, and on it’s distance from the camera.  Even if there’s a lot going on, you still may want to hear really close footsteps for example, or you may want to ignore really distant explosions.


Sub-Grouping Your Sounds

Grouping your sounds is very important, as it enables you to manipulate the levels of large groups of sounds together, instead of having to modify the volumes of large amounts of single channels simultaneously.  It’s a lot easier to put 30 different sounds on one mix group and then modify the volume for just that one group than it is to modify those 30 sounds individually.

How you group your sounds depends on the sort of game you’re making, but you also have to take into consideration how you want to manipulate the mix in real time to achieve the right effect from an artistic standpoint, and then group your sounds accordingly. 

When mixing a game with a large amount of sounds, or lots going on at once, it’s always easier to pre-mix your sounds into groups first.  On the majority of the cutscenes I’ve worked on I’ve had upwards of 100 different channels in total.

Say you have 15 different channels that contain all the elements of the game ambience.  If you pre-mix all the ambience elements first, so they’re all at the right levels in relation to each other, then put them all into one group, you only then need to worry about the volume of that one group during the final mix.

On the final mix of a game, or a cutscene, I tend to pre-mix everything so that I end up with sometimes just 5 faders for the whole game.  They might be:
·               Dialogue
·               Ambience
·               Sound Effects
·               Music
·               UI

The other advantage that grouping sounds together gives you is that you can use dynamics processing on whole groups, instead of individual sounds.  I’ll explain the benefits and give you some examples of this later in the talk.


More About Snapshots

I’ve already spoken a bit about mixer snapshots. 

The main requirement of a decent mixer should be that levels can be adjusted in real-time, without having to restart the game.  If you have to restart your game every time you make the tiniest change, you’ll be there forever.  On a recent project, we didn’t have this functionality and in order to hear mix changes I had to change the volume of each sound individually, rebuild the data and start the game again.  This means it took approximately 10 minutes to change the volume of a sound and then hear that change in the game.



Dynamics Processing

When I talk of dynamics processors, I’m talking about compressors and limiters. 

I think it was a Pirelli advert that the tag-line was ‘Power is nothing without control’.  Well, the same applies to the tools we use to make our games.

Compressors and limiters are all about control, and if you’ve got literally hundreds of sounds potentially all being triggered at the same time, then without control, things can easily get out of hand. 

One way of keeping things under control is to put dynamics processors on subgroups that have the potential to get out of hand, groups that are likely to have lots of big transients on them, such as weapons, explosions, or in the case of a driving game for example, car impacts.

Putting compressors and limiters on all of your sub-groups and setting them up correctly,  will help you to maintain control and give you more clarity when there’s a lot of sounds being played at once.

Routing Example

Here’s a diagram of a routing configuration for a hypothetical game, indicating how I would set up the subgroups, and where I would use dynamics processors. 




The red lines indicate side-chains from channels or groups that are fed to what’s called key inputs of compressors on another channels or groups.  This is one example of passive mixing I mentioned earlier. 

When using key inputs, the compressor isn’t actually listening to the audio on the thing that it is actually on.  It’s reducing the gain on the channel depending on what another channel is doing. 

There’s a couple of points of interest here.

Firstly, the dialogue has been split up into critical and non-critical dialogue.  Critical dialogue is dialogue the player absolutely must hear.  Non-critical dialogue is throwaway stuff that merely adds to the mood and sets the scene.  In this example, the critical dialogue is sent to the key inputs of compressors on the sound effects, music and non-critical dialogue.  If critical dialogue is triggered and there’s a lot happening on the other channels the compressors on the other channels will reduce the gain of them so that the critical dialogue comes through.

The other example here is the bullet-by and NPC weapons.  The player's weapon and the bullet-bys are sent to the key input of the compressors on the NPC weapons so if the player fires their gun, or the bullet-bys are triggered loudly, they will automatically reduce the volume of the NPC weapons.


MONITORING

I’d like to say a quick word about monitoring your game.  Regardless of the type of game you’re making, it’s preferable to mix your game in a critical listening environment.  By a critical listening environment, I mean a room that is acoustically accurate, with a monitoring system that has been properly calibrated.

I know that not all developers have the luxury of custom built recording studios or an acoustically accurate room, but there are plenty of commercial recording studios that are dying to get into game audio.  With the music industry in a bit of a mess, a lot of the major recording studios in London are now actively courting game developers, trying to bring in new business. 

In my view it’s a very worthwhile investment book a couple of days in a studio to play through the game on a monitoring system that is different from what you’re used to working on.

One of the most common mistakes is that when listening to a game on a consumer surround setup, the chances of that system being set up and calibrated properly are minimal.

Therefore, listening to your game in a properly calibrated room on a properly calibrated monitoring system will show up any problems with your mix that may not be obvious on your own setup.  If it sounds good on an accurate system, it’ll sound good anywhere.


STANDARDS

The film and TV industries have had audio standards for years.  The audio systems in most cinemas are setup in a certain way, and film and TV mixers know exactly how their audience will be listening to their content.

However, in the world of video games, no standards currently exist, and to be honest, it really shows in the wide ranging differences between the sound from one game to another.

Generally, games have been excessively loud in the past.  Do any of you remember the startup sound on the Playstation 2?  That sound was played on the machine at full volume, as loud as it could be played.  So, when you switched your machine on, you would set the volume on your TV based on that sound.

That meant that most games would need to make all of their audio really loud in order to match the volume of that initial startup sound.  

Now, a really loud game, means very little dynamic range.  No light or shade at all.  I remember producers telling me that they wanted ‘everything louder than everything else!”

However, we as an industry are starting to rectify this, and standards are beginning to emerge*.  Most of the people involved are talking of a dialogue level standard of between 18-22dB RMS.

When mixing, we at Sony are adopting the DVD standard reference level for mixing at 79dB.  DVDs are obviously tailored for viewing in the home, on consumer level equipment, and so it makes sense for us to follow that lead.


* The section on standards was written before the advent of BS1770.


7 comments:

  1. This article has beautifully explained a complex subject such as Mixing...A must read especially for students like me...Thanks a lot !!

    ReplyDelete
  2. Exccellent read, very well laid out. I will be keeping a close eye on your posts. Much appreciated!

    ReplyDelete
  3. This comment has been removed by the author.

    ReplyDelete
  4. I like all details that you provide in your articles.
    useful information

    ReplyDelete
  5. I hope you will share such type of impressive contents again with us so that we can utilize it and get more advantage.clash of clans triche astuce

    ReplyDelete
  6. Good blog along with the excellent quality stuff and I’m sure this will be greatly helpful.ffxiv gil

    ReplyDelete