roon.dev

Capturing Audio Per-Application on Windows Research

Diagram explaining a possible implementation

If you're developing only for Windows 11 and later versions of Windows 10, there's an official API that does this now, more here and here.

Intro

A couple years ago I began, emphasis on began, a project to capture audio per application, but it never escaped the research stage. Maybe one day I'll go back and do it proper, but for now I'll share what I researched.

The Research

Starting with Windows Vista, Microsoft introduced a new way for applications to interact with the audio system, Windows Audio Session API or WASAPI. Most modern applications use WASAPI, it provides everything that competing drivers provide, bit perfect transfer, low latency along with a continual development and support. Unless the software uses device specific drivers(ie ASIO), a majority of applications route their audio via WASAPI.

Windows Audio Stack. Source: https://learn.microsoft.com/en-us/windows-hardware/drivers/audio/windows-audio-architecture

Windows does not mix audio together until it reaches the Audio engine, “The audio engine: Mixes and processes audio streams…”, therefore at some point an audio buffer for a specific application is exposed. We can find in WASAPI documents that:

The audio engine is the user-mode audio component through which applications share access to an audio endpoint device. The audio engine transports audio data between an endpoint buffer and an endpoint device. To play an audio stream through a rendering endpoint device, an application periodically writes audio data to a rendering endpoint buffer. The audio engine mixes the streams from the various applications.

Thus we know we need to find the functions that relate to write audio data to the rendering endpoint buffer. On the WASAPI docs we find “IAudioRenderClient: Enables a client to write output data to a rendering endpoint buffer.”

IAudioRenderClient has two exposed methods to interact with, GetBuffer and ReleaseBuffer. GetBuffer requires the application to enter the number of audio frames in the data packet that the caller plans to write to the requested space in the buffer and in return it provides the application a pointer to the next place in the buffer the application writes to.

The application knows the maximum number of frames it can request via IAudioClient which calls GetBufferSize. Once the application has initially filled the buffer, subsequent calls of GetBuffer, the number of audio frames requested must be calculated with GetCurrentPadding, which “indicates the amount of valid, unread data that the endpoint buffer currently contains,” TotalBuffer - Padding.

Following that at GetBuffer, we know how much the application wants to write and the pointer at which the OS is allowing it to write to the buffer. Following each GetBuffer, ReleaseBuffer must be called. The only time to tell if the buffer has actually been filled is when the application calls ReleaseBuffer. ReleaseBuffer takes in two inputs, the number of bytes planned to be written and a bitflag which determines whether to treat the data as if it were silent.

We will also need to get the WAVEFORMATEX structure used by the application as the "size of an audio frame is specified by the nBlockAlign member," which can be obtained via GetMixFormat. As well as the frame size, you will also need the sample rate, number of channels, and bits per sample, so you can interpret the audio data in your own application.

With an understanding of how an application outputs audio, we know that we need to hook at the very least GetBuffer and ReleaseBuffer. GetMixFormat gets the format of endpoint, which all applications should share, so its should likely be the responsibility of the application receiving the intercepted audio. There are many hooking libraries available, Microsoft offers there own open source libraries called Detours.

A theoretical implementation

Some Considerations

I only covered WASAPI, but Windows for all the good it's backwards compatibility provides, it also carries baggage. MME is an old Windows Audio API that while increasingly uncommon is still used by some applications, but it should still be possible to hook the right functions to capture the audio. Unlike MME older software that work via WDM/KS, are likely impossible to implement a universal solution as there's no universal implementation or standard API.

A potential problem is the possible introduction of noticeable latency. Obviously some will be introduced as a byproduct and while WASAPI in shared mode is not by design an ultra low latency system, new latency could potentially be noticeable for the end user. I can't say how likely it'll be a problem, I'd probably say it won't be, but I wouldn't discount it entirely.

Another design choice that will need to be made is how to actually return the data to the requesting app. It might be as simple as returning a buffer to it and having them free it, or it could be more complex depending on the requirements.

Another is anti-cheats or anti-malware software, any injections into another software are inherently suspicious. At least in videogames this will likely lead to a ban unless you get explicit permission from the anti-cheat vendor or more maliciously you avoid detection, which is a pretty bad idea.

Another factor worth mentioning and one thats always fun to work with, IPC, or interprocess communication. Shared memory is the first thing that comes to mind, but in my opinion pipes are likely more appropriate.

A final consideration is that I could be entirely wrong about some or everything written here. WASAPI and the Windows API in general is a complex system and despite my best efforts I could have made any number of mistakes.

Conclusion

Maybe one day I'll doing something more with this, but at the very least I hope this was a good primer for anybody interested.