17 February, 2021 10:24
Summary:
Jigar Dani, Principal PM Manager, Microsoft
Sriram Srinivasan, Principal Software Engineering Manager, Microsoft
Over a decade ago, Skype invented the Silk audio codec to transmit speech over the internet and it catalyzed the voice over internet protocol (VoIP) industry. The primary codec used in VoIP then was G.722 that required 64 kbps to transmit wide band (16 kHz) speech, Silk on the other hand offered wideband quality starting at just 14 kbps. Additionally, Silk was an adaptive variable bitrate codec that seamlessly switched from delivering narrow band (8 kHz) speech at ultra-low bandwidth of 6 kbps to offer a near transparent quality of speech at higher bit rates. This was critical for dial-up and limited broadband internet available at that time and served us well as the default codec for Skype and Microsoft Teams. Silk is also the basis of voice mode in the Opus codec, one of the default WebRTC codecs.
As we enter a new decade, users can now choose from several high-end connectivity alternatives such as high-speed broadband, optical fiber, and 5G. Yet, large segments of Microsoft’s user base are still limited to low cable internet speeds or slower 3G and 4G cellular networks. They often experience situations with over 50% packet loss and sporadic loss of coverage when moving between cell towers, commuting, or switching between network types. Network availability can even be unpredictable in their homes where many share bandwidth with others who are working and learning remotely. After all these years, it turns out that utilization of available bitrate is every bit as important today as it was in the dial-up world. Any bitrate savings can be used to provide additional resiliency and improve experiences on other workloads like modern video or content sharing.
Our challenge is to deliver a virtual voice experience that’s as good as talking in person even over ultra-low bandwidth and in highly constrained network conditions. To truly serve our customers, we know they need to be able to communicate and collaborate on the go, on all device types, over any network, in every environment.
That’s why we’re excited to share the details of our new AI-powered audio codec named Satin. Satin can deliver super wide band speech starting at a bitrate of 6 kbps, and full-band stereo music starting at a bitrate of 17 kbps, with progressively higher quality at higher bitrates. Satin has been designed to provide great audio quality even under high packet loss. In addition, its great quality at low bitrates allows us to use more of the available bandwidth for providing better resiliency to packet loss. Here is the net effect of our improved resiliency algorithms and new Satin codec (please use your favorite headset to hear the two audio files).
Silk at 6 kbps, burst packet loss:
Your browser does not support the audio
element.
Satin at 6 kbps with improved resilience, burst packet loss:
Your browser does not support the audio
element.
Our team built this new codec by combining decades of algorithmic experience and advanced machine learning techniques. Let’s take a deeper dive into how Satin works.
What’s narrowband, wideband, and super wideband voice?
Our ear can generally perceive sounds that range in frequency from 20 Hz to 20 kHz. When dealing with discrete time signals, we need to sample the audio waveform at a minimum of twice the highest frequency we wish to reproduce. This is generally why CD-quality music is sampled at 44.1 kHz (44100 samples per second) or 48 kHz. Early telephony systems used a sampling rate of 8 kHz and could reproduce frequencies up to 4 kHz (in practice up to 3.4 kHz), which was considered sufficient at the time for speech communication. While a lower sampling rate implies fewer bits per second to transmit over the wire, it resulted in the all too familiar tinny voice quality over the phone as the higher vocal frequencies present in natural speech could not be reproduced. VoIP solutions, which were no longer limited by the narrowband telephony infrastructure, introduced us to the magic of wideband speech (reproduce up to 8 kHz, sampled at 16 kHz) and users were immediately able to appreciate the crisper, more natural and intelligible sound.
Codecs like Silk and Opus took this a step further with the introduction of super wideband voice, capturing frequencies up to 12 kHz, sampled at 24 kHz (energy drops off rapidly at frequencies above 12 kHz for human voice). As mentioned earlier, higher sampling rates imply a higher bitrate. Satin re-defines super wideband to cover frequencies up to 16 kHz (sampled at 32 kHz) for greater clarity and sibilance, and its efficient compression enables super wideband voice at 6 kbps.
Frequency components of the sound /t/ in the word “suit.” There is a significant amount of energy well beyond the narrowband cutoff of 4 kHz and even the wideband cutoff of 8 kHz. Preserving energy in the higher spectral components results in more natural sounding speech.
Listen to these two samples below on your headphones. The Satin super wideband speech sample sounds a lot more natural and intelligible, much like what you hear when you are talking to someone in person.
Silk narrowband at 6 kbps:
Your browser does not support the audio
element.
Satin super wideband at 6 kbps:
Your browser does not support the audio
element.
How do you achieve super wideband at 6 kbps?
To achieve super wideband quality at 6 kbps, Satin uses a deep understanding of speech production, modelling and psychoacoustics to extract and encode a sparse representation of the signal. To further reduce the required bitrate, Satin only encodes and transmits certain parameters in the lower frequency bands. At the decoder, Satin uses deep neural networks to estimate the high band parameters from the received low band parameters, and a minimal amount of side information sent over the wire.
While this approach solved the primary challenge of reproducing super wideband voice at ultra-low bitrates, it introduced a new challenge of computational complexity. The analysis of the input speech signal to extract a low dimensional representation is computationally intensive. Real-time inference on deep neural networks adds even more complexity. To solve this, the team then focused on both algorithmic optimizations as well as techniques like loop vectorization beyond what the compiler could achieve. This achieved nearly 40% reduction in computational complexity and allowed us to run on all our users’ devices.
As with all new features, we A/B tested Satin before widely rolling it out—both to ensure there were no regressions, as well as to quantify the positive impact for our users. The A/B tests showed a statistically significant increase in call duration for Satin compared to Silk at these low bitrates. Offline, crowdsourced subjective tests to evaluate codec quality at 6 kbps showed the mean opinion score (MOS) rating of Satin to be 1.7 MOS higher than Silk.
How resilient is Satin to packet loss?
The majority of calls are on Wi-Fi and mobile networks, where packet loss is common and can adversely affect call quality. Satin is uniquely positioned to compensate for this. Unlike most other voice codecs, Satin encodes each packet independently, so the effect of losing one packet does not affect the quality of subsequent packets. The codec is also designed to facilitate high quality packet loss concealment in an internal parametric domain. These features help Satin seamlessly handle random losses where one or two packets are lost at a time.
Another type of packet loss, which is even more detrimental to perceived quality, is when several packets are lost in a burst. Here, Satin’s ability to deliver great audio at a low rate of 6 kbps provides the flexibility to use some of the available bitrate to add redundancy and forward error correction to quickly recover from these situations. Satin does this without compromising overall audio quality.
Satin is already being used for all Teams and Skype two-party calls and will roll out for Teams meetings soon. It currently operates in wideband voice mode within a bitrate range of 6 – 36 kbps and will be extended to support full-band stereo music at a maximum sampling rate of 48 kHz in the near future. We are very excited for you to try this new codec and let us know what you think.
Subscribe to Teams Engineering Blog RSS feed to stay in touch with the latest innovations from Teams.
Want to work on the team that builds bleeding edge AI technology: AI Jobs in M365 Intelligent Conversations and Communications Cloud Team
Date: 2021-02-17 16:00:00Z
Link: https://techcommunity.microsoft.com/t5/microsoft-teams-blog/satin-microsoft-s-latest-ai-powered-audio-codec-for-real-time/ba-p/2141382