UPD: the FreeSWITCH integration scripts are available at https://github.com/voxserv/fsqa
A customer has requested to set up a QA service that would continuously monitor the voice quality in their telephony infrastructure. They use a number of telephony carriers, and a set of applications on top of Plivo and FreeSWITCH. Also the conference module in FreeSWITCH is actively used.
Measuring jitter and packet loss, like it’s done in VoIPmonitor, is not sufficient, as we need to monitor end-to-end performance, including that of the FreeSWITCH server itself. So, there has to be a software component that compares the source audio with the recording on the other end of a call.
There are currently two major player on the market for voice quality measurements:
- ITU-T PESQ algorithm is proposed as an ITU recommendation P.862. Its source code is available at the ITU website and on Github. But the algorithm is patented, and the source code license does not allow any production use. The evaluation went quite smoothly, and the algorithm was able to detect even minor distortions, like one 20ms frame loss in a 2-minute call. The PESQ algorithm is designed and calibrated to be used for audio files of 6 to 20 seconds in length. Processing of a 2-minute recording takes approximately 5 seconds on a modern Xeon CPU. Commercial software is provided by OPTICOM and PsyTechnics.
- Sevana.biz is an Estonian company that provides their own algorithms and software product for voice quality assessment. Their AQuA (Audio Quality Analyzer) software provides a fast and reliable way to compare the audio files: processing of a 2-minutes recording took about half a second on a modern CPU. Sevana has kindly provided a 10-days evaluation license and a fully functional software package, and the customer decided to go ahead with purchasing the license.
The simplest single-server license for Sevana AQuA allows running only one AQuA process at a time, so we wrapped its execution into a Perl script that utilizes a simple exclusive locking mechanism and performs audio file processing one at a time.
AQuA produces two scores in each measurement: the similarity percentage, and the MOS score. Both metrics are useful for quality analysis (for example, a 20ms frame added or lost inside of a silent pause influences the similarity score more significantly than MOS). It also takes a number of command-line options which can increase its tolerance to certain types of distortions, such as frequencies outside of G.711 range.
FreeSWITCH software is used as the SIP server for sending and terminating voice calls and for recording the received audio. It allows recording in several different formats: a) raw codec recording, done in the same thread as RTP processing; b) 16-bit signed PCM in WAV format, and file writing is done in a separate thread; c) compressed voice in a number of formats. The first two options produce similar results (raw codec recording had difficulties in the beginning). In case of raw codec recording, an additional step is required to convert the input files into 16-bit PCM WAV.
The call recording server requires to have a precise clock reference, so a baremetal hardware is required. Virtualized environments add up some uncontrollable imprecision to the virtual machines, although a thorough lab test is requires to verify this. It also depends on the type of hypervisor, as they implement the system clock differently.
The Linux kernel provides access to various clock sources. TSC is commonly used as default, and there is also HPET clock on modern hardware platforms. HPET is supposed to provide a more precise clock source, but it appears that it depends on CPU load: we accidentally discovered that audio recording in FreeSWITCH is significantly distorted when there’s some CPU activity is done in parallel (Debian package builder was working on the same 8-core machine). So far, TSC clock on a baremetal server provided the most reliable results.
The recording is done into a tmpfs mounted partition, in order to avoid any dependency on I/O load. The processing script performs the quality assessment on recorded files, and then moves or deletes them, depending on the measured score.
The SIP service was attached to an unusual UDP port, as port 5060 is frequently accessed by port scanners in public Internet. The DNS NAPTR and SRV records are used in order to use a universal SIP URI string, without having to reconfigure the remote servers if the IP address or UDP port changes.
Jitter buffer is disabled by default in FreeSWITCH, and it has to be activated whenever the calls are terminated on the server. In our case, the “jitterbuffer_msec” variable is set to “50:50” in the dialplan before answering and recording the call. With this, the jitter buffer is not allowed to grow dynamically above 50ms. So, we tolerate most of typical Internet-imposed jitter, but clock drift on the sending side would cause packet drop on the receiver.
The dialplan is designed to accept direct SIP calls from remote servers, and PSTN calls from telephony providers. If a remote server calls our QA service directly, it encodes the source name in the user part of the SIP URI. Also there are two options for a QA call: it can playback the test audio, or send silence. In case of PSTN calls, the caller ID is used as the source identifier. The dialplan activates audio recording into a WAV file on a tmpfs partition, and launches the processing script after the hangup.
The conference dialer is used for testing the conferencing performance on a production FreeSWITCH server. It requires a conferencing profile that does not play any greetings to conference participants. Also in case of more than two participants, only one has to be chosen as a speaker, and all others would be listeners. A dedicated SIP URI on the QA server is reserved to playback the test audio and not to perform any recording.
Each measurement result for QA calls is stored in an SQL database for further processing, and also sent to Syslog for real-time monitoring.
The test audio is a concatenation of speech samples from ITU-T Recommendation P.50 Appendix I, resampled from 16KHz to 8KHz and stored as 16-bit signed PCM audio.