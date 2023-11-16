Netflix, the popular streaming platform, has recently released a research paper highlighting the importance of Speech and Music Activity Detection (SMAD) in localization and dubbing. SMAD is a process that allows users to determine the amount of speech and music in each frame of an audio file. This technology has various practical uses in the media and entertainment industry, such as translation and dub-script generation, spoken-language identification, and speech transcription.

In their research, Netflix’s team of authors identified a challenge in labeling speech and music activity in audio frames on a large scale. The task is expensive, labor-intensive, and often hindered copyright limitations that prevent public sharing of audio content. Existing datasets also have limitations, as some focus solely on speech or music detection, without overlapping segments that occur in series and films.

To overcome these limitations, Netflix decided to create its own large-scale dataset using its extensive catalog of TV series and films. Named the TV Speech and Music (TVSM) dataset, it comprises approximately 1600 hours of professionally recorded and produced audio. The dataset includes content from different genres published between 2016 and 2019, with 60% originating from the United States. English, Spanish, and Japanese are the three languages represented in the dataset.

The team leveraged various sources to obtain labels for the dataset, including subtitles, scripted musical cue sheets, and predictions from pre-trained models. Subtitles proved particularly useful for speech labels, as they provide reliable timestamps for speech utterances and often include lyrics from singing voices. The team trained their model on subsets of 20-second segments with noisy labels derived from subtitles, as well as clean annotations manually created for comparison.

The results of their research indicated that their benchmark methods outperformed existing models trained on synthetic and smaller datasets. The team’s ultimate goal is to improve productivity across teams working in multiple languages worldwide. While Netflix has made its audio features and labels available on Zenodo and a GitHub repository, the TVSM dataset itself remains private due to copyright concerns.

(Source: Netflix Technology Blog)