• Breaking News

    AI model from OpenAI automatically recognizes speech and translates it to English

    Benj Edwards / Ars Technica

    On Wednesday, OpenAI launched a brand new open supply AI mannequin known as Whisper that acknowledges and interprets audio at a stage that approaches human recognition potential. It may well transcribe interviews, podcasts, conversations, and extra.

    OpenAI trained Whisper on 680,000 hours of audio information and matching transcripts in 98 languages collected from the net. In response to OpenAI, this open-collection method has led to “improved robustness to accents, background noise, and technical language.” It may well additionally detect the spoken language and translate it to English.

    OpenAI describes Whisper as an encoder-decoder transformer, a kind of neural community that may use context gleaned from enter information to be taught associations that may then be translated into the mannequin’s output. OpenAI presents this overview of Whisper’s operation:

    Enter audio is break up into 30-second chunks, transformed right into a log-Mel spectrogram, after which handed into an encoder. A decoder is educated to foretell the corresponding textual content caption, intermixed with particular tokens that direct the one mannequin to carry out duties corresponding to language identification, phrase-level timestamps, multilingual speech transcription, and to-English speech translation.

    By open-sourcing Whisper, OpenAI hopes to introduce a brand new basis mannequin that others can construct on sooner or later to enhance speech processing and accessibility instruments. OpenAI has a big monitor report on this entrance. In January 2021, OpenAI launched CLIP, an open supply laptop imaginative and prescient mannequin that arguably ignited the latest period of quickly progressing picture synthesis know-how corresponding to DALL-E 2 and Stable Diffusion.

    At Ars Technica, we examined Whisper from code available on GitHub, and we fed it a number of samples, together with a podcast episode and a very difficult-to-understand part of audio taken from a phone interview. Though it took a while whereas operating via an ordinary Intel desktop CPU (the know-how would not work in actual time but), Whisper did a great job of transcribing the audio into textual content via the demonstration Python program—much better than some AI-powered audio transcription companies we’ve got tried prior to now.

    Example console output from the OpenAI's Whisper demonstration program as it transcribes a podcast.
    Enlarge / Instance console output from the OpenAI’s Whisper demonstration program because it transcribes a podcast.

    Benj Edwards / Ars Technica

    With the right setup, Whisper might simply be used to transcribe interviews, podcasts, and probably translate podcasts produced in non-English languages to English in your machine—at no cost. That is a potent mixture that may finally disrupt the transcription business.

    As with nearly each main new AI mannequin today, Whisper brings constructive benefits and the potential for misuse. On Whisper’s model card (below the “Broader Implications” part), OpenAI warns that Whisper might be used to automate surveillance or establish particular person audio system in a dialog, however the firm hopes it will likely be used “primarily for helpful functions.”