![]() ![]() Our focus was on developing a method to remove the need for supervision,” Auli said. “We have not yet investigated potential biases in the model. Some experts, including Facebook chief scientist Yann LeCun, theorize that removing these biases might require a specialized training of unsupervised models with additional, smaller datasets curated to “unteach” specific biases.įacebook acknowledges that more research must be done to figure out the best way to address bias. For example, unsupervised computer vision systems can pick up racial and gender stereotypes present in training datasets. Future workĪI has a well-known bias problem, and unsupervised learning doesn’t eliminate the potential for bias in a system’s predictions. Wav2vec-U was also as accurate as the state-of-the-art supervised speech recognition method from only a few years ago, which was trained on hundreds of hours of speech data. Trained on as little as 9.6 hours of speech and 3,000 sentences of text data, Wav2vec-U reduced the error rate by 63% compared with the next-best unsupervised method. To get a sense of how well Wav2vec-U works in practice, Facebook says it evaluated it first on a benchmark called TIMIT. “Half a day on a single GPU is not very much, and this makes the technology accessible to a wider audience to build speech technology for many more languages of the world.” This excludes self-supervised pre-training of the model, but we previously made these models publicly available for others to use,” Facebook AI research scientist manager Michael Auli told VentureBeat via email. “It takes about half a day - roughly 12 to 15 hours on a single GPU - to train an average Wav2vec-U model. While the GAN’s transitions are initially poor in quality, they improve with the feedback of the discriminator. As for the discriminator, it learns to distinguish between the speech recognition output of the generator and real text from examples of text from sources that were “phonemized,” in addition to the output of the generator. It’s trained by trying to fool the discriminator, which assesses whether the predicted sequences seem realistic. The generator takes audio segments and predicts a phoneme (i.e., unit of sound) corresponding to a sound in language. To learn to recognize words in a recording, Facebook trained a generative adversarial network (GAN) consisting of a generator and a discriminator. Using a self-supervised model and Facebook’s wav2vec 2.0 framework as well as what’s called a clustering method, Wav2vec-U segments recordings into units that loosely correspond to particular sounds. Wav2vec-U learns purely from recorded speech and text, eliminating the need for transcriptions. More recently, Facebook itself announced SEER, an unsupervised model trained on a billion images that achieves state-of-the-art results on a range of computer vision benchmarks. While relatively underexplored in the speech domain, a growing body of research demonstrates the potential of learning from unlabeled data. Microsoft is using unsupervised learning to extract knowledge about disruptions to its cloud services. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |