Mistakes in Common Voice 19

Neurlang

2025/03/02

Common Voice is a great resource for natural language processing enthusiasts. Today I was giving it a pass using Goruut 0.4.0, which uncovered several outliers.

Language Outlier
ar Meaning: Someone in a difficult situation will do anything to get out of it.
bn Google Play
ca "
es וְתִסְמְכֵנוּ לְשָלוֹם.
fa amir šāyān
ja a
ja A
ja fgtyht
ja gfvrv
ja Give us back our time.
ja green
ja hello
ja I don’t know
ja If you love you, please love me.
ja I saw him today
ja I wanna fly so high Yeah, I know my wings are dried
ja jdjdjd
ja ksskskksksks
ja tyyjnybt
ja
ja Windows
ja X軸Y軸
ja You must be teasing us.
ja 歴史的修正主義者De Ste。
ka ???????? ???????? ????? ?????? ??????.
mn ??????
ru ?????? ???? ?????, ??? ????????? ???? ??? ??????
ru Firefox
sr Sreća je u malim stvarima
ur ﺍﺱ ﭘﺮ ﺟﻮﺍﮨﺮ ﻻﻝ ﻧﮩﺮﻭ ﻧﮯ ﻣﺴﮑﺮﺍ ﮐﺮ ﮐﮩﺎ ﺗﮭﺎ
ur ﭨﮭﻮﻧﺲ ﺩﯾﺘﮯ ﺗﮭﮯ
ur ﮈﺍﮐﯿﮧ ﭘﮭﺮ ﭘﺮﯾﺸﺎﻥ ﮨﻮ ﮔﯿﺎ
ur ﮐﯿﺌﺮﭨﯿﮑﺮ ﺣﮑﻮﻣﺖ ﺑﮭﯽ ﻗﻮﻣﯽ ﺣﮑﻮﻣﺖ ﺑﻦ ﺳﮑﺘﯽ ﮨﮯ
ur ﻟﻮﮔﻮﮞ ﻧﮯ ﺣﯿﺮﺕ ﺳﮯ ﭘﻮﭼﮭﺎ

A number of English texts in non-English datasets. No idea what Hebrew phrase does in Spanish dataset. The word “Firefox” should be probably in Cyrillics same for “Sreća je u malim stvarima”. About the Urdu outliers, no idea why these came out, but listing them just in case.