Common Voice is a great resource for natural language processing enthusiasts. Today I was giving it a pass using Goruut 0.4.0, which uncovered several outliers.
Language | Outlier |
---|---|
ar | Meaning: Someone in a difficult situation will do anything to get out of it. |
bn | Google Play |
ca | " |
es | וְתִסְמְכֵנוּ לְשָלוֹם. |
fa | amir šāyān |
ja | a |
ja | A |
ja | fgtyht |
ja | gfvrv |
ja | Give us back our time. |
ja | green |
ja | hello |
ja | I don’t know |
ja | If you love you, please love me. |
ja | I saw him today |
ja | I wanna fly so high Yeah, I know my wings are dried |
ja | jdjdjd |
ja | ksskskksksks |
ja | tyyjnybt |
ja | v |
ja | Windows |
ja | X軸Y軸 |
ja | You must be teasing us. |
ja | 歴史的修正主義者De Ste。 |
ka | ???????? ???????? ????? ?????? ??????. |
mn | ?????? |
ru | ?????? ???? ?????, ??? ????????? ???? ??? ?????? |
ru | Firefox |
sr | Sreća je u malim stvarima |
ur | ﺍﺱ ﭘﺮ ﺟﻮﺍﮨﺮ ﻻﻝ ﻧﮩﺮﻭ ﻧﮯ ﻣﺴﮑﺮﺍ ﮐﺮ ﮐﮩﺎ ﺗﮭﺎ |
ur | ﭨﮭﻮﻧﺲ ﺩﯾﺘﮯ ﺗﮭﮯ |
ur | ﮈﺍﮐﯿﮧ ﭘﮭﺮ ﭘﺮﯾﺸﺎﻥ ﮨﻮ ﮔﯿﺎ |
ur | ﮐﯿﺌﺮﭨﯿﮑﺮ ﺣﮑﻮﻣﺖ ﺑﮭﯽ ﻗﻮﻣﯽ ﺣﮑﻮﻣﺖ ﺑﻦ ﺳﮑﺘﯽ ﮨﮯ |
ur | ﻟﻮﮔﻮﮞ ﻧﮯ ﺣﯿﺮﺕ ﺳﮯ ﭘﻮﭼﮭﺎ |
A number of English texts in non-English datasets. No idea what Hebrew phrase does in Spanish dataset. The word “Firefox” should be probably in Cyrillics same for “Sreća je u malim stvarima”. About the Urdu outliers, no idea why these came out, but listing them just in case.