Just as the title we (for Italian) have a problem.

Right now the majority of datasets are from the academic world and they don’t have any license but need a citation of the paper.

So for the italian model https://github.com/MozillaItalia/DeepSpeech-Italian-Model/ we are avoiding them because we don’t know how to deal with them.

On https://hacks.mozilla.org/2019/12/deepspeech-0-6-mozillas-speech-to-text-engine/ are mentioned two academic dataset that have that issue, no license but citation required.

So my question is we can use them and release a public domain model? Or we need to mention that we are using and also the users that use the model itself?

We have the same problem for audio+text and text only dataset, also on using CC (also non-commercial) to generate a model.

I started also a discussion in Italian on reddit https://www.reddit.com/r/ItalyInformatica/comments/e6ffyg/licenze_open_source_e_paper_accademici/ to understand better the problem.

Because if we can use those stuff and license the model as public domain also if we are using to generate it resources from different sources with different license, will change our project because we will not have any limit.

The point we raised is we can use stuff license in a way and release something that elaborate this stuff (or maybe just a part) create issue for the whole project.

Probably Mozilla with legal team can help on understand this. Including the issue of that every country has different regulations…