Scrape sentences from wikidata #92

stefangrotz · 2020-02-17T18:48:09Z

Wikidata is completely under CC0, this makes it very attractive for the project. In contains both, sentences and sometimes audio, but for this Issue I want to focus on sentences.

This Issue is work in progress, I want to collect possible sources for sentences in Wikidata:

P5831 usage example : a example sentence for a word. Often with a language added in brackets.
A "Description" in many languages exists for many Wikidata- items, but it isn't always a complete sentence.

The next step would be to write a script to scrap these sentences.

MichaelKohler · 2020-02-17T19:42:25Z

Looks like these are indeed CC0. I don't think we need to ask legal for this. @nukeador do you agree?

Would love to see a selection of these sentences. Also, I assume you are aware of the scraper capabilities for other resources? As long as we can get it into a parseable state, it then can directly be integrated in the scraper to use the rules and everything. More details in the last part of the README. Also happy to explain further if needed.

stefangrotz · 2020-02-17T20:10:25Z

As long as we can get it into a parseable state, it then can directly be integrated in the scraper to use the rules and everything.

This was exactly what I was thinking. Right now the example sentences for a datatype called "lexemes" are relatively new. They exists since 2018. But they are planing to move all wiktionary data into wikidata, so we will likely have more sentences in the future.

Wikidata is huge, I am sure that there are more data types that contain sentences.

Would love to see a selection of these sentences.

I always wanted to learn wikidata queries, this is a nice little project to finally do it. I will post some examples tomorrow or so.

nukeador · 2020-02-18T11:30:11Z

Note only these 4 namespaces is CC0.

All structured data from the main, Property, Lexeme, and EntitySchema namespaces is available under the Creative Commons CC0 License; text in the other namespaces is available under the Creative Commons Attribution-ShareAlike License;

Do we have data on how many sentences do we have for each language?

Adrijaned · 2020-04-07T07:16:27Z

I've already suggested using P5831 earlier in the sentence-collector project (common-voice/sentence-collector#260), but, as per this query, there is currently only about 4000 sentences in P5831, some of which are probably repetitions. (After uncommenting the first line of the query you should be able to filter sentences by language using the query helper (accesible by clicking the (i) on the left sidebar)).

All of those should be in the Lexeme namespace, so license-wise should be of no issue.

stefangrotz changed the title ~~scrape sentences from wikidata~~ scrap sentences from wikidata Feb 17, 2020

MichaelKohler assigned MichaelKohler and stefangrotz and unassigned MichaelKohler Feb 17, 2020

MichaelKohler added enhancement New feature or request extract-improvements tooling labels Feb 17, 2020

MichaelKohler mentioned this issue May 3, 2020

Research: ability/feasibility of sentence import from WikiData common-voice/sentence-collector#260

Closed

MichaelKohler changed the title ~~scrap sentences from wikidata~~ Scrape sentences from wikidata Oct 9, 2021

CapitainFlam mentioned this issue Sep 13, 2022

[WIP] additionnal lib/cleanup for French language to improve quality of inputs common-voice/sentence-collector#635

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scrape sentences from wikidata #92

Scrape sentences from wikidata #92

stefangrotz commented Feb 17, 2020 •

edited

Loading

MichaelKohler commented Feb 17, 2020

stefangrotz commented Feb 17, 2020 •

edited

Loading

nukeador commented Feb 18, 2020

Adrijaned commented Apr 7, 2020

Scrape sentences from wikidata #92

Scrape sentences from wikidata #92

Comments

stefangrotz commented Feb 17, 2020 • edited Loading

MichaelKohler commented Feb 17, 2020

stefangrotz commented Feb 17, 2020 • edited Loading

nukeador commented Feb 18, 2020

Adrijaned commented Apr 7, 2020

stefangrotz commented Feb 17, 2020 •

edited

Loading

stefangrotz commented Feb 17, 2020 •

edited

Loading