Skip to content

Latest commit

 

History

History
42 lines (34 loc) · 2.34 KB

README.md

File metadata and controls

42 lines (34 loc) · 2.34 KB

Yorùbá text

This repository contains fully diacritized Yorùbá text, converted to Unicode Normalization Form Composition (NFC) format, where diacritized characters are composed into a single character with the following code:

def convert_to_NFC(filename, outfilename):
    text=''.join(c for c in unicodedata.normalize('NFC', open(filename).read()))
    with open(outfilename, 'w') as f:
        f.write(text)

Web sources:

Social Media sources:

Text has been gathered with permission from online sources, and lightly preprocessed for use in NLP, TTS, ASR applications. Note, some of the sentences may have errors, please submit a pull-request if you have corrections!

Resources