Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corpus Loading Features #12

Open
Olamyy opened this issue Jul 13, 2019 · 0 comments
Open

Corpus Loading Features #12

Olamyy opened this issue Jul 13, 2019 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@Olamyy
Copy link
Contributor

Olamyy commented Jul 13, 2019

I had some free time this week and I was able to pen down some features I'm hoping we'll be able to include. These are:

  1. A class for handling all forms of scrapping. This API for this feature can be like an interface that other scrappers can be built on. We can leverage either bs4 or scrapy . I'm thinking something like:
class BaseScrapper(scrappy.Spider):
         def __init__(name, urls, **kwargs):
               super(BaseScrapper, self).__init__(name, **kwargs)

         def parse_urls(self):
                ###Do something to the URLs before starting
                pass         

         def parse(self):
               #Crawling logic
               pass

Then a scrapper like the Bibeli scrapper can use this class:

class BibeliScrapper(BaseScrapper)
          ###Logic goes here

Major advantage here is reusability. So, anyone can build their own yoruba scrapper with minimum amount of work.

  1. Corpus class and DirectoryCorpus classs (Inspired by gensim)
    This would be a class that can be used to load various format of yoruba corpus using a single API interface. It should support:
  • Streaming files
  • Reading various file formats. txt, gzip, csv,
  • Validating a file format. Say if a user loads an Owe file. It should be able to validate that the content of the file conforms to that format.
  • Preprocessing while reading.
  • Generating random text

A commit for this is available here
The interface is described below:

class Corpus(interfaces.CorpusABC):
    def __init__(self, path=None, text=None, stream=False, fformat='txt', cformat=None, labels=False, preprocess=None):
        """

        Args:
            path:
            text:
        """
        self.path = path
        self.text = text
        self.labels = labels
        self.stream = stream
        self.fformat = fformat
        self.cformat = cformat
        self.preprocess = preprocess
        if not self.preprocess:
            self.preprocess = [normalize_diacritics_text]
        self.data = self.read_file_filename_or_text(text=text) if text else self.read_file_filename_or_text()
        self.validate_format()

    def __iter__(self):
        for line in self.data:
            yield line

    def __len__(self):
        return len(self.data)

    @staticmethod
    def save_corpus(fname, corpus, id2word=None, metadata=False):
        pass

    def streamfile(self, fobj):
        pass

    def read_file_filename_or_text(self, f=None, text=None):
        """

        Returns:

        """
        pass

    def handle_preprocessing(self, text):
        if callable(self.preprocess):
            return self.preprocess(text)
        if isinstance(self.preprocess, list):
            for technique in self.preprocess:
                text = technique(text)
            return text

    def validate_format(self):
        """

        Returns:

        """


    def generate(self, size):
        """

        Args:
            size:

        Returns:

        """
        if not self.cformat:
            raise ValueError("You need to specify a format for generating random text")


class DirectoryCorpus(Corpus):
    def __init__(self, path, **kwargs):
        self.path_dir = path
        walked = list(walk(self.path_dir))
        self.depth = walked[0][0]
        self.dirnames = walked[0][2]
        self.flist = walked[0][3]
        self.path = list(self.read_files())
        super(DirectoryCorpus, self).__init__(path=self.path, **kwargs)

    def read_files(self):
        for path in self.flist:
            yield os.path.join(self.path_dir, path)
  1. Loaders : These would be responsible for loading corpus made available by iranlowo.. They should return a Corpus object.
class OweLoader(DirectoryCorpus):
    def __init__(self, path, **kwargs):
        super(DirectoryCorpus, self).__init__(path=self.path, **kwargs)

I imagine a downside of these features is that it might make the project become bloated(big(?)) but I think the uses would outweigh this downside.

@Olamyy Olamyy changed the title Possible features Corpus Loading Features Jul 13, 2019
@ruohoruotsi ruohoruotsi assigned ruohoruotsi and Olamyy and unassigned ruohoruotsi Jul 13, 2019
@ruohoruotsi ruohoruotsi added the enhancement New feature or request label Jul 13, 2019
@Olamyy Olamyy mentioned this issue Jul 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants