Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Small curated input dataset for continuous integration github actions workflow #3

Open
kopardev opened this issue Nov 7, 2020 · 2 comments
Labels
ci Github actions or CI related task enhancement New feature or request

Comments

@kopardev
Copy link
Collaborator

kopardev commented Nov 7, 2020

@skchronicles said

@kopardev
Yes, I will look into it.

I was also thinking we should create a custom set of references for the ci workflow.

Here is what I am thinking:

  1. Find a dataset with a differentially expressed gene
    DE gene should be comprised of uniquely mapped reads (reads only mapping to one location). This is so we can spike-in this gene later on into a pre-computed counts matrix.
    Optional: Differential expression is validated through a secondary method
  2. Extract these uniquely mapped reads for said DE gene to create the following:
    Sub-sampled fastq files for testing purposes
    Custom reference files (with a custom ref.fa and genes.gtf)
    The ref.fa should only contain the sequence for the gene of intereset (you can pad it with +/- 10KB), and the GTF files will have to be modified to accommodate the new ref.fa, and it should only contain our gene of interest.

Do you have some time to do look into this more?

@kopardev
Copy link
Collaborator Author

kopardev commented Nov 7, 2020

Yes, I can look into this.
My 2 cents.... do we really need to create a custom ref.fa and custom gtf file? If the uniquely aligning reads are preselected for the said gene loci. They should only align there even with the full ref.fa/GTF files. Are you thinking that using the full ref.fa/GTF is somehow going to be restrained due to the limited compute resources available for CI via github actions? If this is indeed the case, then I agree we should create stripped-down versions of the ref.fa and genes.gtf files.

@skchronicles
Copy link
Owner

@kopardev

Are you thinking that using the full ref.fa/GTF is somehow going to be restrained due to the limited compute resources available for CI via github actions? If this is indeed the case, then I agree we should create stripped-down versions of the ref.fa and genes.gtf files.

Yes, these free-tier VMs are pretty low spec:
image

This will also ensure that the workflow runs in the most efficient manner too, but I see what you're saying though. This may be overkill to a certain extent. We could start off by just limiting the ref.fa and genes.gtf to the chromosome of interest.

@skchronicles skchronicles added the enhancement New feature or request label Nov 18, 2020
@skchronicles skchronicles added the ci Github actions or CI related task label Mar 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci Github actions or CI related task enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants