Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random sample of large set of search results #194

Open
hlappen opened this issue Jul 6, 2024 · 1 comment
Open

Random sample of large set of search results #194

hlappen opened this issue Jul 6, 2024 · 1 comment

Comments

@hlappen
Copy link

hlappen commented Jul 6, 2024

I'm working on a project where I want to extract a few pieces of information from the xml of PubMed records. I've managed to do this on a smaller scale (~1k records).

The problem is that I need to do this on much larger sets of records. My search results are about 600k and I'd like to be able to get a random sample of those (at least 15%) to extract (since the full set might be overkill). The trouble I'm running into is getting information on the full list of the results from which to generate a random sample. I am able to get a list of all the ids up to the retmax limit of 10k, but I can't figure out how to get the rest of them.

I know I can use fetch on the web history and iterate through the set in batches, but that would mean getting the whole xml record of all 600k articles just the get the PMIDs, at which point I might as well just extract all the info I want from them all. I've also considered breaking the searches up to be smaller sets of results, but the best I can do is to limit by year and they are still about 50k-80k.

As a simple (and small) example,

search_query <- "clowns AND hospital AND randomized" 
clowns <- entrez_search(db = "pubmed", term = search_query, use_history=TRUE)

This returns 37 results as part of the web history and 20 due to the retmax limit. Now, I know that I can increase that limit for this example in order to get all 37, but I can't for the larger searches I'm doing. So, for the sake of the example, let's assume that's the limit.

I can get the 20 ids using clowns$ids, but is there a way to efficiently get a list of all the 37 ids from the web history, so I could do something like sample_ids <- sample(clowns$ids, 30)?

Probably not important, but I'm grabbing pmid, indexing method, last modification date, and the mesh terms from each record and putting them into a data.frame.

@allenbaron
Copy link

rentrez is a wrapper for the Entrez Utilities published by NCBI that covers most of the functionality of those tools. It would probably be best to ask this question to NCBI directly. If you do, please consider posting their response here for others. Questions like this are fairly common in rentrez's issues section and it would be helpful for others to see this information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants