Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get PubMed metadata from large amount of articles #180

Open
LauraVP1994 opened this issue Jul 12, 2022 · 3 comments
Open

Get PubMed metadata from large amount of articles #180

LauraVP1994 opened this issue Jul 12, 2022 · 3 comments

Comments

@LauraVP1994
Copy link

I have a few questions regarding this tool. I have a query that results in 124 575 hits with

pubmed_search <- entrez_search(db="pubmed",term=new_query,use_history=TRUE)

However, this number is different compared to when I put the same query in the online NCBI.

Secondly, I was wondering how I can import all the data from these papers (First name, PMID number, DOI,...) and put these results in a data table.

Thanks for the help!

@allenbaron
Copy link

However, this number is different compared to when I put the same query in the online NCBI.

I think I recall the Entrez documentation saying something about this. I don't think they use the same search algorithm but I'm not sure.

Secondly, I was wondering how I can import all the data from these papers (First name, PMID number, DOI,...) and put these results in a data table.

This is going to be a big job so you should review NCBI's Entrez API documentation about limits and best time to execute the request. You can get the information with entrez_summary() but it won't work with the current version of rentrez on CRAN because it has a bug. You can get more details from #178.

@LauraVP1994
Copy link
Author

LauraVP1994 commented Jul 15, 2022

Thank you for the help. I did find a way around this problem for now because both my query and the amount of data were too large. The query problem I solved by dividing it in three parts, while the too much data was solved using an lapply (or can be solved with a for loop also of course). There does seem to be a bug in the PMID retrieval because it also seems to extract PMID numbers from the references in the paper, but in my case it was always the first one that was the correct one.

q1 <- entrez_search(db="pubmed", term=query_part1, use_history=TRUE)
q2 <- entrez_search(db="pubmed", term=query_part2, use_history=TRUE, WebEnv=q1$web_history$WebEnv)
q3 <- entrez_search(db="pubmed", term=query_part3, use_history=TRUE, WebEnv=q2$web_history$WebEnv)

query_combine_1_2_3 <- sprintf("#%s AND #%s NOT #%s", q1$web_history$QueryKey, q2$web_history$QueryKey, q3$web_history$QueryKey)
pubmed_search <- entrez_search(db="pubmed", term=query_combine_1_2_3, use_history=TRUE, WebEnv=q1$web_history$WebEnv)

xpath2 <-function(x, path, fun = xmlValue, ...){
       y <- xpathSApply(x, path, fun, ...)
     ifelse(length(y) == 0, NA,
        ifelse(length(y) > 1, paste(unlist(y), collapse=", "), y))
}

pubmed_search_df_list <- lapply(seq(1,pubmed_search$count,100), function(x){
    recs <- entrez_fetch(db="pubmed", web_history=pubmed_search$web_history,
                         rettype="xml", parsed=TRUE, retmax=100, retstart=x)
    articles <- getNodeSet( recs, "//PubmedArticle")
    data.frame( 
        Title = sapply(articles, xpath2,  ".//ArticleTitle"),
        Journal = sapply(articles, xpath2,  ".//Journal/Title"),
        PubYear = sapply(articles, xpath2,  ".//PubDate/Year"),
        Keywords = sapply(articles, xpath2,  ".//Keyword"),
        Authors = sapply(articles, xpath2,  ".//Author/LastName"),
        Abstract = sapply(articles, xpath2,  ".//AbstractText"),
        PMID = sapply(articles, xpath2, ".//ArticleId[@IdType='pubmed']"),
        DOI = sapply(articles, xpath2, ".//ArticleId[@IdType='doi']")
    )
})
pubmed_search_df <- do.call(rbind, pubmed_search_df_list)

I also contacted NCBI support for the problem of the discrepancy between the online results and the coding results and this was their answer:

The E-utilities API is retrieving data from the legacy PubMed database. The new PubMed is built with updated technology. Since they are two separate systems, counts will not always match up. Indexing is done at different times for each system, so additions, deletions (e.g., removing duplicates), or updates to records aren't taking effect at the same time in both places.

Additionally, there are some changes in search syntax and search translations for the new PubMed that may affect the number of search results. For example, Automatic Term Mapping has been augmented to include additional British and American spellings, singular and plural word forms, and other synonyms to provide more consistent and comprehensive search retrieval. To see how PubMed translated your query, see the search details included in History and Search Details on the Advanced Search page. Using this translated query in your E-Utilities search can help bring the E-Utilities results closer to the website results; however, the different update schedules mean that results may still be slightly different.

Later this year, NCBI will be moving to an updated version of the E-utilities API for PubMed using the same technology as the PubMed web interface.

This update will keep E-utilities in sync with the web interface.

Please test your API calls on the test server, report issues, and provide your feedback.
Test Server base URL: https://eutilspreview.ncbi.nlm.nih.gov/entrez/eutils/

See our NCBI Insights post (https://go.usa.gov/xzGK4) for more information.

Useful links:
• About E-utilities: https://go.usa.gov/xz7Gr
• NCBI Insights: https://go.usa.gov/xzGK4

They propose to try and use their test server, however I have no idea of how to incorporate this in rentrez...

@allenbaron
Copy link

Glad to hear you got the info you needed.

To try the NCBI test server with rentrez you'd have to fork the package and switch the internal URL used by rentrez.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants