Get PubMed metadata from large amount of articles #180

LauraVP1994 · 2022-07-12T19:19:07Z

I have a few questions regarding this tool. I have a query that results in 124 575 hits with

pubmed_search <- entrez_search(db="pubmed",term=new_query,use_history=TRUE)

However, this number is different compared to when I put the same query in the online NCBI.

Secondly, I was wondering how I can import all the data from these papers (First name, PMID number, DOI,...) and put these results in a data table.

Thanks for the help!

The text was updated successfully, but these errors were encountered:

allenbaron · 2022-07-14T19:55:29Z

However, this number is different compared to when I put the same query in the online NCBI.

I think I recall the Entrez documentation saying something about this. I don't think they use the same search algorithm but I'm not sure.

Secondly, I was wondering how I can import all the data from these papers (First name, PMID number, DOI,...) and put these results in a data table.

This is going to be a big job so you should review NCBI's Entrez API documentation about limits and best time to execute the request. You can get the information with entrez_summary() but it won't work with the current version of rentrez on CRAN because it has a bug. You can get more details from #178.

LauraVP1994 · 2022-07-15T08:20:46Z

Thank you for the help. I did find a way around this problem for now because both my query and the amount of data were too large. The query problem I solved by dividing it in three parts, while the too much data was solved using an lapply (or can be solved with a for loop also of course). There does seem to be a bug in the PMID retrieval because it also seems to extract PMID numbers from the references in the paper, but in my case it was always the first one that was the correct one.

q1 <- entrez_search(db="pubmed", term=query_part1, use_history=TRUE)
q2 <- entrez_search(db="pubmed", term=query_part2, use_history=TRUE, WebEnv=q1$web_history$WebEnv)
q3 <- entrez_search(db="pubmed", term=query_part3, use_history=TRUE, WebEnv=q2$web_history$WebEnv)

query_combine_1_2_3 <- sprintf("#%s AND #%s NOT #%s", q1$web_history$QueryKey, q2$web_history$QueryKey, q3$web_history$QueryKey)
pubmed_search <- entrez_search(db="pubmed", term=query_combine_1_2_3, use_history=TRUE, WebEnv=q1$web_history$WebEnv)

xpath2 <-function(x, path, fun = xmlValue, ...){
       y <- xpathSApply(x, path, fun, ...)
     ifelse(length(y) == 0, NA,
        ifelse(length(y) > 1, paste(unlist(y), collapse=", "), y))
}

pubmed_search_df_list <- lapply(seq(1,pubmed_search$count,100), function(x){
    recs <- entrez_fetch(db="pubmed", web_history=pubmed_search$web_history,
                         rettype="xml", parsed=TRUE, retmax=100, retstart=x)
    articles <- getNodeSet( recs, "//PubmedArticle")
    data.frame( 
        Title = sapply(articles, xpath2,  ".//ArticleTitle"),
        Journal = sapply(articles, xpath2,  ".//Journal/Title"),
        PubYear = sapply(articles, xpath2,  ".//PubDate/Year"),
        Keywords = sapply(articles, xpath2,  ".//Keyword"),
        Authors = sapply(articles, xpath2,  ".//Author/LastName"),
        Abstract = sapply(articles, xpath2,  ".//AbstractText"),
        PMID = sapply(articles, xpath2, ".//ArticleId[@IdType='pubmed']"),
        DOI = sapply(articles, xpath2, ".//ArticleId[@IdType='doi']")
    )
})
pubmed_search_df <- do.call(rbind, pubmed_search_df_list)

I also contacted NCBI support for the problem of the discrepancy between the online results and the coding results and this was their answer:

The E-utilities API is retrieving data from the legacy PubMed database. The new PubMed is built with updated technology. Since they are two separate systems, counts will not always match up. Indexing is done at different times for each system, so additions, deletions (e.g., removing duplicates), or updates to records aren't taking effect at the same time in both places.

Additionally, there are some changes in search syntax and search translations for the new PubMed that may affect the number of search results. For example, Automatic Term Mapping has been augmented to include additional British and American spellings, singular and plural word forms, and other synonyms to provide more consistent and comprehensive search retrieval. To see how PubMed translated your query, see the search details included in History and Search Details on the Advanced Search page. Using this translated query in your E-Utilities search can help bring the E-Utilities results closer to the website results; however, the different update schedules mean that results may still be slightly different.

Later this year, NCBI will be moving to an updated version of the E-utilities API for PubMed using the same technology as the PubMed web interface.

This update will keep E-utilities in sync with the web interface.

Please test your API calls on the test server, report issues, and provide your feedback.
Test Server base URL: https://eutilspreview.ncbi.nlm.nih.gov/entrez/eutils/

See our NCBI Insights post (https://go.usa.gov/xzGK4) for more information.

Useful links:
• About E-utilities: https://go.usa.gov/xz7Gr
• NCBI Insights: https://go.usa.gov/xzGK4

They propose to try and use their test server, however I have no idea of how to incorporate this in rentrez...

allenbaron · 2022-07-20T14:09:17Z

Glad to hear you got the info you needed.

To try the NCBI test server with rentrez you'd have to fork the package and switch the internal URL used by rentrez.

LauraVP1994 mentioned this issue Jul 15, 2022

Web history object - No esummary records found in file #178

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get PubMed metadata from large amount of articles #180

Get PubMed metadata from large amount of articles #180

LauraVP1994 commented Jul 12, 2022

allenbaron commented Jul 14, 2022

LauraVP1994 commented Jul 15, 2022 •

edited

Loading

allenbaron commented Jul 20, 2022

Get PubMed metadata from large amount of articles #180

Get PubMed metadata from large amount of articles #180

Comments

LauraVP1994 commented Jul 12, 2022

allenbaron commented Jul 14, 2022

LauraVP1994 commented Jul 15, 2022 • edited Loading

allenbaron commented Jul 20, 2022

LauraVP1994 commented Jul 15, 2022 •

edited

Loading