Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try to characterize publication_type in pub_hash better for Pubmed records #1090

Open
peetucket opened this issue Jun 26, 2019 · 2 comments
Open
Labels

Comments

@peetucket
Copy link
Member

Currently for Pubmed Source Records we set all publication types as article (see https://github.com/sul-dlss/sul_pub/blob/master/app/models/pubmed_source_record.rb#L140). It is possible the pubmed source records has information about the type that could be used to set it better as one of the following supported types:

- inproceedings
- book
- article

For example, the pubmed source XML has a node called that looks like this:

           <PublicationTypeList>
                <PublicationType UI="D016428">Journal Article</PublicationType>
            </PublicationTypeList>

which suggests it may hold publication type.

e.g. in prod, see puts PubmedSourceRecord.find_by(pmid:27397405).source_data

@peetucket
Copy link
Member Author

peetucket commented Jun 26, 2019

Current values in pubmed source records:

total = PubmedSourceRecord.count
pub_types=Hash.new(0)
n = 0
PubmedSourceRecord.find_each do |pmsr|
    n += 1
    pub_doc = Nokogiri::XML(pmsr.source_data)
    begin
      article_type = pub_doc.xpath('//PubmedArticle/MedlineCitation/Article/PublicationTypeList/PublicationType')[0].children[0].text
    rescue
      article_type = "NODE_NOT_FOUND"
    end
    pub_types[article_type] += 1
    puts "#{n} of #{total} : #{article_type}"
end;nil
puts total
=> 423676
puts pub_types.sort_by {|_key, value| - value}.to_h
=> {"Journal Article"=>333665,
 "Comparative Study"=>23977,
 "Case Reports"=>19433,
 "Clinical Trial"=>8464,
 "JOURNAL ARTICLE"=>5703,
 "Comment"=>5322,
 "Letter"=>4609,
 "Editorial"=>4526,
 "English Abstract"=>3698,
 "Evaluation Studies"=>3346,
 "In Vitro"=>2361,
 "Historical Article"=>982,
 "Clinical Trial, Phase II"=>928,
 "Clinical Trial, Phase III"=>707,
 "Biography"=>690,
 "Clinical Trial, Phase I"=>651,
 "NODE_NOT_FOUND"=>612,
 "Consensus Development Conference"=>473,
 "News"=>468,
 "Published Erratum"=>416,
 "Congresses"=>375,
 "Review"=>325,
 "Controlled Clinical Trial"=>287,
 "Guideline"=>277,
 "Introductory Journal Article"=>237,
 "Congress"=>176,
 "Interview"=>143,
 "REVIEW"=>135,
 "LETTER"=>46,
 "Clinical Study"=>46,
 "Autobiography"=>45,
 "Clinical Trial, Phase IV"=>41,
 "Lectures"=>38,
 "Addresses"=>38,
 "Consensus Development Conference, NIH"=>38,
 "Bibliography"=>35,
 "Address"=>33,
 "EDITORIAL"=>32,
 "Retraction of Publication"=>30,
 "Clinical Conference"=>24,
 "Dataset"=>23,
 "Newspaper Article"=>22,
 "Corrected and Republished Article"=>21,
 "Research Support, Non-U.S. Gov't"=>16,
 "Classical Article"=>15,
 "Lecture"=>15,
 "Clinical Trial, Veterinary"=>13,
 "Research Support, N.I.H., Extramural"=>11,
 "Duplicate Publication"=>11,
 "Interactive Tutorial"=>11,
 "Patient Education Handout"=>11,
 "Legal Case"=>11,
 "Directory"=>9,
 "Clinical Trial Protocol"=>8,
 "Research Support, U.S. Gov't, P.H.S."=>7,
 "Equivalence Trial"=>6,
 "Personal Narrative"=>5,
 "Systematic Review"=>4,
 "Practice Guideline"=>3,
 "Research Support, U.S. Gov't, Non-P.H.S."=>3,
 "Legislation"=>3,
 "Festschrift"=>2,
 "Meta-Analysis"=>2,
 "Dictionary"=>2,
 "Overall"=>2,
 "PUBLISHED ERRATUM"=>2,
 "Technical Report"=>2,
 "Legal Cases"=>1,
 "CASE REPORTS"=>1,
 "Video-Audio Media"=>1,
 "Adaptive Clinical Trial"=>1}

@peetucket
Copy link
Member Author

Pubmed Documented publication types:

It's unclear from this controlled vocabulary what we would map conference proceedings and books to though

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant