Parsing literature searches

Rob MacLeod

Here I describe a program (a combination of shell and awk scripts actually) called parseref that converts the output of a literature search using PubMed into something BibTEX can understand. To run this requires support for the Bourne shell and the awk program, included automatically for Unix and the Mac OSX but available as a utility for Windoze. In fact, you will need the GNU awk program, called gawk on many systems, for some of the functions in the program.

1 The code and installation

Right click (control-click on the Mac) on the links below to download each of the files:

parseref.sh is the main script.
parsepubmed.awk is the awk program for references in the PubMed format.
parseref.tar is a tar file that contains all three files above.

To work properly, both the shell and awk files must be in the same directory as the files you wish to convert, or they must at least be available in the PATH (or whatever it is called in your operating system). My preference is to place both the shell script and the awk file into my /bin directory, which is almost always in the PATH. Then I make a symbolic link pointing parseref to parseref.sh, so that execution requires only typing parseref.

2 Using `parseref`

At present, parseref.sh works best for searches of the PubMed version of Medline, but in principle, any entries that adhere to the MEDLINE format should work.

Here are the steps required to do the parsing.

2.1 The Search

Go to PubMed to find the reference(s) you want to retrieve for your BibTEX database then carry out the search and identify the references to be converted.
In PubMed, look for the menu item that usually says ``SUMMARY''-it is next to the ``Display'' button. From this menu, select the option ``MEDLINE'' and then hit ``Display''. The result will be a detailed, plain-text listing of all the elements of the reference citations, not meant for human reading, but great for computer programs.
Highlight the part of this listing that includes the reference(s) you want to keep and then copy and paste it into a standard text editor window (e.g., emacs).
Repeat these steps until you are finished saving all the references you want to keep; then save the file (I like using an extension of .pubmed so I know what the file contains.

2.2 The Parsing

Run the program parseref on the file you have saved, here called infilename, as follows:
```
 parseref.sh -p [-l] [-i ABC] infilename 
```
where,
- -p means to use the pubmed conventions.
- The -l option controls the formatting of authors' names--the default is Initials Last-name, e.g., ``R.S. MacLeod'' but for the -l option, we use Last-name, Initials, e.g., ``MacLeod, R.S.''
- -i ABC is how to set the initials that begin the key of each converted entry. Our local convention is each key begin with the initials of the owner of the BibTEX file. If you do not use the -i ABC option, all keys will begin with ``RSM'', my initials.
- infilename is the name of the file in medline format that you want to parse.
If you forget the arguments or syntax, just type parseref.sh by itself and you will be some help.
The result of this operation should be a file with .bib extension with the same base filename as your input file. This should be in BibTEX format and so will fit into your existing file.

You can also run the awk script directly as follows if you have problems with the Unix shell script (include entire command on the same line):

    gawk -f parsepubmed.awk [initials=ABC ] [authortype=lastnamefirst] 
      infilename.pubmed >> outfilename.bib

where infilename.pubmed and outfilename.bib are the input and output files, respectively.

Note that the output file may also contain some string variables that we use a lot to ensure consistent listings for journal names. The current set of such strings is as follows:

  @String{j-BME = "IEEE Trans Biomed Eng"}
  @String{j-CR = "Circ Res"}
  @String{j-C = "Circulation"}
  @String{j-AJP = "Am J Physiol"}
  @String{j-ABE = "Ann Biomed Eng"}
  @String{j-JE = "J Electrocardiol"}

These will appear at the end of the output file and you can either use them or replace the journal variables with whatever strings you prefer.

About this document ...

Parsing literature searches

This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.71)

The command line arguments were:
latex2html -split 3 -no_white -link 3 -no_navigation -no_math -html_version 3.2,math -show_section_numbers -local_icons parseref

The translation was initiated by Rob Macleod on 2009-10-10