CSC 594 Home

Lecture Notes
& Schedule

Projects

Assignments

Syllabus

Resources
& Links

 


CDM COL

DePaul academic 
calendar

 

CSC 594 Topics in AI: Applied Natural Language Processing
Fall 2009/2010

(a) Information Extraction from biomedical texts (with Prof. Lytinen)

NEW: 9/25


The goal of this project is to extract the findings reported in biomedical texts, in particular the abstracts of the research papers submitted to a Shock/Trauma conference.  
Most 'findings' concern causation (e.g. "X caused Y", "X influenced Y"), for instance the effect of drug on a shock/trauma condition or body cells.

Some work has been done already for this project, by Prof. Lytinen, his colleague who is a Shock surgeon (Dr. Gary An) and Emilia Apostolova.  In particular, we collected the abstracts submitted to the 2009 conference, pre-processed them by a tool called Metamap (http://metamap.nlm.nih.gov/ , developed by the National Institute of Health (NIH)), and extracted the named entities of certain semantic types of our interest.  [Since Metamap is a very large thesaurus of medical terms and semantic concepts, we selected only those that are relevant to the area of Shock/Trauma.]  Those semantic types are:

  • Drug/Chemical Compound/Therapeutic Modality
  • Molecule
  • Cell Type
  • Condition
  • Experimental Platform

Then we asked the authors of the papers to verify the output of Metamap (i.e., named entities of those semantic types identified by Metamap) -- since Metamap is an automatic system, there were some errors or some that were missed.  So we asked the authors to manually correct the errors and add the missed ones.  This much has been done.

Now for the project in this course, we will start from this data (i.e., verified named entities, plus the original abstracts).  See below.

Some important notes:

  • We will rely on Dr. An's expertice to guide us as to the 'findings' he is interested in.
  • An issue that we will need to address is what part of the abstract should be processed by our system. We may need a preliminary step to identify important sentences in the abstract, and then focus on these sentences in particular to extract the findings.
  • A finding reported in an abstract is likely to involve named entities, for at least one of the components in the causation relation (i.e., either X or Y, or both, in "X caused Y" for example). 
  • The relation that expresses causation is oftentimes a verb (e.g. "caused", "influenced").  Note that such verbs are not finite -- there are many ways to express causation.  So we will start by identifying 'patterns' which express causation.
  • Also note that the pre-processed data we have so far is only a list of named entities.  So our first task is to match/map the named entities with the original abstracts, then extract the sentence fragments which express causation between named entities.
  • In order to find those fragments, we will use a shallow parser (probably the Stanford Parser) to extract the syntactic phrases that involve named entities (as subject/object nouns).  Below is an example output of the Stanford parser (where the input sentence was "MM-CLINICAL-RESEARCH have documented poor outcomes in MM-SEPTICEMIA.")  Note that some words are already tagged with POS (e.g.  "MM-CLINICAL-RESEARCH/NP") in this example.
    Stanford Partial Parser
    
    Loading parser from serialized file englishPCFG.ser.gz ... done [6.1 sec].
    Parsing [sent. 1 len. 8]: [MM-CLINICAL-RESEARCH/NP, have, documented, 
                               poor, outcomes, in, MM-SEPTICEMIA/NN, .]
    (ROOT
      (S
        (NP (NNP MM-CLINICAL-RESEARCH/NP))
        (VP (VBP have)
          (VP (VBN documented)
            (NP (JJ poor) (NNS outcomes))
            (PP (IN in)
              (NP (NNP MM-SEPTICEMIA/NN)))))
        (. .)))

The results of the Stanford Partial parser are a "parse tree" (i.e., a graph of the syntactic structure of the sentence), as well as certain "dependencies" which the parser extracts (not shown above).

  • One more note.  A relation verb is also a semantic concept in Metamap.  So after extracting the fragments, we will eventually have to go back to Metamap and find the semantic type of the relation.

Data we will use (to start):

Finally, for your interest, here is the abstracts from the 2005 conference: "Shock2005Abstracts.pdf"