594 Projects

CSC 594 Topics in AI: Applied Natural Language Processing
Fall 2009/2010

(a) Information Extraction from biomedical texts (with Prof. Lytinen)

NEW: 9/25

Project write-up by Prof. Lytinen
2009 abstracts in ascii files [188 kb zip file]
Metamap output of 2009 abstracts (with all semantic classes) [3.2 MB zip file]

The goal of this project is to extract the findings reported in biomedical texts, in particular the abstracts of the research papers submitted to a Shock/Trauma conference.
Most 'findings' concern causation (e.g. "X caused Y", "X influenced Y"), for instance the effect of drug on a shock/trauma condition or body cells.

Some work has been done already for this project, by Prof. Lytinen, his colleague who is a Shock surgeon (Dr. Gary An) and Emilia Apostolova. In particular, we collected the abstracts submitted to the 2009 conference, pre-processed them by a tool called Metamap (http://metamap.nlm.nih.gov/ , developed by the National Institute of Health (NIH)), and extracted the named entities of certain semantic types of our interest. [Since Metamap is a very large thesaurus of medical terms and semantic concepts, we selected only those that are relevant to the area of Shock/Trauma.] Those semantic types are:

Drug/Chemical Compound/Therapeutic Modality
Molecule
Cell Type
Condition
Experimental Platform

Then we asked the authors of the papers to verify the output of Metamap (i.e., named entities of those semantic types identified by Metamap) -- since Metamap is an automatic system, there were some errors or some that were missed. So we asked the authors to manually correct the errors and add the missed ones. This much has been done.

Now for the project in this course, we will start from this data (i.e., verified named entities, plus the original abstracts). See below.

Some important notes:

We will rely on Dr. An's expertice to guide us as to the 'findings' he is interested in.
An issue that we will need to address is what part of the abstract should be processed by our system. We may need a preliminary step to identify important sentences in the abstract, and then focus on these sentences in particular to extract the findings.
A finding reported in an abstract is likely to involve named entities, for at least one of the components in the causation relation (i.e., either X or Y, or both, in "X caused Y" for example).
The relation that expresses causation is oftentimes a verb (e.g. "caused", "influenced"). Note that such verbs are not finite -- there are many ways to express causation. So we will start by identifying 'patterns' which express causation.
Also note that the pre-processed data we have so far is only a list of named entities. So our first task is to match/map the named entities with the original abstracts, then extract the sentence fragments which express causation between named entities.

In order to find those fragments, we will use a shallow parser (probably the Stanford Parser) to extract the syntactic phrases that involve named entities (as subject/object nouns). Below is an example output of the Stanford parser (where the input sentence was "MM-CLINICAL-RESEARCH have documented poor outcomes in MM-SEPTICEMIA.") Note that some words are already tagged with POS (e.g. "MM-CLINICAL-RESEARCH/NP") in this example.
```
Stanford Partial Parser

Loading parser from serialized file englishPCFG.ser.gz ... done [6.1 sec].
Parsing [sent. 1 len. 8]: [MM-CLINICAL-RESEARCH/NP, have, documented, 
                           poor, outcomes, in, MM-SEPTICEMIA/NN, .]
(ROOT
  (S
    (NP (NNP MM-CLINICAL-RESEARCH/NP))
    (VP (VBP have)
      (VP (VBN documented)
        (NP (JJ poor) (NNS outcomes))
        (PP (IN in)
          (NP (NNP MM-SEPTICEMIA/NN)))))
    (. .)))
```

The results of the Stanford Partial parser are a "parse tree" (i.e., a graph of the syntactic structure of the sentence), as well as certain "dependencies" which the parser extracts (not shown above).

One more note. A relation verb is also a semantic concept in Metamap. So after extracting the fragments, we will eventually have to go back to Metamap and find the semantic type of the relation.

Data we will use (to start):

2009_shock_abstracts.csv -- submitted abstracts in a text format
2009_ShockNamedEntity.csv -- a list of verified named entities; each named entity is indicated with the abstract number in which they appeared and the byte offsets in the abstract.

Finally, for your interest, here is the abstracts from the 2005 conference: "Shock2005Abstracts.pdf"

CSC 594 Home

CSC 594 Topics in AI: Applied Natural Language Processing Fall 2009/2010

CSC 594 Topics in AI: Applied Natural Language Processing
Fall 2009/2010