Lecture Notes
& Schedule
Projects
Assignments
Syllabus
Resources
& Links
CDM COL
DePaul academic
calendar |
CSC 594 Topics in AI: Applied Natural Language
Processing
Fall 2009/2010
(a) Information Extraction from biomedical texts (with Prof.
Lytinen)
NEW: 9/25
The goal of this project is to extract the findings reported in
biomedical texts, in particular the abstracts of the research papers submitted to
a Shock/Trauma conference.
Most 'findings' concern causation (e.g. "X caused Y", "X
influenced Y"), for instance the effect of drug on a shock/trauma
condition or body cells.
Some work has been done already
for this project, by Prof. Lytinen, his colleague who is a Shock surgeon
(Dr. Gary An) and Emilia Apostolova. In particular, we collected
the abstracts submitted to the 2009 conference, pre-processed them by a
tool called Metamap (http://metamap.nlm.nih.gov/
, developed by the National Institute of Health (NIH)), and extracted
the named entities of certain semantic types of our interest.
[Since Metamap is a very large thesaurus of medical terms and semantic
concepts, we selected only those that are relevant to the area of
Shock/Trauma.] Those semantic types are:
- Drug/Chemical
Compound/Therapeutic Modality
- Molecule
- Cell Type
- Condition
- Experimental Platform
Then we asked the authors of the
papers to verify the output of Metamap (i.e., named entities of those
semantic types identified by Metamap) -- since Metamap is an automatic
system, there were some errors or some that were missed. So we
asked the authors to manually correct the errors and add the missed
ones. This much has been done.
Now for the project in this
course, we will start from this data (i.e., verified named entities,
plus the original abstracts). See below.
Some important notes:
- We will rely on Dr. An's expertice to
guide us as to the 'findings' he is interested in.
- An issue that we will need to address is
what part of the abstract should be processed by our system. We may
need a preliminary step to identify important sentences in the
abstract, and then focus on these sentences in particular to extract
the findings.
- A finding reported in an
abstract is likely to involve named entities, for at least one of
the components in the causation relation (i.e., either X or Y, or
both, in "X caused Y" for example).
- The relation that expresses
causation is oftentimes a verb (e.g. "caused", "influenced").
Note that such verbs are not finite -- there are many ways to
express causation. So we will start by identifying 'patterns'
which express causation.
- Also note that the
pre-processed data we have so far is only a list of named entities.
So our first task is to match/map the named entities with the
original abstracts, then extract the sentence fragments which
express causation between named entities.
- In order to find those
fragments, we will use a shallow parser (probably the
Stanford
Parser) to extract the
syntactic phrases that involve named entities (as subject/object
nouns). Below is an example output of the Stanford parser
(where the input sentence was "MM-CLINICAL-RESEARCH have documented
poor outcomes in MM-SEPTICEMIA.") Note that some words are
already tagged with POS (e.g. "MM-CLINICAL-RESEARCH/NP")
in this example.
Stanford Partial Parser
Loading parser from serialized file englishPCFG.ser.gz ... done [6.1 sec].
Parsing [sent. 1 len. 8]: [MM-CLINICAL-RESEARCH/NP, have, documented,
poor, outcomes, in, MM-SEPTICEMIA/NN, .]
(ROOT
(S
(NP (NNP MM-CLINICAL-RESEARCH/NP))
(VP (VBP have)
(VP (VBN documented)
(NP (JJ poor) (NNS outcomes))
(PP (IN in)
(NP (NNP MM-SEPTICEMIA/NN)))))
(. .)))
The results of the Stanford
Partial parser are a "parse tree" (i.e., a graph of the syntactic
structure of the sentence), as well as certain "dependencies" which
the parser extracts (not shown above).
- One more note. A
relation verb is also a semantic concept in Metamap. So after
extracting the fragments, we will eventually have to go back to
Metamap and find the semantic type of the relation.
Data we will use (to start):
-
2009_shock_abstracts.csv -- submitted abstracts in a text format
-
2009_ShockNamedEntity.csv -- a list of verified named entities;
each named entity is indicated with the abstract number in which
they appeared and the byte offsets in the abstract.
Finally, for your interest, here is the
abstracts from the 2005 conference:
"Shock2005Abstracts.pdf"
|