- Step 1:
Text is fed into MetaMap. Goal: some of the words in the
text should be replaced with named entity categories, plus
these named entities' parts of speech.
These are:
- Drug/Chemical Compound/Therapeutic Modality
- Molecule
- Cell Type
- Condition
- Experimental Platform
Example input and output
Input: Development of sepsis or septic shock in patients
significantly increases mortality.
Output: Development of MM-CONDITION/NN or MM-CONDITION/NN
significantly increases MM-CONDITION/NN .
MetaMap finds and tags named entities. There are 2 problems with
the MetaMap tags:
- they are too specific; we are only interested in the
5 categories listed above. Emilia should have
a list of all the MetaMap categories and which of the 5
more general categories these correspond to.
- It is not clear how to automatically extract the MetaMap tags from
the system's output. We'll have to investigate how to make MetaMap
produce output which is more useful for us.
- Also, we'll have to make sure that the part of speech tags
generated by MetaMap match the tags used by the Stanford Partial
Parser.
- Step 2: Output from Step 1 is fed into Stanford Dependency
Parser (aka Stanford Partial Parser)
The Stanford parser can handle partially processed input. For
example, the named entities identified by MetaMap (and tagged with
their part of speech) are not analyzed further by the Stanford parser.
The part of speech of the named entity determines how it is combined
with the surrounding words in the sentence. For example:
Parsing [sent. 1 len. 11]: [Development, of, MM-Condition\/NP,
or, MM-Condition\/NP, in, patients, significantly, increases,
MM-Condition\/NP, .]
(ROOT
(S
(NP
(NP (NNP Development))
(PP (IN of)
(NP
(NP (NN MM-Condition\/NP)
(CC or)
(NN MM-Condition\/NP))
(PP (IN in)
(NP (NNS patients))))))
(ADVP (RB significantly))
(VP (VBZ increases)
(NP (NN MM-Condition\/NP)))
(. .)))
nsubj(increases-9, Development-1)
prep_of(Development-1, MM-Condition\/NP-3)
prep_of(Development-1, MM-Condition\/NP-5)
conj_or(MM-Condition\/NP-3, MM-Condition\/NP-5)
prep_in(MM-Condition\/NP-3, patients-7)
advmod(increases-9, significantly-8)
dobj(increases-9, MM-Condition\/NP-10)
- Step 3:
This step is where the bulk of the work in this project will take
place. The output of the Stanford Parser will be fed into a
(still to be developed) pattern matcher or rule-based system.
This system
will try to match either the sentence structure or the dependencies
against patterns (to be written by us) to produce a set of
propositions expressed by the sentence.
So there are two major tasks to complete this step:
- Design and implement the pattern matcher / rule-based system
- Write the patterns or rules that can be used by the matcher
to identify the propositions that are expressed by the sentence
The exact format of these propositions has not yet been determined,
and we will have to rely on Dr. Ong's expertise to decide what
these should be. However, the above example illstrates the kind of
rules/patterns that might be need to extract propositions from the
input:
nsubj(increases-?a, Development-?b)
prep_of(Development-?b, MM-Condition\/NP-?x)
dobj(increases-?a, MM-Condition\/NP->y)
=>
increases(MM-Condition\/NP-?x, MM-Condition\/NP-?x)
Or in this case,
increases(sepsis, mortality)
increases(septic shock, mortality)