Proposed Shock Information Extration system

Here are the steps in the processing of abstracts by our proposed system.
  1. Step 1: Text is fed into MetaMap. Goal: some of the words in the text should be replaced with named entity categories, plus these named entities' parts of speech. These are:

    1. Drug/Chemical Compound/Therapeutic Modality
    2. Molecule
    3. Cell Type
    4. Condition
    5. Experimental Platform

    Example input and output

    Input: Development of sepsis or septic shock in patients significantly increases mortality.

    Output: Development of MM-CONDITION/NN or MM-CONDITION/NN significantly increases MM-CONDITION/NN .

    MetaMap finds and tags named entities. There are 2 problems with the MetaMap tags:

    1. they are too specific; we are only interested in the 5 categories listed above. Emilia should have a list of all the MetaMap categories and which of the 5 more general categories these correspond to.
    2. It is not clear how to automatically extract the MetaMap tags from the system's output. We'll have to investigate how to make MetaMap produce output which is more useful for us.
    3. Also, we'll have to make sure that the part of speech tags generated by MetaMap match the tags used by the Stanford Partial Parser.

  2. Step 2: Output from Step 1 is fed into Stanford Dependency Parser (aka Stanford Partial Parser)

    The Stanford parser can handle partially processed input. For example, the named entities identified by MetaMap (and tagged with their part of speech) are not analyzed further by the Stanford parser. The part of speech of the named entity determines how it is combined with the surrounding words in the sentence. For example:

    Parsing [sent. 1 len. 11]: [Development, of, MM-Condition\/NP, or, MM-Condition\/NP, in, patients, significantly, increases, MM-Condition\/NP, .] (ROOT (S (NP (NP (NNP Development)) (PP (IN of) (NP (NP (NN MM-Condition\/NP) (CC or) (NN MM-Condition\/NP)) (PP (IN in) (NP (NNS patients)))))) (ADVP (RB significantly)) (VP (VBZ increases) (NP (NN MM-Condition\/NP))) (. .))) nsubj(increases-9, Development-1) prep_of(Development-1, MM-Condition\/NP-3) prep_of(Development-1, MM-Condition\/NP-5) conj_or(MM-Condition\/NP-3, MM-Condition\/NP-5) prep_in(MM-Condition\/NP-3, patients-7) advmod(increases-9, significantly-8) dobj(increases-9, MM-Condition\/NP-10)
  3. Step 3: This step is where the bulk of the work in this project will take place. The output of the Stanford Parser will be fed into a (still to be developed) pattern matcher or rule-based system. This system will try to match either the sentence structure or the dependencies against patterns (to be written by us) to produce a set of propositions expressed by the sentence.

    So there are two major tasks to complete this step:

    1. Design and implement the pattern matcher / rule-based system
    2. Write the patterns or rules that can be used by the matcher to identify the propositions that are expressed by the sentence

    The exact format of these propositions has not yet been determined, and we will have to rely on Dr. Ong's expertise to decide what these should be. However, the above example illstrates the kind of rules/patterns that might be need to extract propositions from the input:

    nsubj(increases-?a, Development-?b) prep_of(Development-?b, MM-Condition\/NP-?x) dobj(increases-?a, MM-Condition\/NP->y) => increases(MM-Condition\/NP-?x, MM-Condition\/NP-?x) Or in this case, increases(sepsis, mortality) increases(septic shock, mortality)