### Computational Scientific Discovery Overview

Computational Scientific Discovery ("CSD") is a small but growing field concerned with developing algorithms and software to help practicing scientists build scientific models. It borrows from (but is distinct from) the fields of:

• science,
• the philosophy of science, and
• computer science

From Science it borrows:

From the Philosophy of Science it borrows:

• Theories of explanation
• Theories of prediction
• Theories of model preference

From Computer Science (especially Artificial Intelligence) it borrows:

### Processes

CSD differs from other domains that use knowledge representation and reasoning in that it has a well-developed language for describing change. Of course, describing change has been important in Artificial Intelligence since Blocks World planners, but CSD has developed the notion of processes.

Processes group together the information needed to describe changes. As in "classic artificial intelligence" this includes descriptions of objects that can undergo the change, preconditions and postconditions. However, it also includes notions of continuousness often used by scientists in the form of differential equations. In fact, CSD in large part grew out of the development of a series of ever-more-powerful equation-finders for which differential equations were a testbed.

Processes are also inherently compositional in two respects.

1. The first respect in which processes are compositional is that an abstract process can be used as a template for the construction of one or more concrete processes. For example, the abstract equation for uniform acceleration in a gravitational field is:
```d2height/dt2 = -constant;
```
It can be applied to several concrete equations including:
• Free fall close to the Earth's surface where sea level is 0 meters:
```height(t) = height(0) - 0.5 * (9.8 meters/second2) * (t seconds)2
```
• Free fall close to the Moon's surface where average height is 0 meters:
```height(t) = height(0) - 0.5 * (1.6 meters/second2) * (t seconds)2
```
2. The second respect is that several basic processes can be combined together to form more complex processes. For example, we can combine the equation for uniform acceleration in a gravitational field:
```d2height/dt2 = constant;
```
with one for velocity-dependent resistance of an object obj through a fluid fluid:
```mass(obj) * d2height/dt2 = dragCoefficient(obj) * density(fluid) * (dheight/dt)2 * crossArea(obj) / 2
```
to form an equation that can model falling in a gravitational field with terminal-velocity wind-resistance:
```gravitationalForce(obj) - dragForce(obj) = 0

gravitationalForce(obj) = dragForce(obj)

(9.8 meters/second2)*mass(obj) = dragCoefficient(obj)*density(fluid)*(dheight/dt)2*crossArea(obj)/2

dheight/dt = sqrt(2*(9.8 meters/second2)*mass(obj)) / (dragCoefficient(obj)*density(fluid)*crossArea(obj))
```

### My Approach

Just as CSD has extended the "classical artificial intelligence" with a richer notion of process-modeled change, so my research extends "classical" computational discovery with more knowledge. Specifically I am interested in adding cultural knowledge and explanatory awareness to CSD.

Some of the cultural knowledge that I represent include:

• Definitions. For example:
• meters measure length,
• there are 100 centimeters in 1 meter.
• Expectations. For example many scientists believe that they themselves are both in and of the system being studied but not special within that system. Specific examples include:
• Astronomers generally believe that the Earth is not in a privileged position in the Universe (like the center).
• Chemists generally believe that the types of atoms and molecular bonds used in living things are the same as those used in non-living things.
• Biologists generally believe that H.sapiens is one of many evolving species.
• Analytical knowledge. For example:
• The Pythagorean Theorem.
• How to change from Cartesian coordinates into spherical or polar coordinates.

Having this extra knowledge can serve several purposes:

1. It helps scientists be honest with themselves and with other scientists about their background assumptions.
2. It allows the system to be more sure that revisions that it suggests to the given model will be acceptable to scientists.
3. If the model is well accepted it lets scientists double check the appropriateness of their cultural beliefs with the model.
One application of combining processes with cultural knowledge is the following work I did with Ron Edwards, an evolutionary biologist at De Paul University; and Raghuveer Kumarakrishnan a former student at De Paul who got his Masters in Computer Science. We took at face value the Intelligent Design claim that Evolution was improbable because it requires three simultaneous conditions:
1. the geographic isolation of a population, and
2. the production by mutation of a new trait, and
3. the superior fitness of that new trait relative to existing traits
We developed a non-contested biological model of population growth (that included, for example, logistic growth and Mendelian genetics). This model had geographic isolation only (the Founder Effect, or when due to population growth after an extreme die-off the Bottleneck Effect). Specifically it had no mutation of any kind. We then looked at the change in prevalence of a trait of intermediary (not maximal) fitness. We found that geographic isolation was all that was needed for it to increase a statistically significant number of times. This work is detailed here.

Explanatory awareness is the second way in which my research extends "classical" Computational Scientific Discovery. One of the jobs of scientists is to build a self-consistent web of mutually-supportive (and even redundant) explanations. Oftentimes several different explanations for phenomena are possible. Some of these explanations agree, while others may contradict.

My research supports scientists in this regard by representing:

• Justification traces which trace the paths taken through the knowledge base to answer questions. Justification traces embody both an "explanation" of the answers, and a Lisp program that can compute the answers. (A Lisp program is returned because it may be stochastic: the same program may give a distribution of answers based on a random number generator.)
• Justification stories which compare two or more justification traces and argue why some are more acceptable than others. Justification stories are meant to be loosely analogous to "conference papers": records of a number of approaches that might have (or even were) taken, of which a proper subset (ideally just one) are/is superior to the others.

### The Scienceomatic

Over the course of my research I have developed a series of increasingly more powerful programs for CSD. They are collectively and unrepentedly named Scienceomatic.

Here is a brief history of the Scienceomatic Research Campaign. Please note: The name of the computer programs is "Scienceomatic". However, since the Scienceomatic V I have distinguished the name of the computer programming environment from that of the language that that computer program implements. The name of the language is "Scirep" (or "SciRep") for Science Representation.

Version Description
Scienceomatic I
(1997)
An inferior frame-based system that was all representation and little reasoning. It was never implemented.
Scienceomatic II
(1998?)
I've forgotten its details. They must not have been too impressive because soon I moved on to the Scienceomatic III.
Scienceomatic III
(1998-2000)

Basically a system with a language of "things" and with an information-theoretical inspired model preference algorithm built in. Its language described things with a combination of equations (for numeric knowledge), decision trees (for symbolic knowledge) and frame `is-a` inheritence (for either).

The system was written in C++ and used CGI-scripting to implement a web-based interface. All Scienceomatics since Scienceomatic III have been written to be full environments in the sense that they may be used to do either prediction or knowledge discovery. I used the Scienceomatic III to do data-mining knowledge re-discovery in a seismological database; and got my Ph.D. thesis with it.

Scienceomatic IV
(2000-2001)

An early attempt to get around the limitations of the Scienceomatic III that I identified in my thesis. While the Scienceomatic III dealt with data and theory, the Scienceomatic IV added an "in-between" layer of laws (later renamed generalizations). Soon I appreciated the usefulness of a theory/expectation component, as well as a mathematics one (later renamed analytical). I also decided to build around Prolog.

These revisions happened so quickly that the Scienceomatic IV was never implemented. Instead effort went into the design and manufacture of the Scienceomatic V.

Scienceomatic V
(2002-2007)

A Prolog-inspired language of "things", processes and cultural knowledge. Cultural and domain knowledge was supported by divvying the knowledge base up into five "components":

• Definition/expectation: for definitions, expectations and assumptions.
• Observation: for observations (e.g. Tycho Brahe's planetary data)
• Generalization: for generalizations of data that are explainable by theory (e.g. Kepler's Laws)
• Theory: for high-level theory (e.g. Newton's Laws of motion and Gravitation)
• Analytics: for analytical knowledge
Combining processes and cultural knowledge gave the Scienceomatic V the ability to do simulations that compared the "culture" of conventional evolution with that of intelligent design. Press here for a summary, or here for more details.

• It was designed to support a more general information-theoretic model preference algorithm than the Scienceomatic III's, and which is described here.
• It had a Java-implemented GUI that made it easier to use.
• It had a similiar equation/decision tree/frame-inheritence notation as the Scienceomatic III except that the equation and decision tree notations were unified. This gave it the ability to describe "split" equations like:
```          = height(0)             (t <= 0)
height(t) = height(0) - 0.5*g*t2  (0 < t <= tmax)
= 0                     (tmax < t)
```
with just one expression.
• It kept meta-data associated with all numbers and concepts to keep track of the dimensions and units of all values. It used this meta-data for automatic unification and error checking. For example:
• `(100 centimeters == 1 meter)` yields `true`
• `(101 centimeters == 1 meter)` yields `false`
• `(100 centimeters == 1 second)` yields Error!
• Deep in the bowels of the Scienceomatic V lurked a Prolog interpreter. Knowledge that was ugly to represent in terms of equations, decision trees, or frame-inheritence could be stated in terms of Prolog sentences.
Scienceomatic 6
(2007-2008)

The Scienceomatic 6 built upon the Scienceomatic V's Prolog infrastructure to add justification traces, an early form of explanatory awareness. It also improved the Scienceomatic V's notation for differential equations.

I spent several months designing it, but never implemented it because I realized Lisp might be a better language for representing explanatory traces and constructed programs than Prolog. This naturally led to the Scienceomatic 7A.

Scienceomatic 7A
(2008)

The Scienceomatic 7A broke from the Prolog tradition present since 2001-2002 in the Scienceomatic IV/Scienceomatic V. Instead, it was designed to use Lisp's ability for programs to define their own Lisp functions on-the-fly to represent stochastic justification traces. Another advantage of freeing its notation from the sentence-oriented notation of Prolog is that a more natural Java-like class/object notation was developed. Otherwise it built upon the process and cultural knowledge principles pioneered in the Scienceomatic V and Scienceomatic 6

I started implementing the Scienceomatic 7A but abandoned it when I realized I should abandon C++ as the main implementation language.

Scienceomatic 7B
(2008-2009Mar)

In the interest of universality I finally broke with C++ as the computational engine implementation language used from the Scienceomatic III all the way through to the Scienceomatic 7A. The "computational engine" of the Scienceomatic 7B is written in Lisp, and the Lisp interpreter is/will-be implemented Java. Relying of the Java Virtual Machine will give the Scienceomatic 7B both:

• the ability to run on multiple platforms, not just the Unix-based ones for which the Scienceomatic III - Scienceomatic 7A were designed, and
• the ability to use Java's windowing primitives for GUI support

The Scienceomatic 7B further extends explanatory awareness by also supporting justification stories in addition to justification traces.

Scienceomatic 7C and 7D
(2009Mar-2009Dec)

The Lisp-Java combination of SOM 7B made sharing some objects unnatural. Also, Lisp programming can be a pain-in-the-ass if one wants relatively simple data-structures, like red-black self-balancing binary trees.

Versions 7C and 7D were to be written in Java exclusively.

Scienceomatic 8A
(2010Jan-)

Two important breakthroughs occurred in Jan/Feb of 2010:

1. The solving of how to handle meta-data manipulation with the invention of multiple addition operators. Prior to this I had complicated rules for which pieces of meta-data were taken from which operand when constructing the result. This resulted in nasty associativity rules.
A cleaner solution is to define several types of operators:
 +grp Grouping addition: adding the heights of the ten tallest bbuildings in a city and dividing by ten gives you a information on no real building. +ext Extending addition: putting 20 more liters in a tub with 30 liters already extends the volume in the tub. +dGrp Delta grouping addition: +dExt Delta extending addition: adding 1 ml to a tub with 30 liters already semantically gives you the same volume, slightly perturbed.
All four operators "add" 2+2=4. Their only difference is how they manipulate meta-data. (Other operators may be have multiple versions defined for them too.) This work is discussed here.
2. Chris Diteresi, a PhD graduate of IIT and the University of Chicago, told me of the idea of "integration without unification". If even one science is developed enough then multiple high-level frameworks on how knowledge should be organized could exist. For example, within Biology one may start from an evolutionary approach and build all explanations from that viewpoint, or one may start from a developmental approach and build all explanations from that viewpoint. Rather than a priori forcing one viewpoint to be subserviant to the other one could:
• establish enough common ontology for all the sides to agree on basic terminology
• when specific opportunities for collaboration come about, collaborate (integrate), but with with each approach coming at it from its own viewpoint and getting something different out of it (without unification).

Both of these ideas are being incorporated into the Scienceomatic 8A.