![]() |
DePaul BioMedicalInformatics Research |
Click here for a 32 Mb avi file |
The DePaul BioMedicalInformatics Research Lab was created in early 2003 by Dave Angulo. Dave's main research interests lie in Grid Programming, BioMedicalInformatics, and Web Programming (especially XML and Web Services). This Research Lab specializes in BioInformatics and MedicalInformatics research, but mainly in those BioInformatics or MedicalInformatics applications that are compute intensive or data intensive, and thus require the resources of Grid Programming. Since BioInformatics, MedicalInformatics, and Grid Programming are all fertile grounds of research, the intersection of the three areas is quite novel and exciting.
For more information on any of these efforts, please contact Dave Angulo at dangulo@cti.depaul.edu
There are a number of active research activities being conducted by the DePaul BioMedicalInformatics Research Lab. These categories are:
These research activities are expounded upon below.
Other items of interest:
Dave Angulo is the primary co-founder of the Illinois Bio-Grid, along with collaborators from the Chicago Technology Park, the Supercomputing Center of Chicago, the Argonne National Lab MCS, the University of Chicago BCR and Proteomics lab, the Illinois Institute of Technology CS department, the DePaul Supercomputing Center, and the DePaul Geography department.
The main purpose of the Illinois Bio-Grid is to share compute and software resources amongst the members. It is a computational grid for Grid and BioMedicalInformatics software research and for production BioMedicalInformatics research.
DePaul CTI and the Illinois Bio-Grid are home to many computational resources to support researchers with bioinformatics software. Titan is a Sun Fire 6800 MidFrame Server, a powerhouse incorporating 24 processors. It features the 64-bit UltraSPARC III processor with an 8MB pipelined burst level 2 cache, with a four-way associative on-chip 64-KB data and 32KB instruction level 1 cache and integrated memory controller capable of addressing up to 16 GB of main memory per processor at 2.4 GB/s. Storage is provided by a Sun T3 StorEdge array with 2 Terabytes of disk storage and a fiber optic interconnect to the SunFire. Titan (previously Kaveri) is available for use without charge for researchers in the member institutions of the Illinois Bio-Grid. The Illinois Bio-Grid also has thousands of other processors available for use.
The following software is currently available on Titan. Additionally, the Illinois Bio-Grid team is developing software for use on this machine and other machines on the grid.
For more information, please contact <dangulo@cti.depaul.edu>.
The z-cluster is an high performance Linux cluster installed at the School of Computer Science, Telecommunications and Information Systems (CTI) of DePaul University in Chicago. The z-cluster is a core computational facility at CTI.
The z-cluster is constituted by 20 computational nodes. Each node consists of a 3.2 GHz P4 processor with Hyperthreading technology, 2 GBytes of Ram and 200 GBytes Hard Disk. The nodes are connected by a Gigabit switch.
Information on the z-clcuster can be found at http://z-login.cti.depaul.edu/ganglia/
Genomics is the field of investigating proteins based on their primary structure (DNA nucleic acid sequence or amino acid sequence). Biologists frequently are able to inexpensively determine the sequences of amino acids in their proteins (or the nucleic acid sequence in the DNA that equates to their protein). Those Biologists then frequently want to look for homologous proteins, viz. have a similar evolutionary origin. Software tools are used to search for homologous proteins that are in a national database of sequenced proteins: the NCBI's GenBank. If they find such a protein and if the protein in GenBank has a known function, then they have a good idea what the function of the new protein probably is.
The growth of Data at NCBI (GenBank) has been exponential, and the computation time grows by at least the square of the size of the data. This is quickly growing beyond the capacity of normal computers to compute. Additionally, Biologists would like to be able to search for homologous proteins against a batch of input protein sequences (derived from the mass spectrometry equipment), finding target proteins that are homologous to all of the input sequences. It is unlikely that the NCBI will ever expand their software to include such functionality because it is so computationally intensive. We are developing a toolkit of such software. This toolkit, called the IBG Workbench, includes FASTA, BLAST, and Smith-Waterman algorithms, all converted to run with batches of input sequences and also to run in a distributed environment on the Grid. Parts of this workbench were demonstrated at SuperComputing 2002 convention and won two of the three Grand Challenge competitions.
In a second Genomics project, we are working on a Grid enabled version of software algorithms that will take raw data from a Mass Spectrometer and calculate the amino acid sequence of the input protein. For example, a Biologist might start with a whole cell digest of some organism. They would inject a sample of the extract into a series of columns where peptides released from one column are separated on a second column and then are detected and fragmented by the mass spectrometer. The mass spectrometer is acquiring data at about the rate of 3000 spectra per hour. Massive calculations on each spectra must be done for de novo sequencing. In order to handle this huge compute load, we are working on an algorithm to do this in parallel on the Grid. This tool will be part of the IBG Workbench.
We are working on additional modules for the IBG Workbench (mentioned above) that will be useful to proteomics researchers trying to predict tertiary structures of proteins from their amino acid sequences. The intention is to produce reusable modules that could be loaded together allowing researchers to concentrate on their particular areas of research interest. This framework will include modules to read DNA and amino acid sequences from the various GenBank databases as well as primary, secondary, and tertiary structures of proteins. These IBG Workbench modules will also include chemical libraries to calculate energy levels of molecules, as well as modules that use these chemical libraries to perform ab-initio calculations of protein folding. Other methodologies of predicting protein structure, include rule-based and “lego” algorithms will also be supported with their own modules. Having a suite of modules for researchers to choose from will allow them to minimize their development time because they will only need to concentrate on the portion of the problem that their research addresses.
Phylogenetics is the study of evolutionary relationships (phylogeny). We are working with Phylogenetics collaborators at the Field Museum of Natural History in Chicago on determining feasible evolutionary relationships of given taxa by looking at differences in DNA sequences and determining the evolutionary tree starting at some hypothetical evolutionary ancestor of all of the taxa and determining minimum number of mutations required to reproduce the differences in the taxa studied.
All of the above BioMedicalInformatics applications share quite a bit of functionality. Certainly, all of the interactions with the Grid is common functionality; however, connections to the NCBI databases (GenBank), sequence comparisons, etc. are common to many of these applications. This common functionality is useful to a wide array of other BioMedical applications as well. Understanding the usefulness in producing a workbench of such tools and a platform to allow development of other tools using the common infrastructure, we are developing the IBG Workbench of these modular Grid enabled tools. All software developed will be open source and available to all Computer Science or BioMedicalInformatics researchers world-wide.
GeneDesigner is an open-source product available for free download. It is still in development, but it can be used to design a nucleotide sequence for a particular protein (amino acid sequence) such that it will be optimized for the highest possible expression when used as a recombinant DNA fragment in a particular organism.
Get GeneDesigner here.
This library addresses the problem of searching huge biological databases on the scale of several gigabytes by utilizing parallel processing. Biological databases storing DNA sequences, protein sequences, or mass spectra are growing exponentially. Searches through these databases consume exponentially growing computational resources as well. The library provides a general use, MPI based, C++ framework for generically splitting databases amongst several computational nodes. The combined RAM of the nodes working in tandem is often sufficient to keep the entire database in memory, and therefore to search it efficiently without paging to disk.
Get the IBG High Throughput Task Allocator here