| Abstract: |
This thesis addresses the
problem of searching huge biological databases on the scale of several
gigabytes by utilizing parallel processing. Biological databases storing
DNA sequences, protein sequences, or mass spectra are growing
exponentially. Searches through these databases consume exponentially
growing computational resources as well. The thesis demonstrates and
analyzes a general use, MPI based, C++ framework for generically
splitting databases amongst several computational nodes. The combined
RAM of the nodes working in tandem is often sufficient to keep the
entire database in memory, and therefore to search it efficiently
without paging to disk. The framework runs as a persistent service,
processing all submitted queries. This allows for query reordering and
better utilization of the memory. Thereby, we achieve superlinear
speedups compared to single processor implementations. We demonstrate
the utility and speedup of the framework using a real biological
database and an actual searching algorithm for Mass Spectrometry. |