Science Benchmark

Discussions at the XLDB’08 and XLDB’09 workshops have led to the proposal of a new benchmark for scientific data management systems called SS-DB. This benchmark, loosely modeled on an astronomy workload, is intended to simulate applications that manipulate array-oriented data through relatively sophisticated user-defined functions. SS-DB is representative of the processing performed in a number of scientific domains in addition to astronomy, including earth science, oceanography, and medical image analysis. The benchmark includes three types of operations: (1) manipulation of raw imagery, including processing pixels to extract geo-spatial observations; (2) manipulation of observations, including spatial aggregation and grouping into related sets; and (3) manipulation of groups, including a number of relatively sophisticated geometric operations.

It is believed that there are several important architectural features that any data manager for science should include to perform well on SS-DB. These include columnar storage, aggressive compression (e.g., eliminating the need to store array indices), a storage manager that supports tiled-chunks that can be stored contiguously and read sequentially, and the ability to support overlap across chunk borders. UDFs should also be easy to parallelize, and load time should be taken seriously by supporting things like pre-allocation of files.

All solution providers are invited to run this benchmark to provide additional architectural data points.

The following additional information is part of the benchmark specification:

  • Data generator code (C++): benchGen.cc
  • Data file for data generator (20MB): tileData
  • Pseudocode for function F1 (cooking); use with threshold 1000: f1.pseudo
  • F1′ is the same as F1 with threshold 900 instead of 1000.
  • Pseudocode for function F2 (grouping): f2.pseudo
  • The starting points used for the “slabs” used in the benchmark queries (each used three times):
    x y
    503000 503000
    504000 491000
    500000 504000
    504000 501000
    504000 493000

Source code for the MySQL version of the benchmark implementation is available here: SS-DB-MySQL.tgz. Note that this code was written for functionality and performance, not for readability or portability.