Overview

Datacrunch is a framework based on Hadoop which aims to make easy and fast to extract and compute meaningful informations from earth observation datasets (ie. satellite, insitu ocean buoys/drifters, models...).

DataCrunch is a Cersat homemade framework, result of our experiments using BigData technologies, taking benefit from Map/Reduce paradigm, scalable distributed processings and storage, using cheap and fault-tolerant platforms.

DataCrunch primary use is currently to extract metrics from datasets available at Cersat (using a dedicated web gui), but as a framework, it is not limited to this usage.

Why Hadoop for satellite data analytics ?

  • Map/Reduce : easy to merge datasets and extract metrics
  • Scalable BigData distributed processings : high performances
  • Fully Fault-tolerant (hardware, software) : easy to manage

DataCrunch web interface, a Cersat Dataset Analytics tool

TODO : détailler les datasets disponibles

The DataCrunch web interface relies on DataCrunch framework, and is the easiest way to perform usual analytics on registered datasets.

  • Output products : global and regional statistics timeseries, maps, climatology, hovmoller, …
  • Inputs : datasets available on the Cersat Cloud
  • Format : native (NetCDF, HDF, ...)
  • Data : satellite (L0 to L4, swath or gridded), models, buoys…
  • Size : a few MB to hundreds TB
  • Dataset time range : from a few days to dozens years
  • Adding a new dataset : usually requires less than 50 lines of Python or Matlab
  • Processing time : usually a [quick] coffee ? (it remains batch processing, consider a few minutes to hours in huge processing cases)

Brief history

We focused on Hadoop in 2010 when looking for better ways to manage and use big data volumes of satellite data. This opensource product was not well known at this date (compared to today ;-)), but used by web majors like Google, Yahoo, Facebook, using it for data intensive usage at low cost. With several promising features and real revolution in data processing and management, we gave it a try even if it was a priori not well suited for our satellite data usages.


Comments

comments powered by Disqus