DataCrunch - Design

Here is a new article about DataCrunch, following DataCrunch Introduction. We will focus on DataCrunch design and technical implementation.

This article details how we perform analytics computations on all our scientific datasets in native format (Netcdf, Hdf, ...), using terabytes of data and millions files, with only 30 lines of Python ...

more ...

DataCrunch - Introduction

Overview

Datacrunch is a framework based on Hadoop which aims to make easy and fast to extract and compute meaningful informations from earth observation datasets (ie. satellite, insitu ocean buoys/drifters, models...).

DataCrunch is a Cersat homemade framework, result of our experiments using BigData technologies, taking benefit from Map/Reduce ...

more ...

Distributed FileSystem on Cersat platform

Purposes

At Cersat, we manage and use everyday data from dozens of satellite mission archives. This requires a big storage archive capacity since satellite datasets are nowadays about a few hundred gigabyte to hundreds of terabytes, and even petabyte for new missions.

more ...

Software architecture

Overview

The platform software architecture relies on several layers, each containing one or more components :

  • Distributed Processing : Torque/Maui, Hadoop Map/Reduce, Celery, Gearman
  • Virtual Environments / Containers : Chroot, Docker, Eucalyptus/OpenStack
  • Distributed Filesystem : MooseFS, Hadoop HDFS
  • Bare Metal / OS : Ubuntu 12.04 LTS
more ...