Scalable Job Management for Data Ingestion



The Data Management and Archive System (DMAS) is the heart of the PO.DAAC data ingestion process for extracting and cataloging metadata. Due to both increases in nominal processing as well as random spikes of data processing and ingestion (reprocessing, migrations, metadata mining), a system capable of dynamically scaling and processing data is required in order to meet the demands of today as well as the requirements of tomorrow. This poster presents a federated architecture for job assignment and management.  In this architecture a communication layer has been inserted as the job assignment layer to enable a horizontally scalable architecture of data ingestion processing. The foundation of this communication layer is Apache ZooKeeper, which provides synchronization and naming services.  The job lifecycle manager (a.k.a. Manager) receives and assigns jobs through ZooKeeper.  The ZooKeeper, as the bridge between Manager and the ingestion workers, keeps track of job to be processed and their priorities.   Ingest workers operate according to ordered jobs in ZooKeeper and their priorities.  The resulting architecture allows an arbitrary number of Managers and ingestion workers to be dynamically introduced and removed with zero downtime.  This federated architecture of DMAS offers PO.DAAC an elastic approach to job management.  Implemented in pure Java, DMAS is now an elastic distributed ingestion system.  It currently operates in our clusters at JPL, but because of its elasticity in job framing and portability, it can be deployed in a computing cloud to leverage the cloud’s elastic nature.

Collaboration Area: