Data-driven Processing with dCache, Apache Nifi and OSCAR

Data-driven Processing with dCache, Apache Nifi and OSCAR

Tuesday, March 21, 2023

This post describes the integration between OSCAR and dCache. Data stored in dCache triggers the invocation of an OSCAR service to perform the processing of that data file within the scalable OSCAR cluster. This work is being done in the context of the InterTwin EU project.

What is dCache?

dCache is a system for storing data in distributed and heterogenous server nodes that works like a single virtual filesystem tree. The system can be expanded or contracted by adding/removing data servers at any time. dCache is developed by DESY.

What is OSCAR?

OSCAR is an open-source serverless platform for event-driven data-processing containerized applications that execute on elastic Kubernetes clusters that are dynamically provisioned on multiple Clouds.

Version 2.6.1 or newer versions have implemented the handle of the dCache storage events.

What is Nifi? and Why use Nifi?

Apache NiFi was made for dataflows. It supports highly configurable directed data routing, transformation, and system mediation logic graphs.

Nifi works as an event ingestion platform between dCache and OSCAR in this architecture. The SSE (Server-Sent Events) specification is one method to create active listening in dCache. Nifi will be actively listening dCache so that a file upload triggers an OSCAR service. Nifi does not have a process that supports the SSE specification. Therefore, a new Docker image named ghcr.io/grycap/nifi-sse (from the Nifi image with version 1.20.0) has been created. This new image includes a Python-based client-side implementation of this SSE support in dCache), kindly provided by Paul Millar.

Using Nifi brings us advantages over using an active pod listening to trigger OSCAR services when new files are uploaded.

  • It is a generic tool to create a specific dataflow.
  • It can create complex dataflows, redirecting one event into some services.
  • It can change and adapt the dataflow through the web user interface.
  • It can create recipes for dataflows to automate deployment.
  • The data ingestion in OSCAR can be changed at any time, to decouple the file ingestion rate in dCache with data processing rate in OSCAR.
  • In this case, Nifi is deployed in the cluster node to keep the persistence.

To facilitate the process of defining the dataflow in Nifi between dCache and OSCAR, a new client-side tool called dCNiOS has been created. This solution provides a YAML-based definition of the endpoints and provides a command-line interface to facilitate the dataflow deployment and modification at runtime, as shown in the figure below:

workflow

We have an example of a dataflow between dCache and OSCAR with two process groups: dcachelistening and invokecowsay. dataflow

Inside the first process group, dcachelistening we found two processes:

  • ExecuteProcess: Keeps listening to events in dCache with the SSE protocol and caches the events in a temporary folder.
  • GetFile: Is listening in a folder and introducing the events in the dataflow. dataflow

In the second process group, invokecowsay makes an HTTP call creating an asynchronous invocation of an OSCAR service using the OSCAR API dataflow

Finally, we have a video where you can see all the steps to connect Nifi and OSCAR with more details.

OSCAR and IM are developed by the GRyCAP research group at the Universitat Politècnica de València.