Retour

SNU Analytics: Government Program Data Pipeline

Creating a pipeline to generate event databases from production data for the Service National Universel, enabling detailed government statistics and regulatory compliance.

Selego

Government
SNU Analytics Data Pipeline

Analytics for the Service National Universel: Creating a Pipeline to Generate Event Databases from Production Data

Context

We had a database made up of MongoDB collections representing young people signing up to a government program (the Service National Universel, a sort of state-operated summer camp), and subsequently carrying out missions of public interest in organizations all over the country. For both phases, the applications had to go through a multi-stage process—including user creation, completion, administrator validation, and potential refusal or correction requests. It was all recorded in "status" fields. Although we did record the date of the last update, we did not keep track of each status change in the user data directly.

During the project's third year, the client (the Ministry of Education) received a request from a regulatory body for detailed statistics on the evolution of the applicant pool. This required clearly presenting daily figures for created, completed, processed, validated, and refused applications, along with projections and comparisons against annual objectives.

In order to meet this need as fast and simply as possible, we chose to create our own pipeline from scratch, using the tools that were already deployed on the project (a Nodejs instance and databases, all managed by the CleverCloud service).

Ingestion

Luckily, an existing system for automatic event data generation was already in place: every update to the user collection generated a document in the "patches" collection, recording the updated document's ID and the values of changed fields. This collection served security purposes (tracking changes and their origin) and enabled data rollback in case of bugs or human error. It would now also be leveraged for our fledgling analytics service.

However, these patches did not fully meet the client's requirements. To provide a database suitable for cross-referencing and generating detailed anonymized reports—such as the number of created and validated applications by gender, age, or applicant region—data transformation was necessary. Consequently, an automated job was configured to run nightly for processing.

Transformation

The need:

  • Complete data: we needed to rebuild user state before and after each recorded event.

  • Structured data: the records had to be converted from our document store to a relational database for ease of use on the chosen data visualization tool (Metabase).

  • Anonymized data: no identifying information could be recorded on this database, which was going to be accessed by people outside of the client's team and in strict observance of GDPR rules.

We developed JavaScript scripts to populate an external PostgreSQL database. For each event (e.g., application creation, validation, process completion), a new entry was generated in the analytics table. Each entry included the event name and reconstructed user data, both pre- and post-change. This reconstruction utilized event information from the patch collection and the most recent user document, with all identifying details removed. These reconstructed user profiles were stored in JSONB columns, and key fields were duplicated into separate columns to streamline subsequent queries.

Additionally, these scripts needed to support manual execution, in the event of a server failure or misconfiguration, without generating duplicate entries in the events table.

Data Sharing

With these events neatly recorded in a relational database, the data department was able to create beautiful dashboards and PDFs, answering most questions the administration had and thus showcasing the role of the whole program.