Data Science

Building a distributed data auditing system

Bhagwati Malav
October 22, 2020

This article would focus on building a data auditing system which is highly available, horizontally scalable, fault tolerant, works in isolated manner. Any organisation would need such system at some time to manage the data changes so that it can easily be tracked who has changed the data, what was the existing data, what got changed eg. you have distributed data crawling framework where you can configure site configuration, and it is getting used internally, configuration related to products in e-commerce, margin/profit configuration at product level etc. All these cases are very critical, and we should be able to track these changes in future if something goes wrong in your system/ or auditing purpose.

So let’s define basic problem statement here.

1. We need to design a data auditing system which is highly available, we should not loose any data change in any case.

2. If something goes wrong in any one component, system should still be able to perform what it does. Let’s say any node of data store or audit service goes down, we still want things to work in expected manner.

3. We should be able to access the audit data in efficient, and faster manner with minimal latency, response time.

Architecture of Data Auditing System

So let’s start the basic design with keeping above given problem statement in mind. There are sequential steps which we can follow to crack given problem.

  1. How do i find out the diff between old and new data entity ?

There are multiple ways to do it. Lets assume we are going to use apache common library to get the difference. It allows us to define the fields we want to audit instead of getting the difference for all the attributes at given object. Therefore it is easy to get the difference/data changes for any given data entity interaction here.

2. Should we persist it at micro- service data store level or build a separate centralised service to handle it ?

One thing which we could do here is, define a common data model at each micro-service, transform the difference into the audit data model, and persist this into databases. This is easy to do, but comes with few consequences, we are violating the DRY principle(Don’t repeat yourself). We should not write the transformation and persisting logic for each service as it is duplication of work, difficult to manage. If we realise in future that we need to modify something in logic, will have to write it at multiple places. So whenever we see we are writing duplication, separate out that part. It will be very easy to understand, extend, scale, manage that part in future. Therefore we can build a separate micro-service around it, and it will be responsible for managing complete audit capability here.

3. How should we define communication between other micro-services and audit service ?

This is another question whether it has to be sync or async communication. If we try to do it in sync manner, it will have multiple problems. It might increase overall response time for data interaction at other micro-services which is not at all acceptable. As we previously said, we can’t afford to loose any audit data in any case. So if it is sync communication, and audit system is down/or having some problem then we are stuck here, it will also have impact on upstream service. We can make these loosely coupled as they are meant to do two different things in isolation without impacting each other. We can go for async communication. We can pass this audit data to kafka topic, once we send it to kafka, kafka will make sure that delivery of data audit event irrespective of availability of audit system. If audit system is not available, it will make sure to pass that event to audit service once it comes up. We will not loose data audit event in any case. There is another advantage of using this approach, we can come up with multiple partitions based on required throughput after initial benchmarking, it is easy to scale here without changing any code as it is horizontally scalable system. We need to make sure that we scale our audit data store along with it for read/write operation.

4. How should we process this data at data audit service ?

This is another interesting question. Since we need to support reads here from huge data, it would be slightly difficult to achieve that with database alone as data will grow too soon, it might be difficult to get the read with required performance with some search criteria/filters. So we can think of using elastic search or solr here. Let’s assume we are using elastic search to store this data. We will be running multiple nodes of audit service here based on no of partitions we have at kafka level for audit topic, these nodes will be acting as consumers within a consumer group. This will help us a lot in terms of scaling later as it is built in elastic manner. As soon as we receive these event at audit service level, we can run basic validation, store it into primary data store eg: mongodb(with sharding or go for write heavy databases eg: cassandra, dynamo, this system would be write heavy system), and index this data to elastic. It is not recommended practice to use search data store as a primary source of truth thats why we are keeping mongodb as a primary data source, and then performing upsert operation on elastic. As data will grow too soon, we should go for n shards, can come up with this number after initial investigation.

5. How should i expose read interface for audit data ?

Now we have audit data stored at database, and elastic search level, we can build an api which takes search parameters/filters, and get the audit data accordingly. It is very easy to do with elastic search. We can build request dynamically, talk to elastic, and get the required data. It will be very easy to do it with good performance since we already have basic things in placed at audit service. This is a single interface for read operation from audit service. We can also follow CQRS(Command Query Responsibility Segregation) pattern so that we have separate read and write concerns, and both can work separately as we have already divided our read/write data stores here, we can extend, scale, manage these easily.

6. What if data service failed to push audit data event to kafka ?

This is very rare but since we are talking about not loosing any data at any cost, can think about this use case. We can do a small change to handle this at data service level. As soon as we find the audit data, persist it into database of that data service. And then pass it to kafka audit topic, and clean it up from database. If anything goes wrong in terms of sending the audit data to kafka. We will have failures stores in database, we can run a scheduler in frequent manner which will read the data from failure data table, and try to republish it.

7. How will external world/dashboard/UI see this data ?

We can configure api gateway to talk to this read api endpoint, and it can further talk to audit service. So ui will talk to gateway which internally fetches this data from audit system, and pass it back.

8. When should we archive audit data ?

It is essential to have this feature in our system as data will grow too fast, and data storage comes with cost. It completely depends on business use case. We can discuss with business, and come up with some duration. Once we cross that duration, we need to archive the data from both the data stores eg: i want to maintain audit data for only last 3 years, we can have expiry mechanism in placed at data audit document level which will archive data from both the data stores.

So we have basic data auditing system ready which will take care of basic things.There may be other areas of improvement which we could add here, but I tried to highlight basic things which we can at least follow while designing similar system, and we can always extend as we move forward.

Thanks for reading the article, I love discussing system design, distributed system. You can reach out to me for any suggestion/improvements. Would keep writing more on similar topics in future.

Distributed System

About Quinbay

Quinbay is a dynamic one stop technology company driven by the passion to disrupt technology today and define the future.
We private label and create digital future tech platforms for you.

Digitized . Automated . Intelligent