Challenges of data engineering people don’t talk about

Imagine yourself being a data engineer. You wake up with birds chirping outside. Feeling fresh, so you quickly brush your teeth and take a shower. On a way to work you grab yourself a nice latte from your favourite coffee place. The one which you smell already before you turn the corner. The indistinguishable smell of freshly ground coffee and sourdough bread when is just taken out from an oven.

You sit down at your desk, put your bag down, take out your laptop and you are about to take the first sip of your delicious treat.

“Hey, can we track the payment from being keyed in, through the movement through bank accounts and tie it to the refund process as well? We’ve seen a recent rise in complaints about payments getting delayed.” You get asked by a randomly passing colleague.

Good question. We want to provide better service, so you want to get to the bottom of it. A customer comes first.

Alright, so what happens next? Well, someone will have to try to tie all the systems together to figure out how to track all the steps for payment for the first time.

By now, someone could just easily say “Well, all the systems have been integrated, so it will be a piece of cake”.

Let’s have a look how a process for you can look like. 

Cool, I can export data at a granular level through an API from the Ordering system, to get all orders and their payments with quite a lot of additional data. Great, the first steps seem to be ok. 

“Can I get documentation for the API fields?… Oh, it’s experimental only, so you didn’t write one yet. How long has it been experimental?… never mind…”.

So you start going through them one by one to try to figure out what individual dates mean and pingback question to the Ordering system development team. “What is Received Date? Is it when the payment hit our or clients bank account?”. If you are lucky you will get “Yes, that’s exactly what it is. In all the cases and never anything else.” What is more likely is “Yeah, it depends on a workflow, it’s also populated with expected received date which is not correct in all the cases at the moment. There is a bug to fix this.” You will try to find the bug and it’s the lowest of low priorities because it doesn’t affect front end or payment processing. So you grind sand between your teeth and try to work around it. You might start to derive the date yourself based on payment type and Date Ordered/Paid. You will find some odd dates when something was ordered in the year 1414, so you will question it and get back “Yeah, it’s a workaround for a bug, which was there for half a year and it was never a priority to fix it after because those payments were processed and reconciled correctly manually in the end. It’s just a display historical issue. No one looks at those anymore.”

Let’s not forget that you’ve started to look at this with “Yeah, there is API, they have some examples there, so it should be ok.”

Ok, you’ve managed to move into an accounting system. It’s an off-the-shelf system, which was customised 7 years ago. The Ordering system has been in place for 3 years only. “Hmm…” Because you are a fast learner, you think to yourself “I bet there will be something odd here”.

So you will start to dive into it. You remembered the meeting with all the heads of departments, where someone confidently said that there is integration between both systems because it was a key part of the delivery of the system.

You get excited and set up an initial meeting with the accounting team. “Oh, yeah… that integration worked for the first year and then it broke during one of the updates and fix was never prioritised. So we export data manually, run these excel macros to prepare it for bulk upload and upload it on daily basis. Yeah, we want it to change, but it was never prioritised because we have a functional workaround, which takes us 30 minutes each day.”

“Ooft,” you say.

“Do you have a ticket reference to the change, it seems to make sense for you not to do it and it doesn’t seem like a big change on the Ordering system side.” At this moment, you might realise that you are not there to push changes, but to analyse data, so you move your focus back on the task at hand.

Back to getting the data. It’s an off-the-shelf product and they have a well-documented API. Awesome, you connect and export all payments. You open the extract and start analysing it. Suspiciously, fields you would expect to be populated are nulls. You ask the accounting team and they don’t have an answer. However, they will show you the user interface and run you through their process. You’ll notice that in the list of entities are Payments, but it’s not what they click on. It’s Payments Accounting“Oh yeah, the consulting company created this for us, because they couldn’t add our custom fields into the default Payments.”. You dig more into the API documentation to find out how to export custom entities. You prepare the script, run it and see the error “Entity doesn’t exist”. “Hmm”, you run GetAllEntities and notice “Display name = Payments Accounting”, “Internal name = PaymentsNew_”. 

Great, you’ve changed the name, exported the data, checked the file and notice all fields are named as CF1 to CF34. “WHAT ARE THOSE?!”. You contact the vendor and they looked at you strangely, that no one ever wanted before to export custom entity, but they will put the request to replace it with display names in Q4 next year, maybe. Well, what can you do? You reverse engineer the field names, find what you need and then hardcode the names into your process.

Now you have data from two systems with the granularity you want. Brilliant. What’s left is to just join them together. You start looking for the ID or reference fields to match them and that’s it! You will be finally done. Well, not done, you will still need to analyse it… anyway. Nothing matches based on the name, but there are few integers, so you try them. No combination makes sense. You go back to both teams.

It’s not anywhere.

Well, the status is updated back to the Ordering system to inform the customer. How is it done? It’s matched on Date, Amount and Bank Details. You just wished one of the systems stored the original reference from the other. You think about requesting a change for the systems to capture it. It is clear that no one will touch the Accounting system and for the Ordering system, it is a low priority*.

“Surely, I won’t have to map those two together, the same way as it’s already done twice in both systems.”

And in the next minute, you are doing it. It’s easy for the accounting system. On the other hand, the Ordering system doesn’t have matching functionality exposed as public API and neither any bulk export of payments with their bank details.

Do you remember the excel spreadsheet with macros that is used as a workaround? I bet you’ve already forgotten by now. You plea to gods to not force you to use an Excel file which is used on daily basis, but there is no answer. You are using the Excel spreadsheet.

I’m starting to realise that this post is getting quite long, so let’s fast forward.

You take data from the Ordering system, the workaround spreadsheet and the accounting system. Match them together in 98% of cases, because there are manual changes that you won’t pick up. The process fails once every few weeks because the Excel file is saved with different name or dates are formatted to numbers etc. You hope that none of the dependencies you’ve now created change or at least you will be informed about all changes which might impact you. Everyone confirmed that you will be told. In fairness, you are in all the cases when the explicitly used API is changed, however not when underlying data changes meaning or there is a process change, which changes which field is getting populated with Received Date.

If you are a data engineer, data scientist, BI developer, analyst. Whatever the role is called. If you need to grab data from multiple systems and use them for a purpose they haven’t been used before. These are some of the problems you might be dealing with.

And look at it from the perspective of the other teams involved. The Ordering system development team is stretched because it should have been only two years project, but they are being asked for more and more new features. They are doing the best they can to keep the system running and covering all critical features and bugs and rightfully move data fixes into low priority.

Accounting is doing its best to reconcile all payments and refunds and making sure they hit the right bank account in time. They ask for the integration to be fixed frequently, however, it is pushed back as there is a workaround and a project to change the accounting system completely in Q2… for 3 years straight now.

And why you won’t hear anyone talking about data engineering in relation to change management, process improvement, vendor and product management and development roadmaps? One of the reason is that there is no silver bullet. Your work affects everyone in your organisation and it is affected by more decisions that you admit. As a data person you need to learn to live in the middle layer between all systems, processes and people. And similar to other complex problems, no product or framework would solve all your problems, so it is a very hard sell. You just need to take one problem at the time and start to unwind it.

Leave a comment