A financial institution’s file sync led to a disaster

aRoot automatic file transfer, data synchronization

The cash cow is sick: Data access is the bottleneck but file sync almost killed it

Mark’s team of developers was busy at work getting a new application ready for deployment. The team was spread between India, the UK, France and New York.

 

Cash bundle

The old generation of the application was used by hundreds of financial analysts all over the world and revenues in excess of one billion dollar were generated using the application and various derivative products. However, the old application had several limits:

  • It did not scale well: as the number of concurrent users increased the response time went from about 2 seconds to 85 seconds. Hence the number of customers that could be serviced dropped dramatically. Traders and analysts were in arms
  • Once in a while the data and the results of the analysis were not congruent with realities which exposed the firm to legal action and loss of customers.

When Mark was asked “So what was the problem?” — Well many issues. The load was high but our queries to get data from remote servers were contributing to about 60 to 80% of the round trip time and there some issues with the financial models.

Re-engineering the application with FTP (File Transfer Protocol)

Mark said that he huddled with his team for several weeks trying to find the bottleneck and to quickly devise a solution. They all worked seven days a week, logging each more than 80 hours per week. He was taking a beating from management, operations, traders and financial analysts. He tasked a couple architects and engineers to perform a thorough analysis. They found it and everyone was ecstatic, since no code changes were required. The only modifications were in the names of the data servers. The team decided to build a test bed in each of the four countries and to examine how much speed they can gain by accessing data on the local data servers.
They ftped the test data from production servers to the test sites. The team run various load simulations and the access times went down to around one second. Hurrah! victory, and celebration is in order. Mark gets an email from his manager asking him to push some code by the team in financial computing. The code is ready and it needs to be part of the release, but the analysts need to fine tune some parameters on the various financial models before pushing it to production. This pushed the release by another two weeks since  securities specialists adjusted the cash flow models for the complex securities. Some of their Generalized Linear Models (GLM) needed to use the data. Mark was not pleased, he wanted separation of concerns but he complied under pressure. Analysts rerun their GLMs and fine tuned them.

The software was pushed to production on a Friday night and all was ok.

The call that you never want to get Sunday night nor Monday morning

While Mark was having dinner with his family in a small restaurant, on Sunday night, his phone rings. It was Mr. Sigh from new Delhi. He receives another text from the night operations manager in New York: “We need to revert to the old version”. Mark leaves the table and steps outside. What the …. ? “The team in India and Japan are reporting negative cash flows for a lot of securities. We saw no errors in the logs, nothing …”. Mark calls the head of analytics in New York and asks him to send the diff between the old release and the new release since he is out having dinner.

Mark spots a few lines of code in a few equations that computed the cash flow. Some parameters coefficients changed dramatically.
He noticed that it is possible to get a negative cash flow and asks the analyst. We run our models on the data on your test server, said the analyst. Mark realizes that the model was calibrated using the stale data set in testing which was not getting updated in real time from production. The unbalanced design was really unbalanced. Several levels in each of the factors used had only a few replications and data points and thus a few degrees  degrees of freedoms.

Mark asked the analyst for clarifications and then he asks him to rerun the calibration immediately so he can fix the equations using real time data. Within two hours, it was all clear. Mark goes home  with his now disappointed wife and child who were looking forward to their first dinner together in more than a month.

The moral of the story is to make sure that you test on real data and not only on subsets of it. All Mark needed to do was to sync data in real time from production to the test beds.

Mark is not out of the woods yet. We will publish another part of the article about his other troubles ….

 

A financial institution’s file sync led to a disaster was last modified: April 2nd, 2018 by aRoot