Problem:
Recently a report was handed over to me. It is to reconcile the trades received in our real time SWIFT messaging system with the data stored in the investment admin system. This report has to run two times per day, and requires near real time data.
During testing, we noticed that it took about two minutes for this report to complete. When this report was running with a lot of SWIFT messages fed into the system at the same time, some times we saw database timeout errors thrown out.
Analysis:
After taking a dig into this report, I found the bottleneck of the underlying SQL query was that it needs to retrieve records received during a certain date time range based on a field “ReceiveDateTime” in the message table. No index has been defined on this field, and it is something we do not want to do at all. Our messaging system is very busy especially during peak windows. Adding an index on this field will slow down the message persistence process, which was more than two times slower in our testing, whereas the performance gain of the report was only around 30 seconds less after adding the index.
What options can we have then?
Option:
A. Set up a reporting database using the SQL server transactional replication to have near real time data replicated from the production database to the reporting database. The report can then executed against the reporting database.
B. Further tuning of the underlying SQL query.
I tasked one team member to investigate option A. His testing shown that the transactional replication can replicate the changes within seconds. However, we do not have the luxury of setting up a standalone publishing server, which has to be set up on the same database server where a number of databases are hosted. We cannot tell what performance impact it will have on this server and on various applications without thorough testing. In addition, this report was developed to serve as a short-term solution. Do we really want to spend the time and effort required by this option for a one-off solution?
What is the real problem we are going to solve?
I asked myself, given a date time range, how can I retrieve the data from a database table quickly? As we all know, for a large amount of data, the cluster index seek operation is the most efficient way to select data. Is it possible for me to transform the problem of retrieving records within a date time range into a problem of looking for records given a start message ID and an end message ID? After examining the data at hand, the answer is YES.
The primary key of the message table, the message ID, is an auto-grown identity field. The "ReceiveDateTime" field is set at the time of inserting the message into the database table. What does that mean? That means, the data we have are a set of ordered data by the primary key and the "ReceiveDateTime" field. The problem of searching records by "ReceiveDateTime" therefore can be translated into a problem of searching messages based on the message IDs.
For example, when asked to retrieve a range of data between "2012-07-01 08:00" and "2012-07-01 15:00", you are actually asked to search:
1. The start message ID with the receive date time being the closest to, >=, "2012-07-01 08:00".
2. The end message ID with the receive date time being the closest to, <=, "2012-07-01 15:00"
What is the classic search algorithm popping up in your mind when given a set of sorted data? To me, it's binary search. I then implemented in SQL a twisted version of the binary search algorithm to locate the closest message ID matching a given start date time, and the closest message ID matching a given end date time.
In my search, I do not want to search all the records stored in the table. I implemented the following to take a best guess of the start and end message ID at the very start before any binary search:
Given:
The average number of messages received per day: N.
The top 1 Message ID, hm, and ReceiveDateTime, hd, in descending order from the message table at the time of searching.
The top 1 Message ID, lm, and ReceiveDateTime, ld, in ascending order from the message table at the time of searching.
The average number of messages received per day: N.
The top 1 Message ID, hm, and ReceiveDateTime, hd, in descending order from the message table at the time of searching.
The top 1 Message ID, lm, and ReceiveDateTime, ld, in ascending order from the message table at the time of searching.
1. If the given start date time > hd, the start day is in the future, no point of doing any search. Return.
2. If the given start date time < ld, the start day is out of scope. Return.
3. If the given start date time = hd or the start date time = ld, bingo, return the message Id, hm or lm.
4. Calculate the difference D = the current day - the start date time in terms of day.
5. Take a guess of the start message ID a = m - (D + 1) * N.
6. If the receive date time of the message ID <= the given start time, binary search the start message time with the message ID ranging between a and m.
7. If the receive date time of the message ID > the given start time, take another guess till the message ID <= the given start time. Repeat 6 to locate the closest message ID matching the given start date time.
It is a very similar algorithm with a minor difference to locate the closest end message ID matching the given end date time.
After the change together with some other minor query changes, the improvement is dramatic - the running time of this SQL query has been reduced from around two minutes down to less than one second.