Within 24 hours of rolling out the system for live testing a major issue was uncovered.
The Good News:
The good news was that due to only rolling out the live test for a couple of Card Types and continuing to run side by side with the old system there was zero down time.
The Bad News:
The bad news, Unfortunately the problem affected a fundamental design principle of the new System. In a departure from previous ID systems the new system was designed from the ground up to store all images within the database itself (previous systems stored the image files within the file system).
During testing database storage worked great and everything ran smoothly without any hiccups. During local testing the size of the images stored within the system ranged from 50k to 100k (these sizes were chosen as an average from images used for previous card types). The number of issued cards tested was moderate ranging from 5 to 500 for a particular test card type.
The problem found in live testing, was the images used for the test card types ranged from 200k to 5mb. The system itself had no difficulty in dealing with these images, however the problem arose when it came to the scheduled backup processes.
After a single day with only 2 card types migrated the database file had grown to 2.5gb. By comparison the previous system’s backup was a mere 350mb (with image files being handled separately). Whilst 2.5gb isn’t a huge amount of space, the resource allocation for the clients virtual servers meant that they would struggle with backups that were already 7x larger. Considering the initial rollout was only for 2 card types, 100% rollout would have meant a substantially larger backup file in comparison (Estimates would put it at 25gb, 70x larger than their existing backups ).
After careful consideration it was decided the system would be changed to store image files within the file system. This decision was not taken lightly, and by no means was it an easy one to make, considering database storage for the images was such a fundamental concept of the new system. However it was ultimately the continued increase in the average image size that was the key behind the decision. As capture equipment becomes more advanced and allows higher quality images to be captured, image size was only going to continue an upward trend.
Stress Free Changeover:
Fortunately the structure behind the scenes of the new system allowed for the image storage to be changed relatively easily. Even being such a fundamental piece of the system the change over was relatively painless, and for the most part was completed in less than a day.
Live Testing – Take 2:
Two days after the initial live testing rollout, the second live testing rollout took place. Once again just 2 card types were migrated for testing purposes. So far we have had 6 days of live testing with the updated system, and no other major issues have been uncovered. Multiple (mostly minor) bugs have been found and quickly fixed. I am confident that the system will continue to test well, and within the next week we shall migrate additional card types over.
While a major issue was found, I feel the considered and planned approach to the testing has enabled it to be relatively successful. The scaled rollout allowed me to more easily resolve any issues, and ensure the client had no down time (even with a test system).
A good thing to take out of this scenario, is it doesn’t matter how much unit testing and bug hunting you do, there is often issues that can affect an application outside the realm of the application itself, and these are often not picked up until it is too late and the system is out in the wild. However considered planning and vigilance should help reduce any problems these issues may cause.
Gradual rollout in parallel with old system… If only someone had tried that with Novopay…
Hopefully other government departments will learn from these mistakes (will be interesting to see what the IRD ends up doing)
Pingback: Terian ID Creator – Public Release Done! | Journey from Code to Sales