Fault Tolerance at Mahul Branch
Within Six months of joining CMS, I was assigned the role of ‘Escalation Support Engineer’, with an important KRA to reduce escalations being raised to Central Tech Support Team. This meant, resolving every escalation forwarded to me which included Server, Network and even OS level issues.
My typical work day used to start post lunch but attendance started at 9:00 am and while my other colleagues left at 6:00 pm, me and the HD Team were regulars in office till 9:00 pm. Starting noon my pager started to beep with emergencies, which mostly got resolved over phone calls, however, to make life more challenging! many a time personal onsite visit was required.
The ‘Vadapaav shop’ (pls. google vada pav ðŸ˜Š) adjacent to the office was a blessing in disguise and being its regular customer we had its privileged customer membership. It was Friday, April 29th, 1999, I was enjoying my vada pav when the HD Coordinator came running as Mr. Kurian was looking out for me.
Kurian John, VP Sales was a dashing personality, 5’6′ in height, always in casuals, far away from stereo typical suited booted Sales VP. But man! What skills he had, I remember he was the only one who had been consistently over achieving the numbers for the region. So I had all the reason to become tense when I got his message. In my heart of hearts, I was happy that he had already left when I reached his office, so his PA connected me with him over the phone and there he was ‘Swapnil, my friend, we need a miracle to happen. Deepak told me you would be the guy, have a word with the team and sort this out’.
Now, this was very surprising, as Deepak, happened to be GM Services, and when the recommendation came from GM I realized it’s a serious escalation. I called the team, it was 10:45pm, they were going berserk, they just kept repeating, ‘Swapnil, we had got this order on our technical merits and this issue is still not resolved as per SLA, Vivek’s team has screwed up’. Now that was more serious, as Vivek was my reporting manager and issue was escalated to GM Services bypassing others. I took a cab to ‘Mahul Village’ (yes I had privilege for a cab, while other engineers they had to justify ðŸ˜Š).
When I reached On-site, I met the Bank Manager (customer) standing tensed, we spoke and I assured him that the issue will be resolved. Then I met my team Engineer, he was sitting in the midst of servers, and had no clue on solving the issue. He breathed a sigh of relief when he saw me, the issue was now owned by me as per organizations process protocol, so like a war zone movie, I called our HD Team and said in my style ‘Taking over Mahul Issue’. Poor guy, the Field Engineer had not faced such a situation earlier, so he could not address the issue but briefed me about what was reported and what steps he followed. He was struggling since afternoon and then it was late 11:00 pm. I started looking into the problem, those days many nationalized banks had their application running on Novell NetWare SFT III and their setup was down since afternoon and they couldn’t perform End of Day procedure. It took almost an hour to conclude there was a problem with the heartbeat card and even the system board. Replacing a system board was a long process, around 3:00 am, replacement board came in and by 5:00 am I was able to start one server and it managed to run EOD. I was then focusing on the second server but the customer had some relief that they could start the work. There was a problem with one SCSI controller card, around 7:00 am next morning, I briefed the branch manager ‘he can run the branch operation on one server and by afternoon second server will become functional too since there is urgency I shall go to the office and get the required controller’. Only, then the Bank Manager of the Bank was relaxed, and he said ‘Swapnil I understand you have worked overnight and you would get required part to restore the redundancy, I request you to stay back. As today is salary day and all refinery workers would be visiting branch today to withdraw money if for some reason they couldn’t withdraw it, it would become a major issue’ and he took me out to the main lobby, approx. 300 odd refinery workers were gathered there, branch manager continued ‘if the system fails and we tell them about systems they will ransack the branch but if they see someone is trying to fix it we can calm them down’.
It was like climax of ARGO, very tensed, the other support engineer went to office and got the card second server problem got solved, I reported back to office ‘Issue resolved going back to normal’ branch manager thanked us personally and when we walked out of the server room while on the way to the main exit, we were taken by surprise to see the bank staff members clapping for us. That was a very proud moment etched in my heart and memory!
Swapnil Gupte, Lead Solution Architect – Alpha Data