Enduring success in the drive to foundational improvements in IT systems and infrastructure seldom comes fast, cheap, or easy – or without plenty of lessons to share for the next agency in line that is driving toward similar goals.
The Justice Department’s Bureau of Alcohol, Tobacco, Firearms and Explosives – more commonly known as ATF – is nearing the finish line of a six-year campaign to shift away from antiquated and unstable on-premise IT systems, and to recreate its next-generation vision of IT securely within all-cloud infrastructure.
In the first installment of a two-part interview, Mason McDaniel, ATF’s chief technology officer since late 2015 and a prime mover in the agency’s IT makeover, takes us through the twists and turns of the six-years and counting process in an exclusive interview with MeriTalk. His frank recounting of planning and progress – and the mistakes and challenges along the way – makes an invaluable guidebook and (soon-to-be) after-action report for other organizations wanting to take the same route.
First, a bullet-point snapshot of the IT modernization mission and its key milestones:
- Pre-2016 – Mostly IT firefighting on out-of-date on-prem systems, with disaster recovery capabilities severely curtailed;
- 2016 – Assessments and analysis of all applications, data, and databases, resulting in best cloud migration strategies and initial cloud contracts;
- 2018 – Completion of data migration to the cloud, and start of rewriting and updating applications;
- 2019 – A year spent trying to build out cloud-based services based upon unreliable source code derived from on-prem systems, resulting in defects requiring major strategy correction;
- 2020 – Go-live in the cloud with ATF’s case-management portfolio – the first of three major portfolios destined for the cloud;
- December 2021 – Go-live in the cloud with the second major portfolio of licensing applications;
- Coming up soon – Go-live in the cloud for final remaining major portfolio;
- Enduring – Ability to quickly improve systems and adapt to future policy directives.
The Technical Debt Problem
MeriTalk: You’ve spoken publicly for years about the need to get rid of technical debt at ATF. We hear that term sometimes defined as an inability to support software, but also for older systems, an inability to make them secure in the modern age of cyber threats. How do you see it?
McDaniel: That’s a large part of it, plus other aspects unique to our environment. Technical debt typically involves software development in general, and measures within software and in the coding. I take a much broader view of it across the enterprise – it’s really any kind of work that needs to be done to keep a system running the way it should, but that is not done.
For years we had been talking about all the updates that we needed, but not having a lot of success at getting the buy-in to really invest in them. In fact, we were really under investing in IT for many years.
MeriTalk: Who else is in tech leadership at ATF, and how did the leadership team start things moving away from that IT investment impasse?
McDaniel: Our CIO, Roger Beasley, is my boss. We have a great working relationship, and we officially started at ATF on the same day about six years ago. That reflects some unusual stability for agency technology leadership, given that CIOs have an average lifespan of about two years.
What really turned it around was not communicating IT to agency executives in IT terms, but communicating IT in business terms, and in terms of the mission impact of that technical debt.
One great example of that came in 2016, when a snowstorm dropped 38 inches of snow where our data center is located. We had to evacuate that building for two days because the structural engineers did not think the building was rated to support the weight of that much snow. If that roof had collapsed, our data center would have been completely offline.
What made that more severe was in 2013 we had taken on a huge amount of technical debt when we shut down our disaster recovery site, largely due to sequestration budget cuts. We moved the entire disaster recovery site into our primary data center, so we had one aisle of separation between our primary and fail-over systems. If that roof had collapsed, we would have lost our primary and secondary capabilities, with no timeline to restore either.
In 2013 during budget sequestration, again as part of cost cutting efforts, ATF cut the application operations and maintenance teams. As a result, there was nobody left that knew the source code to our legacy systems – they were gone. So, we went years without any updates.
Within my first week at ATF, I was getting demonstrations of our various applications. A subject matter expert was showing me one of them and I pointed out there was a typo on the static splash page before logging into one app. He said, ‘yeah, we know, that’s been there for two years, and we don’t have anyone that can fix it.’ I thought, if you’re not fixing that, what other bugs or security vulnerabilities aren’t you fixing?
You are much more likely to get leadership buy-in when you put it that way, rather than discussing cap ex versus op ex, or requesting funds to upgrade Solaris X.
Sizing the Problem
MeriTalk: Help us to understand the size of the task. What’s the size and complexity of ATF’s footprint – how many people do you have to support, how many offices, what’s the geographic spread?
McDaniel: We have around 5,500 employees, with over 7,000 core accounts including some contractor and task force members. The broad ATF IP infrastructure supports about 7,000 people, primarily U.S.-based, in about 200 offices. We have some foreign liaison offices, so we have an international presence but that’s relatively small compared to our national footprint.
MeriTalk: The older systems that you set out to replace, what was the rough age on those?
McDaniel: About 10 to 15 years old.
MeriTalk: Which doesn’t sound too old…
McDaniel: It isn’t old for people, but it is for dogs and IT. Besides, the technologies were pretty antiquated when they were put in. About 70 percent of our applications were SPARC Solaris, which is obsolete and not under support anymore, but also cannot run well in the cloud, aside from Oracle’s cloud. We decided not to go that route.
As another example, we had a Java 1.4 application – that language is so old you can’t find manuals for it. We put out a contract to duplicate that existing system for the explosives industry, and the contractor (a Java expert) came back and said it’s going to take longer and cost more to build the system because they had to learn the language. That’s because nobody knows Java 1.4 anymore.
MeriTalk: Tell us about the three major applications portfolios that needed to go to the cloud…
McDaniel: One is our legacy case management portfolio – the core investigative regulatory applications that people in the field have used to do their day-to-day jobs for years. Those are really the core ATF missions and applications, and they are fairly tightly intertwined with each other. We have a separate program that is working through business process reengineering and replacing those, but until that completes, we needed to go ahead and migrate the legacy applications into the cloud.
Next, we have a licensing regulatory mission, where we process various license requests, such as to become a Federal firearms licensee – basically a gun manufacturer or dealer – or a Federal explosives licensee. We also process applications where people want to manufacture, transfer, or export restricted weapons which are highly regulated under the National Firearms Act, such as silencers, machine guns, or destructive devices. Those are all included in this licensing portfolio, and that’s about ten applications with some cross functionality.
The third portfolio is around our tracing mission. That is when firearms are recovered at crime scenes, and we need to identify the owner of the firearm. ATF is not allowed to maintain a database linking law-abiding citizens with firearms, so there is no place we can go search. The process of tracing the ownership history of crime guns can be a very labor-intensive process involving many calls to firearms manufacturers and dealers and is managed by the tracing team and about half a dozen main applications.
Digging out of Debt
MeriTalk: Let’s talk about digging out of that technical debt, and the agency’s drive to go to 100 percent cloud-based systems, or very near to it. What’s been the path, and how close are you to getting there?
McDaniel: We started in January of 2016 when we ordered the initial contracts. It did not start out as a direct migration contract.
We started out doing assessments of our IT systems, our applications, and data and databases. The output of that was a 700-page analysis document going into all the technical details of the application architectures, the data architectures, with recommendations for each application on how cloud-ready it was, and the best migration path for that application. That information proved critical to us later on as we were going forward with the actual migration process.
We knew on day one that we wanted to go full-bore into the cloud. We did not want to build out a small cloud infrastructure per application and have them vary all over the place. We wanted to build an enterprise infrastructure in the cloud that all our applications could go into.
So, we spent the first three to four months just building out that infrastructure on AWS, and then about a month later, we got the first cloud-ready production system migrated into the cloud and running in that new environment.
ATF frankly had a poor track record of modernizing our IT systems. Most of our systems were tightly inter-connected at the data level, not through nice, clean APIs. In the past we had tried to pull out individual systems and modernize them in isolation. That hairball of data-level interconnections was largely what caused those efforts to fail. This time, what really made the difference for us is that we flipped it around.
Instead of working on an individual application, we started out focusing on ATF’s entire data tier. We analyzed all our databases, converted from Oracle to our target database format, and migrated all ATF data from our on-premises environment into our shiny new cloud-hosted databases. By converting and migrating the data from all of our systems first, those data-level interdependencies were no longer major hurdles as we started working one by one through the process of migrating individual applications.
We determined that 70 percent of our enterprise portfolio was going to have to be rewritten and rebuilt on new technology stacks. That was part of why our path has taken so long. We were not just lifting and shifting our applications exactly as they were, but have been refactoring and rewriting them.
There was so much to do that we made a key strategic decision. We could not do full business process reengineering (BPR) and redesign every system to the ideal new state ATF wanted. It would have been too complex with too many moving parts to possibly succeed. We decided to pay off the technical debt first – rebuilding our existing capabilities and processes on modern technology frameworks, while redesigning around automation that would enable us to rapidly evolve our business processes and systems once we were in the cloud. Essentially, the finish line of our cloud migration was getting us to the starting line for the kinds of continuous improvement that would enable ATF to rapidly adapt.
MeriTalk: We’ve heard many times that lift and shift may be quick, but you’ll end up with the same lousy system running on somebody else’s machines…
McDaniel: I 100 percent agree with that.
MeriTalk: Is ATF mostly using AWS, or a range of providers?
McDaniel: Our mission applications are primarily run in AWS. We have some in Azure but we’re primarily using Microsoft’s cloud for Office 365 and SharePoint.
MeriTalk: ATF is driving to go to 100 percent cloud-based systems, or very near to it. What have been the major milestones, and how close are you to getting there?
McDaniel: We broke our migration efforts down into the three major portfolios, planning on each group going live independently. We’ve been working on modernizing all of those and moving them to the cloud in increments. We went live with the case management (smallest) portfolio at the end of 2020, and the Licensing (largest) portfolio at the end of 2021. We are ramping up on the third and final Tracing one now.
MeriTalk: Any hiccups the first time out?
McDaniel: We started with the smallest of the three groups of applications, but since it was our case management applications, it was critical to get it right. When we tried to go live with it, though, one of the applications had some significant problems that would not have been acceptable to our users and would typically have caused a failed deployment.
We were deploying into our new cloud environment, though, using our new automated deployment pipeline. During the deployment outage window, the system owners and software developers worked closely together, found the underlying causes of the problems, made code changes to fix them, and passed them through the full testing regimen.
Instead of a failed deployment and rolling back to the previous system, we were able to “fail forward.” Within hours of finding the defect, we had fixed it, tested the fixes, and successfully went live with the corrected version of the applications. We saw that as a huge success, despite the fact that the deployment initially failed.
Then, in late December 2021, despite major challenges we went live with our licensing portfolio.
MeriTalk: So out of the three portfolios, you have one still operating from a data center – the tracing portfolio – but getting close to moving that to the cloud. Is there any date you are looking at for that one?
McDaniel: We have a small set of applications that are left and we’re going to be rewriting those and migrating them over. For now, there are still quite a few unknowns as we really dig into those apps. So, we are not putting a firm date on it yet, but are planning to complete them this year. As soon as we finish that last migration, we will be able to shut down and get the data center decommissioned.
MeriTalk: And then you’ll be all-cloud, and ready to take advantage of what would appear to be much more of a continuous modernization path?
McDaniel: Exactly. So much of the discussion around cloud is around saving money and cost savings. But my boss and I agreed early on, and we made the pitch to our management that we are not approaching it from a cost savings perspective.
Instead, we’re approaching from a functionality perspective and what it’s going to give to the agency when we come out of it. We’ve been working on paying off technical debt and rebuilding vital capabilities. We had been under-investing so much that even with the efficiencies of the cloud, we were going to cost more because we were going to be doing it right. But as I put it earlier, this journey is getting us to the starting line for continuous improvement, so we won’t fall into the same technical debt hole in which we found ourselves.
In the second part of our interview tomorrow, McDaniel takes us through the end of the story – including how ATF navigated through course corrections in the IT rebuild, and what kind of mission payoffs the agency will be enjoying for years to come.