Implementation of Site Reliability Engineering (SRE) practices – which involve applying software engineering to DevOps and operations problems – is working through a number of hurdles including training issues, government and private sector experts explained at an August 11 event organized by ATARC.
The officials detailed some of the challenges and solutions they have been grappling with when integrating DevOps and Cloud infrastructure with SRE practices.
Robert Brown Chief Technology Officer in the Office of Information Technology at the U.S. Citizenship and Immigration Services organization, talked about the training gap that he is seeing with the relatively new SRE practices.
“I would say that we’re in a state right now where we have a bit of a training skills gap,” Brown said. “Because really, at the end of the day, we need people to be more like engineers than just operators.”
“That’s probably one of our predominant challenges is that skills gap and getting the right folks in to help us,” he said.
Another difficulty that was discussed at the panel was the need to always find way to make things as inexpensive and quickly as possible when implementing SRE’s.
Brian Mikkelsen, Vice President and General Manager, U.S. Public Sector, at Datadog, explained the need to implement SRE practices quickly and without breaking the bank.
“There’s this constant pressure of need to innovate in order to move things forward, reduce costs, reduce churn, and pain in terms of training,” he said. “But at the same time, there’s a challenge to risk avoidance, to legacy systems, to prioritization of your major programs or records that need to be migrated,” Mikkelsen said.
Sunil Madhugiri, Chief Technology Officer at the Department of Homeland Security’s U.S. Customs and Border Protection (CBP) component, talked about CBP’s approach to SRE practices.
“SRE plays, of course as you can imagine, a very important role,” he said. “What we have done is taken a two-prong approach, which has made it easier for us to move forward with new SRE approaches,” said Madhugiri.
“First, we have made everything highly available [so it] can be replicated to make sure that from a disaster perspective, we have things ready when disaster hits. And we also have things in place where we monitor things internally,” he said.