Building DevOps Greatness Part III: Success and Growth
This is Part III of our series about building DevOps greatness. Parts I and II presented the foundations for establishing DevOps, the three principles involved, and matching up the right people. This post dives deeper into the technical challenges of implementing each of the three principles, integrating InfoSec into the DevOps process, and the future of DevOps.
DevOps organizations succeed by adopting principles in three areas. First, the principle of flow optimizes lead times via continuous delivery. Second, the principle of feedback uses telemetry to understand and react to the current status. And third, the principle of continuous learning leverages the first two principles to build a culture of positive habits and experiments.
The principles of flow and feedback require more technical effort than the principle of continuous learning, and achieving these technical goals is no small task. However, these challenges, along with clear-cut goals from management, are serious motivators to teams of engineers. And this in turn creates a positive culture that grows people’s skills and increases employee retention.
Let’s begin at the beginning and discuss how a team develops and ships software.
Trunk-Based Development: Time Well Spent
Continuous delivery almost sells itself as organizations see the inherent value in moving quickly and providing constant value to the customer. However, many trip over the biggest detail of implementing CD: trunk-based development. Trunk-based development makes merging into trunk (or master, or mainline) part of everyone’s daily work, which obviates long-running topic branches, messy merges, and complex branching strategies.
Gary Gruver’s experience as the director of engineering for HP’s LaserJet Firmware division demonstrates the benefits:
- Time spent by developers on driving innovation and writing new features increased from 5% to 40%.
- Overall development costs came down some 40%.
- Programs under development spiked around 140%.
- Development costs per program were reduced by 78%.
But unlike the overall approach of CD, trunk-based development is a hard sell to people who’ve spent years working in their own long-running topic branches, handling difficult merges, or using Gitflow Workflow such as branching strategies. Gruver himself even experienced a difficult transition. The truth is, adopting trunk-based development requires a mental shift, just like adopting DevOps.
The Stack Overflow co-founder Jeff Atwood summed it up nicely when he explained the two ways to work with code: Private branches optimize individual productivity; shared branches optimize team productivity. DevOps emphasizes the team over the individual, and so the same follows for working with code in version control.
Trunk-based development also means working in smaller batches—a natural side effect of daily (or more) commits where it is simply not possible to create huge chunks of code.
Smart Monitoring: High Signal-to-Noise Telemetry
Software must be monitored once it goes into production. This is where the principle of feedback takes over. Telemetry data such as error counts or response latencies helps teams detect production issues. Telemetry data such as user conversions, product funnels, or other business KPIs demonstrate if business decisions are working as expected. But bear in mind that these two categories are not targeting different people, and all parties should have access to all data.
Poor user conversions, product funnels, and other business KPIs may be caused by technical errors or other product-related decisions, meaning that building, using, and improving this type of telemetry data provides insights into all levels of a business to everyone’s benefit.
Teams can flounder here by creating mountains of telemetry data that produce more noise than anything else, so identifying high-value telemetry at the outset is essential. Plaster the high-value data in a deploy dashboard where everyone in the team can see and use it to verify that deployments are working in production and that business features are performing correctly. Here are some examples of high-value telemetry:
- Purchases (for an ecommerce business)
- Posts (for social networking businesses)
- Time for key product features (such as time to post an ad, review an item, or sign up)
- Successful / failed interactions (such as button clicks, form submissions, or pretty much anything else!)
- Signups (for any business)
- Failed logins (for any business)
- Error responses (for operations)
- Latencies (for operations)
Odds are that you know the most important KPIs for your team, business, or service, but make sure your telemetry strategy provides this data in real time. This is critical, since continuous delivery may push changes multiple times an hour, and any lag time in business KPIs is unacceptable. Also, remember to account for this requirement across the engineering process.
Teams can misstep by placing telemetry too far back from the user as well. Consider a common web service. There may be telemetry for what happens on the server side but not on the client (user) side. This is where Real User Monitoring (RUM) comes into play. RUM is an important piece of the telemetry strategy because it tells you exactly what’s happening on your users’ computer or smartphone. It may be implemented to detect performance issues between your users and your services, problems with their software, user-specific errors, and usage tracking to name a few.
However RUM is really only suitable for use in production environments and doesn’t make sense in controlled pre-production environments. RUM tools are also more costly than other telemetry solutions and are generally priced according to the total number of users, so be sure to account for this in your budget.
The goal here is not to produce endless data streams. It is to move KPIs, such as lead time, deployment success rate, the ratio of failed to successful interactions, and other business targets, in the right direction. Your telemetry data should help you understand the root causes behind these KPIs and act accordingly. If they don’t, then they’re not worth your time.
Grow a Resilient Culture
Inevitably, as organizations learn how to see and solve problems efficiently, they need to adjust their thinking. This is where the principle of continuous learning comes into play. Teams should strive to value improvement in their daily work more than the daily work itself. This mindset provides everyone on a team a place to excel, which is a great way to foster and grow DevOps talent.
Building such a culture begins with blameless failures and post-mortems. Blameless post-mortems state the issue as it happened with actions taken along the way; their goal is to expose the root cause of a problem and determine corrective actions. In fact, post-mortems are really the regression tests for your own processes. It’s easy to take a negative view on failures, especially when they’re catastrophic, but they actually contain an excellent opportunity to focus on continuous improvement.
The 2014 State of DevOps Report proved that high-performing DevOps organizations will fail and make mistakes more often. The difference is that these companies are more likely to overcome their failures and improve because of them. Roy Rapport from Netflix recounts one story about an engineer who brought down Netflix’s production environment twice in the last 18 months. This resulted in Netflix’s automation moving forward not miles but light years, which enabled more frequent deployments with higher safety. DevOps allows this type of innovation and risk taking. It’s a framework for growth, wherein a team can improve while working in a safe environment.
Rafael Garcia, Acting VP of R&D IT at Hewlett Packard Enterprise, stated at the 2015 DevOps Enterprise summit: “Internally, we described our goal as creating ‘buoys, not boundaries.’ Instead of drawing hard boundaries that everyone has to stay within, we put buoys that indicate deep areas of the channel where you’re safe and supported. You can go past the buoys as long as you follow the organizational principles. After all, how are we ever going to see the next innovation that helps us win if we’re not exploring and testing at the edges?”
Boundaries and buoys help engineers learn, experiment, and grow. Again, this is one surefire way to make engineers feel appreciated and want to stay in your organization. The DevOps community has its own set of buoys and boundaries, and if you push DevOps thinking far enough, you’ll start to wade out into InfoSec waters. Luckily, existing organizations are redefining this area and pushing their buoys farther out to sea.
DevSecOps: The Next Logical Step
One of the top objections to implementing DevOps is that “Information security and compliance won’t let us.” The truth is, DevOps practices are the best way to radically transform and improve InfoSec. This is especially relevant since organizations are usually 100:10:1 between developers, operations, and InfoSec. In order to succeed with auditing and compliance requirements, they need to leverage the knowledge of existing InfoSec staff via automation to bring InfoSec concerns into everyone’s daily work.
DevSecOps and Rugged Ops are two different practices that leverage the principle of feedback to help achieve this. Their goals are to provide both Dev and Ops with fast feedback on their work so that they are notified whenever they commit changes that are potentially insecure. These automated security tests are run as part of the deployment pipeline along with other static analysis tools; but testing is not just confined to static analysis. Dynamic tests, such as the OWASP, probe running applications for known vulnerabilities and poor practices as well.
InfoSec becomes more approachable by adding security-related telemetry, including items such as the number of detected SQL injection attempts (determined by looking for keywords in input fields), password resets, failed logins, security group changes, firewall changes, or networking changes. Ety’s Nick Galbreath shared an internal graph of potential SQL injection attacks in his talk “DevOpsSec: Applying DevOps Principles to Security” at DevOpsDays Austin 2012.
The simple act of increasing visibility made developers realize they were being attacked all the time and thus change their attitude towards security.
Telemetry may also be used to overcome auditing bottlenecks. Today’s engineering processes produce more data than ever before, including server text logs, chat room history, deploy logs, and more. And all this telemetry can be piped to something like Splunk or Kibana to provide self-service access to auditors. The DevOps Audit Defense Toolkit documents the end-to-end process for designing a deploy pipeline to mitigate against stated risks and includes examples of control attestations and artifacts to demonstrate effectiveness.
Applying these principles and practices will take your team past the InfoSec hurdle and closer to building DevOps greatness.
We’ve covered a lot of ground in this series. We’ve touched on the foundational principles and practices, technical challenges, and tackling InfoSec. Now we’re left to consider what’s next. Let’s extrapolate out into the future by applying the principles and collapsing organizational silos. It seems to end in a place where there are no operations, everything is automated, and there are no servers. This sounds like the confusingly named NoOps movement.
NoOps, contradictory to its name, is not about removing operations. Instead, it’s about scaling the abstraction ladder to a place where everyone on the team has access to automated operational procedures for things like environment setup, application deployment, or scaling. Engineers will then have more time to spend working on building products or services instead of worrying about how to operationalize them.
The industry is certainly moving in this direction with technologies such as containers, a plethora of managed PaaS offerings, and now serverless architecture or FaaS. It’s important to understand that these technologies are not obviating operations; they are simply pushing it up the stack. Engineers of the future may be less capable of building and operating infrastructure but more capable of building operational characteristics into their software.
It’s important to remember that all these things exist in a continuum, and SWDLC does not end with DevOps, or even NoOps. As tired and corny as it may sound, it’s all a journey. Every team or organization will always have improvements to make and mistakes to learn from, and you will make progress as long as you keep pushing forward with the principles of flow and feedback. And don’t be afraid to apply the third way of continuous learning along the way. You’re building DevOps greatness that enables teams to work together to survive, learn, thrive, and delight customers as well as help the organization succeed in the marketplace.