I’ve spent my entire career in operations and or infrastructure related roles, and over the years I’ve come to understand why most operations teams don’t scale and why they become the bottleneck for most companies. Now this isn’t going to be an argument for just using Heroku, or AWS, or CloudFoundy I’m going to explain why these services contribute to the reason why IT organizations don’t scale and how you can prevent your company from running into the same mistakes “everyone else” makes.
It’s 2015 and we are quickly approaching 2016 and if you haven’t yet figured out that automating the management of your infrastructure with tools like puppet, chef, or salt is the foundation for running infrastructure at scale then this blog isn’t for you; however if you have graduated past DevOps 101 and have a configuration management system in place but still are confused by the fact that your operations team is still blocking your organization from 10 deployments per day well listen up.
We obviously have learned by now that managing 100s or even 1000s of servers requires cookie cutter os installations, standardized database server, and application server installations but what many companies have yet to learn is how to streamline the entire application delivery process end to end across ALL applications. See products like VMware, Mesosphere, Kubernetes, and CloudFoundry are great they provide “flexibility” a word which sounds great in a marketing pitch but should give any seasoned IT professional the willies because “flexibility” is just another word for “snowflake”.
See in the “olden days” system admins would write documentation around things like naming conventions, process docs around code promotion, and really good teams even had an environments strategy which defined how code was promoted from one environment to the next describing which types or sets of “sign-offs” where necessary to move code from one environment to another. The problem is that by taking advantage of the “flexibility” that AWS or Mesosphere or VMware offered is that you often end up with constantly built infrastructure but each application deployed on these “flexible” platforms was a unique and beautiful crafted snowflake. If your company is only managing one or two apps that isn’t a huge issue but if you work in an environment with 100s of APIs and web apps in a microservices architecture the task of managing, deploying, and promoting these applications quickly turns into a full time job for a large operations teams. Now, before I go any further I want to assure you that I’m a huge advocate of solutions like Mesosphere, VMware, Kubernetes, AWS, etc..
The Infrastructure Abstraction Layer
Having built multiple self-service deployment platforms for large enterprise companies which supported 100s to 1000s of deployments each month I’m consistently asked why “roll your own solution”. The answer is simple it’s the only way infrastructure organizations scale. Now, this statement is immediately open to misinterpretation, it would be crazy for any company not to leverage AWS, Mesosphere, Kubernetes, VMware, OpenStack, etc.. however what isn’t crazy is that for an operations organization to transform all of those documented standards around code deployments, code promotion, naming conventions, etc.. into code. There isn’t any open-source or commercial software yet that I have found that pulls it all together and defines the “business logic” for how your company wants to name Docker containers, or name AWS instances, or defines which DNS domain to use in development vs staging vs prod, or even predefines an environment strategy for each application and a path forward for promotion. If you look at Heroku and read their documentation I challenge you to find any “official documentation” for managing environments, on the other hand a google search offers multiple suggestions (flexibility / snowflakes). This example is the same for Mesosphere, Kubernetes, etc.. people write blogs all day on possible options some even suggest using the same clusters for dev, staging, and prod; however the problem remains the same either operations teams are stuck reading a document with a set of steps to deploy and promote code, or engineering teams invent their own solution on these “flexible” platforms, and a wide range of beautiful snowflakes emerge.
I recently meet with the VP of Infrastructure for a large Fortune 50 company. He spent half the day explaining “open standards” to me and how he built his entire infrastructure with these “open standards” which prevented “vendor lock in”. At the time I wasn’t sure if he was convincing me or himself. I only had about 30 minutes to propose and explain how nothing he presented prevented vendor lock in, as a matter of fact by not building an infrastructure abstraction layer everything the engineers teams did, every bit of code they wrote promoted vendor lock in. See we all know DNS is an “open standard” but if your company deploys Infoblox for instance to manage DNS and tomorrow it moves to Route53 every engineering team needs to rewrite their code to support the new API contracts. Now, by building or leveraging an infrastructure abstraction layer DNS entries can be migrated from one backend system to the next in a few minutes, code changed on the infrastructure abstraction layer and there is literally no impact to the consumers of that infrastructure abstraction layer because the contracts they use to consume DNS services never change from their perspective.
While I continue to advocate for solutions like Mesosphere, or AWS, or OpenStack I also strongly believe that for any infrastructure or operations team to scale they need to focus on taking their “business logic” or “processes” and turning them into code, creating contracts that engineering teams can leverage in a consistent way when interfacing with a wide range of flexible backend services like DNS, PaaS, configuration management systems, etc..