This is a guest post by Ilan Rabinovitch, Director of Product Management at Datadog. The convergence of rapid feature development, automation, continuous delivery, and the shifting...by Ilan Rabinovitch
August 24, 2017
Updated 7/24/2014: This blog post was updated to more accurately reflect Arup’s talk.
Arup Chakrabarti, PagerDuty’s operations engineer manager, stopped by Heavybit Industries’ HQ to discuss the biggest mistakes an operations team can make and how to head them off. To check out the full video, visit Heavybits Video Library.
A lot of people use personal accounts when setting up enterprise infrastructure deployments. Instead, create new accounts using corporate addresses to enforce consistency.
Be wary of how you store passwords. Keeping them in your git repo could require you to wipe out your entire git history at a later date. It’s better to save passwords within configuration management so they can be plugged in as needed.
Another good move with new deployments: select your tools wisely. For example, leverage PaaS tools as long as possible – that way, you can focus on acquiring customers instead of building infrastructure. And don’t be afraid to employ “boring” products like Java. Well-established, tried-and-true tech can let you do some really cool stuff.
You don’t want to risk your test and production environments mingling in any way. Be sure to set up test environments with different hosting and provider accounts from what you use in production.
Performing local development? There’s no way around it: applications will run differently on local machines and in production. To simulate a production environment as closely as possible, create VMs with a tool like Vagrant.
Both these Ansible and Salt are tools that are really easy to learn. Specifically, Ansible makes infrastructure-as-code deployment super-simple for ops teams.
What is infrastructure-as-code? Essentially, it’s the process of building infrastructure in such a way that it can be spun up or down quickly and consistently. Server configurations are going to get screwed up regardless of where your infrastructure is running, so you have to be prepared to restore your servers in as little time as possible.
Whatever tool you use, as a rule of thumb, it’s best to limit the number of automation software tools you’re using. Each one is a source of truth in your infrastructure which means it’s also a point of failure.
Every piece of code must be deployed in as similar a fashion as possible. But getting all of your engineers to practice consistency can be a challenge.
Powerful automation software can certainly help enforce consistency. But automation tools are only appropriate for big deployments – so when you’re getting started, Arup suggests running development using git and employing an orchestration tool. For example, Capistrano for Rails, Celery for Python or Ansible and Salt for both orchestration and configuration management.
Creating and documenting an incident management process is absolutely necessary, even if the process isn’t perfect.
You should be prepared to review the incident-management document on an ongoing basis, too. If you’re experiencing lots of downtime, reviews won’t really be necessary.
It’s becoming less and less common for companies to have dedicated on-call teams – instead, everyone who touches production code is expected to be reachable in the event of downtime.
This requires a platform (like PagerDuty) that can notify different people in different ways. What really matters is getting a hold of the right people at the right time.
The specific tool you use for monitoring is less important than just putting something in place. PagerDuty uses StatsD in concert with Datadog; open-source tools like Nagios can be just as effective.
If you have the money, an application performance management tool like New Relic might be a good fit. But, what matters most is that you have a monitoring tool on deck.
“You have no excuse to not have any monitoring and alerting on your app, even when you first launch,” – Arup Chakrabarti, Engineering Manager, PagerDuty
Just like monitoring and alerting, backing up your data is non-negotiable. Scheduling regular backups to S3 is a standard industry practice today.
You should try restoring your production dataset into a test environment to confirm that your backups are working as designed at least once a month.
Having multiple servers at every layer, multiple stateless app servers and multiple load balancers is a no-brainer. Only with multiple failover options can you truly say you’ve optimized for HA.
Datastores (like Cassandra) are essential because with multimaster data clusters, individual nodes can be taken out with absolutely no customer-facing impact. Clustered datastores are ideal in fast-moving deployment environments for this reason.
Use gateway boxes instead of SSH on your database servers and load balancers. You can run proxies through these gateways and lock traffic down if you suspect an incursion.
When an employee leaves your organization, it’s nice to be able to revoke his or her access expediently. But there are other reasons to set people up with user accounts to your various tools. Someone’s laptop may get lost. An individual might need his password reset. It’s a lot easier, Arup notes, to revoke or reset one user password than a master account password.
Making encryption a part of the development cycle helps you catch security-related bugs early in development. Plus, forcing devs to think constantly about security is simply a good practice.
IT isn’t always ops’ concern. But on certain issues, both teams are stakeholders. For example:
Commonality in equipment: If an engineer loses her custom-built laptop, how long will it take to get her a replacement? Strive for consistency in hardware to streamline machine deployments.
Granting access to the right tools: On-boarding documents are a good way to share login information with new hires.
Imaging local machines: With disk images stored on USB, provisioning or reprovisioning equipment is a snap.
Turning on disk encryption: With encryption, no need to worry if a machine gets lost.
There are millions more mistakes that operations teams can make. But these 10 tend to be the most commonly seen, even at companies like Amazon, Netflix and PagerDuty.
Have your own Ops mistake you’d like to share. Let us know in the comments below.