Turn any signal into insight and action. See how PagerDuty Digital Operations Management Platform integrates machine data and human intelligence to improve visibility and agility across organizations.
Check out the latest capabilities we released.
Flexible schedules, escalations, & alerting
Automated, best practice incident response
Powerful context & noise reduction at scale
Quantify real-time business & technical impact
Improve with modern, prescriptive insights
Over 300 Integrations
Discover DevOps best practices with our library of webinars, whitepapers, reports, and much more.
Learn best practices and get support help with resources from our award-winning support team.
See how PagerDuty works with our live product demo — twice a week, every week.
We've created a maturity model to assist on the journey to digital operations excellence. Take our short assessment to find out where your team falls!
Interactive, simple-to-use API and technical documentation enables users to easily try updates and extend PagerDuty.
Engage with users and PagerDuty experts from our global community of 200k+ users. Become a member, connect, and share insights for success.
Get all your PagerDuty-related questions answered by exploring our in-depth support documentation and community forums.
In part 2 of our postmortem series, we dig into how to establish a culture of continuous learning, from getting leadership on board to invoking...
PagerDuty helps organizations transform their digital operations. Learn more about PagerDuty's mission and what we do.
Meet our experienced and passionate executive team.
We are risk-taking innovators dedicated to delivering amazing products and delighting customers. Join us and do the best work of your career.
With the PagerDuty Foundation, we are committed to doing our part in giving back to the community.
You like sleep and weekends. Customers hate losing access to your system due to maintenance. PagerDuty operations engineer Doug Barth has the solution:
That sounds like a bold proposition. But as Doug explained at DevOps Days Chicago, it actually makes a lot of sense.
Scheduled maintenance tends to take place late at night on weekends—a tough proposition for operations engineers and admins. Customers require access at all hours, not just daylight ones. And scheduled maintenance implies your system is less reliable than you think, because you’re afraid to change it during the workday.
The solution? Avoid it altogether, and replace it with fast, iterative maintenance strategies that don’t compromise your entire system.
That might sound a bit ‘out there.’ But shelving scheduled maintenance is easier than you think. In his talk, Doug offered four ways to do it.
First thing’s first: if you discard scheduled maintenance, your deployments need to be rock-solid. They should be scripted, fast and rolled back quickly, as well as tested periodically to ensure rollbacks don’t lag.
They also need to be forward and backward compatible by one version. It’s not an option to stop the presses when you push out a new version. Red-blue-green deployments are crucial here, as they ensure only a third of your infrastructure undergoes changes at any given time.
Lastly, stateless apps must be the norm. You should be able to reboot an app without any effect on the customer (like forced logouts or lost shopping carts).
Use canary deploys judiciously to test rollouts, judge their integrity and compare results. These test deployments affect only a small segment of your system, so bad code or an unexpected error doesn’t spell disaster for your entire service.
Doug suggested a few practical ways to accomplish this:
As Doug summed it up for the DevOps Days crowd: “Avoid knife-edge changes like the plague.”
Your system should be loaded with retries. Build them into all service layer hops, and use exponential backoffs to avoid overwhelming the downstream system. The requests between service layers must be idempotent, Doug emphasized. When they are, you’ll be able to reissue requests to new servers without double-applying changes.
Use queues where you don’t care about the response to decouple the client from the server. If you’re stuck with a request/response flow, use a circuit breaker approach, where your client library delivers back minimal results if a service is down—reducing front-end latency and errors.
Distribute your data to many servers, so that no one server is so important you can’t safely work on it.
At PagerDuty, the team uses multi-master clusters, which help with operations and vertical scaling. They also use multiple database servers like Cassandra: No one server is that special, which means operational work can happen during the day.
Put together, these strategies help admins and operational engineers sleep more, worry less and maintain better—all ahead of schedule.
I recently had the privilege of spending a full day with a small group of our customers. The attendees were leaders in their development and...
Incident management is a key facet of supporting applications. When working on an application, we spend the vast majority of time on its release to...
600 Townsend St., #200
San Francisco, CA 94103
905 King Street West, Suite 600
Toronto, ON, M6K 3G9, Canada
1416 NW 46th St., St. 301
Seattle, WA 98107
5 Martin Place
1 Fore St,
London EC2Y 9DT
© 2009 - 2019