Turn any signal into insight and action. See how PagerDuty Digital Operations Management Platform integrates machine data and human intelligence to improve visibility and agility across organizations.
Connect insights to real-time action by aligning teams through the shared language of business impact.
Check out the latest products we’ve been working on—including event intelligence, machine learning, response automation, on-call, analytics, operations health management, integrations, and more.
Digital Operations Management arms organizations with the insights needed to turn data into opportunity across every operational use case, from DevOps, ITOps, Security, Support, and beyond.
Over 300 Integrations
Discover DevOps best practices with our library of webinars, whitepapers, reports, and much more.
Learn best practices and get support help with resources from our award-winning support team.
See how PagerDuty works with our live product demo — twice a week, every week.
We've created a maturity model to assist on the journey to digital operations excellence. Take our short assessment to find out where your team falls!
Interactive, simple-to-use API and technical documentation enables users to easily try updates and extend PagerDuty.
Engage with users and PagerDuty experts from our global community of 200k+ users. Become a member, connect, and share insights for success.
Get all your PagerDuty-related questions answered by exploring our in-depth support documentation and community forums.
Have you ever worked on a team where it was a challenge to give constructive feedback or confidently share ideas? At PagerDuty Summit 2018, Patrick...
PagerDuty helps organizations transform their digital operations. Learn more about PagerDuty's mission and what we do.
Meet our experienced and passionate executive team.
We are risk-taking innovators dedicated to delivering amazing products and delighting customers. Join us and do the best work of your career.
With the PagerDuty Foundation, we are committed to doing our part in giving back to the community.
As a freelance developer, inheriting projects is a necessary evil. Almost every project has legacy code that the team is afraid to touch, but when you inherit a project as a freelancer, more often than not, the entire codebase is “legacy.” While dealing with an unfamiliar code base is tough, what can be even more difficult is getting that code base running in a production environment.
Last October, I inherited a project that drove me to near insanity. The source code itself was in shambles for sure, but what made the project such a nightmare was the lack of documentation and communication from the previous developers. This led to me having to reverse engineer the application in order to get it running in the new production environment.
I was essentially playing a guessing game with the architecture. I had an idea of what type of resources I needed to provide, but without getting it in front of users, I really didn’t know what to expect. As I’m sure you can guess, this didn’t end well. Due to inefficient programming patterns, the site required four times the resources it should have in order to achieve some modicum of stability.
Luckily for me, however, one of the first things I did was integrate some incident management tools into the project. What this allowed me to do was identify specific pain points early and often, and fix them immediately. This led to strategic resource and project upgrades to improve the stability of the project.
So, what exactly did I see?
While I felt like I was playing whack-a-mole with half a dozen issues at any given time, there were two that cropped up infrequently enough that I would not have noticed their impact had I not integrated any incident management tools: database locking and memory issues. These are two relatively common development issues that can occur, but while common, they can be difficult to diagnose and solve.
After stabilizing the production site, one of the first things we noticed was that the site was crashing every hour, on the hour, for about 15 minutes each time. Thanks to the information provided by our incident management tools, I was able to narrow down the problem to an hourly cron job. What I found was that a critical cron job was locking a primary database table every time it ran, effectively taking down the site until the process was done. This led me to easily refactor that particular script, which allowed me to increase the uptime of the site and reduce user frustration.
Memory leaks suck. In a complicated application, they can be incredibly difficult to track down — especially when they occur in a production environment. Unfortunately for me, this project was filled to the brim with them. Some are easy to fix, like log entries showing the Redis server running out of memory (insert more memory here), but others can be pretty elusive.
One common and seemingly random memory issue that occurred was timeouts. Occasionally, the site would start timing out for users after attempting to load for five minutes. While I knew from experience that this was likely caused by more inefficient database queries, narrowing down the exact queries was a bit of a challenge. Again, thanks to the incident management framework I’d put in place, I was able to identify a specific set of profile pages that were taking almost half an hour to retrieve data from the database. Because this process took too long, users kept reloading the page and restarting the whole process.
The first thing I was able to do was identify exactly how long users were waiting before they reloaded the page or gave up (about 1 minute). Then, I made some changes to both the web and database server configurations to kill everything after 1 minute. This gave me some breathing room, so those pages didn’t crash the rest of the site.
Then, I had to identify the exact queries that were causing the problems. Unfortunately, these particular pages were pretty query-heavy, but after referencing the logs I was able to narrow it down to one particular line that was querying over 1GB of data from the database server without caching the result. From here, the next steps were to refactor the query, cache the result for an appropriate time, and get the fix out to users as soon as possible.
While these are just a few examples of the problems I was able to solve thanks to my historical incident management data, if I hadn’t implemented the toolset early on, I would probably still be playing guess-and-check with various solutions. Don’t get me wrong, though. The same incident management tools can also be used to plan upgrades for a well-architected application. Identifying the circumstances where your servers overload or things start slowing down is crucial towards scaling your project to accommodate growth.
Learn more about how you can visualize patterns across all your systems data for improved incident management by checking out the PagerDuty Operations Command Console.
In the United States, it’s almost that time of year again where we count our blessings and give thanks. For retail workers, it’s also that...
A long time ago, back in the early days of 2017, we open-sourced our Incident Response Documentation, the reference point for all our internal processes...
600 Townsend St., #200
San Francisco, CA 94103
905 King Street West, Suite 600
Toronto, ON, M6K 3G9, Canada
1416 NW 46th St., St. 301
Seattle, WA 98107
5 Martin Place
1 Fore St,
London EC2Y 9DT
© 2009 - 2019