Triaging an Nginx 502 Bad Gateway on a busy WordPress server

Note

**TL;DR — How to fix Nginx 502 Bad Gateway with PHP-FPM:** (1) `tail` the Nginx error log for the upstream message — it tells you whether the issue is timeout, connection refused, or a broken socket. (2) Check `systemctl status php8.x-fpm` to confirm FPM is running. (3) Look at `pm.status` or `ps aux | grep php-fpm` to count active workers vs. `pm.max_children`. (4) If workers are saturated, raise `pm.max_children` (carefully — every worker eats RAM) or shorten slow request timeouts. (5) Reload, don't restart, with `systemctl reload php8.x-fpm`.

Friday afternoon. A client's WordPress site started returning intermittent 502 Bad Gateway errors. About one in fifteen requests, totally inconsistent. Nothing in WordPress's debug.log, no PHP fatals, no database lock. From the user's side it just looked broken: refresh and it works, refresh again and it doesn't.

If you've run Nginx in front of PHP-FPM long enough, you've seen this. A 502 from Nginx basically means "I tried to talk to PHP and something went wrong." The useful question is which something. The answer is almost always one of five things, and this post is the triage path I use to narrow it down before I start changing anything.

What does an Nginx 502 actually mean?

A 502 from Nginx isn't your application breaking. It's Nginx itself, failing to get a response from whatever it was trying to proxy to. In a normal WordPress stack the upstream is PHP-FPM, usually talking over a Unix socket at `/run/php/php8.2-fpm.sock` or sometimes a TCP port like `127.0.0.1:9000`. When Nginx can't get an answer, it's one of:

PHP-FPM isn't running at all (worker pool crashed).
PHP-FPM is running but every worker is busy — Nginx couldn't get one within the timeout.
A specific PHP request crashed the worker mid-response (segfault, OOM kill).
The Unix socket file got deleted or has wrong permissions.
Nginx's `fastcgi_read_timeout` is shorter than how long the request actually takes.

Each of these has a different fix. The point of the triage is to figure out which one in three minutes flat, before you start guessing.

Where do you find the real error message?

The Nginx error log is your single best source of truth here, and most engineers either don't know where it is or grep for the wrong thing. On Ubuntu it lives at `/var/log/nginx/error.log`. Tail it while you're reproducing the 502:

bash

tail -f /var/log/nginx/error.log | grep -i upstream

The line you want to find looks something like this:

text

2026/04/12 14:18:09 [error] 1284#1284: *9182 connect() to unix:/run/php/php8.2-fpm.sock failed (11: Resource temporarily unavailable) while connecting to upstream

That `(11: Resource temporarily unavailable)` is what you want to see. Errno 11 (EAGAIN) means PHP-FPM's listen queue is full. Every worker is busy and no new connections can be accepted. So this isn't a crash. It's saturation, which is a much friendlier failure mode. Other error strings tell you very different things:

**(11: Resource temporarily unavailable)** — workers saturated, raise `pm.max_children` or shorten request times.
**(2: No such file or directory)** — the socket file doesn't exist, FPM isn't running.
**(13: Permission denied)** — the socket exists but Nginx can't open it (wrong owner/group).
**upstream prematurely closed connection** — a worker died mid-request (segfault, OOM).
**upstream timed out** — the request took longer than `fastcgi_read_timeout` (default 60s).

Tip

If you don't see *any* upstream-related lines in the error log, you're probably tailing the wrong file. WordPress hosts often have multiple Nginx vhosts, each with its own error_log directive. Run `nginx -T 2>/dev/null | grep error_log` to see every error log path Nginx knows about.

How do you check if PHP-FPM workers are saturated?

Once the error log points at saturation, you need to confirm it. The fastest way is enabling FPM's status endpoint. Edit `/etc/php/8.2/fpm/pool.d/www.conf` and uncomment:

ini

pm.status_path = /fpm-status

Reload FPM, then add a temporary Nginx location block (locked down to localhost) and curl it:

bash

curl -s http://127.0.0.1/fpm-status

You'll get output like:

text

pool:                 www
process manager:      dynamic
accepted conn:        184228
listen queue:         12
max listen queue:     38
idle processes:       0
active processes:     20
total processes:      20
max active processes: 20
max children reached: 412
slow requests:        7

Two numbers matter here. `max children reached` was 412. That's how many times FPM hit the ceiling. `idle processes: 0` says there's no headroom at all. The pool is sized for 20 workers and every one of them is busy any time the site gets real traffic. The 502s aren't mysterious anymore.

How much should you raise pm.max_children?

This is where most people make the wrong move. The instinct is to crank `pm.max_children` to something huge like 200 and call it done. But every worker holds memory, and once you exceed physical RAM the kernel starts swapping or the OOM killer starts murdering processes. Congratulations, you've replaced one outage with a worse one.

The actual calculation: average worker memory × target worker count has to fit in available RAM, with headroom left over for MySQL and OS overhead. To find your average worker memory in MB:

bash

ps -ylC php-fpm8.2 --sort:rss | awk '{sum+=$8; count++} END {print sum/count/1024 " MB avg per worker"}'

On this server it came back at 78 MB per worker. The box had 8 GB total, about 2 GB sitting with MariaDB and another 1 GB for the OS, so I had roughly 5 GB to give to FPM. 5,000 ÷ 78 is about 64 workers. I rounded down to 60 to leave breathing room, set `pm.start_servers = 15`, `pm.min_spare_servers = 10`, `pm.max_spare_servers = 25`, and reloaded:

bash

systemctl reload php8.2-fpm

Warning

Use `reload`, not `restart`. Reload tells PHP-FPM to gracefully cycle workers without dropping in-flight requests. Restart kills everything, and any user mid-request will see a 502. Which is, you know, the thing you're trying to fix.

What about slow requests holding workers hostage?

Raising the worker count is a band-aid if what's actually wrong is one slow query (or one slow external API call) holding workers hostage for 30+ seconds at a time. After bumping `pm.max_children`, I always enable FPM's slow log to find out what's really blocking things:

ini

request_slowlog_timeout = 5s
slowlog = /var/log/php8.2-fpm-slow.log

Reload FPM, wait an hour, come back. The slow log captures a full PHP stack trace for any request that took more than 5 seconds, so you get exactly which file and line was the problem. Nine times out of ten it's either an unindexed database query (see my MariaDB CPU post for that one) or a synchronous external HTTP call to a slow third party.

The whole triage path in one checklist

`tail -f /var/log/nginx/error.log | grep -i upstream` — get the actual error string while reproducing the 502.
Map the error to the cause: errno 11 = saturated, errno 2 = FPM down, errno 13 = permissions, prematurely closed = worker crash, timed out = slow request.
If saturation: enable FPM `pm.status_path` and confirm `idle processes: 0` and `max children reached > 0`.
Calculate safe worker count from average worker RSS and free RAM.
Raise `pm.max_children` to the calculated ceiling, reload (don't restart) FPM.
Enable `request_slowlog_timeout = 5s` to find the slow requests holding workers hostage and fix them at the source.

The dumb thing I should have done a year ago

The honest takeaway from this incident isn't about FPM tuning. It's that I didn't have `pm.status_path` enabled in the first place. If a Prometheus scraper had been hitting it every 30 seconds for the last few months, I would have had a graph showing worker saturation creeping up over the prior week. The 502s wouldn't have been a Friday-afternoon surprise. They would have been a Tuesday-morning ticket I picked up between coffees.

If you run any non-trivial PHP-FPM setup and don't have worker pool metrics graphed somewhere, this is the single highest-leverage observability change you can make. Worker exhaustion is the most common cause of WordPress 502s I see, and it's the one that's easiest to predict, if you're actually watching for it. I enabled the status endpoint and built a Grafana panel for it the same evening this happened, which is the kind of decision that always makes you feel slightly stupid.