How I diagnosed a MariaDB query eating 100% CPU on a production server

Note

**TL;DR — How to diagnose a MariaDB process pinning 100% CPU:** (1) Run `top -o %CPU` to confirm `mysqld` is the offender. (2) Run `mysqladmin processlist --verbose` to find queries stuck in *Sending data* state for tens of seconds. (3) Tail `/var/log/mysql/mysql-slow.log` for the full query text and frequency. (4) Run `EXPLAIN` on a replica — `type: ALL` and high `rows` mean a missing index. (5) `KILL` the stuck connection IDs to stop the bleeding. (6) Add the missing composite index to prevent recurrence.

Around 2pm, the alert fired. One of the shared hosting servers on our fleet was pinned at 100% CPU and had been for about four minutes. No deploy in the last few hours, no traffic spike in the edge logs, no obvious trigger to point at. The first instinct in this situation is panic. The second is to start running random commands. Neither helps.

What follows is the actual sequence of commands I ran to find the culprit. They're standard Linux and MariaDB tools, nothing exotic, but the order matters. This is the playbook I wish I'd had in my fingers the first time I had to deal with this under pressure.

How do you confirm MariaDB is actually the CPU bottleneck?

Before touching MariaDB, I wanted to confirm mysqld was actually the process burning the CPU. Could just as easily have been PHP, a runaway cron job, or some shared-hosting tenant doing something dumb in a fork loop. SSH in and run top sorted by CPU:

bash

top -o %CPU

The output left no room for doubt. A single mysqld process was sitting at ~780% CPU on an 8-core box. php-fpm and nginx were both quiet. Redis was idle. Whatever was wrong, it was inside MariaDB. That's a much smaller haystack to search.

Tip

htop is nicer if it's already installed. I don't bother installing anything new during an incident. Whatever's on the box is what I'm using.

Which MariaDB queries are eating the CPU right now?

The single most useful command at this point is mysqladmin processlist. It shows you every active query, how long each has been running, and what state it's in:

bash

mysqladmin -u root processlist --verbose

Most of the rows were short-lived SELECTs from wp-cron and page loads, the usual background noise. But three rows stood out. Same SELECT query, running from three different connections, each one stuck in the Sending data state for 90+ seconds. That's almost always the tell. Short queries don't burn CPU for a minute and a half. Something was scanning a big table without an index.

Reading the processlist output

A few of these states are worth recognising on sight. Sending data means the server is reading rows and evaluating WHERE clauses, and when it's stuck there with high CPU you're almost always looking at a bad query plan. Copying to tmp table on disk means you've blown past your `tmp_table_size` and the server is hitting disk to assemble results, which is slow and obvious. Waiting for table metadata lock means something is blocking DDL. In this case it was very clearly the first one.

How do you find the exact query in the MariaDB slow query log?

processlist shows you the query, but the text gets truncated and you lose the context of how often it runs. The slow query log gives you the full picture. I checked that it was enabled and tailed it:

bash

mysql -e "SHOW VARIABLES LIKE 'slow_query_log%'"
tail -n 200 /var/log/mysql/mysql-slow.log

In the last hour, the same SELECT had been logged 47 times. Average query time: 62 seconds. It was hitting `wp_postmeta`, the WordPress table that gets enormous on any site with serious product or post-type volume, with a WHERE clause on `meta_value`. That should have been a red flag the moment I saw the column name.

How do you read EXPLAIN output to find a missing index?

Once I had the query text, I ran EXPLAIN against it in a read replica (never in prod on an already-stressed primary):

sql

EXPLAIN SELECT post_id FROM wp_postmeta
WHERE meta_key = 'product_sku'
  AND meta_value LIKE 'SKU-2026%';

EXPLAIN said `type: ALL` and `rows: 4,218,774`. A full table scan over four million rows. Every single request. The WordPress `wp_postmeta` schema has an index on `meta_key` but not on `meta_value`, and the `LIKE 'SKU-2026%'` pattern was forcing a scan even after the `meta_key` filter narrowed things down. Of course it was burning CPU.

Warning

Don't run EXPLAIN ANALYZE on a production primary during an incident. It actually executes the query. Use a replica, or at minimum wait until load is back to normal. Plain EXPLAIN is fine.

How do you kill a runaway MariaDB query without restarting the server?

There are two things to do here and the order matters. First, stop the bleeding. Kill the in-flight queries so CPU drops and the other tenants on the box stop suffering for someone else's bug. Second, fix the underlying cause so it doesn't come right back in ten minutes.

To kill the stuck queries, I grabbed their connection IDs from processlist and killed them one at a time:

sql

KILL 184223;
KILL 184298;
KILL 184311;

CPU dropped from 780% to around 45% within 20 seconds. Not back to normal yet, because the legitimate traffic that had been queueing up was still catching up, but the fire was out.

For the actual fix, the answer depends on who owns the code. In this case the query was coming from a plugin doing a prefix search on SKUs, which is something it had no business doing on `meta_value` without a dedicated lookup table. I worked with the site owner and we did two things. Added a composite index on `(meta_key, meta_value(32))` as a short-term patch, and opened a ticket asking the plugin author to move SKU lookups into a proper indexed column for the long term. The index alone took the query from 62 seconds down to 40ms.

The whole thing in one checklist

top -o %CPU — confirm mysqld is the process burning the CPU.
mysqladmin processlist --verbose — look for long-running queries in Sending data or Copying to tmp table on disk state.
Check the slow query log — tail it to see frequency and full query text.
EXPLAIN the query on a replica — look for type: ALL and high rows counts, which mean full table scans.
KILL the stuck connection IDs to stop the bleeding.
Add the missing index or coordinate a real fix with whoever owns the query.

The thing the incident was actually about

The technical lesson here isn't very interesting. Find the slow query, kill it, add the index, move on. The real lesson was about monitoring. By the time my alert fired, the server had already been at 100% for four minutes and real users had felt every second of it. An alert that fires when any single query crosses 30 seconds would have caught this before it ever became a CPU spike. I added that alert the same afternoon, which is the kind of thing you wish past-you had done six months earlier.

One more thing worth saying. The playbook above looks long when it's written out, but in practice it's about six commands and three minutes of actual work. If you're a Linux admin and you don't have this sequence in your fingers, practice it on a staging box once or twice before the day you need it at 2pm on a Tuesday. Muscle memory is more useful than knowledge when you're under pressure.