Fixing Clamd stuck at 100% CPU

My postfix mail server was taking ages (well, minutes, but still, that is really ages in a computer world) to handle a single email with multiple processes taking a full CPU, all the time! What?

I got it solved (after a couple of hours making U-turns in dead ends) and as always, the answer was really trivial. But I know I’ll be losing another couple of hours next time I run a “yum update”, so here is another in the “let’s blog about it so I remember how to fix it next time” series!

Click here to go directly to the fix –

The problem

After updating my Centos7 server, I noticed that clamd was taking 100% of a CPU core and additionally, a number of clamscan process were running, also each claiming most of a CPU core.

  • Clamd is the daemon process for clamav-server which is used by Amavis as anti-virus protection and should be a relatively light-weight process.
  • Clamscan is the “start up, read all the virus signatures in memory to scan a single item for viruses and exit” version that has been moslty replaced by clamd.

So a server using clamd should really never be running clamscan at all…

Looking at the postfix maillog under /var/log, things looked even worse. The log showed amavisd complaining that clamd was unresponsive. This immediately explained the presence of clamscan processes: amavisd will start up a clamscan for an email if it cannot reach clamd, as a fallback.

amavis[20362]: (20362-01) (!)connect to /var/run/clamd.amavisd/clamd.sock failed, attempt #1: Can't connect to a UNIX socket /var/run/clamd.amavisd/clamd.sock: Connection refused

amavis[20362]: (20362-01) (!)ClamAV-clamd av-scanner FAILED: run_av error: Too many retries to talk to /var/run/clamd.amavisd/clamd.sock (All attempts (1) failed connecting to /var/run/clamd.amavisd/clamd.sock) at (eval 134) line 659.\n

amavis[20362]: (20362-01) (!)WARN: all primary virus scanners failed, considering backups

The plot thickens

Sadly, this log does not tell us anything about why clamd is unresponsive so “strace” to the rescue!

Well, no…

Launching strace and attaching it to the running clamd process (uisng strace -p <pid>) showed nothing bad, except that it got killed all the time. The process started, got killed and got restarted again, etc… Every time getting a new process ID and breaking the strace.

I’m not going to paste the full strace output here because it is extremely long (containing loads and loads of read instructions which I believe is clamd reading all the virus signatures and setting up the in-memory state that allows it to precess mails so quickly), but if your strace ends with

mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fda413a1000
read(6, "443a6f443a6b3f3566382e66382e6638"..., 24576) = 24576
read(6, "be9568b7540ff765cff1550504000;55"..., 4096) = 4096
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fda41361000
--- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=1, si_uid=0} ---
+++ killed by SIGTERM +++

That is not good.

It should end with something like this:

munmap(0x7f60ec03c000, 262144) = 0
munmap(0x7f60ec07c000, 262144) = 0
munmap(0x7f60ec0bc000, 262144) = 0
munmap(0x7f60ec0fc000, 262144) = 0
munmap(0x7f60ec13c000, 262144) = 0
exit_group(0) = ?
+++ exited with 0 +++

But the auto-restarting was an important clue, especially when I noticed that the clamd process was killed and restarted exactly every 90 seconds….

That simply reeks of a dirty rotten timeout!

The solution

SystemD, who saved us from “initscript hell” (I won’t mind if you don’t agree! 🙂 ) is a much more complex animal and one of the things is has is a “TimeoutStartSec” setting that tells SystemD how long to wait for a process to start. When that timeout has expired, it will kill and restart it at infinitum.

The default value is set in /etc/systemd/system.conf:

 #DefaultTimeoutStopSec=90s

So it looks like something in clamav changed, causing the startup time to be a lot longer now, clocking in at over 3 minutes.

And nice little SystemD really thinks that 90 seconds should be enough for anybody and promptly restart it again, and again…. and again….

Keeping a new clamd process claiming a full CPU until it is almost ready and causing amavisd to spawn clamscan processed to work around the unresponsive clamd daemon.

The fix

Yep, this is why you came here. Finally! 🙂

Simply tell SystemD to wait a little longer for the clamd process to finish starting up. As I wrote earlier: trivial.

Edit the /lib/systemd/system/clamd@.service file and add the timeout setting to the service block:

[Unit]
Description = clamd scanner (%i) daemon
Documentation=man:clamd(8) man:clamd.conf(5) https://www.clamav.net/documents/
# Check for database existence
# ConditionPathExistsGlob=@DBDIR@/main.{c[vl]d,inc}
# ConditionPathExistsGlob=@DBDIR@/daily.{c[vl]d,inc}
After = syslog.target nss-lookup.target network.target

[Service]
Type = forking
TimeoutStartSec = 10min
ExecStart = /usr/sbin/clamd -c /etc/clamd.d/%i.conf
Restart = on-failure

I set it to 10 minutes, which is longer than it strictly needs to be but I like a nice margin and I am happy with it. Feel free to experiment and find a shorter period that still works.

The Cleanup

After this change, make sure to activate it.

First stop both amavisd and clamd@amavisd:

systemctl stop amavisd
systemctl stop clamd@amavisd

Do check that there are no clamd or clamscan processes running anymore (ps -ef | grep clam). If there are, either wait for them to go away of kill them.

Next, tell SystemD to reload the config and restart amavisd (which should start clamd for you):

systemctl daemon-reload
systemctl start amavisd

You should now see clamd running at 100% CPU again for about 3 to 4 minutes, after which it will detach from the startup process and happily play nice in the background.

Use the “top” command to see it right at the top for 3 to 4 minutes, after which it should dissapear. Check with “ps-ef | grep clam” to confirm that the process is indeed still there!

Now amavis should no longer be complaining about unresponsive clamd and no more clamscan processes should appear!

Get on with our day and enjoy that coffee.

Jhon Masschelein

Author: Jhon Masschelein

Tackler of advanced Cloud and Hadoop challenges in a world of open-source technologies. – Impossible is merely a matter of time and effort. –

9 thoughts on “Fixing Clamd stuck at 100% CPU”

  1. Thanks, your solution solved our problem as well.
    A little note: after “systemctl daemon-reload” there was probably meant to be “systemctl start amavisd”.
    Unless that was there to check if the reader is before that coffee 🙂

  2. I have spent the last two days trying to resolve this exact issue and this worked like a charm. Thank you, you my good sir have my gratitude.

  3. Good find!

    My clamd was just on the fence, taking between 80 and 105 seconds, so it sometimes took a long time (when restarted on the running system) before it worked again, sometimes it worked right away (usually on boot); the kind of behaviour for a bug to drive me up the walls…

    So: thank you

  4. Many thanks for this analysis. Since suffered an automatic upgrade of my Centos 7 kernel and the consequent reboot, this has been deviling me for a day and a half. Should this be a bug report?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.