Page 1 of 1

Request: Incorrect core config doesn't max out CPU

Posted: Tue Feb 07, 2023 5:58 pm
by Simonb
Hi,

We've found that if the remote agent has an incorrect unimus.access.key, or the Unimus Core service cannot connect to the Unimus server on TCP port 5509, it seems to go into a tight loop using maximum CPU and makes the whole server unresponsive so we cannot logon and the server cannot respond to its normal functions.

This is a two-core VMware machine running Windows and Unimus Core version 2.2.4, and needed rebooting from the VMware console side, we think we have seen this behaviour on more than one system.

The logfiles do help troubleshoot it, but feature suggestion - detection failed connection to the main server and go into a sleep state with occasional retries or clean service shutdown instead.

Re: Request: Incorrect core config doesn't max out CPU

Posted: Thu Feb 09, 2023 2:11 pm
by Vik@Unimus
Hello,

When a connection to Unimus Server is lost (as you mentioned, it might be an unavailability or an invalid configuration) such Unimus Core will start retrying the connection. In this communication there are two main timeouts, the first one is the connection timeout itself - this timeout is set to 5 seconds. It may or may not take that long depending on the type of the issue (e.g. connection timing out vs. connection reset at peer or connection being refused). The second timeout is basically a delay between any two connection attempts. This delay is also set to 5 seconds.

The whole connection retry process represents a repeating single TCP connection made and it should not cause any hardware resource exhaustion. If it does, it is important for us to look at it as it might reveal an issue on our end and those are such issues we want to locate and fix as soon as possible. We have located a couple of issues in the past, but all were fixed in prior version and 2.2.4 should be smooth as butter in this respect.
In any case, you mentioned debug logs helped to troubleshoot it. Would you mind providing us with more details? What logs, what did the issue, in your case, turned out to be, etc? If possible, I would recommend submitting a support ticket and we can take a closer look at all the information and troubleshoot it further, if needed.

Lastly, as for for the feature request part. Currently, we don't plan adding a form of automatic throttling or auto-shutdown after a timeout. Our main goal is to have any remote Core, when a connection is restored after any interruption, to connect and be available in seconds again as seamlessly as possible.
We believe that if we were to shutdown the process after some time, there would be cases of some network deployments which might cause multiple Cores to shut down and requiring a manual intervention on each Core to restart the service.