[Fixed in 1.10.2] Devices stuck in Discover or Backup

Unimus support forum
Post Reply
SteveLamb
Posts: 18
Joined: Fri Dec 22, 2017 3:34 pm

Mon May 13, 2019 9:38 pm

We have several instance so Unimus running. it seems that when one instance has > 1000 devices that devices will fail to finish discovery or backup. once they are stuck in this state it is not able to try any further action on this device. restarting the unimus instance or deleting and readding the device seems to resolve the issue.

some of our backups may be very long as we have some switches and routers with significantly large vlan tables.

we are currently running version 1.10.1
below is from the error log when this appears to occur.

Code: Select all

prod_unimus_unimus.1.xvshfhsqctw3@test.example.com    | 2019-04-30 03:07:18.274  WARN 1 --- [  discovery-106] net.unimus.core.api.CoreImpl             : Error during discovery of 10.109.41.5
prod_unimus_unimus.1.xvshfhsqctw3@test.example.com    | 
prod_unimus_unimus.1.xvshfhsqctw3@test.example.com    | java.lang.IllegalStateException: Can't start StopWatch: it's already running
prod_unimus_unimus.1.xvshfhsqctw3@test.example.com    |   at org.springframework.util.StopWatch.start(StopWatch.java:127)
prod_unimus_unimus.1.xvshfhsqctw3@test.example.com    |   at org.springframework.util.StopWatch.start(StopWatch.java:116)
prod_unimus_unimus.1.xvshfhsqctw3@test.example.com    |   at net.unimus.core.util.metrics.JobDurationMetrics.startMeasuring(JobDurationMetrics.java:27)
prod_unimus_unimus.1.xvshfhsqctw3@test.example.com    |   at net.unimus.core.api.CoreImpl$DiscoveryExecutor.doRun(CoreImpl.java:348)
prod_unimus_unimus.1.xvshfhsqctw3@test.example.com    |   at net.unimus.core.api.CoreImpl$ErrorHandlingExecutor.run(CoreImpl.java:302)
prod_unimus_unimus.1.xvshfhsqctw3@test.example.com    |   at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
prod_unimus_unimus.1.xvshfhsqctw3@test.example.com    |   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
prod_unimus_unimus.1.xvshfhsqctw3@test.example.com    |   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
prod_unimus_unimus.1.xvshfhsqctw3@test.example.com    |   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
prod_unimus_unimus.1.xvshfhsqctw3@test.example.com    |   at java.lang.Thread.run(Thread.java:745)
User avatar
Tomas
Posts: 1206
Joined: Sat Jun 25, 2016 12:33 pm

Fri May 17, 2019 3:38 am

Just an update on this - we are investigating this issue right now.

We have had 1 additional customer report this as well, so it is definitely something on our end.
It seems this issue is not wide-spread however, as we have had no other report other than these 2.

I will post an update as soon as we have any news.
User avatar
Tomas
Posts: 1206
Joined: Sat Jun 25, 2016 12:33 pm

Mon May 20, 2019 3:54 pm

Update:

This issue should now be solved, and the fix will be available in 1.10.2.

Can you please test with the latest 1.10.2 Beta release and let us know if this fixes it for you?
viewtopic.php?p=2172#p2172

Thanks!
SteveLamb
Posts: 18
Joined: Fri Dec 22, 2017 3:34 pm

Thu May 23, 2019 3:52 pm

this is working better but not totally fixed. after the first scheduled event we have 3 devices stuck in discovery. running version Version : 1.10.2-Beta2. is there information i can provide that will assist with this.

Thanks
User avatar
Tomas
Posts: 1206
Joined: Sat Jun 25, 2016 12:33 pm

Thu May 23, 2019 4:13 pm

We are investigating - we have a report from another custom also that this is not yet fully fixed.

Will provide an update ASAP.
User avatar
Tomas
Posts: 1206
Joined: Sat Jun 25, 2016 12:33 pm

Tue Jun 04, 2019 11:36 am

We have found additional rare cases where jobs could get stuck.

Could you please try with the latest Beta build and let us know if you are still seeing issues?
viewtopic.php?p=2172#p2172

Thanks!
SteveLamb
Posts: 18
Joined: Fri Dec 22, 2017 3:34 pm

Mon Jun 10, 2019 9:30 pm

good news. on the first run through of 1200 devices i had 0 failed to discover. I will let you know if i see others fail later in the week

this is on beta5
SteveLamb
Posts: 18
Joined: Fri Dec 22, 2017 3:34 pm

Wed Jun 12, 2019 1:24 pm

I may have spoken too soon.

we don't have as much of an issue with discovery, after 2 days i am seeing 5 out of ~1200. but 75% of them are sitting stuck in backup.

let me know if i can provide any information that would help with this.

1.10.2-Beta5
User avatar
Tomas
Posts: 1206
Joined: Sat Jun 25, 2016 12:33 pm

Thu Jun 13, 2019 3:04 pm

Could we schedule a Webex session to investigate this in detail?
Sent a PM to work out the details.
Post Reply