Tomas wrote: ↑Fri Aug 17, 2018 11:25 am
JAz wrote: ↑Wed Aug 15, 2018 8:44 pm
The error message is not helpful/explanatory. Suggest you see what can be done about that.
Indeed, the error could use a cleaner name.
Will work on that
Thank you Tomas. Thank you for understanding there's more to software than just the "output" and thanks for hearing and accepting crit and suggestions and actually trying to do somehting about it. I can only speak for me so I for one truly appreciate that about you.
JAz wrote: ↑Wed Aug 15, 2018 8:44 pm
Why do my devices take so long
|| why is the default timeout so low?
Why do my devices take so long?
Probably a RouterOS limitation on how regex matching against long strings is done.
Since the little boards have very limited CPUs (original 750 is a 400MHz MIBS CPU), if there is a lot of schedulers (containing lots of code), due to how MKT CLI is interpreted into backend config parsing, it can indeed fully load such a low-end CPU.
This makes a lot of sense. I have a few scripts on each of these and they are very lengthy. They are loaded with 'debug' routines and statements that are normally 'off' for production but the code is there and a statement like the one that stalled certainly tried to parse the whole thing. As it boils down to the ROS regex implementation, nothing to do and it is what it is.
Why is the default timeout so low?
This timeout (on how long to wait for commands to finish, or output requested data) is at 20 seconds so jobs don't take forever to timeout / end when things don't go as expected.
If we increased this to say 1 minute, then the minimal time each job would take to fail would rise to 1 minute.
This is not really a good UX (user experience), when waiting for jobs to finish for long periods of time without knowing what's happening.
We assumed 20 seconds is enough for any device to return at least some output (see details on this in the mentioned wiki article).
Apparently in some cases, this is not as we have hoped tho.
I gave this some thought. Clearly understand your need to 'balance' between too low and too high. (Everything in our world is a balance isn't it?)
But let me think out loud here a bit. Maybe it's a solution, maybe just food for thought.
Borrowing from things I've seen elsewhere (nmap probing I think? some others elsewhere?) perhaps you employ a "progressive" timeout logic so that jobs can fail quickly and gracefully, keep a loop of feedback to the end user (or job log) and progressively increase it's timeout/retry up to a point of maximum.
I recognize that this would be more involved from a coding standpoint (the work is not lost on me) so forgive me for being so liberal to 'assign you homework' but the logic should be reusable and maybe it serves to solve this and similar problems in more than one place in the software...
I envision it looking something like this:
- You start with a short timeout. Shortest plausible Maybe 5 seconds.
- If it completes, great. If not, update the user/log ("previous attempt timed out, retrying with a higher timeout value X") and retry with a higher timeout - say 15 seconds.
- If it completes, great. If not, update the user/log and retry with a higher timeout - say 30 seconds.
- If it completes now, great. If not, update the user/log yet again and retry with a still higher timeout - say 60 seconds.
- Continue this iterative loop up to a user defined maximum/impractical value - say 600 or even 1200 seconds.
- Here I've used a multiplier of 2x (doubling each time.)
- --- Ideally it would be up to the user to supply this multiplier (1.5, 3, etc...),
- --- the user would define the maximum (300 seconds? 1200 seconds? 12,000 seconds?) and
- --- the user would define the 'defer' value (see next paragraph.)
Additionally/ a step further - you designate a value at which you don't retry immediately (the user defined 'defer' value above) and defer those retries to the end of the job so the remaining actions of the job can finish first and then batch the retries/failures to the end ( - or - if it's possible, spin the retries off in their own threads to run concurrently?...)
That should enable more useful and more often feedback (to the console and/or log) for the user while still allowing Unimus to get progressively more aggressive, even to extreme values, trying to complete the job, but not 'mysteriously' stalling out from the perspective of the user.
Bonus points if this is configurable per Push rather than global to the server.
Or perhaps this logic can be applied by device/processor type/capability.
Yes. I know. Why not ask you solve peace in the middle-east too while you're at it?
I apologized in advance didn't I?
Final words:
I think most of the issues here can be fixed with better UX.
(maybe allow timeout configuration directly in Mass Config Push preset)
Also better error wording and better visibility to why this is happening and how to fix it.
We will work on all of these
Thank you Tomas.
I appreciate your efforts and respect your dedication.
Salud