[Clusterusers] NIMBYed nodes

Lee Spector lspector at hampshire.edu
Mon Apr 20 15:43:50 EDT 2015


Maybe it's finally time for me to do this. As Tom points out I should have a better handle on it before he leaves. But the runs that I do are a little different from what he usually does.

Is it now possible to do this with all of these properties?:

- My run starts immediately.
- It runs on a node that I specify.
- No other jobs run on that node until my run is done, even if mine temporarily goes to low cpu utilization.
- I can see stderr as well as std out, which I'll be piping to a file (which has to be somewhere I can grep it).
- I can easily and instantly kill and restart my runs.

If so then yes, I guess I should try to switch to a tractor-based workflow for my work on fly.

 -Lee


> On Apr 20, 2015, at 3:26 PM, Chris Perry <perry at hampshire.edu> wrote:
> 
> 
> I have a different ideal solution to propose: use tractor!
> 
> What you are looking to do with your NIMBY mechanism is to pull machines out of tractor temporarily so they can run your job(s). But this is exactly what tractor does: it gives a job to an idle machine then keeps that machine from running new jobs until the first job is killed or completed. If you spool with high enough priority, tractor kills and respools a lower priority job so as to free up the machine, so you don’t have to wait if immediate access is a concern of yours.
> 
> Not to mention, the tractor interface will show running jobs so there would be a “Lee’s job” item in green on the list. This might help keep you less likely to lose track in the way you describe happening now.
> 
> Worth considering? I’m sure there are a few kinks to figure out (such as how to tag your jobs so that they have the full machine, guaranteed) but I feel confident that we can do this.
> 
> - chris
> 
> On Apr 20, 2015, at 2:31 PM, Lee Spector <lspector at hampshire.edu> wrote:
> 
>> 
>> Some of these were probably my doing, but I only recall nimbying 1-4 and 4-5 in the recent past.
>> 
>> It's not a problem with a node that causes me to do this, it's an interest in having total control of a node for a compute-intensive multithreaded run, with no chance that any other processes will be allocated on it. Sometimes I'll start one of these, check in on it regularly for a while, and then check in less frequently if it's not doing anything really interesting but I'm not ready to kill it in case it might still do something good. Then sometimes I lose track. Right now I have nothing running, and any nimbying that I've done can be unnimbyed, although I'm not 100% sure what I may have left nimbyed.
>> 
>> Ideally, I guess, we'd have a nimby command that records who nimbyed and maybe periodically asks them if they still want it nimbyed. I'm not sure how difficult that would be. If somebody is inspired to do this, then it would also be nice if the command for numbying/unnimbying was more straightforward than the current one (which takes a 1 or a 0 as an argument, which is a little confusing), if it returned a value that made it more clear what the new status is, and if there was a simple way just to check the status (which maybe there is now? if so I don't know it).
>> 
>> -Lee
>> 
>> 
>>> On Apr 20, 2015, at 1:36 PM, Wm. Josiah Erikson <wjerikson at hampshire.edu> wrote:
>>> 
>>> Hi all,
>>>  Why are nodes 1-17, 1-18, 1-2, 1-4, and 1-9 NIMBYed? If you are
>>> having a problem with a node that causes you to need to NIMBY it, please
>>> let me know, because maybe it just means I screwed up the service keys
>>> or something. I'm not omniscient :)
>>> 
>>> -- 
>>> Wm. Josiah Erikson
>>> Assistant Director of IT, Infrastructure Group
>>> System Administrator, School of CS
>>> Hampshire College
>>> Amherst, MA 01002
>>> (413) 559-6091
>>> 
>>> _______________________________________________
>>> Clusterusers mailing list
>>> Clusterusers at lists.hampshire.edu
>>> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>> 
>> --
>> Lee Spector, Professor of Computer Science
>> Director, Institute for Computational Intelligence
>> Cognitive Science, Hampshire College
>> 893 West Street, Amherst, MA 01002-3359
>> lspector at hampshire.edu, http://hampshire.edu/lspector/
>> Phone: 413-559-5352, Fax: 413-559-5438
>> 
>> _______________________________________________
>> Clusterusers mailing list
>> Clusterusers at lists.hampshire.edu
>> https://lists.hampshire.edu/mailman/listinfo/clusterusers
> 
> _______________________________________________
> Clusterusers mailing list
> Clusterusers at lists.hampshire.edu
> https://lists.hampshire.edu/mailman/listinfo/clusterusers

--
Lee Spector, Professor of Computer Science
Director, Institute for Computational Intelligence
Cognitive Science, Hampshire College
893 West Street, Amherst, MA 01002-3359
lspector at hampshire.edu, http://hampshire.edu/lspector/
Phone: 413-559-5352, Fax: 413-559-5438



More information about the Clusterusers mailing list