[Clusterusers] NIMBYed nodes

Wm. Josiah Erikson wjerikson at hampshire.edu
Wed Apr 22 17:32:03 EDT 2015


Nope, don't think so. It errored because it ran out of RAM. Keep in mind
it has 24GB of that, so something is probably up with your scene still.
Maybe if you got lucky and went to one of the really really big nodes
that frame made it through :) I just rebooted it and I will un-NIMBY it.
    -Josiah



On 4/22/15 3:16 PM, Piper Odegard wrote:
>
> Hey Josiah,
>
> I just NIMBY'd compute 2-1 because it errored out 3 tasks in a row on
> one of my jobs—in the future, is there a way on the fly to tell a job
> not to use a particular
> machine for the rest of its run?
>
> - Piper
>
>  
>
>  
>
> On 2015-04-20 18:59, Thomas Helmuth wrote:
>
>> Awesome, thanks Chris! I should be able to merge that idea into my
>> current launching scripts to get Lee going, but we'll page you if we
>> run into trouble.
>>
>> Tom
>>
>> On Mon, Apr 20, 2015 at 6:57 PM, Lee Spector <lspector at hampshire.edu
>> <mailto:lspector at hampshire.edu>> wrote:
>>
>>     Thanks Chris!
>>      
>>      -Lee
>>      
>>
>>>     On Apr 20, 2015, at 6:47 PM, Chris Perry <perry at hampshire.edu
>>>     <mailto:perry at hampshire.edu>> wrote:
>>>
>>>      
>>>     My forum post resulted in a solution that could work for us if
>>>     we keep the slots as they are. The RemoteCmd line to use is:
>>>      
>>>     RemoteCmd { command } -samehost 1 -atmost 4 -atleast 4 -service
>>>     { blah }
>>>      
>>>     That will check out exactly 4 slots on a single node. So all Lee
>>>     needs is to know how many slots are on the machine he wants and
>>>     we (should be) good to go using a RemoteCmd structure like this. 
>>>      
>>>     I’m happy to help you guys make these scripts work if and when
>>>     the time comes.
>>>      
>>>     - chris
>>>
>>>     On Apr 20, 2015, at 5:28 PM, Lee Spector <lspector at hampshire.edu
>>>     <mailto:lspector at hampshire.edu>> wrote:
>>>
>>>>      
>>>>     This, in conjunction with our observations of weird
>>>>     multithreading performance pathologies on the JVM, makes me
>>>>     think that we SHOULDN'T change the slots per node from whatever
>>>>     we're currently doing, at least not universally or by default.
>>>>     Is there a way to make this work as it currently does most of
>>>>     the time, but for me to grab a machine in 1-slot mode when I
>>>>     want one?
>>>>      
>>>>     My use case is the exception, not the rule, and I usually need
>>>>     one or a small handful of these sessions going at a time (I run
>>>>     them individually and watch them). At the moment I'm just using
>>>>     my desktop mac, and I think that actually runs my code faster
>>>>     than any fly node at the moment (although 1-4 seemed to perform
>>>>     well). I may want to run on a fly node or two at some point
>>>>     too, but I wouldn't want the performance of other work
>>>>     (especially Tom's many many large runs) to take a hit for this. 
>>>>      
>>>>      -Lee
>>>>      
>>>>      
>>>>
>>>>>     On Apr 20, 2015, at 4:59 PM, Thomas Helmuth
>>>>>     <thelmuth at cs.umass.edu <mailto:thelmuth at cs.umass.edu>> wrote:
>>>>>
>>>>>     I agree with the others that it should be pretty easy to do
>>>>>     everything you need. And I can walk you through how to easily
>>>>>     do it in Clojush with the python script I have to launch runs.
>>>>>
>>>>>     I'm not sure if I'd get more or less performance with one slot
>>>>>     per node, but I'm guessing less -- I often get multiple runs
>>>>>     going on a node, so I'd need to get 4 or 8 times speedups with
>>>>>     multi-threaded in a single run to get the same amount of work
>>>>>     done. But, maybe this would just happen automatically with how
>>>>>     we have it setup in Clojush.
>>>>>
>>>>>     Tom
>>>>>
>>>>>     On Mon, Apr 20, 2015 at 4:19 PM, Chris Perry
>>>>>     <perry at hampshire.edu <mailto:perry at hampshire.edu>> wrote:
>>>>>
>>>>>
>>>>>         I just wrote a post to the tractor developer forum asking
>>>>>         about the one piece of this that I’m not 100% sure about,
>>>>>         but if we go with Josiah’s recommendation of one slot per
>>>>>         blade then my post will be irrelevant and YES we can do
>>>>>         exactly what you want.
>>>>>
>>>>>         And Josiah, I do think that you’re right about one slot
>>>>>         per node, not only because of the great multithreading
>>>>>         that everyone’s doing, but because our fileserver can’t
>>>>>         keep up with too many concurrent reads/writes anyway.
>>>>>
>>>>>         - chris
>>>>>
>>>>>         On Apr 20, 2015, at 3:55 PM, Wm. Josiah Erikson
>>>>>         <wjerikson at hampshire.edu <mailto:wjerikson at hampshire.edu>>
>>>>>         wrote:
>>>>>
>>>>>         >
>>>>>         >
>>>>>         > On 4/20/15 3:43 PM, Lee Spector wrote:
>>>>>         >> Maybe it's finally time for me to do this. As Tom
>>>>>         points out I should have a better handle on it before he
>>>>>         leaves. But the runs that I do are a little different from
>>>>>         what he usually does.
>>>>>         >>
>>>>>         >> Is it now possible to do this with all of these
>>>>>         properties?:
>>>>>         >>
>>>>>         >> - My run starts immediately.
>>>>>         > Yes. If you spool with a high enough priority, it will
>>>>>         actually
>>>>>         > terminate and re-spool other jobs that are in your way.
>>>>>         >> - It runs on a node that I specify.
>>>>>         > Yes. The service key would just be the node you want.
>>>>>         >> - No other jobs run on that node until my run is done,
>>>>>         even if mine temporarily goes to low cpu utilization.
>>>>>         > We would have to make it so that your node had one slot,
>>>>>         and then you
>>>>>         > did something like this if you had more than one command:
>>>>>         >
>>>>>         http://fly.hampshire.edu/docs/Tractor/scripting.html#sharing-one-server-check-out-among-several-commands.
>>>>>         > I think that most things that we are doing these days
>>>>>         are multithreaded
>>>>>         > enough that one slot per node might be a valid choice
>>>>>         cluster-wide... at
>>>>>         > least at the moment with Maxwell and prman. What say others?
>>>>>         >> - I can see stderr as well as std out, which I'll be
>>>>>         piping to a file (which has to be somewhere I can grep it).
>>>>>         > Just pipe each one to whatever file you want. yes.
>>>>>         >> - I can easily and instantly kill and restart my runs.
>>>>>         > Yes. Right-click and restart, interrupt or delete.
>>>>>         >>
>>>>>         >> If so then yes, I guess I should try to switch to a
>>>>>         tractor-based workflow for my work on fly.
>>>>>         >>
>>>>>         >> -Lee
>>>>>         >>
>>>>>         >>
>>>>>         >>> On Apr 20, 2015, at 3:26 PM, Chris Perry
>>>>>         <perry at hampshire.edu <mailto:perry at hampshire.edu>> wrote:
>>>>>         >>>
>>>>>         >>>
>>>>>         >>> I have a different ideal solution to propose: use tractor!
>>>>>         >>>
>>>>>         >>> What you are looking to do with your NIMBY mechanism
>>>>>         is to pull machines out of tractor temporarily so they can
>>>>>         run your job(s). But this is exactly what tractor does: it
>>>>>         gives a job to an idle machine then keeps that machine
>>>>>         from running new jobs until the first job is killed or
>>>>>         completed. If you spool with high enough priority, tractor
>>>>>         kills and respools a lower priority job so as to free up
>>>>>         the machine, so you don’t have to wait if immediate access
>>>>>         is a concern of yours.
>>>>>         >>>
>>>>>         >>> Not to mention, the tractor interface will show
>>>>>         running jobs so there would be a “Lee’s job” item in green
>>>>>         on the list. This might help keep you less likely to lose
>>>>>         track in the way you describe happening now.
>>>>>         >>>
>>>>>         >>> Worth considering? I’m sure there are a few kinks to
>>>>>         figure out (such as how to tag your jobs so that they have
>>>>>         the full machine, guaranteed) but I feel confident that we
>>>>>         can do this.
>>>>>         >>>
>>>>>         >>> - chris
>>>>>         >>>
>>>>>         >>> On Apr 20, 2015, at 2:31 PM, Lee Spector
>>>>>         <lspector at hampshire.edu <mailto:lspector at hampshire.edu>>
>>>>>         wrote:
>>>>>         >>>
>>>>>         >>>> Some of these were probably my doing, but I only
>>>>>         recall nimbying 1-4 and 4-5 in the recent past.
>>>>>         >>>>
>>>>>         >>>> It's not a problem with a node that causes me to do
>>>>>         this, it's an interest in having total control of a node
>>>>>         for a compute-intensive multithreaded run, with no chance
>>>>>         that any other processes will be allocated on it.
>>>>>         Sometimes I'll start one of these, check in on it
>>>>>         regularly for a while, and then check in less frequently
>>>>>         if it's not doing anything really interesting but I'm not
>>>>>         ready to kill it in case it might still do something good.
>>>>>         Then sometimes I lose track. Right now I have nothing
>>>>>         running, and any nimbying that I've done can be unnimbyed,
>>>>>         although I'm not 100% sure what I may have left nimbyed.
>>>>>         >>>>
>>>>>         >>>> Ideally, I guess, we'd have a nimby command that
>>>>>         records who nimbyed and maybe periodically asks them if
>>>>>         they still want it nimbyed. I'm not sure how difficult
>>>>>         that would be. If somebody is inspired to do this, then it
>>>>>         would also be nice if the command for numbying/unnimbying
>>>>>         was more straightforward than the current one (which takes
>>>>>         a 1 or a 0 as an argument, which is a little confusing),
>>>>>         if it returned a value that made it more clear what the
>>>>>         new status is, and if there was a simple way just to check
>>>>>         the status (which maybe there is now? if so I don't know it).
>>>>>         >>>>
>>>>>         >>>> -Lee
>>>>>         >>>>
>>>>>         >>>>
>>>>>         >>>>> On Apr 20, 2015, at 1:36 PM, Wm. Josiah Erikson
>>>>>         <wjerikson at hampshire.edu <mailto:wjerikson at hampshire.edu>>
>>>>>         wrote:
>>>>>         >>>>>
>>>>>         >>>>> Hi all,
>>>>>         >>>>> Why are nodes 1-17, 1-18, 1-2, 1-4, and 1-9 NIMBYed?
>>>>>         If you are
>>>>>         >>>>> having a problem with a node that causes you to need
>>>>>         to NIMBY it, please
>>>>>         >>>>> let me know, because maybe it just means I screwed
>>>>>         up the service keys
>>>>>         >>>>> or something. I'm not omniscient :)
>>>>>         >>>>>
>>>>>         >>>>> --
>>>>>         >>>>> Wm. Josiah Erikson
>>>>>         >>>>> Assistant Director of IT, Infrastructure Group
>>>>>         >>>>> System Administrator, School of CS
>>>>>         >>>>> Hampshire College
>>>>>         >>>>> Amherst, MA 01002
>>>>>         >>>>> (413) 559-6091 <tel:%28413%29%20559-6091>
>>>>>         >>>>>
>>>>>         >>>>> _______________________________________________
>>>>>         >>>>> Clusterusers mailing list
>>>>>         >>>>> Clusterusers at lists.hampshire.edu
>>>>>         <mailto:Clusterusers at lists.hampshire.edu>
>>>>>         >>>>>
>>>>>         https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>>>>         >>>> --
>>>>>         >>>> Lee Spector, Professor of Computer Science
>>>>>         >>>> Director, Institute for Computational Intelligence
>>>>>         >>>> Cognitive Science, Hampshire College
>>>>>         >>>> 893 West Street, Amherst, MA 01002-3359
>>>>>         >>>> lspector at hampshire.edu
>>>>>         <mailto:lspector at hampshire.edu>,
>>>>>         http://hampshire.edu/lspector/
>>>>>         >>>> Phone: 413-559-5352 <tel:413-559-5352>, Fax:
>>>>>         413-559-5438 <tel:413-559-5438>
>>>>>         >>>>
>>>>>         >>>> _______________________________________________
>>>>>         >>>> Clusterusers mailing list
>>>>>         >>>> Clusterusers at lists.hampshire.edu
>>>>>         <mailto:Clusterusers at lists.hampshire.edu>
>>>>>         >>>> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>>>>         >>> _______________________________________________
>>>>>         >>> Clusterusers mailing list
>>>>>         >>> Clusterusers at lists.hampshire.edu
>>>>>         <mailto:Clusterusers at lists.hampshire.edu>
>>>>>         >>> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>>>>         >> --
>>>>>         >> Lee Spector, Professor of Computer Science
>>>>>         >> Director, Institute for Computational Intelligence
>>>>>         >> Cognitive Science, Hampshire College
>>>>>         >> 893 West Street, Amherst, MA 01002-3359
>>>>>         >> lspector at hampshire.edu <mailto:lspector at hampshire.edu>,
>>>>>         http://hampshire.edu/lspector/
>>>>>         >> Phone: 413-559-5352 <tel:413-559-5352>, Fax:
>>>>>         413-559-5438 <tel:413-559-5438>
>>>>>         >>
>>>>>         >> _______________________________________________
>>>>>         >> Clusterusers mailing list
>>>>>         >> Clusterusers at lists.hampshire.edu
>>>>>         <mailto:Clusterusers at lists.hampshire.edu>
>>>>>         >> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>>>>         >
>>>>>         > --
>>>>>         > Wm. Josiah Erikson
>>>>>         > Assistant Director of IT, Infrastructure Group
>>>>>         > System Administrator, School of CS
>>>>>         > Hampshire College
>>>>>         > Amherst, MA 01002
>>>>>         > (413) 559-6091 <tel:%28413%29%20559-6091>
>>>>>         >
>>>>>         > _______________________________________________
>>>>>         > Clusterusers mailing list
>>>>>         > Clusterusers at lists.hampshire.edu
>>>>>         <mailto:Clusterusers at lists.hampshire.edu>
>>>>>         > https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>>>>
>>>>>         _______________________________________________
>>>>>         Clusterusers mailing list
>>>>>         Clusterusers at lists.hampshire.edu
>>>>>         <mailto:Clusterusers at lists.hampshire.edu>
>>>>>         https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>>>>
>>>>>     _______________________________________________
>>>>>     Clusterusers mailing list
>>>>>     Clusterusers at lists.hampshire.edu
>>>>>     <mailto:Clusterusers at lists.hampshire.edu>
>>>>>     https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>>>
>>>>      
>>>>
>>>>     --
>>>>     Lee Spector, Professor of Computer Science
>>>>     Director, Institute for Computational Intelligence
>>>>     Cognitive Science, Hampshire College
>>>>     893 West Street, Amherst, MA 01002-3359
>>>>     lspector at hampshire.edu
>>>>     <mailto:lspector at hampshire.edu>, http://hampshire.edu/lspector/
>>>>     Phone: 413-559-5352 <tel:413-559-5352>, Fax: 413-559-5438
>>>>     <tel:413-559-5438>
>>>>     _______________________________________________
>>>>     Clusterusers mailing list
>>>>     Clusterusers at lists.hampshire.edu
>>>>     <mailto:Clusterusers at lists.hampshire.edu>
>>>>     https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>>     _______________________________________________
>>>     Clusterusers mailing list
>>>     Clusterusers at lists.hampshire.edu
>>>     <mailto:Clusterusers at lists.hampshire.edu>
>>>     https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>
>>      
>>
>>     --
>>     Lee Spector, Professor of Computer Science
>>     Director, Institute for Computational Intelligence
>>     Cognitive Science, Hampshire College
>>     893 West Street, Amherst, MA 01002-3359
>>     lspector at hampshire.edu
>>     <mailto:lspector at hampshire.edu>, http://hampshire.edu/lspector/
>>     Phone: 413-559-5352 <tel:413-559-5352>, Fax: 413-559-5438
>>     <tel:413-559-5438>
>>
>>     _______________________________________________
>>     Clusterusers mailing list
>>     Clusterusers at lists.hampshire.edu
>>     <mailto:Clusterusers at lists.hampshire.edu>
>>     https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>
>>
>> _______________________________________________
>> Clusterusers mailing list
>> Clusterusers at lists.hampshire.edu <mailto:Clusterusers at lists.hampshire.edu>
>> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>
>
> _______________________________________________
> Clusterusers mailing list
> Clusterusers at lists.hampshire.edu
> https://lists.hampshire.edu/mailman/listinfo/clusterusers

-- 
Wm. Josiah Erikson
Assistant Director of IT, Infrastructure Group
System Administrator, School of CS
Hampshire College
Amherst, MA 01002
(413) 559-6091

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.hampshire.edu/pipermail/clusterusers/attachments/20150422/736d5a12/attachment-0001.html>


More information about the Clusterusers mailing list