[Clusterusers] NIMBYed nodes

Wm. Josiah Erikson wjerikson at hampshire.edu
Thu Apr 23 09:27:46 EDT 2015


I added him to the list.
    -Josiah


On 4/22/15 9:20 PM, Piper Odegard wrote:
>
> Huh, other frames in the scene had been rendering fine on the rest of
> rack2—something was funky. Whatever the case is, we're forging ahead.
>
> Also, I just NIMBY'd asheclass-08 because someone is rendering their
> anim 2 final on it locally.
>
> Also also, I've cc'd Jordan Miron—he's started rendering his final
> project on the farm, so being in the loop wouldn't hurt.
>
> - Piper
>
>  
>
>  
>
> On 2015-04-22 17:32, Wm. Josiah Erikson wrote:
>
>> Nope, don't think so. It errored because it ran out of RAM. Keep in
>> mind it has 24GB of that, so something is probably up with your scene
>> still. Maybe if you got lucky and went to one of the really really
>> big nodes that frame made it through :) I just rebooted it and I will
>> un-NIMBY it.
>>     -Josiah
>>
>>
>>
>> On 4/22/15 3:16 PM, Piper Odegard wrote:
>>>
>>> Hey Josiah,
>>>
>>> I just NIMBY'd compute 2-1 because it errored out 3 tasks in a row
>>> on one of my jobs—in the future, is there a way on the fly to tell a
>>> job not to use a particular
>>> machine for the rest of its run?
>>>
>>> - Piper
>>>
>>>  
>>>
>>>  
>>>
>>> On 2015-04-20 18:59, Thomas Helmuth wrote:
>>>
>>>     Awesome, thanks Chris! I should be able to merge that idea into
>>>     my current launching scripts to get Lee going, but we'll page
>>>     you if we run into trouble.
>>>
>>>     Tom
>>>
>>>     On Mon, Apr 20, 2015 at 6:57 PM, Lee Spector
>>>     <lspector at hampshire.edu <mailto:lspector at hampshire.edu>> wrote:
>>>
>>>         Thanks Chris!
>>>          
>>>          -Lee
>>>          
>>>
>>>             On Apr 20, 2015, at 6:47 PM, Chris Perry
>>>             <perry at hampshire.edu <mailto:perry at hampshire.edu>> wrote:
>>>
>>>              
>>>             My forum post resulted in a solution that could work for
>>>             us if we keep the slots as they are. The RemoteCmd line
>>>             to use is:
>>>              
>>>             RemoteCmd { command } -samehost 1 -atmost 4 -atleast 4
>>>             -service { blah }
>>>              
>>>             That will check out exactly 4 slots on a single node. So
>>>             all Lee needs is to know how many slots are on the
>>>             machine he wants and we (should be) good to go using a
>>>             RemoteCmd structure like this. 
>>>              
>>>             I’m happy to help you guys make these scripts work if
>>>             and when the time comes.
>>>              
>>>             - chris
>>>
>>>             On Apr 20, 2015, at 5:28 PM, Lee Spector
>>>             <lspector at hampshire.edu <mailto:lspector at hampshire.edu>>
>>>             wrote:
>>>
>>>                  
>>>                 This, in conjunction with our observations of weird
>>>                 multithreading performance pathologies on the JVM,
>>>                 makes me think that we SHOULDN'T change the slots
>>>                 per node from whatever we're currently doing, at
>>>                 least not universally or by default. Is there a way
>>>                 to make this work as it currently does most of the
>>>                 time, but for me to grab a machine in 1-slot mode
>>>                 when I want one?
>>>                  
>>>                 My use case is the exception, not the rule, and I
>>>                 usually need one or a small handful of these
>>>                 sessions going at a time (I run them individually
>>>                 and watch them). At the moment I'm just using my
>>>                 desktop mac, and I think that actually runs my code
>>>                 faster than any fly node at the moment (although 1-4
>>>                 seemed to perform well). I may want to run on a fly
>>>                 node or two at some point too, but I wouldn't want
>>>                 the performance of other work (especially Tom's many
>>>                 many large runs) to take a hit for this. 
>>>                  
>>>                  -Lee
>>>                  
>>>                  
>>>
>>>                     On Apr 20, 2015, at 4:59 PM, Thomas Helmuth
>>>                     <thelmuth at cs.umass.edu
>>>                     <mailto:thelmuth at cs.umass.edu>> wrote:
>>>
>>>                     I agree with the others that it should be pretty
>>>                     easy to do everything you need. And I can walk
>>>                     you through how to easily do it in Clojush with
>>>                     the python script I have to launch runs.
>>>
>>>                     I'm not sure if I'd get more or less performance
>>>                     with one slot per node, but I'm guessing less --
>>>                     I often get multiple runs going on a node, so
>>>                     I'd need to get 4 or 8 times speedups with
>>>                     multi-threaded in a single run to get the same
>>>                     amount of work done. But, maybe this would just
>>>                     happen automatically with how we have it setup
>>>                     in Clojush.
>>>
>>>                     Tom
>>>
>>>                     On Mon, Apr 20, 2015 at 4:19 PM, Chris Perry
>>>                     <perry at hampshire.edu
>>>                     <mailto:perry at hampshire.edu>> wrote:
>>>
>>>
>>>                         I just wrote a post to the tractor developer
>>>                         forum asking about the one piece of this
>>>                         that I’m not 100% sure about, but if we go
>>>                         with Josiah’s recommendation of one slot per
>>>                         blade then my post will be irrelevant and
>>>                         YES we can do exactly what you want.
>>>
>>>                         And Josiah, I do think that you’re right
>>>                         about one slot per node, not only because of
>>>                         the great multithreading that everyone’s
>>>                         doing, but because our fileserver can’t keep
>>>                         up with too many concurrent reads/writes anyway.
>>>
>>>                         - chris
>>>
>>>                         On Apr 20, 2015, at 3:55 PM, Wm. Josiah
>>>                         Erikson <wjerikson at hampshire.edu
>>>                         <mailto:wjerikson at hampshire.edu>> wrote:
>>>
>>>                         >
>>>                         >
>>>                         > On 4/20/15 3:43 PM, Lee Spector wrote:
>>>                         >> Maybe it's finally time for me to do
>>>                         this. As Tom points out I should have a
>>>                         better handle on it before he leaves. But
>>>                         the runs that I do are a little different
>>>                         from what he usually does.
>>>                         >>
>>>                         >> Is it now possible to do this with all of
>>>                         these properties?:
>>>                         >>
>>>                         >> - My run starts immediately.
>>>                         > Yes. If you spool with a high enough
>>>                         priority, it will actually
>>>                         > terminate and re-spool other jobs that are
>>>                         in your way.
>>>                         >> - It runs on a node that I specify.
>>>                         > Yes. The service key would just be the
>>>                         node you want.
>>>                         >> - No other jobs run on that node until my
>>>                         run is done, even if mine temporarily goes
>>>                         to low cpu utilization.
>>>                         > We would have to make it so that your node
>>>                         had one slot, and then you
>>>                         > did something like this if you had more
>>>                         than one command:
>>>                         >
>>>                         http://fly.hampshire.edu/docs/Tractor/scripting.html#sharing-one-server-check-out-among-several-commands.
>>>                         > I think that most things that we are doing
>>>                         these days are multithreaded
>>>                         > enough that one slot per node might be a
>>>                         valid choice cluster-wide... at
>>>                         > least at the moment with Maxwell and
>>>                         prman. What say others?
>>>                         >> - I can see stderr as well as std out,
>>>                         which I'll be piping to a file (which has to
>>>                         be somewhere I can grep it).
>>>                         > Just pipe each one to whatever file you
>>>                         want. yes.
>>>                         >> - I can easily and instantly kill and
>>>                         restart my runs.
>>>                         > Yes. Right-click and restart, interrupt or
>>>                         delete.
>>>                         >>
>>>                         >> If so then yes, I guess I should try to
>>>                         switch to a tractor-based workflow for my
>>>                         work on fly.
>>>                         >>
>>>                         >> -Lee
>>>                         >>
>>>                         >>
>>>                         >>> On Apr 20, 2015, at 3:26 PM, Chris Perry
>>>                         <perry at hampshire.edu
>>>                         <mailto:perry at hampshire.edu>> wrote:
>>>                         >>>
>>>                         >>>
>>>                         >>> I have a different ideal solution to
>>>                         propose: use tractor!
>>>                         >>>
>>>                         >>> What you are looking to do with your
>>>                         NIMBY mechanism is to pull machines out of
>>>                         tractor temporarily so they can run your
>>>                         job(s). But this is exactly what tractor
>>>                         does: it gives a job to an idle machine then
>>>                         keeps that machine from running new jobs
>>>                         until the first job is killed or completed.
>>>                         If you spool with high enough priority,
>>>                         tractor kills and respools a lower priority
>>>                         job so as to free up the machine, so you
>>>                         don’t have to wait if immediate access is a
>>>                         concern of yours.
>>>                         >>>
>>>                         >>> Not to mention, the tractor interface
>>>                         will show running jobs so there would be a
>>>                         “Lee’s job” item in green on the list. This
>>>                         might help keep you less likely to lose
>>>                         track in the way you describe happening now.
>>>                         >>>
>>>                         >>> Worth considering? I’m sure there are a
>>>                         few kinks to figure out (such as how to tag
>>>                         your jobs so that they have the full
>>>                         machine, guaranteed) but I feel confident
>>>                         that we can do this.
>>>                         >>>
>>>                         >>> - chris
>>>                         >>>
>>>                         >>> On Apr 20, 2015, at 2:31 PM, Lee Spector
>>>                         <lspector at hampshire.edu
>>>                         <mailto:lspector at hampshire.edu>> wrote:
>>>                         >>>
>>>                         >>>> Some of these were probably my doing,
>>>                         but I only recall nimbying 1-4 and 4-5 in
>>>                         the recent past.
>>>                         >>>>
>>>                         >>>> It's not a problem with a node that
>>>                         causes me to do this, it's an interest in
>>>                         having total control of a node for a
>>>                         compute-intensive multithreaded run, with no
>>>                         chance that any other processes will be
>>>                         allocated on it. Sometimes I'll start one of
>>>                         these, check in on it regularly for a while,
>>>                         and then check in less frequently if it's
>>>                         not doing anything really interesting but
>>>                         I'm not ready to kill it in case it might
>>>                         still do something good. Then sometimes I
>>>                         lose track. Right now I have nothing
>>>                         running, and any nimbying that I've done can
>>>                         be unnimbyed, although I'm not 100% sure
>>>                         what I may have left nimbyed.
>>>                         >>>>
>>>                         >>>> Ideally, I guess, we'd have a nimby
>>>                         command that records who nimbyed and maybe
>>>                         periodically asks them if they still want it
>>>                         nimbyed. I'm not sure how difficult that
>>>                         would be. If somebody is inspired to do
>>>                         this, then it would also be nice if the
>>>                         command for numbying/unnimbying was more
>>>                         straightforward than the current one (which
>>>                         takes a 1 or a 0 as an argument, which is a
>>>                         little confusing), if it returned a value
>>>                         that made it more clear what the new status
>>>                         is, and if there was a simple way just to
>>>                         check the status (which maybe there is now?
>>>                         if so I don't know it).
>>>                         >>>>
>>>                         >>>> -Lee
>>>                         >>>>
>>>                         >>>>
>>>                         >>>>> On Apr 20, 2015, at 1:36 PM, Wm.
>>>                         Josiah Erikson <wjerikson at hampshire.edu
>>>                         <mailto:wjerikson at hampshire.edu>> wrote:
>>>                         >>>>>
>>>                         >>>>> Hi all,
>>>                         >>>>> Why are nodes 1-17, 1-18, 1-2, 1-4,
>>>                         and 1-9 NIMBYed? If you are
>>>                         >>>>> having a problem with a node that
>>>                         causes you to need to NIMBY it, please
>>>                         >>>>> let me know, because maybe it just
>>>                         means I screwed up the service keys
>>>                         >>>>> or something. I'm not omniscient :)
>>>                         >>>>>
>>>                         >>>>> --
>>>                         >>>>> Wm. Josiah Erikson
>>>                         >>>>> Assistant Director of IT,
>>>                         Infrastructure Group
>>>                         >>>>> System Administrator, School of CS
>>>                         >>>>> Hampshire College
>>>                         >>>>> Amherst, MA 01002
>>>                         >>>>> (413) 559-6091 <tel:%28413%29%20559-6091>
>>>                         >>>>>
>>>                         >>>>>
>>>                         _______________________________________________
>>>                         >>>>> Clusterusers mailing list
>>>                         >>>>> Clusterusers at lists.hampshire.edu
>>>                         <mailto:Clusterusers at lists.hampshire.edu>
>>>                         >>>>>
>>>                         https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>>                         >>>> --
>>>                         >>>> Lee Spector, Professor of Computer Science
>>>                         >>>> Director, Institute for Computational
>>>                         Intelligence
>>>                         >>>> Cognitive Science, Hampshire College
>>>                         >>>> 893 West Street, Amherst, MA 01002-3359
>>>                         >>>> lspector at hampshire.edu
>>>                         <mailto:lspector at hampshire.edu>,
>>>                         http://hampshire.edu/lspector/
>>>                         >>>> Phone: 413-559-5352 <tel:413-559-5352>,
>>>                         Fax: 413-559-5438 <tel:413-559-5438>
>>>                         >>>>
>>>                         >>>>
>>>                         _______________________________________________
>>>                         >>>> Clusterusers mailing list
>>>                         >>>> Clusterusers at lists.hampshire.edu
>>>                         <mailto:Clusterusers at lists.hampshire.edu>
>>>                         >>>>
>>>                         https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>>                         >>>
>>>                         _______________________________________________
>>>                         >>> Clusterusers mailing list
>>>                         >>> Clusterusers at lists.hampshire.edu
>>>                         <mailto:Clusterusers at lists.hampshire.edu>
>>>                         >>>
>>>                         https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>>                         >> --
>>>                         >> Lee Spector, Professor of Computer Science
>>>                         >> Director, Institute for Computational
>>>                         Intelligence
>>>                         >> Cognitive Science, Hampshire College
>>>                         >> 893 West Street, Amherst, MA 01002-3359
>>>                         >> lspector at hampshire.edu
>>>                         <mailto:lspector at hampshire.edu>,
>>>                         http://hampshire.edu/lspector/
>>>                         >> Phone: 413-559-5352 <tel:413-559-5352>,
>>>                         Fax: 413-559-5438 <tel:413-559-5438>
>>>                         >>
>>>                         >>
>>>                         _______________________________________________
>>>                         >> Clusterusers mailing list
>>>                         >> Clusterusers at lists.hampshire.edu
>>>                         <mailto:Clusterusers at lists.hampshire.edu>
>>>                         >>
>>>                         https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>>                         >
>>>                         > --
>>>                         > Wm. Josiah Erikson
>>>                         > Assistant Director of IT, Infrastructure Group
>>>                         > System Administrator, School of CS
>>>                         > Hampshire College
>>>                         > Amherst, MA 01002
>>>                         > (413) 559-6091 <tel:%28413%29%20559-6091>
>>>                         >
>>>                         >
>>>                         _______________________________________________
>>>                         > Clusterusers mailing list
>>>                         > Clusterusers at lists.hampshire.edu
>>>                         <mailto:Clusterusers at lists.hampshire.edu>
>>>                         >
>>>                         https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>>
>>>                         _______________________________________________
>>>                         Clusterusers mailing list
>>>                         Clusterusers at lists.hampshire.edu
>>>                         <mailto:Clusterusers at lists.hampshire.edu>
>>>                         https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>>
>>>                     _______________________________________________
>>>                     Clusterusers mailing list
>>>                     Clusterusers at lists.hampshire.edu
>>>                     <mailto:Clusterusers at lists.hampshire.edu>
>>>                     https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>>
>>>
>>>                  
>>>
>>>                 --
>>>                 Lee Spector, Professor of Computer Science
>>>                 Director, Institute for Computational Intelligence
>>>                 Cognitive Science, Hampshire College
>>>                 893 West Street, Amherst, MA 01002-3359
>>>                 lspector at hampshire.edu
>>>                 <mailto:lspector at hampshire.edu>, http://hampshire.edu/lspector/
>>>                 Phone: 413-559-5352 <tel:413-559-5352>, Fax:
>>>                 413-559-5438 <tel:413-559-5438>
>>>                 _______________________________________________
>>>                 Clusterusers mailing list
>>>                 Clusterusers at lists.hampshire.edu
>>>                 <mailto:Clusterusers at lists.hampshire.edu>
>>>                 https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>>
>>>             _______________________________________________
>>>             Clusterusers mailing list
>>>             Clusterusers at lists.hampshire.edu
>>>             <mailto:Clusterusers at lists.hampshire.edu>
>>>             https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>>
>>>
>>>          
>>>
>>>         --
>>>         Lee Spector, Professor of Computer Science
>>>         Director, Institute for Computational Intelligence
>>>         Cognitive Science, Hampshire College
>>>         893 West Street, Amherst, MA 01002-3359
>>>         lspector at hampshire.edu
>>>         <mailto:lspector at hampshire.edu>, http://hampshire.edu/lspector/
>>>         Phone: 413-559-5352 <tel:413-559-5352>, Fax: 413-559-5438
>>>         <tel:413-559-5438>
>>>
>>>         _______________________________________________
>>>         Clusterusers mailing list
>>>         Clusterusers at lists.hampshire.edu
>>>         <mailto:Clusterusers at lists.hampshire.edu>
>>>         https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>>
>>>
>>>     _______________________________________________
>>>     Clusterusers mailing list
>>>     Clusterusers at lists.hampshire.edu <mailto:Clusterusers at lists.hampshire.edu>
>>>     https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>>
>>>
>>>
>>> _______________________________________________
>>> Clusterusers mailing list
>>> Clusterusers at lists.hampshire.edu
>>> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>
>> -- 
>> Wm. Josiah Erikson
>> Assistant Director of IT, Infrastructure Group
>> System Administrator, School of CS
>> Hampshire College
>> Amherst, MA 01002
>> (413) 559-6091
>>
>> _______________________________________________
>> Clusterusers mailing list
>> Clusterusers at lists.hampshire.edu <mailto:Clusterusers at lists.hampshire.edu>
>> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>
>
> _______________________________________________
> Clusterusers mailing list
> Clusterusers at lists.hampshire.edu
> https://lists.hampshire.edu/mailman/listinfo/clusterusers

-- 
Wm. Josiah Erikson
Assistant Director of IT, Infrastructure Group
System Administrator, School of CS
Hampshire College
Amherst, MA 01002
(413) 559-6091

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.hampshire.edu/pipermail/clusterusers/attachments/20150423/b11f2faa/attachment-0001.html>


More information about the Clusterusers mailing list