[Clusterusers] NIMBYed nodes

Wm. Josiah Erikson wjerikson at hampshire.edu
Tue Apr 21 09:19:59 EDT 2015


Nice! Did I miss that in the docs or is it not there? Thanks!

On April 20, 2015 6:47:00 PM EDT, Chris Perry <perry at hampshire.edu> wrote:
>
>My forum post resulted in a solution that could work for us if we keep
>the slots as they are. The RemoteCmd line to use is:
>
>RemoteCmd { command } -samehost 1 -atmost 4 -atleast 4 -service { blah
>}
>
>That will check out exactly 4 slots on a single node. So all Lee needs
>is to know how many slots are on the machine he wants and we (should
>be) good to go using a RemoteCmd structure like this. 
>
>I’m happy to help you guys make these scripts work if and when the time
>comes.
>
>- chris
>
>On Apr 20, 2015, at 5:28 PM, Lee Spector <lspector at hampshire.edu>
>wrote:
>
>> 
>> This, in conjunction with our observations of weird multithreading
>performance pathologies on the JVM, makes me think that we SHOULDN'T
>change the slots per node from whatever we're currently doing, at least
>not universally or by default. Is there a way to make this work as it
>currently does most of the time, but for me to grab a machine in 1-slot
>mode when I want one?
>> 
>> My use case is the exception, not the rule, and I usually need one or
>a small handful of these sessions going at a time (I run them
>individually and watch them). At the moment I'm just using my desktop
>mac, and I think that actually runs my code faster than any fly node at
>the moment (although 1-4 seemed to perform well). I may want to run on
>a fly node or two at some point too, but I wouldn't want the
>performance of other work (especially Tom's many many large runs) to
>take a hit for this. 
>> 
>>  -Lee
>> 
>> 
>> 
>>> On Apr 20, 2015, at 4:59 PM, Thomas Helmuth <thelmuth at cs.umass.edu>
>wrote:
>>> 
>>> I agree with the others that it should be pretty easy to do
>everything you need. And I can walk you through how to easily do it in
>Clojush with the python script I have to launch runs.
>>> 
>>> I'm not sure if I'd get more or less performance with one slot per
>node, but I'm guessing less -- I often get multiple runs going on a
>node, so I'd need to get 4 or 8 times speedups with multi-threaded in a
>single run to get the same amount of work done. But, maybe this would
>just happen automatically with how we have it setup in Clojush.
>>> 
>>> Tom
>>> 
>>> On Mon, Apr 20, 2015 at 4:19 PM, Chris Perry <perry at hampshire.edu>
>wrote:
>>> 
>>> I just wrote a post to the tractor developer forum asking about the
>one piece of this that I’m not 100% sure about, but if we go with
>Josiah’s recommendation of one slot per blade then my post will be
>irrelevant and YES we can do exactly what you want.
>>> 
>>> And Josiah, I do think that you’re right about one slot per node,
>not only because of the great multithreading that everyone’s doing, but
>because our fileserver can’t keep up with too many concurrent
>reads/writes anyway.
>>> 
>>> - chris
>>> 
>>> On Apr 20, 2015, at 3:55 PM, Wm. Josiah Erikson
><wjerikson at hampshire.edu> wrote:
>>> 
>>> >
>>> >
>>> > On 4/20/15 3:43 PM, Lee Spector wrote:
>>> >> Maybe it's finally time for me to do this. As Tom points out I
>should have a better handle on it before he leaves. But the runs that I
>do are a little different from what he usually does.
>>> >>
>>> >> Is it now possible to do this with all of these properties?:
>>> >>
>>> >> - My run starts immediately.
>>> > Yes. If you spool with a high enough priority, it will actually
>>> > terminate and re-spool other jobs that are in your way.
>>> >> - It runs on a node that I specify.
>>> > Yes. The service key would just be the node you want.
>>> >> - No other jobs run on that node until my run is done, even if
>mine temporarily goes to low cpu utilization.
>>> > We would have to make it so that your node had one slot, and then
>you
>>> > did something like this if you had more than one command:
>>> >
>http://fly.hampshire.edu/docs/Tractor/scripting.html#sharing-one-server-check-out-among-several-commands.
>>> > I think that most things that we are doing these days are
>multithreaded
>>> > enough that one slot per node might be a valid choice
>cluster-wide... at
>>> > least at the moment with Maxwell and prman. What say others?
>>> >> - I can see stderr as well as std out, which I'll be piping to a
>file (which has to be somewhere I can grep it).
>>> > Just pipe each one to whatever file you want. yes.
>>> >> - I can easily and instantly kill and restart my runs.
>>> > Yes. Right-click and restart, interrupt or delete.
>>> >>
>>> >> If so then yes, I guess I should try to switch to a tractor-based
>workflow for my work on fly.
>>> >>
>>> >> -Lee
>>> >>
>>> >>
>>> >>> On Apr 20, 2015, at 3:26 PM, Chris Perry <perry at hampshire.edu>
>wrote:
>>> >>>
>>> >>>
>>> >>> I have a different ideal solution to propose: use tractor!
>>> >>>
>>> >>> What you are looking to do with your NIMBY mechanism is to pull
>machines out of tractor temporarily so they can run your job(s). But
>this is exactly what tractor does: it gives a job to an idle machine
>then keeps that machine from running new jobs until the first job is
>killed or completed. If you spool with high enough priority, tractor
>kills and respools a lower priority job so as to free up the machine,
>so you don’t have to wait if immediate access is a concern of yours.
>>> >>>
>>> >>> Not to mention, the tractor interface will show running jobs so
>there would be a “Lee’s job” item in green on the list. This might help
>keep you less likely to lose track in the way you describe happening
>now.
>>> >>>
>>> >>> Worth considering? I’m sure there are a few kinks to figure out
>(such as how to tag your jobs so that they have the full machine,
>guaranteed) but I feel confident that we can do this.
>>> >>>
>>> >>> - chris
>>> >>>
>>> >>> On Apr 20, 2015, at 2:31 PM, Lee Spector
><lspector at hampshire.edu> wrote:
>>> >>>
>>> >>>> Some of these were probably my doing, but I only recall
>nimbying 1-4 and 4-5 in the recent past.
>>> >>>>
>>> >>>> It's not a problem with a node that causes me to do this, it's
>an interest in having total control of a node for a compute-intensive
>multithreaded run, with no chance that any other processes will be
>allocated on it. Sometimes I'll start one of these, check in on it
>regularly for a while, and then check in less frequently if it's not
>doing anything really interesting but I'm not ready to kill it in case
>it might still do something good. Then sometimes I lose track. Right
>now I have nothing running, and any nimbying that I've done can be
>unnimbyed, although I'm not 100% sure what I may have left nimbyed.
>>> >>>>
>>> >>>> Ideally, I guess, we'd have a nimby command that records who
>nimbyed and maybe periodically asks them if they still want it nimbyed.
>I'm not sure how difficult that would be. If somebody is inspired to do
>this, then it would also be nice if the command for numbying/unnimbying
>was more straightforward than the current one (which takes a 1 or a 0
>as an argument, which is a little confusing), if it returned a value
>that made it more clear what the new status is, and if there was a
>simple way just to check the status (which maybe there is now? if so I
>don't know it).
>>> >>>>
>>> >>>> -Lee
>>> >>>>
>>> >>>>
>>> >>>>> On Apr 20, 2015, at 1:36 PM, Wm. Josiah Erikson
><wjerikson at hampshire.edu> wrote:
>>> >>>>>
>>> >>>>> Hi all,
>>> >>>>> Why are nodes 1-17, 1-18, 1-2, 1-4, and 1-9 NIMBYed? If you
>are
>>> >>>>> having a problem with a node that causes you to need to NIMBY
>it, please
>>> >>>>> let me know, because maybe it just means I screwed up the
>service keys
>>> >>>>> or something. I'm not omniscient :)
>>> >>>>>
>>> >>>>> --
>>> >>>>> Wm. Josiah Erikson
>>> >>>>> Assistant Director of IT, Infrastructure Group
>>> >>>>> System Administrator, School of CS
>>> >>>>> Hampshire College
>>> >>>>> Amherst, MA 01002
>>> >>>>> (413) 559-6091
>>> >>>>>
>>> >>>>> _______________________________________________
>>> >>>>> Clusterusers mailing list
>>> >>>>> Clusterusers at lists.hampshire.edu
>>> >>>>> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>> >>>> --
>>> >>>> Lee Spector, Professor of Computer Science
>>> >>>> Director, Institute for Computational Intelligence
>>> >>>> Cognitive Science, Hampshire College
>>> >>>> 893 West Street, Amherst, MA 01002-3359
>>> >>>> lspector at hampshire.edu, http://hampshire.edu/lspector/
>>> >>>> Phone: 413-559-5352, Fax: 413-559-5438
>>> >>>>
>>> >>>> _______________________________________________
>>> >>>> Clusterusers mailing list
>>> >>>> Clusterusers at lists.hampshire.edu
>>> >>>> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>> >>> _______________________________________________
>>> >>> Clusterusers mailing list
>>> >>> Clusterusers at lists.hampshire.edu
>>> >>> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>> >> --
>>> >> Lee Spector, Professor of Computer Science
>>> >> Director, Institute for Computational Intelligence
>>> >> Cognitive Science, Hampshire College
>>> >> 893 West Street, Amherst, MA 01002-3359
>>> >> lspector at hampshire.edu, http://hampshire.edu/lspector/
>>> >> Phone: 413-559-5352, Fax: 413-559-5438
>>> >>
>>> >> _______________________________________________
>>> >> Clusterusers mailing list
>>> >> Clusterusers at lists.hampshire.edu
>>> >> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>> >
>>> > --
>>> > Wm. Josiah Erikson
>>> > Assistant Director of IT, Infrastructure Group
>>> > System Administrator, School of CS
>>> > Hampshire College
>>> > Amherst, MA 01002
>>> > (413) 559-6091
>>> >
>>> > _______________________________________________
>>> > Clusterusers mailing list
>>> > Clusterusers at lists.hampshire.edu
>>> > https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>> 
>>> _______________________________________________
>>> Clusterusers mailing list
>>> Clusterusers at lists.hampshire.edu
>>> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>> 
>>> _______________________________________________
>>> Clusterusers mailing list
>>> Clusterusers at lists.hampshire.edu
>>> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>> 
>> --
>> Lee Spector, Professor of Computer Science
>> Director, Institute for Computational Intelligence
>> Cognitive Science, Hampshire College
>> 893 West Street, Amherst, MA 01002-3359
>> lspector at hampshire.edu, http://hampshire.edu/lspector/
>> Phone: 413-559-5352, Fax: 413-559-5438
>> 
>> _______________________________________________
>> Clusterusers mailing list
>> Clusterusers at lists.hampshire.edu
>> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>
>
>
>------------------------------------------------------------------------
>
>_______________________________________________
>Clusterusers mailing list
>Clusterusers at lists.hampshire.edu
>https://lists.hampshire.edu/mailman/listinfo/clusterusers

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.hampshire.edu/pipermail/clusterusers/attachments/20150421/c4b52e7a/attachment-0001.html>


More information about the Clusterusers mailing list