[Clusterusers] NIMBYed nodes

Piper Odegard pto10 at hampshire.edu
Wed Apr 22 15:16:59 EDT 2015


 

Hey Josiah,

I just NIMBY'd compute 2-1 because it errored out 3 tasks in a row on
one of my jobs--in the future, is there a way on the fly to tell a job
not to use a particular
machine for the rest of its run?

- Piper 

On 2015-04-20 18:59, Thomas Helmuth wrote: 

> Awesome, thanks Chris! I should be able to merge that idea into my current launching scripts to get Lee going, but we'll page you if we run into trouble.
> 
> Tom 
> 
> On Mon, Apr 20, 2015 at 6:57 PM, Lee Spector <lspector at hampshire.edu> wrote:
> 
> Thanks Chris! 
> 
> -Lee 
> 
> On Apr 20, 2015, at 6:47 PM, Chris Perry <perry at hampshire.edu> wrote: 
> 
> My forum post resulted in a solution that could work for us if we keep the slots as they are. The RemoteCmd line to use is: 
> 
> RemoteCmd { command } -samehost 1 -atmost 4 -atleast 4 -service { blah } 
> 
> That will check out exactly 4 slots on a single node. So all Lee needs is to know how many slots are on the machine he wants and we (should be) good to go using a RemoteCmd structure like this. 
> 
> I'm happy to help you guys make these scripts work if and when the time comes. 
> 
> - chris 
> 
> On Apr 20, 2015, at 5:28 PM, Lee Spector <lspector at hampshire.edu> wrote: 
> 
> This, in conjunction with our observations of weird multithreading performance pathologies on the JVM, makes me think that we SHOULDN'T change the slots per node from whatever we're currently doing, at least not universally or by default. Is there a way to make this work as it currently does most of the time, but for me to grab a machine in 1-slot mode when I want one? 
> 
> My use case is the exception, not the rule, and I usually need one or a small handful of these sessions going at a time (I run them individually and watch them). At the moment I'm just using my desktop mac, and I think that actually runs my code faster than any fly node at the moment (although 1-4 seemed to perform well). I may want to run on a fly node or two at some point too, but I wouldn't want the performance of other work (especially Tom's many many large runs) to take a hit for this. 
> 
> -Lee 
> 
> On Apr 20, 2015, at 4:59 PM, Thomas Helmuth <thelmuth at cs.umass.edu> wrote: 
> 
> I agree with the others that it should be pretty easy to do everything you need. And I can walk you through how to easily do it in Clojush with the python script I have to launch runs.
> 
> I'm not sure if I'd get more or less performance with one slot per node, but I'm guessing less -- I often get multiple runs going on a node, so I'd need to get 4 or 8 times speedups with multi-threaded in a single run to get the same amount of work done. But, maybe this would just happen automatically with how we have it setup in Clojush.
> 
> Tom 
> 
> On Mon, Apr 20, 2015 at 4:19 PM, Chris Perry <perry at hampshire.edu> wrote:
> 
> I just wrote a post to the tractor developer forum asking about the one piece of this that I'm not 100% sure about, but if we go with Josiah's recommendation of one slot per blade then my post will be irrelevant and YES we can do exactly what you want.
> 
> And Josiah, I do think that you're right about one slot per node, not only because of the great multithreading that everyone's doing, but because our fileserver can't keep up with too many concurrent reads/writes anyway.
> 
> - chris
> 
> On Apr 20, 2015, at 3:55 PM, Wm. Josiah Erikson <wjerikson at hampshire.edu> wrote:
> 
>>
>>
>> On 4/20/15 3:43 PM, Lee Spector wrote:
>>> Maybe it's finally time for me to do this. As Tom points out I should have a better handle on it before he leaves. But the runs that I do are a little different from what he usually does.
>>>
>>> Is it now possible to do this with all of these properties?:
>>>
>>> - My run starts immediately.
>> Yes. If you spool with a high enough priority, it will actually
>> terminate and re-spool other jobs that are in your way.
>>> - It runs on a node that I specify.
>> Yes. The service key would just be the node you want.
>>> - No other jobs run on that node until my run is done, even if mine temporarily goes to low cpu utilization.
>> We would have to make it so that your node had one slot, and then you
>> did something like this if you had more than one command:
>> http://fly.hampshire.edu/docs/Tractor/scripting.html#sharing-one-server-check-out-among-several-commands [1].
>> I think that most things that we are doing these days are multithreaded
>> enough that one slot per node might be a valid choice cluster-wide... at
>> least at the moment with Maxwell and prman. What say others?
>>> - I can see stderr as well as std out, which I'll be piping to a file (which has to be somewhere I can grep it).
>> Just pipe each one to whatever file you want. yes.
>>> - I can easily and instantly kill and restart my runs.
>> Yes. Right-click and restart, interrupt or delete.
>>>
>>> If so then yes, I guess I should try to switch to a tractor-based workflow for my work on fly.
>>>
>>> -Lee
>>>
>>>
>>>> On Apr 20, 2015, at 3:26 PM, Chris Perry <perry at hampshire.edu> wrote:
>>>>
>>>>
>>>> I have a different ideal solution to propose: use tractor!
>>>>
>>>> What you are looking to do with your NIMBY mechanism is to pull machines out of tractor temporarily so they can run your job(s). But this is exactly what tractor does: it gives a job to an idle machine then keeps that machine from running new jobs until the first job is killed or completed. If you spool with high enough priority, tractor kills and respools a lower priority job so as to free up the machine, so you don't have to wait if immediate access is a concern of yours.
>>>>
>>>> Not to mention, the tractor interface will show running jobs so there would be a "Lee's job" item in green on the list. This might help keep you less likely to lose track in the way you describe happening now.
>>>>
>>>> Worth considering? I'm sure there are a few kinks to figure out (such as how to tag your jobs so that they have the full machine, guaranteed) but I feel confident that we can do this.
>>>>
>>>> - chris
>>>>
>>>> On Apr 20, 2015, at 2:31 PM, Lee Spector <lspector at hampshire.edu> wrote:
>>>>
>>>>> Some of these were probably my doing, but I only recall nimbying 1-4 and 4-5 in the recent past.
>>>>>
>>>>> It's not a problem with a node that causes me to do this, it's an interest in having total control of a node for a compute-intensive multithreaded run, with no chance that any other processes will be allocated on it. Sometimes I'll start one of these, check in on it regularly for a while, and then check in less frequently if it's not doing anything really interesting but I'm not ready to kill it in case it might still do something good. Then sometimes I lose track. Right now I have nothing running, and any nimbying that I've done can be unnimbyed, although I'm not 100% sure what I may have left nimbyed.
>>>>>
>>>>> Ideally, I guess, we'd have a nimby command that records who nimbyed and maybe periodically asks them if they still want it nimbyed. I'm not sure how difficult that would be. If somebody is inspired to do this, then it would also be nice if the command for numbying/unnimbying was more straightforward than the current one (which takes a 1 or a 0 as an argument, which is a little confusing), if it returned a value that made it more clear what the new status is, and if there was a simple way just to check the status (which maybe there is now? if so I don't know it).
>>>>>
>>>>> -Lee
>>>>>
>>>>>
>>>>>> On Apr 20, 2015, at 1:36 PM, Wm. Josiah Erikson <wjerikson at hampshire.edu> wrote:
>>>>>>
>>>>>> Hi all,
>>>>>> Why are nodes 1-17, 1-18, 1-2, 1-4, and 1-9 NIMBYed? If you are
>>>>>> having a problem with a node that causes you to need to NIMBY it, please
>>>>>> let me know, because maybe it just means I screwed up the service keys
>>>>>> or something. I'm not omniscient :)
>>>>>>
>>>>>> --
>>>>>> Wm. Josiah Erikson
>>>>>> Assistant Director of IT, Infrastructure Group
>>>>>> System Administrator, School of CS
>>>>>> Hampshire College
>>>>>> Amherst, MA 01002
>>>>>> (413) 559-6091 [2]
>>>>>>
>>>>>> _______________________________________________
>>>>>> Clusterusers mailing list
>>>>>> Clusterusers at lists.hampshire.edu
>>>>>> https://lists.hampshire.edu/mailman/listinfo/clusterusers [3]
>>>>> --
>>>>> Lee Spector, Professor of Computer Science
>>>>> Director, Institute for Computational Intelligence
>>>>> Cognitive Science, Hampshire College
>>>>> 893 West Street, Amherst, MA 01002-3359
>>>>> lspector at hampshire.edu, http://hampshire.edu/lspector/ [4]
>>>>> Phone: 413-559-5352 [5], Fax: 413-559-5438 [6]
>>>>>
>>>>> _______________________________________________
>>>>> Clusterusers mailing list
>>>>> Clusterusers at lists.hampshire.edu
>>>>> https://lists.hampshire.edu/mailman/listinfo/clusterusers [3]
>>>> _______________________________________________
>>>> Clusterusers mailing list
>>>> Clusterusers at lists.hampshire.edu
>>>> https://lists.hampshire.edu/mailman/listinfo/clusterusers [3]
>>> --
>>> Lee Spector, Professor of Computer Science
>>> Director, Institute for Computational Intelligence
>>> Cognitive Science, Hampshire College
>>> 893 West Street, Amherst, MA 01002-3359
>>> lspector at hampshire.edu, http://hampshire.edu/lspector/ [4]
>>> Phone: 413-559-5352 [5], Fax: 413-559-5438 [6]
>>>
>>> _______________________________________________
>>> Clusterusers mailing list
>>> Clusterusers at lists.hampshire.edu
>>> https://lists.hampshire.edu/mailman/listinfo/clusterusers [3]
>>
>> --
>> Wm. Josiah Erikson
>> Assistant Director of IT, Infrastructure Group
>> System Administrator, School of CS
>> Hampshire College
>> Amherst, MA 01002
>> (413) 559-6091 [2]
>>
>> _______________________________________________
>> Clusterusers mailing list
>> Clusterusers at lists.hampshire.edu
>> https://lists.hampshire.edu/mailman/listinfo/clusterusers [3]
> 
> _______________________________________________
> Clusterusers mailing list
> Clusterusers at lists.hampshire.edu
> https://lists.hampshire.edu/mailman/listinfo/clusterusers [3] _______________________________________________
> Clusterusers mailing list
> Clusterusers at lists.hampshire.edu
> https://lists.hampshire.edu/mailman/listinfo/clusterusers [3]

-- 
Lee Spector, Professor of Computer Science 
Director, Institute for Computational Intelligence
Cognitive Science, Hampshire College
893 West Street, Amherst, MA 01002-3359
lspector at hampshire.edu, http://hampshire.edu/lspector/ [4]
Phone: 413-559-5352 [5], Fax: 413-559-5438 [6]
_______________________________________________
Clusterusers mailing list
Clusterusers at lists.hampshire.edu
https://lists.hampshire.edu/mailman/listinfo/clusterusers [3]
_______________________________________________
Clusterusers mailing list
Clusterusers at lists.hampshire.edu
https://lists.hampshire.edu/mailman/listinfo/clusterusers [3] 

-- 
Lee Spector, Professor of Computer Science 
Director, Institute for Computational Intelligence
Cognitive Science, Hampshire College
893 West Street, Amherst, MA 01002-3359
lspector at hampshire.edu, http://hampshire.edu/lspector/ [4]
Phone: 413-559-5352 [5], Fax: 413-559-5438 [6] 
_______________________________________________
 Clusterusers mailing list
Clusterusers at lists.hampshire.edu
https://lists.hampshire.edu/mailman/listinfo/clusterusers [3]

_______________________________________________
Clusterusers mailing list
Clusterusers at lists.hampshire.edu
https://lists.hampshire.edu/mailman/listinfo/clusterusers [3]

 

Links:
------
[1]
http://fly.hampshire.edu/docs/Tractor/scripting.html#sharing-one-server-check-out-among-several-commands
[2] tel:%28413%29%20559-6091
[3] https://lists.hampshire.edu/mailman/listinfo/clusterusers
[4] http://hampshire.edu/lspector/
[5] tel:413-559-5352
[6] tel:413-559-5438
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.hampshire.edu/pipermail/clusterusers/attachments/20150422/b7f6d304/attachment-0001.html>


More information about the Clusterusers mailing list