[Clusterusers] breve segmentation faults on fly
Lee Spector
lspector at hampshire.edu
Tue Oct 30 13:43:25 EDT 2007
No problem -- I've just killed my run. Let me know when I can start a
new one.
-Lee
On Oct 30, 2007, at 11:36 AM, Chris Perry wrote:
>
> Lee -
>
> We need to upgrade some RenderMan software which will require
> rebooting/re-imaging the cluster nodes. It's something we would
> like to do
> soon, but I see that you've got a run going at the moment. Do you
> see a point
> in the near future when we might be able to take things down for a
> bit?
>
> Thanks!
>
> - chris
>
> Quoting Lee Spector <lspector at hampshire.edu>:
>
>>
>> Ah -- right. Yes, that gives me a pretty good idea, and then looking
>> directly at the individual nodes in ganglia I can actually see
>> exactly where the run starts and where the CPU stops. And.... nothing
>> to see there, really. I've got one segfault in my current run which
>> started around 14:00 yesterday, and I can see all of the nodes
>> ramping up but then compute-1-9, without any other apparent surges of
>> activity, drops off around 20:00.
>>
>> -Lee
>>
>> On Oct 29, 2007, at 9:42 PM, Chris Perry wrote:
>>
>>>
>>> Maybe your log/output files for those runs stop when they segfault?
>>> That might be a timestamp you can use..
>>>
>>> - chris
>>>
>>> On Oct 29, 2007, at 9:19 PM, Lee Spector wrote:
>>>
>>>>
>>>> Interesting idea but not trivial to tell, since the seg fault
>>>> messages aren't time stamped and I'm not sure exactly when they
>>>> happened.... but this might be worth looking into more carefully...
>>>>
>>>> -Lee
>>>>
>>>> On Oct 29, 2007, at 9:06 PM, Chris Perry wrote:
>>>>
>>>>>
>>>>> I know we've been rendering fairly hard recently - do the graphs
>>>>> on ganglia show a correlation between your seg fault times/
>>>>> machines and a lack of RAM on those machines? I would expect a
>>>>> different error than seg fault in that case, but it's worth
>>>>> seeing if there's a possible connection.
>>>>>
>>>>> - chris
>>>>>
>>>>> On Oct 29, 2007, at 2:09 PM, Lee Spector wrote:
>>>>>
>>>>>>
>>>>>> Great -- I didn't see any core files, but where would I find
>>>>>> them from a run like this?
>>>>>>
>>>>>> -Lee
>>>>>>
>>>>>> On Oct 29, 2007, at 1:55 PM, jon klein wrote:
>>>>>>
>>>>>>>
>>>>>>> I can't really say why that build of breve is crashing for you,
>>>>>>> but it is pretty old at this point and due to be updated soon.
>>>>>>> The latest builds do have some potential memory savings, so it
>>>>>>> could help out. I'll make a new build soon and let you know
>>>>>>> when it's installed.
>>>>>>>
>>>>>>> Did the crashes leave any core files behind? That would be the
>>>>>>> only way to get a hint of why they might have crashed.
>>>>>>>
>>>>>>> -- jon klein
>>>>>>>
>>>>>>>
>>>>>>> On Oct 29, 2007, at 12:41 PM, Lee Spector wrote:
>>>>>>>
>>>>>>>> On Oct 29, 2007, at 11:24 AM, Wm. Josiah Erikson wrote:
>>>>>>>>> Hum. Well, uh, unless you tell me something that seems like
>>>>>>>>> evidence is to the contrary, I'll assume the ball is in some
>>>>>>>>> court other than mine for the time being?
>>>>>>>>
>>>>>>>> Josiah: I guess so. The OS memory limit seemed like such a
>>>>>>>> good theory that I'm having a hard time letting go of it :-),
>>>>>>>> but if the system is back to where it was previously then I
>>>>>>>> don't see what could be causing this on your end.
>>>>>>>>
>>>>>>>> Jon: I'm using /share/apps/breve/dev/bin/breve_cli, which
>>>>>>>> looks to be a version that has been there since May:
>>>>>>>>
>>>>>>>> $ ls -l /share/apps/breve/dev/bin/breve_cli
>>>>>>>> -rwxr-xr-x 1 1000 1000 305 May 26 07:22 /share/apps/breve/dev/
>>>>>>>> bin/breve_cli
>>>>>>>>
>>>>>>>> Might that have something to do with this, and should it be
>>>>>>>> updated?
>>>>>>>>
>>>>>>>> I can kill my current run at any point if I should do that for
>>>>>>>> upgrading breve.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> -Lee
>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Lee Spector, Professor of Computer Science
>>>>>> School of Cognitive Science, Hampshire College
>>>>>> 893 West Street, Amherst, MA 01002-3359
>>>>>> lspector at hampshire.edu, http://hampshire.edu/lspector/
>>>>>> Phone: 413-559-5352, Fax: 413-559-5438
>>>>>>
>>>>>> _______________________________________________
>>>>>> Clusterusers mailing list
>>>>>> Clusterusers at lists.hampshire.edu
>>>>>> http://lists.hampshire.edu/mailman/listinfo/clusterusers
>>>>>
>>>>> _______________________________________________
>>>>> Clusterusers mailing list
>>>>> Clusterusers at lists.hampshire.edu
>>>>> http://lists.hampshire.edu/mailman/listinfo/clusterusers
>>>>
>>>> --
>>>> Lee Spector, Professor of Computer Science
>>>> School of Cognitive Science, Hampshire College
>>>> 893 West Street, Amherst, MA 01002-3359
>>>> lspector at hampshire.edu, http://hampshire.edu/lspector/
>>>> Phone: 413-559-5352, Fax: 413-559-5438
>>>>
>>>> _______________________________________________
>>>> Clusterusers mailing list
>>>> Clusterusers at lists.hampshire.edu
>>>> http://lists.hampshire.edu/mailman/listinfo/clusterusers
>>>
>>> _______________________________________________
>>> Clusterusers mailing list
>>> Clusterusers at lists.hampshire.edu
>>> http://lists.hampshire.edu/mailman/listinfo/clusterusers
>>
>> --
>> Lee Spector, Professor of Computer Science
>> School of Cognitive Science, Hampshire College
>> 893 West Street, Amherst, MA 01002-3359
>> lspector at hampshire.edu, http://hampshire.edu/lspector/
>> Phone: 413-559-5352, Fax: 413-559-5438
>>
>> _______________________________________________
>> Clusterusers mailing list
>> Clusterusers at lists.hampshire.edu
>> http://lists.hampshire.edu/mailman/listinfo/clusterusers
>>
>
>
> _______________________________________________
> Clusterusers mailing list
> Clusterusers at lists.hampshire.edu
> http://lists.hampshire.edu/mailman/listinfo/clusterusers
--
Lee Spector, Professor of Computer Science
School of Cognitive Science, Hampshire College
893 West Street, Amherst, MA 01002-3359
lspector at hampshire.edu, http://hampshire.edu/lspector/
Phone: 413-559-5352, Fax: 413-559-5438
More information about the Clusterusers
mailing list