[Clusterusers] fly's hardware problem

Tue Dec 22 11:25:22 EST 2015

Following up Tom's comment, I'm not actually sure what runs we might decide we need in the run-up to our January deadlines (for GECCO). We have a meeting tomorrow morning when maybe more will become clear.

In any event I agree with Jaime in being really grateful for the time you're dedicating to this, and I'm sure we'll manage one way or another... but it'd be great to have the system up if possible.

 -Lee

> On Dec 22, 2015, at 11:12 AM, Bassam Kurdali <bassam at urchn.org> wrote:
> 
> I'm fine to wait till Jan too. I was more thinking of pushing some renders since other people might not be using the cluster over the hols, not for a specific deadline.
> cheers
> B
> 
> On Tue, 2015-12-22 at 10:21 -0500, Thomas Helmuth wrote:
>> Hi Josiah,
>> 
>> As far as I know, neither Lee nor I plan on using cluster for GP runs over the break. On the other hand, we are working on some papers with deadlines in late January that have a bunch of data on fly that we'll need. Will we not even be able to access that data? That could be a problem.
>> 
>> Basically, I don't think we'll need tractor or access to the nodes, but it would be good to be able to access and gather data. If that's not possible, it's not a huge deal since we have some initial data, but we'd need access to it ASAP after break.
>> 
>> Thanks!
>> Tom
>> 
>> On Tue, Dec 22, 2015 at 10:13 AM, Wm. Josiah Erikson <wjerikson at hampshire.edu <mailto:wjerikson at hampshire.edu>> wrote:
>>> Hey folks,
>>>     I'm trying to get an idea of how much people were planning on using
>>> the cluster over the holidays, i.e. how much effort I should put into
>>> this. Are there folks who were planning to do important work over the
>>> break whose plans will be totally screwed up if it's not back until
>>> after the new year?
>>>     I will come in tomorrow and probably be able to get it back up
>>> anyway, just want to know what kind of priority to give it in case of
>>> mishap.
>>>     Thanks and happy holidays!
>>>     -Josiah
>>> 
>>> 
>>> On 12/22/15, 8:47 AM, Wm. Josiah Erikson wrote:
>>> > I got some time this morning and came in on a vacation day...
>>> >
>>> > After some mucking around with BIOS and CPU's and memory config, things
>>> > deteriorating quickly...
>>> >
>>> > Well, the machine won't even post now, so I'll have to move everything
>>> > over to a new machine and get a new license file from Pixar. I'll try to
>>> > get this done tomorrow morning, but this may mean that the cluster will
>>> > be down until January if everything doesn't go smoothly.
>>> >
>>> > Apologies!
>>> >     -Josiah
>>> >
>>> >
>>> > On 12/21/15 9:28 PM, Wm. Josiah Erikson wrote:
>>> >> ...or not. It appears to be happening again. I'm out tomorrow, but I
>>> >> should be in briefly on Wednesday and will make another attempt.
>>> >>     -Josiah
>>> >>
>>> >>
>>> >> On 12/21/15, 11:04 AM, Wm. Josiah Erikson wrote:
>>> >>> One of the power supplies was probably bad (dim/flickering lights, and
>>> >>> Google searches had other people experiencing this problem with a bad
>>> >>> power supply). I replaced it with a spare I had sitting around (thanks
>>> >>> Mt. Holyoke)! and hopefully that will be the end of that. Of course, it
>>> >>> could be something else too :)
>>> >>> For now, fly's back up. Hopefully for good, but no guarantees.
>>> >>>     -Josiah
>>> >>>
>>> >>>
>>> >>> On 12/21/15 10:35 AM, Wm. Josiah Erikson wrote:
>>> >>>> Well, it just kernel panicked. Neat. That's new. So it's down now until
>>> >>>> I get it back up :) I'll keep you updated...
>>> >>>>     -Josiah
>>> >>>>
>>> >>>>
>>> >>>> On 12/21/15 10:08 AM, Wm. Josiah Erikson wrote:
>>> >>>>> Hello all,
>>> >>>>>     I rebooted fly again - it is not fixed yet, though it is up and
>>> >>>>> running currently. I will probably take it down later today to try
>>> >>>>> removing one of the CPU's to see if it is the problem. This should
>>> >>>>> result in less than 10 minutes of downtime, and jobs should resume where
>>> >>>>> they left off, so no need to hold off launching renders or whatever. My
>>> >>>>> current theory is that CPU #2 is faulty. It could also be a bad power
>>> >>>>> supply. I'll keep you updated.
>>> >>>>>
>>> 
>>> --
>>> -----
>>> Wm. Josiah Erikson
>>> Head, Systems and Networking
>>> Hampshire College
>>> Amherst, MA 01002
>>> 
>>> _______________________________________________
>>> Clusterusers mailing list
>>> Clusterusers at lists.hampshire.edu <mailto:Clusterusers at lists.hampshire.edu>
>>> https://lists.hampshire.edu/mailman/listinfo/clusterusers <https://lists.hampshire.edu/mailman/listinfo/clusterusers>
>>> 
>> 
>> _______________________________________________
>> Clusterusers mailing list
>> Clusterusers at lists.hampshire.edu <mailto:Clusterusers at lists.hampshire.edu>
>> https://lists.hampshire.edu/mailman/listinfo/clusterusers <https://lists.hampshire.edu/mailman/listinfo/clusterusers>
> _______________________________________________
> Clusterusers mailing list
> Clusterusers at lists.hampshire.edu
> https://lists.hampshire.edu/mailman/listinfo/clusterusers

--
Lee Spector, Professor of Computer Science
Acting Dean, School of Cognitive Science
Director, Institute for Computational Intelligence
Hampshire College, Amherst, Massachusetts, USA
lspector at hampshire.edu, http://hampshire.edu/lspector/, 413-559-5352

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.hampshire.edu/pipermail/clusterusers/attachments/20151222/2c5bd3a9/attachment-0001.html>