[Clusterusers] Switching to new keg
Thomas Helmuth
thelmuth at cs.umass.edu
Mon Jun 22 16:51:59 EDT 2015
Go for it.
On Mon, Jun 22, 2015 at 4:38 PM, Wm. Josiah Erikson <wjerikson at hampshire.edu
> wrote:
> I've got a couple of nodes (compute-1-4 and compute-2-27) that won't
> remount /helga. It looks, Tom, like your processes are still running and
> won't let go of /helga because they were in fact writing logs to it, which
> I did not anticipate. If I can reboot those nodes, we should be all set:
>
> [root at fly install]# ssh compute-1-4
> Rocks Compute Node
> Rocks 6.1.1 (Sand Boa)
> Profile built 09:02 08-Jun-2015
>
> Kickstarted 09:09 08-Jun-2015
> [root at compute-1-4 ~]# ls /helga
> ls: cannot access /helga: Stale file handle
> [root at compute-1-4 ~]# umount -f /helga
> umount2: Device or resource busy
> umount.nfs: /helga: device is busy
> umount2: Device or resource busy
> umount.nfs: /helga: device is busy
> [root at compute-1-4 ~]# ps aux | grep max
> root 723 0.0 0.0 105312 864 pts/0 S+ 16:31 0:00 grep max
> root 3717 0.1 0.0 349216 13068 ? Sl Jun08 38:24
> /opt/pixar/tractor-blade-1.7.2/python/bin/python-bin
> /opt/pixar/tractor-blade-1.7.2/blade-modules/tractor-blade.py
> --dyld_framework_path=None
> --ld_library_path=/opt/pixar/tractor-blade-1.7.2/python/lib:/opt/gridengine/lib/linux-x64:/opt/openmpi/lib:/opt/python/lib:/share/apps/maxwell-3.1:/share/apps/maxwell64-3.1
> --debug --log /var/log/tractor-blade.log -P 8000
> [root at compute-1-4 ~]# ps aux | grep Max
> root 725 0.0 0.0 105312 864 pts/0 S+ 16:32 0:00 grep Max
> [root at compute-1-4 ~]# lsof | grep /helga
> lsof: WARNING: can't stat() nfs4 file system /helga
> Output information may be incomplete.
> python-bi 3717 root 5w
> unknown
> /helga/public_html/tractor/cmd-logs/thelmuth/J1506210001/T2.log (stat:
> Stale file handle)
> python-bi 3717 root 8w
> unknown
> /helga/public_html/tractor/cmd-logs/thelmuth/J1506210001/T9.log (stat:
> Stale file handle)
> python-bi 3717 root 12w
> unknown
> /helga/public_html/tractor/cmd-logs/thelmuth/J1506210001/T14.log (stat:
> Stale file handle)
> python-bi 3717 root 16w
> unknown
> /helga/public_html/tractor/cmd-logs/thelmuth/J1506210001/T24.log (stat:
> Stale file handle)
> [root at compute-1-4 ~]# top
>
> top - 16:32:23 up 14 days, 7:21, 1 user, load average: 33.77, 35.33,
> 35.73
> Tasks: 230 total, 1 running, 229 sleeping, 0 stopped, 0 zombie
> Cpu(s): 50.2%us, 0.2%sy, 0.0%ni, 49.6%id, 0.0%wa, 0.0%hi, 0.0%si,
> 0.0%st
> Mem: 15928760k total, 8015840k used, 7912920k free, 238044k buffers
> Swap: 2047992k total, 0k used, 2047992k free, 1094888k cached
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
> COMMAND
> 8428 thelmuth 20 0 6089m 1.5g 11m S 254.9 10.1 2809:37
> java
> 8499 thelmuth 20 0 6089m 1.5g 11m S 243.6 10.1 2829:15
> java
> 8367 thelmuth 20 0 6089m 1.5g 11m S 236.0 10.0 2829:42
> java
> 8397 thelmuth 20 0 6089m 1.5g 11m S 28.3 10.1 2787:46
> java
> 737 root 20 0 15036 1220 840 R 1.9 0.0 0:00.01
> top
> 1536 root 20 0 0 0 0 S 1.9 0.0 10:24.07
> kondemand/6
> 1 root 20 0 19344 1552 1232 S 0.0 0.0 0:00.46
> init
> 2 root 20 0 0 0 0 S 0.0 0.0 0:00.02
> kthreadd
> 3 root RT 0 0 0 0 S 0.0 0.0 0:00.53
> migration/0
> 4 root 20 0 0 0 0 S 0.0 0.0 0:02.85
> ksoftirqd/0
> 5 root RT 0 0 0 0 S 0.0 0.0 0:00.00
> migration/0
> 6 root RT 0 0 0 0 S 0.0 0.0 0:01.50
> watchdog/0
> 7 root RT 0 0 0 0 S 0.0 0.0 0:00.49
> migration/1
> 8 root RT 0 0 0 0 S 0.0 0.0 0:00.00
> migration/1
> 9 root 20 0 0 0 0 S 0.0 0.0 0:01.85
> ksoftirqd/1
> 10 root RT 0 0 0 0 S 0.0 0.0 0:01.21
> watchdog/1
> 11 root RT 0 0 0 0 S 0.0 0.0 0:00.50
> migration/2
> 12 root RT 0 0 0 0 S 0.0 0.0 0:00.00
> migration/2
> 13 root 20 0 0 0 0 S 0.0 0.0 0:02.12
> ksoftirqd/2
> 14 root RT 0 0 0 0 S 0.0 0.0 0:01.06
> watchdog/2
> 15 root RT 0 0 0 0 S 0.0 0.0 0:01.81
> migration/3
> 16 root RT 0 0 0 0 S 0.0 0.0 0:00.00
> migration/3
> 17 root 20 0 0 0 0 S 0.0 0.0 0:00.50
> ksoftirqd/3
> 18 root RT 0 0 0 0 S 0.0 0.0 0:00.87
> watchdog/3
> 19 root RT 0 0 0 0 S 0.0 0.0 0:01.76
> migration/4
> 20 root RT 0 0 0 0 S 0.0 0.0 0:00.00
> migration/4
> 21 root 20 0 0 0 0 S 0.0 0.0 0:04.57
> ksoftirqd/4
> 22 root RT 0 0 0 0 S 0.0 0.0 0:01.03
> watchdog/4
> 23 root RT 0 0 0 0 S 0.0 0.0 0:00.51
> migration/5
> 24 root RT 0 0 0 0 S 0.0 0.0 0:00.00
> migration/5
> 25 root 20 0 0 0 0 S 0.0 0.0 0:02.27
> ksoftirqd/5
> [root at compute-1-4 ~]# umount -f /helga
> umount2: Device or resource busy
> umount.nfs: /helga: device is busy
> umount2: Device or resource busy
> umount.nfs: /helga: device is busy
> [root at compute-1-4 ~]# umount -f /helga
> umount2: Device or resource busy
> umount.nfs: /helga: device is busy
> umount2: Device or resource busy
> umount.nfs: /helga: device is busy
> [root at compute-1-4 ~]# mount -o remount /helga
> mount.nfs: Stale file handle
> [root at compute-1-4 ~]# mount -o remount /helga
> mount.nfs: Stale file handle
> [root at compute-1-4 ~]# top
>
> top - 16:33:19 up 14 days, 7:22, 1 user, load average: 36.17, 35.76,
> 35.86
> Tasks: 229 total, 1 running, 228 sleeping, 0 stopped, 0 zombie
> Cpu(s): 99.6%us, 0.4%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si,
> 0.0%st
> Mem: 15928760k total, 8016408k used, 7912352k free, 238044k buffers
> Swap: 2047992k total, 0k used, 2047992k free, 1094888k cached
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
> COMMAND
> 8499 thelmuth 20 0 6089m 1.5g 11m S 209.2 10.1 2831:08
> java
> 8397 thelmuth 20 0 6089m 1.5g 11m S 198.9 10.1 2789:37
> java
> 8367 thelmuth 20 0 6089m 1.5g 11m S 196.2 10.0 2831:37
> java
> 8428 thelmuth 20 0 6089m 1.5g 11m S 192.9 10.1 2811:28
> java
> 764 root 20 0 15164 1352 952 R 0.7 0.0 0:00.03
> top
> 1532 root 20 0 0 0 0 S 0.3 0.0 20:31.35
> kondemand/2
> 1534 root 20 0 0 0 0 S 0.3 0.0 12:32.23
> kondemand/4
> 1536 root 20 0 0 0 0 S 0.3 0.0 10:24.14
> kondemand/6
> 1537 root 20 0 0 0 0 S 0.3 0.0 17:00.03
> kondemand/7
> 3717 root 20 0 341m 12m 3000 S 0.3 0.1 38:24.92
> python-bin
> 1 root 20 0 19344 1552 1232 S 0.0 0.0 0:00.46
> init
> 2 root 20 0 0 0 0 S 0.0 0.0 0:00.02
> kthreadd
> 3 root RT 0 0 0 0 S 0.0 0.0 0:00.53
> migration/0
> 4 root 20 0 0 0 0 S 0.0 0.0 0:02.85
> ksoftirqd/0
> 5 root RT 0 0 0 0 S 0.0 0.0 0:00.00
> migration/0
> 6 root RT 0 0 0 0 S 0.0 0.0 0:01.50
> watchdog/0
> 7 root RT 0 0 0 0 S 0.0 0.0 0:00.49
> migration/1
> 8 root RT 0 0 0 0 S 0.0 0.0 0:00.00
> migration/1
> 9 root 20 0 0 0 0 S 0.0 0.0 0:01.85
> ksoftirqd/1
> 10 root RT 0 0 0 0 S 0.0 0.0 0:01.21
> watchdog/1
> 11 root RT 0 0 0 0 S 0.0 0.0 0:00.50
> migration/2
> 12 root RT 0 0 0 0 S 0.0 0.0 0:00.00
> migration/2
> 13 root 20 0 0 0 0 S 0.0 0.0 0:02.12
> ksoftirqd/2
> 14 root RT 0 0 0 0 S 0.0 0.0 0:01.06
> watchdog/2
> 15 root RT 0 0 0 0 S 0.0 0.0 0:01.81
> migration/3
> 16 root RT 0 0 0 0 S 0.0 0.0 0:00.00
> migration/3
> 17 root 20 0 0 0 0 S 0.0 0.0 0:00.50
> ksoftirqd/3
> 18 root RT 0 0 0 0 S 0.0 0.0 0:00.87
> watchdog/3
> 19 root RT 0 0 0 0 S 0.0 0.0 0:01.76
> migration/4
> 20 root RT 0 0 0 0 S 0.0 0.0 0:00.00
> migration/4
> 21 root 20 0 0 0 0 S 0.0 0.0 0:04.57
> ksoftirqd/4
> [root at compute-1-4 ~]# umount -f /helga
> umount2: Device or resource busy
> umount.nfs: /helga: device is busy
> umount2: Device or resource busy
> umount.nfs: /helga: device is busy
> [root at compute-1-4 ~]# mount /helga
> mount.nfs: Stale file handle
> [root at compute-1-4 ~]# logout
> Connection to compute-1-4 closed.
> [root at fly install]# ssh compute-2-17
> Rocks Compute Node
> Rocks 6.1.1 (Sand Boa)
> Profile built 23:42 25-Dec-2014
>
> Kickstarted 23:58 25-Dec-2014
> [root at compute-2-17 ~]# ls /helga
> ls: cannot access /helga: Stale file handle
> [root at compute-2-17 ~]# umount -f /helga
> umount2: Device or resource busy
> umount.nfs: /helga: device is busy
> umount2: Device or resource busy
> umount.nfs: /helga: device is busy
> [root at compute-2-17 ~]# lsof | grep /helga
> lsof: WARNING: can't stat() nfs4 file system /helga
> Output information may be incomplete.
> python-bi 28615 root 8w
> unknown
> /helga/public_html/tractor/cmd-logs/hz12/J1504210040/T3.log (stat: Stale
> file handle)
> [root at compute-2-17 ~]#
>
>
>
> On 6/22/15 4:33 PM, Wm. Josiah Erikson wrote:
>
> Uh... shouldn't have been. I wonder if it was because I yanked its logfile
> place out from under it? I did not expect that to happen! It's back now,
> but it looks like all of your runs errored out. I'm so sorry!
> -Josiah
>
>
> On 6/22/15 4:00 PM, Thomas Helmuth wrote:
>
> Very cool!
>
> It looks like tractor is down. Is that related to this move?
>
> Tom
>
> On Mon, Jun 22, 2015 at 1:50 PM, Wm. Josiah Erikson <
> wjerikson at hampshire.edu> wrote:
>
>> Keg will be going down shortly, and coming back up as a harder, faster,
>> better, stronger version shortly as well, I hope :)
>> -Josiah
>>
>>
>> On 6/11/15 9:53 AM, Wm. Josiah Erikson wrote:
>> > Hi all,
>> > I'm pretty sure there is 100% overlap between people who care
>> > about fly and people who care about keg at this point (though there
>> > are some people who care about fly and not so much keg, like Lee and
>> > Tom - sorry), so I'm sending this to this list.
>> > I have a new 32TB 14+2 RAID6 with 24GB of RAM superfast (way
>> > faster than gigabit) keg all ready to go! I would like to bring it up
>> > on Monday, June 22nd. It would be ideal if rendering was NOT happening
>> > at that time, to make my rsyncing life easier :) Any objections?
>> > -Josiah
>> >
>>
>> --
>> Wm. Josiah Erikson
>> Assistant Director of IT, Infrastructure Group
>> System Administrator, School of CS
>> Hampshire College
>> Amherst, MA 01002
>> (413) 559-6091 <%28413%29%20559-6091>
>>
>> _______________________________________________
>> Clusterusers mailing list
>> Clusterusers at lists.hampshire.edu
>> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>>
>
>
>
> _______________________________________________
> Clusterusers mailing listClusterusers at lists.hampshire.eduhttps://lists.hampshire.edu/mailman/listinfo/clusterusers
>
>
> --
> Wm. Josiah Erikson
> Assistant Director of IT, Infrastructure Group
> System Administrator, School of CS
> Hampshire College
> Amherst, MA 01002(413) 559-6091
>
>
>
> _______________________________________________
> Clusterusers mailing listClusterusers at lists.hampshire.eduhttps://lists.hampshire.edu/mailman/listinfo/clusterusers
>
>
> --
> Wm. Josiah Erikson
> Assistant Director of IT, Infrastructure Group
> System Administrator, School of CS
> Hampshire College
> Amherst, MA 01002(413) 559-6091
>
>
> _______________________________________________
> Clusterusers mailing list
> Clusterusers at lists.hampshire.edu
> https://lists.hampshire.edu/mailman/listinfo/clusterusers
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.hampshire.edu/pipermail/clusterusers/attachments/20150622/44549b9c/attachment-0001.html>
More information about the Clusterusers
mailing list