Search Results

Search found 4 results on 1 pages for 'ajdecon'.

Page 1/1 | 1 

  • Parallel prologue and epilogue in Grid Engine

    - by ajdecon
    We have a cluster being used to run MPI jobs for a customer. Previously this cluster used Torque as the scheduler, but we are transitioning to Grid Engine 6.2u5 (for some other features). Unfortunately, we are having trouble duplicating some of our maintenance scripts in the Grid Engine environment. In Torque, we have a prologue.parallel script which is used to carry out an automated health-check on the node. If this script returns a fail condition, Torque will helpfully offline the node and re-queue the job to use a different group of nodes. In Grid Engine, however, the queue "prolog" only runs on the head node of the job. We can manually run our prologue script from the startmpi.sh initialization script, for the mpi parallel environment; but I can't figure out how to detect a fail condition and carry out the same "mark offline and requeue" procedure. Any suggestions?

    Read the article

  • NFS denies mount, even though the client is listed in exports

    - by ajdecon
    We have a couple of servers (part of an HPC cluster) in which we're currently seeing some NFS behavior which is not making sense to me. node1 exports its /lscratch directory via NFS to node2, mounted at /scratch/node1. node2 also exports its own lscratch, which is correspondingly mounted at /scratch/node2 on node1. Unfortunately, whenever I attempt to mount either NFS export on the opposite node, I get the following error: mount: node1:/lscratch failed, reason given by server: Permission denied This despite the fact that I have included first the IP range (10.6.0.0) and then the specific IPs (10.6.7.1, 10.6.7.2) in /etc/exports. Any suggestions? Edit to remove ambiguity: I've made sure that exports only contains either the range, or the specific IPs, not both at the same time.

    Read the article

  • How to correctly set up iWARP? Preferably on loopback

    - by ajdecon
    iWARP is a protocol for doing remote direct memory access (RDMA) on top of TCP/IP, so that it can work with Ethernet and other network types as opposed to Infiniband. It works with many of the standard IB interfaces - the IB verbs, for example - so it's all pretty transparent. I'm doing some IB-verbs programming (mostly for the sake of learning about how they work better), and it'd be wonderfully convenient for me if I could use iWARP to do RDMA over my loopback interface, so that I could test some of my code without getting on our IB-connected cluster. :-) But I cannot figure out how to get a "local development environment" set up: there are no tutorials I'm aware of for even setting up iWARP from scratch on a server or a network interface. Can anyone give me a tutorial or point me in the right direction? Environment is Fedora 16 running in VirtualBox.

    Read the article

  • How can I determine which switch the Infiniband subnet manager is running on?

    - by ajdecon
    I recently inherited an Infiniband network containing multiple switches, and I know that one of these switches is running the subnet manager. The rest supposedly have that feature turned off, or were never enabled. The trouble is, I have no idea which one it is... I'd like to replace the switch subnet manager with OpenSM running on a couple of my infrastructure servers. Is there any way, short of logging into each switch individually, to determine which switch is running the SM?

    Read the article

1