Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Linux  information FAQs

...

...

For example, note that the queues with the _public suffix will only submit jobs to the public nodes (hosts with the name nodeXX). Queues with the _all suffix can take advantage of user contributed nodes (hosts with the name contribXX) as well as public nodes, however jobs that get dispatched to the user contributed nodes may be preempted without warning by jobs submitted to those nodes by the owner of those nodes.

...

Node owner

...

New queue

...

Old queue

...

Comments

...

Public

...

admin_public

...

admin

...

system usage only

...

 

...

express_public

...

express

...

short term high priority, special needs

...

 

...

int_public

...

 

...

interactive jobs, high priority, 4 hour limit

...

 

...

paralleltest_public

...

paralleltest

...

short term debugging

...

 

...

short_public

...

short

...

2 hour limit

...

 

...

parallel_public

...

parallel

...

2 week limit 2-64cpus

...

 

...

normal_public

...

normal

...

default queue - 3 days limit 1 cpu

...

 

...

long_public

...

long

...

28 days 1 cpu

...

 

...

dregs_public

...

dregs

...

364 days 1 cpu

...

Public shared

...

short_all

 

...

...

 

...

normal_all

...

 

...

shared across all nodes, lower priority

...

 

...

long_all

...

 

...

shared across all nodes, lower priority

...

ATLAS

...

atlas_prod

...

 

...

Physics Atlas support

...

 

...

atlas_analysis

...

 

...

Physics Atlas support

...

Khardon

...

int_khardon

...

 

...

contributed nodes, interactive 4 hour limit

...

 

...

express_khardon

...

 

...

contributed nodes, 30 minute limit

...

 

...

short_khardon

...

 

...

contributed nodes, 2 hour limit

...

 

...

normal_khardon

...

 

...

contributed nodes, 3 day limit

...

 

...

long_khardon

...

 

...

contributed nodes, 14 day limit

...

Miller

...

int_miller

...

 

...

contributed nodes, interactive 4 hour limit

...

 

...

express_miller

...

 

...

contributed nodes, 30 minute limit

...

 

...

short_miller

...

 

...

contributed nodes, 2 hour limit

...

 

...

normal_miller

...

 

...

contributed nodes, 3 day limit

...

 

...

long_miller

...

 

...

contributed nodes, 14 day limit

...

Abriola

...

int_abriola

...

 

...

contributed nodes, interactive 4 hour limit

...

 

...

express_abriola

...

 

...

contributed nodes, 30 minute limit

...

 

...

short_abriola

...

 

...

contributed nodes, 2 hour limit

...

 

...

normal_abriola

...

 

...

contributed nodes, 3 day limit

...

 

...

long_abriola

...

 

...

contributed nodes, 1 year limit

...

Napier

...

int_napier

...

 

...

contributed nodes, interactive 4 hour limit

...

 

...

express_napier

...

 

...

contributed nodes, 30 minute limit

...

 

...

short_napier

...

 

...

contributed nodes, 2 hour limit

...

 

...

normal_napier

...

 

...

contributed nodes, 3 day limit

...

 

...

long_napier

...

 

...

contributed nodes, 1 year limit

How do I choose between queues:

You may view queue properties with the bqueues command:

-bash-3.2$ bqueues

And extra details by queue name:

-bash-3.2$ bqueues -l atlas_prod

What is the default queue?
If you do not specify a queue by name in your bsub arguments, your job goes to the default queue, which is normal_public.

Where do I find basic unix/linux resources?

...

Most usage is centered around a dozen or so commands:

ls, more, less, cat, nano, pwd, cd, man, bsub, bkill, bjobs, ps, scp, ssh, cp, chmod, rm, mkdir, passwd, history, zip, unzip, tar, df, du, head, tail, grep

 

See the man pages for complete documentation. Here is a short description of some.

Basic Unix Commands

Action Needed

Command

Usage

Display contents of a file

cat

cat filename

Copy a file

cp

cp [-op] source destination

Change file protection

chmod

chmod mode filename or
chmod mode directory_name

Change working directory

cd

cd pathname

Display file (/w pauses)

more

more filename

Display first page of text

less

less filename

Display help

man

man command or
man -k topic

Rename a file

mv

mv [-op] filename1 filename2 or
mv [-op] directory1 directory2 or
mv [-op] filename directory

Compare file

diff

diff file1 file2

Delete file

rm

rm [-op] filename

Create a directory

mkdir

mkdir directory_name

Delete a directory /w files in it

rmdir -r

rm -r directory_name

Delete a directory

rmdir

rmdir directory_name

Display a list of files

ls

ls [-op] directory_name

Change the password

passwd

passwd

Display a long list (details)

ls -l

ls -l directory_name

Display current directory

pwd

pwd

Display mounted filesystems

df

df


What text editors are available?

nano, nedit, vi, vim, emacs

 

How do I strip out ^M embedded characters in my files I transferred from my PC?

Use the dos2unix command on the file. There is man page documentation available for further info.  Also check this page  for additional tips.

> man dos2unix


What is a man page?

man pages are linux/unix style text based documentation. For example, to obtain documentation on the command cat:

> man cat

>xman > xman
is the command for the x-based interface to man.

> man man
is the man documentation.

> man -k graphics
finds all related commands concerning graphics.

...

Yes. You should not hard code the names anywhere. The convention is node01, node02, ...

What is LSF?
Load Sharing Facility

Why do I have to submit jobs to compute nodes via LSF?

The cluster has been configured to allocate work to compute nodes in a manner that provides efficient and fair use of resources. A job queueing system called LSF is provided as the work interface to the compute nodes. Your work is then distributed to queues that provide compute node resources. Login to compute nodes via ssh is not suggested and you will be asked to refrain from using the resouces in that manner; let LSF do it!

How do I find LSF documentation?
The vendor website has documentation, the cluster man pages and this local link.

My program needs access to more than 16 gig or ram, what are my options?
An lsf resource has been defined to identify those nodes with 32 gig of ram. You access this through a bsub command line option, -R, when you submit your job.

-bash-3.2$ bsub -R bigmem -queue normal_public ./myprogram

Note: -bash-3.2$ is the default prompt for your bash shell on the cluster. The command is what follows it.

I have a program that is threaded. Is it possible to submit a job that will "take over" a node, or at least the 4 cores on a single chip?

You can force your job to run exclusively on a compute node (the only job running) by using bsub -x when you submit the job. It may take a bit longer to run though as it will have to wait in the queue for a compute node to become fully available.

You should also be able to use a combination of -n and -R to request a specific number of CPU's on one host. The following example should reserve 4 CPU's for your job on one compute node:

-bash_3.2$ bsub -n 4 -R "spanhosts=1" ./yourprogram

How does one use a node exclusively?
Currently the only queue that allows exclusive use is the Express queue. You must request this access. Not all jobs are suitable, so please send your request to the cluster support email and describe what you intend to do.

How does one actually invoke a job exclusively?
LSF bsub command has the -x option. To send your job to a node that has extra memory and runs exclusively:
-bash_3.2$ bsub -x -q express_public -R bigmem ...

What are some of the most common LSF commands:

Action Needed

Command

Usage

System verification

lsid

lsid

Display load levels

lsmon

lsmon

Display hosts

lshosts

lshosts

Summarize past usage

bacct

bacct or
bacct job ID #

Display hosts

bhosts

bhosts

View current jobs

bjobs

bjobs or
bjobs job ID #

Run LSF batch job

bsub

bsub [-op] filename

Kill a job

bkill

bkill job id #

Review/select queue

bqueues

bqueues or
bqueues queue_name

Suspend a job

bstop

bstop job ID #

Changes job order (new or pending)

btop

btop job ID | "job_ID"(index_list)"

Resume suspended jobs

bresume

bresume job ID #

View job history

bhist

bhist job ID #

Modifying or Migrating jobs

bmod

see man page

How can I get notified when my lsf submitted jobs finish?

By default no mail is generated. You need to add the -u option to bsub.
-bash-3.2$ bsub ... -u firstname.lastname@tufts.edu sleep 10

This will cause an e-mail to be sent when the job finishes, containing a summary of the job, the output, CPU & memory utilization, etc.

Also note that this action might send an amount of output to your email account that it may put you over your email quota, thus preventing receipt of mail!

I need to submit 100s of jobs using the same program but with different input data, what is the best practice?

LSF provides a structure called a job array that allows a sequence of jobs that share the same executable and resource requirements, but have different input files, to be submitted, controlled, and monitored as a single unit. Using the standard LSF commands, you can also control and monitor individual jobs and groups of jobs submitted from a job array.

Now that I submitted 100s of jobs I realized that I don't won't them to run after all, how do I kill them all?
All the jobs listed when you use the bjobs command can be removed by doing the following:
-bash-3.2$ bkill 0

I have a job in one queue, but would rather have it in another. How do I migrate the job?
Use the lsf command, bmod. For example.

How can I verify that my requested storage is mounted?
Use the df command.

How can I view my most recent files?

> ls -lt 

 

I sometime notice that my job duration can vary when I rerun a program with exactly the same inputs, condition, etc... Why?
The cluster has a mix of several different Intel CPUs and motherboard combinations. The absolute performance potential is similar among them but given the mix of other jobs sharing nodes, your results will vary. This is not something that is predictable when the cluster is well loaded.

How do I convert mixed case file names in a directory to lower case?

Issue the following in the directory of interest:

> find . -name "A-Z" | cut -c 3- - | awk '{print $1,tolower($1)}' | xargs -i echo "mv {}" | csh

 

This will find everything with uppercase letters and rename it to the
same thing with all lowercase.

How do I uncompress and extract many .gz and corresponding tar files?

> ls -1 *.gz | xargs -n 1 -r -I {} tar zxf {}


Sometimes I get a cryptic message about too many open files, what is that?
There are several resource settings associated with a default account. To see the settings:

-bash-3.2$ ulimit -a

The default setting is 2048 for the open files parameter. A user may increase it up to 10K in 1024 increments. To set it to 4096:

-bash-3.2$ bmod -q express_public <job_number>

This will migrate your job with <job_number> to the express_public queue or some other queue.

The contributed nodes often seem idle. How do I use them if I am not in a particular contributed node queue user group?
There are three queues that will make use of all compute nodes. The Public Shared queues allow job placement to all nodes via LSF. When contributed nodes are idle and there are many jobs already in the normal_public or long_public queue, use of the Public Shared queues will likely land your jobs on idle contributed nodes. See above table for corresponding Public Shared queue names and properties. For more detail on a particular queue:

-bash-3.2$ bqueues- l short_all

How can I submit jobs to LSF on the cluster from my workstation without actually logging into the cluster?

If you have ssh on your workstation, try the following:
> ssh cluster.uit.tufts.edu ". /etc/profile.d/lsf.sh && bsub -q queuename ./yourprogram"
where queuename is one of the above queues.

Suppose I want to copy data via scp from my bash script that is running on a compute node to the /scratch/utln storage area of the login node. How do I reference it?

scp filename h01.uit.tufts.edu:/scratch/utln
Note, your utln username is needed.

How does a program or shell on one compute node reference data from another compute node's local scratch storage?

The local scratch directory on each compute node is automounted when requested. For example, to access file abcd.data on compute node 07 from compute node 19; the path is:

/cluster/scratch/node07/utln/abcd.data

This will give you access to your scratch directory and file on node07.

What is the path to reference from a job on a compute node to the storage on the login node?
/cluster/scratch/h01/utln/ ....

How do I convert mixed case file names in a directory to lower case?

Issue the following in the directory of interest:

find . -name "A-Z" | cut -c 3- - | awk '{print $1,tolower($1)}' | xargs -i echo "mv {}" | csh

This will find everything with uppercase letters and rename it to the
same thing with all lowercase.

Additional User contributed Cluster Documentation

The following has been contributed by Rachel Lomasky. Click Here for the web version or under Wiki Page Operations a Pdf version is available as an attachment.ulimit -n 4096

However, this shows what has happened on the headnode only.

Since jobs execute on compute nodes you should include this in a simple shell script along with your sbatch command, so that the new setting takes effect on the compute node as well.

Sometimes I would like to have multiple shells available on the login node, how do I do this?

There are many ways. Most often you may just login two or more times from your desktop.  Or, refer to the slurm section with examples using srun. 

It is best not to ssh into the cluster while you are on the login node of the  cluster.

 

I compiled a program locally in my account and I would like to add the location of the executables to my PATH, how do I do this?

Suppose your software directory is located in you home directory and your username is jdoe02.  Your software directory is flight_program/  and there is a bin subdirectory  located within containing executables.

$PATH is a global environmental variable defined for the cluster  for general access to important locations.  You may add to it by pre-pending the path you wish to be searched.

PATH=/cluster/home/j/d/jdoe02/flight_program/bin:$PATH

export PATH

To see if it got updated:

> printenv  | grep  PATH