Page Comparison

...

Linux

...

and

...

LSF

...

information

...

FAQs

...

What
...
are
...
the
...
new
...
Cluster
...
queues
...
as
...
of
...
May
...
27,
...
2009:
...

Additional
...
nodes
...
were
...
added
...
to
...
the
...
cluster
...
along
...
with
...
new
...
queues.
...
These
...
changes
...
streamline
...
performance
...
and
...
application
...
integration
...
support.
...
In
...
addition
...
contributed
...
nodes
...
by
...
faculty
...
are
...
listed
...
below.
...
Use
...
of
...
contributed
...
nodes
...
is
...
authorized
...
by
...
node
...
owner.
...
If
...
you
...
are
...
not
...
authorized
...
to
...
use
...
the
...
contributed
...
node(s)
...
you
...
may
...
use
...
the
...
short_all,
...
normal_all
...
or
...
long_all
...
queues
...
at
...
lower
...
priority
...
but
...
subject
...
to
...
preemption
...
and
...
other
...
constraints.
...
For
...
example,
...
note
...
that
...
the
...
queues
...
with
...
the
...
_public
...
suffix
...
will
...
only
...
submit
...
jobs
...
to
...
the
...
public
...
nodes
...
(hosts
...
with
...
the
...
name
...
nodeXX).
...
Queues
...
with
...
the
...
_all
...
suffix
...
can
...
take
...
advantage
...
of
...
user
...
contributed
...
nodes
...
(hosts
...
with
...
the
...
name
...
contribXX)
...
as
...
well
...
as
...
public
...
nodes,
...
however
...
jobs
...
that
...
get
...
dispatched
...
to
...
the
...
user
...
contributed
...
nodes
...
may
...
be
...
preempted
...
without
...
warning
...
by
...
jobs
...
submitted
...
to
...
those
...
nodes
...
by
...
the
...
owner
...
of
...
those
...
nodes.
...
Node
...
owner
New queue
Old queue
Comments
Public
admin_public
admin
system usage only

express_public
express
short term high priority, special needs

starp_public
starp
reserved for StarP applications

int_public

interactive jobs, high priority, 4 hour limit

paralleltest_public
paralleltest
short term debugging

short_public
short
2 hour limit

parallel_public
parallel
2 week limit 2-64cpus

normal_public
normal
default queue - 3 days limit 1 cpu

long_public
long
28 days 1 cpu

dregs_public
dregs
364 days 1 cpu
Public shared
short_all

shared across all nodes, lower priority

normal_all

shared across all nodes, lower priority

long_all

shared across all nodes, lower priority
ATLAS
atlas_prod

Physics Atlas support

atlas_analysis

Physics Atlas support
Khardon
int_khardon

contributed nodes, interactive 4 hour limit

express_khardon

contributed nodes, 30 minute limit

short_khardon

contributed nodes, 2 hour limit

normal_khardon

contributed nodes, 3 day limit

long_khardon

contributed nodes, 14 day limit
Miller
int_miller

contributed nodes, interactive 4 hour limit

express_miller

contributed nodes, 30 minute limit

short_miller

contributed nodes, 2 hour limit

normal_miller

contributed nodes, 3 day limit

long_miller

contributed nodes, 14 day limit
Abriola
int_abriola

contributed nodes, interactive 4 hour limit

express_abriola

contributed nodes, 30 minute limit

short_abriola

contributed nodes, 2 hour limit

normal_abriola

contributed nodes, 3 day limit

long_abriola

contributed nodes, 1 year limit
How do I choose between queues:
You may view queue properties with the bqueues command:
-bash-3.2$ bqueues
And extra details by queue name:
-bash-3.2$ bqueues -l atlas_prod
What is the default queue?
If you do not specify a queue by name in your bsub arguments, your job goes to the default queue, which is normal_public.
Where do I find basic unix/linux resources?
There are many web based tutorials and howto's for anything linux oriented. Some sites of interest:
linux-tutorial, Unix info , linux.org
What are some of the basic linux and related commands?
Most usage is centered around a dozen or so commands:
ls, more, less, cat, nano, pwd, cd, man, bsub, bkill, bjobs, ps, scp, ssh, cp, chmod, rm, mkdir, passwd, history, zip, unzip, tar, df, du
See the man pages for complete documentation. Here is a short description of some.

Basic Unix Commands

Action Needed	Command	Usage
Display contents of a file	cat	cat filename
Copy a file	cp	cp [-op] source destination
Change file protection	chmod	chmod mode filename or chmod mode directory_name
Change working directory	cd	cd pathname
Display file (/w pauses)	more	more filename
Display first page of text	less	less filename
Display help	man	man command or man -k topic
Rename a file	mv	mv [-op] filename1 filename2 or mv [-op] directory1 directory2 or mv [-op] filename directory
Compare file	diff	diff file1 file2
Delete file	rm	rm [-op] filename
Create a directory	mkdir	mkdir directory_name
Delete a directory /w files in it	rmdir -r	rm -r directory_name
Delete a directory	rmdir	rmdir directory_name
Display a list of files	ls	ls [-op] directory_name
Change the password	passwd	passwd
Display a long list (details)	ls -l	ls -l directory_name
Display current directory	pwd	pwd

What is a man page?

man pages are linux/unix style text based documentation. For example, to obtain documentation on the command cat:

> man cat

>xman is the command for the x-based interface to man.

> man man is the man documentation.

> man -k graphics finds all related commands concerning graphics.

Are the compute nodes named differently from the old cluster compute nodes?

Yes. You should not hard code the names anywhere. The convention is node01, node02, ...

What is LSF?
Load Sharing Facility

Why do I have to submit jobs to compute nodes via LSF?

The cluster has been configured to allocate work to compute nodes in a manner that provides efficient and fair use of resources. A job queueing system called LSF is provided as the work interface to the compute nodes. Your work is then distributed to queues that provide compute node resources. Login to compute nodes via ssh is not suggested and you will be asked to refrain from using the resouces in that manner; let LSF do it!

How do I find LSF documentation?
The vendor website has documentation, the cluster man pages and this local link.

My program needs access to more than 16 gig or ram, what are my options?
An lsf resource has been defined to identify those nodes with 32 gig of ram. You access this through a bsub command line option, -R, when you submit your job.

-bash-3.2$ bsub -R bigmem -queue normal_public ./myprogram

Note: -bash-3.2$ is the default prompt for your bash shell on the cluster. The command is what follows it.

I have a program that is threaded. Is it possible to submit a job that will "take over" a node, or at least the 4 cores on a single chip?

You can force your job to run exclusively on a compute node (the only job running) by using bsub -x when you submit the job. It may take a bit longer to run though as it will have to wait in the queue for a compute node to become fully available.

You should also be able to use a combination of -n and -R to request a specific number of CPU's on one host. The following example should reserve 4 CPU's for your job on one compute node:

-bash_3.2$ bsub -n 4 -R "spanhosts=1" ./yourprogram

What are some of the most common LSF commands:

Action Needed	Command	Usage
System verification	lsid	lsid
Display load levels	lsmon	lsmon
Display hosts	lshosts	lshosts
Summarize past usage	bacct	bacct or bacct job ID #
Display hosts	bhosts	bhosts
View current jobs	bjobs	bjobs or bjobs job ID #
Run LSF batch job	bsub	bsub [-op] filename
Kill a job	bkill	bkill job id #
Review/select queue	bqueues	bqueues or bqueues queue_name
Suspend a job	bstop	bstop job ID #
Changes job order (new or pending)	btop	btop job ID \| "job_ID"(index_list)"

...


Resume suspended jobs	bresume	bresume job ID #
View job history	bhist	bhist job ID #
Modifying or Migrating jobs	bmod	see man page

How can I get notified when my lsf submitted jobs finish?

By default no mail is generated. You need to add the -u option to bsub.
-bash-3.2$

...

bsub

...

-u

...

firstname.lastname@tufts.edu

...

sleep

...

10

...

This

...

will

...

cause

...

an

...

e-mail

...

to

...

be

...

sent

...

when

...

the

...

job

...

finishes,

...

containing

...

a

...

summary

...

of

...

the

...

job,

...

the

...

output,

...

CPU

...

&

...

memory

...

utilization,

...

etc.

...

Also

...

note

...

that

...

this

...

action

...

might

...

send

...

an

...

amount

...

of

...

output

...

to

...

your

...

email

...

account

...

that

...

it

...

may

...

put

...

you

...

over

...

your

...

email

...

quota,

...

thus

...

preventing

...

receipt

...

of

...

mail

...

!

...

I

...

need

...

to

...

submit

...

100s

...

of

...

jobs

...

using

...

the

...

same

...

program

...

but

...

with

...

different

...

input

...

data,

...

what

...

is

...

the

...

best

...

practice?

...

LSF

...

provides

...

a

...

structure

...

called

...

a

...

job

...

array

...

that

...

allows

...

a

...

sequence

...

of

...

jobs

...

that

...

share

...

the

...

same

...

executable

...

and

...

resource

...

requirements,

...

but

...

have

...

different

...

input

...

files,

...

to

...

be

...

submitted,

...

controlled,

...

and

...

monitored

...

as

...

a

...

single

...

unit.

...

Using

...

the

...

standard

...

LSF

...

commands,

...

you

...

can

...

also

...

control

...

and

...

monitor

...

individual

...

jobs

...

and

...

groups

...

of

...

jobs

...

submitted

...

from

...

a

...

job

...

array.

...

Now

...

that

...

I

...

submitted

...

100s

...

of

...

jobs

...

I

...

realized

...

that

...

I

...

don't

...

won't

...

them

...

to

...

run

...

after

...

all,

...

how

...

do

...

I

...

kill

...

them

...

all?

...

All

...

the

...

jobs

...

listed

...

when

...

you

...

use

...

the

...

bjobs

...

command

...

can

...

be

...

removed

...

by

...

doing

...

the

...

following:

...

-bash-3.2$

...

bkill

...

0

...

I

...

have

...

a

...

job

...

in

...

one

...

queue,

...

but

...

would

...

rather

...

have

...

it

...

in

...

another.

...

How

...

do

...

I

...

migrate

...

the

...

job?

...

Use

...

the

...

lsf

...

command,

...

bmod.

...

For

...

example:

...

-bash-3.2$

...

bmod

...

-q

...

express_public

...

<job_number>

...

This

...

will

...

migrate

...

your

...

job

...

with

...

<job_number>

...

to

...

the

...

express_public

...

queue

...

or

...

some

...

other

...

queue.

...

The

...

contributed

...

nodes

...

often

...

seem

...

idle.

...

How

...

do

...

I

...

use

...

them

...

if

...

I

...

am

...

not

...

in

...

a

...

particular

...

contributed

...

node

...

queue

...

user

...

group?

...

There

...

are

...

three

...

queues

...

that

...

will

...

make

...

use

...

of

...

all

...

compute

...

nodes.

...

The

...

Public

...

Shared

...

queues

...

allow

...

job

...

placement

...

to

...

all

...

nodes

...

via

...

LSF.

...

When

...

contributed

...

nodes

...

are

...

idle

...

and

...

there

...

are

...

many

...

jobs

...

already

...

in

...

the

...

normal_public

...

or

...

long_public

...

queue,

...

use

...

of

...

the

...

Public

...

Shared

...

queues

...

will

...

likely

...

land

...

your

...

jobs

...

on

...

idle

...

contributed

...

nodes.

...

See

...

above

...

table

...

for

...

corresponding

...

Public

...

Shared

...

queue

...

names

...

and

...

properties.

...

For

...

more

...

detail

...

on

...

a

...

particular

...

queue:

...

-bash-3.2$

...

bqueues

...

-

...

l

...

short_all

...

How

...

can

...

I

...

submit

...

jobs

...

to

...

LSF

...

on

...

the

...

cluster

...

from

...

my

...

workstation

...

without

...

actually

...

logging

...

into

...

the

...

cluster?

...

If

...

you

...

have

...

ssh

...

on

...

your

...

workstation,

...

try

...

the

...

following:

...

>

...

ssh

...

cluster.uit.tufts.edu

...

".

...

/etc/profile.d/lsf.sh

...

&&

...

bsub

...

-q

...

queuename

...

./yourprogram"

...

where

...

queuename

...

is

...

one

...

of

...

the

...

above

...

queues.

...

Suppose

...

I

...

want

...

to

...

copy

...

data

...

via

...

scp

...

from

...

my

...

bash

...

script

...

that

...

is

...

running

...

on

...

a

...

compute

...

node

...

to

...

the

...

/scratch/utln

...

storage

...

area

...

of

...

the

...

login

...

node.

...

How

...

do

...

I

...

reference

...

it?

...

scp

...

filename

...

h01.uit.tufts.edu:/scratch/utln

...

Note,

...

your

...

utln

...

username

...

is

...

needed.

...

How

...

does

...

a

...

program

...

or

...

shell

...

on

...

one

...

compute

...

node

...

reference

...

data

...

from

...

another

...

compute

...

node's

...

local

...

scratch

...

storage?

...

The

...

local

...

scratch

...

Additional

...

User

...

contributed

...

Cluster

...

Documentation

...

The
...
following
...
has
...
been
...
contributed
...
by
...
Rachel
...
Lomasky.
...
Click
...
Here
...
for
...
the
...
web
...
version
...
or
...
under
...
Wiki
...
Page
...
Operations
...
a
...
Pdf
...
version
...
is
...
available
...
as
...
an
...
attachment.

Version	Old Version 46	New Version 47
Changes made by	durwood.marshall	durwood.marshall
Saved on	Sept 08, 2010	Sept 08, 2010

Versions Compared

Key

Linux

and

LSF

information

FAQs

Basic Unix Commands

Additional

User

contributed

Cluster

Documentation

The
...
following
...
has
...
been
...
contributed
...
by
...
Rachel
...
Lomasky.
...
Click
...
Here
...
for
...
the
...
web
...
version
...
or
...
under
...
Wiki
...
Page
...
Operations
...
a
...
Pdf
...
version
...
is
...
available
...
as
...
an
...
attachment.

owner	New queue	Old queue	Comments
Public	admin_public	admin	system usage only
	express_public	express	short term high priority, special needs
	starp_public	starp	reserved for StarP applications
	int_public		interactive jobs, high priority, 4 hour limit
	paralleltest_public	paralleltest	short term debugging
	short_public	short	2 hour limit
	parallel_public	parallel	2 week limit 2-64cpus
	normal_public	normal	default queue - 3 days limit 1 cpu
	long_public	long	28 days 1 cpu
	dregs_public	dregs	364 days 1 cpu
Public shared	short_all		shared across all nodes, lower priority
	normal_all		shared across all nodes, lower priority
	long_all		shared across all nodes, lower priority
ATLAS	atlas_prod		Physics Atlas support
	atlas_analysis		Physics Atlas support
Khardon	int_khardon		contributed nodes, interactive 4 hour limit
	express_khardon		contributed nodes, 30 minute limit
	short_khardon		contributed nodes, 2 hour limit
	normal_khardon		contributed nodes, 3 day limit
	long_khardon		contributed nodes, 14 day limit
Miller	int_miller		contributed nodes, interactive 4 hour limit
	express_miller		contributed nodes, 30 minute limit
	short_miller		contributed nodes, 2 hour limit
	normal_miller		contributed nodes, 3 day limit
	long_miller		contributed nodes, 14 day limit
Abriola	int_abriola		contributed nodes, interactive 4 hour limit
	express_abriola		contributed nodes, 30 minute limit
	short_abriola		contributed nodes, 2 hour limit
	normal_abriola		contributed nodes, 3 day limit
	long_abriola		contributed nodes, 1 year limit

Page Comparison

Versions Compared

Key

Linux

and

LSF

information

FAQs

Basic Unix Commands

Additional

User

contributed

Cluster

Documentation

The ... following ... has ... been ... contributed ... by ... Rachel ... Lomasky. ...Click ... Here... for ... the ... web ... version ... or ... under ... Wiki ... Page ... Operations ... a ... Pdf ... version ... is ... available ... as ... an ... attachment.

The
...
following
...
has
...
been
...
contributed
...
by
...
Rachel
...
Lomasky.
...
Click
...
Here
...
for
...
the
...
web
...
version
...
or
...
under
...
Wiki
...
Page
...
Operations
...
a
...
Pdf
...
version
...
is
...
available
...
as
...
an
...
attachment.