...
Linux
...
and
...
LSF
...
information
...
FAQs
...
What
...
are
...
the
...
new
...
Cluster
...
queues
...
as
...
of
...
May
...
27,
...
2009:
...
Additional
...
nodes
...
were
...
added
...
to
...
the
...
cluster
...
along
...
with
...
new
...
queues.
...
These
...
changes
...
streamline
...
performance
...
and
...
application
...
integration
...
support.
...
In
...
addition
...
contributed
...
nodes
...
by
...
faculty
...
are
...
listed
...
below.
...
Use
...
of
...
contributed
...
nodes
...
is
...
authorized
...
by
...
node
...
owner.
...
If
...
you
...
are
...
not
...
authorized
...
to
...
use
...
the
...
contributed
...
node(s)
...
you
...
may
...
use
...
the
...
short_all,
...
normal_all
...
or
...
long_all
...
queues
...
at
...
lower
...
priority
...
but
...
subject
...
to
...
preemption
...
and
...
other
...
constraints.
...
For
...
example,
...
note
...
that
...
the
...
queues
...
with
...
the
...
_public
...
suffix
...
will
...
only
...
submit
...
jobs
...
to
...
the
...
public
...
nodes
...
(hosts
...
with
...
the
...
name
...
nodeXX).
...
Queues
...
with
...
the
...
_all
...
suffix
...
can
...
take
...
advantage
...
of
...
user
...
contributed
...
nodes
...
(hosts
...
with
...
the
...
name
...
contribXX)
...
as
...
well
...
as
...
public
...
nodes,
...
however
...
jobs
...
that
...
get
...
dispatched
...
to
...
the
...
user
...
contributed
...
nodes
...
may
...
be
...
preempted
...
without
...
warning
...
by
...
jobs
...
submitted
...
to
...
those
...
nodes
...
by
...
the
...
owner
...
of
...
those
...
nodes.
...
Node |
---|
...
owner | New queue | Old queue | Comments |
---|---|---|---|
Public | admin_public | admin | system usage only |
 | express_public | express | short term high priority, special needs |
 | starp_public | starp | reserved for StarP applications |
 | int_public |  | interactive jobs, high priority, 4 hour limit |
 | paralleltest_public | paralleltest | short term debugging |
 | short_public | short | 2 hour limit |
 | parallel_public | parallel | 2 week limit 2-64cpus |
 | normal_public | normal | default queue - 3 days limit 1 cpu |
 | long_public | long | 28 days 1 cpu |
 | dregs_public | dregs | 364 days 1 cpu |
Public shared | short_all | Â | shared across all nodes, lower priority |
 | normal_all |  | shared across all nodes, lower priority |
 | long_all |  | shared across all nodes, lower priority |
ATLAS | atlas_prod | Â | Physics Atlas support |
 | atlas_analysis |  | Physics Atlas support |
Khardon | int_khardon | Â | contributed nodes, interactive 4 hour limit |
 | express_khardon |  | contributed nodes, 30 minute limit |
 | short_khardon |  | contributed nodes, 2 hour limit |
 | normal_khardon |  | contributed nodes, 3 day limit |
 | long_khardon |  | contributed nodes, 14 day limit |
Miller | int_miller | Â | contributed nodes, interactive 4 hour limit |
 | express_miller |  | contributed nodes, 30 minute limit |
 | short_miller |  | contributed nodes, 2 hour limit |
 | normal_miller |  | contributed nodes, 3 day limit |
 | long_miller |  | contributed nodes, 14 day limit |
Abriola | int_abriola | Â | contributed nodes, interactive 4 hour limit |
 | express_abriola |  | contributed nodes, 30 minute limit |
 | short_abriola |  | contributed nodes, 2 hour limit |
 | normal_abriola |  | contributed nodes, 3 day limit |
 | long_abriola |  | contributed nodes, 1 year limit |
How do I choose between queues:
You may view queue properties with the bqueues command:
-bash-3.2$ bqueues
And extra details by queue name:
-bash-3.2$ bqueues -l atlas_prod
What is the default queue?
If you do not specify a queue by name in your bsub arguments, your job goes to the default queue, which is normal_public.
Where do I find basic unix/linux resources?
There are many web based tutorials and howto's for anything linux oriented. Some sites of interest:
linux-tutorial, Unix info , linux.org
What are some of the basic linux and related commands?
Most usage is centered around a dozen or so commands:
ls, more, less, cat, nano, pwd, cd, man, bsub, bkill, bjobs, ps, scp, ssh, cp, chmod, rm, mkdir, passwd, history, zip, unzip, tar, df, du
See the man pages for complete documentation. Here is a short description of some.
Basic Unix Commands
Action Needed | Command | Usage |
---|---|---|
Display contents of a file | cat | cat filename |
Copy a file | cp | cp [-op] source destination |
Change file protection | chmod | chmod mode filename or |
Change working directory | cd | cd pathname |
Display file (/w pauses) | more | more filename |
Display first page of text | less | less filename |
Display help | man | man command or |
Rename a file | mv | mv [-op] filename1 filename2 or |
Compare file | diff | diff file1 file2 |
Delete file | rm | rm [-op] filename |
Create a directory | mkdir | mkdir directory_name |
Delete a directory /w files in it | rmdir -r | rm -r directory_name |
Delete a directory | rmdir | rmdir directory_name |
Display a list of files | ls | ls [-op] directory_name |
Change the password | passwd | passwd |
Display a long list (details) | ls -l | ls -l directory_name |
Display current directory | pwd | pwd |
What is a man page?
man pages are linux/unix style text based documentation. For example, to obtain documentation on the command cat:
> man cat
>xman is the command for the x-based interface to man.
> man man is the man documentation.
> man -k graphics finds all related commands concerning graphics.
Are the compute nodes named differently from the old cluster compute nodes?
Yes. You should not hard code the names anywhere. The convention is node01, node02, ...
What is LSF?
Load Sharing Facility
Why do I have to submit jobs to compute nodes via LSF?
The cluster has been configured to allocate work to compute nodes in a manner that provides efficient and fair use of resources. A job queueing system called LSF is provided as the work interface to the compute nodes. Your work is then distributed to queues that provide compute node resources. Login to compute nodes via ssh is not suggested and you will be asked to refrain from using the resouces in that manner; let LSF do it!
How do I find LSF documentation?
The vendor website has documentation, the cluster man pages and this local link.
My program needs access to more than 16 gig or ram, what are my options?
An lsf resource has been defined to identify those nodes with 32 gig of ram. You access this through a bsub command line option, -R, when you submit your job.
-bash-3.2$ bsub -R bigmem -queue normal_public ./myprogram
Note: -bash-3.2$ is the default prompt for your bash shell on the cluster. The command is what follows it.
I have a program that is threaded. Is it possible to submit a job that will "take over" a node, or at least the 4 cores on a single chip?
You can force your job to run exclusively on a compute node (the only job running) by using bsub -x when you submit the job. It may take a bit longer to run though as it will have to wait in the queue for a compute node to become fully available.
You should also be able to use a combination of -n and -R to request a specific number of CPU's on one host. The following example should reserve 4 CPU's for your job on one compute node:
-bash_3.2$ bsub -n 4 -R "spanhosts=1" ./yourprogram
What are some of the most common LSF commands:
Action Needed | Command | Usage |
---|---|---|
System verification | lsid | lsid |
Display load levels | lsmon | lsmon |
Display hosts | lshosts | lshosts |
Summarize past usage | bacct | bacct or |
Display hosts | bhosts | bhosts |
View current jobs | bjobs | bjobs or |
Run LSF batch job | bsub | bsub [-op] filename |
Kill a job | bkill | bkill job id # |
Review/select queue | bqueues | bqueues or |
Suspend a job | bstop | bstop job ID # |
Changes job order (new or pending) | btop | btop job ID | "job_ID"(index_list)" |
...
Resume suspended jobs | bresume | bresume job ID # |
View job history | bhist | bhist job ID # |
Modifying or Migrating jobs | bmod | see man page |
How can I get notified when my lsf submitted jobs finish?
By default no mail is generated. You need to add the -u option to bsub.
-bash-3.2$
...
bsub
...
...
...
-u
...
firstname.lastname@tufts.edu
...
sleep
...
10
...
This
...
will
...
cause
...
an
...
...
to
...
be
...
sent
...
when
...
the
...
job
...
finishes,
...
containing
...
a
...
summary
...
of
...
the
...
job,
...
the
...
output,
...
CPU
...
&
...
memory
...
utilization,
...
etc.
...
Also
...
note
...
that
...
this
...
action
...
might
...
send
...
an
...
amount
...
of
...
output
...
to
...
your
...
...
account
...
that
...
it
...
may
...
put
...
you
...
over
...
your
...
...
quota,
...
thus
...
preventing
...
receipt
...
of
...
...
!
...
I
...
need
...
to
...
submit
...
100s
...
of
...
jobs
...
using
...
the
...
same
...
program
...
but
...
with
...
different
...
input
...
data,
...
what
...
is
...
the
...
best
...
practice?
...
LSF
...
provides
...
a
...
structure
...
called
...
a
...
...
...
that
...
allows
...
a
...
sequence
...
of
...
jobs
...
that
...
share
...
the
...
same
...
executable
...
and
...
resource
...
requirements,
...
but
...
have
...
different
...
input
...
files,
...
to
...
be
...
submitted,
...
controlled,
...
and
...
monitored
...
as
...
a
...
single
...
unit.
...
Using
...
the
...
standard
...
LSF
...
commands,
...
you
...
can
...
also
...
control
...
and
...
monitor
...
individual
...
jobs
...
and
...
groups
...
of
...
jobs
...
submitted
...
from
...
a
...
job
...
array.
...
Now
...
that
...
I
...
submitted
...
100s
...
of
...
jobs
...
I
...
realized
...
that
...
I
...
don't
...
won't
...
them
...
to
...
run
...
after
...
all,
...
how
...
do
...
I
...
kill
...
them
...
all?
...
All
...
the
...
jobs
...
listed
...
when
...
you
...
use
...
the
...
bjobs
...
command
...
can
...
be
...
removed
...
by
...
doing
...
the
...
following:
...
-bash-3.2$
...
bkill
...
0
...
I
...
have
...
a
...
job
...
in
...
one
...
queue,
...
but
...
would
...
rather
...
have
...
it
...
in
...
another.
...
How
...
do
...
I
...
migrate
...
the
...
job?
...
Use
...
the
...
lsf
...
command,
...
bmod.
...
For
...
example:
...
-bash-3.2$
...
bmod
...
-q
...
express_public
...
<job_number>
...
This
...
will
...
migrate
...
your
...
job
...
with
...
<job_number>
...
to
...
the
...
express_public
...
queue
...
or
...
some
...
other
...
queue.
...
The
...
contributed
...
nodes
...
often
...
seem
...
idle.
...
How
...
do
...
I
...
use
...
them
...
if
...
I
...
am
...
not
...
in
...
a
...
particular
...
contributed
...
node
...
queue
...
user
...
group?
...
There
...
are
...
three
...
queues
...
that
...
will
...
make
...
use
...
of
...
all
...
compute
...
nodes.
...
The
...
Public
...
Shared
...
queues
...
allow
...
job
...
placement
...
to
...
all
...
nodes
...
via
...
LSF.
...
When
...
contributed
...
nodes
...
are
...
idle
...
and
...
there
...
are
...
many
...
jobs
...
already
...
in
...
the
...
normal_public
...
or
...
long_public
...
queue,
...
use
...
of
...
the
...
Public
...
Shared
...
queues
...
will
...
likely
...
land
...
your
...
jobs
...
on
...
idle
...
contributed
...
nodes.
...
See
...
above
...
table
...
for
...
corresponding
...
Public
...
Shared
...
queue
...
names
...
and
...
properties.
...
For
...
more
...
detail
...
on
...
a
...
particular
...
queue:
...
-bash-3.2$
...
bqueues
...
-
...
l
...
short_all
...
How
...
can
...
I
...
submit
...
jobs
...
to
...
LSF
...
on
...
the
...
cluster
...
from
...
my
...
workstation
...
without
...
actually
...
logging
...
into
...
the
...
cluster?
...
If
...
you
...
have
...
ssh
...
on
...
your
...
workstation,
...
try
...
the
...
following:
...
>
...
ssh
...
cluster.uit.tufts.edu
...
".
...
/etc/profile.d/lsf.sh
...
&&
...
bsub
...
-q
...
queuename
...
./yourprogram"
...
where
...
queuename
...
is
...
one
...
of
...
the
...
above
...
queues.
...
Suppose
...
I
...
want
...
to
...
copy
...
data
...
via
...
scp
...
from
...
my
...
bash
...
script
...
that
...
is
...
running
...
on
...
a
...
compute
...
node
...
to
...
the
...
/scratch/utln
...
storage
...
area
...
of
...
the
...
login
...
node.
...
How
...
do
...
I
...
reference
...
it?
...
scp
...
filename
...
h01.uit.tufts.edu:/scratch/utln
...
Note,
...
your
...
utln
...
username
...
is
...
needed.
...
How
...
does
...
a
...
program
...
or
...
shell
...
on
...
one
...
compute
...
node
...
reference
...
data
...
from
...
another
...
compute
...
node's
...
local
...
scratch
...
storage?
...
The
...
local
...
scratch
...
directory
...
on
...
each
...
compute
...
node
...
is
...
automounted
...
when
...
requested.
...
For
...
example,
...
to
...
access
...
file
...
abcd.data
...
on
...
compute
...
node
...
07
...
from
...
compute
...
node
...
19;
...
the
...
path
...
is:
...
/cluster/scratch/node07/utln/abcd.data
...
This
...
will
...
give
...
you
...
access
...
to
...
your
...
scratch
...
directory
...
and
...
file
...
on
...
node07.
...
What
...
is
...
the
...
path
...
to
...
reference
...
from
...
a
...
job
...
on
...
a
...
compute
...
node
...
to
...
the
...
storage
...
on
...
the
...
login
...
node?
...
/cluster/scratch/h01/utln/
...
....
...
How
...
do
...
I
...
convert
...
mixed
...
case
...
file
...
names
...
in
...
a
...
directory
...
to
...
lower
...
case?
...
Issue
...
the
...
following
...
in
...
the
...
directory
...
of
...
interest:
...
find
...
.
...
-name
...
"
...
...
"
...
|
...
cut
...
-c
...
3
...
-
...
-
...
|
...
awk
...
...
$1,tolower($1)}'
...
|
...
xargs
...
-i
...
echo
...
"mv
...
{}"
...
|
...
csh
...
This
...
will
...
find
...
everything
...
with
...
uppercase
...
letters
...
and
...
rename
...
it
...
to
...
the
...
same
...
thing
...
with
...
all
...
lowercase
...
.
...
Additional
...
User
...
contributed
...
Cluster
...
Documentation
...
The
...
following
...
has
...
been
...
contributed
...
by
...
Rachel
...
Lomasky.
...
...
...
for
...
the
...
web
...
version
...
or
...
under
...
Wiki
...
Page
...
Operations
...
a
...
...
version
...
is
...
available
...
as
...
an
...
attachment.