The Tufts High Performance Compute (HPC) cluster delivers 35,845,920 cpu hours and 59,427,840 gpu hours of free compute time per year to the user community.

Teraflops: 60+ (60+ trillion floating point operations per second) cpu: 4000 cores gpu: 6784 cores Interconnect: 40GB low latency ethernet

For additional information, please contact Research Technology Services at tts-research@tufts.edu


Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 27 Next »

Linux and LSF information FAQs

What are the new Cluster queues as of May 27, 2009:
Additional nodes were added to the cluster along with new queues. These changes streamline performance and application integration support. In addition contributed nodes by faculty are listed below. Use of contributed nodes is authorized by node owner. If you are not authorized to use the contributed node(s) you may use the short_all, normal_all or long_all queues at lower priority but subject to preemption and other constraints.

For example, note that the queues with the _public suffix will only submit jobs to the public nodes (hosts with the name nodeXX). Queues with the _all suffix can take advantage of user contributed nodes (hosts with the name contribXX) as well as public nodes, however jobs that get dispatched to the user contributed nodes may be preempted without warning by jobs submitted to those nodes by the owner of those nodes.

Node owner

New queue

Old queue

Comments

Public

admin_public

admin

system usage only

 

express_public

express

short term high priority, special needs

 

starp_public

starp

reserved for StarP applications

 

int_public

 

interactive jobs, high priority, 4 hour limit

 

paralleltest_public

paralleltest

short term debugging

 

short_public

short

2 hour limit

 

parallel_public

parallel

2 week limit 2-64cpus

 

normal_public

normal

default queue - 3 days limit 1 cpu

 

long_public

long

28 days 1 cpu

 

dregs_public

dregs

364 days 1 cpu

Public shared

short_all

 

shared across all nodes, lower priority

 

normal_all

 

shared across all nodes, lower priority

 

long_all

 

shared across all nodes, lower priority

ATLAS

atlas_prod

 

Physics Atlas support

 

atlas_analysis

 

Physics Atlas support

Khardon

int_khardon

 

contributed nodes, interactive 4 hour limit

 

express_khardon

 

contributed nodes, 30 minute limit

 

short_khardon

 

contributed nodes, 2 hour limit

 

normal_khardon

 

contributed nodes, 3 day limit

 

long_khardon

 

contributed nodes, 14 day limit

Miller

int_miller

 

contributed nodes, interactive 4 hour limit

 

express_miller

 

contributed nodes, 30 minute limit

 

short_miller

 

contributed nodes, 2 hour limit

 

normal_miller

 

contributed nodes, 3 day limit

 

long_miller

 

contributed nodes, 14 day limit

Abriola

int_abriola

 

contributed nodes, interactive 4 hour limit

 

express_abriola

 

contributed nodes, 30 minute limit

 

short_abriola

 

contributed nodes, 2 hour limit

 

normal_abriola

 

contributed nodes, 3 day limit

 

long_abriola

 

contributed nodes, 1 year limit

How do I choose between queues:

You may view queue properties with the bqueues command:

-bash-3.2$ bqueues

And extra details by queue name:

-bash-3.2$ bqueues -l atlas_prod

What is the default queue?
If you do not specify a queue by name in you bsub arguements, your job goes to the default queue, which is normal_public.

Where do I find basic unix/linux resources?

There are many web based tutorials and howto's for anything linux oriented. Some sites of interest:

linux-tutorial, Unix info , linux.org

What are some of the basic linux and related commands?

Most usage is centered around a dozen or so commands:

ls, more, less, cat, nano, pwd, cd, man, bsub, bkill, bjobs, ps, scp, ssh, cp, chmod, rm, mkdir, passwd, history, zip, unzip, tar, df, du
See the man pages for complete documentation. Here is a short description of some.

Basic Unix Commands

Action Needed

Command

Usage

Display contents of a file

cat

cat filename

Copy a file

cp

cp [-op] source destination

Change file protection

chmod

chmod mode filename or
chmod mode directory_name

Change working directory

cd

cd pathname

Display file (/w pauses)

more

more filename

Display first page of text

less

less filename

Display help

man

man command or
man -k topic

Rename a file

mv

mv [-op] filename1 filename2 or
mv [-op] directory1 directory2 or
mv [-op] filename directory

Compare file

diff

diff file1 file2

Delete file

rm

rm [-op] filename

Create a directory

mkdir

mkdir directory_name

Delete a directory /w files in it

rmdir -r

rm -r directory_name

Delete a directory

rmdir

rmdir directory_name

Display a list of files

ls

ls [-op] directory_name

Change the password

passwd

passwd

Display a long list (details)

ls -l

ls -l directory_name

Display current directory

pwd

pwd

What is a man page?

man pages are linux/unix style text based documentation. To obtain documentation on the command cat:

> man cat

>xman is the command for the x-based interface to man.

> man man is the man documentation.

> man -k graphics finds all related commands concerning graphics.

Are the compute nodes named differently from the old cluster compute nodes?

Yes. You should not hard code the names anywhere. The convention is node01, node02, ...

What is LSF?
Load Sharing Facility

Why do I have to submit jobs to compute nodes via LSF?

The cluster has been configured to allocate work to compute nodes in a manner that provides efficient and fair use of resources. A job queueing system called LSF is provided as the work interface to the compute nodes. Your work is then distributed to queues that provide compute node resources. Login to compute nodes via ssh is not suggested and you will be asked to refrain from using the resouces in that manner; let LSF do it!

How do I find LSF documentation?
The vendor website has documentation, the cluster man pages and this local link.

My program needs access to more than 16 gig or ram, what are my options?
An lsf resource has been defined to identify those nodes with 32 gig of ram. You access this through a bsub command line option, -R, when you submit your job.

-bash-3.2$ bsub -R bigmem -queue normal_public ./myprogram

Note: -bash-3.2$ is the default prompt for your bash shell on the cluster. The command is what follows it.

What are some of the most common LSF commands:

Action Needed

Command

Usage

System verification

lsid

lsid

Display load levels

lsmon

lsmon

Display hosts

lshosts

lshosts

Summarize past usage

bacct

bacct or
bacct job ID #

Display hosts

bhosts

bhosts

View current jobs

bjobs

bjobs or
bjobs job ID #

Run LSF batch job

bsub

bsub [-op] filename

Kill a job

bkill

bkill job id #

Review/select queue

bqueues

bqueues or
bqueues queue_name

Suspend a job

bstop

bstop job ID #

Changes job order (new or pending)

btop

btop job ID | "job_ID"(index_list)"

Resume suspended jobs

bresume

bresume job ID #

View job history

bhist

bhist job ID #

Modifying or Migrating jobs

bmod

see man page

How can I get notified when my lsf submitted jobs finish:

You need to add the -u option to bsub.
-bash-3.2$ bsub ... -u firstname.lastname@tufts.edu sleep 10

This will cause an e-mail to be sent when the job finishes, containing a summary of the job, the output, CPU & memory utilization, etc.

I need to submit 100s of jobs using the same program but with different input data, what is the best practice?

LSF provides a structure called a job array that allows a sequence of jobs that share the same executable and resource requirements, but have different input files, to be submitted, controlled, and monitored as a single unit. Using the standard LSF commands, you can also control and monitor individual jobs and groups of jobs submitted from a job array.

How can I submit jobs to LSF on the cluster from my workstation without actually logging into the cluster?

If you have ssh on your workstation, try the following:
> ssh cluster.uit.tufts.edu ". /etc/profile.d/lsf.sh && bsub -q queuename ./yourprogram"
where queuename is one of the above queues.

Additional User contributed Cluster Documentation

The following has been contributed by Rachel Lomasky. Click Here for the web version or under Wiki Page Operations a Pdf version is available as an attachment.

  • No labels