Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Update Announcement

Attention Cluster users:

This is an update regarding the Tufts Research Cluster expansion project.
  The final phase of the project is now underway.  Three earlier phases
(hardware acquisition, installation and production migration) have been
completed successfully.

We expect the final phase to complete in early September.  Work is
on-going with IBM to finalize network hardware and connectivity for about
30 nodes.  In addition the ATLAS project host, Cobalt, is expected to
re-join the cluster this week.  Once the combined clusters are stable, we
will address software updates and other minor issues that may appear as a
result of this upgrade.

  SSH related login key issues

We are aware that changes to the ssh keys have caused some connectivity
problems with the new cluster head node.  A simple solution is to remove
old cluster keys from your ssh client.  This process depends on how you
connect.   For those that use ssh directly, (Mac OS X or Windows with
CygWin) one may remove/edit out the old key in the following manner:

Launch the terminal application:

Mac OS X launch Terminal.app
Windows with CygWin launch CygWin and then startx

From the shell prompt, run the following command:

rm -f ~/.ssh/known_hosts

If you are using a Windows ssh or sftp client  you may need to use the
embedded key management options/tool.  Unfortunately, there are too many
ssh clients for us to provide exact steps for each.  Your client
documentation is the best source of information for making this change.

Once you have connected to the new head node, you may need to clear old
ssh keys to ensure connectivity with all the new cluster nodes. The
easiest way to do this is to run the following command when you have
connected:

cleanSSH.sh

...

2014 Cluster Upgrade Information


Cluster Changes Overview

  • Compute nodes are comprised of Intel based Cisco and IBM servers
  • slurm replaces LSF job scheduling and load management software
  • ssh logins to compute nodes is restricted
  • compute node cross mounts of local disk temporary storage such as /scratch is removed
  • compute nodes /scratch2 is replaced with /scratch
  • slurm interactive partition limited to 2 nodes for increased performance and quality of service
  • New login and file transfer nodes for better user service


Background

Every 3 years, the Tufts high-performance computing cluster environment is partially refreshed to offer increased capacity.  We did so in 2008 and 2011, and again in 2014.  We have had 4 generations of IBM hardware and we will soon welcome a new generation of Cisco UCS hardware.   Our oldest IBM hardware(HS21 and HS22 blades) has been retired and  our newest Cisco hardware has been installed as the basis of the new cluster.  We will  integrate current cluster IBM hardware into the new Cisco environment during the  Fall 2014 semester. There will be two migrations of IBM hardware roughly 5 weeks apart.  As a result of migration, fewer nodes on cluster6.uit.tufts.edu will result.   Concurrent user access to old and new clusters will be available to will allow current cluster account holders to transition during the fall semester.

By  late Dec. 2014 the hardware transition to the new cluster production environment will provide a total of 2680 cores (w/ 1680 IBM and 1000 Cisco cores) as compared to 2110 cores during 2013-2014 academic year .  Due to a significantly different and newer architecture, the 1000 Cisco cores are roughly two times more powerful.  Network performance between nodes is also improved due to new hardware.    In practical terms this refresh represents the largest increase in compute power for the Tufts HPC community.



User facing access nodes to new cluster services

There are two new nodes for interfacing the new cluster. 

  • login.cluster.tufts.edu
  • xfer.cluster.tufts.edu

Your  cluster6.uit.tufts.edu Tufts utln and password is required for normal access to these nodes.

The login node functions in the same fashion as the old cluster,  cluster6.uit.tufts.edu.  This includes compiling, slurm job submissions, editing, etc.   However the need to transfer data into or out of the cluster is now separated from the login node.  A second node, xfer.cluster.tufts.edu, is provided for file transfers.    This insures a better quality of service to login node users and minimizes storage related logistics.  Access to xfer.cluster.tufts.edu is via scp, sftp, rsync, and any scp based desktop file transfer programs such as WinScp, Filezilla, etc. will work as well.    Also one may ssh into the transfer node for initiating transfers as needed from either the headnode, login.cluster.tufts.edu, or desktop/laptop.   Node xfer.cluster.tufts.edu is a file transfer only service and not another headnode.  Access to normal headnode/login node functions such as slurm, compilers, etc.  is  unavailable on xfer.cluster.tufts.edu

 

Slurm replaces LSF

As part of this hardware refresh, we have also replaced  LSF load management software with slurm on the new cluster.  We made this decision as did many other high-performance supercomputing centers worldwide. The benefit to Tufts users is an experience built upon a  common understanding and implementation strategy for scaling HPC jobs.

See wiki page aboutslurm for Tufts  LSF  users.  

Documentation sources and examples are provided for orientation purposes.


Cluster Maintenance Window

During Sept. 2015 a maintenance window, Wednesdays at 6-7am,  was established  for supporting the installation and configuration of software and licenses, similar  to that found at other HPC sites.   For the most part, maintenance activities during this window will not have any direct effects on access or job placements.  With history as a guide, these events are on the order of several minutes or less and don't happen all that often.    In cases where user impact is expected, sufficient notification will be  given.  If you have any concerns please contact  tts-support@tufts.edu.