SURYA | CCF

HPC Facility

The HPC Facility (SURYA) has 16 CPU (640 cores) compute node clusters with 4 GPU (160 cores with 2x4 Nvidia Tesla-V100 having 40,960‬ CUDA core) node clusters along with 8.5TB of RAM. The HPC Facility use Parallel File System (~200TB) of DDN GRIDScaler storage at 15 GBps throughput over 100 Gbps interconnect network.

Queuing Systems

When a job is submitted, it is placed in a queue. There are different queues available for different purposes. The user must select any one of the queues from the ones listed below which is appropriate for his/her computation need.

Queue	Details
Queue for submitting CPU jobs: The queue will be available to all the users. A single user can submit only 1 jobs.	Name of Queue = CORE160 No of nodes = 4 No of x86 Processors = 160 Name of node = Any CPU node {1-16} Walltime = 360 hrs MaxJob = 1 per user
Queue for submitting CPU jobs: The queue will be available to all the users. A single user can submit only 1 jobs.	Name of Queue = CORE320 No of nodes = 8 No of x86 Processors = 320 Name of node = Any CPU node {1-16} Walltime = 24 hrs MaxJob = 1 per user
Queue for submitting GPU jobs: The queue will be available to all the users. The jobs that utilize GPUs shall only be allowed. A single user can submit only 1 jobs.	Name of Queue = GPU No of nodes = 1 No of x86 Processors = 40 CUDA cores = 10,240 Name of node = Any GPU node {17-20} Walltime = 360 hrs MaxJob = 1 per user

Node Configuration

Based on the queuing system given above, the node configurations can be summarized as follows:

Queue Type	Queue Name	Node Configuration
CPU	CORE160	CPU : 160, RAM : 1,536 GB
CPU	CORE320	CPU : 320, RAM : 3,072 GB
GPU	GPU	CPU : 40, RAM : 384 GB, 2x Tesla-V100 : 16GB

Sample Scripts to submit job for various queue:

CPU Queue : CORE160

      #!/bin/bash

      #PBS -u FACULTY_NAME
      #PBS -N STUDENT_NAME
      #PBS -q core160
      #PBS -l nodes=4:ppn=40
      #PBS -o out.log
      #PBS -j oe
      #PBS -V

      module load compilers/intel/parallel_studio_xe_2018_update3_cluster_edition
      cd $PBS_O_WORKDIR
      mpiexec.hydra -f $PBS_NODEFILE -np 160 “script_name.sh”
      ./Job_script.sh
      exit;

CPU Queue : CORE320

      #!/bin/bash

      #PBS -u FACULTY_NAME
      #PBS -N STUDENT_NAME
      #PBS -q core320
      #PBS -l nodes=8:ppn=40
      #PBS -o out.log
      #PBS -j oe
      #PBS -V

      module load compilers/intel/parallel_studio_xe_2018_update3_cluster_edition
      cd $PBS_O_WORKDIR
      mpiexec.hydra -f $PBS_NODEFILE -np 320 “script_name.sh”
      ./Job_script.sh
      exit;

GPU Queue : GPU

      #!/bin/bash

      #PBS -u FACULTY_NAME
      #PBS -N STUDENT_NAME
      #PBS -q gpu
      #PBS –l select=1:ncpus=20:ngpus=1 (For ONE GPU)
      #PBS –l select=2:ncpus=20:ngpus=1 (For TWO GPU)
      #PBS –l select=2:ncpus=20:ngpus=1:host=node{17-20} (For Specific Node)
      #PBS -o out.log
      #PBS -j oe
      #PBS -V

      module load compilers/intel/parallel_studio_xe_2018_update3_cluster_edition
      export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
      export CUDA_VISIBLE_DEVICES=0,1
      cd $PBS_O_WORKDIR
      python your_script_name.py
      mpirun -np 2 your script_name.sh
      ./Job_script.sh
      exit;

Useful Commands

Accessing a user account: ssh <username>@172.20.70.12
For submitting a job: qsub submit_script.sh
For checking queue status: qstat {-a, -s, -n}
For checking node status: ssh node{1-20}
For cancelling the job: qdel <job-id>

Usage Guidelines

Users are supposed to submit jobs only through scheduler.
Users are not supposed to run any job on the master node.
Users are not allowed to run a job by directly login to any compute node.
Users must report for any weaknesses in computer security and incidents of possible misuse or violation of the account policies to the HPC administrators or write regarding the same to ccf@iiita.ac.in.
There is no system backup for data in /home or any other partition, it is the user's responsibility to back up his/her data on a regular basis that is not stored in his/her home directory. We cannot recover any data in any of location, including files lost to system crashes or hardware failure so it is important to make copies of your important data regularly.

Central Computing Facility

Indian Institute of Information Technology - Allahabad

HPC Facility

CPU Queue : CORE160

CPU Queue : CORE320

GPU Queue : GPU