How to use BLAS on taki

Note: This page has not been updated to reflect the move to taki from maya.


The BLAS (Basic Linear Algebra Subprograms) library is useful for efficient matrix computations. You may want to consider using it instead of writing your own routines. In general there are three “levels” of BLAS functions. Level 1 contains simple operations like the dot-product. Level 2 contains operations like matrix-vector products. Level 3 contains operations like matrix-matrix products. In this tutorial, we will show how to compile and run a program that uses BLAS to multiply matrices. Before you begin, make sure to read the tutorial for compiling C programs.

Where to find documentation

On this page, we’ll show how to use BLAS functionality within AMD Core Math Library (ACML). We’ll assume that you’re using the default “pgi-9.0-mvapich2-1.4rc2” switcher setting – other MPI+compiler combinations may not work with the code below.

For more information about the BLAS routines in ACML, see

  • AMD’s ACML website
  • BLAS at Netlib has a lot of information about the library. One useful section contains FORTRAN code for the procedures
  • The manual pages on maya (e.g. “man dgemm”) also show the FORTRAN function interfaces
  • LAPACK User’s Guide
  • The ACML header file on maya shows the C interface. This file is located in: /usr/cluster/pgi/linux86-64/9.0-4/include/acml.h. You can browse through this file and see all the available functions and their arguments & return values. Note that there are two versions of most functions listed – one with an underscore (“_”) at the end, and one without an underscore. Both are versions of the same functional, but the underscore version is compatible with FORTRAN conventions, and the no-underscore version is a bit easier to use.


In this example we’ll multiply two matrices together using the BLAS dgemm function. We will multiply A (m x k) with B (k x n) to produce C (m x n). Our matrices will contain doubles, and be stored in column-major order. Here is the code

#include <stdio.h>
#include <acml.h>

#define MATRIX_IDX(n, i, j) j*n + i
#define MATRIX_ELEMENT(A, m, n, i, j) A[ MATRIX_IDX(m, i, j) ]

void init_matrix(double* A, int m, int n)
   double element = 1.0;
   for (int j = 0; j < n; j++)
      for (int i = 0; i < m; i++)
         MATRIX_ELEMENT(A, m, n, i, j) = element;
         element *= 0.9;

void print_matrix(const double* A, int m, int n)
   for (int i = 0; i < m; i++)
      for (int j = 0; j < n; j++)
          printf("%8.4f", MATRIX_ELEMENT(A, m, n, i, j));

int main(int argc, char** argv)
   int m = 3;
   int n = 4;
   int k = 5;

   double A[m * k];
   double B[k * n];
   double C[m * n];

   init_matrix(A, m, k);
   init_matrix(B, k, n);

   printf("Matrix A (%d x %d) is:\n", m, k);
   print_matrix(A, m, k);

   printf("\nMatrix B (%d x %d) is:\n", k, n);
   print_matrix(B, k, n);

   dgemm('N', 'N', m, n, k, 1.0, A, m, B, k, 0.0, C, m);

   printf("\nMatrix C (%d x %d) = AB is:\n", m, n);
   print_matrix(C, m, n);

   return 0;

Download: ../code/blas-matrix-multiply/pgi_acml/main.c

The important line is

dgemm('N', 'N', m, n, k, 1.0, A, m, B, k, 0.0, C, m);

which is the actual matrix multiplication call. The ‘N’ arguments indicate that we do not want BLAS to use the transpose of either A or B, but rather A and B themselves. The dgemm function is a little bit more general than just a matrix multiply. As the Netlib FORTRAN documentation indicates, it can handle operations of the form

C := alpha*op( A )*op( B ) + beta*C,

which allows us to easily accumulate results into C if we wish. In this example we have taken the constant alpha to be 1.0 and the constant beta to be 0.0. Notice that we did not have to initialize C’s entries to zeros for this reason.

Next is the Makefile. The important part is that we need to link to two more libraries: libacml and libpgftnrtl. This is accomplished by adding them to the LDFLAGS variable.

EXECUTABLE := matrix_multiply
OBJS := main.o

CFLAGS := -O3 -c99 -Minform=warn -fastsse

LDFLAGS := -lm -lacml -lpgftnrtl

CC := mpicc $(INCLUDES)

%.o: %.c %.h
    $(CC) $(CFLAGS) $(DEFS) $(INCLUDES) -c $< -o $@


    -rm -f *.o $(EXECUTABLE)

Download: ../code/blas-matrix-multiply/pgi_acml/Makefile

Compiling the code should look something like this

[araim1@maya-usr1 pgi_acml]$ make
mpicc  -O3 -c99 -Minform=warn -fastsse   -c -o main.o main.c
mpicc  -O3 -c99 -Minform=warn -fastsse   main.o -o matrix_multiply  -lm -lacml -lpgftnrtl
[araim1@maya-usr1 pgi_acml]$ ls
main.c  main.o  Makefile  matrix_multiply
[araim1@maya-usr1 pgi_acml]$ 

Then running the code should look this

[araim1@maya-usr1 pgi_acml]$ ./matrix_multiply
Matrix A (3 x 5) is:
  1.0000  0.7290  0.5314  0.3874  0.2824
  0.9000  0.6561  0.4783  0.3487  0.2542
  0.8100  0.5905  0.4305  0.3138  0.2288

Matrix B (5 x 4) is:
  1.0000  0.5905  0.3487  0.2059
  0.9000  0.5314  0.3138  0.1853
  0.8100  0.4783  0.2824  0.1668
  0.7290  0.4305  0.2542  0.1501
  0.6561  0.3874  0.2288  0.1351

Matrix C (3 x 4) = AB is:
  2.5543  1.5083  0.8906  0.5259
  2.2989  1.3575  0.8016  0.4733
  2.0690  1.2217  0.7214  0.4260
[araim1@maya-usr1 pgi_acml]$

Running the code through the batch system should require no special steps. A standard SLURM script like the one below should be sufficient.

#SBATCH --job-name=matrix_multiply
#SBATCH --output=slurm.out
#SBATCH --error=slurm.err
#SBATCH --partition=develop
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1


Download: ../code/blas-matrix-multiply/pgi_acml/run.slurm