=======================================================
Stencil Domain Specific Language Compiler (sdslc) 0.3.2
=======================================================

The Stencil Domain Specific Language Compiler (sdslc) is a source-to-source 
translator for C/C++/MATLAB files with embedded sections of the Stencil Domain 
Specific Language (SDSL).

-----------------
REQUIRED SOFTWARE
-----------------
The following components are required to build the SDSL compiler:

  * Apache Ant
  * Bison
  * CMake 2.8 or higher
  * gcc/g++ 4.4 or higher
  * Java JDK 1.6 or higher
    - Must be JDK, not JRE
  * LLVM 3.0 or higher
    - Must be built with CMake
  * Nvidia CUDA SDK 5.0 or higher
  * Python 2.7
  * MATLAB

The SDSL compiler has been successfully built and tested on Fedora 16, 
Ubuntu 12.04, and RHEL 6.3.
    
--------
BUILDING
--------
To setup the build environment, please set the following environment variables:

  * JAVA_HOME: Set to installation path of Java JDK
    e.g. 'export JAVA_HOME=/usr/lib/jvm/java-1.6.0-openjdk'
  * PATH: Make sure 'ant' is available on your PATH
  
The entire build process is controlled by a CMake script. You can
generate the makefiles for the project by creating a build directory
and invoking CMake. The following CMake options are recognized:

  * OT_LLVM_BINARY_DIR
    - The installation path of LLVM
    - Required
  * CUDA_INSTALL_DIR
    - The installation path of the CUDA SDK 
    - Defaults to /usr/local/cuda
  * SDSLC_INSTALL_DIR
    - Path to install sdslc at
    - Defaults to /usr/local

The following series of commands will build sdslc starting from a tar.gz 
distribution, and install it at /usr/local/:

  $ tar xzvf sdslc-0.3.2.tar.gz
  $ cd sdslc-0.3.2
  $ mkdir build
  $ cd build
  $ cmake -DOT_LLVM_BINARY_DIR=$LLVM_ROOT -DCUDA_INSTALL_DIR=$CUDA_ROOT\
          -DSDSLC_INSTALL_DIR=/usr/local/sdslc-0.3.2 ..
  $ make
  $ make install
  
Root user or sudo access may be required for the 'make install' command, 
depending on the value of SDSLC_INSTALL_DIR. The main executable produced is 
the $SDSLC_INSTALL_DIR/bin/sdslc script that wraps the sdslc Java program.

-----
USAGE
-----
The basic usage of the SDSL compiler involves writing a C/C++ source file with 
embedded SDSL syntax. The embedded SDSL code must be placed between 
'#pragma sdsl begin' and '#pragma sdsl end' statements, e.g.

  #pragma sdsl begin
    int dim0;
    int dim1;

    iterate { ... }
  #pragma sdsl end
  
Any grid data or parameters defined in SDSL must have corresponding arrays or 
variables with the same name and type defined in the local C/C++/MATLAB scope.

--------
EXAMPLES
--------
Full examples are included in examples/general (and subdirectories) and 
the examples/cdsc directories.

The affine versions of the general benchmarks can be built with the following 
commands:

  $ cd $SDSLC_INSTALL_DIR/share/sdslc/examples/general
  $ make affine

This will call the sdslc compiler to produce intermediate C code and gcc
to produce executables in each benchmark's subdirectory.
  
These codes are built to be run on the CPU and contain affine C sections 
demarcated by '#pragma scop begin' and '#pragma scop end'. These codes are 
built to be further optimized with polyhedral compilation tools such as the 
following:

  * PoCC 
    - http://www.cs.ucla.edu/~pouchet/software/pocc
  * PolyOpt/C 
    - http://www.cs.ucla.edu/~pouchet/software/polyopt
    
The overlap tiled (overtile) versions of the general benchmarks can be built 
with the following commands:

  $ cd $SDSLC_INSTALL_DIR/share/sdslc/examples/general
  $ make overtile

This will call the sdslc compiler to produce intermediate CUDA code and nvcc
to produce executables in each benchmark's subdirectory.

These codes are built to run on CUDA-capable Nvidia GPUs of the Fermi and Kepler
generations. For GT2xx series chips (GeForce GTX2xx, TESLA C10xx) it is 
necessary to add the '--legacy-gpu' option to the SDSLC_FLAGS variable in the 
examples/common.mk file.

Nested and hybrid split-tiled versions of the general benchmarks can be built 
for the SSE2 vector ISA with the following commands:

  $ cd $SDSLC_INSTALL_DIR/share/sdslc/examples/general
  $ make nest-dlt
  $ make hyb-dlt

This source distribution also contains a number of medical imaging codes from 
the Center for Domain Specific Computing (CDSC). Each CDSC example can be built 
individually with the following commands:

  $ cd $SDSLC_INSTALL_DIR/share/sdslc/examples/cdsc/<benchmark>
  $ make

This will build affine, overtile, nested split-tiled, and hybrid split-tiled 
versions of the benchmarks. The pipeline app can also be built as a standalone 
object file by executing 'make pipeline-obj' in the pipeline-3d directory.

----------
AUTOTUNING
----------
The overtile GPU versions of all benchmarks can be autotuned to achieve maximum 
performance on the current GPU. For the general benchmarks, single and double 
precision versions can be autotuned by executing the following commands:

  $ cd $SDSLC_INSTALL_DIR/share/sdslc/examples/general/<benchmark>
  $ make autotune-sp
  $ make autotune-dp.

The CDSC pipeline can be autotuned with the following commands:

  $ cd $SDSLC_INSTALL_DIR/share/sdslc/examples/cdsc/<benchmark>
  $ make autotune-<benchmark>
  
The autotuner works by repeatedly executing a benchmark with different thread
block sizes, space tile sizes, and time tile sizes. This process can take a very
long time (multiple hours) and can also lead to combinations of sizes that are 
not compatible with the current GPU. In cases where a size combination is 
unable to execute on the current GPU an error message will be printed and the 
next size combination will be tried. It is perfectly normal to see long 
stretches where most execution attempts fail.

The files examples/general/autotune*.conf and examples/cdsc/autotune*.conf are 
used to configure the autotuner. Thread block and tile size ranges along with 
flags for sdslc (such as '--legacy-gpu') and nvcc can be specified in these 
files.

At any time during an autotuning run the fastest code is available in both SDSL 
and CUDA source as <benchmark>.sdsl.autotuned.<orig suffix> and 
<benchmark>.autotuned.cu.


=================================
INTEGRATING SDSL INTO MATLAB CODE
=================================
Many programs for scientific computation and visualization are written in 
MATLAB, the license for which can be obtained from www.mathworks.com.
MATLAB provides a simple programming environment, but running MATLAB 
code is often time-consuming. A general technique to accelerate MATLAB
is using MEX-files.

The compute-intensive parts of a MATLAB program can be rewritten in SDSL,
and the resulting MATLAB code can then be compiled by the sdslc compiler to
generate a C/Overtile MEX-file. The generated MEX-function can be 
independently optimized for different architectures. 

To illustrate the integration of SDSL into MATLAB, a MATLAB code for
denoising (riciandenoise.m), and its SDSL-integrated equivalent version 
(riciandenoise_sdsl.m) are made available in docs. The loop at lines 70-109
represents a stencil computation which can be better optimized if represented
in SDSL. Since all the arrays used within the loop will be passed as an
argument to the MEX-function, such arrays must already appear before the 
definition of the SDSL stencil in the SDSL-integrated MATLAB code.

The MATLAB code in riciandenoise.m uses a 2d-array 'ulast' to create a copy
of u. However, in SDSL, the array u is defined on timesteps 0,1 as: 
     double griddata u on gr at 0,1;
In SDSL, ulast can be replaced by [0]u[][]. Hence, there is no explicit
declaration for ulast in the SDSL code.

While writing the SDSL stencil, the grid dimensions must be declared first
(starting with the fastest varying dimension first), followed by the
declaration of parameters in the surrounding host program. For example,
in riciandenoise_sdsl.m, the SDSL stencil begins with the declaration:
    int N2;
    int N1;

    double dt;
    ...
    double gamma;

    grid gr[N1][N2];
    ...

The 2d-grid gr is defined to be of dimensions N1 x N2, with N2 being the 
fastest varying dimension. Hence, N2 is declared first, followed by N1. 
If the grid was 3-dimensional (N1 x N2 x N3), then N3 would have been 
declared before N2. After declaring all the dimensions, we declare the 
other scalars in the host program (dt, epsilon, ..., gamma). Then we 
declare the grid, and lastly the griddata on the grid at different 
timesteps. More restrictions on the current SDSL syntax are explained 
in greater detail in the user guide. 

---------------------
COMPILING MATLAB CODE
---------------------
The SDSL-integrated MATLAB code can be compiled using the following command 
line options:
'sdslc -b <backend-arg> -o <output-filename> -i <isa> -p -f <input-filename>'
where <backend-arg> is the appropriate backend for the MEX-function. The
supported backends are:

1. Affine-C (maffine-c) 
2. Overtile (movertile)
3. Nested split-tile (mnest-split-dlt)
4. Hybrid split-tile (mhyb-split-dlt)

The vector ISA can be specified for split-tile backend. <output-filename> is
the name of the .m file that will be generated, and it must be same as the name
of the MATLAB function in the input file. For example, in riciandenoise_sdsl.m,
the function name is riciandenoise_c, and hence the output file name must be
riciandenoise_c.m. If unspecified, the output file name defaults to a.m. More
details about the command line arguments can be obtained by running the command
'sdslc --help'.
 
After compiling the SDSL-embedded input file (e.g. riciandenoise_sdsl.m), 
the sdsl compiler generates a .m output file (riciandenoise_c.m) that is 
similar to the input file, except that the embedded SDSL code is replaced 
by a call to a MEX-function in the output file. It also generates a
MEX-file. This can be a .c file (if the backend was maffine-c, mhyb-split-dlt, 
or mnest-split-dlt) or a .cu file (if the backend was movertile), which can 
then be compiled in MATLAB by a MEX compiler. Executing the output .m file in 
MATLAB will call the generated MEX-function, which executes faster than the 
native MATLAB code. 

----------------------------------------------
TILE SIZE CONSTRAINTS FOR SPLIT TILING BACKEND
----------------------------------------------
In nested split-tiling, a d-dimensional loop nest is recursively split-tiled 
along each dimension. Hybrid split-tiling is a hybrid of standard tiling on
the outer loop and split-tiling on the inner (d-1) loops.
The current implementation of split-tile backend introduces the following
constraints on the input problem size and SDSL tile sizes: 
 
1. For the innermost dimension that is split-tiled, the sum of upright tile 
   size (SDSL_UPR_TILE_SIZE) and inverted tile size (SDSL_UPR_TILE_SIZE) must
   be a factor of (problem size in the innermost dimension / VF). That is,
       SDSL_UPR_TILE_SIZE_0 + SDSL_UPR_TILE_SIZE_0 = PB_SIZE_0 / VF;

   MATLAB stores matrices in column-major order. We retain the same storage 
   order while passing the arrays as arguments to MEX-function, and change 
   the stencil's access patterns instead to avoid array permutations. Thus,
   a (380x420) array in MATLAB will be passed as a (420x380) array to the 
   MEX-function. PB_SIZE_0 will therefore change accordingly for split-tiled 
   backend with MATLAB. 

2. The time tile size (SDSL_TT_SIZE) must be less than the number of iterations 
   for the outer loop of the stencil (defined in the iterate construct).
   For example, in the benchmark examples/cdsc/register-3d, the outer loop of
   the stencil iterates 3 times. Therefore, SDSL_TT_SIZE <= 3
   
3. If the SDSL code performs reduction after every K iterations (with K 
   specified in the check every () clause), the generated code will have 
   a guarding if-statement surrounding the reduction:
	 (if ((tt + SDSL_TT_SIZE) % m == 0)) 
   The values of m and SDSL_TT_SIZE must be chosen such that m * SDSL_TT_SIZE = K.

The tile sizes are set in the generated C code using compile-time constant 
#defines. We can change their values to ensure that the above-mentioned 
constraints are satisfied. 

To illustrate all the above-mentioned constraints, consider the SDSL code in 
examples/cdsc/denoise-3d. The problem size is 128x128x128. The reduction is
performed after every 4 iterations, and the outer loop of the stencil iterates 
for a maximum of 50 iterations. For hybrid split-tiled code with SSE2 extension, 
the vectorization factor (VF) is 4.

  * Since the reduction is performed every 4 iterations, we can set SDSL_TT_SIZE  
    to 4, and the value in the guarding if-statement surrounding the reduction to 
           'if ((tt + SDSL_TT_SIZE) % 1 == 0)'.   
  * For the innermost dimension, PB_SIZE_0 = 128. Thus, the constraint is: 
       SDSL_UPR_TILE_SIZE_0 + SDSL_UPR_TILE_SIZE_0 = 128/4;
    We can set the values of SDSL_UPR_TILE_SIZE and SDSL_UPR_TILE_SIZE for both 
    inner dimensions so that their sum is a factor of 32. 

  A possible value assignment is:
  
        #define SDSL_TT_SIZE (4)                               
        #define SDSL_UPR_TILE_SIZE_1 (24)
        #define SDSL_INV_TILE_SIZE_1 (8)
        #define SDSL_UPR_TILE_SIZE_0 (24)                                 
        #define SDSL_INV_TILE_SIZE_0 (8)

The split-tile backend is experimental in this release, and is intended to
undergo further revisions. The code generated by hybrid and nested split-tiling 
backend passes correctness check for all the benchmarks in examples/general 
and examples/cdsc, but if any bug is discovered, please notify the author.  
 
---------------
FURTHER READING
---------------

More details about SDSL and the SDSL compiler can be found in the user guide 
located at share/sdslc/doc/sdsl-guide-0.3.2.pdf