======================================================= Stencil Domain Specific Language Compiler (sdslc) 0.3.2 ======================================================= The Stencil Domain Specific Language Compiler (sdslc) is a source-to-source translator for C/C++/MATLAB files with embedded sections of the Stencil Domain Specific Language (SDSL). ----------------- REQUIRED SOFTWARE ----------------- The following components are required to build the SDSL compiler: * Apache Ant * Bison * CMake 2.8 or higher * gcc/g++ 4.4 or higher * Java JDK 1.6 or higher - Must be JDK, not JRE * LLVM 3.0 or higher - Must be built with CMake * Nvidia CUDA SDK 5.0 or higher * Python 2.7 * MATLAB The SDSL compiler has been successfully built and tested on Fedora 16, Ubuntu 12.04, and RHEL 6.3. -------- BUILDING -------- To setup the build environment, please set the following environment variables: * JAVA_HOME: Set to installation path of Java JDK e.g. 'export JAVA_HOME=/usr/lib/jvm/java-1.6.0-openjdk' * PATH: Make sure 'ant' is available on your PATH The entire build process is controlled by a CMake script. You can generate the makefiles for the project by creating a build directory and invoking CMake. The following CMake options are recognized: * OT_LLVM_BINARY_DIR - The installation path of LLVM - Required * CUDA_INSTALL_DIR - The installation path of the CUDA SDK - Defaults to /usr/local/cuda * SDSLC_INSTALL_DIR - Path to install sdslc at - Defaults to /usr/local The following series of commands will build sdslc starting from a tar.gz distribution, and install it at /usr/local/: $ tar xzvf sdslc-0.3.2.tar.gz $ cd sdslc-0.3.2 $ mkdir build $ cd build $ cmake -DOT_LLVM_BINARY_DIR=$LLVM_ROOT -DCUDA_INSTALL_DIR=$CUDA_ROOT\ -DSDSLC_INSTALL_DIR=/usr/local/sdslc-0.3.2 .. $ make $ make install Root user or sudo access may be required for the 'make install' command, depending on the value of SDSLC_INSTALL_DIR. The main executable produced is the $SDSLC_INSTALL_DIR/bin/sdslc script that wraps the sdslc Java program. ----- USAGE ----- The basic usage of the SDSL compiler involves writing a C/C++ source file with embedded SDSL syntax. The embedded SDSL code must be placed between '#pragma sdsl begin' and '#pragma sdsl end' statements, e.g. #pragma sdsl begin int dim0; int dim1; iterate { ... } #pragma sdsl end Any grid data or parameters defined in SDSL must have corresponding arrays or variables with the same name and type defined in the local C/C++/MATLAB scope. -------- EXAMPLES -------- Full examples are included in examples/general (and subdirectories) and the examples/cdsc directories. The affine versions of the general benchmarks can be built with the following commands: $ cd $SDSLC_INSTALL_DIR/share/sdslc/examples/general $ make affine This will call the sdslc compiler to produce intermediate C code and gcc to produce executables in each benchmark's subdirectory. These codes are built to be run on the CPU and contain affine C sections demarcated by '#pragma scop begin' and '#pragma scop end'. These codes are built to be further optimized with polyhedral compilation tools such as the following: * PoCC - http://www.cs.ucla.edu/~pouchet/software/pocc * PolyOpt/C - http://www.cs.ucla.edu/~pouchet/software/polyopt The overlap tiled (overtile) versions of the general benchmarks can be built with the following commands: $ cd $SDSLC_INSTALL_DIR/share/sdslc/examples/general $ make overtile This will call the sdslc compiler to produce intermediate CUDA code and nvcc to produce executables in each benchmark's subdirectory. These codes are built to run on CUDA-capable Nvidia GPUs of the Fermi and Kepler generations. For GT2xx series chips (GeForce GTX2xx, TESLA C10xx) it is necessary to add the '--legacy-gpu' option to the SDSLC_FLAGS variable in the examples/common.mk file. Nested and hybrid split-tiled versions of the general benchmarks can be built for the SSE2 vector ISA with the following commands: $ cd $SDSLC_INSTALL_DIR/share/sdslc/examples/general $ make nest-dlt $ make hyb-dlt This source distribution also contains a number of medical imaging codes from the Center for Domain Specific Computing (CDSC). Each CDSC example can be built individually with the following commands: $ cd $SDSLC_INSTALL_DIR/share/sdslc/examples/cdsc/ $ make This will build affine, overtile, nested split-tiled, and hybrid split-tiled versions of the benchmarks. The pipeline app can also be built as a standalone object file by executing 'make pipeline-obj' in the pipeline-3d directory. ---------- AUTOTUNING ---------- The overtile GPU versions of all benchmarks can be autotuned to achieve maximum performance on the current GPU. For the general benchmarks, single and double precision versions can be autotuned by executing the following commands: $ cd $SDSLC_INSTALL_DIR/share/sdslc/examples/general/ $ make autotune-sp $ make autotune-dp. The CDSC pipeline can be autotuned with the following commands: $ cd $SDSLC_INSTALL_DIR/share/sdslc/examples/cdsc/ $ make autotune- The autotuner works by repeatedly executing a benchmark with different thread block sizes, space tile sizes, and time tile sizes. This process can take a very long time (multiple hours) and can also lead to combinations of sizes that are not compatible with the current GPU. In cases where a size combination is unable to execute on the current GPU an error message will be printed and the next size combination will be tried. It is perfectly normal to see long stretches where most execution attempts fail. The files examples/general/autotune*.conf and examples/cdsc/autotune*.conf are used to configure the autotuner. Thread block and tile size ranges along with flags for sdslc (such as '--legacy-gpu') and nvcc can be specified in these files. At any time during an autotuning run the fastest code is available in both SDSL and CUDA source as .sdsl.autotuned. and .autotuned.cu. ================================= INTEGRATING SDSL INTO MATLAB CODE ================================= Many programs for scientific computation and visualization are written in MATLAB, the license for which can be obtained from www.mathworks.com. MATLAB provides a simple programming environment, but running MATLAB code is often time-consuming. A general technique to accelerate MATLAB is using MEX-files. The compute-intensive parts of a MATLAB program can be rewritten in SDSL, and the resulting MATLAB code can then be compiled by the sdslc compiler to generate a C/Overtile MEX-file. The generated MEX-function can be independently optimized for different architectures. To illustrate the integration of SDSL into MATLAB, a MATLAB code for denoising (riciandenoise.m), and its SDSL-integrated equivalent version (riciandenoise_sdsl.m) are made available in docs. The loop at lines 70-109 represents a stencil computation which can be better optimized if represented in SDSL. Since all the arrays used within the loop will be passed as an argument to the MEX-function, such arrays must already appear before the definition of the SDSL stencil in the SDSL-integrated MATLAB code. The MATLAB code in riciandenoise.m uses a 2d-array 'ulast' to create a copy of u. However, in SDSL, the array u is defined on timesteps 0,1 as: double griddata u on gr at 0,1; In SDSL, ulast can be replaced by [0]u[][]. Hence, there is no explicit declaration for ulast in the SDSL code. While writing the SDSL stencil, the grid dimensions must be declared first (starting with the fastest varying dimension first), followed by the declaration of parameters in the surrounding host program. For example, in riciandenoise_sdsl.m, the SDSL stencil begins with the declaration: int N2; int N1; double dt; ... double gamma; grid gr[N1][N2]; ... The 2d-grid gr is defined to be of dimensions N1 x N2, with N2 being the fastest varying dimension. Hence, N2 is declared first, followed by N1. If the grid was 3-dimensional (N1 x N2 x N3), then N3 would have been declared before N2. After declaring all the dimensions, we declare the other scalars in the host program (dt, epsilon, ..., gamma). Then we declare the grid, and lastly the griddata on the grid at different timesteps. More restrictions on the current SDSL syntax are explained in greater detail in the user guide. --------------------- COMPILING MATLAB CODE --------------------- The SDSL-integrated MATLAB code can be compiled using the following command line options: 'sdslc -b -o -i -p -f ' where is the appropriate backend for the MEX-function. The supported backends are: 1. Affine-C (maffine-c) 2. Overtile (movertile) 3. Nested split-tile (mnest-split-dlt) 4. Hybrid split-tile (mhyb-split-dlt) The vector ISA can be specified for split-tile backend. is the name of the .m file that will be generated, and it must be same as the name of the MATLAB function in the input file. For example, in riciandenoise_sdsl.m, the function name is riciandenoise_c, and hence the output file name must be riciandenoise_c.m. If unspecified, the output file name defaults to a.m. More details about the command line arguments can be obtained by running the command 'sdslc --help'. After compiling the SDSL-embedded input file (e.g. riciandenoise_sdsl.m), the sdsl compiler generates a .m output file (riciandenoise_c.m) that is similar to the input file, except that the embedded SDSL code is replaced by a call to a MEX-function in the output file. It also generates a MEX-file. This can be a .c file (if the backend was maffine-c, mhyb-split-dlt, or mnest-split-dlt) or a .cu file (if the backend was movertile), which can then be compiled in MATLAB by a MEX compiler. Executing the output .m file in MATLAB will call the generated MEX-function, which executes faster than the native MATLAB code. ---------------------------------------------- TILE SIZE CONSTRAINTS FOR SPLIT TILING BACKEND ---------------------------------------------- In nested split-tiling, a d-dimensional loop nest is recursively split-tiled along each dimension. Hybrid split-tiling is a hybrid of standard tiling on the outer loop and split-tiling on the inner (d-1) loops. The current implementation of split-tile backend introduces the following constraints on the input problem size and SDSL tile sizes: 1. For the innermost dimension that is split-tiled, the sum of upright tile size (SDSL_UPR_TILE_SIZE) and inverted tile size (SDSL_UPR_TILE_SIZE) must be a factor of (problem size in the innermost dimension / VF). That is, SDSL_UPR_TILE_SIZE_0 + SDSL_UPR_TILE_SIZE_0 = PB_SIZE_0 / VF; MATLAB stores matrices in column-major order. We retain the same storage order while passing the arrays as arguments to MEX-function, and change the stencil's access patterns instead to avoid array permutations. Thus, a (380x420) array in MATLAB will be passed as a (420x380) array to the MEX-function. PB_SIZE_0 will therefore change accordingly for split-tiled backend with MATLAB. 2. The time tile size (SDSL_TT_SIZE) must be less than the number of iterations for the outer loop of the stencil (defined in the iterate construct). For example, in the benchmark examples/cdsc/register-3d, the outer loop of the stencil iterates 3 times. Therefore, SDSL_TT_SIZE <= 3 3. If the SDSL code performs reduction after every K iterations (with K specified in the check every () clause), the generated code will have a guarding if-statement surrounding the reduction: (if ((tt + SDSL_TT_SIZE) % m == 0)) The values of m and SDSL_TT_SIZE must be chosen such that m * SDSL_TT_SIZE = K. The tile sizes are set in the generated C code using compile-time constant #defines. We can change their values to ensure that the above-mentioned constraints are satisfied. To illustrate all the above-mentioned constraints, consider the SDSL code in examples/cdsc/denoise-3d. The problem size is 128x128x128. The reduction is performed after every 4 iterations, and the outer loop of the stencil iterates for a maximum of 50 iterations. For hybrid split-tiled code with SSE2 extension, the vectorization factor (VF) is 4. * Since the reduction is performed every 4 iterations, we can set SDSL_TT_SIZE to 4, and the value in the guarding if-statement surrounding the reduction to 'if ((tt + SDSL_TT_SIZE) % 1 == 0)'. * For the innermost dimension, PB_SIZE_0 = 128. Thus, the constraint is: SDSL_UPR_TILE_SIZE_0 + SDSL_UPR_TILE_SIZE_0 = 128/4; We can set the values of SDSL_UPR_TILE_SIZE and SDSL_UPR_TILE_SIZE for both inner dimensions so that their sum is a factor of 32. A possible value assignment is: #define SDSL_TT_SIZE (4) #define SDSL_UPR_TILE_SIZE_1 (24) #define SDSL_INV_TILE_SIZE_1 (8) #define SDSL_UPR_TILE_SIZE_0 (24) #define SDSL_INV_TILE_SIZE_0 (8) The split-tile backend is experimental in this release, and is intended to undergo further revisions. The code generated by hybrid and nested split-tiling backend passes correctness check for all the benchmarks in examples/general and examples/cdsc, but if any bug is discovered, please notify the author. --------------- FURTHER READING --------------- More details about SDSL and the SDSL compiler can be found in the user guide located at share/sdslc/doc/sdsl-guide-0.3.2.pdf