base16384-sycl/README.md

# Base16384-SYCL

A high-performance Base16384 encoding library implemented using Intel SYCL for accelerated computation on heterogeneous hardware platforms.

## Overview

> [!Note]
> This library requires Intel oneAPI DPC++/SYCL runtime. Please ensure proper environment setup before building and running the applications.

Base16384-SYCL is an optimized implementation of the [Base16384 encoding algorithm](https://github.com/fumiama/base16384) that leverages Intel SYCL (oneAPI Data Parallel C++) to achieve superior performance on both CPU and GPU architectures. The library provides efficient encoding and decoding capabilities while maintaining cross-platform compatibility.

## Features

- **Hardware Acceleration**: Utilizes Intel SYCL for parallel processing on CPUs, GPUs, and other accelerators
- **Cross-Platform Support**: Compatible with Windows and Unix-like systems
- **Performance Optimized**: Includes vectorization and memory optimization for maximum throughput
- **Robust Error Handling**: Comprehensive exception handling with detailed error reporting
- **Modern C++**: Written in C++20 with modern programming practices

## Prerequisites

### Required Dependencies

- **Intel oneAPI Toolkit**: DPC++/SYCL compiler and runtime
- **CMake**: Version 3.4 or higher

### Windows-Specific Requirements

- Visual Studio Build Tools or Visual Studio IDE
- Intel DPC++ compiler (icx-cl)
- NMake (included with Visual Studio)

### Unix/Linux Requirements

- Intel DPC++ compiler (icpx)
- Standard build tools (make, etc.)

## Installation

### 1. Environment Setup

> [!Tip]
> **For VS Code Users**: If you're using Visual Studio Code, the environment variable setup commands will be executed automatically when you open a terminal. If this fails, it may be due to a non-standard installation path. Please modify the paths in `.vscode/settings.json` accordingly.

**Windows:**

```powershell
# Navigate to your Intel oneAPI installation directory
# Typically: C:\Program Files (x86)\Intel\oneAPI\
setvars.bat
```

**Linux/Unix:**

```bash
# Navigate to your Intel oneAPI installation directory
# Typically: /opt/intel/oneapi/
source setvars.sh
```

### 2. Build Process

**Clone and navigate to the project:**

```cmd
git clone https://github.com/fumiama/base16384-sycl.git
cd base16384-sycl
mkdir build
cd build
```

**Configure the build system:**

> Add `-DBUILD=test` to enable testing.

- Windows
  ```cmd
  cmake -G "NMake Makefiles" -DCMAKE_BUILD_TYPE=Release ..
  ```
- Unix-Like
  ```sh
  cmake -DCMAKE_BUILD_TYPE=Release ..
  ```

**Compile the project:**

```cmd
cmake --build .
```

### 3. Testing

**Run the test suite:**

```cmd
ctest
```

### 4. Performance Analysis with Intel VTune

Intel VTune Profiler is a powerful performance analysis tool that can help you identify bottlenecks and optimize the applications.

#### Prerequisites

- Intel VTune Profiler (included in Intel oneAPI Base Toolkit)
- Compiled Base16384-SYCL application or tests with debug symbols (use `RelWithDebInfo` build type)

#### Running VTune Analysis

**1. Launch VTune GUI:**

```bash
vtune-gui
```

**2. Create a New Project:**

- Click "New Project" in the welcome screen
- Set project name and location
- Configure the target application path

**3. Configure Analysis Type:**

Choose an analysis type based on your profiling goals:

- **Hotspots Analysis**: Identify CPU-intensive functions
- **GPU Offload Analysis**: Analyze GPU kernel performance and host-device data transfer
- **Memory Consumption**: Track memory usage patterns
- **Threading Analysis**: Detect threading issues and analyze parallelism

**4. Run the Analysis:**

- Click the "Start" button to begin profiling
- VTune will execute your application and collect performance data

**5. Analyze Results:**

![VTune Analysis Results of basic test](./assets/vtune-b14-test-basic.png)

**Key metrics to examine:**

- **Kernel Execution Time**: Time spent in SYCL kernels
- **Memory Transfer Overhead**: Host-to-device and device-to-host data transfer time
- **CPU Utilization**: Host CPU usage during GPU operations
- **GPU Utilization**: GPU compute unit occupancy

#### Optimization Tips

Based on VTune analysis, consider these optimization strategies:

1. **Reduce Host-Device Transfer**: Minimize data copying between CPU and GPU
2. **Increase Kernel Occupancy**: Optimize work-group sizes and global range
3. **Use Shared Memory**: Leverage local memory for frequently accessed data
4. **Batch Operations**: Process larger data chunks to amortize kernel launch overhead

## Build Configuration

The project supports multiple build configurations:

- **Release**: Optimized for maximum performance (`-O3`, `/O2`)
- **Debug**: Includes debugging symbols and reduced optimization
- **RelWithDebInfo**: Release optimization with debug information
- **MinSizeRel**: Optimized for minimal binary size

## Compatibility

- **Operating Systems**: Windows 10/11, Linux, macOS
- **Architectures**: x86-64, ARM64 (where Intel oneAPI is supported)
- **Hardware**: Intel CPUs, Intel GPUs, NVIDIA GPUs (via Level Zero), AMD GPUs (experimental)

## Contributing

Contributions are welcome! Please ensure that:

1. Code follows the existing style and conventions
2. All tests pass (`ctest`)
3. New features include appropriate test coverage
4. Documentation is updated for significant changes

## License

This project is licensed under the GNU General Public License v3.0 (GPL-3.0). See the [LICENSE](LICENSE) file for detailed information.

## Acknowledgments

- Intel oneAPI team for the SYCL implementation
- Base16384 algorithm developers
- Contributors to the open-source community