I build high-performance rendering engines, CUDA compute kernels, and systems-level infrastructure — from Vulkan pipelines and warp-synchronous GPU algorithms to SIMD-optimised CPU rasterisers and lock-free concurrent systems.
Vulkan 1.3, SPIR-V, Compute Shaders, OpenGL, CUDA Kernel Optimisation, Real-time & Offline Rendering
C++17 (primary), C, Python, GLSL, Slang, CUDA C/C++, ARM Assembly
SIMD (SSE/AVX2), TBB, OpenMP, std::atomic, Lock-Free, Warp Intrinsics, Boost.Asio
Nsight Systems/Compute/Graphics, RenderDoc, perf, flame graphs, GDB/LLDB, Linux, Docker, CMake
A Vulkan 1.3 rendering engine with multi-queue architecture, timeline semaphore synchronisation, GPU particle systems, and dynamic rendering — optimised for maximum GPU utilisation with profiler-driven iteration using Nsight and RenderDoc.
A CUDA and C++ HPC library with warp-synchronous algorithms, lock-free concurrency, and CPU memory hierarchy benchmarks — profiled end-to-end with NVTX, Nsight Compute, and perf flame graphs.
A microservice messaging platform with Boost.Asio TCP long-connections, gRPC service mesh, custom TLV binary protocol with compile-time safety constraints, and stress-tested to 50K concurrent connections.
A CPU rendering pipeline combining AVX2 rasterisation, Whitted-style ray tracing, and Monte Carlo path tracing with importance sampling — optimised with SIMD vectorisation and TBB parallelism.
Bare-metal bring-up on Cortex-A72 (Raspberry Pi 4B): assembly boot, MMU configuration, interrupt subsystem, and cooperative scheduling — debugged through JTAG, GDB, and UART tracing.
Seeking GPU software engineering, graphics, and systems roles. Open to opportunities in Shanghai and across the APAC region.