GPU · Graphics · High-Performance Systems

Jonathan Liu
Graphics & Systems
Engineer

I build high-performance rendering engines, CUDA compute kernels, and systems-level infrastructure — from Vulkan pipelines and warp-synchronous GPU algorithms to SIMD-optimised CPU rasterisers and lock-free concurrent systems.

01

GPU & Graphics

Vulkan 1.3, SPIR-V, Compute Shaders, OpenGL, CUDA Kernel Optimisation, Real-time & Offline Rendering

Languages

C++17 (primary), C, Python, GLSL, Slang, CUDA C/C++, ARM Assembly

HPC & Concurrency

SIMD (SSE/AVX2), TBB, OpenMP, std::atomic, Lock-Free, Warp Intrinsics, Boost.Asio

Profiling & Systems

Nsight Systems/Compute/Graphics, RenderDoc, perf, flame graphs, GDB/LLDB, Linux, Docker, CMake

02
Real-Time Rendering

Vulkan-Platform

A Vulkan 1.3 rendering engine with multi-queue architecture, timeline semaphore synchronisation, GPU particle systems, and dynamic rendering — optimised for maximum GPU utilisation with profiler-driven iteration using Nsight and RenderDoc.

Vulkan 1.3 C++17 GLSL / Slang Compute Shaders SPIR-V
  • ~91× frame time reduction (424ms → 4.66ms) via profiler-guided pipeline redesign
  • 268 FPS sustained at 700K triangles, CPU submission overhead 0.02–0.1ms
  • Compute-shader particle system — 16K particles/dispatch, <0.01ms
  • Cross-platform: Linux, Windows, macOS (MoltenVK)
View on GitHub
Vk
High-Performance Computing

libHPC

A CUDA and C++ HPC library with warp-synchronous algorithms, lock-free concurrency, and CPU memory hierarchy benchmarks — profiled end-to-end with NVTX, Nsight Compute, and perf flame graphs.

CUDA C++17 Lock-Free SIMD Nsight
  • GPU radix sort: 500M keys in 360ms / 1.39B keys/s on RTX 3080 Ti
  • ABA-safe lock-free MPMC queue — 48-bit VA + 16-bit refcount in single atomic
  • Cache-tiling 6.4× and false-sharing elimination 5.6× speedups on i7-12800HX
  • Full NVTX instrumentation with Nsight Compute multi-pass metrics
View on GitHub
Systems & Concurrency

Distributed IM System

A microservice messaging platform with Boost.Asio TCP long-connections, gRPC service mesh, custom TLV binary protocol with compile-time safety constraints, and stress-tested to 50K concurrent connections.

Boost.Asio gRPC C++17 Redis Docker
  • 50K concurrent connections, ~60K msg/s, p99 5.4ms RTT on loopback
  • 17.9M messages with 100% delivery — zero message loss
  • Custom SFINAE-constrained TLV serialization across Qt + Boost stacks
  • ASan + GDB regression harness for concurrency stress testing
View on GitHub
Offline Rendering

Software-Rasterizer

A CPU rendering pipeline combining AVX2 rasterisation, Whitted-style ray tracing, and Monte Carlo path tracing with importance sampling — optimised with SIMD vectorisation and TBB parallelism.

AVX2/SSE TBB Path Tracing BVH C++17
  • 17.1ms/frame median at 1024×1024, ~6K triangles (1000-frame benchmark)
  • 8 fragments/vector AVX2 pipeline with early-out on coverage + z-test masks
  • 2048 SPP Cornell Box in ~14 min via TBB parallel_reduce
  • Cross-platform SIMD via SIMDe abstraction layer
View on GitHub
RT
Bare-Metal Systems

ARMv8-A Kernel

Bare-metal bring-up on Cortex-A72 (Raspberry Pi 4B): assembly boot, MMU configuration, interrupt subsystem, and cooperative scheduling — debugged through JTAG, GDB, and UART tracing.

ARMv8-A C/C++ ARM Assembly GDB/OpenOCD
  • EL1→EL0 exception level transition with full context save/restore
  • TTBR-based multi-level page tables for kernel, MMIO, and user-space
  • IRQ routing for timer, UART, GPIO peripherals with SVC trap handling
  • Cooperative context switching validated via single-step JTAG debugging
Aₓ

Let's talk

Seeking GPU software engineering, graphics, and systems roles. Open to opportunities in Shanghai and across the APAC region.