GPU · Graphics · High-Performance Systems

Jonathan Liu
Graphics & Systems
Engineer

I build high-performance rendering engines, CUDA compute kernels, and systems-level infrastructure — from Vulkan pipelines and warp-synchronous GPU algorithms to SIMD-optimised CPU rasterisers and lock-free concurrent systems.

GitHub View Projects ↓

Core Expertise

GPU & Graphics

Vulkan 1.3, SPIR-V, Compute Shaders, OpenGL, CUDA Kernel Optimisation, Real-time & Offline Rendering

Languages

C++17 (primary), C, Python, GLSL, Slang, CUDA C/C++, ARM Assembly

HPC & Concurrency

SIMD (SSE/AVX2), TBB, OpenMP, std::atomic, Lock-Free, Warp Intrinsics, Boost.Asio

Profiling & Systems

Nsight Systems/Compute/Graphics, RenderDoc, perf, flame graphs, GDB/LLDB, Linux, Docker, CMake

Selected Work

Real-Time Rendering

Vulkan-Platform

A Vulkan 1.3 rendering engine with multi-queue architecture, timeline semaphore synchronisation, GPU particle systems, and dynamic rendering — optimised for maximum GPU utilisation with profiler-driven iteration using Nsight and RenderDoc.

Vulkan 1.3 C++17 GLSL / Slang Compute Shaders SPIR-V

~91× frame time reduction (424ms → 4.66ms) via profiler-guided pipeline redesign
268 FPS sustained at 700K triangles, CPU submission overhead 0.02–0.1ms
Compute-shader particle system — 16K particles/dispatch, <0.01ms
Cross-platform: Linux, Windows, macOS (MoltenVK)

View on GitHub

High-Performance Computing

libHPC

A CUDA and C++ HPC library with warp-synchronous algorithms, lock-free concurrency, and CPU memory hierarchy benchmarks — profiled end-to-end with NVTX, Nsight Compute, and perf flame graphs.

CUDA C++17 Lock-Free SIMD Nsight

GPU radix sort: 500M keys in 360ms / 1.39B keys/s on RTX 3080 Ti
ABA-safe lock-free MPMC queue — 48-bit VA + 16-bit refcount in single atomic
Cache-tiling 6.4× and false-sharing elimination 5.6× speedups on i7-12800HX
Full NVTX instrumentation with Nsight Compute multi-pass metrics

View on GitHub

∥

Systems & Concurrency

Distributed IM System

A microservice messaging platform with Boost.Asio TCP long-connections, gRPC service mesh, custom TLV binary protocol with compile-time safety constraints, and stress-tested to 50K concurrent connections.

Boost.Asio gRPC C++17 Redis Docker

50K concurrent connections, ~60K msg/s, p99 5.4ms RTT on loopback
17.9M messages with 100% delivery — zero message loss
Custom SFINAE-constrained TLV serialization across Qt + Boost stacks
ASan + GDB regression harness for concurrency stress testing

View on GitHub

⚡

Offline Rendering

Software-Rasterizer

A CPU rendering pipeline combining AVX2 rasterisation, Whitted-style ray tracing, and Monte Carlo path tracing with importance sampling — optimised with SIMD vectorisation and TBB parallelism.

AVX2/SSE TBB Path Tracing BVH C++17

17.1ms/frame median at 1024×1024, ~6K triangles (1000-frame benchmark)
8 fragments/vector AVX2 pipeline with early-out on coverage + z-test masks
2048 SPP Cornell Box in ~14 min via TBB parallel_reduce
Cross-platform SIMD via SIMDe abstraction layer

View on GitHub

Bare-Metal Systems

ARMv8-A Kernel

Bare-metal bring-up on Cortex-A72 (Raspberry Pi 4B): assembly boot, MMU configuration, interrupt subsystem, and cooperative scheduling — debugged through JTAG, GDB, and UART tracing.

ARMv8-A C/C++ ARM Assembly GDB/OpenOCD

EL1→EL0 exception level transition with full context save/restore
TTBR-based multi-level page tables for kernel, MMIO, and user-space
IRQ routing for timer, UART, GPIO peripherals with SVC trap handling
Cooperative context switching validated via single-step JTAG debugging

Aₓ

Let's talk

Seeking GPU software engineering, graphics, and systems roles. Open to opportunities in Shanghai and across the APAC region.

GitHub Back to Projects ↑

Jonathan Liu Graphics & SystemsEngineer

Core Expertise

GPU & Graphics

Languages

HPC & Concurrency

Profiling & Systems

Selected Work

Vulkan-Platform

libHPC

Distributed IM System

Software-Rasterizer

ARMv8-A Kernel

Let's talk

Jonathan Liu
Graphics & Systems
Engineer