https://github.com/yonigozlan/OptimVision

Overall Goal

Optimize the inference speed of vision models within the Hugging Face Transformers library, with a focus on models compiled using PyTorch's torch.compile.

This is first done by:

Using relevant profiling tools
Identify and correct bottlenecks in performance

Profiling and debugging: Tensorboard pytorch profiler

Very rich and useful profiling tool: docs
Allows to profile any models (compiled or not), with a very nice visual interface.

Most useful dashboards:

Overview:

Screenshot 2024-09-01 184058.png

Nice overview of the general performance of a model
CPU/GPU exec % is a good indicator of whether a compiled model is well optimized for compilation.

Trace:

Screenshot 2024-09-01 183840.png

Screenshot 2024-09-01 183752.png

See GPU efficiency/utilization
See timeline of the entire call stacks!
- Incredibly useful to identify cuda synchronization point, cpu/gpu transfer, graph breaks etc.

Torch.compile

Different modes (from docs):

”default” is the default mode, which is a good balance between performance and overhead
”reduce-overhead” is a mode that reduces the overhead of python with CUDA graphs, useful for small batches. Reduction of overhead can come at the cost of more memory usage, as we will cache the workspace memory required for the invocation so that we do not have to reallocate it on subsequent runs. Reduction of overhead is not guaranteed to work; today, we only reduce overhead for CUDA only graphs which do not mutate inputs. There are other circumstances where CUDA graphs are not applicable; use TORCH_LOG=perf_hints to debug.
”max-autotune” is a mode that leverages Triton based matrix multiplications and convolutions It enables CUDA graphs by default.