https://github.com/yonigozlan/OptimVision
Optimize the inference speed of vision models within the Hugging Face Transformers library, with a focus on models compiled using PyTorch's torch.compile.
Most useful dashboards:
Overview:

Trace:


”default” is the default mode, which is a good balance between performance and overhead”reduce-overhead” is a mode that reduces the overhead of python with CUDA graphs, useful for small batches. Reduction of overhead can come at the cost of more memory usage, as we will cache the workspace memory required for the invocation so that we do not have to reallocate it on subsequent runs. Reduction of overhead is not guaranteed to work; today, we only reduce overhead for CUDA only graphs which do not mutate inputs. There are other circumstances where CUDA graphs are not applicable; use TORCH_LOG=perf_hints to debug.”max-autotune” is a mode that leverages Triton based matrix multiplications and convolutions It enables CUDA graphs by default.