Using a serialisation and zero-copy initialisation libraries
Tensorizer (recommended)
Tensorizer is a library that allows you to load your model from storage into GPU memory in a single step.While initially built to fetch models from S3, it can be used to load models from file as well and so, can be used to load models from Cerebrium’s persistent storage, which features a near 2GB/s read speed. In the case of large models (20B+ parameters), we’ve observed a 30–50% decrease in model loading time which further increases with larger models. For more information on the underlying methods, take a look at their GitHub page here. In this section below, we’ll show you how to use Tensorizer to load your model from storage straight into GPU memory in a single step.
Installation
Add the following to your[cerebrium.dependencies.pip] in your cerebrium.toml file to install Tensorizer in your deployment: