Deploying LLM on Hugging Face as Model and Space

LLMs are AI algorithms that use neural networks for natural language processing. They are being used as chatbots, text generators and summarizers, language translators, and code generators. They find several applications in fields like healthcare, education, and e-commerce. Deployment means making a product accessible for use. So, LLM deployment refers to making the LLM available for users.

Hugging Face is a hub of AI and ML models and focuses mainly on NLP. It provides easy and free deployment options. It provides a free storage quota, and after that,t it provides paid resources. It has inference providers, which means that API calls are directed to third-party inference providers. There are two approaches to deploying LLMs on Hugging Face:

Deploying as a model
Deploying in space

We will go through a tutorial for both approaches and create a comparison of each in this blog.

Deploying as a model:

In this approach, the model is hosted by Hugging Face and is ready for public use. For this, we need to upload our trained LLM and supporting files to Hugging Face. The model can then be used via an inference API or by directly loading it from the Transformers library. There are two ways to call inference APIs, via a custom key or through routing. In case of a custom key, the inference calls are sent to third-party inference providers directly. While in routing, the inference calls are routed by Hugging Face, and the bill is added to the Hugging Face account.

Hugging Face provides two templates, base template and diffusion LoRA template. The base template serves as the core model architecture without any fine-tuning and modifications. The diffusion LoRA template refers to a LoRA adapter applied to a diffusion model. It is mainly used for diffusion models. So, we will be using the base template for deploying LLMs.

Steps:

Let’s go through the steps to deploy LLM as a model:

Sign up or log in to the Hugging Face website.
Go to your profile icon at the top right corner of the screen.
Select New Model from there.

4 . Add a name for your model and select public or private. Public means that the model can be seen by everyone, but only you and your organization members can make changes to it. Private means that only you and your organization can see and make changes to the model.

5. You can also add a license for your model to restrict commercial use and modifications. Some common licenses include MIT, Apache 2.0, CC BY 4.0, Non-Commercial and Proprietary. MIT and Apache 2.0 are open and allow commercial use and modifications. CC BY 4.0 is open but allows commercial use with attribution. Non-Commercial is restrictive but does not allow for commercial use, but allows for modifications. Proprietary is closed and does not allow for commercial use or modifications usually.

6. Click on Create Model, and your model is created.

7. Upload your pre-trained model and other supporting files like model weights, configuration, tokenizer, and inference scripts.

8. Your model is hosted by Hugging Face and ready to use for users.

Applications:

Developers can use the model via API calls or the Transformers library for specific tasks.
Businesses can integrate LLMs for customer service chatbots.
Search engines can integrate them to enhance suggestions.
LLMs can be fine-tuned for use in healthcare and finance.

Deploying as a space:

In this approach, our LLM is deployed as a web-based app. For this we need two files, app.py and requirements.txt. The app.py contains the application designed for interacting with LLM, and requirement.txt contains the libraries and packages that need to be installed for the model to work. The app is built and run by Hugging Face automatically.

Steps:

Let’s go through the steps to deploy LLM as a model:

Sign up or log in to the Hugging Face website.
Go to your profile icon at the top right corner of the screen.
Select New Space from there.

Sign up or log in to the Hugging Face website.
Go to your profile icon at the top right corner of the screen.
Select New Space

Add a name for your space.
You can add a short description and license.
Choose the space SDK for your application. Streamlit is based on Python and is used for developing model interfaces. Gradio provides ready to use templates for the interface such as chatbot, text-to-image and leaderboard. Docker provides full control and is used for complex apps. Static is used for hosting static apps.
Choose the hardware for your space. There are free as well as paid options. Hugging face provides 2 virtual CPUs with 16GB RAM for free and remaining options are paid.
Select the public or private option for your space.
Click on Create Space, and your space is created.
Upload or create your files app.py and requirements.txt.
Your app is built and restarted in a short time.
The app is running and is ready to use.

You can find an example tutorial of deploying a chatbot using the Mistral API here: Link

Applications:

Users can directly chat with LLMs through an interactive web-based app.
LLMs can be used by kids through a fun and entertaining app.
Personalized apps can be built for healthcare and legal advice.

Comparison:

Deployment as a Model	Deployment as a Space
The users can access the model via API calls without any complexity.	The app is less efficient for API calls.
It provides limited customization, i.e., no user interface.	It provides several templates to build interactive user interfaces.
Cannot be used by non-technical users without UI.	Non-technical users can easily interact with the app through the UI.
Users can optimize the LLM and fine-tune it for their specific use.	It is difficult to download and fine-tune models from spaces.
It provides less control over hardware than spaces.	Provides more control over hardware.
It is limited to technical users.	It can be used by technical as well as non-technical users.
Users can easily load the model from the Transformers library.	The app may be slow if the model is large and resources are insufficient.

Conclusion:

Hugging Face makes deployment easy for direct API usage and interactive apps. Both approaches have their own pros and cons. Deployment as a model is fast and efficient but provides limited customization. Deployment as a space provides interactive user interfaces and built-in templates but are inefficient for API usage. Choosing the right approach depends on the use case and resources available.

References:

Hugging Face

Loading