|

|  How to Integrate Hugging Face with Kubernetes

How to Integrate Hugging Face with Kubernetes

January 24, 2025

Learn to seamlessly deploy Hugging Face models using Kubernetes in this step-by-step integration guide, perfect for enhancing your AI applications.

How to Connect Hugging Face to Kubernetes: a Simple Guide

 

Set Up Your Environment

 

  • Ensure you have a Kubernetes cluster running. You can use Minikube for local development or any cloud provider's Kubernetes service for deployment.
  •  

  • Install the necessary CLI tools: kubectl for managing Kubernetes clusters and the Hugging Face CLI for interacting with Hugging Face services.

 

# Install kubectl
brew install kubectl

# Install Hugging Face CLI
pip install huggingface_hub

 

Create a Hugging Face Model API

 

  • Sign up for a Hugging Face account and navigate to the Hub. Select a model you want to run.
  •  

  • Create an API token from your Hugging Face account settings and save it securely. You'll need it for accessing the Hugging Face model from your Kubernetes deployment.

 

Containerize Your Application

 

  • Create a Dockerfile for your application. This file should contain instructions to set up the environment and run your application.
  •  

  • Ensure to install necessary libraries and copy your codebase into the Docker image.

 

FROM python:3.8-slim

WORKDIR /app

COPY . /app

RUN pip install -r requirements.txt

ENTRYPOINT ["python", "app.py"]

 

Push Docker Image to a Registry

 

  • Build the Docker image and tag it appropriately.
  •  

  • Push the Docker image to a container registry like Docker Hub or Google Container Registry.

 

# Build the Docker image
docker build -t username/hf-kubernetes-example:latest .

# Push the Docker image 
docker push username/hf-kubernetes-example:latest

 

Deploying to Kubernetes

 

  • Create a Kubernetes deployment manifest. This YAML file should include specifications for your app's deployment, such as image location and replication settings.
  •  

  • Apply the deployment manifest to your Kubernetes cluster.

 

apiVersion: apps/v1
kind: Deployment
metadata:
  name: hf-kubernetes-example
spec:
  replicas: 2
  selector:
    matchLabels:
      app: hf-kubernetes-example
  template:
    metadata:
      labels:
        app: hf-kubernetes-example
    spec:
      containers:
      - name: hf-kubernetes-example
        image: username/hf-kubernetes-example:latest
        ports:
        - containerPort: 80
        env:
        - name: HF_API_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-secret
              key: api-token

 

Expose Your Application

 

  • Create a Service to expose your application. Use a LoadBalancer if you want external access.
  •  

  • Apply the Service manifest to your Kubernetes cluster.

 

apiVersion: v1
kind: Service
metadata:
  name: hf-kubernetes-service
spec:
  type: LoadBalancer
  ports:
  - port: 80
  selector:
    app: hf-kubernetes-example

 

Secure Your Deployment

 

  • Create secrets for sensitive information like API tokens using Kubernetes secrets.
  •  

  • Ensure your application can retrieve these secrets and use them safely.

 

kubectl create secret generic hf-secret --from-literal=api-token=your_hf_api_token_here

 

Monitor and Troubleshoot

 

  • Use `kubectl get pods` to check the status of your pods and ensure they're running.
  •  

  • Use `kubectl logs pod_name` to view logs and troubleshoot any issues in your deployment.

 

# Check the status of all running pods
kubectl get pods

# View detailed logs of a specific pod
kubectl logs <pod_name>

 

Omi Necklace

The #1 Open Source AI necklace: Experiment with how you capture and manage conversations.

Build and test with your own Omi Dev Kit 2.

How to Use Hugging Face with Kubernetes: Usecases

 

Deploying Large-Scale NLP Models with Hugging Face and Kubernetes

 

  • Leverage the power of Kubernetes to seamlessly manage the deployment of large-scale NLP models trained using Hugging Face's Transformers library.
  • Ensure dynamic scaling of infrastructure based on computational needs, allowing for cost-effective and efficient model inference.

 

Setting Up the Environment

 

  • Create a Kubernetes cluster using a cloud provider such as Google Kubernetes Engine (GKE), Amazon Elastic Kubernetes Service (EKS), or Azure Kubernetes Service (AKS).
  • Install necessary command-line tools: Kubernetes CLI (kubectl) and Helm for managing your cluster and its applications.

 

Model Serving with Hugging Face

 

  • Use Hugging Face's `transformers` library to prepare and export the NLP model. Hugging Face provides pre-trained models that can be fine-tuned or used out-of-the-box for various NLP tasks.
  •  

    from transformers import pipeline
    model = pipeline('sentiment-analysis')
    

     

  • Package the model into a Docker container, equipped with all dependencies required for execution. This ensures portability and compatibility across diverse environments.
  •  

    FROM python:3.8-slim
    COPY . /app
    WORKDIR /app
    RUN pip install transformers
    CMD ["python", "serve.py"]
    

     

Deploying on Kubernetes

 

  • Define a Kubernetes Deployment configuration to manage model replicas and ensure high availability of your service.
  •  

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: nlp-model
    spec:
      replicas: 3
      selector:
        matchLabels:
          app: nlp-model
      template:
        metadata:
          labels:
            app: nlp-model
        spec:
          containers:
          - name: nlp-model
            image: your-docker-repo/nlp-model:latest
            ports:
            - containerPort: 80
    

     

  • Use a Kubernetes Service to expose your deployed model to external traffic, enabling application access.

 

Monitoring and Autoscaling

 

  • Integrate Kubernetes Horizontal Pod Autoscaler to automatically adjust the number of replicas in response to traffic loads, ensuring performance under varying workload conditions.
  • Set up monitoring tools like Prometheus and Grafana for real-time insights on resource utilization and application health.

 

 

Integrating Hugging Face Model Endpoints with Kubernetes for Real-Time Inference

 

  • Utilize Kubernetes to orchestrate Hugging Face's model inference endpoints for real-time data processing and analytics applications.
  • Achieve high availability and redundancy for NLP services by deploying multiple instances across a Kubernetes cluster.

 

Environment Configuration

 

  • Set up a Kubernetes cluster using providers like GKE, EKS, or AKS for robust infrastructure management.
  • Install `kubectl`, Helm, and the Hugging Face CLI to interact with Kubernetes and Hugging Face's services.

 

Creating and Exporting Models

 

  • Utilize the Hugging Face `transformers` library to train and finalize the model for deployment purposes.
  •  

    from transformers import BertTokenizer, BertForSequenceClassification
    model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
    

     

  • Ensure the model is containerized using Docker for seamless deployment in a Kubernetes environment.
  •  

    FROM python:3.8
    COPY . /app
    WORKDIR /app
    RUN pip install transformers
    CMD ["python", "inference.py"]
    

     

Deploying Hugging Face Models on Kubernetes

 

  • Use Kubernetes Deployments to manage load balancing and rolling updates for the Hugging Face model endpoint.
  •  

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: hf-model
    spec:
      replicas: 5
      selector:
        matchLabels:
          app: hf-model
      template:
        metadata:
          labels:
            app: hf-model
        spec:
          containers:
          - name: hf-model
            image: your-docker-repo/hf-model:latest
            ports:
            - containerPort: 8080
    

     

  • Configure a Kubernetes Service to make the Hugging Face model endpoint accessible to external applications.

 

Ensuring Scalability and Resilience

 

  • Implement Horizontal Pod Autoscaler in Kubernetes to dynamically scale the model serving pods based on CPU and memory utilization.
  • Deploy Prometheus and Grafana for continuous monitoring of system performance, resource utilization, and application metrics.

 

Omi App

Fully Open-Source AI wearable app: build and use reminders, meeting summaries, task suggestions and more. All in one simple app.

Github →

Order Friend Dev Kit

Open-source AI wearable
Build using the power of recall

Order Now

Troubleshooting Hugging Face and Kubernetes Integration

How to deploy a Hugging Face model on Kubernetes?

 

Set Up Your Environment

 

  • Install the Kubernetes command-line tool `kubectl` and set up a Kubernetes cluster.
  •  

  • Authenticate your environment to ensure that `kubectl` can interact with your desired cluster.

 

Prepare the Docker Image

 

  • Create a Dockerfile for your Hugging Face model. Base it on an image, e.g., `python:3.9`, installing necessary libraries.
  •  

  • Create an entry-point script to run your model inference API, e.g., using Flask or FastAPI.

 

FROM python:3.9  
COPY . /app  
WORKDIR /app  
RUN pip install -r requirements.txt  
CMD ["python", "app.py"]  

 

Deploy to Kubernetes

 

  • Build and push the Docker image: \`\`\`shell docker build -t . docker push \`\`\`
  •  

  • Create a Kubernetes deployment YAML file for your model. Configure replicas, container specs, and expose necessary ports.
  •  

  • Deploy using `kubectl apply`:

 

apiVersion: apps/v1  
kind: Deployment  
metadata:  
  name: huggingface-model  
spec:  
  replicas: 2  
  template:  
    metadata:  
      labels:  
        app: huggingface  
    spec:  
      containers:  
      - name: model-container  
        image: <your-docker-repo-url>  
        ports:  
        - containerPort: 5000  

 

kubectl apply -f deployment.yaml  

What are common errors when scaling Hugging Face models in Kubernetes?

 

Common Errors When Scaling Hugging Face Models in Kubernetes

 

  • Resource Allocation Issues: Ensure proper CPU and memory requests/limits are set. Default resource settings may lead to under-provision or overprovision, affecting performance or cost.
  •  

  • Concurrency Management Problems: Use Kubernetes Horizontal Pod Autoscaler for managing loads. Incorrect configuration can limit scaling benefits.
  •  

  • Load Balancing Challenges: Implement efficient load balancing with services like Istio. Default settings might not distribute traffic evenly, leading to bottlenecks.
  •  

  • Persistence Misconfigurations: Avoid using local storage for model artifacts. Use shared storage like persistent volume claims (PVCs) to ensure data availability across pods.
  •  

  • Networking Constraints: Check network configurations like Service Mesh, as they can impact latency and performance.

 

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-model
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: huggingface-model
        image: huggingface/transformers
        resources:
          requests:
            memory: "4Gi"
            cpu: "2"
          limits:
            memory: "8Gi"
            cpu: "4"

 

How to optimize Hugging Face inference on Kubernetes for cost and speed?

 

Consider Efficient Deployment

 

  • Use lightweight base images and multi-stage builds to reduce container size. This optimizes speed and cost in Kubernetes.
  •  

  • Autoscale pods based on CPU, memory, and custom metrics using the HorizontalPodAutoscaler to manage loads efficiently.

 

Optimize Model Loading

 

  • Load models asynchronously to speed up initialization time. Use libraries that support lazy loading to minimize startup latency.
  •  

  • Cache models in memory with shared volumes to avoid repeated loading, thus saving time and computational resources.

 

Use Efficient Hardware

 

  • Utilize GPU instances for high-throughput models. Leverage node selectors and tolerations for appropriate hardware allocation.
  •  

  • Implement inference acceleration technologies like NVIDIA Triton Inference Server for supported models to optimize speed.

 

apiVersion: apps/v1
kind: Deployment
metadata:
  name: huggingface-inference
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: model-server
        image: huggingface/transformers
        resources:
          limits:
            nvidia.com/gpu: 1

 

Don’t let questions slow you down—experience true productivity with the AI Necklace. With Omi, you can have the power of AI wherever you go—summarize ideas, get reminders, and prep for your next project effortlessly.

Order Now

Join the #1 open-source AI wearable community

Build faster and better with 3900+ community members on Omi Discord

Participate in hackathons to expand the Omi platform and win prizes

Participate in hackathons to expand the Omi platform and win prizes

Get cash bounties, free Omi devices and priority access by taking part in community activities

Join our Discord → 

OMI NECKLACE + OMI APP
First & only open-source AI wearable platform

a person looks into the phone with an app for AI Necklace, looking at notes Friend AI Wearable recorded a person looks into the phone with an app for AI Necklace, looking at notes Friend AI Wearable recorded
a person looks into the phone with an app for AI Necklace, looking at notes Friend AI Wearable recorded a person looks into the phone with an app for AI Necklace, looking at notes Friend AI Wearable recorded
online meeting with AI Wearable, showcasing how it works and helps online meeting with AI Wearable, showcasing how it works and helps
online meeting with AI Wearable, showcasing how it works and helps online meeting with AI Wearable, showcasing how it works and helps
App for Friend AI Necklace, showing notes and topics AI Necklace recorded App for Friend AI Necklace, showing notes and topics AI Necklace recorded
App for Friend AI Necklace, showing notes and topics AI Necklace recorded App for Friend AI Necklace, showing notes and topics AI Necklace recorded

OMI NECKLACE: DEV KIT
Order your Omi Dev Kit 2 now and create your use cases

Omi Dev Kit 2

Endless customization

OMI DEV KIT 2

$69.99

Make your life more fun with your AI wearable clone. It gives you thoughts, personalized feedback and becomes your second brain to discuss your thoughts and feelings. Available on iOS and Android.

Your Omi will seamlessly sync with your existing omi persona, giving you a full clone of yourself – with limitless potential for use cases:

  • Real-time conversation transcription and processing;
  • Develop your own use cases for fun and productivity;
  • Hundreds of community apps to make use of your Omi Persona and conversations.

Learn more

Omi Dev Kit 2: build at a new level

Key Specs

OMI DEV KIT

OMI DEV KIT 2

Microphone

Yes

Yes

Battery

4 days (250mAH)

2 days (250mAH)

On-board memory (works without phone)

No

Yes

Speaker

No

Yes

Programmable button

No

Yes

Estimated Delivery 

-

1 week

What people say

“Helping with MEMORY,

COMMUNICATION

with business/life partner,

capturing IDEAS, and solving for

a hearing CHALLENGE."

Nathan Sudds

“I wish I had this device

last summer

to RECORD

A CONVERSATION."

Chris Y.

“Fixed my ADHD and

helped me stay

organized."

David Nigh

OMI NECKLACE: DEV KIT
Take your brain to the next level

LATEST NEWS
Follow and be first in the know

Latest news
FOLLOW AND BE FIRST IN THE KNOW

thought to action

team@basedhardware.com

company

careers

events

invest

privacy

products

omi

omi dev kit

personas

resources

apps

bounties

affiliate

docs

github

help