NimblePros

Choosing an AI Model with Self-Hosted AI

March 05, 2026
Categories: #AI
Choosing an AI Model with Self-Hosted AI
Sarah Dutkiewicz

Sarah Dutkiewicz, Senior Trainer

The rise of large language models has fundamentally shifted how we approach problem-solving. But increasingly, developers and architects are questioning the reliance on centralized AI platforms like ChatGPT and Gemini. Concerns around data privacy, vendor lock-in, unpredictable costs, and limited control have fueled a growing interest in self-hosting AI models.

This isn’t about a nostalgic return to the past. It’s about reclaiming control - building a more resilient, efficient, and ultimately, more strategic approach to leveraging the power of AI.

For developers and architects, the ability to host AI models directly on your infrastructure offers a critical advantage: complete data sovereignty, optimized performance, and the flexibility to tailor AI to your precise needs. Platforms like Open WebUI and OpenCode are emerging as viable options, empowering you to move beyond the limitations of cloud-based solutions.

In this post, we’ll cover the key considerations for choosing the right AI model and platform - focusing on the technical realities of self-hosting and equipping you with the knowledge to make an informed decision.

Model Selection

Choosing the right AI model is the foundation of your self-hosted AI endeavor. It’s far more than just selecting the ‘biggest’ or ‘most impressive’ model. Practical considerations, particularly hardware limitations, dictate the feasibility of deployment.

Size vs. Hardware Realities

Don’t fall into the trap of assuming you can comfortably run a 70B (70 billion) parameter Llama 2 model on a consumer-grade GPU. The sheer memory requirements are significant. Parameters are the weights that the model learns during training - essentially, the connections and strengths within the neural network. The more parameters a model has, the more complex patterns it can learn, and generally, the more powerful it is (though it also requires more resources). The B is a parameter measurement that stands for “billion”. Start with 7B or 13B models - they represent a pragmatic balance between performance and resource utilization. Benchmark data from the Hugging Face Model Hub can provide valuable insights into expected VRAM requirements.

Quantization Techniques

A critical technique is quantization - reducing the precision of the model’s weights. Quantization techniques involve reducing the precision of these parameters - converting them from 32-bit floating-point numbers (which are very precise but consume a lot of memory) to lower-precision formats like 8-bit integers. This allows for running larger models on the same hardware, improving inference speed and reducing memory usage. 4-bit or 8-bit quantization can dramatically decrease memory footprint without a catastrophic drop in performance. Experiment with different quantization methods to find the optimal balance.

Framework Compatibility

Verify model compatibility with your preferred framework (PyTorch, TensorFlow, etc.). Ensure the model’s API and documentation align with your existing development workflow.

API Design & Flexibility

Prioritize models with well-defined APIs. You’ll likely need to integrate the AI model into your own applications, so a robust API is crucial for efficient interaction. Look for models with clear documentation outlining the API endpoints and data formats.

Benchmarking & Performance Metrics

Don’t rely solely on model size. Track key performance metrics:

  • Inference Latency: The time it takes to generate a response.
  • Throughput: The number of requests the model can handle per second.
  • Accuracy/Perplexity: While not directly measurable, these metrics (often available in model descriptions) offer insights into the model’s capabilities.

Platform Considerations

Choosing the right platform is crucial for a successful self-hosted AI deployment. While Open WebUI and OpenCode both provide valuable capabilities, understanding their differences is key.

Open WebUI

Open WebUI is a user-friendly, web-based interface for running and interacting with AI models. It excels in:

  • Ease of Use: Its intuitive web UI makes it accessible to users with limited technical expertise.
  • Model Loading & Inference: Supports a wide range of models and provides a straightforward method for loading and running them.
  • API Access: Offers a REST API for programmatic access to the model.

I use Open WebUI when I’m brainstorming and when I need to talk through some coding hurdles without getting into my code directly. These are some of the models I work with and how I use them:

Open WebUI models list

  • Gemma 3 - created by Google - used for general AI and reasoning. I use it to help with figuring out social media strategies, presentation pacing, brainstorming, and content creation.
  • CodeLlama - created by Meta - used for coding brainstorming.
  • Mistral - created by Mistral AI - used when I want to deal with text generation and summarization.
  • Qwen2.5 - Created by Alibaba - used when I’m dealing with multilingual support in applications.

This is how I set up Open WebUI to run locally.

OpenCode

OpenCode emphasizes collaboration and community-driven development. It’s designed for:

  • Version Control & Collaboration: Integrated with Git for seamless version control and collaborative development.
  • Customization & Extensions: Provides a flexible architecture for extending functionality and adding custom components.
  • Community Support: Leverages a community-driven ecosystem for support and knowledge sharing.

One of my teammates mentioned this in a recent AI discussion. This is on my list of technologies to explore and possibly cover in a future blog post!

Platform Comparison

Feature Open WebUI OpenCode
Ease of Use High Medium
Customization Low High
Collaboration Limited Strong
API Access Yes Yes

Ultimately, choosing a self-hosted AI deployment offers significant advantages - increased control over your data, reduced reliance on external services, and potentially lower long-term costs. When considering the options, Open WebUI provides a powerful and accessible entry point for developers and users seeking a straightforward model inference experience. OpenCode, with its emphasis on collaboration and customization, represents a strategic choice for teams prioritizing long-term maintainability and adaptability. Both platforms offer viable solutions for self-hosting AI, and your selection will largely depend on your specific needs and technical expertise. As you continue to explore the possibilities of self-hosted AI, remember that a robust understanding of these platforms and a commitment to ongoing optimization are key to unlocking their full potential.

Deployment Considerations

Successfully deploying a self-hosted AI model goes beyond simply choosing the right platform. Careful planning and consideration of your infrastructure are crucial for optimal performance and reliability.

Hardware Requirements

Selecting appropriate hardware is the foundation of a robust self-hosted AI deployment. The right components directly impact the performance and scalability of your model.

  • GPU: A dedicated GPU is highly recommended for AI inference. The amount of VRAM (Video RAM) significantly impacts the size and complexity of models you can run.
  • CPU: A decent CPU is necessary for handling data preprocessing and other supporting tasks.
  • RAM: Sufficient RAM is essential for loading the model and handling incoming requests.

Running locally, I run my Open WebUI on an ASUS ROG Strix G53ZW with a 12th generation Intel ® Core ™ i9 processor. I have 32GB of RAM. I also have an NVIDIA GeForce RTX-3070 Ti Laptop GPU.

We also have Open WebUI running in our homelab. Our Docker setup runs on an Intel Core i7-7700K (4C/8T) with 48 GB of DDR4 RAM (2133 MHz) and an NVIDIA RTX 3060 Ti GPU. The RTX 3060 Ti (8 GB VRAM) handles most inference workloads, which is enough to comfortably run 7B–13B models locally with quantization.

The RAM is a mixed configuration (16 GB + 8 GB + 16 GB + 8 GB), which is perfectly fine for a homelab setup even if it’s not the most optimal pairing. The GPU sits on a PCIe x16 slot and does the heavy lifting for local model inference.

Deployment Options

Choosing the appropriate deployment method depends heavily on your needs and technical expertise. Several options exist, each with its own trade-offs.

  • Local Machine: For experimentation and small-scale deployments, running the model directly on your local machine can be a viable option.
  • Virtual Machines (VMs): Using VMs provides isolation and portability, making it easier to manage and scale your deployment.
  • Containerization (Docker): Docker simplifies deployment by packaging the model and its dependencies into a portable container.

Locally and in our homelab, we are using Docker.

Monitoring & Maintenance

Ongoing monitoring and proactive maintenance are essential for ensuring the continued operation and performance of your self-hosted AI system.

  • Resource Monitoring: Track CPU usage, GPU utilization, memory consumption, and network traffic to identify potential bottlenecks.
  • Log Management: Implement logging to capture errors, track performance, and troubleshoot issues.
  • Regular Updates: Keep your software (including the AI model and platform) up-to-date to benefit from security patches and performance improvements.

Conclusion

Throughout this post, we’ve explored the critical components of self-hosted AI deployment, from model selection to platform considerations and ongoing maintenance. The decision to self-host isn’t simply a technical one; it’s a strategic move offering greater control over your data, reduced operational costs, and the potential to tailor your AI solutions to your precise requirements.

Key Considerations: When choosing your approach, remember to carefully assess:

  • Model Size & Complexity: Select a model that aligns with your hardware capabilities.
  • Platform Compatibility: Choose a platform that supports your chosen model and technical expertise.
  • Resource Requirements: Ensure you have adequate hardware and infrastructure to support your deployment.

Getting Started: For newcomers, we recommend starting with a smaller, more manageable model (7B or 13B parameters) on a local machine or a lightweight VM. Open WebUI provides a user-friendly entry point, while OpenCode offers a solid foundation for collaborative development.

Now is the time to take action! Don’t just read about self-hosted AI - begin your own exploration. Experiment with different models, explore the capabilities of Open WebUI and OpenCode, and contribute to the growing community. The future of AI is being shaped by those who dare to self-host. Importantly, when evaluating your models, consider their inherent capabilities. While raw parameters are a starting point, a truly effective AI solution leverages a model’s strengths. Look for models with robust reasoning abilities, natural language understanding, or image generation - depending on your intended application. Don’t underestimate the impact of model fine-tuning - adapting a pre-trained model to your specific dataset can dramatically improve performance. Start building your own AI solutions today!