Marat's Notes

Use Case: Private AI Development Environment

Imagine you’re working on a project that involves sensitive code, proprietary algorithms, or customer data. You want to use AI assistance for code generation, debugging, and documentation, but you’re concerned about:

Data Privacy: Sending code to cloud-based AI services means your proprietary code leaves your environment
Cost Control: Cloud AI services charge per token, and costs can add up quickly during active development
Offline Development: You need AI assistance even when internet connectivity is unreliable
Custom Models: You want to fine-tune models on your specific codebase or use specialized models
Regulatory Compliance: Your organization requires data to stay on-premises for compliance reasons

The Challenge: You need an AI coding assistant that:

Runs entirely on your local machine
Provides fast, responsive code suggestions
Integrates seamlessly with your development workflow
Doesn’t require constant internet connectivity
Keeps all your code and data private

The Solution: Ollama provides an easy way to run large language models locally, and Cursor IDE can be configured to use these local models instead of cloud-based services. This gives you the benefits of AI-assisted coding while maintaining complete control over your data and infrastructure.

This guide walks through setting up Ollama, running local models, and configuring Cursor to use them as your AI coding assistant.

Prerequisites

Before getting started, ensure you have:

System Requirements:
- macOS, Linux, or Windows
- At least 16GB RAM (32GB recommended for larger models)
- 20GB+ free disk space for models
- Modern CPU (GPU optional but recommended for better performance)
Software:
- Cursor IDE installed (cursor.sh)
- Terminal/Command line access
- Homebrew (macOS) or package manager (Linux)
Basic Knowledge:
- Command line usage
- Understanding of AI models and their capabilities

Installing Ollama

1. macOS Installation

# Install using Homebrew
brew install ollama

# Or download from official website
# Visit https://ollama.ai/download

2. Linux Installation

# Install using the official script
curl -fsSL https://ollama.ai/install.sh | sh

# Or using package manager (Ubuntu/Debian)
# Download .deb package from https://ollama.ai/download

3. Windows Installation

# Download installer from https://ollama.ai/download
# Run the installer executable

4. Verify Installation

# Check Ollama version
ollama --version

# Start Ollama service (if not running automatically)
ollama serve

# In another terminal, test the installation
ollama list

Setting Up Local Models

1. Available Models

Ollama supports various open-source models. Popular choices for coding:

CodeLlama: Specialized for code generation (7B, 13B, 34B variants)
Llama 2: General-purpose model (7B, 13B, 70B variants)
Mistral: Efficient and capable (7B variant)
DeepSeek Coder: Code-focused model
StarCoder: Code generation specialist

2. Pulling Models

# Pull CodeLlama 7B (good balance of performance and speed)
ollama pull codellama:7b

# Pull CodeLlama 13B (better quality, slower)
ollama pull codellama:13b

# Pull Mistral (efficient and fast)
ollama pull mistral:7b

# Pull DeepSeek Coder (specialized for coding)
ollama pull deepseek-coder:6.7b

# List available models
ollama list

3. Testing Models

# Test CodeLlama with a simple prompt
ollama run codellama:7b "Write a Python function to calculate fibonacci numbers"

# Interactive mode
ollama run codellama:7b
# Then type your prompts interactively
# Type /bye to exit

4. Model Management

# Show model information
ollama show codellama:7b

# Copy a model
ollama cp codellama:7b my-custom-codellama

# Remove a model (frees disk space)
ollama rm codellama:7b

# List all models
ollama list

Configuring Ollama Server

1. Server Configuration

Ollama runs a local server by default. Configure it for optimal performance:

# Check if Ollama is running
curl http://localhost:11434/api/tags

# Set environment variables for configuration
export OLLAMA_HOST=0.0.0.0:11434
export OLLAMA_NUM_PARALLEL=1
export OLLAMA_MAX_LOADED_MODELS=1

2. API Endpoint

Ollama provides a REST API that Cursor can connect to:

# Test API endpoint
curl http://localhost:11434/api/generate -d '{
  "model": "codellama:7b",
  "prompt": "Hello, how are you?",
  "stream": false
}'

3. Performance Optimization

# For systems with GPU (CUDA)
export CUDA_VISIBLE_DEVICES=0

# For systems with Apple Silicon (M1/M2/M3)
# Ollama automatically uses Metal acceleration

# Limit memory usage
export OLLAMA_NUM_GPU=1
export OLLAMA_MAX_LOADED_MODELS=1

Connecting Cursor to Ollama

1. Cursor Settings Configuration

Cursor can be configured to use local models via the settings:

Open Cursor Settings:
- macOS: Cmd + , or Cursor > Settings
- Windows/Linux: Ctrl + , or File > Preferences > Settings
Navigate to AI Settings:
- Search for “AI” or “Model” in settings
- Look for “Model Provider” or “AI Provider” settings
Configure Custom Model:
- Find “Custom Model” or “Local Model” option
- Set the API endpoint: http://localhost:11434
- Set the model name: codellama:7b (or your preferred model)

2. Using Cursor Settings JSON

Alternatively, edit Cursor’s settings JSON directly:

{
  "cursor.ai.model": "codellama:7b",
  "cursor.ai.provider": "custom",
  "cursor.ai.endpoint": "http://localhost:11434/api/generate",
  "cursor.ai.apiKey": "",
  "cursor.ai.temperature": 0.7,
  "cursor.ai.maxTokens": 2048
}

3. Cursor Configuration File

Create or edit .cursor/config.json in your project:

{
  "ai": {
    "provider": "ollama",
    "model": "codellama:7b",
    "endpoint": "http://localhost:11434",
    "temperature": 0.7,
    "maxTokens": 2048,
    "stream": true
  }
}

Advanced Configuration

1. Custom Model Configuration

Create a custom model configuration file:

# Create Modelfile
cat > Modelfile << EOF
FROM codellama:7b

# Set custom parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER num_ctx 4096

# Set system prompt for coding
SYSTEM """You are a helpful coding assistant. 
You write clean, efficient, and well-documented code.
Always explain your code and suggest best practices."""

# Set template
TEMPLATE """

User: 
Assistant:"""
EOF

# Create custom model
ollama create my-coder -f Modelfile

# Use the custom model
ollama run my-coder

2. Python Integration

# ollama_client.py
import requests
import json

class OllamaClient:
    def __init__(self, base_url="http://localhost:11434"):
        self.base_url = base_url
    
    def generate(self, model: str, prompt: str, stream: bool = False):
        """Generate text using Ollama"""
        url = f"{self.base_url}/api/generate"
        
        payload = {
            "model": model,
            "prompt": prompt,
            "stream": stream
        }
        
        response = requests.post(url, json=payload)
        
        if stream:
            return self._handle_stream(response)
        else:
            return response.json()
    
    def chat(self, model: str, messages: list, stream: bool = False):
        """Chat with model using messages format"""
        url = f"{self.base_url}/api/chat"
        
        payload = {
            "model": model,
            "messages": messages,
            "stream": stream
        }
        
        response = requests.post(url, json=payload)
        
        if stream:
            return self._handle_stream(response)
        else:
            return response.json()
    
    def _handle_stream(self, response):
        """Handle streaming responses"""
        for line in response.iter_lines():
            if line:
                data = json.loads(line)
                yield data
    
    def list_models(self):
        """List available models"""
        url = f"{self.base_url}/api/tags"
        response = requests.get(url)
        return response.json()

# Usage example
if __name__ == "__main__":
    client = OllamaClient()
    
    # List models
    models = client.list_models()
    print("Available models:", models)
    
    # Generate code
    response = client.generate(
        model="codellama:7b",
        prompt="Write a Python function to sort a list of dictionaries by a key"
    )
    print(response['response'])

3. Node.js Integration

// ollama-client.js
const axios = require('axios');

class OllamaClient {
    constructor(baseUrl = 'http://localhost:11434') {
        this.baseUrl = baseUrl;
    }
    
    async generate(model, prompt, stream = false) {
        const url = `${this.baseUrl}/api/generate`;
        
        const response = await axios.post(url, {
            model,
            prompt,
            stream
        }, {
            responseType: stream ? 'stream' : 'json'
        });
        
        if (stream) {
            return this.handleStream(response.data);
        }
        
        return response.data;
    }
    
    async chat(model, messages, stream = false) {
        const url = `${this.baseUrl}/api/chat`;
        
        const response = await axios.post(url, {
            model,
            messages,
            stream
        }, {
            responseType: stream ? 'stream' : 'json'
        });
        
        if (stream) {
            return this.handleStream(response.data);
        }
        
        return response.data;
    }
    
    handleStream(stream) {
        return new Promise((resolve, reject) => {
            let fullResponse = '';
            
            stream.on('data', (chunk) => {
                const lines = chunk.toString().split('\n').filter(line => line.trim());
                lines.forEach(line => {
                    try {
                        const data = JSON.parse(line);
                        if (data.response) {
                            fullResponse += data.response;
                            process.stdout.write(data.response);
                        }
                        if (data.done) {
                            resolve(fullResponse);
                        }
                    } catch (e) {
                        // Skip invalid JSON
                    }
                });
            });
            
            stream.on('error', reject);
        });
    }
    
    async listModels() {
        const url = `${this.baseUrl}/api/tags`;
        const response = await axios.get(url);
        return response.data;
    }
}

// Usage example
async function main() {
    const client = new OllamaClient();
    
    // List models
    const models = await client.listModels();
    console.log('Available models:', models);
    
    // Generate code
    const response = await client.generate(
        'codellama:7b',
        'Write a JavaScript function to debounce a function call'
    );
    console.log('\nResponse:', response.response);
}

main().catch(console.error);

Testing the Integration

1. Verify Ollama is Running

# Check if Ollama service is running
curl http://localhost:11434/api/tags

# Expected output: JSON with list of models

2. Test Model Response

# Test model directly
ollama run codellama:7b "Write a hello world function in Python"

# Test via API
curl http://localhost:11434/api/generate -d '{
  "model": "codellama:7b",
  "prompt": "Write a hello world function in Python",
  "stream": false
}'

3. Test Cursor Integration

Open Cursor IDE
Open any code file
Use Cursor’s AI features (Cmd/Ctrl + K for inline edit, Cmd/Ctrl + L for chat)
Verify that responses are coming from your local Ollama model

4. Monitor Performance

# Check Ollama logs
# macOS/Linux
tail -f ~/.ollama/logs/server.log

# Check system resources
# macOS
top -pid $(pgrep ollama)

# Linux
top -p $(pgrep ollama)

Performance Optimization

1. Model Selection

Choose models based on your hardware:

# For 16GB RAM systems
ollama pull codellama:7b      # ~4GB, fast
ollama pull mistral:7b        # ~4GB, efficient

# For 32GB+ RAM systems
ollama pull codellama:13b     # ~7GB, better quality
ollama pull llama2:13b        # ~7GB, general purpose

# For systems with powerful GPUs
ollama pull codellama:34b     # ~20GB, best quality

2. GPU Acceleration

# Check if GPU is available
ollama ps

# For NVIDIA GPUs, ensure CUDA is available
nvidia-smi

# Ollama should automatically use GPU if available

3. Memory Management

# Limit number of loaded models
export OLLAMA_MAX_LOADED_MODELS=1

# Set context window size (affects memory usage)
# Edit model's Modelfile
PARAMETER num_ctx 2048  # Lower = less memory

4. Response Speed

# Use smaller models for faster responses
ollama pull codellama:7b

# Reduce context window for faster processing
PARAMETER num_ctx 1024

# Adjust temperature for faster generation
PARAMETER temperature 0.5  # Lower = faster, more deterministic

Troubleshooting

1. Common Issues

Issue: Ollama not starting

# Check if port is already in use
lsof -i :11434

# Kill existing process
killall ollama

# Restart Ollama
ollama serve

Issue: Model not found

# List available models
ollama list

# Pull the model again
ollama pull codellama:7b

# Check model name spelling
ollama show codellama:7b

Issue: Slow responses

# Check system resources
top

# Use smaller model
ollama pull codellama:7b  # Instead of 13b or 34b

# Reduce context window
# Edit Modelfile with: PARAMETER num_ctx 1024

Issue: Cursor not connecting

# Verify Ollama API is accessible
curl http://localhost:11434/api/tags

# Check Cursor settings
# Ensure endpoint is: http://localhost:11434
# Ensure model name matches: codellama:7b

# Check Cursor logs
# macOS: ~/Library/Logs/Cursor/
# Linux: ~/.config/Cursor/logs/
# Windows: %APPDATA%\Cursor\logs\

2. Debug Mode

# Run Ollama in debug mode
OLLAMA_DEBUG=1 ollama serve

# Check detailed logs
tail -f ~/.ollama/logs/server.log

Best Practices

1. Model Management

# Keep only models you actively use
ollama list
ollama rm unused-model

# Regularly update models
ollama pull codellama:7b  # Re-pull to get updates

2. Resource Management

# Monitor disk usage
du -sh ~/.ollama/models/

# Clean up unused models
ollama list
ollama rm old-model-name

3. Security Considerations

# If exposing Ollama over network, use authentication
# Set OLLAMA_HOST to specific interface
export OLLAMA_HOST=127.0.0.1:11434  # Local only

# For remote access, use reverse proxy with authentication
# Example: nginx with basic auth

4. Development Workflow

Start with smaller models for faster iteration
Use larger models for complex code generation
Test prompts before using in production code
Monitor performance and adjust model selection
Keep models updated for latest improvements

Complete Setup Script

#!/bin/bash
# setup-ollama-cursor.sh

set -e

echo "Setting up Ollama for Cursor integration..."

# Install Ollama
if [[ "$OSTYPE" == "darwin"* ]]; then
    # macOS
    if ! command -v ollama &> /dev/null; then
        echo "Installing Ollama via Homebrew..."
        brew install ollama
    fi
elif [[ "$OSTYPE" == "linux-gnu"* ]]; then
    # Linux
    if ! command -v ollama &> /dev/null; then
        echo "Installing Ollama..."
        curl -fsSL https://ollama.ai/install.sh | sh
    fi
fi

# Start Ollama service
echo "Starting Ollama service..."
ollama serve &

# Wait for service to start
sleep 5

# Pull recommended model
echo "Pulling CodeLlama 7B model..."
ollama pull codellama:7b

# Verify installation
echo "Verifying installation..."
ollama list

echo ""
echo "Setup complete!"
echo ""
echo "Next steps:"
echo "1. Configure Cursor to use: http://localhost:11434"
echo "2. Set model to: codellama:7b"
echo "3. Test the integration in Cursor"
echo ""
echo "To test: ollama run codellama:7b 'Hello, world!'"

Conclusion

Running local AI models with Ollama and connecting them to Cursor provides:

Complete Privacy: All code and data stays on your machine
Cost Control: No per-token charges, just hardware costs
Offline Capability: Works without internet connectivity
Customization: Fine-tune models for your specific needs
Performance: Fast responses with local processing

Key takeaways:

Ollama makes it easy to run large language models locally
Choose model size based on your hardware capabilities
Cursor can be configured to use local models via API endpoint
Start with smaller models (7B) for faster iteration
Monitor system resources and adjust accordingly
Keep models updated for latest improvements

By following this guide, you can set up a private, cost-effective AI coding assistant that runs entirely on your local machine while maintaining the benefits of AI-assisted development.

Running Local AI Models with Ollama and Connecting to Cursor