Skip to Content
ServicesAI ServicePerformance & GPU

Performance and GPU

GPU Support

Automatic Detection

The AI Service automatically detects and uses GPU when available:

import tensorflow as tf # Check GPU availability gpus = tf.config.list_physical_devices('GPU') if gpus: logger.info(f"GPU detected") # Configure GPU memory growth for gpu in gpus: tf.config.experimental.set_memory_growth(gpu, True) else: logger.info("No GPU detected, using CPU")

Environment Variable

Force CPU mode even if GPU is available:

USE_GPU=false uv run uvicorn src.main:app

GPU Requirements

Minimum:

  • CUDA 11.2+
  • cuDNN 8.1+
  • 4GB GPU memory

Recommended:

  • CUDA 12.0+
  • cuDNN 8.9+
  • 8GB+ GPU memory
  • NVIDIA GPU (Tesla T4, V100, A100)

Performance Benchmarks

Inference Latency

Latency by configuration:

  • CPU (4 cores): p50: 350ms, p95: 600ms, p99: 800ms
  • GPU (T4): p50: 80ms, p95: 120ms, p99: 150ms
  • GPU (V100): p50: 45ms, p95: 70ms, p99: 90ms
  • GPU (A100): p50: 30ms, p95: 50ms, p99: 65ms

Throughput

Requests per second by configuration:

  • CPU (4 cores): 10 req/s (20 concurrent users)
  • GPU (T4): 50 req/s (100 concurrent users)
  • GPU (V100): 100 req/s (200 concurrent users)
  • GPU (A100): 150 req/s (300 concurrent users)

Memory Usage

Memory requirements:

  • CPU Mode: 1.5GB total (500MB model + 1GB runtime)
  • GPU Mode: 2GB RAM + 3GB VRAM (500MB model in RAM, 2GB in VRAM, 1.5GB runtime RAM, 1GB runtime VRAM)

Optimization Strategies

Model Optimization

TensorFlow Lite Convert to TFLite for faster inference on CPU:

# Convert model python scripts/convert_to_tflite.py \ --model models/v2.1.0/model/best_model.h5 \ --output models/v2.1.0/model/model.tflite # Reduces size by ~4x # Faster CPU inference by ~2x

Quantization Reduce model size and improve performance:

# INT8 quantization converter = tf.lite.TFLiteConverter.from_keras_model(model) converter.optimizations = [tf.lite.Optimize.DEFAULT] tflite_model = converter.convert() # Size: 50MB → 12MB # Accuracy drop: <1% # Inference speed: +40%

Batching

Process multiple requests in a single batch:

# Instead of: for request in requests: prediction = model.predict(request) # Use batching: batch = np.stack([r.features for r in requests]) predictions = model.predict(batch) # 3x faster

Caching

Cache predictions for frequently seen inputs:

from functools import lru_cache @lru_cache(maxsize=1000) def cached_predict(features_hash): return model.predict(features) # Instant response for cached predictions # Useful for demo/test accounts

Model Warm-up

Pre-warm the model at startup to avoid cold-start latency:

@asynccontextmanager async def lifespan(app: FastAPI): # Warm-up model dummy_input = create_dummy_input() _ = predictor.predict(dummy_input) logger.info("Model warmed up") yield

Scaling

Horizontal Scaling

Add more service replicas:

# Kubernetes kubectl scale deployment ai-service --replicas=5 # Docker Compose docker-compose up --scale ai-service=5

Load Balancing

Distribute requests across replicas:

# Kubernetes Service apiVersion: v1 kind: Service metadata: name: ai-service spec: selector: app: ai-service type: LoadBalancer ports: - port: 8080 targetPort: 8080

Auto-scaling

Scale based on CPU/GPU usage:

# Horizontal Pod Autoscaler apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: ai-service-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: ai-service minReplicas: 2 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70

Monitoring

Key Metrics

Latency:

  • Request processing time
  • Model inference time
  • Data preprocessing time

Throughput:

  • Requests per second
  • Concurrent requests
  • Queue depth

Resource Usage:

  • CPU utilization
  • GPU utilization
  • Memory consumption
  • GPU memory

Accuracy:

  • Prediction confidence distribution
  • Model version usage
  • Error rates

Prometheus Metrics

Export metrics for monitoring:

from prometheus_client import Counter, Histogram # Request metrics inference_duration = Histogram( 'inference_duration_seconds', 'Time spent in model inference', ['model_version'] ) prediction_count = Counter( 'predictions_total', 'Total number of predictions', ['model_version', 'status'] ) # Usage with inference_duration.labels(model_version='v2.1.0').time(): prediction = model.predict(input) prediction_count.labels(model_version='v2.1.0', status='success').inc()

Grafana Dashboard

Visualize metrics in Grafana:

  • Inference latency over time
  • GPU utilization heatmap
  • Request rate by endpoint
  • Error rate trends

Troubleshooting

High Latency

Symptoms: Response times greater than 1 second

Solutions:

  1. Check GPU utilization (nvidia-smi)
  2. Verify model is loaded correctly
  3. Enable batching if not already
  4. Consider model quantization
  5. Add more replicas

Out of Memory

Symptoms: CUDA_OUT_OF_MEMORY errors

Solutions:

  1. Reduce batch size
  2. Enable GPU memory growth
  3. Use smaller model
  4. Upgrade GPU with more memory

Low Throughput

Symptoms: Less than 10 req/s on GPU

Solutions:

  1. Enable request batching
  2. Optimize data preprocessing
  3. Use async I/O
  4. Check for blocking operations
  5. Profile with TensorFlow Profiler
Last updated on