Performance and GPU

GPU Support

Automatic Detection

The AI Service automatically detects and uses GPU when available:


import tensorflow as tf
 
# Check GPU availability
gpus = tf.config.list_physical_devices('GPU')
if gpus:
    logger.info(f"GPU detected")
    # Configure GPU memory growth
    for gpu in gpus:
        tf.config.experimental.set_memory_growth(gpu, True)
else:
    logger.info("No GPU detected, using CPU")

Environment Variable

Force CPU mode even if GPU is available:


USE_GPU=false uv run uvicorn src.main:app

GPU Requirements

Minimum:

CUDA 11.2+
cuDNN 8.1+
4GB GPU memory

Recommended:

CUDA 12.0+
cuDNN 8.9+
8GB+ GPU memory
NVIDIA GPU (Tesla T4, V100, A100)

Performance Benchmarks

Inference Latency

Latency by configuration:

CPU (4 cores): p50: 350ms, p95: 600ms, p99: 800ms
GPU (T4): p50: 80ms, p95: 120ms, p99: 150ms
GPU (V100): p50: 45ms, p95: 70ms, p99: 90ms
GPU (A100): p50: 30ms, p95: 50ms, p99: 65ms

Throughput

Requests per second by configuration:

CPU (4 cores): 10 req/s (20 concurrent users)
GPU (T4): 50 req/s (100 concurrent users)
GPU (V100): 100 req/s (200 concurrent users)
GPU (A100): 150 req/s (300 concurrent users)

Memory Usage

Memory requirements:

CPU Mode: 1.5GB total (500MB model + 1GB runtime)
GPU Mode: 2GB RAM + 3GB VRAM (500MB model in RAM, 2GB in VRAM, 1.5GB runtime RAM, 1GB runtime VRAM)

Optimization Strategies

Model Optimization

TensorFlow Lite Convert to TFLite for faster inference on CPU:


# Convert model
python scripts/convert_to_tflite.py \
  --model models/v2.1.0/model/best_model.h5 \
  --output models/v2.1.0/model/model.tflite
 
# Reduces size by ~4x
# Faster CPU inference by ~2x

Quantization Reduce model size and improve performance:


# INT8 quantization
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()
 
# Size: 50MB → 12MB
# Accuracy drop: <1%
# Inference speed: +40%

Batching

Process multiple requests in a single batch:


# Instead of:
for request in requests:
    prediction = model.predict(request)
 
# Use batching:
batch = np.stack([r.features for r in requests])
predictions = model.predict(batch)  # 3x faster

Caching

Cache predictions for frequently seen inputs:


from functools import lru_cache
 
@lru_cache(maxsize=1000)
def cached_predict(features_hash):
    return model.predict(features)
 
# Instant response for cached predictions
# Useful for demo/test accounts

Model Warm-up

Pre-warm the model at startup to avoid cold-start latency:


@asynccontextmanager
async def lifespan(app: FastAPI):
    # Warm-up model
    dummy_input = create_dummy_input()
    _ = predictor.predict(dummy_input)
    logger.info("Model warmed up")
    yield

Scaling

Horizontal Scaling

Add more service replicas:


# Kubernetes
kubectl scale deployment ai-service --replicas=5
 
# Docker Compose
docker-compose up --scale ai-service=5

Load Balancing

Distribute requests across replicas:


# Kubernetes Service
apiVersion: v1
kind: Service
metadata:
  name: ai-service
spec:
  selector:
    app: ai-service
  type: LoadBalancer
  ports:
    - port: 8080
      targetPort: 8080

Auto-scaling

Scale based on CPU/GPU usage:


# Horizontal Pod Autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ai-service-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ai-service
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

Monitoring

Key Metrics

Latency:

Request processing time
Model inference time
Data preprocessing time

Throughput:

Requests per second
Concurrent requests
Queue depth

Resource Usage:

CPU utilization
GPU utilization
Memory consumption
GPU memory

Accuracy:

Prediction confidence distribution
Model version usage
Error rates

Prometheus Metrics

Export metrics for monitoring:


from prometheus_client import Counter, Histogram
 
# Request metrics
inference_duration = Histogram(
    'inference_duration_seconds',
    'Time spent in model inference',
    ['model_version']
)
 
prediction_count = Counter(
    'predictions_total',
    'Total number of predictions',
    ['model_version', 'status']
)
 
# Usage
with inference_duration.labels(model_version='v2.1.0').time():
    prediction = model.predict(input)
prediction_count.labels(model_version='v2.1.0', status='success').inc()

Grafana Dashboard

Visualize metrics in Grafana:

Inference latency over time
GPU utilization heatmap
Request rate by endpoint
Error rate trends

Troubleshooting

High Latency

Symptoms: Response times greater than 1 second

Solutions:

Check GPU utilization (nvidia-smi)
Verify model is loaded correctly
Enable batching if not already
Consider model quantization
Add more replicas

Out of Memory

Symptoms: CUDA_OUT_OF_MEMORY errors

Solutions:

Reduce batch size
Enable GPU memory growth
Use smaller model
Upgrade GPU with more memory

Low Throughput

Symptoms: Less than 10 req/s on GPU

Solutions:

Enable request batching
Optimize data preprocessing
Use async I/O
Check for blocking operations
Profile with TensorFlow Profiler