Performance and GPU
GPU Support
Automatic Detection
The AI Service automatically detects and uses GPU when available:
import tensorflow as tf
# Check GPU availability
gpus = tf.config.list_physical_devices('GPU')
if gpus:
logger.info(f"GPU detected")
# Configure GPU memory growth
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
else:
logger.info("No GPU detected, using CPU")Environment Variable
Force CPU mode even if GPU is available:
USE_GPU=false uv run uvicorn src.main:appGPU Requirements
Minimum:
- CUDA 11.2+
- cuDNN 8.1+
- 4GB GPU memory
Recommended:
- CUDA 12.0+
- cuDNN 8.9+
- 8GB+ GPU memory
- NVIDIA GPU (Tesla T4, V100, A100)
Performance Benchmarks
Inference Latency
Latency by configuration:
- CPU (4 cores): p50: 350ms, p95: 600ms, p99: 800ms
- GPU (T4): p50: 80ms, p95: 120ms, p99: 150ms
- GPU (V100): p50: 45ms, p95: 70ms, p99: 90ms
- GPU (A100): p50: 30ms, p95: 50ms, p99: 65ms
Throughput
Requests per second by configuration:
- CPU (4 cores): 10 req/s (20 concurrent users)
- GPU (T4): 50 req/s (100 concurrent users)
- GPU (V100): 100 req/s (200 concurrent users)
- GPU (A100): 150 req/s (300 concurrent users)
Memory Usage
Memory requirements:
- CPU Mode: 1.5GB total (500MB model + 1GB runtime)
- GPU Mode: 2GB RAM + 3GB VRAM (500MB model in RAM, 2GB in VRAM, 1.5GB runtime RAM, 1GB runtime VRAM)
Optimization Strategies
Model Optimization
TensorFlow Lite Convert to TFLite for faster inference on CPU:
# Convert model
python scripts/convert_to_tflite.py \
--model models/v2.1.0/model/best_model.h5 \
--output models/v2.1.0/model/model.tflite
# Reduces size by ~4x
# Faster CPU inference by ~2xQuantization Reduce model size and improve performance:
# INT8 quantization
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()
# Size: 50MB → 12MB
# Accuracy drop: <1%
# Inference speed: +40%Batching
Process multiple requests in a single batch:
# Instead of:
for request in requests:
prediction = model.predict(request)
# Use batching:
batch = np.stack([r.features for r in requests])
predictions = model.predict(batch) # 3x fasterCaching
Cache predictions for frequently seen inputs:
from functools import lru_cache
@lru_cache(maxsize=1000)
def cached_predict(features_hash):
return model.predict(features)
# Instant response for cached predictions
# Useful for demo/test accountsModel Warm-up
Pre-warm the model at startup to avoid cold-start latency:
@asynccontextmanager
async def lifespan(app: FastAPI):
# Warm-up model
dummy_input = create_dummy_input()
_ = predictor.predict(dummy_input)
logger.info("Model warmed up")
yieldScaling
Horizontal Scaling
Add more service replicas:
# Kubernetes
kubectl scale deployment ai-service --replicas=5
# Docker Compose
docker-compose up --scale ai-service=5Load Balancing
Distribute requests across replicas:
# Kubernetes Service
apiVersion: v1
kind: Service
metadata:
name: ai-service
spec:
selector:
app: ai-service
type: LoadBalancer
ports:
- port: 8080
targetPort: 8080Auto-scaling
Scale based on CPU/GPU usage:
# Horizontal Pod Autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ai-service-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ai-service
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70Monitoring
Key Metrics
Latency:
- Request processing time
- Model inference time
- Data preprocessing time
Throughput:
- Requests per second
- Concurrent requests
- Queue depth
Resource Usage:
- CPU utilization
- GPU utilization
- Memory consumption
- GPU memory
Accuracy:
- Prediction confidence distribution
- Model version usage
- Error rates
Prometheus Metrics
Export metrics for monitoring:
from prometheus_client import Counter, Histogram
# Request metrics
inference_duration = Histogram(
'inference_duration_seconds',
'Time spent in model inference',
['model_version']
)
prediction_count = Counter(
'predictions_total',
'Total number of predictions',
['model_version', 'status']
)
# Usage
with inference_duration.labels(model_version='v2.1.0').time():
prediction = model.predict(input)
prediction_count.labels(model_version='v2.1.0', status='success').inc()Grafana Dashboard
Visualize metrics in Grafana:
- Inference latency over time
- GPU utilization heatmap
- Request rate by endpoint
- Error rate trends
Troubleshooting
High Latency
Symptoms: Response times greater than 1 second
Solutions:
- Check GPU utilization (
nvidia-smi) - Verify model is loaded correctly
- Enable batching if not already
- Consider model quantization
- Add more replicas
Out of Memory
Symptoms: CUDA_OUT_OF_MEMORY errors
Solutions:
- Reduce batch size
- Enable GPU memory growth
- Use smaller model
- Upgrade GPU with more memory
Low Throughput
Symptoms: Less than 10 req/s on GPU
Solutions:
- Enable request batching
- Optimize data preprocessing
- Use async I/O
- Check for blocking operations
- Profile with TensorFlow Profiler