
DevOps in the Age of AI (Part Deux)
Oct 7, 2024
· by Chiefkemist
Description of DevOps in the Age of AI
DevOps in the Age of AI
Model Lifecycle
graph LR A[Model Training] --> B[Model Saving] B --> C[Model + API Packaging - Docker Container] C --> D[Serve via API]
Model file format convertions (Optional)
graph LR A(Tensorflow H5 Model) --> B(Convert to ONNX) --> C(ONNX Model) D(Pytorch PT Model) --> E(Convert to ONNX) --> F(ONNX Model) G(Python's Pickle Model) --> G(Python's Pickle Model)
Prior Art
Open Source Models
Ollama
brew install ollama ollama pull llama3.2 ollama serve
LlamaCpp
brew install llamacpp llama-server --hf-repo hugging-quants/Llama-3.2-1B-Instruct-Q8_0-GGUF --hf-file llama-3.2-1b-instruct-q8_0.gguf -c 2048
Ollama in Docker
FROM ollama/ollama:0.3.12 # Listen on all interfaces, port 8080 ENV OLLAMA_HOST 0.0.0.0:8080 # Store model weight files in /models ENV OLLAMA_MODELS /models # Reduce logging verbosity ENV OLLAMA_DEBUG false # Never unload model weights from the GPU ENV OLLAMA_KEEP_ALIVE -1 # Store the model weights in the container image ENV MODEL gemma2:9b RUN ollama serve & sleep 5 && ollama pull $MODEL # Start Ollama ENTRYPOINT ["ollama", "serve"]
Supported variables:
`MODEL` (build variable)
`OLLAMA_HOST` (runtime variable)
`OLLAMA_NUM_PARALLEL` (runtime variable)
LlamaCpp in Docker
FROM ghcr.io/ggerganov/llama.cpp:server # Create directories for the server and models RUN mkdir -p /app/models # Download model file into /app/models EXPOSE 8080 # Command to run the server when the container starts ENTRYPOINT ["llama-server", "-m", "/app/models/llama-3.2-1b-instruct-q8_0.gguf", "-c", "2048"]
Let's port to Dagger and Publish to Google Cloud Registry
Dagger
brew install dagger Example: dagger call --interactive function-name --project-path=./path-to-project-in-repo \ --src-dir=https://user:$GITHUB_TOKEN@github.com/user/reponame#branchname --image-name="gcr.io/organization/project/image-name"
Deploy UI App
npm run build cd client fly launch