name: check-logs description: Query and analyze logs using Grafana Loki for the Kagenti platform, search for errors, and investigate issues
Check Logs Skill
This skill helps you query and analyze logs from the Kagenti platform using Loki via Grafana.
When to Use
- User asks "show me logs for X"
- Investigating errors or failures
- After deployments to check for issues
- Debugging pod crashes or restarts
- Analyzing application behavior
What This Skill Does
- Query Logs: Search logs by namespace, pod, container, or log level
- Error Detection: Find errors and warnings in logs
- Log Aggregation: View logs across multiple pods
- Time-based Queries: Query logs for specific time ranges
- Log Patterns: Detect common issues from log patterns
Examples
Query Logs in Grafana UI
Access Grafana: https://grafana.localtest.me:9443 Navigate: Explore → Select Loki datasource
Log Dashboard: https://grafana.localtest.me:9443/d/loki-logs/loki-logs
Query Examples in Grafana Explore:
# All logs from observability namespace
{kubernetes_namespace_name="observability"}
# Logs from specific pod
{kubernetes_pod_name=~"prometheus.*"}
# Logs with errors
{kubernetes_namespace_name="observability"} |= "error"
# Logs from last 5 minutes with level=error
{kubernetes_namespace_name="observability"} | json | level="error"
# Count errors per namespace
sum by (kubernetes_namespace_name) (count_over_time({kubernetes_namespace_name=~".+"} |= "error" [5m]))
Query Logs via CLI (Promtail/Loki)
# Query Loki for recent errors in observability namespace
kubectl exec -n observability deployment/grafana -- \
curl -s -G 'http://loki.observability.svc:3100/loki/api/v1/query_range' \
--data-urlencode 'query={kubernetes_namespace_name="observability"} |= "error"' \
--data-urlencode 'limit=100' \
--data-urlencode 'start='$(date -u -v-5M +%s)000000000 \
--data-urlencode 'end='$(date -u +%s)000000000 | python3 -m json.tool
Check Logs for Specific Pod
# Get logs for a specific pod using kubectl
kubectl logs -n observability deployment/prometheus --tail=100
# Get logs from previous container (if crashed)
kubectl logs -n observability pod/prometheus-xxx --previous
# Follow logs in real-time
kubectl logs -n observability deployment/grafana -f --tail=20
# Get logs from specific container in pod
kubectl logs -n observability pod/alertmanager-xxx -c alertmanager --tail=50
Search for Errors Across Platform
# Get recent error logs from all namespaces
for ns in observability keycloak oauth2-proxy istio-system kiali-system; do
echo "=== Errors in $ns ==="
kubectl logs -n $ns --all-containers=true --tail=50 2>&1 | grep -i "error\|fatal\|exception" | head -5
echo
done
Check Logs for Failed Pods
# Find pods with issues and check their logs
kubectl get pods -A | grep -E "Error|CrashLoop|ImagePull" | while read ns pod rest; do
echo "=== Logs for $pod in $ns ==="
kubectl logs -n $ns $pod --tail=30 --previous 2>/dev/null || kubectl logs -n $ns $pod --tail=30
echo
done
Query Log Volume by Namespace
# In Grafana Explore (Loki datasource)
sum by (kubernetes_namespace_name) (
rate({kubernetes_namespace_name=~".+"}[5m])
)
Search for Specific Error Pattern
# Find connection errors
{kubernetes_namespace_name="observability"} |~ "connection (refused|timeout|reset)"
# Find authentication failures
{kubernetes_namespace_name=~"keycloak|oauth2-proxy"} |~ "auth.*fail|unauthorized|forbidden"
# Find OOM kills
{kubernetes_namespace_name=~".+"} |~ "OOM|out of memory|oom.*kill"
Log Levels and Filtering
Standard Log Levels
- error: Critical errors requiring attention
- warn/warning: Warnings that may indicate issues
- info: Informational messages
- debug: Detailed debugging information
- trace: Very detailed trace information
Filter by Log Level
# Only errors
{kubernetes_namespace_name="observability"} | json | level="error"
# Errors and warnings
{kubernetes_namespace_name="observability"} | json | level=~"error|warn"
# Everything except debug
{kubernetes_namespace_name="observability"} | json | level!="debug"
Common Log Queries for Platform Components
Prometheus Logs
kubectl logs -n observability deployment/prometheus --tail=100
# Check for scrape errors
kubectl logs -n observability deployment/prometheus | grep -i "scrape\|error"
Grafana Logs
kubectl logs -n observability deployment/grafana --tail=100
# Check for datasource errors
kubectl logs -n observability deployment/grafana | grep -i "datasource\|error"
Keycloak Logs
kubectl logs -n keycloak statefulset/keycloak --tail=100
# Check for authentication errors
kubectl logs -n keycloak statefulset/keycloak | grep -i "auth\|login\|error"
Istio Proxy (Sidecar) Logs
# Check sidecar logs for a specific pod
POD=$(kubectl get pod -n observability -l app=alertmanager -o jsonpath='{.items[0].metadata.name}')
kubectl logs -n observability $POD -c istio-proxy --tail=50
AlertManager Logs
kubectl logs -n observability deployment/alertmanager -c alertmanager --tail=100
# Check for notification errors
kubectl logs -n observability deployment/alertmanager -c alertmanager | grep -i "notif\|error\|fail"
Log Analysis Patterns
Detect Crash Loops
# Find pods restarting frequently
kubectl get pods -A | awk '{if ($4 > 5) print $0}'
# Check logs before crash
kubectl logs -n <namespace> <pod-name> --previous | tail -50
Find HTTP Errors
{kubernetes_namespace_name=~".+"} |~ "HTTP.*[45]\\d{2}"
Find Timeout Errors
{kubernetes_namespace_name=~".+"} |~ "timeout|timed out|deadline exceeded"
Find Database Connection Issues
{kubernetes_namespace_name=~".+"} |~ "database.*error|connection.*refused|SQL.*error"
Troubleshooting with Logs
Issue: Service Not Starting
- Check pod events:
kubectl describe pod <pod-name> -n <namespace> - Check container logs:
kubectl logs <pod-name> -n <namespace> - Check init container logs:
kubectl logs <pod-name> -n <namespace> -c <init-container>
Issue: High Error Rate
- Query error logs:
{kubernetes_namespace_name="X"} |= "error" [5m] - Group by component:
sum by (kubernetes_pod_name) (count_over_time({...} |= "error" [5m])) - Identify pattern in error messages
Issue: Performance Degradation
- Check for warnings:
{kubernetes_namespace_name="X"} |= "warn" - Look for timeout messages
- Check for resource exhaustion messages
Grafana Loki Dashboard Features
Loki Logs Dashboard: https://grafana.localtest.me:9443/d/loki-logs/loki-logs
Features:
- Namespace filter: Select specific namespace
- Pod filter: Filter by pod name
- Log level: Filter by error/warn/info/debug
- Time range: Select time window
- Log volume graphs: See log rate over time
- Log table: Browse actual log lines
Panels:
- Log Volume by Level: See errors vs warnings over time
- Log Volume by Namespace: Compare activity across namespaces
- Logs per Second: Current log ingestion rate
- Log Lines: Actual log content with search
Related Documentation
- Loki Documentation
- LogQL Query Language
- CLAUDE.md Troubleshooting
- Alert Runbooks - Many reference logs
Pro Tips
- Use time ranges: Always specify time range to limit data
- Filter early: Add namespace/pod filters before log level filters (more efficient)
- Use regex carefully: Complex regex can be slow on large log volumes
- Check both current and previous: For crashed pods, use
--previous - Tail first: Use
--tail=Nto limit output, then increase if needed
🤖 Generated with Claude Code