Published: January 2, 2025
Logging is expensive. A typical 50-service SaaS generates 1TB of logs per day. At Datadog's $0.10/GB ingestion + $1.27/million log events, that's $3,000/month just for ingestion, $50,000+/year total.
Grafana Loki changes the game: Index-free architecture means you pay only for storage (~$0.02/GB in S3). That same 1TB/day costs $600/month all-in—an 80-90% reduction.
Traditional logging (ELK, Splunk): Full-text index every log line. A 1TB index costs $1-$5/GB in storage + compute for queries.
Loki's approach: Index only metadata (labels like service=api, level=error). Compress and store raw logs in object storage. Query by label, then grep through compressed chunks.
┌─────────────────┬──────────────────┬──────────────────┬────────────────┐ │ Volume │ Loki (S3) │ Elasticsearch │ Datadog │ ├─────────────────┼──────────────────┼──────────────────┼────────────────┤ │ 100GB/day │ $60/mo │ $500-$1,000/mo │ $1,500/mo │ │ 500GB/day │ $300/mo │ $3,000-$5,000/mo │ $7,500/mo │ │ 1TB/day │ $600/mo │ $7,000-$12,000/mo│ $15,000/mo │ │ 5TB/day │ $3,000/mo │ $40,000-$60,000/mo│$75,000/mo │ └─────────────────┴──────────────────┴──────────────────┴────────────────┘ Loki cost breakdown (1TB/day): - S3 storage: 30TB/mo × $0.02/GB = $600/mo - Compute: 3x c5.2xlarge = $300/mo - Total: ~$900/mo (vs $15,000 for Datadog)
# Install Loki with S3 backend helm repo add grafana https://grafana.github.io/helm-charts helm install loki grafana/loki-distributed \ --namespace logging --create-namespace \ --values loki-values.yaml
# loki-values.yaml loki: auth_enabled: false storage: type: s3 bucketNames: chunks: loki-chunks ruler: loki-ruler s3: region: us-east-1 schema_config: configs: - from: 2024-01-01 store: tsdb # Latest index format object_store: s3 schema: v13 index: prefix: index_ period: 24h limits_config: retention_period: 720h # 30 days max_query_length: 721h max_query_lookback: 720h ingestion_rate_mb: 50 ingestion_burst_size_mb: 100 compactor: working_directory: /data/compactor shared_store: s3 compaction_interval: 10m retention_enabled: true retention_delete_delay: 2h retention_delete_worker_count: 150 # Deploy with 3 components for scalability write: replicas: 3 # Ingestors read: replicas: 3 # Query frontend backend: replicas: 2 # Compactor, query scheduler
Don't keep all logs forever. Use tiered retention based on value:
What: All logs, full-text searchable
Cost: $0.023/GB
Use case: Active debugging, recent incidents
What: ERROR/WARN only, sampled DEBUG
Cost: $0.018/GB (moves to Infrequent Access after 30 days)
Use case: Compliance, post-mortems
What: ERROR only, aggregated metrics
Cost: $0.004/GB
Use case: Audits, legal holds
# Implement tiered retention with S3 Lifecycle # s3-lifecycle.json { "Rules": [ { "Id": "MoveToIA", "Status": "Enabled", "Transitions": [ { "Days": 7, "StorageClass": "STANDARD_IA" }, { "Days": 30, "StorageClass": "GLACIER_IR" } ], "Expiration": { "Days": 365 }, "Filter": { "Prefix": "loki/" } } ] } # Apply lifecycle policy aws s3api put-bucket-lifecycle-configuration \ --bucket loki-chunks \ --lifecycle-configuration file://s3-lifecycle.json
Loki's performance depends on low-cardinality labels. High cardinality = slow queries + high costs.
user_id=12345
request_id=abc-def-ghi
email=user@example.com
Problem: Millions of unique labels = millions of index entries = slow queries + high memory
service=api
level=error
namespace=production
cluster=us-east-1
Result: ~100 unique labels = fast queries + low memory
High-cardinality data goes in log message, not labels
# Good: Search by label, filter by content {service="api", level="error"} |= "user_id=12345" # Bad: Using user_id as label (DON'T DO THIS) {service="api", level="error", user_id="12345"} # Creates millions of streams!
# Fast: Narrows to specific log streams first {service="api", level="error"} |= "payment failed" # Slow: Scans ALL logs {} |= "payment failed" # Fastest: Pre-aggregate into metrics sum by (service) ( count_over_time({level="error"}[5m]) )
# Convert logs to metrics for dashboards # Error rate by service sum by (service) ( rate({level="error"}[5m]) ) / sum by (service) ( rate({} [5m]) ) # P99 latency from structured logs quantile_over_time(0.99, {service="api"} | json | unwrap duration [5m] ) by (service) # Top 10 error messages topk(10, sum by (msg) ( count_over_time({level="error"} | json [1h]) ) )
DEBUG logs are 80% of volume, 5% of value. Sample them aggressively:
# Python: Sample DEBUG logs at 1% import logging import random class SamplingFilter(logging.Filter): def filter(self, record): if record.levelno == logging.DEBUG: return random.random() < 0.01 # Keep 1% of DEBUG return True # Keep all ERROR/WARN/INFO logger = logging.getLogger() logger.addFilter(SamplingFilter()) # Result: 80% volume reduction, keep all errors
# Bad: Unstructured (hard to query) logger.info(f"User {user_id} purchased {product} for ${amount}") # Good: Structured JSON (easy to query) logger.info("Purchase completed", extra={ "user_id": user_id, "product": product, "amount": amount, "payment_method": "stripe" }) # Output in Loki: # {"level":"info","msg":"Purchase completed","user_id":123,"product":"widget","amount":49.99} # Query specific purchases {service="api"} | json | amount > 100 | line_format "{{.user_id}}: ${{.amount}}"
# Track ingestion rate (bytes/sec) by service sum by (service) ( rate(loki_distributor_bytes_received_total[5m]) ) # Estimated monthly cost ($0.02/GB storage) sum(loki_ingester_chunks_stored_total) * 1024 * 0.02 / 1024 / 1024 / 1024 # Top 5 noisiest services topk(5, sum by (service) ( rate(loki_distributor_lines_received_total[1h]) ) ) # Query performance (slow queries to optimize) histogram_quantile(0.99, rate(loki_query_frontend_duration_seconds_bucket[5m]) )
Bottom line: Loki + smart retention = 80-90% cost savings vs traditional logging. Start with one service this week. You'll never go back.
HostingX IL
Scalable automation & integration platform accelerating modern B2B product teams.
Services
Subscribe to our newsletter
Get monthly email updates about improvements.
Copyright © 2025 HostingX IL. All Rights Reserved.