Cost, Control, and Chaos: My Journey of Hosting Databases on K8s for a High-Volume Startup

The Optimizer
5 min readDec 7, 2024

--

In the fast-paced world of startups, managing costs and performance is a constant balancing act. Over the past few years, I’ve navigated the complexities of hosting databases for a startup where data volume was high and growing steadily. Starting with managed cloud solutions, transitioning to Aurora, and finally settling on a self-hosted database in a Kubernetes (K8s) cluster, each phase of this journey has taught me valuable lessons. Here’s how it unfolded.

The Problem: High Costs of Managed Solutions

Initially, we relied on managed RDS for our non-production environments, but the costs quickly spiralled out of control. Moving to Aurora seemed like a logical step to reduce expenses while maintaining scalability. However, Aurora turned out to be another misstep — costs were still high, and we lacked the flexibility needed for fine-grained optimizations.

This realization pushed us to explore self-hosting databases in a Kubernetes cluster.

The Solution: Hosting Databases in Kubernetes

Tools Used

Database Management Tools

  • pgBackRest — For automated backups and point-in-time recovery.
  • PGTune — For generating optimized PostgreSQL configuration settings based on workload.
  • PGBOUNCER — A lightweight connection pooler for PostgreSQL.
  • pgaudit — For enabling comprehensive audit logging in PostgreSQL.

Monitoring and Observability Tools

  • Prometheus — For monitoring system metrics (CPU, memory, etc.)
  • Grafana — For real-time visualization of metrics and setting up alerts.
  • Postgres Exporter — For exporting PostgreSQL- metrics to Prometheus.

Kubernetes-Specific Tools

  • Kubernetes StatefulSets — For deploying stateful applications like databases.
  • Kubernetes Secrets — For securely storing sensitive information (e.g., database credentials).
  • Network policies — For isolating the database from other deployments.
  • Kubernetes Probes — For setting up health checks (readiness and liveness probes).

Okay, let’s get back to the story..

Hosting a database in Kubernetes required meticulous planning and execution. Here’s how I tackled each challenge:

1. Migration Challenges

  • Avoiding Data Errors: Migrating from managed databases to self-hosted solutions required precision. I ensured no errors or duplicate data by performing incremental data dumps and running rigorous data validation scripts.
  • Roles and User Management: Transitioning user roles and permissions was critical. I recreated roles, ensuring compatibility with our applications.

2. Performance Tuning

Managing a database in Kubernetes necessitated deep attention to database performance:

  • Start Big, Scale Down: I started with high-capacity instances and reduced resources gradually as usage patterns stabilized.
  • Monitor Checkpoints and WAL Points: Proper checkpointing reduced disk I/O spikes, and monitoring Write-Ahead Logs (WAL) helped prevent replication lag.
  • Choose Throughput and IOPS Wisely: I provisioned storage carefully, balancing cost with performance.
  • Optimize Database Parameters: Using tools like PGTune, I optimized critical PostgreSQL settings:
max_connections = 200  
shared_buffers = 4GB
effective_cache_size = 12GB
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
random_page_cost = 1.1
effective_io_concurrency = 200
work_mem = 5242kB

3. Operational Best Practices

  • Observability: Monitoring CPU, memory, and storage usage was essential. I relied on tools like Prometheus and Grafana for real-time insights.
  • Connection Pooling: PGBOUNCER helped manage connections efficiently, reducing contention and improving throughput.
  • Audit Logging: Using pgaudit, I enabled comprehensive audit logging for better observability and compliance.
  • Vacuum and Analyze: Automated vacuuming and analyzing using PostgreSQL’s autovacuum settings to reclaim space and update query statistics.
  • Health Checks: Kubernetes probes ensured the database was always responsive, restarting pods when necessary.

4. Backups and Recovery

  • Automated Backups: I set up pgBackRest and WAL-G to automate backups and enable point-in-time recovery.
  • Anomaly Alerts: Alerts for unusual activity, such as high WAL generation or long-running queries, helped maintain stability.

5. Security Considerations

  • Restricted Exposure: The database was never exposed to the internet, and all access was secured through private networking.
  • User Permissions: I adhered to the principle of least privilege, carefully managing user roles.
  • Secrets Management: Kubernetes Secrets ensured sensitive information like credentials was stored securely.
  • Audit Logging: Capturing access and query logs provided both security and operational insights.

6. Cost Optimization with Reserved Instances

To further reduce costs, I transitioned to Reserved Instances for Kubernetes nodes hosting our database workloads. This strategy cut costs significantly while maintaining predictable performance levels. Combining reserved instances with resource right-sizing yielded substantial savings.

7. Upgrades and Maintenance

Upgrading PostgreSQL was a critical aspect of this journey:

  • Reading Release Notes: Before every upgrade, I studied release notes to identify breaking changes, such as changes in authentication mechanisms.
  • Extension Compatibility: Verified that extensions like pg_stat_statements or pgaudit were compatible with the target PostgreSQL version before upgrading.
  • Planned Downtime: Upgrades were carefully scheduled to minimize disruption.
  • Testing Upgrades: Each upgrade was first tested in a non-production environment to ensure compatibility.

The Trade-offs

Hosting a database in Kubernetes has its downsides:

  • Lack of High Availability: Our setup was not designed for HA, which we accepted as a trade-off for cost savings.
  • Operational Overhead: Managing backups, monitoring, and upgrades required a well-defined checklist and standard operating procedures (SOPs).

The Benefits

Despite the challenges, the benefits were significant:

  • Cost Savings: We achieved a 70% reduction in costs compared to managed solutions.
  • Full Control: Self-hosting allowed us to fine-tune every aspect of the database, from parameters to backup schedules.
  • Faster Restarts: Without the constraints of managed environments, database restarts were faster.
  • Customizations: We had full access to WAL logs and could implement custom workflows.

Lessons Learned

  1. Plan Your Migration: Data integrity and role management are critical when transitioning from managed to self-hosted solutions.
  2. Performance Monitoring is Crucial: Regularly tune parameters, monitor long-running queries, and adjust resources based on load.
  3. Prioritize Security: Limit exposure and enforce strict user permissions.
  4. Prepare for Operational Overhead: Develop clear SOPs to manage backups, upgrades, and incidents.
  5. Embrace Observability: Invest in robust monitoring to stay ahead of potential issues.
  6. Test Thoroughly: Every migration, upgrade, or configuration change must be tested.

This journey has been a blend of challenges and rewards. Self-hosting a database may not be for everyone, but for our high-volume startup, it provided the cost control and customization we needed to thrive.

Thanks for reading my blog. Feel free to hit me up for any AWS/DevOps/Open Source-related discussions.

Manoj Kumar — LinkedIn.

--

--

The Optimizer
The Optimizer

Written by The Optimizer

Cloud & DevOps👨‍💻 | AWS☁️| K8s⚔️| Terraform🏗️ | CI/CD🚀| Open Source 🐧 | Versatile DevOps engineers. Well-versed with DevOps Tools and Cloud Services.

No responses yet