Cost, Control, and Chaos: My Journey of Hosting Databases on K8s for a High-Volume Startup
In the fast-paced world of startups, managing costs and performance is a constant balancing act. Over the past few years, I’ve navigated the complexities of hosting databases for a startup where data volume was high and growing steadily. Starting with managed cloud solutions, transitioning to Aurora, and finally settling on a self-hosted database in a Kubernetes (K8s) cluster, each phase of this journey has taught me valuable lessons. Here’s how it unfolded.
The Problem: High Costs of Managed Solutions
Initially, we relied on managed RDS for our non-production environments, but the costs quickly spiralled out of control. Moving to Aurora seemed like a logical step to reduce expenses while maintaining scalability. However, Aurora turned out to be another misstep — costs were still high, and we lacked the flexibility needed for fine-grained optimizations.
This realization pushed us to explore self-hosting databases in a Kubernetes cluster.
The Solution: Hosting Databases in Kubernetes
Tools Used
Database Management Tools
- pgBackRest — For automated backups and point-in-time recovery.
- PGTune — For generating optimized PostgreSQL configuration settings based on workload.
- PGBOUNCER — A lightweight connection pooler for PostgreSQL.
- pgaudit — For enabling comprehensive audit logging in PostgreSQL.
Monitoring and Observability Tools
- Prometheus — For monitoring system metrics (CPU, memory, etc.)
- Grafana — For real-time visualization of metrics and setting up alerts.
- Postgres Exporter — For exporting PostgreSQL- metrics to Prometheus.
Kubernetes-Specific Tools
- Kubernetes StatefulSets — For deploying stateful applications like databases.
- Kubernetes Secrets — For securely storing sensitive information (e.g., database credentials).
- Network policies — For isolating the database from other deployments.
- Kubernetes Probes — For setting up health checks (readiness and liveness probes).
Okay, let’s get back to the story..
Hosting a database in Kubernetes required meticulous planning and execution. Here’s how I tackled each challenge:
1. Migration Challenges
- Avoiding Data Errors: Migrating from managed databases to self-hosted solutions required precision. I ensured no errors or duplicate data by performing incremental data dumps and running rigorous data validation scripts.
- Roles and User Management: Transitioning user roles and permissions was critical. I recreated roles, ensuring compatibility with our applications.
2. Performance Tuning
Managing a database in Kubernetes necessitated deep attention to database performance:
- Start Big, Scale Down: I started with high-capacity instances and reduced resources gradually as usage patterns stabilized.
- Monitor Checkpoints and WAL Points: Proper checkpointing reduced disk I/O spikes, and monitoring Write-Ahead Logs (WAL) helped prevent replication lag.
- Choose Throughput and IOPS Wisely: I provisioned storage carefully, balancing cost with performance.
- Optimize Database Parameters: Using tools like PGTune, I optimized critical PostgreSQL settings:
max_connections = 200
shared_buffers = 4GB
effective_cache_size = 12GB
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
random_page_cost = 1.1
effective_io_concurrency = 200
work_mem = 5242kB
3. Operational Best Practices
- Observability: Monitoring CPU, memory, and storage usage was essential. I relied on tools like Prometheus and Grafana for real-time insights.
- Connection Pooling: PGBOUNCER helped manage connections efficiently, reducing contention and improving throughput.
- Audit Logging: Using pgaudit, I enabled comprehensive audit logging for better observability and compliance.
- Vacuum and Analyze: Automated vacuuming and analyzing using PostgreSQL’s autovacuum settings to reclaim space and update query statistics.
- Health Checks: Kubernetes probes ensured the database was always responsive, restarting pods when necessary.
4. Backups and Recovery
- Automated Backups: I set up pgBackRest and WAL-G to automate backups and enable point-in-time recovery.
- Anomaly Alerts: Alerts for unusual activity, such as high WAL generation or long-running queries, helped maintain stability.
5. Security Considerations
- Restricted Exposure: The database was never exposed to the internet, and all access was secured through private networking.
- User Permissions: I adhered to the principle of least privilege, carefully managing user roles.
- Secrets Management: Kubernetes Secrets ensured sensitive information like credentials was stored securely.
- Audit Logging: Capturing access and query logs provided both security and operational insights.
6. Cost Optimization with Reserved Instances
To further reduce costs, I transitioned to Reserved Instances for Kubernetes nodes hosting our database workloads. This strategy cut costs significantly while maintaining predictable performance levels. Combining reserved instances with resource right-sizing yielded substantial savings.
7. Upgrades and Maintenance
Upgrading PostgreSQL was a critical aspect of this journey:
- Reading Release Notes: Before every upgrade, I studied release notes to identify breaking changes, such as changes in authentication mechanisms.
- Extension Compatibility: Verified that extensions like pg_stat_statements or pgaudit were compatible with the target PostgreSQL version before upgrading.
- Planned Downtime: Upgrades were carefully scheduled to minimize disruption.
- Testing Upgrades: Each upgrade was first tested in a non-production environment to ensure compatibility.
The Trade-offs
Hosting a database in Kubernetes has its downsides:
- Lack of High Availability: Our setup was not designed for HA, which we accepted as a trade-off for cost savings.
- Operational Overhead: Managing backups, monitoring, and upgrades required a well-defined checklist and standard operating procedures (SOPs).
The Benefits
Despite the challenges, the benefits were significant:
- Cost Savings: We achieved a 70% reduction in costs compared to managed solutions.
- Full Control: Self-hosting allowed us to fine-tune every aspect of the database, from parameters to backup schedules.
- Faster Restarts: Without the constraints of managed environments, database restarts were faster.
- Customizations: We had full access to WAL logs and could implement custom workflows.
Lessons Learned
- Plan Your Migration: Data integrity and role management are critical when transitioning from managed to self-hosted solutions.
- Performance Monitoring is Crucial: Regularly tune parameters, monitor long-running queries, and adjust resources based on load.
- Prioritize Security: Limit exposure and enforce strict user permissions.
- Prepare for Operational Overhead: Develop clear SOPs to manage backups, upgrades, and incidents.
- Embrace Observability: Invest in robust monitoring to stay ahead of potential issues.
- Test Thoroughly: Every migration, upgrade, or configuration change must be tested.
This journey has been a blend of challenges and rewards. Self-hosting a database may not be for everyone, but for our high-volume startup, it provided the cost control and customization we needed to thrive.
Thanks for reading my blog. Feel free to hit me up for any AWS/DevOps/Open Source-related discussions.
Manoj Kumar — LinkedIn.