freeleaps-ops/docs/Azure_K8s_Node_Addition_Runbook.md

284 lines
6.2 KiB
Markdown
Raw Normal View History

2025-09-03 23:59:04 +00:00
# Azure Kubernetes Node Addition Runbook
## Overview
This runbook provides step-by-step instructions for adding new Azure Virtual Machines to an existing Kubernetes cluster installed via Kubespray.
## Prerequisites
- Access to Azure CLI with appropriate permissions
- SSH access to the new VM
- Access to the existing Kubernetes cluster
- Kubespray installation directory
## Pre-Installation Checklist
### 1. Verify New VM Details
```bash
# Get VM details from Azure
az vm show --resource-group <RESOURCE_GROUP> --name <VM_NAME> --query "{name:name,ip:publicIps,privateIp:privateIps}" -o table
```
### 2. Verify SSH Access
```bash
# Test SSH connection to the new VM
ssh wwwadmin@mathmast.com@<VM_PRIVATE_IP>
# You will be prompted for password
```
### 3. Verify Network Connectivity
```bash
# From the new VM, test connectivity to existing cluster
ping <EXISTING_MASTER_IP>
```
## Step-by-Step Process
### Step 1: Update Ansible Inventory
1. **Navigate to Kubespray directory**
```bash
cd freeleaps-ops/3rd/kubespray
```
2. **Edit the inventory file**
```bash
vim ../cluster/ansible/manifests/inventory.ini
```
3. **Add the new node to the appropriate group**
For a worker node:
```ini
[kube_node]
# Existing nodes...
prod-usw2-k8s-freeleaps-worker-nodes-06 ansible_host=<NEW_VM_PRIVATE_IP> ansible_user=wwwadmin@mathmast.com host_name=prod-usw2-k8s-freeleaps-worker-nodes-06
```
For a master node:
```ini
[kube_control_plane]
# Existing nodes...
prod-usw2-k8s-freeleaps-master-03 ansible_host=<NEW_VM_PRIVATE_IP> ansible_user=wwwadmin@mathmast.com etcd_member_name=freeleaps-etcd-03 host_name=prod-usw2-k8s-freeleaps-master-03
```
### Step 2: Verify Inventory Configuration
1. **Check inventory syntax**
```bash
ansible-inventory -i ../cluster/ansible/manifests/inventory.ini --list
```
2. **Test connectivity to new node**
```bash
ansible -i ../cluster/ansible/manifests/inventory.ini kube_node -m ping -kK
```
### Step 3: Run Kubespray Scale Playbook
1. **Execute the scale playbook**
```bash
cd ../cluster/ansible/manifests
ansible-playbook -i inventory.ini ../../3rd/kubespray/scale.yml -kK -b
```
**Note**:
- `-k` prompts for SSH password
- `-K` prompts for sudo password
- `-b` enables privilege escalation
### Step 4: Verify Node Addition
1. **Check node status**
```bash
kubectl get nodes
```
2. **Verify node is ready**
```bash
kubectl describe node <NEW_NODE_NAME>
```
3. **Check node labels**
```bash
kubectl get nodes --show-labels
```
### Step 5: Post-Installation Verification
1. **Test pod scheduling**
```bash
# Create a test pod to verify scheduling
kubectl run test-pod --image=nginx --restart=Never
kubectl get pod test-pod -o wide
```
2. **Check node resources**
```bash
kubectl top nodes
```
3. **Verify node components**
```bash
kubectl get pods -n kube-system -o wide | grep <NEW_NODE_NAME>
```
## Troubleshooting
### Common Issues
#### 1. SSH Connection Failed
```bash
# Verify VM is running
az vm show --resource-group <RESOURCE_GROUP> --name <VM_NAME> --query "powerState"
# Check network security groups
az network nsg rule list --resource-group <RESOURCE_GROUP> --nsg-name <NSG_NAME>
```
#### 2. Ansible Connection Failed
```bash
# Test with verbose output
ansible -i ../cluster/ansible/manifests/inventory.ini kube_node -m ping -kK -vvv
```
#### 3. Node Not Ready
```bash
# Check node conditions
kubectl describe node <NEW_NODE_NAME>
# Check kubelet logs
kubectl logs -n kube-system kubelet-<NEW_NODE_NAME>
```
#### 4. Pod Scheduling Issues
```bash
# Check node taints
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints
# Check node capacity
kubectl describe node <NEW_NODE_NAME> | grep -A 10 "Capacity"
```
### Recovery Procedures
#### If Scale Playbook Fails
1. **Clean up the failed node**
```bash
kubectl delete node <NEW_NODE_NAME>
```
2. **Reset the VM**
```bash
# Reset VM to clean state
az vm restart --resource-group <RESOURCE_GROUP> --name <VM_NAME>
```
3. **Retry the scale playbook**
```bash
ansible-playbook -i inventory.ini ../../3rd/kubespray/scale.yml -kK -b
```
#### If Node is Stuck in NotReady State
1. **Check kubelet service**
```bash
ssh wwwadmin@mathmast.com@<VM_PRIVATE_IP>
sudo systemctl status kubelet
```
2. **Restart kubelet**
```bash
ssh wwwadmin@mathmast.com@<VM_PRIVATE_IP>
sudo systemctl restart kubelet
```
## Security Considerations
### 1. Network Security
- Ensure the new VM is in the correct subnet
- Verify network security group rules allow cluster communication
- Check firewall rules if applicable
### 2. Access Control
- Use SSH key-based authentication when possible
- Limit sudo access to necessary commands
- Monitor node access logs
### 3. Compliance
- Ensure the new node meets security requirements
- Verify all required security patches are applied
- Check compliance with organizational policies
## Monitoring and Maintenance
### 1. Node Health Monitoring
```bash
# Set up monitoring for the new node
kubectl get nodes -o wide
kubectl top nodes
```
### 2. Resource Monitoring
```bash
# Monitor resource usage
kubectl describe node <NEW_NODE_NAME> | grep -A 5 "Allocated resources"
```
### 3. Log Monitoring
```bash
# Monitor kubelet logs
kubectl logs -n kube-system kubelet-<NEW_NODE_NAME> --tail=100 -f
```
## Rollback Procedures
### If Node Addition Causes Issues
1. **Cordon the node**
```bash
kubectl cordon <NEW_NODE_NAME>
```
2. **Drain the node**
```bash
kubectl drain <NEW_NODE_NAME> --ignore-daemonsets --delete-emptydir-data
```
3. **Remove the node**
```bash
kubectl delete node <NEW_NODE_NAME>
```
4. **Update inventory**
```bash
# Remove the node from inventory.ini
vim ../cluster/ansible/manifests/inventory.ini
```
## Documentation
### Required Information
- VM name and IP address
- Resource group and subscription
- Node role (worker/master)
- Date and time of addition
- Person performing the addition
### Post-Addition Checklist
- [ ] Node appears in `kubectl get nodes`
- [ ] Node status is Ready
- [ ] Pods can be scheduled on the node
- [ ] All node components are running
- [ ] Monitoring is configured
- [ ] Documentation is updated
## Emergency Contacts
- **Infrastructure Team**: [Contact Information]
- **Kubernetes Administrators**: [Contact Information]
- **Azure Support**: [Contact Information]
---
**Last Updated**: [Date]
**Version**: 1.0
**Author**: [Name]