freeleaps-ops/cluster/manifests/freeleaps-data-platform/flink
2025-08-21 10:50:07 +08:00
..
flink-crd.yaml deploy flink 2025-08-21 10:50:07 +08:00
flink-operator-v2.yaml deploy flink 2025-08-21 10:50:07 +08:00
flink-rbac.yaml deploy flink 2025-08-21 10:50:07 +08:00
flink-storage.yaml deploy flink 2025-08-21 10:50:07 +08:00
ha-flink-cluster-v2.yaml deploy flink 2025-08-21 10:50:07 +08:00
README.md deploy flink 2025-08-21 10:50:07 +08:00
simple-ha-flink-cluster.yaml deploy flink 2025-08-21 10:50:07 +08:00
values.yaml fix(flink): update resource requests and limits for jobmanager and taskmanager 2025-07-04 11:03:26 +08:00

Flink High Availability Cluster Deployment

Overview

This project uses Apache Flink Kubernetes Operator to deploy a high availability Flink cluster with persistent storage and automatic failover capabilities.

Component Architecture

  • JobManager: 2 replicas with high availability configuration
  • TaskManager: 3 replicas for distributed processing
  • High Availability: Kubernetes-based HA with persistent storage
  • Checkpointing: Persistent checkpoints and savepoints storage

File Description

Flink Kubernetes Operator deployment configuration:

  • Operator deployment in flink-system namespace
  • RBAC configuration for cluster-wide permissions
  • Health checks and resource limits
  • Enhanced CRD definitions with additional printer columns

Custom Resource Definitions for Flink:

  • FlinkDeployment CRD
  • FlinkSessionJob CRD
  • Required for Flink Operator to function

Production-ready HA Flink cluster configuration:

  • 2 JobManager replicas with HA enabled
  • 3 TaskManager replicas with anti-affinity rules
  • Persistent storage for HA data, checkpoints, and savepoints
  • Memory and CPU resource allocation
  • Exponential delay restart strategy
  • Proper volume mounts and storage configuration

Simplified HA Flink cluster configuration:

  • Uses ephemeral storage to avoid PVC binding issues
  • Basic HA configuration for testing and development
  • Minimal resource requirements
  • Recommended for development and testing

Storage and RBAC configuration:

  • PersistentVolumeClaims for HA data, checkpoints, and savepoints
  • ServiceAccount and RBAC permissions for Flink cluster
  • Azure Disk storage class configuration with correct access modes

Enhanced RBAC configuration:

  • Complete permissions for Flink HA functionality
  • Both namespace-level and cluster-level permissions
  • Includes watch permissions for HA operations

Deployment Steps

# Apply Flink Operator configuration
kubectl apply -f flink-operator-v2.yaml

# Verify operator installation
kubectl get pods -n flink-system

2. Create Storage Resources (Optional - for production)

# Apply storage configuration
kubectl apply -f flink-storage.yaml

# Verify PVC creation
kubectl get pvc -n freeleaps-data-platform
# Option A: Deploy with persistent storage (production)
kubectl apply -f ha-flink-cluster-v2.yaml

# Option B: Deploy with ephemeral storage (development/testing)
kubectl apply -f simple-ha-flink-cluster.yaml

# Check deployment status
kubectl get flinkdeployments -n freeleaps-data-platform
kubectl get pods -n freeleaps-data-platform -l app=flink

High Availability Features

  • JobManager HA: 2 JobManager replicas with Kubernetes-based leader election
  • Persistent State: Checkpoints and savepoints stored on persistent volumes
  • Automatic Failover: Exponential delay restart strategy with backoff
  • Pod Anti-affinity: Ensures components are distributed across different nodes
  • Storage Persistence: HA data, checkpoints, and savepoints persist across restarts

Network Configuration

  • JobManager: Port 8081 (Web UI), 6123 (RPC), 6124 (Blob Server)
  • TaskManager: Port 6121 (Data), 6122 (RPC), 6126 (Metrics)
  • Service Type: ClusterIP for internal communication

Storage Configuration

  • HA Data: 10Gi for high availability metadata
  • Checkpoints: 20Gi for application checkpoints
  • Savepoints: 20Gi for manual savepoints
  • Storage Class: azure-disk-std-ssd-lrs
  • Access Mode: ReadWriteOnce (Azure Disk limitation)

Monitoring and Operations

  • Health Checks: Built-in readiness and liveness probes
  • Web UI: Accessible through JobManager service
  • Metrics: Exposed on port 8080 for Prometheus collection
  • Logging: Centralized logging through Kubernetes

Configuration Details

High Availability Settings

  • Type: kubernetes (native Kubernetes HA)
  • Storage: Persistent volume for HA metadata
  • Cluster ID: ha-flink-cluster-v2

Checkpointing Configuration

  • Interval: 60 seconds
  • Timeout: 10 minutes
  • Min Pause: 5 seconds
  • Backend: Filesystem with persistent storage

Resource Allocation

  • JobManager: 0.5 CPU, 1024MB memory (HA), 1.0 CPU, 1024MB memory (Simple)
  • TaskManager: 0.5 CPU, 2048MB memory (HA), 2.0 CPU, 2048MB memory (Simple)

Troubleshooting

Common Issues and Solutions

1. PVC Binding Issues

# Check PVC status
kubectl get pvc -n freeleaps-data-platform

# PVC stuck in Pending state - usually due to:
# - Insufficient storage quota
# - Wrong access mode (ReadWriteMany not supported by Azure Disk)
# - Storage class not available

# Solution: Use ReadWriteOnce access mode or ephemeral storage

2. Pod CrashLoopBackOff

# Check pod status
kubectl get pods -n freeleaps-data-platform -l app=flink

# Check pod logs
kubectl logs <pod-name> -n freeleaps-data-platform

# Check pod events
kubectl describe pod <pod-name> -n freeleaps-data-platform

3. ServiceAccount Issues

# Verify ServiceAccount exists
kubectl get serviceaccount -n freeleaps-data-platform

# Check RBAC permissions
kubectl get rolebinding -n freeleaps-data-platform

4. Storage Path Issues

# Ensure storage paths match volume mounts
# For persistent storage: /opt/flink/ha-data, /opt/flink/checkpoints
# For ephemeral storage: /tmp/flink/ha-data, /tmp/flink/checkpoints

Diagnostic Commands

# Check Flink Operator logs
kubectl logs -n flink-system -l app.kubernetes.io/name=flink-kubernetes-operator

# Check Flink cluster status
kubectl describe flinkdeployment <cluster-name> -n freeleaps-data-platform

# Check pod events
kubectl get events -n freeleaps-data-platform --sort-by='.lastTimestamp'

# Check storage status
kubectl get pvc -n freeleaps-data-platform
kubectl describe pvc <pvc-name> -n freeleaps-data-platform

# Check operator status
kubectl get pods -n flink-system
kubectl logs -n flink-system deployment/flink-kubernetes-operator

Important Notes

  1. Storage Limitations: Azure Disk storage class only supports ReadWriteOnce access mode
  2. ServiceAccount: Ensure the correct ServiceAccount is specified in cluster configuration
  3. Resource Requirements: Verify cluster has enough CPU/memory for all replicas
  4. Network Policies: May need adjustment for inter-pod communication
  5. Ephemeral vs Persistent: Use ephemeral storage for development/testing, persistent for production
# 1. Deploy operator
kubectl apply -f flink-operator-v2.yaml

# 2. Wait for operator to be ready
kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=flink-kubernetes-operator -n flink-system

# 3. Deploy simple HA cluster (no persistent storage)
kubectl apply -f simple-ha-flink-cluster.yaml

# 4. Monitor deployment
kubectl get flinkdeployments -n freeleaps-data-platform
kubectl get pods -n freeleaps-data-platform -l app=flink

Production Deployment

# 1. Deploy operator
kubectl apply -f flink-operator-v2.yaml

# 2. Deploy storage resources
kubectl apply -f flink-storage.yaml

# 3. Deploy production HA cluster
kubectl apply -f ha-flink-cluster-v2.yaml

# 4. Monitor deployment
kubectl get flinkdeployments -n freeleaps-data-platform
kubectl get pods -n freeleaps-data-platform -l app=flink