History

Nicolas c9f681e44b deploy flink		2025-08-21 10:50:07 +08:00
..
flink-crd.yaml	deploy flink	2025-08-21 10:50:07 +08:00
flink-operator-v2.yaml	deploy flink	2025-08-21 10:50:07 +08:00
flink-rbac.yaml	deploy flink	2025-08-21 10:50:07 +08:00
flink-storage.yaml	deploy flink	2025-08-21 10:50:07 +08:00
ha-flink-cluster-v2.yaml	deploy flink	2025-08-21 10:50:07 +08:00
README.md	deploy flink	2025-08-21 10:50:07 +08:00
simple-ha-flink-cluster.yaml	deploy flink	2025-08-21 10:50:07 +08:00
values.yaml	fix(flink): update resource requests and limits for jobmanager and taskmanager	2025-07-04 11:03:26 +08:00

README.md

Flink High Availability Cluster Deployment

Overview

This project uses Apache Flink Kubernetes Operator to deploy a high availability Flink cluster with persistent storage and automatic failover capabilities.

Component Architecture

JobManager: 2 replicas with high availability configuration
TaskManager: 3 replicas for distributed processing
High Availability: Kubernetes-based HA with persistent storage
Checkpointing: Persistent checkpoints and savepoints storage

File Description

1. flink-operator-v2.yaml

Flink Kubernetes Operator deployment configuration:

Operator deployment in flink-system namespace
RBAC configuration for cluster-wide permissions
Health checks and resource limits
Enhanced CRD definitions with additional printer columns

2. flink-crd.yaml

Custom Resource Definitions for Flink:

FlinkDeployment CRD
FlinkSessionJob CRD
Required for Flink Operator to function

3. ha-flink-cluster-v2.yaml

Production-ready HA Flink cluster configuration:

2 JobManager replicas with HA enabled
3 TaskManager replicas with anti-affinity rules
Persistent storage for HA data, checkpoints, and savepoints
Memory and CPU resource allocation
Exponential delay restart strategy
Proper volume mounts and storage configuration

4. simple-ha-flink-cluster.yaml

Simplified HA Flink cluster configuration:

Uses ephemeral storage to avoid PVC binding issues
Basic HA configuration for testing and development
Minimal resource requirements
Recommended for development and testing

5. flink-storage.yaml

Storage and RBAC configuration:

PersistentVolumeClaims for HA data, checkpoints, and savepoints
ServiceAccount and RBAC permissions for Flink cluster
Azure Disk storage class configuration with correct access modes

6. flink-rbac.yaml

Enhanced RBAC configuration:

Complete permissions for Flink HA functionality
Both namespace-level and cluster-level permissions
Includes watch permissions for HA operations

Deployment Steps

1. Install Flink Operator

# Apply Flink Operator configuration
kubectl apply -f flink-operator-v2.yaml

# Verify operator installation
kubectl get pods -n flink-system

2. Create Storage Resources (Optional - for production)

# Apply storage configuration
kubectl apply -f flink-storage.yaml

# Verify PVC creation
kubectl get pvc -n freeleaps-data-platform

3. Deploy HA Flink Cluster

# Option A: Deploy with persistent storage (production)
kubectl apply -f ha-flink-cluster-v2.yaml

# Option B: Deploy with ephemeral storage (development/testing)
kubectl apply -f simple-ha-flink-cluster.yaml

# Check deployment status
kubectl get flinkdeployments -n freeleaps-data-platform
kubectl get pods -n freeleaps-data-platform -l app=flink

High Availability Features

JobManager HA: 2 JobManager replicas with Kubernetes-based leader election
Persistent State: Checkpoints and savepoints stored on persistent volumes
Automatic Failover: Exponential delay restart strategy with backoff
Pod Anti-affinity: Ensures components are distributed across different nodes
Storage Persistence: HA data, checkpoints, and savepoints persist across restarts

Network Configuration

JobManager: Port 8081 (Web UI), 6123 (RPC), 6124 (Blob Server)
TaskManager: Port 6121 (Data), 6122 (RPC), 6126 (Metrics)
Service Type: ClusterIP for internal communication

Storage Configuration

HA Data: 10Gi for high availability metadata
Checkpoints: 20Gi for application checkpoints
Savepoints: 20Gi for manual savepoints
Storage Class: azure-disk-std-ssd-lrs
Access Mode: ReadWriteOnce (Azure Disk limitation)

Monitoring and Operations

Health Checks: Built-in readiness and liveness probes
Web UI: Accessible through JobManager service
Metrics: Exposed on port 8080 for Prometheus collection
Logging: Centralized logging through Kubernetes

Configuration Details

High Availability Settings

Type: kubernetes (native Kubernetes HA)
Storage: Persistent volume for HA metadata
Cluster ID: ha-flink-cluster-v2

Checkpointing Configuration

Interval: 60 seconds
Timeout: 10 minutes
Min Pause: 5 seconds
Backend: Filesystem with persistent storage

Resource Allocation

JobManager: 0.5 CPU, 1024MB memory (HA), 1.0 CPU, 1024MB memory (Simple)
TaskManager: 0.5 CPU, 2048MB memory (HA), 2.0 CPU, 2048MB memory (Simple)

Troubleshooting

Common Issues and Solutions

1. PVC Binding Issues

# Check PVC status
kubectl get pvc -n freeleaps-data-platform

# PVC stuck in Pending state - usually due to:
# - Insufficient storage quota
# - Wrong access mode (ReadWriteMany not supported by Azure Disk)
# - Storage class not available

# Solution: Use ReadWriteOnce access mode or ephemeral storage

2. Pod CrashLoopBackOff

# Check pod status
kubectl get pods -n freeleaps-data-platform -l app=flink

# Check pod logs
kubectl logs <pod-name> -n freeleaps-data-platform

# Check pod events
kubectl describe pod <pod-name> -n freeleaps-data-platform

3. ServiceAccount Issues

# Verify ServiceAccount exists
kubectl get serviceaccount -n freeleaps-data-platform

# Check RBAC permissions
kubectl get rolebinding -n freeleaps-data-platform

4. Storage Path Issues

# Ensure storage paths match volume mounts
# For persistent storage: /opt/flink/ha-data, /opt/flink/checkpoints
# For ephemeral storage: /tmp/flink/ha-data, /tmp/flink/checkpoints

Diagnostic Commands

# Check Flink Operator logs
kubectl logs -n flink-system -l app.kubernetes.io/name=flink-kubernetes-operator

# Check Flink cluster status
kubectl describe flinkdeployment <cluster-name> -n freeleaps-data-platform

# Check pod events
kubectl get events -n freeleaps-data-platform --sort-by='.lastTimestamp'

# Check storage status
kubectl get pvc -n freeleaps-data-platform
kubectl describe pvc <pvc-name> -n freeleaps-data-platform

# Check operator status
kubectl get pods -n flink-system
kubectl logs -n flink-system deployment/flink-kubernetes-operator

Important Notes

Storage Limitations: Azure Disk storage class only supports ReadWriteOnce access mode
ServiceAccount: Ensure the correct ServiceAccount is specified in cluster configuration
Resource Requirements: Verify cluster has enough CPU/memory for all replicas
Network Policies: May need adjustment for inter-pod communication
Ephemeral vs Persistent: Use ephemeral storage for development/testing, persistent for production

Quick Start (Recommended for Testing)

# 1. Deploy operator
kubectl apply -f flink-operator-v2.yaml

# 2. Wait for operator to be ready
kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=flink-kubernetes-operator -n flink-system

# 3. Deploy simple HA cluster (no persistent storage)
kubectl apply -f simple-ha-flink-cluster.yaml

# 4. Monitor deployment
kubectl get flinkdeployments -n freeleaps-data-platform
kubectl get pods -n freeleaps-data-platform -l app=flink

Production Deployment

# 1. Deploy operator
kubectl apply -f flink-operator-v2.yaml

# 2. Deploy storage resources
kubectl apply -f flink-storage.yaml

# 3. Deploy production HA cluster
kubectl apply -f ha-flink-cluster-v2.yaml

# 4. Monitor deployment
kubectl get flinkdeployments -n freeleaps-data-platform
kubectl get pods -n freeleaps-data-platform -l app=flink