[Monitoring] Add auto-allocation of ECS resources #4407

MinhThieu145 · 2024-08-11T20:11:25Z

This pull request adds the functionality to automatically monitor and scale Fargate workers. This feature will help optimize resource allocation and improve the overall performance of the application.

MinhThieu145 · 2024-08-12T03:00:02Z

scripts/monitoring/auto_monitor_scale_workers.py

+def get_cpu_metrics_for_challenge(
+    challenge,
+    cluster_name=COMMON_SETTINGS_DICT["CLUSTER"],
+    range_days=1,
+    period_seconds=300,
+):
+    """
+    Get the CPU Utilization of the worker in the challenge.
+    """
+
+    cloudwatch_client = get_boto3_client("cloudwatch", aws_keys)
+
+    start_time = datetime.utcnow() - timedelta(days=range_days)
+    end_time = datetime.utcnow()
+    queue_name = challenge.queue
+    service_name = "{}_service".format(queue_name)
+
+    response = cloudwatch_client.get_metric_statistics(
+        Namespace="AWS/ECS",
+        MetricName="CPUUtilization",
+        Dimensions=[
+            {"Name": "ClusterName", "Value": cluster_name},
+            {"Name": "ServiceName", "Value": service_name},
+        ],
+        StartTime=start_time,
+        EndTime=end_time,
+        Period=period_seconds,
+        Statistics=["Average", "Maximum", "Minimum"],


Extract CPU Utilization from workers with Cloudwatch

MinhThieu145 · 2024-08-12T03:01:05Z

scripts/monitoring/auto_monitor_scale_workers.py

+def get_memory_metrics_for_challenge(
+    challenge,
+    cluster_name=COMMON_SETTINGS_DICT["CLUSTER"],
+    range_days=1,
+    period_seconds=300,
+):
+    """
+    Get the Memory Utilization of the worker in the challenge.
+    """
+
+    cloudwatch_client = get_boto3_client("cloudwatch", aws_keys)
+
+    start_time = datetime.utcnow() - timedelta(days=range_days)
+    end_time = datetime.utcnow()
+    queue_name = challenge.queue
+    service_name = "{}_service".format(queue_name)
+    response = cloudwatch_client.get_metric_statistics(
+        Namespace="AWS/ECS",
+        MetricName="MemoryUtilization",
+        Dimensions=[
+            {"Name": "ClusterName", "Value": cluster_name},
+            {"Name": "ServiceName", "Value": service_name},
+        ],
+        StartTime=start_time,
+        EndTime=end_time,
+        Period=period_seconds,
+        Statistics=["Average", "Maximum", "Minimum"],
+    )
+
+    return response["Datapoints"]


Extract Memory Utilization from worker with Cloudwatch

MinhThieu145 · 2024-08-12T03:01:34Z

scripts/monitoring/auto_monitor_scale_workers.py

+def get_storage_metrics_for_challenge(
+    challenge,
+    cluster_name=COMMON_SETTINGS_DICT["CLUSTER"],
+    range_days=1,
+    period_seconds=300,
+):
+    """
+    Get the Storage Utilization of the worker in the challenge.
+    """
+
+    from datetime import datetime, timedelta
+
+    cloudwatch_client = get_boto3_client("cloudwatch", aws_keys)
+
+    start_time = datetime.utcnow() - timedelta(days=range_days)
+    end_time = datetime.utcnow()
+    queue_name = challenge.queue
+    service_name = "{}_service".format(queue_name)
+
+    response = cloudwatch_client.get_metric_statistics(
+        Namespace="ECS/ContainerInsights",
+        MetricName="EphemeralStorageUtilized",
+        Dimensions=[
+            {"Name": "ClusterName", "Value": cluster_name},
+            {"Name": "ServiceName", "Value": service_name},
+        ],
+        StartTime=start_time,
+        EndTime=end_time,
+        Period=period_seconds,
+        Statistics=["Average"],


Extract Storage Utilization from worker with Cloudwatch

MinhThieu145 · 2024-08-12T03:05:03Z

scripts/monitoring/auto_monitor_scale_workers.py

+def current_worker_limit(task_definition_name, metrics):
+    ecs_client = get_boto3_client("ecs", aws_keys)
+    try:
+        response = ecs_client.describe_task_definition(
+            taskDefinition=task_definition_name
+        )
+    except Exception as e:
+        print(f"Error retrieving task definition: {str(e)}")
+        return {}
+
+    task_definition = response.get("taskDefinition", {})
+    return task_definition.get(metrics, 0)


Extract the current resources limit for workers, only applied for CPU and Memory

MinhThieu145 · 2024-08-12T03:09:18Z

scripts/monitoring/auto_monitor_scale_workers.py

+    return task_definition.get(metrics, 0)
+
+
+def get_new_resource_limit(


This function adjusts the worker limit based on how much they’re being used:

Lower the Limit: If average utilization is below 25%, the limit is cut in half.

Raise the Limit: If average utilization is above 75%, the limit is doubled.

There’s also room to add more complex logic later if needed.

MinhThieu145 · 2024-08-12T03:15:45Z

scripts/monitoring/auto_monitor_scale_workers.py

+    average = sum(
+        datapoint["Average"] for datapoint in metrics if "Average" in datapoint
+    ) / len(metrics)


Calculates the average of the "Average" metrics across all data points (where each data point represents a day). This essentially provides the overall average utilization percentage based on the daily average metrics.

MinhThieu145 · 2024-08-12T03:16:27Z

scripts/monitoring/auto_monitor_scale_workers.py

+    # Apply separate logic based on whether the metric is CPU or Memory
+    if metric_name == "CPU":
+        # CPU-specific scaling logic
+        if average <= 25:
+            # if average smaller than 25%, scale down
+            print(
+                f"Scaling down {service_name} due to low {metric_name} utilization"
+            )
+            new_limit = str(int(current_metric_limit) // 2)
+        elif average >= 75:
+            # if average greater than 75%, scale up
+            print(
+                f"Scaling up {service_name} due to high {metric_name} utilization"
+            )
+            new_limit = str(int(current_metric_limit) * 2)
+        else:
+            # no scaling action required
+            print(
+                f"No scaling action required for {service_name} based on {metric_name} utilization"
+            )
+            new_limit = current_metric_limit
+        return new_limit


Logic to find new limit for CPU

MinhThieu145 · 2024-08-12T03:17:16Z

scripts/monitoring/auto_monitor_scale_workers.py

+    elif metric_name == "Memory":
+        if average <= 25:
+            # if average smaller than 25%, scale down
+            print(
+                f"Scaling down {service_name} due to low {metric_name} utilization"
+            )
+            new_limit = str(int(current_metric_limit) // 2)
+        elif average >= 75:
+            # if average greater than 75%, scale up
+            print(
+                f"Scaling up {service_name} due to high {metric_name} utilization"
+            )
+            new_limit = str(int(current_metric_limit) * 2)
+        else:
+            # no scaling action required
+            print(
+                f"No scaling action required for {service_name} based on {metric_name} utilization"
+            )
+            new_limit = current_metric_limit
+        return new_limit


Logic to find new limit for Memory

…workers

codecov-commenter · 2024-08-15T02:31:25Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 69.30%. Comparing base (96968d6) to head (341ccc4).
Report is 1110 commits behind head on master.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #4407      +/-   ##
==========================================
- Coverage   72.93%   69.30%   -3.63%     
==========================================
  Files          83       20      -63     
  Lines        5368     3574    -1794     
==========================================
- Hits         3915     2477    -1438     
+ Misses       1453     1097     -356

see 64 files with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update be3c597...341ccc4. Read the comment docs.

gchhablani · 2024-08-28T21:32:38Z

Thanks for this PR! Made a few changes.

add auto monitor and scale for fargate workers

5477138

MinhThieu145 commented Aug 12, 2024

View reviewed changes

MinhThieu145 added 2 commits August 11, 2024 21:48

chore: Update auto scaling scripts to use latest task definition for …

3bbc278

…workers

chore: Remove unused import in auto_monitor_scale_workers.py

341ccc4

gchhablani added 2 commits August 28, 2024 15:11

Merge branch 'master' into ecs-worker-auto-adjustment

d23ec47

Add auto allocate ECS resources

3140eca

gchhablani changed the title ~~Add auto monitor and scale for Fargate workers~~ [Monitoring] Add auto-allocation of ECS resources Aug 28, 2024

gchhablani added 21 commits August 28, 2024 17:36

Minor

61a9bcf

Remove remote eval and ec2 workers

8f9124c

Update

c9d3b57

Fix format

f0c135a

Fix

c6254c2

Change update logic

026aff5

Fix

3ec7889

Add valid combinations and correctness check

ccd7ada

Remove extra print

0318114

Update logic if current limit is None

8a41165

Add force update logic in case of ensure correctness

feb25db

Update auto_allocate_ecs_resources.py

abb191e

Update auto_allocate_ecs_resources.py

176864b

Update auto_allocate_ecs_resources.py

f17cd13

Update default period

517cf92

Update back to 90

001e474

Update run frequency and min resource

caa849c

Update default range

1ae24cd

Update logic

1c6b965

Use reservation at max to compute new values

22b97c8

Merge branch 'master' into ecs-worker-auto-adjustment

2c35f12

gchhablani added 5 commits August 29, 2024 17:07

Add print statement

560867a

Add back smaller values

6105bfa

Remove comment

f03e0e1

Change update logic

bf55d3c

Use max reservation value to get accurate estimate

98106a6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Monitoring] Add auto-allocation of ECS resources #4407

[Monitoring] Add auto-allocation of ECS resources #4407

MinhThieu145 commented Aug 11, 2024

MinhThieu145 Aug 12, 2024

MinhThieu145 Aug 12, 2024

MinhThieu145 Aug 12, 2024

MinhThieu145 Aug 12, 2024

MinhThieu145 Aug 12, 2024

MinhThieu145 Aug 12, 2024

MinhThieu145 Aug 12, 2024

MinhThieu145 Aug 12, 2024

codecov-commenter commented Aug 15, 2024

gchhablani commented Aug 28, 2024

		return task_definition.get(metrics, 0)


		def get_new_resource_limit(

[Monitoring] Add auto-allocation of ECS resources #4407

Are you sure you want to change the base?

[Monitoring] Add auto-allocation of ECS resources #4407

Conversation

MinhThieu145 commented Aug 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Aug 15, 2024

Codecov Report

gchhablani commented Aug 28, 2024