Evalai: [Backend] Add feature to stop/delete code upload infrastructure

Created on 8 Feb 2021  路  6Comments  路  Source: Cloud-CV/EvalAI

Description

We have an automated set up to create a code upload worker cluster in host's AWS account which creates all the required services for code upload challenge submission evaluation. We want to add automated support to stop/delete the code upload worker infrastructure. This requires -

  • [ ] Delete code upload worker running on EvalAI fargate
  • [ ] Deleting EKS cluster nodegroup
  • [ ] Deleting EKS cluster
  • [ ] Deleting cloudwatch log groups
  • [ ] Deleting the VPC and associated subnets, security group and route table
  • [ ] Deleting IAM roles

Metadata of mentioned service is stored in ChallengeEvaluationCluster model, you can find the code here. Codebase for starting these services is available here.

The approach for this feature is to:

  • [ ] Add a django admin action for challenge model which will trigger resource deletion
  • [ ] Create multiple celery tasks which will handle deletion of each service
  • [ ] Trigger the service deletion using boto3 APIs and to get the resource ids use ChallengeEvaluationCluster model
  • [ ] Send an email to challenge host when all services are stopped successfully
GSOC-2021 backend hard-to-fix

All 6 comments

@Ram81 May I take this one?

@Ram81 Would it make sense to store eks_role_name, nodegroup_name (and some other values) in the ChallengeEvaluationCluster Model as well, as in order to delete those we need to refer back to those values.

Currently we can parse them while deleting just as we were doing while creating

...
nodegroup_name = "{0}-nodegroup".format(challenge_obj.title.replace(" ", "-"))
...

however this may create an issue, if these values were to change in the future.

@Ram81 May I take this one?

@ShauryaAg sure, go ahead.

@Ram81 Would it make sense to store eks_role_name, nodegroup_name (and some other values) in the ChallengeEvaluationCluster Model as well, as in order to delete those we need to refer back to those values.

Yes, we can add those to the model.

however this may create an issue, if these values were to change in the future.

These values won't be changed by anyone. These will be populated by automated task and set to null when we trigger the delete task.

@Ram81 I have made the necessary changes for

  • [x] Delete code upload worker running on EvalAI fargate
  • [x] Deleting EKS cluster nodegroup
  • [x] Deleting EKS cluster
  • [x] Deleting cloudwatch log groups
  • [x] Deleting the VPC and associated subnets, security group and route table
  • [x] Deleting IAM roles
  • [ ] Add a django admin action for challenge model which will trigger resource deletion
  • [x] Create multiple celery tasks which will handle deletion of each service
  • [x] Trigger the service deletion using boto3 APIs and to get the resource ids use ChallengeEvaluationCluster model
  • [ ] Send an email to challenge host when all services are stopped successfully

But I still need an AWS API key for testing purposes. Is there any way I can get one?

@ShauryaAg you can open the PR and I can pull and test it with AWS resources. If you want to test you can try signing up for free tier AWS account for testing VPC, subnet and IAM creation and deletion. I don't think you'll get access to sping up EKS clusters but that should be fine I think.

@Ram81 I have created a draft PR for the same. Please test it with the AWS API keys.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

VishwaasHegde picture VishwaasHegde  路  4Comments

RishabhJain2018 picture RishabhJain2018  路  5Comments

deshraj picture deshraj  路  5Comments

dhruvbatra picture dhruvbatra  路  4Comments

deshraj picture deshraj  路  5Comments