Ckan: NetKAN Bot Re-Architecture

Created on 19 Jun 2019  Â·  58Comments  Â·  Source: KSP-CKAN/CKAN

History

NetKAN started its life as a Perl script in late 2014, followed by a single python/bash script, another Perl script which then inspired NetKAN-bot. Which is a modular Perl application with a bunch of libraries and a bunch of scripts to do things like regular indexing, webhooks and mirroring to archive.org.

Problems

NetKAN-bot has served its purpose since writing it around 4 years ago, however with the growth of the number of mods along with now generating the majority of the metadata via the bot it is really starting to creak under the pressure. The application from a build perspective is reasonably straight forward, the infrastructure is not and makes deploying changes a bit of a pain.

Rationale

One of the things that really drove the development for NetKAN-bot was ensuring the metadata produced complied with our schema. @HebaruSan has done a bunch of excellent work in #2788, meaning we could drop a of logic making for a far more simple indexer. Splitting these up and using a FIFO queue also means we don't have to be so defensive about our git repo getting messed up as we don't need to check outputs or even touch the file system if the payload and file on the disk are identical.

Initial Brain Dump

NetKAN bot will be replaced by a bunch of distinct containers with Specific roles.

Inflator

@DasSkelett / @HebaruSan - How much work do you think would be in this? I'm happy to do the Docker/AWS archicture as I pretty much breathe this stuff at work, C# is not something I do a lot of.

  • Processes SQS queue

    • Incomming Message example

      Payload: ModIdentifier Attr Refresh: True/False (this is to signify a repo pull is required)

    • Outgoing Message example

      Payload: CKAN JSON Output (if success) Attr Identifier: ModIdentifier Attr Success: True/False Attr Error Message: Message/Exception if there is one Attr Check Time: timestamp

Initially there will be 2 incomming and 2 outgoing

Incomming:

  • Webhooks
  • Scheduled

Outgoing:

  • Master
  • Staged

Implementation

Will require some new features added to NetKAN:

  • Ability to run as a foreground process, that watches an SQS queue
  • Read sqs queues, prioritise webhooks
  • Netkan Repo pull changes when refresh is set true on an sqs message, a call out to shell would be fine at a minmum
  • Upon inflation publish output to sqs master, unless it's marked for staging
  • Manage cache aging. Currently it's a sceduled task, but netkan could age out files and when refresh is set it could mark for refresh of the cached download.

Assumptions that can be made

  • As long as it'll run in fg, packaging into a docker container is trivial
  • Cache storage path will be provided
  • SQS Access will not require keys, the instance will be granted access via IAM

Indexer

Takes care of processing the outgoing messages, check if the payload is different, update, commit, push + udpate status.

The idea behind having a separate queue is to avoid the current situation of needing to be real careful of our repo state, if we process them seperately we can manage this much much easier.

It will also update the status, likely a DynamoDB table. Which we can in the interim produce a json file periodically and upload to S3.

Scheduler

Periodically check instance cpu credit levels, submit all NetKANs to the queue for inflation.

WebHook Processing

This may actually be a lambda script behind load balancer or it may reside on the instance (load balancers are kinda expensive for this small use case). But it will take care of receiving inflation requests and creating an SQS Message on the webhook queue and CKAN-meta changes can spawn messages for the Mirroring to take place.

Mirror processing

This will take care of processing the mirror queue and submitting applicable mods to archive.org for preservation.

Summary

It's likely we can achieve this in a few steps and the idea will be to build automated processes for deployment of changes.

  1. Minimal Operation

    • NetKAN Daemon mode with Git/SQS Processing

    • Indexer

    • Scheduler

  2. Webhooks

    • They can quite happily run as they are until we've built something to process them

  3. Mirroring

    • Along with the webhooks, they'll happily run as they are for now

Extra

  • Consider how we want to impelment repo splitting
  • Consider how to backfill x releases
Architecture Discussion needed Infrastructure

Most helpful comment

A quick update, as I haven't been idle. But I am getting smashed at work, we're doing a pretty big backend re-architecture, so that's absorbing a lot of cycles!

However, I have a full integration test done. I can fire a netkan at the Incomming queue, the inflator does its thing, throws it at the outgoing queue, the index picks it up processes it, stages it if necessary and updates dynamodb with the current status.

I've also got the initial docker file and compose file done for dev/production building, there are only some minor tweaks to clean up some things there.

I have had a brief look at the C#, whilst I'm pretty sure I could figure it out; I would highly appreciate If someone could knock off these things for netkan, I'll submit a PR with a dockerfile and container build process after:

  • Config via Environment Variables:

    • NETKAN_QUEUES

    • GITHUB_TOKEN

    • NETKAN_CACHE

  • x_netkan_staging_reason - If this can be added as an attribute on the outgoing message that'd be super! I noticed it missing when redoing the PR code and it was one of @HebaruSan 's earlier PRs to the bots branch :slightly_smiling_face:

Keep being awesome everyone :heart:

All 58 comments

How much work do you think would be in this? I'm happy to do the Docker/AWS archicture as I pretty much breathe this stuff at work, C# is not something I do a lot of.

I have not heard of SQS before, so for me this project would start with a lot of learning.

Fortunately @HebaruSan, SQS is pretty straight forward. You could leave "get_messages" "send_message" stubs and I'd almost be able to add the relevant code to do the things.

Can we use this re-architecting opportunity to make some headway on #1155? The current bot's "statelessness" effectively prevents us from tracking a release date property, which has been a pretty common enhancement request. If we're going to change the structure of the bot and the interface between the infrastructure and netkan.exe, we might as well do it once instead of twice.

One question:

Initially there will be 2 incomming and 2 outgoing [SQS queues]

Incomming:

* Webhooks

* Scheduled

Outgoing:

* Master

* Staged

Why two outgoing queues? Aren't they both processed by the indexer?
Wouldn't an Attr Staged: True/False or something suffice?

It will also update the status, likely a DynamoDB table. Which we can in the interim produce a json file periodically and upload to S3.

Dare you breaking my indexing alert script, took a long time to get it working... :)

Can we use this re-architecting opportunity to make some headway on #1155? The current bot's "statelessness" effectively prevents us from tracking a release date property, which has been a pretty common enhancement request.

Thinking on this, we already technically have the information in the status.json. Maybe we want to periodically generate an updated.json ala download counts? The statelessness is a really good and reliable thing, we can guarantee whatever is produced by the binary you download, will be produced by the bot. Though as we're decoupling, we could think about putting that into the metadata (but I don't think we should put that in the first pass).

If we're going to change the structure of the bot and the interface between the infrastructure and netkan.exe, we might as well do it once instead of twice.

Actually I don't necessarily agree here. It's an ambitious task to replace and one of the things that is a problem right now is that it's really hard to iterate and push changes into production. Utilising queues and decoupling the processes means we don't have to care about our state, we can just recycle our containers and they'll pickup where they left off.

My aim is to make it so when someone has a great idea and builds a change, pushing it out is very easy, along with rolling it back being also very easy. Or if we need to rebuild the environment, how it is built isn't locked up inside my head.

Why two outgoing queues? Aren't they both processed by the indexer?
Wouldn't an Attr Staged: True/False or something suffice?

Excellent question.

Incomming
It's not super necessary (and may not be at all necessary if I figure out if it's possible), but I intend on using SQS FIFO, it has built in dedupe and a bunch of other reliability features. However I couldn't see a way to force an item to the front and I really want a webhook job to happen in near real time, so basically the indexer will check the webhook queue first before checking the scheduled queue.

Outgoing
I want to be able to process staged commits after processing all the commits to master. I may even do this with a separate indexing container. Basically to reduce all the branch changing logic I've had to build into the current bot. It's a really complex task mid index run and likely responsible for some of the tree corruption that required writing a script to periodically fix.

Dare you breaking my indexing alert script, took a long time to get it working... :)

I wouldn't dare break such a thing! I promise there will be an endpoint to replace it, but initially we'll just produce the same thing. The infrastructure to do more than a json file on S3 costs money and one of my aims is to make the resources we use more efficient. Moving to SQS + making our scheduling smarter we can run as frequently we have the spare compute resources for, instead of carefully balancing and mostly failing. If we get a bunch of funding to "Make this run faster please", it'll be a case of opening up the taps and it will run faster without any code changes.

If someone would like to have a crack at the C# stuff, I'll create a couple of queues and generate API keys with access to them.

The default strategy for the SDK is to cycle through the different auth methods, so if you set the right environment variables it will pick up those automatically and in production it will also do the right things. So you can pretty much focus on just getting the SDK working and adding the SQS logic and the auth will "just work".

Utilising queues and decoupling the processes means we don't have to care about our state, we can just recycle our containers and they'll pickup where they left off.

Question about this, the "Inflator" will need a persistent download cache to maintain anything like the rates of processing we enjoy today. Would it have that?

Thinking on this, we already technically have the information in the status.json.

Or in git; here are all the times this .ckan has been modified:

$ git log --format='%aI' -- Astrogator/Astrogator-v0.9.2.ckan
2019-06-13T20:03:50+00:00
2019-05-30T04:03:37+00:00

(Except that the bot currently does a shallow checkout of CKAN-meta; we'd need to do a full checkout to be able to retrieve the original release date.)

Question about this, the "Inflator" will need a persistent download cache to maintain anything like the rates of processing we enjoy today. Would it have that?

Yep. I intend to create a shared cache on the local fs that the the inflator, archiver etc can access. Managing cross container ephemeral data is something I already do a lot in my day job, so it's a pretty easy thing to solve. Netkan can already take a cache path, pretend that path will exist and point to the right place. We can leave cache management out of Netkan for now and keep with a scheduled task to clean it up (super trivial thing to write).

If we ever want to have multiple instances processing this data, then we'll have multiple caches, but I don't see that as a big problem. We could also use something like EFS, for such a small amount of data it wouldn't be expensive and we're not expecting ludicrously high I/O, so performance wouldn't be an issue. Either way, problem for the future.

Or in git; here are all the times this .ckan has been modified

That's a neat idea. I think we should populating historical data as a separate exercise as it'll be a one off (like how we back filled all the hashes then uploaded compatible mods to the archive). I think we could add an indexed field to the spec and resulting metadata. As we're processing messages from a queue instead of spitting things out to a FS, checking git status and committing if there are changes there is scope to be a little smarter with our comparison. It does mean that the bot is no longer producing 100% the same data every time.

Either way, with a newer more flexible architecture I'm comfortable moving that to a future us problem. We can definitely focus down to a design for that specific feature and come up with something that is reliable and achieves the stated goal.

If someone would like to have a crack at the C# stuff, I'll create a couple of queues and generate API keys with access to them.

You could leave "get_messages" "send_message" stubs and I'd almost be able to add the relevant code to do the things.

I've started a branch built on top of #2788 to refactor Netkan to allow driving it with a queue. The messaging stubs are here:

https://github.com/HebaruSan/CKAN/blob/feature/netkan-sqs/Netkan/Processors/QueueHandler.cs

OMG You are amazing! I'm setting a reminder to look at this on Sunday when I have some time.

Just a brief comment, one of the hugely awesome things about the SDK is authentication resolution is automatic. You can see the constructor here and what the resolution order is here. It's very unlikely we'd want to hard code a set of credentials deep into the application. What it means is for testing you can use an api key via environment variables or .aws/credentials file and in production we can allow the instance access via the IAM Instance Profile without changing anything in the application.

So... are you saying to use the constructor without the key parameters? I'm still completely guessing how SQS works, mostly going by sample programs as I can find them.

Yup, looking at the list of examples on the first link it covers that scenario. It's a pretty common pattern for AWS.

Does the first link go where you meant it to? I get the root node of a huge "AWS SDK for .NET Version 3 API Reference" list which doesn't have any authentication stuff noticeably on it.
https://docs.aws.amazon.com/sdkfornet/v3/apidocs/Index.html

EDIT: Possibly resolved this...

What about the queue URL? Is that a constant, or built into the same config file, or do we have to retrieve it at run time based on the queue name?

I'm going to look it up at startup based on the name. Partly because the above brain dump already lists out what the names are likely to be.

EDIT: Think I found the answers to these...

Does ReceiveMessage remove the messages from the queue, or will they still be there until deleted explicitly with DeleteMessage?

Sounds like they stay till deleted.

And now that I'm thinking about it, how would multiple clients coordinate to distribute messages among themselves?

I think this is what VisibilityTimeout is for, you can specify a number of seconds for the returned messages to be hidden, after which they reappear.

Design question:
The format of the incoming message:

Payload: ModIdentifier
Attr Refresh: True/False (this is to signify a repo pull is required)

... implies that the scheduler must have a checked out copy of the NetKAN repo (because it must know all the current mod identifiers). But also that the inflator must also have a copy of the same NetKAN repo (because only the identifier is sent, so it must get the metadata from outside the message, and because it can be told to do a pull).

This seems duplicative. What do you think of the idea of the scheduler sending the JSON from the .netkan file as the body of the message? Then the infrastructure would only need one copy of the NetKAN repo (in the scheduler), and a signaling mechanism for repo pulls would no longer be needed (because the scheduler could send the right version of the JSON if it knows a pull is needed).

Does the first link go where you meant it to? I get the root node of a huge "AWS SDK for .NET Version 3 API Reference" list which doesn't have any authentication stuff noticeably on it.
https://docs.aws.amazon.com/sdkfornet/v3/apidocs/Index.html

Gah, silly page doesn't put the links in the top and I was rushing it out of my brain! https://docs.aws.amazon.com/sdkfornet/v3/apidocs/items/SQS/TSQSClient.html

We can use environment variables for everything, so in theory the client constructor doesn't require anything passed to it (this is how I do it in python all the time and the C# _looks_ the same, but YMMV).

These are the keys generally all that's needed

AWS_ACCESS_KEY_ID – Specifies an AWS access key associated with an IAM user or role.
AWS_SECRET_ACCESS_KEY – Specifies the secret key associated with the access key. This is essentially the "password" for the access key.
AWS_DEFAULT_REGION – Specifies the AWS Region to send the request to.

The region we use for everything is: us-west-2

I'm going to look it up at startup based on the name. Partly because the above brain dump already lists out what the names are likely to be.

The queue urls should be consistent and I will provide them to the container likely via an environment variable (this information is _very_ easy to pull from the queue and pass to the container configuration)
https://sqs.us-east-1.amazonaws.com/80398EXAMPLE/MyTestQueue

One thing I didn't consider is that every time we Check, Add, Delete or Send to a queue it is considered a message, which adds to our total messages sent. So we may want to only have an inbound/outbound queue as every time we check one it'll use a message (they're inexpensive and we will be able to use a smaller instance so we will save money overall), from a first pass standpoint it might be acceptable to not over complicate the queues too much. We can use long polling to reduce the amount of unnecessary messages we consume checking for new ones.

This seems duplicative. What do you think of the idea of the scheduler sending the JSON from the .netkan file as the body of the message? Then the infrastructure would only need one copy of the NetKAN repo (in the scheduler), and a signaling mechanism for repo pulls would no longer be needed (because the scheduler could send the right version of the JSON if it knows a pull is needed).

That's actually a really good idea and it means we don't have to worry about git operations inside of netkan. Leaving the cache management outside it as well means the only thing we're adding is SQS and the rest of the operations already exist. Wrap the inflate in a try/catch on success send the resulting ckan, send the details as the payload instead. Set a success attribute with True/False as applicaable for each scenario.

Tomorrow (in about 18 hours or so) I plan to generate some queues. I will start by finishing my template for creating the dev queues + iam permissions to be able to send/receive from them. After I'll hack together some python and test the interactions.

That's actually a really good idea and it means we don't have to worry about git operations inside of netkan.

Cool, I will update my prototype code to receive the JSON in the body and not look for a Refresh attribute. I have been pushing updates here as I learn more about SQS:

https://github.com/HebaruSan/CKAN/blob/feature/netkan-sqs/Netkan/Processors/QueueHandler.cs

So I've spent a little more time reading the documentation and doing some testing, we have some capacity to be a little smart how we handle messages. Now as far as webhooks go, we've never really guaranteed them to be instant, but they are currently instant. It might be worth running two instances of the inflator, one to process the webhooks and one for the scheduler. I also think a singular queue for the inflated metadata and setting a 'staged' attribute makes more sense. It was hard before because of the monolithic nature of the indexer before, which not something we need to be concerned about now.

|SQS Action|Messages|
|---|---|
|Scheduler Polling (long poll 20 seconds)|131400|
|Webhook Polling (long poll 20 seconds)|131400|
|Inflated Polling (long poll 20 seconds)|131400|
|Index Scheduled (2000 netkans, 10/batch send, bi hourly)|73000|
|Scheduled Processed (2000 netkans, 1/batch send, bi hourly)|730000|
|Scheduled Deleted (2000 netkans, 1/batch send, bi hourly)|730000|
|Inflated Scheduled (2000 netkans, 1/batch send, bi hourly)|730000|
|Processed (2000 netkans, 1/batch receive, bi hourly)|730000|
|Scheduled Deleted (2000 netkans, 1/batch, bi hourly)|730000|
|Total|4117200|

If we don't do batch deletes we'll end up around the 5 million message mark including webooks (which is only a couple of dollars), if we process everything in the max batch size of 10 we can almost do all our processing within the free tier of a million requests per month.

Either way we'll be in a much better place and can iterate once the basics are working.

I also whipped up a quick python3 script for submitting NetKANs to a queue

https://gist.github.com/techman83/0604f4dc9f849aac605e46f494de9403

It requires boto3, but the rest are part of stdlib

The 2 queues I created for testing:

  • InboundDev.fifo
  • OutboundDev.fifo

I thought I read somewhere that FIFOs are more expensive than regular queues that don't guarantee the ordering. Could that be another cost-saving opportunity? While an orderly alphabetical processing sequence would be nice, I don't think it would be mission critical.

I'm doing this:

export AWS_DEFAULT_REGION='us-west-2'

But my test command still says:

9879 [1] FATAL CKAN.NetKAN.Program (null) - No RegionEndpoint or ServiceURL configured
9880 [1] FATAL CKAN.NetKAN.Program (null) -   at Amazon.Runtime.ClientConfig.Validate () [0x0001f] in <3f355c0f859241b59b26361b00881687>:0 
  at Amazon.Runtime.AmazonServiceClient..ctor (Amazon.Runtime.AWSCredentials credentials, Amazon.Runtime.ClientConfig config) [0x00037] in <3f355c0f859241b59b26361b00881687>:0 
  at Amazon.SQS.AmazonSQSClient..ctor () [0x0000b] in <3f355c0f859241b59b26361b00881687>:0 
  at CKAN.NetKAN.Processors.QueueHandler..ctor (System.String cacheDir, System.Boolean overwriteCache, System.String githubToken, System.Boolean prerelease) [0x00028] in <3f355c0f859241b59b26361b00881687>:0 
  at CKAN.NetKAN.Program.Main (System.String[] args) [0x00079] in <3f355c0f859241b59b26361b00881687>:0 

Is a region end point different from a region? I'll try hard coding it in C# for now, but obviously it would be best to get the environment variables working...

Deleted my previous comment; it's working now, I can retrieve the queue URLs.

Design suggestion, propose adding another attribute to the Incoming messages:

Attr Releases: Integer (number of releases to inflate, default 1)

This could be used to allow the downstream CKAN-meta committer container to re-request more releases when a module has changed, to catch batch-uploaded backport releases.

OK, I think it's working now. There should be some sample output messages in OutboundDev.fifo.

I thought I read somewhere that FIFOs are more expensive than regular queues that don't guarantee the ordering. Could that be another cost-saving opportunity? While an orderly alphabetical processing sequence would be nice, I don't think it would be mission critical.

They are, about 10 cents per million requests more than standard queues. The ordering isn't actually what lead me to picking them was the delivery guarantees and deduplication. Means we don't have to think about reliability of our messaging and just act on what's received. Which at this small scale saves a signification amount of effort.

Design suggestion, propose adding another attribute to the Incoming messages:

Attr Releases: Integer (number of releases to inflate, default 1)

This could be used to allow the downstream CKAN-meta committer container to re-request more releases when a module has changed, to catch batch-uploaded backport releases.

Sounds great! How we trigger that can be a future thing to ponder.

I've started the dev work for the indexer etc. Just in a repo on my GH until I figure out how I want to do it.

https://github.com/techman83/NetKAN-Infra/tree/initial_dev

Re-opened as netkan is only part of the picture :)

Proof of concept for populating a release_date property in Python:

https://gist.github.com/HebaruSan/665599c5c8fb9733c23c8aa64f29bc48

It takes the target filename as a parameter and the latest contents on standard input. Prints the latest contents plus a release_date property to standard output.

If the file exists and has a release_date already, the existing value is pulled forward into the new version. If it exists and doesn't have a release_date, we get the file creation date from git. If it doesn't exist, use now as the release_date. Assumes it's running in the root of CKAN-meta with a full clone (all history available).

Neat! I reckon we should open a few feature enhancements separate to the re-architecture. The task is relatively ambitious already and we'll be in a much better place to make improvements going forward.

We also would stepping away from NetKAN always able to produce exactly what is in our metadata, so I reckon we should thrash the idea out a little with space to do so in a targeted way. That isn't to suggest I think its a bad idea, I just want to defer thinking about it for now :slightly_smiling_face:

Yeah, I checked your NetKAN-Infra repo for a spot to integrate this, but I didn't find one, so I thought I would share it while things are still in the early stages in case there are design decisions coming up that might determine how easy it is.

I had a long term to-do item to attempt this in Perl, which probably still isn't impossible, but this rewrite project motivated me to finally clear up the murky details.

EDIT: Actually, I think it's not quite possible in Perl, because the "old bot" overwrites the old file before we know what the filename will be. The new framework specifically gives us the opportunity to interject some logic between the inflation and the writing of the file, which makes this possible.

We might raise a specific thing in the code project relating specifically to the code. But a lot of the decisions around this aren't technical and impact Client, Spec, Mission, along with Indexing.

Definitely appreciate sharing, a good proof of concept can help drive decisions. The architecture of the Perl NetKAN-bot would make anything a real hack to implement. Not so much Perl being the problem, rather it was constructed to be very careful about Metadata reliability, as NetKAN reliability wasn't what it is today.

More food for thought: In theory at least, crawling SpaceDock modules without a $vref property is unnecessary, because we'll be notified of changes in those modules by the web hook. I count 789 such modules; excluding them would significantly cut down the bot's work load. (339 SpaceDock modules with a $vref would remain.) Something to consider for the Scheduler component.

(The presence or absence of a $vref matters because the URL property of a .version file allows the mod author to specify a remote location for version info that the web hook does not know about. The bot needs to react when authors make changes to those files.)

Neat. I've always wanted to get smarter with the scheduled indexing. We can probably include stuff like that.

Full pass daily, more frequent low impact checks (HEAD download urls, github release info, ignore spacedock). Much much easier to do with the scheduling component when we get there :slightly_smiling_face:

Another possibly feasible wish-list item: Give us an easy way to purge a specific file from the bot's cache! This would allow us to resolve issues like KSP-CKAN/NetKAN#7270 while "remov[ing] [you] from the process as much as possible" :wink: .

I have some ideas :stuck_out_tongue:

Smarter handling of epochs would be great as well. After we inflate a module, the committer component could check whether it has a higher version number than the existing releases and if not, either raise an error or auto-increment the epoch and try again. Maybe add a new netkan property x_netkan_allow_out_of_order to suppress this for repositories that are known to do backports.

This is one of the big sticking points to any effort to make the bot "set it and forget it".

EDIT: Alternate strategy to accomplish the same thing:
The scheduler component could scan the existing versions of a module, find the highest one, and send it to the inflator in a new HighestVersion attribute. Then the inflator would inflate as normal, and if the generated version is less than HighestVersion, increment the epoch itself and try again immediately.

auto-increment the epoch and try again.

This should be saved in the .netkan and commited to the netkan repo too. Else the netkan stays as is and the bot auto-increments a mod possibly multiple times every run (which will get resource heavy), without anyone of us ever knowing of it.
Alternatively, don't commit it, but stage it as a PR. the only thing we would have to do is check it and merge it.

If we don't do it via PRs, we should add another way for the bot to inform us.
An alert infrastructure probably could be used for other stuff too, like (new) inflation errors. I'm currently running a hand-crafted Python script on my PC, pulling the status.json every 10 minutes and comparing the errors with those of the previous run, and if there are any, it triggers a notification.
It works, but a different way (I would prefer via e-mail) would also be nice.

(which will get resource heavy)

Not necessarily; since only the version property would need to change, we don't have to re-do the entire inflation. We could do something as simple as this:

        private static string incrementVersionEpoch(string version)
        {
            string upstreamVersion = version;
            int epoch = 0;
            if (version.Contains(":"))
            {
                string[] pieces = version.Split(":".ToCharArray(), 2);
                epoch = int.Parse(pieces[0]);
                upstreamVersion = pieces[1];
            }
            return $"{epoch+1}:{upstreamVersion}";
        }

        private static string ensureVersionHigherThan(string version, string minLimit)
        {
            ModuleVersion minVersion = new ModuleVersion(minLimit);
            ModuleVersion modVersion = new ModuleVersion(version);
            while (modVersion < minVersion)
            {
                version = incrementVersionEpoch(version);
                modVersion = new ModuleVersion(version);
            }
            return version;
        }

The approximate cost per increment would be one string split, one string concatenation, and parsing one string into a ModuleVersion.

These are all wonderful ideas and I very much don't want to squash them. But I'm also conscious that a re-architecture is a reasonable undertaking. I think we should be very liberal in opening new feature issues for individual things, so we can discuss and flesh them out.

One of the goals for this issue is to make it so we're no longer stuck. Our current code, whilst for the most part pretty robust (always dying instead of something far worse #979), making and deploying changes is hard. The new architecture will be for the most part, automatic, so if we have good ideas getting them out the door will be not much more than a merge away.

I've re-written the main commit logic, have some solid ideas for the branching/staging and need to investigate producing our status. @HebaruSan the work on queuing is beautiful, I'm super excited to get this into production as watching NetKANs go into the queue, go into the other queue, get processed by the indexer is awesome!

An alert infrastructure probably could be used for other stuff too, like (new) inflation errors. I'm currently running a hand-crafted Python script on my PC, pulling the status.json every 10 minutes and comparing the errors with those of the previous run, and if there are any, it triggers a notification.
It works, but a different way (I would prefer via e-mail) would also be nice.

This is pretty doable. I have some ideas for after the basics have been done. Prior to those are some billing triggers, so I can be alerted and worst case services can be shut off if limits are reached. It's reasonably safe right now as the instances run out of credits and that's it. But with SQS/DynamoDB there is scope for something to spin out of control unnoticed, which isn't ideal.

I'm across the basics with DynamoDB, handling status was one of the last things to check off the list. It should do what we need and be able to continue producing the status files as is for now.

(which will get resource heavy)

Not necessarily; since only the version property would need to change, we don't have to re-do the entire inflation. We could do something as simple as this:

[...]

The approximate cost per increment would be one string split, one string concatenation, and parsing one string into a ModuleVersion.

But this still leaves the problem that we will have outdated netkan data in the NetKAN repository, doesn't it?

But this still leaves the problem that we will have outdated netkan data in the NetKAN repository, doesn't it?

Actually I think this method could replace manual management of epochs and allow us to retire the x_netkan_epoch property. There would no longer be a need for it.

EDIT: I'll start a branch/PR for this. Let's redirect this convo over there to avoid cluttering this issue.

Oh wow. I could not for the life of me figure out how the web hook translated a SpaceDock mod number to a netkan identifier. Turns out SpaceDock does that!!

https://github.com/KSP-SpaceDock/SpaceDock-Notify/blob/e7c2881d01948b5fc98e06de04df59af3763a23d/app.py#L31-L47

This is maybe something we could take more ownership of. Having another system pull and parse our metadata and search it is a bit of abstraction leakage. If there will be new web hook code, maybe there should be a /update/spacedock/<number> route.

Wow indeed. That is rough! I think initially, I'll bring across the current webhooks. As I'm conscious about not dragging this out too much. But I absolutely think we can do better.

Current state of play:

  • Indexing logic re-implemented in python (full integration test hasn't been done, but all the logic has tests and does what it should be doing.)
  • Inflator testing (It works super nicely)
  • Cloudformation pieces mostly complete (at least the harder bits, containers/instances/clusters I'm well across)

Things left to do

  • Add ability for environment variables to populate commandline defaults (I've had a brief look, but requires more than a coffee breaks worth of time to wrap my head around C#)

    • NETKAN_QUEUES

    • GITHUB_TOKEN

    • NETKAN_CACHE

  • Have travis build a docker file when building netkan
  • Finish the container builds for the infrastructure
  • Write the scheduler
  • Finish the cloudformation templates
  • Write a process to export status -> status.json for netkan-status

It probably seems like a lot, but most of the effort has been rewriting the indexing logic. Which is much much simpler now. I'll get that tested and knock out the rest of the bits and pieces.

Am looking forward to replacing the old infrastructure. It'll make it much easier for us to make iterative improvements!

A quick update, as I haven't been idle. But I am getting smashed at work, we're doing a pretty big backend re-architecture, so that's absorbing a lot of cycles!

However, I have a full integration test done. I can fire a netkan at the Incomming queue, the inflator does its thing, throws it at the outgoing queue, the index picks it up processes it, stages it if necessary and updates dynamodb with the current status.

I've also got the initial docker file and compose file done for dev/production building, there are only some minor tweaks to clean up some things there.

I have had a brief look at the C#, whilst I'm pretty sure I could figure it out; I would highly appreciate If someone could knock off these things for netkan, I'll submit a PR with a dockerfile and container build process after:

  • Config via Environment Variables:

    • NETKAN_QUEUES

    • GITHUB_TOKEN

    • NETKAN_CACHE

  • x_netkan_staging_reason - If this can be added as an attribute on the outgoing message that'd be super! I noticed it missing when redoing the PR code and it was one of @HebaruSan 's earlier PRs to the bots branch :slightly_smiling_face:

Keep being awesome everyone :heart:

Config via Environment Variables:

Can you not specify these in the command line? I was assuming this would look like:

netkan.exe --queues $QUEUES --github-token $TOKEN --cachedir $DIR

I will take care of x_netkan_staging_reason now.

Can you not specify these in the command line? I was assuming this would look like:

That involves invoking a shell. Docker runs the command directly. Now, there are a myriad of ways to achieve it, but the application consuming environment variables is the cleanest way.

It would mean that you could launch netkan in queue mode with a docker run command using the command line switches.

I'll sort something in the interim.

I did! Running as a plain command runs it in a shell.

FROM mono:5.20.1
RUN useradd -ms /bin/bash netkan
USER netkan
WORKDIR /home/netkan
ADD netkan.exe .
ENTRYPOINT /usr/bin/mono netkan.exe --queues $QUEUES \
  --github-token $GH_Token --cachedir ckan_cache -v  

Needs some automation work, but we have a docker hub organisation!

https://hub.docker.com/r/kspckan/inflator

Don't worry @hebarusan it'll all become apparent what docker is doing to really speed up our deployment to production :)

I also found out AWS SQS has released virtual queues! Certainly not something we need to worry about right now, but it's something I really desired when I was thinking about when how to handle Webhooks.

I'll be moving the new repo under the organisation account in the next couple of days and getting Travis sorted to build and submit the containers.

The SpaceDock folks were kind enough to teach me enough docker to be dangerous while I was looking at that PR, and I definitely see the potential of this now.

When do we start writing the scheduler? :grin:

Haha, I know right! It has already begun! KSP-CKAN/NetKAN-Infra@2e6f2909af838e488d182fa464719dff2c6913f7

I don't expect it to take too long to write. I'm gunna keep it pretty simple to begin with and then we can add some smarts later.

@DasSkelett you'll be happy to know status will be taken care of via KSP-CKAN/NetKAN-Infra@18e3efa370aad1e9372203f98647fc49805fd724 and KSP-CKAN/NetKAN-Infra@f6f259ce5e38adce269842c868d884a7e06a9e4d

I'd love to replace this at some point, but frontend stuff isn't my forte. I know at work we're doing some super cool stuff with GraphQL + React. Essentially if a query doesn't change it is cached until such time it gets invalidated, essentially meaning out new frontend rarely needs to hit the backend. Which would be really nifty for exposing the DynamoDB stuff without needing to scale up the read credits.

It. Is. Done!

Also we have our first commit from the new infrastructure!

https://github.com/KSP-CKAN/CKAN-meta/commit/c8ae9226ed67cdf13df6f9b54d9bf9b571e8e1e7

.. I need to fix the bots gitconfig it would seem :wink:

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Zuthal picture Zuthal  Â·  5Comments

mynameis1234 picture mynameis1234  Â·  7Comments

DasSkelett picture DasSkelett  Â·  5Comments

ihsoft picture ihsoft  Â·  5Comments

DasSkelett picture DasSkelett  Â·  4Comments