I have managed to restart the host on the provider (AWS) but after 20 minutes it is not responding on the ssh port (although it is pingable)
Server now responding to the ssh port. Unfortunately the backend Node.js process appears to be repeatedly crashing and restarting so the service is not yet responsive.
From the backend logs -
error: Forever detected script exited with code: 0
error: Script restart attempt #182
12:08:31 PM - warn: Cannot find the config file: --configFile=/dev/mongodb/credentials/trssConf.json
12:08:32 PM - error: Exception in database query: message=Cannot read property 'collection' of undefined, stack=TypeError: Cannot read property 'collection' of undefined
at new TestResultsDB (/home/jenkins/openjdk-test-tools/TestResultSummaryService/Database.js:257:23)
at EventHandler.processBuild (/home/jenkins/openjdk-test-tools/TestResultSummaryService/EventHandler.js:19:37)
at Timeout._onTimeout (/home/jenkins/openjdk-test-tools/TestResultSummaryService/backend.js:11:13)
at listOnTimeout (internal/timers.js:531:17)
at processTimers (internal/timers.js:475:7)
12:08:32 PM - error: Exception in database query: message=Cannot read property 'collection' of undefined, stack=TypeError: Cannot read property 'collection' of undefined
at new BuildListDB (/home/jenkins/openjdk-test-tools/TestResultSummaryService/Database.js:278:23)
at EventHandler.monitorBuild (/home/jenkins/openjdk-test-tools/TestResultSummaryService/EventHandler.js:57:37)
at Timeout._onTimeout (/home/jenkins/openjdk-test-tools/TestResultSummaryService/backend.js:12:13)
at listOnTimeout (internal/timers.js:531:17)
at processTimers (internal/timers.js:475:7)
(node:3779) UnhandledPromiseRejectionWarning: TypeError: Cannot read property 'collection' of undefined
at new AuditLogsDB (/home/jenkins/openjdk-test-tools/TestResultSummaryService/Database.js:285:23)
at EventHandler.processBuild (/home/jenkins/openjdk-test-tools/TestResultSummaryService/EventHandler.js:42:23)
at Timeout._onTimeout (/home/jenkins/openjdk-test-tools/TestResultSummaryService/backend.js:11:13)
at listOnTimeout (internal/timers.js:531:17)
at processTimers (internal/timers.js:475:7)
(node:3779) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:3779) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
(node:3779) UnhandledPromiseRejectionWarning: TypeError: Cannot read property 'collection' of undefined
at new AuditLogsDB (/home/jenkins/openjdk-test-tools/TestResultSummaryService/Database.js:285:23)
at EventHandler.monitorBuild (/home/jenkins/openjdk-test-tools/TestResultSummaryService/EventHandler.js:78:23)
at Timeout._onTimeout (/home/jenkins/openjdk-test-tools/TestResultSummaryService/backend.js:12:13)
at listOnTimeout (internal/timers.js:531:17)
at processTimers (internal/timers.js:475:7)
(node:3779) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 2)
Machine became unresponsive at 02:29:28 (base on the kernel messages) with an Out of memory situation.
I am trying to recover the system but it looks like there may have been configuration files etc. stored in the dynamic /dev filesystem (as per the snippet in the previous comment which referneces --configFile=/dev/mongodb/credentials/trssConf.json - I can see entries in the shell history for things like mkdir -p /dev/mongodb/data which slightly worries me since it suggests that's not a dynamically created area that will get regenerated somehow (After reboot there is no /dev/mongodb on the machine)
There is no information on restarting the mongodb service on https://github.com/AdoptOpenJDK/openjdk-test-tools/tree/master/TestResultSummaryService
I have tried starting mongo - it first had a problem with /dev/mongodb/log not existing, then not being owned by the correct user. After resolving that, systemctl start mongod appears to have got it working, but I'm still at a loss as to where the /dev/mongodb/credentials is supposed to come from.
Current status: mongodb, TRSSBackend and TRSSFrontend services are showing as active (running) but connections to the server (SSH or the nginx on 443) are not possible. Reason currently unknown (external firewall?) - I was lucky to have been able to get in while connections were allowed.
@llxia Need your input on the /dev/mongodb directory and also likely some doc updates if mongodb has to be started manually separately from the front and back end services.
Looks like all the database stuff had been stored on a ramdrive and is therefore lost and will need to be rebuilt.
Machine now has 16Gb of swap (equals the amount of RAM) and a 160Gb /data partition that we can use for the results database.
AWS moved the IP address on the host. After it rebooted there was still a log entry with the pold IP address but it subsequently switched. PR in for inventory change.
https://trss.adoptopenjdk.net address now pointing to the new IP address
MongoDB is now running on a persistent filesystem (The new /data) so we should be back in action ...
@llxia Can you update the documentation to cover the setup of MongoDB and how to restart it etc.
We use standard cmd to install and to restart systemctl restart mongod. The only thing is that if there is user/password for DB access, then TRSS needs to know (in trssConf.json). In the above case, MongoDB started correctly. TRSS cannot find trssConf.json, so it cannot connect to MongoDB.
I will update the readme.
Can this be closed now (nice job on the rescue BTW)?
Can this be closed now (nice job on the rescue BTW)?
I was holding off until we have the documentation updated with details of what goes into the trssConf.json on the production server so we don't hit so many problems next time (keeping this open stops us from forgetting about it...)
The format about trssConf.json is documented in https://github.com/AdoptOpenJDK/openjdk-test-tools/tree/master/TestResultSummaryService#configure-file
If we need a backup copy of trssConf.json that is used in the production server, we can store it somewhere else. But I do not think we should put user/password in the readme.
Absolutely agree passwords shouldn't be in there (although we can store that elsewhere) but things like the data directory that we've set for mongo should be along with the other specifics of the production server setup such as the location of the config file (The docs just say that you should provide a --configfile option, but for the production server it's fixed to /data/db/trssConf.json in /etc/init.d/TRSSBackend and TRSSFrontend so we should state that as you'd never want to set it anywhere else on the production server
The docs just say that you should provide a --configfile option, but for the production server it's fixed to /data/db/trssConf.json
This is because we have the forever services created for TRSS. During the service creation, we can specify the --configfile option. forever-service only needs to be created once at beginning of the machine configuration or change options. Maybe we should add this into the playbook?
forever-service install TRSSFrontend -e "NODE_ENV=production" -f " --workingDir /home/jenkins/openjdk-test-tools/TestResultSummaryService" --script /home/jenkins/openjdk-test-tools/TestResultSummaryService/frontend.js -o " --configFile=/data/db/credentials/trssConf.json"
forever-service install TRSSBackend -e "NODE_ENV=production NODE_OPTIONS=--max_old_space_size=4096 " -f " --workingDir /home/jenkins/openjdk-test-tools/TestResultSummaryService " --script /home/jenkins/openjdk-test-tools/TestResultSummaryService/backend.js -o " --configFile=/data/db/credentials/trssConf.json"
Is the information on forever mentioned anywhere else? We should definitely add those two commands into the Deployment Instructions section of the doc.
And yes I would agree that since we have all of the code to start mongo and nginx in the TRSS playbooks we should add the backend/frontend service setup there too. @Haroon-Khel Can you you at doing this please?
Is the information on forever mentioned anywhere else? We should definitely add those two commands into the Deployment Instructions section of the doc.
The information was mentioned in the previous issue and I did a demo/recording a while back with more up to date information.
Just to be clear, the steps should be in the following order:
Correct me if I wrong, I do not think Step 2 can be in the playbook as it contains credentials. If we want to put Step 4 in the playbook (without Step 2), then we should start with an empty trssConf.json file. And Admin can create user/password in MongoDB and update trssConf.json manually later.
I do not think Step 2 can be in the playbook as it contains credentials
Ansible has various mechanisms to inject credentials into playbooks. For example, there's ansible-vault. And there are variables that can be set when invoking ansible-playbook by using -e.
Before sinking hours into updating the playbooks, please consider the best approach for https://github.com/AdoptOpenJDK/openjdk-infrastructure/issues/1689.
@aahlenst To be clear my primary goal here is to ensure that what we have in production at the moment is documented along with the other setup instructions before putting time into moving it.