Vscode-remote-release: Cluster workflow feature to allow shell commands or script to run before remote server setup (e.g. slurm) (wrap install script)

Created on 24 Oct 2019 · 43Comments · Source: microsoft/vscode-remote-release

I want to be able to connect to our institution's cluster using VS Code Remote SSH without the server running on a compute node instead of the login node. The preferred workflow is to SSH into the login node and then use a command to allocate a job and spin up an interactive shell on a compute node, and then run any further tasks from there. VS Code Remote SSH doesn't appear to have a feature that facilitates this workflow. I want to be able to inject the spin-up command immediately after SSH'ing into the cluster, but before the VS Code server is set up on the cluster, and before any other tasks are run.

feature-request ssh

Source

wwarriner

👍24 ❤9 👀5 🚀5 😄3 🎉2

Most helpful comment

I'm curious if this is on the roadmap for the near future. With my university going entirely remote for the foreseeable future, being able to use this extension to work on the cluster would be absolutely amazing.

daeh on 20 Mar 2020

👍13 👀1 🚀1 ❤1 🎉1

All 43 comments

I managed to modify the extension.js file in the following way:

CTRL+F -> "bash
Change the string literal "bash" to "bash -c \"MY_COMMAND bash\""

I've confirmed that this correctly starts the VS Code Remote SSH server on a compute node. Now I am running into a port-forwarding issue, possibly related to issue #92. Our compute nodes have the ports used by VS Code Remote SSH disabled, so there isn't an easy way around this issue.

Thanks for the hard work on this so far! This extension has extraordinary potential. Being able to run and modify a Jupyter notebook remotely on our cluster, while using intellisense and gitlens, AND conda environment detection and dynamic swapping, all in a single application for FREE is incredible.

wwarriner on 25 Oct 2019

Our compute nodes have the ports used by VS Code Remote SSH disabled, so there isn't an easy way around this issue.

Do you mean that port forwarding for ssh is disabled on that server? Or are you able to forward some other port over an ssh connection to that server?

roblourens on 27 Oct 2019

Port forwarding for SSH is not disabled on any part of our cluster. I am not intentionally attempting to forward any other ports to the server. I was using remote.SSH.enableDynamicForwarding and remote.SSH.useLocalServer. Your questions gave me the idea to disable those options. I can't determine if that has helped because my earlier assertion was incorrect. I can't get the server to run on a compute node.

To address that issue, and to clarify our workflow some, we are using Slurm. It is highly preferred to have tasks running within a job context so that login node resources aren't being consumed. To do that, we create a job using srun (or one of its siblings) with appropriate resource request parameters. Any commands we want to run are provided as the final argument to srun. All calls to srun must have a command, because it uses execve() to invoke the commands apparently. If no command is passed, srun fails with an error message. With that in mind, setting up the VS Code server on the remote would have to be funneled through a call to srun. Any other method of invocation (such as bash -c) will result in commands being run out of the job context, and thus on the login node. Naively modifying the bash invocation does not work, apparently because srun never receives any arguments. It isn't clear to me how the server installer gets invoked and set up, so I can't offer any suggestions.

As a side note, it is also possible to provide the argument --pty bash to srun to get a terminal within the job context on a node allocated for that job. Looking at #1671, specifically here. It seems like it should be possible to adjust the invocation of bash -ilc to do additional things (found by ctrl+f). I've tried testing this but it doesn't look like that code is called at any point that I can tell, using echo for debugging.

wwarriner on 29 Oct 2019

What code do you mean by "that code"? I don't think the issue you point to is related.

We run the installer script essentially like echo <installer script here> | ssh hostname bash. There is an old feature request to be able to run a custom script before running the installer. I am not sure whether that would help you here, is there a way with Slurm to run a command, then have the rest of the same script run in a job context?

It sounds more like you need a way to wrap the full installer script in a custom command, like srun "<installer script here>" is that right?

roblourens on 29 Oct 2019

👍1

Yes to your last question, ideally with the ability to customize the wrapping command.

wwarriner on 29 Oct 2019

👍2

This would be a important feature for vscode-remote. I am currently trying to use vscode to run some interactive python code in a shared cluster and the only way of doing it is by using the srun command of slurm. I'll try to find a workaround, but I think there really is a user case for this feature request.

nicocarbone on 6 Nov 2019

👍8

I've got the same issue, but with using LSF instead of SLURM.
As @roblourens points out here: https://github.com/microsoft/vscode-remote-release/issues/1829#issuecomment-553525298
just running the install script and starting the server only solves half the problem. Once the server is started, I surmise that VS code will still try sshing directly into the desired (login-restricted) machine to discover what port the VS-remote server picked, as well as starting new terminals that show up in the GUI.

Basically, the only way this can work is if all subprocesses for servers and user terminals are strictly forked children from the original seed shell acquired from LSF/SLURM/whatever job manager you are using. A hacky workaround may be to use something like Paramiko to start a mini-SSH server from the seed shell and then login to this mini server directly from VS Code (assuming there isn't a firewall blocking you, but obviously reverse SSH tunnels can be used to get around that).

daferna on 14 Nov 2019

👍1

Another possible resolution to this issue is by enabling a direct connection to the remote server.
That is, the user would:

Launch vscode-server on a remote (possible login-restricted) host.
Enter the remote server address and port in vscode, and connect to it.

That way, no ssh is required at all and it can work on login-restricted hosts.

feinsteinben on 22 Dec 2019

👍1

A slight variant on this: I would like to be able to get the target address for SSH from a script (think cat'ing a file that is semi-frequently updated with the address of a dynamic resource). Currently I am using a ProxyCommand configured in sshconfig, but that has the disadvantage of requiring a second process.

ihnorton on 13 Jan 2020

I want to be able to connect to our institution's cluster using VS Code Remote SSH without the server running on a compute node instead of the login node. The preferred workflow is to SSH into the login node and then use a command to allocate a job and spin up an interactive shell on a compute node, and then run any further tasks from there. VS Code Remote SSH doesn't appear to have a feature that facilitates this workflow. I want to be able to inject the spin-up command immediately after SSH'ing into the cluster, but before the VS Code server is set up on the cluster, and before any other tasks are run.

@wwarriner Is the issue you are referring to the same one as the one on this stack overflow SO question?

It sounds like we are having a similar problem, when I spin an interactive job and try to run my debugger, I can't do it because it goes back to the head node and tries to run things there.

https://stackoverflow.com/questions/60141905/how-to-run-code-in-a-debugging-session-from-vs-code-on-a-remote-using-an-interac

brando90 on 9 Feb 2020

The problem is more serious than I thought. I can't run the debugger in the interactive session but I can't even "Run Without Debugging" without it switching to the Python Debug Console on it's own. So that means I have to run things manually with python main.py but that won't allow me to use the variable pane...which is a big loss! (I was already willing to lose the breakpoint privilege by using pdb, which I wasn't a super big fan but ok fine while things get fixed...)

What I am doing is switching my terminal to the conoder_ssh_to_job and then clicking the button Run Without Debugging (or ^F5 or Control + fn + f5) and although I made sure to be on the interactive session at the bottom in my integrated window it goes by itself to the Python Debugger window/pane which is not connected to the interactive session I requested from my cluster...

brando90 on 10 Feb 2020

Am I reading this right that currently the only way to have the language server run a compute node rather than the head/login node is to modify extension.js? Or is there a different preferred solution? I'm also getting weird port conflicts when I modify extension.js.

(I'm also using slurm and the python language server eating up 300GB on the head node disrupts the whole department).

daeh on 20 Feb 2020

daeh on 20 Mar 2020

👍13 👀1 🚀1 ❤1 🎉1

Yes, I also want this feature a lot with universities going remote due to COVID-19

brando90 on 24 Mar 2020

Another possible resolution to this issue is by enabling a direct connection to the remote server.
That is, the user would:

Launch vscode-server on a remote (possible login-restricted) host.

Enter the remote server address and port in vscode, and connect to it.

That way, no ssh is required at all and it can work on login-restricted hosts.

how do you do that? Have you tried it?

brando90 on 24 Mar 2020

No capacity to address this in the near future but I am interested to hear how the cluster setup works for other users - if anyone is not using slurm/srun as described above please let me know what it would take to make this work for you.

roblourens on 25 Mar 2020

I put this to settings.json:

"terminal.integrated.shellArgs.linux": [
    "-c",
    "export FAF=FEF ; exec $SHELL -l",
  ]

After that every linux shell will has "FAF" env variable ( what I wanted ), furthermore with "exec" command , no new process created !

I hope this will be useful for someone :D !

alfonzso on 8 Apr 2020

I guess this is related. I would like VS code clients (e.g., julia client) to have an option to start in the Slurm job I am currently in and not in the login node.

Nosferican on 16 Apr 2020

I am able to get the Julia language server by having added

ml >/dev/null 2>&1 && ml julia

to my ~/.bashrc.

For Slurm jobs,
I have to

Start the Julia client.
From the Julia client run the ijob command line
Start Julia again from that shell.

Would be great to at least start the Julia client from the job shell as a Julia client session.

One issue with that approach is that it starts Julia from the shell and not the client so it misses out on a few features such as vscodeddisplay for being able to display tabular data.

Nosferican on 14 May 2020

I tried to work on this for over a day, I may have got a somewhat working solution, inspired by @Nosferican's idea, to run the command line job from within the julia client. But I didn't have to add anything to my ~/.bashrc for it to work.

One caveat though, is like he said, I couldn't view dataframes using vscodedisplay function, nor am I able to view plots. But I suppose one hacky workaround for plots, is to save them and open them up alongside in vscode itself. Screenshow below shows how it worked:

This was using julia, but I'm sure similar setup could be followed through python/R, i.e. by invoking shell command features and running srun from within julia/python/r, like this:

srun -c --pty julia

As @Nosferican said though, and as shown in my screenshot, images couldn't be displayed. Any ideas?

P.S. BTW before trying out this, I've tried all sorts of ways to get around this today, for e.g. by adding this to my settings.json:

"terminal.integrated.shellArgs.linux": [
    "-c",
    "srun -c 6 --pty bash",
]

Also tried to work around by using a tmux setup that would be running on a compute node, hoping any new julia/python/r instance would also be using the same instance. The tmux setup would be something like his: https://github.com/julia-vscode/julia-vscode/issues/426

But using that method, I could only get python to execute code on the terminal, doesn't even work for its interactive jupyter view, and not for r and julia.

Don't know enough about vscode's integrated terminal setup to manipulate the ports either.

srgk26 on 17 May 2020

👍1

Another use case would be to transfer code to server while ssh in to it.
run the command rsync (automatically) ... on local server before opening the connection

ahmednrana on 21 May 2020

Any chance of getting this out of backlog and into a milestone, @roblourens? It would be amazing to be able to use this extension.

daeh on 17 Jun 2020

Similar problem, I tried to request an interactive shell on my cluster via my login node. Unfortunately this causes the vs code extension to timeout.

I'm using LSF and I run:
bsub -Is "zsh"
in my .bashrc.

I'm guessing it's a port forwarding problem between the client server-hosted extension files?

ctr26 on 17 Jun 2020

Has there been any progress on this? Can we now ssh directly to an interactive session and have it work? (https://stackoverflow.com/questions/60141905/how-to-run-code-in-a-debugging-session-from-vs-code-on-a-remote-using-an-interac)

brando90 on 30 Jun 2020

@roblourens There seems to be about 37 non-bugs in the backlog milestone. Could you give a rough estimate on how high this issue ranks in terms of priorities? For example, next release not the one after of end of the year?

Nosferican on 30 Jun 2020

Has there been any progress on this? Can we now ssh directly to an interactive session and have it work? (https://stackoverflow.com/questions/60141905/how-to-run-code-in-a-debugging-session-from-vs-code-on-a-remote-using-an-interac)

Update:
There's a huge problem with this approach. Please see the discussion below by @Nosferican.

I confirm the answer in the StackOverflow works for me!
Thank you and the author of the answer!

Although I found there might need to be a bit modification from the original answer, mainly I think we need to add username@ before login server name (sorry I'm not able to comment there since my StackOverflow account is new).

A recap of the procedure:

Submit an interactive job (e.g. salloc for slurm), get the computing node assigned.
On VSCode, add remote SSH using ssh -J [email protected] username@nodeXXX.
The -J option will set the "ProxyJump" in the ~/.ssh/config file, and it will look like:

Host MyCluster
    HostName nodeXXX
    ProxyJump [email protected]
    User username

The setup is ready, you can open this SSH target in VSCode. You might need to enter your password twice for first logging into the login node, and then the computing node.
Now you should be able to work on the remote computing node!!

A reminder: the key is to set the ~/.ssh/config correctly, be aware of the jump node's name. And remember to change the nodeXXX name every time to the computing node assigned.

p.s. It originally didn't work on one of my clusters somehow. But after I use the SSH key file, and specify the IdentityFile in the ~/.ssh/config, the problem was solved.
So, I suggest using the SSH key and set the ~/.ssh/config as:

Host MyCluster
    HostName nodeXXX
    ProxyJump [email protected]
    User username
    IdentityFile ~\.ssh\my_key

This saves you from entering your password twice every time anyway.

Lucecpkn on 3 Aug 2020

👍1

I tried the solution. I am able to start VS code on the computing node but it returns a shell in the computing node on in the Slurm job. Is there a way to have VS Code shell / language servers step into the job?

Nosferican on 4 Aug 2020

I tried the solution. I am able to start VS code on the computing node but it returns a shell in the computing node on in the Slurm job. Is there a way to have VS Code shell / language servers step into the job?

Sorry, I'm not sure what you mean. Did you try to open the Explorer in VSCode and work on some code scripts? I think the language server will step in automatically when you work on a certain code script.

Lucecpkn on 5 Aug 2020

Aye. The solution works in the sense I can connect to the compute nodes but I am not inside the Slurm job so I don't have access to the resources it allocated for it. I can start coding and the language server steps in but I am now consuming resources on that node that would not be tracked by the cluster job manager through Slurm.

Nosferican on 5 Aug 2020

The solution isn't super practical for me as the nodes get allocated with arbitrary names

Ideally this proxy jumping would be automated

ctr26 on 5 Aug 2020

Aye. The solution works in the sense I can connect to the compute nodes but I am not inside the Slurm job so I don't have access to the resources it allocated for it. I can start coding and the language server steps in but I am now consuming resources on that node that would not be tracked by the cluster job manager through Slurm.

OMG you are right! Forgive me, I didn't even know if the job is being done inside the Slurm allocation! (Though I notice the abnormal low usage when I use seff to check the job. Now I feel guilty of my illegitimate use.)
Could you please tell me how to check if the job is being inside the Slurm allocation?

Lucecpkn on 5 Aug 2020

The easiest way to check would be to look for the Slurm environmental variables that get set up by default. https://slurm.schedmd.com/srun.html#lbAI for example, if inside the Slurm job it should have an environmental variable SLURM_JOB_ID (it will have a bunch others as well).

Nosferican on 5 Aug 2020

Now I'm confused. I tested inside the VSCode (inside Python for example). It turns out I can get the SLURM_JOB_ID env variable. Moreover, when I run my job parallelly, I get at max exactly the number of CPUs I applied for. So, maybe it is working as expected?

Lucecpkn on 5 Aug 2020

I am not sure how it would work based on the pipeline. I can potentially have different Slurm jobs in the same compute node. If I SSH to that node, how would it know which job to pick from for executing my code?

Nosferican on 5 Aug 2020

Now I'm confused. I tested inside the VSCode (inside Python for example). It turns out I can get the SLURM_JOB_ID env variable. Moreover, when I run my job parallelly, I get at max exactly the number of CPUs I applied for. So, maybe it is working as expected?

Do you happen to know whether pam_slurm_adopt is configured in your cluster?

feinsteinben on 5 Aug 2020

Currently an alternative is to run the VS Code OSS version for the cluster (only have access to open source extensions and not the marketplace due to not being Microsoft)
Screen Shot 2020-08-05 at 1 13 47 PM
Should be able to do the same with self-hosted code spaces.

Nosferican on 5 Aug 2020

👍1

Now I'm confused. I tested inside the VSCode (inside Python for example). It turns out I can get the SLURM_JOB_ID env variable. Moreover, when I run my job parallelly, I get at max exactly the number of CPUs I applied for. So, maybe it is working as expected?

Do you happen to know whether pam_slurm_adopt is configured in your cluster?

Sorry, I don't know. I only know I'm not able to ssh into nodes that I do not have a running job.

Lucecpkn on 5 Aug 2020

So it's probably configured in your cluster. In this case, there is no
problem to ssh into a slurm job (via vscode or otherwise). The problem
begins when such a configuration doesn't exist (for example in LSF or other
scheduler).

On Thu, Aug 6, 2020 at 12:03 AM Pan notifications@github.com wrote:

Now I'm confused. I tested inside the VSCode (inside Python for example).
It turns out I can get the SLURM_JOB_ID env variable. Moreover, when I
run my job parallelly, I get at max exactly the number of CPUs I applied
for. So, maybe it is working as expected?

Do you happen to know whether pam_slurm_adopt
https://slurm.schedmd.com/pam_slurm_adopt.html is configured in your
cluster?

Sorry, I don't know. I only know I'm not able to ssh into nodes that I do
not have a running job.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/microsoft/vscode-remote-release/issues/1722#issuecomment-669509367,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ACVZMBQT5QP5BSXMXILGLH3R7HCKTANCNFSM4JEYF4NQ
.

feinsteinben on 6 Aug 2020

@Nosferican We've started working with VS Code OSS. It's a shame that the extension list is smaller due to the Marketplace limitation. I don't see Marketplace access changing, perhaps MS will make their remote dev server FOSS at some point.

I'd like to build on the recent discussions about SSHing into nodes. We have pam_slurm_adopt configured in our cluster. The challenge for us is that each node has a name based on its numerical acquisition order. Which node to SSH into isn't known before the job is created and can't be selected by the user. Currently the workflow would have to be: (1) create a job; (2) get the node number; (3) send that information back to local VSCode; (4) use the node as the SSH target for remote dev; (5) start remote dev. Certainly this could be managed manually each time the user wishes to connect VSCode to the cluster, but that is clunky, inflexible and error prone. I haven't been able to figure out how to automate the process in my free time, perhaps someone with more skill could figure this out?

wwarriner on 26 Sep 2020

I wanted to run the vscode python debugger on an interactive Slurm job in a remote server. I tried to edit the extension.js file but it didn't work for me.
Here is my workaround:

on the remote server create a file named bash somewhere for example /home/myuser/pathto/bash
make it executable using chmod +x bash
write salloc [your desired options for the interactive job] in the bash file
In vscode Settings search for Automation Shell: Linux and click on the "Edit in settings.js"
change the line to "terminal.integrated.automationShell.linux": "/home/myuser/pathto/bash" and save it (use the absolute path. for example ~/pathto/bash didn't work for me)
Done :)

now every time you run the debugger it will first ask for the interactive job and the debugger will run on it. but take in to consider that this is also applied to tasks you run in tasks.json.
also you can use srun instead of salloc. for example srun --pty -t 2:00:00 --mem=8G -p interactive bash

asalimih on 28 Oct 2020

👍1

@asalimih works, but is there a way to automatically scancel the job when finished or reclaim an existing (probably more complicated)?

lkugler on 19 Nov 2020

@asalimih works, but is there a way to automatically scancel the job when finished or reclaim an existing (probably more complicated)?

this was for interactive debugging. when you run the debugger it will open a terminal so whenever you finished debugging, you can close the terminal bash and the job will be automatically canceled.
In order to connect and disconnect from an interactive job without it being canceled, I guess you can use tmux like here. Here is my suggestion but I haven't tried it.

first manually run a tmux session like tmux new -s debug_session
create an interactive job inside it: salloc [your desired options for the interactive job]
in the bash file (that I explained in previous post ) you just attach to the tmux session: tmux attach -t debug_session

now if you start the debugger it will run inside that interactive job inside the tmux session (I guess :) ) and closing the terminal won't stop the job.

asalimih on 19 Nov 2020

👍1

Since the today, the debugger is unusable. :-(
This is the case, since calling the debugger does not wait for a finished source activate; conda activate xyz, so the script runs with the (wrong) "base" environment. What could have changed that? Any ideas how to solve that?

salloc: Granted job allocation 1953
(base) bash-4.4$  /usr/bin/env /.../user/miniconda3/envs/plot/bin/python /...user/.vscode-server/extensions/ms-python.python-2020.12.422005962/pythonFiles/lib/python/debugpy/launcher 43457 -- /...user/osse_analysis/main.py 
source /...user/miniconda3/bin/activate
conda activate plot
-> fails

I also noticed that vscode now reports Your service running on port 12345 is available.. I am using @asalimih 's strategy.
Edit: Do I not have to call srun --jobid=1234 /bin/bash after salloc to grab the allocation and run it on the compute node?

lkugler on 15 Dec 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings