Sagemaker-python-sdk: SSH into a SageMaker instance for debugging purposes

Created on 10 Aug 2018 · 12Comments · Source: aws/sagemaker-python-sdk

I am trying to connect to a SageMaker instance through SSH with my local machine, but I cannot find a way to do it. This seems like an important functionnality, either for debugging (through PyCharm) or for uploading files with SCP. I am wondering if there is any way to do this?

question

Source

mklissa

👍9

Most helpful comment

AWS does not natively support SSH-ing into SageMaker notebook instances, but nothing really prevents you from setting up SSH yourself.

The only problem is that these instances do not get a public IP address, which means you have to either create a reverse proxy (with ngrok for example) or connect to it via bastion box.

AWS does not natively support SSH-ing into SageMaker notebook instances, but nothing really prevents you from setting up SSH yourself.

The only problem is that these instances do not get a public IP address, which means you have to either create a reverse proxy (with ngrok for example) or connect to it via bastion box.

Steps to make the ngrok solution work:

download ngrok with curl https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip > ngrok.zip
unzip ngrok.zip
create ngrok free account to get permissions for tcp tunnels
run ./ngrok authenticate with your token
start with ./ngrok tcp 22 > ngrok.log & (& will put it in the background)
logfile will contain the url so you know where to connect to
create ~/.ssh/authorized_keys file (on SageMaker) and paste your public key (likely ~/.ssh/id_rsa.pub from your computer)
ssh by calling ssh -p <port_from_ngrok_logfile> [email protected] (or whatever host they assign to you, it's going to be in the ngrok.log)

If you want to automate it, I suggest using lifecycle configuration scripts.

Another good trick is wrapping downloading, unzipping, authenticating and starting ngrok into some binary in /usr/bin so you can just call it from SageMaker console if it dies.

It's a little bit too long to explain completely how to automate it with lifecycle scripts, but I've written a detailed guide on https://biasandvariance.com/sagemaker-ssh-setup/.

mariokostelac on 7 May 2020

👍7 ❤2

All 12 comments

SageMaker doesn't support SSH access to running jobs or endpoints. There are a couple of ways to get files into your instances:

add the files to the set of input data you specify when you launch your job (or add to model files for a hosting endpoint)
modify your training (or inference) code to download the files when your code is run.

There's currently no way to do remote debugging of a training job. You might be able to do this by using a customized container to run your job in local mode.

jesterhazy on 10 Aug 2018

👍4

If you have another instance that you can ssh into from both the instance and your local machine, then you can tunnel through and achieve ssh access. I'm using this for the same purpose of SCPing stuff in and out.

For example, assuming "bastion" is the additional middle instance:

# run this command from within a terminal on your notebook instance (New -> Terminal), pushes port 22 to bastion's locally accessible port 10022
sh-4.2$ ssh user@bastion -R 10022:localhost:22 -f -N

# run this command from you local machine, pulls port 10022 of the bastion to local machine port 10022
[you@yourmachine]$ ssh user@bastion -L 10022:localhost:10022 -f -N

# now you can ssh or scp as you'd like, using the localhost port 10022 as the target
[you@yourmachine]$ ssh localhost -p 10022 -l ec2-user

You'll of course have to take care of authentication in the right directions (e.g. create private keys and add to authorized_keys as applicable).

yonatanp on 27 Sep 2018

👍2

Are you planning to implement this feature? how?

You could add a new IAM Allow Statement sagemaker:CreateSSHTraining that would permit ssh by using a new configuration option, e.g.

tf_estimator.fit(inputs=input, ssh_pub_key='~/.ssh/id_rsa.pub')

The sagemaker locally installed cli will take care of uploading the ssh public key by using current user's AWS credentials.

Then SageMaker should create proxy/endpoint that is automatically firewalled to the source IP from which the training was launched (e.g. current laptop). This endpoint (random sub-domain) purpose is only to expose port 22 (or other random port) for the current user.

It would finally print the randomly generated ssh endpoint into stdout so the user can copy paste to ssh into the training instance. The training instance could automatically pause before shutting down to give the user time to ssh into it but this can be made configurable, most likely the user will use the Python debugger to put a breakpoint anyways.

The link would expire once the user shuts down the training instance.

Perhaps this feature is worth a new resource type, instead of polluting the Trainings SM resource it would create a SSHTrainings resource.

elgalu on 8 Jun 2019

👍8

@mklissa I know this is quite late, but it looks like AWS has thought about your particular use case: Tutorial: Set Up PyCharm Professional with a Development Endpoint. It works via AWS Glue's ability to create developer endpoint. However, it looks like it only supports Py2.7 though.

kot-behemoth on 30 Jul 2019

👍1

AWS does not natively support SSH-ing into SageMaker notebook instances, but nothing really prevents you from setting up SSH yourself.

The only problem is that these instances do not get a public IP address, which means you have to either create a reverse proxy (with ngrok for example) or connect to it via bastion box.

AWS does not natively support SSH-ing into SageMaker notebook instances, but nothing really prevents you from setting up SSH yourself.

The only problem is that these instances do not get a public IP address, which means you have to either create a reverse proxy (with ngrok for example) or connect to it via bastion box.

Steps to make the ngrok solution work:

download ngrok with curl https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip > ngrok.zip
unzip ngrok.zip
create ngrok free account to get permissions for tcp tunnels
run ./ngrok authenticate with your token
start with ./ngrok tcp 22 > ngrok.log & (& will put it in the background)
logfile will contain the url so you know where to connect to
create ~/.ssh/authorized_keys file (on SageMaker) and paste your public key (likely ~/.ssh/id_rsa.pub from your computer)
ssh by calling ssh -p <port_from_ngrok_logfile> [email protected] (or whatever host they assign to you, it's going to be in the ngrok.log)

If you want to automate it, I suggest using lifecycle configuration scripts.

Another good trick is wrapping downloading, unzipping, authenticating and starting ngrok into some binary in /usr/bin so you can just call it from SageMaker console if it dies.

It's a little bit too long to explain completely how to automate it with lifecycle scripts, but I've written a detailed guide on https://biasandvariance.com/sagemaker-ssh-setup/.

mariokostelac on 7 May 2020

👍7 ❤2

Thank you @mariokostelac! I used the most recent ngrok and needed to change two things:

Change the command to ./ngrok authtoken <AUTHTOKEN>.
Keep ngrok in the foreground (writing to a log file did not work)

daysm on 3 Nov 2020

This can also be solved via https://docs.aws.amazon.com/systems-manager/latest/userguide/managed_instances.html by setting the SageMaker machine as it if where an on-prem computer that AWS SSM can manage and then one can ssh/scp/tunnel into it.

laptop> $ aws ssm start-session --region=eu-central-1 --target i-083ee1e47a95416c3

Starting session with SessionId: lgallucci-0d662d7d50462b043

ec2> $ nvidia-smi
Thu Nov 19 08:58:45 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M60           On   | 00000000:00:1E.0 Off |                    0 |
| N/A   34C    P8    14W / 150W |      0MiB /  7618MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

elgalu on 19 Nov 2020

👀2

laptop> $ aws ssm start-session --region=eu-central-1 --target i-083ee1e47a95416c3

Starting session with SessionId: lgallucci-0d662d7d50462b043

ec2> $ nvidia-smi
Thu Nov 19 08:58:45 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M60           On   | 00000000:00:1E.0 Off |                    0 |
| N/A   34C    P8    14W / 150W |      0MiB /  7618MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

How do I know my SageMaker Studio notebook target id?

ghost on 17 Dec 2020

This can also be solved via https://docs.aws.amazon.com/systems-manager/latest/userguide/managed_instances.html by setting the SageMaker machine as it if where an on-prem computer that AWS SSM can manage and then one can ssh/scp/tunnel into it.

This is great, thanks a lot for that information. I'll try to set it up soon.

mariokostelac on 18 Jan 2021

@hanan-vian SM doesn't give you any target id, you have to do everything yourself as if it were some computer box in your basement (sort to say). It would be great if the SageMaker team realizes the potential of this use case and does the integration automatically, some day maybe.

elgalu on 18 Jan 2021

@elgalu if I understand you correctly I have to start en ec2 instance with a Deep Learning-AMI?
I cannot use this together with Estimator.fit() using the sdk?