I am trying to connect to a SageMaker instance through SSH with my local machine, but I cannot find a way to do it. This seems like an important functionnality, either for debugging (through PyCharm) or for uploading files with SCP. I am wondering if there is any way to do this?
SageMaker doesn't support SSH access to running jobs or endpoints. There are a couple of ways to get files into your instances:
There's currently no way to do remote debugging of a training job. You might be able to do this by using a customized container to run your job in local mode.
If you have another instance that you can ssh into from both the instance and your local machine, then you can tunnel through and achieve ssh access. I'm using this for the same purpose of SCPing stuff in and out.
For example, assuming "bastion" is the additional middle instance:
# run this command from within a terminal on your notebook instance (New -> Terminal), pushes port 22 to bastion's locally accessible port 10022
sh-4.2$ ssh user@bastion -R 10022:localhost:22 -f -N
# run this command from you local machine, pulls port 10022 of the bastion to local machine port 10022
[you@yourmachine]$ ssh user@bastion -L 10022:localhost:10022 -f -N
# now you can ssh or scp as you'd like, using the localhost port 10022 as the target
[you@yourmachine]$ ssh localhost -p 10022 -l ec2-user
You'll of course have to take care of authentication in the right directions (e.g. create private keys and add to authorized_keys as applicable).
Are you planning to implement this feature? how?
You could add a new IAM Allow Statement sagemaker:CreateSSHTraining that would permit ssh by using a new configuration option, e.g.
tf_estimator.fit(inputs=input, ssh_pub_key='~/.ssh/id_rsa.pub')
The sagemaker locally installed cli will take care of uploading the ssh public key by using current user's AWS credentials.
Then SageMaker should create proxy/endpoint that is automatically firewalled to the source IP from which the training was launched (e.g. current laptop). This endpoint (random sub-domain) purpose is only to expose port 22 (or other random port) for the current user.
It would finally print the randomly generated ssh endpoint into stdout so the user can copy paste to ssh into the training instance. The training instance could automatically pause before shutting down to give the user time to ssh into it but this can be made configurable, most likely the user will use the Python debugger to put a breakpoint anyways.
The link would expire once the user shuts down the training instance.
Perhaps this feature is worth a new resource type, instead of polluting the Trainings SM resource it would create a SSHTrainings resource.
@mklissa I know this is quite late, but it looks like AWS has thought about your particular use case: Tutorial: Set Up PyCharm Professional with a Development Endpoint. It works via AWS Glue's ability to create developer endpoint. However, it looks like it only supports Py2.7 though.
AWS does not natively support SSH-ing into SageMaker notebook instances, but nothing really prevents you from setting up SSH yourself.
The only problem is that these instances do not get a public IP address, which means you have to either create a reverse proxy (with ngrok for example) or connect to it via bastion box.
AWS does not natively support SSH-ing into SageMaker notebook instances, but nothing really prevents you from setting up SSH yourself.
The only problem is that these instances do not get a public IP address, which means you have to either create a reverse proxy (with ngrok for example) or connect to it via bastion box.
Steps to make the ngrok solution work:
curl https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip > ngrok.zipunzip ngrok.zip/ngrok authenticate with your token./ngrok tcp 22 > ngrok.log & (& will put it in the background)~/.ssh/authorized_keys file (on SageMaker) and paste your public key (likely ~/.ssh/id_rsa.pub from your computer)ssh -p <port_from_ngrok_logfile> [email protected] (or whatever host they assign to you, it's going to be in the ngrok.log)If you want to automate it, I suggest using lifecycle configuration scripts.
Another good trick is wrapping downloading, unzipping, authenticating and starting ngrok into some binary in /usr/bin so you can just call it from SageMaker console if it dies.
It's a little bit too long to explain completely how to automate it with lifecycle scripts, but I've written a detailed guide on https://biasandvariance.com/sagemaker-ssh-setup/.
Thank you @mariokostelac! I used the most recent ngrok and needed to change two things:
./ngrok authtoken <AUTHTOKEN>.This can also be solved via https://docs.aws.amazon.com/systems-manager/latest/userguide/managed_instances.html by setting the SageMaker machine as it if where an on-prem computer that AWS SSM can manage and then one can ssh/scp/tunnel into it.
laptop> $ aws ssm start-session --region=eu-central-1 --target i-083ee1e47a95416c3
Starting session with SessionId: lgallucci-0d662d7d50462b043
ec2> $ nvidia-smi
Thu Nov 19 08:58:45 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M60 On | 00000000:00:1E.0 Off | 0 |
| N/A 34C P8 14W / 150W | 0MiB / 7618MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
This can also be solved via https://docs.aws.amazon.com/systems-manager/latest/userguide/managed_instances.html by setting the SageMaker machine as it if where an on-prem computer that AWS SSM can manage and then one can ssh/scp/tunnel into it.
laptop> $ aws ssm start-session --region=eu-central-1 --target i-083ee1e47a95416c3 Starting session with SessionId: lgallucci-0d662d7d50462b043 ec2> $ nvidia-smi Thu Nov 19 08:58:45 2020 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla M60 On | 00000000:00:1E.0 Off | 0 | | N/A 34C P8 14W / 150W | 0MiB / 7618MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
How do I know my SageMaker Studio notebook target id?
This can also be solved via https://docs.aws.amazon.com/systems-manager/latest/userguide/managed_instances.html by setting the SageMaker machine as it if where an on-prem computer that AWS SSM can manage and then one can ssh/scp/tunnel into it.
This is great, thanks a lot for that information. I'll try to set it up soon.
@hanan-vian SM doesn't give you any target id, you have to do everything yourself as if it were some computer box in your basement (sort to say). It would be great if the SageMaker team realizes the potential of this use case and does the integration automatically, some day maybe.
@elgalu if I understand you correctly I have to start en ec2 instance with a Deep Learning-AMI?
I cannot use this together with Estimator.fit() using the sdk?
@philschmid we are discussing SSH access in SageMaker Studio/Notebooks in this thread. With EC2 you can already ssh, it's solved there.
Most helpful comment
AWS does not natively support SSH-ing into SageMaker notebook instances, but nothing really prevents you from setting up SSH yourself.
The only problem is that these instances do not get a public IP address, which means you have to either create a reverse proxy (with ngrok for example) or connect to it via bastion box.
AWS does not natively support SSH-ing into SageMaker notebook instances, but nothing really prevents you from setting up SSH yourself.
The only problem is that these instances do not get a public IP address, which means you have to either create a reverse proxy (with ngrok for example) or connect to it via bastion box.
Steps to make the ngrok solution work:
curl https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip > ngrok.zipunzip ngrok.zip/ngrok authenticatewith your token./ngrok tcp 22 > ngrok.log &(& will put it in the background)~/.ssh/authorized_keysfile (on SageMaker) and paste your public key (likely~/.ssh/id_rsa.pubfrom your computer)ssh -p <port_from_ngrok_logfile> [email protected](or whatever host they assign to you, it's going to be in the ngrok.log)If you want to automate it, I suggest using lifecycle configuration scripts.
Another good trick is wrapping downloading, unzipping, authenticating and starting ngrok into some binary in /usr/bin so you can just call it from SageMaker console if it dies.
It's a little bit too long to explain completely how to automate it with lifecycle scripts, but I've written a detailed guide on https://biasandvariance.com/sagemaker-ssh-setup/.