Azure-docs: Environment not scaling down off-peak, any missing steps?

Created on 18 Mar 2020 · 28Comments · Source: MicrosoftDocs/azure-docs

I have a 4x host deployment to try and prove the concept of this solution. Scaling up when peak hours begins seems to work fine, but upon scaling down - the process only manages the first VM in the pool then it seems to time out/fail. This is the output from my automation job where it will just hang until the next job starts - then they both fail within a few minutes.

Starting WVD tenant hosts scale optimization: Current Date Time is: 03/18/2020 15:45:56
It is Off-peak hours
Starting to scale down WVD session hosts ...
Processing hostpool FullDesktop
Checking session host: wvd-1.domain.local
of sessions:0 and status:Available
Checking session host: wvd-2.domain.local
of sessions:0 and status:Available
Checking session host: wvd-3.domain.local
of sessions:0 and status:Available
Checking session host: wvd-4.domain.local
of sessions:0 and status:Available
Stopping Azure VM: wvd-1 and waiting for it to complete ...

Azure VM has been stopped: wvd-1 ...

When I check the portal, wvd-1 has succesfully de-allocated but the other three are still running. I've deployed the Win10 1909 w/ Office Pro Plus image from the gallery and have performed no further config (I've deployed the host pool from the gallery which configures the hosts via DSC for use in an WVD Pool). Are there any other configurations that I need to carry out on the hosts for this to work?

Document Details

⚠ Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.

ID: 24a6c4e2-702f-2b17-3e2d-c86f33b13beb
Version Independent ID: 1e7811d8-668a-db98-34ca-eaf4d7ee8c1f
Content: Scale session hosts Azure Automation - Azure
Content Source: articles/virtual-desktop/set-up-scaling-script.md
Service: virtual-desktop
GitHub Login: @Heidilohr
Microsoft Alias: helohr

Pri1 assigned-to-author product-question triaged virtual-desktosvc

Source

townendk

👍1

Most helpful comment

Adding @RoopChevuri as well for comment.

ChristianMontoya on 19 Mar 2020

👍3

All 28 comments

There are no errors or warnings logged against the failed automation task, but this is listed as an exception:

The running command stopped because the preference variable "ErrorActionPreference" or common parameter is set to Stop: ActivityId: Powershell commands to diagnose the failure: Get-RdsDiagnosticActivities -ActivityId

townendk on 18 Mar 2020

Experiencing the exact same issue.

Jimbos10 on 19 Mar 2020

I will add that I was able to change the errorActionPreference under the workbook to "continue" and was able to generate the following error.

Get-RdsSessionHost : ActivityId: Powershell commands to diagnose the failure: Get-RdsDiagnosticActivities -ActivityId At line:679 char:27 + ... nHostInfo = Get-RdsSessionHost -TenantName $TenantName -HostPoolName ... + ~~~~~~~~~~~~~ + CategoryInfo : FromStdErr: (Microsoft.RDInf...tRdsSessionHost:GetRdsSessionHost) [Get-RdsSessionHost], RdsPowerShellException + FullyQualifiedErrorId : DefaultNoRdsError,Microsoft.RDInfra.RDPowershell.SessionHost.GetRdsSessionHost

Jimbos10 on 19 Mar 2020

@townendk Thank you for the detailed feedback. We are actively investigating and will get back to you soon.

KalyanChanumolu-MSFT on 19 Mar 2020

👍1

@townendk Please find similar issue here, you can also find few suggestions/troubleshooting steps from product team.

VikasPullagura-MSFT on 19 Mar 2020

Thanks All.

VikasPullagura-MSFT - this is not the same issue, I've read the one described in your post but in that scenario the session hosts do not heartbeat. Mine seem to have an appropriate heartbeat. Thank you though!

townendk on 19 Mar 2020

@viswanadham-k can you please check and add your comments on this.

VikasPullagura-MSFT on 19 Mar 2020

I've done some further testing this morning.

1- Overnight all VMs in the pool were powered off (manually) and de-allocated in the portal. The tasks overnight all ran succesfully but they didn't power anything on. My "MinimumNumberOfRDSH" variable is set to '1' so I expected it to boot one of them back up for me but I'm perhaps misunderstanding the behaviour in off-peak time(?)

2- Upon peak time starting, the script correctly booted up the first VM in my pool.

3- I logged in two concurrent sessions to this (my current threshold is for one user per vCPU, and the VM had 2x vCPU)

4- The next script run successfully identified that my VM was at capacity with two users on it, so the second VM in the pool was booted up.

5- I logged off one of my users, but both VMs stayed up and running during peak time (I believe this is by design, scaling down only happens out of hours(?))

6- Peak time ended, and the job fired and correctly identified that my second VM host should be powered off. The VM has de-allocated in the portal but the script again is hanging at this stage:

Starting WVD tenant hosts scale optimization: Current Date Time is: 03/19/2020 09:34:44
It is Off-peak hours
Starting to scale down WVD session hosts ...
Processing hostpool FullDesktop
Checking session host: wvd-2.domain.local
of sessions:0 and status:Available
Checking session host: wvd-1.domain.local
of sessions:1 and status:Available
Stopping Azure VM: wvd-2 and waiting for it to complete ...
Azure VM has been stopped: wvd-2 ...

The task seems to hang at this point until the next scheduled occurrence is fired. Whether that's 15mins as per default, or 2 hours - the result is the same. The script hangs, then as soon as the next job fires they both immediately fail within a few seconds of each other.

townendk on 19 Mar 2020

👍1

@ChristianMontoya Can you check and add your comments.

VikasPullagura-MSFT on 19 Mar 2020

👍1

I am experiencing, the exact same output as above as well.

Jimbos10 on 19 Mar 2020

👍2

Adding @RoopChevuri as well for comment.

ChristianMontoya on 19 Mar 2020

👍3

Just wondering if there is an update on this. Thank you.

Jimbos10 on 22 Mar 2020

Hi @townendk & @Jimbos10

Please let me know your availability time today. I will send a meeting invitation then will connect and assist you to resolve the error. Let me know your mailing addresses for to send meeting inviation.

Thanks

viswanadham-k on 23 Mar 2020

I have time now if you like, up until 3PM est.

Jimbos10 on 23 Mar 2020

Did you guys manage to talk about this? I'm tied up this week (with WVD deployments funnily enough) but I should hopefully have time for a meeting later in the week

townendk on 24 Mar 2020

Not yet, we did work on it, but I am sure that he is very busy given the current situation.

Jimbos10 on 25 Mar 2020

Hi, I worked with Microsoft and appears to be an issue with the automation account. My automation account was created in East US, which doesn't appear to be working. When we created the resource group and the automation account in West US2 it is now working. They are going to continue to work on the issue with the EUS DC.

Jimbos10 on 27 Mar 2020

I wanted to add that I worked with Viswanadham, who was super knowledgable and very easy to work with.

Jimbos10 on 28 Mar 2020

Hi @Jimbos10, thank you for the update.

@Viswanadham-k - I've been experiencing the same in West Europe, so not sure if that helps the troubleshooting.

townendk on 30 Mar 2020

Just a quick followup Viswanadham found another issue with a variable that was reporting different values in different datacenter zones. The updated code should now be available according to Viswanadham. Thanks again for your help Viswanadham!

Jimbos10 on 1 Apr 2020

@viswanadham-k is there a plan for this code update to be reflected on the github repository referenced in this document?

thefonz3h on 8 Apr 2020

Hi @thefonz3h
We already updated latest code to github repository. If you have already deployed WVD Scaling tool if you want latest code changes please run this script "createazureautomationaccount.ps1" it will reflect latest code.

Thanks

viswanadham-k on 8 Apr 2020

Hi @viswanadham-k I did a fresh deployment today getting the latest version of the script and it still failed with the same issue. I edited the script and could still see a variable set to WestUS. I tried switching this to WestEurope but it had no impact.

Do you have a direct download link for the latest version of this script so I can check it against my version?

Many thanks

thefonz3h on 8 Apr 2020

@viswanadham-k Can you please check on this.

@RoopChevuri Please add your comments.

VikasPullagura-MSFT on 16 Apr 2020

Hi @thefonz3h

We have fixed the issues which you have listed here.
Please go through below steps to update latest changes in your environment.

Access below link and copy the basicScale.ps1 code
https://github.com/Azure/RDS-Templates/blob/ptg-wvdautoscaling-automation/wvd-templates/wvd-scaling-script/basicScale.ps1
Access your automation account resource in Azure portal.
Click on Runbooks
a. Select the "WVDAutoScaleRunbook".
b. Paste into azure automation account runbook.
c. Click on save button and
d. click on publish button.
If you are still facing issue please let me know.

Thanks,

viswanadham-k on 16 Apr 2020

Hi @viswanadham-k thank you for the update. I have updated my automation account script and now can see that the scale down job no longer fails.

This may warrant a separate thread but now that this seems to be working OK, could I please get clarification on what the script SHOULD be doing? I've found it does not scale down hosts that have got users logged into them. I was expecting for these users to be ejected via a grace period but the scaling script just seems to ignore hosts with active user sessions.

Many thanks for the support so far!

townendk on 20 Apr 2020

Hi @townendk
Thank you for giving feedback.
Can you please close this issue.

Thanks

viswanadham-k on 20 Apr 2020

Hi,
I was experiencing the same issue. Apparently the Hosts Status when they are deallocated has changed from NoHeartBeat to unavailable. I changed to the new script suggested https://github.com/Azure/RDS-Templates/blob/ptg-wvdautoscaling-automation/wvd-templates/wvd-scaling-script/basicScale.ps1. Now some of the hosts were shutdown but not all that should have. For the hosts that were left online the script identified that it was off peak hours and started the "scale down" process but it kept trying to shut down the wrong host. Also, apparently the hosts that were on during Off peak hours and that were supposed to be down were left with the "No Allow New Connections" setting off so no users were able to connect in the morning.

See sample output below.
It is Off-peak hours

Starting to scale down WVD session hosts ...

Processing hostpool AZC-WVD-Prod

Checking session host: AZCWVD-9.hl.local
of sessions: 0 and status: Available

Checking session host: AZCWVD-2.hl.local
of sessions: 1 and status: Available

Checking session host: AZCWVD-7.hl.local
of sessions: 1 and status: Available

Stopping Azure VM: AZCWVD-0 and waiting for it to complete ...

Azure VM has been stopped: AZCWVD-0 ...

HostpoolName: AZC-WVD-Prod, TotalRunningCores: 8 NumberOfRunningHosts: 2