Openshift-ansible: Intentionally dumb question: Which branch/tag is expected to Just Work? (With AWS)

Created on 23 Jun 2017 · 25Comments · Source: openshift/openshift-ansible

Description

I am finding it very hard to find a variant of this repository that works as described.

I have been quietly trying to set up an Openshift Origin cluster in our AWS account, all week - you know, in background of other work. So it's not terribly time consuming but I just want to point out that I think I have tried about 50 variations at this point, and _not one_ has resulted in a working OpenShift cluster.

This is what I have been trying:

Check out release-1.* and/or openshift-ansible-3.*, try to set up cluster
Watch as some error comes up
Try to do a clean way to fix this error. OK doesn't work, try hackjob to fix error. Surfaces another error. (Interwoven: jumping around between GitHub issues and StackOverflow)
Tear down cluster. Repeat.

I have read the README and subsection on AWS, looked at many issues, and cannot understand why this is is so elusive.

I am very experienced with AWS automation in general. Very experienced with Python. Pretty experienced with various config mgmt tools. Somewhat experineced with Kubernetes, including setting up and configuring clusters with kops and kube-aws. So with all that relevant experience, you'd think I could approach OpenShift-Ansible pretty well... but it's been surprisingly rocky. If I was not so keenly interested for specific reasons, I would have given up already and moved onto other things.

I hope it is clear, I am not trying to be a jerk, only trying to bring in unfiltered feedback because I think the project aims to be accessible and fairly easy to adopt, so it is in the interest of the project to raise this feedback.

I am sure I am doing something wrong...

But... I am filing one GitHub issue here, visibly... but I would say, there are probably a lot of others who have hit issues like this, and just given up and moved on. Maybe this can be smoothed out.

Versions

# "installed" openshift-ansible by git clone then checking out tags.
$ git checkout release-1.5
Already on 'release-1.5'
Your branch is up-to-date with 'github/release-1.5'.

$ git describe
openshift-ansible-3.5.84-1-6-ga23bf82b

$ ansible --version
ansible 2.3.0.0
  config file = <pwd>/ansible.cfg
  configured module search path = Default w/o overrides
  python version = 2.7.13 (default, Apr  4 2017, 08:47:57) [GCC 4.2.1 Compatible Apple LLVM 8.1.0 (clang-802.0.38)]

$ python --version
Python 2.7.13

$ pip freeze | grep boto
boto==2.47.0
boto3==1.4.2
botocore==1.5.45

$ pip freeze | grep -i ssl
backports.ssl-match-hostname==3.5.0.1
passlib==1.6.2
pyOpenSSL==17.0.0

Steps To Reproduce

I am not totally sure; to me it looks like these are just in broken state upon checkout, but maybe I am doing something wrong.

Expected Results

A basically working OpenShift Origin cluster; no red text / failed/ unreachable from Ansible.

Observed Results

For every variant I have tried, I get >=1 instance failed or unreachable. This is usually a master node. I am not totally sure but I _think_ that it is just an early error and there could be more things wrong after getting past this error.

Here is an example for release-1.5:

TASK [openshift_docker_facts : Set docker facts] *********************************************************
fatal: [mflo-topaz-openshiftcluster-node-compute-a702a]: FAILED! => {
    "failed": true
}

MSG:

{{ hostvars[groups.oo_first_master.0].openshift.common.portal_net }}: 'dict object' has no attribute 'openshift'

fatal: [mflo-topaz-openshiftcluster-node-compute-d9506]: FAILED! => {
    "failed": true
}

MSG:

{{ hostvars[groups.oo_first_master.0].openshift.common.portal_net }}: 'dict object' has no attribute 'openshift'

fatal: [mflo-topaz-openshiftcluster-node-infra-d33a1]: FAILED! => {
    "failed": true
}

MSG:

{{ hostvars[groups.oo_first_master.0].openshift.common.portal_net }}: 'dict object' has no attribute 'openshift'

PLAY RECAP ***********************************************************************************************
localhost                  : ok=80   changed=17   unreachable=0    failed=0
mflo-topaz-openshiftcluster-master-24827 : ok=2    changed=0    unreachable=1    failed=0
mflo-topaz-openshiftcluster-master-2c771 : ok=71   changed=12   unreachable=0    failed=1
mflo-topaz-openshiftcluster-node-compute-3fb91 : ok=2    changed=0    unreachable=1    failed=0
mflo-topaz-openshiftcluster-node-compute-a702a : ok=68   changed=12   unreachable=0    failed=1
mflo-topaz-openshiftcluster-node-compute-d9506 : ok=68   changed=12   unreachable=0    failed=1
mflo-topaz-openshiftcluster-node-compute-e21de : ok=2    changed=0    unreachable=1    failed=0
mflo-topaz-openshiftcluster-node-infra-19d5e : ok=2    changed=0    unreachable=1    failed=0
mflo-topaz-openshiftcluster-node-infra-d33a1 : ok=68   changed=12   unreachable=0    failed=1

I can put examples from other versions, looking for advice on which are useful to provide. Thanks :)

Additional Information

Provide any additional information which may help us diagnose the
issue.

OS: macOS Sierra 10.12.5
unmodified inventory file, but did set some environment variables; obfuscated but correspond to real things:
- ec2_vpc_subnet='subnet-xyz' (an exisiting subnet. this subnet is public, i.e. has a NAT gateway. I have used this subnet successfully with AWS ECS, but more importantly also with a Kubernetes cluster set up w/ kops)
- ec2_keypair='foobar' (an existing keypair from my aws account. I believe I did have to add this to my ssh-agent to get past some initial ssh errors.)
- ec2_security_groups=sg-xyz (existing SG that allows All Traffic from our office IPs, we use this all the time, same SG works with Kubernetes cluster set up via kops, so I don't have a good reason to distrust this)

The command I have been running, for each branch, is:

bin/cluster create aws $CLUSTER_ID --deployment-type origin

I got that command from README_AWS, which is also where I got the instructions on which vars to set.

I want this ticket to be less about my particular issue, and more about wondering if we can make an improvement to the README or something that would avoid issues like this. Let me know what I can do to help more.

lifecyclrotten

Source

hangtwenty

❤3 👍3 😄1

All 25 comments

Thanks for the issue. seriously This sounds like one of those interesting cases where several people could benefit from a clear-cut solution well documented. So, again, thanks.

Can you share with us your inventory file? As much as you're comfortable with. You can mask hostnames or any other sensitive information as you please. It might help us get to the root of what's going on. Maybe there exists a variable that should be set but we haven't made that clear enough.

tbielawa on 23 Jun 2017

Another thing(s) that may help, please provide the exact ansible-playbook command you have tried running as well as ensure you've been running with at least one -v in ansible-playbook. I.e.:

$ ansible-playbook -vv -i my-hosts playbooks/foo/bar.yml

The extra v will assist with introspection.

Looking at the play recap it seems that you're not getting very far into the runs at all. A normal cluster init will reach into the thousands of plays OK/changed. As you said, this must be a simple simple or silly that you shouldn't have to be dealing with.

edit: grammar

tbielawa on 23 Jun 2017

@tbielawa Per the README I haven't been running ansible-playbook actually, I have been running the bin/cluster aws ... commands. ~I was unclear how to flip that into verbose mode, any tips?~ Just realized -vv is supported there so I will re-run with that.

(BTW the bin/cluster interface is kind of picky... i.e. it has to be bin/cluster -vv aws instead of bin/cluster aws -vv ... Wonder if it would be an easier more flexible interface if done with docopt. Just side note, speaking of adoptability/accessibility.)

Since I am using AWS I am under the impression that it's using the ec2.py dynamic inventory file, in this repository. Is there something else I can dump for pasting here, that would give you the kind of info you are after?

hangtwenty on 24 Jun 2017

More output. This is release-1.5, not sure why it is slight different this time vs last (I _may_ have done 1 local edit to hack & populate short_version, in previous thing. But in this case I can assure that my git tree is unmodified from release-1.5 tag.

$ git status --verbose
On branch release-1.5
Your branch is up-to-date with 'github/release-1.5'.
#...

$ git log
commit a23bf82b8f58b8e4d0ee57b16415b0a380d64d19
Merge: c9d908b3 b4711a16
Author: Scott Dodson <[email protected]>
Date:   Wed Jun 21 13:46:56 2017 -0400

TASK [openshift_master_facts : Set Default scheduler predicates and priorities] **************************
<SNIP>/openshift-ansible/roles/openshift_master_facts/tasks/main.yml:113
fatal: [mflo-topaz-openshiftcluster-master-1b331]: FAILED! => {
    "failed": true
}

MSG:

An unhandled exception occurred while running the lookup plugin 'openshift_master_facts_default_predicates'. Error was a <class 'ansible.errors.AnsibleError'>, original message: Unknown short_version {{ openshift_pkg_version | default('') }}


PLAY RECAP ***********************************************************************************************
localhost                  : ok=76   changed=16   unreachable=0    failed=0
mflo-topaz-openshiftcluster-master-1b331 : ok=91   changed=14   unreachable=0    failed=1
mflo-topaz-openshiftcluster-node-compute-b758e : ok=46   changed=9    unreachable=0    failed=0
mflo-topaz-openshiftcluster-node-compute-fd562 : ok=46   changed=9    unreachable=0    failed=0
mflo-topaz-openshiftcluster-node-infra-3788d : ok=46   changed=9    unreachable=0    failed=0

gist coming shortly

hangtwenty on 24 Jun 2017

I put the full verbose logs in a private pastebin and invited you by your profile email @tbielawa . I scrubbed it a little and I don't think there is anything sensitive, or I wouldn't even share at all 😆 but just to be a little better I did it this way. https://gitlab.com/hangtwenty/openshift-ansible-pastebin/ | https://gitlab.com/hangtwenty/openshift-ansible-pastebin/raw/master/paste-release-1.5.log

hangtwenty on 24 Jun 2017

@tbielawa were you able to access the logs?

hangtwenty on 27 Jun 2017

@hangtwenty re: https://gitlab.com/hangtwenty/openshift-ansible-pastebin/raw/master/paste-release-1.5.log

The page you're looking for could not be found.

https://gitlab.com/hangtwenty/openshift-ansible-pastebin/

doesn't show any files or anything :-\

tbielawa on 27 Jun 2017

@tbielawa I tried to invite you by profile email on your GitHub profile, you had an active GitLab account for that email (GitLab told me). Are logged in? Else if this doesn't work, is there another way I can share with you/maintainers but not with open internet? Like I said I really not think it is sensitive, I am just erring on the side of caution.

hangtwenty on 27 Jun 2017

You could encrypt it to my GPG key.

$ gpg --recv-keys 0333AE37
$ gpg -e --armor -r 0333AE37 <LOGFILE>

and send the <LOGFILE>.asc to [email protected]

On Tue, Jun 27, 2017 at 4:25 PM, Michael Floering notifications@github.com
wrote:

@tbielawa https://github.com/tbielawa I tried to invite you by profile
email on your GitHub profile, is there another way I can share with
you/maintainers but not with open internet? Like I said I really not think
it is sensitive, I am just erring on the side of caution.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/openshift/openshift-ansible/issues/4579#issuecomment-311474769,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AACBYYSqacVgFEpWgzM2_Dm6neXAMaUwks5sIWVBgaJpZM4OD5QN
.

--
Tim Bielawa, Sr. Software Engineer [ED-C137]
Cell: 919.332.6411 | IRC: tbielawa (#openshift)
1BA0 4FAB 4C13 FBA0 A036 4958 AD05 E75E 0333 AE37

tbielawa on 27 Jun 2017

Great call. Sent

hangtwenty on 27 Jun 2017

:+1: Got it. Thanks. I'll be checking it out today!

tbielawa on 28 Jun 2017

Didn't forget about you. Other work side-tracked me yesterday. I'll make another attempt today.

tbielawa on 29 Jun 2017

👍1

I'm gonna add, this is quite a pain. I've done quite some hacking to get it.....sorta working. My biggest issue has been with dnsmasq/skydns. Still trying to get that automated.

But I'd also appreciate a "just works" branch as to me its currently a bit, well, broken on AWS.

ataahua on 12 Jul 2017

My biggest issue has been with dnsmasq/skydns. Still trying to get that automated.

Can you help me understand what problems you're having with that? I'd like to fix them, maybe best in a separate issue, just @ me and i'll make sure to look at it.

sdodson on 12 Jul 2017

I'm still unclear on what tag or release I should even try - what is expected to be stable. It isn't totally explicit and concrete in the docs. (I get the impression that the most official/expected-stable are release-1.2 and others following that pattern - that's the one tag explicitly linked in the README. But then release-1.4 and release-1.5 both seem broken ... and there is recent activity on these tags (the tags move?) so I don't really understand what's considered most stable.)

hangtwenty on 12 Jul 2017

Speaking of release-1.2 since it is called out explicitly in the README, that one fails _really_ early on

ERROR! Syntax Error while loading YAML.


The error appears to have been in '<SNIPPED>/openshift-ansible/roles/rhel_subscribe/tasks/main.yml': line 56, column 9, but may
be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:

  when: "{{ deployment_type in [ 'enterprise', 'atomic-enterprise', 'openshift-enterprise' ] and }}"
        not openshift.common.is_atomic | bool
        ^ here

... when the readme links to a tag that fails so quickly, and it is from a syntax error... it really throws off a newcomer to this tool

hangtwenty on 12 Jul 2017

@hangtwenty thanks for the honest feedback, it'll go towards improving our README. Let's break down what we have here and see how to make it clearer

README.md - Getting the correct version

The master branch tracks our current work in development and should be compatible with the Origin master branch (code in development).

Presently this means that the master branch is tracking work on OpenShift Container Platform 3.6, which is not yet a released product. In reality, even with our clusters of test machines and masses of test engineers, this branch may not be stable.

In addition to the master branch, we maintain stable branches corresponding to upstream Origin releases, e.g.: we guarantee an openshift-ansible 3.2 release will fully support an origin 1.2 release.

Unless you want the latest hotness (which, as stated, is for an unfinished product) then you want to select a release-1.x branch. From the response I see from you now which just appeared on my screen, and your original post, it sounds like you have attempted to use several branches but are still receiving errors in them.

You're doing the right thing. I believe that we may be missing some quality assurance steps.

The most recent branch will often receive minor feature backports and fixes. Older branches will receive only critical fixes.

You observed this and tried to use the advertised 'stable' 1.4 and 1.5 branches: But then release-1.4 and release-1.5 both seem broken and it did not work for you.

# from a 1.2 release branch you saw:
ERROR! Syntax Error while loading YAML.
...

This completely is an error on our side. Our CI testing is less rigorous in older branches because we generally get them to a stable place by the release cut date and then minimize the amount of changes we backport to them over time. We hope that people always use the latest stable release branches. however as you have pointed out, that is not working correctly for you either.

I imagine in this case that we backported some kind of fix which was marked critical and our reduced test coverage allowed a flakey YAML file to enter. One area in which we could improve this would be to ensure that at least our tox testing runs on the older branches. Our latest Travis configuration would have caught that error immediately and refused to allow the code to merge.

So which branch should you use?

You should use the release-1.5 branch. If you have an error with that branch, such as:

An unhandled exception occurred while running the lookup plugin 'openshift_master_facts_default_predicates'. Error was a <class 'ansible.errors.AnsibleError'>, original message: Unknown short_version {{ openshift_pkg_version | default('') }}

...then we are obliged to look into it and come up with a solution. We are working on finalizing our master branch (3.6 release) presently, so our ability to devote time is in flux.

BTW, I think the error with openshift_pkg_version was fixed recently. I am uncertain off the top of my head if that fix has been backported into the stable release-1.5 branch. (assuming I'm remembering everything correctly)

tbielawa on 12 Jul 2017

I agree with you completely. We are advertising guaranteed support but we
are not delivering it in this case.

On Wed, Jul 12, 2017 at 1:52 PM, Michael Floering notifications@github.com
wrote:

Speaking of release-1.2 since it is called out explicitly in the README,
that one fails really early on

ERROR! Syntax Error while loading YAML.

The error appears to have been in '/openshift-ansible/roles/rhel_subscribe/tasks/main.yml': line 56, column 9, but may
be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:

when: "{{ deployment_type in [ 'enterprise', 'atomic-enterprise', 'openshift-enterprise' ] and }}"
not openshift.common.is_atomic | bool
^ here

... when the readme links to a tag that fails so quickly, it really throws
off a newcomer to this tool

tbielawa on 12 Jul 2017

@tbielawa Thanks so much for explaining, this all makes sense (everything you said). Good call on breaking down each piece of information about the versions too. I will think about some clarifications and comment or submit a PR

hangtwenty on 12 Jul 2017

PR submissions instantly earn you Tim Points you can later turn in for
"luxurious prizes".

On Wed, Jul 12, 2017 at 3:23 PM, Michael Floering notifications@github.com
wrote:

@tbielawa https://github.com/tbielawa Thanks so much for explaining,
this all makes sense (everything you said). Good call on breaking down each
piece of information about the versions too. I will think about some
clarifications and comment or submit a PR

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/openshift/openshift-ansible/issues/4579#issuecomment-314871070,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AACBYZMGSwDcxSr9x5A84nD8VW6nF25Jks5sNR0sgaJpZM4OD5QN
.

tbielawa on 12 Jul 2017

😄1

Re:

BTW, I think the error with openshift_pkg_version was fixed recently. I am uncertain off the top of my head if that fix has been backported into the stable release-1.5 branch. (assuming I'm remembering everything correctly)

On latest release-1.5 I still get,

An unhandled exception occurred while running the lookup plugin 'openshift_master_facts_default_predicates'. Error was a <class 'ansible.errors.AnsibleError'>, original message: Unknown short_version {{ openshift_pkg_version | default('') }}

hangtwenty on 12 Jul 2017

@hangtwenty
Take a look at https://github.com/openshift/openshift-ansible/issues/3397

I had managed to resolve this issue by including "std_include.yml" in "/playbooks/aws/openshift-cluster/config.yml"

include: ../../common/openshift-cluster/std_include.yml
tags:
always

This way initialize_openshift_version.yml is included which sets up the openshift_pkg_version

Looks like the fix was contributed to master branch, I'm not sure if it was back ported to release-1.5

hbhargav on 14 Aug 2017

Any updates on this? I tried release-1.4 and got as far as #3397, but then I get TASK [openshift_version : fail]...fatal: [openshift-doot-master-b7055]: FAILED! => {"changed": false, "failed": true, "msg": "Package origin not found"}

TASK [openshift_version : fail] **********************************************************
fatal: [openshift-doot-master-b7055]: FAILED! => {"changed": false, "failed": true, "msg": "Package origin not found"}
    to retry, use: --limit @/home/kynan/workspace/openshift-ansible/playbooks/aws/openshift-cluster/launch.retry

PLAY RECAP *******************************************************************************
localhost                  : ok=78   changed=15   unreachable=0    failed=0   
openshift-doot-master-b7055 : ok=47   changed=7    unreachable=0    failed=1   
openshift-doot-node-compute-9c2d0 : ok=37   changed=6    unreachable=0    failed=0   
openshift-doot-node-compute-cc865 : ok=37   changed=6    unreachable=0    failed=0   
openshift-doot-node-infra-6c416 : ok=37   changed=6    unreachable=0    failed=0   

ACTION [create] failed: Command 'ansible-playbook  -i inventory/aws/hosts -e 'num_masters=1 enable_excluders=false num_nodes=2 cluster_id=openshift-doot cluster_env=dev num_etcd=0 num_infra=1 deployment_type=origin' playbooks/aws/openshift-cluster/launch.yml' returned non-zero exit status 2

ublubu on 6 Nov 2017

👍1

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot on 19 May 2020

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot on 18 Jun 2020

Was this page helpful?

0 / 5 - 0 ratings