Apache Airflow version: 1.10.10 (also present on current master)
Environment:
uname -a): 4.14.173-137.229.amzn2.x86_64 amazon linux 2What happened:
A subprocess launched by s3_file_transform was killed with Out-of-memory.
Consequently, a SIGKILL was issued by operating system.
The operator however records a _success_ for this task, even though the subprocess _failed_ .
There was no log-entry of any failure in the s3_file_transform operator.
Subsequent tasks of the DAG were launched, as of cause airflow itself did not catch, that a previous step actually did not successfully run.
As a result, the DAG produced worthless results.
What you expected to happen:
According to https://docs.python.org/3/library/subprocess.html#subprocess.CompletedProcess.returncode
a _negative_ exit code can returned by the subprocess module of python that is used in the operator. This happens in this very situation, as:
A negative value -N indicates that the child was terminated by signal N (POSIX only).
However
https://github.com/apache/airflow/blob/1ed171bfb265ded8674058bdc425640d25f1f4fc/airflow/providers/amazon/aws/operators/s3_file_transform.py#L157 checks _only_ if the exit code is >0, whereas by above mentioned documentation something can be unsuccessful when exit code<0 is encountered.
This is completely expected and POSIX compliant - simply that check seems incorrect.
A successful command returns a 0, while an unsuccessful one returns a non-zero value that usually can be interpreted as an error code
I would recommend to change it to
if process.returncode:
(which means, it will catch any non-zero return code)
(a pull request will be supplied referencing this issue)
How to reproduce it:
s3_file_transform operatorAnything else we need to know:
How we found it
We found this "lapse" by adding several debug messages to subprocess - and we noticed that after fetching data, it just.... stopped.
The issue disappeared when we increased the instance system to m5.xlarge the issue disappeared.
We then changed the above statement, and suddenly saw, that a SIGKILL stopped the process.
Other operators and subprocess use
Several other operators seem to use subprocess correctly.
https://github.com/apache/airflow/search?q=returncode+%3E+0&unscoped_q=returncode+%3E+0
Usage of checking whether the exit code is >0 seems to be also done in the gcs operator
Thanks for opening your first issue here! Be sure to follow the issue template!
Nice. Would you like to make a PR for that @blackw1ng? it's rather easy to fix and I am happy to review it.
Sorry - i wasnt entirely sure, if this project accepts PRs without an issue - so i basically simultaneously did both. Thanks for the quick review (of this rather trivial fix) - i guess this can be resolved as "fixed".