User on chat has tried to import their WP XML file from their self-hosted installation of WordPress. It is just showing as never ending on their end. I’m seeing it as finished. I’ve downloaded the XML file and tried to import it again on my end.
I ran it through the terminal and got this error:
warning: failed to load external entity “rogerbowman-wordpress-2017-04-12.xml”
When I import the file, I get this error:
Sorry, there has been an error.
This does not appear to be a WXR file, missing/invalid WXR version number
Was fixed in p2UL9c-3pY-p2, with this comment:
There were some weird characters in some comments in the import file. I was able to find and fix them using xmllint.
I've had a few cases similar to this, where it shows as finished on our end, but the user sees an indicator saying it's still importing.
Just noting another case of this.
The restapi_import_manager_data isn't being cleared or updated properly when the import gets stuck, so figuring out what conditions are leading to that situation and fixing that is the likely root cause fix here.
I think this is a similar case in 849487-zen. I got the same error messages:
In Terminal:
warning: failed to load external entity “pogadajmyopeerelu.wordpress.2017-12-15.xml”
When I import the file, I get this error:
Sorry, there has been an error.
This does not appear to be a WXR file, missing/invalid WXR version number
Another instance here.
https://en.forums.wordpress.com/topic/import-167/#post-3166764
I tested the export file from the original report and confirmed that the importer gets stuck when I try to import it. I get a spinner that never goes away:

When I check the import job, I get the same error message reported above:
Sorry, there has been an error.
This does not appear to be a WXR file, missing/invalid WXR version number
I checked the import file and the underlying issue is ASCII control characters (SOH or <0x01>) included in spam comments in the export file. In another import file reported here, there was a single character (SYN or <0x16>) in a comment in the export file. Removing these characters resolves the issue so the import succeeds.
Could we handle these ASCII characters (e.g. find and strip them) before trying to complete the import?
cc @Automattic/delta-samus and @Automattic/tanooki (fyi as you're working on importers)
I removed the [Pri] High label as this doesn't seem to occur frequently and the issue has existed for some time. Delta does have this on our radar now though, so we will look into how/when we can address this.
@bisko @pablinos @andfinally Do you think this is a relatively quick fix, and something we might be able to work into one of our current/upcoming importer projects?
@mattsherman I wouldn't say it would be a "relatively quick fix" as there are some edge cases that might prove our working theory problematic. For example broken characters that appear in serialized data (post meta and such) can cause issues with the imported data. We have to be careful when working with this to make sure we account for them.
With that said, we can try to scope out the fix and have a preliminary fix in place that we can test with while we manage to catch most issues and then ship to the general public.
Thanks @bisko.
Tracking backend investigation and work here: https://github.com/Automattic/samus-private/issues/34
I'm going to close this one as I'm not sure this is still happening. We did quite a few improvements to the Calypso importers, especially around broken characters that occur in some XML files.