The sample_texts() method is supposed to return sample documents but if an implementation derives from TextCorpus and overrides get_texts(), this is completely ignored by the implementation of sample_texts() which uses getstream() only.
This applies to the most recent develop (58d560b545e6df4cfc5fd3879f8647ba3a7a0e3b) and maste
(885430d136c87c613ab58ad6b1dc55fee47a43c7) branches.
Hello @johann-petrak
sample_texts use getstream only, but what's a problem to define getstream?The commits just identify the versions of the repository I had been looking at for later. Essentially, it means I am talking of the version as of today.
As I said in the original report the problem is that the documentation claims that in order to make things work correctly, get_texts() need to get implemented. One reasonably would then expect that by doing this the method sample_texts() will also work correctly, using that implementation, when in fact it does not. That would be the whole point of inheritance.
I cannot see a reason why sample_texts() should not use get_texts() especially since the subclass may also choose to not use the preprocess_text method or not in the same way. If the implementation of get_texts() is what control the behaviour, all methods should be based on it and not some methods on arbitrary or default behaviour that would need to change the whole way of how work is distributed.
@johann-petrak I partially agree with you, the main difference between get_texts and getstream in pre-processing step (subclass should override it user don't want to use it). Unfortunately, if we try to fix it now - we'll break backward compatibility.
From the other hand, I don't see any serious problems with the current implementation (but looks slightly strange, I agree).
CC: @piskvorky @vlejd wdyt about it?
Make sample_texts() use the overridden text_texts() would be the correct way and expected way for this to work, but if the risk of breaking backwards compatibility is too high then at least it would be necessary to document this and advise overriding/re-implementing sample_texts() if necessary.
I did not really need sample_texts() I merely used it when I was testing my code and when it returned something unexpected I thought my code was broken and needed some time to figure out what is going on.
Looks like a bug to me.
@menshikh-iv what backward compatibility risks do you see here?
@piskvorky I read code more carefully and found that sample_text repeat same work as get_texts (like pre-processing), need to test it and compare, but probably, this will be fine (and don't break something), but I'm not 100% sure now.
Only one thing: this will be slower (because in the current implementation, sample_text using preprocess_text only if this document will be yielded, but if we switched to get_texts, we'll pre-process documents always, that can seriously affect to performance in the case when we have large dataset.
I met it too.
for text in wiki.get_texts():
AttributeError: 'tuple' object has no attribute 'get_texts'
I hope you can fix it, thanks for your hard work