ORC Bloom Filter Support has been broken in latest presto release.Either we should fix this in corresponding previous branches too OR we should mark this in release note that ORC Bloom Filter Support while Querying ORC table having Bloom Filter will not take advantage of Bloom Filter.
The support is broken from Presto Release 0.214.
After changes of StreamId in readBloomFilterIndexes method of StripeReader class the Bloom filter does not skip unsatisfied Row Group of ORC due to coding bug as the below line return always null.
StripeReader.java
List<HiveBloomFilter> bloomFilters = bloomFilterIndexes.get(entry.getKey());
@kevinwilfong
@dain Please have a look, This have an impact on Presto ORC performance.
@dilipkasana Thanks for reporting this performance regression. To confirm, do you believe that it was introduced in this commit: https://github.com/prestodb/presto/commit/87f7e51c99963787f71027e749cf1e10815180fb (PR #11319) ? Any chance you'd be interested in submitting a fix?
CC: @wenleix
CC: @yingsu00 @oerling @tdcmeehan
Yes this is introduced in this commit: 87f7e51
@dilipkasana Dilip, would you like to submit a PR with the fix?
I have already submit a pull request for this at : https://github.com/prestodb/presto/pull/12901
@dilipkasana
After changes of StreamId in readBloomFilterIndexes method of StripeReader class the Bloom filter does not skip unsatisfied Row Group of ORC due to coding bug as the below line return always null.
I am curious why is that? Since StreamId just contains column, sequence (always 0 for ORC) and streamKind (should be the same for the same column) right ?
It might be something incorrect with StreamId that makes it not working in HashMap, although I didn't see anything obviously wrong with its hashCode and equals method.
Ignoring sequence would cause bloom filter not work correctly for DWRF flat map.
Echoing Maria's comment on the PR and Wenlei's comment here, could you add a test that demonstrates the problem. It's not immediately obvious to me why that line would always return null. I'm also concerned this fix wouldn't work correctly for flat maps.