Presto: Infinite loop during planning in 0.193

Created on 1 Feb 2018  ·  17Comments  ·  Source: prestodb/presto

Hi,

I use presto 0.193 + HDP 2.5.3 + hive connector.

Infinite loop during planning for the following query occurs.

This error doesn't occur in presto 0.192.

According to the 0.193 release note(https://prestodb.io/docs/current/release/release-0.193.html), Infinite loop has already been fixed, but it seems not to fix.

Fix an infinite loop during planning for queries containing non-trivial predicates.

I'm sorry not to create simple query which reproduces the issue...

...
FROM
(
    SELECT
...
    FROM
        (SELECT * FROM partitioned_table1 WHERE dt = date_format(date_add('day', -1, now()), '%Y%m%d') AND ...) aa
    JOIN
        (SELECT * FROM partitioned_table2 WHERE dt = date_format(date_add('day', -1, now()), '%Y%m%d')) bb
    ON
        aa.some_id = bb.id
    JOIN    
        (SELECT * FROM partitioned_table3 WHERE dt = date_format(date_add('day', -1, now()), '%Y%m%d') AND ...) cc
    ON
        aa.some_id = cc.id
    JOIN
        (SELECT * FROM partitioned_table4 WHERE dt = date_format(date_add('day', -1, now()), '%Y%m%d') AND ...) dd
    ON
        aa.some_id = dd.some_id
    CROSS JOIN
        (
            SELECT
...
            FROM
                non_partitioned_table
            WHERE
                yyyymmdd <= date_format(date_add('day', -1, now()), '%Y%m%d')
            AND
                (end_date_of_month_yn = 'Y' OR yyyymmdd = date_format(date_add('day', -1, now()), '%Y%m%d'))
        ) t
) m
JOIN
(
    SELECT
        substring(dt,1,6) AS yyyymm,
        some_id,
        sum(IF(platform='ALL', pv, 0)) AS pv_total
    FROM 
        partitioned_table5
    GROUP BY
        substring(dt,1,6),
        some_id
) a
ON
    m.yyyymm = a.yyyymm
AND
    m.some_id = a.some_id
LEFT JOIN
(
    SELECT
        substring(dt,1,6) AS yyyymm,
        some_id,
        sum(IF(platform='ALL', pv, 0)) AS pv_total
    FROM 
        partitioned_table5
    GROUP BY
        substring(dt,1,6),
        some_id
) b
ON
    m.yyyymm_1m_ago = b.yyyymm
AND
    m.some_id = b.some_id
LEFT JOIN
(
    SELECT
        substring(dt,1,6) AS yyyymm,
        some_id,
        count(*) AS cnt
    FROM
        partitioned_table6
    WHERE
        status = 1
    GROUP BY
        substring(dt,1,6),
        some_id
) c
ON
    m.yyyymm = c.yyyymm
AND
    m.some_id = c.some_id
LEFT JOIN
(
    SELECT
        substring(dt,1,6) AS yyyymm,
        some_id,
        count(*) AS cnt
    FROM
        partitioned_table6
    WHERE
        status = 1
    GROUP BY
        substring(dt,1,6),
        some_id
) d
ON
    m.yyyymm_1m_ago = d.yyyymm
AND
    m.some_id = d.some_id
LEFT JOIN
    partitioned_table7 e
ON
    m.yyyymmdd_end = e.dt
AND
    m.some_id2 = e.some_id3
LEFT JOIN
    partitioned_table7 f
ON
    m.yyyymmdd_end_1m_ago = f.dt
AND
    m.some_id2 = f.some_id3
WHERE
    m.yyyymm = '201801'

Thanks

Most helpful comment

Yes!
I confirmed martint:picklayout1 branch solved my problem although PickTableLayoutForPredicate metrics are not so different from 0.193.

My tables have about 500 - 2000 partitions.

All 17 comments

In 0.193 we fixed https://github.com/prestodb/presto/issues/9362. Does disabling iterative optimizer help? If so can you please post an EXPLAIN result as gist. Can you please tell me what are the JMX Rule stats before and after running your query?

Thank you for comment!

Does disabling iterative optimizer help? If so can you please post an EXPLAIN result as gist.

Yes, I posted.
https://gist.github.com/wyukawa/181599ddb1b923b6d2cb4bb3f0e6f996

What about JMX stats? Can you leave predicates in ScanFilterProject?

JMX stats after I start presto cluster and execute query

When I try in 0.192, JMX stats is here

ScanFilterProject

                                                        - ScanFilterProject[table = ..., originalConstraint = ((("dt" = CAST('20180131' AS varchar)) AND (CAST("..." AS integer) <> 1)) AND (CAST("..." AS integer) = 1)), filterPredicate = (("..." = "$literal$tinyint"(BIGINT '1')) AND ("is_
                                                        - ScanFilterProject[table = ..., originalConstraint = (("dt_146" = CAST('20180131' AS varchar)) AND (CAST("status" AS integer) = 1)), filterPredicate = ("status" = "$literal$tinyint"(BIGINT '1'))] => [id_143:bigint, $hashvalue_689:bigint]
                                                    - ScanFilterProject[table = ..., originalConstraint = ((("dt_273" = CAST('20180131' AS varchar)) AND (CAST("auth_267" AS integer) = 0)) AND (CAST("..." AS integer) <> 0)), filterPredicate = (("auth_267" = "$literal$tinyint"(BIGINT '0')) AND ("domai
                                                                        - ScanFilterProject[table = ...., originalConstraint = ("substring"("dt_409", BIGINT '1', BIGINT '6') = CAST('201801' AS varchar)), filterPredicate = ("substring"("dt_409", BIGINT '1', BIGINT '6') = CAST('201801' AS varchar))
                                        - ScanFilterProject[table = ..., originalConstraint = ((("yyyymmdd" <= CAST('20180131' AS varchar)) AND (("end_date_of_month_yn" = CAST('Y' AS varchar)) OR ("yyyymmdd" = CAST('20180131' AS varchar)))) AND ("yyyymm" = CAST('201801' AS varchar))), filterPredicate = ((
                                                - ScanFilterProject[table = ..., originalConstraint = ((CAST("status_464" AS integer) = 1) AND ("substring"("dt_468", BIGINT '1', BIGINT '6') = CAST('201801' AS varchar))), filterPredicate = (("status_464" = "$literal$tinyint"(BIGINT '1')) AND ("substring"(
                                    - ScanFilterProject[table = ..., originalConstraint = (CAST("status_509" AS integer) = 1), filterPredicate = ("status_509" = "$literal$tinyint"(BIGINT '1'))] => [substring_521:varchar, ...:bigint]
- ScanFilterProject[table = ..., originalConstraint = ("dt_543" <= CAST('20180131' AS varchar)), filterPredicate = ("dt_543" <= CAST('20180131' AS varchar))] => [...:bigint, ....:bigint, ....:double, dt_543:varchar, $hashvalue_728:bigint]

Are there informative?
If you want more information, let me know.

Thanks

Thanks, it gives me a lot!

Can you take a look at other rules stats? Here, stats you posted says that PickTableLayout was called 24 times with 10 success, max time of rule application was about 68ms. Please look for a rule which either have really high number of Time.count and hits or/and really long Time.max. It looks like this time it is a different rule.

Just for test reason, can you please increase experimental.iterative-optimizer-timeout to something like an hour, to see if IterativeOptimizer ends eventually. That would mean there is no cycle, but there is something to optimize in optimizer.

How many partitions do you have in the tables you are using? There is a fix for recent issue: https://github.com/prestodb/presto/pull/9879 which fixed planing time for queries with tables with thousands of partitions.

Can test if #9879 solves your problem?

Yes!
I confirmed martint:picklayout1 branch solved my problem although PickTableLayoutForPredicate metrics are not so different from 0.193.

My tables have about 500 - 2000 partitions.

I confirmed martint:picklayout1 branch solved my problem

Great!

PickTableLayoutForPredicate metrics are not so different from 0.193.

It is expected, that stats are OK.

I believe that this will be merged soon and well as 0.194 will be soon released because of that.

Fixed in 0.194.

i think we're still seeing this on 0.194-SNAPSHOT, where queries stay in PLANNING and never progress, built from e42d339 using three partitions and FileHiveMetastore on s3. running around 20,000 total partitions. with the top level partition having highest cardinality, effectively id varchar, ymd varchar, h varchar. performing a query seems to endlessly perform ListObjects against the s3 bucket even long after the query has been cancelled, the only thing that kills it is a restart or the iterative optimiser timeout, disabling the iterative optimiser doesn't seem to help for us.

i just realised this issue may not be the right place, as the previous build we were seeing this on was 0.167

@arussellsaw indeed a lot has changed since 0.167. Would you mind reporting new issue for the problem you described? Did it work in 0.193? If possible, try to get some more information (JMX Rule stats before and after running your query, partitioning columns cardinality, etc.).

@wyukawa.

Can you tell me how to export the jmx of preto and view it with some tools?

Now, I need to connect the jmx catalog with the command “./presto --schema current --catalog jmx --server localhost:8000” and select tables one by one.

Thanks.

for example,

SELECT * FROM jmx.current."com.facebook.presto.sql.planner.iterative:name=iterativeoptimizer,rule=picktablelayoutforpredicate" LIMIT 100

@wyukawa

Thanks. What viewer are you using?

You can also use Java VisualVM but you may need to set jmx.rmiregistry.port in this case.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

tesseract2048 picture tesseract2048  ·  3Comments

jeremy-degroot picture jeremy-degroot  ·  3Comments

tomz picture tomz  ·  3Comments

zsaltys picture zsaltys  ·  4Comments

shigechuanqi picture shigechuanqi  ·  3Comments