dvc 🚀 - Optimize stage collection for large repos

@danfischetti Could you please post a cProfile log from your specific use case so we can see more clearly what is slowing you down? :slightly_smiling_face:

efiop on 25 Oct 2019

@danfischetti Also, did the change from yesterday make any difference for you?

efiop on 25 Oct 2019

I just updated to 0.65.0 and I don't see any difference.

In [7]: cProfile.run("repo.collect_stages()")                                                                                                                                                     [389/1983]
^C         329107531 function calls (317924667 primitive calls) in 113.090 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    33595    0.019    0.000    0.048    0.000 <frozen importlib._bootstrap>:997(_handle_fromlist)
        1    0.000    0.000  113.095  113.095 <string>:1(<module>)
    99792    0.029    0.000    0.051    0.000 <string>:12(__new__)
     5764    0.022    0.000    0.980    0.000 __init__.py:100(loadd_from)
    17293    0.004    0.000    0.004    0.000 __init__.py:110(tree)
215138/5764    0.225    0.000    0.447    0.000 __init__.py:319(convert_to_unicode)
     3389    0.008    0.000    0.217    0.000 __init__.py:350(dvc_walk)
     3388    0.007    0.000   49.910    0.015 __init__.py:371(_filter_out_dirs)
     3444    3.412    0.001   49.902    0.014 __init__.py:373(filter_dirs)
        1    0.042    0.042  113.095  113.095 __init__.py:382(collect_stages)
     5764    0.015    0.000    0.316    0.000 __init__.py:453(relpath)
     8151    0.045    0.000    1.234    0.000 __init__.py:52(_get)
     5764    0.024    0.000    0.856    0.000 __init__.py:60(_get)
     5764    0.014    0.000    1.305    0.000 __init__.py:71(loadd_from)
    89676    0.101    0.000    0.221    0.000 _collections_abc.py:879(__iter__)
    13915    0.010    0.000    0.103    0.000 _collections_abc.py:966(append)
   130306    0.046    0.000    0.046    0.000 _weakrefset.py:70(__contains__)
   130306    0.055    0.000    0.101    0.000 abc.py:178(__instancecheck__)
    13915    0.018    0.000    0.020    0.000 base.py:115(_check_requires)
    11528    0.003    0.000    0.003    0.000 base.py:127(scheme)
     5764    0.005    0.000    0.009    0.000 base.py:142(cache)
    71962    0.038    0.000    0.400    0.000 base.py:150(supported)
    71962    0.052    0.000    0.362    0.000 base.py:158(supported)
    13915    0.019    0.000    0.086    0.000 base.py:419(_validate_output_path)
    13915    0.057    0.000    1.450    0.000 base.py:72(__init__)
    13915    0.042    0.000    0.082    0.000 base.py:92(__init__)
     5764    0.002    0.000    0.002    0.000 codecs.py:259(__init__)
     5764    0.005    0.000    0.007    0.000 codecs.py:308(__init__)
    17386    0.015    0.000    0.025    0.000 codecs.py:318(decode)
    17386    0.003    0.000    0.003    0.000 codecs.py:330(getstate)
28183    0.005    0.000    0.005    0.000 comments.py:100(__init__)                                                                                                                           [353/1983]
    28183    0.004    0.000    0.004    0.000 comments.py:108(set_block_style)
    28183    0.011    0.000    0.011    0.000 comments.py:126(__init__)
    65153    0.024    0.000    0.024    0.000 comments.py:132(add_kv_line_col)
    13915    0.007    0.000    0.007    0.000 comments.py:159(add_idx_line_col)
    13915    0.015    0.000    0.027    0.000 comments.py:182(ca)
    28183    0.030    0.000    0.060    0.000 comments.py:269(fa)
   135434    0.084    0.000    0.165    0.000 comments.py:304(lc)
    28183    0.020    0.000    0.121    0.000 comments.py:311(_yaml_set_line_col)
    65153    0.038    0.000    0.114    0.000 comments.py:316(_yaml_set_kv_line_col)
    13915    0.008    0.000    0.026    0.000 comments.py:320(_yaml_set_idx_line_col)
    33594    0.146    0.000   16.608    0.000 comments.py:353(copy_attributes)
    34016    0.043    0.000    3.137    0.000 comments.py:381(__init__)
    89676    0.040    0.000    0.040    0.000 comments.py:385(__getsingleitem__)
    22419    0.010    0.000    0.010    0.000 comments.py:410(__len__)
    13915    0.023    0.000    0.079    0.000 comments.py:414(insert)
     8504    0.005    0.000    0.005    0.000 comments.py:423(extend)
     8504    0.004    0.000    0.004    0.000 comments.py:46(__init__)
     8504    0.042    0.000   16.621    0.002 comments.py:476(__deepcopy__)
    19679    0.004    0.000    0.004    0.000 comments.py:563(__init__)
    84832    0.051    0.000    0.089    0.000 comments.py:610(__iter__)
    39358    0.048    0.000    0.048    0.000 comments.py:635(__init__)
    13915    0.002    0.000    0.002    0.000 comments.py:64(items)
   210012    0.085    0.000    0.106    0.000 comments.py:744(__getitem__)
   130306    0.126    0.000    0.201    0.000 comments.py:754(__setitem__)
   232430    0.106    0.000    0.106    0.000 comments.py:773(__contains__)
    48499    0.023    0.000    0.082    0.000 comments.py:777(get)
    31207    0.029    0.000    0.034    0.000 comments.py:794(__delitem__)
    84832    0.026    0.000    0.026    0.000 comments.py:814(__iter__)
    84832    0.020    0.000    0.020    0.000 comments.py:819(_keys)
    39358    0.016    0.000    0.016    0.000 comments.py:824(__len__)
    19679    0.010    0.000    0.014    0.000 comments.py:898(items)
19679/5764    0.128    0.000   17.277    0.003 comments.py:942(__deepcopy__)
    13916    0.002    0.000    0.002    0.000 compat.py:178(<lambda>)
     5764    0.012    0.000    0.021    0.000 compat.py:252(version_tnf)
    89676    0.064    0.000    0.120    0.000 compat.py:266(__getitem__)
149985/5764    0.584    0.000   33.287    0.006 composer.py:109(compose_node)                                                                                                                     [317/1983]
   121802    0.341    0.000    2.292    0.000 composer.py:142(compose_scalar_node)
     8504    0.064    0.000   24.126    0.003 composer.py:161(compose_sequence_node)
19679/5764    0.241    0.000   32.750    0.006 composer.py:194(compose_mapping_node)
    28183    0.006    0.000    0.006    0.000 composer.py:228(check_end_doc_comment)
     5764    0.005    0.000    0.008    0.000 composer.py:33(__init__)
   817820    0.527    0.000    1.524    0.000 composer.py:40(parser)
   449955    0.295    0.000    0.859    0.000 composer.py:47(resolver)
     5764    0.022    0.000   34.501    0.006 composer.py:70(get_single_node)
     5764    0.013    0.000   33.408    0.006 composer.py:95(compose_document)
     5764    0.012    0.000   36.254    0.006 constructor.py:106(get_single_data)
   121802    0.084    0.000    0.097    0.000 constructor.py:1063(construct_scalar)
     5764    0.016    0.000    1.707    0.000 constructor.py:114(construct_document)
   104510    0.055    0.000    0.154    0.000 constructor.py:1266(construct_yaml_str)
149985/45826    0.256    0.000    1.420    0.000 constructor.py:128(construct_object)
     8504    0.034    0.000    1.093    0.000 constructor.py:1281(construct_rt_sequence)
    19679    0.043    0.000    0.050    0.000 constructor.py:1306(flatten_mapping)
19679/5764    0.203    0.000    1.579    0.000 constructor.py:1393(construct_mapping)
    17008    0.024    0.000    1.201    0.000 constructor.py:1528(construct_yaml_seq)
39358/11528    0.151    0.000    1.665    0.000 constructor.py:1538(construct_yaml_map)
    28183    0.031    0.000    0.119    0.000 constructor.py:1546(set_collection_style)
    17292    0.009    0.000    0.042    0.000 constructor.py:1729(construct_yaml_bool)
    65153    0.028    0.000    0.062    0.000 constructor.py:254(check_mapping_key)
    17292    0.016    0.000    0.033    0.000 constructor.py:443(construct_yaml_bool)
     5764    0.016    0.000    0.041    0.000 constructor.py:60(__init__)
     5764    0.005    0.000    0.034    0.000 constructor.py:75(composer)
9997354/5764    7.516    0.000   17.306    0.003 copy.py:132(deepcopy)
  7456429    0.526    0.000    0.526    0.000 copy.py:190(_deepcopy_atomic)
  2311178    2.280    0.000    7.435    0.000 copy.py:210(_deepcopy_list)
    33594    0.069    0.000    0.345    0.000 copy.py:219(_deepcopy_tuple)
    33594    0.024    0.000    0.273    0.000 copy.py:220(<listcomp>)
100782/67188    0.981    0.000   15.075    0.000 copy.py:236(_deepcopy_dict)
  2540925    0.979    0.000    1.345    0.000 copy.py:252(_keep_alive)
    67188    0.212    0.000   15.901    0.000 copy.py:268(_reconstruct)
   134376    0.037    0.000    0.122    0.000 copy.py:273(<genexpr>)
    13915    0.011    0.000    0.014    0.000 copy.py:66(copy)
    67188    0.028    0.000    0.039    0.000 copyreg.py:87(__newobj__)                                                                                                                           [281/1983]
   512368    0.131    0.000    0.131    0.000 error.py:30(__init__)
     5764    0.003    0.000    0.005    0.000 events.py:112(__init__)
   121802    0.081    0.000    0.178    0.000 events.py:125(__init__)
   201224    0.075    0.000    0.075    0.000 events.py:17(__init__)
   149985    0.068    0.000    0.124    0.000 events.py:42(__init__)
    28183    0.019    0.000    0.045    0.000 events.py:51(__init__)
     5764    0.005    0.000    0.008    0.000 events.py:80(__init__)
     5764    0.004    0.000    0.006    0.000 events.py:93(__init__)
     5764    0.016    0.000    0.034    0.000 fnmatch.py:48(filter)
        8    0.000    0.000    0.000    0.000 future.py:47(__del__)
     5764    0.005    0.000    0.042    0.000 genericpath.py:16(exists)
     5764    0.006    0.000    0.020    0.000 genericpath.py:27(isfile)
     5764    0.012    0.000    0.021    0.000 genericpath.py:69(commonprefix)
    11528    0.174    0.000    0.229    0.000 glob.py:114(_iterdir)
    34584    0.017    0.000    0.067    0.000 glob.py:145(has_magic)
    11528    0.005    0.000    0.005    0.000 glob.py:152(_ishidden)
     5764    0.003    0.000    0.003    0.000 glob.py:22(iglob)
17292/5764    0.035    0.000    0.501    0.000 glob.py:39(_iglob)
     5764    0.018    0.000    0.285    0.000 glob.py:79(_glob1)
    11528    0.004    0.000    0.005    0.000 glob.py:82(<genexpr>)
     5764    0.007    0.000    0.067    0.000 glob.py:85(_glob0)
     5764    0.010    0.000    0.514    0.000 glob.py:9(glob)
     3388    0.003    0.000    0.005    0.000 ignore.py:57(__call__)
     3388    0.002    0.000    0.002    0.000 ignore.py:58(<listcomp>)
     3388    0.005    0.000    0.010    0.000 ignore.py:76(__call__)
    13915    0.068    0.000    1.171    0.000 local.py:20(_parse_path)
     5764    0.005    0.000    0.024    0.000 local.py:41(fspath)
    13915    0.040    0.000    0.128    0.000 local.py:52(__init__)
    11528    0.019    0.000    0.044    0.000 main.py:167(reader)
  2008844    0.316    0.000    0.387    0.000 main.py:176(scanner)
  1259907    0.802    0.000    1.404    0.000 main.py:185(parser)
     5764    0.012    0.000    0.028    0.000 main.py:207(composer)
     5764    0.019    0.000    0.073    0.000 main.py:215(constructor)
   642674    0.441    0.000    0.785    0.000 main.py:225(resolver)
     5764    0.043    0.000   36.737    0.006 main.py:316(load)
     5764    0.016    0.000    0.381    0.000 main.py:375(get_constructor_parser)
     5764    0.056    0.000    0.638    0.000 main.py:61(__init__)
     5764    0.018    0.000    0.582    0.000 main.py:615(official_plug_ins)
     5764    0.002    0.000    0.002    0.000 main.py:619(<listcomp>)
    19679    0.013    0.000    0.036    0.000 nodes.py:117(__init__)
   149985    0.061    0.000    0.061    0.000 nodes.py:15(__init__)
   121802    0.073    0.000    0.120    0.000 nodes.py:81(__init__)
    28183    0.020    0.000    0.034    0.000 nodes.py:92(__init__)
27320/3394    0.041    0.000    0.199    0.000 os.py:277(walk)
   199584    0.097    0.000    0.116    0.000 parse.py:109(_coerce_args)
    99792    0.140    0.000    0.499    0.000 parse.py:359(urlparse)
    99792    0.133    0.000    0.233    0.000 parse.py:392(urlsplit)
      569    0.000    0.000    0.002    0.000 parse.py:83(clear_cache)
   199584    0.017    0.000    0.017    0.000 parse.py:98(_noop)
     5764    0.007    0.000    0.017    0.000 parser.py:101(__init__)
    11528    0.014    0.000    0.014    0.000 parser.py:108(reset_parser)
     5764    0.003    0.000    0.011    0.000 parser.py:118(dispose)
  2008844    1.021    0.000    1.682    0.000 parser.py:122(scanner)
     5764    0.003    0.000    0.011    0.000 parser.py:129(resolver)
   466611    0.343    0.000   28.242    0.000 parser.py:136(check_event)
   149985    0.032    0.000    0.032    0.000 parser.py:150(peek_event)
   201224    0.063    0.000    0.935    0.000 parser.py:158(get_event)
     5764    0.021    0.000    0.792    0.000 parser.py:173(parse_stream_start)
    11528    0.031    0.000    0.208    0.000 parser.py:185(parse_implicit_document_start)
     5764    0.018    0.000    0.076    0.000 parser.py:203(parse_document_start)
     5764    0.018    0.000    0.080    0.000 parser.py:236(parse_document_end)
    19679    0.012    0.000    1.351    0.000 parser.py:319(parse_block_node)
   130306    0.089    0.000    6.997    0.000 parser.py:327(parse_block_node_or_indentless_sequence)
   149985    0.996    0.000    8.246    0.000 parser.py:335(parse_node)
    22419    0.075    0.000    1.971    0.000 parser.py:532(parse_indentless_sequence_entry)
    19679    0.028    0.000    1.671    0.000 parser.py:555(parse_block_mapping_first_key)
    84832    0.330    0.000    7.132    0.000 parser.py:561(parse_block_mapping_key)
    65153    0.317    0.000   17.919    0.000 parser.py:587(parse_block_mapping_value)
    27830    0.022    0.000    0.401    0.000 path_info.py:29(__new__)
    19679    0.011    0.000    0.067    0.000 path_info.py:57(__fspath__)
     5764    0.002    0.000    0.019    0.000 path_info.py:60(fspath)                                                                                                                             [210/1983]
   142121    0.044    0.000    0.057    0.000 pathlib.py:282(splitroot)
    41715    0.258    0.000    0.469    0.000 pathlib.py:51(parse_parts)
    41715    0.137    0.000    0.668    0.000 pathlib.py:629(_parse_args)
    41715    0.047    0.000    0.730    0.000 pathlib.py:649(_from_parts)
    19679    0.018    0.000    0.026    0.000 pathlib.py:672(_format_parsed_parts)
    41715    0.005    0.000    0.005    0.000 pathlib.py:679(_init)
    19679    0.030    0.000    0.056    0.000 pathlib.py:689(__str__)
    13885    0.011    0.000    0.362    0.000 pathlib.py:897(__rtruediv__)
    13915    0.003    0.000    0.003    0.000 pathlib.py:915(is_absolute)
    11528    0.018    0.000    0.029    0.000 posixpath.py:102(split)
    18942    0.018    0.000    0.035    0.000 posixpath.py:142(basename)
    23056    0.041    0.000    0.071    0.000 posixpath.py:152(dirname)
     3387    0.004    0.000    0.030    0.000 posixpath.py:166(islink)
     5764    0.008    0.000    0.047    0.000 posixpath.py:176(lexists)
  9191693   29.449    0.000   46.949    0.000 posixpath.py:329(normpath)
    42736    0.048    0.000    0.438    0.000 posixpath.py:367(abspath)
   142704    0.039    0.000    0.064    0.000 posixpath.py:39(_get_sep)
     5764    0.032    0.000    0.199    0.000 posixpath.py:444(relpath)
     5764    0.003    0.000    0.003    0.000 posixpath.py:466(<listcomp>)
     5764    0.004    0.000    0.004    0.000 posixpath.py:467(<listcomp>)
     5764    0.004    0.000    0.007    0.000 posixpath.py:50(normcase)
    42736    0.028    0.000    0.062    0.000 posixpath.py:62(isabs)
    46442    0.108    0.000    0.159    0.000 posixpath.py:73(join)
     5764    0.003    0.000    0.010    0.000 re.py:231(compile)
     5764    0.007    0.000    0.007    0.000 re.py:286(_compile)
   529870    0.073    0.000    0.073    0.000 reader.py:101(stream)
    11528    0.016    0.000    0.213    0.000 reader.py:109(stream)
  4754654    0.934    0.000    0.939    0.000 reader.py:132(peek)
   379321    0.247    0.000    0.287    0.000 reader.py:140(prefix)
   520803    1.687    0.000    1.735    0.000 reader.py:163(forward)
   512368    0.584    0.000    0.784    0.000 reader.py:178(get_mark)
     5764    0.014    0.000    0.188    0.000 reader.py:187(determine_encoding)
     5974    0.006    0.000    0.017    0.000 reader.py:218(_get_non_printable_ascii)
     5974    0.004    0.000    0.021    0.000 reader.py:236(_get_non_printable)
     5974    0.004    0.000    0.025    0.000 reader.py:244(check_printable)
    23313    0.022    0.000    0.083    0.000 reader.py:258(update)                                                                                                                               [174/1983]
    11738    0.020    0.000    0.132    0.000 reader.py:293(update_raw)
     5764    0.011    0.000    0.025    0.000 reader.py:79(__init__)
    11528    0.014    0.000    0.014    0.000 reader.py:87(reset_reader)
     5764    0.006    0.000    0.010    0.000 resolver.py:115(__init__)
   436323    0.264    0.000    0.782    0.000 resolver.py:124(parser)
   149985    0.039    0.000    0.039    0.000 resolver.py:218(descend_resolver)
   149985    0.032    0.000    0.032    0.000 resolver.py:241(ascend_resolver)
     5764    0.009    0.000    0.020    0.000 resolver.py:319(__init__)
    46112    0.101    0.000    0.173    0.000 resolver.py:327(add_version_implicit_resolver)
     5764    0.001    0.000    0.001    0.000 resolver.py:335(get_loader_version)
   243604    0.169    0.000    0.909    0.000 resolver.py:344(versioned_resolver)
   149985    0.318    0.000    1.391    0.000 resolver.py:357(resolve)
   436323    0.248    0.000    1.031    0.000 resolver.py:382(processing_version)
  6133737    0.720    0.000    0.722    0.000 scanner.py:144(reader)
   121802    2.762    0.000    7.814    0.000 scanner.py:1517(scan_plain)
   186955    0.160    0.000    0.910    0.000 scanner.py:156(scanner_processing_version)
   121802    0.418    0.000    1.056    0.000 scanner.py:1594(scan_plain_spaces)
  1295888    1.388    0.000   19.170    0.000 scanner.py:1756(check_token)
   396047    0.315    0.000    2.883    0.000 scanner.py:1770(peek_token)
  2008844    1.937    0.000   16.845    0.000 scanner.py:1780(_gather_comments)
   316909    0.608    0.000    2.413    0.000 scanner.py:1805(get_token)
   206634    0.491    0.000    1.027    0.000 scanner.py:1843(scan_to_next_token)
   271787    0.244    0.000    0.463    0.000 scanner.py:1912(scan_line_break)
  4082841    2.298    0.000    5.929    0.000 scanner.py:197(need_more_tokens)
   206634    0.626    0.000   13.296    0.000 scanner.py:214(fetch_more_tokens)
  3783113    0.841    0.000    0.841    0.000 scanner.py:326(next_possible_simple_key)
  3989747    2.828    0.000    3.093    0.000 scanner.py:342(stale_possible_simple_keys)
   121802    0.271    0.000    0.549    0.000 scanner.py:362(save_possible_simple_key)
    84832    0.036    0.000    0.060    0.000 scanner.py:386(remove_possible_simple_key)
   212398    0.172    0.000    0.282    0.000 scanner.py:404(unwind_indent)
    79068    0.025    0.000    0.027    0.000 scanner.py:429(add_indent)
    11528    0.027    0.000    0.083    0.000 scanner.py:440(fetch_stream_start)
     5764    0.015    0.000    0.065    0.000 scanner.py:449(fetch_stream_end)
    65153    0.036    0.000    0.036    0.000 scanner.py:56(__init__)
    13915    0.293    0.000    0.376    0.000 scanner.py:561(fetch_block_entry)
    65153    0.346    0.000    0.796    0.000 scanner.py:617(fetch_value)                                                                                                                         [138/1983]
     5764    0.007    0.000    0.071    0.000 scanner.py:67(__init__)
   121802    0.142    0.000    8.517    0.000 scanner.py:739(fetch_plain)
    13915    0.013    0.000    0.027    0.000 scanner.py:760(check_document_start)
     5790    0.003    0.000    0.003    0.000 scanner.py:768(check_document_end)
    13915    0.009    0.000    0.013    0.000 scanner.py:776(check_block_entry)
    65153    0.094    0.000    0.435    0.000 scanner.py:789(check_value)
   121802    0.123    0.000    0.770    0.000 scanner.py:808(check_plain)
  6485203    1.171    0.000    1.557    0.000 scanner.py:85(flow_level)
    11528    0.016    0.000    0.100    0.000 scanner.py:90(reset_scanner)
53625/20031    0.080    0.000    3.076    0.000 schema.py:103(validate)
    53625    0.068    0.000    0.095    0.000 schema.py:111(<listcomp>)
   716348    0.618    0.000    0.955    0.000 schema.py:196(_priority)
   272439    0.263    0.000    1.307    0.000 schema.py:20(__init__)
   429301    0.122    0.000    0.122    0.000 schema.py:217(__init__)
   149281    0.098    0.000    0.365    0.000 schema.py:225(_dict_key_priority)
567067/5764    1.264    0.000    4.393    0.001 schema.py:245(validate)
   272439    0.345    0.000    1.044    0.000 schema.py:25(code)
    22419    0.014    0.000    2.811    0.000 schema.py:254(<genexpr>)
    33594    0.033    0.000    0.033    0.000 schema.py:295(<genexpr>)
   544878    0.271    0.000    0.670    0.000 schema.py:31(uniq)
    19679    0.035    0.000    0.088    0.000 schema.py:312(<genexpr>)
   220210    0.065    0.000    0.092    0.000 schema.py:370(__hash__)
   544878    0.263    0.000    0.399    0.000 schema.py:38(<listcomp>)
   544878    0.068    0.000    0.068    0.000 schema.py:39(<genexpr>)
   272439    0.041    0.000    0.041    0.000 schema.py:40(<genexpr>)
     8504    0.016    0.000    0.028    0.000 schema.py:74(__init__)
     8504    0.015    0.000    2.972    0.000 schema.py:86(validate)
     8504    0.012    0.000    0.017    0.000 schema.py:93(<listcomp>)
     5764    0.014    0.000   37.391    0.006 stage.py:14(load_stage_fd)
     5764    0.009    0.000    0.012    0.000 stage.py:157(__init__)
    30470    0.035    0.000    0.074    0.000 stage.py:202(is_valid_filename)
     5764    0.029    0.000    4.882    0.001 stage.py:355(validate)
     5764    0.002    0.000    0.007    0.000 stage.py:570(_check_dvc_filename)
     5764    0.007    0.000    0.055    0.000 stage.py:580(_check_file_exists)
     5764    0.005    0.000    0.031    0.000 stage.py:585(_check_isfile)
     5764    0.010    0.000    0.034    0.000 stage.py:590(_get_path_tag)
     5764    0.147    0.000   62.820    0.011 stage.py:598(load)
    11528    0.009    0.000    0.012    0.000 tokens.py:137(__init__)
   322673    0.067    0.000    0.067    0.000 tokens.py:16(__init__)
   121802    0.072    0.000    0.097    0.000 tokens.py:241(__init__)
   353596    0.096    0.000    0.222    0.000 tokens.py:56(comment)
   169664    0.066    0.000    0.180    0.000 tokens.py:61(move_comment)
     5764    0.003    0.000    0.116    0.000 tree.py:45(open)
     5764    0.004    0.000    0.046    0.000 tree.py:49(exists)
     5764    0.005    0.000    0.025    0.000 tree.py:57(isfile)
     3389    0.004    0.000    0.246    0.000 tree.py:61(walk)
    57799    0.007    0.000    0.007    0.000 util.py:35(<lambda>)
    57799    0.056    0.000    0.074    0.000 util.py:40(__getattribute__)
   220083    0.047    0.000    0.047    0.000 {built-in method __new__ of type object at 0x7fe0472981c0}
    17386    0.010    0.000    0.010    0.000 {built-in method _codecs.utf_8_decode}
     3387    0.001    0.000    0.001    0.000 {built-in method _stat.S_ISLNK}
     5764    0.001    0.000    0.001    0.000 {built-in method _stat.S_ISREG}
   369139    0.031    0.000    0.031    0.000 {built-in method builtins.callable}
        1    0.000    0.000  113.095  113.095 {built-in method builtins.exec}
  2807363    0.534    0.000    0.534    0.000 {built-in method builtins.getattr}
  7064428    1.465    0.000    1.465    0.000 {built-in method builtins.hasattr}
   220210    0.027    0.000    0.027    0.000 {built-in method builtins.hash}
 15174575    1.070    0.000    1.070    0.000 {built-in method builtins.id}
 16204105    2.353    0.000    2.454    0.000 {built-in method builtins.isinstance}
   783536    0.099    0.000    0.099    0.000 {built-in method builtins.issubclass}
 12008460    0.837    0.000    0.855    0.000 {built-in method builtins.len}
     5764    0.002    0.000    0.002    0.000 {built-in method builtins.max}
     5764    0.005    0.000    0.005    0.000 {built-in method builtins.min}
    45810    0.049    0.000    0.351    0.000 {built-in method builtins.next}
   188708    0.087    0.000    0.087    0.000 {built-in method builtins.setattr}
    33594    0.090    0.000    0.455    0.000 {built-in method builtins.sorted}
     5764    0.107    0.000    0.114    0.000 {built-in method io.open}
  9551462    1.071    0.000    1.121    0.000 {built-in method posix.fspath}
     5764    0.013    0.000    0.013    0.000 {built-in method posix.getcwd}
     9151    0.064    0.000    0.064    0.000 {built-in method posix.lstat}
     9152    0.053    0.000    0.053    0.000 {built-in method posix.scandir}                                                                                                                      [67/1983]
    11528    0.050    0.000    0.050    0.000 {built-in method posix.stat}
  1914109    0.183    0.000    0.183    0.000 {built-in method sys._getframe}
   412844    0.089    0.000    0.089    0.000 {built-in method sys.intern}
    67188    0.076    0.000    0.076    0.000 {method '__reduce_ex__' of 'object' objects}
   467898    0.071    0.000    0.091    0.000 {method 'add' of 'set' objects}
109256682    6.549    0.000    6.549    0.000 {method 'append' of 'list' objects}
     1138    0.001    0.000    0.001    0.000 {method 'clear' of 'dict' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
    31207    0.005    0.000    0.005    0.000 {method 'discard' of 'set' objects}
     5974    0.004    0.000    0.004    0.000 {method 'encode' of 'str' objects}
    75167    0.012    0.000    0.012    0.000 {method 'endswith' of 'str' objects}
   178451    0.020    0.000    0.020    0.000 {method 'extend' of 'list' objects}
    11388    0.003    0.000    0.003    0.000 {method 'find' of 'str' objects}
 20556039    2.051    0.000    2.051    0.000 {method 'get' of 'dict' objects}
    84832    0.022    0.000    0.022    0.000 {method 'insert' of 'list' objects}
   181395    0.078    0.000    0.078    0.000 {method 'is_dir' of 'posix.DirEntry' objects}
    28183    0.010    0.000    0.010    0.000 {method 'issubset' of 'set' objects}
   167970    0.020    0.000    0.020    0.000 {method 'items' of 'dict' objects}
  9605612    2.222    0.000    2.222    0.000 {method 'join' of 'str' objects}
    17292    0.004    0.000    0.004    0.000 {method 'lower' of 'str' objects}
    27830    0.008    0.000    0.008    0.000 {method 'lstrip' of 'str' objects}
    69327    0.076    0.000    0.076    0.000 {method 'match' of '_sre.SRE_Pattern' objects}
    36971    0.042    0.000    0.102    0.000 {method 'pop' of 'collections.OrderedDict' objects}
   540768    0.107    0.000    0.107    0.000 {method 'pop' of 'list' objects}
    11738    0.081    0.000    0.109    0.000 {method 'read' of '_io.TextIOWrapper' objects}
    41715    0.005    0.000    0.005    0.000 {method 'reverse' of 'list' objects}
    53526    0.013    0.000    0.013    0.000 {method 'rfind' of 'str' objects}
    34584    0.009    0.000    0.009    0.000 {method 'rstrip' of 'str' objects}
    34584    0.045    0.000    0.045    0.000 {method 'search' of '_sre.SRE_Pattern' objects}
   322784    0.051    0.000    0.051    0.000 {method 'setdefault' of 'dict' objects}
  9244935    4.773    0.000    4.773    0.000 {method 'split' of 'str' objects}
 18501289    2.396    0.000    2.396    0.000 {method 'startswith' of 'str' objects}
     5974    0.007    0.000    0.007    0.000 {method 'translate' of 'bytes' objects}
    33594    0.014    0.000    0.014    0.000 {method 'update' of 'dict' objects}

I interrupted the call after about 2 minutes.

danfischetti on 25 Oct 2019

👍1

@danfischetti Got it, so looks like I've misunderstood you yesterday :slightly_frowning_face: (we've found a major bug thanks to that, so yay :tada: :slightly_smiling_face: ).

Please correct me if I'm wrong, but AFAIK you are only using dvc through API, right? In that case, a quick workaround would be to simply monkeypatch Repo.check_modified_graph with a noop. E.g.

repo = Repo(".")
repo.check_modified_graph = lambda *args: None
repo.add("something")
repo.checkout("other.dvc")

I need to stress that this is pretty dangerous, as your new repo.add() might be overlapping with some existing dvc-file(another dvc-file that has the same file you are adding listed as an output) and you might get unexpected results from repo.checkout()(no target).

A proper solution would be to try to optimize this by, for example, caching the results of stage collection in particular directories based on their mtime and doing some other related things. But that would only work if most of your directories are not being constantly updated, of course. Could you please talk a bit more about the scenario you have? What are the requirements? How long of an execution time for check_modified_graph would be acceptable for you? How wide and how deep of a file tree do you have? Are dvc-files scattered all over that tree, or do you have large chunks that don't have them?

efiop on 25 Oct 2019

Though, thinking about it, operations like add/fetch/checkout/etc that are related to data management part of dvc sholdn't really care about DAG relations. At most, they could check that there are no overlapping outputs when you are doing things like dvc checkout, since that would create a race condition(but for this we don't have to collect dag for the whole project). This is a really interesting thing to consider. Need to think about it a bit more, but for now, it does look promising. 🤔

For example, say you have big.dvc and small.dvc that both point to data. Then if we lift this current restriction, we would be able to do neat things like

dvc checkout big.dvc
./myscript.sh data
dvc checkout small.dvc
./myscript.sh data

( note that this also effectively replaces a currently hidden dvc tag feature )
but when you try to

dvc checkout

it should probably raise an error, because small.dvc and big.dvc overlap, so the content of data would depend on the order of underlying linking.

efiop on 25 Oct 2019

Currently we really only use the API, but that is partially because a lot of the cli primitives are so painfully slow. Some simple things would be nice to do without having to drop into python and script it.

The bulk of our directory structure is organized by "scene", which is a logical unit of video and image data corresponding to a point in time. There are 100s of scenes and each scene has a few dozen files associated with them, oftentimes data associated with a scene will be associated with a particular model, and that file type across many scenes is updated at once. This workflow is managed by our API, but sometimes we want to change a typo in a single file, where the workflow should just be "dvc add " or "dvc checkout ". If we know the specific file being added or checked out i dont think these DAG checks are buying us anything.

danfischetti on 25 Oct 2019

👍1

Currently we really only use the API, but that is partially because a lot of the cli primitives are so painfully slow. Some simple things would be nice to do without having to drop into python and script it.

@danfischetti Were you able to pinpoint the parts that are slow for you there compared to the API? Also, have you tried a 0.65.0 CLI? It should be pretty similar to API these days. We've improved the startup time in the recent versions quite significantly, so the gap between CLI and API should've shrunken.

The bulk of our directory structure is organized by "scene", which is a logical unit of video and image data corresponding to a point in time. There are 100s of scenes and each scene has a few dozen files associated with them, oftentimes data associated with a scene will be associated with a particular model, and that file type across many scenes is updated at once. This workflow is managed by our API, but sometimes we want to change a typo in a single file, where the workflow should just be "dvc add " or "dvc checkout ". If we know the specific file being added or checked out i dont think these DAG checks are buying us anything.

Thanks for clarifying! Would that monkeypatch workaround be suitable for you for now? In the meantime, we'll consider the idea from https://github.com/iterative/dvc/issues/2671#issuecomment-546518272 , as it has the potential to be a simple, effective and, most of all, correct solution for all of us. 🙂

Thank you so much for the great feedback! 🙂

efiop on 25 Oct 2019

I have tried the 0.65.0 cli, the reason it's slower than the API is we're skipping the high level "add" operation and are manually creating and saving Stage objects. Otherwise the api would be just as slow due to the collect_stages call.

Yes i think the idea in mentioned in that comment would work. DAG checks are totally appropriate when doing a top level checkout, only skipping that when a specific file is requested would suit our needs.

danfischetti on 26 Oct 2019

👍2

Discussed this offline with @efiop and @dmpetrov and on 1-1 with @shcheklein and there is no consensus on lifting DAG checks even for dvc add command. The core consideration is that people should be able to:

git clone ... && dvc pull

and continue their or their teammates work without any complications. Ultimately we want both correctness (as it reinforced now) and performance.

Suor on 3 Nov 2019

I'll do some research on ways to optimize or cache this.

Suor on 3 Nov 2019

So here is my benching so far (21k add stages):

| task | time |
| ------------------- | ----------- |
| list | 0.65s |
| list + mtime/size | 0.85s |
| list + read | 1.16s |
| parse yamls | 48.5s |
| create stages | 69.6s |
| stages (no schema) | 59.0s |
| build graph | 69.6s |

The majority of time is taken by 2 things:

YAML parsing (47s)
Schema validation (10.5s)

The rest is split between path manipulations, deepcopies, outs and deps creation mostly.

Suor on 3 Nov 2019

@Suor thanks, great summary! we need get rid of all last 4 effectively to make it usable.

shcheklein on 3 Nov 2019

Switching from ruamel.yaml back to PyYAML cuts parsing time in half - 24.6s instead of 48.5s. But that doesn't preserve comments so stages can't be safely dumped.

Suor on 3 Nov 2019

@shcheklein if we cache stages then building graph is not the issue. The issues that remain:

still slow on empty cache
need to make cache cross-python, cross-dvc

Suor on 3 Nov 2019

@Suor cache might be a solution. But it still takes time to build it. We'll need to do checks to ensure that it's still valid in case someone manually changes DVC-file. We'll have to think about things like atomicity, etc.

shcheklein on 3 Nov 2019

@shcheklein we can cache either by (filename, mtime, size) tuple or even file contents (reading which is fast enough), so someone manually changing DVC file is not an instance.

Another thing, which I see is that we are using python yaml parser. PyYAML somewhat supports wrapping libyaml, which should speed up things. Here is how you install it though:

python setup.py --with-libyaml install

So no luck with using it as a dep)

Suor on 3 Nov 2019

manually changing

is only one problem in supporting cache, there will be more tricker ones. So cache might be a solution but unfortunately a quite complicated one.

shcheklein on 3 Nov 2019

So the update on yaml libs:

| library | time |
| -----------------|------ |
| ruamel.yaml | 48.5s |
| PyYAML | 25.6s |
| PyYAML (libyaml) | 3.9s |

To use PyYAML with libyaml on debian based linuxes:

sudo apt install libyaml-dev
pip install PyYAML

So that as achievable via deps. We might want to use such strategy:

parse stages with PyYAML on read, store unparsed text
if we need to dump the stage, then parse text with ruamel.yaml, apply diff and dump_file

This way we'll make it faster for most scenarios without caching. Rewriting all the stages looks like a rare scenario.

Not sure we can do anything with validation besides caching. This is single call:

Schema(Stage.SCHEMA).validate(convert_to_unicode(d))

So the only thing we can do besides caching is replacing validation lib altogether.

Suor on 3 Nov 2019

👍1

I don't see how this specific optimization solves the problem, @Suor . But it def complicates all the logic and most likely packaging for different platforms.

shcheklein on 3 Nov 2019

It solves (48.5 - 3.9) / 69.7 ~ 64% of the problem. Even with an empty cache. It will be about 33% of the problem without libyaml C lib. Both the logic and packaging complication will be quite limited.

Suor on 3 Nov 2019

What do you suggest? Implementing cache only?

Suor on 3 Nov 2019

64% of the problem

so, it complicates everything but does not solve the problem

What do you suggest?

add an option for people who manage large number of DVC-files an option to disable the check for now. There should not be a penalty for them and we should unblock the workflow. Also, it will give us more information, namely is there a potential problem in not performing this check. It should be working < 1s. Then, see what can we do - allow this setup in general and/or use cache or something else.

shcheklein on 3 Nov 2019

Benched using voluptuous instead of using schema for validation, it works about 13x faster (it precompiles the schema into python function). This will strip another 14% of the problem, making it 8x faster combined with libyaml/PyYAML thing. There are other possible optimizations there.

so, it complicates everything but does not solve the problem

It's not that black and white. Making it faster will benefit everyone, will make it under 1s for whoever it is now under 8s, so the problem will be solved for them.

Skipping check is not that rosy either:

if you'll still need to load many stages, it will remain slow
all the the other ops besides dvc add will remain slow
you'll need to avoid any code constructing graph accidentally (this includes future code)

Suor on 3 Nov 2019

It's not that black and white.

I think this specific ticket and use case is black and white indeed. At least, I don't see how suggested optimizations can help. It will be a few minutes to just add a file, right?

It's a good question if there are other cases with thousands of DVC-files and what are the requirements there. It would answer the question if need to do a middle ground optimization with some potential complications on supporting this.

Skipping check is not that rosy either.

not saying that this the solution I like, it's just don't see how to unblock the workflow for the team and save us some time to come up with a better one if possible.

shcheklein on 3 Nov 2019

It will be a few minutes to just add a file, right?

For me it 1 minute now, before optimizations, so it's like 8 sec after optimizations, I have 21k files. It's much longer for @danfischetti. It took 113s for 5764 stages there, they are probably more complicated then mine.

@danfischetti can you copy paste a yaml of your typical stage here? Also what's your directory structure? Do dvc files spread over the tree?

Suor on 4 Nov 2019

👍1

Btw, guys, how about we start with simply making dvc checkout some.dvc not collect all stages? The reason it does that in the first place is because it is trying to cleanup old unused links that we no longer have dvc files pointing to, which, actually only makes sense on dvc checkout(without arguments). That should speed up dvc checkout for specific targets by a lot. With dvc add it is not that simple, as we are creating a new dvc stage there, which has risks of colliding with other stages, so we will have to optimize DAG collection there :slightly_frowning_face:

efiop on 5 Nov 2019

For the record: created https://github.com/iterative/dvc/pull/2750 to disable DAG checks for dvc checkout for specific targets.

efiop on 7 Nov 2019

👍1

So after all optimizations merged we have 8.8s instead of 69.6s to collect graph for 21k simple stages, it is a 7.9x speedup. Here is what takes time:

| task | time | | what is added to prev line |
|----------------|------|-------|-------------------------------------|
| list + read | 1.1s | | (includes startup time) |
| ... + parse | 3.1s | +2.0s | PyYAML/libyaml parsing |
| ... + validate | 4.5s | +1.2s | validation/coercion with voluptuous |
| collect stages | 7.2s | +2.7s | stage/dep/out object creation |
| check stages | 8.2s | +1.0s | check dups, overlaps, etc |
| collect graph | 8.8s | +0.6s | graph creation (incl. import nx and cycle check) |

I would say even if we cache graph we can get at best 1.5s, if we cache stages individually - 2.5s.

Since the issue for topic starter is not urgent I suggest stopping with this for now.

Suor on 3 Dec 2019

🚀2 ❤2 👍2

@danfischetti could you please give it a try? does it solve the problem with your repo?

shcheklein on 4 Dec 2019

👍1

Closing due to inactivity.

efiop on 7 Jan 2020

I have also encountered this issue

tushar-dadlani on 27 Feb 2020

dvc version

DVC version: 0.82.8
Python version: 3.7.5
Platform: Darwin-18.7.0-x86_64-i386-64bit
Binary: True
Package: osxpkg
Filesystem type (workspace): ('apfs', '/dev/disk1s1')

tushar-dadlani on 27 Feb 2020

For the record: @tushar-dadlani is running into dvc pull something.dvc collecting all of the stages for too long without any progress bars or anything. We need to at least add a progress bar, but should also consider not collecting stages when we are given a specific stage as a target already.

efiop on 27 Feb 2020

@efiop we can also consider some further optimizations into stage collection. We stopped this because the initial user was non-responsive.

Suor on 27 Feb 2020

👍1

The issue is we need to cling to real problem better as in my test scenario it's generally ok.

Suor on 27 Feb 2020

@tushar-dadlani the thing this works well for my test scenario, I may try to invent new ones, but it would make much more sense to look at your case. Can you provide a cut-up anonymized copy of you repo:

cd <your-repo>
mkdir ../repo-copy; cp -r * ../repo-copy  # skipping .dvc and .git here

cd ../repo-copy
find . -type f -not -name \*.dvc -exec sh -c 'echo ERASED > {}' \;

cd ..
tar czvf repo.tar.gz repo-copy/; rm -rf repo-copy

Then attach repo.tar.gz here or to the slack channel. This will repliacte all the dir/file structure as well as stages and pipelines, which should be enough to reproduce and optimize it.

Suor on 28 Feb 2020

👍1

@tushar-dadlani the thing this works well for my test scenario, I may try to invent new ones, but it would make much more sense to look at your case. Can you provide a cut-up anonymized copy of you repo:

```shell
cd
mkdir ../repo-copy; cp -r * ../repo-copy # skipping .dvc and .git here

cd ../repo-copy

This line should have some warning, as if it is run in a different location compared to your assumed location, it can be disastrous.

find . -type f -not -name *.dvc -exec sh -c 'echo ERASED > {}' \;

cd ..
tar czvf repo.tar.gz repo-copy/; rm -rf repo-copy
```

Then attach repo.tar.gz here or to the slack channel. This will repliacte all the dir/file structure as well as stages and pipelines, which should be enough to reproduce and optimize it.

tushar-dadlani on 4 Mar 2020

👍1

Great point @tushar-dadlani ! Thanks for providing the test repo 🙏 We are able to reproduce the problem, looking into it right now.

efiop on 5 Mar 2020

So my implementation using tries gives:

Tries:
  654.77 ms in collect stages
   26.63 ms in dups/overlaps
   19.35 ms in stages in outs
  188.29 ms in build graph
  370.05 ms in check check_acyclic
    1.45 s in _collect_graph(Repo: '/home/suor/proj...)

Old code:
  650.43 ms in collect stages
   27.48 ms in dups/overlaps
   35.23 ms in stages in outs
    3.02 s in build graph
  400.53 ms in check acyclic
    4.33 s in _collect_graph(Repo: '/home/suor/proj...)

for 1320 stages. Will test on @tushar-dadlani's data and create a PR.

Suor on 5 Mar 2020

🎉1

This is what I have for @tushar-dadlani's repo:

   26.49 s in collect stages
    1.32 s in dups/overlaps
  782.49 ms in stages in outs
    7.23 s in build graph
   18.80 s in check acyclic
   54.83 s in _collect_graph(Repo: '/home/suor/proj...)

Build graph is not the biggest anymore. The only way to make it fast is probably caching. Making some commands avoid building a graph and making full collection is also a good idea.

Old code:

   25.61 s in collect stages
    1.72 s in dups/overlaps
    1.74 s in stages in outs
^C 2997.81 s in build graph  # interrupted 
 3027.11 s in _collect_graph(Repo: '/home/suor/proj...)

Suor on 5 Mar 2020

🚀2

@danfischetti @tushar-dadlani Guys, please also take a look at https://github.com/iterative/dvc/pull/3490 , which attempts to introduce the hack that @danfischetti originally asked for.

efiop on 17 Mar 2020

Dvc: Optimize stage collection for large repos

Most helpful comment

All 41 comments

This line should have some warning, as if it is run in a different location compared to your assumed location, it can be disastrous.

Related issues