I posted this issue #2666 yesterday which was closed but the patch did not solve my underlying issue. The DAG is constructed using the stages from the repo.collect_stages()
method. This method is what is very slow for me with a large repo. So I would like to ask again if you could make the collection of every stage before any operation optional so it doesn't take hours to dvc add
a single file with no dependencies.
@danfischetti Could you please post a cProfile log from your specific use case so we can see more clearly what is slowing you down? :slightly_smiling_face:
@danfischetti Also, did the change from yesterday make any difference for you?
I just updated to 0.65.0 and I don't see any difference.
In [7]: cProfile.run("repo.collect_stages()") [389/1983]
^C 329107531 function calls (317924667 primitive calls) in 113.090 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
33595 0.019 0.000 0.048 0.000 <frozen importlib._bootstrap>:997(_handle_fromlist)
1 0.000 0.000 113.095 113.095 <string>:1(<module>)
99792 0.029 0.000 0.051 0.000 <string>:12(__new__)
5764 0.022 0.000 0.980 0.000 __init__.py:100(loadd_from)
17293 0.004 0.000 0.004 0.000 __init__.py:110(tree)
215138/5764 0.225 0.000 0.447 0.000 __init__.py:319(convert_to_unicode)
3389 0.008 0.000 0.217 0.000 __init__.py:350(dvc_walk)
3388 0.007 0.000 49.910 0.015 __init__.py:371(_filter_out_dirs)
3444 3.412 0.001 49.902 0.014 __init__.py:373(filter_dirs)
1 0.042 0.042 113.095 113.095 __init__.py:382(collect_stages)
5764 0.015 0.000 0.316 0.000 __init__.py:453(relpath)
8151 0.045 0.000 1.234 0.000 __init__.py:52(_get)
5764 0.024 0.000 0.856 0.000 __init__.py:60(_get)
5764 0.014 0.000 1.305 0.000 __init__.py:71(loadd_from)
89676 0.101 0.000 0.221 0.000 _collections_abc.py:879(__iter__)
13915 0.010 0.000 0.103 0.000 _collections_abc.py:966(append)
130306 0.046 0.000 0.046 0.000 _weakrefset.py:70(__contains__)
130306 0.055 0.000 0.101 0.000 abc.py:178(__instancecheck__)
13915 0.018 0.000 0.020 0.000 base.py:115(_check_requires)
11528 0.003 0.000 0.003 0.000 base.py:127(scheme)
5764 0.005 0.000 0.009 0.000 base.py:142(cache)
71962 0.038 0.000 0.400 0.000 base.py:150(supported)
71962 0.052 0.000 0.362 0.000 base.py:158(supported)
13915 0.019 0.000 0.086 0.000 base.py:419(_validate_output_path)
13915 0.057 0.000 1.450 0.000 base.py:72(__init__)
13915 0.042 0.000 0.082 0.000 base.py:92(__init__)
5764 0.002 0.000 0.002 0.000 codecs.py:259(__init__)
5764 0.005 0.000 0.007 0.000 codecs.py:308(__init__)
17386 0.015 0.000 0.025 0.000 codecs.py:318(decode)
17386 0.003 0.000 0.003 0.000 codecs.py:330(getstate)
28183 0.005 0.000 0.005 0.000 comments.py:100(__init__) [353/1983]
28183 0.004 0.000 0.004 0.000 comments.py:108(set_block_style)
28183 0.011 0.000 0.011 0.000 comments.py:126(__init__)
65153 0.024 0.000 0.024 0.000 comments.py:132(add_kv_line_col)
13915 0.007 0.000 0.007 0.000 comments.py:159(add_idx_line_col)
13915 0.015 0.000 0.027 0.000 comments.py:182(ca)
28183 0.030 0.000 0.060 0.000 comments.py:269(fa)
135434 0.084 0.000 0.165 0.000 comments.py:304(lc)
28183 0.020 0.000 0.121 0.000 comments.py:311(_yaml_set_line_col)
65153 0.038 0.000 0.114 0.000 comments.py:316(_yaml_set_kv_line_col)
13915 0.008 0.000 0.026 0.000 comments.py:320(_yaml_set_idx_line_col)
33594 0.146 0.000 16.608 0.000 comments.py:353(copy_attributes)
34016 0.043 0.000 3.137 0.000 comments.py:381(__init__)
89676 0.040 0.000 0.040 0.000 comments.py:385(__getsingleitem__)
22419 0.010 0.000 0.010 0.000 comments.py:410(__len__)
13915 0.023 0.000 0.079 0.000 comments.py:414(insert)
8504 0.005 0.000 0.005 0.000 comments.py:423(extend)
8504 0.004 0.000 0.004 0.000 comments.py:46(__init__)
8504 0.042 0.000 16.621 0.002 comments.py:476(__deepcopy__)
19679 0.004 0.000 0.004 0.000 comments.py:563(__init__)
84832 0.051 0.000 0.089 0.000 comments.py:610(__iter__)
39358 0.048 0.000 0.048 0.000 comments.py:635(__init__)
13915 0.002 0.000 0.002 0.000 comments.py:64(items)
210012 0.085 0.000 0.106 0.000 comments.py:744(__getitem__)
130306 0.126 0.000 0.201 0.000 comments.py:754(__setitem__)
232430 0.106 0.000 0.106 0.000 comments.py:773(__contains__)
48499 0.023 0.000 0.082 0.000 comments.py:777(get)
31207 0.029 0.000 0.034 0.000 comments.py:794(__delitem__)
84832 0.026 0.000 0.026 0.000 comments.py:814(__iter__)
84832 0.020 0.000 0.020 0.000 comments.py:819(_keys)
39358 0.016 0.000 0.016 0.000 comments.py:824(__len__)
19679 0.010 0.000 0.014 0.000 comments.py:898(items)
19679/5764 0.128 0.000 17.277 0.003 comments.py:942(__deepcopy__)
13916 0.002 0.000 0.002 0.000 compat.py:178(<lambda>)
5764 0.012 0.000 0.021 0.000 compat.py:252(version_tnf)
89676 0.064 0.000 0.120 0.000 compat.py:266(__getitem__)
149985/5764 0.584 0.000 33.287 0.006 composer.py:109(compose_node) [317/1983]
121802 0.341 0.000 2.292 0.000 composer.py:142(compose_scalar_node)
8504 0.064 0.000 24.126 0.003 composer.py:161(compose_sequence_node)
19679/5764 0.241 0.000 32.750 0.006 composer.py:194(compose_mapping_node)
28183 0.006 0.000 0.006 0.000 composer.py:228(check_end_doc_comment)
5764 0.005 0.000 0.008 0.000 composer.py:33(__init__)
817820 0.527 0.000 1.524 0.000 composer.py:40(parser)
449955 0.295 0.000 0.859 0.000 composer.py:47(resolver)
5764 0.022 0.000 34.501 0.006 composer.py:70(get_single_node)
5764 0.013 0.000 33.408 0.006 composer.py:95(compose_document)
5764 0.012 0.000 36.254 0.006 constructor.py:106(get_single_data)
121802 0.084 0.000 0.097 0.000 constructor.py:1063(construct_scalar)
5764 0.016 0.000 1.707 0.000 constructor.py:114(construct_document)
104510 0.055 0.000 0.154 0.000 constructor.py:1266(construct_yaml_str)
149985/45826 0.256 0.000 1.420 0.000 constructor.py:128(construct_object)
8504 0.034 0.000 1.093 0.000 constructor.py:1281(construct_rt_sequence)
19679 0.043 0.000 0.050 0.000 constructor.py:1306(flatten_mapping)
19679/5764 0.203 0.000 1.579 0.000 constructor.py:1393(construct_mapping)
17008 0.024 0.000 1.201 0.000 constructor.py:1528(construct_yaml_seq)
39358/11528 0.151 0.000 1.665 0.000 constructor.py:1538(construct_yaml_map)
28183 0.031 0.000 0.119 0.000 constructor.py:1546(set_collection_style)
17292 0.009 0.000 0.042 0.000 constructor.py:1729(construct_yaml_bool)
65153 0.028 0.000 0.062 0.000 constructor.py:254(check_mapping_key)
17292 0.016 0.000 0.033 0.000 constructor.py:443(construct_yaml_bool)
5764 0.016 0.000 0.041 0.000 constructor.py:60(__init__)
5764 0.005 0.000 0.034 0.000 constructor.py:75(composer)
9997354/5764 7.516 0.000 17.306 0.003 copy.py:132(deepcopy)
7456429 0.526 0.000 0.526 0.000 copy.py:190(_deepcopy_atomic)
2311178 2.280 0.000 7.435 0.000 copy.py:210(_deepcopy_list)
33594 0.069 0.000 0.345 0.000 copy.py:219(_deepcopy_tuple)
33594 0.024 0.000 0.273 0.000 copy.py:220(<listcomp>)
100782/67188 0.981 0.000 15.075 0.000 copy.py:236(_deepcopy_dict)
2540925 0.979 0.000 1.345 0.000 copy.py:252(_keep_alive)
67188 0.212 0.000 15.901 0.000 copy.py:268(_reconstruct)
134376 0.037 0.000 0.122 0.000 copy.py:273(<genexpr>)
13915 0.011 0.000 0.014 0.000 copy.py:66(copy)
67188 0.028 0.000 0.039 0.000 copyreg.py:87(__newobj__) [281/1983]
512368 0.131 0.000 0.131 0.000 error.py:30(__init__)
5764 0.003 0.000 0.005 0.000 events.py:112(__init__)
121802 0.081 0.000 0.178 0.000 events.py:125(__init__)
201224 0.075 0.000 0.075 0.000 events.py:17(__init__)
149985 0.068 0.000 0.124 0.000 events.py:42(__init__)
28183 0.019 0.000 0.045 0.000 events.py:51(__init__)
5764 0.005 0.000 0.008 0.000 events.py:80(__init__)
5764 0.004 0.000 0.006 0.000 events.py:93(__init__)
5764 0.016 0.000 0.034 0.000 fnmatch.py:48(filter)
8 0.000 0.000 0.000 0.000 future.py:47(__del__)
5764 0.005 0.000 0.042 0.000 genericpath.py:16(exists)
5764 0.006 0.000 0.020 0.000 genericpath.py:27(isfile)
5764 0.012 0.000 0.021 0.000 genericpath.py:69(commonprefix)
11528 0.174 0.000 0.229 0.000 glob.py:114(_iterdir)
34584 0.017 0.000 0.067 0.000 glob.py:145(has_magic)
11528 0.005 0.000 0.005 0.000 glob.py:152(_ishidden)
5764 0.003 0.000 0.003 0.000 glob.py:22(iglob)
17292/5764 0.035 0.000 0.501 0.000 glob.py:39(_iglob)
5764 0.018 0.000 0.285 0.000 glob.py:79(_glob1)
11528 0.004 0.000 0.005 0.000 glob.py:82(<genexpr>)
5764 0.007 0.000 0.067 0.000 glob.py:85(_glob0)
5764 0.010 0.000 0.514 0.000 glob.py:9(glob)
3388 0.003 0.000 0.005 0.000 ignore.py:57(__call__)
3388 0.002 0.000 0.002 0.000 ignore.py:58(<listcomp>)
3388 0.005 0.000 0.010 0.000 ignore.py:76(__call__)
13915 0.068 0.000 1.171 0.000 local.py:20(_parse_path)
5764 0.005 0.000 0.024 0.000 local.py:41(fspath)
13915 0.040 0.000 0.128 0.000 local.py:52(__init__)
11528 0.019 0.000 0.044 0.000 main.py:167(reader)
2008844 0.316 0.000 0.387 0.000 main.py:176(scanner)
1259907 0.802 0.000 1.404 0.000 main.py:185(parser)
5764 0.012 0.000 0.028 0.000 main.py:207(composer)
5764 0.019 0.000 0.073 0.000 main.py:215(constructor)
642674 0.441 0.000 0.785 0.000 main.py:225(resolver)
5764 0.043 0.000 36.737 0.006 main.py:316(load)
5764 0.016 0.000 0.381 0.000 main.py:375(get_constructor_parser)
5764 0.056 0.000 0.638 0.000 main.py:61(__init__)
5764 0.018 0.000 0.582 0.000 main.py:615(official_plug_ins)
5764 0.002 0.000 0.002 0.000 main.py:619(<listcomp>)
19679 0.013 0.000 0.036 0.000 nodes.py:117(__init__)
149985 0.061 0.000 0.061 0.000 nodes.py:15(__init__)
121802 0.073 0.000 0.120 0.000 nodes.py:81(__init__)
28183 0.020 0.000 0.034 0.000 nodes.py:92(__init__)
27320/3394 0.041 0.000 0.199 0.000 os.py:277(walk)
199584 0.097 0.000 0.116 0.000 parse.py:109(_coerce_args)
99792 0.140 0.000 0.499 0.000 parse.py:359(urlparse)
99792 0.133 0.000 0.233 0.000 parse.py:392(urlsplit)
569 0.000 0.000 0.002 0.000 parse.py:83(clear_cache)
199584 0.017 0.000 0.017 0.000 parse.py:98(_noop)
5764 0.007 0.000 0.017 0.000 parser.py:101(__init__)
11528 0.014 0.000 0.014 0.000 parser.py:108(reset_parser)
5764 0.003 0.000 0.011 0.000 parser.py:118(dispose)
2008844 1.021 0.000 1.682 0.000 parser.py:122(scanner)
5764 0.003 0.000 0.011 0.000 parser.py:129(resolver)
466611 0.343 0.000 28.242 0.000 parser.py:136(check_event)
149985 0.032 0.000 0.032 0.000 parser.py:150(peek_event)
201224 0.063 0.000 0.935 0.000 parser.py:158(get_event)
5764 0.021 0.000 0.792 0.000 parser.py:173(parse_stream_start)
11528 0.031 0.000 0.208 0.000 parser.py:185(parse_implicit_document_start)
5764 0.018 0.000 0.076 0.000 parser.py:203(parse_document_start)
5764 0.018 0.000 0.080 0.000 parser.py:236(parse_document_end)
19679 0.012 0.000 1.351 0.000 parser.py:319(parse_block_node)
130306 0.089 0.000 6.997 0.000 parser.py:327(parse_block_node_or_indentless_sequence)
149985 0.996 0.000 8.246 0.000 parser.py:335(parse_node)
22419 0.075 0.000 1.971 0.000 parser.py:532(parse_indentless_sequence_entry)
19679 0.028 0.000 1.671 0.000 parser.py:555(parse_block_mapping_first_key)
84832 0.330 0.000 7.132 0.000 parser.py:561(parse_block_mapping_key)
65153 0.317 0.000 17.919 0.000 parser.py:587(parse_block_mapping_value)
27830 0.022 0.000 0.401 0.000 path_info.py:29(__new__)
19679 0.011 0.000 0.067 0.000 path_info.py:57(__fspath__)
5764 0.002 0.000 0.019 0.000 path_info.py:60(fspath) [210/1983]
142121 0.044 0.000 0.057 0.000 pathlib.py:282(splitroot)
41715 0.258 0.000 0.469 0.000 pathlib.py:51(parse_parts)
41715 0.137 0.000 0.668 0.000 pathlib.py:629(_parse_args)
41715 0.047 0.000 0.730 0.000 pathlib.py:649(_from_parts)
19679 0.018 0.000 0.026 0.000 pathlib.py:672(_format_parsed_parts)
41715 0.005 0.000 0.005 0.000 pathlib.py:679(_init)
19679 0.030 0.000 0.056 0.000 pathlib.py:689(__str__)
13885 0.011 0.000 0.362 0.000 pathlib.py:897(__rtruediv__)
13915 0.003 0.000 0.003 0.000 pathlib.py:915(is_absolute)
11528 0.018 0.000 0.029 0.000 posixpath.py:102(split)
18942 0.018 0.000 0.035 0.000 posixpath.py:142(basename)
23056 0.041 0.000 0.071 0.000 posixpath.py:152(dirname)
3387 0.004 0.000 0.030 0.000 posixpath.py:166(islink)
5764 0.008 0.000 0.047 0.000 posixpath.py:176(lexists)
9191693 29.449 0.000 46.949 0.000 posixpath.py:329(normpath)
42736 0.048 0.000 0.438 0.000 posixpath.py:367(abspath)
142704 0.039 0.000 0.064 0.000 posixpath.py:39(_get_sep)
5764 0.032 0.000 0.199 0.000 posixpath.py:444(relpath)
5764 0.003 0.000 0.003 0.000 posixpath.py:466(<listcomp>)
5764 0.004 0.000 0.004 0.000 posixpath.py:467(<listcomp>)
5764 0.004 0.000 0.007 0.000 posixpath.py:50(normcase)
42736 0.028 0.000 0.062 0.000 posixpath.py:62(isabs)
46442 0.108 0.000 0.159 0.000 posixpath.py:73(join)
5764 0.003 0.000 0.010 0.000 re.py:231(compile)
5764 0.007 0.000 0.007 0.000 re.py:286(_compile)
529870 0.073 0.000 0.073 0.000 reader.py:101(stream)
11528 0.016 0.000 0.213 0.000 reader.py:109(stream)
4754654 0.934 0.000 0.939 0.000 reader.py:132(peek)
379321 0.247 0.000 0.287 0.000 reader.py:140(prefix)
520803 1.687 0.000 1.735 0.000 reader.py:163(forward)
512368 0.584 0.000 0.784 0.000 reader.py:178(get_mark)
5764 0.014 0.000 0.188 0.000 reader.py:187(determine_encoding)
5974 0.006 0.000 0.017 0.000 reader.py:218(_get_non_printable_ascii)
5974 0.004 0.000 0.021 0.000 reader.py:236(_get_non_printable)
5974 0.004 0.000 0.025 0.000 reader.py:244(check_printable)
23313 0.022 0.000 0.083 0.000 reader.py:258(update) [174/1983]
11738 0.020 0.000 0.132 0.000 reader.py:293(update_raw)
5764 0.011 0.000 0.025 0.000 reader.py:79(__init__)
11528 0.014 0.000 0.014 0.000 reader.py:87(reset_reader)
5764 0.006 0.000 0.010 0.000 resolver.py:115(__init__)
436323 0.264 0.000 0.782 0.000 resolver.py:124(parser)
149985 0.039 0.000 0.039 0.000 resolver.py:218(descend_resolver)
149985 0.032 0.000 0.032 0.000 resolver.py:241(ascend_resolver)
5764 0.009 0.000 0.020 0.000 resolver.py:319(__init__)
46112 0.101 0.000 0.173 0.000 resolver.py:327(add_version_implicit_resolver)
5764 0.001 0.000 0.001 0.000 resolver.py:335(get_loader_version)
243604 0.169 0.000 0.909 0.000 resolver.py:344(versioned_resolver)
149985 0.318 0.000 1.391 0.000 resolver.py:357(resolve)
436323 0.248 0.000 1.031 0.000 resolver.py:382(processing_version)
6133737 0.720 0.000 0.722 0.000 scanner.py:144(reader)
121802 2.762 0.000 7.814 0.000 scanner.py:1517(scan_plain)
186955 0.160 0.000 0.910 0.000 scanner.py:156(scanner_processing_version)
121802 0.418 0.000 1.056 0.000 scanner.py:1594(scan_plain_spaces)
1295888 1.388 0.000 19.170 0.000 scanner.py:1756(check_token)
396047 0.315 0.000 2.883 0.000 scanner.py:1770(peek_token)
2008844 1.937 0.000 16.845 0.000 scanner.py:1780(_gather_comments)
316909 0.608 0.000 2.413 0.000 scanner.py:1805(get_token)
206634 0.491 0.000 1.027 0.000 scanner.py:1843(scan_to_next_token)
271787 0.244 0.000 0.463 0.000 scanner.py:1912(scan_line_break)
4082841 2.298 0.000 5.929 0.000 scanner.py:197(need_more_tokens)
206634 0.626 0.000 13.296 0.000 scanner.py:214(fetch_more_tokens)
3783113 0.841 0.000 0.841 0.000 scanner.py:326(next_possible_simple_key)
3989747 2.828 0.000 3.093 0.000 scanner.py:342(stale_possible_simple_keys)
121802 0.271 0.000 0.549 0.000 scanner.py:362(save_possible_simple_key)
84832 0.036 0.000 0.060 0.000 scanner.py:386(remove_possible_simple_key)
212398 0.172 0.000 0.282 0.000 scanner.py:404(unwind_indent)
79068 0.025 0.000 0.027 0.000 scanner.py:429(add_indent)
11528 0.027 0.000 0.083 0.000 scanner.py:440(fetch_stream_start)
5764 0.015 0.000 0.065 0.000 scanner.py:449(fetch_stream_end)
65153 0.036 0.000 0.036 0.000 scanner.py:56(__init__)
13915 0.293 0.000 0.376 0.000 scanner.py:561(fetch_block_entry)
65153 0.346 0.000 0.796 0.000 scanner.py:617(fetch_value) [138/1983]
5764 0.007 0.000 0.071 0.000 scanner.py:67(__init__)
121802 0.142 0.000 8.517 0.000 scanner.py:739(fetch_plain)
13915 0.013 0.000 0.027 0.000 scanner.py:760(check_document_start)
5790 0.003 0.000 0.003 0.000 scanner.py:768(check_document_end)
13915 0.009 0.000 0.013 0.000 scanner.py:776(check_block_entry)
65153 0.094 0.000 0.435 0.000 scanner.py:789(check_value)
121802 0.123 0.000 0.770 0.000 scanner.py:808(check_plain)
6485203 1.171 0.000 1.557 0.000 scanner.py:85(flow_level)
11528 0.016 0.000 0.100 0.000 scanner.py:90(reset_scanner)
53625/20031 0.080 0.000 3.076 0.000 schema.py:103(validate)
53625 0.068 0.000 0.095 0.000 schema.py:111(<listcomp>)
716348 0.618 0.000 0.955 0.000 schema.py:196(_priority)
272439 0.263 0.000 1.307 0.000 schema.py:20(__init__)
429301 0.122 0.000 0.122 0.000 schema.py:217(__init__)
149281 0.098 0.000 0.365 0.000 schema.py:225(_dict_key_priority)
567067/5764 1.264 0.000 4.393 0.001 schema.py:245(validate)
272439 0.345 0.000 1.044 0.000 schema.py:25(code)
22419 0.014 0.000 2.811 0.000 schema.py:254(<genexpr>)
33594 0.033 0.000 0.033 0.000 schema.py:295(<genexpr>)
544878 0.271 0.000 0.670 0.000 schema.py:31(uniq)
19679 0.035 0.000 0.088 0.000 schema.py:312(<genexpr>)
220210 0.065 0.000 0.092 0.000 schema.py:370(__hash__)
544878 0.263 0.000 0.399 0.000 schema.py:38(<listcomp>)
544878 0.068 0.000 0.068 0.000 schema.py:39(<genexpr>)
272439 0.041 0.000 0.041 0.000 schema.py:40(<genexpr>)
8504 0.016 0.000 0.028 0.000 schema.py:74(__init__)
8504 0.015 0.000 2.972 0.000 schema.py:86(validate)
8504 0.012 0.000 0.017 0.000 schema.py:93(<listcomp>)
5764 0.014 0.000 37.391 0.006 stage.py:14(load_stage_fd)
5764 0.009 0.000 0.012 0.000 stage.py:157(__init__)
30470 0.035 0.000 0.074 0.000 stage.py:202(is_valid_filename)
5764 0.029 0.000 4.882 0.001 stage.py:355(validate)
5764 0.002 0.000 0.007 0.000 stage.py:570(_check_dvc_filename)
5764 0.007 0.000 0.055 0.000 stage.py:580(_check_file_exists)
5764 0.005 0.000 0.031 0.000 stage.py:585(_check_isfile)
5764 0.010 0.000 0.034 0.000 stage.py:590(_get_path_tag)
5764 0.147 0.000 62.820 0.011 stage.py:598(load)
11528 0.009 0.000 0.012 0.000 tokens.py:137(__init__)
322673 0.067 0.000 0.067 0.000 tokens.py:16(__init__)
121802 0.072 0.000 0.097 0.000 tokens.py:241(__init__)
353596 0.096 0.000 0.222 0.000 tokens.py:56(comment)
169664 0.066 0.000 0.180 0.000 tokens.py:61(move_comment)
5764 0.003 0.000 0.116 0.000 tree.py:45(open)
5764 0.004 0.000 0.046 0.000 tree.py:49(exists)
5764 0.005 0.000 0.025 0.000 tree.py:57(isfile)
3389 0.004 0.000 0.246 0.000 tree.py:61(walk)
57799 0.007 0.000 0.007 0.000 util.py:35(<lambda>)
57799 0.056 0.000 0.074 0.000 util.py:40(__getattribute__)
220083 0.047 0.000 0.047 0.000 {built-in method __new__ of type object at 0x7fe0472981c0}
17386 0.010 0.000 0.010 0.000 {built-in method _codecs.utf_8_decode}
3387 0.001 0.000 0.001 0.000 {built-in method _stat.S_ISLNK}
5764 0.001 0.000 0.001 0.000 {built-in method _stat.S_ISREG}
369139 0.031 0.000 0.031 0.000 {built-in method builtins.callable}
1 0.000 0.000 113.095 113.095 {built-in method builtins.exec}
2807363 0.534 0.000 0.534 0.000 {built-in method builtins.getattr}
7064428 1.465 0.000 1.465 0.000 {built-in method builtins.hasattr}
220210 0.027 0.000 0.027 0.000 {built-in method builtins.hash}
15174575 1.070 0.000 1.070 0.000 {built-in method builtins.id}
16204105 2.353 0.000 2.454 0.000 {built-in method builtins.isinstance}
783536 0.099 0.000 0.099 0.000 {built-in method builtins.issubclass}
12008460 0.837 0.000 0.855 0.000 {built-in method builtins.len}
5764 0.002 0.000 0.002 0.000 {built-in method builtins.max}
5764 0.005 0.000 0.005 0.000 {built-in method builtins.min}
45810 0.049 0.000 0.351 0.000 {built-in method builtins.next}
188708 0.087 0.000 0.087 0.000 {built-in method builtins.setattr}
33594 0.090 0.000 0.455 0.000 {built-in method builtins.sorted}
5764 0.107 0.000 0.114 0.000 {built-in method io.open}
9551462 1.071 0.000 1.121 0.000 {built-in method posix.fspath}
5764 0.013 0.000 0.013 0.000 {built-in method posix.getcwd}
9151 0.064 0.000 0.064 0.000 {built-in method posix.lstat}
9152 0.053 0.000 0.053 0.000 {built-in method posix.scandir} [67/1983]
11528 0.050 0.000 0.050 0.000 {built-in method posix.stat}
1914109 0.183 0.000 0.183 0.000 {built-in method sys._getframe}
412844 0.089 0.000 0.089 0.000 {built-in method sys.intern}
67188 0.076 0.000 0.076 0.000 {method '__reduce_ex__' of 'object' objects}
467898 0.071 0.000 0.091 0.000 {method 'add' of 'set' objects}
109256682 6.549 0.000 6.549 0.000 {method 'append' of 'list' objects}
1138 0.001 0.000 0.001 0.000 {method 'clear' of 'dict' objects}
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
31207 0.005 0.000 0.005 0.000 {method 'discard' of 'set' objects}
5974 0.004 0.000 0.004 0.000 {method 'encode' of 'str' objects}
75167 0.012 0.000 0.012 0.000 {method 'endswith' of 'str' objects}
178451 0.020 0.000 0.020 0.000 {method 'extend' of 'list' objects}
11388 0.003 0.000 0.003 0.000 {method 'find' of 'str' objects}
20556039 2.051 0.000 2.051 0.000 {method 'get' of 'dict' objects}
84832 0.022 0.000 0.022 0.000 {method 'insert' of 'list' objects}
181395 0.078 0.000 0.078 0.000 {method 'is_dir' of 'posix.DirEntry' objects}
28183 0.010 0.000 0.010 0.000 {method 'issubset' of 'set' objects}
167970 0.020 0.000 0.020 0.000 {method 'items' of 'dict' objects}
9605612 2.222 0.000 2.222 0.000 {method 'join' of 'str' objects}
17292 0.004 0.000 0.004 0.000 {method 'lower' of 'str' objects}
27830 0.008 0.000 0.008 0.000 {method 'lstrip' of 'str' objects}
69327 0.076 0.000 0.076 0.000 {method 'match' of '_sre.SRE_Pattern' objects}
36971 0.042 0.000 0.102 0.000 {method 'pop' of 'collections.OrderedDict' objects}
540768 0.107 0.000 0.107 0.000 {method 'pop' of 'list' objects}
11738 0.081 0.000 0.109 0.000 {method 'read' of '_io.TextIOWrapper' objects}
41715 0.005 0.000 0.005 0.000 {method 'reverse' of 'list' objects}
53526 0.013 0.000 0.013 0.000 {method 'rfind' of 'str' objects}
34584 0.009 0.000 0.009 0.000 {method 'rstrip' of 'str' objects}
34584 0.045 0.000 0.045 0.000 {method 'search' of '_sre.SRE_Pattern' objects}
322784 0.051 0.000 0.051 0.000 {method 'setdefault' of 'dict' objects}
9244935 4.773 0.000 4.773 0.000 {method 'split' of 'str' objects}
18501289 2.396 0.000 2.396 0.000 {method 'startswith' of 'str' objects}
5974 0.007 0.000 0.007 0.000 {method 'translate' of 'bytes' objects}
33594 0.014 0.000 0.014 0.000 {method 'update' of 'dict' objects}
I interrupted the call after about 2 minutes.
@danfischetti Got it, so looks like I've misunderstood you yesterday :slightly_frowning_face: (we've found a major bug thanks to that, so yay :tada: :slightly_smiling_face: ).
Please correct me if I'm wrong, but AFAIK you are only using dvc through API, right? In that case, a quick workaround would be to simply monkeypatch Repo.check_modified_graph
with a noop
. E.g.
repo = Repo(".")
repo.check_modified_graph = lambda *args: None
repo.add("something")
repo.checkout("other.dvc")
I need to stress that this is pretty dangerous, as your new repo.add()
might be overlapping with some existing dvc-file(another dvc-file that has the same file you are adding listed as an output) and you might get unexpected results from repo.checkout()
(no target).
A proper solution would be to try to optimize this by, for example, caching the results of stage collection in particular directories based on their mtime and doing some other related things. But that would only work if most of your directories are not being constantly updated, of course. Could you please talk a bit more about the scenario you have? What are the requirements? How long of an execution time for check_modified_graph
would be acceptable for you? How wide and how deep of a file tree do you have? Are dvc-files scattered all over that tree, or do you have large chunks that don't have them?
Though, thinking about it, operations like add/fetch/checkout/etc that are related to data management part of dvc sholdn't really care about DAG relations. At most, they could check that there are no overlapping outputs when you are doing things like dvc checkout
, since that would create a race condition(but for this we don't have to collect dag for the whole project). This is a really interesting thing to consider. Need to think about it a bit more, but for now, it does look promising. 馃
For example, say you have big.dvc
and small.dvc
that both point to data
. Then if we lift this current restriction, we would be able to do neat things like
dvc checkout big.dvc
./myscript.sh data
dvc checkout small.dvc
./myscript.sh data
( note that this also effectively replaces a currently hidden dvc tag
feature )
but when you try to
dvc checkout
it should probably raise an error, because small.dvc
and big.dvc
overlap, so the content of data
would depend on the order of underlying linking.
Currently we really only use the API, but that is partially because a lot of the cli primitives are so painfully slow. Some simple things would be nice to do without having to drop into python and script it.
The bulk of our directory structure is organized by "scene", which is a logical unit of video and image data corresponding to a point in time. There are 100s of scenes and each scene has a few dozen files associated with them, oftentimes data associated with a scene will be associated with a particular model, and that file type across many scenes is updated at once. This workflow is managed by our API, but sometimes we want to change a typo in a single file, where the workflow should just be "dvc add
Currently we really only use the API, but that is partially because a lot of the cli primitives are so painfully slow. Some simple things would be nice to do without having to drop into python and script it.
@danfischetti Were you able to pinpoint the parts that are slow for you there compared to the API? Also, have you tried a 0.65.0 CLI? It should be pretty similar to API these days. We've improved the startup time in the recent versions quite significantly, so the gap between CLI and API should've shrunken.
The bulk of our directory structure is organized by "scene", which is a logical unit of video and image data corresponding to a point in time. There are 100s of scenes and each scene has a few dozen files associated with them, oftentimes data associated with a scene will be associated with a particular model, and that file type across many scenes is updated at once. This workflow is managed by our API, but sometimes we want to change a typo in a single file, where the workflow should just be "dvc add " or "dvc checkout ". If we know the specific file being added or checked out i dont think these DAG checks are buying us anything.
Thanks for clarifying! Would that monkeypatch workaround be suitable for you for now? In the meantime, we'll consider the idea from https://github.com/iterative/dvc/issues/2671#issuecomment-546518272 , as it has the potential to be a simple, effective and, most of all, correct solution for all of us. 馃檪
Thank you so much for the great feedback! 馃檪
I have tried the 0.65.0 cli, the reason it's slower than the API is we're skipping the high level "add" operation and are manually creating and saving Stage objects. Otherwise the api would be just as slow due to the collect_stages
call.
Yes i think the idea in mentioned in that comment would work. DAG checks are totally appropriate when doing a top level checkout, only skipping that when a specific file is requested would suit our needs.
Discussed this offline with @efiop and @dmpetrov and on 1-1 with @shcheklein and there is no consensus on lifting DAG checks even for dvc add
command. The core consideration is that people should be able to:
git clone ... && dvc pull
and continue their or their teammates work without any complications. Ultimately we want both correctness (as it reinforced now) and performance.
I'll do some research on ways to optimize or cache this.
So here is my benching so far (21k add stages):
| task | time |
| ------------------- | ----------- |
| list | 0.65s |
| list + mtime/size | 0.85s |
| list + read | 1.16s |
| parse yamls | 48.5s |
| create stages | 69.6s |
| stages (no schema) | 59.0s |
| build graph | 69.6s |
The majority of time is taken by 2 things:
The rest is split between path manipulations, deepcopies, outs and deps creation mostly.
@Suor thanks, great summary! we need get rid of all last 4 effectively to make it usable.
Switching from ruamel.yaml
back to PyYAML
cuts parsing time in half - 24.6s instead of 48.5s. But that doesn't preserve comments so stages can't be safely dumped.
@shcheklein if we cache stages then building graph is not the issue. The issues that remain:
@Suor cache might be a solution. But it still takes time to build it. We'll need to do checks to ensure that it's still valid in case someone manually changes DVC-file. We'll have to think about things like atomicity, etc.
@shcheklein we can cache either by (filename, mtime, size)
tuple or even file contents (reading which is fast enough), so someone manually changing DVC file is not an instance.
Another thing, which I see is that we are using python yaml parser. PyYAML somewhat supports wrapping libyaml, which should speed up things. Here is how you install it though:
python setup.py --with-libyaml install
So no luck with using it as a dep)
manually changing
is only one problem in supporting cache, there will be more tricker ones. So cache might be a solution but unfortunately a quite complicated one.
So the update on yaml libs:
| library | time |
| -----------------|------ |
| ruamel.yaml | 48.5s |
| PyYAML | 25.6s |
| PyYAML (libyaml) | 3.9s |
To use PyYAML with libyaml on debian based linuxes:
sudo apt install libyaml-dev
pip install PyYAML
So that as achievable via deps. We might want to use such strategy:
PyYAML
on read, store unparsed textruamel.yaml
, apply diff and dump_fileThis way we'll make it faster for most scenarios without caching. Rewriting all the stages looks like a rare scenario.
Not sure we can do anything with validation besides caching. This is single call:
Schema(Stage.SCHEMA).validate(convert_to_unicode(d))
So the only thing we can do besides caching is replacing validation lib altogether.
I don't see how this specific optimization solves the problem, @Suor . But it def complicates all the logic and most likely packaging for different platforms.
It solves (48.5 - 3.9) / 69.7 ~ 64%
of the problem. Even with an empty cache. It will be about 33% of the problem without libyaml C lib. Both the logic and packaging complication will be quite limited.
What do you suggest? Implementing cache only?
64% of the problem
so, it complicates everything but does not solve the problem
What do you suggest?
add an option for people who manage large number of DVC-files an option to disable the check for now. There should not be a penalty for them and we should unblock the workflow. Also, it will give us more information, namely is there a potential problem in not performing this check. It should be working < 1s. Then, see what can we do - allow this setup in general and/or use cache or something else.
Benched using voluptuous instead of using schema for validation, it works about 13x faster (it precompiles the schema into python function). This will strip another 14% of the problem, making it 8x faster combined with libyaml/PyYAML thing. There are other possible optimizations there.
so, it complicates everything but does not solve the problem
It's not that black and white. Making it faster will benefit everyone, will make it under 1s for whoever it is now under 8s, so the problem will be solved for them.
Skipping check is not that rosy either:
dvc add
will remain slow It's not that black and white.
I think this specific ticket and use case is black and white indeed. At least, I don't see how suggested optimizations can help. It will be a few minutes to just add a file, right?
It's a good question if there are other cases with thousands of DVC-files and what are the requirements there. It would answer the question if need to do a middle ground
optimization with some potential complications on supporting this.
Skipping check is not that rosy either.
not saying that this the solution I like, it's just don't see how to unblock the workflow for the team and save us some time to come up with a better one if possible.
It will be a few minutes to just add a file, right?
For me it 1 minute now, before optimizations, so it's like 8 sec after optimizations, I have 21k files. It's much longer for @danfischetti. It took 113s for 5764 stages there, they are probably more complicated then mine.
@danfischetti can you copy paste a yaml of your typical stage here? Also what's your directory structure? Do dvc files spread over the tree?
Btw, guys, how about we start with simply making dvc checkout some.dvc
not collect all stages? The reason it does that in the first place is because it is trying to cleanup old unused links that we no longer have dvc files pointing to, which, actually only makes sense on dvc checkout
(without arguments). That should speed up dvc checkout
for specific targets by a lot. With dvc add
it is not that simple, as we are creating a new dvc stage there, which has risks of colliding with other stages, so we will have to optimize DAG collection there :slightly_frowning_face:
For the record: created https://github.com/iterative/dvc/pull/2750 to disable DAG checks for dvc checkout
for specific targets.
So after all optimizations merged we have 8.8s instead of 69.6s to collect graph for 21k simple stages, it is a 7.9x speedup. Here is what takes time:
| task | time | | what is added to prev line |
|----------------|------|-------|-------------------------------------|
| list + read | 1.1s | | (includes startup time) |
| ... + parse | 3.1s | +2.0s | PyYAML/libyaml parsing |
| ... + validate | 4.5s | +1.2s | validation/coercion with voluptuous |
| collect stages | 7.2s | +2.7s | stage/dep/out object creation |
| check stages | 8.2s | +1.0s | check dups, overlaps, etc |
| collect graph | 8.8s | +0.6s | graph creation (incl. import nx and cycle check) |
I would say even if we cache graph we can get at best 1.5s, if we cache stages individually - 2.5s.
Since the issue for topic starter is not urgent I suggest stopping with this for now.
@danfischetti could you please give it a try? does it solve the problem with your repo?
Closing due to inactivity.
I have also encountered this issue
dvc version
DVC version: 0.82.8
Python version: 3.7.5
Platform: Darwin-18.7.0-x86_64-i386-64bit
Binary: True
Package: osxpkg
Filesystem type (workspace): ('apfs', '/dev/disk1s1')
For the record: @tushar-dadlani is running into dvc pull something.dvc
collecting all of the stages for too long without any progress bars or anything. We need to at least add a progress bar, but should also consider not collecting stages when we are given a specific stage as a target already.
@efiop we can also consider some further optimizations into stage collection. We stopped this because the initial user was non-responsive.
The issue is we need to cling to real problem better as in my test scenario it's generally ok.
@tushar-dadlani the thing this works well for my test scenario, I may try to invent new ones, but it would make much more sense to look at your case. Can you provide a cut-up anonymized copy of you repo:
cd <your-repo>
mkdir ../repo-copy; cp -r * ../repo-copy # skipping .dvc and .git here
cd ../repo-copy
find . -type f -not -name \*.dvc -exec sh -c 'echo ERASED > {}' \;
cd ..
tar czvf repo.tar.gz repo-copy/; rm -rf repo-copy
Then attach repo.tar.gz
here or to the slack channel. This will repliacte all the dir/file structure as well as stages and pipelines, which should be enough to reproduce and optimize it.
@tushar-dadlani the thing this works well for my test scenario, I may try to invent new ones, but it would make much more sense to look at your case. Can you provide a cut-up anonymized copy of you repo:
```shell
cd
mkdir ../repo-copy; cp -r * ../repo-copy # skipping .dvc and .git herecd ../repo-copy
find . -type f -not -name *.dvc -exec sh -c 'echo ERASED > {}' \;
cd ..
tar czvf repo.tar.gz repo-copy/; rm -rf repo-copy
```Then attach
repo.tar.gz
here or to the slack channel. This will repliacte all the dir/file structure as well as stages and pipelines, which should be enough to reproduce and optimize it.
Great point @tushar-dadlani ! Thanks for providing the test repo 馃檹 We are able to reproduce the problem, looking into it right now.
So my implementation using tries gives:
Tries:
654.77 ms in collect stages
26.63 ms in dups/overlaps
19.35 ms in stages in outs
188.29 ms in build graph
370.05 ms in check check_acyclic
1.45 s in _collect_graph(Repo: '/home/suor/proj...)
Old code:
650.43 ms in collect stages
27.48 ms in dups/overlaps
35.23 ms in stages in outs
3.02 s in build graph
400.53 ms in check acyclic
4.33 s in _collect_graph(Repo: '/home/suor/proj...)
for 1320 stages. Will test on @tushar-dadlani's data and create a PR.
This is what I have for @tushar-dadlani's repo:
26.49 s in collect stages
1.32 s in dups/overlaps
782.49 ms in stages in outs
7.23 s in build graph
18.80 s in check acyclic
54.83 s in _collect_graph(Repo: '/home/suor/proj...)
Build graph is not the biggest anymore. The only way to make it fast is probably caching. Making some commands avoid building a graph and making full collection is also a good idea.
Old code:
25.61 s in collect stages
1.72 s in dups/overlaps
1.74 s in stages in outs
^C 2997.81 s in build graph # interrupted
3027.11 s in _collect_graph(Repo: '/home/suor/proj...)
@danfischetti @tushar-dadlani Guys, please also take a look at https://github.com/iterative/dvc/pull/3490 , which attempts to introduce the hack that @danfischetti originally asked for.
Most helpful comment
So after all optimizations merged we have 8.8s instead of 69.6s to collect graph for 21k simple stages, it is a 7.9x speedup. Here is what takes time:
| task | time | | what is added to prev line |
|----------------|------|-------|-------------------------------------|
| list + read | 1.1s | | (includes startup time) |
| ... + parse | 3.1s | +2.0s | PyYAML/libyaml parsing |
| ... + validate | 4.5s | +1.2s | validation/coercion with voluptuous |
| collect stages | 7.2s | +2.7s | stage/dep/out object creation |
| check stages | 8.2s | +1.0s | check dups, overlaps, etc |
| collect graph | 8.8s | +0.6s | graph creation (incl. import nx and cycle check) |
I would say even if we cache graph we can get at best 1.5s, if we cache stages individually - 2.5s.
Since the issue for topic starter is not urgent I suggest stopping with this for now.