Vscode-jupyter: In telemetry data, hash all imports and not just known ones so that in the future new packages can be found.

Created on 20 Mar 2019  路  11Comments  路  Source: microsoft/vscode-jupyter

Right now we return the package name for imports if it comes from a known list.

This doesn't work if we want to look for new ones in the future.

Instead this should use a hash of the package name.

Most helpful comment

Here's a detail analysis (of debug bits)

image

Essentially most of the time doing this work (on a 1000 lines) and it's not much time, just 5 ms, is generating the hashes. We could do the hash generation in the python code, but doesn't seem worth it.

Reparsing codelens on average is around 24ms each save on this same file.

And it seems there's a python spawn in the profile. Spawn itself takes 9ms. So this is actually faster.

I'll do the debounce though.

All 11 comments

Some thoughts on perf:

  • Getting all the document content into memory and splitting the content into lines and processing each line one by one can be slow and CPU intensive (regex and hashing).

    • Yes, it might not be much, but the more files you have opened, the larger the file, the more frequent this is done, the worse it gets.

    • Assume, a user has auto save on, now every few seconds we're performing this operation (split text, regex processing line by line)

    • All happens in the main thread that all other extensions share.

  • Events handlers are not debounced

    • With auto save on, as and when you make changes, all of them will queue up.

    • I.e. if we make changes to the same file in 5 times in x seconds, we don't need to process the same file 5 times. I.e. if user is still making changes, lets wait until they are done (e.g. debounce for 10s), then process the file.

Proposal

  • Perform this (as much as possible) out of proc,

    • Language Server (do everything there, however not everyone has this on)

    • Use jedi (let jedi do the parsing, tokenizing, etc..), by using the symbol provider.



      • Then in TS all we do is parse a JSON and a smaller subset...



    • Preferred approach: Send the file name(s) to a Python script, let the python script parse everything, send back a JSON array of the module names.



      • This way, all text processing (reading the document), parsing, identifying modules, hashing, etc is all done out of proc.



Impact:

  • As we chew more CPU time, other extensions can get adversely affected.
  • As we chew more CPU time during on save document, features in our own extension such as save on format & on save actions (sort imports on save) can get adversely affected.

    • VSC could end up abandoning those operations if they take more than x ms (we know today that this happens with format on save, as VSC only allocates 750ms for the code to format the text and provide the text edits....

    • I.e. we chew CPU for other operations, then the likelihood of these existing operations timeout out increases...

We have to unhash them on the other side to determine if a specific package or not.

https://github.com/Microsoft/vscode-python/blob/master/news/3%20Code%20Health/4852.md

Is this correct, can we unhash them. Or do we compare against hashes of known packages.

There's no unhashing. It's comparing against known hashes. Like the unit test does.

We should measure the perf impact of this before we go off and design something more complicated. It only looks at the first 1K of lines.

Text is already in memory as far as I can tell, so we don't add any extra overhead by reading it.

Spawning a python process may take more cpu time than applying a regex against 1000 lines of text. I'll measure it to find out though.

@DonJayamanne debouncing sounds like a good idea. No need to queue up a ton of these if the file is just being saved a bunch. I'll add that too while I look at the perf.

. It only looks at the first 1K of lines.

You might want to use the getlineat method of the document object

Spawning a python process may take more cpu time than applying a regex against 1000 lines of tex

The key difference is, it's out of proc, i.e. The extension process time isn't used. Chewing extension process time is where we need to be careful.

I didn't mean the time to run the python code, I mean simply the time to call proc.spawn (and hook up the stdio handlers and read from them).

Iterating over a 1000 lines is really damn fast. Maybe faster than the amount of code that runs when calling proc.spawn.

I'm profiling now and for a single save of a 3000K line file, the profiler doesn't even register the time for the import. I have to save 3 or 4 times to get anything to show up for the import in the profile.

Our code lens parsing is a much bigger part of on save.

And this is with debug bits. Minimized bits are likely even faster but it's hard to tell what the CPU time is from.

Here's a detail analysis (of debug bits)

image

Essentially most of the time doing this work (on a 1000 lines) and it's not much time, just 5 ms, is generating the hashes. We could do the hash generation in the python code, but doesn't seem worth it.

Reparsing codelens on average is around 24ms each save on this same file.

And it seems there's a python spawn in the profile. Spawn itself takes 9ms. So this is actually faster.

I'll do the debounce though.

  • I think we should still use document.lineAt(<line number/index?).text rather than reading the entire text and splitting the lines, should be more efficient, even though it may be negligible.

  • Would be good to track the total time taken for this processing (we generally do that in most places when tracking telemetry)

    • Will also tell us how often this is done (how often we run this code).
Was this page helpful?
0 / 5 - 0 ratings