Tensorboard: Tensorboard Projector - cosine distance "Nearest points in the original space" not correct

Created on 12 Jul 2019 · 12Comments · Source: tensorflow/tensorboard

Environment information (required)

``````
--- check: autoidentify
INFO: diagnose_tensorboard.py version 393931f9685bd7e0f3898d7dcdf28819fef54c43

--- check: general
INFO: sys.version_info: sys.version_info(major=3, minor=6, micro=8, releaselevel='final', serial=0)
INFO: os.name: nt
INFO: os.uname(): N/A
INFO: sys.getwindowsversion(): sys.getwindowsversion(major=10, minor=0, build=17763, platform=2, service_pack='')

--- check: package_management
INFO: has conda-meta: False
INFO: $VIRTUAL_ENV: None

--- check: installed_packages
INFO: installed: tensorboard==1.13.1
INFO: installed: tensorflow-gpu==1.13.1
INFO: installed: tensorflow==1.14.0
WARNING: conflicting installations: ['tensorflow', 'tensorflow-gpu']
INFO: installed: tensorflow-estimator==1.13.0

--- check: tensorboard_python_version
INFO: tensorboard.version.VERSION: '1.13.1'

--- check: tensorflow_python_version
INFO: tensorflow.__version__: '1.13.1'
INFO: tensorflow.__git_version__: "b'v1.13.1-0-g6612da8951'"

--- check: tensorboard_binary_path
INFO: which tensorboard: b'F:\Desktop\Thesis\Python3.6\Scripts\tensorboard.exe\r\n'

--- check: readable_fqdn
INFO: socket.getfqdn(): 'DESKTOP-LD8UUFN.home'

--- check: stat_tensorboardinfo
INFO: directory: C:\Users\josch\AppData\Local\Temp.tensorboard-info
INFO: os.stat(...): os.stat_result(st_mode=16895, st_ino=61361544923004089, st_dev=3506408066, st_nlink=1, st_uid=0, st_gid=0, st_size=24576, st_atime=1562950451, st_mtime=1562950451, st_ctime=1560964117)
INFO: mode: 0o40777

--- check: source_trees_without_genfiles
INFO: tensorboard_roots (1): ['F:\Python3.6\lib\site-packages']; bad_roots (0): []

--- check: full_pip_freeze
INFO: pip freeze --all:
absl-py==0.7.1
astor==0.8.0
attrs==19.1.0
backcall==0.1.0
bleach==3.1.0
boto==2.49.0
boto3==1.9.171
botocore==1.12.171
certifi==2019.6.16
chardet==3.0.4
colorama==0.4.1
cycler==0.10.0
decorator==4.4.0
defusedxml==0.6.0
docutils==0.14
entrypoints==0.3
gast==0.2.2
gensim==3.7.3
google-pasta==0.1.7
grpcio==1.21.1
h5py==2.9.0
idna==2.8
ipykernel==5.1.1
ipython==7.5.0
ipython-genutils==0.2.0
ipywidgets==7.4.2
jedi==0.13.3
Jinja2==2.10.1
jmespath==0.9.4
joblib==0.13.2
jsonschema==3.0.1
jupyter==1.0.0
jupyter-client==5.2.4
jupyter-console==6.0.0
jupyter-core==4.5.0
Keras-Applications==1.0.8
Keras-Preprocessing==1.1.0
kiwisolver==1.1.0
Markdown==3.1.1
MarkupSafe==1.1.1
matplotlib==3.1.0
mistune==0.8.4
mock==3.0.5
nbconvert==5.5.0
nbformat==4.4.0
notebook==5.7.8
numpy==1.16.4
pandas==0.24.2
pandocfilters==1.4.2
parso==0.4.0
pickleshare==0.7.5
pip==18.1
prometheus-client==0.7.0
prompt-toolkit==2.0.9
protobuf==3.8.0
Pygments==2.4.2
pyparsing==2.4.0
pyrsistent==0.15.2
python-dateutil==2.8.0
pytz==2019.1
pywinpty==0.5.5
pyzmq==18.0.1
qtconsole==4.5.1
requests==2.22.0
s3transfer==0.2.1
scikit-learn==0.21.2
scipy==1.3.0
Send2Trash==1.5.0
setuptools==41.0.1
six==1.12.0
sklearn==0.0
smart-open==1.8.4
tensorboard==1.13.1
tensorflow==1.14.0
tensorflow-estimator==1.13.0
tensorflow-gpu==1.13.1
termcolor==1.1.0
terminado==0.8.2
testpath==0.4.2
tornado==6.0.2
traitlets==4.3.2
urllib3==1.25.3
wcwidth==0.1.7
webencodings==0.5.1
Werkzeug==0.15.4
wheel==0.33.4
widgetsnbextension==3.4.2
wrapt==1.11.2
xlrd==1.2.0
``````

Issue description

I am currently visualizing word embeddings (shape=60,300) from my TensorFlow model in the TensorBoard Projector and i am having troubles with the cosine distance.

The displayed distances distort the results and doesn't match the real cosine distances.

This was a test run with different category embeddings:

TensorBoard:
sklearn:

Both use the same data and the results are not even close.

Is TensorBoard reducing the dimensions from the vectors and the label "Nearest points in the original space" is incorrect?

projector usability bug

Source

JoschuaXner

👍2

Most helpful comment

Yes, with a tiny detail that you will have to call np.linalg.norm(N_vectors,axis=-1,keepdims=True) so the division broadcasting works in the last line of code.

dsmilkov on 19 Aug 2019

👍4

All 12 comments

cc @dsmilkov @nsthorat I think this functionality has been in the projector for a while, so I doubt it's changed lately. Any thoughts?

nfelt on 16 Jul 2019

Hi!

Yes, the functionality hasn't changed. Couple of notes:

To make fast projections, the Projector projects high-dimensional data down to 200 dimensions (randomly chosen). Is 60,300 the dimensionality of your data, or the number of points? This random projection could lead to loss of information.
Make sure to turn off Sphereize Data (checkbox in the left panel), which shifts the points and makes them unit norm. While this affects the absolute values of the cosine distances, it shouldn't affect the ranking of the neighbors though.

dsmilkov on 20 Jul 2019

👀2

Hey, tanks for your reply.

i wanted to visualize 60 vectors with each 300 dimensions.

I unhecked the checkbox but that doesnt change influence the ranking list.

The reduced dimensions to 200 only affects the 3D/2D PCA visualisation right? And not the cosine distances?

JoschuaXner on 20 Jul 2019

I got the expected results for my data with the Framework sklearn by computing the cosine distances in the original 300 dimension vector space and using the distance matrix to reduce the 300 dimensions to 2 with tsne so i can visualize it with matplotlib.

I was able to reproduce these results with different training cylcles.

So either tensorboard projector doesn't calculate the cosine distances right or the term "nearest point in original space" is wrong/misleading.

JoschuaXner on 20 Jul 2019

Hi
Does someone know how the Sphereize data calculated?
when you hover on the Sphereize data its says

The data is normalized by shifting each point by the centroid and making
      it unit norm

how the centroid point has been calculated?
what is the space of the features? the original or PCA?

thanks for the help

mbenami on 11 Aug 2019

Hi, the space is the original embedding (not the PCA). The unit normalization is done by dividing each component of the embedding by the magnitude of that embedding.

dsmilkov on 13 Aug 2019

thanks, @dsmilkov
how the centroid for shifting each point is calculated?
how many centroids are there? one for each bit in the embedding space?

mbenami on 18 Aug 2019

There is only one centroid vector (k-dimensional) for a k-dim embedding space. It's computed by averaging all the embeddings.

dsmilkov on 19 Aug 2019

Thanks
so the calculation on N vectors will be:

centroid_vector = np.mean(N_vectors, axis=-1)
N_vectors = N_vectors - centroid_vector
N_vectors = N_vectors / np.linalg.norm(N_vectors,axis=-1)

am I right?

mbenami on 19 Aug 2019

Yes, with a tiny detail that you will have to call np.linalg.norm(N_vectors,axis=-1,keepdims=True) so the division broadcasting works in the last line of code.

dsmilkov on 19 Aug 2019

👍4

Hi!

Yes, the functionality hasn't changed. Couple of notes:

To make fast projections, the Projector projects high-dimensional data down to 200 dimensions (randomly chosen). Is 60,300 the dimensionality of your data, or the number of points? This random projection could lead to loss of information.

Make sure to turn off Sphereize Data (checkbox in the left panel), which shifts the points and makes them unit norm. While this affects the absolute values of the cosine distances, it shouldn't affect the ranking of the neighbors though.

@dsmilkov , I am afraid that the Sphereize Data option does change the ranking of euclidian distance, based on the dataset (256dim) I tested.

Tensorboard version: 2.1.0

FangliangBai on 7 Mar 2020

👍2

@mbenami

I know it's been almost a year, but I just stumbled on this thread and I don't think centroid_vector is computed correctly. If your N_vectors was of shape (# of vectors, dimension), then axis=0 not axis=-1 when computing the centroid, e.g.

embeddings = np.random.rand(100, 768)

# Correct, compute k-dimensional centroid vector for a k-dimensional vector space
centroid = np.mean(embeddings, axis=0)
assert centroid.shape == (768,)

# Incorrect
centroid = np.mean(embeddings, axis=-1)
assert centroid.shape == (768,)