``````
--- check: autoidentify
INFO: diagnose_tensorboard.py version 393931f9685bd7e0f3898d7dcdf28819fef54c43
--- check: general
INFO: sys.version_info: sys.version_info(major=3, minor=6, micro=8, releaselevel='final', serial=0)
INFO: os.name: nt
INFO: os.uname(): N/A
INFO: sys.getwindowsversion(): sys.getwindowsversion(major=10, minor=0, build=17763, platform=2, service_pack='')
--- check: package_management
INFO: has conda-meta: False
INFO: $VIRTUAL_ENV: None
--- check: installed_packages
INFO: installed: tensorboard==1.13.1
INFO: installed: tensorflow-gpu==1.13.1
INFO: installed: tensorflow==1.14.0
WARNING: conflicting installations: ['tensorflow', 'tensorflow-gpu']
INFO: installed: tensorflow-estimator==1.13.0
--- check: tensorboard_python_version
INFO: tensorboard.version.VERSION: '1.13.1'
--- check: tensorflow_python_version
INFO: tensorflow.__version__: '1.13.1'
INFO: tensorflow.__git_version__: "b'v1.13.1-0-g6612da8951'"
--- check: tensorboard_binary_path
INFO: which tensorboard: b'F:\Desktop\Thesis\Python3.6\Scripts\tensorboard.exe\r\n'
--- check: readable_fqdn
INFO: socket.getfqdn(): 'DESKTOP-LD8UUFN.home'
--- check: stat_tensorboardinfo
INFO: directory: C:\Users\josch\AppData\Local\Temp.tensorboard-info
INFO: os.stat(...): os.stat_result(st_mode=16895, st_ino=61361544923004089, st_dev=3506408066, st_nlink=1, st_uid=0, st_gid=0, st_size=24576, st_atime=1562950451, st_mtime=1562950451, st_ctime=1560964117)
INFO: mode: 0o40777
--- check: source_trees_without_genfiles
INFO: tensorboard_roots (1): ['F:\Python3.6\lib\site-packages']; bad_roots (0): []
--- check: full_pip_freeze
INFO: pip freeze --all:
absl-py==0.7.1
astor==0.8.0
attrs==19.1.0
backcall==0.1.0
bleach==3.1.0
boto==2.49.0
boto3==1.9.171
botocore==1.12.171
certifi==2019.6.16
chardet==3.0.4
colorama==0.4.1
cycler==0.10.0
decorator==4.4.0
defusedxml==0.6.0
docutils==0.14
entrypoints==0.3
gast==0.2.2
gensim==3.7.3
google-pasta==0.1.7
grpcio==1.21.1
h5py==2.9.0
idna==2.8
ipykernel==5.1.1
ipython==7.5.0
ipython-genutils==0.2.0
ipywidgets==7.4.2
jedi==0.13.3
Jinja2==2.10.1
jmespath==0.9.4
joblib==0.13.2
jsonschema==3.0.1
jupyter==1.0.0
jupyter-client==5.2.4
jupyter-console==6.0.0
jupyter-core==4.5.0
Keras-Applications==1.0.8
Keras-Preprocessing==1.1.0
kiwisolver==1.1.0
Markdown==3.1.1
MarkupSafe==1.1.1
matplotlib==3.1.0
mistune==0.8.4
mock==3.0.5
nbconvert==5.5.0
nbformat==4.4.0
notebook==5.7.8
numpy==1.16.4
pandas==0.24.2
pandocfilters==1.4.2
parso==0.4.0
pickleshare==0.7.5
pip==18.1
prometheus-client==0.7.0
prompt-toolkit==2.0.9
protobuf==3.8.0
Pygments==2.4.2
pyparsing==2.4.0
pyrsistent==0.15.2
python-dateutil==2.8.0
pytz==2019.1
pywinpty==0.5.5
pyzmq==18.0.1
qtconsole==4.5.1
requests==2.22.0
s3transfer==0.2.1
scikit-learn==0.21.2
scipy==1.3.0
Send2Trash==1.5.0
setuptools==41.0.1
six==1.12.0
sklearn==0.0
smart-open==1.8.4
tensorboard==1.13.1
tensorflow==1.14.0
tensorflow-estimator==1.13.0
tensorflow-gpu==1.13.1
termcolor==1.1.0
terminado==0.8.2
testpath==0.4.2
tornado==6.0.2
traitlets==4.3.2
urllib3==1.25.3
wcwidth==0.1.7
webencodings==0.5.1
Werkzeug==0.15.4
wheel==0.33.4
widgetsnbextension==3.4.2
wrapt==1.11.2
xlrd==1.2.0
``````
I am currently visualizing word embeddings (shape=60,300) from my TensorFlow model in the TensorBoard Projector and i am having troubles with the cosine distance.
The displayed distances distort the results and doesn't match the real cosine distances.
This was a test run with different category embeddings:
TensorBoard:

sklearn:

Both use the same data and the results are not even close.
Is TensorBoard reducing the dimensions from the vectors and the label "Nearest points in the original space" is incorrect?
cc @dsmilkov @nsthorat I think this functionality has been in the projector for a while, so I doubt it's changed lately. Any thoughts?
Hi!
Yes, the functionality hasn't changed. Couple of notes:
60,300 the dimensionality of your data, or the number of points? This random projection could lead to loss of information.Hey, tanks for your reply.
i wanted to visualize 60 vectors with each 300 dimensions.
I unhecked the checkbox but that doesnt change influence the ranking list.
The reduced dimensions to 200 only affects the 3D/2D PCA visualisation right? And not the cosine distances?
I got the expected results for my data with the Framework sklearn by computing the cosine distances in the original 300 dimension vector space and using the distance matrix to reduce the 300 dimensions to 2 with tsne so i can visualize it with matplotlib.
I was able to reproduce these results with different training cylcles.
So either tensorboard projector doesn't calculate the cosine distances right or the term "nearest point in original space" is wrong/misleading.
Hi
Does someone know how the Sphereize data calculated?
when you hover on the Sphereize data its says
The data is normalized by shifting each point by the centroid and making
it unit norm
how the centroid point has been calculated?
what is the space of the features? the original or PCA?
thanks for the help
Hi, the space is the original embedding (not the PCA). The unit normalization is done by dividing each component of the embedding by the magnitude of that embedding.
thanks, @dsmilkov
how the centroid for shifting each point is calculated?
how many centroids are there? one for each bit in the embedding space?
There is only one centroid vector (k-dimensional) for a k-dim embedding space. It's computed by averaging all the embeddings.
Thanks
so the calculation on N vectors will be:
centroid_vector = np.mean(N_vectors, axis=-1)
N_vectors = N_vectors - centroid_vector
N_vectors = N_vectors / np.linalg.norm(N_vectors,axis=-1)
am I right?
Yes, with a tiny detail that you will have to call np.linalg.norm(N_vectors,axis=-1,keepdims=True) so the division broadcasting works in the last line of code.
Hi!
Yes, the functionality hasn't changed. Couple of notes:
- To make fast projections, the Projector projects high-dimensional data down to 200 dimensions (randomly chosen). Is
60,300the dimensionality of your data, or the number of points? This random projection could lead to loss of information.- Make sure to turn off Sphereize Data (checkbox in the left panel), which shifts the points and makes them unit norm. While this affects the absolute values of the cosine distances, it shouldn't affect the ranking of the neighbors though.
@dsmilkov , I am afraid that the Sphereize Data option does change the ranking of euclidian distance, based on the dataset (256dim) I tested.
Tensorboard version: 2.1.0
@mbenami
I know it's been almost a year, but I just stumbled on this thread and I don't think centroid_vector is computed correctly. If your N_vectors was of shape (# of vectors, dimension), then axis=0 not axis=-1 when computing the centroid, e.g.
embeddings = np.random.rand(100, 768)
# Correct, compute k-dimensional centroid vector for a k-dimensional vector space
centroid = np.mean(embeddings, axis=0)
assert centroid.shape == (768,)
# Incorrect
centroid = np.mean(embeddings, axis=-1)
assert centroid.shape == (768,)
Most helpful comment
Yes, with a tiny detail that you will have to call np.linalg.norm(N_vectors,axis=-1,keepdims=True) so the division broadcasting works in the last line of code.