[x] v2.2.6
[x] Linux
Transparently encode&decode edges' labels
Insert an edge with a _label
like /color/red
.
It will log a _warning_ and then an _exception_ like:
WARNING: $ANSI{green {db=db}} Requested command 'create edge type '%2Fcolor%2Fred' as subclass of 'E'' must be executed outside active transaction: the transaction will be committed and reopen right after it. To avoid this behavior execute it outside a transaction
javax.script.ScriptException: java.lang.IllegalArgumentException: Invalid field name 'out_%2Fcolor%2Fred'. Character '%' is invalid
Can you try encoding the label to /color/red
? (using backtick)
Hi @lvca,
I think I am missing something... My dataset is in GraphSon format thus the edges are represented like:
{"_id":1315003,"_type":"edge","_outV":62550035,"_inV":14823714,"_label":"/color/red/"}
Are you suggesting to modify all the entries to be like ... ,"_label":"
/color/red/"}
?
Thanks,
If you can try it and it's working, we could fix the problem by always using backtick in the importer.
I try, writing the encoder now,
BTW
While waiting for a reply I tried replacing all /
with _
and the loading phase went further but the suddenly stopped due to:
javax.script.ScriptException: com.orientechnologies.orient.core.exception.ODatabaseException: RecordId cannot support cluster id major than 32767. Found: 32768
Any ideas on that? (or shall I better open a new issue?)
EDIT: Should I prefix each _id
with 1:
?
Thank you so much
RecordId cannot support cluster id major than 32767. Found: 32768
If you are running into this, it means you are creating too many classes (or clusters). ODB is limited to 32676 cluster Ids. If you are running on a system with multiple cores, then the total number of classes available to you is 32676 divided by the number of cores, because ODB creates the same number of clusters per class as there are cores. As an example, if the ODB server is on an 8 core machine, you'll only have 4084 classes available.
Scott
@smolinari thanks for joining.
I've 24 cores that means I should have only 1361 classes?
Why does it start complaining only at '32676' and not 1361?
It's correct to say that Orientdb creates a class for each different value of _label
?
Puh, 24 cores? Big machine...so that is a significant number and yes, that means you only have 1361 classes at your disposal. You can change this behavior by running ALTER DATABASE MINIMUMCLUSTERS 1
, before you start loading data, which will then give you the full 32676 available classes. This could be at the cost of future performance, however.
As for _label
creating a class. I am not sure what _label
is exactly. If it is supposed to be a property of an edge, I'd say there is an issue the code. If every _label
should actually be a completely new set of edges, then yes, it should be creating a class.
Why does it start complaining only at '32676' and not 1361?
Because ODB really only concerns itself with the number of clusters available.
Scott
For the first point, thanks I will try.
As for the _label
s I am just trying to guessing... I have a plenty of messages like this in the logs (each with a different "...type '..'") so..:
WARNING: $ANSI{green {db=db}} Requested command 'create edge type '%2Fcolor%2Fred' as subclass of 'E'' must be executed outside active transaction: the transaction will be committed and reopen right after it. To avoid this behavior execute it outside a transaction
In order to be sure of the format I had also tried loading the dataset w/another engine (Neo4j) and there I had no issue, I just typed g.load("myfile.json")
and waited few hours..
Martin
@lvca does not work.
The ``are encoded them selves to
%60and thus rise the same problem:
Character '%' is invalid`
I am not really familiar with the internals of OrientDB but I bet that re-applying the patch reverted here https://github.com/orientechnologies/orientdb/commit/c72d4c637c33388ed0856c0ad2b65c015fbc91a5 should fix the problem.
Please note at this point is not only /
char encoding broken but the whole encoding logic.
@smolinari I think I am missing something about the role that _label
could actually have in OrientDB. I am going to try to change all _label
to label
(thus shall be just plain properties) and see how this is going to work. I will report back.
Hmm...looking at what GraphSon is all about, it looks like _label
should actually be the edge class name and not a property. GraphSon is mixing schema (according to ODB's way of doing graphs) with the data. I'd venture to say that ODB's version of the Gremlin API and the g.load()
function needs looking at.
Scott
@smolinari I do not know if I made a step forward or backward...
Now I got javax.script.ScriptException: java.lang.StringIndexOutOfBoundsException: String index out of range: 0
. ...
That is above my knowledge. Sorry.
Scott
@smolinari you've already done much!
Thank you =)
@MartinBrugnara any chance to have your file, so we can work on it internally? If it's not a problem, please send it to support -- at-- - orientdb.com.
@lvca Sure, my script will eventually all be release as OS.
The dataset is a (5GB plain text) sample from freebase (I tried different samples before reporting).
This is the Dockerfile:
# Derived from https://github.com/orientechnologies/orientdb-docker/blob/master/2.2/Dockerfile
FROM java:openjdk-8-jdk-alpine
ENV ORIENTDB_VERSION 2.2.6
ENV ORIENTDB_DOWNLOAD_URL http://central.maven.org/maven2/com/orientechnologies/orientdb-community/$ORIENTDB_VERSION/orientdb-community-$ORIENTDB_VERSION.zip
RUN apk add --update unzip curl bash && rm -rf /var/cache/apk/*
RUN \
curl -o orientdb.zip $ORIENTDB_DOWNLOAD_URL \
&& unzip orientdb.zip -d /opt/ \
&& rm orientdb.zip \
&& rm -rf /orientdb/databases/* \
&& ln -s /opt/orientdb-community-${ORIENTDB_VERSION} /orientdb
ENV PATH /orientdb/bin:$PATH
RUN chmod +x /orientdb/bin/*
#http://orientdb.com/docs/2.0/orientdb.wiki/Performance-Tuning.html
# RAM: 128G Cores: 24
RUN sed -i'.bak' 's@-Xms32m -Xmx512m@-Xms1G -Xmn128M -Xmx60G -Dstorage.diskCache.bufferSize=65536 -server -XX:+AggressiveOpts -XX:CompileThreshold=200@' "/orientdb/bin/gremlin.sh"
#VOLUME ["/orientdb/backup", "/orientdb/databases", "/orientdb/config"]
VOLUME ["/exp"]
#OrientDb binary, http
EXPOSE 2424 2480
WORKDIR /exp
CMD ["gremlin.sh", "-e", "/tmp/query"]
And the query itself:
import com.tinkerpop.blueprints.impls.orient.OrientGraph
g = new OrientGraph("plocal:/db/orient_freebase")
g.loadGraphSON("/exp/freebase.json")
g.shutdown()
System.exit(0)
EDIT: upgrade to 2.2.7 would not be an issue.
Which file in https://developers.google.com/freebase/?
Sorry for the delay.
It seams we started from the big one but then we also cleaned it up and parsed and converted... so here comes the link to the smallest sample I got 5GB.
Thank you so much,
Martin
Thanks for the file. I waited for a lot of time, but it looks like there are tons of vertices and I cannot arrive to the point where edges are imported. Could you provide a shortest version with just some vertices and edges, so it's quick to debug it? Thanks in advance.
By the way on 2.2.8-SNAPSHOT the graphson
format is supported by OrientDB console. You can even import directly the tar.gz file. Furthermore a status is printed so you know what's the import status.
It seams definitively making progress, I will let it work over the night and report back tomorrow.
Thanks.
Cool the console interface!
With default options it failed (due to JVM heap) at ~3*10e6 nodes.
With same 60GB Headp and 60GB cache it loaded the nodes in a reasonable time, but then it exploded on the '%' encoding issue.
I am now trying loading the dataset with all the 'not supported chars' already escaped.
Preface: The sample has V: 28413023 E: 31475362
The console logs:
Imported 28400000 graph elements: 28400000 vertices and 0 edges. Parsed 2.13GB (uncompressed)ed)..
So I suppose it loads the nodes and reaches the edges (looking at the previous error on '%').
Then it dies when its time to save the edges with:
Error: java.lang.StringIndexOutOfBoundsException: String index out of range: 0
I checked 3 times my datasets to not contain null
keys or empty values.
I also tried with another (different/bigger 20G) sample and it crashes always by
Error: java.lang.StringIndexOutOfBoundsException: String index out of range: 0
This is how I execute the query:
echo "Loading $DATASET" >> /exp/results.txt
echo "CREATE DATABASE PLOCAL:/srv/db ;QUIT" | "$ORIENTDB_HOME"/bin/console.sh
echo "CONNECT PLOCAL:/srv/db admin admin ;IMPORT DATABASE ${DATASET} -format=graphson ;QUIT" | time "$ORIENTDB_HOME"/bin/console.sh >> /exp/results
This are the only modification to the console.sh
file:
ENV JOPTS '-Xms1G -Xmn128M -Xmx60G -Dstorage.diskCache.bufferSize=65536 -server -XX:+AggressiveOpts -XX:CompileThreshold=200'
RUN sed -i'.bak' "s@-Xmx512m @@" "$ORIENTDB_HOME/bin/console.sh"
RUN sed -i'.bak' "s@#JAVA_OPTS=-Xmx1024m@JAVA_OPTS=\"${JOPTS}\"@" "$ORIENTDB_HOME/bin/console.sh"
The only thing I've in mind is that _path
attribute is required on edges and instead of reporting the error dies badly (checking this assumption while writing).
@smolinari I would like to avoid confusion.. so I ask again to be sure:
It's right to claim the following?
"With orientdb one can have at most 32676 different kind of edges; meaning that one can have at most 32676 distinct values for the '_label' attribute."
Hey Martin,
Would you be able to extract just 10 vertices and a couple of edges from that file, so I can look at it?
Sure,
I am out of office now.
I'll try to do it tomorrow.
Thanks,
-- Maritn
On Fri, 26 Aug 2016, Luca Garulli wrote:
Date: Fri, 26 Aug 2016 07:33:12 -0700
From: Luca Garulli [email protected]
Reply-To: orientechnologies/orientdb
[email protected]>
To: orientechnologies/orientdb [email protected]
Cc: Martin Brugnara martin.[email protected],
Mention [email protected]
Subject: Re: [orientechnologies/orientdb] Error encoding edge label (#6577)Hey Martin,
Would you be able to extract just 10 vertices and a couple of edges from that file, so I can look at it?
-- Martin
Thanks. With a small file I can fix this in minutes.
@lvca I identified the problem: It seams something, I do not know if it's the console loader or the db itself, is not happy with unlabelled edges: you have to specify _label
.
Summarizing
In order to load the sample I had to:
1) Substitute all /
with _
(due to issue with escaping fixable w/#5424 - #5558 hack)
javax.script.ScriptException: java.lang.IllegalArgumentException: Invalid field name 'out_%2Fcolor%2Fred'. Character '%' is invalid
2) Do not use _label
property but store _labels_ as attributes because OrientDB does not support more than 32676 /#cores
different (values for) labels (https://github.com/orientechnologies/orientdb/issues/6577#issuecomment-241009521).
RecordId cannot support cluster id major than 32767. Found: 32768
3) Add fake label xoxo
to have the loader happy (Please log something more meaningful there).
Error: java.lang.StringIndexOutOfBoundsException: String index out of range: 0
4) Use the console loader instead of gremlin (to have acceptable loading time)
real 4h 19m 13s
user 30h 28m 16s [24 cores]
sys 12m 53.10s
5) Rewrite my queries to look for the attribute instead of the label for traversing.
Please:
Could someone kindly confirm I understood correctly the issue at point 2 as expressed here" https://github.com/orientechnologies/orientdb/issues/6577#issuecomment-242748570 ?
Thank you,
Martin
@MartinBrugnara
"With orientdb one can have at most 32676 different kind of edges; meaning that one can have at most 32676 distinct values for the '_label' attribute."
If ODB is loading the edges and using _label
to define edge classes, then yes, you will be limited in this respect. If, however, you just load the edges under the E
class and _label
is just an attribute then no. You can load 9 million trillion edges. :smile:
Scott
@smolinari
I think I like the E class..
How should I modify the loading?
Should I prepped each edge labe with E:
to have something like:
{"_id":1315003,"_type":"edge","_outV":62550035,"_inV":14823714,"_label":"E:/color/red/"}
At this point I should also get the rid off the escaping?
What I am looking for is to make the query like this working:
// grovy - gremlin 2
def sample_edges(g, n) {
return g.E.shuffle().next(n).collect { edge -> [
source: edge.outV[uid_field].next(),
target: edge.inV[uid_field].next(),
// I suppose this is a property right?
label: edge.label
]};
}
Thank you!!!
About (1) I've just fixed it in 2.2.x branch.
About (2) How many different edge classes do you have? If you set this at the beginning:
alter database minimumclusters 1
The maximum number of clusters is 32K. But if you have many edge classes (hundreds or more) with only a few instances in it for each cluster, then having the label as attribute is a good idea. I'd like to support this from the importer to avoid the next user have all these problems. This is already supported by OrientDB API, but not by the console.
Something like:
import database /temp/db.tar.gz -format=graphson -useClassForEdgeLabel=false
About (3) you don't need it anymore, because of solution (2).
@lvca
About (1) There is a snapshot I can use?
About (2)
alter database minimumclusters 1
import database /temp/db.tar.gz -format=graphson -useClassForEdgeLabel=false
label
instead of _label
are they recognized as labels wo/class or as simple attribtues?@smolinari
I think I misunderstand you: I tried to prefix all labels with #E:
but it dies with:
javax.script.ScriptException: com.orientechnologies.orient.core.exception.ODatabaseException: RecordId cannot support cluster id major than 32767. Found: 32769
Yeah, unfortunately, I am not sure what the Gremlin import function is doing, though I think I can say with a good bit of certainty prefixing the label to create edges won't really help.
To explain the "E" class, it is the standard edge super class in ODB. If you create a class and it extends or inherits from the "E" class, then it will also automatically become an edge class. For instance, if you had
CREATE CLASS MyEdgeClass EXTENDS E
You'd be creating the "MyEdgeClass" class and since that class exends "E", it also can only hold edges. Also, if you create edges, but do not have a distinct class identified in the SQL, then the edges will automatically be stored under the "E" edge super class.
The same goes for the standard class "V", which is the super class for vertices.
CREATE CLASS MyVertexClass EXTENDS V
More info about the "V" and "E" default super classes here: http://orientdb.com/docs/2.2/Tutorial-Document-and-graph-model.html
Creating Edges here: http://orientdb.com/docs/2.2/SQL-Create-Edge.html
And Vertexes here: http://orientdb.com/docs/2.2/SQL-Create-Vertex.html
Scott
(1) Look at last snapshots: https://oss.sonatype.org/content/repositories/snapshots/com/orientechnologies/orientdb-community/2.2.9-SNAPSHOT/
Sorry for disappearing, but other critcal things required my attention.
I've now updated to the latest version: 2.2.13
But the inconsistency seams to remain.
Here comes a small example with /
in nodes' attributes and edges' labels.
The query SELECT *
shows how the strings are stored escaped but also how they are returned escaped.
Using gremlin (bundeled in the installation) to query on the attributes or labels return no elements.
Thanks,
Martin
** Data have been loaded with the combination of your suggestions:
echo "CREATE DATABASE PLOCAL:/srv/db ;ALTER DATABASE minimumclusters 1 ;QUIT" | "$ORIENTDB_HOME"/bin/console.sh
echo "CONNECT PLOCAL:/srv/db admin admin ;IMPORT DATABASE ${DATASET} -format=graphson ;QUIT" | time "$ORIENTDB_HOME"/bin/console.sh
Forwarded to @luigidellaquila