Azure-cosmos-dotnet-v2: [Graph] String filters

Created on 10 Apr 2018 · 14Comments · Source: Azure/azure-cosmos-dotnet-v2

Great job with the api so far.
While not an offical gremlin spec feature, I think string filters would be very useful.
It was discussed a bit in #413 to stay on topic I created this issue to track string filters.

Feedback portal issue: Add support for Text Search Predicates

Offical tinkerpop issue: TINKERPOP-2041
Tinkerpop pr: https://github.com/apache/tinkerpop/pull/944
Of course this hasn't been added to the specification yet, but a few other graph database providers has implemented some form of string filtering:

DSE Graph
TitanDB
JanusGraph
(And Sql2Gremlin achieves this with lambda.)

My particular use case is filtering names.

Fuzzy search is probably not as easy to implement as the others, but a simple contains filter, or regex filter would go a long way for solving simple text searches.

There was some mention in #413 of an announcement which would detail the string functions that are being worked on. Any news on that?

EDIT:
Added relevant feedback portal link.
Added link to official tinkerpop issue.

EDIT 2:
We now have text predicates!

Source

BenjaBobs

👍8

Most helpful comment

I've been in contact with @LuisBosquez about this and we agreed to take the conversation out in the open. :)
To summarize, the team is working on two string predicates, which are up for discussion:

startsWidth()
contains()

Here are my two cents on the subject:
Almost all places where text predicates take place, there is some option to set case sensitivity.
The way I see most gremlin implemented I think the best fit would be something like this:

default GraphTraversal<S,E> startsWith(String value) // defaults to case insensitive
default GraphTraversal<S,E> startsWith(String value, boolean caseSensitive)
default GraphTraversal<S,E> contains(String value) // defaults to case insensitive
default GraphTraversal<S,E> contains(String value, boolean caseSensitive)

Such that with the following data

[
    {
        "id": "875a2805-bdc7-4ed9-81ab-9e619bf2dfeb",
        "label": "person",
        "type": "vertex",
        "properties": {
            "name": [
                {
                    "id": "1bca3ec4-96dc-427f-9535-2ded44230109",
                    "value": "John Doe"
                }
            ],
        }
    },
    {
        "id": "fc8cffe9-a349-464d-a283-58a411b0e4e8",
        "label": "person",
        "type": "vertex",
        "properties": {
            "name": [
                {
                    "id": "8b0642ac-a27a-4a6f-8cd7-f95c3fc78a2a",
                    "value": "Jane Doe"
                }
            ],
        }
    }
]

I would expected the following:

g.V().has('name', startsWith('John'))

[
    {
        "id": "875a2805-bdc7-4ed9-81ab-9e619bf2dfeb",
        "label": "person",
        "type": "vertex",
        "properties": {
            "name": [
                {
                    "id": "1bca3ec4-96dc-427f-9535-2ded44230109",
                    "value": "John Doe"
                }
            ],
        }
    }
]

g.V().has('name', startsWith('john'))

[
    {
        "id": "875a2805-bdc7-4ed9-81ab-9e619bf2dfeb",
        "label": "person",
        "type": "vertex",
        "properties": {
            "name": [
                {
                    "id": "1bca3ec4-96dc-427f-9535-2ded44230109",
                    "value": "John Doe"
                }
            ],
        }
    }
]

g.V().has('name', startsWith('john', true))

[]

g.V().has('name', contains('Doe'))

[
    {
        "id": "875a2805-bdc7-4ed9-81ab-9e619bf2dfeb",
        "label": "person",
        "type": "vertex",
        "properties": {
            "name": [
                {
                    "id": "1bca3ec4-96dc-427f-9535-2ded44230109",
                    "value": "John Doe"
                }
            ],
        }
    },
    {
        "id": "fc8cffe9-a349-464d-a283-58a411b0e4e8",
        "label": "person",
        "type": "vertex",
        "properties": {
            "name": [
                {
                    "id": "8b0642ac-a27a-4a6f-8cd7-f95c3fc78a2a",
                    "value": "Jane Doe"
                }
            ],
        }
    }
]

g.V().has('name', contains('doe'))

[
    {
        "id": "875a2805-bdc7-4ed9-81ab-9e619bf2dfeb",
        "label": "person",
        "type": "vertex",
        "properties": {
            "name": [
                {
                    "id": "1bca3ec4-96dc-427f-9535-2ded44230109",
                    "value": "John Doe"
                }
            ],
        }
    },
    {
        "id": "fc8cffe9-a349-464d-a283-58a411b0e4e8",
        "label": "person",
        "type": "vertex",
        "properties": {
            "name": [
                {
                    "id": "8b0642ac-a27a-4a6f-8cd7-f95c3fc78a2a",
                    "value": "Jane Doe"
                }
            ],
        }
    }
]

g.V().has('name', contains('doe', true))

[]

BenjaBobs on 28 Jun 2018

👍5

All 14 comments

@olivertowers Any news on this? Alternatively I'm looking at doing my own string indexing and seeing if I can work out a naive contains filter myself using Gremlin, or maybe saving everything on Azure Search as well, but I'd rather not because that would mean increased overhead.

BenjaBobs on 1 May 2018

startsWidth()
contains()

default GraphTraversal<S,E> startsWith(String value) // defaults to case insensitive
default GraphTraversal<S,E> startsWith(String value, boolean caseSensitive)
default GraphTraversal<S,E> contains(String value) // defaults to case insensitive
default GraphTraversal<S,E> contains(String value, boolean caseSensitive)

Such that with the following data

[
    {
        "id": "875a2805-bdc7-4ed9-81ab-9e619bf2dfeb",
        "label": "person",
        "type": "vertex",
        "properties": {
            "name": [
                {
                    "id": "1bca3ec4-96dc-427f-9535-2ded44230109",
                    "value": "John Doe"
                }
            ],
        }
    },
    {
        "id": "fc8cffe9-a349-464d-a283-58a411b0e4e8",
        "label": "person",
        "type": "vertex",
        "properties": {
            "name": [
                {
                    "id": "8b0642ac-a27a-4a6f-8cd7-f95c3fc78a2a",
                    "value": "Jane Doe"
                }
            ],
        }
    }
]

I would expected the following:

g.V().has('name', startsWith('John'))

[
    {
        "id": "875a2805-bdc7-4ed9-81ab-9e619bf2dfeb",
        "label": "person",
        "type": "vertex",
        "properties": {
            "name": [
                {
                    "id": "1bca3ec4-96dc-427f-9535-2ded44230109",
                    "value": "John Doe"
                }
            ],
        }
    }
]

g.V().has('name', startsWith('john'))

[
    {
        "id": "875a2805-bdc7-4ed9-81ab-9e619bf2dfeb",
        "label": "person",
        "type": "vertex",
        "properties": {
            "name": [
                {
                    "id": "1bca3ec4-96dc-427f-9535-2ded44230109",
                    "value": "John Doe"
                }
            ],
        }
    }
]

g.V().has('name', startsWith('john', true))

[]

g.V().has('name', contains('Doe'))

[
    {
        "id": "875a2805-bdc7-4ed9-81ab-9e619bf2dfeb",
        "label": "person",
        "type": "vertex",
        "properties": {
            "name": [
                {
                    "id": "1bca3ec4-96dc-427f-9535-2ded44230109",
                    "value": "John Doe"
                }
            ],
        }
    },
    {
        "id": "fc8cffe9-a349-464d-a283-58a411b0e4e8",
        "label": "person",
        "type": "vertex",
        "properties": {
            "name": [
                {
                    "id": "8b0642ac-a27a-4a6f-8cd7-f95c3fc78a2a",
                    "value": "Jane Doe"
                }
            ],
        }
    }
]

g.V().has('name', contains('doe'))

[
    {
        "id": "875a2805-bdc7-4ed9-81ab-9e619bf2dfeb",
        "label": "person",
        "type": "vertex",
        "properties": {
            "name": [
                {
                    "id": "1bca3ec4-96dc-427f-9535-2ded44230109",
                    "value": "John Doe"
                }
            ],
        }
    },
    {
        "id": "fc8cffe9-a349-464d-a283-58a411b0e4e8",
        "label": "person",
        "type": "vertex",
        "properties": {
            "name": [
                {
                    "id": "8b0642ac-a27a-4a6f-8cd7-f95c3fc78a2a",
                    "value": "Jane Doe"
                }
            ],
        }
    }
]

g.V().has('name', contains('doe', true))

[]

BenjaBobs on 28 Jun 2018

👍5

As for fuzzy search, I don't know enough about the subject to talk in detail about implementation details, but if you were to consider it, I found an article describing an edit-distance function that supposedly runs in near linear time: APPROXIMATING EDIT DISTANCE IN NEAR-LINEAR TIME and an article testing various methods NIKITA'S BLOG - Fuzzy string search

A possibly gremlin syntax could then be something like

default GraphTraversal<S,E> fuzzyContains(String value, int maxDistance)

With results akin to

g.V().has('name', fuzzyContains('Johnny', 1000))

[
    {
        "id": "875a2805-bdc7-4ed9-81ab-9e619bf2dfeb",
        "label": "person",
        "type": "vertex",
        "properties": {
            "name": [
                {
                    "id": "1bca3ec4-96dc-427f-9535-2ded44230109",
                    "value": "John Doe"
                }
            ],
        }
    },
    {
        "id": "fc8cffe9-a349-464d-a283-58a411b0e4e8",
        "label": "person",
        "type": "vertex",
        "properties": {
            "name": [
                {
                    "id": "8b0642ac-a27a-4a6f-8cd7-f95c3fc78a2a",
                    "value": "Jane Doe"
                }
            ],
        }
    }
]

g.V().has('name', fuzzyContains('Johnny', 20))

[
    {
        "id": "875a2805-bdc7-4ed9-81ab-9e619bf2dfeb",
        "label": "person",
        "type": "vertex",
        "properties": {
            "name": [
                {
                    "id": "1bca3ec4-96dc-427f-9535-2ded44230109",
                    "value": "John Doe"
                }
            ],
        }
    }
]

g.V().has('name', fuzzyContains('Johnny', 0))

(where a distance of 0 would practically be the same as string equals.)

[]

BenjaBobs on 28 Jun 2018

Great suggestions @BenjaBobs
@LuisBosquez Any progress on these ?

syedhassaanahmed on 5 Sep 2018

There is discussion here on how to approach text predicates in TinkerPop itself:

https://lists.apache.org/thread.html/c52542a00c89e2f06f73636efd0a26c9e2ac8436bb41b3ade1b7931b@%3Cdev.tinkerpop.apache.org%3E

input is welcome.

spmallette on 17 Sep 2018

Update: The tinkerpop issue https://github.com/apache/tinkerpop/pull/944 was merged and is slated for version 3.4.0.

BenjaBobs on 8 Oct 2018

the team is working on two string predicates

Any update on this?

adrianknight89 on 21 Feb 2019

I haven't heard from them in a long time. Last time I checked (early January) they hadn't implemented string predicates yet.

BenjaBobs on 22 Feb 2019

Just to be official, TinkerPop did release Text Predicates on 3.4.0 back at the very start of January.

http://tinkerpop.apache.org/docs/3.4.0/upgrade/#_text_predicates

so from a TinkerPop perspective it is an available feature.

spmallette on 22 Feb 2019

👍1

Then we just need to wait for Microsoft to implement it according to the spec.

BenjaBobs on 22 Feb 2019

Hello everyone, we have started the deployment of this feature. We implemented the TextP-based functions described in this document: http://tinkerpop.apache.org/docs/3.4.0/reference/#a-note-on-predicates

Depending on your region, you might see this feature within the next few weeks.

LuisBosquez on 25 Feb 2019

👍1

@spmallette @LuisBosquez
Awesome work with the text predicates! Any chance for a fuzzyContaining? 😃

BenjaBobs on 26 Feb 2019

The discussion started rather widely in TinkerPop on what text predicates to initially include. The set that was established matched the widest number of databases that allowed for this sort of thing. I suppose additional ones could be proposed/considered in the future.

spmallette on 26 Feb 2019

And rightfully so, because if not carefully designed to be flexible enough we'll end up with many standards.

The newly added predicates are a great addition and make many things possible. While advanced predicates such as fuzzy and regex are probably nice-to-haves (really nice-to-haves 😄 ), DSE Graph, Titan Graph and Janus Graph have already flavoured their implementation of Gremlin with their own variation of syntax.

BenjaBobs on 26 Feb 2019

Was this page helpful?

0 / 5 - 0 ratings