Spacy: I think this is a better way of storing the data, instead of the current methods

Created on 11 Oct 2019 · 6Comments · Source: explosion/spaCy

I don't really familiar with the spaCy library, but I don't think the way spaCy manipulate the 'text' is correct. Those are supportive reasons:

The 're' (regular expression) module
aldo the 're' module offers the best ways to manipulate text in High-Performance speed, it's written that spaCy don't take advantage of it.
Performance
only by writing this code takes spaCy 1sec in 'intel i5 k8' computer, which is bad:

nlp = spacy.load('en_core_web_sm')
doc = nlp('this is my text')

After that, you need to iterate in 'doc' to get the tokens or token attributes.
I think the Dictionary is a much better way to accomplish that.

Something like this:

import re
from functools import wraps


def load(model):
    # process the model...
    @wraps(model)
    def wrapper(text):
        # wrapper: process the 'text' with the given 'model'...

        # initialize before loop (assign new memory to the objects)
        ptt = None
        mchs = None
        position = None
        # FIX: fix the split method to split it to only letters, dots, and columns
        #   with no new lines ('\n') or tabs...
        for word in text.split(' '):
            ptt = re.compile(word)
            # find the mchs which are the positions of the given pattern
            mchs = re.finditer(ptt, text)

            for mch in mchs:
                # the position of the word
                position = (mch.start(), mch.end())

                yield {
                    (position): 
                        {   'name': text[mch.start():mch.end()],
                            'head': '',
                            'pos': '',
                        }
                    }
    return wrapper


nlp = load('en')
doc = nlp('this is my text')

for obj in doc:
    print(obj)

NOT IN THE EXAMPLE (extra functionality):
After that, any time user requests to access an attribute of a token, the program would add it to the dictionary in the token index inside the head filed.

>>> Please tell me if it's a good idea so I can pull new commit to the library...

Source

PROgramJEDI

👎2

Most helpful comment

I'm not sure exactly which assumptions you've taken a wrong turn on, but it really doesn't sound like you have an accurate understanding of what the library does. I don't think you should spend time making a PR.

honnibal on 11 Oct 2019

👍2 👎1

All 6 comments

honnibal on 11 Oct 2019

👍2 👎1

I'm not sure exactly which assumptions you've taken a wrong turn on, but it really doesn't sound like you have an accurate understanding of what the library does. I don't think you should spend time making a PR.

In fact, I think I understand enough about what the library is about.

In general, the library uses common machine learning algorithms and protocols, such as 'NLP'. The library analyzes the templates and allows users to access the result.

If you can explain to me, without being patronizing and not funny, I'd love to know more.

By the way, the library is slow as hell, and I have suggested ways to improve efficiency and speed.

And I don't think you tried to understand what I tried to do.

PROgramJEDI on 11 Oct 2019

👎1

I'm not sure exactly which assumptions you've taken a wrong turn on, but it really doesn't sound like you have an accurate understanding of what the library does. I don't think you should spend time making a PR.

Please, explain to me, what is the purpose library

PROgramJEDI on 11 Oct 2019

👎1

I'm not sure exactly which assumptions you've taken a wrong turn on, but it really doesn't sound like you have an accurate understanding of what the library does. I don't think you should spend time making a PR.

Please, explain to me, what is the purpose library

https://spacy.io/

Gonzalo933 on 11 Oct 2019

👍1

I'm not sure exactly which assumptions you've taken a wrong turn on, but it really doesn't sound like you have an accurate understanding of what the library does. I don't think you should spend time making a PR.

I don't understand, what is matter, the library should fist parse the text, and then tokenize it, and it should do it fast and efficient. It doesn't seem like the current situation.

I think it is not efficient with the current method...

PROgramJEDI on 13 Oct 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.