Hi, I am trying to use a file with more then 8,000 entries (10-20 letter words, one per line. 132K) and get their corresponding lines in a big file (645,151lines, 76M). I use:
rg -w -f query_file target_file
I get the error: Compiled regex exceeds size limit of 10485760 bytes.
How can I configure it to allow rg to run without the limit?
You can't, as of now, without re-compiling and hard-coding a new limit.
I would be interested to hear how well 8,000 entries performs. As of now, it is fed into a single regex as a single alternation. If they are all plain text strings and not patterns, then it might be better to use Aho-Corasick instead (which ripgrep should do, it just doesn't yet).
If you were so inclined, the limit is here: https://github.com/BurntSushi/ripgrep/blob/master/grep/src/search.rs#L67
FYI, you can't "remove" the limit (but you can set it arbitrarily high).
Thanks, I have run it with a 640-line file (11K) without problems and very fast. The lines of the file are like these:
AACY020304192.1.1216
AACY020311045.179.1696
AACY020524983.300.1803
AACY020558197.3281.4799
AACY023705044.289.1515
AACY023835985.1.1221
I am fairly new to coding, so excuse me but I need a "for dummies" howto set the limit very high. Basically, what should I do and where?
I have a server with 64 cores, 128 Mb RAM.
sorry
The limit probably should be exposed as a flag so that it's a knob you can turn easily. That feature is not currently available, so the only way for you to do it is change the source code of ripgrep and recompile it. Briefly:
$ git clone git://github.com/BurntSushi/ripgrep
$ cd ripgrep
... try to compile it to make sure your env is setup right
$ cargo build --release
... change the limit in the source code
$ $EDITOR grep/src/search.rs
... compile again
$ cargo build --release
... the ripgrep binary is in target/release
$ ./target/release/rg blah blah blah
In order for the above to work, you will need to install Rust. See: https://rustup.rs/
OK, solved, thanks.
I ran a file with 11,724 lines in under 1:20 h, not bad compared to several hours with grep.
I changed the "size_limit" from 10 to 1000:
impl Default for Options {
fn default() -> Options {
Options {
case_insensitive: false,
case_smart: false,
line_terminator: b'\n',
size_limit: 1000 * (1 << 20),
dfa_size_limit: 10 * (1 << 20),
}
}
}
Cheers
Can you try increasing the dfa limit as well? Might speed things up too
On Feb 16, 2017 2:58 PM, "Microbial Genomics Lab" notifications@github.com
wrote:
Cheers
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/BurntSushi/ripgrep/issues/362#issuecomment-280441247,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAb34qqW670PqgAxZQNhO7lrXV_VA0b6ks5rdKpggaJpZM4L_ose
.
I'm going to re-open this because I think this limit should be configurable without re-compiling ripgrep.
Nice! Down to seconds now!
Wow. My guess is that you were previously exhausting the cache space of the DFA, so it probably did a lot of thrashing or dropped down to one of the (much) slower NFA engines.
Just a comment (I have not tried resetting the limit, as detailed above): I ran into this exact issue today, comparing a ~15K list (individual words on separate lines) to another file (~272K; ditto). Large, I know, but grep trounced that task, whereas ripgrep failed:
time rg -Fxv -f my_stopwords.txt my_list | sort | uniq > \
~/projects/nlp/tfidf/output/kw_not_sw
Compiled regex exceeds size limit of 10485760 bytes.
Command exited with non-zero status 1
0:01.58
time grep -Fxv -f my_stopwords.txt my_list | sort | uniq > \
~/projects/nlp/tfidf/output/kw_not_sw
0:00.14
Arch Linux x84_64; 32MB RAM + swap + tmpfs; ripgrep v.0.7.1; grep (GNU grep) v.3.1
@victoriastuart Increase the size limit using --regex-size-limit.
Hi Andrew; thank you for the project/code, comment -- appreciated. Love ripgrep (v. fast, generally)! :-D
Some observations (just a FYI; I'm happy with using grep, here):
$ time rg --regex-size-limit 100M -Fxv -f \
/mnt/Vancouver/Programming/data/nlp/stopwords/my_stopwords.txt \
/mnt/Vancouver/untitled | sort | uniq > ~/projects/nlp/tfidf/output/kw_not_sw
Compiled regex exceeds size limit of 104857600 bytes.
Command exited with non-zero status 1
0:01.55
$ time rg --regex-size-limit 200M -Fxv -f \
/mnt/Vancouver/Programming/data/nlp/stopwords/my_stopwords.txt \
/mnt/Vancouver/untitled | sort | uniq > ~/projects/nlp/tfidf/output/kw_not_sw
^C ## manually terminated
Command terminated by signal 2
21:07.17 ## 21 min
$ time grep -Fxv -f \
/mnt/Vancouver/Programming/data/nlp/stopwords/my_stopwords.txt \
/mnt/Vancouver/untitled | sort | uniq > ~/projects/nlp/tfidf/output/kw_not_sw
0:00.13 ## ~0.1 sec
Regarding the ripgrep experiments:
* Intel Core i7-4790 CPU @ 3.60 GHz x 4 cores (hyper-threaded to 8 threads):
* only 1 of 4 CPU/threads used @100% at any one time (rest ~@10-20%;
includes other background processes, I presume).
* Memory (RAM; swap) usage unchanged throughout.
* Noted:
* https://github.com/tiehuis/ripgrep/commit/593895566512f45b4dbdec78de2c46a7a91e8e18
* The default limit is 10M. This should only be changed on very large regex
inputs where the (slower) fallback regex engine may otherwise be used
You need to increase --dfa-size-limit too.
Noting https://github.com/BurntSushi/ripgrep/issues/497 ,
time rg --regex-size-limit 200M --dfa-size-limit 10G -Fxv -f \
/mnt/Vancouver/Programming/data/nlp/stopwords/my_stopwords.txt \
/mnt/Vancouver/untitled > ~/projects/nlp/tfidf/output/kw_not_sw
^C Command terminated by signal 2
33:58.19 ## manually-terminated at t ~34 min
@victoriastuart thanks! Is possible can you share the data you are searching and your regex queries as well? Or tell me how to get it? It reproduce it publicly available data?
Hi ... it's my own data (private), but simply lists of words; e.g.:
#wed
'd
'll
'm
're
's
't
'tis
'twas
've
**
+summary
--
-141
-2
-200c
-a
-casts
-determination
-immunocompatible
-induced
-researchers
-resistant
-sensitive
-β
...
.b
/diaphanous
/pax7
0-10
0-μm
0.02
0.2
0.5
000
000-fold
00000002-0030
00000004-0010
[ ... SNIP!! ... ]
zzzxpster
zörnig
zünd
zürich
~130
~130,000
²-actin
µm
ß
ángel
école
òscar
öaw
ûlo
α
α-helical
αβ
β
β-sandwich
β-sheets
β-subunit
β1
β2
β3
γ-carboxyl
μm
−141
`['cell', 'gene', 'rna', 'cancer', 'protein', 'disease', 'dna', 'mouse', 'human', 'tumor', 'expression', 'genetic', 'function', 'mrna', 'tissue', 'brain', 'memory', 'genome', 'mirna', 'mechanism', 'molecular', 'stem', 'rnai', 'mutation', 'pathway', 'breast', 'blood', 'micrornas', 'cellular', 'lncrna', ...]
````
formatted for processing (~15K lines, this particular output):
cell
gene
rna
cancer
protein
disease
dna
mouse
human
tumor
expression
genetic
function
mrna
tissue
brain
memory
genome
mirna
mechanism
molecular
stem
rnai
mutation
pathway
breast
blood
micrornas
cellular
lncrna
[ ... SNIP!! ... ]
free-energy
bldg
edn
br
lookup
sciencemag
453-466
ac9erisnzlnjed
aczdv
csb28z6ji
g7bq
q8uv1gph4msxsoqw8ar
xad
zooew
tt
phylogenet
```
Most helpful comment
The limit probably should be exposed as a flag so that it's a knob you can turn easily. That feature is not currently available, so the only way for you to do it is change the source code of ripgrep and recompile it. Briefly:
In order for the above to work, you will need to install Rust. See: https://rustup.rs/