Gitea: Indexer returns no results for some terms

Created on 23 Jan 2020  路  27Comments  路  Source: go-gitea/gitea

  • Gitea version (or commit ref): 1.11.0-rc1
  • Git version: 2.24.1
  • Operating system: Debian testing
  • Database (use [x]):

    • [ ] PostgreSQL

    • [ ] MySQL

    • [ ] MSSQL

    • [X] SQLite

  • Can you reproduce the bug at https://try.gitea.io:

    • [ ] Yes (provide example URL)

    • [ ] No

    • [X ] Not relevant

Description

I enabled the indexer. It has been running for couple days since then. I am able to search and get some results but some results return no results by the Code search page meanwhile I can get 10s on results for with grep

For the term "tool_set" in the Code search page I get No source code matching your search term found.

Grepping the same code base (eve after deleting the comment lines)

find -type f -name "*.py" -exec grep -i 'tool_set' {} \; |sed '/#/d' |wc -l 44

ini

[indexer]
REPO_INDEXER_ENABLED = true
ISSUE_INDEXER_PATH: indexers/issues.bleve
REPO_INDEXER_PATH: indexers/repos.bleve
UPDATE_BUFFER_LEN: 20
MAX_FILE_SIZE: 1048576

All 27 comments

The indexer itself can handle your case. I've specifically tested with tool_set and it was indexed correctly when I ran the indexer from scratch. The indexer is having some problems, however, because I'm getting errors in the log I can't pinpoint like:

2020/01/22 21:04:05 ...ndexer/code/queue.go:39:processRepoIndexerOperationQueue() [E] indexer.Index: exit status 1
        /home/gprandi/src/code.gitea.io/gitea/modules/indexer/code/queue.go:39 (0x187f5f7)
                processRepoIndexerOperationQueue: log.Error("indexer.Index: %v", err)
        /home/gprandi/go/src/runtime/asm_amd64.s:1357 (0x46f5d0)
                goexit: BYTE    $0x90   // NOP

Which clogs the indexer queue. If I restart the instance and commit new changes to the repository, the indexer seems to pick them up correctly.

The indexer is expected to take a "long time" to build, but not _days_. It took a couple of minutes to build from scratch my indexes on 327 MB of repositories.

That is interesting.Where is a good place to see the indexer having issues? I did grep on he gitea log but not much that I can see

https://paste.debian.net/hidden/3591c5ca/

I also wonder if there is a limit to the size of the indexer db, mine is it at 285mb now and I have many repos in there.

My log configuration in app.ini:

[log]
MODE             = file
MAX_DAYS         = 15
LEVEL = Info
ROUTER           = file
ROUTER_LOG_LEVEL = Trace
STACKTRACE_LEVEL = Error
XORM = file
REDIRECT_MACARON_LOG = true

[log.file.xorm]
FILE_NAME = xorm.log

(It's a little redacted, so maybe not all options make sense)

This separates the SQL (XORM) log from the other logs, making everything cleaner. I've also set up a trace to every error, so I know exactly where every log is produced.

To get a meaningful log I stopped Gitea and deleted the repos.bleve directory to force the system to rebuild them when restarted. You'll know it finished when it stops growing (which is _not necessarily_ when the log says it does... in fact my log was not useful about that).

Then I've edited a file using the web UI, and when the indexer attempted to do its thing, it crashed.

(NOTE: your paste doesn't say much, unfortunately)

@guillep2k what's the gitea version?

@lunny I've tested on master as of today. (53f9dbfc7bd322a439bd6c6582d69506c7244384)

I also wonder if there is a limit to the size of the indexer db, mine is it at 285mb now and I have many repos in there.

BTW, the indexes of my prod instance are 1.3GB from 1.4GB of repositories (working fine on Gitea 1.10.3).

@guillep2k I will test with the latest rc2 from today. I will delete the database and force it again.

Btw is there a way to force the indexer while gitea is running?

Btw is there a way to force the indexer while gitea is running?

If by force you mean rebuild all, no, there isn't. But files are re-indexed with each commit (only the affected files, the whole file is re-indexed, not just the diff).

Hmm the latest rc2 fails on me with

2020/01/22 22:52:07 .../xorm/session_raw.go:78:queryRows() [I] [SQL] SELECT `name` FROM `user` WHERE `id`=? LIMIT 1 []interface {}{1} - took: 26.358碌s
2020/01/22 22:52:07 .../xorm/session_raw.go:78:queryRows() [I] [SQL] SELECT `name` FROM `user` WHERE `id`=? LIMIT 1 []interface {}{1} - took: 44.862碌s
2020/01/22 22:52:07 .../xorm/session_raw.go:78:queryRows() [I] [SQL] SELECT `name` FROM `user` WHERE `id`=? LIMIT 1 []interface {}{1} - took: 34.792碌s
2020/01/22 22:52:07 .../xorm/session_raw.go:78:queryRows() [I] [SQL] SELECT `name` FROM `user` WHERE `id`=? LIMIT 1 []interface {}{1} - took: 27.177碌s
2020/01/22 22:52:07 ...exer/code/indexer.go:54:func2() [I] PID: 3759700 Initializing Repository Indexer at: /opt/gitea/indexers/repos.bleve
2020/01/22 22:52:07 ...er/issues/indexer.go:142:func2() [I] PID 3759700: Initializing Issue Indexer: bleve
2020/01/22 22:52:07 .../xorm/session_raw.go:78:queryRows() [I] [SQL] SELECT `pull_request`.`id` FROM `pull_request` WHERE (status=?) []interface {}{1} - took: 158.714碌s
2020/01/22 22:52:07 .../xorm/session_raw.go:78:queryRows() [I] [SQL] SELECT `id`, `repo_id`, `hook_id`, `uuid`, `type`, `url`, `signature`, `payload_content`, `http_method`, `content_type`, `event_type`, `is_ssl`, `is_delivered`, `delivered`, `is_succeed`, `request_content`, `response_content` FROM `hook_task` WHERE (is_delivered=?) []interface {}{false} - took: 213.028碌s
2020/01/22 22:52:07 routers/init.go:122:GlobalInit() [I] SQLite3 Supported
2020/01/22 22:52:07 routers/init.go:46:checkRunMode() [I] Run Mode: Production
2020/01/22 22:52:07 ...ndexer/code/bleve.go:228:Close() [D] Closing repo indexer
2020/01/22 22:52:07 ...ndexer/code/bleve.go:235:Close() [I] PID: 3759700 Repository Indexer closed
2020/01/22 22:52:07 ...exer/code/indexer.go:63:func2() [F] PID: 3759700 Unable to initialize the Repository Indexer at path: /opt/gitea/indexers/repos.bleve Error: error parsing mapping JSON: unexpected end of JSON input
        mapping contents:

        /go/src/code.gitea.io/gitea/modules/indexer/code/indexer.go:63 (0x124255f)
        /usr/local/go/src/runtime/asm_amd64.s:1357 (0x466c70)


Could you find the file rupture_sharded_meta.json on indexer directory ?

There is no rupture_sharded_meta.json

find -L -type f|grep -i rupt
./indexers/issues.bleve/rupture_meta.json
./indexers/repos.bleve/rupture_meta.json

@gerroon could you paste the content of that two files?

cat issues.bleve/rupture_meta.json  repos.bleve/rupture_meta.json 

{"version":1}{"version":4}

Ok, I deleted the whole indexer thing, installed the latest nightly (v1.11.0-rc2) . The database grew to 3gb

-rw-r--r-- 1 git git   47 Jan 23 00:04 index_meta.json
-rw-r--r-- 1 git git   13 Jan 23 00:04 rupture_meta.json
-rw------- 1 git git 3.0G Jan 23 08:27 store

However it still cant find tool_set

I did another search for builtin. It located about 30 searches in the whole GItea contolled repos. Since I do not have the clones of all the repos, I made a search in the largest one I cloned for builtin It returned and the difference is by huge magnitutes, not even close (30 vs 542).

grep -ir "builtin" *|wc -l
542

One thing I am seeing is that 183.27 K/s 0.00 B/s 0.00 % 95.49 % gitea web -c /opt/gitea/custom/conf/app.ini doing constant reading (holding %99 of the system io) without writing and never giving up whatever it is doing. And the database store file was last updated like 4 hours ago. So whatever is reading from the disk is not written back given that the database file has not been updated for like 4 hours?

Here is the lsof for gitea

   1    unix                            33206 type=STREAM
    2    unix                            33206 type=STREAM
    3     REG       0x30     869488   66071775 /media/DRIVE/_TEMP/LOG/gitea/gitea.log
    4 a_inode        0xe          0       8828 [eventpoll]
    5     REG       0x30      31205   66071776 /media/DRIVE/_TEMP/LOG/gitea/macaron.log
    6     REG       0x30          0   66071777 /media/DRIVE/_TEMP/LOG/gitea/router.log
    7     REG       0x30     696800   66071778 /media/DRIVE/_TEMP/LOG/gitea/xorm.log
    8     REG      0x822    2433024    6197630 /media/DRIVEB/opt/gitea/data/gitea.db
    9     REG      0x822          0    6167383 /media/DRIVEB/opt/gitea/data/queues/issue_indexer/LOCK
   10     REG      0x822      28139    6167384 /media/DRIVEB/opt/gitea/data/queues/issue_indexer/LOG
   11    IPv6                                  *:3000
   12     REG      0x822      39378    6163834 /media/DRIVEB/opt/gitea/data/queues/issue_indexer/000102.log
   13     REG      0x822        110    6163840 /media/DRIVEB/opt/gitea/data/queues/issue_indexer/MANIFEST-000103
   14     REG      0x822      15305    6209772 /media/DRIVEB/opt/gitea/data/queues/issue_indexer/000037.ldb
   15     REG      0x822        127    6207818 /media/DRIVEB/opt/gitea/data/queues/issue_indexer/000002.ldb
   16     REG      0x822          0    6167673 /media/DRIVEB/opt/gitea/data/queues/task/LOCK
   17     REG      0x822      26545    6197256 /media/DRIVEB/opt/gitea/data/queues/task/LOG
   18     REG      0x822          0    6164917 /media/DRIVEB/opt/gitea/data/queues/task/000084.log
   19     REG      0x822         70    6164929 /media/DRIVEB/opt/gitea/data/queues/task/MANIFEST-000085
   20     REG      0x822        127    6167385 /media/DRIVEB/opt/gitea/data/queues/task/000002.ldb
   21     REG       0x30    1048576   66074727 /media/DRIVE/GITEA/indexers/issues.bleve/store
   22     REG       0x30 3211452416   66074729 /media/DRIVE/GITEA/indexers/repos.bleve/store
   23    IPv6                                  localhost:3000->localhost:43982
  cwd     DIR       0x30         58     400815 /media/DRIVE/REPO/GITEA
  mem     REG       0x2b              66074727 /media/DRIVE/GITEA/indexers/issues.bleve/store (path dev=0,48)
  mem     REG       0x2b              66074729 /media/DRIVE/GITEA/indexers/repos.bleve/store (path dev=0,48)
  rtd     DIR      0x825       4096          2 /
  txt     REG      0x822   82951528    6056650 /media/DRIVEB/opt/gitea/gitea



It would be useful to have some logs for the time span of your tests.

EDIT: (I mean, for context)

I would like to but there a lot of personal information in the logs, alot about my projects, issues, wikis etc etc If you cna tell me what specific you are looking for I can definetely provide it like crashes. But I am not seeing any of those there.

I think I've found an important bug! But it should only manifest itself as repos not being _updated_ (creation of indexes from scratch should not be affected).

As for the error message in my instance:

2020/01/22 21:04:05 ...ndexer/code/queue.go:39:processRepoIndexerOperationQueue() [E] indexer.Index: exit status 1

I've been debugging and it turns out this error is expected as I have one corrupt repo, so git show-ref -s returns.... a silent exit status of 1. I believe this should not affect the indexing of other repos, because the error is logged and the indexer just continues processing its queue.

About the bug I've mentioned, I'll post a PR momentarily.

Sounds good.

I just started from scratch again, this time I added include files list so that the scope is limited since I am mostly interested in txt and py files (my repos have alot of binary fiels too). I will report back if that does any good.

Ok that did not work perfectly either. So here is the result from the Gitea code search page for builtin.transform I am only including the results from the same repo in Code search and the Grep search.

One speculation I can make is that Code search seems to only return one result per file (compare it to the grep seearch), which can be one of the culprits if not the whole problem.

MayaConfigV3_2/fa_hotkeys.py
View File

     {"properties":
      [("name", 'builtin.transform'),
       ],

QWER/QWER_Industry_Keymap.py
View File

     {"properties":
      [("name", 'builtin.transform'),
       ],

and here is from the terminal

grep -ir "builtin.transform" *

MayaConfigV3_2/fa_hotkeys.py:536:      [("name", 'builtin.transform'),
MayaConfigV3_2/fa_hotkeys.py:2547:      [("name", 'builtin.transform'),
MayaConfigV3_2/fa_hotkeys.py:2708:      [("name", 'builtin.transform'),
MayaConfigV3_2/fa_hotkeys.py:2715:      [("name", 'builtin.transform'),
QWER/QWER_Industry_Keymap.py:1307:      [("name", 'builtin.transform'),
QWER/QWER_Industry_Keymap.py:1314:      [("name", 'builtin.transform'),
QWER/QWER_Industry_Keymap.py:1321:      [("name", 'builtin.transform'),
QWER/QWER_Industry_Keymap.py:2050:      [("name", 'builtin.transform'),
QWER/QWER_Industry_Keymap.py:2057:      [("name", 'builtin.transform'),
QWER/QWER_Industry_Keymap.py:2064:      [("name", 'builtin.transform'),
QWER/QWER_Industry_Keymap.py:5407:      [("name", 'builtin.transform'),
QWER/QWER_Industry_Keymap.py:7050:      [("name", 'builtin.transform'),

It still not reporting anything aabout "tool_set" for this repo I listed above, but see what ack returns for the repo given above.

ack tool_set *|wc -l                                                                                                                                              
3849

Oh! 馃う鈥嶁檪

The indexer indexes only the first instance of any term _per file_. It's not meant to be a full text search.

Interesting. Thn maybe it is not even going to return partial results?

Here tool_set_by_name returns 2 results from the whole Gitea. Meanwhile grep can return many foir a single repo. Maybe that explains why "tool_set" returns none in some ways?

It _should_ return result per file where it occurs, as long as it's in master (or whatever branch is your default) and "indexable" (i.e. not filtered out by your settings or ... ehem .... . _perhaps your files are marked as executable_?). 馃槼

as per @guillep2k

May this https://github.com/go-gitea/gitea/issues/9190#issuecomment-571563226 be related to this issue?

Re-checked https://github.com/go-gitea/gitea/issues/9190#issuecomment-571563226 behavior with latest upstream version
1.12.0+dev-174-g5b17bb8f3
Seems working now!
Repo index was updated after git push

@vvrein I'm gonna close this as Fixed by #9965 and #9957

Was this page helpful?
0 / 5 - 0 ratings