I reopen this issue #9486 because i've the same behaviour with final 1.5.0 version.
__System info:__
Branch | Build Time | Commit | Version
-- | -- | -- | --
1.5 | "" | "6ac835404e7e64ea7299a6eebcce1ab1ef15fe3c" | "1.5.0"
__ OS __ :
Red Hat Enterprise Linux Server release 6.4 (Santiago)
or
CentOS Linux release 7.2.1511 (Core)
__Steps to reproduce:__
[I] 2018-03-09T13:35:40Z InfluxDB starting, version 1.3.5, branch HEAD, commit 9d9001036d3585cf21925c13a57881bc6c8dcc7e
[I] 2018-03-09T13:35:40Z Go version go1.8.3, GOMAXPROCS set to 12
[I] 2018-03-09T13:35:40Z Using configuration at: /tech/influxdb/conf/influxdb.conf
...
[I] 2018-03-09T13:35:55Z /tech/influxdb/var/data/influxdb/data/telegraf/default/2323 opened in 8.036002793s service=store
...
[I] 2018-03-09T13:35:57Z Listening on HTTP:0.0.0.0:9001 service=httpd
[I] 2018-03-09T14:00:28Z InfluxDB starting, version 1.4.2, branch 1.4, commit 6d2685d1738277a1c2672fc58df7994627769be6
[I] 2018-03-09T14:00:28Z Go version go1.9.2, GOMAXPROCS set to 12
[I] 2018-03-09T14:00:28Z Using configuration at: /tech/influxdb/conf/influxdb.conf
...
[I] 2018-03-09T14:00:44Z /tech/influxdb/var/data/influxdb/data/telegraf/default/2323 opened in 8.537640076s service=store
...
[I] 2018-03-09T14:00:47Z Listening on HTTP:0.0.0.0:9001 service=httpd
ts=2018-03-09T13:43:45.145501Z lvl=info msg="InfluxDB starting" log_id=06jktnzG000 version=1.5.0 branch=1.5 commit=6ac835404e7e64ea7299a6eebcce1ab1ef15fe3c
ts=2018-03-09T13:43:45.145557Z lvl=info msg="Go runtime" log_id=06jktnzG000 version=go1.9.2 maxprocs=12
...
ts=2018-03-09T13:54:22.970753Z lvl=info msg="Opened shard" log_id=06jktnzG000 service=store trace_id=06jktnzl000 op_name=tsdb_open path=/tech/influxdb/var/data/influxdb/data/telegraf/default/2323 duration=251247.336ms
...
ts=2018-03-09T13:54:33.583256Z lvl=info msg="Listening on HTTP" log_id=06jktnzG000 service=httpd addr=0.0.0.0:9001 https=false
__Expected behavior:__ [What you expected to happen]
Same startup time in all versions
__Actual behavior:__ [What actually happened]
Release | Total startup time | One shard open time
-- | -- | --
1.3.5 | 17s | 8.036002793s
1.4.2 | 19s | 8.537640076s
1.5.0 | 11min | 251247.336ms
We made a test in 1.5.0 with tsi1 enabled (after influx_inspect buildtsi command) and we have the same result.
@ptitou Can you provide some details about the hardware? e.g. How many cores, how much RAM, and are you running SSDs?
Does this occur on every startup or just the first time?
Hardware Info
1 CPU Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz (6 cores / 12 threads)
MemTotal: 65862900 kB (64Gb)
Disk : Hardware RAID5 of 3 HDD (total of 1Tb)
It occurs on every startup, even on other server (tested on a VM with SSD, we have same duration).
@ptitou How many series do you have in the shard?
Hi,
I'm working on this issue with @ptitou.
See below more details on series about the shard :
Summary :
File: /tech/influxdb/var/data/influxdb/data/telegraf/default/2323/000002441-000000005.tsm
Time Range: 2018-01-08T00:00:00Z - 2018-01-14T23:59:55Z
Duration: 167h59m55s Series: 17126 File Size: 2147502592
File: /tech/influxdb/var/data/influxdb/data/telegraf/default/2323/000002441-000000006.tsm
Time Range: 2018-01-08T00:00:00Z - 2018-01-14T23:59:55Z
Duration: 167h59m55s Series: 217919 File Size: 2147702233
File: /tech/influxdb/var/data/influxdb/data/telegraf/default/2323/000002441-000000007.tsm
Time Range: 2018-01-08T00:00:00Z - 2018-01-14T23:59:55Z
Duration: 167h59m55s Series: 147602 File Size: 2147633909
File: /tech/influxdb/var/data/influxdb/data/telegraf/default/2323/000002441-000000008.tsm
Time Range: 2018-01-08T00:00:00Z - 2018-01-14T23:59:55Z
Duration: 167h59m55s Series: 17395 File Size: 273550172
@max3163 Thank you. Can you start up influxd with some additional flags and SIGINT (CTRL-C) it after it loads the shards?
$ influxd run -cpuprofile cpu.pprof -memprofile mem.pprof
Then zip the cpu.pprof and mem.pprof files up and drop them here in this GitHub issue.
Also, 1.5.0 takes MUCH more RAM, even using tsi1 index. I guess I'll have to switch back to 1.4.3 =\
Find in attachment the prrof file as ask :
We have also twice more RAM usage on the influxdb 1.5 :

@max3163 @gabrielmocan Thanks for the info. We've consistently seen a drop in memory usage so I'm curious as to why your memory has gone up. I'm taking a look at the profiles now.
@gabrielmocan One thing to note is that enabling TSI will only affect new shards. You can build TSI indexes for existing shards using the buildtsi utility: http://docs.influxdata.com/influxdb/v1.5/tools/influx_inspect/#influx-inspect-buildtsi
@max3163 It looks like it's still using the inmem index from the profile: https://rawgit.com/benbjohnson/020eff9976185bc186f5d0cf7b469fa1/raw/17fde3e2394dcf3231088c32c083a3a8972a1bbd/mem.svg
Can you show the output of tree on your shard's data directory?
@benbjohnson I've migrated from 1.4.3 and the shards were already tsi1 indexed.
@gabrielmocan Unfortunately, you'll need to rebuild the TSI indexes when migrating to 1.5. Can you try that and let me know if it improves your memory usage?
@benbjohnson
As indicate in the title of the issue, the problem is when we use the "inmem" engine yes.
We have made only one test to check with the tsi1 index on an other instance (with tool influxdb_inspect as indicate in the doc to migrate data) but we had same result for the loading time.
Sry I don't have access to the server for now, I will send you the output of the tree command only Monday.
@max3163 I apologize. I got sidetracked on the TSI side discussion. Disregard my tree request.
Sorry for disrupting the topic guys.
@benbjohnson
No problem ! and don't hesitate to ask me more information if needed.
@max3163 I'm still looking into the issue. From the cpu profile it looks like there's not much CPU usage. The profile duration was 10.63s but there were only 62.24s of samples which makes me think it's waiting on disk. Do have startup logs for your SSD drives? Also, are these drives connected directly or mounted over NFS?
@benbjohnson the disks are attached to server with a RAID5 DELL controller. I confirm there is plenty IOwait during startup in 1.5.0.
Here a graph where we start v1.4.2, and then v1.5.0 :

I attach also a zip with the cpu.pprof and mem.pprof files for 1.4.2 startup to compare with 1.5.0 and find if there concurrent access which generate IOwait, and see where the startup process in 1.5.0 differ.
v1.4.2.zip
If we look the disk io graph and calculate the total of data reads, we remark that influxdb 1.5.0 reads about 480GB of data while files on disk is about 240GB (twice less) !
Maybe a track for memory occupation bigger in 1.5.0 than 1.4.2 ?
I have large constant I/O 122MB/s and 955 IOPS on READ on 1000GB GP2 EBS.
This is very slow.
1.4.3 startup vs 1.5.0 startup


@ptitou @szibis Do you have full startup logs that you can share? Also, how frequently are you adding new series? 1.5.0 uses a new series file which is persisted to disk that lets us scale to a higher number of series and we memory-map the data into the process.
Here the startup log files for v1.4.2 and v1.5.0
influxdb_log.zip
Also the number of series for the last 7days
.
The tests were made on a snapshot of the production database, so there no new data between startup tests.
@ptitou Can you send the output of this command? It could be slow if your series data isn't being held in-memory. That doesn't seem like the case since you have 64GB of RAM but I want to double check.
$ find $DATADIR -type d | grep _series | xargs ls -l
Also, are you running this in a container or restarting the host between startups?
here the output of the find command
find.zip
The database is running on a physical server, and there is no restart of the host between startups, and no container or other VM.
@ptitou Sorry for the delay. We're having trouble reproducing the issue. Are you able to share any of your data files? Or can you run influx_inspect report? In particular, shard qualif/default/1773 seems to have a huge disparity between 1.4.3 & 1.5.0 (7s vs 360,000s).
I don't have the shard qualif/default/1773 but if I launch the command on another shard, first time with v1.5.0 and second time with v1.4.3 here is the result :
launch time from logs
1.4.2 : /tech/influxdb/var/data/influxdb/data/qualif/default/2175 5.114586342s
1.5.0 : /tech/influxdb/var/data/influxdb/data/qualif/default/2175 152163.862ms
v.1.5.0
-bash-4.1$ /tech/influxdb/influxdb-1.5.0/usr/bin/influx_inspect report 2175
DB RP Shard File Series New (est) Min Time Max Time Load Time
qualif default 2175 000000438-000000003.tsm 26154 26243 2017-11-27T00:00:00.070737193Z 2017-12-03T23:59:59.027678227Z 4.669501721s
qualif default 2175 000000438-000000004.tsm 12563 12183 2017-11-27T00:00:00.56750589Z 2017-12-03T23:59:59.027678227Z 4.486162495s
qualif default 2175 000000438-000000005.tsm 56064 56584 2017-11-27T00:00:00.070737193Z 2017-12-03T23:59:59.027678227Z 3.656177153s
Summary:
Files: 3
Time Range: 2017-11-27T00:00:00.070737193Z - 2017-12-03T23:59:59.027678227Z
Duration: 167h59m58.956941034s
Statistics
Series:
- qualif (est): 95010 (100%)
Total (est): 95010
Completed in 12.85635712s
1.4.3
-bash-4.1$ /tech/influxdb/influxdb-1.4.2-1/usr/bin/influx_inspect report 2175
DB RP Shard File Series New (est) Min Time Max Time Load Time
qualif default 2175 000000438-000000003.tsm 26154 26243 2017-11-27T00:00:00.070737193Z 2017-12-03T23:59:59.027678227Z 7.092426ms
qualif default 2175 000000438-000000004.tsm 12563 12183 2017-11-27T00:00:00.56750589Z 2017-12-03T23:59:59.027678227Z 4.421906ms
qualif default 2175 000000438-000000005.tsm 56064 56584 2017-11-27T00:00:00.070737193Z 2017-12-03T23:59:59.027678227Z 14.351637ms
Summary:
Files: 3
Time Range: 2017-11-27T00:00:00.070737193Z - 2017-12-03T23:59:59.027678227Z
Duration: 167h59m58.956941034s
Statistics
Series:
- qualif (est): 95010 (100%)
Total (est): 95010
Completed in 95.538358ms
Don't think if it can help you. If you want some shard, do you have an FTP server where I can put this one (6Gb of data) ?
@ptitou
Please send an email to [email protected] so we can share upload information.
@ptitou OK, this is weird. I've tried on several machines and I can't reproduce the issue. I even tried it on a limited 1 vcpu/1GB RAM machine:
$ wget https://dl.influxdata.com/influxdb/releases/influxdb-1.5.1_linux_amd64.tar.gz
$ tar xzvf influxdb-1.5.1_linux_amd64.tar.gz
$ influxdb-1.5.0-1/usr/bin/influx_inspect report 2175
DB RP Shard File Series New (est) Min Time Max Time Load Time
root 2175 000000438-000000003.tsm 26154 26243 2017-11-27T00:00:00.070737193Z 2017-12-03T23:59:59.027678227Z 6.088269ms
root 2175 000000438-000000004.tsm 12563 12183 2017-11-27T00:00:00.56750589Z 2017-12-03T23:59:59.027678227Z 3.050018ms
root 2175 000000438-000000005.tsm 56064 56584 2017-11-27T00:00:00.070737193Z 2017-12-03T23:59:59.027678227Z 16.402941ms
Summary:
Files: 3
Time Range: 2017-11-27T00:00:00.070737193Z - 2017-12-03T23:59:59.027678227Z
Duration: 167h59m58.956941034s
Statistics
Series:
- (est): 95010 (100%)
Total (est): 95010
Completed in 184.327883ms
$ wget https://dl.influxdata.com/influxdb/releases/influxdb-1.4.3_linux_amd64.tar.gz
$ tar xzvf influxdb-1.4.3_linux_amd64.tar.gz
$ influxdb-1.4.3-1/usr/bin/influx_inspect report 2175
DB RP Shard File Series New (est) Min Time Max Time Load Time
root 2175 000000438-000000003.tsm 26154 26243 2017-11-27T00:00:00.070737193Z 2017-12-03T23:59:59.027678227Z 7.225517ms
root 2175 000000438-000000004.tsm 12563 12183 2017-11-27T00:00:00.56750589Z 2017-12-03T23:59:59.027678227Z 6.398312ms
root 2175 000000438-000000005.tsm 56064 56584 2017-11-27T00:00:00.070737193Z 2017-12-03T23:59:59.027678227Z 10.399071ms
Summary:
Files: 3
Time Range: 2017-11-27T00:00:00.070737193Z - 2017-12-03T23:59:59.027678227Z
Duration: 167h59m58.956941034s
Statistics
Series:
- (est): 95010 (100%)
Total (est): 95010
Completed in 186.647269ms
I accidentally fired up an Ubuntu instance instead of CentOS but I'm firing up a CentOS 7 instance right now to double check.
OK, it looks like it is an issue on CentOS. I'm not seeing quite the same disparity but it's still large. 1.4.3 is about 200ms and 1.5.0 is 2s.
We've got the same result on Red Hat Enterprise Linux Server release 6.4 (Santiago) and on CentOS Linux release 7.2.1511 (Core) as explained on the first comment if it can help you.
@ptitou I tracked it down to a change in madvise() that only seems to be affecting CentOS/RHEL. We are reverting it here: https://github.com/influxdata/influxdb/pull/9612
Thanks for all the help tracking this issue down.
馃憤 thanks for your perseverence ! I will try the patch and tell you if it's ok for us !
Hi, I have tested 1.5.1 on Debian and I see also long time for restarts. Influxdb goes through all the shards and disk is 100% busy reading. Opening each file is taking from several seconds to more than 1 minute as shown in the logs
ts=2018-03-21T22:21:28.949409Z lvl=info msg="Opened file" log_id=06zc7cNl000 engine=tsm1 service=filestore path=/appdata/influxdb/data/service/collectd_service/3211/000009897-000000007.tsm id=4 duration=44701.636ms
ts=2018-03-21T22:21:28.949478Z lvl=info msg="Opened file" log_id=06zc7cNl000 engine=tsm1 service=filestore path=/appdata/influxdb/data/service/collectd_service/3211/000009897-000000015.tsm id=12 duration=35986.227ms
ts=2018-03-21T22:21:28.960476Z lvl=info msg="Opened file" log_id=06zc7cNl000 engine=tsm1 service=filestore path=/appdata/influxdb/data/service/collectd_service/3211/000009897-000000012.tsm id=9 duration=44415.333ms
ts=2018-03-21T22:21:29.066193Z lvl=info msg="Opened file" log_id=06zc7cNl000 engine=tsm1 service=filestore path=/appdata/influxdb/data/service/collectd_service/3211/000009897-000000013.tsm id=10 duration=44520.889ms
ts=2018-03-21T22:21:29.505564Z lvl=info msg="Opened file" log_id=06zc7cNl000 engine=tsm1 service=filestore path=/appdata/influxdb/data/service/collectd_service/3211/000009897-000000017.tsm id=14 duration=36529.187ms
ts=2018-03-21T22:21:52.160085Z lvl=info msg="Opened shard" log_id=06zc7cNl000 service=store trace_id=06zc7crl000 op_name=tsdb_open path=/appdata/influxdb/data/service/collectd_service/2444 duration=132423.805ms
ts=2018-03-21T22:21:52.160776Z lvl=info msg="Opened shard" log_id=06zc7cNl000 service=store trace_id=06zc7crl000 op_name=tsdb_open path=/appdata/influxdb/data/service/collectd_service/3183 duration=199630.454ms
ts=2018-03-21T22:21:52.161794Z lvl=info msg="Opened shard" log_id=06zc7cNl000 service=store trace_id=06zc7crl000 op_name=tsdb_open path=/appdata/influxdb/data/service/collectd_service/3211 duration=68156.950ms
First attempt is a restart with index-version = "tsi1" after running influx_inspect buildtsi -datadir data -waldir wal offline. Current attempt is running with index-version = "inmem" after deleting all the index directories. Both cases is taking hours to restart
Is there a reason why it tries to open all the shards on restart? The dataset is of 2.8 Tb, with 2 million series distributed around 10 databases of different sizes

Just as an additional note in case it might be helpful for some one else. The screenshot below shows the difference between restarting v1.5.1 with tsi1 vs inmem. The latest takes less time, consumes less memory and looks more stable. the first restart with tsi1 took longer and it started swapping at the end

@ptitou
Sorry for the noob question here, but may I ask what query you used to get the total count of series like shown in your screenshot?
@markusressel
The query is :
SELECT last("numSeries") AS "numSeries" FROM "database" WHERE ("hostname" = '$host') AND $timeFilter GROUP BY time($__interval) fill(null)
From _internal datasource
Test done today with the nightly release, everything work fine !
Do you plan to release a version 1.5.2 with this commit (as explained in issue #9614 ) or do we have to wait release 1.6 ?
@ptitou Sorry for the delay. This fix is in 1.5.2.
Most helpful comment
@markusressel
The query is :
SELECT last("numSeries") AS "numSeries" FROM "database" WHERE ("hostname" = '$host') AND $timeFilter GROUP BY time($__interval) fill(null)
From _internal datasource