Cockroach: sql: select count from a table have no response

Created on 17 Nov 2016 · 9Comments · Source: cockroachdb/cockroach

Header

I161117 06:53:54.818822 122 util/log/clog.go:899  [config] file created at: 2016/11/17 06:53:54
I161117 06:53:54.818822 122 util/log/clog.go:899  [config] running on machine: hadoop241
I161117 06:53:54.818822 122 util/log/clog.go:899  [config] binary: CockroachDB beta-20161103 (linux amd64, built 2016/11/03 15:03:54, go1.7.3)
I161117 06:53:54.818822 122 util/log/clog.go:899  [config] arguments: [/home/wen/clusterthree/cockroach start --insecure --host= --store=node1 --port=8000 --http-port=7000]

Please describe the issue you observed:

What did you do?
I'm doing stress testing, after that, doing select count on a table bag_index
What did you expect to see?
Return the count number, a normal select operation
What did you see instead?
Select have no response

Examples:
Select have no response
[email protected]:8000> SELECT count(*) FROM bag_index;

Other tables are fine

[email protected]:8000> SELECT count(*) FROM bag_index_compute;
+----------+
| count(*) |
+----------+
|    71314 |
+----------+
(1 row)

After restart this node, the table is okay now

[email protected]:8000> select count(*) from bag_index;
+----------+
| count(*) |
+----------+
|   123612 |
+----------+
(1 row)

Here are some logs

[wen@hadoop241 logs]$ less cockroach.hadoop241.wen.log.INFO.2016-11-15T09_45_50+08_00.3327
I161115 09:45:50.648868 1 util/log/clog.go:988  [config] file created at: 2016/11/15 09:45:50
I161115 09:45:50.648868 1 util/log/clog.go:988  [config] running on machine: hadoop241
I161115 09:45:50.648868 1 util/log/clog.go:988  [config] binary: CockroachDB beta-20161103 (linux amd64, built 2016/11/03 15:03:54, go1.7.3)
I161115 09:45:50.648868 1 util/log/clog.go:988  [config] arguments: [/home/wen/clusterthree/cockroach start --insecure --host= --store=node1 --port=8000 --http-port=7000]
I161115 09:45:50.648868 1 util/log/clog.go:988  line format: [IWEF]yymmdd hh:mm:ss.uuuuuu goid file:line msg
I161115 09:45:50.648867 1 cli/start.go:299  CockroachDB beta-20161103 (linux amd64, built 2016/11/03 15:03:54, go1.7.3)
I161115 09:45:50.756303 1 cli/backtrace.go:40  backtrace disabled: stat /opt/backtrace/bin/ptrace: no such file or directory
I161115 09:45:50.756366 1 cli/start.go:315  starting cockroach node
W161115 09:45:50.756421 1 server/server.go:154  [n?] running in insecure mode, this is strongly discouraged. See --insecure.
I161115 09:45:50.756657 1 gossip/gossip.go:237  [n?] initial resolvers: []
W161115 09:45:50.756678 1 gossip/gossip.go:1055  [n?] no resolvers found; use --join to specify a connected node
I161115 09:45:50.766629 1 storage/engine/rocksdb.go:326  opening rocksdb instance at "node1"
I161115 09:45:51.360771 1 server/config.go:443  1 storage engine initialized
I161115 09:45:51.364954 1 base/node_id.go:62  [n1] NodeID set to 1
W161115 09:45:51.630882 1 storage/store.go:766  [n1] storeMu: mutex held by github.com/cockroachdb/cockroach/pkg/storage.(*Store).Start for 265.797643ms (>100ms):
goroutine 1 [running]:
runtime/debug.Stack(0x1d25710, 0x1d25750, 0x3b)
        /usr/local/go/src/runtime/debug/stack.go:24 +0x79
github.com/cockroachdb/cockroach/pkg/util/syncutil.ThresholdLogger.func1(0xfd7c00b)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/syncutil/timedmutex.go:65 +0xe1
github.com/cockroachdb/cockroach/pkg/util/syncutil.(*TimedMutex).Unlock(0xc4203977b8)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/syncutil/timedmutex.go:92 +0x80
github.com/cockroachdb/cockroach/pkg/storage.(*Store).Start(0xc420397500, 0x7f6a4b6b9dc0, 0xc420213c80, 0xc420465cb0, 0xc4205ecc70, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:1052 +0x520
github.com/cockroachdb/cockroach/pkg/server.(*Node).initStores(0xc4201a0780, 0x7f6a4b6b9dc0, 0xc420213c80, 0xc4201461f0, 0x1, 0x1, 0xc420465cb0, 0x0, 0x0, 0x7f6a4b6b9dc0)
        /go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:418 +0x213
github.com/cockroachdb/cockroach/pkg/server.(*Node).start(0xc4201a0780, 0x7f6a4b6b9dc0, 0xc420213c80, 0x232c420, 0xc4203c5f60, 0xc4201461f0, 0x1, 0x1, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:340 +0x11d
github.com/cockroachdb/cockroach/pkg/server.(*Server).Start(0xc420202000, 0x7f6a4b6b9dc0, 0xc420213c80, 0xc42046b7c0, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/server/server.go:575 +0xccc
github.com/cockroachdb/cockroach/pkg/cli.runStart(0x2316b40, 0xc42044f680, 0x0, 0x5, 0x7fffb067c83f, 0x4)
        /go/src/github.com/cockroachdb/cockroach/pkg/cli/start.go:324 +0x920
github.com/cockroachdb/cockroach/pkg/cli.maybeDecorateGRPCError.func1(0x2316b40, 0xc42044f680, 0x0, 0x5, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/cli/error.go:30 +0x5a
github.com/spf13/cobra.(*Command).execute(0x2316b40, 0xc42044f630, 0x5, 0x5, 0x2316b40, 0xc42044f630)
        /go/src/github.com/spf13/cobra/command.go:632 +0x23e
github.com/spf13/cobra.(*Command).ExecuteC(0x2312740, 0xc42000e220, 0xebcdac937f56b809, 0xc42000e258)
        /go/src/github.com/spf13/cobra/command.go:722 +0x367
github.com/spf13/cobra.(*Command).Execute(0x2312740, 0xc42000e240, 0xebcdac937f56b809)
        /go/src/github.com/spf13/cobra/command.go:681 +0x2b
github.com/cockroachdb/cockroach/pkg/cli.Run(0xc42000c0f0, 0x6, 0x6, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/cli/cli.go:95 +0x6d
main.main()
        /go/src/github.com/cockroachdb/cockroach/main.go:37 +0xe1
I161115 09:45:51.631284 1 storage/store.go:1173  [n1] [n1,s1]: failed initial metrics computation: [n1,s1]: system config not yet available
I161115 09:45:51.631321 1 server/node.go:433  [n1] initialized store [n1,s1]: {Capacity:105689415680 Available:57766645760 RangeCount:0}
I161115 09:45:51.631455 1 server/node.go:318  [n1] node ID 1 initialized
I161115 09:45:51.631566 1 gossip/gossip.go:280  [n1] NodeDescriptor set to node_id:1 address:<network_field:"tcp" address_field:"hadoop241:8000" > attrs:<> locality:<> 
I161115 09:45:51.631713 1 storage/stores.go:296  [n1] read 2 node addresses from persistent storage
I161115 09:45:51.632068 1 server/node.go:563  [n1] connecting to gossip network to verify cluster ID...
I161115 09:45:51.632098 1 server/node.go:584  [n1] node connected via gossip and verified as part of cluster {"104a7b1e-070a-4472-94fd-6c8b26850864"}
I161115 09:45:51.632131 1 server/node.go:368  [n1] node=1: started with [[]=node1] engine(s) and attributes []
I161115 09:45:51.632172 1 server/server.go:605  [n1] starting http server at hadoop241:7000
I161115 09:45:51.632188 1 server/server.go:606  [n1] starting grpc/postgres server at hadoop241:8000
I161115 09:45:51.632382 1 server/server.go:607  [n1] advertising CockroachDB node at hadoop241:8000
I161115 09:45:51.774141 168 gossip/client.go:126  [n1] node 1: started gossip client to hadoop243:8001
I161115 09:45:54.579276 201 storage/raft_transport.go:423  [n1] raft transport stream to node 2 established
I161115 09:45:54.789133 161 storage/raft_transport.go:423  [n1] raft transport stream to node 3 established
I161115 09:45:54.802167 145 sql/event_log.go:95  [n1] Event: "node_restart", target: 1, info: {Descriptor:{NodeID:1 Address:{NetworkField:tcp AddressField:hadoop241:8000} Attrs: Locality:} ClusterID:{UUID:104a7b1e-070a-4472-94fd-6c8b26850864} StartedAt:1479174351632104074}
I161115 09:45:56.082623 231 storage/replica_raftstorage.go:574  [n1,s1,r116/1:/System/tsd/cr.store.{mu…-ra…}] [n1,s1,r116/1:/System/tsd/cr.store.{mu…-ra…}]: with replicaID 1, applying Raft snapshot at index 24327 (id=989003f9, encoded size=16, 25 rocksdb batches, 4048 log entries)
I161115 09:45:57.265915 326 storage/replica_raftstorage.go:574  [n1,s1,r1/1:/{Min-System/tsd/c…}] [n1,s1,r1/1:/{Min-System/tsd/c…}]: with replicaID 1, applying Raft snapshot at index 159280 (id=9d6b22b8, encoded size=16, 41 rocksdb batches, 3124 log entries)
I161115 09:45:57.550411 231 storage/replica_raftstorage.go:577  [n1,s1,r116/1:/System/tsd/cr.store.{mu…-ra…}] [n1,s1,r116/1:/System/tsd/cr.store.{mu…-ra…}]: with replicaID 1, applied Raft snapshot in 1.468s
W161115 09:45:57.551112 231 storage/replica.go:528  [n1,s1,r116/1:/System/tsd/cr.store.{mu…-ra…}] raftMu: mutex held by github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRaftRequest for 1.468760541s (>500ms):
goroutine 231 [running]:
runtime/debug.Stack(0x1d28bf8, 0x1d28c38, 0x48)
        /usr/local/go/src/runtime/debug/stack.go:24 +0x79
github.com/cockroachdb/cockroach/pkg/util/syncutil.ThresholdLogger.func1(0x578b81dd)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/syncutil/timedmutex.go:65 +0xe1
github.com/cockroachdb/cockroach/pkg/util/syncutil.(*TimedMutex).Unlock(0xc42026d1f8)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/syncutil/timedmutex.go:92 +0x80
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRaftRequest(0xc420397500, 0x7f6a4b6b9dc0, 0xc420212db0, 0xc4203b4d78, 0x3f459234f9039098, 0x3e23945eaf8aba90, 0x74, 0xc420212a20, 0x2e, 0x30, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:2923 +0xc88
github.com/cockroachdb/cockroach/pkg/storage.(*Store).HandleSnapshot(0xc420397500, 0xc4203b4d20, 0x233e340, 0xc4207621f0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:2567 +0x6ac
github.com/cockroachdb/cockroach/pkg/storage.(*RaftTransport).RaftSnapshot.func1.1(0x233e340, 0xc4207621f0, 0xc420079500, 0xc4203caf38, 0xc4203caf40)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/raft_transport.go:352 +0x14d
github.com/cockroachdb/cockroach/pkg/storage.(*RaftTransport).RaftSnapshot.func1(0x7f6a4b6b9dc0, 0xc42026ff50)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/raft_transport.go:353 +0x45
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1(0xc420465cb0, 0x1ffde26, 0x19, 0x162, 0x0, 0x0, 0xc420212930, 0x7f6a4b6b9dc0, 0xc42026ff50)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:264 +0xdf
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask
        /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:265 +0x234
I161115 09:45:58.131184 326 storage/replica_raftstorage.go:577  [n1,s1,r1/1:/{Min-System/tsd/c…}] [n1,s1,r1/1:/{Min-System/tsd/c…}]: with replicaID 1, applied Raft snapshot in 0.865s
W161115 09:45:58.131592 326 storage/replica.go:528  [n1,s1,r1/1:/{Min-System/tsd/c…}] raftMu: mutex held by github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRaftRequest for 865.770425ms (>500ms):
goroutine 326 [running]:
runtime/debug.Stack(0x1d28bf8, 0x1d28c38, 0x48)
        /usr/local/go/src/runtime/debug/stack.go:24 +0x79
github.com/cockroachdb/cockroach/pkg/util/syncutil.ThresholdLogger.func1(0x339a9bb9)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/syncutil/timedmutex.go:65 +0xe1
github.com/cockroachdb/cockroach/pkg/util/syncutil.(*TimedMutex).Unlock(0xc42026c778)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/syncutil/timedmutex.go:92 +0x80
github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRaftRequest(0xc420397500, 0x7f6a4b6b9dc0, 0xc4207343c0, 0xc420e5c058, 0xb0416caeb8226b9d, 0xccb6d6981cf7f5bd, 0x1, 0x26d30f8, 0x0, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:2923 +0xc88
github.com/cockroachdb/cockroach/pkg/storage.(*Store).HandleSnapshot(0xc420397500, 0xc420e5c000, 0x233e340, 0xc420e961d0, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:2567 +0x6ac
github.com/cockroachdb/cockroach/pkg/storage.(*RaftTransport).RaftSnapshot.func1.1(0x233e340, 0xc420e961d0, 0xc420079500, 0xc4207d7f38, 0xc4207d7f40)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/raft_transport.go:352 +0x14d
github.com/cockroachdb/cockroach/pkg/storage.(*RaftTransport).RaftSnapshot.func1(0x7f6a4b6b9dc0, 0xc420785c50)
        /go/src/github.com/cockroachdb/cockroach/pkg/storage/raft_transport.go:353 +0x45
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1(0xc420465cb0, 0x1ffde26, 0x19, 0x162, 0x0, 0x0, 0xc420785e90, 0x7f6a4b6b9dc0, 0xc420785c50)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:264 +0xdf
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask
        /go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:265 +0x234
I161115 09:45:58.660700 346 storage/replica_raftstorage.go:574  [n1,s1,r132/1:{-}] [n1,s1,r132/1:{-}]: with replicaID 1, applying Raft snapshot at index 7329 (id=d2ed84e5, encoded size=16, 13 rocksdb batches, 2494 log entries)
I161115 09:45:58.848294 346 storage/replica_raftstorage.go:577  [n1,s1,r132/1:/System/tsd/cr.node.{sq…-tx…}] [n1,s1,r132/1:/System/tsd/cr.node.{sq…-tx…}]: with replicaID 1, applied Raft snapshot in 0.188s
I161115 09:45:59.559487 297 storage/replica_raftstorage.go:574  [n1,s1,r131/1:{-}] [n1,s1,r131/1:{-}]: with replicaID 1, applying Raft snapshot at index 16085 (id=5d7ff75c, encoded size=16, 18 rocksdb batches, 4282 log entries)
I161115 09:45:59.823922 297 storage/replica_raftstorage.go:577  [n1,s1,r131/1:/System/tsd/cr.{node.…-store…}] [n1,s1,r131/1:/System/tsd/cr.{node.…-store…}]: with replicaID 1, applied Raft snapshot in 0.264s
I161115 09:46:00.526882 333 http2_server.go:236  transport: http2Server.HandleStreams received bogus greeting from client: "GET / HTTP/1.1\r\nHost: 19"
I161115 09:46:01.633268 163 server/status/runtime.go:228  runtime stats: 611 MiB RSS, 130 goroutines, 103 MiB/237 MiB/362 MiB GO alloc/idle/total, 77 MiB/111 MiB CGO alloc/total, 0.00cgo/sec, 0.00/0.00 %(u/s)time, 0.00 %gc (25x)
I161115 09:46:01.635334 143 storage/store.go:3732  [n1,s1] sstables (read amplification = 4):

bag_index

CREATE TABLE "crebusiness.bag_index" (
    sharding_code STRING(4) NULL,
    ticket_all_no STRING(32) NOT NULL,
    ticket_sign STRING(1) NULL,
    ticket_no STRING(12) NULL,
    form_code STRING(3) NULL,
    ticket_type STRING(1) NULL,
    trans_type STRING(2) NULL,
    print_date STRING(8) NULL,
    print_time STRING(6) NULL,
    send_man_name STRING(64) NULL,
    send_province_code STRING(2) NULL,
    send_city_code STRING(4) NULL,
    send_area_code STRING(6) NULL,
    send_man_address STRING(128) NULL,
    send_man_phone STRING(20) NULL,
    send_man_fax STRING(20) NULL,
    send_man_mobile STRING(20) NULL,
    send_man_id_card STRING(18) NULL,
    send_man_email STRING(64) NULL,
    receive_man_name STRING(64) NULL,
    receive_province_code STRING(2) NULL,
    receive_city_code STRING(4) NULL,
    receive_area_code STRING(6) NULL,
    receive_man_address STRING(128) NULL,
    receive_man_phone STRING(20) NULL,
    receive_man_fax STRING(20) NULL,
    receive_man_mobile STRING(20) NULL,
    receive_man_id_card STRING(18) NULL,
    receive_man_email STRING(64) NULL,
    goods_name STRING(20) NULL,
    num STRING(11) NULL,
    weight STRING(11) NULL,
    declare_price STRING(11) NULL,
    passenger_num STRING(4) NULL,
    train_code STRING(8) NULL,
    train_date STRING(8) NULL,
    train_ticket_no STRING(20) NULL,
    ticket_station_code STRING(5) NULL,
    create_datetime STRING(14) NULL,
    update_datetime STRING(14) NULL,
    cre_sheet_no STRING(32) NULL,
    row_id INT NULL,
    CONSTRAINT "primary" PRIMARY KEY (ticket_all_no),
    UNIQUE INDEX idx_row_id (row_id),
    INDEX idx_update_datetime (update_datetime),
    INDEX idx_create_datetime (create_datetime),
    INDEX idx_print_date (print_date, print_time),
    FAMILY "primary" (sharding_code, ticket_all_no, ticket_sign, ticket_no, form_code, ticket_type, trans_type, print_date, print_time, send_man_name, send_province_code, send_city_code, send_area_code, receive_province_code),
    FAMILY fam_1_send_man_address_send_man_phone_send_man_fax_send_man_mobile_receive_city_code (send_man_address, send_man_phone, send_man_fax, send_man_mobile, receive_city_code),
    FAMILY fam_2_send_man_id_card_send_man_email_receive_man_name_receive_area_code_receive_man_phone_receive_man_id_card (send_man_id_card, send_man_email, receive_man_name, receive_area_code, receive_man_phone, receive_man_id_card),
    FAMILY fam_3_receive_man_address_receive_man_fax_receive_man_mobile_goods_name_num (receive_man_address, receive_man_fax, receive_man_mobile, goods_name, num),
    FAMILY fam_4_receive_man_email_weight_declare_price_passenger_num_train_code_train_date_train_ticket_no_ticket_station_code_create_datetime_row_id (receive_man_email, weight, declare_price, passenger_num, train_code, train_date, train_ticket_no, ticket_station_code, create_datetime, row_id),
    FAMILY fam_5_update_datetime_cre_sheet_no (update_datetime, cre_sheet_no)
)

2016-11-14 it has this issue ( the select count for bag_index isn't response at this time 17:34 )

E161114 17:34:05.899731 214 storage/queue.go:568  [n1,raftlog] on [n1,s1,r115/1:/Table/59/1/"102120{1K0…-9K3…}]: failed to send RPC: sending to all 3 replicas failed; last error: rpc error: code = 4 desc = context deadline exceeded
E161114 17:37:40.921581 214 storage/queue.go:568  [n1,raftlog] on [n1,s1,r115/1:/Table/59/1/"102120{1K0…-9K3…}]: failed to send RPC: sending to all 3 replicas failed; last error: rpc error: code = 4 desc = context deadline exceeded
E161114 17:43:53.169373 629354 server/admin.go:278  namespace information_schema with ParentID 0 not found
E161114 17:43:53.170946 629669 server/admin.go:278  namespace pg_catalog with ParentID 0 not found
E161114 17:43:54.184960 621246 server/admin.go:278  namespace information_schema with ParentID 0 not found
E161114 17:43:54.192395 621247 server/admin.go:278  namespace pg_catalog with ParentID 0 not found
E161114 17:50:20.995330 214 storage/queue.go:568  [n1,raftlog] on [n1,s1,r115/1:/Table/59/1/"102120{1K0…-9K3…}]: failed to send RPC: sending to all 3 replicas failed; last error: rpc error: code = 4 desc = context deadline exceeded
E161114 17:55:11.022894 214 storage/queue.go:568  [n1,raftlog] on [n1,s1,r115/1:/Table/59/1/"102120{1K0…-9K3…}]: failed to send RPC: sending to all 3 replicas failed; last error: rpc error: code = 4 desc = context deadline exceeded

2016-11-14's graph (for version beta-20161103, till this morning select count for bag_index still no response, restart that node makes it okay now )
high_cpu
17-10
17-20

Table dump and all logs files
http://www.files.com/shared/582a7fe44d75b/2016-11-15_dump_and_logs.zip

C-question O-community

Source

chinglinwen

All 9 comments

Thanks for the detailed report, @chinglinwen. From the logs, it looks like you're running a 3 node cluster. Because the data is replicated it would be helpful to know what was happening to the other nodes (you only included the logs from one of them).

Another bit to look at is the /debug/requests page when you encounter the problem. This is an internal debugging tool that allows us to see what is going on with an active operation. You have to navigate to <host>:<adminUI-port>/debug/requests manually (there is no direct link from the admin UI).

petermattis on 17 Nov 2016

Okay, I will collect that tomorrow morning when I have access the servers.

On Thu, Nov 17, 2016, 9:58 PM Peter Mattis [email protected] wrote:

Thanks for the detailed report, @chinglinwen
https://github.com/chinglinwen. From the logs, it looks like you're
running a 3 node cluster. Because the data is replicated it would be
helpful to know what was happening to the other nodes (you only included
the logs from one of them).

Another bit to look at is the /debug/requests page when you encounter the
problem. This is an internal debugging tool that allows us to see what is
going on with an active operation. You have to navigate to
:/debug/requests manually (there is no direct link
from the admin UI).

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/cockroachdb/cockroach/issues/10752#issuecomment-261252725,
or mute the thread
https://github.com/notifications/unsubscribe-auth/APlkQfZs1oaZ3UOOmgMcsl8JpOpdQ6g6ks5q_F12gaJpZM4K0vdw
.

chinglinwen on 17 Nov 2016

The logs and requests info are here: logs_and_debug_requests_2016-11-18.zip

chinglinwen on 18 Nov 2016

@chinglinwen I don't see a query for SELECT count(*) FROM bag_index in the data you sent. I do see a query for SELECT count(ticket_all_no) FROM bag_index that took 37 seconds to execute. It looks like your table is fairly large. Perhaps not in number of rows, but in the number of rows times the size of each row. Is that correct? CockroachDB doesn't currently handle large aggregations very efficiently. The current implementation essentially performs a sequential table scan and you can see that in node1/_debug_requests_10s.html log. This is being addressed very soon by the distributed SQL work.

If there was a query for SELECT count(*) FROM bag_index that never finished, it would show up in the "active" section of the /debug/requests page.

petermattis on 19 Nov 2016

👍1

I'll try to verify it again later(next Monday).
Also I may messd up two cluster(one is delete from a table cause no
response(a long time), so this time even 37 seconds makes me feel it's no
responsive).

On Sat, Nov 19, 2016, 9:33 AM Peter Mattis [email protected] wrote:

@chinglinwen https://github.com/chinglinwen I don't see a query for SELECT
count(*) FROM bag_index in the data you sent. I do see a query for SELECT
count(ticket_all_no) FROM bag_index that took 37 seconds to execute. It
looks like your table is fairly large. Perhaps not in number of rows, but
in the number of rows times the size of each row. Is that correct?
CockroachDB doesn't currently handle large aggregations very efficiently.
The current implementation essentially performs a sequential table scan and
you can see that in node1/_debug_requests_10s.html log. This is being
addressed very soon by the distributed SQL work.

If there was a query for SELECT count(*) FROM bag_index that never
finished, it would show up in the "active" section of the /debug/requests
page.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/cockroachdb/cockroach/issues/10752#issuecomment-261683356,
or mute the thread
https://github.com/notifications/unsubscribe-auth/APlkQT3ZiW6eMWWXycCYk_mFigZvqLVQks5q_lHMgaJpZM4K0vdw
.

chinglinwen on 19 Nov 2016

Any operation that does a full table scan such as the count(*) query or deleting or dropping a table can currently take a long time. I likely would have considered a query which takes 37 secs to be unresponsive as well.

petermattis on 19 Nov 2016

@petermattis Got it, thanks.

chinglinwen on 20 Nov 2016

@petermattis Consider the table size from UI page is 5.5G, I think it is large.

I calculated the table size, it's about 666 M ( 460 bytes row size*1519113 rows/1024/1024 =666 M) ( It have much different from UI page's table size )

[wen@hadoop241 clusterthree]$ echo "SELECT * FROM bag_index LIMIT 20;" | ./sql.sh  > out 
[wen@hadoop241 clusterthree]$ while read l; do echo "$l" | wc; done < out         
      1       2       8
      1      42     607
      1      33     460
      1      33     460
      1      33     460
      1      33     460
      1      33     460
      1      33     460
      1      33     460
      1      33     460
      1      33     460
      1      33     460
      1      33     460
      1      33     460
      1      33     460
      1      33     460
      1      33     460
      1      33     460
      1      33     460
      1      33     460
      1      33     460
      1      33     460
[wen@hadoop241 clusterthree]$  echo "SELECT count(*) FROM bag_index;" | ./sql.sh     
1 row
count(*)
1519113
[wen@hadoop241 clusterthree]$

bag_index 2016-11-22_graph

out.txt

chinglinwen on 22 Nov 2016

I'm now seeing SELECT COUNT(*) retrieving results (with distsql) for 1.5m rows take 7 seconds on "cold" data, and 2 seconds on "hot" data (i.e. successive calls to the same query). This is on a 6 node cluster with the TPC-H dataset (scalefactor 1), which takes ~1.7gb on disk including denormalizations.

Performance varies with schema (the lineitem table is 6 million rows but takes about the same time (~8s) as the orders table which is 1.5 million rows cold)

Closing this issue, as performance has improved to the point where it is unlikely to be judged as returning no response, although it's not blazingly fast either. If the non-response was genuinely a non-response due to a bug (as opposed to really slow execution), there have been enough changes that we should re-open if this re-occurs. We are also continuous load-testing TPC-H queries which are much more complex than the COUNT(*), so we should catch any bugs quickly.