Hi,
Recently we migrated our https://thulya.com production database from MSSQL to TIDb to experiment. TiDb is definitely the best amongst several SQL and NoSQL systems we have tried/used. The installation using Ansible is so easy and seamless. We have setup the cluster with total of 8 nodes and 4 in each DC (multi DCs) and 2 replicas in each DC. Its working brilliantly for the write queries and with less than 100ms for simple select queries. However for some 'optimised' join queries its taking between 800ms to 8secs (obviously not because of cross DC but within the DC itself). We are thinking to move the Join queries to Elasticsearch or CrateDb to improve the performance.
I have come across Vitess recently. It seems being used by Google and looked at some presentations which talked that even complex joins perform efficiently (with 2ms average and in most cases less than 100ms for complex joins?)
I am wondering the architectural differences between TiDb and Vitess. Have you done any performance comparison between TiDb and Vitess?
Thanks and regards,
Diwakar
Because of the range-sharding, many joins/inserts/updates/deletes become cross-node, which is slow/inneficient. I'm assuming all the fast stuff that Vitess does are when querying/joining data that resides on the same node (so partitioned by same key). This works the same with CitusDB for example.
Assuming TiDB will introduce hash partitioning, it should be able to do the same. Currently, range-sharding offers better scalability (all rows from client_id can grow infinitely compared to a single node when hashing by client_id).
@diwakarpp Thanks for your feedback!
We do not have any performance comparison with Vitess. You can find some benchmark result here including sysbench/tpc-h.
As you said, some join queries are slow. Would you like to show us your schemas and queries? Maybe we could give you some advice.
Hi, @diwakarpp I'm the architect of TiDB, and I think I'm quite familiar with Vitess too.
from the architecture we can see, Vitess is a MySQL sharding solution, like a proxy on top of MySQL, because of that, it's really hard for vitess to handle cross-shard join or subquery (that means you need to build a full-featured SQL optimizer to do that.), it's OK for some complex queries that happen in a single shard and they perform efficiently. Here's a unsupported cases list of Vitess, https://github.com/vitessio/vitess/blob/master/data/test/vtgate/unsupported_cases.txt
Hi @ddorian , @shenli and @c4pt0r thanks a lot for the detailed inputs. I think Vitess has many pitfalls in their approach and architecture. We have done good decision by migrating our db to TiDb rather than Vitess. TiDb seems to be quite stable and resilient and we will fine tune it to handle the complex joins (e.g: conditional match and inner join between 8 tables or 10 tables in some instances. May we will split the queries into more than one to handle the complexity).
I am not sure if any of the below have been configured by us incorrectly. Could you pl share your thoughts to confirm?
(1) One of our DC is London based and second DC is Kansas US based. We are thinking to move US based DC to France so the DCs will remain close to each other to avoid network latency during replication. Currently each DC has two replicas. Would the join queries are impacted because of the distance between DCs? Although technically the copies of data is fully self contained within the DC, would the join queries span to multiple DCs during execution which could cause delay?
(2) In our cluster config, we have ignored the Rack field for the pd servers installation and used just the Zone and Host fields. Is this fine in general? Would it impact the performance in any way?
[pd_servers:vars]
location_labels = ["zone","host"]
(3) We will share some complex join queries shortly just to see if we made any errors or to reduce the complexity or add indexes as per your suggestion.
(4) Currently we did not enable big logs.
4.1 If we enable bin logs, will it impact the performance of the cluster or queries?
4.2. Is there any query that we can run from the mysql client to fetch the binlogs from the cluster or should we read it from each node?
4.3. Also, how long the bin logs will be present for our data sync with Elasticsearch ?
4.4. During our data sync, how to maintain a pointer for the protobuf data or clearing the bin log data to avoid redundant syncing?
Hi, I would appreciate your reply for the above aspects. Do you want me to create separate thread for the above aspects? Thanks and regards.
I know you guys might be very busy or not available currently. Its a concern that there are no replies even after 2 business days since my above questions. This is a concern for us to use TiDb in production. Whereas Vitess team are responding to our queries quicker and in most cases instantly.
Sorry for the late response, I was on a business trip yesterday. I'm preparing the answers now, will post soon.
@diwakarpp
(1) One of our DC is London based and second DC is Kansas US based. We are thinking to move US based DC to France so the DCs will remain close to each other to avoid network latency during replication. Currently each DC has two replicas. Would the join queries are impacted because of the distance between DCs? Although technically the copies of data is fully self contained within the DC, would the join queries span to multiple DCs during execution which could cause delay?
TiDB splits data in multiple regions(96MB), every region has 3 replicas by default, PD will place the replcas in different machines/racks/DCs, but only leader replica will handle the traffic, we can specify all the region leaders in one datacenter, so JOINs will not fetch data from different DCs.
(2) In our cluster config, we have ignored the Rack field for the pd servers installation and used just the Zone and Host fields. Is this fine in general? Would it impact the performance in any way?
It matters, but it only affects the scheduling strategy and HA, if you have multiple DCs (at least 3), PD will place replicas across different DCs, but if you don't have enough DC, it means there's a DC will contain more than 1 replicas, so if you ignore the Rack field, there's a chance that PD will place 2 replicas in the same Rack, if all the physical machines in this Rack are out of service (for example, power off), that means some data will lost the majority of replicas.
(3) We will share some complex join queries shortly just to see if we made any errors or to reduce the complexity or add indexes as per your suggestion.
You can send the queries to me via email:[email protected], we can review it for you.
(4) Currently we did not enable big logs.
4.1 If we enable bin logs, will it impact the performance of the cluster or queries?
No, it will not impact the perforamance, it's asynchornos process.
4.2. Is there any query that we can run from the mysql client to fetch the binlogs from the cluster or should we read it from each node?
We have a offical tool called, tidb-binlog, it's a seperate component, it will automaticlly fetch binlog from each tidb node, and sort them with transaction order. But it's not an open-source yet, it's in our Enterprise Edition, but if you guys have insterest in tidb-binlog, we're happy to offer.
4.3. Also, how long the bin logs will be present for our data sync with Elasticsearch ?
Most cases, in seconds.
4.4. During our data sync, how to maintain a pointer for the protobuf data or clearing the bin log data to avoid redundant syncing?
Inside tidb-binlog, we have a kafka for caching the binlogs. every binlog has a transaction id, and the transaction id is monotone increasing (it's guarantee by our transaction model), as I mentioned, tidb-binlog will sort them in transaction id, and maintain the pointer.
Hi @c4pt0r , thank you very much for your detailed reply. Its helpful and much appreciated.
Could you pl confirm the below so we will proceed to configure and test them?
(1) How to configure Ansible or TiDb yaml to specify all the region leaders within the same datacenter (for the joins to work within the same DC)? Could you pl give any write up/reference for this one.
We have two DCs. In our Ansible inventory.ini config, we have the following:
[pd_servers:vars]
location_labels = ["zone","host"]
max-replicas = 2
[pd_servers]
Here we have 4 server IPs specified as PD servers.
Can you pl give an example inventory.ini entries to specify the region leaders for each datacenter(DC) separately?
(4.2) I am interested to use tidb-binlog for syncing the data from TiDb to Elasticsearch/Crate. Could you pl send me the access/download link to try it out. Please enable the license for us to use it for 1 to 2 years. Once we start generating revenue, we will be happy to go for the paid/enterprise license too. By the way, please message me privately with the enterprise license cost for the tidb-binlog.
@diwakarpp
For question 1:
You can use pd-ctl to add scheduling strategy to make all the leaders in the same IDC. Please refer to the document. In this case, I think you can use scheduler add evict-leader-scheduler storeID to evict leaders from the other IDC.
BTW: I recommend you to set max-replicas to 3 instead of 2.
For question 2:
Please refer to the tidb-binlog manual. Although tidb-binlog is not open-source, you can use it free.
@diwakarpp Happy to hear that about your interest in enterprise edition! Please feel free to contact me via email, [email protected] :)
thank you @c4pt0r and @shenli . I do not know how to use pd-ctl, I will refer your docs and apply the scheduling strategy to restrict the queries and join queries to within the DC. At the time of asking the original question we were in a rush for the production deadline. Now we have rescheduled it for August end to do the stability checks. We will apply the pd-ctl and let you know if any questions or help needed. @c4pt0r I will ping you privately on the enterprise edition :+1: Cheers.
@diwakarpp cool, looking forward to it, feel free to ping us if you have any question
This issue is resolved now. I am closing this post. Thanks.
@c4pt0r You guys have any roadmap with hash-sharding and putting data with same shard_id across different tables in the server ? This way you get the benefits of both range & hash sharding (unlimited scale on range sharding, better joins/transactions on hash sharding when using same hash_value).
As @ddorian mentioned, combining range sharding and hash sharding both would be a smart combination to speed up the join queries. I'm not sure about the cons though. @c4pt0r @shenli any thoughts/plans for this?
@diwakarpp @ddorian Thanks for your advice.
We don't have a plan for it. In TiDB, data is managed in the unit of region. There is no guarantee that data with the same hash value will be settled in the same region. I think the hierarchical schema used in Google F1 is a better way.
@shenli the problem with hierarchical is that:
parent_id on primary-key (while with hash-partition, you can hash by parent_id, and primary-key=(id))@ddorian With Hierarchical schema we could choose streaming join operator. It is faster and more memory efficient.
thank you @shenli Just for discussion on the join query performance, looks like cratedb guys used Block Hash Joins and they explained it in a recent article here. I guess it could be a overkill for TiDB performance fine tuning as TiDB and CrateDB both are completely different systems in the core engine level but just thought to share it for your team's reference and analysis if required. Cheers.
@shenli The final article Part 3 about performance improvement using Hash joins seems to have been posted by Crate guys. Interestingly (1) The locality approach (2) The modulo distribution approach both seem to have been discussed. This is just for a reference! Thanks and regards.
Most helpful comment
Hi, @diwakarpp I'm the architect of TiDB, and I think I'm quite familiar with Vitess too.
from the architecture we can see, Vitess is a MySQL sharding solution, like a proxy on top of MySQL, because of that, it's really hard for vitess to handle cross-shard join or subquery (that means you need to build a full-featured SQL optimizer to do that.), it's OK for some complex queries that happen in a single shard and they perform efficiently. Here's a unsupported cases list of Vitess, https://github.com/vitessio/vitess/blob/master/data/test/vtgate/unsupported_cases.txt