Linguist: The state of SQL detection

Created on 11 Jun 2020 · 11Comments · Source: github/linguist

Problem

Linguist's detection of languages in the SQL family is currently quite inconsistent. The classifier seems to be quite biased towards PL/pgSQL, classifying a lot of SQL variants as PL/pgSQL (examples: mysql, oracle). This has been the subject of a few issues before: SQLite vs PLpgSQL, MySQL vs PLpgSQL, TSQL vs PLSQL or TSQL vs PLpgSQL, PLpgSQL vs SQLPL.

Furthermore, current heuristics for the .sql extension seem to imply that PostgreSQL' dialect of SQL is PL/pgSQL, IBM DB2's dialect is SQL PL and Oracle's dialect is PL/SQL. But neither of these are true. All of them are extensions to SQL included that are just a subset of their SQL dialect. In practice, most PostgreSQL code out there does not use PL/pgSQL at all.

It is a bit tricky to disambiguate properly between all of them, but I think we should change towards falling back to SQL more often.

Possible Solutions

I'm not sure what's the best course of action, but I think it could include some of the following options. I will go ahead and create PRs for some of them, but I'd like to hear your opinions on the bigger picture.

Heuristics fallback to SQL, no classifier

We currently have a very broad expression to trigger the classifier, as the final rule for .sql:

https://github.com/github/linguist/blob/2a6d952df870807034c3830738f95ad33222aca0/lib/linguist/heuristics.yml#L459-L461

Any of these strings may appear in virtually every SQL dialect. In particular BEGIN, which is standard SQL. Given that it is unlikely that the classifier does well here, it may be better to just fall back to SQL:

  # Fallback to SQL
  - language: SQL

After all, seeing some PL/pgSQL code classified as SQL should not be that surprising. However, seeing it as PL/SQL, SQL PL or T-SQL is much more misleading.

Changes:

[x] .sql falls back to SQL #4888

Remove all patterns that are not exclusive to one SQL dialect

Reinforcing the previous point, we may remove all patterns that are not exclusive to one language. For example, the following statement, which is valid MySQL, is currently classified as T-SQL by the heuristics:

SELECT IF TRUE 1 ELSE 2 END IF;

And the following is classified as PL/SQL:

SELECT SYSDATE();

Changes in this direction:

[x] Relax T-SQL #4885 #4913
[x] Relax PL/SQL #4886
[x] Relax SQL PL #4887

Remove all patterns that are not exclusive to one PL

Stop considering any PostgreSQL as PL/pgSQL, Oracle as PL/SQL, etc. For example, I think that Oracle's XMLType is not necessarily part of PL/SQL, just part of the broader Oracle SQL.

Create a language group for SQL

Ok, maybe this is too extreme, but given the amount of misclassifications, it could even make sense to create a language group for SQL?

Create a specialized strategy

Create a specialized strategy for SQL dialect detection, with more advanced parsing than just regexps. Probably tons of work.

Source

smola

👍3

Most helpful comment

This all seems reasonable to me. SQL does seem like a tough nut to crack and while i do agree that SQL should be used as a fall back more often (better to classify a square as a rectangle than misidentify a rectangle as a square), it will be unpleasant to not have the syntax highlighting you expect when viewing the file in the webapp. That last part may not be solvable currently and it is certainly outside the scope here, just wanted it said. =D

beau-witter on 14 Jun 2020

👍4

All 11 comments

Given that it is unlikely that the classifier does well here, it may be better to just fall back to SQL:

Looks good to me :+1:

Ok, maybe this is too extreme, but given the amount of misclassifications, it could even make sense to create a language group for SQL?

Yeah, that seems reasonable to me if we can't find an efficient way to distinguish between all SQL variants.

Let's involve peeps who contribute all the SQL variants in the discussion. /cc @Shoelace contribute most of the variants, @beau-witter contributed the T-SQL variant.

pchaigno on 14 Jun 2020

beau-witter on 14 Jun 2020

👍4

This issue has been automatically marked as stale because it has not had activity in a long time. If this issue is still relevant and should remain open, please reply with a short explanation (e.g. "I have checked the code and this issue is still relevant because ___."). Thank you for your contributions.

stale[bot] on 16 Aug 2020

This issue is still relevant because SQL code still gets mis-detected.

iomintz on 16 Aug 2020

👍3

@iomintz Could you post some examples of mis-detected SQL?

smola on 19 Aug 2020

The examples you gave in your OP still hold.

iomintz on 19 Aug 2020

@iomekam There's a pending PR (#4888) that should fix many issues. If you have any specific examples with URLs to files you are interested in, I can look into them.

smola on 30 Aug 2020

4888 has been merged. Can this be closed?

pchaigno on 6 Sep 2020

@pchaigno Yes, I think so. There is still the open question about creating a language group for SQL. Maybe we could reconsider this depending on follow up issues in the months after v7.11.0 deployment at GitHub?

smola on 6 Sep 2020

👍2

Maybe we could reconsider this depending on follow up issues in the months after v7.11.0 deployment at GitHub?

:+1:

pchaigno on 6 Sep 2020

sorry i'm a bit late.. but i am happy with all of the merged changes.
i agree that error on the side of generic SQL is preferable to choosing a wrong dialect.