Linguist's detection of languages in the SQL family is currently quite inconsistent. The classifier seems to be quite biased towards PL/pgSQL, classifying a lot of SQL variants as PL/pgSQL (examples: mysql, oracle). This has been the subject of a few issues before: SQLite vs PLpgSQL, MySQL vs PLpgSQL, TSQL vs PLSQL or TSQL vs PLpgSQL, PLpgSQL vs SQLPL.
Furthermore, current heuristics for the .sql extension seem to imply that PostgreSQL' dialect of SQL is PL/pgSQL, IBM DB2's dialect is SQL PL and Oracle's dialect is PL/SQL. But neither of these are true. All of them are extensions to SQL included that are just a subset of their SQL dialect. In practice, most PostgreSQL code out there does not use PL/pgSQL at all.
It is a bit tricky to disambiguate properly between all of them, but I think we should change towards falling back to SQL more often.
I'm not sure what's the best course of action, but I think it could include some of the following options. I will go ahead and create PRs for some of them, but I'd like to hear your opinions on the bigger picture.
We currently have a very broad expression to trigger the classifier, as the final rule for .sql:
https://github.com/github/linguist/blob/2a6d952df870807034c3830738f95ad33222aca0/lib/linguist/heuristics.yml#L459-L461
Any of these strings may appear in virtually every SQL dialect. In particular BEGIN, which is standard SQL. Given that it is unlikely that the classifier does well here, it may be better to just fall back to SQL:
# Fallback to SQL
- language: SQL
After all, seeing some PL/pgSQL code classified as SQL should not be that surprising. However, seeing it as PL/SQL, SQL PL or T-SQL is much more misleading.
Changes:
.sql falls back to SQL #4888Reinforcing the previous point, we may remove all patterns that are not exclusive to one language. For example, the following statement, which is valid MySQL, is currently classified as T-SQL by the heuristics:
SELECT IF TRUE 1 ELSE 2 END IF;
And the following is classified as PL/SQL:
SELECT SYSDATE();
Changes in this direction:
Stop considering any PostgreSQL as PL/pgSQL, Oracle as PL/SQL, etc. For example, I think that Oracle's XMLType is not necessarily part of PL/SQL, just part of the broader Oracle SQL.
Ok, maybe this is too extreme, but given the amount of misclassifications, it could even make sense to create a language group for SQL?
Create a specialized strategy for SQL dialect detection, with more advanced parsing than just regexps. Probably tons of work.
Given that it is unlikely that the classifier does well here, it may be better to just fall back to SQL:
Looks good to me :+1:
Ok, maybe this is too extreme, but given the amount of misclassifications, it could even make sense to create a language group for SQL?
Yeah, that seems reasonable to me if we can't find an efficient way to distinguish between all SQL variants.
Let's involve peeps who contribute all the SQL variants in the discussion. /cc @Shoelace contribute most of the variants, @beau-witter contributed the T-SQL variant.
This all seems reasonable to me. SQL does seem like a tough nut to crack and while i do agree that SQL should be used as a fall back more often (better to classify a square as a rectangle than misidentify a rectangle as a square), it will be unpleasant to not have the syntax highlighting you expect when viewing the file in the webapp. That last part may not be solvable currently and it is certainly outside the scope here, just wanted it said. =D
This issue has been automatically marked as stale because it has not had activity in a long time. If this issue is still relevant and should remain open, please reply with a short explanation (e.g. "I have checked the code and this issue is still relevant because ___."). Thank you for your contributions.
This issue is still relevant because SQL code still gets mis-detected.
@iomintz Could you post some examples of mis-detected SQL?
The examples you gave in your OP still hold.
@iomekam There's a pending PR (#4888) that should fix many issues. If you have any specific examples with URLs to files you are interested in, I can look into them.
@pchaigno Yes, I think so. There is still the open question about creating a language group for SQL. Maybe we could reconsider this depending on follow up issues in the months after v7.11.0 deployment at GitHub?
Maybe we could reconsider this depending on follow up issues in the months after v7.11.0 deployment at GitHub?
:+1:
sorry i'm a bit late.. but i am happy with all of the merged changes.
i agree that error on the side of generic SQL is preferable to choosing a wrong dialect.
Most helpful comment
This all seems reasonable to me. SQL does seem like a tough nut to crack and while i do agree that SQL should be used as a fall back more often (better to classify a square as a rectangle than misidentify a rectangle as a square), it will be unpleasant to not have the syntax highlighting you expect when viewing the file in the webapp. That last part may not be solvable currently and it is certainly outside the scope here, just wanted it said. =D