When uploading data to Wikidata, OpenRefine checks for common issues in the uploaded data, and reports these to the user before the upload. Many of these checks rely on Wikidata's own constraint system, which lets Wikidata contributors specify how each Wikidata property should be used (for instance by providing a regular expression for its format).

The Wikidata extension in OpenRefine only supports some of the constraints that Wikidata uses. This means that some problems in data imports can go undetected and get flagged up as constraint violations later on in Wikidata itself.
We could implement more constraint checks. This could include constraints defined in Wikidata but also other generic checks such as those implemented in #2103.
Some constraints are expensive to check as they require communicating with Wikidata itself. Since constraint checks are run in real time (to provide quick feedback to the user), we should be careful not to add any expensive operations in new constraint checks.
The architecture of constraint checks in OpenRefine can evolve - for instance to accommodate for more expensive checks transparently, better warnings reported to the user, better handling of multiple constraint declarations of the same type on the same property… The current design is not set in stone.
There is also an interest in developing a generic data validation system, not specific to Wikidata, where all sorts of issues could be reported (think validation against any tabular schema, for instance as defined by the Data Package or CSVW specs).
_This is a proposed Outreachy project in 2020. If you are not planning to apply for an internship via Outreachy, we kindly ask that you do not work on this task yet, in order to leave the floor to potential interns._
Also, regarding validating against a Type or Statement and "checks", using Schemas and Shape Expressions (ShEx) for validation/rules is something that many folks inside and outside the Wikidata ecosystem have been talking about for a long time and been experimenting with, such as myself and @VladimirAlexiev
Lots of discussion on approaches here:
https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Schemas
Example of a tool (but there are other APIs mentioned (pyshexy, etc.) in discussion link above):
https://tools.wmflabs.org/shex-simple/wikidata/packages/shex-webapp/doc/shex-simple.html?data=Endpoint:%20https://query.wikidata.org/sparql&hideData&manifest=[]&textMapIsSparqlQuery&schemaURL=%2F%2Fwww.wikidata.org%2Fwiki%2FSpecial%3AEntitySchemaText%2FE10&shape-map=SELECT%20?item%20WHERE%20{%20?item%20wdt:P31%20wd:Q5%20}%20LIMIT%205
There are of course tasks and issues in Wikidata Phabricator that would help (linking Schemas in statements) and some in the larger Epic (Adding new datatypes to Wikidata)
Have we got an issue for more general validation in OR?
General validation? Not really, but we could use #1727 (Design how type validation and other constraint validations should be handled)
...which was to began to try to push along research on validation (we can cut out the Data Package mention within it) ... and the rest of many designs are in my head and others, but not formally written down for us. A lot of this kind of work, I've done outside of OpenRefine as part of my job as a Data Architect where the Data Modeling with 3GPP systems for Telecom was very strict and standards based (otherwise your cell phone wouldn't even work). But here with Wikidata and OpenRefine working with general Statements, Types, Annotations, there is the need for conforming and validating Schema on both sides. ShEx is just one plausible solution with lots of experimenting needed and many folks worry about performance during verify/validation runs depending on what was asked of the Wikidata backend and indexes.
We already have inside and outside the Wikidata community some very well defined use-cases already for various validation and constraint needs. It's "where and how" do we define them, and "who" is going to be impacted with performance... user|service|both and then how to minimize that impact.
I think its worth having a general community meetup on this topic of validation at some Wikidata conf. or virtually online to discuss "what is possible". This is a wide topic with system-level impacts for data producers like Wikidata, Google, GLAM, OCLC and Schema producers like Uni-Leipzig, etc. @danbri might even want to be slightly involved in discussion here, dunno.
Have we got an issue for more general validation in OR?
I don't think we have got one yet - it might make sense to move some of the discussion there to keep this focused on Wikidata constraints.
Is it possible to use WD's constraint system itself rather than reimplementing constraints?
SHEX is not yet used on a large scale in WD... Also, some specialized constraints (eg Contemporary With) are not implementable in SHEX.
Is it possible to use WD's constraint system itself rather than reimplementing constraints?
That is something that @lucaswerkmeister and I have talked about when the Wikidata extension was first released. Currently, WD's constraint system can only check violations for statements that are already saved in WD: https://phabricator.wikimedia.org/T194194.
Even if that limitation was lifted, we would need to think twice before migrating to it. Having a local implementation of the constraints makes it possible to check for issues in real time, even for relatively big edit batches, which would probably not be doable if we have to issue one HTTP request for each statement in the batch (see more discussion on the phabricator task above).
Also, some of the issues we report do not correspond to any WD constraint (see for instance #2103 which added a bunch of new ones).
SHEX is not yet used on a large scale in WD... Also, some specialized constraints (eg Contemporary With) are not implementable in SHEX.
Yes, there has been some interest in "implementing SHEX in OpenRefine" but I am yet to see a clear use case for it. It is surely not a drop-in replacement for the WD constraint system at the moment, at least.
@wetneb I wanna start my contribution for outreachy but it is bit initmidating to start off with. Also, I couldn't able to comprend how to proceed with this task .
Hi @TejaswiKarasani,
The idea behind the contribution phase is not that you start working on the Outreachy project directly - the intention is more that you get familiar with the environment, by making much smaller contributions which will help you get up to speed.
We have a collection of good first issues: those are reasonably small tasks that you can tackle, this will give you the opportunity to set up your development environment and clarify any issues about the workflow to contribute to OpenRefine.
To understand better what this task is about, I encourage you to try out OpenRefine and its Wikidata integration by yourself. Try following tutorials such as this one:
https://www.wikidata.org/wiki/Wikidata:Tools/OpenRefine/Editing/Tutorials/Basic_editing
Once you have done both of these things, you should be in a better position to apply for this task. Let me know if you have any specific questions in the process :)
Ok @wetneb :)
@wetneb
I first forked it (https://github.com/OpenRefine/OpenRefine) into my GtiHub and then cloned it https://github.com/TejaswiKarasani/OpenRefine into my desktop.

When I use refine.bat build, I can just see the following but can't intall anything

I am facing following errors while setting up the project in Eclipse though I did install maven dependency too
Missing artifact com.codeberry.jdatapath:jdatapath:jar:alpha2
Missing artifact com.codeberry.jdatapath:jdatapath:jar:alpha2
Missing artifact com.colloquial:arithcode:jar:1.1
Missing artifact com.colloquial:arithcode:jar:1.1
Missing artifact com.colloquial:arithcode:jar:1.1
Missing artifact com.colloquial:arithcode:jar:1.1
Missing artifact com.colloquial:arithcode:jar:1.1
Missing artifact com.colloquial:arithcode:jar:1.1
Missing artifact com.colloquial:arithcode:jar:1.1
Missing artifact com.colloquial:arithcode:jar:1.1
Missing artifact com.wcohen:secondstring:jar:20100303
Missing artifact com.wcohen:secondstring:jar:20100303
Missing artifact com.wcohen:secondstring:jar:20100303
Missing artifact com.wcohen:secondstring:jar:20100303
Missing artifact com.wcohen:secondstring:jar:20100303
Missing artifact com.wcohen:secondstring:jar:20100303
Missing artifact com.wcohen:secondstring:jar:20100303
Missing artifact com.wcohen:secondstring:jar:20100303
Missing artifact edu.mit.simile:butterfly:jar:1.0.2
Missing artifact edu.mit.simile:butterfly:jar:1.0.2
Missing artifact edu.mit.simile:butterfly:jar:1.0.2
Missing artifact edu.mit.simile:butterfly:jar:1.0.2
Missing artifact edu.mit.simile:butterfly:jar:1.0.2
Missing artifact edu.mit.simile:butterfly:jar:1.0.2
Missing artifact edu.mit.simile:butterfly:jar:1.0.2
Missing artifact edu.mit.simile:butterfly:jar:1.0.2
Missing artifact edu.mit.simile:butterfly:jar:1.0.2
Missing artifact edu.mit.simile:vicino:jar:1.1
Missing artifact edu.mit.simile:vicino:jar:1.1
Missing artifact edu.mit.simile:vicino:jar:1.1
Missing artifact edu.mit.simile:vicino:jar:1.1
Missing artifact edu.mit.simile:vicino:jar:1.1
Missing artifact edu.mit.simile:vicino:jar:1.1
Missing artifact edu.mit.simile:vicino:jar:1.1
Missing artifact edu.mit.simile:vicino:jar:1.1
Missing artifact marc4j:marc4j:jar:2.4
Missing artifact marc4j:marc4j:jar:2.4
Missing artifact marc4j:marc4j:jar:2.4
Missing artifact marc4j:marc4j:jar:2.4
Missing artifact marc4j:marc4j:jar:2.4
Missing artifact marc4j:marc4j:jar:2.4
Missing artifact marc4j:marc4j:jar:2.4
Missing artifact marc4j:marc4j:jar:2.4
Missing artifact net.sf.opencsv:opencsv:jar:2.4-SNAPSHOT
Missing artifact net.sf.opencsv:opencsv:jar:2.4-SNAPSHOT
Missing artifact net.sf.opencsv:opencsv:jar:2.4-SNAPSHOT
Missing artifact net.sf.opencsv:opencsv:jar:2.4-SNAPSHOT
Missing artifact net.sf.opencsv:opencsv:jar:2.4-SNAPSHOT
Missing artifact net.sf.opencsv:opencsv:jar:2.4-SNAPSHOT
Missing artifact net.sf.opencsv:opencsv:jar:2.4-SNAPSHOT
Missing artifact net.sf.opencsv:opencsv:jar:2.4-SNAPSHOT
The container 'Maven Dependencies' references non existing library 'C:\Users\TejaswiKarasani.m2\repository\edu\mit\simile\butterfly\1.0.2\butterfly-1.0.2.jar'
The container 'Maven Dependencies' references non existing library 'C:\Users\TejaswiKarasani.m2\repository\edu\mit\simile\butterfly\1.0.2\butterfly-1.0.2.jar'
The container 'Maven Dependencies' references non existing library 'C:\Users\TejaswiKarasani.m2\repository\edu\mit\simile\butterfly\1.0.2\butterfly-1.0.2.jar'
The container 'Maven Dependencies' references non existing library 'C:\Users\TejaswiKarasani.m2\repository\edu\mit\simile\butterfly\1.0.2\butterfly-1.0.2.jar'
The container 'Maven Dependencies' references non existing library 'C:\Users\TejaswiKarasani.m2\repository\edu\mit\simile\butterfly\1.0.2\butterfly-1.0.2.jar'
The container 'Maven Dependencies' references non existing library 'C:\Users\TejaswiKarasani.m2\repository\edu\mit\simile\butterfly\1.0.2\butterfly-1.0.2.jar'
The container 'Maven Dependencies' references non existing library 'C:\Users\TejaswiKarasani.m2\repository\edu\mit\simile\butterfly\1.0.2\butterfly-1.0.2.jar'
The container 'Maven Dependencies' references non existing library 'C:\Users\TejaswiKarasani.m2\repository\marc4j\marc4j\2.4\marc4j-2.4.jar'
The project cannot be built until build path errors are resolved
The project cannot be built until build path errors are resolved
The project cannot be built until build path errors are resolved
The project cannot be built until build path errors are resolved
The project cannot be built until build path errors are resolved
The project cannot be built until build path errors are resolved
The project cannot be built until build path errors are resolved
The project cannot be built until build path errors are resolved
Hi everyone! I'm Hammad, a final year Software Engineering student from Pakistan and an Outreachy 2020 aspirant. I was looking at OpenRefine's project: "Implement more constraint checks in OpenRefine's Wikidata extension" and it seemed very interesting to me. Having an affinity for Java and some experience with Wikidata, I feel like I can do well. I'd already introduced myself on Gitter but thought I'd do it here too. Looking forward to working with you all!
@TejaswiKarasani This seems to be due to the fact that refine.bat cannot find Maven on your computer. @thadguidry might be able to help for this (perhaps somewhere else though, as this is unrelated to this issue). I am working on lifting this limitation: #2365. The problem should be fixed in a few days (hopefully). Thanks for your patience!
@madham32 welcome to the project!
@TejaswiKarasani This seems to be due to the fact that
refine.batcannot find Maven on your computer. @thadguidry might be able to help for this (perhaps somewhere else though, as this is unrelated to this issue). I am working on lifting this limitation: #2365. The problem should be fixed in a few days (hopefully). Thanks for your patience!@madham32 welcome to the project!
Thanks for your response @wetneb . I will contact @thadguidry through email :)
@TejaswiKarasani 1st things...
On Windows, as a developer... NEVER CLONE ONTO YOUR DESKTOP. It is bad practice and will cause you a lot of grief. Install things under your C:\Users\TejaswiKarasani folder, or just under a new folder under C:\ or another drive you have.
It is also always wise to setup a nice github_projects folder under your C:\Users\TejaswiKarasani folder and then clone OpenRefine into that folder (and any other projects you might clone). The Github Desktop Windows client is now very nice to use if you are not experienced with Git within Eclipse or other IDE's or the command line version.
The Maven cache path that you have C:\Users\TejaswiKarasani.m2\repository\... is not ideal. This is causing Maven to have problems because of default Maven pathing on Windows installations.
Instead it should look like this: C:\Users\TejaswiKarasani\.m2\repository\...
Maven cache folder on Windows should default to a %HOMEPATH%\.m2 folder and not what you have currently.

So the the above tells me your Maven installation on Windows is not proper.
Ensure your %HOMEPATH% variable is not malformed or overwritten somehow. Ideally it would just be your C: drive then Users then username, similar to mine:

You might need to clean up and delete that folder TejaswiKarasani.m2 and ensure you do not delete your C:\Users\TejaswiKarasani folder (Windows 10 will give you a warning).
Once Maven is cleaned up (you no longer have a .m2 or .TejaswiKarasani.m2 folder under your C:\Users\ folder...then you can attempt to install Maven again (NOT TO YOUR DESKTOP) but under Program Files or a new folder you make like C:\maven. Ensure you have set your %MAVEN_HOME% variable properly (you probably didn't read the Maven installation instructions well?) Follow our guide inside the refine.bat echo information that you can read here: https://github.com/OpenRefine/OpenRefine/blob/master/refine.bat#L226

Maven's installation should also set it's bin folder automatically in your System Path as well.
But you can override if you have a custom installation path:

Don't you love the simplicity that Windows brings to your life? :-)
@thadguidry thanks a ton it did run :)

I'd set up the openrefine and was looking over some good-first-issue issues though was having a hard time understanding the codebase and where to get started. Was wondering if there's any pointers to where I can start working or some particularly easy to understand issues or code I can start with because right now I was feeling kinda lost. Thanks!
@madham32 Try going through the wiki once, there is also this document there on how to write an extension, and they kinda explain the codebase structure in it.
Closing as @darecoder implemented all constraint checks that can be executed quickly. Great job!
Most helpful comment
Closing as @darecoder implemented all constraint checks that can be executed quickly. Great job!