Hello,
I've seen a repo to have OpenRefine running on Spark. Is-it something you could somehow integrate ?
https://github.com/andreybratus/RefineOnSpark
Thanks.
Regards,
Yann
Yann, thanks for reaching out. As you can see, it is on our roadmap under _Performance Improvements_
However, this is a huge milestone, and we need first to break it down into concrete steps and define the architecture. Is it something you will be able to help us with?
Hello @magdmartin,
OK, did not see that before. I'll have a look a it, code wise, but I'm a bit rusty TBH. Do you have a view about how to do that ? RefineOnSpark approach is something you'd accept ? Or a completely different architecture ?
On the long term, I'd be in favour of a different architecture, that really breaks down operations into individual processes that can be run on different nodes in parallel.
I totally agree. OpenRefine should be able to work on a sample of the data (when this is to big to fit in memory), create a set of operations to perform on a project and applying those operations on the remote framework like Spark/YARN/Flink/Apache Beam/etc.
In order to consider the process finished some validation rules need to be set. It those rules are not satisfied, another sample of the data (that doesn't satisfy those checks) is reported to the user in order to properly fix them with some additional operation.
Thus, the very first thing tho accomplish this goal is to be able to export the Refine history in a common format/language (independent from GREL/Jython/Python/etc) and reply all those operations in a batch program.
I suggest to create a proper external library (e.g. OpenRefine-Language or OpenRefine-History) that will be imported by OpenRefine and all other framework that needs to reply the history (somehow)
The way I see things, we should implement various things :
(side talk, woring in parquet locally is definitely a thing to consider https://tech.blue-yonder.com/efficient-dataframe-storage-with-apache-parquet/)
So, in the end, Using Spark would be :
Am I wrong ?
I think there is more than just translating GREL to Spark: it is the operations themselves that need translating. I will try to come up with a more concrete proposal soon.
OK. Regarding architecture
@fpompermaier maybe the right way to map history would be doing it to Beam, which will then be a bridge to many other frameworks (Spark, Flink, ....).
WDYT ?
I'd keep things separated. But this is just my opinion.
The problem right now is that the history operations are quite hard to translate into a batch process.
Maybe GREL could be quite easy to translate but Jython and other languages are more tricky...
The first thing I'd try is to extract GREL and the history-related classes into a separate project and see whether this lib could be imported and used to replay the project history in a separate batch engine (like spark or beam).
@YannBrrd regarding your architecture diagram, the front-end and back-end are currently not independent in OpenRefine.
We recently received funding from Google News Labs and we plan to use that funds to modernize OpenRefine architecture including:
@wetneb did I missed something?
No that's correct! Anyway this is a huge task so it's great to have many voices chiming in :)
@magdmartin if we follow @fpompermaier suggestion, we could tackle the 2 tasks in parallel :
And because I count in ternary :
ALL ... our overall architecture plans are to offer alternative processing into BEAM transforms (which opens up every possibility including Raw file manipulations if you have to, all the way up to cloud run transforms on current tech stacks). But as @magdmartin has stated, we have plans to 1st work on separating out our backend completely and aligning for BEAM. That's our high-level plan.
As far as replay of operations, we have to research a bit more , but regardless any sequence of operations will be using runners within BEAM and following its current Execution Model. https://beam.apache.org/documentation/execution-model/
As far as OTHER work that we plan to do... is to get the local desktop experience of OpenRefine to use much less resources in order to handle MORE data and quicker operations within the bounds of local heap or even native heap. (FYI, Blazegraph and others did similar work in maximizing performance by using Native Heap as much as possible, wherever possible https://jira.blazegraph.com/browse/BLZG-533 and using more Native Heap where it makes sense is something we will eventually explore as well).
And one more side note.... moving to Java 9 will help us drop a few bits per String in our data model.
if (COMPACT_STRINGS) {
byte[] val = StringUTF16.compress(value);
https://www.sitepoint.com/inside-java-9-part-ii/#compactstrings
I think that all the confusion arises from the fact that this found received from Google to modernize the OpenRefine architecture is not very clear (or well documented).
Apart from a very brief discussion on the OpenRefine Google Group there's almost nothing about who "we"/"ALL" refer to, who is leading this thing and if the community could be involved in this decisional process/analysis.. @magdmartin excluded, all other commiters are new to OpenRefine: I'd be nice to know who is now involved in the project and what are its responsibility or educational background.
Is there any requirements/design document that the community can see and discuss on or is everything already decided?
The only real change I saw is that now OpenRefine has some milestone set, but no clear idea if those milestone are strict or not..
"we" includes "you". :) This is open source. A community of volunteers. But Yes we have teams https://github.com/orgs/OpenRefine/teams and there is a technical steering committee (which includes me, Jacky, Antonin, and Owen...which we refer to as CORE https://github.com/orgs/OpenRefine/teams/core ) to basically help guide a bit. But OpenRefine can take many forms in the case of extensions and where the community wants to go with it.
Anyone can request access to a team. Being added to a team depends on a few important considerations, for addition to CORE of which is knowing our history, goals, and codebase and previous contributions to our codebase. But NOT being in CORE team, doesn't mean you have no voice in matters of OpenRefine, quite the contrary actually...
Technical discussions and online meetings can happen anytime the community desires or has the time. Our CORE team has had multiple online meetings and these are announced in our OpenRefine Dev Mailing list.
I and the CORE team put together the Milestones for review and have historically set the agenda for OpenRefine features, but we are not the only voice...the entire community has the voice...which includes you. :)
So at this point, your probably asking... "Hey Thad, how can I get more involved with OpenRefine and technical discussions? " , then I would reply ... "Right here, as well as on our Dev mailing list, as well as on our community hangout sessions". And as soon as we setup the non-profit organization we intend to have a big community online hangout session discussing... "the future".
UPDATE: I've added the "Google Sponsored Projects" https://github.com/OpenRefine/OpenRefine/projects to our Github Projects page to make things more clear for you
UPDATE2: And the Milestones are not really strict. We're all just a bunch of volunteers with limited time and now getting some reimbursement from Google for some of that time on certain features.
I don't have access neither to https://github.com/orgs/OpenRefine/teams not https://github.com/orgs/OpenRefine/teams/core.
I was subscribed to user group but not to dev. So problem fixed now..
Thanks for the link to https://github.com/OpenRefine/OpenRefine/projects, it's very useful
@fpompermaier Hmm... yeah seems like teams are not publicly shown even though I have all of them being set to Visible. https://help.github.com/articles/about-teams/ But teams are not for discussion but actually for permissions and project communication with "@mentions". If anyone wants to be part of "technical discussions" then you simply subscribe to our DEV mailing list. Easy.
@fpompermaier see also our CONTRIBUTING document. I am also migrating the draft Governance model in #1435
Is Refine-on-Spark (https://github.com/andreybratus/RefineOnSpark) currently the only way to run open-refine on spark? If so, has anyone had success with it?
@tcbuzor @wetneb @fpompermaier @jackyq2015 What do you think of Delta Lake perhaps as a potential new storage backend which includes Time Travel (read all the way down) similar to our History/Undo. Delta Lake was recently open sourced from Databricks folks https://databricks.com/blog/2019/04/24/open-sourcing-delta-lake.html
https://docs.delta.io/latest/quick-start.html
And the idea of multiple users working on a OpenRefine Project could be supported with https://docs.delta.io/latest/concurrency-control.html
Hi @thadguidry ,
sorry for the late response but I was on vacation last week.
Delta-lake solution (including Time Travel feature) looks very promising and could provide a good backend for OpenRefine[1].
However, I'm still convinced that the very first step towards any possible integration with an external processing engine should be the translation of the current History into a more general "wrangling language" that should be easily "mappable" to a Spark/Flink/Beam job (I don't think that the current history could be directly translated into a Spark/Flink/Beam job..if this is feasible then the definition of a general wrangling language could be ignored).
What do you think?
[1] the OpenRefine backend should become something like Apache Livy (https://livy.apache.org/get-started/) in order to handle multiple user sessions/jobs and ensuring for a proper locking mechanism of Sources/Sinks in use
@fpompermaier There can certainly be improvement in our History, and we have a few issues that ask for that. As far as working with Apache Beam and using it where it makes sense...Many of OpenRefine's current commands (that take a single input and produce a single output) would be mapped to Beam DoFn's and use its high-level MapElement. Now, how our History JSON needs to be improved to help support Beam pipeline's... that is research work that anyone can do :)
The main thing to be aware of is that state needs to go away and for those commands or operations that can be "Beamed", they will need to be written well and follow Beam's User Code Requirements for Transforms
Btw, if you want to experiment and research more... let me know... we have $$$ thousands of dollars of credits for Google Cloud. But I'd rather see effort first put into local performance improvements, before dealing with cloud performance improvements. Local Spark and maybe Delta Lake should be the first research area, even just Proof of Concepts to see our MassEditCommand would work with Delta Lake would be very useful research.
In our company I'm not the expert of OpenRefine but I'm pretty skilled about Flink (and some Spark/Beam).
If you set up a dedicated fork with a first example of what you would like to test I could try implementing the Beam stuff. What to you think? Does it make sense?
@fpompermaier I'm recently enjoying Apache Drill and wondering more about it in #1469 and put some comments into it.
Just ran across Kylo, which has data wrangling and uses Spark, NiFi, and Livy on the backend.
@indolering Kylo unfortunately doesn't offer interactive expression dialogs or faceting and a few other things that users of OpenRefine like. Although, Kylo massively scales to apply transforms across your Spark cluster and uses a pluggable Transform Engine that defaults to Spark Shell if I am not mistaken. https://github.com/Teradata/kylo/blob/master/ui/ui-app/src/main/resources/static/js/feed-mgr/visual-query/wrangler/query-engine.ts
If you don't need to do much "investigation" while wrangling, then Kylo is a good fit.
If you have no idea what your data looks like, you'll probably want to use a tool like OpenRefine to begin to figure out what shape the data is in, inspect outliers, etc. and if the data isn't that large, refine and clean it up within OpenRefine and export it.
I will work full time on this in 2020. See the roadmap and the GitHub project. I will follow @fpompermaier's suggestion to isolate the data model, GREL and the operations in a standalone library, to facilitate its reuse independently of OpenRefine as a web application.
Closed #1468 as a duplicate of this. Some brainstorming about this happened in this document: https://docs.google.com/document/d/1WT8nCYdNUU14y39IJJlB9fqnxhO0XZnjgHZljEdw-60/edit?usp=sharing
A first prototype of what this could look like is available in the spark-prototype
branch. Here is a description of its architecture, with instructions for how to try it out:
https://hackmd.io/f8czl6cNT4uxgvAMwI-NkA?view
I've tried to build it on Ubuntu 18.04 LTS with Maven 3.3.9 and the build
fails on "OpenRefine - testing" with the following error: Failed to
execute goal
org.apache.maven.plugins:maven-install-plugin:2.5.2:install-file
(install-or-model) on project or-testing: The artifact information is
incomplete or not valid:
[ERROR] [0] 'version' is missing.
[ERROR] -> [Help 1]
Maybe the Maven version should be higher?
On Sun, 9 Feb 2020 at 19:22, Antonin Delpeuch notifications@github.com
wrote:
A first prototype of what this could look like is available in the
spark-prototype branch. Here is a description of its architecture, with
instructions for how to try it out:
https://hackmd.io/f8czl6cNT4uxgvAMwI-NkA?view—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/OpenRefine/OpenRefine/issues/1433?email_source=notifications&email_token=AA4Z4JLHPAANCGJYMRFCUVLRCBCWDA5CNFSM4ELY4722YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELGT7JQ#issuecomment-583876518,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AA4Z4JLI446ZP4QOMBHMQZLRCBCWDANCNFSM4ELY472Q
.
Hi Flavio, thanks for trying it out and sorry about that. I have Maven 3.6.0… I'd be surprised if your Maven was too old, it must be a different issue
Can you try again from a fresh copy of the branch (make sure you do git pull
and ./refine clean
)? Otherwise here are packaged artifacts for Linux and Windows.
http://pintoch.ulminfo.fr/f91d182316/openrefine-linux-spark.tar.gz
http://pintoch.ulminfo.fr/369852f6ab/openrefine-win-spark.zip
Still the same problem. The linux zip instead gives me this error:
17:29:52.076 [ refine_server] Starting Server bound to '
127.0.0.1:3333' (0ms)
17:29:52.077 [ refine_server] refine.memory size: 1400M JVM Max
heap: 1407188992 (1ms)
17:29:52.083 [ refine_server] Initializing context: '/' from
'/media/fp/SSD/projects/oblivion/openrefine-spark/webapp' (6ms)
17:29:52.428 [ refine_server] Creating new workspace directory
/home/fp/.local/share/openrefine (345ms)
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/media/fp/SSD/projects/oblivion/openrefine-spark/server/target/lib/slf4j-log4j12-1.7.18.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/media/fp/SSD/projects/oblivion/openrefine-spark/webapp/WEB-INF/lib/slf4j-log4j12-1.7.18.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
17:29:52.933 [..apache.spark.util.Utils] Your hostname, fp-flavio resolves
to a loopback address: 127.0.1.1; using 192.168.234.52 instead (on
interface eno1) (505ms)
17:29:52.934 [..apache.spark.util.Utils] Set SPARK_LOCAL_IP if you need to
bind to another address (1ms)
17:29:52.994 [..ache.spark.SparkContext] Running Spark version 2.4.4 (60ms)
17:29:53.185 [..p.util.NativeCodeLoader] Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
(191ms)
17:29:53.299 [..ache.spark.SparkContext] Submitted application: OpenRefine
(114ms)
17:29:53.390 [..e.spark.SecurityManager] Changing view acls to: fp (91ms)
17:29:53.390 [..e.spark.SecurityManager] Changing modify acls to: fp (0ms)
17:29:53.391 [..e.spark.SecurityManager] Changing view acls groups to:
(1ms)
17:29:53.391 [..e.spark.SecurityManager] Changing modify acls groups to:
(0ms)
17:29:53.392 [..e.spark.SecurityManager] SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(fp); groups
with view permissions: Set(); users with modify permissions: Set(fp);
groups with modify permissions: Set() (1ms)
17:29:53.660 [..apache.spark.util.Utils] Successfully started service
'sparkDriver' on port 39519. (268ms)
17:29:53.686 [..g.apache.spark.SparkEnv] Registering MapOutputTracker (26ms)
17:29:53.711 [..g.apache.spark.SparkEnv] Registering BlockManagerMaster
(25ms)
17:29:53.713 [..ckManagerMasterEndpoint] Using
org.apache.spark.storage.DefaultTopologyMapper for getting topology
information (2ms)
17:29:53.714 [..ckManagerMasterEndpoint] BlockManagerMasterEndpoint up (1ms)
17:29:53.725 [..torage.DiskBlockManager] Created local directory at
/tmp/blockmgr-0bb1c998-ffbe-4b6f-a415-ea9759326e5f (11ms)
17:29:53.754 [..rage.memory.MemoryStore] MemoryStore started with capacity
625.2 MB (29ms)
17:29:53.769 [..g.apache.spark.SparkEnv] Registering
OutputCommitCoordinator (15ms)
17:29:53.847 [.._project.jetty.util.log] Logging initialized @1882ms (78ms)
17:29:53.906 [..ect.jetty.server.Server] jetty-9.3.z-SNAPSHOT, build
timestamp: unknown, git hash: unknown (59ms)
17:29:53.919 [..ect.jetty.server.Server] Started @1954ms (13ms)
17:29:53.947 [..erver.AbstractConnector] Started ServerConnector@302fec27
{HTTP/1.1,[http/1.1]}{0.0.0.0:4040} (28ms)
17:29:53.947 [..apache.spark.util.Utils] Successfully started service
'SparkUI' on port 4040. (0ms)
17:29:53.973 [...handler.ContextHandler] Started
o.s.j.s.ServletContextHandler@43ed0ff3{/jobs,null,AVAILABLE,@Spark} (26ms)
17:29:53.974 [...handler.ContextHandler] Started
o.s.j.s.ServletContextHandler@56db847e{/jobs/json,null,AVAILABLE,@Spark}
(1ms)
17:29:53.975 [...handler.ContextHandler] Started
o.s.j.s.ServletContextHandler@740abb5{/jobs/job,null,AVAILABLE,@Spark} (1ms)
17:29:53.976 [...handler.ContextHandler] Started
o.s.j.s.ServletContextHandler@5fe8b721{/jobs/job/json,null,AVAILABLE,@Spark}
(1ms)
17:29:53.977 [...handler.ContextHandler] Started
o.s.j.s.ServletContextHandler@551a20d6{/stages,null,AVAILABLE,@Spark} (1ms)
17:29:53.978 [...handler.ContextHandler] Started
o.s.j.s.ServletContextHandler@578524c3{/stages/json,null,AVAILABLE,@Spark}
(1ms)
17:29:53.979 [...handler.ContextHandler] Started
o.s.j.s.ServletContextHandler@64c2b546{/stages/stage,null,AVAILABLE,@Spark}
(1ms)
17:29:53.980 [...handler.ContextHandler] Started
o.s.j.s.ServletContextHandler@4cc547a{/stages/stage/json,null,AVAILABLE,@Spark}
(1ms)
17:29:53.981 [...handler.ContextHandler] Started
o.s.j.s.ServletContextHandler@7555b920{/stages/pool,null,AVAILABLE,@Spark}
(1ms)
17:29:53.981 [...handler.ContextHandler] Started
o.s.j.s.ServletContextHandler@4152d38d{/stages/pool/json,null,AVAILABLE,@Spark}
(0ms)
17:29:53.982 [...handler.ContextHandler] Started
o.s.j.s.ServletContextHandler@3591009c{/storage,null,AVAILABLE,@Spark} (1ms)
17:29:53.982 [...handler.ContextHandler] Started
o.s.j.s.ServletContextHandler@5398edd0{/storage/json,null,AVAILABLE,@Spark}
(0ms)
17:29:53.983 [...handler.ContextHandler] Started
o.s.j.s.ServletContextHandler@b5cc23a{/storage/rdd,null,AVAILABLE,@Spark}
(1ms)
17:29:53.983 [...handler.ContextHandler] Started
o.s.j.s.ServletContextHandler@5cc5b667{/storage/rdd/json,null,AVAILABLE,@Spark}
(0ms)
17:29:53.984 [...handler.ContextHandler] Started
o.s.j.s.ServletContextHandler@61edc883{/environment,null,AVAILABLE,@Spark}
(1ms)
17:29:53.985 [...handler.ContextHandler] Started
o.s.j.s.ServletContextHandler@758f4f03{/environment/json,null,AVAILABLE,@Spark}
(1ms)
17:29:53.985 [...handler.ContextHandler] Started
o.s.j.s.ServletContextHandler@182f1e9a{/executors,null,AVAILABLE,@Spark}
(0ms)
17:29:53.986 [...handler.ContextHandler] Started
o.s.j.s.ServletContextHandler@6928f576{/executors/json,null,AVAILABLE,@Spark}
(1ms)
17:29:53.986 [...handler.ContextHandler] Started
o.s.j.s.ServletContextHandler@660e9100{/executors/threadDump,null,AVAILABLE,@Spark}
(0ms)
17:29:53.986 [...handler.ContextHandler] Started
o.s.j.s.ServletContextHandler@69f63d95{/executors/threadDump/json,null,AVAILABLE,@Spark}
(0ms)
17:29:53.992 [...handler.ContextHandler] Started
o.s.j.s.ServletContextHandler@9cd25ff{/static,null,AVAILABLE,@Spark} (6ms)
17:29:53.993 [...handler.ContextHandler] Started
o.s.j.s.ServletContextHandler@59a67c3a{/,null,AVAILABLE,@Spark} (1ms)
17:29:53.993 [...handler.ContextHandler] Started
o.s.j.s.ServletContextHandler@5003041b{/api,null,AVAILABLE,@Spark} (0ms)
17:29:53.994 [...handler.ContextHandler] Started
o.s.j.s.ServletContextHandler@23a9ba52{/jobs/job/kill,null,AVAILABLE,@Spark}
(1ms)
17:29:53.995 [...handler.ContextHandler] Started
o.s.j.s.ServletContextHandler@ca27722{/stages/stage/kill,null,AVAILABLE,@Spark}
(1ms)
17:29:53.996 [..apache.spark.ui.SparkUI] Bound SparkUI to 0.0.0.0, and
started at http://flavio-pc.fp.local:4040 (1ms)
17:29:54.110 [..spark.executor.Executor] Starting executor ID driver on
host localhost (114ms)
17:29:54.224 [..apache.spark.util.Utils] Successfully started service
'org.apache.spark.network.netty.NettyBlockTransferService' on port 37457.
(114ms)
17:29:54.225 [..ttyBlockTransferService] Server created on
flavio-pc.fp.local:37457 (1ms)
17:29:54.227 [..rk.storage.BlockManager] Using
org.apache.spark.storage.RandomBlockReplicationPolicy for block replication
policy (2ms)
17:29:54.251 [..rage.BlockManagerMaster] Registering BlockManager
BlockManagerId(driver, flavio-pc.fp.local, 37457, None) (24ms)
17:29:54.255 [..ckManagerMasterEndpoint] Registering block manager
flavio-pc.fp.local:37457 with 625.2 MB RAM, BlockManagerId(driver,
flavio-pc.fp.local, 37457, None) (4ms)
17:29:54.259 [..rage.BlockManagerMaster] Registered BlockManager
BlockManagerId(driver, flavio-pc.fp.local, 37457, None) (4ms)
17:29:54.259 [..rk.storage.BlockManager] Initialized BlockManager:
BlockManagerId(driver, flavio-pc.fp.local, 37457, None) (0ms)
17:29:54.403 [...handler.ContextHandler] Started
o.s.j.s.ServletContextHandler@667e34b1{/metrics/json,null,AVAILABLE,@Spark}
(144ms)
17:29:54.457 [ refine] Starting OpenRefine Spark
[3df2c35]... (54ms)
17:29:54.457 [ refine] initializing FileProjectManager
with dir (0ms)
17:29:54.457 [ refine] /home/fp/.local/share/openrefine
(0ms)
17:29:54.461 [ FileProjectManager] Failed to load workspace from any
attempted alternatives. (4ms)
17:29:54.508 [ butterfly] Error loading special module
manager (47ms)
java.lang.ClassNotFoundException:
org.openrefine.extension.database.DatabaseModuleImpl
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
at edu.mit.simile.butterfly.Butterfly.createModule(Butterfly.java:671)
at edu.mit.simile.butterfly.Butterfly.configure(Butterfly.java:412)
at edu.mit.simile.butterfly.Butterfly.init(Butterfly.java:308)
at
org.mortbay.jetty.servlet.ServletHolder.initServlet(ServletHolder.java:440)
at org.mortbay.jetty.servlet.ServletHolder.doStart(ServletHolder.java:263)
at org.openrefine.RefineServer.configure(Refine.java:291)
at org.openrefine.RefineServer.init(Refine.java:203)
at org.openrefine.Refine.init(Refine.java:109)
at org.openrefine.Refine.main(Refine.java:103)
[... rest removed by @wetneb to make the thread shorter...]
Oops, sorry about that, it is because I haven't removed the extensions from the package. You can either delete the webapp/extensions
folder or download this new version:
I removed the extensions folder and now it works! And it's also pretty
fast..good job!
I have problem in visualizing the spark ui because some files cannot be
loaded: for example I get the following errors:
Error for /static/timeline-view.css (0ms)
java.lang.NoSuchMethodError:
javax.servlet.http.HttpServletRequest.isAsyncSupported()Z
at
org.spark_project.jetty.servlet.DefaultServlet.sendData(DefaultServlet.java:943)
at
org.spark_project.jetty.servlet.DefaultServlet.doGet(DefaultServlet.java:532)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at
org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
at
org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:584)
at
org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
at
org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
at
org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
at
org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at
org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493)
at
org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
at
org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at org.spark_project.jetty.server.Server.handle(Server.java:539)
at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:333)
at
org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
at
org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108)
at
org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
at
org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
at
org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
at
org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
at
org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
at
org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
at java.lang.Thread.run(Thread.java:748)
11:44:10.759 [...servlet.ServletHandler] Error for /static/vis.min.css (0ms)
java.lang.NoSuchMethodError:
javax.servlet.http.HttpServletRequest.isAsyncSupported()Z
at
org.spark_project.jetty.servlet.DefaultServlet.sendData(DefaultServlet.java:943)
at
org.spark_project.jetty.servlet.DefaultServlet.doGet(DefaultServlet.java:532)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at
org.spark_project.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
at
org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:584)
at
org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
at
org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
at
org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
at
org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at
org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493)
at
org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
at
org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at org.spark_project.jetty.server.Server.handle(Server.java:539)
at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:333)
at
org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
at
org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108)
at
org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
at
org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
at
org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
at
org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
at
org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
at
org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
at java.lang.Thread.run(Thread.java:748)
11:44:10.760 [..etty.server.HttpChannel]
//localhost:4040/static/vis.min.css (1ms)
java.lang.NoSuchMethodError:
javax.servlet.http.HttpServletRequest.isAsyncStarted()Z
at
org.spark_project.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:688)
at
org.spark_project.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
at
org.spark_project.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
at
org.spark_project.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
at
org.spark_project.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at
org.spark_project.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:493)
at
org.spark_project.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
at
org.spark_project.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at org.spark_project.jetty.server.Server.handle(Server.java:539)
at org.spark_project.jetty.server.HttpChannel.handle(HttpChannel.java:333)
at
org.spark_project.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
at
org.spark_project.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
at org.spark_project.jetty.io.FillInterest.fillable(FillInterest.java:108)
at
org.spark_project.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
at
org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
at
org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
at
org.spark_project.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
at
org.spark_project.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
at
org.spark_project.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
at java.lang.Thread.run(Thread.java:748)
11:44:10.761 [..etty.server.HttpChannel]
//localhost:4040/static/bootstrap.min.css (1ms)
java.lang.NoSuchMethodError:
javax.servlet.http.HttpServletRequest.isAsyncStarted()Z
On Tue, 11 Feb 2020 at 18:51, Antonin Delpeuch notifications@github.com
wrote:
Oops, sorry about that, it is because I haven't removed the extensions
from the package. You can either delete the webapp/extensions folder or
download this new version:
- http://pintoch.ulminfo.fr/a1c912deb7/openrefine-linux-spark.tar.gz
- http://pintoch.ulminfo.fr/ed00c4c173/openrefine-win-spark.zip
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/OpenRefine/OpenRefine/issues/1433?email_source=notifications&email_token=AA4Z4JN5NSKH3X7JWWHGB6TRCLQS7A5CNFSM4ELY4722YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELNNBPA#issuecomment-584765628,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AA4Z4JNTAZLJW4ECRE3QEFDRCLQS7ANCNFSM4ELY472Q
.
Argh, that's pretty annoying, it's a conflict between Spark's jetty and Butterfly's jetty. Somehow on my machine Spark gets the right Jetty. That means we need to upgrade Butterfly's jetty to a new version (and in the future migrate out of it of course).
Interestingly, through Visual Studio Code and using bash, I get these version issues ...
[INFO] --- maven-install-plugin:2.5.2:install-file (install-or-model) @ or-testing ---
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for OpenRefine 3.3-SNAPSHOT:
[INFO]
[INFO] OpenRefine ......................................... SUCCESS [ 0.204 s]
[INFO] OpenRefine - utilities ............................. SUCCESS [ 0.602 s]
[INFO] OpenRefine - model ................................. SUCCESS [ 0.690 s]
[INFO] OpenRefine - testing ............................... FAILURE [ 0.178 s]
[INFO] OpenRefine - GREL .................................. SKIPPED
[INFO] OpenRefine - main .................................. SKIPPED
[INFO] OpenRefine - server ................................ SKIPPED
[INFO] OpenRefine - packaging ............................. SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 1.908 s
[INFO] Finished at: 2020-02-12T09:51:54-06:00
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-install-plugin:2.5.2:install-file (install-or-model) on project or-testing: The artifact information is incomplete or not valid:
[ERROR] [0] 'version' is missing.
[ERROR]
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR] mvn <args> -rf :or-testing
Error: Error while running maven task 'compile'
thadg@DESKTOP-2EMESME MINGW64 /e/GitHub Repos/openrefine (spark-prototype)
but if I build directly from CMD window (setting my MAVEN_HOME directly in Windows env) and type "refine build", then it compiles and can run, but then I see a few class naming issues. Will we be completely changing to org.openrefine eventually, right ?
09:53:30.088 [..apache.spark.util.Utils] Successfully started service 'sparkDriver' on port 55286. (686ms)
09:53:30.106 [..g.apache.spark.SparkEnv] Registering MapOutputTracker (18ms)
09:53:30.121 [..g.apache.spark.SparkEnv] Registering BlockManagerMaster (15ms)
09:53:30.122 [..ckManagerMasterEndpoint] Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information (1ms)
09:53:30.123 [..ckManagerMasterEndpoint] BlockManagerMasterEndpoint up (1ms)
09:53:30.136 [..torage.DiskBlockManager] Created local directory at C:\Users\thadg\AppData\Local\Temp\blockmgr-b193aa84-b2da-4e07-89ec-c57f377ec4c8 (13ms)
09:53:30.159 [..rage.memory.MemoryStore] MemoryStore started with capacity 660.0 MB (23ms)
09:53:30.171 [..g.apache.spark.SparkEnv] Registering OutputCommitCoordinator (12ms)
09:53:30.225 [.._project.jetty.util.log] Logging initialized @5869ms (54ms)
09:53:30.266 [..ect.jetty.server.Server] jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown (41ms)
09:53:30.294 [..ect.jetty.server.Server] Started @5938ms (28ms)
09:53:30.311 [..erver.AbstractConnector] Started ServerConnector@5b84f14{HTTP/1.1,[http/1.1]}{0.0.0.0:4040} (17ms)
09:53:30.311 [..apache.spark.util.Utils] Successfully started service 'SparkUI' on port 4040. (0ms)
09:53:30.323 [...handler.ContextHandler] Started o.s.j.s.ServletContextHandler@588307f7{/jobs,null,AVAILABLE,@Spark} (12ms)
09:53:30.324 [...handler.ContextHandler] Started o.s.j.s.ServletContextHandler@43f0c2d1{/jobs/json,null,AVAILABLE,@Spark} (1ms)
09:53:30.325 [...handler.ContextHandler] Started o.s.j.s.ServletContextHandler@5fb65013{/jobs/job,null,AVAILABLE,@Spark} (1ms)
09:53:30.325 [...handler.ContextHandler] Started o.s.j.s.ServletContextHandler@7a1f45ed{/jobs/job/json,null,AVAILABLE,@Spark} (0ms)
09:53:30.326 [...handler.ContextHandler] Started o.s.j.s.ServletContextHandler@1744a475{/stages,null,AVAILABLE,@Spark} (1ms)
09:53:30.326 [...handler.ContextHandler] Started o.s.j.s.ServletContextHandler@444cc791{/stages/json,null,AVAILABLE,@Spark} (0ms)
09:53:30.326 [...handler.ContextHandler] Started o.s.j.s.ServletContextHandler@1c5c616f{/stages/stage,null,AVAILABLE,@Spark} (0ms)
09:53:30.327 [...handler.ContextHandler] Started o.s.j.s.ServletContextHandler@ca93621{/stages/stage/json,null,AVAILABLE,@Spark} (1ms)
09:53:30.328 [...handler.ContextHandler] Started o.s.j.s.ServletContextHandler@6a48a7f3{/stages/pool,null,AVAILABLE,@Spark} (1ms)
09:53:30.328 [...handler.ContextHandler] Started o.s.j.s.ServletContextHandler@3f985a86{/stages/pool/json,null,AVAILABLE,@Spark} (0ms)
09:53:30.328 [...handler.ContextHandler] Started o.s.j.s.ServletContextHandler@57a2ed35{/storage,null,AVAILABLE,@Spark} (0ms)
09:53:30.329 [...handler.ContextHandler] Started o.s.j.s.ServletContextHandler@12ffd1de{/storage/json,null,AVAILABLE,@Spark} (1ms)
09:53:30.329 [...handler.ContextHandler] Started o.s.j.s.ServletContextHandler@3d278b4d{/storage/rdd,null,AVAILABLE,@Spark} (0ms)
09:53:30.330 [...handler.ContextHandler] Started o.s.j.s.ServletContextHandler@4096aa05{/storage/rdd/json,null,AVAILABLE,@Spark} (1ms)
09:53:30.330 [...handler.ContextHandler] Started o.s.j.s.ServletContextHandler@9d3c67{/environment,null,AVAILABLE,@Spark} (0ms)
09:53:30.330 [...handler.ContextHandler] Started o.s.j.s.ServletContextHandler@6c806c8b{/environment/json,null,AVAILABLE,@Spark} (0ms)
09:53:30.331 [...handler.ContextHandler] Started o.s.j.s.ServletContextHandler@6dfcffb5{/executors,null,AVAILABLE,@Spark} (1ms)
09:53:30.331 [...handler.ContextHandler] Started o.s.j.s.ServletContextHandler@184fb68d{/executors/json,null,AVAILABLE,@Spark} (0ms)
09:53:30.331 [...handler.ContextHandler] Started o.s.j.s.ServletContextHandler@71d8cfe7{/executors/threadDump,null,AVAILABLE,@Spark} (0ms)
09:53:30.332 [...handler.ContextHandler] Started o.s.j.s.ServletContextHandler@1e530163{/executors/threadDump/json,null,AVAILABLE,@Spark} (1ms)
09:53:30.336 [...handler.ContextHandler] Started o.s.j.s.ServletContextHandler@14d8444b{/static,null,AVAILABLE,@Spark} (4ms)
09:53:30.336 [...handler.ContextHandler] Started o.s.j.s.ServletContextHandler@204e90f7{/,null,AVAILABLE,@Spark} (0ms)
09:53:30.337 [...handler.ContextHandler] Started o.s.j.s.ServletContextHandler@20a05b32{/api,null,AVAILABLE,@Spark} (1ms)
09:53:30.337 [...handler.ContextHandler] Started o.s.j.s.ServletContextHandler@2ab5afc7{/jobs/job/kill,null,AVAILABLE,@Spark} (0ms)
09:53:30.338 [...handler.ContextHandler] Started o.s.j.s.ServletContextHandler@4dc8c0ea{/stages/stage/kill,null,AVAILABLE,@Spark} (1ms)
09:53:30.339 [..apache.spark.ui.SparkUI] Bound SparkUI to 0.0.0.0, and started at http://host.docker.internal:4040 (1ms)
09:53:30.426 [..spark.executor.Executor] Starting executor ID driver on host localhost (87ms)
09:53:30.591 [..apache.spark.util.Utils] Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 55297. (165ms)
09:53:30.592 [..ttyBlockTransferService] Server created on host.docker.internal:55297 (1ms)
09:53:30.594 [..rk.storage.BlockManager] Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy (2ms)
09:53:30.610 [..rage.BlockManagerMaster] Registering BlockManager BlockManagerId(driver, host.docker.internal, 55297, None) (16ms)
09:53:30.615 [..ckManagerMasterEndpoint] Registering block manager host.docker.internal:55297 with 660.0 MB RAM, BlockManagerId(driver, host.docker.internal, 55297, None) (5ms)
09:53:30.618 [..rage.BlockManagerMaster] Registered BlockManager BlockManagerId(driver, host.docker.internal, 55297, None) (3ms)
09:53:30.618 [..rk.storage.BlockManager] Initialized BlockManager: BlockManagerId(driver, host.docker.internal, 55297, None) (0ms)
09:53:30.700 [...handler.ContextHandler] Started o.s.j.s.ServletContextHandler@f324455{/metrics/json,null,AVAILABLE,@Spark} (82ms)
09:53:30.744 [ refine] Starting OpenRefine Spark [3df2c35]... (44ms)
09:53:30.744 [ refine] initializing FileProjectManager with dir (0ms)
09:53:30.744 [ refine] C:\Users\thadg\AppData\Roaming\OpenRefine (0ms)
java.lang.IllegalArgumentException: Invalid type id 'com.google.refine.preference.TopList' (for id type 'Id.class'): no such class found
at com.fasterxml.jackson.databind.jsontype.impl.ClassNameIdResolver._typeFromId(ClassNameIdResolver.java:66)
at com.fasterxml.jackson.databind.jsontype.impl.ClassNameIdResolver.typeFromId(ClassNameIdResolver.java:48)
at com.fasterxml.jackson.databind.jsontype.impl.TypeDeserializerBase._findDeserializer(TypeDeserializerBase.java:154)
at com.fasterxml.jackson.databind.jsontype.impl.AsPropertyTypeDeserializer._deserializeTypedForId(AsPropertyTypeDeserializer.java:108)
at com.fasterxml.jackson.databind.jsontype.impl.AsPropertyTypeDeserializer.deserializeTypedFromObject(AsPropertyTypeDeserializer.java:93)
at com.fasterxml.jackson.databind.deser.AbstractDeserializer.deserializeWithType(AbstractDeserializer.java:131)
at com.fasterxml.jackson.databind.deser.impl.TypeWrappedDeserializer.deserialize(TypeWrappedDeserializer.java:42)
at com.fasterxml.jackson.databind.ObjectMapper._readValue(ObjectMapper.java:3708)
at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2005)
at com.fasterxml.jackson.databind.ObjectMapper.treeToValue(ObjectMapper.java:2476)
at org.openrefine.preference.PreferenceStore.loadObject(PreferenceStore.java:119)
at org.openrefine.preference.PreferenceStore.setEntries(PreferenceStore.java:103)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at com.fasterxml.jackson.databind.deser.impl.MethodProperty.deserializeAndSet(MethodProperty.java:97)
at com.fasterxml.jackson.databind.deser.BeanDeserializer.vanillaDeserialize(BeanDeserializer.java:258)
at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:125)
at com.fasterxml.jackson.databind.deser.SettableBeanProperty.deserialize(SettableBeanProperty.java:520)
at com.fasterxml.jackson.databind.deser.impl.MethodProperty.deserializeAndSet(MethodProperty.java:95)
at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:222)
at com.fasterxml.jackson.databind.ObjectReader._bindAndClose(ObjectReader.java:1513)
at com.fasterxml.jackson.databind.ObjectReader.readValue(ObjectReader.java:1181)
at org.openrefine.io.FileProjectManager.loadFromFile(FileProjectManager.java:385)
at org.openrefine.io.FileProjectManager.load(FileProjectManager.java:364)
at org.openrefine.io.FileProjectManager.<init>(FileProjectManager.java:98)
at org.openrefine.io.FileProjectManager.initialize(FileProjectManager.java:83)
at org.openrefine.RefineServlet.init(RefineServlet.java:149)
at javax.servlet.GenericServlet.init(GenericServlet.java:244)
at edu.mit.simile.butterfly.Butterfly.init(Butterfly.java:180)
at org.mortbay.jetty.servlet.ServletHolder.initServlet(ServletHolder.java:440)
at org.mortbay.jetty.servlet.ServletHolder.doStart(ServletHolder.java:263)
at org.openrefine.RefineServer.configure(Refine.java:291)
at org.openrefine.RefineServer.init(Refine.java:203)
at org.openrefine.Refine.init(Refine.java:109)
at org.openrefine.Refine.main(Refine.java:103)
java.lang.IllegalArgumentException: Invalid type id 'com.google.refine.preference.TopList' (for id type 'Id.class'): no such class found
at com.fasterxml.jackson.databind.jsontype.impl.ClassNameIdResolver._typeFromId(ClassNameIdResolver.java:66)
at com.fasterxml.jackson.databind.jsontype.impl.ClassNameIdResolver.typeFromId(ClassNameIdResolver.java:48)
<snip>
Hi @thadguidry and @fpompermaier,
Sorry about these version issues, the maven configuration was just not pretty at all. It should be fixed now (I rebased the spark-prototype branch so you might want to delete it on your side before pulling).
I also fixed @felixlohmeier's issue with viewing the Spark UI (hopefully!), which required migrating Butterfly to Jetty 9.
Packaged versions are here:
@wetneb The build completes successfully now (with CMD window, but not in Visual Studio Code...I'll figure that out when I have more time)
The UI does launch and Text Facet seems to work !
Is Sort supposed to work? I get a busy pointer... and then fallout with errors in the console.
I also get various other issues, like when I CTRL-C and try to shut it down, this happens...
16:52:40.328 [..he.spark.ContextCleaner] Cleaned accumulator 147 (1ms)
16:52:40.328 [..he.spark.ContextCleaner] Cleaned accumulator 65 (0ms)
16:52:40.330 [..torage.BlockManagerInfo] Removed broadcast_6_piece0 on host.docker.internal:59437 in memory (size: 2.7 KB, free: 660.0 MB) (2ms)
16:52:40.334 [..he.spark.ContextCleaner] Cleaned accumulator 161 (4ms)
16:52:41.246 [ refine] POST /command/core/get-importing-job-status (912ms)
16:52:41.256 [..ache.spark.SparkContext] Starting job: count at GridState.java:219 (10ms)
16:52:41.256 [...scheduler.DAGScheduler] Got job 9 (count at GridState.java:219) with 1 output partitions (0ms)
16:52:41.256 [...scheduler.DAGScheduler] Final stage: ResultStage 9 (count at GridState.java:219) (0ms)
16:52:41.257 [...scheduler.DAGScheduler] Parents of final stage: List() (1ms)
16:52:41.257 [...scheduler.DAGScheduler] Missing parents: List() (0ms)
16:52:41.257 [...scheduler.DAGScheduler] Submitting ResultStage 9 (MapPartitionsRDD[24] at mapValues at ImporterUtilities.java:267), which has no missing parents (0ms)
16:52:41.258 [..rage.memory.MemoryStore] Block broadcast_9 stored as values in memory (estimated size 4.1 KB, free 660.0 MB) (1ms)
16:52:41.260 [..rage.memory.MemoryStore] Block broadcast_9_piece0 stored as bytes in memory (estimated size 2.2 KB, free 660.0 MB) (2ms)
16:52:41.261 [..torage.BlockManagerInfo] Added broadcast_9_piece0 in memory on host.docker.internal:59437 (size: 2.2 KB, free: 660.0 MB) (1ms)
16:52:41.262 [..ache.spark.SparkContext] Created broadcast 9 from broadcast at DAGScheduler.scala:1161 (1ms)
16:52:41.262 [...scheduler.DAGScheduler] Submitting 1 missing tasks from ResultStage 9 (MapPartitionsRDD[24] at mapValues at ImporterUtilities.java:267) (first 15 tasks are for partitions Vector(0)) (0ms)
16:52:41.262 [..duler.TaskSchedulerImpl] Adding task set 9.0 with 1 tasks (0ms)
16:52:41.263 [..cheduler.TaskSetManager] Starting task 0.0 in stage 9.0 (TID 9, localhost, executor driver, partition 0, PROCESS_LOCAL, 8031 bytes) (1ms)
16:52:41.263 [..spark.executor.Executor] Running task 0.0 in stage 9.0 (TID 9) (0ms)
16:52:41.266 [..spark.executor.Executor] Finished task 0.0 in stage 9.0 (TID 9). 623 bytes result sent to driver (3ms)
16:52:41.267 [..cheduler.TaskSetManager] Finished task 0.0 in stage 9.0 (TID 9) in 4 ms on localhost (executor driver) (1/1) (1ms)
16:52:41.267 [..duler.TaskSchedulerImpl] Removed TaskSet 9.0, whose tasks have all completed, from pool (0ms)
16:52:41.267 [...scheduler.DAGScheduler] ResultStage 9 (count at GridState.java:219) finished in 0.009 s (0ms)
16:52:41.268 [...scheduler.DAGScheduler] Job 9 finished: count at GridState.java:219, took 0.011662 s (1ms)
16:52:41.272 [ refine] GET /command/core/get-csrf-token (4ms)
16:52:41.635 [ refine] POST /command/core/load-language (363ms)
16:52:41.647 [ refine] GET /command/core/get-preference (12ms)
16:52:41.647 [ refine] GET /command/core/get-preference (0ms)
16:52:41.661 [ refine] GET /command/core/get-project-metadata (14ms)
16:52:41.680 [ refine] GET /command/core/get-models (19ms)
16:52:41.742 [ refine] GET /command/core/get-history (62ms)
16:52:41.745 [ refine] POST /command/core/get-rows (3ms)
16:52:41.751 [ refine] GET /command/core/get-history (6ms)
16:52:41.774 [..ache.spark.SparkContext] Starting job: count at GridState.java:219 (23ms)
16:52:41.775 [...scheduler.DAGScheduler] Got job 10 (count at GridState.java:219) with 1 output partitions (1ms)
16:52:41.775 [...scheduler.DAGScheduler] Final stage: ResultStage 10 (count at GridState.java:219) (0ms)
16:52:41.775 [...scheduler.DAGScheduler] Parents of final stage: List() (0ms)
16:52:41.776 [...scheduler.DAGScheduler] Missing parents: List() (1ms)
16:52:41.776 [...scheduler.DAGScheduler] Submitting ResultStage 10 (MapPartitionsRDD[24] at mapValues at ImporterUtilities.java:267), which has no missing parents (0ms)
16:52:41.780 [..rage.memory.MemoryStore] Block broadcast_10 stored as values in memory (estimated size 4.1 KB, free 660.0 MB) (4ms)
16:52:41.784 [..rage.memory.MemoryStore] Block broadcast_10_piece0 stored as bytes in memory (estimated size 2.2 KB, free 660.0 MB) (4ms)
16:52:41.787 [..torage.BlockManagerInfo] Added broadcast_10_piece0 in memory on host.docker.internal:59437 (size: 2.2 KB, free: 660.0 MB) (3ms)
16:52:41.787 [..ache.spark.SparkContext] Created broadcast 10 from broadcast at DAGScheduler.scala:1161 (0ms)
16:52:41.788 [...scheduler.DAGScheduler] Submitting 1 missing tasks from ResultStage 10 (MapPartitionsRDD[24] at mapValues at ImporterUtilities.java:267) (first 15 tasks are for partitions Vector(0)) (1ms)
16:52:41.788 [..duler.TaskSchedulerImpl] Adding task set 10.0 with 1 tasks (0ms)
16:52:41.789 [..cheduler.TaskSetManager] Starting task 0.0 in stage 10.0 (TID 10, localhost, executor driver, partition 0, PROCESS_LOCAL, 8031 bytes) (1ms)
16:52:41.789 [..spark.executor.Executor] Running task 0.0 in stage 10.0 (TID 10) (0ms)
16:52:41.792 [..spark.executor.Executor] Finished task 0.0 in stage 10.0 (TID 10). 580 bytes result sent to driver (3ms)
16:52:41.792 [..cheduler.TaskSetManager] Finished task 0.0 in stage 10.0 (TID 10) in 3 ms on localhost (executor driver) (1/1) (0ms)
16:52:41.792 [..duler.TaskSchedulerImpl] Removed TaskSet 10.0, whose tasks have all completed, from pool (0ms)
16:52:41.793 [...scheduler.DAGScheduler] ResultStage 10 (count at GridState.java:219) finished in 0.016 s (1ms)
16:52:41.793 [...scheduler.DAGScheduler] Job 10 finished: count at GridState.java:219, took 0.019746 s (0ms)
16:52:41.805 [..ache.spark.SparkContext] Starting job: take at GridState.java:200 (12ms)
16:52:41.806 [...scheduler.DAGScheduler] Got job 11 (take at GridState.java:200) with 1 output partitions (1ms)
16:52:41.806 [...scheduler.DAGScheduler] Final stage: ResultStage 11 (take at GridState.java:200) (0ms)
16:52:41.806 [...scheduler.DAGScheduler] Parents of final stage: List() (0ms)
16:52:41.807 [...scheduler.DAGScheduler] Missing parents: List() (1ms)
16:52:41.807 [...scheduler.DAGScheduler] Submitting ResultStage 11 (MapPartitionsRDD[32] at filter at GridState.java:200), which has no missing parents (0ms)
16:52:41.809 [..rage.memory.MemoryStore] Block broadcast_11 stored as values in memory (estimated size 5.1 KB, free 660.0 MB) (2ms)
16:52:41.810 [..rage.memory.MemoryStore] Block broadcast_11_piece0 stored as bytes in memory (estimated size 2.7 KB, free 660.0 MB) (1ms)
16:52:41.811 [..torage.BlockManagerInfo] Added broadcast_11_piece0 in memory on host.docker.internal:59437 (size: 2.7 KB, free: 660.0 MB) (1ms)
16:52:41.811 [..ache.spark.SparkContext] Created broadcast 11 from broadcast at DAGScheduler.scala:1161 (0ms)
16:52:41.812 [...scheduler.DAGScheduler] Submitting 1 missing tasks from ResultStage 11 (MapPartitionsRDD[32] at filter at GridState.java:200) (first 15 tasks are for partitions Vector(0)) (1ms)
16:52:41.812 [..duler.TaskSchedulerImpl] Adding task set 11.0 with 1 tasks (0ms)
16:52:41.813 [..cheduler.TaskSetManager] Starting task 0.0 in stage 11.0 (TID 11, localhost, executor driver, partition 0, PROCESS_LOCAL, 8031 bytes) (1ms)
16:52:41.814 [..spark.executor.Executor] Running task 0.0 in stage 11.0 (TID 11) (1ms)
16:52:41.818 [..spark.executor.Executor] Finished task 0.0 in stage 11.0 (TID 11). 1358 bytes result sent to driver (4ms)
16:52:41.819 [..cheduler.TaskSetManager] Finished task 0.0 in stage 11.0 (TID 11) in 6 ms on localhost (executor driver) (1/1) (1ms)
16:52:41.819 [..duler.TaskSchedulerImpl] Removed TaskSet 11.0, whose tasks have all completed, from pool (0ms)
16:52:41.819 [...scheduler.DAGScheduler] ResultStage 11 (take at GridState.java:200) finished in 0.012 s (0ms)
16:52:41.820 [...scheduler.DAGScheduler] Job 11 finished: take at GridState.java:200, took 0.014586 s (1ms)
16:52:41.827 [..ache.spark.SparkContext] Starting job: count at GridState.java:219 (7ms)
16:52:41.827 [...scheduler.DAGScheduler] Got job 12 (count at GridState.java:219) with 1 output partitions (0ms)
16:52:41.827 [...scheduler.DAGScheduler] Final stage: ResultStage 12 (count at GridState.java:219) (0ms)
16:52:41.828 [...scheduler.DAGScheduler] Parents of final stage: List() (1ms)
16:52:41.828 [...scheduler.DAGScheduler] Missing parents: List() (0ms)
16:52:41.828 [...scheduler.DAGScheduler] Submitting ResultStage 12 (MapPartitionsRDD[28] at filter at Engine.java:118), which has no missing parents (0ms)
16:52:41.830 [..rage.memory.MemoryStore] Block broadcast_12 stored as values in memory (estimated size 4.8 KB, free 660.0 MB) (2ms)
16:52:41.832 [..rage.memory.MemoryStore] Block broadcast_12_piece0 stored as bytes in memory (estimated size 2.5 KB, free 660.0 MB) (2ms)
16:52:41.833 [..torage.BlockManagerInfo] Added broadcast_12_piece0 in memory on host.docker.internal:59437 (size: 2.5 KB, free: 660.0 MB) (1ms)
16:52:41.834 [..ache.spark.SparkContext] Created broadcast 12 from broadcast at DAGScheduler.scala:1161 (1ms)
16:52:41.834 [...scheduler.DAGScheduler] Submitting 1 missing tasks from ResultStage 12 (MapPartitionsRDD[28] at filter at Engine.java:118) (first 15 tasks are for partitions Vector(0)) (0ms)
16:52:41.834 [..duler.TaskSchedulerImpl] Adding task set 12.0 with 1 tasks (0ms)
16:52:41.835 [..cheduler.TaskSetManager] Starting task 0.0 in stage 12.0 (TID 12, localhost, executor driver, partition 0, PROCESS_LOCAL, 8031 bytes) (1ms)
16:52:41.836 [..spark.executor.Executor] Running task 0.0 in stage 12.0 (TID 12) (1ms)
16:52:41.838 [..spark.executor.Executor] Finished task 0.0 in stage 12.0 (TID 12). 580 bytes result sent to driver (2ms)
16:52:41.838 [..cheduler.TaskSetManager] Finished task 0.0 in stage 12.0 (TID 12) in 3 ms on localhost (executor driver) (1/1) (0ms)
16:52:41.838 [..duler.TaskSchedulerImpl] Removed TaskSet 12.0, whose tasks have all completed, from pool (0ms)
16:52:41.839 [...scheduler.DAGScheduler] ResultStage 12 (count at GridState.java:219) finished in 0.010 s (1ms)
16:52:41.839 [...scheduler.DAGScheduler] Job 12 finished: count at GridState.java:219, took 0.012490 s (0ms)
16:53:08.129 [ ProjectManager] Saving all modified projects ... (26290ms)
16:53:08.144 [..equenceFileRDDFunctions] Saving as sequence file of type (NullWritable,BytesWritable) (15ms)
16:53:08.213 [..nfiguration.deprecation] mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir (69ms)
16:53:08.216 [..oopMapRedCommitProtocol] Using output committer class org.apache.hadoop.mapred.FileOutputCommitter (3ms)
16:53:08.242 [..ache.spark.SparkContext] Starting job: runJob at SparkHadoopWriter.scala:78 (26ms)
16:53:08.243 [...scheduler.DAGScheduler] Got job 13 (runJob at SparkHadoopWriter.scala:78) with 1 output partitions (1ms)
16:53:08.244 [...scheduler.DAGScheduler] Final stage: ResultStage 13 (runJob at SparkHadoopWriter.scala:78) (1ms)
16:53:08.244 [...scheduler.DAGScheduler] Parents of final stage: List() (0ms)
16:53:08.244 [...scheduler.DAGScheduler] Missing parents: List() (0ms)
16:53:08.244 [...scheduler.DAGScheduler] Submitting ResultStage 13 (MapPartitionsRDD[34] at saveAsObjectFile at GridState.java:235), which has no missing parents (0ms)
16:53:08.252 [..rage.memory.MemoryStore] Block broadcast_13 stored as values in memory (estimated size 66.9 KB, free 659.9 MB) (8ms)
16:53:08.257 [..rage.memory.MemoryStore] Block broadcast_13_piece0 stored as bytes in memory (estimated size 24.3 KB, free 659.9 MB) (5ms)
16:53:08.257 [..torage.BlockManagerInfo] Added broadcast_13_piece0 in memory on host.docker.internal:59437 (size: 24.3 KB, free: 660.0 MB) (0ms)
16:53:08.258 [..ache.spark.SparkContext] Created broadcast 13 from broadcast at DAGScheduler.scala:1161 (1ms)
16:53:08.258 [...scheduler.DAGScheduler] Submitting 1 missing tasks from ResultStage 13 (MapPartitionsRDD[34] at saveAsObjectFile at GridState.java:235) (first 15 tasks are for partitions Vector(0)) (0ms)
16:53:08.258 [..duler.TaskSchedulerImpl] Adding task set 13.0 with 1 tasks (0ms)
16:53:08.259 [..cheduler.TaskSetManager] Starting task 0.0 in stage 13.0 (TID 13, localhost, executor driver, partition 0, PROCESS_LOCAL, 8031 bytes) (1ms)
16:53:08.260 [..spark.executor.Executor] Running task 0.0 in stage 13.0 (TID 13) (1ms)
16:53:08.295 [..oopMapRedCommitProtocol] Using output committer class org.apache.hadoop.mapred.FileOutputCommitter (35ms)
16:53:08.317 [..spark.executor.Executor] Exception in task 0.0 in stage 13.0 (TID 13) (22ms)
java.io.IOException: (null) entry in command string: null chmod 0644 C:\Users\thadg\AppData\Roaming\OpenRefine\2562870451033.project\initial\grid\_temporary\0\_temporary\attempt_20200213165308_0034_m_000000_0\part-00000
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:762)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:859)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:842)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:661)
at org.apache.hadoop.fs.ChecksumFileSystem$1.apply(ChecksumFileSystem.java:501)
at org.apache.hadoop.fs.ChecksumFileSystem$FsOperation.run(ChecksumFileSystem.java:482)
at org.apache.hadoop.fs.ChecksumFileSystem.setPermission(ChecksumFileSystem.java:498)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:467)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:433)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
at org.apache.hadoop.io.SequenceFile$Writer.<init>(SequenceFile.java:1132)
at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:271)
at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:528)
at org.apache.hadoop.mapred.SequenceFileOutputFormat.getRecordWriter(SequenceFileOutputFormat.java:63)
at org.apache.spark.internal.io.HadoopMapRedWriteConfigUtil.initWriter(SparkHadoopWriter.scala:230)
at org.apache.spark.internal.io.SparkHadoopWriter$.executeTask(SparkHadoopWriter.scala:120)
at org.apache.spark.internal.io.SparkHadoopWriter$.$anonfun$write$1(SparkHadoopWriter.scala:83)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
16:53:08.318 [..spark.executor.Executor] Not reporting error to driver during JVM shutdown. (1ms)
16:53:11.129 [..erver.AbstractConnector] Stopped ServerConnector@7fad8c79{HTTP/1.1,[http/1.1]}{127.0.0.1:3333} (2811ms)
16:53:11.129 [..se.jetty.server.session] node0 Stopped scavenging (0ms)
16:53:11.133 [ ProjectManager] Saving all modified projects ... (4ms)
Sort is not supposed to work, I have disabled it. I have added CSV/TSV export just for the sake of having an exporter too.
Concerning your exception, thanks, it is interesting. It might be Windows-specific since this seems to be about a failure in a chmod command.
CTRL-C shutdown works correctly without exceptions when it doesn't interrupt a Spark executor. So might need to look into that.
07:50:51.214 [...scheduler.DAGScheduler] Final stage: ResultStage 31 (count at GridState.java:219) (0ms)
07:50:51.214 [...scheduler.DAGScheduler] Parents of final stage: List() (0ms)
07:50:51.215 [...scheduler.DAGScheduler] Missing parents: List() (1ms)
07:50:51.215 [...scheduler.DAGScheduler] Submitting ResultStage 31 (MapPartitionsRDD[23] at mapValues at ImporterUtilities.java:267), which has no missing parents (0ms)
07:50:51.216 [..rage.memory.MemoryStore] Block broadcast_31 stored as values in memory (estimated size 4.1 KB, free 659.9 MB) (1ms)
07:50:51.220 [..rage.memory.MemoryStore] Block broadcast_31_piece0 stored as bytes in memory (estimated size 2.2 KB, free 659.9 MB) (4ms)
07:50:51.221 [..torage.BlockManagerInfo] Added broadcast_31_piece0 in memory on host.docker.internal:50496 (size: 2.2 KB, free: 659.9 MB) (1ms)
07:50:51.221 [..ache.spark.SparkContext] Created broadcast 31 from broadcast at DAGScheduler.scala:1161 (0ms)
07:50:51.222 [...scheduler.DAGScheduler] Submitting 1 missing tasks from ResultStage 31 (MapPartitionsRDD[23] at mapValues at ImporterUtilities.java:267) (first 15 tasks are for partitions Vector(0)) (1ms)
07:50:51.222 [..duler.TaskSchedulerImpl] Adding task set 31.0 with 1 tasks (0ms)
07:50:51.224 [..cheduler.TaskSetManager] Starting task 0.0 in stage 31.0 (TID 31, localhost, executor driver, partition 0, PROCESS_LOCAL, 30852 bytes) (2ms)
07:50:51.224 [..spark.executor.Executor] Running task 0.0 in stage 31.0 (TID 31) (0ms)
07:50:51.228 [..spark.executor.Executor] Finished task 0.0 in stage 31.0 (TID 31). 666 bytes result sent to driver (4ms)
07:50:51.246 [..cheduler.TaskSetManager] Finished task 0.0 in stage 31.0 (TID 31) in 23 ms on localhost (executor driver) (1/1) (18ms)
07:50:51.247 [..duler.TaskSchedulerImpl] Removed TaskSet 31.0, whose tasks have all completed, from pool (1ms)
07:50:51.247 [...scheduler.DAGScheduler] ResultStage 31 (count at GridState.java:219) finished in 0.031 s (0ms)
07:50:51.248 [...scheduler.DAGScheduler] Job 31 finished: count at GridState.java:219, took 0.034720 s (1ms)
07:50:52.068 [ refine] GET /command/core/get-csrf-token (820ms)
07:50:52.071 [ refine] POST /command/core/cancel-importing-job (3ms)
07:52:00.236 [..ache.spark.SparkContext] Invoking stop() from shutdown hook (68165ms)
07:52:00.241 [..erver.AbstractConnector] Stopped Spark@7e27dc80{HTTP/1.1,[http/1.1]}{0.0.0.0:4040} (5ms)
07:52:00.256 [..apache.spark.ui.SparkUI] Stopped Spark web UI at http://host.docker.internal:4040 (15ms)
07:52:00.286 [..utTrackerMasterEndpoint] MapOutputTrackerMasterEndpoint stopped! (30ms)
07:52:00.329 [..rage.memory.MemoryStore] MemoryStore cleared (43ms)
07:52:00.330 [..rk.storage.BlockManager] BlockManager stopped (1ms)
07:52:00.331 [..rage.BlockManagerMaster] BlockManagerMaster stopped (1ms)
07:52:00.334 [..mmitCoordinatorEndpoint] OutputCommitCoordinator stopped! (3ms)
07:52:00.347 [..ache.spark.SparkContext] Successfully stopped SparkContext (13ms)
07:52:00.361 [..til.ShutdownHookManager] Shutdown hook called (14ms)
07:52:00.361 [..til.ShutdownHookManager] Deleting directory C:\Users\thadg\AppData\Local\Temp\spark-cb4a0eff-5aff-4208-976d-8c9994e3be3d (0ms)
07:52:03.202 [..erver.AbstractConnector] Stopped ServerConnector@7fad8c79{HTTP/1.1,[http/1.1]}{127.0.0.1:3333} (2841ms)
07:52:03.203 [..se.jetty.server.session] node0 Stopped scavenging (1ms)
07:52:03.251 [...handler.ContextHandler] Stopped o.e.j.w.WebAppContext@4e928fbf{OpenRefine,/,null,UNAVAILABLE}{E:\GitHub Repos\openrefine\main\webapp} (48ms)
Terminate batch job (Y/N)? y
When it first starts for me and runs ok... I notice a few other things in console:
winutils.exe
for Hadoop for the Windows build/package?E:\GitHub Repos\openrefine>set CLASSPATH="server\classes;server\target\lib\*"
E:\GitHub Repos\openrefine>"E:\Java\jdk-11.0.6.10-hotspot\\bin\java.exe" -cp "server\classes;server\target\lib\*" -Xms1400M -Xmx8000M -Drefine.memory=8000M -Drefine.max_form_content_size=1048576 -Drefine.port=3333 -Drefine.host=127.0.0.1 -Drefine.webapp=main\webapp -Djava.library.path=server\target\lib/native/windows org.openrefine.Refine
07:56:13.155 [...eclipse.jetty.util.log] Logging initialized @480ms to org.eclipse.jetty.util.log.Slf4jLog (0ms)
07:56:13.165 [ refine_server] Starting Server bound to '127.0.0.1:3333' (10ms)
07:56:13.166 [ refine_server] refine.memory size: 8000M JVM Max heap: 8388608000 (1ms)
07:56:13.202 [ refine_server] Initializing context: '/' from 'E:\GitHub Repos\openrefine\main\webapp' (36ms)
07:56:13.240 [..pse.jetty.server.Server] jetty-9.4.26.v20200117; built: 2020-01-17T12:35:33.676Z; git: 7b38981d25d14afb4a12ff1f2596756144edf695; jvm 11.0.6+10 (38ms)
07:56:17.137 [..dardDescriptorProcessor] NO JSP Support for /, did not find org.eclipse.jetty.jsp.JettyJspServlet (3897ms)
07:56:17.146 [..se.jetty.server.session] DefaultSessionIdManager workerName=node0 (9ms)
07:56:17.147 [..se.jetty.server.session] No SessionScavenger set, using defaults (1ms)
07:56:17.149 [..se.jetty.server.session] node0 Scavenging every 660000ms (2ms)
07:56:17.186 [...handler.ContextHandler] Started o.e.j.w.WebAppContext@4e928fbf{OpenRefine,/,file:///E:/GitHub%20Repos/openrefine/main/webapp/,AVAILABLE}{E:\GitHub Repos\openrefine\main\webapp} (37ms)
07:56:17.206 [..erver.AbstractConnector] Started ServerConnector@7fad8c79{HTTP/1.1,[http/1.1]}{127.0.0.1:3333} (20ms)
07:56:17.207 [..pse.jetty.server.Server] Started @4538ms (1ms)
07:56:17.208 [ refine_server] Failed to use jdatapath to detect user data path: resorting to environment variables (1ms)
07:56:17.209 [ refine_server] Failed to use jdatapath to detect user data path: resorting to environment variables (1ms)
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/E:/GitHub%20Repos/openrefine/main/webapp/WEB-INF/lib/spark-unsafe_2.12-2.4.4.jar) to method java.nio.Bits.unaligned()
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
07:56:18.635 [..ache.spark.SparkContext] Running Spark version 2.4.4 (1426ms)
07:56:18.887 [..p.util.NativeCodeLoader] Unable to load native-hadoop library for your platform... using builtin-java classes where applicable (252ms)
07:56:18.926 [..pache.hadoop.util.Shell] Failed to locate the winutils binary in the hadoop binary path (39ms)
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:378)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:393)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:386)
at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79)
at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:116)
at org.apache.hadoop.security.Groups.<init>(Groups.java:93)
at org.apache.hadoop.security.Groups.<init>(Groups.java:73)
In the last version you sent I can't import a TSV anymore, I get the
following errors:
15:10:14.118 [ refine] POST
/command/core/create-importing-job (6ms)
15:10:14.130 [ refine] POST
/command/core/importing-controller (12ms)
15:10:14.177 [..etty.server.HttpChannel] /command/core/importing-controller
(47ms)
java.lang.NullPointerException
at
org.openrefine.importing.ImportingUtilities.rankFormats(ImportingUtilities.java:898)
at
org.openrefine.importing.ImportingUtilities.loadDataAndPrepareJob(ImportingUtilities.java:149)
at
org.openrefine.importing.DefaultImportingController.doLoadRawData(DefaultImportingController.java:119)
at
org.openrefine.importing.DefaultImportingController.doPost(DefaultImportingController.java:88)
at
org.openrefine.commands.importing.ImportingControllerCommand.doPost(ImportingControllerCommand.java:67)
at org.openrefine.RefineServlet.service(RefineServlet.java:201)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
at
org.eclipse.jetty.servlet.ServletHolder$NotAsyncServlet.service(ServletHolder.java:1395)
at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:755)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1617)
at org.mortbay.servlet.UserAgentFilter.doFilter(UserAgentFilter.java:78)
at org.mortbay.servlet.GzipFilter.doFilter(GzipFilter.java:131)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1596)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:545)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:590)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
at
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1607)
at
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1297)
at
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:485)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1577)
at
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1212)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
at org.eclipse.jetty.server.Server.handle(Server.java:500)
at
org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:383)
at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:547)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:375)
at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:270)
at
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
15:10:15.126 [ refine] POST
/command/core/get-importing-job-status (949ms)
15:10:16.126 [ refine] POST
/command/core/get-importing-job-status (1000ms)
15:10:17.126 [ refine] POST
/command/core/get-importing-job-status (1000ms)
15:10:18.126 [ refine] POST
/command/core/get-importing-job-status (1000ms)
15:10:19.127 [ refine] POST
/command/core/get-importing-job-status (1001ms)
15:10:20.127 [ refine] POST
/command/core/get-importing-job-status (1000ms)
15:10:21.127 [ refine] POST
/command/core/get-importing-job-status (1000ms)
15:10:22.126 [ refine] POST
/command/core/get-importing-job-status (999ms)
15:10:23.126 [ refine] POST
/command/core/get-importing-job-status (1000ms)
15:10:24.127 [ refine] POST
/command/core/get-importing-job-status (1001ms)
15:10:25.127 [ refine] POST
/command/core/get-importing-job-status (1000ms)
15:10:26.127 [ refine] POST
/command/core/get-importing-job-status (1000ms)
15:10:27.127 [ refine] POST
/command/core/get-importing-job-status (1000ms)
15:10:28.127 [ refine] POST
/command/core/get-importing-job-status (1000ms)
15:10:29.125 [ refine] POST
/command/core/get-importing-job-status (998ms)
15:10:30.125 [ refine] POST
/command/core/get-importing-job-status (1000ms)
15:10:31.128 [ refine] POST
/command/core/get-importing-job-status (1003ms)
^C15:10:31.232 [..ache.spark.SparkContext] Invoking stop() from shutdown
hook (104ms)
On Fri, 14 Feb 2020 at 14:55, Thad Guidry notifications@github.com wrote:
CTRL-C shutdown works correctly without exceptions when it doesn't
interrupt a Spark executor. So might need to look into that.07:50:51.214 [...scheduler.DAGScheduler] Final stage: ResultStage 31 (count at GridState.java:219) (0ms)
07:50:51.214 [...scheduler.DAGScheduler] Parents of final stage: List() (0ms)
07:50:51.215 [...scheduler.DAGScheduler] Missing parents: List() (1ms)
07:50:51.215 [...scheduler.DAGScheduler] Submitting ResultStage 31 (MapPartitionsRDD[23] at mapValues at ImporterUtilities.java:267), which has no missing parents (0ms)
07:50:51.216 [..rage.memory.MemoryStore] Block broadcast_31 stored as values in memory (estimated size 4.1 KB, free 659.9 MB) (1ms)
07:50:51.220 [..rage.memory.MemoryStore] Block broadcast_31_piece0 stored as bytes in memory (estimated size 2.2 KB, free 659.9 MB) (4ms)
07:50:51.221 [..torage.BlockManagerInfo] Added broadcast_31_piece0 in memory on host.docker.internal:50496 (size: 2.2 KB, free: 659.9 MB) (1ms)
07:50:51.221 [..ache.spark.SparkContext] Created broadcast 31 from broadcast at DAGScheduler.scala:1161 (0ms)
07:50:51.222 [...scheduler.DAGScheduler] Submitting 1 missing tasks from ResultStage 31 (MapPartitionsRDD[23] at mapValues at ImporterUtilities.java:267) (first 15 tasks are for partitions Vector(0)) (1ms)
07:50:51.222 [..duler.TaskSchedulerImpl] Adding task set 31.0 with 1 tasks (0ms)
07:50:51.224 [..cheduler.TaskSetManager] Starting task 0.0 in stage 31.0 (TID 31, localhost, executor driver, partition 0, PROCESS_LOCAL, 30852 bytes) (2ms)
07:50:51.224 [..spark.executor.Executor] Running task 0.0 in stage 31.0 (TID 31) (0ms)
07:50:51.228 [..spark.executor.Executor] Finished task 0.0 in stage 31.0 (TID 31). 666 bytes result sent to driver (4ms)
07:50:51.246 [..cheduler.TaskSetManager] Finished task 0.0 in stage 31.0 (TID 31) in 23 ms on localhost (executor driver) (1/1) (18ms)
07:50:51.247 [..duler.TaskSchedulerImpl] Removed TaskSet 31.0, whose tasks have all completed, from pool (1ms)
07:50:51.247 [...scheduler.DAGScheduler] ResultStage 31 (count at GridState.java:219) finished in 0.031 s (0ms)
07:50:51.248 [...scheduler.DAGScheduler] Job 31 finished: count at GridState.java:219, took 0.034720 s (1ms)
07:50:52.068 [ refine] GET /command/core/get-csrf-token (820ms)
07:50:52.071 [ refine] POST /command/core/cancel-importing-job (3ms)
07:52:00.236 [..ache.spark.SparkContext] Invoking stop() from shutdown hook (68165ms)
07:52:00.241 [..erver.AbstractConnector] Stopped Spark@7e27dc80{HTTP/1.1,[http/1.1]}{0.0.0.0:4040} (5ms)
07:52:00.256 [..apache.spark.ui.SparkUI] Stopped Spark web UI at http://host.docker.internal:4040 (15ms)
07:52:00.286 [..utTrackerMasterEndpoint] MapOutputTrackerMasterEndpoint stopped! (30ms)
07:52:00.329 [..rage.memory.MemoryStore] MemoryStore cleared (43ms)
07:52:00.330 [..rk.storage.BlockManager] BlockManager stopped (1ms)
07:52:00.331 [..rage.BlockManagerMaster] BlockManagerMaster stopped (1ms)
07:52:00.334 [..mmitCoordinatorEndpoint] OutputCommitCoordinator stopped! (3ms)
07:52:00.347 [..ache.spark.SparkContext] Successfully stopped SparkContext (13ms)
07:52:00.361 [..til.ShutdownHookManager] Shutdown hook called (14ms)
07:52:00.361 [..til.ShutdownHookManager] Deleting directory C:\Users\thadg\AppData\Local\Temp\spark-cb4a0eff-5aff-4208-976d-8c9994e3be3d (0ms)
07:52:03.202 [..erver.AbstractConnector] Stopped ServerConnector@7fad8c79{HTTP/1.1,[http/1.1]}{127.0.0.1:3333} (2841ms)
07:52:03.203 [..se.jetty.server.session] node0 Stopped scavenging (1ms)
07:52:03.251 [...handler.ContextHandler] Stopped o.e.j.w.WebAppContext@4e928fbf{OpenRefine,/,null,UNAVAILABLE}{E:\GitHub Repos\openrefine\main\webapp} (48ms)
Terminate batch job (Y/N)? y—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/OpenRefine/OpenRefine/issues/1433?email_source=notifications&email_token=AA4Z4JKUTRSB3X2U4D7SAWDRC2PEPA5CNFSM4ELY4722YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELZC6RA#issuecomment-586297156,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AA4Z4JJZFVW724NI6IUZBC3RC2PEPANCNFSM4ELY472Q
.
@fpompermaier I just ran a local build again and imported a TSV just fine. Will you be able to get yourself setup with a Dev environment soon to directly pull the spark-prototype branch and do builds, testing ?
@fpompermaier could it be because you are importing a file? Only the clipboard import is supposed to work so far.
@wetneb Doh ! I thought you said you added import / export ?
I have added CSV/TSV export just for the sake of having an exporter too.
Oh, so just export. And to load, we have to use clipboard. Correct?
OK, clipboard doesn't help to load large datasets... I was able to preview this large CSV but then loading was still creating tons of Spark Job stages, so something about loading large files, no matter what: issue #2314
Yes, again at this stage I am primarily looking for feedback on the overall architecture - no performance optimizations have been done at this stage so in many cases the performance will be worse than OpenRefine 3.x.
The world is moving faster than we can keep up!
20x faster than Apache Spark (with GPU hardware if users have it)
https://databricks.com/session/accelerating-machine-learning-workloads-and-apache-spark-applications-via-cuda-and-nccl
https://docs.blazingdb.com/docs
https://www.infoq.com/news/2020/02/apache-spark-gpus-rapids/
Hmm, might be interesting for a Student someone to try some POC with BlazingSQL or other open source software, if they have an NVIDIA GPU like I do.
I am not sure this would be very suitable for student projects - at least not if they want to base it on my current work on spark, since it is far from finished.
But why not as a research project and poc? I mean, as you have done, they could do the same? I've seen incredible talent out there sometimes, even pulled them into Ericsson later as fulltime employees. I am aware there's a lot to dive into here, but I'd rather encourage research in this space if there's a desire. It's an idea, and if someone is talented enough or has the passion.... why hold them back?
As to a question of if we want to classify this as valuable research for OpenRefine project or not. I think it's valuable and even folks like Siren and @fpompermaier might be interested here, which is why I brought it up.
I will not hold anyone back from playing with this - I would just not advertise it as a project that would be appropriate for someone who just starts to discover the project.
Ah, my advertisement on the Gitter chat room... OK, sure, I can delete those comments on there. You are probably right.
DONE.
Appreciate the feedback, Antonin, I really do, thanks!
@wetneb @fpompermaier So this is interesting...Apache Beam now has Schemas support for handling type information.
With the introduction of schemas, a new format for handling type information, Beam is heading in a similar direction as Flink with its type system which is essential for the Table API or SQL. Speaking of, the next Flink release will include a Python version of the Table API which is based on the language portability of Beam. Looking ahead, the Beam community plans to extend the support for interactive programs like notebooks. TFX, which is built with Beam, is a very powerful way to solve many problems around training and validating machine learning models.
I haven't looked deeply into it, but was mentioned at the Conclusion at bottom of this article: https://flink.apache.org/ecosystem/2020/02/22/apache-beam-how-beam-runs-on-top-of-flink.html
Sorry for the late feedback..Yes, Apache Beam has this very interesting feature that is called SDK Harness that allows to call external code "natively" with a very low overhead (except the gRPC serialization cost between functions). I agree with @wetneb that this project is very complex and require a huge effort to be fulfilled. Unfortunately this year I have a very tight scheduling so I don't have much time to write code but I could contribute to some discussion if you want to (maybe also with some small contributions hopefully but I can't guarantee anything).
Moreover Apache Beam SDK is very different from Spark and Flink, I've triwd a couple of times to use I didn't like it at all..contrarily I'm very experienced with Flink and I've a lot of experience with it (since 2012).
Said that, Apache Beam is a perfect fit for this task (at least theoretically) because then you could easily switch from Spark to Flink or whatever but you also have to carefully take into account that not every feature are supported by every engine..I would undertake Apache Beam only if we get some sort of support from the Google guys (if I'm not wrong @thadguidry said that there were some interest from Google in this direction), otherwise it's probably wort to continue the work that @wetneb has drafted.
However I think that it's a very heavy task for a single person..
Spark SQL is a newer API than RDD, but still allows interoperability with RDD and working with DataFrame functions and creating your own untyped user-defined aggregate functions, as well as type-safe versions. I see a lot there with the Spark SQL API that I like. I think this requires much more investigation.
@wetneb Additionally, I've mentioned a few more comments on the Spark instructions doc https://hackmd.io/f8czl6cNT4uxgvAMwI-NkA?view
Hi Thad, using the SQL APIÂ layer of Spark is blocked by their lack of support for User Defined Types (UDT). The UDT API has been private for years and unfortunately this is unlikely to change soon.
Concretely, why would you want us to use this API? Which user workflows are you thinking about? Because we would need to use UDTs to embed cell data in Dataframes, SQL querying of project data would probably be fairly tedious for users.
For now I am sticking to the RDD layer. If UDTs become public later on, it should be easy to migrate to Datasets (SQL) from here. Going the other way around is more complicated since the RDD API is simpler. Also, sticking to RDDs makes it potentially possible to migrate to Beam, where the RDD runner is more mature than the Dataset one.
It would allow exposing easier pivoting, unions, and joins https://databricks.com/blog/2018/11/01/sql-pivot-converting-rows-to-columns.html
But maybe we can do similar with RDD and some Java-fu and dialogs?
Exposing a SQL query opens up the doors for lots of folks in GLAM.
When you say "cell data", I think your stuck in a corner of "strongly typed objects in Java".
You and I both know there are ways to break out of that rut.
I really think we should have a team call on this issue and pull Tom in. Commenting is just not letting my Data Architect world align well with your Programmatic expertise. And I think a conference call will help both sides understand better.
OK, thanks @wetneb for today's explanation prior to our call with Samuel Klein (The Underlay). I now understand that you have less concerns than I do about changing the shape of data through RDD transformations. If we were to address separation of concerns by partitioning data into subsets like GridState, Recon, Metadata (instead of single Objects for cell values)...then there would be a cost of joining and updating those subsets, but with Spark that cost is very minimal versus traditional RDBMS. But, since you have minimal concerns with RDD for now as you explained, I will trust you fully here, and we can continue the current approach in the spark-prototype
branch.
@thadguidry I am not sure about pinning this (and the SDC issue): for me, pinned issues should be general issues people should be aware of (for instance, repository looking for a maintainer, or very important issue people keep reporting by creating duplicates). This particular issue is not particularly insightful either, and we should not try to drag new contributors to it as they do not really have a way to engage with my effort from it…
@wetneb Sure. Makes sense. Appropriate how you need then. I was just kinda trying it out to see the viz.
@wetneb Does it make sense to bump spark-prototype
branch to use Apache Spark 3.0 that got released? https://github.com/OpenRefine/OpenRefine/blob/ce300cdcbbfe6c01dc9cf3def6fcf98c876ded2b/or-spark/pom.xml#L18
I have been thinking about it, I will definitely do that before a release - do not expect anything significantly different though, the RDD API has remained pretty much the same AFAICT.
@wetneb Concerning bigger data and joins... I found this interesting application of using IR and historical workflow to build computational graphs for determining the logic of UDFs that can then be used for strategies for automatic persistent partitioning. My thoughts are that this might be useful later for building good partitioning strategies for workloads with Wikidata reconciling against large data such as in Commons, Wikipedia, and Institutional Metadata, just to name a few.
https://arxiv.org/pdf/2006.16529.pdf
Of course, getting some workflow history from clients would be an OPT-IN, but could provide useful knowledge for those working with Wikidata and large processing and analysis to use OpenRefine in a constructive way for that and have automatic partitioning so performance is made more optimal. Interestingly, the paper mentions Juneau for Jupyter Notebooks, which a friend mentioned in passing, but I have no experience. Anyways, food for thought on some future state of possible things for OpenRefine on Spark.
@wetneb Hmm, looks like we still need to contend with Hadoop's binaries needed as I described before above comment
10:21:28.400 [..pache.hadoop.util.Shell] Failed to locate the winutils binary in the hadoop binary path (39ms)
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
A large CSV file imported on my Windows 10 machine but notice the Hadoop exception after it completed all the stages and final job.
We can see that we'll need to also deal with task reduction later, but for now need to focus on the Hadoop binaries and the best way to package that as part of the refine build process.
Any thoughts on how best to incorporate the Hadoop binaries as part of the build for any kind of user, including me on Windows?
11:12:29.705 [ refine] GET /command/core/get-models (27ms)
11:12:29.731 [ refine] POST /command/core/get-all-preferences (26ms)
11:12:29.800 [ refine] GET /command/core/get-history (69ms)
11:12:29.804 [ refine] POST /command/core/get-rows (4ms)
11:12:29.816 [ refine] GET /command/core/get-history (12ms)
11:12:37.570 [..cheduler.TaskSetManager] Stage 182 contains a task of very large size (257401 KB). The maximum recommended task size is 100 KB. (7754ms)
11:13:11.944 [..cheduler.TaskSetManager] Stage 183 contains a task of very large size (257401 KB). The maximum recommended task size is 100 KB. (34374ms)
11:13:20.840 [..cheduler.TaskSetManager] Stage 184 contains a task of very large size (257401 KB). The maximum recommended task size is 100 KB. (8896ms)
11:13:56.412 [..cheduler.TaskSetManager] Stage 185 contains a task of very large size (257401 KB). The maximum recommended task size is 100 KB. (35572ms)
11:14:30.975 [..cheduler.TaskSetManager] Stage 186 contains a task of very large size (257401 KB). The maximum recommended task size is 100 KB. (34563ms)
11:14:39.780 [..cheduler.TaskSetManager] Stage 187 contains a task of very large size (257401 KB). The maximum recommended task size is 100 KB. (8805ms)
11:14:48.543 [..cheduler.TaskSetManager] Stage 188 contains a task of very large size (257401 KB). The maximum recommended task size is 100 KB. (8763ms)
11:14:57.057 [..cheduler.TaskSetManager] Stage 189 contains a task of very large size (257401 KB). The maximum recommended task size is 100 KB. (8514ms)
11:15:06.596 [..cheduler.TaskSetManager] Stage 190 contains a task of very large size (257401 KB). The maximum recommended task size is 100 KB. (9539ms)
11:15:15.281 [..cheduler.TaskSetManager] Stage 191 contains a task of very large size (257401 KB). The maximum recommended task size is 100 KB. (8685ms)
11:15:24.048 [..cheduler.TaskSetManager] Stage 192 contains a task of very large size (257401 KB). The maximum recommended task size is 100 KB. (8767ms)
11:15:32.443 [..cheduler.TaskSetManager] Stage 193 contains a task of very large size (257401 KB). The maximum recommended task size is 100 KB. (8395ms)
11:15:41.209 [..cheduler.TaskSetManager] Stage 194 contains a task of very large size (257401 KB). The maximum recommended task size is 100 KB. (8766ms)
11:15:49.819 [..cheduler.TaskSetManager] Stage 195 contains a task of very large size (257401 KB). The maximum recommended task size is 100 KB. (8610ms)
11:15:58.609 [..cheduler.TaskSetManager] Stage 196 contains a task of very large size (257401 KB). The maximum recommended task size is 100 KB. (8790ms)
11:16:07.505 [..cheduler.TaskSetManager] Stage 197 contains a task of very large size (257401 KB). The maximum recommended task size is 100 KB. (8896ms)
11:16:17.914 [..cheduler.TaskSetManager] Stage 198 contains a task of very large size (257401 KB). The maximum recommended task size is 100 KB. (10409ms)
11:16:26.981 [..cheduler.TaskSetManager] Stage 199 contains a task of very large size (257401 KB). The maximum recommended task size is 100 KB. (9067ms)
11:16:36.050 [..cheduler.TaskSetManager] Stage 200 contains a task of very large size (257401 KB). The maximum recommended task size is 100 KB. (9069ms)
11:16:44.498 [..cheduler.TaskSetManager] Stage 201 contains a task of very large size (257401 KB). The maximum recommended task size is 100 KB. (8448ms)
11:16:53.376 [..cheduler.TaskSetManager] Stage 202 contains a task of very large size (257401 KB). The maximum recommended task size is 100 KB. (8878ms)
11:17:02.339 [..cheduler.TaskSetManager] Stage 203 contains a task of very large size (257401 KB). The maximum recommended task size is 100 KB. (8963ms)
11:17:11.565 [..cheduler.TaskSetManager] Stage 204 contains a task of very large size (257401 KB). The maximum recommended task size is 100 KB. (9226ms)
11:17:20.203 [..cheduler.TaskSetManager] Stage 205 contains a task of very large size (257401 KB). The maximum recommended task size is 100 KB. (8638ms)
11:17:28.906 [..cheduler.TaskSetManager] Stage 206 contains a task of very large size (257401 KB). The maximum recommended task size is 100 KB. (8703ms)
11:17:37.573 [..cheduler.TaskSetManager] Stage 207 contains a task of very large size (257401 KB). The maximum recommended task size is 100 KB. (8667ms)
11:21:40.251 [..cheduler.TaskSetManager] Stage 208 contains a task of very large size (257401 KB). The maximum recommended task size is 100 KB. (242678ms)
11:22:09.829 [..spark.executor.Executor] Exception in task 2.0 in stage 208.0 (TID 260) (29578ms)
java.io.IOException: (null) entry in command string: null chmod 0644 C:\Users\thadg\AppData\Roaming\OpenRefine\2164326080469.project\initial\grid\_temporary\0\_temporary\attempt_20210105112131_0108_m_000002_0\part-00002.gz
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:762)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:859)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:842)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:661)
at org.apache.hadoop.fs.ChecksumFileSystem$1.apply(ChecksumFileSystem.java:501)
at org.apache.hadoop.fs.ChecksumFileSystem$FsOperation.run(ChecksumFileSystem.java:482)
at org.apache.hadoop.fs.ChecksumFileSystem.setPermission(ChecksumFileSystem.java:498)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:467)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:433)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:801)
at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:135)
at org.apache.spark.internal.io.HadoopMapRedWriteConfigUtil.initWriter(SparkHadoopWriter.scala:230)
at org.apache.spark.internal.io.SparkHadoopWriter$.executeTask(SparkHadoopWriter.scala:120)
at org.apache.spark.internal.io.SparkHadoopWriter$.$anonfun$write$1(SparkHadoopWriter.scala:83)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:830)
Most helpful comment
I totally agree. OpenRefine should be able to work on a sample of the data (when this is to big to fit in memory), create a set of operations to perform on a project and applying those operations on the remote framework like Spark/YARN/Flink/Apache Beam/etc.
In order to consider the process finished some validation rules need to be set. It those rules are not satisfied, another sample of the data (that doesn't satisfy those checks) is reported to the user in order to properly fix them with some additional operation.
Thus, the very first thing tho accomplish this goal is to be able to export the Refine history in a common format/language (independent from GREL/Jython/Python/etc) and reply all those operations in a batch program.
I suggest to create a proper external library (e.g. OpenRefine-Language or OpenRefine-History) that will be imported by OpenRefine and all other framework that needs to reply the history (somehow)