(Pardon my poor english)
After using DL4J and som parts of DataVec (namely image loading and transforming) i would like to share my experience. I hope the feedback is useful.
I started out by reading and modifying the examples available in github. The documentation is really evolving into an excellent source of information, and keeps lowering the threshold for beginners.
But it proved pretty hard to build a ML pipeline for my production environment. I wanted for example to modify (or at least understand the details) of the image loading capabilities of DataVec. Although very powerful and fast, there were so many indirections and layers of abstractions that made it very time consuming to build something custom.
There is the opencv helper, the ND4J concepts, the Writable abtractions, and the layers of Readers.
There is little guidance from the types either. Everything is basically an INDarray (or Mat, or Writable).
DataSet may have labels, DataSetIterator iterates over datasets, that in turn have minibatches.
To read images, convert them to matrices and feed them to a NN you must know about
RecordDataSetIterator
ImageRecordReader
ParentPathLabelGenerator
FileSplit
ImageTransform
ImagePreProcessingScaler
Now in the end all these are required in one way or the other, but today you need pretty intimate details on every part to understand how they work together.
One particular case i wanted to explore was that i wanted to feed the histograms of images into the network. I wasnt (then) able to reuse the parts that where useful, i ended up writing the pipeline from scratch, creating tempfiles and such (which probably was a good idea, ut can maybe be solved more elegantly)
But what if i could have approcaed the problem more na茂vely? After all, I have input, i want to transform it, and i want to feed it into the network.
Input into a Neural net could just be a Stream
Minibatching could be a question of how many items you pull from the stream.
Now, the stream would have to be read repeatedly, so some stream provider could form the input interface: RecordProvider
Providing input to the neural net would then be a simple matter of mapping the contents of the stream to ultimately emit Records.
A dsl could look like this:
Supplier
RecordProvider
RecordProvider.from(paths)
.map(path->Imageloader.load(path))
.map(img->ImageTransform.crop(img,200,200))
.map(img->new Record(img.flatten,img.getPath.filename))
FittedRecordProvider
TrainedNetwork tn = fitted.train(nn);
This approach is simple to learn (basic stream architecture from java 8) but can be developed further.
If one wants to reuse the training pipeline to test incoming the values, the transformation pipeline should be refactored out into its own component. this could look like this
Pipeline
Pipeline.newPipeline()
.map(path->Imageloader.load(path))
.map(img->ImageTransform.crop(img,200,200))
.map(img->new Record(img.flatten,img.getPath.filename))
Model
pipeline.fit(paths).train(nn);
Score s = model.score(path);
Further improvements could be made: What if the test input are images that come over the wire? They cannot be converted through the same pipeline!
We could split the pipeline in two:
Pipeline
Pipeline.identity()
.map(path->Imageloader.load(path))
.label(image -> new Label(image.getPath())
Pipeline
Pipeline.newPipeline()
.map(img->ImageTransform.crop(img,200,200))
.map(img->new Record(img.flatten,img.label))
Pipeline
Pipeline.newPipeline()
.map(bytes->new BufferedImage(bytes))
.map(bi->new LabelledImage(bi, label))
Model m = loadPipeline.then(imgTransform).fit(paths).train(nn)
Score s = m.score(imgDeserialize.then(imgTransform).transform(image))
I realize that there are many obstacles to solve, and also that this approach is very similar to Spark and Flink. But i believe a simple and fluent api has a strong case, its excellent for learning prototyping, and lowers the threshold for putting the model into production too.
In some sense this approach is an extension to the fluent and simple builder approach you have to building the Networks. I would love to see ND4J going in this direction!
Further reading:
http://nealford.com/memeagora/2013/01/22/why_everyone_eventually_hates_maven.html This is actually a really nice writeup on the concept of _contextual_ vs _composable_, where DataVecs approach today is contextual, but i propose a more composable approach.
https://github.com/treo/ingest/wiki/Original-Idea Paul Dubs' take on the same issue.
I hope this makes sense and is of at least some value.
I actually started a proposal by trying to wrap the existing MultiLayerNetwork, but it took me too much time trying to figure out the inner workings of minibatching. There are also a lot of optimalizations at work that i dint want to mess up, so i stuck to the original DataVec.
Thanks for the thoughts! This does sound like it's going in the right direction. We should include this discussion in the rewrite @AlexDBlack is planning for other parts of DataVec.
BTW, one of the major issue is the different image representations we need to deal with. In particular we need the data in a Mat to use operations from OpenCV and in NDArray for operations in ND4J. I started working on that in JavaCV using the concept of "frame converters". INDArray could be wrapped up in that framework as well... DataVec sort of does the same thing with Writable. We could start writing "converters" for those, maybe. It's not something anyone has ever tried to do before. People just keep doing it the ad hoc way, or entities like Google just rewrite everything from scratch a la TensorFlow (ta-dah, no need for OpenCV), but I would personally like to make things work together instead.
@saudet I think you could get very far usabilitywise if you could model the transformations as Function somehow. And if you could use types to wrap the INDArrays so you get domain semantics it would read a lot better (and catch an awful lot of bugs) (For example you could have every image type as its own class)
Ok, let's try to start designing this. Most operations from OpenCV need 8-bit image data, but ND4J needs floating point data. What would you do to bridge the gap?
Maybe one could start to build conversion functions that use OpenCV operations, and then apply a ND4J conversion as a final step?
OpenCV conversion functions would then be
Function
and then one could make a ND4J converter
Function
This way you could get the best of both worlds. (Use of OpenCV and easy chaining)
A further step could be to create types for the different versions of images one could have, for example GrayScale, Color8Bit etc.
Ok, and say some people need to use BufferedImage, and others Bitmap on
Android, and PIX for Tesseract, and Tensor for TensorFlow, and ... well you
get the idea. Are you suggesting to write conversion functions for every
possible combination of format and type?
A valid point. But do you not already today actually provide different conversions for the different image-types? Would adding typing not just require us to factor out the different functionalities into more fine-grained classes?
I also think that if one provides a simple api, then user will have no trouble providing their own pipelines for Tensorflow etc..., or no?
I would be up for allowing people to provide some sort of lambdas if we could persist that in some way. What I'm worried about here is: Everything in datavec should be persistable (and not via some sort of one off java serialization hacks)
@prange The very reason why you're having problems with usability is because of the fact that we don't have all that functionality. If you're up to working on making everything work together, that would be great. :)
@agibsonccc You mean as a json structure for example?
Hm, i am just reading about Spark ML, to get inspiration on how other frameworks have solved this. I tis pretty similar to the suggestions here, although they use the Spark datastructures. I believe there is great value in having simple (na茂ve even, althoug optimized) implementation included alongside DL4J, it helps understading, testing and adapting to different usecases that do not want to use spark (as an example). I'll keep looking into this the next couple of weeks.
Apart from persistence, what are other requirements you would have?
Performance would be the next major requirement :)
Ok here comes a first take on a training pipeline.
The first example is training of a VAE
List<AnomalyDetectInput> data =
prepareData();
RecordSource<AnomalyDetectInput> records =
RecordSources.fromIterable(data);
loadEpoch(records)
.bind(authDataConverter::fit)
.bind(domainConverter ->
loadEpoch(records)
.map(transformer(domainConverter))
.map(transformer(UnlabeledRecord::new))
.map(batch(128))
.map(shuffle())
.map(toDataset())
.apply(anomalyAutoencoderNetwork(domainConverter.size), MLTask::train)
.map(TrainedAnomalyDetector.createModel(domainConverter))
.map(d -> eval(d, data))
.repeat(() -> StopCondition.maxElapedTime(Duration.ofSeconds(20)))
.onExecution(pair -> System.out.println("Score is " + pair.getSecond()))
.repeat(() -> StopCondition.times(10))
.map(Pair::getFirst)
).execute();
The general idea is that epochs are delivered by vanilla java.util.Stream, encapsulated by a RecordSource.
The actual work is done by a datatype MLTask, that, when executed performs some potentially long running work. The api for MLTask contains standard manipulation methods like map and bind (could also be called append or andThen) and repeat.
The idea is that one first constructs an MLTask and supplies all the transformations, and then call execute to actually make the while thing run, possibly in the background.
In some cases (for example fitting) one needs to iterate through the epoch first, and the use the fitted transformer in some later step, that is what _bind_ is for.
@prange sorry for the late response, just saw this - yes as json or the like.
The next example is training a net with convolution layers:
Path home = new File(System.getProperty("user.home")).toPath().resolve("safetynet");
Path train = home.resolve("img/canvas/train");
Path test = home.resolve("img/canvas/test");
int height = 150;
int width = 150;
int channels = 3;
Function<Path, String> getLabel = path ->
path.getParent().getFileName().toString();
loadEpoch(RecordSources.files(train))
.map(transformer(getLabel))
.bind(labels -> collect(categories(), labels, LabellerFitting::fit))
.bind(labeller -> {
MLPipe<Stream<Path>, Stream<DataSet>> loadImagesPipe =
paths ->
paths
.map(transformer(LabelledRecord.toLabelledRecord(getLabel)))
.map(LabelledRecord.transformFeatures(new LoadImage()))
.map(LabelledRecord.transformFeatures(ImageTransformers.randomCrop(width, height, System.currentTimeMillis())))
.map(LabelledRecord.transformFeatures(ImageTransformers.toINDArray()))
.map(LabelledRecord.transformLabels(labeller.labelEncoder()))
.map(batch(128))
.map(toLabelledDataset());
return
loadEpoch(RecordSources.files(train))
.pipe(loadImagesPipe)
.apply(buildNet(width, height, channels, labeller.labels().length()), MLTask::train)
.bind(net ->
Evaluate.evaluate(
loadEpoch(RecordSources.files(test)).pipe(loadImagesPipe).executeAndAwait(),
net,
labeller.labels()
))
.repeat(() -> StopCondition.maxElapedTime(Duration.ofMinutes(10)))
.map(Pair::getSecond);
})
.execute()
.thenAccept(System.out::println);
The api fulfills the performance requirement: It builds around Streams which are very efficient, and the bottleneck here are just the transformations provided by the user.
_But_ it fails the serialization requirement miserably. This api is built for maximum flexibility and userprovided transformations, hence it is not possible to supply json (or other) serialization out of the box. The solution is either to : limit the transformations to supplied functions that have serde instances available to them, OR supply a serialization api that would make it easy for users to embed their own serde instances, or of course some combination.
Being able to serialize this model is of course very important and i'll have to solve that. That being said: i think that a super-simple api built around streams will lower the bar significantly for developers just starting to look into DL4J.
@prange Definitely don't disagree with you there - the reason we have this as persistable though is because it means you don't have to worry about rewriting code for production. You're trading developer productivity for write literally only once. That's a hard sell for us right now (it's a huge selling point in our product). It's not a horrible idea - per se. I like the intent and the api is great. What we'd want is something that prevents folks from having to write code twice though.
The main thing that needs to be persistable are the transforms steps. The original source can come from wherever.
Yes, i agree 100%. Without serialization this api is just something devs can play with, but not deploy anywhere. Ill continue to work on this and will post here as i progress...
I think a have found a solution that strikes the balance between flexibility and simplicity:
//Training
PipelineConfig config =
loadEpoch(RecordSources.files(train))
.map(transformer(getLabel))
.bind(labels -> collect(categories(), labels, LabellerFitting::fit))
.bind(labeller -> {
MLPipe<Stream<Path>, Stream<DataSet>> loadImagesPipe =
paths ->
paths
.map(transformer(LabelledRecord.toLabelledRecord(getLabel)))
.map(LabelledRecord.transformFeatures(new LoadImage()))
.map(LabelledRecord.transformFeatures(ImageTransformers.randomCrop(width, height, System.currentTimeMillis())))
.map(LabelledRecord.transformFeatures(ImageTransformers.toINDArray()))
.map(LabelledRecord.transformLabels(labeller.labelEncoder()))
.map(batch(128))
.map(toLabelledDataset());
return
loadEpoch(RecordSources.files(train))
.pipe(loadImagesPipe)
.apply(buildNet(width, height, channels, labeller.labels().size()), MLTask::train)
.bind(net ->
Evaluate.evaluate(
loadEpoch(RecordSources.files(test)).pipe(loadImagesPipe).executeAndAwait(),
net,
labeller.labels()
))
.repeat(() -> StopCondition.maxElapedTime(Duration.ofMinutes(10)))
.map(pair -> new PipelineConfig(pair.getFirst(), pair.getSecond()).set("labels", labeller.labels()));
})
.executeAndAwait();
config.store(configPath);
//Production pipeline (probably in another enviroment)
PipelineConfig loadedConfig = PipelineConfig.load(configPath);
MLPipe<String, String> classifyPipe =
loadImgAsBase64 ->
loadImgAsBase64
.map(fromClient -> fromClient.substring(22))
.map(ImageTransformers.fromBase64)
.map(ImageTransformers.removeTransparency)
.map(ImageTransformers.convertRGBToInput)
.map(img -> loadedConfig.net.output(img))
.map(new OneHotCategoryLabelDecoder(loadedConfig.getSet("labels")));
//Classification task
String someBase64EncodedImg = "";
String imageClass =
MLTask.value(someBase64EncodedImg).pipe(classifyPipe).executeAndAwait();
The idea is that when the net is "trained" you store it in a PipelineConfig, together with an evaluation and custom config parameters. The config is serialized and shipped to the production environment, where it is used to build the production pipeline.
This way we make the serialization of the net and parameters explicit and obvious. But we keep the flexibility to build pipelines in pure code. When going via the config object it is also obvious that one needs to load som data to be able to build the production pipeline, and it will feel natural to the developer.
I am pretty certain we will use this "pattern" in my current project.
Edit:
My experience with ML in production is that the training and production pipelines are completely different, they only share some common values. I suspect others have different approaches. Our take is that the net is just another function in our domain, and the main concern is to convert data into a format that is useful for that function. I think this might look different when frameworks like Flink and Spark are involved.
/cc @AlexDBlack
@prange so the idea with datavec is you have 1 dsl with several "backends" that basically solves the "training and production are different pipelines" - that comes back to it being persistent. So basiccally what happens now is you have a production pipeline you "load" a Transform process. You save it during training, load it during production and run it on an "executor" which is basically your execution backend.
The "theory" there is you have your single box model training, you can save the output of that and just "load"' it or slightly modify it post load for production.
Aha! Now in understand better the intent of DataVec. I guess my target audience is more towards users in environments where spark (et.al) is not needed/wanted. (Although to me the approaches look very similar ) For example, in the case above we read training data from disk, but in production the model is used by a plain webserver (with great success btw), we do not need spark in our environment. Without the confinements of a streaming platform the datavec API feels limiting and cumbersome in a vanilla java runtime.
I see that the DataVec API has a lot of convenience for transforming data, and i would guess that most of it can be represented as simple functions that would be reusable in any environment.
(Btw, i am not trying to be rude here, my english is insufficient to express humbleness while trying to provide feedback, i really want to contribute)
Next steps:
I am probably going to use the approach i outline above in my current ongoing project, and also in the future to test and build DL4J nets.
I could create a github project separately and maintain and document it there. If it gets noticed or even popular you can consider it again in the future...
Its currently out of scope for me to go deeper into integrations with backends, although i might in the near future (Flink project coming up...)
However, i am not a very big fan of building around frameworks, partly because one has to keep track with updates and versions, partly because a lot of functionality eventually will be overlapping, and partly because of the loss of flexibility. But as you have said, that a tradeoff you cannot make at this moment.