Plots2: Machine Learning based projects

Created on 18 Jan 2019 · 66Comments · Source: publiclab/plots2

Currently, our Spam system is completely manual, but
I think, instead of reviewing similar content/posts, we can use
Machine Learning algorithms for easing the task.

discussion planning

Source

NeuralMonk

Most helpful comment

Hey everyone,
I have done my research on given ideas and devised the following plan:
@jywarren, it is definitely a good idea to create a new repository for machine learning based projects, instituting a separate server for data analysis access data via the API.

We can host a Flask server in this way:

It will take the screenshot of the image,
Feed it to the input of the model,
Take the output of the model to show it on the web page.

Goal: Automatically label aerial imagery

Tagging,
Semantic segmentation.

Implementing the Machine learning model in simple steps:

Collect the pair of images and label,
Write a program that predicts labels for given images(model),
Let the computer automatically tune parameters to mimic examples(learning).

The lengthy task: collecting the pair of aerial images and label

One important yet rarely discussed aspect of using machine learning for aerial image interpretation is the source of the data.
Since labelling images is a very time-consuming process, the datasets have been small in both aerial image applications and general image labelling work. Hence, obtaining good sources of accurately labelled data is important for both evaluating existing approaches and training systems that are likely to work under varying conditions.
In some domains, hand-labelling data in order to train a classifier is not necessary because the label information is often readily available. For example, in the case of road detection (Semantic segmentation), the locations of existing roads are typically known because they are useful for navigation and not just as target labels in a machine learning task.
The abundance of accurately labelled data for road detection makes it a very good candidate for evaluating existing aerial image interpretation systems as well as the application of machine learning techniques.

For buildings, Google Maps can provide the locations of a substantial portion of the buildings in almost any major city. This type of data can act as a source of noisy labels, which are correct with very high probability when they indicate the presence of an object and with lower, but still substantially high, probability when they indicate the absence of an object. Training a classifier on large amounts of this type of noisy data with a robust loss function can potentially produce a much better detector than by using a much smaller set of accurate labels. At present, there seem to be no applications of robust estimators to aerial image data with noisy labels.
For object classes such as cars or areas for which Google Maps possesses neither accurate nor complete map information, hand-labelling data seems to be the option or to use of crowdsourcing tools like zooniverse https://www.zooniverse.org/ which helps us to make the dataset.

In a classification task, small translations or rotations can be applied to the input images, but in order to apply the same idea to image labelling one must be able to realistically transform both the image and the labels. On a road detection task, applying rotations to each training case before it is processed has been shown to help prevent overfitting

So we need to start making our own dataset for the better result.
we can do it manually and I'd like to volunteer my self to do the same by using a python script.
alternatively platforms like Zooniverse can be used to create the dataset
https://help.zooniverse.org/getting-started/

The most important part is data. A larger and more accurate sample size will lead to the better results.
The primary obstacle is the imbalance in dataset which makes detecting rare labels a difficult task.

Tagging;

It is almost similar task as I suggested earlier for the text the difference is that, now the dataset is of images so we need to use CNN to perform the following task there is a great blog post by Adit how CNN actually work for image classification: https://adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-Neural-Networks/
screenshot from 2019-02-20 02-33-09

The machine learning model

Residual Network (ResNet) which is a major breakthrough in CNN.
1.allowing training model with 100's of the layer for grater accuracy.

layers compute residual(delta) between input and output

Why does it work?

each layer has less work to do(no copying)
allows gradient to flow more easily due to skipping connection

To understand more deeply you can go through a great intuitive blog : https://wiseodd.github.io/techblog/2016/10/13/residual-net/

Our approach to making our model better

1.instead of softmax, use the sigmoid activation function

2.optimize tag threshold to maximize F2 score

Many of the times we are trying to find the optimal threshold for F2 score using trial and error but instead of that we can find the best threshold using a brute-force search on a local validation set can actually net really good results on the LB, without much overfitting in the local score. Basically, you can try every possible threshold on a local validation set, and take the best performing threshold, applying it to the test set.
And we also know that the best threshold is vastly different for each class. This means we can also get a big improvement by setting a different threshold for each class

Using pretrained model

A very common trick used in ML which is also known as transfer learning which means instead of training your model with random initialization we can initialize the parameters we got from another similar model who already trained on different data set. which is basically a great head start.

Simply put, a pre-trained model is a model created by some one else to solve a similar problem. Instead of building a model from scratch to solve a similar problem, you use the model trained on other problem as a starting point.

For example, if you want to build a self-learning car. You can spend years to build a decent image recognition algorithm from scratch or you can take inception model (a pre-trained model) from Google which was built on ImageNet data to identify images in those pictures.

A pre-trained model may not be 100% accurate in your application, but it saves huge efforts required to re-invent the wheel. Let me show this to you with a recent example.

Augment label dataset using lossless image transformation.

The more the data the better so like we can rotate our image by 90 degrees left and right which eventually increased the size of our dataset.

Tune learning rate (LR) manually
it is very important to find which LR has best performance
Ensembling of 3 model architecture(optional)
1. ResNet 5x
2. inception 5x
3. DenseNet 5x

Or we can also do good with "ConNets101"
it depends on what are the resources we have
ensembling is good ML approach but give a little boost in F2 score and take about 15 times more computation than ConvNet101.

Semantic segmentation

Basically "semantic segmentation" attempts to partition the image into semantically meaningful parts, and to classify each part into one of the pre-determined classes. You can also achieve the same goal by classifying each pixel (rather than the entire image/segment). In that case, you are doing pixel-wise classification, which leads to the same end result but in a slightly different path.
to understand it deeply check the very insightful blog
https://www.jeremyjordan.me/semantic-segmentation/

ResNet based FCN architecture
fine-tuned a pre-trained model
Use IR R G image as input
Make prediction using sliding window because network only can handle 256X256
Ensembling average of five model

Other ideas for future works.

1.Detection of an oil spill.

Detecting oil spill accurately using CNN is a very tough task because there are some natural phenomena which look similar from space and a small sample size does not help. We need SAR images to detect oil spill correctly because in SAR image oil spill look like in dark formation which can be easily get detected. The following can prove to be usefull:

Fully convolution Network
FCN-GoogleNet
FCN-ResNets
deep neural autoencoder

2.Detection and mapping of plastic

We can able to detect plastic on our trained model using object detection while labelling the data we need make a specific label for plastic or no-plastic so that our CNN network can use thousands of the example of labelled plastic pieces such that it will finally able to tell what is a piece of plastic and what is not. We can able to detect a different type of plastic like rope toy etc.

Air pollution

When somebody uploads an image on mapknitter with Geo-tagging we can able to find the PM2.5 level and detect the air quality using following link https://aqicn.org/map/india/#@g/19.9884/80.5078/5z
so we can able to classify air is polluted or not in the given region.

But to predict future air pollution patterns in is itself a major machine learning task.

PM2.5 refer to the tiny particle in the air that reduce visibility and cause air to appear hazy and get affected by the meteorological and traffic factor, burning of fossil fuel, Industrial parameters such as power plant emission play a significant role in air pollution.

The required data-set

Temperature
wind speed
Dewpoint
pressure
PM2.5 Concentration
classified data sample(polluted or not)

Our system does two tasks:
1) detect the level of PM2.5 on given location
2) Predict PM2.5 value for a particular date
2.1) Logistic regression to predict air is polluted or not
2.2) Autoregression to predict a future value of PM2.5 based on the previous PM2.5 value reading

Since our plan is quite extensive, I'd like to begin working on it as soon as possible. I'd like to invite inputs from you regarding the same, primarily should I start the project on zooniverse or should I start labelling it manually?

thanks, everyone

NeuralMonk on 27 Feb 2019

🚀1 ❤1 👍1

All 66 comments

Great idea.
@jywarren I want to add a couple more idea. I know they are not Core Mission Driven Projects. We must focus on them before addressing these less important issues. But just to brainstorm a little.

[ ] Content Based Tag Recommendation System (Suggested by Jeff)
[ ] Anomalous Spam Detection System(As suggested by @SKashyapD )
[ ] Recommendation Systems for posts (@Saurabh19126848_twitter suggestion on gitter chat )
[ ] recommendation system for posts (@Saurabh19126848_twitter suggestion on gitter chat)
[ ] sentiment analysis ( @Saurabh19126848_twitter suggestion on gitter chat)
[ ] Tag Suggestions by Natural Language Processing on nodes(suggested by me)

I am highly in favour of automating our services. Main problem is with Rails absence of libraries to ML. We can find majority of above on based on Isolation Forest algorithms, Naive Bayes, BBN, CNN, ANN etc. which are heavily implemented in python, not in rails. Writing libraries from Scratch does not make sense at all.
So, we also need to think of these considerations.

SidharthBansal on 19 Jan 2019

I would love to participate in any of these projects! I've worked with Recommendation Systems and Sentiment Analysis during my graduation. But I don't know any libraries to Rails tho. I've only worked with libraries in Python and R before.

milaaraujo on 19 Jan 2019

Same scene is with me. I will love to work on these projects. Some are in
my current semester curriculum but they are heavily based on python and R.

On Sat, Jan 19, 2019, 1:23 PM Camila Araújo <[email protected] wrote:

I would love to participate in any of these projects! I've worked with
Recommendation Systems and Sentiment Analysis during my graduation. But I
don't know any libraries to Rails tho. I've only worked with Python and R
before.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/publiclab/plots2/issues/4660#issuecomment-455758517,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AUACQwG8buuV6NgCZ0UtchzatJ8fgaIlks5vEs79gaJpZM4aIqPK
.

SidharthBansal on 19 Jan 2019

We could make flask server

On Sat, 19 Jan, 2019, 13:26 Sidharth Bansal <[email protected] wrote:

Same scene is with me. I will love to work on these projects. Some are in
my current semester curriculum but they are heavily based on python and R.

On Sat, Jan 19, 2019, 1:23 PM Camila Araújo <[email protected]
wrote:

I would love to participate in any of these projects! I've worked with
Recommendation Systems and Sentiment Analysis during my graduation. But I
don't know any libraries to Rails tho. I've only worked with Python and R
before.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<https://github.com/publiclab/plots2/issues/4660#issuecomment-455758517
,
or mute the thread
<
https://github.com/notifications/unsubscribe-auth/AUACQwG8buuV6NgCZ0UtchzatJ8fgaIlks5vEs79gaJpZM4aIqPK

.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/publiclab/plots2/issues/4660#issuecomment-455758659,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AqtjHul_KrFgr1v230-HkxgZWGPG_cyoks5vEs-PgaJpZM4aIqPK
.

NeuralMonk on 19 Jan 2019

Hi everyone, just dropping here to say that making a flask server for data science stuff is the correct approach here. Essentially, you would need a separate server crunching the numbers and acting as an interface to the models. This flask server would need to be run in a separate container and I volunteer to make appropriate changes to the docker-compose config to make sure this floats. Looking forward to assist people in implementing the above cool features in the website.

ryzokuken on 20 Jan 2019

🎉3

Tag Prediction
Suggest the tags based on the content of the post posted on the website of
public lab

1. Real World / Business Objectives and Constraints *
  1.1 Predict as many labels as possible correctly.
  1.2 No strict latency constraint.
  1.3 Cost of errors would be a bad customer experience.
1. Machine Learning problem *
2. 2.1 Data*
  Requires lots of data to train the machine learning model which can be
  done by API
  Data Field Explanation
  Id - Unique identifier for each question
  Title - The question's title
  Body - The body of the question
  Tags - The tags associated with the question (all lowercase, should
  not contain tabs '\t' or ampersands '&')
2.2 Mapping the real-world problem to a Machine Learning Problem*
- 2.2.1 Type of Machine Learning Problem*
  It is a multilable classification problem
  Multilable Classification: Multilabel classification assigns to each sample
  a set of target labels. This can be thought as predicting properties of a
  data-point that are not mutually exclusive, such as topics that are
  relevant for a document. A text might be about any of religion, politics,
  finance or education at the same time or none of these.
  __Credit__: http://scikit-learn.org/stable/modules/multiclass.html
  - 2.2.2 Performance metric*
    Micro-Averaged F1-Score (Mean F Score) *: The F1 score can be
    interpreted as a weighted average of the precision and recall, where an F1
    score reaches its best value at 1 and worst score at 0. The relative
    contribution of precision and recall to the F1 score are equal. The formula
    for the F1 score is:
    *F1 = 2 * (precision * recall) / (precision + recall)
    In the multi-class and multi-label case, this is the weighted average of
    the F1 score of each class.
    'micro f1 score':
    Calculate metrics globally by counting the total true positives, false
    negatives and false positives.
    'macro f1 score':
    Calculate metrics for each label, and find their unweighted mean. This does
    not take label imbalance into account.
  2.2.3 Machine Learning Objectives and Constraints
  1. Minimize Micro avg F1 Score.
  2. Try out multiple strategies for Multi-label classification.

*3. Exploratory Data Analysis *
3.1 Using Pandas with SQLite to Load the data
3.2 Analysis of Tags
3.3 Cleaning and preprocessing
1. Sample data points
2. Separate Code from Body
3. Remove Special characters from Question title and description
4. Remove stop words
5. Remove HTML Tags
6. Convert all the characters into small letters
7. Use SnowballStemmer to stem the words

*4. Machine Learning Models *
4.1 Converting tags for multilable problems
4.2 Split the data into test and train (80:20)
4.3 featurizing data with TfIdf vectorizer
4.4 Applying Logistic Regression/SVM with OneVsRest Classifier

5. testing the model

Sources / useful linksyourtube : https://youtu.be/nNDqbUhtIRg
https://youtu.be/nNDqbUhtIRg research paper :
https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tagging-1.pdf
https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tagging-1.pdf
research paper : https://dl.acm.org/citation.cfm?id=2660970&dl=ACM&coll=DL
https://dl.acm.org/citation.cfm?id=2660970&dl=ACM&coll=DL

On Mon, Jan 21, 2019 at 2:42 AM Ujjwal Sharma notifications@github.com
wrote:

Hi everyone, just dropping here to say that making a flask server for data
science stuff is the correct approach here. Essentially, you would need a
separate server crunching the numbers and acting as an interface to the
models. This flask server would need to be run in a separate container and
I volunteer to make appropriate changes to the docker-compose config to
make sure this floats. Looking forward to assist people in implementing the
above cool features in the website.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/publiclab/plots2/issues/4660#issuecomment-455902842,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AqtjHg_7Gl_e3BOjI4El8ttC6wvowmIKks5vFNuigaJpZM4aIqPK
.

NeuralMonk on 25 Jan 2019

🚀1

I really love your research but its important to take input from @jywarren
whether or not the organisation is aiming at ML into current projects.
Today or tomorrow we need to enable ml. But it depends on core mission
projects too.
So, Jeff will guide us best whether these could be further discussed or
will be taken care later on.
Thanks everyone.

On Fri, Jan 25, 2019, 6:19 PM SKashyapD <[email protected] wrote:

Tag Prediction
Suggest the tags based on the content of the post posted on the website of
public lab

Real World / Business Objectives and Constraints *
1.1 Predict as many labels as possible correctly.
1.2 No strict latency constraint.
1.3 Cost of errors would be a bad customer experience.

Machine Learning problem *

2.1 Data*
Requires lots of data to train the machine learning model which can be
done by API
Data Field Explanation
Id - Unique identifier for each question
Title - The question's title
Body - The body of the question
Tags - The tags associated with the question (all lowercase, should
not contain tabs '\t' or ampersands '&')

2.2 Mapping the real-world problem to a Machine Learning Problem*

2.2.1 Type of Machine Learning Problem*
It is a multilable classification problem
Multilable Classification: Multilabel classification assigns to each sample
a set of target labels. This can be thought as predicting properties of a
data-point that are not mutually exclusive, such as topics that are
relevant for a document. A text might be about any of religion, politics,
finance or education at the same time or none of these.
__Credit__: http://scikit-learn.org/stable/modules/multiclass.html

2.2.2 Performance metric*
Micro-Averaged F1-Score (Mean F Score) *: The F1 score can be
interpreted as a weighted average of the precision and recall, where an F1
score reaches its best value at 1 and worst score at 0. The relative
contribution of precision and recall to the F1 score are equal. The formula
for the F1 score is:
*F1 = 2 * (precision * recall) / (precision + recall)
In the multi-class and multi-label case, this is the weighted average of
the F1 score of each class.
'micro f1 score':
Calculate metrics globally by counting the total true positives, false
negatives and false positives.
'macro f1 score':
Calculate metrics for each label, and find their unweighted mean. This does
not take label imbalance into account.

2.2.3 Machine Learning Objectives and Constraints

Minimize Micro avg F1 Score.

Try out multiple strategies for Multi-label classification.

*3. Exploratory Data Analysis *
3.1 Using Pandas with SQLite to Load the data
3.2 Analysis of Tags
3.3 Cleaning and preprocessing

Sample data points

Separate Code from Body

Remove Special characters from Question title and description

Remove stop words

Remove HTML Tags

Convert all the characters into small letters

Use SnowballStemmer to stem the words

*4. Machine Learning Models *
4.1 Converting tags for multilable problems
4.2 Split the data into test and train (80:20)
4.3 featurizing data with TfIdf vectorizer
4.4 Applying Logistic Regression/SVM with OneVsRest Classifier

5. testing the model

*Sources / useful linksyourtube : https://youtu.be/nNDqbUhtIRg
https://youtu.be/nNDqbUhtIRg research paper :

https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tagging-1.pdf
<
https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tagging-1.pdf
>
research paper : https://dl.acm.org/citation.cfm?id=2660970&dl=ACM&coll=DL
https://dl.acm.org/citation.cfm?id=2660970&dl=ACM&coll=DL*

On Mon, Jan 21, 2019 at 2:42 AM Ujjwal Sharma notifications@github.com
wrote:

Hi everyone, just dropping here to say that making a flask server for
data
science stuff is the correct approach here. Essentially, you would need a
separate server crunching the numbers and acting as an interface to the
models. This flask server would need to be run in a separate container
and
I volunteer to make appropriate changes to the docker-compose config to
make sure this floats. Looking forward to assist people in implementing
the
above cool features in the website.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://github.com/publiclab/plots2/issues/4660#issuecomment-455902842
,
or mute the thread
<
https://github.com/notifications/unsubscribe-auth/AqtjHg_7Gl_e3BOjI4El8ttC6wvowmIKks5vFNuigaJpZM4aIqPK

.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/publiclab/plots2/issues/4660#issuecomment-457563092,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AUACQ5sKU_-ulgC69LkAJf6AuPDBpMU2ks5vGv1ugaJpZM4aIqPK
.

SidharthBansal on 25 Jan 2019

Hello everyone!
Please let me know if I should start working on it since it will take a lot of
time commitment and effort on my part.
Or If you want me to work on something else please let me know.

On Fri, 25 Jan, 2019, 19:00 Sidharth Bansal <[email protected] wrote:

I really love your research but its important to take input from @jywarren
whether or not the organisation is aiming at ML into current projects.
Today or tomorrow we need to enable ml. But it depends on core mission
projects too.
So, Jeff will guide us best whether these could be further discussed or
will be taken care later on.
Thanks everyone.

On Fri, Jan 25, 2019, 6:19 PM SKashyapD <[email protected] wrote:

Tag Prediction
Suggest the tags based on the content of the post posted on the website
of
public lab

Real World / Business Objectives and Constraints *
1.1 Predict as many labels as possible correctly.
1.2 No strict latency constraint.
1.3 Cost of errors would be a bad customer experience.

Machine Learning problem *

2.1 Data*
Requires lots of data to train the machine learning model which can be
done by API
Data Field Explanation
Id - Unique identifier for each question
Title - The question's title
Body - The body of the question
Tags - The tags associated with the question (all lowercase, should
not contain tabs '\t' or ampersands '&')

2.2 Mapping the real-world problem to a Machine Learning Problem*

2.2.1 Type of Machine Learning Problem*
It is a multilable classification problem
Multilable Classification: Multilabel classification assigns to each
sample
a set of target labels. This can be thought as predicting properties of a
data-point that are not mutually exclusive, such as topics that are
relevant for a document. A text might be about any of religion, politics,
finance or education at the same time or none of these.
__Credit__: http://scikit-learn.org/stable/modules/multiclass.html

2.2.2 Performance metric*
Micro-Averaged F1-Score (Mean F Score) *: The F1 score can be
interpreted as a weighted average of the precision and recall, where an
F1
score reaches its best value at 1 and worst score at 0. The relative
contribution of precision and recall to the F1 score are equal. The
formula
for the F1 score is:
*F1 = 2 * (precision * recall) / (precision + recall)
In the multi-class and multi-label case, this is the weighted average of
the F1 score of each class.
'micro f1 score':
Calculate metrics globally by counting the total true positives, false
negatives and false positives.
'macro f1 score':
Calculate metrics for each label, and find their unweighted mean. This
does
not take label imbalance into account.

2.2.3 Machine Learning Objectives and Constraints

Minimize Micro avg F1 Score.

Try out multiple strategies for Multi-label classification.

*3. Exploratory Data Analysis *
3.1 Using Pandas with SQLite to Load the data
3.2 Analysis of Tags
3.3 Cleaning and preprocessing

Sample data points

Separate Code from Body

Remove Special characters from Question title and description

Remove stop words

Remove HTML Tags

Convert all the characters into small letters

Use SnowballStemmer to stem the words

*4. Machine Learning Models *
4.1 Converting tags for multilable problems
4.2 Split the data into test and train (80:20)
4.3 featurizing data with TfIdf vectorizer
4.4 Applying Logistic Regression/SVM with OneVsRest Classifier

5. testing the model

*Sources / useful linksyourtube : https://youtu.be/nNDqbUhtIRg
https://youtu.be/nNDqbUhtIRg research paper :

https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tagging-1.pdf
<

https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tagging-1.pdf
>
research paper :
https://dl.acm.org/citation.cfm?id=2660970&dl=ACM&coll=DL
https://dl.acm.org/citation.cfm?id=2660970&dl=ACM&coll=DL*

On Mon, Jan 21, 2019 at 2:42 AM Ujjwal Sharma notifications@github.com
wrote:

Hi everyone, just dropping here to say that making a flask server for
data
science stuff is the correct approach here. Essentially, you would
need a
separate server crunching the numbers and acting as an interface to the
models. This flask server would need to be run in a separate container
and
I volunteer to make appropriate changes to the docker-compose config to
make sure this floats. Looking forward to assist people in implementing
the
above cool features in the website.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<
https://github.com/publiclab/plots2/issues/4660#issuecomment-455902842
,
or mute the thread
<

https://github.com/notifications/unsubscribe-auth/AqtjHg_7Gl_e3BOjI4El8ttC6wvowmIKks5vFNuigaJpZM4aIqPK
>

.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<https://github.com/publiclab/plots2/issues/4660#issuecomment-457563092
,
or mute the thread
<
https://github.com/notifications/unsubscribe-auth/AUACQ5sKU_-ulgC69LkAJf6AuPDBpMU2ks5vGv1ugaJpZM4aIqPK

.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/publiclab/plots2/issues/4660#issuecomment-457573031,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AqtjHgNI9WMvwvfIuHshnnlTfUGIf3efks5vGwbjgaJpZM4aIqPK
.

NeuralMonk on 5 Feb 2019

Hi, thanks to everyone for your input here! I think there are some potential use cases for machine learning across the Public Lab ecosystem! But perhaps we need to do a bit more in-detail brainstorming on individual examples. For example, I'm not sure that running a containerized flask server as part of the plots2 codebase makes sense because it dramatically expands the setup complexity of the project (we had an issue with this in a previous project to run a Solr container), but perhaps it could make sense to develop in a separate repository?

Could such a separate server for data analysis access data via the API?

Of the brainstormed applications, i'm hesitant on the spam one -- i like the basic premise, but to me, it seems more sustainable and less 'reinvent the wheel' to look at an existing library or service for spam identification, like Askimet or something. I'm sure others have worked on this problem and am less sure we could provide something unique that would be competitive.

On the other hand, I'd love to think about places in the PL ecosystem where machine learning would present a really unique benefit that supports our overall mission.

Would Spectral Workbench be one of those places?

I note a mention of neural networks for trying to solve an issue here: https://github.com/publiclab/spectral-workbench.js/issues/56#issuecomment-457179753 (although seems that should be broken into its own issue)
@Lucaszw emailed me some time back with the idea of using machine learning to apply appropriate tags to spectra in SpectralWorkbench. That also seems interesting!

On MapKnitter, would it be plausible to scan images and try to identify features and tag accordingly?

The Vision API at Google Cloud can do some pretty interesting things there: https://cloud.google.com/vision/

Although in this test it didn't seem to find anything in this aerial photo except that it was an aerial photo 😄 :

Perhaps one approach here might be to begin a Zooniverse project using MapKnitter data: https://www.zooniverse.org/lab

Then that could be used as training data to develop a machine learning approach to identifying, say, areas of high risk of spills, pollution, etc.

Terrapattern tried doing something kind of like this: https://qz.com/764746/terrapattern-open-source-satellite-photo-search-tool/

http://www.terrapattern.com/about

That could be a really interesting approach, and I like the idea of using the MapKnitter image set to help an ML approach get better at identifying pollution.

Note that Terrapattern also uses OpenStreetMap tags to train it's model. Perhaps we could correlate MapKnitter images with any OSM tags which are overlapping with the images shown, although there might not be too many.

Anyhow, these are some ideas that get a bit at the environmental mission of Public Lab, and might make for an interesting set of possible projects that wouldn't necessarily live IN the plots2 codebase, but could be really powerful tools for our community.

jywarren on 7 Feb 2019

❤2 👍1

This is a really great example of using machine learning to identify environmental issues: https://skytruth.org/2019/02/using-machine-learning-to-map-the-footprint-of-fracking-in-central-appalachia/

it also gets at some of the challenges, as well as discusses how to use existing manually categorized datasets as a training set, OR to use existing databases to correlate with imagery to train a model. Great work, @skytruth!

jywarren on 14 Feb 2019

Hey everyone and thanks @jywarren for your wonderful inputs and your proposed ideas are very cool and interesting.
I have already started reading and researching about them. It will take me about a week to find out how things are supposed to be done.
thanks, everyone.

NeuralMonk on 14 Feb 2019

We can host a Flask server in this way:

It will take the screenshot of the image,
Feed it to the input of the model,
Take the output of the model to show it on the web page.

Goal: Automatically label aerial imagery

Tagging,
Semantic segmentation.

Implementing the Machine learning model in simple steps:

Collect the pair of images and label,
Write a program that predicts labels for given images(model),
Let the computer automatically tune parameters to mimic examples(learning).

The lengthy task: collecting the pair of aerial images and label

Tagging;

The machine learning model

Residual Network (ResNet) which is a major breakthrough in CNN.
1.allowing training model with 100's of the layer for grater accuracy.

layers compute residual(delta) between input and output

Why does it work?

each layer has less work to do(no copying)
allows gradient to flow more easily due to skipping connection

To understand more deeply you can go through a great intuitive blog : https://wiseodd.github.io/techblog/2016/10/13/residual-net/

Our approach to making our model better

1.instead of softmax, use the sigmoid activation function

2.optimize tag threshold to maximize F2 score

Using pretrained model

A pre-trained model may not be 100% accurate in your application, but it saves huge efforts required to re-invent the wheel. Let me show this to you with a recent example.

Augment label dataset using lossless image transformation.

The more the data the better so like we can rotate our image by 90 degrees left and right which eventually increased the size of our dataset.

Tune learning rate (LR) manually
it is very important to find which LR has best performance
Ensembling of 3 model architecture(optional)
1. ResNet 5x
2. inception 5x
3. DenseNet 5x

Semantic segmentation

ResNet based FCN architecture
fine-tuned a pre-trained model
Use IR R G image as input
Make prediction using sliding window because network only can handle 256X256
Ensembling average of five model

Other ideas for future works.

1.Detection of an oil spill.

Fully convolution Network
FCN-GoogleNet
FCN-ResNets
deep neural autoencoder

2.Detection and mapping of plastic

Air pollution

But to predict future air pollution patterns in is itself a major machine learning task.

The required data-set

Temperature
wind speed
Dewpoint
pressure
PM2.5 Concentration
classified data sample(polluted or not)

thanks, everyone

NeuralMonk on 27 Feb 2019

🚀1 ❤1 👍1

Hi! This is a lot of information - thanks for compiling it! I wanted to ask a few things first --

With such a complex system, perhaps we should do some diagramming to show what the parts of the system are, and what are the potential ways to fulfill each part -- we could start with a diagram template like the one linked here, that was used to generate the plots2 data model: https://github.com/publiclab/plots2/blob/master/doc/DATA_MODEL.md
I'm really interested in good integration with existing efforts -- what portions of systems like Terrapattern and others are re-usable, or could we at least remain compatible with? https://github.com/CreativeInquiry/terrapattern
For buildings, Google Maps can provide the locations of a substantial portion of the buildings in almost any major city. -- I'd even prefer OpenStreetMap, which Terrapattern uses, and is an open source data source which we could also encourage people to contribute to in order to improve the training! See how to query here: https://github.com/publiclab/leaflet-environmental-layers/issues/50 and also a lot about more data sources to draw from in https://github.com/publiclab/leaflet-environmental-layers/ !
For the PM air quality data, do you think perhaps it's possible that there is no visible sign of air quality issues in MapKnitter images? or if you're not using images to correlate, but just data, there may be other models to look to first.

I hope this helps!

jywarren on 27 Feb 2019

Oh, and also, starting a Zooniverse project would be GREAT! @zengirl2 may be interested in this too.

jywarren on 27 Feb 2019

thanks @jywarren for great inputs and making things more clearer and interesting.

Yes it is little complex and i will try to breakdown things in simpler way and i started working on this I will try to complete it as soon as possible.
for now we can able to do Semantic segmentation part which can help model to predict tags like ROAD, BUILDING, WATER, TREES, VEGETATION because there is data available freely like
eg- https://project.inria.fr/aerialimagelabeling/
and we can use opneStreerMap http://openstreetmapdata.com/
so we can start doing thiis
Using open source is always fun.
Using images we can only able to find out whether or not the image is hazy but with the location of the image we are able to find out its PM2.5 value of that particular location.

NeuralMonk on 6 Mar 2019

Zooniverse sounds great! I guess you should create a team first and add me (and @zengirl2 or anyone else who is interested too) and I could then flesh out the rest of the project.

Hope this sounds good?

NeuralMonk on 6 Mar 2019

oh very cool, yes that sounds good! Can you email me with your email or
Zooniverse username at [email protected]?

On Wed, Mar 6, 2019 at 6:50 AM SKashyapD notifications@github.com wrote:

Zooniverse sounds great! I guess you should create a team first and add me
(and @Zengirl2 https://github.com/Zengirl2 or anyone else who is
interested too) and I could then flesh out the rest of the project.

Hope this sounds good?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/publiclab/plots2/issues/4660#issuecomment-470078821,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AABfJ6PXdmCVBLNsHluBxt-7LwtZy7tdks5vT6t9gaJpZM4aIqPK
.

jywarren on 6 Mar 2019

@SKashyapD Hey there--I do have a strong interest in Zooniverse, but I'm still behind on a fan project I'm working on. So, you can include me, but I won't be able to do much right now.

Zengirl2 on 6 Mar 2019

untitled diagram
Most simplest way to show how things going to work each and every block have there own technical details. please create a repository and I will explain every technical detail on it.

thanks @jywarren for creating zooniverse project.
zooniverse project looks great I started working on it but I have to know few things first to make it better and clear.
-what we are specifically looking for(core mission)?
-what are the labels we are going to take to create our database?
-anything important you want to mention?

should i start working on semantic segmentation part?

thanks everyone

NeuralMonk on 6 Mar 2019

thanks @Zengirl2 for showing interest . And any kind of contribution will be great.
@jywarren please add @Zengirl2 to our zooniverse project.

NeuralMonk on 6 Mar 2019

@SKashyapD I originally had interest in using Zooniverse to go through possible pollution from hurricanes. They have started to do projects for hurricanes (although not with the pollution I would like). I was at the point of having conversations with two people from Zooniverse about learning to use their content system. I believe I may even have a video tutorial that they sent me.

Zengirl2 on 6 Mar 2019

I am really excited to complete zooniverse project and semantic segmentation part @jywarren please give me some inputs so that i can start working and I will try complete all this as soon as possible.
@Zengirl2 please give me that tutorial video it will help me a lot.

NeuralMonk on 9 Mar 2019

@SKashyapD Here's the links for some helpful info about setting up projects on Zooniverse (this was based on a specific example of flood/hurricane I had been asking about).

Doc Explanation
https://docs.google.com/document/d/1W5y5Iq6WY5OpP6P4kcHrE6od0tGBFhO0huXvXHJJCzs/edit?usp=sharing

Youtube video
https://www.youtube.com/watch?v=_bcu5tJDjPY

Zengirl2 on 11 Mar 2019

thanks @Zengirl2 for providing me resources.
@jywarren please let me know when your are finished and I already working on some prerequisite that will help us in future

NeuralMonk on 14 Mar 2019

I think @Zengirl2's idea for core mission is great -- identify specific types of pollution from aerial photos -- and we can start with whatever is a good initial training set.

I added @Zengirl2 to the zooniverse! Thank you!

jywarren on 15 Mar 2019

There are lots of Hurricane Harvey images linked to from posts on this page: https://publiclab.org/wiki/harvey#Questions -- i hope that helps!

jywarren on 15 Mar 2019

@jywarren That sounds amazing. I will start working on this thing immediately.
Should I make a summer of code proposal for image labelling of Mapknitter using semantic segmentation and tagging.

NeuralMonk on 15 Mar 2019

Sure, if you're interested in submitting a proposal, that would be great!

On Fri, Mar 15, 2019 at 3:10 PM SKashyapD notifications@github.com wrote:

@jywarren https://github.com/jywarren That sounds amazing. I will start
working on this thing immediately.
Should I make a summer of code proposal for image labelling of Mapknitter
using semantic segmentation and tagging.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/publiclab/plots2/issues/4660#issuecomment-473408580,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AABfJ9kvSQtMKqNnTUhWTZtIUQBiYRatks5vW-_2gaJpZM4aIqPK
.

jywarren on 15 Mar 2019

hello everyone

1.@jywarren I have done some work on our zooniverse project and I uploaded a random aerial picture check our workflow it looks great
Screenshot from 2019-03-22 00-47-36
2.we need to upload some data to start classifying. Making a CSV file would be great with longitude and latitude information.
3.After uploading the data we need to share our project link as much as possible. Announcing it on Public lab website would be a great start.

there is still lot of work to be done and I am trying to do it as soon as possible with maintaining the content quality.

@Zengirl2 please review it and suggesting few tags would be great.

Thanks @jywarren I am working on the proposal.
and I have contacted a main contributor of terrapattern and he provided me some very useful link
like OpenStreetMap data set they have used and I have discussed technical details they have tried and used and it is really helping me out to see things more clearly about our project.

cheers!
thanks everyone

NeuralMonk on 21 Mar 2019

@Zengirl2 do you have any idea how much images we can able classify in 2-3 months?

@jywarren if we are able to classify enough images on zooniverse then after making a neural network model on OpenStreetMap data we can able to add our zooniverse data to classify environmental issues which is know as batch processing in machine learning.

NeuralMonk on 21 Mar 2019

This is very interesting! Would you be interested in trying to use some of
the imagery from Hurricane Harvey that was posted in the link I shared?
Cool!

On Thu, Mar 21, 2019 at 4:24 PM SKashyapD notifications@github.com wrote:

@Zengirl2 https://github.com/Zengirl2 do you have any idea how much
images we can able classify in 2-3 months?

@jywarren https://github.com/jywarren if we are able to classify enough
images on zooniverse then after making a neural network model on
OpenStreetMap data we can able to add our zooniverse data to classify
environmental issues which is know as batch processing in machine learning.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/publiclab/plots2/issues/4660#issuecomment-475388287,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AABfJ7R2JQIkZz0eYPD7pIR02eA_wd2hks5vY-phgaJpZM4aIqPK
.

jywarren on 21 Mar 2019

I download the set of images from this link
should I upload the whole data set or few images ?
@jywarren

NeuralMonk on 21 Mar 2019

@SKashyapD This is awesome! I was going to suggest the same as @jywarren about images. Can you explain exactly what you mean by tags in this case? Do you mean tags for people to find the info or tags of more examples of pollution or other indicators that interest us? Also, as far as how many images we can classify--do you mean when people look at them on Zooniverse to mark what they find or do you mean some process before the images are loaded into Zooniverse? Sorry, I'm really an Arduino hardware person that understood Zooniverse could be a possibility, but didn't really have the programming knowledge to make it happen. You are really bringing a dream of mine to life! :joy:

Zengirl2 on 21 Mar 2019

Thanks @Zengirl2 for such a wonderful reply.
1.Images are too big which will make the tagging task difficult should I crop them first? or there is any other way to do so.
2.Yes tags for type of pollution and indicators or anything you can found useful
3.Actually I wanted to know the response of volunteers we can expect?

thanks everyone

NeuralMonk on 21 Mar 2019

hello everyone

I uploaded 147 imagery from Hurricane Harvey after slicing and removing irrelevant images.
I will try to add more images soon @Zengirl2 please review it.
@jywarren can I add something to research section ?
and to update the Team section can you please provide me your portfolio links @jywarren @Zengirl2 which will make our project look more promising.

thanks

NeuralMonk on 30 Mar 2019

@SKashyapD Where can I view the work on the Zooniverse project (besides the screen grab earlier)? Would I have received an invite? Also, I think if I remember correctly Zooniverse can include our project on Zooniverse (rather than just a public project)--that's how you get a lot of action on it. Have you decided which way you are going to categorize it?

Zengirl2 on 30 Mar 2019

@SKashyapD I sent a note to my contacts at Zooniverse letting them know you are working on a project. Also, my user name on Zooniverse is @Zengirl2 as well. :)

Zengirl2 on 30 Mar 2019

I think I sent you an invite, @zengirl2!

On Sat, Mar 30, 2019, 1:30 PM Leslie Birch notifications@github.com wrote:

@SKashyapD https://github.com/SKashyapD I sent a note to my contacts at
Zooniverse letting them know you are working on a project. Also, my user
name on Zooniverse is @Zengirl2 https://github.com/Zengirl2 as well. :)

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/publiclab/plots2/issues/4660#issuecomment-478268611,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AABfJ87qeKh6KBXLZ-c2NlbRsd77Cn5Jks5vb58RgaJpZM4aIqPK
.

jywarren on 30 Mar 2019

@jywarren I have sent you my summer of code proposal on your email and I want to hear your feedback.
Since we are pressed for time.

thank you.

NeuralMonk on 30 Mar 2019

@Zengirl2 to edit the project you can go through this link too.
what are the criteria to get selected as zooniverse project?
which type of categorization you are talking about? categorization of dataset?

NeuralMonk on 30 Mar 2019

@jywarren and @SKashyapD - when I log into Zooniverse it is not showing that I'm connected to any projects. Jeff, I remember seeing where you said you were going to invite me, but I don't remember getting any email about it. Can you see what name you used to add me?

@SKashyapD what I was talking about as far as whether this is a Zooniverse project or private project is listed under lab policies.

Zengirl2 on 31 Mar 2019

Please check your email you may have received the respective email, because your username is same in the project @Zengirl2 .

NeuralMonk on 1 Apr 2019

For categorization of the project @jywarren may tell better about it
do we have enough volunteer for classification task?

NeuralMonk on 1 Apr 2019

@SKashyapD Hey, just got the email today. Will look at the project tonight when I get home :unicorn:

Zengirl2 on 1 Apr 2019

@SKashyapD I had a chance to look at the project and it is coming along fine. I noticed that when I chose to mark an image, that it did not give me another image once I had completed. Was this because it is not yet live? Or have you not attached a file of images yet? Anyway, here's my comments:

If this is just a test, it is fine that it is not a full blown Zooniverse project. Just sending the link to the Public Lab community once this is live is good.
Usually a Zooniverse project only takes on marking an image for one or two things. We are asking more by having many types of pollution. I know just trying to identify oil sheen from an image is difficult, so we probably need to develop a tutorial. Also, a gas company flare--would that be considered pollution? These are some of the things a tutorial can make more understandable :). In fact, the original image you used as an example earlier before you sent the link for the project was great--perhaps that can be used for the tutorial.
We should probably make it more clear why we are trying to do this work, so maybe filling out the field guide section would be a good idea as well.

Zengirl2 on 2 Apr 2019

@Zengirl2 I have fixed that problem now it is working properly. please look up to it again.

I will make a tutorial as soon as possible, and I will add more images too.
Can you provide an exemplary tutorial anything which can help to make the tutorial better?
I have done some research during the making of summer of code proposal for why we are doing it, so can I add few things @jywarren?
Can you provide me with your BIo or something which can help me to create Team section @jywarren @Zengirl2? it will help us to make our project looks good.

Thank you!

NeuralMonk on 6 Apr 2019

Hey @SKashyapD--your images are working correctly now :tada:

Great example of a similar project and tutorial (it has already completed but you can still view it)
Tutorial Details - I know some important things we were talking about identifying was sheen on water from oil spills, damaged infrastructure (like large oil tanks that get ripped open or toppled from hurricanes), flares (the flames from stacks from gas companies) and I'm wondering if we can identify tar on beaches? Maybe that counts as oil spill, too.
Drawing "Mining" - I was having difficulties using this--do you need to make more than two points? It said "2 of 0 required drawn" when I tried it.
Classification section - This seems to be a summary of the places identified by the symbols/drawings, but not sure where/how I'm supposed to input any information (like for instance if I knew there was a gas plant in a location).
Pretty Stuff - The hurricane project example I gave you earlier helps to show how to make a project attractive/needed. I'm thinking we may be able to get a photo for the front page that looks more like hurricane devastation. I believe we have images already on Public Lab's site that could be useful, so I'll try to find one. This also affects the message on the top of the project...maybe something like "We need your help recognizing pollution from aerial images so we can prepare for future disasters". Also, where you have the quote about "destroying oceans" maybe we can give more detail about how hurricanes and other disasters cause pollution of air, water and soil for living things in surrounding areas long after the initial event. Also, the ability to identify pollution from aerial images helps to hold companies accountable for preparation and remediation. Think Skytruth :)
My bio (you can use my pic from Github--let me know if you need it larger)- Leslie is a user and educator of open source hardware and volunteers with Public Lab to help others investigate their environmental concerns. She is currently working on a Master's of Environmental Studies with a focus on Conservation Tech at University of Pennsylvania.

Zengirl2 on 8 Apr 2019

I saw machine learning and I wanted to chime in. Of the original list that @SidharthBansal compiled from the different source of requests, I wanted to add that we had been discussing the tag recommendation tangentially on the website (the code part of the conversation which has moved to Github). At any rate, to @jywarren's comment above regarding not 'reinventing the wheel' there are some recommendation engines in Ruby that I recommended (har har) my comment here.

skilfullycurled on 10 Apr 2019

Sorry for the delay @Zengirl2

I started working on the tutorial and thanks for the resources.
For drawing mining I selected polygon because it will help us to map mining area better(you can draw any required shape).
In classification section I will add an extra section for notes like this.
I will update the few section of the project to make it appealing.
Thanks for the bio @Zengirl2.

Can you please provide me few resources for more images @jywarren @Zengirl2

Thanks everyone!

NeuralMonk on 17 Apr 2019

🎉1 👍1

Thanks @skilfullycurled for taking initiative. you can check this it may help recommendify

NeuralMonk on 17 Apr 2019

Hi @SKashyapD -- can you help me find your SoC proposal? Did it get posted?

jywarren on 18 Apr 2019

hey, @jywarren
Yes, It got posted on Public lab website and You have also reviewed it. SoC proposal
Is there any trouble or something?

Thank you!

NeuralMonk on 18 Apr 2019

No, no, thank you for taking the initiative on an ML thread, @SKashyapD! I'm not sure I'll be able to take that much more initiative on the implementation of a tag recommendation system since I don't have lots of experience in programming with Ruby, however, I really want to second your idea of having a server for this.

One thing I would like to do is to piggy back on your initiative and eventually start a conversation about how to grow a community around ML and data science now that the stats downloads page is coming along. More on that later, I have to actually get back to my own data science project!

skilfullycurled on 18 Apr 2019

@skilfullycurled it will be great

NeuralMonk on 30 Apr 2019

Hi @jywarren @SKashyapD @Zengirl2, can we close this issue or anyone want to update it? Thanks!

grvsachdeva on 1 Jul 2019

Hello everyone
Not now.
I will start working on this after summer break.

Thank you

On Mon, 1 Jul, 2019, 12:09 PM Gaurav Sachdeva, notifications@github.com
wrote:

Hi @jywarren https://github.com/jywarren @SKashyapD
https://github.com/SKashyapD @Zengirl2 https://github.com/Zengirl2,
can we close this issue or anyone want to update it? Thanks!

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/publiclab/plots2/issues/4660?email_source=notifications&email_token=AKVWGHR4OEW2RRVVGHJUWRLP5GRDFA5CNFSM4GRCUPFKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODY5ENWA#issuecomment-507135704,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AKVWGHTGJVXBOFYM2LCMTYTP5GRDFANCNFSM4GRCUPFA
.

NeuralMonk on 1 Jul 2019

👍1

Cool!

grvsachdeva on 1 Jul 2019

@skilfullycurled @jywarren what are the other small projects we can start to grow the community around data science.

thanks!

NeuralMonk on 27 Aug 2019

I am a new bee into the ML area..
Was going thru the problem statement as it was interesting.
Thanks for posting in details for better understanding.
Thanks @skilfullycurled @jywarren @SKashyapD @Zengirl2

budema6 on 27 Aug 2019

@SKashyapD, thank you for keeping the conversation alive! I'll rejoin as soon as I can. I just started school in a new program so although PL GitHub conversations are one of my favorite was to feel like I'm doing programming work but actually just avoiding it (not a joke), I should probably finish my work first. Still, I was too happy about the conversation to not join in. : )

One project comes immediately to mind:

SPAM: There's currently two problems. The first is the spam we currently get from sign-ups and postings, and the second, spam accounts that were made before there was a more robust moderation system. There's a period of time (can't recall what it is) where there are literally ~300,000 accounts. I think Public Lab is awesome, but that seems a tad inflated. ; ) Additionally, I believe that when users are moderated as spam, they are not removed from the database.

Spam isn't the most exciting task but it'd have a real impact. A) moderating spam is a huge resource train. B) Unless we're able to filter out spam accounts, there really can't be good data science because the data won't be good.

This project has two quasi-FTO's. They aren't FTO's according to the actual definition, but the problem contains some "hello worlds" of data science that would be good for someone who is comfortable with Ruby (I don't think you have to be awesome at it, I hardly knew Python in my first data science class) but wants to get started in data science. And the second is for someone who is comfortable with the fundamental exploratory data analysts tasks and wants to try a simple ML exercise.

I've been collecting a data set of spam.

Project 1: Exploratory data analysis. I started #5450 to discuss non-ML ways to detect spam, and I came up with some guidelines simply by exploring the data. These guidelines could become more robust with more exploratory analysis of a larger dataset. This would be a good way to get familiar with the SciRuby library collection and the fundamentals of data science (using Ruby notebooks, dataframes, selecting data, aggregating results, plotting etc.) As I said, I've been collecting a dataset of spam, but we also need a way to identify past spam because I'm sure the markers have changed over time.

Project 2: Creating a spam/ham classifier. This is why I started the collection actually, so that we'd have enough for the spam part. The harder thing is collecting data for people who are in the ham category. So that's sort of in the Project 1 category, but after we have enough of both, then there are plenty of tutorials for someone to have a nice learning experience.

skilfullycurled on 27 Aug 2019

My pleasure @budema6, I'm excited about developing a community so it's really thanks to you for your interest!

skilfullycurled on 27 Aug 2019

Update: I now have enough spam if ever anyone wants to take on training spam/ham classifier for the site. If I recall, I've seen a number of Jupyter notebooks that do this in Ruby. Of course, the data has to be parsed, and we need a ham dataset as well. In any event...

skilfullycurled on 13 Dec 2019

Hey! This topic really interests me and I have made some Natural Language Processing projects with python and the spacy library. I'd love to help out and try applying NLP to spam detection. I'm no expert, but i think I could help :smile:

Uzay-G on 11 Jan 2020

@Uzay-G, thanks for reviving this thread. I'm not sure when/how but I'm thinking it might be a good idea to try to have a call. It just seems like there's enough interest in general, and it might be good to just meet each other and see if we can organize ourselves. I'd sort of like to see this become a tool topic just like balloon mapping or spectrometry. And, perhaps at some point even have a separate PL repo for projects the same way mapknitter does.

Anyone at @publiclab/connectors, how are we handling developer open calls these days?

skilfullycurled on 12 Jan 2020

Hi :smile:, this issue has been automatically marked as stale because it has not had recent activity. Don't worry you can continue to work on this and ask @publiclab/reviewers to add "work in progress" label :tada: . Otherwise, it will be closed if no further activity occurs in 5 days -- but you can always re-open it if you like! :100: Thank you for your contributions :raised_hands: :balloon:.

stale[bot] on 7 Oct 2020

Sorry about the stalebot message here, it was a mistake! 😅 Can't seem to delete due to a GitHub API issue... strange. Carry on!

jywarren on 8 Oct 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

UI and data table for Spam Management Dashboard

keshavsethi · 3Comments

Showing contributors on /contributors/____ pages

jywarren · 3Comments

Fix "Insert Header" Toolbar Button Responsive Styling

noi5e · 3Comments

Community Check-in #14 Summer Of Code Week 15

milaaraujo · 3Comments

Feature alerts on nodes

grvsachdeva · 3Comments