Plots2: set up scraper for Public Lab's twitter account

Created on 30 May 2017  路  17Comments  路  Source: publiclab/plots2

background

Twitter is a microblogging platform launched in 2006 (no, no, jk jk, for real now)...

Twitter gives limited access to your timeline events. For instance, here's what the exported archive looks like when viewed in a browser -- it only shows what you tweeted or retweeted, not the activity on each tweet or what other people are talking about you:
screenshot of archive

And here is what analytics.twitter.com shows you in digested "month" overviews. Pretty good, but still doesn't capture who is talking about you, although it does give a count of total mentions:

twitter-2017-01

what we want, what we really really want

  • all mentions of publiclab across the twitterverse, which would include @publiclab, #publiclab or simply publiclab even public lab -- in a CSV file.
  • extract all hastags from the above set
  • snapshot of current followers to see who we've lost (how would we do comparisons over time?)

I can implement it in python, but the level of difficulty of someone implementing it in Ruby would be relatively simple (although that isn't to say that it wouldn't take time to implement). If it would help, I can write one in python and someone could translate it but that's probably not so great as we could just as easily run a python script. It just wouldn't fit in as neatly in the public lab codebase. An alternative is that the process is written in python and produces a parsed data structure (csv for example) which is consumed into the database by Ruby.

Resources:

https://github.com/sferik/twitter/tree/master/examples
https://robots.thoughtbot.com/ruby-wrapper-for-twitter-search-api (contains example with Active Record and Cron)

Basic steps:

  • [ ] Decide on a data structure for the SQL database
  • [ ] Create tables given structure
  • [ ] Setup an application on twitter to to authenticate one's self, and obtain auth keys (might be slightly different process when it's your own timeline)
  • [ ] Create a script to authenticate and collect the data
  • [ ] Create scripts to parse data and put in appropriate tables
  • [ ] Create scripts to analyze/manipulate/aggregate data (e.g. compare number of followers since last query)
  • [ ] Create a cron job (or something similar) to run the script

Other to-dos:

  • Logging features
  • Email notifications of failures

Bonus to-dos:

CC @ebarry

Most helpful comment

Liz is right. We can't let these bots dictate what we're not going to work on! They disrupt our faith in institutions. Sow discord and spread misinformation. But to quote former President George W. Bush: _'...There's an old saying in Tennessee...Fool me once, shame on ... shame on you. Fool me... You can't get fooled again!'_. So, I'll be damned if I'm going to let some bot waltz in here and seed the insidious idea that these issues won't get closed if we don't actively work on them.

picard_the_line

All 17 comments

great! thanks for posting, i added in screenshots above.

I added a help-wanted label, but if you think this isn't ready for input from folks, you could switch it to a break-me-up one. Could be good to also note exactly what twitter account, i.e. https://twitter.com/publiclab

Thanks @jywarren, awesome idea, didn't know we had that!

There isn't a priority on having this feature fully integrated into the site so I think right now I'm imagining that this could be done in stages, something like the following:

  1. I write something in Python which would create csv's of data for internal usage.
  2. Ruby code is added to put csv's into the database.
  3. Ruby code is added to replace the processing and parsing of the data
  4. Ruby code is added to replace the getting of the data from Twitter.

I think the questions for the dev team that would help define the help wanted or which parts would need to be broken up are:

  • What are you feelings about having python code running side by side with the rest of the site?
  • How do you feel about having some separate Ruby scripts running along side the site?
  • Or, would you not want any code (Ruby or Python) interacting with the site unless it was built out in accordance with current style guides (e.g. with models, views, controllers, etc.)?

Hi all! If we are thinking about generating statistics about our tweets and social media there is already an opensource server application at https://github.com/loklak/loklak_server. It seems to be most appropriate solution as we can deploy our own server with it and enables to collect and share a large number of tweets. We can look into it further to know its usage and other features.

This will be awesome if it works!

I'm so glad you pointed it out. I can't tell you how much data I've missed out on gathering because I was too lazy to go to the computer to set up a script.

And...

They have a Python API!

I'll give it a run on my own machine when I get a chance.

Hi, @skilfullycurled -- just thoughts on these:

Ruby code is added to put csv's into the database.

Do you need to store additional data in the PublicLab.org database? Or is this part of a gradual plan to integrate this effort with the automated processes already on the site?

Ruby code is added to replace the processing and parsing of the data

If you can share a python script or even pseudocode of the kind of processing you're doing, we can help shape a set of issues to accomplish this, either in a separate script or via the API or an expansion of the API.

Ruby code is added to replace the getting of the data from Twitter.

Not sure if this is necessary -- do we need to store Twitter data in our website's database? But happy to brainstorm on this.

Thanks!

Yes, although Ruby is nice, i guess there doesn't seem to be anything strongly motivating us to actually integrate this function with the PublicLab.org codebase -- it could be run as a standalone system at a subdomain or in a folder, and not have to add the complexity of integration, right? I def. appreciate the interest in a public facing tool, though!

loklak does look really great!

I'll not respond to your first comment since your second comment captures it exactly. As long as the dev team is okay with it running as a standalone feature then no, there's no reason to integrate it. That would lower the complexity for everyone for sure.

So I think the plan then is to build out something separate (loklak or otherwise) that can deliver data through a feed so that in the future it can be front facing and others can work with the data if they so choose. If one day it becomes advantageous to integrate it more directly, then we can implement Operation Ruby Spear.

Thanks everyone!

I guess I can remove the help-wanted label for now if you're investigating loklak_server as an option?

Sure. Although for those who stumble upon this thread, as always, help is welcome!

Hi :smile:, this issue has been automatically marked as stale because it has not had recent activity. Don't worry you can continue to work on this and ask @publiclab/reviewers to add "work in progress" label :tada: . Otherwise, it will be closed if no further activity occurs in 5 days -- but you can always re-open it if you like! :100: Thank you for your contributions :raised_hands: :balloon:.

She commented "#TwitterAintDead," thereby removing the 'stale' label.

Liz is right. We can't let these bots dictate what we're not going to work on! They disrupt our faith in institutions. Sow discord and spread misinformation. But to quote former President George W. Bush: _'...There's an old saying in Tennessee...Fool me once, shame on ... shame on you. Fool me... You can't get fooled again!'_. So, I'll be damned if I'm going to let some bot waltz in here and seed the insidious idea that these issues won't get closed if we don't actively work on them.

picard_the_line

Hey Benjamin, what's our working definition of "_active_"? 馃槂

That's just the point! Again, to quote GWB, [we're] the decider!

It seems like this will take some active discussion, so let's either file another issue here, or put it on the agenda for an open call. We have to stay pro-active on being pro-active.

In the meantime (and I say this with sincerity) what is it that we still want from this? Is this still simply about data? Are there new metrics (Instagram comes to mind) that would give a better picture?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

divyabaid16 picture divyabaid16  路  3Comments

keshavsethi picture keshavsethi  路  3Comments

grvsachdeva picture grvsachdeva  路  3Comments

ebarry picture ebarry  路  3Comments

grvsachdeva picture grvsachdeva  路  3Comments