Fec-cms: Import FEC Record data into CMS

Created on 29 Aug 2016 · 19Comments · Source: fecgov/fec-cms

Once the data has all been scraped and the new templates have been completed, import all FEC Record data into the Wagtail database.

Criteria for completion:

[x] All FEC Records are included in the database as instances of the record page template

For each record, these properties should be populated:

[x] Title
[x] Published date (the original date it was published)
[x] Categories
[x] Keywords (will need to add a model property for this)
[x] Monthly issue (will need to add a model property)
[x] Author(s): These are database models, so we'll need to create new instances for each author
[x] Author roles
[x] Body content: If possible to import the HTML into a RichText field (rather than an HTML field) that would be best, in order to make it easier for people to edit, but it's not the end of the world

Back-end

Source

noahmanger

All 19 comments

Not sure if this is meant to just be an initial import; but there are summaries of AOs and MURs that we want to display / link to from the canonical pages for those legal resources.

So if during import we could extract summaries tied to a particular AO/MUR number, it'd help that effort.

porta-antiporta on 29 Aug 2016

@porta-antiporta I am happy to pair with someone on legal or had this off if someone volunteers as tribute.

LindsayYoung on 29 Aug 2016

:hand: I can pair.

adborden on 30 Aug 2016

@ccostino added more detail to the issue.

noahmanger on 16 Dec 2016

👍1

Thanks!

ccostino on 16 Dec 2016

I've made some improvements to the scraper to help integrate it with the app some more; next up is to build the importer of the data, which I believe to be just a matter of the following:

Adding the necessary fields to the DigestPage model that aren't already there.
Building a small importer module that loads the JSON file and attempts to import each record in it and save it as a DigestPage:
- Account for authors while processing (we have a DigestPageAuthor model); only add new ones if they don't already exist.
- If a record fails to import, make a note of it and the error(s) but continue with the import process.
- Display the results/errors at the end so folks can troubleshoot records that didn't import successfully.

One question for you though, @noahmanger: Do we have a sense of whether or not these fields might be needed for some of the other page types that we have? If so, would it make more sense to add them to our base ContentPage model?

ccostino on 20 Dec 2016

Hey — I should have spotted this sooner, but we actually have a dedicated page template for Record articles. It's called Record Page in Wagtail.

https://beta.fec.gov/admin/pages/4/add_subpage/

emileighoutlaw on 20 Dec 2016

Yes, we do! Thanks for catching that, @emileighoutlaw, I should've caught that myself as I was dealing with that model while addressing the failing tests issues...

Should we be using the RecordPage instead of DigestPage in this case?

ccostino on 20 Dec 2016

Yep, we should use RecordPage. Let's start by just adding any of these unique fields to that model first. Also, @LindsayYoung wrote a load script that's over in this branch https://github.com/18F/fec-cms/pull/483 that might be worth checking out.

noahmanger on 20 Dec 2016

Are the categories the same as the other models? In other words, would they follow the same validation rules as say the press releases or are they a completely different set of categories in this case?

ccostino on 21 Dec 2016

@LindsayYoung and I also caught up this morning: I'm going to incorporate the work done #483 as a part of this new PR to get all of the importers organized (this will exclude the data itself that is in that PR, though).

ccostino on 21 Dec 2016

@ccostino they're different categories. You should be able to get the unique values from the scraped data.

noahmanger on 21 Dec 2016

👍1

Thanks! I just wanted to see if anything was specific vs. what might be shared with other page types.

ccostino on 21 Dec 2016

FYI on this - there are no author roles for these pages and the data that was scraped, so they will all default to author.

ccostino on 23 Dec 2016

👍1

Okay, I have this mostly working! Details on usage and findings are in the PR. :-)

The big thing is working on seeing if we can get the HTML to import correctly; it may not be possible, not sure.

ccostino on 23 Dec 2016

The latest commits in the PR modify the import script behavior to import the body content as rich text instead of raw HTML (it's still HTML under the hood, though). I've provided an option to toggle between the two just in case.

There is one last thing I've noticed with the record pages. Not all of them have authors associated with them, but it looks like they all have a byline in them, e.g., (Posted MM/DD/YY; By: First Last). If we do have an author associated with the article, there is a separate part of the template that renders that information, so in those cases the author appears twice in the article (the date is also rendered elsewhere in the template, at the top right of the page).

@noahmanger: Would it make sense to try and remove these bylines and in doing so, extract the author information from it if we don't already have it? Or is what we have good/close enough at this point?

ccostino on 28 Dec 2016

@ccostino awesome stuff. I'd say if that latter step isn't too much trouble it'd be good to do, but it's not critical.

noahmanger on 28 Dec 2016

Okay, I'll spend a bit of time seeing what I can accomplish and time box it at an hour or so.

ccostino on 28 Dec 2016

Aside from exploring this author and byline piece, the rest of this work is complete! I'll be spending just a bit more time with it tomorrow to see if I can get something simple working for it.

ccostino on 28 Dec 2016

Was this page helpful?

0 / 5 - 0 ratings