Once the data has all been scraped and the new templates have been completed, import all FEC Record data into the Wagtail database.
Criteria for completion:
For each record, these properties should be populated:
Not sure if this is meant to just be an initial import; but there are summaries of AOs and MURs that we want to display / link to from the canonical pages for those legal resources.
So if during import we could extract summaries tied to a particular AO/MUR number, it'd help that effort.
@porta-antiporta I am happy to pair with someone on legal or had this off if someone volunteers as tribute.
:hand: I can pair.
@ccostino added more detail to the issue.
Thanks!
I've made some improvements to the scraper to help integrate it with the app some more; next up is to build the importer of the data, which I believe to be just a matter of the following:
DigestPage model that aren't already there.DigestPage:DigestPageAuthor model); only add new ones if they don't already exist.One question for you though, @noahmanger: Do we have a sense of whether or not these fields might be needed for some of the other page types that we have? If so, would it make more sense to add them to our base ContentPage model?
Hey — I should have spotted this sooner, but we actually have a dedicated page template for Record articles. It's called Record Page in Wagtail.
Yes, we do! Thanks for catching that, @emileighoutlaw, I should've caught that myself as I was dealing with that model while addressing the failing tests issues...
Should we be using the RecordPage instead of DigestPage in this case?
Yep, we should use RecordPage. Let's start by just adding any of these unique fields to that model first. Also, @LindsayYoung wrote a load script that's over in this branch https://github.com/18F/fec-cms/pull/483 that might be worth checking out.
Are the categories the same as the other models? In other words, would they follow the same validation rules as say the press releases or are they a completely different set of categories in this case?
@LindsayYoung and I also caught up this morning: I'm going to incorporate the work done #483 as a part of this new PR to get all of the importers organized (this will exclude the data itself that is in that PR, though).
@ccostino they're different categories. You should be able to get the unique values from the scraped data.
Thanks! I just wanted to see if anything was specific vs. what might be shared with other page types.
FYI on this - there are no author roles for these pages and the data that was scraped, so they will all default to author.
Okay, I have this mostly working! Details on usage and findings are in the PR. :-)
The big thing is working on seeing if we can get the HTML to import correctly; it may not be possible, not sure.
The latest commits in the PR modify the import script behavior to import the body content as rich text instead of raw HTML (it's still HTML under the hood, though). I've provided an option to toggle between the two just in case.
There is one last thing I've noticed with the record pages. Not all of them have authors associated with them, but it looks like they all have a byline in them, e.g., (Posted MM/DD/YY; By: First Last). If we do have an author associated with the article, there is a separate part of the template that renders that information, so in those cases the author appears twice in the article (the date is also rendered elsewhere in the template, at the top right of the page).
@noahmanger: Would it make sense to try and remove these bylines and in doing so, extract the author information from it if we don't already have it? Or is what we have good/close enough at this point?
@ccostino awesome stuff. I'd say if that latter step isn't too much trouble it'd be good to do, but it's not critical.
Okay, I'll spend a bit of time seeing what I can accomplish and time box it at an hour or so.
Aside from exploring this author and byline piece, the rest of this work is complete! I'll be spending just a bit more time with it tomorrow to see if I can get something simple working for it.