Boostnote: plugin to scrape website & convert HTML to markdown

Created on 30 Sep 2017  ·  16Comments  ·  Source: BoostIO/Boostnote


Issuehunt badges

I'm enjoying boostnote after switching from evernote & quiver.app - thank you to everyone who has contributed to this promising open source tool.

I keep a "code" notebook for technical notes-to-self and today I wanted to add a "clipping" of a blog post to it. I wasn't sure what the best way was (sometimes I try copying-and-pasting directly from the browser, which worked OK in quiver's rich-text note mode... but rtf, gross), so I tried out a few tools for automatically converting from HTML to markdown.

pandoc has a command-line option to fetch content from URL and can convert to/from HTML, markdown, and many other formats. Install on osx with brew install pandoc, then:

pandoc -f html --normalize --wrap=none -t markdown_github+backtick_code_blocks+autolink_bare_uris -o output.md <URL>

as a handy fish shell function:

❯ function panscrape --description='usage: panscrape [URL] > blog_clipping.md'
      pandoc -f html --normalize --wrap=none -t markdown_github+backtick_code_blocks+autolink_bare_uris $argv
  end
❯ funcsave panscrape
❯ panscrape "https://shapeshed.com/command-line-utilities-with-nodejs/" > clipping.md

# or to copy directly to system clipboard
❯ panscrape "https://shapeshed.com/command-line-utilities-with-nodejs/" | pbcopy

Pandoc does an OK job but isn't definitely not perfect, so some manual editing of the output may be necessary, for instance deleting header & footer content.

html-md-conversion-pandoc-boostnote

If you don't want to install anything, fuckyeahmarkdown.com seems to have an alright hosted converter.

feature request

Add a command (plugin?) to Boostnote that takes a URL as input, scrapes the page, converts the html to markdown, and creates a new note filled with the result.

Starting points:

  • node-europa "is a Node.js module for converting HTML into valid Markdown that uses the Europa Core engine."
  • scrape-markdown CLI tool based on node-europa

    • note: npm package is out-of-date and does not work; install from source repo with npm install github:evangoer/scrape-markdown

    • run locally ./node_modules/.bin/scrape-markdown [URL]

I would be happy to help with implementation.

405




IssueHunt Summary

awolf81 awolf81 has been rewarded.

Backers (Total: $100.00)

  • boostio boostio ($100.00)

    Submitted pull Requests

- #3099 Html to md feature

Tips

IssueHunt has been backed by the following sponsors. Become a sponsor


feature request rewarded on issuehunt

Most helpful comment

I just found copycat, and am testing it against copy as markdown (no affiliation). Combined with One Tab, my research aka open tabs aka browsing history are becoming useful articles and lists.

All 16 comments

I use the copy as markdown plugin of chrome. I find it very convenient.

Thanks for sharing! Looks like copy-as-markdown uses reMarked.js internally, another option besides node-europa for the putative Boostnote plugin.

Screenshot comparing reMarked.js vs pandoc - reMarked has trouble parsing the code blocks for some reason:
html-md-conversion-pandoc-remarked_vs_pandoc

reMarked code blocks fenced with 'true'?

This is funny. I couldn't figure out why the reMarked demo was fencing code blocks with 'true'. I think it's just a mistake in how the reMarker object is configured on the demo page:

// code blocks will be delimited with the string 'true'
var reMarker = new reMarked({gfm_code: true});

// this is what we want
// try it by pasting into the console at reMarked demo site
var reMarker = new reMarked({gfm_code: "```"});
reMarker.render(document.getElementById('html-inp').value)

example reMarked.js output w/ {gfm_code: "```"}:

The basics
----------

To create an executable Node.js script all you need is a Node.js shebang at the top of the script and then some code to execute.

```
#!/usr/bin/env node

console.log('hello world');
```

Assuming you are on a UNIX like system you can do this to make the script executable

```
chmod u+x yourscript
```

Now you can run it and you should see ‘hello world’ printed.

```
./yourscript
hello world
```

Handling arguments
------------------

As you get beyond basic scripts you’ll want to pass arguments into the script. The arguments passed to a script are available as `process.argv`.

If you pass arguments to the simple example above and add `console.log(process.argv)` you’ll see the arguments are available as an array. For example if you run

conclusion

I think reMarked.js - when properly configured - produces better output compared with pandoc, and possibly node-europa.

I just found copycat, and am testing it against copy as markdown (no affiliation). Combined with One Tab, my research aka open tabs aka browsing history are becoming useful articles and lists.

@kazup01 has boosted this issue with $100. Visit this issue on Issuehunt

@stormburpee has started working. Visit this issue on Issuehunt

@stormburpee has submitted output. Visit this issue on Issuehunt

Hey guys, feel free to take a look at the pull request I made for this feature over at #1981
Based of the url you suggested in the original post it works great, and I've been doing some testing with a bunch of other websites that I look at, and even ones that you probably wouldn't expect to work.

In the issue I've attached a few example photos for you to see.

@rokt33r has stopped working. Visit this issue on Issuehunt

@kazup01 cancelled funding, $100, of this issue. Visit this issue on Issuehunt

@boostio funded this issue with $100. Visit this issue on Issuehunt

@edokan has started working. Visit this issue on Issuehunt

Would a web clipper(like Evernote's browser extension) be a better solution for this?

@zerox-dg has rewarded $90.00 to @awolf81. See it on IssueHunt

  • :moneybag: Total deposit: $100.00
  • :tada: Repository reward(0%): $0.00
  • :wrench: Service fee(10%): $10.00

This feature is now available as of 0.13.0, when creating a new note:

image

sweet!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Ocanamat picture Ocanamat  ·  3Comments

NourEldin275 picture NourEldin275  ·  3Comments

louiealmeda picture louiealmeda  ·  3Comments

Rokt33r picture Rokt33r  ·  3Comments

necan picture necan  ·  3Comments