I am reviving some projects - this blog as well as some open source work. They desperately need some TLC…
Since this blog has travelled through different hosting options and technologies I still had some legacy posts formatted in HTML. I’ve taken the plunge to refactor them into much cleaner Markdown syntax.
This turned out easier than expected. With a couple of good libraries to lean on I wrote a quick Node application to do the dirty work. This aspect of the Node.js community is the same reason why I fell in love with Ruby in the first place - the wealth of small, well-crafted libraries/gems/packages that focus on solving specific problems eloquently.
We’ll design the app with stream input and output, taking HTML input from stdin and outputting the resulting Markdown on stdout. In this way we can use it in conjunction with other tools, true to the Unix tools philosophy
. First, let’s start with package.json file. We can create one with npm init and fill in the blanks:
| |
We make use of these great packages:
- get-stdin : gets stdin as a string or buffer
- turndown : convert HTML to Markdown using JavaScript
- turndown-plugin-gfm : a plugin for turndown to enable GitHub Flavoured Markdown
- gray-matter : a library that parses different types of front-matter
Note the bin section in the package.json file. This allows us to run npm link and have the convenience of not having to type in node index.js every time we want to run the app. If we decided to publish this application as an official npm package we would have to set the preferGlobal flag to true as well so that a user gets warned if the package is not installed with the --global flag [further reading
].
Here is the index.js file marked for execution in all its (quick and dirty) glory:
| |
First we retrieve all the input from stdin. The turndown package is nicely customizable and I’ve set the output format styles to what I prefer where it differs from the defaults .
We ask the turndown library to use the gfm plugin to support GitHub Flavoured Markdown. The turndown library strips out newlines from the Jekyll front-matter at the moment. This issue provides a simple workaround - use the gray-matter library to parse the front matter.
Now that we have the input and output mechanisms ready, we can write
| |
The output will be:
| |
Let’s pipe a sample blog post with front matter in:
| |
Output:
| |
Beautiful!
We can now convert our posts in bulk by iterating over all the HTML posts:
| |
And then we dutifully proceed with QA on each converted post before deleting the original. This code will not handle some of the edge cases that I was coding up in HTML back in 2005. To be honest, I’m not sure whether this was intentionally bad markup or signs of scars received while fighting with Wordpress, but inline styles for italics and bold text and auto-closing paragraph tags (<p/>) are some examples. Those cases are rare so I chose to handle them manually rather than diving into the insanity which is HTML parsing .
Photo by Pankaj Patel on Unsplash




