5.21.2013

HTML Scraper

I haven't posted in a long time, but today I did something that will likely be useful again. It also can definitely be improved upon, so if you find this and know what I might want to do differently, then speak up.

It parses html into a clojure tree which it can then walk, stripping the plain text as it goes. You can define how you want to replace specific tags and entities, or black list tags you want to ignore completely.