Importing (a LOT of) HTML Documents with MODx CMS/CMF

In: dunno

3 Jul 2008

Yesterday was the first time in about a year and a half of using MODx CMS/CMF that I had a need for to use the Import HTML function (in the Manager, via Tools -> Import HTML). I was not only impressed by how fast it performed with a large batch of HTML files, but ecstatic with the yield of the function’s end results. 

The task at hand was like this: Port an older, hand-coded HTML site over to a dynamic content management system platform. Ideally without a ton of data entry (by an error-prone human anyway) involved in the process, since the project’s scope didn’t account for that. It did however account for all the original primary body and page title content (with its original markup) needed to remain intact for sake of natural SEO positioning the site’s content had accumulated over the years.

220 Pages of HTML Content.

Um, no. I was pretty much gonna have to enter the data myself or find some mysql database import method that I preferred not have to Google around for another “solution” (hehe). 

Once I’d given the Import HTML feature a test run on a fresh dev install and gotten good results, I was 75% finished with the task in under an hour. 

All that was left was to prep the HTML pages by batch removing unwanted code (like navigation and header/footer/callout/etc).

This went like this (on a Mac, Windows users could easily follow suit with some Windows-ey yucky stuff).

  1. Grab all the HTML files from the original webserver (since I didn’t have FTP access) using the CocoaWget (a Mac/Linux GUI for WGET)
  2. Create an OS X Automator workflow to open all HTML files with Coda, Edit -> Find, type the opening selection and closing selection tags to remove, pausing for an insertion of a Wildcard symbol, and replacing with an HTML comment that just said <pre><!– Welcome to your new home, MODx. –></pre> !!bonus -> NO REGEX UGLY FUNK REQUIRED !!
  3. Voila, all 220 HTML files successfully imported in 0.21 seconds as MODx documents with the page title as the document title and anything else within the html body tag as into the MODx CONTENT system TV.

1 Response to Importing (a LOT of) HTML Documents with MODx CMS/CMF

Avatar

amik

June 10th, 2010 at 8:23 am

I love modx. Good idea to use coda to remove the code you dont need!

Comment Form

About Me

I make stuff on the inter-tubes. I'm somehow or another a strange hybrid creature of half-nerd, half artsy-fartsy designer with an unruly love for typography. I heart girls, and have 4 wonderful ones. Boosh!

Photostream

    DSCN6720DSCN6719DSCN6718DSCN6717DSCN6716DSCN6715DSCN6714DSCN6713DSCN6712DSCN6711DSCN6710DSCN6709DSCN6708DSCN6707DSCN6706DSCN6705DSCN6704DSCN6703
  • Charles Davis: Tyra is a great tv show host and i love watching her TV show, i learn a lot from her show`;- [...]
  • Jacob Patel: I often watch the Tyra Banks show late in the afternoon. Great show and great host.;~* [...]
  • amik: I love modx. Good idea to use coda to remove the code you dont need! [...]
  • Martha Thomson: Ellen is also very lucky to have a sweet and beautiful girlfriend like Portia De Rossi [...]
  • seanhin: I have a joke for you =) What kind of bird can write? A penguin. ___________________________ [...]
ale