Screen Scraping

At my PHP meetup the other night some of the folks were discussing upgrade paths for content management systems, especially Drupal. They noted that there isn’t yet a good upgrade path to the most recent version from the previous version. They described all of the manual steps that would have to be taken to migrate the data from one version to the next, with the biggest technical consideration being whether or not you have direct access to the database. If that access is possible then you can go in behind the scenes and just move stuff to a new database and its new table structure and effect whatever changes and transformations are needed by hand. The same thing can be done using custom back end code. That that access is not possible then they talked about using a process known as screen scraping.

There are numerous methods for using JavaScript and other automated tools (e.g., Selenium, which I used briefly) to manipulate user screens in web-based systems and similar tools exist for many other kinds of systems. Screens can be navigated to that will display desired data which can then be automatically read on a field-by-field basis. This works when the entry and display elements are tagged with unique identifiers, which is certainly the case in HTML screens. The hassle with this method is that it’s slow and you have to have a way to ensure that you can cause all of the data in the system to be exposed. That may or may not be a simple thing to do.

The FileNet ECM/BPM tools I used back in 1993 and 1994 incorporated mainframe terminal windows (it supported several kinds, sometimes connected directly by Telnet across TCP/IP but other times using IBM 3270 emulation using dedicated hardware) that could be screen scraped in the same way. That tool required entry and display fields to be defined by position using an 80×25-character screen. I don’t recall attempting to do complete migrations using that method personally, but I imagine that others must have done so. The capability was more often used to supplement the ongoing activities, although over time such a procedure can affect a reasonably complete soft migration.

I think this reflects the idea that the central issues in computing haven’t changed much since its earliest days. As discussed in the book Facts and Fallacies of Software Engineering, every new innovation is hailed as being massively transformative but over time proves to yield some marginal improvement in some limited problem domain. The biggest areas of endeavor now seem to be managing complexity and balancing costs for information technology against fixed and ongoing costs across every life cycle phase considering performance, reliability, and storage to achieve the lowest systemic cost of ownership consistent with required performance.

People recruit based on long checklists of specific tools because it’s easy to do, seems objective, and is amenable to automation. The question, however, is whether you really want to recruit for specific tools or for the ability to solve the larger problems at a higher level.

The bottom line is that the more things change, the more they stay the same.

This entry was posted in Tools and methods and tagged , , . Bookmark the permalink.

Leave a Reply