Word Processing to HTML
WorkshopIntroduction
Word processing is optimized to deliver to one device: a printer. Units of measurement are different, font choices are different, the output space is different, and the underlying technology is different. Getting from Word to HTML is not straightforward.
A word processing document is a self-contained information space. A web page is always part of a larger context. So, not only the structure but the goal of the information will always be different. In other words, the best choice is nearly always to think in terms of re-creating the information rather than to convert the document.
SaveAs HTML
Steps are simple. Open the document and choose SaveAs HTML.
If the content is just headings and paragraphs, it may look perfectly okay. But the source code is almost never correct. The document almost never validates.
When the document is edited, it's nearly impossible to do so consistently. Your best bet is to keep the original on your desktop as a word processing document, edit it there, then SaveAs HTML again and overwrite the previous. This means you will always have two versions (at least) floating around.
If the document has more complex structures, it will become more and more difficult for the software to do a reliable and robust conversion. And, of course, editing that becomes even more difficult.
But there are even more serious problems.
First, your document probably needs to fit into your departmental template. SaveAs simply cannot do that. And pasting the code almost never works.
Second, unless you know how to do it, your document will have no meta tags. This means it will be less likely to be found by search engines.
Third, conversion of special characters, including very common ones such as an em-dash or apostrophe, is never done properly, causing them to display as nonsense strings.
Fourth, you’ll have to add in alt tags, table summaries and other accessibility steps.
In short, it doesn't work very well and even if it did, you wouldn't want to use it.
Screen Scrape as an Alternative
This works better, but still takes work. It's easy: just copy from the document and paste into your text editor, then add markup.
Because you scrape only the bits you want, you can actually consider restructuring and rewriting the information. This is usually a Good Thing.
It's adding the markup that's tedious, of course. But now you don't have any superfluous code and you can do the markup the way you want, including using any departmental stylesheets in play.
Watch out for special characters, though. Make sure you create character entities for them.
Special Considerations
- Alt tags
- Lists
- Headings
- Special characters (quotes, apostrophes, em dash, foreign symbols, special symbols)
- Images (will have to copy them to the server and link to them)
- URLs and email addresses (make the links with href)
Practical Steps
- paste into Notepad or HTML-Kit or other text editor
- Paste Special into Dreamweaver
- in your word processor, turn off Smart Quotes (or use Replace)
Forms
This is a very large topic. The basic issue is this: form design is fundamentally different on paper than on a computer.
With paper, you tend to design so that everything fits on one page. Even on a multipage form, you try to make sure the page breaks at a logical place. The size of a box limits the amount of information that can go into the box. There’s no way to present a list of options, unless it’s a very short list. There’s no data validation.
A web form can be multiple screens long. It can do conditional branching, so that different questions are asked based on previous responses; it can be interactive. It can provide supplemental information. It can do data validation. It can even be multimedia.
To put it another way: a paper form is document; a web form is a process.
The worst thing you can do is to take that paper form and “put it on the web.” That paper-based form is nearly always inadequate. But because form design is hard, and reviewing business processes is even harder, people dodge the whole topic by just putting the paper form up as a PDF.
Special Characters
Here is a list of special characters that often pop up in word processing documents
| Name | "untranslated"Character Entity | Symbol | |
|---|---|---|---|
| em dash | – | — | — |
| cent | ¢ | ¢ | ¢ |
| yen | ¥ | ¥ | ¥ |
| copyright | © | © | © |
| one-quarter | ¼ | ¼ | ¼ |
| cedilla | ç | ¸ | ¸ |
| accented e | é | é | é |
| Euro | € | € | € |
| left double quotes | “ | “ | “ |
| apostrophe | ‘ | ’ | ’ |
And so on. There are literally hundreds of symbols, and each of them has a corresponding character entity code.
