If you read CAT tool forums at ProZ, Translators Café, Yahoo Groups and elsewhere, you will have noticed that questions about rogue or junk tags come up most weeks. Of course, if you’ve just opened a file in Studio and are faced with a plague of tags in each segment – even interspersed between characters – you may not have the patience to take a deep breath and search for the answer. A new desperate plea for help is sent out.
Signs & Symptoms
Trados inserts tags within words. Is this a bug?
Studio is riddled with rogue tags. Urgent!
Tag soup in the Editor window. Help!
The good news is that it isn’t a problem with Trados Studio. You’ve got a file that looks like a decent Word document, but in actual fact it was originally a scanned image converted into text using OCR software.
Optical Character Recognition is a life saver if you need to work on an editable file, but if you don’t pre-process the file you will end up with tags every time the OCR application thinks there is a change in font, size or spacing. And that can be between every single character in a word.
It’s not until you open the file in Studio that you see the extent of the problem.
Close the file in Studio and go back to Word where you can clean it up. There are different methods for doing this. Have a read through them and decide which one suits your particular document:
1. Clear the formatting. If the text doesn’t have much basic formatting (bold, font size, bullet points, etc.) then the easiest solution is to clear all the formatting. Select the whole document (Ctrl+A) then go to Home [tab] / Font /Clear formatting. That will leave you with plain text and no tags.
2. Define the basic formatting. If you want to keep some formatting, such as bold, italics, and tables, just get rid of the main culprits by defining font, size and spacing. Again, select all (Ctrl+A), then go to font (Ctrl+D) and select one font (e.g. Arial) and one size (e.g.11). If you’re using Word 2010, go to the advanced tab in font and select 100% for the scale, normal for spacing and make sure kerning is disabled. If you’re using Word 2007 you’ll find these settings in the Character Spacing tab.
3. Use a macro. If you want to automate the second method, here’s a quick macro to do it:
Click on the macro to download it in .doc format.
If you want to learn more about creating macros yourself, check out how to Record or run a macro in Microsoft Office Help and the Macros and VBA section at Word MVPs.
4. Use CodeZapper. The last, all-in-one solution is a well-known set of macros created by David Turner called CodeZapper. I find it very useful when the formatting is complex and I need the final layout to look just like the original. It is a .dot file that can be simply copied into your Word start-up folder. For €20, you’ll solve a lot of headaches.
What can you do to prevent this happening in the future?
In Studio, make sure you’ve activated the box “Skip advanced font formatting (tracking, kerning, etc.)” for .docx files. This actually ignores all the formatting specified in the Advanced Tab of Word 2010 (Character Spacing tab in Word 2007).
If you import a PDF straight into Studio, you do so at your own risk! A better solution is to use a PDF converter and process your file in Word before you add it to Studio. Another option – launched in 2011 – is Adobe Export PDF, which has the advantage of preserving headers, footers and bullet points better than other applications, but the disadvantage of not being able to customise settings in advance.
Also, bear in mind that the PDF file type in Studio is only for editable PDFs. Studio won’t cope with a PDF that is actually a scanned image, so you’ll have to use an OCR program for these files, such as ABBYY FineReader or Omnipage. [Edited to add: Studio now imports all PDFs, including non-editable ones, from SDL Trados 2015 onwards.]
A last but very important point is to decide whether you’re going to charge more for translating a PDF or scanned image than an editable file. Should you charge by the hour or add a surcharge to your normal rate? Should you charge per source word or would it be better to use the target word count for these jobs? What do you recommend?
Image attribution: Thanh