If you read CAT tool forums at ProZ, Translators Café, Yahoo Groups and elsewhere, you will have noticed that questions about rogue or junk tags come up most weeks. Of course, if you’ve just opened a file in Studio and are faced with a plague of tags in each segment – even interspersed between characters – you may not have the patience to take a deep breath and search for the answer. A new desperate plea for help is sent out.
Signs & Symptoms
Trados inserts tags within words. Is this a bug?
Studio is riddled with rogue tags. Urgent!
I have a rash of tags. How can I get rid of them?
Tag soup in the Editor window. Help!
The good news is that it isn’t a problem with Trados Studio. You’ve got a file that looks like a decent Word document, but in actual fact it was originally a scanned image converted into text using OCR software.
Optical Character Recognition is a life saver if you need to work on an editable file, but if you don’t pre-process the file you will end up with tags every time the OCR application thinks there is a change in font, size or spacing. And that can be between every single character in a word.
It’s not until you open the file in Studio that you see the extent of the problem.
Close the file in Studio and go back to Word where you can clean it up. There are different methods for doing this. Have a read through them and decide which one suits your particular document:
1. Clear the formatting. If the text doesn’t have much basic formatting (bold, font size, bullet points, etc.) then the easiest solution is to clear all the formatting. Select the whole document (Ctrl+A) then go to Home [tab] / Font /Clear formatting. That will leave you with plain text and no tags.
2. Define the basic formatting. If you want to keep some formatting, such as bold, italics, and tables, just get rid of the main culprits by defining font, size and spacing. Again, select all (Ctrl+A), then go to font (Ctrl+D) and select one font (e.g. Arial) and one size (e.g.11). If you’re using Word 2010, go to the advanced tab in font and select 100% for the scale, normal for spacing and make sure kerning is disabled. If you’re using Word 2007 you’ll find these settings in the Character Spacing tab.
3. Use a macro. If you want to automate the second method, here’s a quick macro to do it:
Click on the macro to download it in .doc format.
If you want to learn more about creating macros yourself, check out how to Record or run a macro in Microsoft Office Help and the Macros and VBA section at Word MVPs.
4. Use CodeZapper. The last, all-in-one solution is a well-known set of macros created by David Turner called CodeZapper. I find it very useful when the formatting is complex and I need the final layout to look just like the original. It is a .dot file that can be simply copied into your Word start-up folder. For €20, you’ll solve a lot of headaches.
What can you do to prevent this happening in the future?
In Studio, make sure you’ve activated the box “Skip advanced font formatting (tracking, kerning, etc.)” for .docx files. This actually ignores all the formatting specified in the Advanced Tab of Word 2010 (Character Spacing tab in Word 2007).
If you import a PDF straight into Studio, you do so at your own risk! A better solution is to use a PDF converter and process your file in Word before you add it to Studio. Another option – launched in 2011 – is Adobe Export PDF, which has the advantage of preserving headers, footers and bullet points better than other applications, but the disadvantage of not being able to customise settings in advance.
Also, bear in mind that the PDF file type in Studio is only for editable PDFs. Studio won’t cope with a PDF that is actually a scanned image, so you’ll have to use an OCR program for these files, such as ABBYY FineReader or Omnipage. [Edited to add: Studio now imports all PDFs, including non-editable ones, from SDL Trados 2015 onwards.]
A last but very important point is to decide whether you’re going to charge more for translating a PDF or scanned image than an editable file. Should you charge by the hour or add a surcharge to your normal rate? Should you charge per source word or would it be better to use the target word count for these jobs? What do you recommend?
Image attribution: Thanh
Excellent post, Emma! Clear, informative and really useful. Thanks also for the heads-up about Code Zapper.
Thank you very much for the very useful article. I personally learnt some methods how to get rid of these annoying tags. A lot of thanks.
This is great stuff, Emma, thank you very much!
I recommend Dave Turner’s CodeZapper to everyone who has a license for MS Word. It’s the best tool available for cleaning up rogue tags in most texts. Registered users also receive automatic updates by e-mail, and Dave continues to refine his macros.
With regard to charges, it is really time that all translation providers – agencies and freelancers – routinely include tags in their cost calculations. Trados studio counts tags, and a “word or character count weighting” can be decided for tags. If you choose to count one tag as one word (actual testing has shown that the time burden imposed by a tag is at least at this level, perhaps closer to two words), then 200 tags in a document with a lot of formatting will count as 200 additional words on the invoice. Cleanup work for poorly formatted OCR documents received can also be compensated with such a calculation – after all, it costs time to do the cleanup by any method.
Hi Kevin, I like your idea of charging for tags, and if you’ve actually persuaded your clients to pay for them, I’m even more impressed. The mere mention of charging by the hour or adding a surcharge often results in an original editable file suddenly being found, or certainly more effort being put into the OCR, so having the option of charging per tag gives us another, perhaps fairer, card to play.
Thanks for dropping by with your comment,
Reblogged this on multifarious and commented:
A great article Emma… this question does come up quite a lot and you’ve done an excellent job of handling it.
Paul’s hyperlink to multifarious is defunct. Perhaps someone would like to update it?
Thanks for pointing this out, Jim. I’ve updated the link to Paul’s home page.
Thank you Emma, I thought I was going insane! It’s my first project with Studio 2011.
Brilliant! You have just saved my day. All the bad tags have vanished.
I finally had enough extraneous tags and googled to see if there was a solution. Apparently there are several. Thanks for the information!
Pingback: Signs & Symptoms of Translation: A look back on the first year | Signs & Symptoms of Translation
Besides CodeZapper which you mentioned, there is also Document Cleaner [ http://www.translatortools.net/word-doccleaner.html ], a set of commands for pre-formatting of documents converted from PDF files. Besides the ability to reformat Word documents to remove unnecessary tags caused by formatting, it can also remove various types of bookmarks which are another source of tags.
Document Cleaner is part of TransTools for Word add-in [ http://www.translatortools.net/word-about.html ] that integrates with Word similar to CodeZapper. TransTools is distributed free of charge.
Thanks for mentioning another option, Stanislav. I think that you’re the developer of Document Cleaner, is that right? I remember trying it a while back and thinking it would be useful for people who work in Word (rather than a standalone application like Studio), but removing bookmarks would certainly be helpful for Studio files. I’ll give it another try soon. Thanks, Emma
I’m pretty new to translation industry and I find your article very informative!
May I ask you a question?
I converted a PDF source file into a word file, and the sentences are all chopped like this:
Original: I converted a PDF source file into a word file, and the sentences are chopped like this:
Chopped word file: I converted a PDF source file into a word file, and the
sentences are chopped like this:
It was some 25000 words automobile user’s manual and there were tons of these chopped sentences all over. I have managed to translate everything on Word (good grief!), manually extracting unneeded space. I’m just wondering if I could have done it better with some technique.
I’m new to SDL Trados and have never used one before. I have just purchased Trados 2011 a couple of days ago and trying to figure out how to use it.
I have tried my Adobe Acrobat Professional and it work just fine and it didn’t get chopped!! How amazing. For the above mentioned users manual project, I have just innocently used the from PDF to word converted file provided by the client company, thinking this is the best outcome, which wasn’t. I have learned from this incident that I need to convert PDF files myself even if the company provides me with the already converted file. Thank you anyway for your ideas, Emma!
But I still have one problem: is there a any good way to eliminate unneeded hyphens between words like this? audio and elec- tronic devices (I want to eliminate – between elec and tronic)
I plainly tried “replace” function of Word, but didn’t work since there are many other hyphens in the document that are necessary (not like between word hyphens). If you know any good way to eliminate only the unneeded one, please give me advise! Thank you so much in advance.
Hi Yukiko, Yes, as you’ve discovered, it’s best to control the PDF to Word conversion yourself. I certainly don’t leave it to Studio in the case of editable PDFs. Or to an agency.
RE: hyphens. Try replacing Optional Hyphens under Special in the F&R dialog in Word.
Hope that works!
Pingback: Removing Tags from a Translation Memory
Very helpful indeed. Thanks a lot for your great suggestions.
Importing a pdf straight into studio isn’t possible. My editor window turned up blank. I could only pull it in after having pushed it through ocr as a docx file.
Hi uskapp. That’s probably because your PDF was a scanned image, not editable text. Studio can’t handle it in that case.
That’s strange, for in their webinars they say pdf is included in the formats that Studio 2014 can take.
I agree, Ulla, that it’s not very transparent at all. Studio Help doesn’t mention it either. Basically, you can put editable PDFs into Studio, but not image-based PDFs. I’ll add a note in my blog post to clarify this.
Reblogged this on Kliping dari WordPress and commented:
Cara menghilangkan tag di sdl trados studio dalam bahasa inggris.
menghilangkan ungu-ungu 😀
I haven’t had any tag problems so far, but since I have only been using Trados for a month, I know this will come in handy, sooner or later! Thank you Emma. You made it all sound so simple!
Than you for sharing your precious Trados experience with us. Could you please write once a small article explaining the way segmentation rules actually work in Studio 2014 as I’m quite puzzled and can’t find my way and define them properly.
I have even considered to propose to sdl developers to implement the following idea:
To store as a segmentation rule every action of manually splitting a segment as well as every action of manual merge of segments.
In my case that would be a great relief and spare me time and unnecessary efforts.
Thanks for your suggestion. I’ll add segmentation rules to my blog ideas’ list.
Glad you liked the article, Laurence. I hope you don’t need to put it into practice very often!
Great! Thank you so much!!!!!
Comments on this post are now closed. If you need help with Studio issues, I recommend the Studio User Community (use your SDL credentials to sign up), the Studio User group on Yahoo Groups or the SDL Trados support forum on ProZ.
If you’d like my advice about solving specific Studio issues or personal guidance as you start out with Studio, please click the contact tab above to arrange a consultancy session.