Not working with Russian

Discussion about Txt2Vobsub. Requests and bug fixing.

Moderators: george, roger_rabbit

Postby trustfm » Fri Jul 10, 2009 5:17 pm

VISTA +RUSSIAN

Without any modification of vista i just loaded a Russian subtitle.
I loaded the subtitle. As you can see the program does not recognize the Russian fonts (top right, because i have greek fonts enabled at the regional settings) But if i select Russian as font charset i get the RENDERED result below :

Image

I do not know if the rendered image is correct...
trustfm
Site Admin
 
Posts: 1649
Joined: Sun Feb 27, 2005 4:40 pm

Postby citizen doctor » Mon Jul 13, 2009 6:43 pm

After seeing your working examples, I spent some hours today trying to understand what is going on. I was able to get both Dutch and Russian working, but not with UTF-8. I hadn't tried Russian before in a non-UTF-8 encoding because the text editor I use didn't have an older cyrillic character encoding. Later, I did't think to try this with Dutch, either. Today I dug out Microft WordPad, and old non-unicode application to save the example subtitles in a non-UTF-8 format.

I got the Dutch working when I saved the Dutch example in ANSI encoding. In this and many older encodings, all characters are encoded with one byte only. The characters from 0x00-0x7f are pretty much the same for all single-byte encodings (English and some control characters). The range 0x80-0xff is used for non-English characters. There are many variations, usually a different character set (or code page) for each different language, although sometimes more than one language is squeezed into the space available. It turns out that the ANSI standard includes many characters used in Western languages which share most of their characters with English (like Dutch). In the ANSI encoding, the Dutch character ? is encoded as the byte 0xeb.

So, if a text file contains the byte 0xeb, this byte will be decoded as ? if the code page is ANSI, л if the code page is set to Russian (it is probably the Windows code page CP-1251), and λ if the code page is set to Greek (probably CP-1253). When I set my Vista locale to Russian, and I use Microsoft WordPad (non-unicode), I can paste my subtitles into WordPad and save them in CP-1251, and Txt2Vobsub displays them properly and renders them properly.

The character ? is encoded as 0xc3ab in UTF-8 unicode. As far as I can tell, Txt2Vobsub can't decode this. It tries to decode it (and all Russian characters) as if it is two separate bytes, no matter what character set is chosen.

If there is some way to get Txt2Vobsub to work with UTF-8, I would like to know, but I can get Txt2Vobsub to work now for what I need it for.

Thanks for your help.
citizen doctor
Junior Member
 
Posts: 7
Joined: Mon Jun 22, 2009 5:23 am

Previous

Return to Txt2Vobsub

Who is online

Users browsing this forum: No registered users and 1 guest

cron