String Encoding

Everything about development and the OpenMW source code.
Locked
User avatar
lgromanowski
Site Admin
Posts: 1193
Joined: 05 Aug 2011, 22:21
Location: Wroclaw, Poland
Contact:

String Encoding

Post by lgromanowski »

Zini wrote: I started working on implementing the game mechanics (so I can add the missing script instructions). As a first step I replaced the hard-coded attribute names in the stats window with values from GMSTs.

But here we have a problem: The encoding of those GMST values is apparently not UTF-8, which can cause OpenMW to crash. IMHO what should be done here is to convert the encoding upon loading from the esm/esp, so we can simply can forget about it afterwards. The best place to do so, would be components/esm/esm_reader.hpp.

I am not that experienced with character encodings though (while I understand the theory, I have never worked with anything but UTF-8 or plain ASCII on a modern operating system). Therefore I would prefer if someone else could look into this. For now i have disabled the GMST lookup again. As a replacement the code is using the ID of the GMST as label.
ape wrote: I've looked into that and created a function for this issue.

I'm using Iconv[1], it's included in the gnu libc and it's also
available for windows[2].

I've tested it with the german version of mw on linux and
only by checking the output with the esmtool.

The encoding seems to be "ISO-8859-1" but i do not
know it exactly.

Here's the code: http://pastebin.com/CKy117dG


[1] http://en.wikipedia.org/wiki/Iconv
[2] http://gnuwin32.sourceforge.net/packages/libiconv.htm
Zini wrote: From what I know about common encodings on Windows at the time, when MW was released, your guess about ISO-8859-1 is mostly likely correct.

I presume it was inevitable to add another dependency, though we will have to find or write a CMake file for iconv first, before we can use this library in OpenMW (unless it is already available as part of the cmake distribution).

A few comments about the code:

- Your assumption, that you can get away with inputSize*2 is most likely wrong. I haven't looked into the documentation of iconv yet, but from what I see I guess you should be able to handle unknown output sizes by using a fixed sized buffer and call iconv repeatedly (that's how it is usually done with these C-style interfaces).

- Error messages should go to std::cerr instead of std::cout.

- Also, please lose the new[]/delete[] crap. That's pretty much universally seen as bad practise in C++. We have std::vector.

Code: Select all

std::vector<char> inputBuffer (input.begin(), input.end());

...

if (iconv(cd, &inputBuffer[0], &inputBytesLeft, &outputBuffer[0], &outputBytesLeft) == (size_t)-1)

...

Anyway, pretty good start.
Zini wrote: Found an Findiconv.cmake file: http://gitorious.org/gammu/debian/blobs ... conv.cmake

No idea if it is any good though. My CMake skills aren't that deep.
ape wrote: Thanks for your comments. Here's the improved version:

http://pastebin.com/WkEaCA08
Your assumption, that you can get away with inputSize*2 is most likely wrong. I haven't looked into the documentation of iconv yet, but from what I see I guess you should be able to handle unknown output sizes by using a fixed sized buffer and call iconv repeatedly (that's how it is usually done with these C-style interfaces).
I implemented the calling iconv repeatedly. Does it come close to your imagination?

Maybe i'll take a look at cmake to include iconv.
Zini wrote: Looks reasonable at a first glance. Wanna try to integrate it into OpenMW? I see you already have forked the project on GitHub. So, what you need to do now would be:

- update your local copy of the fork (fetch, merge, and don't forget the submodule update)
- Put the cmake file into openmw/cmake
- Adjust openmw/CMakeLists.txt (find_package and include_directories)
- Adjust openmw/apps/openmw/CMakeLists.txt (target_link_libraries)
- Integrate your function into openmw/components/esm/esm_reader.hpp

btw. I see you are using pastebin. Nothing wrong with that, but GitHub has an integrated code snippet feature. It's called Gist (reachable from your github home page). Maybe it would be better if we keep all openmw stuff in one place.

Edit: Oops! I was wrong about the GitHub account. That was ap0. ;) Maybe you could make a GitHub account? (it's free)
ape wrote: No, i didn't forked it, i just checked it out. Maybe you mistook me for ap0.

But i will try it after i have read me into git :)
ape wrote: I've pushed the code to my fork (apreiml). I'll looking for some other task now :).
Zini wrote: Thanks. Will pull and test later.

If you are looking for things to do, here is a list: http://openmw.org/forum/viewtopic.php?f=13&t=35
Or pick something else you would like to do.
Zini wrote: Well, it works at least partially. I don't get crashes anymore. But non-ASCII characters are still not rendered. I guess it is the font. Maybe it does only contain ASCII characters.

I pushed a change, that removed the attribute-name workaround. If you want to see for yourself, you can pull from my fork.
git remote zini [url=git://github.com/zinnschlag/openmw.git]git://github.com/zinnschlag/openmw.git[/url]
git fetch zini
git merge zini/master
ape wrote: Yeah, but how do i get some text to screen. Are there any
console commands implemented. I've not looked into the mwscript
code right now.

It seems that there was an api change between libiconv v1 and v2.
Unfortunately i found only libiconv-1 for winxp.
I wasn't abel to compile openmw under windows because of some
mygui errors so i'll look into that task later.
Zini wrote: If you take my updated codebase, all you have to do is to press i. That will open the stats window, which lists attribute names according to the GMST settings given in the master file.
ape wrote: Hmm, i've tried to write the names to stdout and pipe them to file.
The enca tool says they are proper utf-8 encoded.

I also guess it could be a font problem, maybe it's too small.
Zini wrote: No. That is not a question of size. Fonts usually don't support all Unicode characters. Each font only provides a subset. And the subset in this font is too small. Conclusion: Someone needs to find us a better font.

Edit: Unless that is what you meant by "too small". That expression is kinda ambiguous.
ape wrote: I noticed that some cell names contains characters like â?? and they weren't proper converted to utf-8, so i changed the encoding code page (in my fork) from ISO-8859-1 to WINDOWS-1252 which is slightly different and seems to work better.
Zini wrote: That is so typical Microsoft. Always requiring special treatment. :(
nicolay wrote: Great work. I have used iconv earlier and it would have been my first choice for conversions as well. It handles just about everything, which means that it will also work for French, German, Polish, Russian, Chinese etc. versions of the Morrowind dataset as long as we set it up to use the right input encodings.

BTW there seem to have been some snafu in the Git merges here. Zini's latest commits are included twice, but under different hashes. Did someone do a rebase? If so, please try to avoid rebasing already commited code in the future as that tend to mess things up.
nicolay wrote: There's an alternative to including iconv in the program itself. That is to use iconv to generate our own lookup tables and our own string translation routines. That way we can include just the finished tables we need directly in the code, rather than using an external library.

I've made some prototype code for this already. I'll write it up as a component once I get the time. Should actually be faster too.
nicolay wrote: Right, my alternative plan worked beautifully. The latest commit uses an internal lookup table for conversion (see components/to_utf8/) rather than an external library. The table is generated by an external program which uses iconv and is included, but since the table itself is included in the repos nobody really needs to bother with this program.

As a result:
- character conversion to UTF-8 should now work also on Windows
- iconv is no longer a dependency, on any platform
- string loading is now much faster (and uses less memory)
- ASCII strings (a whole 99.95% or strings are pure ASCII) are passed through untouched, a huge part of the reason for the speedup

I've tested this and from all I can see it works without any problems. The system can also be extended to support additional input encodings besides Windows-1252, such as those used for central and eastern European versions.
Zini wrote: This is finished then? (btw. works for me except for the bad font)

Then we can close the issue (#2) on github?
nicolay wrote: Correct, thanks for the reminder (the issue is now closed). I think we should put the font issue on the roadmap for the next release as well.
fallenwizard wrote: OpenMW immediately segfaults after the update. I tried to enable the debugging symbols but cmake ignores it for whatever reason.
ape wrote:
fallenwizard wrote:OpenMW immediately segfaults after the update. I tried to enable the debugging symbols but cmake ignores it for whatever reason.
Try to start openmw with the --nosound flag.
fallenwizard wrote:
ape wrote:
fallenwizard wrote:OpenMW immediately segfaults after the update. I tried to enable the debugging symbols but cmake ignores it for whatever reason.
Try to start openmw with the --nosound flag.
Data directory: /home/fw/openmw/data
zsh: segmentation fault ./openmw --nosound
It's not the fault of the mpg123 memory corruption bug because I have the most actual version of mpg123 installed. It is the fault of the iconv integration, something went wrong.
nicolay wrote:
fallenwizard wrote:Data directory: /home/fw/openmw/data
zsh: segmentation fault ./openmw --nosound
Ow, not good. Does it always crash at this point? That is before the ESMs are even loaded, so it's strange that this update should cause you problems (all the changed code is in the ESM reader.)

Could you try pinpointing exactly where it crashes? The only thing of significance that happens between that printout and the BSA loading is OGRE configuration. Does the Ogre window pop up before the segfault?
fallenwizard wrote: Yes, it crashes immediately after the .bsa loading. No OGRE window shows up
ape wrote: Can you post the backtrace?

To generate a backtrace run openmw with db:

Code: Select all

gdb openmw
then type "run" to run the program and after the crash type "bt".
fallenwizard wrote:
ape wrote:Can you post the backtrace?

To generate a backtrace run openmw with db:

Code: Select all

gdb openmw
then type "run" to run the program and after the crash type "bt".
I tried to enable the debugging symbols but cmake ignores it for whatever reason.
gdb without debugging symbols is worthless
nicolay wrote: Ok. There are a couple more things to try. One is a complete recompile, in case this is a compile foul-up (the linker linking in out-of-date files, I've had many segfaults caused by this before, although this shouldn't happen with cmake.) BTW you haven't upgraded OGRE or changed Ogre settings lately?

If that doesn't help, then there is always good old "printf debugging" - ie. adding in printouts around where the crash happens to pinpoint its exact location (I would start around line 208 of apps/openmw/engine.cpp).
fallenwizard wrote:
nicolay wrote:Ok. There are a couple more things to try. One is a complete recompile, in case this is a compile foul-up (the linker linking in out-of-date files, I've had many segfaults caused by this before,
I am always rebuilding openmw completly after a git pull
although this shouldn't happen with cmake.) BTW you haven't upgraded OGRE or changed Ogre settings lately?
no
If that doesn't help, then there is always good old "printf debugging" - ie. adding in printouts around where the crash happens to pinpoint its exact location (I would start around line 208 of apps/openmw/engine.cpp).
I tried this and this line is the culprit:

Code: Select all

    mOgre.configure(!isFile("ogre.cfg"), plugCfg, false);
EDIT: Nevermind, it was my fault. I forgot to edit plugins.cfg to point to the correct path of the OGRE libraries
Locked