Improved File Format for omwaddon files

Feedback on past, current, and future development.
EvilEye
Posts: 33
Joined: 12 Feb 2014, 13:45
Gitlab profile: https://gitlab.com/Assumeru

Re: Improved File Format for omwaddon files

Post by EvilEye »

bmw wrote: 29 May 2020, 03:13That's also more or less what I had in mind, unless I'm misunderstanding anything.
You're not misunderstanding anything. I think the coincidence is amusing though.
bmw wrote: 29 May 2020, 03:13and the only reason for such exact reproductions would be for validating the translation tool.
Which I like the sound of, but I'll admit I can't think of a real reason to implement such a feature besides that.
bmw wrote: 29 May 2020, 03:13I've already implemented splitting out scripts into their own files (the implementation could use improvement, but it's early days yet), and my thoughts were to have support for include directives so that you could have a main file which includes a bunch of other files, or directories full of files. I don't think grouping record types into directories should be a required thing, but it could certainly be an optional way to structure a project.
Cool. Using include directives would also solve the issue of maintaining record order (though again, that's more about reproducibility.)
lambda
Posts: 70
Joined: 11 Sep 2016, 17:10

Re: Improved File Format for omwaddon files

Post by lambda »

Shooting from the hip here so feel free to ignore, but why not think of the game data as a database and an esp as a series of SQL statements?
Chris
Posts: 1625
Joined: 04 Sep 2011, 08:33

Re: Improved File Format for omwaddon files

Post by Chris »

bmw wrote: 29 May 2020, 03:13 As for problems such as line endings, formatting, encoding and large file sizes, those are the same problems that software development has always had to deal with, and the solutions are no different than any other situation.
Line endings, formatting, and encoding are only an issue when dealing with text-based files that can be edited in any old text editor. Binary files created by specialized tools don't have this problem like text files do. Similarly, binary files take up less space than text files by being able to utilize the storage better (you aren't limited to readable characters and strings to declare information).
bmw wrote: 29 May 2020, 03:13 Large file sizes can be mitigated by breaking them up into smaller, meaningfully structured files, and using include directives of some sort to link the files together.
What I mean by that is taking up more storage, and having to read in more to get the same information. Breaking up into smaller files would actually be worse here given the way disks store files (the physical size allocated on disk for a file is larger than the file because they're stored in set-sized blocks; if a disk uses 4KB blocks, for example, it doesn't matter whether files are 32, 56, 1084, or 3050 bytes, each one is going to take up 4KB regardless). Additionally, separate smaller files will be less efficient since the system needs to open multiple files and jump around on the disk to read them for the same information.
bmw wrote: 29 May 2020, 03:13 Inline comments also help significantly in improving the readability of text files. Encoding should be standardized (the yaml spec calls for utf-8, utf-16 or utf-32, for example). Most markup parsers will ignore errant formatting, and DOS line endings are just Unix line endings with extra trailing whitespace on each line.
Which puts extra burdens on the engine and editor to correctly parse the text and figure out what you mean, keeping any extra information (i.e. comments) it otherwise has no use for, rather than simply reading/writing the pertinent data as-is. Just need to look at the trouble the launcher had trying to keep comments in the config files it reads/writes for the active plugins.
Ferk wrote: 29 May 2020, 12:34 Being able to track the history of the changes in the file through version control tools that rely on text-based content is already a big plus.
That's only possible if the editor guarantees consistent ordering of information. For instance, if it writes out a weapon record first followed by a book record, then on some update it starts writing out the book record first followed by the weapon record. To the editor and engine, it makes no difference what order these things are in, but a version control diff would make a mess of it. It's additional things the editor has to worry about that it otherwise wouldn't need to.
ponyrider0
Posts: 14
Joined: 11 Jun 2019, 23:53

Re: Improved File Format for omwaddon files

Post by ponyrider0 »

You all may be interested in the current proof-of-concept github repository we have for development of the Morroblivion v065: https://github.com/ponyrider0/Morroblivion-JSON

In short, this is the full source-code of the Morroblivion ESM with all changes from v064 to the latest v065 developer build stored and tracked as separate commits. Instead of YAML, I am using a JSON-like format, with modifications on legal JSON for easier text processing / version control of certain records like Script and Book records. Currently, it is only a one-way process going from the binary TES4 ESM to decompiled JSON source code. Decompiler source code is here: https://github.com/ponyrider0/ESM2JSON-Scripts.

My goal is to use the TES4 ESM exporter code from this branch of openmw (https://github.com/ponyrider0/openmw) to produce a full ESM compiler/decompiler development and modding system. The inspiration and template for this system is the WeiDU system used for modding of Infinity Engine games: https://weidu.org/. For those unfamiliar with WeiDU, it is a comprehensive byte-decompiler/-compiler, translation manager, regular expression/string processor, binary patcher and mod management system. One thing that seems to be missing from the WeiDU Infinity Engine modding ecosystem is integration with an open-source version control system like git.


Some lessons learned so far:
- Each ESM record must be stored as an individual file in order for git to be able to properly identify, track and merge changes made at the record level. The first iteration of the JSON source-code format combined all records of a given type into a single text file, example: all activator records were stored in ACTI.json, all alchemy records were stored in ALCH.json file, etc. This resulted in efficient physical disk space usage and short decompilation times. However, as mentioned in the previous post by Chris, it was very difficult for git to identify and track changes made in multiple records unless exact, consistent sorting of the records was done each time. And it was impossible to merge source code from a small ESP patch file that contained a subset of records into a larger ESM master file that contained a superset of records. These issues were completely resolved by storing ESMs in individual records.

-Currently, each record is stored individually with filenames based on their TES4 FormID and organized into file directories based on their record type (ACTI, ALCH, etc.) Adapting this system to use TES3 string-based record IDs should be straight-forward as long as the files stay organized in file directories based on record type. With this design, ESP patch files can be developed in separate branches or forks and merged into an ESM master using standard git pull-requests. In fact, the Morroblivion-JSON repository has several branches containing decompiled versions of v064, v065, and several ESP hotfixes which were then successfully merged together using the built-in git branch-merge feature rather than using a separate ESM/ESP based merging tool.

- The biggest issues with the current design have already been partly mentioned on this thread: compiler/decompiler performance and disk space efficiency. The decompilation time is very heavily bottle-necked by the OS filesystem performance. There are approximately 70,000 total records in Morrowind + Tribunal + Bloodmoon. There are ~450,000 total records in Morroblivion ESM (this is because all worldspace/map objects are stored as subrecords in their parent CELL record in TES3, but are stored as full fledged records in a CELL subfolder in TES4). The current decompilation prcoess uses a TES4Edit-based script which takes about 4 hours to generate the ~450,000 files on Windows NTFS. Based on performance monitor data, I think most of the time is bottle-necked by the Windows Filesystem, but I won't know for sure until I port the TES4Edit-based decompilation scripts to python or C++. Regarding disk space usage and efficiency, the current full JSON-source code tree of Morroblivion v065 ESM contains 431,390 files, with a total data-size of 956 MB, and taking up 2.17 GB of disk space. I could probably improve these performance and efficiency numbers by combining all worldspace/object records into a combined JSON file, but then I would lose granularity in tracking those changes within git. I will have to evaluate the cost/benefits for the future.

- Another issue with the current design is that completely deleting a record/file in an ESP can not be propagated to a master with a simple file-tree copy procedure. My current plan is to leverage the ESM format's "Delete" flag bit to mark a record as deleted, then these files can be purged in a post-processing step at any point after merging into the master repository.

- My eventual plan is to replace all hard-coded 32-bit FormIDs in the filename as well as in the record data with string-based record IDs. Then, these can be dynamically resolved into FormIDs when compiling the source code into TES4 engine binary format... or left as string-based record IDs when compiling for TES3/OpenMW engine!

- A pre-/post-processing step should be done after decompilation and before recompilation. This deals with the issues related to whitespace, line-ending, formating and character encoding which have previously been mentioned on this thread. Example: the string buffers from script and book subrecords should be converted from a byte-packed array into line by line text format so that git and diff tools can identify and track the changes in more granular (line-by-line) chunks of data. The general JSON source-code can also be formatted with proper nested indentations for easier readability when manual merges need to be done. During the compilation stage, standard JSON minify and other off-the shelf pre-processing operations can be used to standardize formatting, clean and strip comments from the source-code so that it is a legal JSON format. Then the legalized JSON files can then be parsed by existing JSON processing libraries.

- Additional pre-/post-processing steps can be added to do things similar to WeiDU mods, such as a script-based mod package/installer which modifies all strings at mod-installation time using regular expressions. Or a mod which searches for all creature records of a certain type and modifies the attributes by a specific mathematical formula. WeiDU mods can even do very advanced things like decompile in-game scripts from an ESM, then search the decompiled scripts for specific instructions and insert/replace those instructions with new ones, then recompile the scripts into an active ESM file.


So what is the ultimate point of this proof-of-concept? To demonstrate that a massive mod like Morroblivion can be developed in a distributed, collaborative manner by leveraging existing version control tools already used by the open-source community. To explore and experiment with new ways of developing and deploying mods when not constrained by a single binary ESM format which is locked to one game engine.


So the current to-do list:
1) Adapt the TES4 export code and script byte-compiler from https://github.com/ponyrider0/openmw (aka modexporter) to complete the decompiler/compiler system.
2) Port the current TES4Edit-based decompiler scripts from https://github.com/ponyrider0/ESM2JSON-Scripts to python or C++.
3) Implement Record ID to FormID dynamic resolver, probably basing it on the FormID lookup system from the above branch of openmw/modexporter.


Ideas for the future:
- Adapt openmw/modexporter to export directly to JSON format.
- Adapt the JSON compiler to output TES3 binary compatible format.
- Automate the toolchain and integrate with existing ESM/ESP editors like openmw-cs, TES Construct Set, Morrowind Enchanted Editor, TES4Edit, etc.
- Automate the github repository to compile and package nightly builds of ESM files.
- Create a JSON importer to read JSON format into openmw data structures.
- Create a new mod format that is based on decompiler/string processor/compiler scripts similar to WeiDU mods.
- Dynamically resolve Record ID to FormID at game engine start-up to circumvent 255 maximum mod limit seen in TES4/TES5 engines.
Last edited by ponyrider0 on 30 May 2020, 20:52, edited 1 time in total.
User avatar
AnyOldName3
Posts: 2668
Joined: 26 Nov 2015, 03:25

Re: Improved File Format for omwaddon files

Post by AnyOldName3 »

If you've extended the JSON format, and haven't done it in a completely insane way, it's probably still valid YAML. You might not need to do any legalising transformations if you just parse it with something else.

Also, Git lets you register a command to turn a binary file into a text version and back that it'll use for diffs and patches. If the single-file approach is significantly faster, but the multi-file one diffs better, you might be able to get the best of both worlds. I'm not sure of the details of this, though, so it may turn out to be useless in this situation, though.
User avatar
bmw
Posts: 81
Joined: 04 Jan 2019, 19:42
Contact:

Re: Improved File Format for omwaddon files

Post by bmw »

I'm going to respond to ponyrider0 in a separate post, as this is already quite long.
lambda wrote: 29 May 2020, 16:54 Shooting from the hip here so feel free to ignore, but why not think of the game data as a database and an esp as a series of SQL statements?
That is an interesting idea, particularly since it would likely be quite efficient to implement (the SQL libraries would deal with most of the work,
and we'd basically just be writing up a schema).

One issue, I think, would be performance, though I'm not experienced enough with databases to provide an estimate of what the difference in speed
would be.
The other thing I can think of is that SQL scripts are a lot less user-friendly than markup, which takes a lot away from this idea.
Chris wrote: 29 May 2020, 23:03 Line endings, formatting, and encoding are only an issue when dealing with text-based files that can be edited in any old text editor. Binary files created by specialized tools don't have this problem like text files do.
True, you don't have those problems when people are forced to use one of two editors that can handle the files, but again, these are old problems that programmers have had to deal with for a long time when handling source code, and aren't hard to deal with. Even if it were a problem though, saying that you're restricted to using a specific encoding, specific line endings, etc. wouldn't be any worse than the current situation where you're restricted to a specific editor.
Sure, we may get the odd problem with someone attempting to do something weird, but in the end isn't having a more accessible format that can be
handled by a wider range of tools better (particularly when almost all of the work is done for us already if we make use of an existing standard like yaml and its libraries)?
Chris wrote: 29 May 2020, 23:03 What I mean by that is taking up more storage, and having to read in more to get the same information. Breaking up into smaller files would actually be worse here given the way disks store files (the physical size allocated on disk for a file is larger than the file because they're stored in set-sized blocks; if a disk uses 4KB blocks, for example, it doesn't matter whether files are 32, 56, 1084, or 3050 bytes, each one is going to take up 4KB regardless). Additionally, separate smaller files will be less efficient since the system needs to open multiple files and jump around on the disk to read them for the same information.
I was more talking about the issue of "The main benefit of being text-based, human-readability, quickly goes out the window with non-small files", as large text files are a pain to edit, but splitting them up and documenting things well solves this.
You are certainly correct that text files take up more space and are slower to read, but as I mentioned before, there's no reason why we can't have both a text format that's optimized for editing and development, and a binary equivalent format that is optimized for parsing which the text format can be transcoded into for use at runtime.
Chris wrote: 29 May 2020, 23:03 Which puts extra burdens on the engine and editor to correctly parse the text and figure out what you mean, keeping any extra information (i.e. comments) it otherwise has no use for, rather than simply reading/writing the pertinent data as-is. Just need to look at the trouble the launcher had trying to keep comments in the config files it reads/writes for the active plugins.
It should only put extra burdens on the markup parser/encoder library that you're using. Many of these support maintaining formatting and comments, and if we have to improve certain libraries that don't, it's only going to make things better for others who might be in a similar situation (sadly it looks like yaml-cpp and yaml-rust both lack support for preserving comments at the moment, but this support will eventually be written in, and we could even contribute to that).
Chris wrote: 29 May 2020, 23:03 That's only possible if the editor guarantees consistent ordering of information. For instance, if it writes out a weapon record first followed by a book record, then on some update it starts writing out the book record first followed by the weapon record. To the editor and engine, it makes no difference what order these things are in, but a version control diff would make a mess of it. It's additional things the editor has to worry about that it otherwise wouldn't need to.
True, but this is basically the same issue as preserving formatting, and good parsing/encoding libraries should be able to manage preserving order in
lists (given that, if you specify that something is a list, it should have a consistent order).
What you're talking about is, I think, mainly going to occur if you have someone editing esps in their current form, and trying to convert them into the new text format, which would cause all sorts of problems (mostly due to formatting, structure, and comments. It seems to me that the ordering issue would be really easy to fix in the editor). That's not the purpose of this system though. The idea of having the text format is that, if you use it, it would be what you are using for development, and if you exclusively edit the text format you shouldn't have such problems (and the CS could be modified to handle the text format, preserving comments, structure and formatting).
You wouldn't decompile source code and expect your changes to it to be merged into the original project's source (though admittedly our case is less extreme).
ponyrider0
Posts: 14
Joined: 11 Jun 2019, 23:53

Re: Improved File Format for omwaddon files

Post by ponyrider0 »

AnyOldName3 wrote: 30 May 2020, 14:15 If you've extended the JSON format, and haven't done it in a completely insane way, it's probably still valid YAML. You might not need to do any legalising transformations if you just parse it with something else.
I am using some non-standard, hard-coded scripts to pre-/post-process the JSON files to work with comments and multi-line strings. These behaviors are all hard-coded based on Record type, but transitioning to YAML standard comments and multi-line formatting makes a lot of sense. Thanks.
Also, Git lets you register a command to turn a binary file into a text version and back that it'll use for diffs and patches. If the single-file approach is significantly faster, but the multi-file one diffs better, you might be able to get the best of both worlds. I'm not sure of the details of this, though, so it may turn out to be useless in this situation, though.
Thanks, I will try to look into that further. Unfortunately, I think it will still be bottlenecked by the filesystem creation of one record per file -- even if this process is done dynamically.
User avatar
AnyOldName3
Posts: 2668
Joined: 26 Nov 2015, 03:25

Re: Improved File Format for omwaddon files

Post by AnyOldName3 »

You might be able to output all of those separate files to stdout with some kind of separator between them to say where one file stops and the next starts. That'd make it only one actual file, and it'd exist in memory instead of on the filesystem.
ponyrider0
Posts: 14
Joined: 11 Jun 2019, 23:53

Re: Improved File Format for omwaddon files

Post by ponyrider0 »

lambda wrote: 29 May 2020, 16:54 Shooting from the hip here so feel free to ignore, but why not think of the game data as a database and an esp as a series of SQL statements?
The game data is already represented on disk and in memory as a "database". All data is stored as records that are organized into separate tables based on the record type. In TES3, the primary key is a string-based Record ID. In TES4 and original TES5, the primary key is a 32bit unsigned integer FormID. However, a major hurdle is that many record types store data hierarchically with arbitrary numbers of subrecords per record. These subrecords would have to be converted into multiple tables to fully represent it in a relational database model -- alternatively, it could just be chunked into a BLOB and ignored by the database system.

Representing a mod file as a series of SQL statements is basically what is already done by WeiDU mods. Certain operations like generating or modifying records based on a procedural algorithm can be very efficiently represented as a few lines of code. One of the downsides of this approach is the processing time required to generate those records at mod-installation time or at game-launch time. Another downside is the inefficiency in storing blocks of data that can not be procedurally generated. In that case, representing the records as raw table data would be more efficient than trying to encode the data as a series of SQL insert statements. Probably the most significant downside to mods based on SQL statements is a steep learning curve to create mod packages. Edited, see WeiDU discussion below: In the Infinity Engine/WeiDU modding community, this steep learning curve has resulted in more than one mod being created with game editors by one person who then relies on another person to help them with WeiDU script coding to package their mod into an installer.

AnyOldName3 wrote: 30 May 2020, 18:56 You might be able to output all of those separate files to stdout with some kind of separator between them to say where one file stops and the next starts. That'd make it only one actual file, and it'd exist in memory instead of on the filesystem.
Ah, got it! Yes, if git can perform branch-merge operations: copying data from multiple "files" of one branch and merging them on a file-by-file level with multiple "files" of another branch without actually needing to write to disk, then it definitely would work. However, that is definitely beyond my expertise -- if anyone can help with this, please let me know. Thanks.
Last edited by ponyrider0 on 31 May 2020, 18:19, edited 2 times in total.
User avatar
bmw
Posts: 81
Joined: 04 Jan 2019, 19:42
Contact:

Re: Improved File Format for omwaddon files

Post by bmw »

ponyrider0 wrote: 30 May 2020, 07:11 Currently, each record is stored individually with filenames based on their TES4 FormID...
I think this is mostly a problem with the fact that, if I understand correctly, you are more using this system for tracking changes and you aren't modifying the json directly. Modifying the editor to produce more consistent output would help, but in the end, the best way would be to have the editor also handle the text format, which should allow it to produce changes which are sane even for multi-record files.
ponyrider0 wrote: 30 May 2020, 07:11 The biggest issues with the current design have already been partly mentioned on this thread: compiler/decompiler performance and disk space efficiency...
Writing 450 000 files can't be particularly fast, but this sort of thing should be able to be done in much less than 4 hours.

With deltaplugin (my prototype tool for handling such a text-based format, written in rust) I'm able to process 169 plugins into a merged plugin in 6 seconds, if you include the time it takes to dump then to yaml and read them back, or 3 seconds if the dump is skipped and it only produces the merged plugin. It goes up to about 28 and 20 seconds if I flush caches to force the system to re-read them from the HDD (the plugins in total take up 372M of disk space, or 85M of data in the yaml dump of the records, noting that only a little more than half of the record types can currently be handled by the tool. It's also worth noting that yaml data takes up less space than json (fewer separators), and that the format I'm using seems to be both more compact in general than the Morroblivion-JSON structure, and also is extra compact due to not duplicating information from overridden records, so 85M of data here is more information than it would be in your Morroblivion-JSON format).
ponyrider0 wrote: 30 May 2020, 07:11 Another issue with the current design is that completely deleting a record/file in an ESP can not be propagated to a master with a simple file-tree copy procedure. My current plan is to leverage the ESM format's "Delete" flag bit to mark a record as deleted, then these files can be purged in a post-processing step at any point after merging into the master repository.
It should using a git commit or a patch file though. Why are you copying file trees to make changes to the master? Is it because you're modifying the esm, exporting a new tree, and copying it onto the old one? I would think you could just replace the old tree with the new one (or use something like "rsync -a --delete" to apply the changes).
ponyrider0 wrote: 30 May 2020, 07:11 - My eventual plan is to replace all hard-coded 32-bit FormIDs in the filename as well as in the record data with string-based record IDs. Then, these can be dynamically resolved into FormIDs when compiling the source code into TES4 engine binary format... or left as string-based record IDs when compiling for TES3/OpenMW engine!
Dynamically created FormIDs sound like a good idea for supporting the later games. I've just been working with morrowind plugins, so I hadn't considered how to handle them, but that should reduce the requirement for getting this sort of format to be transcoded into plugins compatible with the later games.
ponyrider0 wrote: 30 May 2020, 07:11 So what is the ultimate point of this proof-of-concept? To demonstrate that a massive mod like Morroblivion can be developed in a distributed, collaborative manner by leveraging existing version control tools already used by the open-source community. To explore and experiment with new ways of developing and deploying mods when not constrained by a single binary ESM format which is locked to one game engine.
I think it's great that you're doing this sort of thing already, and thanks for your input!
To be honest, I think a lot of the problems you're running into are due to not having a complete system, in that you can't yet work directly with the json files, and aren't really that relevant when considering a system that can.
Post Reply