Yeah, the visualizer is console based, so you can start feeling like a Real Hacker they show in the movies right about now. But let’s take a look at the tree structure first:



[.]

[.]

[.]

[.] [-] [root][.] @magic = 59 4b 43 30 30 31 00 00[.] @magic2 = 18 00 00 00 00 00 00 00[.] @unknown1 = 8388634[.] @unknown2 = 4660

One can use arrow keys in the visualizer to walk through all these fields, use “Tab” to jump to hex viewer and back and use “Enter” to open closed tree nodes, show the instances (we’ll talk about them later) and view the hex dumps full screen. To make article easier to read, I won’t be showing full screenshots of the visualizer, only the interesting parts of the tree as text.

Ok, that was data01.ykc, let’s check out data02.ykc:



[.]

[.]

[.]

[.] [-] [root][.] @magic = 59 4b 43 30 30 31 00 00[.] @magic2 = 18 00 00 00 00 00 00 00[.] @unknown1 = 560390478[.] @unknown2 = 28400

and data03.ykc:



[.]

[.]

[.]

[.] [-] [root][.] @magic = 59 4b 43 30 30 31 00 00[.] @magic2 = 18 00 00 00 00 00 00 00[.] @unknown1 = 219734364[.] @unknown2 = 58440

Not that much, actually. There’s not a file directory or anything. Let’s note the sizes of original container files and see if any of these could be offsets or pointers inside the file:

data01.ykc — 8393294 @unknown1 = 8388634

data02.ykc — 560418878 @unknown1 = 560390478

data03.ykc — 219792804 @unknown1 = 219734364

Wow, looks like we’ve hit the bullseye. Let’s check out what’s happening at that offset in the file:

meta:

id: ykc

application: Yuka Engine

endian: le

seq:

- id: magic

contents: ["YKC001", 0, 0]

- id: magic2

contents: [0x18, 0, 0, 0, 0, 0, 0, 0]

- id: unknown_ofs

type: u4

- id: unknown2

type: u4

instances:

unknown3:

pos: unknown_ofs

size-eos: true

We’ve added another field named “unknown3”. However, this time it’s not in “seq” section, but in “instances” section. It’s actually the very same thing, but “instances” is used to describe fields which are not going in sequence, thus they can be anywhere in the stream and require note of position (“pos”) to start parsing from. So, our “unknown3” starts with “unknown_ofs” (“pos: unknown_ofs”) and spans up to the end of file (=stream, so “size-eos: true”). As we have no idea what we’ll get there, so far it will be read just as a byte stream. Nothing too fancy, but let’s take a look:



[.]

[.]

[.]

[.]

[-] unknown3 = 57 e7 7f 00 0a 00 00 00 18 00 00 00 13 02 00 00… [-] [root][.] @magic = 59 4b 43 30 30 31 00 00[.] @magic2 = 18 00 00 00 00 00 00 00[.] @unknown_ofs = 8388634[.] @unknown2 = 4660[-] unknown3 = 57 e7 7f 00 0a 00 00 00 18 00 00 00 13 02 00 00…

Hey, note the length of that “unknown3”. It looks exactly like “unknown2”. So it turns out that the first header of YKC file is actually not the header itself, but a reference to some other point of file to find the real header. Let’s fix our .ksy file to add this knowledge:

meta:

id: ykc

application: Yuka Engine

endian: le

seq:

- id: magic

contents: ["YKC001", 0, 0]

- id: magic2

contents: [0x18, 0, 0, 0, 0, 0, 0, 0]

- id: header_ofs

type: u4

- id: header_len

type: u4

instances:

header:

pos: header_ofs

size: header_len

Nothing too complex here: we’ve just renamed “unknown” fields to have more meaningful names and replaced “size-eos: true” (which means reading everything up to the end of file) with “size: header_len” (which specifies exact amount of bytes to read). Probably that’s pretty close to original idea. Load it up once again and now let’s focus on that field we’ve named “header”. It looks something like that in data01.ykc:

000000: 57 e7 7f 00 0a 00 00 00 18 00 00 00 13 02 00 00

000010: 00 00 00 00 61 e7 7f 00 0b 00 00 00 2b 02 00 00

000020: db 2a 00 00 00 00 00 00 6c e7 7f 00 11 00 00 00

000030: 06 2d 00 00 92 16 00 00 00 00 00 00 7d e7 7f 00

in data02.ykc:

000000: d1 2b 66 21 0c 00 00 00 18 00 00 00 5a 04 00 00

000010: 00 00 00 00 dd 2b 66 21 14 00 00 00 72 04 00 00

000020: 26 1a 00 00 00 00 00 00 f1 2b 66 21 16 00 00 00

000030: 98 1e 00 00 a8 32 00 00 00 00 00 00 07 2c 66 21

in data03.ykc:

000000: ec 30 17 0d 26 00 00 00 18 00 00 00 48 fd 00 00

000010: 00 00 00 00 12 31 17 0d 26 00 00 00 60 fd 00 00

000020: 0d 82 03 00 00 00 00 00 38 31 17 0d 26 00 00 00

000030: 6d 7f 04 00 d0 85 01 00 00 00 00 00 5e 31 17 0d

At the first glance, it doesn’t make any sense at all. On the second thought, though, a sequence of repeating bytes catches our eye. That is `e7 7f` in the first file, `2b 66` in the second, and `30 17` with `31 17` in the third. Actually, it looks very much like that we’re dealing with fixed length records here, 0x14 (20 decimal) bytes long. By the way, this hypothesis goes well with header lengths in all three files too: all three of 4660, 28400, and 58440 are divisible by 20. Let’s give it a try:

meta:

id: ykc

application: Yuka Engine

endian: le

seq:

- id: magic

contents: ["YKC001", 0, 0]

- id: magic2

contents: [0x18, 0, 0, 0, 0, 0, 0, 0]

- id: header_ofs

type: u4

- id: header_len

type: u4

instances:

header:

pos: header_ofs

size: header_len

type: header

types:

header:

seq:

- id: entries

size: 0x14

repeat: eos

Check out what happened with “header” instance here. It’s still positioned at “header_ofs” and has size of “header_len” bytes, but it’s no longer a mere byte array. It has its own type, “type: header”. This means that we can specify a custom type and this type will be used to process the given field. Here we go, it’s right below, in that “types:” section. As you might have already guessed, actually “type” follows exactly the same format as the main file (i.e. root) — one can use the same “seq” section to specify the sequence of subfields, “instances”, it can have its own subtypes (“types”), etc, etc.

So, we’ve specified a “header” type, which consists of a single field — “entries”. We know that this field need to be 0x14 bytes long (“size: 0x14”), but we demand it to be repeated as long as possible (i.e. up to end of the stream — that is “repeat: eos”).

By the way, note that the concept of “stream” is not the same as we’ve seen before, when “stream” actually meant “whole file”. This time we’ve dealing with a substructure that has fixed size (“size: header_len”), so that repetition will be limited by that size anyway. So we can rest assured that if there would be something beyond that length, it won’t be contaminating this structure of ours.

Ok, let’s give it a try:



[-]

[.] 0 = 57 e7 7f 00|0a 00 00 00|18 00 00 00|13 02 00 00|00 00 00 00

[.] 1 = 61 e7 7f 00|0b 00 00 00|2b 02 00 00|db 2a 00 00|00 00 00 00

[.] 2 = 6c e7 7f 00|11 00 00 00|06 2d 00 00|92 16 00 00|00 00 00 00

[.] 3 = 7d e7 7f 00|14 00 00 00|98 43 00 00|69 25 00 00|00 00 00 00

[.] 4 = 91 e7 7f 00|15 00 00 00|01 69 00 00|d7 12 00 00|00 00 00 00

[.] 5 = a6 e7 7f 00|12 00 00 00|d8 7b 00 00|27 3f 07 00|00 00 00 00 [-] header[-] @entries (233 = 0xe9 entries)[.] 0 = 57 e7 7f 00|0a 00 00 00|18 00 00 00|13 02 00 00|00 00 00 00[.] 1 = 61 e7 7f 00|0b 00 00 00|2b 02 00 00|db 2a 00 00|00 00 00 00[.] 2 = 6c e7 7f 00|11 00 00 00|06 2d 00 00|92 16 00 00|00 00 00 00[.] 3 = 7d e7 7f 00|14 00 00 00|98 43 00 00|69 25 00 00|00 00 00 00[.] 4 = 91 e7 7f 00|15 00 00 00|01 69 00 00|d7 12 00 00|00 00 00 00[.] 5 = a6 e7 7f 00|12 00 00 00|d8 7b 00 00|27 3f 07 00|00 00 00 00

Now it starts to make some sense, isn’t it? It really looks like a repeated structure. Let’s check out the second file too:



[-]

[.] 0 = d1 2b 66 21|0c 00 00 00|18 00 00 00|5a 04 00 00|00 00 00 00

[.] 1 = dd 2b 66 21|14 00 00 00|72 04 00 00|26 1a 00 00|00 00 00 00

[.] 2 = f1 2b 66 21|16 00 00 00|98 1e 00 00|a8 32 00 00|00 00 00 00

[.] 3 = 07 2c 66 21|16 00 00 00|40 51 00 00|a2 16 00 00|00 00 00 00

[.] 4 = 1d 2c 66 21|16 00 00 00|e2 67 00 00|89 c4 00 00|00 00 00 00

[.] 5 = 33 2c 66 21|16 00 00 00|6b 2c 01 00|fa f5 00 00|00 00 00 00 [-] header[-] @entries (1420 = 0x58c entries)[.] 0 = d1 2b 66 21|0c 00 00 00|18 00 00 00|5a 04 00 00|00 00 00 00[.] 1 = dd 2b 66 21|14 00 00 00|72 04 00 00|26 1a 00 00|00 00 00 00[.] 2 = f1 2b 66 21|16 00 00 00|98 1e 00 00|a8 32 00 00|00 00 00 00[.] 3 = 07 2c 66 21|16 00 00 00|40 51 00 00|a2 16 00 00|00 00 00 00[.] 4 = 1d 2c 66 21|16 00 00 00|e2 67 00 00|89 c4 00 00|00 00 00 00[.] 5 = 33 2c 66 21|16 00 00 00|6b 2c 01 00|fa f5 00 00|00 00 00 00

By the way, do you see that (233 = 0xe9 entries) and (1420 = 0x58c entries)? It’s plausible to deduce that it could be number of files in the archive. Our first archive is relatively small (8 MiB), dividing it by 233 files yields us 36022 bytes per file on average. Looks legit for a bunch of scripts, configs, etc. The second archive is the largest (560 MiB), having 1420 files yields 394661 bytes per file, which looks ok for stuff like images, voice files, etc.

`57 e7 7f 00`, `61 e7 7f 00`, `6c e7 7f 00` and so on look very much like an increasing sequence of integers, what could it mean? In the second file it’s `d1 2b 66 21`, `dd 2b 66 21`, `f1 2b 66 21`. Hang on a sec here, I think I’ve seen it somewhere already. Let’s roll back to the beginning of our work — that’s it! It’s close to the full length of file — thus, it looks like offsets yet again. Ok, let’s try to describe the structure of these 20-bytes records. Judging by the looks, I’d say that these are 5 integers. We’ll describe another type named “file_entry”. Giving full listings becomes a bother, so if you’ll excuse me I won’t copy-paste whole file from now on and will just show you the changed “types” section:

types:

header:

seq:

- id: entries

repeat: eos

type: file_entry

file_entry:

seq:

- id: unknown_ofs

type: u4

- id: unknown2

type: u4

- id: unknown3

type: u4

- id: unknown4

type: u4

- id: unknown5

type: u4

No new .ksy features tackled here. We’ve added “type: file_entry” for entries and described this subtype as 5 sequential u4 integers. Checking it out in visualizer:

Any thoughts? Yet another idea: “unknown3” is a pointer to the beginning of the file in our archive, “unknown4” is most likely being length of this file. It’s simple because 24 + 531 = 555, and 555 + 10971 = 11526. That’s simply the files that go on sequentially in the container. One might also note the same for unknown_ofs and unknown2: 8382295 + 10 = 8382305, 8382305 + 11 = 8382316. That means that “unknown2” is a length of some other subrecords which begin at “unknown_ofs” offset. “unknown5” always seems to be equal to 0.

Come on, let’s add some special magic into “file_entry” to read these blocks of data, i.e. record at (unknown_ofs; unknown2) and file body at (unknown3; unknown4). It would look like that:

file_entry:

seq:

- id: unknown_ofs

type: u4

- id: unknown_len

type: u4

- id: body_ofs

type: u4

- id: body_len

type: u4

- id: unknown5

type: u4

instances:

unknown:

pos: unknown_ofs

size: unknown_len

io: _root._io

body:

pos: body_ofs

size: body_len

io: _root._io

Actually, we’ve done that trick with “instances” before, so it’s not that new. The only real new “magic” thing here is that “io: _root._io” specification. What does it do?

Do you remember when I’ve mentioned that KS has concept of “stream” that’s being read, and if you effectively limit that “stream” by offset and size while parsing a substructure, it’s not the same “stream” that equals to whole file we had from the very beginning? That’s the case here. Without this “io” specification, “pos: body_ofs” would try to seek to “body_ofs” position in a stream that corresponds to our “file_entry” record, which is actually 20 bytes long — and that’s not what we want (not to mention that it would result in an error). So we need some special magic to specify that we want to read not from the current IO stream, but from IO stream that corresponds to the whole file — that is “_root._io”.

Ok, what have we got with all that?



[-] 0

[.]

[.]

[.]

[.]

[.]

[-] unknown = 73 74 61 72 74 2e 79 6b 73 00

[-] body = 59 4b 53 30 30 31 01 00 30 00 00 00…

[-] 1

[.]

[.]

[.]

[.]

[.]

[-] unknown = 73 79 73 74 65 6d 2e 79 6b 67 00

[-] body = 59 4b 47 30 30 30 00 00 40 00 00 00… [-] @entries (233 = 0xe9 entries)[-] 0[.] @unknown_ofs = 8382295[.] @unknown_len = 10[.] @body_ofs = 24[.] @body_len = 531[.] @unknown5 = 0[-] unknown = 73 74 61 72 74 2e 79 6b 73 00[-] body = 59 4b 53 30 30 31 01 00 30 00 00 00…[-] 1[.] @unknown_ofs = 8382305[.] @unknown_len = 11[.] @body_ofs = 555[.] @body_len = 10971[.] @unknown5 = 0[-] unknown = 73 79 73 74 65 6d 2e 79 6b 67 00[-] body = 59 4b 47 30 30 30 00 00 40 00 00 00…

It’s easier to check out with interactive visualizer, but even on this static shot it’s easy to tell that `73 74 61 72 74 2e 79 6b 73 00` is an ASCII string. Checking out string representation, it turns out that it’s “start.yks” with trailing zero byte. And `73 79 73 74 65 6d 2e 79 6b 67 00` is actually “system.ykg”. Bingo, it’s the file names. And we’re damn sure of them that they are the strings, not just some bytes. Let’s mark it up:

file_entry:

seq:

- id: filename_ofs

type: u4

- id: filename_len

type: u4

- id: body_ofs

type: u4

- id: body_len

type: u4

- id: unknown5

type: u4

instances:

filename:

pos: filename_ofs

size: filename_len

type: str

encoding: ASCII

io: _root._io

body:

pos: body_ofs

size: body_len

io: _root._io

The new stuff here is that “type: str” — it means that the bytes we’ve captured must be interpreted as a string — and “encoding: ASCII” specifies encoding (we’re not really sure, but so far it’s been ASCII). Visualizer again:



[-]

[-] 0

[.]

[.]

[.]

[.]

[.]

[-] filename = "start.yks\x00"

[-] body = 59 4b 53 30 30 31 01 00 30 00 00 00…

[-] 1

[.]

[.]

[.]

[.]

[.]

[-] filename = "system.ykg\x00"

[-] body = 59 4b 47 30 30 30 00 00 40 00 00 00…

[-] 2

[.]

[.]

[.]

[.]

[.]

[-] filename = "SYSTEM\\black.PNG\x00"

[-] body = 89 50 4e 47 0d 0a 1a 0a 00 00 00 0d… [-] header[-] @entries (233 = 0xe9 entries)[-] 0[.] @filename_ofs = 8382295[.] @filename_len = 10[.] @body_ofs = 24[.] @body_len = 531[.] @unknown5 = 0[-] filename = "start.yks\x00"[-] body = 59 4b 53 30 30 31 01 00 30 00 00 00…[-] 1[.] @filename_ofs = 8382305[.] @filename_len = 11[.] @body_ofs = 555[.] @body_len = 10971[.] @unknown5 = 0[-] filename = "system.ykg\x00"[-] body = 59 4b 47 30 30 30 00 00 40 00 00 00…[-] 2[.] @filename_ofs = 8382316[.] @filename_len = 17[.] @body_ofs = 11526[.] @body_len = 5778[.] @unknown5 = 0[-] filename = "SYSTEM\\black.PNG\x00"[-] body = 89 50 4e 47 0d 0a 1a 0a 00 00 00 0d…

Now, isn’t that nice? Looks like a job well done for me. You can even select individual file bodies, press “w” in the visualizer, type some name and export these binary blocks as local files. But this is tiresome and that’s not exactly what we wanted: we wanted to extract all the files at once, keeping their original filenames.

Showdown time

Let’s make a script for that. What do we do to transform our format description into code? Now that’s where Kaitai Struct shines: you don’t need to retype all that type specifications into code manually. You just get the ksc compiler and run:

ksc -t ruby ykc.ksy

and you’ve got yourself a nice and shiny “ykc.rb” file in your current folder, which is a library that you can plug in and use straight away. Ok, but how do we do that? Let’s start with something simple, like listing files to the screen:

require_relative 'ykc'

Ykc.from_file('data01.ykc').header.entries.each { |f|

puts f.filename

}

Cool, huh? Here we go — two lines of code (four, if you count in “require” and block termination) — and we’ve got that huge listing pumping:

start.yks

system.ykg

SYSTEM\black.PNG

SYSTEM\bt_click.ogg

SYSTEM\bt_select.ogg

SYSTEM\config.yks

SYSTEM\Confirmation.yks

SYSTEM\confirmation_load.png

SYSTEM\confirmation_no.ykg

SYSTEM\confirmation_no_load.ykg

...

Let’s go through what’s going on here step-by-step:

Ykc.from_file(…) — creates a new object of Ykc class (which is generated from our .ksy description), parsing a file from local filesystem; fields of this object would be filled with whatever’s describe in .ksy

.header — selects “header” field in Ykc, thus returning instance of Ykc::Header class, which corresponds to “header” type in .ksy

.entries — selects “entries” field in the header, returns an array of instances of Ykc::FileEntry class

.each { |f| … } —a typical Ruby way to do something with each element of a collection

puts f.filename — just outputs string in “filename” field of a FileEntry to the stdout, that is the screen

It shouldn’t be very hard to write mass extraction script, but I just want to note a couple of things before that:

There are path specifications in the “filename” field, and it uses “\” (backslash) as a folder separator due to the fact that archive was originally created on Windows system. If we’ll attempt to create such a path on UNIX system, it will obediently create us a directory with backslashes in its name, so it’s good idea to convert these “\” into “/” for calls like mkdir_p. File names are actually zero-terminated (yeah, it looks like C, alright). It’s invisible when you just dump it on screen, but it may become a problem when you’ll try to create a file with a “\0” in the name. If you’ll look a bit further into the listing, you’ll encounter stuff like that:

"SE\\00050_\x93d\x98b\x82P.ogg\x00"

"SE\\00080_\x83J\x81[\x83e\x83\x93.ogg\x00"

"SE\\00090_\x83`\x83\x83\x83C\x83\x80.ogg\x00"

"SE\\00130_\x83h\x83\x93\x83K\x83`\x83\x83\x82Q.ogg\x00"

"SE\\00160_\x91\x96\x82\xE8\x8B\x8E\x82\xE9\x82Q.ogg\x00"

Do you remember the beginning of the article, when I said that crazy Japanese programmers use Shift-JIS? That’s exactly it. They use files with Japanese characters in it. Let’s change “encoding: ASCII” to “encoding: SJIS” in our filename type description for that. Don’t forget to recompile ksy → rb, and, voila:

SE\00050_電話１.ogg

SE\00080_カーテン.ogg

SE\00090_チャイム.ogg

SE\00130_ドンガチャ２.ogg

SE\00160_走り去る２.ogg

Even if you don’t read Japanese, you can check out something like Google Translator to see that 電話 is actually “phone”, so chances are “SE\00050” is a sound of phone ringing.

Ultimately, our extraction script will look like this:

require 'fileutils'

require_relative 'ykc' EXTRACT_PATH = 'extracted' ARGV.each { |ykc_fn|

Ykc.from_file(ykc_fn).header.entries.each { |f|

filename = f.filename.strip.encode('UTF-8').gsub("\\", '/')

dirname = File.dirname(filename)

FileUtils::mkdir_p("#{EXTRACT_PATH}/#{dirname}")

File.write("#{EXTRACT_PATH}/#{filename}", f.body)

}

}

That’s a bit more than 2 lines, but nothing fancy goes on here either. We grab a list of command line arguments (this way, you can run it using something like ./extract-ykc *.ykc), and, again, for every container file we’re iterating over all file entries. We clean up the file name (stripping trailing zero, encoding it to UTF-8 and replacing backslashes with forward slashes), derive directory name (dirname), create the folder if it doesn’t exist (mkdir_p) and, finally, we dump the `f.body` contents there to a file.

Our job is complete. You can run the script to see what we’ll get. As we’ve predicted, images are really in .png format (and you can view them with any image viewer), and music and sounds are in .ogg (so you can listen to them with any player). For example, here’s the backgrounds that we’ve got in BG folder:

BG folder unpacked: backgrounds

And TA folder contains sprites which are overlaid over these backgrounds. For example, Mika looks like that:

TA/MIKA folder unpacked: sprites for Mika character

I can give out a little secret: in many Japanese VNs, “standing” sprites are named “ta” or “tati” / “tachi” / “tatsu” / “tatte”. That’s because Japanese word for “standing” is 立って (tatte) or “to stand up” is 立ち上がる (tachi ageru). That usually contrasts with “fa” or “face” sprites, which are actually used for avatar portraits in the dialogue text box, which show only character’s face.

That’s it for today. The two major things left to reverse engineer here are “yks” and “ykg” files — probably “yks” is the script of the game, and “ykg” are some aux graphics or animations. Let’s try to tackle them next time.