Catsystem2提取.cst文件中的日文内容

逆向工程编程语言实用技术Catsysem2.cst文本提取

浏览数 - 951 发布于 - 2026-02-23 - 23:24

重新编辑于 - 2026-03-03 - 21:39

用法

python 程序.py 文件夹目录

代码采取MIT协议

ds与gpt5.3写的，勉强看吧

代码在最下方😘

# 原始代码作者：Ephylm411 # 参考：https://zhuanlan.zhihu.com/p/623697843

1. 背景知识：.cst 文件结构

根据源码作者Ephylm411的分析，.cst 文件的结构如下：

外层封装（压缩层）

文件头（16字节）
- 8字节：固定魔数 CatScene
- 4字节：压缩后数据大小
- 4字节：解压后数据大小
数据体：zlib 压缩的数据，解压后得到 .bin 文件。

内层结构（解压后的 `.bin` 文件）

文件头（16字节）：4 个 uint32
- h0：数据体总长度（不含文件头）
- h1：语句组个数（每个组包含若干语句）
- h2：偏移表起始偏移量（相对文件头）
- h3：语句块起始偏移量
数据体：依次为三部分
1. 语句表（长度 = h2 - 16）
  每 8 字节一组，每组两个 uint32：
  - d10：当前语句组包含的语句数量
  - d11：该组第一条语句的全局索引
2. 偏移表（长度 = h3 - h2）
  每 4 字节一个 uint32，共 h1 * 8 项？实际上语句个数 = 所有组 d10 之和，而偏移表项数等于语句总个数。
  每个偏移量表示对应语句在语句块中的起始位置（相对语句块开头）。
3. 语句块（长度 = h0 - h3）
  每条语句格式：
  - [0]：固定为 0x01（语句起始标记）
  - [1]：语句类型（0x20 对话框文本、0x21 角色名、0x02 等待输入、0x30 控制命令等）
  - [2..]：内容（以 0x00 结尾，Shift-JIS 编码）

重要关系：

h1 * 8 = h2 - 16（语句表大小 = 8 字节/组 × 组数）
所有组的语句数量之和 = 偏移表项数 = 语句块中的语句条数
语句块中每条语句的长度由起始和结束 0x00 界定。

2. 代码模块解析

2.1 辅助函数

cst2bin(datcst) / bin2cst(datbin)
处理外层压缩：检查魔数，进行 zlib 压缩/解压，返回原始二进制数据。
warn() / printWarnings()
收集并输出非致命警告（例如无法访问的数据区域、未知类型码等）。

2.2 核心类 `FormatCST`

负责解析 .bin 内部结构，并提供迭代、修改、重新打包功能。

`init(self, fc)`

从文件对象 fc 读取 .cst，解压得到 bin 数据。
解析 bin 的文件头，得到 h0, h1, h2, h3。
切分出三部分：b1（语句表）、b2（偏移表）、b3（语句块）。
验证完整性：
- 语句表每组的第二个数必须是连续的起始索引。
- 语句个数与偏移表项数一致。
解析语句块：按偏移表定位每条语句，读取类型和内容，存入 self.d3 列表（每条语句的完整二进制数据，包括开头 0x01 和结尾 0x00）。
过程中检查未访问到的区域并发出警告。

`iter()` / `next(skp=True)` / `get(skp=True)`

iter() 重置迭代器。
next(skp) 移动到下一条语句，skp=True 时跳过不需要翻译的语句（只保留类型 0x20、0x21 以及特殊的 scene 命令和 fselect 命令）。
get(skp) 返回当前语句的内容（如果 skp=True 则去除首尾标记，否则保留完整二进制用于调试）。

`rep(bn)`

替换当前语句的内容（保留前两个字节和结尾 0x00），bn 为新内容的字节串。

`pac()`

重新生成偏移表和语句块，返回新的 .bin 文件数据（不包括外层压缩）。

2.3 文本编码处理

SCENE_ENCODING = 'cp932'：游戏内部使用 Shift-JIS（CP932）。
TXT_ENCODING = 'utf-8-sig'：导出的文本文件使用 UTF-8 with BOM，兼容中文编辑器。
scene_bytes_to_text() / text_to_scene_bytes() 完成编码转换。

2.4 单文件操作函数

depacst_file(sc, sb)：解压缩 .cst 为 .bin。
unpacst_file(sc, st, skp)：提取文本到 .txt。skp 参数控制是否只导出可翻译文本。
repacst_file(sc, st, sd)：根据修改后的 .txt 重新打包为新的 .cst。

2.5 批量处理函数

extract_dir(src_dir, skp)：处理目录下所有 .cst，将文本输出到 scene_txt 子目录。
depacst_dir(src_dir)：解压缩目录下所有 .cst 到 scene_bin 子目录。
unpacst(skp) / depacst() / repacst()：操作默认目录 scene_cst、scene_txt、scene_bin、scene_dst。

2.6 命令行解析与主流程

支持多种调用方式：

text

python script.py [tag] [path]

tag：

0：提取文本（跳过非翻译项）
1：重新打包
2：解压为 .bin
3：提取所有语句（包括控制命令，不跳过）

path：可以是单个 .cst 文件或包含 .cst 的目录。省略时使用默认目录 scene_cst。

4. 注意事项

长度限制
游戏引擎为每条语句分配了固定长度的缓冲区（由原日文长度决定）。替换文本的字节数不能超过原句的字节数，否则会导致游戏读取崩溃。代码中 rep() 直接替换内容，不检查长度，因此翻译时需确保新文本的字节数 ≤ 原句字节数（Shift-JIS 编码下，中文字符与日文字符一样占 2 字节，相对安全；若使用 UTF-8 则每个中文字符占 3 字节，极易超长）。

编码问题
游戏内部使用 CP932（Shift-JIS），导出文本使用 UTF-8-BOM 是为了方便编辑。重新打包时程序会将 UTF-8 转回 CP932，若存在无法映射的字符（如某些特殊符号）会报错。

特殊语句

0x02 语句无内容，不需要翻译。
0x30 控制命令一般不要改动，但其中 scene 和 fselect 的内容需要翻译，程序在 skp=True 模式下会保留它们。
fselect 选择支通常位于文件末尾，内容可翻译。

警告与错误

程序会输出警告（如未访问的数据区域、未知类型码等），但不影响基本功能。
若文本行数不足或多于原语句数，会给出警告，但依然会生成新文件（可能缺句或多余行被忽略）。

目录结构
默认工作目录下需存在 scene_cst 文件夹。批量处理时输出目录自动创建。也可直接指定文件或目录路径。

5. 总结

该工具是针对 CatSystem2 引擎游戏（如 NEKOPARA）的汉化辅助工具，能够完整地提取和替换脚本中的文本。其核心在于正确解析 .cst 文件的复杂结构，并保持偏移表与语句块的一致性。虽然代码中有一些硬编码假设（如语句类型含义），但在已知游戏中工作良好。使用时需注意文本长度限制和编码转换，这是汉化此类游戏的关键难点。

源码如下:

Python

# 原始代码作者：Ephylm411
# 参考：https://zhuanlan.zhihu.com/p/623697843
# 改进：Kinotern 与 GPT-5.3 Codex
import os, sys
import struct
import zlib


warnings = []

def warn(value):
    warnings.append(value)

def clearWarnings():
    global warnings
    warnings = []

def printWarnings(sc):
    global warnings
    for wi in warnings:
        print('[WARNING]', sc, wi)
    clearWarnings()


def cst2bin(datcst):
    # CST 外层封装：魔数 + 压缩后大小 + 解压后大小 + zlib 数据。
    tag, sizcst, sizbin = struct.unpack_from('8sII', datcst)
    if tag != b'CatScene':
        raise Exception('Label Mismatch')
    datcst = datcst[16:]
    if sizcst != len(datcst):
        raise Exception('Size Ante Decompress Mismatch')
    datbin = zlib.decompress(datcst)
    if sizbin != len(datbin):
        raise Exception('Size Post Decompress Mismatch')
    return datbin

def bin2cst(datbin):
    # 将场景原始二进制重新封装为 CST。
    datcst = zlib.compress(datbin)
    return b'CatScene' + struct.pack('II', len(datcst), len(datbin)) + datcst


class FormatCST:
    def __init__(self, fc):
        # 解码并拆分 CST 内部数据：
        # - b1：分段映射表
        # - b2：片段偏移表
        # - b3：片段内容区
        b = cst2bin(fc.read())
        (h0, self.h1, self.h2, self.h3), b = struct.unpack_from('4I', b), b[16:]
        self.b1, b2, b3 = b[:self.h2], b[self.h2:self.h3], b[self.h3:]
        if h0 != len(b) or self.h1 * 8 != self.h2 or (self.h3 - self.h2) % 4 != 0:
            raise Exception('Integrity Constraint 0 Violated')

        it = struct.iter_unpack('II', self.b1)
        flag = True
        self.n1 = 0
        while flag:
            try:
                d10, d11 = next(it)
                if d11 != self.n1:
                    flag = False
                self.n1 += d10
            except StopIteration:
                break
        if not flag or self.n1 * 4 != self.h3 - self.h2:
            raise Exception('Integrity Constraint 1 Violated')

        it = struct.iter_unpack('I', b2)
        d2 = []
        while True:
            try:
                d2.append(*next(it))
            except StopIteration:
                break
        if self.n1 != len(d2):
            raise Exception('Integrity Constraint 2 Violated')
        
        ofs = 0
        self.d3 = []
        for i in range(self.n1):
            if ofs < d2[i]:
                warn('Unaccessible Fragment Offset 0x{0:08X}'.format(ofs))
                ofs = d2[i]
            if ofs > d2[i]:
                raise Exception('Overflow Offset 0x{0:08X}'.format(ofs))
            try:
                d30, d31, d32 = struct.unpack_from('3B', b3, ofs)
            except Exception:
                raise Exception('Content Truncated')
            if d30 != 0x01:
                raise Exception('Invalid Offset 0x{0:08X}'.format(ofs))
            if d31 not in (0x02, 0x20, 0x21, 0x30):
                warn('Unknown Code 0x01{1:02X} Offset 0x{0:08X}'.format(ofs, d31))
            ofs += 3
            while d32 != 0x00:
                try:
                    d32, = struct.unpack_from('B', b3, ofs)
                except Exception:
                    raise Exception('Content Truncated')
                ofs += 1
            self.d3.append(b3[d2[i] : ofs])
        if ofs < len(b3):
            warn('Unaccessible Fragment Offset 0x{0:08X}'.format(ofs))
        
    def iter(self):
        # 重置迭代状态。
        self.idx = -1
        self.fslc = False

    def next(self, skp = True):
        # 移动到下一个片段。skp=True 时只保留可见文本项。
        self.idx += 1
        if skp:
            while self.idx < self.n1:
                if self.fslc:
                    break
                d31, = struct.unpack_from('B', self.d3[self.idx], 1)
                if d31 in (0x20, 0x21):
                    break
                if d31 == 0x30 and self.d3[self.idx][2:8] == b'scene\x20':
                    break
                if d31 == 0x30 and self.d3[self.idx][2:] == b'fselect\x00':
                    self.fslc = True
                self.idx += 1
        if self.idx >= self.n1:
            raise StopIteration
    
    def get(self, skp = True):
        # 返回当前片段的文本主体。
        if skp:
            return self.d3[self.idx][2:-1]
        else:
            return b'<\\x01><\\x' + bytes('{0:02X}'.format(self.d3[self.idx][1]), encoding = 'utf-8') + b'>' + self.d3[self.idx][2:-1] + b'<\\x00>'
    
    def rep(self, bn):
        # 替换当前片段文本，保留前缀与结尾空字节。
        self.d3[self.idx] = self.d3[self.idx][:2] + bn + b'\x00'

    def pac(self):
        # 重新构建偏移表和内容区。
        b2, b3 = b'', b''
        ofs = 0
        for i in range(self.n1):
            b2 += struct.pack('I', ofs)
            b3 += self.d3[i]
            ofs += len(self.d3[i])
        b0 = struct.pack('4I', self.h3 + ofs, self.h1, self.h2, self.h3)
        return b0 + self.b1 + b2 + b3


pathcst = 'scene_cst'
pathbin = 'scene_bin'
pathtxt = 'scene_txt'
pathdst = 'scene_dst'
# 游戏脚本文本编码（NEKOPARA 常见为 CP932 / Shift-JIS）
SCENE_ENCODING = 'cp932'
# 导出文本编码：UTF-8 with BOM（utf-8-sig）
TXT_ENCODING = 'utf-8-sig'


def scene_bytes_to_text(bn):
    return bn.decode(SCENE_ENCODING)


def text_to_scene_bytes(st):
    return st.encode(SCENE_ENCODING)


def cst_to_txt_name(sc_name):
    # 统一文本命名：01.cst -> 01.cst.txt
    return sc_name + '.txt'


def read_txt_lines(st):
    # 优先按 UTF-8 BOM 读取；若是旧版 CP932 文本则回退兼容。
    try:
        f = open(st, 'r', encoding = TXT_ENCODING, newline = '')
        try:
            return f.read().splitlines()
        finally:
            f.close()
    except UnicodeDecodeError:
        warn('TXT 非 UTF-8，已回退按 CP932 读取')
        f = open(st, 'r', encoding = SCENE_ENCODING, newline = '')
        try:
            return f.read().splitlines()
        finally:
            f.close()


def depacst_file(sc, sb = None):
    # 单文件模式：.cst -> .bin
    clearWarnings()
    if sb is None:
        sb = os.path.splitext(sc)[0] + '.bin'

    f = open(sc, 'rb')
    try:
        b = cst2bin(f.read())
    finally:
        f.close()

    f = open(sb, 'wb')
    f.write(b)
    f.close()

    printWarnings(os.path.basename(sc))
    return sb


def unpacst_file(sc, st = None, skp = True):
    # 单文件模式：.cst -> .txt
    clearWarnings()
    if st is None:
        st = cst_to_txt_name(sc)

    f = open(sc, 'rb')
    try:
        c = FormatCST(f)
    finally:
        f.close()

    f = open(st, 'w', encoding = TXT_ENCODING, newline = '\r\n')
    c.iter()
    while True:
        try:
            c.next(skp)
            f.write(scene_bytes_to_text(c.get(skp)))
            f.write('\n')
        except StopIteration:
            break
    f.close()

    printWarnings(os.path.basename(sc))
    return st


def repacst_file(sc, st = None, sd = None):
    # 单文件模式：.cst + .txt -> .new.cst
    clearWarnings()
    if st is None:
        st = cst_to_txt_name(sc)
        # 兼容旧命名：01.txt
        if not os.path.exists(st):
            st_old = os.path.splitext(sc)[0] + '.txt'
            if os.path.exists(st_old):
                warn('检测到旧命名 TXT，建议改为 *.cst.txt')
                st = st_old
    if sd is None:
        sd = os.path.splitext(sc)[0] + '.new.cst'

    f = open(sc, 'rb')
    try:
        c = FormatCST(f)
    finally:
        f.close()

    lines = read_txt_lines(st)
    c.iter()
    li = 0
    while True:
        try:
            c.next()
        except StopIteration:
            break
        if li >= len(lines):
            warn('Lack of Text')
            break
        try:
            bn = text_to_scene_bytes(lines[li])
        except UnicodeEncodeError as e:
            raise Exception('Text Encode Error Line {0}: {1}'.format(li + 1, e))
        c.rep(bn)
        li += 1
    if li < len(lines):
        warn('Unused Text Lines: {0}'.format(len(lines) - li))

    f = open(sd, 'wb')
    f.write(bin2cst(c.pac()))
    f.close()

    printWarnings(os.path.basename(sc))
    return sd


def parse_arg():
    # 命令行用法：
    # - python 1.py
    # - python 1.py 0|1|2|3
    # - python 1.py xxx.cst
    # - python 1.py 0|1|2|3 xxx.cst
    # - python 1.py folder
    # - python 1.py 0|2|3 folder
    tag = 0
    src = None
    if len(sys.argv) >= 2:
        if sys.argv[1] in ('0', '1', '2', '3'):
            tag = int(sys.argv[1])
            if len(sys.argv) >= 3:
                src = sys.argv[2]
        else:
            src = sys.argv[1]
    return tag, src


def resolve_src(src):
    # 解析输入路径：支持直接路径或 pathcst 下的相对路径。
    if src is None:
        return (None, None)
    if os.path.isfile(src):
        return ('file', src)
    if os.path.isdir(src):
        return ('dir', src)
    cst = os.path.join(pathcst, src)
    if os.path.isfile(cst):
        return ('file', cst)
    cst_dir = os.path.join(pathcst, src)
    if os.path.isdir(cst_dir):
        return ('dir', cst_dir)
    return (None, None)


def extract_dir(src_dir, skp = True):
    # 批量提取目录中的 .cst 到 <目录>/scene_txt。
    lis = os.listdir(src_dir)
    csts = [name for name in lis if name.endswith('.cst')]
    dst_dir = os.path.join(src_dir, 'scene_txt')
    if not os.path.exists(dst_dir):
        os.makedirs(dst_dir)
    s0, s1 = 0, len(csts)
    for sc in csts:
        scp = os.path.join(src_dir, sc)
        stp = os.path.join(dst_dir, cst_to_txt_name(sc))
        try:
            unpacst_file(scp, stp, skp)
            s0 += 1
        except Exception as e:
            print('[ERROR]', sc, e)
    return (s0, s1)


def depacst_dir(src_dir):
    # 批量解压目录中的 .cst 到 <目录>/scene_bin。
    lis = os.listdir(src_dir)
    csts = [name for name in lis if name.endswith('.cst')]
    dst_dir = os.path.join(src_dir, 'scene_bin')
    if not os.path.exists(dst_dir):
        os.makedirs(dst_dir)
    s0, s1 = 0, len(csts)
    for sc in csts:
        scp = os.path.join(src_dir, sc)
        sbp = os.path.join(dst_dir, os.path.splitext(sc)[0] + '.bin')
        try:
            depacst_file(scp, sbp)
            s0 += 1
        except Exception as e:
            print('[ERROR]', sc, e)
    return (s0, s1)


def depacst():
    liscst = os.listdir(pathcst)
    if not os.path.exists(pathbin):
        os.makedirs(pathbin)
    s0, s1 = 0, 0
    for sc in liscst:
        if not sc.endswith('.cst'):
            continue
        s1 += 1
        sp = os.path.join(pathcst, sc)
        sb = os.path.join(pathbin, sc[:-3] + 'bin')
        try:
            depacst_file(sp, sb)
        except Exception as e:
            print('[ERROR]', sc, e)
            continue

        s0 += 1
    return (s0, s1)


def unpacst(skp = True):
    liscst = os.listdir(pathcst)
    if not os.path.exists(pathtxt):
        os.makedirs(pathtxt)
    s0, s1 = 0, 0
    for sc in liscst:
        if not sc.endswith('.cst'):
            continue
        s1 += 1
        sp = os.path.join(pathcst, sc)
        st = os.path.join(pathtxt, cst_to_txt_name(sc))
        try:
            unpacst_file(sp, st, skp)
        except Exception as e:
            print('[ERROR]', sc, e)
            continue

        s0 += 1
    return (s0, s1)


def repacst():
    liscst = os.listdir(pathcst)
    listxt = os.listdir(pathtxt)
    if not os.path.exists(pathdst):
        os.makedirs(pathdst)
    s0, s1 = 0, 0
    for st in listxt:
        if not st.endswith('.txt'):
            continue
        if st.endswith('.cst.txt'):
            sc = st[:-4]
        else:
            # 兼容旧命名：01.txt -> 01.cst
            sc = st[:-4] + '.cst'
        if sc not in liscst:
            print('[WARNING] Original CST File Missing: ' + sc)
            continue
        s1 += 1
        sp = os.path.join(pathcst, sc)
        stp = os.path.join(pathtxt, st)
        sdp = os.path.join(pathdst, sc)
        try:
            repacst_file(sp, stp, sdp)
        except Exception as e:
            print('[ERROR]', sc, e)
            continue

        s0 += 1
    return (s0, s1)


if __name__ == '__main__':
    tag, src = parse_arg()
    if tag not in (0, 1, 2, 3):
        print('Invalid Parametre')
        sys.exit()

    srct, src = resolve_src(src)
    if src is not None and srct == 'file':
        try:
            if tag == 0:
                unpacst_file(src)
            if tag == 1:
                repacst_file(src)
            if tag == 2:
                depacst_file(src)
            if tag == 3:
                unpacst_file(src, skp = False)
            print('1 / 1 completed')
        except Exception as e:
            print('[ERROR]', os.path.basename(src), e)
            sys.exit(1)
        sys.exit()

    if src is not None and srct == 'dir':
        if tag == 0:
            s0, s1 = extract_dir(src)
        elif tag == 2:
            s0, s1 = depacst_dir(src)
        elif tag == 3:
            s0, s1 = extract_dir(src, False)
        else:
            print('[ERROR] Directory mode only supports tag 0/2/3')
            sys.exit(1)
        print('%d / %d completed' % (s0, s1))
        sys.exit()

    if len(sys.argv) >= 2 and sys.argv[1] not in ('0', '1', '2', '3'):
        print('[ERROR] CST Path Missing:', sys.argv[1])
        sys.exit(1)

    if tag == 0:
        s0, s1 = unpacst()
    if tag == 1:
        s0, s1 = repacst()
    if tag == 2:
        s0, s1 = depacst()
    if tag == 3:
        s0, s1 = unpacst(False)
    print('%d / %d completed' % (s0, s1))

子网站

Catsystem2提取.cst文件中的日文内容

1. 背景知识：.cst 文件结构

外层封装（压缩层）

内层结构（解压后的 `.bin` 文件）

2. 代码模块解析

2.1 辅助函数

2.2 核心类 `FormatCST`

`init(self, fc)`

`iter()` / `next(skp=True)` / `get(skp=True)`

`rep(bn)`

`pac()`

2.3 文本编码处理

2.4 单文件操作函数

2.5 批量处理函数

2.6 命令行解析与主流程

3. 使用示例

3.1 准备环境

3.2 提取文本

3.3 翻译文本

3.4 重新打包

3.5 其他操作

4. 注意事项

5. 总结

子网站

Catsystem2提取.cst文件中的日文内容

1. 背景知识：.cst 文件结构

外层封装（压缩层）

内层结构（解压后的 .bin 文件）

2. 代码模块解析

2.1 辅助函数

2.2 核心类 FormatCST

__init__(self, fc)

iter() / next(skp=True) / get(skp=True)

rep(bn)

pac()

2.3 文本编码处理

2.4 单文件操作函数

2.5 批量处理函数

2.6 命令行解析与主流程

3. 使用示例

3.1 准备环境

3.2 提取文本

3.3 翻译文本

3.4 重新打包

3.5 其他操作

4. 注意事项

5. 总结

内层结构（解压后的 `.bin` 文件）

2.2 核心类 `FormatCST`

`init(self, fc)`

`iter()` / `next(skp=True)` / `get(skp=True)`

`rep(bn)`

`pac()`