mdx/mdd 是供 MDict、GoldenDict 等加载使用的词库,有些时候我们想要自己动手排版,这就需要解压 mdx/mdd ,提取其中文字、图片、音频等数据,这时候就可以利用 Python 脚本来处理。
我使用的是Python 2.7和readmdict.py、 ripemd128.py、pureSalsa20.py
1.运行 python readmdict.py
返回错误 LZO compression support is not available
2.用编辑器打开readmdict.py文件看了一下,将以下几行都注释掉:
try:
import lzo
except ImportError:
lzo = None
print("LZO compression support is not available")
运行 python readmdict.py
C:\Python27\readmdict>python readmdict.py
Try Brutal Force on Encrypted Key Blocks
Traceback (most recent call last):
File "readmdict.py", line 649, in <module>
mdx = MDX(args.filename, args.encoding, args.substyle, args.passcode)
File "readmdict.py", line 503, in __init__
MDict.__init__(self, fname, encoding, passcode)
File "readmdict.py", line 105, in __init__
self._key_list = self._read_keys_brutal()
File "readmdict.py", line 399, in _read_keys_brutal
key_list = self._decode_key_block(key_block_compressed, key_block_info_list)
File "readmdict.py", line 205, in _decode_key_block
if lzo is None:
NameError: global name 'lzo' is not defined
将文件恢复
3. >>> import lzo
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: No module named lzo
>>> exit()
发现没有lzo模块
4. pip安装也不成功:
C:\Python27\Scripts>pip install lzo
Collecting lzo
c:\python27\lib\site-packages\pip-7.1.2-py2.7.egg\pip\_vendor\requests\packages\urllib3\util\ssl_.py:90: InsecurePlatformWarning: A tru
e SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to
fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
InsecurePlatformWarning
Retrying (Retry(total=4, connect=None, read=None, redirect=None)) after connection broken by 'ProtocolError('Connection aborted.', er
ror(10054, ''))': /simple/lzo/
c:\python27\lib\site-packages\pip-7.1.2-py2.7.egg\pip\_vendor\requests\packages\urllib3\util\ssl_.py:90: InsecurePlatformWarning: A tru
e SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to
fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
InsecurePlatformWarning
Could not find a version that satisfies the requirement lzo (from versions: )
No matching distribution found for lzo
You are using pip version 7.1.2, however version 8.1.1 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.
C:\Python27\Scripts>
5. 在stackoverflow上查找到在windows上无法直接安装, http://stackoverflow.com/questions/7517075/how-can-i-install-python-lzo-1-08
按照回答,先到http://www.lfd.uci.edu/~gohlke/pythonlibs/#python-lzo下载了python_lzo-1.11-cp27-none-win32.whl(因为我的版本是Python 2.7)
然后运行pip install python_lzo-1.11-cp27-none-win32.whl进行安装:
C:\Python27\Scripts>pip install python_lzo-1.11-cp27-none-win32.whl
Processing c:\python27\scripts\python_lzo-1.11-cp27-none-win32.whl
Installing collected packages: python-lzo
Successfully installed python-lzo-1.11
如果是不同版本,会提示:
C:\Python27\Scripts>pip install python_lzo-1.11-cp35-none-win32.whl
python_lzo-1.11-cp35-none-win32.whl is not a supported wheel on this platform.
6.使用Python提取mdx中的数据
C:\Python27\readmdict>python readmdict.py
======== C:/Python27/readmdict/test.mdx ========
Number of Entries : 34151
Compact : No
Compat : No
GeneratedByEngineVersion : 1.2
Description : <font color=blue size=5><b>銆婃煰鏋楁柉楂樼礆鑻辫獮瀛哥繏瑭炲吀绗?鐗堛€?/b></font>
<p>杞夋彌閬斾汉锛氬翱鐗?(-_-)鈥欌€?
<br>杞夋彌鏃ユ湡锛?9骞?鏈?0鏃?
<br>瑭炲吀瀛楁暩锛?4151
<p><b><font color=red><瑭炲吀鍏у></b></font >
<br>銆€銆€Collins Cobuild鏄瓧鍏歌垏鑻辫獮瀛哥繏鏇哥殑鐭ュ悕鍝佺墝锛屻€奀ollins Cobuild Advanced Learner's English Dictionary銆嬩
竴鐩存槸涓栫晫鍚勫湴璁€鑰呭績涓殑鏈€浣冲瓧鍏镐箣涓€锛岀敋鑷宠璀界偤銆岀従浠h嫳瑾炴渶鍏ㄩ潰銆佹瑠濞佺殑鍤皫銆嶃€?澶氬湅鏆㈤姺鏇搞
€婂崈钀垾瀛歌嫳瑾炪€嬩腑涔熺壒鍒ユ帹钖︺€奀ollins Cobuild Advanced Learner's English Dictionary銆嬨€?
<p><font color=red>鏈经鍏歌綁鑷猯ingos锛屾湰杈吀渚涘缈掕嫳瑾炵殑鏈嬪弸浣跨敤锛?
<p><font color=grape>杞夋彌绱旂偤鑸堣叮锛?/font>
<p><font color=green>鏈鍏哥敱婢抽杸鏈嬪弸鍒朵綔锛?/font>
RequiredEngineVersion : 1.2
Format : Html
Encrypted : No
Encoding : UTF-16
StyleSheet :
Title : Title (No HTML code allowed)
KeyCaseSensitive : No
DataSourceFormat : 107
C:\Python27\readmdict>
7. 在mdx 所在目录下,出现了 test.txt
附:
readmdict.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# readmdict.py
# Octopus MDict Dictionary File (.mdx) and Resource File (.mdd) Analyser
#
# Copyright (C) 2012, 2013, 2015 Xiaoqiang Wang <xiaoqiangwang AT gmail DOT com>
#
# This program is a free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, version 3 of the License.
#
# You can get a copy of GNU General Public License along this program
# But you can always get it from http://www.gnu.org/licenses/gpl.txt
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
from struct import pack, unpack
from io import BytesIO
import re
import sys
from ripemd128 import ripemd128
from pureSalsa20 import Salsa20
# zlib compression is used for engine version >=2.0
import zlib
# LZO compression is used for engine version < 2.0
try:
import lzo
except ImportError:
lzo = None
print("LZO compression support is not available")
# 2x3 compatible
if sys.hexversion >= 0x03000000:
unicode = str
def _unescape_entities(text):
"""
unescape offending tags < > " &
"""
text = text.replace(b'<', b'<')
text = text.replace(b'>', b'>')
text = text.replace(b'"', b'"')
text = text.replace(b'&', b'&')
return text
def _fast_decrypt(data, key):
b = bytearray(data)
key = bytearray(key)
previous = 0x36
for i in range(len(b)):
t = (b[i] >> 4 | b[i] << 4) & 0xff
t = t ^ previous ^ (i & 0xff) ^ key[i % len(key)]
previous = b[i]
b[i] = t
return bytes(b)
def _mdx_decrypt(comp_block):
key = ripemd128(comp_block[4:8] + pack(b'<L', 0x3695))
return comp_block[0:8] + _fast_decrypt(comp_block[8:], key)
def _salsa_decrypt(ciphertext, encrypt_key):
s20 = Salsa20(key=encrypt_key, IV=b"\x00"*8, rounds=8)
return s20.encryptBytes(ciphertext)
def _decrypt_regcode_by_deviceid(reg_code, deviceid):
deviceid_digest = ripemd128(deviceid)
s20 = Salsa20(key=deviceid_digest, IV=b"\x00"*8, rounds=8)
encrypt_key = s20.encryptBytes(reg_code)
return encrypt_key
def _decrypt_regcode_by_email(reg_code, email):
email_digest = ripemd128(email.decode().encode('utf-16-le'))
s20 = Salsa20(key=email_digest, IV=b"\x00"*8, rounds=8)
encrypt_key = s20.encryptBytes(reg_code)
return encrypt_key
class MDict(object):
"""
Base class which reads in header and key block.
It has no public methods and serves only as code sharing base class.
"""
def __init__(self, fname, encoding='', passcode=None):
self._fname = fname
self._encoding = encoding.upper()
self._passcode = passcode
self.header = self._read_header()
try:
self._key_list = self._read_keys()
except:
print("Try Brutal Force on Encrypted Key Blocks")
self._key_list = self._read_keys_brutal()
def __len__(self):
return self._num_entries
def __iter__(self):
return self.keys()
def keys(self):
"""
Return an iterator over dictionary keys.
"""
return (key_value for key_id, key_value in self._key_list)
def _read_number(self, f):
return unpack(self._number_format, f.read(self._number_width))[0]
def _parse_header(self, header):
"""
extract attributes from <Dict attr="value" ... >
"""
taglist = re.findall(b'(\w+)="(.*?)"', header, re.DOTALL)
tagdict = {}
for key, value in taglist:
tagdict[key] = _unescape_entities(value)
return tagdict
def _decode_key_block_info(self, key_block_info_compressed):
if self._version >= 2:
# zlib compression
assert(key_block_info_compressed[:4] == b'\x02\x00\x00\x00')
# decrypt if needed
if self._encrypt & 0x02:
key_block_info_compressed = _mdx_decrypt(key_block_info_compressed)
# decompress
key_block_info = zlib.decompress(key_block_info_compressed[8:])
# adler checksum
adler32 = unpack('>I', key_block_info_compressed[4:8])[0]
assert(adler32 == zlib.adler32(key_block_info) & 0xffffffff)
else:
# no compression
key_block_info = key_block_info_compressed
# decode
key_block_info_list = []
num_entries = 0
i = 0
if self._version >= 2:
byte_format = '>H'
byte_width = 2
text_term = 1
else:
byte_format = '>B'
byte_width = 1
text_term = 0
while i < len(key_block_info):
# number of entries in current key block
num_entries += unpack(self._number_format, key_block_info[i:i+self._number_width])[0]
i += self._number_width
# text head size
text_head_size = unpack(byte_format, key_block_info[i:i+byte_width])[0]
i += byte_width
# text head
if self._encoding != 'UTF-16':
i += text_head_size + text_term
else:
i += (text_head_size + text_term) * 2
# text tail size
text_tail_size = unpack(byte_format, key_block_info[i:i+byte_width])[0]
i += byte_width
# text tail
if self._encoding != 'UTF-16':
i += text_tail_size + text_term
else:
i += (text_tail_size + text_term) * 2
# key block compressed size
key_block_compressed_size = unpack(self._number_format, key_block_info[i:i+self._number_width])[0]
i += self._number_width
# key block decompressed size
key_block_decompressed_size = unpack(self._number_format, key_block_info[i:i+self._number_width])[0]
i += self._number_width
key_block_info_list += [(key_block_compressed_size, key_block_decompressed_size)]
#assert(num_entries == self._num_entries)
return key_block_info_list
def _decode_key_block(self, key_block_compressed, key_block_info_list):
key_list = []
i = 0
for compressed_size, decompressed_size in key_block_info_list:
start = i
end = i + compressed_size
# 4 bytes : compression type
key_block_type = key_block_compressed[start:start+4]
# 4 bytes : adler checksum of decompressed key block
adler32 = unpack('>I', key_block_compressed[start+4:start+8])[0]
if key_block_type == b'\x00\x00\x00\x00':
key_block = key_block_compressed[start+8:end]
elif key_block_type == b'\x01\x00\x00\x00':
if lzo is None:
print("LZO compression is not supported")
break
# decompress key block
header = b'\xf0' + pack('>I', decompressed_size)
key_block = lzo.decompress(header + key_block_compressed[start+8:end])
elif key_block_type == b'\x02\x00\x00\x00':
# decompress key block
key_block = zlib.decompress(key_block_compressed[start+8:end])
# extract one single key block into a key list
key_list += self._split_key_block(key_block)
# notice that adler32 returns signed value
assert(adler32 == zlib.adler32(key_block) & 0xffffffff)
i += compressed_size
return key_list
def _split_key_block(self, key_block):
key_list = []
key_start_index = 0
while key_start_index < len(key_block):
# the corresponding record's offset in record block
key_id = unpack(self._number_format, key_block[key_start_index:key_start_index+self._number_width])[0]
# key text ends with '\x00'
if self._encoding == 'UTF-16':
delimiter = b'\x00\x00'
width = 2
else:
delimiter = b'\x00'
width = 1
i = key_start_index + self._number_width
while i < len(key_block):
if key_block[i:i+width] == delimiter:
key_end_index = i
break
i += width
key_text = key_block[key_start_index+self._number_width:key_end_index]\
.decode(self._encoding, errors='ignore').encode('utf-8').strip()
key_start_index = key_end_index + width
key_list += [(key_id, key_text)]
return key_list
def _read_header(self):
f = open(self._fname, 'rb')
# number of bytes of header text
header_bytes_size = unpack('>I', f.read(4))[0]
header_bytes = f.read(header_bytes_size)
# 4 bytes: adler32 checksum of header, in little endian
adler32 = unpack('<I', f.read(4))[0]
assert(adler32 == zlib.adler32(header_bytes) & 0xffffffff)
# mark down key block offset
self._key_block_offset = f.tell()
f.close()
# header text in utf-16 encoding ending with '\x00\x00'
header_text = header_bytes[:-2].decode('utf-16').encode('utf-8')
header_tag = self._parse_header(header_text)
if not self._encoding:
encoding = header_tag[b'Encoding']
if sys.hexversion >= 0x03000000:
encoding = encoding.decode('utf-8')
# GB18030 > GBK > GB2312
if encoding in ['GBK', 'GB2312']:
encoding = 'GB18030'
self._encoding = encoding
# encryption flag
# 0x00 - no encryption
# 0x01 - encrypt record block
# 0x02 - encrypt key info block
if header_tag[b'Encrypted'] == b'No':
self._encrypt = 0
elif header_tag[b'Encrypted'] == b'Yes':
self._encrypt = 1
else:
self._encrypt = int(header_tag[b'Encrypted'])
# stylesheet attribute if present takes form of:
# style_number # 1-255
# style_begin # or ''
# style_end # or ''
# store stylesheet in dict in the form of
# {'number' : ('style_begin', 'style_end')}
self._stylesheet = {}
if header_tag.get('StyleSheet'):
lines = header_tag['StyleSheet'].splitlines()
for i in range(0, len(lines), 3):
self._stylesheet[lines[i]] = (lines[i+1], lines[i+2])
# before version 2.0, number is 4 bytes integer
# version 2.0 and above uses 8 bytes
self._version = float(header_tag[b'GeneratedByEngineVersion'])
if self._version < 2.0:
self._number_width = 4
self._number_format = '>I'
else:
self._number_width = 8
self._number_format = '>Q'
return header_tag
def _read_keys(self):
f = open(self._fname, 'rb')
f.seek(self._key_block_offset)
# the following numbers could be encrypted
if self._version >= 2.0:
num_bytes = 8 * 5
else:
num_bytes = 4 * 4
block = f.read(num_bytes)
if self._encrypt & 1:
if self._passcode is None:
raise RuntimeError('user identification is needed to read encrypted file')
regcode, userid = self._passcode
if isinstance(userid, unicode):
userid = userid.encode('utf8')
if self.header[b'RegisterBy'] == b'EMail':
encrypted_key = _decrypt_regcode_by_email(regcode, userid)
else:
encrypted_key = _decrypt_regcode_by_deviceid(regcode, userid)
block = _salsa_decrypt(block, encrypted_key)
# decode this block
sf = BytesIO(block)
# number of key blocks
num_key_blocks = self._read_number(sf)
# number of entries
self._num_entries = self._read_number(sf)
# number of bytes of key block info after decompression
if self._version >= 2.0:
key_block_info_decomp_size = self._read_number(sf)
# number of bytes of key block info
key_block_info_size = self._read_number(sf)
# number of bytes of key block
key_block_size = self._read_number(sf)
# 4 bytes: adler checksum of previous 5 numbers
if self._version >= 2.0:
adler32 = unpack('>I', f.read(4))[0]
assert adler32 == (zlib.adler32(block) & 0xffffffff)
# read key block info, which indicates key block's compressed and decompressed size
key_block_info = f.read(key_block_info_size)
key_block_info_list = self._decode_key_block_info(key_block_info)
assert(num_key_blocks == len(key_block_info_list))
# read key block
key_block_compressed = f.read(key_block_size)
# extract key block
key_list = self._decode_key_block(key_block_compressed, key_block_info_list)
self._record_block_offset = f.tell()
f.close()
return key_list
def _read_keys_brutal(self):
f = open(self._fname, 'rb')
f.seek(self._key_block_offset)
# the following numbers could be encrypted, disregard them!
if self._version >= 2.0:
num_bytes = 8 * 5 + 4
key_block_type = b'\x02\x00\x00\x00'
else:
num_bytes = 4 * 4
key_block_type = b'\x01\x00\x00\x00'
block = f.read(num_bytes)
# key block info
# 4 bytes '\x02\x00\x00\x00'
# 4 bytes adler32 checksum
# unknown number of bytes follows until '\x02\x00\x00\x00' which marks the beginning of key block
key_block_info = f.read(8)
if self._version >= 2.0:
assert key_block_info[:4] == b'\x02\x00\x00\x00'
while True:
fpos = f.tell()
t = f.read(1024)
index = t.find(key_block_type)
if index != -1:
key_block_info += t[:index]
f.seek(fpos + index)
break
else:
key_block_info += t
key_block_info_list = self._decode_key_block_info(key_block_info)
key_block_size = sum(list(zip(*key_block_info_list))[0])
# read key block
key_block_compressed = f.read(key_block_size)
# extract key block
key_list = self._decode_key_block(key_block_compressed, key_block_info_list)
self._record_block_offset = f.tell()
f.close()
self._num_entries = len(key_list)
return key_list
class MDD(MDict):
"""
MDict resource file format (*.MDD) reader.
>>> mdd = MDD('example.mdd')
>>> len(mdd)
208
>>> for filename,content in mdd.items():
... print filename, content[:10]
"""
def __init__(self, fname, passcode=None):
MDict.__init__(self, fname, encoding='UTF-16', passcode=passcode)
def items(self):
"""Return a generator which in turn produce tuples in the form of (filename, content)
"""
return self._decode_record_block()
def _decode_record_block(self):
f = open(self._fname, 'rb')
f.seek(self._record_block_offset)
num_record_blocks = self._read_number(f)
num_entries = self._read_number(f)
assert(num_entries == self._num_entries)
record_block_info_size = self._read_number(f)
record_block_size = self._read_number(f)
# record block info section
record_block_info_list = []
size_counter = 0
for i in range(num_record_blocks):
compressed_size = self._read_number(f)
decompressed_size = self._read_number(f)
record_block_info_list += [(compressed_size, decompressed_size)]
size_counter += self._number_width * 2
assert(size_counter == record_block_info_size)
# actual record block
offset = 0
i = 0
size_counter = 0
for compressed_size, decompressed_size in record_block_info_list:
record_block_compressed = f.read(compressed_size)
# 4 bytes: compression type
record_block_type = record_block_compressed[:4]
# 4 bytes: adler32 checksum of decompressed record block
adler32 = unpack('>I', record_block_compressed[4:8])[0]
if record_block_type == b'\x00\x00\x00\x00':
record_block = record_block_compressed[8:]
elif record_block_type == b'\x01\x00\x00\x00':
if lzo is None:
print("LZO compression is not supported")
break
# decompress
header = '\xf0' + pack('>I', decompressed_size)
record_block = lzo.decompress(header + record_block_compressed[8:])
elif record_block_type == b'\x02\x00\x00\x00':
# decompress
record_block = zlib.decompress(record_block_compressed[8:])
# notice that adler32 return signed value
assert(adler32 == zlib.adler32(record_block) & 0xffffffff)
assert(len(record_block) == decompressed_size)
# split record block according to the offset info from key block
while i < len(self._key_list):
record_start, key_text = self._key_list[i]
# reach the end of current record block
if record_start - offset >= len(record_block):
break
# record end index
if i < len(self._key_list)-1:
record_end = self._key_list[i+1][0]
else:
record_end = len(record_block) + offset
i += 1
data = record_block[record_start-offset:record_end-offset]
yield key_text, data
offset += len(record_block)
size_counter += compressed_size
assert(size_counter == record_block_size)
f.close()
class MDX(MDict):
"""
MDict dictionary file format (*.MDD) reader.
>>> mdx = MDX('example.mdx')
>>> len(mdx)
42481
>>> for key,value in mdx.items():
... print key, value[:10]
"""
def __init__(self, fname, encoding='', substyle=False, passcode=None):
MDict.__init__(self, fname, encoding, passcode)
self._substyle = substyle
def items(self):
"""Return a generator which in turn produce tuples in the form of (key, value)
"""
return self._decode_record_block()
def _substitute_stylesheet(self, txt):
# substitute stylesheet definition
txt_list = re.split('`\d+`', txt)
txt_tag = re.findall('`\d+`', txt)
txt_styled = txt_list[0]
for j, p in enumerate(txt_list[1:]):
style = self._stylesheet[txt_tag[j][1:-1]]
if p and p[-1] == '\n':
txt_styled = txt_styled + style[0] + p.rstrip() + style[1] + '\r\n'
else:
txt_styled = txt_styled + style[0] + p + style[1]
return txt_styled
def _decode_record_block(self):
f = open(self._fname, 'rb')
f.seek(self._record_block_offset)
num_record_blocks = self._read_number(f)
num_entries = self._read_number(f)
assert(num_entries == self._num_entries)
record_block_info_size = self._read_number(f)
record_block_size = self._read_number(f)
# record block info section
record_block_info_list = []
size_counter = 0
for i in range(num_record_blocks):
compressed_size = self._read_number(f)
decompressed_size = self._read_number(f)
record_block_info_list += [(compressed_size, decompressed_size)]
size_counter += self._number_width * 2
assert(size_counter == record_block_info_size)
# actual record block data
offset = 0
i = 0
size_counter = 0
for compressed_size, decompressed_size in record_block_info_list:
record_block_compressed = f.read(compressed_size)
# 4 bytes indicates block compression type
record_block_type = record_block_compressed[:4]
# 4 bytes adler checksum of uncompressed content
adler32 = unpack('>I', record_block_compressed[4:8])[0]
# no compression
if record_block_type == b'\x00\x00\x00\x00':
record_block = record_block_compressed[8:]
# lzo compression
elif record_block_type == b'\x01\x00\x00\x00':
if lzo is None:
print("LZO compression is not supported")
break
# decompress
header = b'\xf0' + pack('>I', decompressed_size)
record_block = lzo.decompress(header + record_block_compressed[8:])
# zlib compression
elif record_block_type == b'\x02\x00\x00\x00':
# decompress
record_block = zlib.decompress(record_block_compressed[8:])
# notice that adler32 return signed value
assert(adler32 == zlib.adler32(record_block) & 0xffffffff)
assert(len(record_block) == decompressed_size)
# split record block according to the offset info from key block
while i < len(self._key_list):
record_start, key_text = self._key_list[i]
# reach the end of current record block
if record_start - offset >= len(record_block):
break
# record end index
if i < len(self._key_list)-1:
record_end = self._key_list[i+1][0]
else:
record_end = len(record_block) + offset
i += 1
record = record_block[record_start-offset:record_end-offset]
# convert to utf-8
record = record.decode(self._encoding, errors='ignore').strip(u'\x00').encode('utf-8')
# substitute styles
if self._substyle and self._stylesheet:
record = self._substitute_stylesheet(record)
yield key_text, record
offset += len(record_block)
size_counter += compressed_size
assert(size_counter == record_block_size)
f.close()
if __name__ == '__main__':
import sys
import os
import os.path
import argparse
import codecs
def passcode(s):
try:
regcode, userid = s.split(',')
except:
raise argparse.ArgumentTypeError("Passcode must be regcode,userid")
try:
regcode = codecs.decode(regcode, 'hex')
except:
raise argparse.ArgumentTypeError("regcode must be a 32 bytes hexadecimal string")
return regcode, userid
parser = argparse.ArgumentParser()
parser.add_argument('-x', '--extract', action="store_true",
help='extract mdx to source format and extract files from mdd')
parser.add_argument('-s', '--substyle', action="store_true",
help='substitute style definition if present')
parser.add_argument('-d', '--datafolder', default="data",
help='folder to extract data files from mdd')
parser.add_argument('-e', '--encoding', default="",
help='folder to extract data files from mdd')
parser.add_argument('-p', '--passcode', default=None, type=passcode,
help='register_code,email_or_deviceid')
parser.add_argument("filename", nargs='?', help="mdx file name")
args = parser.parse_args()
# use GUI to select file, default to extract
if not args.filename:
import Tkinter
import tkFileDialog
root = Tkinter.Tk()
root.withdraw()
args.filename = tkFileDialog.askopenfilename(parent=root)
args.extract = True
if not os.path.exists(args.filename):
print("Please specify a valid MDX/MDD file")
base, ext = os.path.splitext(args.filename)
# read mdx file
if ext.lower() == os.path.extsep + 'mdx':
mdx = MDX(args.filename, args.encoding, args.substyle, args.passcode)
if type(args.filename) is unicode:
bfname = args.filename.encode('utf-8')
else:
bfname = args.filename
print('======== %s ========' % bfname)
print(' Number of Entries : %d' % len(mdx))
for key, value in mdx.header.items():
print(' %s : %s' % (key, value))
else:
mdx = None
# find companion mdd file
mdd_filename = ''.join([base, os.path.extsep, 'mdd'])
if os.path.exists(mdd_filename):
mdd = MDD(mdd_filename, args.passcode)
if type(mdd_filename) is unicode:
bfname = mdd_filename.encode('utf-8')
else:
bfname = mdd_filename
print('======== %s ========' % bfname)
print(' Number of Entries : %d' % len(mdd))
for key, value in mdd.header.items():
print(' %s : %s' % (key, value))
else:
mdd = None
if args.extract:
# write out glos
if mdx:
output_fname = ''.join([base, os.path.extsep, 'txt'])
tf = open(output_fname, 'wb')
for key, value in mdx.items():
tf.write(key)
tf.write(b'\r\n')
tf.write(value)
if not value.endswith(b'\n'):
tf.write(b'\r\n')
tf.write(b'</>\r\n')
tf.close()
# write out style
if mdx.header.get('StyleSheet'):
style_fname = ''.join([base, '_style', os.path.extsep, 'txt'])
sf = open(style_fname, 'wb')
sf.write(b'\r\n'.join(mdx.header['StyleSheet'].splitlines()))
sf.close()
# write out optional data files
if mdd:
datafolder = os.path.join(os.path.dirname(args.filename), args.datafolder)
if not os.path.exists(datafolder):
os.makedirs(datafolder)
for key, value in mdd.items():
fname = key.decode('utf-8').replace('\\', os.path.sep)
dfname = datafolder + fname
if not os.path.exists(os.path.dirname(dfname)):
os.makedirs(os.path.dirname(dfname))
df = open(dfname, 'wb')
df.write(value)
df.close()
pureSalsa20.py
#!/usr/bin/env python
# coding: utf-8
"""
pureSalsa20.py -- a pure Python implementation of the Salsa20 cipher, ported to Python 3
v4.0: Added Python 3 support, dropped support for Python <= 2.5.
// zhansliu
Original comments below.
====================================================================
There are comments here by two authors about three pieces of software:
comments by Larry Bugbee about
Salsa20, the stream cipher by Daniel J. Bernstein
(including comments about the speed of the C version) and
pySalsa20, Bugbee's own Python wrapper for salsa20.c
(including some references), and
comments by Steve Witham about
pureSalsa20, Witham's pure Python 2.5 implementation of Salsa20,
which follows pySalsa20's API, and is in this file.
Salsa20: a Fast Streaming Cipher (comments by Larry Bugbee)
-----------------------------------------------------------
Salsa20 is a fast stream cipher written by Daniel Bernstein
that basically uses a hash function and XOR making for fast
encryption. (Decryption uses the same function.) Salsa20
is simple and quick.
Some Salsa20 parameter values...
design strength 128 bits
key length 128 or 256 bits, exactly
IV, aka nonce 64 bits, always
chunk size must be in multiples of 64 bytes
Salsa20 has two reduced versions, 8 and 12 rounds each.
One benchmark (10 MB):
1.5GHz PPC G4 102/97/89 MB/sec for 8/12/20 rounds
AMD Athlon 2500+ 77/67/53 MB/sec for 8/12/20 rounds
(no I/O and before Python GC kicks in)
Salsa20 is a Phase 3 finalist in the EU eSTREAM competition
and appears to be one of the fastest ciphers. It is well
documented so I will not attempt any injustice here. Please
see "References" below.
...and Salsa20 is "free for any use".
pySalsa20: a Python wrapper for Salsa20 (Comments by Larry Bugbee)
------------------------------------------------------------------
pySalsa20.py is a simple ctypes Python wrapper. Salsa20 is
as it's name implies, 20 rounds, but there are two reduced
versions, 8 and 12 rounds each. Because the APIs are
identical, pySalsa20 is capable of wrapping all three
versions (number of rounds hardcoded), including a special
version that allows you to set the number of rounds with a
set_rounds() function. Compile the version of your choice
as a shared library (not as a Python extension), name and
install it as libsalsa20.so.
Sample usage:
from pySalsa20 import Salsa20
s20 = Salsa20(key, IV)
dataout = s20.encryptBytes(datain) # same for decrypt
This is EXPERIMENTAL software and intended for educational
purposes only. To make experimentation less cumbersome,
pySalsa20 is also free for any use.
THIS PROGRAM IS PROVIDED WITHOUT WARRANTY OR GUARANTEE OF
ANY KIND. USE AT YOUR OWN RISK.
Enjoy,
Larry Bugbee
bugbee@seanet.com
April 2007
References:
-----------
http://en.wikipedia.org/wiki/Salsa20
http://en.wikipedia.org/wiki/Daniel_Bernstein
http://cr.yp.to/djb.html
http://www.ecrypt.eu.org/stream/salsa20p3.html
http://www.ecrypt.eu.org/stream/p3ciphers/salsa20/salsa20_p3source.zip
Prerequisites for pySalsa20:
----------------------------
- Python 2.5 (haven't tested in 2.4)
pureSalsa20: Salsa20 in pure Python 2.5 (comments by Steve Witham)
------------------------------------------------------------------
pureSalsa20 is the stand-alone Python code in this file.
It implements the underlying Salsa20 core algorithm
and emulates pySalsa20's Salsa20 class API (minus a bug(*)).
pureSalsa20 is MUCH slower than libsalsa20.so wrapped with pySalsa20--
about 1/1000 the speed for Salsa20/20 and 1/500 the speed for Salsa20/8,
when encrypting 64k-byte blocks on my computer.
pureSalsa20 is for cases where portability is much more important than
speed. I wrote it for use in a "structured" random number generator.
There are comments about the reasons for this slowness in
http://www.tiac.net/~sw/2010/02/PureSalsa20
Sample usage:
from pureSalsa20 import Salsa20
s20 = Salsa20(key, IV)
dataout = s20.encryptBytes(datain) # same for decrypt
I took the test code from pySalsa20, added a bunch of tests including
rough speed tests, and moved them into the file testSalsa20.py.
To test both pySalsa20 and pureSalsa20, type
python testSalsa20.py
(*)The bug (?) in pySalsa20 is this. The rounds variable is global to the
libsalsa20.so library and not switched when switching between instances
of the Salsa20 class.
s1 = Salsa20( key, IV, 20 )
s2 = Salsa20( key, IV, 8 )
In this example,
with pySalsa20, both s1 and s2 will do 8 rounds of encryption.
with pureSalsa20, s1 will do 20 rounds and s2 will do 8 rounds.
Perhaps giving each instance its own nRounds variable, which
is passed to the salsa20wordtobyte() function, is insecure. I'm not a
cryptographer.
pureSalsa20.py and testSalsa20.py are EXPERIMENTAL software and
intended for educational purposes only. To make experimentation less
cumbersome, pureSalsa20.py and testSalsa20.py are free for any use.
Revisions:
----------
p3.2 Fixed bug that initialized the output buffer with plaintext!
Saner ramping of nreps in speed test.
Minor changes and print statements.
p3.1 Took timing variability out of add32() and rot32().
Made the internals more like pySalsa20/libsalsa .
Put the semicolons back in the main loop!
In encryptBytes(), modify a byte array instead of appending.
Fixed speed calculation bug.
Used subclasses instead of patches in testSalsa20.py .
Added 64k-byte messages to speed test to be fair to pySalsa20.
p3 First version, intended to parallel pySalsa20 version 3.
More references:
----------------
http://www.seanet.com/~bugbee/crypto/salsa20/ [pySalsa20]
http://cr.yp.to/snuffle.html [The original name of Salsa20]
http://cr.yp.to/snuffle/salsafamily-20071225.pdf [ Salsa20 design]
http://www.tiac.net/~sw/2010/02/PureSalsa20
THIS PROGRAM IS PROVIDED WITHOUT WARRANTY OR GUARANTEE OF
ANY KIND. USE AT YOUR OWN RISK.
Cheers,
Steve Witham sw at remove-this tiac dot net
February, 2010
"""
import sys
assert(sys.version_info >= (2, 6))
if sys.version_info >= (3,):
integer_types = (int,)
python3 = True
else:
integer_types = (int, long)
python3 = False
from struct import Struct
little_u64 = Struct( "<Q" ) # little-endian 64-bit unsigned.
# Unpacks to a tuple of one element!
little16_i32 = Struct( "<16i" ) # 16 little-endian 32-bit signed ints.
little4_i32 = Struct( "<4i" ) # 4 little-endian 32-bit signed ints.
little2_i32 = Struct( "<2i" ) # 2 little-endian 32-bit signed ints.
_version = 'p4.0'
#----------- Salsa20 class which emulates pySalsa20.Salsa20 ---------------
class Salsa20(object):
def __init__(self, key=None, IV=None, rounds=20 ):
self._lastChunk64 = True
self._IVbitlen = 64 # must be 64 bits
self.ctx = [ 0 ] * 16
if key:
self.setKey(key)
if IV:
self.setIV(IV)
self.setRounds(rounds)
def setKey(self, key):
assert type(key) == bytes
ctx = self.ctx
if len( key ) == 32: # recommended
constants = b"expand 32-byte k"
ctx[ 1],ctx[ 2],ctx[ 3],ctx[ 4] = little4_i32.unpack(key[0:16])
ctx[11],ctx[12],ctx[13],ctx[14] = little4_i32.unpack(key[16:32])
elif len( key ) == 16:
constants = b"expand 16-byte k"
ctx[ 1],ctx[ 2],ctx[ 3],ctx[ 4] = little4_i32.unpack(key[0:16])
ctx[11],ctx[12],ctx[13],ctx[14] = little4_i32.unpack(key[0:16])
else:
raise Exception( "key length isn't 32 or 16 bytes." )
ctx[0],ctx[5],ctx[10],ctx[15] = little4_i32.unpack( constants )
def setIV(self, IV):
assert type(IV) == bytes
assert len(IV)*8 == 64, 'nonce (IV) not 64 bits'
self.IV = IV
ctx=self.ctx
ctx[ 6],ctx[ 7] = little2_i32.unpack( IV )
ctx[ 8],ctx[ 9] = 0, 0 # Reset the block counter.
setNonce = setIV # support an alternate name
def setCounter( self, counter ):
assert( type(counter) in integer_types )
assert( 0 <= counter < 1<<64 ), "counter < 0 or >= 2**64"
ctx = self.ctx
ctx[ 8],ctx[ 9] = little2_i32.unpack( little_u64.pack( counter ) )
def getCounter( self ):
return little_u64.unpack( little2_i32.pack( *self.ctx[ 8:10 ] ) ) [0]
def setRounds(self, rounds, testing=False ):
assert testing or rounds in [8, 12, 20], 'rounds must be 8, 12, 20'
self.rounds = rounds
def encryptBytes(self, data):
assert type(data) == bytes, 'data must be byte string'
assert self._lastChunk64, 'previous chunk not multiple of 64 bytes'
lendata = len(data)
munged = bytearray(lendata)
for i in range( 0, lendata, 64 ):
h = salsa20_wordtobyte( self.ctx, self.rounds, checkRounds=False )
self.setCounter( ( self.getCounter() + 1 ) % 2**64 )
# Stopping at 2^70 bytes per nonce is user's responsibility.
for j in range( min( 64, lendata - i ) ):
if python3:
munged[ i+j ] = data[ i+j ] ^ h[j]
else:
munged[ i+j ] = ord(data[ i+j ]) ^ ord(h[j])
self._lastChunk64 = not lendata % 64
return bytes(munged)
decryptBytes = encryptBytes # encrypt and decrypt use same function
#--------------------------------------------------------------------------
def salsa20_wordtobyte( input, nRounds=20, checkRounds=True ):
""" Do nRounds Salsa20 rounds on a copy of
input: list or tuple of 16 ints treated as little-endian unsigneds.
Returns a 64-byte string.
"""
assert( type(input) in ( list, tuple ) and len(input) == 16 )
assert( not(checkRounds) or ( nRounds in [ 8, 12, 20 ] ) )
x = list( input )
def XOR( a, b ): return a ^ b
ROTATE = rot32
PLUS = add32
for i in range( nRounds // 2 ):
# These ...XOR...ROTATE...PLUS... lines are from ecrypt-linux.c
# unchanged except for indents and the blank line between rounds:
x[ 4] = XOR(x[ 4],ROTATE(PLUS(x[ 0],x[12]), 7));
x[ 8] = XOR(x[ 8],ROTATE(PLUS(x[ 4],x[ 0]), 9));
x[12] = XOR(x[12],ROTATE(PLUS(x[ 8],x[ 4]),13));
x[ 0] = XOR(x[ 0],ROTATE(PLUS(x[12],x[ 8]),18));
x[ 9] = XOR(x[ 9],ROTATE(PLUS(x[ 5],x[ 1]), 7));
x[13] = XOR(x[13],ROTATE(PLUS(x[ 9],x[ 5]), 9));
x[ 1] = XOR(x[ 1],ROTATE(PLUS(x[13],x[ 9]),13));
x[ 5] = XOR(x[ 5],ROTATE(PLUS(x[ 1],x[13]),18));
x[14] = XOR(x[14],ROTATE(PLUS(x[10],x[ 6]), 7));
x[ 2] = XOR(x[ 2],ROTATE(PLUS(x[14],x[10]), 9));
x[ 6] = XOR(x[ 6],ROTATE(PLUS(x[ 2],x[14]),13));
x[10] = XOR(x[10],ROTATE(PLUS(x[ 6],x[ 2]),18));
x[ 3] = XOR(x[ 3],ROTATE(PLUS(x[15],x[11]), 7));
x[ 7] = XOR(x[ 7],ROTATE(PLUS(x[ 3],x[15]), 9));
x[11] = XOR(x[11],ROTATE(PLUS(x[ 7],x[ 3]),13));
x[15] = XOR(x[15],ROTATE(PLUS(x[11],x[ 7]),18));
x[ 1] = XOR(x[ 1],ROTATE(PLUS(x[ 0],x[ 3]), 7));
x[ 2] = XOR(x[ 2],ROTATE(PLUS(x[ 1],x[ 0]), 9));
x[ 3] = XOR(x[ 3],ROTATE(PLUS(x[ 2],x[ 1]),13));
x[ 0] = XOR(x[ 0],ROTATE(PLUS(x[ 3],x[ 2]),18));
x[ 6] = XOR(x[ 6],ROTATE(PLUS(x[ 5],x[ 4]), 7));
x[ 7] = XOR(x[ 7],ROTATE(PLUS(x[ 6],x[ 5]), 9));
x[ 4] = XOR(x[ 4],ROTATE(PLUS(x[ 7],x[ 6]),13));
x[ 5] = XOR(x[ 5],ROTATE(PLUS(x[ 4],x[ 7]),18));
x[11] = XOR(x[11],ROTATE(PLUS(x[10],x[ 9]), 7));
x[ 8] = XOR(x[ 8],ROTATE(PLUS(x[11],x[10]), 9));
x[ 9] = XOR(x[ 9],ROTATE(PLUS(x[ 8],x[11]),13));
x[10] = XOR(x[10],ROTATE(PLUS(x[ 9],x[ 8]),18));
x[12] = XOR(x[12],ROTATE(PLUS(x[15],x[14]), 7));
x[13] = XOR(x[13],ROTATE(PLUS(x[12],x[15]), 9));
x[14] = XOR(x[14],ROTATE(PLUS(x[13],x[12]),13));
x[15] = XOR(x[15],ROTATE(PLUS(x[14],x[13]),18));
for i in range( len( input ) ):
x[i] = PLUS( x[i], input[i] )
return little16_i32.pack( *x )
#--------------------------- 32-bit ops -------------------------------
def trunc32( w ):
""" Return the bottom 32 bits of w as a Python int.
This creates longs temporarily, but returns an int. """
w = int( ( w & 0x7fffFFFF ) | -( w & 0x80000000 ) )
assert type(w) == int
return w
def add32( a, b ):
""" Add two 32-bit words discarding carry above 32nd bit,
and without creating a Python long.
Timing shouldn't vary.
"""
lo = ( a & 0xFFFF ) + ( b & 0xFFFF )
hi = ( a >> 16 ) + ( b >> 16 ) + ( lo >> 16 )
return ( -(hi & 0x8000) | ( hi & 0x7FFF ) ) << 16 | ( lo & 0xFFFF )
def rot32( w, nLeft ):
""" Rotate 32-bit word left by nLeft or right by -nLeft
without creating a Python long.
Timing depends on nLeft but not on w.
"""
nLeft &= 31 # which makes nLeft >= 0
if nLeft == 0:
return w
# Note: now 1 <= nLeft <= 31.
# RRRsLLLLLL There are nLeft RRR's, (31-nLeft) LLLLLL's,
# => sLLLLLLRRR and one s which becomes the sign bit.
RRR = ( ( ( w >> 1 ) & 0x7fffFFFF ) >> ( 31 - nLeft ) )
sLLLLLL = -( (1<<(31-nLeft)) & w ) | (0x7fffFFFF>>nLeft) & w
return RRR | ( sLLLLLL << nLeft )
# --------------------------------- end -----------------------------------
ripemd128.py
"""
ripemd128.py - A simple ripemd128 library in pure Python.
Supports both Python 2 (versions >= 2.6) and Python 3.
Usage:
from ripemd128 import ripemd128
digest = ripemd128(b"The quick brown fox jumps over the lazy dog")
assert(digest == b"\x3f\xa9\xb5\x7f\x05\x3c\x05\x3f\xbe\x27\x35\xb2\x38\x0d\xb5\x96")
"""
import struct
# follows this description: http://homes.esat.kuleuven.be/~bosselae/ripemd/rmd128.txt
def f(j, x, y, z):
assert(0 <= j and j < 64)
if j < 16:
return x ^ y ^ z
elif j < 32:
return (x & y) | (z & ~x)
elif j < 48:
return (x | (0xffffffff & ~y)) ^ z
else:
return (x & z) | (y & ~z)
def K(j):
assert(0 <= j and j < 64)
if j < 16:
return 0x00000000
elif j < 32:
return 0x5a827999
elif j < 48:
return 0x6ed9eba1
else:
return 0x8f1bbcdc
def Kp(j):
assert(0 <= j and j < 64)
if j < 16:
return 0x50a28be6
elif j < 32:
return 0x5c4dd124
elif j < 48:
return 0x6d703ef3
else:
return 0x00000000
def padandsplit(message):
"""
returns a two-dimensional array X[i][j] of 32-bit integers, where j ranges
from 0 to 16.
First pads the message to length in bytes is congruent to 56 (mod 64),
by first adding a byte 0x80, and then padding with 0x00 bytes until the
message length is congruent to 56 (mod 64). Then adds the little-endian
64-bit representation of the original length. Finally, splits the result
up into 64-byte blocks, which are further parsed as 32-bit integers.
"""
origlen = len(message)
padlength = 64 - ((origlen - 56) % 64) #minimum padding is 1!
message += b"\x80"
message += b"\x00" * (padlength - 1)
message += struct.pack("<Q", origlen*8)
assert(len(message) % 64 == 0)
return [
[
struct.unpack("<L", message[i+j:i+j+4])[0]
for j in range(0, 64, 4)
]
for i in range(0, len(message), 64)
]
def add(*args):
return sum(args) & 0xffffffff
def rol(s,x):
assert(s < 32)
return (x << s | x >> (32-s)) & 0xffffffff
r = [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13,14,15,
7, 4,13, 1,10, 6,15, 3,12, 0, 9, 5, 2,14,11, 8,
3,10,14, 4, 9,15, 8, 1, 2, 7, 0, 6,13,11, 5,12,
1, 9,11,10, 0, 8,12, 4,13, 3, 7,15,14, 5, 6, 2]
rp = [ 5,14, 7, 0, 9, 2,11, 4,13, 6,15, 8, 1,10, 3,12,
6,11, 3, 7, 0,13, 5,10,14,15, 8,12, 4, 9, 1, 2,
15, 5, 1, 3, 7,14, 6, 9,11, 8,12, 2,10, 0, 4,13,
8, 6, 4, 1, 3,11,15, 0, 5,12, 2,13, 9, 7,10,14]
s = [11,14,15,12, 5, 8, 7, 9,11,13,14,15, 6, 7, 9, 8,
7, 6, 8,13,11, 9, 7,15, 7,12,15, 9,11, 7,13,12,
11,13, 6, 7,14, 9,13,15,14, 8,13, 6, 5,12, 7, 5,
11,12,14,15,14,15, 9, 8, 9,14, 5, 6, 8, 6, 5,12]
sp = [ 8, 9, 9,11,13,15,15, 5, 7, 7, 8,11,14,14,12, 6,
9,13,15, 7,12, 8, 9,11, 7, 7,12, 7, 6,15,13,11,
9, 7,15,11, 8, 6, 6,14,12,13, 5,14,13,13, 7, 5,
15, 5, 8,11,14,14, 6,14, 6, 9,12, 9,12, 5,15, 8]
def ripemd128(message):
h0 = 0x67452301
h1 = 0xefcdab89
h2 = 0x98badcfe
h3 = 0x10325476
X = padandsplit(message)
for i in range(len(X)):
(A,B,C,D) = (h0,h1,h2,h3)
(Ap,Bp,Cp,Dp) = (h0,h1,h2,h3)
for j in range(64):
T = rol(s[j], add(A, f(j,B,C,D), X[i][r[j]], K(j)))
(A,D,C,B) = (D,C,B,T)
T = rol(sp[j], add(Ap, f(63-j,Bp,Cp,Dp), X[i][rp[j]], Kp(j)))
(Ap,Dp,Cp,Bp)=(Dp,Cp,Bp,T)
T = add(h1,C,Dp)
h1 = add(h2,D,Ap)
h2 = add(h3,A,Bp)
h3 = add(h0,B,Cp)
h0 = T
return struct.pack("<LLLL",h0,h1,h2,h3)
def hexstr(bstr):
return "".join("{0:02x}".format(b) for b in bstr)