This document will describe how I convert thai pdf file to text data.

In GNU/Linux,

  • use the command pdftotext (bundled with xpdf or poppler-utils)
  • put the following config to file ~/xpdfrc-thai-unicode,
include /etc/xpdf/xpdfrc
unicodeMap      UTF-8-Thai         /home/sysadmin/UTF-8-Thai.unicodeMap
  • generate the file /home/sysadmin/UTF-8-Thai.unicodeMap using the following perl script (the script is based on the code from xpdf/UTF8.h),
$ perl -e 'print "0000 007f 00\n";
for ($u=0x0080; $u<0x0800; $u+=0x40) { printf("%04x %04x %02x%02x\n", $u, $u+0x40-1, 0xc0+($u>>6), 0x80+($u&0x3f) ); }
for ($u=0x0800; $u<0x10000; $u+=0x40)
{ printf("%04x %04x %02x%02x%02x\n", $u, $u+0x40-1, 0xe0+($u>>12), 0x80+($u>>6 & 0x3f), 0x80+($u&0x3f) ); }
for ($u=0x10000; $u<0x110000; $u+=0x40)
{ printf("%06x %06x %02x%02x%02x%02x\n", $u, $u+0x40-1, 0xf0+($u>>18), 0x80+($u>>12 & 0x3f), 0x80+($u>>6 & 0x3f), 0x80+($u&0x3f) ); }
' > ~/UTF-8-Thai.unicodeMap
  • edit the file /home/sysadmin/UTF-8-Thai.unicodeMap as described by the following patch (it is derived from the mapping in /usr/share/xpdf/thai/TIS-620.unicodeMap),
--- -   2008-09-22 18:46:28.681553000 +0700
+++ /home/sysadmin/UTF-8-Thai.unicodeMap        2008-09-22 18:36:41.000000000 +0700
@@ -985,7 +985,16 @@
 f640 f67f ef9980
 f680 f6bf ef9a80
 f6c0 f6ff ef9b80
-f700 f73f ef9c80
+f700 e0b890
+f701 f704 e0b8b4
+f705 f709 e0b988
+f70a f70e e0b988
+f70f e0b88d
+f710 e0b8b1
+f711 e0b98d
+f712 f717 e0b987
+f718 f71a e0b8b8
+f720 f73f ef9ca0
 f740 f77f ef9d80
 f780 f7bf ef9e80
 f7c0 f7ff ef9f80
  • begin convert the pdf file using the command,
    $ pdftotext -raw -enc UTF-8-Thai /tmp/1.PDF /tmp/1.txt -cfg ~/xpdfrc-thai-unicode
  • my contribution here is released under dual licenses,
    1. GFDL
    2. The same license as xpdf