Thot toolkit for statistical machine translation
Thot: a Toolkit for Statistical Machine Translation
Thot is an open source software toolkit for statistical machine translation (SMT). Originally, Thot incorporated tools to train phrase-based models. The new version of Thot now includes a state-of-the-art phrase-based translation decoder as well as tools to estimate all of the models involved in the translation process. In addition to this, Thot is also able to incrementally update its models in real time after presenting an individual sentence pair using online learning (also known as adaptive machine translation).
Thot was created by Daniel Ortiz-Martínez.
A new version of the toolkit has been released with several improvements and new features:
The toolkit includes the following features:
Thot has been coded using C, C++, Python and shell-scripting. Thot is known to compile on Unix-like and Windows (using Cygwin) systems. See the "Documentation and Support" section of these instructions if you experience problems during compilation.
It is released under the GNU Lesser General Public License (LGPL).
To install Thot, first you need to install the autotools (autoconf, autoconf-archive, automake and libtool packages in Ubuntu). If you are planning to use Thot on a Windows platform, you also need to install the Cygwin environment. Alternatively, Thot can also be installed on Mac OS X systems using MacPorts.
On the other hand, some of the functionality incorporated by Thot requires the previous installation of third-party software (see below).
Once the autotools are available (as well as other required software such as Cygwin or MacPorts), you can proceed with the installation of Thot by following the next sequence of steps:
Obtain the package using git:
$ git clone https://github.com/daormar/thot.git
cd
to the directory containing the package's source code and type
./reconf
.
Type ./configure
to configure the package.
Type make
to compile the package.
Type make install
to install the programs and any data files and
documentation.
You can remove the program binaries and object files from the source
code directory by typing make clean
.
By default the files are installed under the /usr/local/ directory (or similar, depending on the OS you use); however, since Step 5 requires root privileges, another directory can be specified during Step 3 by typing:
$ configure --prefix=<absolute-installation-path>
For example, if "user1" wants to install the Thot package in the directory /home/user1/thot, the sequence of commands to execute should be the following:
$ ./reconf
$ configure --prefix=/home/user1/thot
$ make
$ make install
The installation directory can be the same directory where the Thot package was decompressed.
IMPORTANT NOTE: if Thot is being installed in a PBS cluster (a cluster
providing qsub
and other related tools), it is important that the
configure
script is executed in the main cluster node, so as to
properly detect the cluster configuration (do not execute it in an
interactive session).
The Thot configure script can be used to modify the toolkit behavior. Here is a list of current installation options:
--enable-ibm2-alig
: Thot currently uses
HMM-based alignment models to obtain the word alignment matrices
required for phrase model estimation. One alternative installation
option allows to replace HMM-based alignment models by IBM 2 alignment
models. IBM 2 alignment models can be estimated very efficiently without
significantly affecting translation quality.--with-kenlm=<DIR>
: installs Thot with the
necessary code to combine it the with the KenLM
library. <DIR>
is the absolute path where the KenLM
library was installed. See more information below.--with-casmacat=<DIR>
: this option enables the
configuration required for the CasMaCat Workbench. <DIR>
is the absolute path where the CasMaCat library was installed. See more
information below.--enable-testing=<DIR>
: this option enables
the execution of unit tests using the thot_test
tool. Using
this option requires the installation of the CppUnit library. For more
information, see the third party software Section below.Thot can be combined with third party software, extending its functionality. Currently supported software is listed below.
LevelDB is a key-value storage library providing an ordered mapping from string keys to string values. LevelDB is used into Thot to handle language and translation model parameters from disk. The advantage of doing this is a reduction of main memory requirements and loading times for both kinds of models to virtually zero, at the cost of a small time overhead incurred by disk usage.
One interesting aspect of using LevelDB to access models parameters that is not present in other solutions implemented in the toolkit, including those based on the KenLM and Berkeley DB libraries described below, is that LevelDB allows to modify model parameters in a very efficient way, in contrast to the models implemented by means of the alternative libraries, where modifications are slow or simply not allowed.
To enable LevelDB in Thot it is necessary to install the library in a
standard path before executing configure
. In operating
systems where the apt
tool is available, LevelDB can be
installed by means of the following command:
$ sudo apt install libleveldb-dev
The KenLM library provides software to estimate, filter and query language models. KenLM has been incorporated into Thot so as to enable access of language model parameters from disk in a similar way to that described for the LevelDB library. However, KenLM language models are static in contrast to the dynamic language models implemented by means of LevelDB.
KenLM library should be downloaded, compiled and installed before
executing Thot's configure
script. configure
should be used with the --with-kenlm=<DIR>
option, where
<DIR>
is the directory where the library was installed. A
specific version of KenLM has been created with minor modifications
focused on making the package easier to install. The required sequences
of commands is as follows:
$ mkdir kenlm ; cd kenlm
$ git clone https://github.com/daormar/kenlm.git repo
$ mkdir build ; cd build
$ cmake -DCMAKE_INSTALL_PREFIX=../ ../repo
$ make
$ make install # Installs the library in the "kenlm" directory
For more information about how to use this functionality, please refer to the Thot manual.
Berkeley DB is a software library providing a high performance database for key/value data. Berkeley DB can be combined with Thot so as to allow access to phrase model parameters from disk. The purpose of this is to reduce main memory requirements and loading times in the same way as was explained above for language models and KenLM.
Berkeley DB library should be installed in a standard OS directory (such
as /usr/local
) before the configure
script is
executed. In systems providing the apt
tool, this can be
easily achieved with the following command:
$ sudo apt install libdb++-dev
For additional information about how to use this functionality, see the Thot manual.
Thot can be used in combination with the CasMacat Workbench that is being developed in the project of the same name. See this webpage to get the specific installation instructions.
CppUnit
is a C++ framework for unit testing. The thot_test
tool
internally uses CppUnit to execute a growing set of unit tests.
CppUnit should be installed in a standard OS directory before
executing configure
. If the apt
tool is
available in your operating system, CppUnit can be installed by
executing the following command:
$ sudo apt install libcppunit-dev
Once the package has been installed, it is possible to perform basic checkings in an automatic manner so as to detect portability errors. For this purpose, the following command can be used:
$ make installcheck
The tests performed by the previous command involve the execution of the main tasks present in a typical SMT pipeline, including training and tuning of model parameters as well as generating translations using the estimated models (see more on this in the Thot manual). The command internally uses the toy corpus provided with the Thot package to carry out the checkings.
Due to the strong focus of Thot on online and incremental learning, it includes its own programs to carry out language and translation model estimation. Specifically, Thot includes tools to work with n-gram language models based on incrementally updateable sufficient statistics. On the other hand, Thot also includes a set of tools and a whole software library to estimate IBM 1, IBM 2 and HMM-based word alignment models. The estimation process can be carried out using batch and incremental EM algorithms. This functionality is not based on the standard GIZA++ software for word alignment model generation.
Additionally, Thot does not use any code from other existing translation tools. In this regard, Thot tries to offer its own view of the process of statistical machine translation, with a strong focus on online learning and also incorporating interactive machine translation functionality. Another interesting feature of the toolkit is its stable and robust translation server.
The Thot toolkit is under development. Original public versions of Thot date back to 2005 [Ortiz-Martínez et al., 2005] and were hosted in SourceForge. These original versions were strongly focused on the estimation of phrase-based models. By contrast, current version offers several new features that had not been previously incorporated.
A set of specific tools to ease the process of making SMT experiments has been created. Basic usage instructions have been recently added to the Thot manual.
On the other hand, there are some toolkit extensions that will be incorporated in the next months:
Improved management of concurrency in the Thot translation server (concurrent translation processes are currently handled with mutual exclusion) [STATUS: implementation finished]
Virtualized language models, i.e. accessing language model parameters from disk [STATUS: implementation finished]
Interpolation of language and translation models [STATUS: implementation finished]
Finally, here is a list of known issues with the Thot toolkit that are currently being addressed:
Phrase model training is based on HMM-based word alignment models estimated by means of incremental EM. The current implementation is slow and currently constitutes a bottleneck when training phrase models from large corpora. One already implemented solution is to carry out the estimation in multiple processors. Another solution is to replace HMM-based models by IBM 2 Models, which can be estimated very efficiently. However, we are also investigating alternative optimization techniques to efficiently execute the estimation process of HMM-based models in a single processor [STATUS: under development although code is much faster now]
Log-linear model weight adjustment is carried out by means of the downhill simplex algorithm, which is slow. Downhill simplex will be replaced by a more efficient technique [STATUS: issue solved]
Non-monotonic translation is not yet sufficiently tested, specially with complex corpora such as Europarl [STATUS: under development]
Project documentation is being developed. Such documentation includes:
If you need additional help, you can:
Additional information about the theoretical foundations of Thot can be found in [Ortiz-Martínez, 2011]. One interesting feature of Thot, incremental (or online) estimation of statistical models, is also described in [Ortiz-Martínez et al., 2010]. Finally, phrase-level alignment generation functionality implemented in Thot was proposed in [Ortiz-Martínez et al., 2008].
You are welcome to use the code under the terms of the license for research or commercial purposes, however please acknowledge its use with a citation:
Daniel Ortiz-Martínez, Francisco Casacuberta. "The New Thot Toolkit for Fully Automatic and Interactive Statistical Machine Translation". In Proc. of the European Association for Computational Linguistics (EACL): System Demonstrations, Gothenburg, Sweden, April 2014. pp. 45-48.
Here is a BiBTeX entry:
@InProceedings{Ortiz2014, author = {Daniel Ortiz-Mart\'{\i}nez and Francisco Casacuberta}, title = {The New Thot Toolkit for Fully Automatic and Interactive Statistical Machine Translation}, booktitle = {Proc. of the European Association for Computational Linguistics (EACL): System Demonstrations}, year = {2014}, month = {April}, address = {Gothenburg, Sweden}, pages = "45--48", }
Daniel Ortiz-Martínez, "Online Learning for Statistical Machine Translation". In Computational Linguistics, 2016. Vol. 42 (1), pp. 121-161. Download
Daniel Ortiz-Martínez, "Advances in Fully-Automatic and Interactive Phrase-Based Statistical Machine Translation". PhD Thesis. Universidad Politécnica de Valencia. 2011. Advisors: Ismael García Varea and Francisco Casacuberta. Download
Daniel Ortiz-Martínez, Ismael García-Varea, Francisco Casacuberta. "Online Learning for Interactive Statistical Machine Translation". In Proc. of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies (NAACL-HLT), pp. 546-554, Los Angeles, US, 2010. Download
Daniel Ortiz-Martínez, Ismael García-Varea, Francisco Casacuberta. "Phrase-level alignment generation using a smoothed loglinear phrase-based statistical alignment model". In Proc. of the European Association for Machine Translation (EAMT), pp. 160-169, Hamburg, Germany, 2008. Best paper award. Download
Daniel Ortiz-Martínez, Ismael García-Varea, Francisco Casacuberta. "Thot: a toolkit to train phrase-based models for statistical machine translation". In Proc. of the Machine Translation Summit (MT-Summit), Phuket, Thailand, September 2005. Download
Thot has been supported by the European Union under the CasMaCat research project. Thot has also received support from the Spanish Government in a number of research projects, such as the MIPRCV project that belonged to the prestigious CONSOLIDER programme.