Thot Software Toolkit for Statistical Machine Translation

Thot: a Toolkit for Statistical Machine Translation

Introduction

Thot is an open source software toolkit for statistical machine translation (SMT). Originally, Thot incorporated tools to train phrase-based models. The new version of Thot now includes a state-of-the-art phrase-based translation decoder as well as tools to estimate all of the models involved in the translation process. In addition to this, Thot is also able to incrementally update its models in real time after presenting an individual sentence pair using online learning (also known as adaptive machine translation).

Thot was created by Daniel Ortiz-Martínez.

News

A new version of the toolkit has been released with several improvements and new features:

Ability to force translations for specific phrases of the source sentence
Capability to access language and translation models from disk
Added functionality to use multiple language and translation models
Added tool to create empty models useful to perform online learning from scratch
Portability increased
Growing suite of unit and integration tests
The Thot manual has been extended and revised

Features

The toolkit includes the following features:

Phrase-based statistical machine translation decoder.
Computer-aided translation (post-editing and interactive machine translation).
Incremental estimation of statistical models (adaptive machine translation).
Robust generation of alignments at phrase-level.
Client-server implementation of the translation functionality.
Single word alignment model estimation using the incremental EM algorithm.
Scalable and parallel model estimation algorithms using Map-Reduce.
Compiles on Unix-like and Windows (using Cygwin) systems.
Integration with the CasMaCat Workbench developed in the EU FP7 CasMaCat project.
...

Distribution Details

Thot has been coded using C, C++, Python and shell-scripting. Thot is known to compile on Unix-like and Windows (using Cygwin) systems. See the "Documentation and Support" section of these instructions if you experience problems during compilation.

It is released under the GNU Lesser General Public License (LGPL).

Installation

Basic Installation Procedure

To install Thot, first you need to install the autotools (autoconf, autoconf-archive, automake and libtool packages in Ubuntu). If you are planning to use Thot on a Windows platform, you also need to install the Cygwin environment. Alternatively, Thot can also be installed on Mac OS X systems using MacPorts.

On the other hand, some of the functionality incorporated by Thot requires the previous installation of third-party software (see below).

Once the autotools are available (as well as other required software such as Cygwin or MacPorts), you can proceed with the installation of Thot by following the next sequence of steps:

Obtain the package using git:
```
$ git clone https://github.com/daormar/thot.git
```
Or download it in a zip file
cd to the directory containing the package's source code and type ./reconf.
Type ./configure to configure the package.
Type make to compile the package.
Type make install to install the programs and any data files and documentation.
You can remove the program binaries and object files from the source code directory by typing make clean.

By default the files are installed under the /usr/local/ directory (or similar, depending on the OS you use); however, since Step 5 requires root privileges, another directory can be specified during Step 3 by typing:

 $ configure --prefix=<absolute-installation-path>

For example, if "user1" wants to install the Thot package in the directory /home/user1/thot, the sequence of commands to execute should be the following:

 $ ./reconf
 $ configure --prefix=/home/user1/thot
 $ make
 $ make install

The installation directory can be the same directory where the Thot package was decompressed.

IMPORTANT NOTE: if Thot is being installed in a PBS cluster (a cluster providing qsub and other related tools), it is important that the configure script is executed in the main cluster node, so as to properly detect the cluster configuration (do not execute it in an interactive session).

Alternative Installation Options

The Thot configure script can be used to modify the toolkit behavior. Here is a list of current installation options:

--enable-ibm2-alig: Thot currently uses HMM-based alignment models to obtain the word alignment matrices required for phrase model estimation. One alternative installation option allows to replace HMM-based alignment models by IBM 2 alignment models. IBM 2 alignment models can be estimated very efficiently without significantly affecting translation quality.
--with-kenlm=<DIR>: installs Thot with the necessary code to combine it the with the KenLM library. <DIR> is the absolute path where the KenLM library was installed. See more information below.
--with-casmacat=<DIR>: this option enables the configuration required for the CasMaCat Workbench. <DIR> is the absolute path where the CasMaCat library was installed. See more information below.
--enable-testing=<DIR>: this option enables the execution of unit tests using the thot_test tool. Using this option requires the installation of the CppUnit library. For more information, see the third party software Section below.

Third Party Software

Thot can be combined with third party software, extending its functionality. Currently supported software is listed below.

LevelDB

LevelDB is a key-value storage library providing an ordered mapping from string keys to string values. LevelDB is used into Thot to handle language and translation model parameters from disk. The advantage of doing this is a reduction of main memory requirements and loading times for both kinds of models to virtually zero, at the cost of a small time overhead incurred by disk usage.

One interesting aspect of using LevelDB to access models parameters that is not present in other solutions implemented in the toolkit, including those based on the KenLM and Berkeley DB libraries described below, is that LevelDB allows to modify model parameters in a very efficient way, in contrast to the models implemented by means of the alternative libraries, where modifications are slow or simply not allowed.

To enable LevelDB in Thot it is necessary to install the library in a standard path before executing configure. In operating systems where the apt tool is available, LevelDB can be installed by means of the following command:

$ sudo apt install libleveldb-dev

KenLM

The KenLM library provides software to estimate, filter and query language models. KenLM has been incorporated into Thot so as to enable access of language model parameters from disk in a similar way to that described for the LevelDB library. However, KenLM language models are static in contrast to the dynamic language models implemented by means of LevelDB.

KenLM library should be downloaded, compiled and installed before executing Thot's configure script. configure should be used with the --with-kenlm=<DIR> option, where <DIR> is the directory where the library was installed. A specific version of KenLM has been created with minor modifications focused on making the package easier to install. The required sequences of commands is as follows:

$ mkdir kenlm ; cd kenlm
$ git clone https://github.com/daormar/kenlm.git repo
$ mkdir build ; cd build
$ cmake -DCMAKE_INSTALL_PREFIX=../ ../repo
$ make
$ make install    # Installs the library in the "kenlm" directory

For more information about how to use this functionality, please refer to the Thot manual.

Berkeley DB Library

Berkeley DB is a software library providing a high performance database for key/value data. Berkeley DB can be combined with Thot so as to allow access to phrase model parameters from disk. The purpose of this is to reduce main memory requirements and loading times in the same way as was explained above for language models and KenLM.

Berkeley DB library should be installed in a standard OS directory (such as /usr/local) before the configure script is executed. In systems providing the apt tool, this can be easily achieved with the following command:

$ sudo apt install libdb++-dev

For additional information about how to use this functionality, see the Thot manual.

Casmacat Workbench

Thot can be used in combination with the CasMacat Workbench that is being developed in the project of the same name. See this webpage to get the specific installation instructions.

CppUnit

CppUnit is a C++ framework for unit testing. The thot_test tool internally uses CppUnit to execute a growing set of unit tests.

CppUnit should be installed in a standard OS directory before executing configure. If the apt tool is available in your operating system, CppUnit can be installed by executing the following command:

$ sudo apt install libcppunit-dev

Checking Package Installation

Once the package has been installed, it is possible to perform basic checkings in an automatic manner so as to detect portability errors. For this purpose, the following command can be used:

 $ make installcheck

The tests performed by the previous command involve the execution of the main tasks present in a typical SMT pipeline, including training and tuning of model parameters as well as generating translations using the estimated models (see more on this in the Thot manual). The command internally uses the toy corpus provided with the Thot package to carry out the checkings.

Relation with Existing Software

Due to the strong focus of Thot on online and incremental learning, it includes its own programs to carry out language and translation model estimation. Specifically, Thot includes tools to work with n-gram language models based on incrementally updateable sufficient statistics. On the other hand, Thot also includes a set of tools and a whole software library to estimate IBM 1, IBM 2 and HMM-based word alignment models. The estimation process can be carried out using batch and incremental EM algorithms. This functionality is not based on the standard GIZA++ software for word alignment model generation.

Additionally, Thot does not use any code from other existing translation tools. In this regard, Thot tries to offer its own view of the process of statistical machine translation, with a strong focus on online learning and also incorporating interactive machine translation functionality. Another interesting feature of the toolkit is its stable and robust translation server.

Current Status

The Thot toolkit is under development. Original public versions of Thot date back to 2005 [Ortiz-Martínez et al., 2005] and were hosted in SourceForge. These original versions were strongly focused on the estimation of phrase-based models. By contrast, current version offers several new features that had not been previously incorporated.

A set of specific tools to ease the process of making SMT experiments has been created. Basic usage instructions have been recently added to the Thot manual.

On the other hand, there are some toolkit extensions that will be incorporated in the next months:

Improved management of concurrency in the Thot translation server (concurrent translation processes are currently handled with mutual exclusion) [STATUS: implementation finished]
Virtualized language models, i.e. accessing language model parameters from disk [STATUS: implementation finished]
Interpolation of language and translation models [STATUS: implementation finished]

Finally, here is a list of known issues with the Thot toolkit that are currently being addressed:

Phrase model training is based on HMM-based word alignment models estimated by means of incremental EM. The current implementation is slow and currently constitutes a bottleneck when training phrase models from large corpora. One already implemented solution is to carry out the estimation in multiple processors. Another solution is to replace HMM-based models by IBM 2 Models, which can be estimated very efficiently. However, we are also investigating alternative optimization techniques to efficiently execute the estimation process of HMM-based models in a single processor [STATUS: under development although code is much faster now]
Log-linear model weight adjustment is carried out by means of the downhill simplex algorithm, which is slow. Downhill simplex will be replaced by a more efficient technique [STATUS: issue solved]
Non-monotonic translation is not yet sufficiently tested, specially with complex corpora such as Europarl [STATUS: under development]

Documentation and Support

Project documentation is being developed. Such documentation includes:

These instructions.
The Thot manual ("thot_manual.pdf" under the "doc" directory).
A quick user guide ("thot_quick_guide.pdf" under the "doc" directory).
A seminar about SMT and Thot ("thot_seminar.pdf" under the "doc" directory).

If you need additional help, you can:

Additional information about the theoretical foundations of Thot can be found in [Ortiz-Martínez, 2011]. One interesting feature of Thot, incremental (or online) estimation of statistical models, is also described in [Ortiz-Martínez et al., 2010]. Finally, phrase-level alignment generation functionality implemented in Thot was proposed in [Ortiz-Martínez et al., 2008].

Citation

You are welcome to use the code under the terms of the license for research or commercial purposes, however please acknowledge its use with a citation:

Daniel Ortiz-Martínez, Francisco Casacuberta. "The New Thot Toolkit for Fully Automatic and Interactive Statistical Machine Translation". In Proc. of the European Association for Computational Linguistics (EACL): System Demonstrations, Gothenburg, Sweden, April 2014. pp. 45-48.

Here is a BiBTeX entry:

@InProceedings{Ortiz2014,
  author    = {Daniel Ortiz-Mart\'{\i}nez and Francisco Casacuberta},
  title     = {The New Thot Toolkit for Fully Automatic and Interactive Statistical Machine Translation},
  booktitle = {Proc. of the European Association for Computational Linguistics (EACL): System Demonstrations},
  year      = {2014},
  month     = {April},
  address   = {Gothenburg, Sweden},
  pages     = "45--48",
}

Literature

Daniel Ortiz-Martínez, "Online Learning for Statistical Machine Translation". In Computational Linguistics, 2016. Vol. 42 (1), pp. 121-161. Download

Daniel Ortiz-Martínez, "Advances in Fully-Automatic and Interactive Phrase-Based Statistical Machine Translation". PhD Thesis. Universidad Politécnica de Valencia. 2011. Advisors: Ismael García Varea and Francisco Casacuberta. Download

Daniel Ortiz-Martínez, Ismael García-Varea, Francisco Casacuberta. "Online Learning for Interactive Statistical Machine Translation". In Proc. of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies (NAACL-HLT), pp. 546-554, Los Angeles, US, 2010. Download

Daniel Ortiz-Martínez, Ismael García-Varea, Francisco Casacuberta. "Phrase-level alignment generation using a smoothed loglinear phrase-based statistical alignment model". In Proc. of the European Association for Machine Translation (EAMT), pp. 160-169, Hamburg, Germany, 2008. Best paper award. Download

Daniel Ortiz-Martínez, Ismael García-Varea, Francisco Casacuberta. "Thot: a toolkit to train phrase-based models for statistical machine translation". In Proc. of the Machine Translation Summit (MT-Summit), Phuket, Thailand, September 2005. Download