Thot toolkit for statistical machine translation

Thot toolkit for statistical machine translation

View the Project on GitHub daormar/thot

Thot: a Toolkit for Statistical Machine Translation


Thot is an open source software toolkit for statistical machine translation (SMT). Originally, Thot incorporated tools to train phrase-based models. The new version of Thot now includes a state-of-the-art phrase-based translation decoder as well as tools to estimate all of the models involved in the translation process. In addition to this, Thot is also able to incrementally update its models in real time after presenting an individual sentence pair using online learning.

Thot is being developed by Daniel Ortiz-Martínez. Daniel is a researcher on natural language processing at Webinterpret. Formerly, he was a member of the PRHLT research group as well as an assistant professor at the Technical University of Valencia.


A new version of the toolkit has been released with several improvements and new features:


The toolkit includes the following features:

Distribution Details

Thot has been coded using C, C++, Python and shell-scripting. Thot is known to compile on Unix-like and Windows (using Cygwin) systems. See the "Documentation and Support" section of these instructions if you experience problems during compilation.

It is released under the GNU Lesser General Public License (LGPL).


Basic Installation Procedure

To install Thot, first you need to install the autotools (autoconf, autoconf-archive, automake and libtool packages in Ubuntu). If you are planning to use Thot on a Windows platform, you also need to install the Cygwin environment. Alternatively, Thot can also be installed on Mac OS X systems using MacPorts.

On the other hand, some of the code used for corpus pre/post-processing (see more on this in the Thot manual) is based on the Natural Language Toolkit (NLTK) library. Those users interested in using the pre/post-processing functionality incorporated in Thot will need to install that library as well.

Once the autotools are available (as well as other required software such as Cygwin, MacPorts or the NLTK library), you can proceed with the installation of Thot by following the next sequence of steps:

  1. Obtain the package using git:

    $ git clone

    Or download it in a zip file

  2. cd to the directory containing the package's source code and type ./reconf.

  3. Type ./configure to configure the package.

  4. Type make to compile the package.

  5. Type make install to install the programs and any data files and documentation.

  6. You can remove the program binaries and object files from the source code directory by typing make clean.

By default the files are installed under the /usr/local/ directory (or similar, depending of the OS you use); however, since Step 5 requires root privileges, another directory can be specified during Step 3 by typing:

 $ configure --prefix=<absolute-installation-path>

For example, if "user1" wants to install the Thot package in the directory /home/user1/thot, the sequence of commands to execute should be the following:

 $ ./reconf
 $ configure --prefix=/home/user1/thot
 $ make
 $ make install

The installation directory can be the same directory where the Thot package was decompressed.

See "INSTALL" file for more information.

IMPORTANT NOTE: if Thot is being installed in a PBS cluster (a cluster providing qsub and other related tools), it is important that the configure script is executed in the main cluster node, so as to properly detect the cluster configuration (do not execute it in an interactive session).

Alternative Installation Options

The Thot configure script can be used to modify the toolkit behavior. Here is a list of current installation options:

Installation Including the CasMaCat Workbench

Thot can be combined with the CasMacat Workbench that is being developed in the project of the same name. See this webpage to get the specific installation instructions.

Checking Package Installation

Once the package has been installed, it is possible to perform basic checkings in an automatic manner so as to detect portability errors. For this purpose, the following command can be used:

 $ make installcheck

The tests performed by the previous command involve the execution of the main tasks present in a typical SMT pipeline, including training and tuning of model parameters as well as generating translations using the estimated models (see more on this in the Thot manual). The command internally uses the toy corpus provided with the Thot package to carry out the checkings.

Relation with Existing Software

Due to the strong focus of Thot on online and incremental learning, it includes its own programs to carry out language and translation model estimation. Specifically, Thot includes tools to work with n-gram language models based on incrementally updateable sufficient statistics. On the other hand, Thot also includes a set of tools and a whole software library to estimate IBM 1, IBM 2 and HMM-based word alignment models. The estimation process can be carried out using batch and incremental EM algorithms. This functionality is not based on the standard GIZA++ software for word alignment model generation.

Additionally, Thot does not use any code from other existing translation tools. In this regard, Thot tries to offer its own view of the process of statistical machine translation, with a strong focus on online learning and also incorporating interactive machine translation functionality. Another interesting feature of the toolkit is its stable and robust translation server.

Current Status

The Thot toolkit is under development. Original public versions of Thot date back to 2005 [Ortiz-Martínez et al., 2005] and were hosted in SourceForge. These original versions were strongly focused on the estimation of phrase-based models. By contrast, current version offers several new features that had not been previously incorporated.

A set of specific tools to ease the process of making SMT experiments has been created. Basic usage instructions have been recently added to the Thot manual.

On the other hand, there are some toolkit extensions that will be incorporated in the next months:

Finally, here is a list of known issues with the Thot toolkit that are currently being addressed:

Documentation and Support

Project documentation is being developed. Such documentation include:

If you need additional help, you can:

Additional information about the theoretical foundations of Thot can be found in [Ortiz-Martínez, 2011]. One interesting feature of Thot, incremental (or online) estimation of statistical models, is also described in [Ortiz-Martínez et al., 2010]. Finally, phrase-level alignment generation functionality implemented in Thot was proposed in [Ortiz-Martínez et al., 2008].


You are welcome to use the code under the terms of the license for research or commercial purposes, however please acknowledge its use with a citation:

Daniel Ortiz-Martínez, Francisco Casacuberta. "The New Thot Toolkit for Fully Automatic and Interactive Statistical Machine Translation". In Proc. of the European Association for Computational Linguistics (EACL): System Demonstrations, Gothenburg, Sweden, April 2014. pp. 45-48.

Here is a BiBTeX entry:

  author    = {Daniel Ortiz-Mart\'{\i}nez and Francisco Casacuberta},
  title     = {The New Thot Toolkit for Fully Automatic and Interactive Statistical Machine Translation},
  booktitle = {Proc. of the European Association for Computational Linguistics (EACL): System Demonstrations},
  year      = {2014},
  month     = {April},
  address   = {Gothenburg, Sweden},
  pages     = "45--48",


Daniel Ortiz-Martínez, "Online Learning for Statistical Machine Translation". In Computational Linguistics, 2016. Vol. 42 (1), pp. 121-161. Download

Daniel Ortiz-Martínez, "Advances in Fully-Automatic and Interactive Phrase-Based Statistical Machine Translation". PhD Thesis. Universidad Politécnica de Valencia. 2011. Advisors: Ismael García Varea and Francisco Casacuberta. Download

Daniel Ortiz-Martínez, Ismael García-Varea, Francisco Casacuberta. "Online Learning for Interactive Statistical Machine Translation". In Proc. of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies (NAACL-HLT), pp. 546-554, Los Angeles, US, 2010. Download

Daniel Ortiz-Martínez, Ismael García-Varea, Francisco Casacuberta. "Phrase-level alignment generation using a smoothed loglinear phrase-based statistical alignment model". In Proc. of the European Association for Machine Translation (EAMT), pp. 160-169, Hamburg, Germany, 2008. Best paper award. Download

Daniel Ortiz-Martínez, Ismael García-Varea, Francisco Casacuberta. "Thot: a toolkit to train phrase-based models for statistical machine translation". In Proc. of the Machine Translation Summit (MT-Summit), Phuket, Thailand, September 2005. Download


Thot has been supported by the European Union under the CasMaCat research project. Thot has also received support from the Spanish Government in a number of research projects, such as the MIPRCV project that belongs to the CONSOLIDER programme.

Last updated: 20 March 2017