Sessions

Project 1

Deadline Project1 April 26, 2019

Tutorial for theory of IBM

Tutorial for practical Collins

NAACL alignment format

Output format

The results file should include one line for each word-to-word alignment identified by the system. The lines in the results file should follow the format below:

sentence_no position_L1 position_L2 [S

where:

sentence_no represents the id of the sentence within the test file. Sentences in the test data already have an id assigned. (see the examples below)
position_L1 represents the position of the token that is aligned from the text in language L1; the first token in each sentence is token 1. (not 0)
position_L2 represents the position of the token that is aligned from the text in language L2; again, the first token is token 1.
S|P can be either S or P, representing a Sure or Probable alignment. All alignments that are tagged as S are also considered to be part of the P alignments set (that is, all alignments that are considered “Sure” alignments are also part of the “Probable” alignments set). If the S|P field is missing, a value of S will be assumed by default.

The S

P field overlap is optional.

Running example

Consider the two following aligned sentences:

[from the English file]

They had gone .

[from the French file]

Ils etaient alles .

A correct word alignment that will be produced for this sentence is

18 1 1

18 2 2

18 3 3

18 4 4

Which states that all these alignments are from sentence 18, and the English token 1 (“They”) aligns with the French token 1 (“Ils”), the English token 2 (“had”), aligns with the French token 2 (“etaient”), and so on. Note that the punctuation is also aligned (English token 4 (“.”) align with French token (“.”)), and will count towards the final scoring figures.

With missing S

P fields considered by default to be S.