Quantitative Evaluation of Machine Translation Systems Sentence Level 1 Universidade de Lis

发布时间:   来源:文档文库   
字号:
QuantitativeEvaluationofMachineTranslationSystems:SentenceLevel
PalmiraMarrafa1andAntónioRibeiro2
UniversidadedeLisboaFaculdadedeLetras
GroupofLexicalandGrammaticalKnowledge
Computation(CLULAvenida5deOutubro,85–5ºP–1050–050Lisboa,PortugalPalmira.Marrafa@netcabo.pt
Abstract
Thispaperreportsthefirstresultsofanon-goingresearchonevaluationofMachineTranslationquality.ThestartingpointforthisworkwastheframeworkofISLE(theInternationalStandardsforLanguageEngineering,whichprovidesaclassificationforevaluationofMachineTranslation.Inordertomakeaquantitativeevaluationoftranslationquality,wepursueamoreconsistent,fine-grainedandcomprehensiveclassificationofpossibletranslationerrorsandweproposemetricsforsentencelevelerrors,specificallylexicalandsyntacticerrors.
MachineTranslationevaluation,translationqualitymetrics

1
UniversidadeNovadeLisboaFaculdadedeCiênciaseTecnologiaDepartamentodeInformática
QuintadaTorreMontedaCaparica
P–2829–516Caparica,Portugal
ambar@di.fct.unl.pt
2
Keywords
Introduction
MuchworkhasbeendoneonevaluationofMachineTranslationinthelasttenyears(see,forexample,Balkan,1991;Arnoldetal.,1993;Vasconcellos,1994;Whiteetal.,1994;EAGLES,1996;WhiteandO’Connell,1996;White,forthcoming.AcommongoalhasbeenthedesignofevaluationtechniquesinordertoreachamoreobjectiveevaluationofMachineTranslationqualitysystems.
However,theevaluationofMachineTranslationhasbeensubjectivetoagreatextent.ISLE(theInternationalStandardsforLanguageEngineeringaimsatreducingsubjectivityinthisdomain.ItprovidesaclassificationofinternalandexternalcharacteristicsofMachineTranslationsystemstobeevaluatedinconformitywiththeISO/IEC9126standard(ISO1991,whichconcernsqualitycharacteristicsofsoftwareproducts.Itassumestheneedofaquantitativeevaluationleadingtodefinitionofmetrics.
However,thatclassificationisnotfine-grainedenoughtoevaluatethequalityofmachinetranslatedtextsregardingthepossibletypesoftranslationerrors.Thus,inthiswork,weproposeamoreconsistent,fine-grainedandcomprehensiveclassificationattheindividualsentencelevel.Ourclassificationtakesintoaccounttheinternalstructureoflexicalunitsandsyntacticconstituents.Moreover,weproposemetricstomakeanobjectivequantitativeevaluation.Thesemetricsarebasedonthenumberoferrorsfoundandthetotalnumberofpossibleerrors.Thestructuralcomplexityofthepossibleerrorsisalsoconsideredinthemetrics.
WeselectedsomepertinentcharacteristicsfromtheISLEclassificationtomeasurethequalityofsentenceleveltranslations,concerninglexicalandsyntacticerrors,includingcollocations,fixedandsemi-fixedexpressionsforlexicalevaluation.Asforsyntacticerrors,webuiltatypologyoferrors.
OurmethodologywasmotivatedbyEnglish,FrenchandPortugueseparalleltextsfromtheEuropeanParliamentsessionsandalsobytranslationsobtainedfromtwocommercialMachineTranslationsystems.
Inthenextsection,wepresentamotivationfortherefinementofthetaxonomywithsomeexamples.Afterthat,wesummarisetheclassificationanddefinethemetricsusedfortheevaluation.Inthefollowingsection,wediscusssomepreviouswork.Finally,wepresenttheconclusionsandthefuturework.
Motivation
ISO(theInternationalOrganisationforStandardisationandIEC(theInternationalElectrotechnicalCommissionaretheinstitutionswhichdevelopinternationalstandards.Asforevaluation,animportantstandardistheISO/IEC9126(ISO1991.Thisstandarddistinguishesbetweeninternalcharacteristicswhichpertaintotheinternalworkingsandstructureofthesoftwareandexternalcharacteristicswhicharethecharacteristicswhichcanbeobservedwhenthesystemisinoperation.
TheISLEClassificationFrameworkforEvaluationofMachineTranslation1providesaclassificationoftheinternalandtheexternalcharacteristicsofMachineTranslationsystemstobeevaluatedinconformitywiththeISO/IEC9126standard.
AimingtoanalyseMachineTranslationsystemsfromauser’spointofview,wefocussedontheexternalcharacteristics.WetooktheISLEclassificationasastartingpointforthisevaluation.
IdeallyanevaluationofaMachineTranslationsystemqualityshouldcoverallthedifferentparametersliabletobeconsideredinatranslation.However,thisisatoocomplextasktobedoneinthisearlystageofourwork.Thus,wedecidedtofocusonthesentencelevel.
1http://issco-www.unige.ch/staff/andrei/islemteval2/
mainclassification.html

Theevaluationofthisleveldealswithfunctionality,inparticularaccuracy,accordingtotheISLEclassification:
2.2Systemexternalcharacteristics2.2.1Functionality2.2.1.2Accuracy
2.2.1.2.2Individualsentencelevel2.2.1.2.2.1Morphology
2.2.1.2.2.2Syntax(sentenceandphrasestructure
2.2.1.2.3Typesoferrors
2.2.1.2.3.2Punctuationerrors2.2.1.2.3.3Lexicalerrors2.2.1.2.3.4Syntaxerrors2.2.1.2.3.5Stylisticerrors
Fig.1:ExtractfromtheISLEFramework

However,thecharacteristicslistedabovearenotfine-grainedenoughfortheevaluation.Moreover,themetricsproposedintheISLEclassificationdonotprovideasufficientlyobjectiveevaluation.
ScoringtheQuality
Weaimatquantifyingevaluationasmuchaspossibleinordertoreducesubjectivity.Inthisway,wehavecompiledasystematiclistoflexicalandsyntacticpropertieswhichcanbeasourceoftranslationerrorsatthesentencelevel.RefertotheAppendixforthemainpropertiesincluded.
Thislistisusedtocomputeboththenumberofpossibleerrorsthatcanoccurinagivensentenceandthenumberoferrorsactuallyidentifiedinthatsentence.Thetranslationqualityscoreiscomputedwiththesenumbers,asfollows:
However,inourapproach,asyntacticphenomenawhichmaybedifficulttoexpressinaMachineTranslationsystemisnotnecessarilyassignedahighweightjustbecauseitismoredifficult.Itsweightisbasedontheiroccurrencefrequencyincorpora.WeareevaluatingthetranslationqualityandnotthequalityofMachineTranslationsystem.Wetakeitasablackbox.Thatis,asmentionedabove,wedonotevaluatethesysteminternalcharacteristics,accordingtotheISLEFramework.
Lexicalerrorsarenotasclearlydefinable.Webelievethattheyshouldtakeintoaccounthowmuchtheyaffecttheunderstandabilityofasentence.Forexample,‘fatideas’seemstobemoredifficulttounderstandthan‘bigideas’.WeclaimthatWordNetscanbeusedtoweightthelexicaladequacy.Thisweightmaybecomputedbymeasuringtheconceptualdistancebetweenthenodewhichrepresentstheexpectedlexicalunitandtheonewhichrepresentsthetranslationobtained.Inordertoincludethisweight,wearecurrentlyworkingonawaytotunetheformulaofthemetricpresentedin(1.TomeasuretheconceptualdistanceweintendtoextendthetechniquesdescribedinResnik(1999.2
Noticethatdeterminingthenumberofpossibleerrorsisnotatrivialtasksincetheidentificationofallconstraintscanbequitecomplex.
AnexampleisgiveninFig.2andFig.3(boththeoriginaltextandthePortugueseversionofthetextwereextractedfromadocumentoftheEuropeanParliament:
Originaltext‘TextsadoptedbytheParliament’Portugueseversion‘TextosaprovadospeloParlamento’Translationby
‘TextosadoptivoporParlamento’
anMTSystem
Fig.2:ExampleofaTranslation

Agr-num
Agr-num
Score=1
#identifiederror(e×weight(e
e=1n
n
‘TextosaprovadospeloParlamento’Agr-gen
PrepAgr-gen
(1
#possibleerror(e×weight(e
e=1

whereeistheerrortypenumber.Thescoreisweightedsinceweassumethatnoteveryerrorhasthesameimpactonthetranslationquality.Itseemsfairtotakeintoaccounthowsevereerrorsare.Weclaimthattheweightsofeachsyntacticdependencyconstraintshouldbedeterminedinfunctionoftheprobabilityofitsoccurrence.Theseprobabilitiesarecomputedfromananalysisofcorporaasshownbelow:
#occurrencesofconstraint(c
weight(c=n
(2
#occurrencesofconstraint(C
Fig.3:IdentificationofConstraints.

Consideringlexicalandsyntacticproperties,thesearethepossibleerrors:
Lexicon
-Fivetokensrealised:‘textos’,‘aprovados’,‘por’,
‘o’and‘Parlamento’;
-Oneterm:‘textosaprovados’;Syntax
-Agr-num:agreement–numberbetween‘Textos’
and‘aprovados’;
-Agr-gen:agreement–genderbetween‘Textos’and
‘aprovados’;
-Prep:prepositionselection:‘aprovados’selects
‘por’;
-Order:four3wronglyorderedtokens;
2Foranalternativeapproach,seeAgirreandRigau(1995.3Forntokens,thehighestnumberoftokenordererrorsisn–1,

C=1

Theweightofconstraintc,computedinthisway,isequaltotheweightassignedtoerrore.
WeshouldstressthattherearesomesyntacticphenomenawhicharemoredifficulttohandlethanothersinsomeMachineTranslationsystemsbecauseoftheexpressivenesspowerofthesystems’formalisms.
whichhappenswhenalltokenswerereversed.Weassumethat

-
Agr-num:agreement–numberbetween‘o’and‘Parlamento’;
-Agr-gen:agreement–genderbetween‘o’and
‘Parlamento’;
Contractions:‘por’(‘by’+‘o’(‘the’=‘pelo’

Agr-numAgr-num‘TextosadoptivoporøParlamento’

Agr-genPrepAgr-genFig.4:IdentifiedErrors

-Lexicon
Oneunrealisedtoken:‘o’;
-Onewrongtoken:‘adoptivo’insteadof‘adoptado’
(co-occurrencerestrictionviolation;
-Onewrongterm:‘textosadoptivo’(theEuropean
Institutionsadoptedthetranslation‘textosaprovados’for‘textsadopted’;-Syntax
Agr-num:agreement–numberbetween‘Textos’
(pluraland‘adoptivo’(singular;
-Prep:prepositionselection:‘aprovados’selects
‘por’;
-Agr-num:noagreement–numberbetween‘o’and
‘Parlamento’;
-Agr-gen:agreement–genderbetween‘o’and
‘Parlamento’;

Contractions:‘por’(‘by’+‘o’(‘the’=‘pelo’Thetotalnumberofpossibleerrorsfoundinthisshortexampleamountsto16.Thisgivesanideaofhowhardtheidentificationofpossibleerrorsinatextmaybe.ASimplerApproach
Metricsstrictlybasedonthetotalnumberoftokensandonthenumberofwrongtokenswouldobviouslybemucheasiertocompute.
Alongtheselines,Bangaloreetal.(2000discussthreemetricsbasedonthenumberofinsertions,deletionsandsubstitutionsneededinageneratedstringtoobtainareferencestringinthecontextofgeneration.Equation(3showsthesimplestone(Bangaloreetal.,2000,p.3:
SimpleStringAccuracy=1I+D+S
R
(3

whereIisthenumberofinsertions,Dthenumberofdeletions,SthenumberofsubstitutionsandRthenumberoftokensinthestring.Thismetric,whichhasalreadybeenusedtomeasurequalityofMachineTranslationsystems(Alshawietal.,1998,penalisestwicewordswhicharemisplaced,aspointedoutbyBangaloreetal.(ibidem,becauseitcountsthiserrorasonedeletionandoneinsertion.Asaconsequence,thenumberofinsertionsanddeletionscanbelargerthantheactualnumberof
onetokenresultingfromtwocontractedorjuxtaposedtokenscountsastwodistincttokens.Wedothisbecausetwonon-contractedorjuxtaposedtokensmaybeinthewrongorder.
tokens.Shouldthisbethecase,theresultofthemetricmaybenegative.Toavoidthis,theauthorstreatthemisplacedwordsseparatelyintheformulabyaddinganothervariable(Mwhichcountsthenumberofmisplacedtokens.
GenerationStringAccuracy=1
M+I+D+S
R
(4
Inspiteoftheimprovement,thismetrictreatsmisplacednon-atomicconstituentsasseveralmisplacedtokens.Thus,theauthorsrecognisetheneedofincludingconstituencycriteriainthedesignofthemetrics.Asamatteroffact,creatingdiscontinuitiesinconstituentsshouldbemorepenalisedthanscramblingconstituentsbecausethelevelofunacceptabilityishigherintheformercasethaninthelatter.Forexample,‘TextsbyadoptedtheParliamentseemsworsethanbytheParliamenttextsadopted’.Bearingthisinmind,theysuggestathirdmetric,calledtree-basedaccuracy,whichsumsthescoreofthesimplestringaccuracymetric,foratomicconstituentsandtokens,andthescoreofgenerationstringaccuracymetric,fornon-atomicconstituents.Forthis,eachsentenceisparsedtoidentifytheconstituentsanditsparsetreeiscomparedtothetreeofthereferencestring(theparsingisbasedonthePennTreebank.
Nevertheless,thismetricdoesnottakeintoaccounttheinternalstructureofconstituentsexceptforthelinearorder.Asaconsequence,whenevertwoerrorsoccurinatokenthisapproachjustconsidersthemasasingleerror.Forexample:
*‘textosaprovadatext+MASC+PLUadopted+FEM+SING
Fig.5:ExampleoferrorsinsideanNP(genderandnumberagreement

Inthisexample,wehavetwoerrors.However,themetricabovejustconsidersthemasone,sinceitsufficesthesubstitutionofonetokentocorrectit.Thisshowsthatweneedtoconsidertheinternalstructureoftheconstituentstoidentifyandcountalltheerrorsinordertopenalisethem.Otherwise,someofthemmaynotbepenalised.Ourapproachattemptstobemoreaccurate,avoidingthisproblem.Itconsiderstheinternalstructureoftheconstituents,providingamorefine-grained4typologyoferrorsaspresentedintheAppendix.
Conclusions
Webelievethattheapproachpresentedinthispaperistherightwaytomovetowardsatrustworthyevaluationoftranslationquality.Ourproposalprovidesthemeansforanobjectiveevaluation.Itmakesuseofafine-grainedtypologyoferrorswhichaimsatdealingwithbooleancriteria.Thishighlyreducessubjectivity.
4Dependingontheapplication,wecanrelaxthegranularityof
thetypologyoferrors.Forexample,specifier–nounagreementmaynotberelevantforgisting.

本文来源:https://www.2haoxitong.net/k/doc/beba1e0d76c66137ee06194e.html

《Quantitative Evaluation of Machine Translation Systems Sentence Level 1 Universidade de Lis.doc》
将本文的Word文档下载到电脑,方便收藏和打印
推荐度:
点击下载文档

文档为doc格式