Large language models "ad referendum": How good are they at machine translation in the legal domain?

Main Article Content

Vicent Briva-Iglesias
Gokhan Dogru
João Lucas Cavalheiro Camargo

Abstract

This study evaluates the machine translation (MT) quality of two state-of-the-art large language models (LLMs) against a traditional neural machine translation (NMT) system across four language pairs in the legal domain. It combines automatic evaluation metrics (AEMs) and human evaluation (HE) by professional translators to assess translation ranking, fluency and adequacy. The results indicate that while Google Translate generally outperforms LLMs in AEMs, human evaluators rate LLMs, especially GPT-4, comparably or slightly better in terms of producing contextually adequate and fluent translations. This discrepancy suggests LLMs' potential in handling specialized legal terminology and context, highlighting the importance of human evaluation methods in assessing MT quality. The study underscores the evolving capabilities of LLMs in specialized domains and calls for reevaluation of traditional AEMs to better capture the nuances of LLM-generated translations.

Downloads

Download data is not yet available.

Article Details

How to Cite
Briva-Iglesias, V., Dogru, G., & Cavalheiro Camargo, J. L. (2024). Large language models "ad referendum": How good are they at machine translation in the legal domain?. MonTI. Monographs in Translation and Interpreting, (16), 75–107. https://doi.org/10.6035/MonTI.2024.16.02
Section
Articles

References

BAGO, Petra et alii. (2022) “Sharing High-Quality Language Resources in the Legal Domain to Develop Neural Machine Translation for under-Resourced European Languages.” Revista de Llengua i Dret 78, pp. 9-34.

BORJA, Anabel & Robert Martínez-Carrasco. (2019) “Future-Proofing Legal Translation: A Paradigm Shift for an Exponential Era.” In: Simonnæs, Ingrid & Marita Kristiansen (eds.) 2019. Legal Translation: Current Issues and Challenges in Research Methods and Applications. Berlin: Frank & Timme, pp. 187-206.

BRIVA-IGLESIAS, Vicent. (2021) “Traducción humana vs. traducción automática: análisis contrastivo e implicaciones para la aplicación de la traducción automática en traducción jurídica.” Mutatis Mutandis 14:2, pp. 571-600. https://doi.org/10.17533/udea.mut.v14n2a14

BRIVA-IGLESIAS, Vicent. (2022) “English-Catalan Neural Machine Translation: State-of-the-Art Technology, Quality, and Productivity.” Tradumàtica 20, pp. 149-176. https://doi.org/10.5565/rev/tradumatica.303

BRIVA-IGLESIAS, Vicent; Sharon O’Brien & Benjamin R. Cowan. (2023) “The Impact of Traditional and Interactive Post-Editing on Machine Translation User Experience, Quality, and Productivity.” Translation, Cognition & Behavior 6:1, pp. 60-86. https://doi.org/10.1075/tcb.00077.bri

BROWN, Tom B. et alii. (2020) “Language Models Are Few-Shot Learners.” arXiv. https://doi.org/10.48550/arXiv.2005.14165

CADWELL, Patrick et alii. (2016) “Human Factors in Machine Translation and Post-Editing among Institutional Translators.” Translation Spaces 5:2, pp. 222-243. https://doi.org/10.1075/ts.5.2.04cad.

CAO, Deborah. (2007) Translating Law. Bristol: Multilingual Mat-ters. https://doi.org/10.21832/9781853599552

CASTILHO, Sheila et al. (2018) “Approaches to Human and Ma-chine Translation Quality Assessment.” In: Moorkens, Joss et alii (eds.) 2018. Translation Quality Assessment: From Principles to Practice. Cham: Springer International Publishing, pp. 9-38. https://doi.org/10.1007/978-3-319-91241-7_2

CASTILHO, Sheila et alii. (2021) “DELA Corpus - A Document-Level Corpus Annotated with Context-Related Issues.” In: Barrault, Loic et alii (eds.) 2021. Proceedings of the Sixth Conference on Machine Translation. Punta Cana: Association for Computational Linguistics, pp. 566-577. Online: https://aclanthology.org/2021.wmt-1.63

CASTILHO, Sheila et alii. (2023) “Do Online Machine Translation Systems Care for Context? What about a GPT Model?” In: Nurminen, Mary et alii (eds.) 2023. Proceedings of the 24th Annual Conference of the European Association for Machine Translation. Tampere: European Association for Machine Translation. Online: https://aclanthology.org/2023.eamt-1.39

CASTILHO, Sheila & Helena de Medeiros Caseli. (2023) “Tradução Automática.” In: Marques Seno, Eloize R. et alii (eds.) 2023. Processamento de Linguagem Natural: Conceitos, Técnicas e Aplicações em Português. Online: Brasileiras em PLN. https://brasileiraspln.com/livro-pln/1a-edicao/

CLACK, Christopher. (2018) “Smart Contract Templates: Legal Semantics and Code Validation.” Journal of Digital Banking 2:4, pp. 338-352.

DEVLIN, Jacob et alii. (2019) “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” arXiv. https://doi.org/10.48550/arXiv.1810.04805

DOHERTY, Stephen. (2017) “Issues in Human and Automatic Translation Quality Assessment.” In: Kenny, Dorothy (ed.) 2017. Human Issues in Translation Technology. London: Routledge, pp. 50-78.

ELIS. (2022) “European Language Industry Survey 2022.” Online: ELIS Research. https://fit-europe-rc.org/wp-content/uploads/2022/03/ELIS-2022_survey_results_final_report.pdf?x85225

ELOUNDOU, Tyna et alii. “GPTs Are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models.” arXiv. https://doi.org/10.48550/arXiv.2303.10130

EMT. (2022) “European Master’s in Translation Competence Framework 2022.” Online: https://ec.europa.eu/info/sites/default/files/about_the_european_commission/service_standards_and_principles/documents/emt_competence_fwk_2022_en.pdf.

ENGBERG, Jan. (2020) “Comparative Law for Legal Translation: Through Multiple Perspectives to Multidimensional Knowledge.” International Journal for the Semiotics of Law 33:2, pp. 263-282. https://doi.org/10.1007/s11196-020-09706-9

GÖRÖG, Attila. (2014) “Quantifying and Benchmarking Quality: The TAUS Dynamic Quality Framework.” Tradumàtica 12, pp. 443-454. https://doi.org/10.5565/rev/tradumatica.66

GOTTI, Fabrizio et alii. (2008) “Automatic Translation of Court Judgments.” In: AMTA (ed.) 2008. Proceedings of the 8th Conference of the Association for Machine Translation in the Americas: Government and Commercial Uses of MT. Waikiki: Association for Machine Translation in the Americas, pp. 370-379. Online: https://aclanthology.org/2008.amta-govandcom.11

GROSSMAN, Maura R. & Gordon V. Cormack. (2010) “Technology-Assisted Review in E-Discovery Can Be More Effective and more Efficient than Exhaustive Manual Review Annual Survey.” Richmond Journal of Law and Technology 17:3, pp. 1-48.

HACKER, Philipp; Andreas Engel & Marco Mauer. (2023) “Regu-lating ChatGPT and Other Large Generative AI Models.” arXiv. https://doi.org/10.48550/arXiv.2302.02337

HAN, Jesse Michael et alii. (2021) “Unsupervised Neural Machine Translation with Generative Language Models Only.” arXiv. https://doi.org/10.48550/arXiv.2110.05448

HENDY, Amr et alii. (2023) “How Good Are GPT Models at Machine Translation? A Comprehensive Evaluation.” arXiv. https://doi.org/10.48550/arXiv.2302.09210

JIAO, Wenxiang et alii. (2023) “Is ChatGPT A Good Translator? Yes with GPT-4 as the Engine.” arXiv. https://doi.org/10.48550/arXiv.2301.08745

KARPINSKA, Marzena & Mohit Iyyer. (2023) “Large Language Models Effectively Leverage Document-Level Context for Literary Translation, but Critical Errors Persist.” arXiv. https://doi.org/10.48550/arXiv.2304.03245

KASNECI, Enkelejda et alii. (2023). “ChatGPT for Good? On Opportunities and Challenges of Large Language Models for Education.” Learning and Individual Differences 103, 102274. https://doi.org/10.1016/j.lindif.2023.102274

KENNY, Dorothy. (2022) Machine Translation for Everyone: Em-powering Users in the Age of Artificial Intelligence. Berlin: Language Science Press.

KILLMAN, Jeffrey. (2014) “Vocabulary Accuracy of Statistical Machine Translation in the Legal Context.” In: O’Brien, Sharon; Michel Simard & Lucia Specia (eds.) 2014. Proceedings of the 11th Conference of the Association for Machine Translation in the Americas. Vancouver: Association for Machine Translation in the Americas, pp. 85-98. Online: https://aclanthology.org/2014.amta-wptp.7

KILLMAN, Jeffrey & Mónica Rodríguez-Castro. (2022) “Post-Editing vs. Translating in the Legal Context: Quality and Time Effects from English to Spanish.” Revista de Llengua i Dret 78, pp. 56-72. http://dx.doi.org/10.2436/rld.i78.2022.3831

KOCMI, Tom et alii. (2021) “To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation.” In: Barrault, Loic et alii (eds.) 2021. Proceedings of the Sixth Conference on Machine Translation. Punta Cana: Association for Computational Linguistics, pp. 478-494. Online: https://aclanthology.org/2021.wmt-1.57

KOEHN, Philipp & Rebecca Knowles. (2017) “Six Challenges for Neural Machine Translation.” In: Luong, Thang et alii (eds.) 2017. Proceedings of the First Workshop on Neural Machine Translation. Vancouver: Association for Computational Linguistics, pp. 28-39. https://doi.org/10.18653/v1/W17-3204

KUNG, Tiffany H. et alii. (2023) “Performance of ChatGPT on USMLE: Potential for AI-Assisted Medical Education Using Large Language Models.” PLOS Digital Health 2:2, e0000198. https://doi.org/10.1371/journal.pdig.0000198

LÄUBLI, Samuel et alii. (2020) “A Set of Recommendations for Assessing Human–Machine Parity in Language Translation.” Journal of Artificial Intelligence Research 67, pp. 653-672. https://doi.org/10.1613/jair.1.11371

LESZNYÁK, Ágnes. (2019) “Hungarian Translators’ Perceptions of Neural Machine Translation in the European Commission.” In: Forcada, Mikel et alii (eds.) 2019. Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks. Dublin: European Association for Machine Transla-tion, pp. 16-22. Online: https://aclanthology.org/W19-6703

LONG, Shangbang et alii. (2018) “Automatic Judgment Prediction via Legal Reading Comprehension.” arXiv. https://doi.org/10.48550/arXiv.1809.06537

LYU, Chenyang; Jitao Xu & Longyue Wang. (2023) “New Trends in Machine Translation Using Large Language Models: Case Examples with ChatGPT.” arXiv. https://doi.org/10.48550/arXiv.2305.01181

MARTÍNEZ-CARRASCO, Robert. (2022) “‘Más bellas y más infieles que nunca’. Usos y percepciones en materia tecnológica entre el profesorado de traducción jurídica de España.” Quaderns de Filologia. Estudis Lingüístics 27, pp. 235-257. https://doi.org/10.7203/qf.0.24618

MILETO, Fiorenza. (2019) “Post-Editing and Legal Translation.” H2D. Revista de Humanidades Digitais 1:1. https://doi.org/10.21814/h2d.237

MOSLEM, Yasmin et alii. (2023) “Adaptive Machine Translation with Large Language Models.” arXiv. https://doi.org/10.48550/arXiv.2301.13294

NAVEED, Humza et alii. (2023) “A Comprehensive Overview of Large Language Models.” arXiv. http://arxiv.org/abs/2307.06435

NOONAN, Nick. (2023) “Creative Mutation: A Prescriptive Ap-proach to the Use of ChatGPT and Large Language Models in Lawyering.” SSRN Scholarly Paper. Rochester, NY. https://doi.org/10.2139/ssrn.4406907

O’BRIEN, Sharon. (2022) “How to Deal with Errors in Machine Translation: Post-Editing.” In: Kenny, Dorothy (ed.) 2022. Machine Translation for Everyone. Berlin: Language Science Press, pp. 105-120. https://doi.org/10.5281/zenodo.6759982

OVIEDO-TRESPALACIOS, Oscar et alii. (2023) “The Risks of Using ChatGPT to Obtain Common Safety-Related Information and Advice.” SSRN Scholarly Paper. Rochester, NY. https://doi.org/10.2139/ssrn.4346827

PAPINENI, Kishore et alii. (2002) “Bleu: A Method for Automatic Evaluation of Machine Translation.” In: Isabelle, Pierre; Eu-gene Charniak & Dekang Lin (eds.) 2002. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Philadelphia: Association for Computational Lin-guistics, pp. 311-318. https://doi.org/10.3115/1073083.1073135

POPOVIĆ, Maja. (2015) “ChrF: Character n-Gram F-Score for Automatic MT Evaluation.” In: Bojar, Ondřej et alii (eds.) 2015. Proceedings of the Tenth Workshop on Statistical Machine Translation. Lisbon: Association for Computational Linguistics, pp. 392-395. https://doi.org/10.18653/v1/W15-3049

POST, Matt. (2018) “A Call for Clarity in Reporting BLEU Scores.” arXiv. https://doi.org/10.48550/arXiv.1804.08771

RADFORD, Alec et alii. (2022) “Robust Speech Recognition via Large-Scale Weak Supervision.” arXiv. https://arxiv.org/abs/2212.04356

RAGNI, Valentina & Lucas Nunes Vieira. (2022) “What has changed with neural machine translation? A critical review of human factors.” Perspectives, 30:1, pp. 137-158. https://doi.org/10.1080/0907676X.2021.1889005.

RAUNAK, Vikas et alii. (2021) “The Curious Case of Hallucinations in Neural Machine Translation.” arXiv. https://doi.org/10.48550/arXiv.2104.06683

REI, Ricardo; José G. C. de Souza et alii. (2022) “2022. ‘COMET-22: Unbabel-IST 2022 Submission for the Metrics Shared Task.” In: Koehn, Philipp et alii (eds.) 2022. Proceedings of the Seventh Conference on Machine Translation. Abu Dhabi: Association for Computational Linguistics, pp. 578-585. Online: https://aclanthology.org/2022.wmt-1.52.

REI, Ricardo; Craig Stewart et alii. (2020) “COMET: A Neural Framework for MT Evaluation.” ArXiv: 2009.09025 [Cs], Oc-tober. http://arxiv.org/abs/2009.09025

REI, Ricardo; Marcos Treviso et alii. (2022) “CometKiwi: IST-Unbabel 2022 Submission for the Quality Estimation Shared Task.” In: Koehn, Philipp et alii (eds.) 2022. Proceedings of the Seventh Conference on Machine Translation. Abu Dhabi: Association for Computational Linguistics, pp. 634-645. Online: https://aclanthology.org/2022.wmt-1.60

ROSSI, Caroline & Jean-Pierre Chevrot. (2019) “Uses and Perceptions of Machine Translation at the European Commission.” The Journal of Specialised Translation 31, pp. 177-200. https://shs.hal.science/halshs-01893120

SARCEVIC, Susan. (1997) New Approach to Legal Translation. Den Haag: Kluwer Law International.

SEBASTIAN, Glorin. (2023) “Do ChatGPT and Other AI Chatbots Pose a Cybersecurity Risk? - An Exploratory Study.” SSRN Scholarly Paper. Rochester, NY. https://doi.org/10.2139/ssrn.4363843

SELLAM, Thibault; Dipanjan Das & Ankur Parikh. (2020) “BLEURT: Learning Robust Metrics for Text Generation.” In: Jurafsky, Dan et alii (eds.) 2020. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, pp. 7881-7892. https://doi.org/10.18653/v1/2020.acl-main.704

SHTERIONOV, Dimitar et alii. (2018) “Human versus automatic quality evaluation of NMT and PBSMT.” Machine Translation 32, pp. 217-235. https://doi.org/10.1007/s10590-018-9220-z

SIU, Sai Cheong. (2023) “ChatGPT and GPT-4 for Professional Translators: Exploring the Potential of Large Language Models in Translation.” SSRN Scholarly Paper. Rochester, NY. https://doi.org/10.2139/ssrn.4448091

SNOVER, Matthew et alii. (2006) “A Study of Translation Edit Rate with Targeted Human Annotation.” In: AMTA (ed.) 2006. Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers. Cam-bridge, MA: Association for Machine Translation in the Americas, pp. 223-231.

SOSONI, Vilelmini; John O’Shea & Maria Stasimioti. (2022) “Translating Law: A Comparison of Human and Post-Edited Translations from Greek to English.” Revista de Llengua i Dret 78, pp. 92-120. https://doi.org/10.2436/rld.i78.2022.3704

TIEDEMANN, Jörg. (2012) “Parallel Data, Tools and Interfaces in OPUS.” In: Calzolari, Nicoletta et alii (eds.) 2012. Proceedings of the Eighth International Conference on Language Re-sources and Evaluation. Istanbul: European Language Re-sources Association, pp. 2214-2218.

TRAUTMANN, Dietrich; Alina Petrova & Frank Schilder. (2022) “Legal Prompt Engineering for Multilingual Legal Judgement Prediction.” arXiv. https://doi.org/10.48550/arXiv.2212.02199

VARDARO, Jennifer; Moritz Schaeffer & Silvia Hansen-Schirra. (2019) “Translation Quality and Error Recognition in Professional Neural Machine Translation Post-Editing.” Informatics 6:3, pp. 41. https://doi.org/10.3390/informatics6030041

VANROY, Bram; Arda Tezcan, & Lieve Macken. (2023). “MATEO: MAchine Translation Evaluation Online.” In: Nurminen, Mary et alii (eds.) 2023. Proceedings of the 24th Annual Conference of the European Association for Machine Translation. Tampere: European Association for Machine Translation, pp. 499-500. http://hdl.handle.net/1854/lu-01h2ac8kf9xgq69hzmb2z3jaz9

VIEIRA, Lucas Nunes; Minako O’Hagan & Carol O’Sullivan. (2021) “Understanding the societal impacts of machine translation: a critical review of the literature on medical and legal use cases.” Information, Communication & Society 24:11, pp. 1515-1532.

WAY, Andy. (2020) “Machine translation: Where are we at today.” In: Angelone, Erik; Maureen Ehrensberger-Dow & Gary Massey (eds.) 2020. The Bloomsbury companion to language industry studies. London: Bloomsbury Academic, pp. 311-332.

WANG, Longyue et alii. (2023) “Document-Level Machine Translation with Large Language Models.” arXiv. https://doi.org/10.48550/arXiv.2304.02210

WHITE, Jules et alii. (2023) “ChatGPT Prompt Patterns for Improving Code Quality, Refactoring, Requirements Elicitation, and Software Design.” arXiv. https://doi.org/10.48550/arXiv.2303.07839

WIESMANN, Eva. (2019) “Machine Translation in the Field of Law: A Study of the Translation of Italian Legal Texts into German.” Comparative Legilinguistics 37:1, pp. 117-153. https://doi.org/10.14746/cl.2019.37.4

YUE, Thomas et alii. (2023) “Democratizing Financial Knowledge with ChatGPT by OpenAI: Unleashing the Power of Technology.” SSRN Scholarly Paper. Rochester, NY. https://doi.org/10.2139/ssrn.4346152.

ZHANG, Biao; Barry Haddow & Alexandra Birch. (2023) “Prompting Large Language Model for Machine Translation: A Case Study.” arXiv. https://doi.org/10.48550/arXiv.2301.07069

ZHENG, Lianmin et alii. (2023) “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.” arXiv. http://arxiv.org/abs/2306.05685

ZHUO, Terry Yue et alii. (2023) “Red Teaming ChatGPT via Jail-breaking: Bias, Robustness, Reliability and Toxicity.” arXiv. https://doi.org/10.48550/arXiv.2301.12867