Large language models "ad referendum": How good are they at machine translation in the legal domain?
Main Article Content
Abstract
This study evaluates the machine translation (MT) quality of two state-of-the-art large language models (LLMs) against a traditional neural machine translation (NMT) system across four language pairs in the legal domain. It combines automatic evaluation metrics (AEMs) and human evaluation (HE) by professional translators to assess translation ranking, fluency and adequacy. The results indicate that while Google Translate generally outperforms LLMs in AEMs, human evaluators rate LLMs, especially GPT-4, comparably or slightly better in terms of producing contextually adequate and fluent translations. This discrepancy suggests LLMs' potential in handling specialized legal terminology and context, highlighting the importance of human evaluation methods in assessing MT quality. The study underscores the evolving capabilities of LLMs in specialized domains and calls for reevaluation of traditional AEMs to better capture the nuances of LLM-generated translations.
Downloads
Article Details

This work is licensed under a Creative Commons Attribution 4.0 International License.
The documents contained in these directories are included by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a non-commercial basis. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.
References
BAGO, Petra et alii. (2022) “Sharing High-Quality Language Resources in the Legal Domain to Develop Neural Machine Translation for under-Resourced European Languages.” Revista de Llengua i Dret 78, pp. 9-34.
BORJA, Anabel & Robert Martínez-Carrasco. (2019) “Future-Proofing Legal Translation: A Paradigm Shift for an Exponential Era.” In: Simonnæs, Ingrid & Marita Kristiansen (eds.) 2019. Legal Translation: Current Issues and Challenges in Research Methods and Applications. Berlin: Frank & Timme, pp. 187-206.
BRIVA-IGLESIAS, Vicent. (2021) “Traducción humana vs. traducción automática: análisis contrastivo e implicaciones para la aplicación de la traducción automática en traducción jurídica.” Mutatis Mutandis 14:2, pp. 571-600. https://doi.org/10.17533/udea.mut.v14n2a14
BRIVA-IGLESIAS, Vicent. (2022) “English-Catalan Neural Machine Translation: State-of-the-Art Technology, Quality, and Productivity.” Tradumàtica 20, pp. 149-176. https://doi.org/10.5565/rev/tradumatica.303
BRIVA-IGLESIAS, Vicent; Sharon O’Brien & Benjamin R. Cowan. (2023) “The Impact of Traditional and Interactive Post-Editing on Machine Translation User Experience, Quality, and Productivity.” Translation, Cognition & Behavior 6:1, pp. 60-86. https://doi.org/10.1075/tcb.00077.bri
BROWN, Tom B. et alii. (2020) “Language Models Are Few-Shot Learners.” arXiv. https://doi.org/10.48550/arXiv.2005.14165
CADWELL, Patrick et alii. (2016) “Human Factors in Machine Translation and Post-Editing among Institutional Translators.” Translation Spaces 5:2, pp. 222-243. https://doi.org/10.1075/ts.5.2.04cad.
CAO, Deborah. (2007) Translating Law. Bristol: Multilingual Mat-ters. https://doi.org/10.21832/9781853599552
CASTILHO, Sheila et al. (2018) “Approaches to Human and Ma-chine Translation Quality Assessment.” In: Moorkens, Joss et alii (eds.) 2018. Translation Quality Assessment: From Principles to Practice. Cham: Springer International Publishing, pp. 9-38. https://doi.org/10.1007/978-3-319-91241-7_2
CASTILHO, Sheila et alii. (2021) “DELA Corpus - A Document-Level Corpus Annotated with Context-Related Issues.” In: Barrault, Loic et alii (eds.) 2021. Proceedings of the Sixth Conference on Machine Translation. Punta Cana: Association for Computational Linguistics, pp. 566-577. Online: https://aclanthology.org/2021.wmt-1.63
CASTILHO, Sheila et alii. (2023) “Do Online Machine Translation Systems Care for Context? What about a GPT Model?” In: Nurminen, Mary et alii (eds.) 2023. Proceedings of the 24th Annual Conference of the European Association for Machine Translation. Tampere: European Association for Machine Translation. Online: https://aclanthology.org/2023.eamt-1.39
CASTILHO, Sheila & Helena de Medeiros Caseli. (2023) “Tradução Automática.” In: Marques Seno, Eloize R. et alii (eds.) 2023. Processamento de Linguagem Natural: Conceitos, Técnicas e Aplicações em Português. Online: Brasileiras em PLN. https://brasileiraspln.com/livro-pln/1a-edicao/
CLACK, Christopher. (2018) “Smart Contract Templates: Legal Semantics and Code Validation.” Journal of Digital Banking 2:4, pp. 338-352.
DEVLIN, Jacob et alii. (2019) “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” arXiv. https://doi.org/10.48550/arXiv.1810.04805
DOHERTY, Stephen. (2017) “Issues in Human and Automatic Translation Quality Assessment.” In: Kenny, Dorothy (ed.) 2017. Human Issues in Translation Technology. London: Routledge, pp. 50-78.
ELIS. (2022) “European Language Industry Survey 2022.” Online: ELIS Research. https://fit-europe-rc.org/wp-content/uploads/2022/03/ELIS-2022_survey_results_final_report.pdf?x85225
ELOUNDOU, Tyna et alii. “GPTs Are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models.” arXiv. https://doi.org/10.48550/arXiv.2303.10130
EMT. (2022) “European Master’s in Translation Competence Framework 2022.” Online: https://ec.europa.eu/info/sites/default/files/about_the_european_commission/service_standards_and_principles/documents/emt_competence_fwk_2022_en.pdf.
ENGBERG, Jan. (2020) “Comparative Law for Legal Translation: Through Multiple Perspectives to Multidimensional Knowledge.” International Journal for the Semiotics of Law 33:2, pp. 263-282. https://doi.org/10.1007/s11196-020-09706-9
GÖRÖG, Attila. (2014) “Quantifying and Benchmarking Quality: The TAUS Dynamic Quality Framework.” Tradumàtica 12, pp. 443-454. https://doi.org/10.5565/rev/tradumatica.66
GOTTI, Fabrizio et alii. (2008) “Automatic Translation of Court Judgments.” In: AMTA (ed.) 2008. Proceedings of the 8th Conference of the Association for Machine Translation in the Americas: Government and Commercial Uses of MT. Waikiki: Association for Machine Translation in the Americas, pp. 370-379. Online: https://aclanthology.org/2008.amta-govandcom.11
GROSSMAN, Maura R. & Gordon V. Cormack. (2010) “Technology-Assisted Review in E-Discovery Can Be More Effective and more Efficient than Exhaustive Manual Review Annual Survey.” Richmond Journal of Law and Technology 17:3, pp. 1-48.
HACKER, Philipp; Andreas Engel & Marco Mauer. (2023) “Regu-lating ChatGPT and Other Large Generative AI Models.” arXiv. https://doi.org/10.48550/arXiv.2302.02337
HAN, Jesse Michael et alii. (2021) “Unsupervised Neural Machine Translation with Generative Language Models Only.” arXiv. https://doi.org/10.48550/arXiv.2110.05448
HENDY, Amr et alii. (2023) “How Good Are GPT Models at Machine Translation? A Comprehensive Evaluation.” arXiv. https://doi.org/10.48550/arXiv.2302.09210
JIAO, Wenxiang et alii. (2023) “Is ChatGPT A Good Translator? Yes with GPT-4 as the Engine.” arXiv. https://doi.org/10.48550/arXiv.2301.08745
KARPINSKA, Marzena & Mohit Iyyer. (2023) “Large Language Models Effectively Leverage Document-Level Context for Literary Translation, but Critical Errors Persist.” arXiv. https://doi.org/10.48550/arXiv.2304.03245
KASNECI, Enkelejda et alii. (2023). “ChatGPT for Good? On Opportunities and Challenges of Large Language Models for Education.” Learning and Individual Differences 103, 102274. https://doi.org/10.1016/j.lindif.2023.102274
KENNY, Dorothy. (2022) Machine Translation for Everyone: Em-powering Users in the Age of Artificial Intelligence. Berlin: Language Science Press.
KILLMAN, Jeffrey. (2014) “Vocabulary Accuracy of Statistical Machine Translation in the Legal Context.” In: O’Brien, Sharon; Michel Simard & Lucia Specia (eds.) 2014. Proceedings of the 11th Conference of the Association for Machine Translation in the Americas. Vancouver: Association for Machine Translation in the Americas, pp. 85-98. Online: https://aclanthology.org/2014.amta-wptp.7
KILLMAN, Jeffrey & Mónica Rodríguez-Castro. (2022) “Post-Editing vs. Translating in the Legal Context: Quality and Time Effects from English to Spanish.” Revista de Llengua i Dret 78, pp. 56-72. http://dx.doi.org/10.2436/rld.i78.2022.3831
KOCMI, Tom et alii. (2021) “To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation.” In: Barrault, Loic et alii (eds.) 2021. Proceedings of the Sixth Conference on Machine Translation. Punta Cana: Association for Computational Linguistics, pp. 478-494. Online: https://aclanthology.org/2021.wmt-1.57
KOEHN, Philipp & Rebecca Knowles. (2017) “Six Challenges for Neural Machine Translation.” In: Luong, Thang et alii (eds.) 2017. Proceedings of the First Workshop on Neural Machine Translation. Vancouver: Association for Computational Linguistics, pp. 28-39. https://doi.org/10.18653/v1/W17-3204
KUNG, Tiffany H. et alii. (2023) “Performance of ChatGPT on USMLE: Potential for AI-Assisted Medical Education Using Large Language Models.” PLOS Digital Health 2:2, e0000198. https://doi.org/10.1371/journal.pdig.0000198
LÄUBLI, Samuel et alii. (2020) “A Set of Recommendations for Assessing Human–Machine Parity in Language Translation.” Journal of Artificial Intelligence Research 67, pp. 653-672. https://doi.org/10.1613/jair.1.11371
LESZNYÁK, Ágnes. (2019) “Hungarian Translators’ Perceptions of Neural Machine Translation in the European Commission.” In: Forcada, Mikel et alii (eds.) 2019. Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks. Dublin: European Association for Machine Transla-tion, pp. 16-22. Online: https://aclanthology.org/W19-6703
LONG, Shangbang et alii. (2018) “Automatic Judgment Prediction via Legal Reading Comprehension.” arXiv. https://doi.org/10.48550/arXiv.1809.06537
LYU, Chenyang; Jitao Xu & Longyue Wang. (2023) “New Trends in Machine Translation Using Large Language Models: Case Examples with ChatGPT.” arXiv. https://doi.org/10.48550/arXiv.2305.01181
MARTÍNEZ-CARRASCO, Robert. (2022) “‘Más bellas y más infieles que nunca’. Usos y percepciones en materia tecnológica entre el profesorado de traducción jurídica de España.” Quaderns de Filologia. Estudis Lingüístics 27, pp. 235-257. https://doi.org/10.7203/qf.0.24618
MILETO, Fiorenza. (2019) “Post-Editing and Legal Translation.” H2D. Revista de Humanidades Digitais 1:1. https://doi.org/10.21814/h2d.237
MOSLEM, Yasmin et alii. (2023) “Adaptive Machine Translation with Large Language Models.” arXiv. https://doi.org/10.48550/arXiv.2301.13294
NAVEED, Humza et alii. (2023) “A Comprehensive Overview of Large Language Models.” arXiv. http://arxiv.org/abs/2307.06435
NOONAN, Nick. (2023) “Creative Mutation: A Prescriptive Ap-proach to the Use of ChatGPT and Large Language Models in Lawyering.” SSRN Scholarly Paper. Rochester, NY. https://doi.org/10.2139/ssrn.4406907
O’BRIEN, Sharon. (2022) “How to Deal with Errors in Machine Translation: Post-Editing.” In: Kenny, Dorothy (ed.) 2022. Machine Translation for Everyone. Berlin: Language Science Press, pp. 105-120. https://doi.org/10.5281/zenodo.6759982
OVIEDO-TRESPALACIOS, Oscar et alii. (2023) “The Risks of Using ChatGPT to Obtain Common Safety-Related Information and Advice.” SSRN Scholarly Paper. Rochester, NY. https://doi.org/10.2139/ssrn.4346827
PAPINENI, Kishore et alii. (2002) “Bleu: A Method for Automatic Evaluation of Machine Translation.” In: Isabelle, Pierre; Eu-gene Charniak & Dekang Lin (eds.) 2002. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Philadelphia: Association for Computational Lin-guistics, pp. 311-318. https://doi.org/10.3115/1073083.1073135
POPOVIĆ, Maja. (2015) “ChrF: Character n-Gram F-Score for Automatic MT Evaluation.” In: Bojar, Ondřej et alii (eds.) 2015. Proceedings of the Tenth Workshop on Statistical Machine Translation. Lisbon: Association for Computational Linguistics, pp. 392-395. https://doi.org/10.18653/v1/W15-3049
POST, Matt. (2018) “A Call for Clarity in Reporting BLEU Scores.” arXiv. https://doi.org/10.48550/arXiv.1804.08771
RADFORD, Alec et alii. (2022) “Robust Speech Recognition via Large-Scale Weak Supervision.” arXiv. https://arxiv.org/abs/2212.04356
RAGNI, Valentina & Lucas Nunes Vieira. (2022) “What has changed with neural machine translation? A critical review of human factors.” Perspectives, 30:1, pp. 137-158. https://doi.org/10.1080/0907676X.2021.1889005.
RAUNAK, Vikas et alii. (2021) “The Curious Case of Hallucinations in Neural Machine Translation.” arXiv. https://doi.org/10.48550/arXiv.2104.06683
REI, Ricardo; José G. C. de Souza et alii. (2022) “2022. ‘COMET-22: Unbabel-IST 2022 Submission for the Metrics Shared Task.” In: Koehn, Philipp et alii (eds.) 2022. Proceedings of the Seventh Conference on Machine Translation. Abu Dhabi: Association for Computational Linguistics, pp. 578-585. Online: https://aclanthology.org/2022.wmt-1.52.
REI, Ricardo; Craig Stewart et alii. (2020) “COMET: A Neural Framework for MT Evaluation.” ArXiv: 2009.09025 [Cs], Oc-tober. http://arxiv.org/abs/2009.09025
REI, Ricardo; Marcos Treviso et alii. (2022) “CometKiwi: IST-Unbabel 2022 Submission for the Quality Estimation Shared Task.” In: Koehn, Philipp et alii (eds.) 2022. Proceedings of the Seventh Conference on Machine Translation. Abu Dhabi: Association for Computational Linguistics, pp. 634-645. Online: https://aclanthology.org/2022.wmt-1.60
ROSSI, Caroline & Jean-Pierre Chevrot. (2019) “Uses and Perceptions of Machine Translation at the European Commission.” The Journal of Specialised Translation 31, pp. 177-200. https://shs.hal.science/halshs-01893120
SARCEVIC, Susan. (1997) New Approach to Legal Translation. Den Haag: Kluwer Law International.
SEBASTIAN, Glorin. (2023) “Do ChatGPT and Other AI Chatbots Pose a Cybersecurity Risk? - An Exploratory Study.” SSRN Scholarly Paper. Rochester, NY. https://doi.org/10.2139/ssrn.4363843
SELLAM, Thibault; Dipanjan Das & Ankur Parikh. (2020) “BLEURT: Learning Robust Metrics for Text Generation.” In: Jurafsky, Dan et alii (eds.) 2020. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, pp. 7881-7892. https://doi.org/10.18653/v1/2020.acl-main.704
SHTERIONOV, Dimitar et alii. (2018) “Human versus automatic quality evaluation of NMT and PBSMT.” Machine Translation 32, pp. 217-235. https://doi.org/10.1007/s10590-018-9220-z
SIU, Sai Cheong. (2023) “ChatGPT and GPT-4 for Professional Translators: Exploring the Potential of Large Language Models in Translation.” SSRN Scholarly Paper. Rochester, NY. https://doi.org/10.2139/ssrn.4448091
SNOVER, Matthew et alii. (2006) “A Study of Translation Edit Rate with Targeted Human Annotation.” In: AMTA (ed.) 2006. Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers. Cam-bridge, MA: Association for Machine Translation in the Americas, pp. 223-231.
SOSONI, Vilelmini; John O’Shea & Maria Stasimioti. (2022) “Translating Law: A Comparison of Human and Post-Edited Translations from Greek to English.” Revista de Llengua i Dret 78, pp. 92-120. https://doi.org/10.2436/rld.i78.2022.3704
TIEDEMANN, Jörg. (2012) “Parallel Data, Tools and Interfaces in OPUS.” In: Calzolari, Nicoletta et alii (eds.) 2012. Proceedings of the Eighth International Conference on Language Re-sources and Evaluation. Istanbul: European Language Re-sources Association, pp. 2214-2218.
TRAUTMANN, Dietrich; Alina Petrova & Frank Schilder. (2022) “Legal Prompt Engineering for Multilingual Legal Judgement Prediction.” arXiv. https://doi.org/10.48550/arXiv.2212.02199
VARDARO, Jennifer; Moritz Schaeffer & Silvia Hansen-Schirra. (2019) “Translation Quality and Error Recognition in Professional Neural Machine Translation Post-Editing.” Informatics 6:3, pp. 41. https://doi.org/10.3390/informatics6030041
VANROY, Bram; Arda Tezcan, & Lieve Macken. (2023). “MATEO: MAchine Translation Evaluation Online.” In: Nurminen, Mary et alii (eds.) 2023. Proceedings of the 24th Annual Conference of the European Association for Machine Translation. Tampere: European Association for Machine Translation, pp. 499-500. http://hdl.handle.net/1854/lu-01h2ac8kf9xgq69hzmb2z3jaz9
VIEIRA, Lucas Nunes; Minako O’Hagan & Carol O’Sullivan. (2021) “Understanding the societal impacts of machine translation: a critical review of the literature on medical and legal use cases.” Information, Communication & Society 24:11, pp. 1515-1532.
WAY, Andy. (2020) “Machine translation: Where are we at today.” In: Angelone, Erik; Maureen Ehrensberger-Dow & Gary Massey (eds.) 2020. The Bloomsbury companion to language industry studies. London: Bloomsbury Academic, pp. 311-332.
WANG, Longyue et alii. (2023) “Document-Level Machine Translation with Large Language Models.” arXiv. https://doi.org/10.48550/arXiv.2304.02210
WHITE, Jules et alii. (2023) “ChatGPT Prompt Patterns for Improving Code Quality, Refactoring, Requirements Elicitation, and Software Design.” arXiv. https://doi.org/10.48550/arXiv.2303.07839
WIESMANN, Eva. (2019) “Machine Translation in the Field of Law: A Study of the Translation of Italian Legal Texts into German.” Comparative Legilinguistics 37:1, pp. 117-153. https://doi.org/10.14746/cl.2019.37.4
YUE, Thomas et alii. (2023) “Democratizing Financial Knowledge with ChatGPT by OpenAI: Unleashing the Power of Technology.” SSRN Scholarly Paper. Rochester, NY. https://doi.org/10.2139/ssrn.4346152.
ZHANG, Biao; Barry Haddow & Alexandra Birch. (2023) “Prompting Large Language Model for Machine Translation: A Case Study.” arXiv. https://doi.org/10.48550/arXiv.2301.07069
ZHENG, Lianmin et alii. (2023) “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.” arXiv. http://arxiv.org/abs/2306.05685
ZHUO, Terry Yue et alii. (2023) “Red Teaming ChatGPT via Jail-breaking: Bias, Robustness, Reliability and Toxicity.” arXiv. https://doi.org/10.48550/arXiv.2301.12867