O CELGA-ILTEC é uma unidade de I&D da Universidade de Coimbra (UID FCT: 4887) criada em 2015 que resulta da fusão de duas unidades: o ILTEC e o CELGA. Tem como atividades centrais a investigação e a criação de recursos linguísticos.
Portuguese Unisyn Lexicon (LUPo): An Accent-Independent Pronunciation Lexicon for Portuguese
Successful speech technologies require the ability to account for variation in the speech signal. Most text-to-speech (TTS) systems are built using data from a single accent, usually what is considered to be the standard accent, or dialect, for a given language. While users of these technologies represent an ever widening speaker base, the prospect of developing separate lexicons to account for regional pronunciation variants is an extremely costly one. Semi-automatic approaches for exploiting regularities between graphemes and phones have yielded good results. However, such systems rarely extend to multiple accents, and make limited or no use of morphology. Moreover, these projects typically occur in isolation, and are governed by private sector interests that prohibit the sharing of data and tools.
The Portuguese Unisyn Lexicon project (LUPo) is dedicated to delivering an accent-independent lexicon and rule system for generating accent-specific pronunciations in Portuguese. With the consultancy of Susan Fitt, author and developer of the Unisyn Lexicon for English, our methodologies will be a reformulation of those originally employed by Fitt to adapt this largely successful paradigm to Portuguese, and take advantage of the MorDebe database´s relational structure and rich lexicographic content to minimize confusability and create a more integrated and well informed system. Our model will capitalize on having direct access to mappings of European and Brazilian Portuguese spelling variants, part of speech information, etymological relationships, and a morphological parser.
The end product will be a set of open-source tools for generating accent-specific output for individual lexical entries, along with the ability to produce transcriptions for multi-word texts. Pronunciation models will be included for European and Brazilian Portuguese standards, plus eight or more actual spoken accents representing the continents of Africa, Asia, Europe, and South America. All deliverables, including cross-dialectal data, phonetic transcriptions, the master lexicon, allophonic rules, and tools, will be documented and made freely available to the research community and general public via the Portal da Língua Portuguesa knowledge base. The 'Portal' currently gets 4000-4500 hits by unique users per day and is increasingly regarded as a standard resource for inquiries about the Portug”uese language. Inclusion of LUPo in the Portal will greatly enhance the Portal as a pan Lusophone resource and the only one of its kind to provide richly detailed and varied phonetic output for a large number of Portuguese accents. Indeed, it will be the first online resource to provide high-quality phonetic transcription data for regional variants of Portuguese.
Research partners for this project include specialists from Brazil and Europe, representing the fields of phonetics, computational linguistics, lexicography, and Portuguese phonology, morphology, dialectology, and sociolinguistics.
Instituto de Linguística Teórica e Computacional (ILTEC)
3 years (March 2010 through February 2013)
Research Coordinator and Primary Investigator
Simone Ashby (ILTEC)
Sílvia Brandão (Universidade Federal do Rio de Janeiro)
Susan Fitt (formerly of the University of Edinburgh)
Project Initiatives and Objectives
The LUPo project will produce an accent-independent pronunciation lexicon for Portuguese, along with tools for generating accent-specific output for lexical entries and multi-word texts. The proposed tools will feature an interactive mode in which the post-lexical rules used to derive accent-specific transforms are displayed in the output. Users will have the option of accessing the open-source lexicon and tools either as a standalone application or via the Portal da Língua Portuguesa online knowledge base. The 'Portal' module will be accessible as part of the page view for each lexical entry, wherein the user can select a desired accent to view the corresponding transcription for a given word. Online and offline users will also have access to a tool for inputting a fixed amount of text, selecting a desired accent, and generating multi-word transcribed output for that accent, while also having the option to show the rules where they apply. The complete software package will contain documentation in Portuguese and English and be subject to regular updates as improvements are made to the lexicon and tools, and new pronunciation models are introduced. While the aims of the current project will be achieved over a span of three years, ILTEC is committed to ensuring that LUPo continues to evolve as a software application and scholarly database.
In keeping with these development initiatives, our objectives for the project are as follows: