|
|
SpanishSpeechDat CarObjectives
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Items | Corpus contents | ||
|---|---|---|---|
| 2 | voice activation keywords | ||
| 1 | sequence of 10 isolated digits | ||
| 1 | sheet number (4+ digits) | ||
| 1 | spontaneous telephone number (9-11 digits) | 7 connected digits | |
| 3 | read telephone numbers | ||
| 1 | credit card number (16 digits) | ||
| 1 | PIN code (6 digits) | ||
| 1 | spontaneous date, e.g. birthday | ||
| 1 | prompted date, word style | 3 dates | |
| 1 | relative and general date exp. | ||
| 2 | word spotting phrases using an application word (embedded) | ||
| 4 | isolated digits | ||
| 1 | spontaneous, e.g. own forename | ||
| 1 | spelling of direct. city name | 7 spelled word | |
| 4 | real word/name | (letter sequences) | |
| 1 | artificial name for coverage | ||
| 1 | money amount | ||
| 1 | natural number | ||
| 1 | spontaneous, e.g. own forename | ||
| 1 | city / growing up (spontaneous) | ||
| 2 | most frequent cities | 7 directory assistance | |
| 2 | most frequent company/agency | names | |
| 1 | forename/surname | ||
| 9 | phonetically rich sentences | ||
| 1 | time of day (spontaneous) | 2 time phrases | |
| 1 | time phrase (word style) | ||
| 4 | phonetically rich words | ||
| 13 | Mobile phone Application words | ||
| 22 | IVR functions keywords | 67 application words | |
| 32 | car products keywords | ||
| 2 | additional language dependent keywords | ||
| 10 | Prompts for spontaneous speech | ||
The number of speakers is 300. Each speaker records two sessions. Speakers are selected with the following criteria:
- Balance on dialects. The map shows the four dialectal regions defined in this project. A minimum of 60 speakers (or 120 sessions) from each region is mandatory
- Balance in sex
- Balance in age. Three age groups 16-30, 31-45, 46-60 must be equally represented in the database

Four high quality audio channels are recorded in a car in a mobile platform Plt_M and are stored as sequences of 16bit, 16 kHz uncompressed and multiplexed. Channels are sequentially multiplexed in short unsigned.
One telephone channel is recorded via GSM mobile phone on a stationary ISDN speech server Plt_F. Speech files are stored as sequences of 8-bit 8 kHz A-law uncompressed speech samples (CCITT G.711 recommendation).
Each prompted utterance is stored within a separate file. Each speech file has an accompanying ASCII SAM label file
Two types of recordings compose the database. First, wideband recordings (60-7000 Hz) for systems which are installed and operate in the car itself; second, narrow band recordings (300-3400 Hz) for systems that operate centrally outside the car and obtain their spoken input from the driver over the cellular telephone network. Two recording platforms were used
- A mobile recording platform (PltM) installed inside the car, recording multi-channel speech utterances in a high bandwidth mode (16kHz sample frequency)
- A fixed recording platform (PltF) located at the far-end fixed side of the GSM communications simultaneously recording the speech utterances coming from the car (8 kHz sample frequency, A law encoding)
Multi-channel recordings are performed simultaneously in the car and through the GSM network. The recordings are made through an Acoustic front-end (AFE) installed inside the car and connected to the recording platform PltM. Three kinds of AFEs are used simultaneously during the recordings: a close-talking microphone, a remote noise cancelling microphone with 3 Handsfree microphones placed at different locations in the car and a commercial Handsfree car-kit equipment for GSM radiotelephones in cars. The synchronisation mode between the PltM and PltF is based on use of DTMF tones emitted from the GSM terminal placed in car. Data Acquisition is performed by a dedicated hardware in the PC and the storage is made directly on hard disk. The recordings are always made on four channels (1 close-talk signal as reference and 3 far-talk signals). The positions for the far-talk microphones are:
- A: at the ceiling of the car near the A-pillar
- B: at the ceiling of the car in front of the speaker behind the sunvisor
- C: at the ceiling of the car over the mid-console (near the rear mirror)
The GSM phone is mounted at the ceiling of the car over the mid-console.
The fixed recording platform, located at the far-end fixed side of the GSM communication, record simultaneously the speech utterances coming from the car. A software package ADA-K was developed by UPC. The main characteristics of PltF are:
- Direct connection to an ISDN line
- Recording of speech
- DTMF detection (simultaneously with recording of speech)
- Full duplex operation (record while playing).
A synchronisation and communication protocol between the two platforms is used to:
- Detecting if PltF is still alive during the recordings (and to repair a hang up);
- Allowing synchronisation of the recordings on the two platforms;
- Allowing the separation of the items in individual files.
The protocols comprise a series of beeps and DTMF-codes transmitted by PltM to PltF to ensure that each recorded item is preceded by a simultaneous beep on all recording channels to allow rapid off-line synchronisation of the recordings on both platforms.
There are defined 7 environment conditions. Every environment is equally represented in the final database.
- car stopped by motor running
- car in town traffic
- car in town traffic, with noisy conditions
- car moving at a low speed with rough road conditions -> freeway, out of towns roads
- car moving at a low speed with rough road conditions -> freeway, with noisy conditions
- car moving at a high speed with good road conditions (smooth asphalt) -> highway
- car moving at a high speed with good road conditions (smooth asphalt) -> highway, with audio equipment on
In addition, some information was collected during the recordings :
- Weather conditions : rain, sun chine, wind
- Accessories used during recordings : windscreen wipers, ventilation, fan, radio
- Level of fan: off, low, medium, high
The transcription included in this database is an orthographic, lexical transcription with a few details that represent audible acoustic events (speech and non speech) present in the corresponding waveform files. The extra marks contained in the transcription aid in interpreting the text form of the utterance. Transcriptions were made in two passes: one pass in which words are transcribed, and a second pass in which the additional details are added.
Extra marks point to mispronuntiation, truncations, uninteligible words and extra noises. Symbols for extra noises are:
- [fil]: Filled pause.
- These sounds can well be modelled in a filled pause model in speech recognisers. Examples of filled pauses: uh, um, er, ah, mm.
- [spk]: Speaker noise.
- All kinds of sounds and noises made by the calling speaker that are not part of the prompted text, e.g. lip smack, cough, grunt, throat clear, tongue click, loud breath, laugh, loud sigh.
- [sta]: Stationary noise
- This category contains background noise that is not intermittent and has a more or less stable amplitude spectrum. Examples: car noise, road noise, channel noise, GSM noise, voice babble (cocktail-party noise), public place background noise, street noise.
- [int]: Intermittent noise.
- This category contains noises of an intermittent nature. These noises typically occur only once (like a door slam), or have pauses between them (like phone ringing), or change their colour over time (like music). Examples: music, background speech, baby crying, phone ringing, door slam, doorbell, paper rustle, cross talk.
- [dit]: DTMF tone.
- In fact this is a special case of [int]. But since this sound can be expected to be present in nearly each speech file, a special symbol was defined.
The Spanish Database has been transcribed using the software tool UPCRevBD.v1, developed at UPC. Only signals from close talk microphone are transcribed.
The lexicon is included in a file. The lexicon file is an alphabetically ordered list of distinct lexical items (essentially words in our case) which occur in the corpus with the corresponding pronunciation information. Each distinct word has a separate entry. As the lexicon is derived from the corpus it uses the same alphabetic encoding for special and accented characters as used in the transcriptions (ISO-8859). The lexicon will include a frequency count for each entry in the lexicon e.g. to help indicate rare words whose transcriptions are perhaps less important or reliable.
The pronunciation lexicon was produced after the transcription phase; it contain, alphabetically sorted, all words found in the "LB0:" transcription (one occurrence for each word), their number of occurrences and the list of their phonemic representations. The words appear in the lexicon exactly as they appear in the transcription. The lexicon is case insensitive.
All the component words have been identified and alphabetically sorted; all fragments, mispronunciations and non speech events have been removed, and only one occurrence of each word have been selected.
A software tool developed at UPC (SAGA: Spanish Automatic Graphemes to Allophones Transcriber) has been used to translate the transcribed words to phonemic strings by using the SAMPA phonemic notation.
This multiplexed speech file was recorded in the car. This speech file was recorded in the fixed platform.
Accompanying ASCII SAM label file to the speech file recorded in the car. Accompanying ASCII SAM label file to the speech file recorded in the fixed platform.
You can find them here.
This database will be commercially available.
Information: asuncion@gps.tsc.upc.es
|
Send us any suggestion
| This page was last updated on August 19th, 2000 |