Geoffrey Stewart
Morrison

BSc     MTS     MA     PhD     FCSFS










Michael J Saks, Regents Professor, Sandra Day O’Connor College of Law and Department of Psychology, Arizona State University:


William C Thompson, Professor Emeritus, School of Law, and Department of Criminology, Law & Society, University of California Irvine:

  • Morrison is one of the leading thinkers in the world about problems of forensic inference. Few have his ability to understand and explain forensic statistics.



| Other Pages |

| Appointments |

| News |

| Publications |

Forensic Science · Speech Perception & Acoustic Phonetics · Databases & Software · Popular Press · Other

| Contact |



Other pages:
Forensic Evaluation Ltd
Forensic Data Science Laboratory
Forensic Voice Comparison portal
David Kaye’s blog on forensic science, statistics, and the law





Current
&
Past
Appointments





News


• August 2023

InterForensics: Conferência Internacional de Ciências Forenses

28–31 August 2023

Brasília, Brazil

Keynote presentation:

Advancing a paradigm shift in evaluation of forensic evidence: The rise of forensic data science

In forensic science, the process of evaluation of strength of evidence consists of analysis (i.e., extraction of information from items of interest) and interpretation (i.e., drawing inferences with respect to the meaning of the information extracted by the analysis). Currently, across the majority of branches of forensic science, widespread practice is that analysis is conducted using human perception, and interpretation is conducted using subjective judgement. Even in branches of forensic science in which analysis is conducted using instrumental measurement, interpretation is commonly based on subjective judgement, e.g., by eyeballing graphical representations of the measured values. Human-judgement-based analysis and subjective-judgement-based interpretation methods are non-transparent and susceptible to cognitive bias. Across the majority of branches of forensic science, even branches of forensic science in which interpretation is conducted using statistical models, the framework for interpretation of evidence is often logically flawed, and forensic-evaluation systems (the end-to-end combination of analysis and interpretation methods) are often not empirically validated or not adequately empirically validated. A paradigm shift in evaluation of forensic evidence is ongoing in which methods based on human perception and human judgement are being replaced by methods based on relevant data, quantitative measurements, and statistical models; methods that: 1. are transparent and reproducible; 2. are intrinsically resistant to cognitive bias; 3. use the logically correct framework for interpretation of evidence (the likelihood-ratio framework); and 4. are empirically validated under casework conditions. This presentation describes these elements of the new paradigm, impediments to the paradigm shift, and the presenter’s strategy for facilitating and thus advancing the paradigm shift. The presenter argues that this is true a Kuhnian paradigm shift in the sense that it requires rejection of existing methods and the ways of thinking that underpin them, and rejection of the idea that progress can be made by incremental improvements to existing methods. Instead, it requires the wholesale adoption of an entire constellation of new methods and new ways of thinking.

Slides

Invited presentation:

Speaker identification by lay listeners compared to expert forensic voice comparison based on automatic-speaker-recognition technology

Morrison G.S., Bali A.S, Basu N., Edmond G., Martire K.A., Rosas-Aguilar C., Weber P.

Expert testimony is only admissible if it will assist the trier of fact to make a decision that the trier of fact would not be able to make unaided. This presentation reports on research that addresses the question of whether speaker identification by a judge listening alone or by a jury listening as a collaborative group would be more or less accurate than the output of an expertly operated forensic-voice-comparison system that is based on state-of-the-art automatic-speaker-recognition technology. Individuals listening alone and groups of collaborating listeners made judgements on pairs of recordings, each of which was either a same-speaker pair or a different-speaker pair. The recordings were of adult male Australian English speakers, and reflected the conditions of the recordings in an actual forensic case. Individual listeners came from Australia, Canada & the United States, and Chile & Mexico & Spain, and differed in their degree of familiarity with the language and accent spoken. Collaborating groups of listeners consisted only of listeners from Australia, who were familiar with the language and accent spoken. Listeners’ responses were compared with likelihood ratios output by a forensic-voice-comparison system that is based on state-of-the-art automatic-speaker-recognition technology. The presentation will include reporting of results. Further information about this research is available at https://forensic-voice-comparison.net/speaker-recognition-by-humans/

Slides



• June 2023

International Conference on Forensic Inference and Statistics (ICFIS)

12–15 June 2023

Lund, Sweden

My presentations:

A bi-Gaussian method for calibration of likelihood ratios

The following is a perfectly-calibrated forensic-evaluation system: A system that outputs natural-log likelihood ratios, ln(LR), for which the distribution of ln(LR) in response to different-source inputs and the distribution of ln(LR) in response to same-source inputs are both Gaussian, the two distributions have the same variance, σ2, and the means of the different-source and same-source distributions are −σ2/2 and +σ2/2 respectively (hereinafter, we refer to this as a “perfectly-calibrated bi-Gaussian system”). Given such a system, for any LR value, the probability density of the same-source distribution evaluated at the corresponding ln(LR) value divided by the probability density of the different-source distribution evaluated at the corresponding ln(LR) value will equal that LR value.

A perfectly-calibrated bi-Gaussian system for which σ2 is larger will have better performance than a perfectly-calibrated bi-Gaussian system for which σ2 is smaller. Performance can be measured using the log-likelihood-ratio cost (Cllr). Different empirical sets of system outputs could map to the same Cllr value (a many-to-one mapping), but there is a bidirectional one-to-one mapping between the σ2 value of a perfectly-calibrated bi-Gaussian system and its Cllr value.

The proposed bi-Gaussian calibration method consists of the following steps:

1.    Using a forensic-evaluation system, calculate uncalibrated LRs for a set of input pairs for which the different-source or same-source status of each pair is known.

2.    Calibrate the output of Step ‎1 using a traditional monotonic calibration method, e.g., logistic regression.

3.    Calculate the Cllr value for the output of Step ‎2.

4.    Determine the σ2 value for the perfectly-calibrated bi-Gaussian system with the same Cllr value as calculated in Step ‎3.

5.    Ignoring same-source and different-source labels, determine the mapping function from the empirical cumulative distribution of the output of Step ‎1 to the cumulative distribution of a two-Gaussian mixture corresponding to the perfectly-calibrated bi-Gaussian system with the σ2 value determined in Step ‎4.

6.    Using the mapping function from Step ‎5, map the output of Step ‎1 to calibrated LRs.

The uncalibrated LR value corresponding to an input pair for which the same-source versus different-source status is in question can be inserted at Step ‎5, and the calibrated LR value obtained from Step ‎6.

The presentation will explore the behaviour of the bi-Gaussian method compared to other calibration methods, logistic regression and pool-adjacent violators (PAV), and will highlight some potential advantages and limitations.

Slides

Similarity-score-based likelihood ratios do not take account of typicality

There is confusion in the literature between “scores” which are uncalibrated likelihood ratios, which take account of both the similarity between the items of questioned and known source and their typicality with respect to a sample of the relevant population, and “scores” which only take account of similarity. The latter we will refer to as “similarity scores”. Uncalibrated likelihood ratios can be converted to calibrated likelihood ratios (this has been common practice in forensic voice comparison since at least as early as 2007). Similarity scores, however, cannot be converted to likelihood ratios that meaningfully address source-level forensic propositions because they do not take account of typicality with respect to a sample of the relevant population.

Using the same data, the presentation will demonstrate the application of a common-source likelihood-ratio model and a similarity-score-based likelihood-ratio model. It will demonstrate that, for equally similar pairs of items, the common-source model results in a lower likelihood-ratio value for a pair of items that is typical with respect to the relevant-population distribution, and results in a higher likelihood-ratio value for a pair of items that is atypical with respect to the relevant-population distribution. In contrast, the presentation will demonstrate that a similarity-score-based model results in the same likelihood-ratio value for both pairs of items. The similarity-score-based model overvalues the typical pair of items and undervalues the atypical pair of items.

Slides

What a future forensic-data-science model for fingermark-fingerprint comparison might look like

Morrison (2022, https://doi.org/10.1016/j.fsisyn.2022.100270) described a paradigm shift in evaluation of forensic evidence in which methods based on human perception and subjective judgement are replace by methods based on relevant data, quantitative measurements, and statistical models / machine-learning techniques. The new paradigm methods are transparent and reproducible, intrinsically resistant to cognitive bias, use the logically correct framework for evaluation of evidence, and are calibrated and validated under casework conditions. The presenter will discuss what a future forensic-data-science model for fingermark-fingerprint comparison might look like, how it might be used by practitioners, and some of the challenges for getting there.

Slides



• May–June 2023

– Meeting of the Forensic Science Committee of the International Organization for Standardization (ISO)

29 May – 2 June 2023

Birmingham, United Kingdom

I hosted this meeting at Aston University



• March 2023

Invited presentation

27 March 2023

Forensic Capability Network Evaluative Opinion for Fingerprints Workshop

[online presentation]

Evaluative opinions for fingermark-fingerprint comparison

Slides


• November 2022

Automatic speaker recognition technology outperforms human listeners in the courtroom

Press release



• November 2022

Chartered Society of Forensic Science Conference

4 November 2022

I gave a presentation:

Speaker identification by listeners compared to expert forensic voice comparison based on state-of-the-art automatic-speaker-recognition technology

Slides

My co-author Dr Nabanita Basu gave a presentation:

Feature-based calculation of likelihood ratios for forensic comparison of fired cartridge cases



• October 2022

Introduction to the likelihood ratio framework for evaluation of forensic evidence

Concepts of likelihood-ratio calculation + Calibration and validation of likelihood-ratio systems

I presented two 1-day Continuing Professional Development (CPD) workshops

Forensic Data Science Laboratory, Aston University, Birmingham, UK

25 & 26 October 2022



• October 2022

ASCLD Forensic Research Committee Lightning Talk

As part of a session on The Future of Forensic Analysis, Intepretation, and Reporting

I presened on A paradigm shift in forensic analysis, interpretation, and reporting: The rise of forensic data science

13 October 2022. Livestream at 13:00 EDT



• May–June 2022

European Academy of Forensic Science Conference

30 May – 3 June 2022

I gave a Workshop:

Calibration and validation of likelihood-ratio systems

1 June 2022, 09:30–11:40

Publications such as Forensic Science Regulator (2021) “Codes of Practice and Conduct: Development of evaluative opinions” and Morrison et al. (2021) “Consensus on validation of forensic voice comparison” have emphasized the importance of both calibrating and validating forensic-evaluation systems that output likelihood ratios. This workshop provides an introduction to both of these (related) topics. Participants will gain an understanding of how to conduct empirical calibration and empirical validation of likelihood-ratio systems, including: an understanding of the meaning of calibration and validation in relation to likelihood ratios; requirements for data used for calibration and validation; the use of statistical models (including logistic regression) to perform calibration; the calculation of the log-likelihood-ratio cost (Cllr) as a validation metric; and the use of Tippett plots to represent validation results and to support (or not) the likelihood-ratio value calculated for the comparison of the items of interest in the case. The workshop will focus on source-level comparison, but the principles can also be applied to other forensic-evaluation tasks. The workshop will focus on systems based on relevant data, quantitative measurements, and statistical models, but the principles can also be applied to systems based on human perception and subjective judgement.

Slides

I gave a Keynote Presentation:

Advancing a paradigm shift in evaluation of forensic evidence: The rise of forensic data science

1 June 2022, 13:35–14:05

Currently, across the majority of branches of forensic science, widespread practice is that analysis and interpretation are conducted using human perception and subjective judgement. Human-perception-based analysis and subjective-judgement-based interpretation methods are non-transparent and are susceptible to cognitive bias. Across the majority of branches of forensic science, the framework for interpretation of evidence is often logically flawed, and forensic-evaluation systems (the end-to-end combination of analysis and interpretation methods) are often not empirically validated or not adequately empirically validated. A paradigm shift in evaluation of forensic evidence is ongoing in which methods based on human perception and human judgement are being replaced by methods based on relevant data, quantitative measurements, and statistical models; methods that: 1. are transparent and reproducible; 2. are intrinsically resistant to cognitive bias; 3. use the logically correct framework for interpretation of evidence (the likelihood-ratio framework); and 4. are empirically validated under casework conditions. This presentation describes these elements of the new paradigm, impediments to the paradigm shift, and the presenter’s strategy for facilitating and thus advancing the paradigm shift. The presenter argues that this is true a Kuhnian paradigm shift in the sense that it requires rejection of existing methods and the ways of thinking that underpin them, and rejection of the idea that progress can be made by incremental improvements to existing methods. Instead, it requires the wholesale adoption of an entire constellation of new methods and new ways of thinking.

Slides

My coauthors gave presentations:

Basu N., Bolton-King R.S., Morrison G.S.: Feature-based calculation of likelihood ratios for forensic comparison of fired cartridge cases

1 June 2022, 12:00–12:20

Morrison’s keynote presentation described a paradigm shift in evaluation of forensic evidence in which methods based on human perception and subjective judgement are replaced by methods based on relevant data, quantitative measurements, and statistical models that calculate likelihood ratios and that are empirically validated under casework conditions. In this presentation we provide an example of the application of the new paradigm to forensic comparison of fired cartridge cases, a common task in forensic firearm examination, a branch of forensic science in which the new paradigm has so far made almost no progress. We describe the building of a database of 3D digital images of the bases of fired cartridge cases, and the development and validation of feature-extraction techniques and feature-based statistical models for calculating likelihood ratios. We focus particularly on the problem of feature extraction, i.e., what are the best features to extract and what region of the base of a fired cartridge case is it best to extract the features from? The features are automatically extracted from the firing-pin impression, the breech-face region, and the whole region of interest (including flowback), using various functional-data-analysis techniques including central moments, Legendre moments, and Zernike moments. The statistical modelling process makes use of methods commonly used in human-supervised-automatic approaches to forensic voice comparison, and the validation makes use of protocols, metrics, and graphics commonly used in human-supervised-automatic approaches to forensic voice comparison.

Slides

Weber P., Enzinger E., Labrador B., Lozano-Díez A., Ramos D., González-Rodríguez J., Morrison G.S.: Validation of the alpha version of the E3 forensic speech science system (E3FS3) core software tools

1 June 2022, 15:55–16:15

The E3 Forensic Speech Science System (E3FS3) is being developed in collaboration with multiple research and operational forensic laboratories. It is designed for conducting forensic-voice-comparison research and casework. When complete, E3FS3 will include open-code software tools, data-collection protocols, databases, standards and guidelines, standard operating procedures, a library of validation reports, and training for practitioners. The core software tools are based on state-of-the-art automatic-speaker-recognition technology, and are accompanied by detailed documentation explaining which algorithms were implemented and why they were chosen. For maximum transparency, the software is written in Python (a popular free high-level programming language) and the code is extensively commented.  As each component of E3FS3 reaches the stage at which it is suitable for general release, it will be made available via http://E3FS3.forensic-voice-comparison.net/

This presentation describes the design principles for E3FS3, and reports on validations of an alpha version of the core software tools. The validations include a benchmark validation using the forensic_eval_01 data that has previously been used to assess the performance of multiple other forensic-voice-comparison systems. They also include validations conducted under the conditions of the first case for which E3FS3 was used. This includes multiple recordings conditions, including multiple different questioned-speaker recording durations. Validations were conducted and reported in accordance with the “Consensus on validation of forensic voice comparison”.



• December 2021

– Terrance M Nearey, my former PhD supervisor, passed away on 18 December 2021. My condolences to his family and friends. Terry taught me to be a scientist.



• December 2021

2nd AFORE Webinar on The Validation of Analytical Methods in Forensic Science

1–2 December 2021

Programme

My presentation:

Consensus on validation of forensic-comparison systems in the context of casework

Over a series of rounds of drafting and meetings in 2019–2020, a group of authors developed a consensus on validation of forensic voice comparison. Group members included individuals who had knowledge and experience of validating forensic-voice-comparison systems in research and/or casework contexts, and individuals who had actually presented validation results to courts. They also included individuals who could bring a legal perspective on these matters, and individuals with knowledge and experience of validation in forensic science more broadly. Although the scope was forensic voice comparison, with minor wording changes the resulting statement of consensus would be applicable to validating source-comparison systems in any branch of forensic science. The scope was validation for the purpose of demonstrating whether, in the context of specific cases, forensic-comparison systems that output likelihood ratios are (or are not) good enough for their output to be used in court. In this presentation, I provide an overview of the statement of consensus and underlying concepts. I also discuss my reflections on broader issues related to validation and standard/guidelines.

Slides and video of the presentation



• October 2021

– noteworthy paper published 30 October 2021:

Brook C., Lynøe N., Eriksson A., Balding D. (2021). Retraction of a peer reviewed article suggests ongoing problems within Australian forensic science. Forensic Science International: Synergy. https://doi.org/10.1016/j.fsisyn.2021.100208


• October 2021

Un método para calcular la fuerza de la evidencia asociada con el supuesto reconocimiento de un locutor conocido por un testigo auditivo

Presentación: Congreso de Ciencia Forense, Universidad Autónoma de México.

Presentadora: Claudia Rosas

8 de octubre 2021

Investigaciones anteriores sobre el reconocimiento de locutores por parte de testigos auditivos se han centrado en factores que afectan la exactitud de los testigos auditivos en general. Los resultados de esta investigación han permitido a los testigos expertos hacer generalizaciones sobre si es más o menos probable que las condiciones de un caso conduzcan a identificaciones / reconocimientos correctos o incorrectos. Han proporcionado orientación sobre cómo diseñar alineaciones de locutores para aumentar la exactitud y reducir el sesgo. Sin embargo, la investigación previa no ha proporcionado una solución a la pregunta clave de la evidencia en un caso particular: ¿Cuál es la fuerza de la evidencia asociada con la identificación / reconocimiento de este testigo auditivo en particular de este hablante en particular bajo las condiciones particulares de este caso?

Esta presentación describe y demuestra un método para evaluar la fuerza de la evidencia cuando un testigo afirma reconocer una voz como la voz de un hablante que le es conocido. La demostración se basa en las condiciones de un caso real en el que un testigo afirma reconocer una voz en una grabación (la voz de un delincuente) como la voz de una persona conocida para él (un sospechoso). La víctima, que se encontraba en el maletero de un automóvil, realizó una llamada a los servicios de emergencia a través de un teléfono móvil. La llamada se grabó en el centro de llamadas. La voz del delincuente estaba en el fondo de la grabación (el delincuente aparentemente estaba sentado en el asiento delantero del automóvil). La parte de la grabación durante la cual se pude oír la voz del delincuente duró aproximadamente tres segundos.

El método calcula un factor de Bayes que responde a la pregunta: ¿Cuál es la probabilidad de que un testigo auditivo cooperativo afirme reconocer al delincuente como sospechoso si el delincuente era el sospechoso? frente a ¿cuál es la probabilidad de que el testigo auditivo afirme reconocer al delincuente como el sospechoso si el delincuente no era el sospechoso sino algún otro hablante de la población relevante? Los datos relevantes para la demostración fueron las respuestas de los oyentes ingenuos a las grabaciones de los locutores que eran conocidos para los oyentes. Los locutores fueron grabados en condiciones que reflejaban las del caso. Las grabaciones se presentaron a los oyentes en una alineación de locutores. En respuesta a cada grabación, si el oyente afirmó reconocer al hablante, se le pidió que escribiera el nombre del hablante, de lo contrario, declarara que no reconoció al locutor. Los factores de Bayes se calcularon utilizando estos datos y distribuciones beta-binomiales con los a prioris de Jeffreys.

Video

Transparencias


• June 2021

Symposium on calibration in forensic science

Forensic Data Science Laboratory, Aston Institute for Forensic Linguistics

Date & Time: 3 June 2021, 12:00–15:00 UTC

Location: Online

Presenters: Geoffrey Stewart Morrison, Luciana Ferrer, Daniel Ramos, Peter Vergeer

This symposium brings together some of the leading researchers in the calibration of the likelihood-ratio output of automatic-speaker-recognition systems and of forensic-evaluation systems. They explain what calibration is and why it is important. They present algorithms used for calibrating likelihood-ratio systems, and metrics used for assessing the degree of calibration of likelihood-ratio systems. They discuss aspects of calibration on which there is consensus, aspects on which there is disagreement, and aspects requiring additional research. They also discuss how to encourage wider adoption of calibration of likelihood-ratio systems in forensic practice.

Slides and video recordings of the event



• 2019–2023

My colleagues from Aston University’s Centre for Forensic Linguistics and I were sucessful in bidding for research funding from the UK Research and Innovation (UKRI) Research England Expanding Excellence in England Fund (E3)
We won over GBP 5.4 M in E3 funding. With additional contributions from Aston University strategic funding, the total comes to over GBP 6 M (~USD 7.75 M)




Publications:

Forensic
Science


Bi-Gaussianized calibration of likelihood ratios
Morrison G.S. (2024).
Law, Probability & Risk., 23, mgae004.
https://doi.org/10.1093/lpr/mgae004

software and data at: https://forensic-data-science.net/calibration-and-validation/#biGauss

  • For a perfectly calibrated forensic evaluation system, the likelihood ratio of the likelihood ratio is the likelihood ratio. Conversion of uncalibrated log likelihood ratios (scores) to calibrated log likelihood ratios is often performed using logistic regression. The results, however, may be far from perfectly calibrated. We propose and demonstrate a new calibration method, “bi-Gaussianized calibration”, that warps scores toward perfectly calibrated log-likelihood-ratio distributions. Using both synthetic and real data, we demonstrate that bi-Gaussianized calibration leads to better calibration than does logistic regression, that it is robust to score distributions that violate the assumption of two Gaussians with the same variance, and that it is competitive with logistic-regression calibration in terms of performance measured using log-likelihood-ratio cost (Cllr). We also demonstrate advantages of bi-Gaussianized calibration over calibration using pool-adjacent violators (PAV). Based on bi-Gaussianized calibration, we also propose a graphical representation that may help explain the meaning of likelihood ratios to triers of fact.


Speaker identification in courtroom contexts – Part II:
Investigation of bias in individual listeners’ responses
Basu N., Weber P., Bali A.S., Rosas-Aguilar C., Edmond G., Martire K.A., Morrison G.S. (2023). Forensic Science International, 349, 111768.

https://doi.org/10.1016/j.forsciint.2023.111768

  • In “Speaker identification in courtroom contexts – Part I” individual listeners made speaker-identification judgements on pairs of recordings which reflected the conditions of the questioned-speaker and known-speaker recordings in a real case. The recording conditions were poor, and there was a mismatch between the questioned-speaker condition and the known-speaker condition. No contextual information that could potentially bias listeners’ responses was included in the experiment condition – it was decontextualized with respect to case circumstances and with respect to other evidence that could be presented in the context of a case. Listeners’ responses exhibited a bias in favour of the different-speaker hypothesis. It was hypothesized that the bias was due to the poor and mismatched recording conditions. The present research compares speaker-identification performance between: (1) listeners under the original Part I experiment condition, (2) listeners who were informed ahead of time that the recording conditions would make the recordings sound more different from one another than had they both been high-quality recordings, and (3) listeners who were presented with high-quality versions of the recordings. Under all experiment conditions, there was a substantial bias in favour of the different-speaker hypothesis. The bias in favour of the different-speaker hypothesis therefore appears not to be due to the poor and mismatched recording conditions.

  • See also:


A single test pair does not a method validation make: A response to Kirchhübel et al. (2023)
Morrison G.S. (2023). Science & Justice, 63, 327–329. https://doi.org/10.1016/j.scijus.2023.03.001

preprint

  • In terms of development of methods that are transparent and reproducible, that are intrinsically resistant to cognitive bias, that use the logically correct framework for interpretation of evidence (the likelihood-ratio framework), and that are empirically validated under casework conditions, forensic voice comparison may be one of the most progressive branches of forensic science. Forensic voice comparison is not, however, a monolithic branch of forensic science. The aforementioned progress has been made in the context of human-supervised-automatic methods. Unfortunately, many or most forensic-voice-comparison practitioners, who use auditory-acoustic-phonetic methods, have not joined in this progress. There are calls going back to the late 1960s for forensic-voice-comparison methods to be meaningfully validated under casework conditions, but many or most forensic-voice-comparison practitioners have still not heeded those calls. A recent example appears in Kirchhübel et al. (2023) https://doi.org/10.1016/j.scijus.2023.01.004, which proposes a validation protocol in which the number of test pairs is only one. This does not constitute meaningful validation. If the cost and time necessary to conduct meaningful validations of auditory-acoustic-phonetic methods are such that it is not practical to conduct meaningful validations, then forensic voice comparison should not be performed using auditory-acoustic-phonetic methods.


Forensic voice comparison: Overview
Morrison G.S., Zhang C. (2023). In Houck M., Wilson L., Lewis S., Eldridge H., Lothridge K., Reedy P. (Eds.), Encyclopedia of Forensic Sciences (3rd Ed.), vol. 2, pp. 737–750. Elsevier.
https://doi.org/10.1016/B978-0-12-823677-2.00130-6

Preprint at http://forensic-voice-comparison.net/encyclopedia/

  • In forensic voice comparison, a forensic practitioner analyzes a recording of a speaker of questioned identity and one or more recordings of a speaker of known identity, and compares the analytical results in order to draw an inference that will assist a legal-decision maker to decide whether the recordings are of the same speaker or of different speakers. This entry provides an overview of analytical approaches (including auditory, spectrographic, acoustic-phonetic, and automatic) and interpretive frameworks (including the likelihood-ratio framework) that have been used in forensic voice comparison. It also briefly discusses legal admissibility and validation of forensic voice comparison.


Forensic voice comparison: Human-supervised-automatic approach
Morrison G.S., Weber P., Enzinger E., Labrador B., Lozano-Díez A., Ramos D., González-Rodríguez J. (2023). In Houck M., Wilson L., Lewis S., Eldridge H., Lothridge K., Reedy P. (Eds.), Encyclopedia of Forensic Sciences (3rd Ed.)
, vol. 2, pp. 720–736. Elsevier.
https://doi.org/10.1016/B978-0-12-823677-2.00182-3

Preprint at http://forensic-voice-comparison.net/encyclopedia/

  • The current entry builds on the entry Forensic voice comparison – Overview. It describes in greater detail the human-supervised-automatic analytical approach to forensic voice comparison in conjunction with the likelihood-ratio interpretive framework. It discusses practitioner tasks, including adoption of relevant hypotheses for the case, assessment of the conditions of the questioned-speaker and known-speaker recordings in the case, and selection of data representing the relevant population and reflecting the conditions for the case. It also describes an example forensic-voice-comparison system based on state-of-the-art automatic-speaker-recognition technology, and validation of that system using a benchmark dataset reflecting the conditions of a real forensic case.


A method to convert traditional fingerprint ACE / ACE-V outputs (“identification”, “inconclusive”, “exclusion”) to Bayes factors
Morrison G.S. (2022). Manuscript submitted for publication.

preprint

Matlab code

  • Fingerprint examiners appear to be reluctant to adopt probabilistic reasoning, statistical models, and empirical validation. The rate of adoption of the likelihood-ratio framework by fingerprint practitioners appears to be near zero. A factor in the reluctance to adopt the likelihood-ratio framework may be a perception that it would require a radical change in practice. The present paper proposes a small step that would require minimal changes to current practice. It proposes and demonstrates a method to convert traditional fingerprint-examination outputs (“identification”, “inconclusive”, and “exclusion”) to well-calibrated Bayes factors. The method makes use of a beta-binomial model, and both uninformative and informative priors.


Calculating likelihood ratios
Morrison G.S. (2022). Lecture notes.

document

  • These lecture notes describe calculation of specific-source, common-source, and similarity-score-based likelihood ratios using Gaussian-distribution models. They demonstrate that, since it does not take account of typicality with respect the relevant population, the similarity-score approach does not result in appropriate likelihood-ratio values.


A response to Busey & Klutzke (2022): Regarding subjective assignment of likelihood ratios
Morrison G.S. (2022). Science & Justice, 63, 61–62.

https://doi.org/10.1016/j.scijus.2022.11.003/

preprint

  • Busey & Klutzke (2022) states that “Morrison (2012) has argued that the likelihood ratio need not be quantitative but could be based on the expert’s subjective evaluation.” The statement appears to suggest that Morrison (2012) argued in favour of subjective assignment of likelihood-ratio values. This interpretation of Morrison (2012) is incorrect.


A plague on both your houses: The debate about how to deal with “inconclusive” conclusions when calculating error rates
Morrison G.S. (2022). Law, Probability and Risk.

https://doi.org/10.1093/lpr/mgac015

preprint

  • There is an ongoing debate about how to deal with “inconclusive” conclusions when calculating error rates. The appropriate solution to the problem is to abandon reporting of conclusions as “identification”, “inconclusive”, and “exclusion”, and to adopt the likelihood-ratio framework for interpretation and adopt the log-likelihood-ratio cost (Cllr) as a validation metric.


Speaker identification in courtroom contexts – Part I: Individual listeners compared to forensic voice comparison based on automatic-speaker-recognition technology
Basu N., Bali A.S., Weber P., Rosas-Aguilar C., Edmond G., Martire K.A., Morrison G.S. (2022). Forensic Science International, 341, 111499.

https://doi.org/10.1016/j.forsciint.2022.111499

  • Expert testimony is only admissible in common law if it will potentially assist the trier of fact to make a decision that they would not be able to make unaided. The present paper addresses the question of whether speaker identification by an individual lay listener (such as a judge) would be more or less accurate than the output of a forensic-voice-comparison system that is based on state-of-the-art automatic-speaker-recognition technology. Listeners listen to and make probabilistic judgements on pairs of recordings reflecting the conditions of the questioned- and known-speaker recordings in an actual case. Reflecting different courtroom contexts, listeners with different language backgrounds are tested: Some are familiar with the language and accent spoken, some are familiar with the language but less familiar with the accent, and others are less familiar with the language. Also reflecting different courtroom contexts: In one condition listeners make judgements based only on listening, and in another condition listeners make judgements based on both listening to the recordings and considering the likelihood-ratio values output by the forensic-voice-comparison system.

  • See also:


The opacity myth: A response to Swofford & Champod (2022)
Morrison G.S., Basu N., Enzinger E., Weber P. (2022). Forensic Science International: Synergy, 5, 100275.

https://doi.org/10.1016/j.fsisyn.2022.100275

  • Swofford & Champod (2022) FSI Synergy article 100220 reports the results of semi-structured interviews that asked interviewees their views on probabilistic evaluation of forensic evidence in general, and probabilistic evaluation of forensic evidence performed using computational algorithms in particular. The interview protocol included a leading question based on the premise that machine-learning methods used in forensic inference are not understandable even to those who develop those methods. We contend that this is a false premise.


Forensic comparison of fired cartridge cases: Feature-extraction methods for feature-based calculation of likelihood ratios
Basu, N., Bolton-King, R.S., Morrison G.S. (2022). Forensic Science International: Synergy, 5, 100272.

https://doi.org/10.1016/j.fsisyn.2022.100272

  • We describe and validate a feature-based system for calculation of likelihood ratios from 3D digital images of fired cartridge cases. The system includes a database of 3D digital images of the bases of 10 cartridges fired per firearm from approximately 300 firearms of the same class (semi-automatic pistols that fire 9 mm diameter centre-fire Luger-type ammunition, and that have hemispherical firing pins and parallel breech-face marks). The images were captured using Evofinder®, an imaging system that is commonly used by operational forensic laboratories. A key component of the research reported is the comparison of different feature-extraction methods. Feature sets compared include those previously proposed in the literature, plus Zernike-moment based features. Comparisons are also made of using feature sets extracted from the firing-pin impression, from the breech-face region, and from the whole region of interest (firing-pin impression + breech-face region + flowback if present). Likelihood ratios are calculated using a statistical modelling pipeline that is standard in forensic voice comparison. Validation is conducted and results are assessed using validation procedures and validation metrics and graphics that are standard in forensic voice comparison.


Advancing a paradigm shift in evaluation of forensic evidence: The rise of forensic data science
Morrison G.S. (2022). Forensic Science International: Synergy, 5, 100270.

https://doi.org/10.1016/j.fsisyn.2022.100270

  • This is a written version of a Keynote Presentation given at the European Academy of Forensic Science Conference (EAFS 2022)

  • Widespread practice across the majority of branches of forensic science uses analytical methods based on human perception, and interpretive methods based on subjective judgement. These methods are non-transparent and are susceptible to cognitive bias, interpretation is often logically flawed, and forensic-evaluation systems are often not empirically validated. I describe a paradigm shift in which existing methods are replaced by methods based on relevant data, quantitative measurements, and statistical models; methods that are transparent and reproducible, are intrinsically resistant to cognitive bias, use the logically correct framework for interpretation of evidence (the likelihood-ratio framework), and are empirically validated under casework conditions.

  • Michael J Saks, Regents Professor, Sandra Day O’Connor College of Law and Department of Psychology, Arizona State University:
    • Your paradigm article is an excellent piece. It should be a guiding light for quite some time (for anyone interested in being guided). You see both the promise and the challenges more clearly and completely than most of us.


A strawman with machine learning for a brain: A response to Biedermann (2022) The strange persistence of (source) “identification” claims in forensic literature
Morrison G.S., Ramos D., Ypma R.J.F., Basu N., de Bie K., Enzinger E., Geradts Z., Meuwly D., van der Vloed D., Vergeer P., Weber P. (2022). (Letter to the Editor). Forensic Science International: Synergy, 4, 100230.

https://doi.org/10.1016/j.fsisyn.2022.100230

  • We agree wholeheartedly with Biedermann (2022) FSI Synergy article 100222 in its criticism of research publications that treat forensic inference in source attribution as an “identification” or “individualization” task. We disagree, however, with its criticism of the use of machine learning for forensic inference. The argument it makes is a strawman argument. There is a growing body of literature on the calculation of well-calibrated likelihood ratios using machine-learning methods and relevant data, and on the validation under casework conditions of such machine-learning-based systems.


Validation of the alpha version of the E3 Forensic Speech Science System (E3FS3) core software tools
Weber P., Enzinger E., Labrador B., Lozano-Díez A., Ramos D., González-Rodríguez J., Morrison G.S. (2022). Forensic Science International: Synergy, 4, 100223.
https://doi.org/10.1016/j.fsisyn.2022.100223

  • This paper reports on validations of an alpha version of the E3 Forensic Speech Science System (E3FS3) core software tools. This is an open-code human-supervised-automatic forensic-voice-comparison system based on x-vectors extracted using a type of Deep Neural Network (DNN) known as a Residual Network (ResNet). A benchmark validation was conducted using training and test data (forensic_eval_01) that have previously been used to assess the performance of multiple other forensic-voice-comparison systems. Performance equalled that of the best-performing system with previously published results for the forensic_eval_01 test set. The system was then validated using two different populations (male speakers of Australian English and female speakers of Australian English) under conditions reflecting those of a particular case to which it was to be applied. The conditions included three different sets of codecs applied to the questioned-speaker recordings (two mismatched with the set of codecs applied to the known-speaker recordings), and multiple different durations of questioned-speaker recordings. Validations were conducted and reported in accordance with the “Consensus on validation of forensic voice comparison”.


Calculation of likelihood ratios for inference of biological sex from human skeletal remains
Morrison G.S., Weber P., Basu N., Puch-Solis R., Randolph-Quinney P.S. (2021). Forensic Science International: Synergy, 3, 100202.
https://doi.org/10.1016/j.fsisyn.2021.100202

  • Related software is available at https://forensic-data-science.net/anthropology

  • It is common in forensic anthropology to draw inferences (e.g., inferences with respect to biological sex of human remains) using statistical models applied to anthropometric data. Commonly used models can output posterior probabilities, but a threshold  is usually applied in order to obtain a classification. In the forensic-anthropology literature, there is some unease with this “fall-off-the-cliff” approach. Proposals have been made to exclude results that fall within a “zone of uncertainty”, e.g., if the posterior probability for “male” is greater than 0.95 then the remains are classified as male, and if the posterior probability for “male” is less than 0.05 then the remains are classified as female, but if the posterior probability for “male” is between 0.05 and 0.95 the remains are not classified as either male or female. In the present paper, we propose what we believe is a simpler solution that is in line with interpretation of evidence in other branches of forensic science: implementation of the likelihood-ratio framework using relevant data, quantitative measurements, and statistical models. Statistical models that can implement this approach are already widely used in forensic anthropology. All that is required are minor modifications in the way those models are used and a change in the way practitioners and researchers think about the meaning of the output of those models. We explain how to calculate likelihood ratios using osteometric data and linear discriminant analysis, quadratic discriminant analysis, and logistic regression models. We also explain how to empirically validate likelihood-ratio models.


In the context of forensic casework, are there meaningful metrics of the degree of calibration?
Morrison G.S. (2021). Forensic Science International: Synergy, 3, 100157.
https://doi.org/10.1016/j.fsisyn.2021.100157

  • Matlab code

  • Forensic-evaluation systems should output likelihood-ratio values that are well calibrated. If they do not, their output will be misleading. Unless a forensic-evaluation system is intrinsically well-calibrated, it should be calibrated using a parsimonious parametric model that is trained using calibration data. The system should then be tested using validation data. Metrics of degree of calibration that are based on the pool-adjacent-violators (PAV) algorithm recalibrate the likelihood-ratio values calculated from the validation data. The PAV algorithm overfits on the validation data because it is both trained and tested on the validation data, and because it is a non-parametric model with weak constraints. For already-calibrated systems, PAV-based ostensive metrics of degree of calibration do not actually measure degree of calibration; they measure sampling variability between the calibration data and the validation data, and overfitting on the validation data. Monte Carlo simulations are used to demonstrate that this is the case. We therefore argue that, in the context of casework, PAV-based metrics are not meaningful metrics of degree of calibration; however, we also argue that, in the context of casework, a metric of degree of calibration is not required.


Reply to Response to Vacuous standards – subversion of the OSAC standards-development process
Morrison G.S., Neumann C., Geoghegan P.H., Edmond G., Grant T., Ostrum R.B., Roberts P., Saks M., Syndercombe Court D., Thompson W.C., Zabell S. (2021). Forensic Science International: Synergy, 3, 100149.
https://doi.org/10.1016/j.fsisyn.2021.100149


Consensus on validation of forensic voice comparison
Morrison G.S., Enzinger E., Hughes V., Jessen M., Meuwly D., Neumann C., Planting S., Thompson W.C., van der Vloed D., Ypma R.J.F., Zhang C., Anonymous A., Anonymous B. (2021). Science & Justice, 61, 229–309.
https://doi.org/10.1016/j.scijus.2021.02.002


Statistical models in forensic voice comparison
Morrison G.S., Enzinger E., Ramos D., González-Rodríguez J., Lozano-Díez A. (2020). In Banks D.L., Kafadar K., Kaye D.H., Tackett M. (Eds.), Handbook of Forensic Statistics (Ch. 20, pp. 451–497). Boca Raton, FL: CRC.
https://doi.org/10.1201/9780367527709

  • Preprint and videos related to chapter: http://forensic-voice-comparison.net/handbook-of-forensic-statistics/

  • This chapter describes a number of signal-processing and statistical-modeling techniques that are commonly used to calculate likelihood ratios in human-supervised automatic approaches to forensic voice comparison. Techniques described include mel frequency cepstral coefficients (MFCCs) feature extraction, Gaussian mixture model - universal background model (GMM-UBM) systems, i-vector - probabilistic linear discriminant analysis (i-vector PLDA) systems, deep neural network (DNN) based systems (including senone posterior i-vectors, bottleneck features, and embeddings / x-vectors), mismatch compensation, and score to likelihood ratio conversion (aka calibration). Empirical validation of forensic voice comparison systems is also covered. The aim of the chapter is to bridge the gap between general introductions to forensic voice comparison and the highly technical automatic speaker recognition literature from which the signal-processing and statistical-modeling techniques are mostly drawn. Knowledge of the likelihood ratio framework for the evaluation of forensic evidence is assumed. It is hoped that the material presented here will be of value to students of forensic voice comparison and to researchers interested in learning about statistical modeling techniques that could potentially also be applied to data from other branches of forensic science.


Vacuous standards – subversion of the OSAC standards-development process
Morrison G.S., Neumann C., Geoghegan P.H. (2020). Forensic Science International: Synergy, 2, 206–209.
https://doi.org/10.1016/j.fsisyn.2020.06.005

  • This is an invited Perspectives article.

  • In the context of development of standards for forensic science, particularly standards initially developed by the U.S. Organization of Scientific Area Committees for Forensic Science (OSAC), this perspective paper raises concern about the publication of vacuous standards. Vacuous standards generally state few requirements; the requirements they do state are often vague; compliance with their stated requirements can be achieved with little effort – the bar is set very low; and compliance with their stated requirements would not be sufficient to lead to scientifically valid results. This perspective paper proposes a number of requirements that we believe would be essential in order for a standard on validation of forensic-science methods to be fit for purpose.


A method for calculating the strength of evidence associated with an earwitness’s claimed recognition of a familiar speaker
Rosas C., Sommerhoff J., Morrison G.S. (2019). Science & Justice, 59, 585–596.

https://doi.org/10.1016/j.scijus.2019.07.001

  • Matlab code and data

  • Presentación: Un método para calcular la fuerza de la evidencia asociada con el supuesto reconocimiento de un locutor conocido por un testigo auditivo. Congreso de Ciencia Forense, Universidad Autónoma de México, 8 de octubre 2021

  • The present paper proposes and demonstrates a method for assessing strength of evidence when an earwitness claims to recognize the voice of a speaker who is familiar to them. The method calculates a Bayes factor that answers the question: What is the probability that the earwitness would claim to recognize the offender as the suspect if the offender was the suspect versus what is the probability that the earwitness would claim to recognize the offender as the suspect if the offender was not the suspect but some other speaker from the relevant population? By “claim” we mean a claim made by a cooperative earwitness not a claim made by an earwitness who is intentionally deceptive. Relevant data are derived from naïve listeners’ responses to recordings of familiar speakers presented in a speaker lineup. The method is demonstrated under recording conditions that broadly reflect those of a real case.


Multi-laboratory evaluation of forensic voice comparison systems under conditions reflecting those of a real forensic case (forensic_eval_01) - Conclusion
Morrison G.S., Enzinger E. (2019). Speech Communication, 112, 37–39.
https://doi.org/10.1016/j.specom.2019.06.007



  • This conclusion to the virtual special issue (VSI) “Multi-laboratory evaluation of forensic voice comparison systems under conditions reflecting those of a real forensic case (forensic_eval_01)” provides a brief summary of the papers included in the VSI, observations based on the results, and reflections on the aims and process. It also includes errata and acknowledgments.


A statistical procedure to adjust for time-interval mismatch in forensic voice comparison
Morrison G.S., Kelly F. (2019). Speech Communication, 112, 15–21.

https://doi.org/10.1016/j.specom.2019.07.001

  • Matlab code and data

  • The present paper describes a statistical modeling procedure that was developed to account for the fact that, in a forensic voice comparison analysis conducted for a particular case, there was a long time interval between when the questioned- and known-speaker recordings were made (six years), but in the sample of the relevant population used for training and testing the forensic voice comparison system there was a short interval (hours to days) between when each of multiple recordings of each speaker was made. The present paper also includes results of empirical validation of the procedure. Although based on a particular case, the procedure has potential for wider application given that relatively long time intervals between the recording of questioned and known speakers are not uncommon in casework.


Introduction to forensic voice comparison
Morrison G.S., Enzinger E. (2019). In Katz W.F., Assmann P.F. (Eds.) The Routledge Handbook of Phonetics (ch. 21, pp. 599–634). Abingdon, UK: Taylor & Francis.
https://doi.org/10.4324/9780429056253-22

  • purchase book

  • preprint: https://research.aston.ac.uk/en/publications/introduction-to-forensic-voice-comparison

  • This chapter provides a brief introduction to forensic voice comparison. It describes different approaches that have been used to extract information from voice recordings: auditory, spectrographic, acoustic-phonetic, and automatic approaches. It also describes different frameworks that have been used to draw inferences from such information: likelihood-ratio, posterior-probability, identification/exclusion/inconclusive, and the UK framework. In addition, the chapter describes empirical validation of forensic voice comparison systems and briefly discusses legal admissibility.


Response to House of Lords inquiry on forensic science
Morrison G.S. (2018-09-14b)


A response to the discussion of score-based models in Aitken (2018) “Bayesian hierarchical random effects models in forensic science”
Morrison G.S. (2018).

  • comment is posted at Frontiers in Genetics below the origial Aitken (2018) article

  • download pdf version

  • Aitken (2018) states that “Score-based approaches have been used for ... speech recognition” and that scores are “based on the similarity of pairwise scores rather than the similarity and rarity of features.” In fact, in the field of forensic speaker recognition the scores used are not similarity-only scores, but scores that take account of both similarity and typicality.


A response to Marquis et al (2017) What is the error margin of your signature analysis?
Morrison G.S., Ballantyne K., Geoghegan P.H. (2018). Forensic Science International, 287, e11–e12.
https://doi.org/10.1016/j.forsciint.2018.03.009



  • Marquis et al (2017) [What is the error margin of your signature analysis? Forensic Science International, 281, e1–e8] ostensibly presents a model of how to respond to a request from a court to state an “error margin” for a conclusion from a forensic analysis. We interpret the court’s request as an explicit request for meaningful empirical validation to be conducted and the results reported. Marquis et al (2017), however, recommends a method based entirely on subjective judgement and does not subject it to any empirical validation. We believe that much resistance to the adoption of the likelihood ratio framework is not to the idea of assessing the relative probabilities (or likelihoods) of the evidence under prosecution and defence hypotheses per se, but to what is perceived to be unwarranted subjective assignment of those probabilities. In order to maximize transparency, replicability, and resistance to cognitive bias, we recommend the use of methods based on relevant data, quantitative measurements, and statistical models. If the method is based on subjective judgement, the output should be empirically calibrated. Irrespective of the basis of the method, its implementation should be empirically validated under conditions reflecting those of the case at hand.


Admissibility of forensic voice comparison testimony in England and Wales
Morrison G.S. (2018). Criminal Law Review, (1), 20–33.

  • download

  • In 2015 the Criminal Practice Directions (CPD) on admissibility of expert evidence in England & Wales were revised. They emphasised the principle that “the court must be satisfied that there is a sufficiently reliable scientific basis for the evidence to be admitted”. The present paper aims to assist courts in understanding from a scientific perspective what would be necessary to demonstrate the validity of testimony based on forensic voice comparison. We describe different technical approaches to forensic voice comparison that have been used in the United Kingdom, and critically review the case law on their admissibility. We conclude that courts have been inconsistent in their reasoning. In line with the CPD, we recommend that courts enquire as to whether forensic practitioners have made use of data and analytical methods that are appropriate and adequate for the case under consideration, and that courts require forensic practitioners to empirically demonstrate the level of performance of their forensic voice comparison system under conditions reflecting those of the case under consideration.


Forensic speech science
Morrison G.S., Enzinger E., Zhang C. (2018). In I. Freckelton, & H. Selby (Eds.), Expert Evidence (Ch. 99). Sydney, Australia: Thomson Reuters.

  • preprint: http://expert-evidence.forensic-voice-comparison.net/

  • A revised, updated, and expanded edition of Morrison (2010) “Forensic voice comparison”. It introduces forensic speech science in a relatively non-technical way, assuming a reader who has no prior knowledge of the subject. As with the previous edition, the revised edition provides an introduction to forensic voice comparison and to speaker recognition by laypeople (e.g., earwitnesses). Compared to the previous edition, the revised edition has a heavier focus on automatic approaches to forensic voice comparison.The revised edition also includes coverage of other areas of forensic speech science, particularly disputed utterance analysis.


The impact in forensic voice comparison of lack of calibration and of mismatched conditions between the known-speaker recording and the relevant-population sample recordings
Morrison G.S. (2018). Forensic Science International, 283, e1–e7.
http://dx.doi.org/10.1016/j.forsciint.2017.12.024



  • In a 2017 New South Wales case, a forensic practitioner conducted a forensic voice comparison using a Gaussian mixture model - universal background model (GMM-UBM). The practitioner did not report the results of empirical tests of the performance of this system under conditions reflecting those of the case under investigation. The practitioner trained the model for the numerator of the likelihood ratio using the known-speaker recording, but trained the model for the denominator of the likelihood ratio (the UBM) using high-quality audio recordings, not recordings which reflected the conditions of the known-speaker recording. There was therefore a difference in the mismatch between the numerator model and the questioned-speaker recording versus the mismatch between the denominator model and the questioned-speaker recording. In addition, the practitioner did not calibrate the output of the system. The present paper empirically tests the performance of a replication of the practitioner’s system. It also tests a system in which the UBM was trained on known-speaker-condition data and which was empirically calibrated. The performance of the former system was very poor, and the performance of the latter was substantially better.

  • Matlab code

  • When strength of forensic evidence is quantified using sample data and statistical models, a concern may be raised as to whether the output of a model overestimates the strength of evidence. This is particularly the case when the amount of sample data is small, and hence sampling variability is high. This concern is related to concern about precision. This paper describes, explores, and tests three procedures which shrink the value of the likelihood ratio or Bayes factor toward the neutral value of one. The procedures are: (1) a Bayesian procedure with uninformative priors, (2) use of empirical lower and upper bounds (ELUB), and (3) a novel form of regularized logistic regression. As a benchmark, they are compared with linear discriminant analysis, and in some instances with non-regularized logistic regression. The behaviours of the procedures are explored using Monte Carlo simulated data, and tested on real data from comparisons of voice recordings, face images, and glass fragments.


Score based procedures for the calculation of forensic likelihood ratios – Scores should take account of both similarity and typicality
Morrison G.S., Enzinger E. (2018). Science & Justice, 58, 47–58.

http://dx.doi.org/10.1016/j.scijus.2017.06.005

see also: http://geoff-morrison.net/#ICFIS2014

  • Matlab code

  • Score based procedures for the calculation of forensic likelihood ratios are popular across different branches of forensic science. They have two stages, first a function or model which takes measured features from known-source and questioned-source pairs as input and calculates scores as output, then a subsequent model which converts scores to likelihood ratios. We demonstrate that scores which are purely measures of similarity are not appropriate for calculating forensically interpretable likelihood ratios. In addition to taking account of similarity between the questioned-origin specimen and the known-origin sample, scores must also take account of the typicality of the questioned-origin specimen with respect to a sample of the relevant population specified by the defence hypothesis. We use Monte Carlo simulations to compare the output of three score based procedures with reference likelihood ratio values calculated directly from the fully specified Monte Carlo distributions. The three types of scores compared are: 1. non-anchored similarity-only scores; 2. non-anchored similarity and typicality scores; and 3. known-source anchored same-origin scores and questioned-source anchored different-origin scores. We also make a comparison with the performance of a procedure using a dichotomous “match”/“non-match” similarity score, and compare the performance of 1 and 2 on real data.


Response to Request for information on the development of the Organization of Scientific Area Committees (OSAC) for Forensic Science 2.0
Morrison G.S. (2017-10-28)


A response to: “NIST experts urge caution in use of courtroom evidence presentation method”
Morrison G.S. (2017).
http://forensic-evaluation.net/NIST_press_release_2017_10/

  • A press release from the National Institute of Standards and Technology (NIST) could potentially impede progress toward improving the analysis of forensic evidence and the presentation of forensic analysis results in courts in the United States and around the world. “NIST experts urge caution in use of courtroom evidence presentation method” was released on October 12, 2017, and was picked up by the phys.org news service. It argues that, except in exceptional cases, the results of forensic analyses should not be reported as “likelihood ratios”. The press release, and the journal article by NIST researchers Steven P. Lund & Harri Iyer on which it is based, identifies some legitimate points of concern, but makes a strawman argument and reaches an unjustified conclusion that throws the baby out with the bathwater.

  • I also recommend:


Empirical test of the performance of an acoustic-phonetic approach to forensic voice comparison under conditions similar to those of a real case
Enzinger E., Morrison G.S. (2017). Forensic Science International, 277, 30–40.
http://dx.doi.org/10.1016/j.forsciint.2017.05.007



  • In a 2012 case in New South Wales, Australia, the identity of a speaker on several audio recordings was in question. Forensic voice comparison testimony was presented based on an auditory-acoustic-phonetic-spectrographic analysis. No empirical demonstration of the validity and reliability of the analytical methodology was presented. Unlike the admissibility standards in some other jurisdictions (e.g., US Federal Rule of Evidence 702 and the Daubert criteria, or England & Wales Criminal Practice Directions 19A), Australia’s Unified Evidence Acts do not require demonstration of the validity and reliability of analytical methods and their implementation before testimony based upon them is presented in court. The present paper reports on empirical tests of the performance of an acoustic-phonetic-statistical forensic voice comparison system which exploited the same features as were the focus of the auditory-acoustic-phonetic-spectrographic analysis in the case, i.e., second-formant (F2) trajectories in /o/ tokens and mean fundamental frequency (f0). The tests were conducted under conditions similar to those in the case. The performance of the acoustic-phonetic-statistical system was very poor compared to that of an automatic system.


Comments on National Commission on Forensic Science (NCFS) Views on Statistical Statements in Forensic Testimony
Morrison (2016-09-12a)
Morrison (2017-01-04a)
http://forensic-evaluation.net/NCFS_Statistical_Statements_in_Forensic_Testimony/


What should a forensic practitioner’s likelihood ratio be? II
Morrison G.S. (2017). Science & Justice, 57, 472–476.

http://dx.doi.org/10.1016/j.scijus.2017.08.004



  • In the debate as to whether forensic practitioners should assess and report the precision of the strength of evidence statements that they report to the courts, I remain unconvinced by proponents of the position that only a subjectivist concept of probability is legitimate. I consider this position counterproductive for the goal of having forensic practitioners implement, and courts not only accept but demand, logically correct and scientifically valid evaluation of forensic evidence. In considering what would be the best approach for evaluating strength of evidence, I suggest that the desiderata be (1) to maximise empirically demonstrable performance; (2) to maximise objectivity in the sense of maximising transparency and replicability, and minimising the potential for cognitive bias; and (3) to constrain and make overt the forensic practitioner’s subjective-judgement based decisions so that the appropriateness of those decisions can be debated before the judge in an admissibility hearing and/or before the trier of fact at trial. All approaches require the forensic practitioner to use subjective judgement, but constraining subjective judgement to decisions relating to selection of hypotheses, properties to measure, training and test data to use, and statistical modelling procedures to use – decisions which are remote from the output stage of the analysis – will substantially reduce the potential for cognitive bias. Adopting procedures based on relevant data, quantitative measurements, and statistical models, and directly reporting the output of the statistical models will also maximise transparency and replicability. A procedure which calculates a Bayes factor on the basis of relevant sample data and reference priors is no less objective than a frequentist calculation of a likelihood ratio on the same data. In general, a Bayes factor calculated using uninformative or reference priors will be closer to a value of 1 than a frequentist best estimate likelihood ratio. The bound closest to 1 based on a frequentist best estimate likelihood ratio and an assessment of its precision will also, by definition, be closer to a value of 1 than the frequentist best estimate likelihood ratio. From a practical perspective, both procedures shrink the strength of evidence value towards the neutral value of 1. A single-value Bayes factor or likelihood ratio may be easier for the courts to handle than a distribution. I therefore propose as a potential practical solution, the use of procedures which account for imprecision by shrinking the calculated Bayes factor or likelihood ratio towards 1, the choice of the particular procedure being based on empirical demonstration of performance.


Assessing the admissibility of a new generation of forensic voice comparison testimony
Morrison G.S., Thompson W.C. (2017). Columbia Science and Technology Law Review, 18, 326–434.
https://doi.org/10.7916/stlr.v18i2.4022

  • The publisher’s pdf does not have hyperlinks. The text of the preprint is almost identical and the pdf has hyperlinks.

  • preprint: https://ssrn.com/abstract=2883767

  • preprint: https://www.newton.ac.uk/files/preprints/ni16053.pdf

  • This article provides a primer on forensic voice comparison (aka forensic speaker recognition), a branch of forensic science in which the forensic practitioner analyzes a voice recording in order to provide an expert opinion that will help the trier-of-fact determine the identity of the speaker. The article begins with an explanation of ways in which human speech varies within and between speakers. It then discusses different technical approaches that forensic practitioners have used to compare voice recordings, and frameworks of reasoning that practitioners have used for evaluating the evidence and reporting its strength. It then discusses procedures for empirical validation of the performance of forensic voice comparison systems. It also discusses the potential influence of contextual bias and ways to reduce this. Building on this scientific foundation, the article then offers analysis, commentary, and recommendations on how courts evaluate the admissibility of forensic voice comparison testimony under the Daubert and Frye standards. It reviews past rulings such as U.S. v. Angleton, 269 F.Supp 2nd 892 (S.D. Tex. 2003) that found expert testimony based on the spectrographic approach inadmissible under Daubert. The article also offers a detailed analysis of the evidence presented in the recent Daubert hearing in U.S. v. Ahmed, et al. 2015 EDNY 12-CR-661, which included testimony based on the newer automatic approach. The scientific testimony proffered in Ahmed is used to illustrate the issues courts are likely to face when considering the admissibility of forensic voice comparison testimony in the future. The article concludes with a discussion of how proponents of forensic voice comparison testimony might meet a reasonably rigorous application of the Daubert standard and thereby ensure that such testimony is sufficiently trustworthy to be used in court.


Forensic voice comparison
Zhang C., Morrison G.S. (2017). In: Sybesma R., Behr W., Gu Y., Handel,Z., Huang C.-T. J., Myers J. (Eds.), Encyclopedia of Chinese Language and Linguistics (pp. 256–260). Leiden: Brill.
http://dx.doi.org/10.1163/2210-7363_ecll_COM_000205




A comment on the PCAST report: Skip the “match”/“non-match” stage
Morrison G.S., Kaye D.H., Balding D.J., Taylor D., Dawid P., Aitken C.G.G., Gittelson S., Zadora G., Robertson B., Willis S.M., Pope S., Neil M., Martire K.A., Hepler A., Gill R.D., Jamieson,A., de Zoete J., Ostrum R.B., Caliebe A. (2016/2017). Forensic Science International, 272, e7–e9.
http://dx.doi.org/10.1016/j.forsciint.2016.10.018

  • preprint: http://forensic-evaluation.net/PCAST2016/



  • This letter comments on the report “Forensic science in criminal courts: Ensuring scientific validity of feature-comparison methods” recently released by the President’s Council of Advisors on Science and Technology (PCAST). The report advocates a procedure for evaluation of forensic evidence that is a two-stage procedure in which the first stage is “match”/“non-match” and the second stage is empirical assessment of sensitivity (correct acceptance) and false alarm (false acceptance) rates. Almost always, quantitative data from feature-comparison methods are continuously-valued and have within-source variability. We explain why a two-stage procedure is not appropriate for this type of data, and recommend use of statistical procedures which are appropriate.


Use of relevant data, quantitative measurements, and statistical models to calculate a likelihood ratio for a Chinese forensic voice comparison case involving two sisters
Zhang C., Morrison G.S., Enzinger E. (2016). Forensic Science International, 267, 115–124.
http://dx.doi.org/10.1016/j.forsciint.2016.08.017



  • Currently, the standard approach to forensic voice comparison in China is the aural-spectrographic approach. Internationally, this approach has been the subject of much criticism. The present paper describes what we believe is the first forensic voice comparison analysis presented to a court in China in which a numeric likelihood ratio was calculated using relevant data, quantitative measurements, and statistical models, and in which the validity and reliability of the analytical procedures were empirically tested under conditions reflecting those of the case under investigation. The hypotheses addressed were whether the female speaker on a recording of a mobile telephone conversation was a particular individual, or whether it was that individual’s younger sister. Known speaker recordings of both these individuals were recorded using the same mobile telephone as had been used to record the questioned-speaker recording, and customised software was written to perform the acoustic and statistical analyses.


Multi-laboratory evaluation of forensic voice comparison systems under conditions reflecting those of a real forensic case (forensic_eval_01) - Introduction
Morrison G.S., Enzinger E. (2016). Speech Communication, 85, 119–126.
http://dx.doi.org/10.1016/j.specom.2016.07.006

  • This article should be open access. If for any reason you can’t access it at the SPECOM site

  • There is increasing pressure on forensic laboratories to validate the performance of forensic analysis systems before they are used to assess strength of evidence for presentation in court. Different forensic voice comparison systems may use different approaches, and even among systems using the same general approach there can be substantial differences in operational details. From case to case, the relevant population, speaking styles, and recording conditions can be highly variable, but it is common to have relatively poor recording conditions and mismatches in speaking style and recording conditions between the known- and questioned-speaker recordings. In order to validate a system intended for use in casework, a forensic laboratory needs to evaluate the degree of validity and reliability of the system under forensically realistic conditions. The present paper is an introduction to a Virtual Special Issue consisting of papers reporting on the results of testing forensic voice comparison systems under conditions reflecting those of an actual forensic voice comparison case. A set of training and test data representative of the relevant population and reflecting the conditions of this particular case has been released, and operational and research laboratories are invited to use these data to train and test their systems. The present paper includes the rules for the evaluation and a description of the evaluation metrics and graphics to be used. The name of the evaluation is: forensic_eval_01


Reply to Hicks et alii (2017) Reply to Morrison et alii (2016) Refining the relevant population in forensic voice comparison – A response to Hicks et alii (2015) The importance of distinguishing information from evidence/observations when formulating propositions
Morrison G.S., Enzinger E., Zhang C. (2017).
http://arxiv.org/abs/1704.07639

  • The present letter to the editor is one in a series of publications discussing the formulation of hypotheses (propositions) for the evaluation of strength of forensic evidence. In particular, the discussion focusses on the issue of what information may be used to define the relevant population specified as part of the different-speaker hypothesis in forensic voice comparison. The previous publications in the series are: Hicks et al. (2015); Morrison et al. (2016); Hicks et al. (2017). The latter letter to the editor mostly resolves the apparent disagreement between the two groups of authors. We briefly discuss one outstanding point of apparent disagreement, and attempt to correct a misinterpretation of our earlier remarks. We believe that at this point there is no actual disagreement, and that both groups of authors are calling for greater collaboration in order to reduce the likelihood of future misunderstandings.


Refining the relevant population in forensic voice comparison - A response to Hicks et alii (2015) The importance of distinguishing information from evidence/observations when formulating propositions
Morrison G.S., Enzinger E., Zhang C. (2016). Science & Justice, 56, 492–497.
http://dx.doi.org/10.1016/j.scijus.2016.07.002



  • Hicks et al. (2015) propose that forensic speech scientists not use the accent of the speaker of questioned identity to refine the relevant population. This proposal is based on a lack of understanding of the realities of forensic voice comparison. If it were implemented, it would make data-based forensic voice comparison analysis within the likelihood ratio framework virtually impossible. We argue that it would also lead forensic speech scientists to present invalid unreliable strength of evidence statements, and not allow them to conduct the tests that would make them aware of this problem.


Special issue on measuring and reporting the precision of forensic likelihood ratios: Introduction to the debate
Morrison G.S. (2016). Science & Justice, 56, 371–373.
http://dx.doi.org/10.1016/j.scijus.2016.05.002



  • The present paper introduces the Science & Justice virtual special issue on measuring and reporting the precision of forensic likelihood ratios – whether this should be done, and if so how. The focus is on precision (aka reliability) as opposed to accuracy (aka validity). The topic is controversial and different authors are expected to express a range of nuanced opinions. The present paper frames the debate, explaining the underlying problem and referencing classes of solutions proposed in the existing literature. The special issue will consist of a number of position papers, responses to those position papers, and replies to the responses.


What should a forensic practitioner’s likelihood ratio be?
Position Paper in the Science & Justice Virtual Special Issue on on measuring and reporting the precision of forensic likelihood ratios
Morrison G.S., Enzinger E. (2016). Science & Justice, 56, 374–379.
http://dx.doi.org/10.1016/j.scijus.2016.05.007



  • Matlab code: combine_imprecise_priors_LR

  • We argue that forensic practitioners should empirically assess and report the precision of their likelihood ratios. Once the practitioner has specified the prosecution and defence hypotheses they have adopted, including the relevant population they have adopted, and has specified the type of measurements they will make, their task is to empirically calculate an estimate of a likelihood ratio which has a true but unknown value. We explicitly reject the competing philosophical position that the forensic practitioner’s likelihood ratio should be based on subjective personal probabilities. Estimates of true but unknown values are based on samples and are subject to sampling uncertainty, and it is standard practice to report the degree of precision of such estimates. We discuss the dangers of not reporting precision to the courts, and the problems with an alternative approach which instead reports a verbal expression corresponding to a pre-specified range of likelihood ratio values. Reporting precision as an interval requires an arbitrary choice of coverage, e.g., a 95% or a 99% credible interval. We outline a normative framework which a trier of fact could use to make non-arbitrary use of the results of forensic practitioners’ empirical calculations of likelihood ratios and their precision.


INTERPOL survey of the use of speaker identification by law enforcement agencies
Morrison G.S., Sahito F.H., Jardine G., Djokic D., Clavet S., Berghs S., Goemans Dorny C. (2016). Forensic Science International, 263, 92–100.
http://dx.doi.org/10.1016/j.forsciint.2016.03.044


Response to forensic science questions posed by President’s Council of Advisors on Science and Technology
Morrison (2015-12-04)
http://forensic-evaluation.net/PCAST2015/


Statement regarding the UK Parliamentary Office of Science and Technology 2015 briefing on Forensic Linguistics
Morrison G.S. (2015-10-21)
http://forensic-evaluation.net/POSTnote509/


A demonstration of the application of the new paradigm for the evaluation of forensic evidence under conditions reflecting those of a real forensic-voice-comparison case.
Enzinger E., Morrison G.S., Ochoa F. (2015/2016) Science & Justice, 56, 42–57.
http://dx.doi.org/10.1016/j.scijus.2015.06.005



  • Audio examples

  • The new paradigm for the evaluation of the strength of forensic evidence includes: The use of the likelihood-ratio framework. The use of relevant data, quantitative measurements, and statistical models. Empirical testing of validity and reliability under conditions reflecting those of the case under investigation. Transparency as to decisions made and procedures employed. The present paper illustrates the use of the new paradigm to evaluate strength of evidence under conditions reflecting those of a real forensic-voice-comparison case. The offender recording was from a landline telephone system, had background office noise, and was saved in a compressed format. The suspect recording included substantial reverberation and ventilation system noise, and was saved in a different compressed format. The present paper includes descriptions of the selection of the relevant hypotheses, sampling of data from the relevant population, simulation of suspect and offender recording conditions, and acoustic measurement and statisticalmodelling procedures. The present paper also explores the use of different techniques to compensate for the mismatch in recording conditions. It also examines how system performance would have differed had the suspect recording been of better quality.


Mismatched distances from speakers to telephone in a forensic-voice-comparison case
Enzinger E., Morrison G.S. (2015). Speech Communication, 70, 28–41.
http://dx.doi.org/10.1016/j.specom.2015.03.001



  • In a forensic-voice-comparison case, one speaker (A) was standing a short distance away from another speaker (B) who was talking on a mobile telephone. Later, speaker A moved closer to the telephone. Shortly thereafter, there was a section of speech where the identity of the speaker was in question – the prosecution claiming that it was speaker A and the defense claiming it was speaker B. All material for training a forensic-voice-comparison system could be extracted from this single recording, but there was a near-far mismatch: Training data for speaker A were mostly far, training data for speaker B were near, and the disputed speech was near. Based on the conditions of this case we demonstrate a methodology for handling forensic casework using relevant data, quantitative measurements, and statistical models to calculate likelihood ratios. A procedure is described for addressing the degree of validity and reliability of a forensic-voicecomparison system under such conditions. Using a set of development speakers we investigate the effect of mismatched distances to the microphone and demonstrate and assess three methods for compensation.


Calculation of forensic likelihood ratios: Use of Monte Carlo simulations to compare the output of score-based approaches with true likelihood-ratio values
Morrison G.S. (2015). Research Report.
Stable URL: http://geoff-morrison.net/#ICFIS2014

  • Also available at: http://arxiv.org/abs/1612.08165

  • Matlab code

  • A group of approaches for calculating forensic likelihood ratios first calculates scores which quantify the degree of difference or the degree of similarity between pairs of samples, then converts those scores to likelihood ratios. In order for a score-based approach to produce a forensically interpretable likelihood ratio, however, in addition to accounting for the similarity of the questioned sample with respect to the known sample, it must also account for the typicality of the questioned sample with respect to the relevant population. The present paper explores a number of score-based approaches using different types of scores and different procedures for converting scores to likelihood ratios. Monte Carlo simulations are used to compare the output of these approaches to true likelihood-ratio values calculated on the basis of the distribution specified for a simulated population. The inadequacy of approaches based on similarity-only or difference-only scores is illustrated, and the relative performance of different approaches which take account of both similarity and typicality is assessed.


Critique by Dr Geoffrey Stewart Morrison of a forensic voice comparison report submitted by Mr Edward J Primeau in relation to a section of audio recording which is alleged to be a recording of the voice of Dr Marlo Raynolds
(2014-11-30)
http://forensic-evaluation.net/raynolds/


Likelihood ratio calculation for a disputed-utterance analysis with limited available data.
Morrison G.S., Lindh J., Curran J.M. (2014). Speech Communication, 58, 81–90.
http://dx.doi.org/10.1016/j.specom.2013.11.004



  • Erratum

  • Matlab and R code

  • We present a disputed-utterance analysis using relevant data, quantitative measurements, and statistical models to calculate likelihood ratios. The acoustic data were taken from an actual forensic case in which the amount of data available to train the statistical models was small and the data point from the disputed word was far out on the tail of one of the modelled distributions. A procedure based on single multivariate Gaussian models for each hypothesis led to an unrealistically high likelihood ratio value with extremely poor reliability, but a procedure based on Hotelling’s T2 statistic and a procedure based on calculating a posterior predictive density produced more acceptable results. The Hotelling’s T2 procedure attempts to take account of the sampling uncertainty of the mean vectors and covariance matrices due to the small number of tokens used to train the models, and the posterior-predictivedensity analysis integrates out the values of the mean vectors and covariance matrices as nuisance parameters. Data scarcity is common in forensic speech science and we argue that it is important not to accept extremely large calculated likelihood ratios at face value, but to consider whether such values can be supported given the size of the available data and modelling constraints.


Forensic strength of evidence statements should preferably be likelihood ratios calculated using relevant data, quantitative measurements, and statistical models – a response to Lennard (2013) Fingerprint identification: How far have we come?
Morrison G.S., Stoel R.D. (2014). Australian Journal of Forensic Sciences, 46, 282–292.
http://dx.doi.org/10.1080/00450618.2013.833648

  • Preprint at: http://arxiv.org/abs/2012.12198



  • Lennard (2013) [Fingerprint identification: how far have we come? Aus J Forensic Sci. doi:10.1080/00450618.2012.752037] proposes that the numeric output of statistical models should not be presented in court (except ‘if necessary’/‘if required’). Instead, he argues in favour of an ‘expert opinion’ which may be informed by a statistical model but which is not itself the output of a statistical model. We argue that his proposed procedure lacks the transparency, the ease of testing of validity and reliability, and the relative robustness to cognitive bias that are the strengths of a likelihood-ratio approach based on relevant data, quantitative measurements, and statistical models, and that the latter is therefore preferable.


Distinguishing between forensic science and forensic pseudoscience: Testing of validity and reliability, and approaches to forensic voice comparison.
Morrison G.S. (2014). Science & Justice, 54, 245–256.
http://dx.doi.org/10.1016/j.scijus.2013.07.004



  • In this paper it is argued that one should not attempt to directly assess whether a forensic analysis technique is scientifically acceptable. Rather one should first specify what one considers to be appropriate principles governing acceptable practice, then consider any particular approach in light of those principles. This paper focuses on one principle: the validity and reliability of an approach should be empirically tested under conditions reflecting those of the case under investigation using test data drawn from the relevant population. Versions of this principle have been key elements in several reports on forensic science, including forensic voice comparison, published over the last four-and-a-half decades. The aural–spectrographic approach to forensic voice comparison (also known as “voiceprint” or “voicegram” examination) and the currently widely practiced auditory–acoustic–phonetic approach are considered in light of this principle (these two approaches do not appear to be mutually exclusive). Approaches based on data, quantitative measurements, and statistical models are also considered in light of this principle.


Forensic audio analysis – Review: 2010–2013.
Grigoras C., Smith J.M., Morrison G.S., Enzinger E. (2013). In: NicDaéid, N. (Ed.), Proceedings of the 17th International Forensic Science Mangers’ Symposium, Lyon (pp. 612–637). Lyon, France: Interpol.


Effects of telephone transmission on the performance of formant-trajectory-based forensic voice comparison – female voices.
Zhang C., Morrison G.S., Enzinger E., Ochoa F. (2013). Speech Communication, 55, 796–813.
http://dx.doi.org/10.1016/j.specom.2013.01.011



  • In forensic-voice-comparison casework a common scenario is that the suspect’s voice is recorded directly using a microphone in an interview room but the offender’s voice is recorded via a telephone system. Acoustic-phonetic approaches to forensic voice comparison often include analysis of vowel formants, and the second formant is often assumed to be relatively robust to telephone-transmission effects. This study assesses the effects of telephone transmission on the performance of formant-trajectory-based forensic-voice-comparison systems. The effectiveness of both human-supervised and fully-automatic formant tracking is investigated. Human-supervised formant tracking is generally considered to be more accurate and reliable but requires a substantial investment of human labor. Measurements were made of the formant trajectories of /iau/ tokens in a database of recordings of 60 female speakers of Chinese using one human-supervised and five fully-automatic formant trackers. Measurements were made under high-quality, landline-to-landline, mobile-to-mobile, and mobile-to-landline conditions. High-quality recordings were treated as suspect samples and telephone-transmitted recordings as offender samples. Discrete cosine transforms (DCT) were fitted to the formant trajectories and likelihood ratios were calculated on the basis of the DCT coefficients. For each telephone-transmission condition the formant-trajectory system was fused with a baseline mel-frequency cepstral-coefficient (MFCC) system, and performance was assessed relative to the baseline system. The systems based on human-supervised formant measurement always outperformed the systems based on fully-automatic formant measurement; however, in conditions involving mobile telephones neither the former nor the latter type of system provided meaningful improvement over the baseline system, and even in the other conditions the high cost in skilled labor for human-supervised formant-trajectory measurement is probably not warranted given the relatively good performance that can be obtained using other less-costly procedures.


Reliability of human-supervised formant-trajectory measurement for forensic voice comparison.
Zhang C., Morrison G.S., Ochoa F., Enzinger E. (2013). Journal of the Acoustical Society of America, 133, EL54–EL60.
http://dx.doi.org/10.1121/1.4773223

  • Acoustic-phonetic approaches to forensic voice comparison often include human-supervised measurement of vowel formants, but the reliability of such measurements is a matter of concern. This study assesses the within- and between-supervisor variability of three sets of formanttrajectory measurements made by each of four human supervisors. It also assesses the validity and reliability of forensic-voice-comparison systems based on these measurements. Each supervisor’s formant-trajectory system was fused with a baseline mel-frequency cepstral-coefficient system, and performance was assessed relative to the baseline system. Substantial improvements in validity were found for all supervisors’ systems, but some supervisors’ systems were more reliable than others.


Tutorial on logistic-regression calibration and fusion: Converting a score to a likelihood ratio.
Morrison G.S. (2013). Australian Journal of Forensic Sciences, 45, 173–197.
http://dx.doi.org/10.1080/00450618.2012.733025

  • typsetting errata

  • Preprint: https://arxiv.org/abs/2104.08846

  • Logistic-regression calibration and fusion are potential steps in the calculation of forensic likelihood ratios. The present paper provides a tutorial on logistic-regression calibration and fusion at a practical conceptual level with minimal mathematical complexity. A score is log-likelihoodratio like in that it indicates the degree of similarity of a pair of samples while taking into consideration their typicality with respect to a model of the relevant population. A higher-valued score provides more support for the same-origin hypothesis over the different-origin hypothesis than does a lower-valued score; however, the absolute values of scores are not interpretable as log likelihood ratios. Logistic-regression calibration is a procedure for converting scores to log likelihood ratios, and logistic-regression fusion is a procedure for converting parallel sets of scores from multiple forensic-comparison systems to log likelihood ratios. Logistic-regression calibration and fusion were developed for automatic speaker recognition and are popular in forensic voice comparison. They can also be applied in other branches of forensic science, a fingerprint/fingermark example is provided.


Vowel inherent spectral change in forensic voice comparison.
Morrison G.S. (2013). In G.S. Morrison & P.F. Assmann (Eds.) Vowel inherent spectral change (pp. 263–283). Heidelberg, Germany: Springer-Verlag.
http://dx.doi.org/10.1007/978-3-642-14209-3_11



  • The onset + offset model of vowel inherent spectral change has been found to be effective for vowel-phoneme identification, and not to be outperformed by more sophisticated parametric-curve models. This suggests that if only simple cues such as initial and final formant values are necessary for signaling phoneme identity, then speakers may have considerable freedom in the exact path taken between the initial and final formant values. If the constraints on formant trajectories are relatively lax with respect to vowel-phoneme identity, then with respect to speaker identity there may be considerable information contained in the details of formant trajectories. Differences in physiology and idiosyncrasies in the use of motor commands may mean that different individuals produce different formant trajectories between the beginning and end of the same vowel phoneme. If withinspeaker variability is substantially smaller than between-speaker variability then formant trajectories may be effective features for forensic voice comparison. This chapter reviews a number of forensic-voice-comparison studies which have used different procedures to extract information from formant trajectories. It concludes that information extracted from formant trajectories can lead to a high degree of validity in forensic voice comparison (at least under controlled conditions), and that a whole trajectory approach based on parametric curves outperforms an onset + offset model.


The importance of using between-session test data in evaluating the performance of forensic-voice-comparison systems.
Enzinger E., Morrison G.S. (2012). Proceedings of the 14th Australasian International Conference on Speech Science and Technology, Sydney (pp. 137–140). Australasian Speech Science and Technology Association.

  • In this paper we report on a study which demonstrates the im- portance of using non-contemporaneous test data in evaluating the validity and reliability in forensic-voice-comparison sys- tems. We test four different systems: one MFCC GMM–UBM, one vowel formant-trajectory based, one nasal spectra based, and the fusion of the three systems. Each system is tested on the same set of test recordings, including same-speaker and different-speaker pairs. In one condition, the same-speaker pairs are from contemporaneous (within-session) recordings and in the other they are from non-contemporaneous (between-session) recordings. Within-session testing always overesti- mated the performance of the systems compared to between-session testing.


Human-supervised and fully-automatic formant-trajectory measurement for forensic voice comparison – Female voices.
Zhang C., Morrison G.S., Enzinger E., Ochoa F. (2012). Laboratory Report. Forensic Voice Comparison Laboratory, School of Electrical Engineering & Telecommunications, University of New South Wales, Sydney, Australia.
Stable URL: http://geoff-morrison.net/#_2012LabRepFormants

  • Acoustic-phonetic approaches to forensic voice comparison often include analysis of vowel formants. Such methods typically depend on human-supervised formant measurement, which is often assumed to be relatively reliable and relatively robust to telephonetransmission- channel effects, but which requires substantial investment of human labor. Fully-automatic formant trackers require minimal human labor but are usually not considered reliable. This study assesses the effect of variability within three sets of formant-trajectory measurements made by four human supervisors on the validity and reliability of forensic-voice-comparison systems in a high-quality v high-quality recording condition. Measurements were made of the formant trajectories of /iau/ tokens in a database of recordings of 60 female speakers of Chinese. The study also assesses the validity of forensic-voice-comparison systems including a human-supervised and five fully-automatic formant trackers under landline-to-landline, mobile-to-mobile, and mobile-to-landline conditions, each of these matched with the same condition and mismatched with the high-quality condition. In each case the formant-trajectory systems were fused with a baseline mel-frequency cepstral-coefficient (MFCC) system, and performance was assessed relative to the baseline system. The human-supervised systems always outperformed the fullyautomatic formant-tracker systems, but in some conditions the improvement was marginal and the cost of human-supervised formant-trajectory measurement probably not warranted.


Response to Draft Australian Standard: DR AS 5388.3 Forensic analysis - Part 3 - Interpretation
Morrison G.S., Evett I.W., Willis S.M., Champod C., Grigoras C., Lindh J., Fenton N., Hepler A., Berger C.E.H., Buckleton J.S., Thompson W.C. , González-Rodríguez J., Neumann C., Curran J.M., Zhang C., Aitken C.G.G., Ramos D., Lucena-Molina J.J., Jackson G., Meuwly D., Robertson B., Vignaux G.A. (2012).
Stable URL: http://geoff-morrison.net/#_2012DraftStandResp
Stable URL: http://forensic-evaluation.net/australian-standards/#Morrison_et_al_2012


Database selection for forensic voice comparison.
Morrison G.S., Ochoa F., Thiruvaran T. (2012). Proceedings of Odyssey 2012: The Language and Speaker Recognition Workshop, Singapore, 62–77.

  • Defining the relevant population to sample is an important issue in data-based implementation of the likelihood-ratio framework for forensic voice comparison. We present a logical argument that because an investigator or prosecutor only submits suspect and offender recordings for forensic analysis if they sound sufficiently similar to each other, the appropriate defense hypothesis for the forensic scientist to adopt will usually be that the suspect is not the speaker on the offender recording but is a member of a population of speakers who sound sufficiently similar that an investigator or prosecutor would submit recordings of these speakers for forensic analysis. We propose a procedure for selecting background, development, and test databases using a panel of human listeners, and empirically test an automatic procedure inspired by the above. Although the automatic procedure is not entirely consistent with the logical arguments and human-listener procedure, it serves as a proof of concept for the importance of database selection. A forensic-voice-comparison system using the automatic database-selection procedure outperformed systems with random database selection.


Voice source features for forensic voice comparison – an evaluation of the Glottex® software package.
Enzinger E., Zhang C., Morrison G.S. (2012). Proceedings of Odyssey 2012: The Language and Speaker Recognition Workshop, Singapore, 78–85.

  • Errata & Addenda

  • GLOTTEX is a software package which extracts informa- tion about voice source properties, including estimates of properties related to physical structures of the vocal folds. It has been proposed that the output of GLOTTEX can be used as part of a forensic-voice-comparison system. We test this using manually labeled segments from a database of voice recordings of 60 female Chinese speakers. Performance was assessed relative to a baseline MFCC GMM-UBM system. GMM-UBM systems based on features extracted by GLOTTEX were combined with the baseline system using logistic-regression fusion. System performance was assessed in three channel conditions: high-quality v high-quality, mobile-to-landline v mobile-to-landline, and mobile-to-landline v high-quality. Substantial improvements over the baseline system were not observed.


What did Bain really say? A preliminary forensic analysis of the disputed utterance based on data, acoustic analysis, statistical models, calculation of likelihood ratios, and testing of validity.
Morrison G.S., Hoy M. C. (2012). Proceedings of the 46th Audio Engineering Society (AES) Conference on Audio Forensics: Recording, Recovery, Analysis, and Interpretation, Denver, CO, 203–207.



  • This paper presents a preliminary analysis of the disputed utterance in Bain v R [2009] NZSC 16. A likelihood ratio is calculated as a strength-of-evidence statement with respect to the question: What is the probability of getting the acoustic properties of the disputed utterance if Bain had said “I shot the prick” versus if he had said “I can’t breathe”. In particular, an acoustic and statistical analysis is conducted on the first segment of the second word to estimate the probability of getting the acoustics of this segment if it were a postalveolar fricative versus if it were a palatal fricative. The validity of the system is tested and ways to improve the analysis are discussed.


Protocol for the collection of databases of recordings for forensic-voice-comparison research and practice.
Morrison G.S., Rose P., Zhang C. (2012). Australian Journal of Forensic Sciences, 44, 155–167.
http://dx.doi.org/10.1080/00450618.2011.630412



  • A protocol for the collection of databases of audio recordings for forensic-voice-comparison research and practice is described. The protocol fulfills the following requirements: (1) The database contains at least two non-contemporaneous recordings of each speaker. (2) The database contains recordings of each speaker using different speaking styles which are typical of speaking styles found in casework, and which are elicited as natural speech. (3) The database is usable for research and casework involving recording- and transmission-channel mismatch. The protocol includes three speaking tasks, (1) an informal telephone conversation, (2) an information exchange task over the telephone, and (3) a pseudo-police-style interview. Technical issues are also discussed.


The likelihood-ratio framework and forensic evidence in court: A response to R v T.
Morrison G.S. (2012). International Journal of Evidence and Proof, 16, 1–29.
http://dx.doi.org/10.1350/ijep.2012.16.1.390



  • Erratum: Table 1, fourth line of numbers should read: “1000–10 000”, not “100–10 000”

  • In R v T the Court concluded that the likelihood-ratio framework should not be used for the evaluation of evidence except ‘where there is a firm statistical base’. The present paper argues that the Court’s opinion is based on misunderstandings of statistics and of the likelihood-ratio framework for the evaluation of evidence. The likelihood-ratio framework is a logical framework and not itself dependent on the use of objective measurements, databases, and statistical models. The ruling is analysed from the perspective of the new paradigm for forensic-comparison science: the use of the likelihood-ratio framework for the evaluation of evidence; a strong preference for the use of objective measurements, databases representative of the relevant population, and statistical models; and empirical testing of the validity and reliability of the forensic-comparison system under conditions reflecting those of the case at trial.


Forensic voice comparison using Chinese /iau/.
Zhang C., Morrison G.S., Thiruvaran T. (2011). Proceedings of the17th International Congress of Phonetic Sciences, Hong Kong, China
, 2280–2283.

  • An acoustic-phonetic forensic-voice-comparison system extracted information from the formant trajectories of tokens of Standard Chinese /iau/. When this information was added to a generic automatic forensic-voice-comparison system, which did not itself exploit acoustic-phonetic information, there was a substantial improvement in system validity but a decline in system reliability.


Humans versus machine: Forensic voice comparison on a small database of Swedish voice recordings.
Lindh J., Morrison G.S. (2011). Proceedings of the17th International Congress of Phonetic Sciences, Hong Kong, China
, 1254–1257.

  • A procedure for comparing the performance of humans and machines on speaker recognition and on forensic voice comparison is proposed and demonstrated. The procedure is consistent with the new paradigm for forensic-comparison science (use of the likelihood-ratio framework and testing of the validity and reliability of the results). The use of the procedure is demonstrated using a small database of Swedish voice recordings.


Measuring the validity and reliability of forensic likelihood-ratio systems.
Morrison G.S. (2011). Science & Justice, 51, 91–98.
http://dx.doi.org/10.1016/j.scijus.2011.03.002



  • Matlab Code: CI_calcs 2011-03-30.

  • Throughout 2015 and 2016 this was ranked as the most cited paper published in Science & Justice within the previous 5 years.

  • There has been a great deal of concern recently about validity and reliability in forensic science. This paper reviews for a broad target audience metrics of validity and reliability (accuracy and precision) which have been applied in forensic voice comparison and which are potentially applicable in other branches of forensic science. The metric of validity is the log likelihood-ratio cost (Cllr), and the metric of reliability is an empirical estimate of credible intervals. A revised procedure for the calculation of credible intervals is introduced.


A comparison of procedures for the calculation of forensic likelihood ratios from acoustic-phonetic data: Multivariate kernel density (MVKD) versus Gaussian mixture model - universal background model (GMM-UBM).
Morrison G.S. (2011). Speech Communication, 53, 242–256.
http://dx.doi.org/10.1016/j.specom.2010.09.005



  • Matlab Code: MVKD_v_GMM-UBM 2010-02-19

  • Two procedures for the calculation of forensic likelihood ratios were tested on the same set of acoustic–phonetic data. One procedure was a multivariate kernel density procedure (MVKD) which is common in acoustic–phonetic forensic voice comparison, and the other was a Gaussian mixture model–universal background model (GMM–UBM) which is common in automatic forensic voice comparison. The data were coefficient values from discrete cosine transforms fitted to second-formant trajectories of /aI/, /eI/, /ou/, /au/, and /OI/ tokens produced by 27 male speakers of Australian English. Scores were calculated separately for each phoneme and then fused using logistic regression. The performance of the fused GMM–UBM system was much better than that of the fused MVKD system, both in terms of accuracy (as measured using the log-likelihood-ratio cost, Cllr) and precision (as measured using an empirical estimate of the 95% credible interval for the likelihood ratios from the different-speaker comparisons).


An issue in the calculation of logistic-regression calibration and fusion weights for forensic voice comparison.
Morrison G.S., Thiruvaran T., Epps J. (2010). Proceedings of the 13th Australasian International Conference on Speech Science and Technology, Melbourne, 74–77.

  • Logistic regression is a popular procedure for calibration and fusion of likelihood ratios in forensic voice comparison and automatic speaker recognition. The availability of multiple recordings of each speaker in the database used for calculation of calibration/fusion weights allows for different procedures for calculating those weights. Two procedures are compared, one using pooled data and the other using mean values from each speaker-comparison pair. The procedures are tested using an acoustic-phonetic and an automatic forensic-voicecomparison system. The mean procedure has a tendency to result in better accuracy, but the pooled procedure always results in better precision of the likelihood-ratio output.


Forensic voice comparison.
Morrison G.S. (2010). In I. Freckelton, & H. Selby (Eds.), Expert Evidence (Ch. 99). Sydney, Australia: Thomson Reuters.

  • As part of the Expert Evidence series the 100-page Forensic Voice Comparison chapter is aimed first at lawyers, judges, police officers, and potential jury members; however, it is hoped that this chapter will also be of interest to forensic scientists, phoneticians / speech scientists, speech-processing engineers, and students of all these disciplines. It introduces forensic voice comparison in a relatively non-technical way, assuming a reader who has no prior knowledge of the subject. The focus is on the understanding of concepts and the provision of basic knowledge.
    • “Morrison has a very nice writing style and I think he has phrased some of the fundamental matters in a way that is more clearly put than I have ever seen. I think he has done a masterly job.”
      • Dr John S. Buckleton, Principle Scientist, ESR Forensics, Auckland, New Zealand
    • “It is very informative and at the same time easy to read – a rare combination. It’s a great book.”
      • Dr Michael Jessen, Senior Scientist, Department of Speaker and Audio Analysis, Federal Criminal Police Office, Wiesbaden, Germany


Estimating the precision of the likelihood-ratio output of a forensic-voice-comparison system.
Morrison G.S., Thiruvaran T., Epps J. (2010). Proceedings of Odyssey 2010: The Speaker and Language Recognition Workshop, Brno, 63–70.

  • Matlab Code: CI_calcs 2011-03-30.

  • The issues of validity and reliability are important in forensic science. Within the likelihood-ratio framework for the evaluation of forensic evidence, the log-likelihood-ratio cost (Cllr) has been applied as an appropriate metric for evaluating the accuracy of the output of a forensic-voice-comparison system, but there has been little research on developing a quantitative metric of precision. The present paper describes two procedures for estimating the precision of the output of a forensic-comparison system, a non-parametric estimate and a parametric estimate of its 95% credible interval. The procedures are applied to estimate the precision of a basic automatic forensic-voice-comparison system presented with different amounts of questioned-speaker data. The importance of considering precision is discussed.


An empirical estimate of the precision of likelihood ratios from a forensic-voice-comparison system.
Morrison G.S., Zhang C., Rose P. (2011). Forensic Science International, 208, 59–65.
http://dx.doi.org/10.1016/j.forsciint.2010.11.001



  • An acoustic–phonetic forensic-voice-comparison system was constructed using the time-averaged formant values of tokens of 61 male Chinese speakers’ /i/, /e/, and /a/ monophthongs as input. Likelihood ratios were calculated using amultivariate kernel density formula. A separate set of likelihood ratios was calculated for each vowel phoneme, and these were then fused and calibrated using linear logistic regression. The system was tested via cross-validation. The validity and reliability of the results were assessed using the log-likelihood-ratio-cost function (Cllr, a measure of accuracy) and an empirical estimate of the credible interval for the likelihood ratios from different-speaker comparisons (ameasure of precision). The credible interval was calculated on the basis of two independent pairs of samples for each different-speaker comparison pair.


Comparación forense de la voz y el cambio de paradigma.
Morrison G.S. (2011). CSIC/UIMP Posgrado Oficial en Estudios Fónicos Cuadernos de Trabajo, 1, 1–38.
[Translation by Curiá C. of: Morrison G.S. (2009). Forensic voice comparison and the paradigm shift. Science & Justice, 49, 298–308.]

  • Nos encontramos en medio de un proceso de cambio de paradigma en las ciencias relacionadas con la comparación forense de la voz. El nuevo paradigma puede caracterizarse como una implementación cuantitativa del marco de la relación de verosimilitud y de la evaluación cuantitativa de la validez y la fiabilidad de los resultados. Durante los años 90 este nuevo paradigma se adoptó ampliamente en la comparación de los perfiles de ADN, y se ha ido extendiendo gradualmente a otras ramas de las ciencias forenses, incluyendo la comparación forense de la voz. El presente artículo describe en primer lugar el nuevo paradigma y, a continuaci ón, expone la historia de su adopción en la comparación forense de la voz durante la última década. El cambio de paradigma es un proceso todavía incompleto, y aquellos que trabajan en él todavía representan una minoría entre la comunidad dedicada a la comparación forense de la voz.


Forensic voice comparison and the paradigm shift.
Morrison G.S. (2009). Science & Justice, 49, 298–308.
http://dx.doi.org/10.1016/j.scijus.2009.09.002



  • We are in the midst of a paradigm shift in the forensic comparison sciences. The new paradigm can be characterised as quantitative data-based implementation of the likelihood-ratio framework with quantitative evaluation of the reliability of results. The new paradigm was widely adopted for DNA profile comparison in the 1990s, and is gradually spreading to other branches of forensic science, including forensic voice comparison. The present paper first describes the new paradigm, then describes the history of its adoption for forensic voice comparison over approximately the last decade. The paradigm shift is incomplete and those working in the new paradigm still represent a minority within the forensic–voice-comparison community.


Comments on Coulthard & Johnson’s (2007) portrayal of the likelihood-ratio framework.
Morrison G.S. (2009). Australian Journal of Forensic Sciences, 41, 155–161.
http://dx.doi.org/10.1080/00450610903147701



  • In their recent introduction to forensic linguistics, Coulthard & Johnson (2007) include a portrayal of the likelihood-ratio framework for the evaluation of forensic comparison evidence (pp. 203–207). This portrayal includes a number of inaccuracies. The present letter attempts to correct these inaccuracies.


A reponse to the UK position statement on forensic speaker comparison.
Rose P., Morrison G.S. (2009). International Journal of Speech, Language and the Law, 16, 139–163.
http://dx.doi.org/10.1558/ijsll.v16i1.139




Likelihood-ratio-based forensic speaker comparison using parametric representations of vowel formant trajectories.
Morrison G.S. (2009). Journal of the Acoustical Society of America, 125, 2387–2397.
http://dx.doi.org/10.1121/1.3081384



  • Non-contemporaneous speech samples from 27 male speakers of Australian English were compared in a forensic likelihood-ratio framework. Parametric curves (polynomials and discrete cosine transforms) were fitted to the formant trajectories of the diphthongs /aI/, /eI/, /oU/, /aU/, and /OI/. The estimated coefficient values from the parametric curves were used as input to a generative multivariate-kernel-density formula for calculating likelihood ratios expressing the probability of obtaining the observed difference between two speech samples under the hypothesis that the samples were produced by the same speaker versus under the hypothesis that they were produced by different speakers. Cross-validated likelihood-ratio results from systems based on different parametric curves were calibrated and evaluated using the log-likelihood-ratio cost function (Cllr). The cross-validated likelihood ratios from the best-performing system for each vowel phoneme were fused using logistic regression. The resulting fused system had a very low error rate, thus meeting one of the requirements for admissibility in court.


Automatic-type calibration of traditionally derived likelihood ratios: Forensic analysis of Australian English /o/ formant trajectories.
Morrison G.S., Kinoshita Y. (2008). Proceedings of Interspeech 2008 (pp. 1501–1504). International Speech Communication Association.

  • A traditional-style phonetic-acoustic forensic-speakerrecognition analysis was conducted on Australian English /o/ recordings. Different parametric curves were fitted to the formant trajectories of the vowel tokens, and cross-validated likelihood ratios were calculated using a single-stage generative multivariate kernel density formula. The outputs of different systems were compared using Cllr, a metric developed for automatic speaker recognition, and the crossvalidated likelihood ratios were calibrated using a procedure developed for automatic speaker recognition. Calibration ameliorated some likelihood-ratio results which had offered strong support for a contrary-to-fact hypothesis.


Forensic speaker recognition of Chinese /i/ and /y/ using likelihood ratios.
Zhang C., Morrison G.S., Rose P. (2008). Proceedings of Interspeech 2008 (pp. 1937–1940). International Speech Communication Association.

  • A likelihood-ratio-based forensic speaker discrimination was conducted using the mean formant frequencies of Standard Chinese /i/ and /y/ tokens produced by 64 male speakers. The speech data were relatively forensically realistic in that they were relatively extemporaneous, were recorded over the telephone, and were from three non-contemporaneous recording sessions. A multivariate-kernel-density formula was used to calculate cross-validated likelihood ratios comparing all possible same-speaker and different-speaker combinations across sessions. Results were comparable with those previously obtained with laboratory speech in other languages. In general, greater strength of evidence was obtained for recording sessions separated by one week than for recording sessions separated by one month.


Forensic voice comparison using likelihood ratios based on polynomial curves fitted to the formant trajectories of Australian English /a
I/.
Morrison G.S. (2008). International Journal of Speech, Language and the Law, 15, 249–266.
http://dx.doi.org/10.1558/ijsll.v15i2.249

Incorrect versions of Figures 3 and 4 were printed in the paper version. These have been corrected in the online vesion.



  • Earlier studies have indicated that information regarding speaker identity can be extracted from the dynamic spectral properties of diphthongs. Some studies have conducted likelihood-ratio analyses based on simple models of the dynamic formant properties of diphthongs (e.g., dual-target model), and others have used more sophisticated polynomial curve fitting models but have not conducted likelihood-ratio analyses. The present study examines the strength of evidence which can be produced by a likelihood-ratio analysis based on the coefficients of polynomial curves fitted to the formant trajectories of Australian English /aI/ tokens. A cubic polynomial model offers a substantial improvement over the dual-target model.



Speech
Perception
&
Acoustic
Phonetics


Vowel inherent spectral change.
Morrison G.S., Assmann P.F. (Eds.) (2013). Heidelberg, Germany: Springer-Verlag. ISBN: 978-3-642-14208-6 (print) / 978-3-642-14209-3 (online).
http://dx.doi.org/10.1007/978-3-642-14209-3


Theories of vowel inherent spectral change: A review.
Morrison G.S. (2013). In Morrison G.S., Assmann P.F. (Eds.) Vowel inherent spectral change (pp. 31–47). Heidelberg, Germany: Springer-Verlag.
http://dx.doi.org/10.1007/978-3-642-14209-3_3




Perception of natural vowels by monolingual Canadian-English, Mexican-Spanish, and Peninsular-Spanish listeners.
Morrison G.S. (2012). Canadian Acoustics, 40(4), 29–39.
http://jcaa.caa-aca.ca/index.php/jcaa/article/view/2580


Analysis of categorical response data: Use logistic regression rather than endpoint-difference scores or discriminant analysis (L).
Morrison G.S., Kondaurova M.V. (2009). Journal of the Acoustical Society of America, 126, 2159–2162.
http://dx.doi.org/10.1121/1.3216917




L1-Spanish speakers’ acquisition of the English /i/–/I/ contrast II: Perception of vowel inherent spectral change.
Morrison G.S. (2009). Language & Speech, 52, 437–462.
http://dx.doi.org/10.1177/0023830909336583



  • supplementary material


Perception of synthetic vowels by monolingual Canadian-English, Mexican-Spanish, and Peninsular-Spanish listeners.
Morrison G.S. (2008). Canadian Acoustics, 36(4), 17–23.
http://jcaa.caa-aca.ca/index.php/jcaa/article/view/2100


L1-Spanish speakers’ acquisition of the English /i/–/I/ contrast: Duration-based perception is not the initial developmental stage.
Morrison G.S. (2008). Language & Speech, 51, 285–315.
http://dx.doi.org/10.1177/0023830908099067



  • supplementary material


Complexity of acoustic-production-based models of speech perception.
Morrison G.S. (2008). Proceedings of Acoustics’08, (pp. 2369–2374). Paris: Société Française d’Acoustique.


Comment on “A geometric representation of spectral and temporal vowel features: Quantification of vowel overlap in three linguistic varieties” [J. Acoust Soc. Am. 119, 2334–2350 (2006)] (L).
Morrison G.S. (2008). Journal of the Acoustical Society of America. 123, 37–40.
http://dx.doi.org/10.1121/1.2804633

  • prepublication version including footnotes which were removed from the final version because of space constraints

  • Matlab function

  • Matlab function also available at American Institute of Physics Electronic Physics Auxiliary Publication Service (EPAPS): E-JASMAN-123-001801

  • The following article may also be of interest:
    Gorshi, S., Vaseghi, S., Yan, Q. (2008) Cross-entropic comparison of formants of British, Australian and American English accents. Speech Communication, 50, 564–579. http://dx.doi.org/10.1016/j.specom.2008.03.013


Logistic regression modelling for first- and second- language perception data.
Morrison G.S. (2007). In Solé M. J., Prieto P., Mascaró J. (Eds.), Segmental and prosodic issues in Romance phonology (pp. 219–236). Amsterdam: John Benjamins.

A multinomial logistic regression function is now avaialble in the Matlab Statistics Toolbox. I have provided versions of some of the sample software making use this function. T. M. Nearey’s software allows for more control over the specification of the logistic regression model, in particular it allows one to specify diphone-biassed models. Matlab is required to run the software. Zipped files which include Nearey's software are password protected. Contact me to get the password. See also Logistic regression Software above.


A cross-dialect comparison of Peninsula- and Peruvian-Spanish vowels.
Morrison G.S., Escudero P. (2007). Proceedings of the 16th International Congress of Phonetic Sciences: Saarbrücken 2007.


Testing theories of vowel inherent spectral change.
Morrison G.S., Nearey T.M. (2007). Journal of the Acoustical Society of America, 122, EL15–EL22.
http://dx.doi.org/10.1121/1.2739111


L1 & L2 production and perception of English and Spanish vowels: A statistical modelling approach.
Morrison G.S. (2006). Doctoral dissertation, University of Alberta, Edmonton, Alberta, Canada.
Stable URL: http://geoff-morrison.net/#_2006PhD


A cross-language vowel normalisation procedure.
Morrison G.S.,  Nearey T.M. (2006). Canadian Acoustics 34(3), 94–95.
[Proceedings of the Canadian Acoustical Association Conference 2006, Halifax, Nova Scotia, Canada]


An adaptive sampling procedure for speech perception experiments.
Morrison G.S. (2006). In Proceedings of the Ninth International Conference on Spoken Language Processing: Interspeech 2006 — ICSLP, Pittsburgh, Pennsylvania, USA (pp. 857-860). Bonn, Germany: ISCA.


Methodological issues in L2 perception research, and vowel spectral cues in Spanish listeners’ perception of word-final /t/ and /d/ in Spanish.
Morrison G.S. (2006). In Diaz-Campos M. (Ed.), Selected Proceedings of the 2nd Conference on Laboratory Approaches to Spanish Phonetics and Phonology (pp. 35–47). Somerville, MA: Cascadilla Proceedings Project.


An appropriate metric for cue weighting in L2 speech perception: Response to Escudero & Boersma (2004).
Morrison G.S. (2005). Studies in Second Language Acquisition, 27, 597–606.
https://doi.org/10.1017/S0272263105050266


Dat is what the PM said: A quantitative analysis of Prime Minister Chrétien’s pronunciation of English voiced dental fricatives.
Morrison G.S. (2005). Cahiers linguistiques d’Ottawa, 33, 1–21. Ottawa, Ontario: University of Ottawa, Department of Linguistics.


[Review of Phonetically based phonology by B. Hayes, R. Kirchner, & D. Steriade (Eds.)].
Morrison G.S. (2005). Linguist List, 16, 1400.


Principles for a quantitative speech learning model.
Morrison G.S. (2005, November). Poster presented at the Workshop on Models of L1 and L2 Phonetics/Phonology, Utrecht, The Netherlands.


Towards a Quantitative Speech Learning Model.
Morrison G.S. (2005, May). Poster presented at the 1st ASA Workshop on Second Language Speech Learning, Vancouver, British Columbia, Canada.


An acoustic and statistical analysis of Spanish mid-vowel allophones.
Morrison G.S. (2004). Estudios de Fonética Experimental, 13, 11–37.


Perception and production of Spanish vowels by English speakers.
Morrison G.S. (2003). In Solé M.J., Recansens D., Romero J. (Eds.), Proceedings of the 15th International Congress of Phonetic Sciences: Barcelona 2003 (pp. 1533–1536). Adelaide, South Australia: Causal Productions.


Spanish listeners’ use of vowel spectral properties as cues to post-vocalic consonant voicing in English.
Morrison G.S. (2002). In Collected Papers of the First Pan-American/Iberian Meeting on Acoustics. Mexico, DF: Mexican Institute of Acoustics.


Effects of L1 duration experience on Japanese and Spanish listeners’ perception of English high front vowels.
Morrison G.S. (2002). Master’s thesis, Simon Fraser University, Burnaby, British Columbia, Canada.
Stable URL: http://geoff-morrison.net/#_2002MA


Perception of English /i/ and /I/ by Japanese & Spanish listeners: Longitudinal results.
Morrison G.S. (2002). In Morrison G.S., Zsoldos L. (Eds.), Proceedings of the North West Linguistics Conference 2002 (pp. 29–48). Burnaby, BC: Simon Fraser University Linguistics Graduate Student Association.


Japanese listeners’ use of duration cues in the identification of English high front vowels.
Morrison G.S. (2002). In Larson J., Paster M. (Eds.), Proceedings of the 28th Annual Meeting of the Berkeley Linguistics Society (pp. 189–200). Berkeley, CA: Berkeley Linguistics Society.

  • The thesis and two papers above use discriminant analysis to model perception data, a more appropriate technique would be logistic regression.


Perception of English /i/ and /I/ by Japanese listeners.
Morrison G.S. (2002). In Oh S., Sawai N., Shiobara K., Wojak R. (Eds.), University of British Columbia Working Papers in Linguistics Volume 8: Proceedings of NWLC 2001: Northwest Linguistics Conference (pp. 113-131). Vancouver, BC: University of British Columbia, Department of Linguistics. 


[Review of The sounds of language: An introduction to phonetics by H. Rogers].
Morrison G.S. (2001, June). TEAL News, 23-25, 28.


Databases
&
Software


Links to software associated with particular papers are located alongside the refererences to those papers.

If you use my software for research please let me know and give me credit in published papers.


E3 database of 3D images of fired cartridge cases.
Bolton-King R.S., Morrison G.S., Basu N., Zhang X.A. (2022). NIST Ballistics Toolmark Research Database.


Forensic database of voice recordings of 500+ Australian English speakers.
Morrison G.S., Zhang C., Enzinger E., Ochoa F., Bleach D., Johnson M., Folkes B. K., De Souza,S., Cummins N., Chow D., Szczekulska A. (2015/2021/2022).


Forensic database of audio recordings of 68 female speakers of Standard Chinese.
Zhang C., Morrison G.S. (2011).


Presentation timer
Matlab code providing countdown in minutes and seconds of presentation period and question period. [Software release 2012-12-02]
Stable URL: http://geoff-morrison.net/#Timer


Credible interval calculation
Stable URL: http://geoff-morrison.net/#CredInt


Sound file cutter upper.
Morrison G.S. (2010). [Software release 2010-12-02]
Stable URL: http://geoff-morrison.net/#CutUp


train_llr_fusion_robust.m
Morrison G.S. [Software release 2009-07-02]
Stable URL: http://geoff-morrison.net/#TrainFus

Robust version of train_llr_fusion.m from Niko Brümmer’s FoCal Toolbox. This version can handle cases of complete separation.


Logistic regression software for speech percepton data.
Morrison G.S. (2009). [Software release 2009-03-13].
Stable URL: http://geoff-morrison.net/#LogReg2009

Includes data to run analyses from Morrison & Kondaurova (2009).


FormantMeasurer: Software for efficient human-supervised measurement of formant trajectories.
Morrison G.S.,  Nearey T.M. (2011). [Software release 2011-05-26].
Stable URL: http://geoff-morrison.net/#FrmMes


SoundLabeller: Ergonomically designed software for marking and labelling portions of sound files.
Morrison G.S. (2012). [Software release 2012-07-30].
Stable URL: http://geoff-morrison.net/#SndLbl


Acoustic recording software for speech production experiments.
Morrison G.S. (2008). [Software release 2008-12-23].
Stable URL: http://geoff-morrison.net/#RecSft


Matlab function which draws Tippett plots.
Morrison G.S. (2008, 2009). [Software release 2009-11-18].
Stable URL: http://geoff-morrison.net/#TipPlt


Matlab implementation of Aitken & Lucy’s (2004) forensic likelihood-ratio software using multivariate-kernel-density estimation.
Morrison G.S. (2007). [Software].
Stable URL: http://geoff-morrison.net/#MVKD



Popular
Press


Confidential document reveals key human role in gunshot tech
Burke G., Tarm M (2023-01-20). AP News.

Automatic speaker recognition technology outperforms human listeners in the courtroom
Press release 2022-11-02

La ficción criminal: una perspectiva filosófica
2022-07-15
Serie: Filosofía de la ciencia
La UNED en La 2 de TVE
Enlaces al video:
RTVE, UNED, YouTube

Interpol’s new software will recognize criminals by their voices
Dumiak M. (2018,-05-16). IEEE Spectrum.

How to own a pool and like it
Leonido T. (2017-05-05). Triple Canopy.

我知道“绑架者”是你 | 罪案遗踪
Warren G. (2017, April). 知识分子 The Intellectual.

Voice analysis should be used with caution in court
Catanzaro M., Tola E., Hummel P., Viciano A. (2017-01-25). Scientific American.

Quando l’intercettazione è ambigua
Catanzaro M., Hummel P., Tola E., Viciano A. (2016-04-13). Le Inchieste - La Repubblica.

Forensic frontiers: Whose voice is that?
Rogers N. (2016-06-11). Science, Vol. 351, Issue 6278, pp. 1140.
http://dx.doi.org/10.1126/science.351.6278.1140


Speech forensics: When Hollywood seldom mirrors real-life court cases
Catanzaro M., Viciano A., Hummel P., Tola E. (2015-12-06). Euro Scientist.


Congresso de Criminalística vai até quinta-feira, em Búzios
Herrera R., Quintão R. (2015-11-09). RJTV 1ª Edição. TV Globo Rio de Janeiro. Television news report.


¿Qué hay en una voz? / ¿Què hi ha en una veu?
Catanzaro M., Viciano A., Hummel P., Tola E. (2015-11-08). Dominical (Sunday magazine of the newspaper El Periódico).


Real-life CSI: finding a voice in a haystack: New forensic method identifies voices to help nab criminals
Stirling B. (2015-11-02). New Trail.


Audio expert doubts Tory expert’s analysis

Bryden J. (2014-12-01) The Canadian Press.


A revolution in forensic voice comparison / Una revolución en la comparación forense del habla.
Morrison G.S. (2010-11). 2nd Pan-American/Iberian Meeting on Acoustics Lay Language Papers, Acoustical Society of America World Wide Press Room.


Sounds & Signals: Improving forensic voice comparison. (Podcast report based on an interview with Dr. Geoffrey Stewart Morrison).
Bard S. (2009-06-19). Science Update. Washington, DC: American Association for the Advancement of Science.

Report begins at time stamp 9:17


Forensic voice comparison – Reality not TV.
Morrison G.S. (2009-05). 157th Meeting Lay Language Papers, Acoustical Society of America World Wide Press Room.



Other


Teaching the classical Hebrew stem system: The binyanim
Morrison G.S. (1995). Master’s thesis, Vancouver School of Theology, Vancouver, BC, Canada.
Stable URL: http://geoff-morrison.net/#_1995MTS





Contact


Internet Explorer users:
To view my e-mail address (and the request links on this page), set the security level to "medium-high" or below



| top |

| Other Pages |

| Appointments |

| News |

| Publications |

Forensic Science · Speech Perception & Acoustic Phonetics · Databases & Software · Popular Press · Other

| Contact |


Page last updated 2024-04-13










Any opinions expressed are those of the author and do not necessarily reflect the opinions or policies of any individuals or organisations with which he has been or is currently associated.