Oxford
Principal component analysis of turn-initial words in spoken interactions
The present study investigates turn-initial words in spoken interactions by using principal component analysis. The choice of turn initiators differs significantly in different settings. In formal settings like White House press conferences, the first person pronouns I and we are characteristically employed at the very beginning of utterances, while this tendency is not observed in less formal and more conversational settings. Other initiators relatively frequent in formal settings include: well, the, and no. Furthermore, our findings suggest that female speakers are inclined to link their utterances to previous speakers’ turns by using interpersonal initiators like vocative personal names (e.g. John, Gary, and David). On the whole, however, the factor of gender is much less significant than the factor of the setting.
Translation Style and Ideology: a Corpus-assisted Analysis of two English Translations of Hongloumeng
Hongloumeng by Xueqin Cao (Hsueh-ch‘in Ts'ao) is generally considered one of the greatest classical Chinese novel. Of all nine published English translations known today, the one translated by Hawkes and Minford (the Story of the Stone, Penguin, 1973–86) and the other by Yang and Yang (A Dream of Red Mansions1 , Foreign Languages Press in Beijing, 1978–80) are the best known among translators and literary scholars. Over the years, both have been carefully scrutinized and much critiqued. Translators and translation scholars have been engaged in heated debates over salient features of the translations, strategies employed by the translators, the possible effects of the two translations and so on [cf. Liu and Gu (1997) On translation of cultural contents in Hong Lou Meng [in Chinese]. Chinese Translators Journal, 1: 16–19; Wang (2001) A Comparative Study of the English Translations of Poetry in Hong Lou Meng. Xi’an: Shanxi Normal University Press; Feng (2006) On the Translation of Hong Lou Meng [in Chinese]. Shanghai: Shanghai Foreign Language Education Press; Liu (2008) , Translating tenor: With reference to the English versions of Hong Lou Meng. Meta, 53(3): 528–48], with the eventual aim to determine which translation better captures the style of the original text or author. Like many debates of similar nature, no definitive conclusions have been reached despite such an intense interest. We believe a corpus-assisted examination [Baker, M. (2000). Towards a methodology for investigating the style of a literary translator. Target, 12(2): 241–66; Baker, M. (1993). Corpus linguistics and translation studies: Implications and applications. In Gill, F., Baker, M., and Tognini-Bonelli, E. (eds), Text and Technology: In Honour of John Sinclair. Amsterdam: Benjamins, pp. 233–50] of the two translations will provide more convincing analysis and can better describe the differences in the translation style of the two famous translations. A particular effort is further made to interpret the reasons for the different strategies adopted by the two different pairs of translators in the social, political, and ideological context of the translations.
Variation in noun and pronoun frequencies in a sociohistorical corpus of English
Many corpus linguists make the tacit assumption that part-of-speech frequencies remain constant during the period of observation. In this article, we will consider two related issues: (1) the reliability of part-of-speech tagging in a diachronic corpus and (2) shifts in tag ratios over time. The purpose is both to serve the users of the corpus by making them aware of potential problems, and to obtain linguistically interesting results. We use noun and pronoun ratios as diagnostics indicative of opposing stylistic tendencies, but we are also interested in testing whether any observed variation in the ratios could be accounted for in sociolinguistic terms. The material for our study is provided by the Parsed Corpus of Early English Correspondence (PCEEC), which consists of 2.2 million running words covering the period 1415–1681. The part-of-speech tagging of the PCEEC has its problems, which we test by reannotating the corpus according to our own principles and comparing the two annotations. While there are quite a few changes, the mean percentage of change is very small for both nouns and pronouns. As for variation over time, the mean frequency of nouns declines somewhat, while the mean frequency of pronouns fluctuates with no clear diachronic trend. However, women consistently use more pronouns than men, while men use more nouns than women. More fine-grained distinctions are needed to uncover further regularities and possible reasons for this variation.
Well Connected to Your Digital Object? E-Curator: A Web-based e-Science Platform for Museum Artefacts
This article describes the development of a new virtual research tool for the Arts and Humanities community. The E-Curator project led by Museums and Collections at University College London took a practical, multidisciplinary approach to traceable storage and transmission of three-dimensional (3D) laser scan data sets. The objective was to establish protocols for retrievable data acquisition and processing to facilitate remote web-based access to museum e-artefacts and thereby enhance international scholarship. An Internet capable 3D visualization tool was designed, using state-of-the-art colour laser scanning technology for digitizing museum objects in combination with an e-science developed data storage and retrieval solution (Storage Resource Broker). The prototype was developed in discussion with a team of museum curators and conservators who were able to compare the handling of a range of real objects with their virtual copies on-screen. This article will explore two case studies of objects recorded with an Arius3D colour laser scanner and a handheld Metris K-Scan laser scanner to illustrate the 3D recording methodology and highlight how the developed system is capable of complementing traditional cataloguing and analysis methods for museum artefacts and enable digital repatriation. Anthropological research, based on observations from the example of the E-Curator project, is discussing the production, reception, and circulation of 3D digital objects and the networked technology of the digital image.
Getting beyond the common denominator
The impact of technology on the humanities, arts, and social sciences (HASS) is already profound and will only escalate as time goes on. The challenge that HASS scholars face is: will they create the world they are in the process of entering or is this world going to be created for and imposed on them? If the latter, progress in HASS will certainly be impeded; if the former, then cyberinfrastructure in the HASS will be created in the image of those who best understand what they require. This article explores the question of how cyberinfrastructure for the HASS might facilitate the development of digital technologies and applications that will meet the specialized needs of scholarship. Formidable barriers that exist are discussed—especially those peculiar to humanities and social sciences—that impede the development of specialized technologies and applications. Also addressed is the interplay between various players directly involved with developing specialized technologies and applications tailored for the HASS. Finally, changes in academic perspective are discussed that may result in transformations in cyberinfrastructure to improve the likelihood of success in creating specialized technologies and applications. The authors’ observations are based primarily on lessons learned in the development and deployment of the InscriptiFact Image Database along with experiences with InscriptiFact’s sister archival imaging project, the West Semitic Research Project, both at the University of Southern California.
Papyrological investigations: transferring perception and interpretation into the digital world
Deciphering ancient and damaged documents is a complex investigative task that papyrologists routinely undertake to extract meaning from the script. Perception and interpretation play an essential role. In this article, we present methods for transferring to the digital world some of the processes that experts draw upon when interpreting a text, with the ultimate aim of constructing an Interpretation Support System (ISS) for papyrologists. Image-capture and image-processing approaches that reflect real-world perceptual processes have been implemented. In addition, we propose an expansion of a previously built model of papyrological reading and transcription. We make explicit some of the implicit processes involved in an interpretation effort, using an example where papyrologists developed hypotheses for the identification of a puzzling letter form. Two distinct yet not mutually exclusive approaches to the interpretation task have been identified: the kinaesthetic/palaeographical strategy and the cruciverbalistic/philological strategy. The ISS will have to facilitate both approaches. Mechanisms triggering the emergence of working hypotheses of interpretation, which we call percepts, have also been pinpointed; they include skilled vision, scholarly expectations, aspect shifting and local-global oscillations. Working hypotheses being triggered by such mechanisms can then be exposed as an explicit network of sourced percepts; these mechanisms also confer a qualitative well-foundedness to the percepts and hence help us to retrace and assess the rationale leading to a specific interpretation.
Comparing nearly identical treaty texts: a note on the Treaty of Fort Laramie with Sioux, etc., 1851 and Levenshtein's edit distance metric
Vladimir Levenshtein’s edit distance algorithm is used to reveal disparities between delimiter stripped texts of the Senate amended Treaty of Fort Laramie with Sioux, etc., 1851 as corrected in a previous study, and of other federal copies of this transaction. All of the latter deviated markedly from that newly created version, reflecting errors of exclusion, of the absence in some transcripts of the Senate modification, of editorial decisions made by Charles J. Kappler during the preparation of his treaty compilations at the beginning of the twentieth century, and of spelling. These results confirmed that the instrument was until now never published in its complete formal state. This study may serve as a model for future text analyses that might benefit from the employment of Levenshtein’s metric.
Visualization as a research tool for dialect geography using a geo-browser
Moving from a traditional dialect geography research methodology to one in which data are processed electronically, and where visualization is used as a research tool, can be of great benefit to dialect geography. A working environment offering full support for using visualization as a research tool could take dialect geography into the era of e-Science. Despite the advent of electronic data processing, electronic publishing and Geographic Information Systems (GIS), an analysis of the most important computerized tools for dialect geography research suggests that there is little support for the use of modern data mining and analysis techniques connected to visualization for the analysis and interpretation of dialect data. In this article, we use the electronic publication of two major dialect dictionaries to illustrate the value of visualization as a research tool by showing how visual data mining and combining dialect data with independent data sets applies to dialect geography research. We argue that there is no need for large-scale software development because visualization, as a research tool, is supported to a large extent by geo-browsers such as ‘Google Earth’, which make it possible to flexibly combine and visualize different types of geo-referenced data.
The effect of author set size and data size in authorship attribution
Applications of authorship attribution `in the wild’ [ Koppel, M., Schler, J., and Argamon, S. (2010) . Authorship attribution in the wild. Language Resources and Evaluation. Advanced Access published January 12, 2010:10.1007/s10579-009-9111-2], for instance in social networks, will likely involve large sets of candidate authors and only limited data per author. In this article, we present the results of a systematic study of two important parameters in supervised machine learning that significantly affect performance in computational authorship attribution: (1) the number of candidate authors (i.e. the number of classes to be learned), and (2) the amount of training data available per candidate author (i.e. the size of the training data). We also investigate the robustness of different types of lexical and linguistic features to the effects of author set size and data size. The approach we take is an operationalization of the standard text categorization model, using memory-based learning for discriminating between the candidate authors. We performed authorship attribution experiments on a set of three benchmark corpora in which the influence of topic could be controlled. The short text fragments of e-mail length present the approach with a true challenge. Results show that, as expected, authorship attribution accuracy deteriorates as the number of candidate authors increases and size of training data decreases, although the machine learning approach continues performing significantly above chance. Some feature types (most notably character n-grams) are robust to changes in author set size and data size, but no robust individual features emerge.
Computerized Scansion of Ancient Greek Hexameter
A metrical system is the particular rhythm upon which a verse is structured. Classical Ancient Greek poetry demonstrates a wide variety of metrical systems, the most ancient one being the hexameter. Two of the longest and most famous poems of Ancient Greek poetry, namely Iliad and Odyssey, were composed in the hexameter by Homer. This system is named ‘hexameter’, because the verse is divided into six sections, and so is the rhythm of reciting it. Each section has a fixed scheme, thus it can have only two or three syllables in a predefined combination. The aim of this project was the development of a program that will automatically scan such a verse, by using the least possible computing and linguistic resources. The term ‘scansion’ denotes the discovery of the particular pattern of the metrical system of the verse. That is in which positions of the verse long syllables are located, in which positions are short, and how these syllables form the fixed scheme of every section of the verse. Words inside verses were carefully selected to conform to the above standards.
Extended nearest shrunken centroid classification: A new method for open-set authorship attribution of texts of varying sizes
The nearest shrunken centroid (NSC) methodology, originally developed for high-dimensional genomics problems, was recently applied in a stylometric study. Although NSC has many advantages, stylometric problems usually differ from genomics problems in several important ways: texts are of a wide range of sizes, a large series of texts are often the subjects for classification, and most importantly the set of candidate authors cannot usually be assumed to be closed. Consequently, naïve application of NSC methodology can produce misleading results. We extend the NSC methodology for more general application to stylometry. Reanalysis of the Book of Mormon using the open-set NSC method produced dramatically different results from a closed-set NSC analysis.
Lexical bundles and German bibles
This article examines the use of lexical bundles—repeated word groups of differing lengths—in two German Bible translations and analyzes their use in relation to the comparative readability of the two texts. The texts included in this study are the four Gospels and the book of Acts in Martin Luther’s classic translation of the Bible and the modern translation Hoffnung für alle (Hfa). The study is both quantitative and qualitative in nature, looking at lexical bundle statistics in aggregate and at specific uses of lexical bundles in context. The results indicate that the older translation used many more lexical bundles, but types that were used by the newer version were used more effectively. Overall, however, the older version has the advantage in readability due to its greater use of lexical bundles.
Automatically Extracting Typical Syntactic Differences from Corpora
We develop an aggregate measure of syntactic difference for automatically finding common syntactic differences between collections of text. With the use of this measure, it is possible to mine for differences between, for example, the English of learners and natives, or between related dialects. If formulated in advance, hypotheses can also be tested for statistical significance. It enables us to find not only absence or presence, but also under- and overuse of specific constructs. We have applied our measure to the English of Finnish immigrants in Australia to look for traces of Finnish grammar in their English. The outcomes of this detection process were analysed and found to be insightful. A report is included in this article. Besides explaining our method, we also go into the theory behind it, including permutation statistics, and the custom normalizations required for applying these tests to syntactical data. We also explain how to use the software we developed to apply this method to new corpora, and give some suggestions for further research.
The Regressive Imagery Dictionary: A test of its concurrent validity in English, German, Latin, and Portuguese
Since the 1970s, the Regressive Imagery Dictionary (RID) has been widely used as a content analysis tool for both psychological and literary research on texts. Today, besides the original English version, it exists in translations for seven other languages. However, the wide-ranging validation studies conducted on the English version have mostly not been replicated for the various translations, hence the validity of these translations must rest for the time being on their concurrent validity with the English original. This article examines the concurrent validity of the German, Latin, and Portuguese translations of the RID. Taking the English RID as a de facto standard, it uses translations of the psalms (N = 150) to check how far the three translations of the RID correspond to the English original in identifying whether there is a significant dominance of primary or secondary process lexis in a text. Overall, compared against the English version, the Latin translation has 77.33% accuracy, the German translation 68%, and the Portuguese translation 56.67%. In terms of the sensitivity and specificity of classification, the Latin translation performs quite well on both measures; in contrast, the German translation is conservative, whilst the Portuguese translation is liberal.