Skip to content

Frameworks and tools for statistical big data in the humanities

Project: Research

Description

Over the last forty or so years, data archives in most advanced countries have assembled very large collections of statistics, including historical data computerised by academic projects. However, these are almost all collections of DATASETS, mostly organised essentially as tables, and data archive catalogues describe each dataset only as a whole, while the detailed structure of the rows and columns is usually described only in highly variable documentation from the original creators. This makes it quite impossible to analyse the history of the UK as a whole by downloading the contents of the UK Data Archive as a whole; and archive search interfaces are designed for social scientists wanting to download datasets for detailed statistical analysis, not for humanists -- or the general public -- wanting answers to specific questions such as "how many domestic servants were there in Bath in 1841?". Even once you find the number you want, how exactly do you cite it?

Two big statistical database projects take a different approach. Firstly, the Great Britain Historical (GBH) GIS, accessed via the Vision of Britain web site, holds 14,099,469 DATA VALUES, covering most aspects of Britain's demographic, economic, social and political history over the last 200 years, each on a separate row of just one table, and each with information on date, geographical area covered and subject matter. Area and subject are recorded through linkage to formal ontologies -- hierarchical lists -- the former linking to computerised boundaries for the reporting units, so data can be both graphed as time series and mapped. Secondly, the US National Science Foundation have funded the Collaborative for Historical Information and Analysis (CHIA), led by the University of Pittsburgh and running from January 2013 until at least December 2015, to assemble historical statistics for the whole world since 1600 by creating an online infrastructure enabling contributors worldwide to insert their data into an information architecture modelled on the GBH GIS. Such "crowd-sourcing" using specialists is the only way really big historical data structures can be created.

The proposed research would directly fund the architect of the GBH GIS to play a larger role within CHIA, and recruit British and European contributors. However, the main aim here is not to make either the GBH GIS or the CHIA system even bigger, but to make it easier for humanists and the general public to find and work with the contents. This is a relatively small project with three main outputs. Firstly, a better statistical thesaurus or topic ontology -- if you like, a Dewey Decimal Classification for numbers -- providing a shared language for contributors and users. This will be based on an existing thesaurus developed by social scientists but with some broader headings added to help non-specialists get started, and many more variant terms added, enabling users to find out about "jobs" as well as "occupations". Secondly, a user-friendly interface for finding not "datasets" but particular numbers: specifying the area covered by clicking on a map, the period via a graphical timeline, and the topic by selecting progressively narrower terms within the "domain ontology". Where possible, it will let users drill down from simple totals to more detailed data. Thirdly, a separate interface for computers, not people, providing data values either individually or as a long stream, each with complete information about the date, area and topic. One purpose is simply to provide each data value with its own enduring web address (a "Uniform Resource Identifier"), enabling it to be cited with absolute precision and confidence. However, this "data feed" could be used by search engines, like Google or Bing, to index our content and provide new forms of access; or it could be linked to large scale analytic engines, enabling other researchers to actually analyse the history of Britain or the world as a whole".
StatusActive
Effective start/end date1/01/14 → …

Funding

Award relations

Frameworks and tools for statistical big data in the humanities

Professor Humphrey Southall, Miss Paula Aucott & Westwood, J.

Arts and Humanities Research Council: £80,217.00

1/01/1431/03/15

Award date: 16/12/13

Funding: R: ResearchAward

Relations

ID: 2457974