Data & Text Analytics

Data Analytics

Organisations now have massive quantities of data and being able to extract knowledge from it in a timely and efficient way is key to maintaining competitive advantage.

Big data – very large and dynamic sets consisting of both structured and unstructured data – are now pervasive and growing rapidly. The information held by organisations is often stored in a range of formats that make it difficult to access the data they contain, for example documents or web pages. Techniques for analysing, interpreting and accessing these large data sets using state-of-the-art analytics are a powerful asset for organisations.

The ACRC provides expertise on data and text analytics to assist organisations in making sense of their data. Areas of interest include the analysis of large data sets and the extraction of data from text. The ACRC builds upon over two decades of work on data and text analytics carried out at the Department of Computer Science in The University of Sheffield

Big data: Opportunities and Challenges

  • Organisations now have access to unprecedented amounts of data
  • Internal documents, spreadsheets, world wide web, social media …
  • Data contains valuable information
  • Customers and their behaviour
  • Prediction of future trends
  • Extracting the information is challenging
  • Range of formats (structured and unstructured), distributed data, data volume

ACRC Services

  • Help organisations extract valuable information from data by applying the latest data and text analytics techniques
  • Provide access to over two decades of experience in applications of analytics at Sheffield University

Areas of Expertise

  • Text analytics
  • Data analytics
  • Big data
  • Information Retrieval

Example Projects

Social Media Analytics

Sentiment Analysis:

Apply state-of-the-art natural language processing techniques to extract sentiment about brands, organisations, products and persons from user generated content in social media, i.e. Twitter. Senitment Analysis

Discovering Influential Users:

Identify users with high potential to become popular in social media given profile and text features and applying state-of-the-art machine learning techniques. (Lampos et al. 2014)

Big Data Analytics

Discover Hidden Topics:

Automatically identify topics being discussed in large document collections by applying topic modelling. (Aletras and Stevenson, 2013a; Aletras and Stevenson, 2013b; Aletras and Stevenson, 2014a, Aletras and Stevenson, 2014b)

Organise and Visualise Document Collections:

Enhance information access and provide exploratory search by representing large document collections using the latent topics discussed within them.

(Agirre et. al. 2013, Aletras et. al. 2014c)

  • Content and Collaborative-based Product Recommendation
  • Navigate document collections by identifying similar items automatically. (Aletras et al. 2012)
  • Information Extraction
  • Identification of structured information within documents. (Stevenson and Greenwood, 2005)
  • Analysis of scientific publications. (Nawab et. al. 2013)

Who We Are

ACRC’s researchers have experience of working with industry and publishing research at top-level international venues.

Mark Stevenson is a Senior Lecturer in Computer Science who has been working in data and text analytics since 1995. He has held positions within industry (Reuters Ltd; British Telecom) and academia (Sheffield, Stanford). He has been awarded more than £2.4 million in grants from a range of sources including UK funding councils, the European Commission and industry. He co-ordinated an EU project (PATHS) which developed information access solutions for large collections of unstructured information provided by the European Library and SMEs. The PATHS consortium consisted of six partners from five European countries.

Dr Judita Preiss is a researcher in the ACRC working on Text Analytics. Working on the following Sheffield based projects on

  • A DSTL project on Information Processing and Sensemaking: Exploratory Search for Document Collections..
  • The EPRSC funded grant on Knowledge Discovery.
  • The EU funded project ACCURAT concerned with improving performance of MT using automatically ascquired comparable corpora.
  • The Google funded project distinguishing common from proper nouns.

Dr Jurica Seva is a researcher in the ACRC working on Text Analytics.Jurica specialises in the field of (big) data mining, analytics and visualisation. His research interests include, but are not limited to, contextual recommendation systems, automatic classification/categorisation, sentiment analysis, topic modelling, machine learning/data mining of (web) textual data and the use of natural language processing techniques in large corpora.

Adam Poulston is a PhD student within the ACRC. Adam is working on Text Analytics, in particular exploring what can be inferred about a person based on what they share on social media.

References

Papers in Refereed Conference Proceedings

  • E. Agirre, N. Aletras, P. Clough, S. Fernando, P. Goodale, M. Hall, A. Soroa and M. Stevenson (2013) PATHS: A System for Accessing Cultural Heritage Collections. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 151–156, Sofia, Bulgaria
  • Nikolaos Aletras and Mark Stevenson (2014a). Measuring the similarity between automatically generated topics. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers (EACL ’14), pages 22-27, Gothenburg, Sweden.
  • Nikolaos Aletras, Timothy Baldwin, Jey Han Lau, and Mark Stevenson (2014c) Representing topics labels for exploring digital libraries. In Proceedings of the International Digital Libraries Conference (DL 2014), London, UK.
  • Nikolaos Aletras and Mark Stevenson (2013a) Evaluating topic coherence using distributional semantics. In Proceedings of the 10th International Conference on Computational Semantics (IWCS 2013) — Long Papers, pages 13-22, Potsdam, Germany.
  • Nikolaos Aletras and Mark Stevenson (2013b) Representing topics using images. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT ’13), pages 158-167, Atlanta, Georgia.
  • Distinguishing Common and Proper Nouns (J. Preiss and M. Stevenson). In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity, 2013, pages 80—84
  • DALE: A Word Sense Disambiguation System for Biomedical Documents Trained using Automatically Labeled Examples (J. Preiss and M. Stevenson). In Proceedings of the 2013 NAACL HLT Demonstration Session, 2013, pages 1—4.
  • Identifying Comparable Corpora Using LDA. In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2012, pages 558—562.
  • A System for Large-Scale Acquisition of Verbal, Nominal and Adjectival Subcategorization Frames from Corpora in Proceedings of ACL, 2007, pages 912—919.
  • Can Anaphoric Definite Descriptions be Replaced by Pronouns? (J. Preiss, C. Gasperin and T. Briscoe) in Proceedings of LREC, 2004.
  • Improving Subcategorization Acquisition using Word Sense Disambiguation (A. Korhonen and J. Preiss) in Proceedings of ACL, 2003, pages 48—55.
  • Using Grammatical Relations to Compare Parsers, in Proceedings of EACL, 2003, pages 291—298.
  • Subcategorization Acquisition as an Evaluation Method for WSD (J. Preiss, A. Korhonen and T. Briscoe) in Proceedings of LREC, 2002, pages 1551—1556.

Papers in Refereed Workshop Proceedings