Organisations now have massive quantities of data and being able to extract knowledge from it in a timely and efficient way is key to maintaining competitive advantage.
Big data – very large and dynamic sets consisting of both structured and unstructured data – are now pervasive and growing rapidly. The information held by organisations is often stored in a range of formats that make it difficult to access the data they contain, for example documents or web pages. Techniques for analysing, interpreting and accessing these large data sets using state-of-the-art analytics are a powerful asset for organisations.
The ACRC provides expertise on data and text analytics to assist organisations in making sense of their data. Areas of interest include the analysis of large data sets and the extraction of data from text. The ACRC builds upon over two decades of work on data and text analytics carried out at the Department of Computer Science in The University of Sheffield
Big data: Opportunities and Challenges
- Organisations now have access to unprecedented amounts of data
- Internal documents, spreadsheets, world wide web, social media …
- Data contains valuable information
- Customers and their behaviour
- Prediction of future trends
- Extracting the information is challenging
- Range of formats (structured and unstructured), distributed data, data volume
- Help organisations extract valuable information from data by applying the latest data and text analytics techniques
- Provide access to over two decades of experience in applications of analytics at Sheffield University
Areas of Expertise
- Text analytics
- Data analytics
- Big data
- Information Retrieval
Social Media Analytics
Apply state-of-the-art natural language processing techniques to extract sentiment about brands, organisations, products and persons from user generated content in social media, i.e. Twitter. Senitment Analysis
Discovering Influential Users:
Identify users with high potential to become popular in social media given profile and text features and applying state-of-the-art machine learning techniques. (Lampos et al. 2014)
Big Data Analytics
Discover Hidden Topics:
Automatically identify topics being discussed in large document collections by applying topic modelling. (Aletras and Stevenson, 2013a; Aletras and Stevenson, 2013b; Aletras and Stevenson, 2014a, Aletras and Stevenson, 2014b)
Organise and Visualise Document Collections:
Enhance information access and provide exploratory search by representing large document collections using the latent topics discussed within them.
(Agirre et. al. 2013, Aletras et. al. 2014c)
- Content and Collaborative-based Product Recommendation
- Navigate document collections by identifying similar items automatically. (Aletras et al. 2012)
- Information Extraction
- Identification of structured information within documents. (Stevenson and Greenwood, 2005)
- Analysis of scientific publications. (Nawab et. al. 2013)
Who We Are
ACRC’s researchers have experience of working with industry and publishing research at top-level international venues.
Mark Stevenson is a Senior Lecturer in Computer Science who has been working in data and text analytics since 1995. He has held positions within industry (Reuters Ltd; British Telecom) and academia (Sheffield, Stanford). He has been awarded more than £2.4 million in grants from a range of sources including UK funding councils, the European Commission and industry. He co-ordinated an EU project (PATHS) which developed information access solutions for large collections of unstructured information provided by the European Library and SMEs. The PATHS consortium consisted of six partners from five European countries.
Dr Judita Preiss is a researcher in the ACRC working on Text Analytics. Working on the following Sheffield based projects on
- A DSTL project on Information Processing and Sensemaking: Exploratory Search for Document Collections..
- The EPRSC funded grant on Knowledge Discovery.
- The EU funded project ACCURAT concerned with improving performance of MT using automatically ascquired comparable corpora.
- The Google funded project distinguishing common from proper nouns.
Dr Jurica Seva is a researcher in the ACRC working on Text Analytics.Jurica specialises in the field of (big) data mining, analytics and visualisation. His research interests include, but are not limited to, contextual recommendation systems, automatic classification/categorisation, sentiment analysis, topic modelling, machine learning/data mining of (web) textual data and the use of natural language processing techniques in large corpora.
Adam Poulston is a PhD student within the ACRC. Adam is working on Text Analytics, in particular exploring what can be inferred about a person based on what they share on social media.
- Exploring Relation Types for Literature-based Discovery (J. Preiss, M. Stevenson, R. Gaizauskas) in J Am Med Assoc. 2015.
- R. Nawab, M. Stevenson and P. Clough (2013) Comparing Medline Citations using Modified N-grams. Journal of the American Medical Informatics Association, 21(1):105-110.
- A Detailed Comparison of WSD Systems: An Analysis of the System Answers for the SENSEVAL-2 English All Words Task in Natural Language Engineering, 2006, 12(3): 209—228.
- Probabilistic Word Sense Disambiguation in Journal of Computer Speech and Language, 2004, 18(3):319—337.
- Introduction to the Special Issue on Word Sense Disambiguation (J. Preiss and M. Stevenson) in Journal of Computer Speech and Language, 2004, 18(3):201—207.
Papers in Refereed Conference Proceedings
- E. Agirre, N. Aletras, P. Clough, S. Fernando, P. Goodale, M. Hall, A. Soroa and M. Stevenson (2013) PATHS: A System for Accessing Cultural Heritage Collections. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 151–156, Sofia, Bulgaria
- Nikolaos Aletras and Mark Stevenson (2014a). Measuring the similarity between automatically generated topics. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers (EACL ’14), pages 22-27, Gothenburg, Sweden.
- Nikolaos Aletras, Timothy Baldwin, Jey Han Lau, and Mark Stevenson (2014c) Representing topics labels for exploring digital libraries. In Proceedings of the International Digital Libraries Conference (DL 2014), London, UK.
- Nikolaos Aletras and Mark Stevenson (2013a) Evaluating topic coherence using distributional semantics. In Proceedings of the 10th International Conference on Computational Semantics (IWCS 2013) — Long Papers, pages 13-22, Potsdam, Germany.
- Nikolaos Aletras and Mark Stevenson (2013b) Representing topics using images. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT ’13), pages 158-167, Atlanta, Georgia.
- Distinguishing Common and Proper Nouns (J. Preiss and M. Stevenson). In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity, 2013, pages 80—84
- DALE: A Word Sense Disambiguation System for Biomedical Documents Trained using Automatically Labeled Examples (J. Preiss and M. Stevenson). In Proceedings of the 2013 NAACL HLT Demonstration Session, 2013, pages 1—4.
- Identifying Comparable Corpora Using LDA. In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2012, pages 558—562.
- A System for Large-Scale Acquisition of Verbal, Nominal and Adjectival Subcategorization Frames from Corpora in Proceedings of ACL, 2007, pages 912—919.
- Can Anaphoric Definite Descriptions be Replaced by Pronouns? (J. Preiss, C. Gasperin and T. Briscoe) in Proceedings of LREC, 2004.
- Improving Subcategorization Acquisition using Word Sense Disambiguation (A. Korhonen and J. Preiss) in Proceedings of ACL, 2003, pages 48—55.
- Using Grammatical Relations to Compare Parsers, in Proceedings of EACL, 2003, pages 291—298.
- Subcategorization Acquisition as an Evaluation Method for WSD (J. Preiss, A. Korhonen and T. Briscoe) in Proceedings of LREC, 2002, pages 1551—1556.
Papers in Refereed Workshop Proceedings
- The Effect of Word Sense Disambiguation Accuracy on Literature Based Discovery (J. Preiss and M. Stevenson). In Proceedings of DTMBIO, 2015.
- Seeking informativeness in literature based discovery. In Proceedings of BioNLP, 2014, pages 112—117
- Nikolaos Aletras and Mark Stevenson (2014b) Labelling topics using unsupervised graph-based methods. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Volume 2: Short Papers (ACL ’14), pages 631-636, Baltimore, Maryland.
- Towards Semantic Literature Based Discovery (J. Preiss, M. Stevenson and M. H. McClure) in AAAI-2012 Fall Symposium on Information Retrieval and Knowledge Discovery in Biomedical Text, 2012.
- University Of Sheffield: Two Approaches to Semantic Text Similarity (S. Biggins, S. Mohammed, S. Oakley, L. Stringer, M. Stevenson and J. Preiss) in *Sem2012, 2012, pages 655—661.
- Scaling up WSD with Automatically Generated Examples (W. Cheng, J. Preiss and M. Stevenson) in BioNLP, 2012.
- Refining the most frequent sense baseline (J. Preiss, J. Dehdari, J. King and D. Mehay) in Proceedigs of the Workshop on Semantic Evaluations, 2009.
- HMMs, GRs, and n-grams as lexical substitution techniques — are they portable to other languages? (J. Preiss, A. Coonce and B. Baker) in Proceedings of the Workshop on Natural Language Processing methods and Corpora in Translation, Lexicography, and Language Learning, 2009, pages 21-27.
- M. Stevenson and M. Greenwood (2005) A Semantic Approach to IE Pattern Induction. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL-05), pages 379-386, Ann Arbour, MI.
- WSD for Subcategorization Acquisition Task Description (J. Preiss and A. Korhonen) in Proceedings of SENSEVAL-3, 2004, pages 33-36.
- Probabilistic WSD in SENSEVAL-3 in Proceedings of SENSEVAL-3, 2004, pages 213-216.
- The Contribution of Domain-independent Robust Pronominal Anaphora Resolution to Open-Domain Question-Answering (R. Watson, J. Preiss and T. Briscoe) in Symposium on Reference Resolution and its Applications to Question Answering and Summarization, 2003, pages 75-82.
- Intermediate Parsing for Anaphora Resolution? Implementing the Lappin and Leass non-coreference filters (J. Preiss and T. Briscoe) in Proceedings of the Anaphora Workshop at EACL, 2003, pages 1-6.
- Choosing a Parser for Anaphora Resolution in Proceedings of DAARC, 2002, pages 175-180.
- A Comparison of Probabilistic and Non-Probabilistic Anaphora Resolution Algorithms in Proceedings of the student workshop at ACL, 2002, pages 42-47.
- Improving Subcategorization Acquisition with WSD (J. Preiss and A. Korhonen) in Proceedings of the Word Sense Disambiguation workshop, 2002, pages 102-108.
- Anaphora Resolution with Memory Based Learning in Proceedings of CLUK5, 2002, pages 1-9.
- Anaphora Resolution with Word Sense Disambiguation in Proceedings of SENSEVAL-2, 2002, pages 143-146.
- Disambiguating Noun and Verb Senses Using Automatically Acquired Selectional Preferences (D. McCarthy, J. Carroll and J. Preiss) in Proceedings of SENSEVAL-2, 2002, pages 119-122.
- Local versus Global Context for WSD of Nouns in Proceedings of CLUK4, 2001, pages 1-8.