Article Synopsis

  • The study investigates the effectiveness of different document retrieval methods in social science research, comparing conventional keyword approaches with more advanced techniques like query expansion and active supervised learning.
  • Previous approaches relying solely on keywords may lead to biased results, while the study aims to determine if advanced methods improve retrieval performance.
  • The findings reveal that more complex methods generally do not perform better than keyword lists, but active supervised learning can significantly enhance retrieval results when a sufficient amount of labeled training data is available.

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

One of the first steps in many text-based social science studies is to retrieve documents that are relevant for an analysis from large corpora of otherwise irrelevant documents. The conventional approach in social science to address this retrieval task is to apply a set of keywords and to consider those documents to be relevant that contain at least one of the keywords. But the application of incomplete keyword lists has a high risk of drawing biased inferences. More complex and costly methods such as query expansion techniques, topic model-based classification rules, and active as well as passive supervised learning could have the potential to more accurately separate relevant from irrelevant documents and thereby reduce the potential size of bias. Yet, whether applying these more expensive approaches increases retrieval performance compared to keyword lists at all, and if so, by how much, is unclear as a comparison of these approaches is lacking. This study closes this gap by comparing these methods across three retrieval tasks associated with a data set of German tweets (Linder in SSRN, 2017. 10.2139/ssrn.3026393), the Social Bias Inference Corpus (SBIC) (Sap et al. in Social bias frames: reasoning about social and power implications of language. In: Jurafsky et al. (eds) Proceedings of the 58th annual meeting of the association for computational linguistics. Association for Computational Linguistics, p 5477-5490, 2020. 10.18653/v1/2020.aclmain.486), and the Reuters-21578 corpus (Lewis in Reuters-21578 (Distribution 1.0). [Data set], 1997. http://www.daviddlewis.com/resources/testcollections/reuters21578/). Results show that query expansion techniques and topic model-based classification rules in most studied settings tend to decrease rather than increase retrieval performance. Active supervised learning, however, if applied on a not too small set of labeled training instances (e.g. 1000 documents), reaches a substantially higher retrieval performance than keyword lists.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC9762672PMC
http://dx.doi.org/10.1007/s42001-022-00191-7DOI Listing

Publication Analysis

Top Keywords

keyword lists
12
retrieval performance
12
comparison approaches
8
social science
8
documents relevant
8
irrelevant documents
8
query expansion
8
expansion techniques
8
techniques topic
8
topic model-based
8

Similar Publications

Background: Type 2 Diabetes Mellitus (T2DM) is a chronic metabolic disease characterized by insulin resistance and progressive decline in pancreatic beta cell function. It is a public health problem of great magnitude that has been increasing globally over the last 4 decades. The latest research has found that sugar-sweetened beverages (SSBs), as an important dietary risk factor, are closely related to the occurrence and development of T2DM.

View Article and Find Full Text PDF

Objective: Application review is a lengthy time commitment. The objective of this study is to retrospectively compare the list of recommended applicants as generated by two processes: (1) faculty holistic review and (2) keyword search via Thalamus Cortex, residency application management software, to see how much overlap exists between the two strategies.

Methods: Faculty at the training program completed the traditional application review performed by manual, holistic review of each eligible application, and submitted scores on their top 10-15 applicants to the program director (PD).

View Article and Find Full Text PDF

[Bacterial skin and soft tissue infections].

Klin Mikrobiol Infekc Lek

June 2025

Department of Infectious Diseases and Travel Medicine, Second Faculty of Medicine, Charles University and University Hospital Motol, Prague, Czech Republic, e-mail:

Skin and soft tissue infections (SSTIs) represent a diverse spectrum of conditions, including erysipelas, cellulitis, cutaneous abscesses, necrotizing fasciitis, and myonecrosis. Erysipelas and cellulitis are the most common community-acquired SSTIs. Erysipelas is typically caused by pyogenic streptococci, while cellulitis often has a staphylococcal etiology.

View Article and Find Full Text PDF

This systematic review aimed to identify effect modification and interaction factors that moderate the association between socioeconomic status (SES) and smoking behavior among adolescents. We searched PubMed, Embase, PsycINFO, and Web of Science using keywords including "adolescents," "smoking," "inequality," "effect modification," and "interaction." Peer-reviewed articles published in English or French between January 1, 2011, and December 31, 2021, were included, alongside relevant studies identified from reference lists.

View Article and Find Full Text PDF

: The management of type 2 diabetes (T2D) extends beyond glycemic control, requiring a more global strategy that includes optimization of body composition, even more so in the context of sarcopenia and visceral adiposity, as they contribute to poor outcomes. Past reviews have typically been focused on weight reduction or glycemic effectiveness, with limited inclusion of new therapies' effects on muscle and fat distribution. In addition, the emergence of incretin-based therapies and dual agonists such as tirzepatide requires an updated synthesis of their impacts on body composition.

View Article and Find Full Text PDF