This research summary article is based on the paper 'ALLIE: Active learning on large-scale imbalanced graphs' Please don't forget to join our ML Subreddit
Social network analysis, financial fraud detection, molecular design, search engines, and recommendation systems are all examples of graph-structured data. Graph neural networks (GNN), as opposed to classical point or pairwise models, have recently emerged as state-of-the-art models on these types of datasets due to their ability to learn and aggregate complex relationships between neighborhoods (K-hop) .
GNNs, like other deep learning models, require a considerable amount of labeled data to train in supervised environments, despite their enticing advantages. In many areas, obtaining adequate labeled data for training is time-consuming, labor-intensive, and expensive, which limits the use of GNNs.
Active Learning (AL) is a promising technique for obtaining labels faster, at lower cost, and training models efficiently. AL dynamically interrogates candidate samples for labeling to maximize machine-learned model performance on a shoestring budget. On various benchmark datasets, such as citation graphs and gene networks, current LA improvements on graphs have proven beneficial.
However, there has been little research on AL approaches for large-scale unbalanced circumstances (eg, the discovery of a small fraction of fake reviews on an e-commerce website). This encourages scholars to consider how to interrogate the most “informative” data to reduce the cost of training GNNs and mitigate the effect of imbalance.
It is not easy to form GNNs with the AL technique on unbalanced charts. Since underrepresented positive samples are less likely to be selected by standard AL methods, the low prevalence rate of positive samples prevents traditional AL methods from learning the full data distribution. Searching for abusive reviews on a shopping website, for example, can be modeled as a binary classification problem, with positive samples (i.e. abusive reviews) representing a very small proportion of the labeled data. .
When an AL model is trained to sample reviews for labeling, it will largely provide non-abusive reviews, resulting in a modest increase in model performance. To balance the class distribution, most AL sampling strategies described in natural language processing and computer vision assume independent and identically distributed data. Due to the varied relational structure and extensive links, these methodologies are not immediately applicable to data organized in graphs.
Building an AL method for large-scale graphical data is difficult. Popular social media platforms (such as Facebook and Snapchat) have hundreds of millions of monthly active users, while online e-commerce sites (such as Amazon and Walmart) contain millions of products and process billions of transactions. At this scale, it is impossible to search through all the unlabeled samples of the graph, because the computational complexity of AL techniques increases exponentially with the size of the unlabeled set. Therefore, it is crucial to reduce the search space of AL algorithms on large-scale graphs.
To solve these two problems, the Amazon researchers propose a technique based on active learning for large-scale unbalanced graphs (ALLIE), which combines the principle of AL on graphs with reinforcement learning for a accurate and efficient node categorization. By using multiple uncertainty measures as criteria, ALLIE can successfully select informative unlabeled samples for labeling. Additionally, the method prioritizes the categorization of less confident and “underrepresented” samples.
The researchers propose a graph magnification mechanism for ALLIE that categorizes related nodes into clusters to adapt the approach to huge graphs. The search space of the AL algorithm is reduced with a better representation of the nodes in each cluster. This is the first study to use large-scale graphs and active learning to model the imbalance problem.
The team’s contributions are as follows:
• Graph policy network based on imbalance-aware reinforcement learning: The team uses a reinforcement learning technique to discover a representative subset of the unlabeled dataset by optimizing the performance of the classifier. The nodes surveyed will be more representative of the minority class.
• Graph magnification strategy for handling large-scale graph data: Existing approaches rarely consider scalability, which makes them inefficient when used in real-world scenarios. The researchers use a graph magnification approach to reduce the action space in the policy network to reduce execution time.
• Robust learning for more accurate node classification: The researchers build a knot classifier with targeted loss that underweights well-classified samples, unlike traditional approaches that do not distinguish between majority and minority classes when maximizing the objective function.
ALLIE has been tested on balanced and unbalanced datasets. The balanced datasets are based on publicly available quote charts, while the unbalanced dataset is from a private e-commerce site. Across both datasets, the researchers report on node classification performance.
According to the results, ALLIE improved an average of 2.39% in Macro F1 and 2.71% in Micro F1 over the best baseline on balanced chart datasets. On the e-commerce website dataset, ALLIE improved positive classes (abusive users and reviews) by an average of 4.75% in Precision, 1.96% in Recall, and 3.45% in F1 (with 10.54%, 3.7%, and 7.71% relative improvement, respectively) from best baseline. A detailed ablation study was also performed by the team to highlight the importance of each component of ALLIE. According to additional tests, ALLIE outperforms baselines with a variety of initial training set sizes and query budgets.
In a recent study, Amazon researchers present ALLIE, a unique active learning framework for large-scale unbalanced graphs. ALLIE uses a graph policy network to query potential nodes for labeling by maximizing the long-term performance of the GNN classifier. ALLIE, compared to many state-of-the-art approaches, can better handle uneven distribution of data through two balancing mechanisms. ALLIE also has a graph magnification module, making it scalable for large-scale applications. ALLIE’s high performance is demonstrated by experiments on three benchmark datasets and one real-world retail website dataset.