2022

Analyzing classification performance on an unbalanced honeycomb dataset

Author
SERKAN ÖZGÜN
University
SELÇUK UNIVERSITY
Year
2022

Abstract

Beekeeping activities are an important agricultural activity for Turkey and the World. Beekeeping is important in socio-economic terms due to its contribution to the development of rural areas in Turkey. In addition, the products produced as a result of beekeeping activities are important food sources for humans. For this reason, using the right methods in beekeeping activities is important for the sustainability of beekeeping activities. Beekeeping activities carried out by the producers unconsciously and without using the necessary techniques negatively affect the quality and yield of the products to be obtained. Honey is one of the most important outputs obtained as a result of beekeeping activities. There are many stages in the honey production process. One of these stages is the honey harvest stage. Utilizing the right methods and techniques during honey harvesting increases the amount and quality of honey produced. In addition, conscious beekeeping activities are also effective in preserving the existence of the bee colony by avoiding unnecessary baby bee losses. In this thesis, the detection of 'closed larval cells' on the honeycomb is considered as a classification problem in order to reduce the loss of baby bees in honey harvest. In the study, a dataset was created by using 38 honeycomb images. The dataset was constructed with labels for two classes, closed larval cells and others. It was seen that the data ratio of the two labeled classes in the data set was 1/5. It was aimed to increase the classification success by eliminating the imbalance between the classes. For this, five different data-level oversampling approaches (SMOTE, Borderline-SMOTE1, Borderline-SMOTE2, Safe-Level-SMOTE and DEBOHID) that are well-known and current in the literature were used. Three different classifiers (K-Nearest Neighbor (kNN), Decision Tree and Support Vector Machines (DVM)) were used to show the classification success on balanced data. Classification results were evaluated with F1-Score, G-Mean and AUC metrics. As a result of the classification processes, it was observed that the classification success increased in the data sets that were balanced with synthetic data generation methods.