IYRC PAPER ARCHIVE
  • HOME
  • PAST CONFERENCES
    • IYRC WINTER 2025
    • IYRC SPRING 2025
    • IYRC Fall 2024 >
      • Authors
    • IYRC Spring 2024 >
      • Authors
    • IYRC Fall 2023 >
      • Authors
    • IYRC Spring 2023 >
      • Authors
    • IYRC Fall 2022 >
      • Authors
    • IYRC Fall 2021 >
      • Authors
    • IYRC Spring 2021 >
      • Authors
    • IYRC 2020 >
      • Authors
    • IYRC 2019 >
      • Authors
    • IYRC 2018 >
      • Authors
  • PAST SUMMER PROGRAMS
    • IYRC Summer 2025
    • IYRC Summer 2024
    • IYRC Summer 2023
    • IYRC Summer 2022
    • IYRC Summer 2021
  • IYRC HOME PAGE

ESCALATING THE QUANTITY OF MEDICAL DATA USING CTGAN: DIABETES DATASET

by Jihyung Kim
Category: STEM
Abstract – The number of diabetes diagnoses is increasing sharply in the United States. It is a life- long disease that can cause serious symptoms such as blurred visions. Collecting medical data requires a consent form and goes through complicated procedures, which makes it harder. Conditional Generative Adversarial Network(CTGAN) can help to solve this problem. GAN is a Deep Learning model that manufactures synthetic data. CTGAN is basically GAN because it goes through very similar procedures, but CTGAN is for table data. We checked how accurate the fake data was to the real data using various machine learning models and deep learning. Logistic Regression(LR), Decision Tree(DT), KNN, Gradient Boosting(GB), Light Gradient Boosting Machine(LGBM), Support Vector Classifier(SVC), Gaussian, and Deep Neural Network(DNN) got 40.55%, 38.1%, 44.5%, 39%, 35.35%, 44.05%, 53.65%, 39.25%, and 34.2%, respectively. We applied GridSearch on two models: Random Forest(RF) and Light Gradient Boosting Machine(LGBM). Random Forest(RF) showed a bit better accuracy by performing 77.85% while Light Gradient Boosting Machine(LGBM) performed 76.65%. Then we decided to create a new dataset combining the fake data with a bit of real data. When we compared the new dataset with the pure real data, the accuracy scores from all models almost doubled, 100%, 77.3%, 88.5%, 93.9%, 93.05%, 88%, 93.9%, 75.05%, and 65.8%, respectively. Although we had to modify the model in order to reach a satisfactory result, CTGAN can become a very significant model for researchers who need a large amount of data.
  • PAPER
  • PRESENTATION VIDEO
<
>
Download PDF
​
  • HOME
  • PAST CONFERENCES
    • IYRC WINTER 2025
    • IYRC SPRING 2025
    • IYRC Fall 2024 >
      • Authors
    • IYRC Spring 2024 >
      • Authors
    • IYRC Fall 2023 >
      • Authors
    • IYRC Spring 2023 >
      • Authors
    • IYRC Fall 2022 >
      • Authors
    • IYRC Fall 2021 >
      • Authors
    • IYRC Spring 2021 >
      • Authors
    • IYRC 2020 >
      • Authors
    • IYRC 2019 >
      • Authors
    • IYRC 2018 >
      • Authors
  • PAST SUMMER PROGRAMS
    • IYRC Summer 2025
    • IYRC Summer 2024
    • IYRC Summer 2023
    • IYRC Summer 2022
    • IYRC Summer 2021
  • IYRC HOME PAGE