Internship Project/Data Labeling and Update Tool

(25.02.11) Open Source Data Labeling & Updating Tool Ideation

Genie; 2025. 2. 24. 17:17

→ The primary goal

  • the process of managing and modifying data, enabling users to complete task : labeling & updating data efficiently and quickly with minimal effort.
  • 회의를 통해 업데이트 완료

👤 User Storie

Role Description
admin Manages data organization(management), user roles, logging(monitoring), and reports and access control.
+ Handling data labeling, modification, and review.
reviewer(annotator) Responsible for data labeling, modification, and review, ensuring data accuracy and consistency.

 

Required FeatureDifficultyImpactNote

Story Required Feature Difficulty Impact Note
As a reviewer, I want to retrieve data by a unique ID or retrieve all data so that I can access necessary data efficiently. Data Read (MongoDB, NoSQL) low high  
As a reviewer, I want to update(modify) sql_query, nl_en_query, and nl_ko_query fields so that I can correct or improve query (SQL, NL) accuracy. Update Query Fields low high  
As a reviewer, I want to validate sql_query before execution so that I can prevent syntax errors (Grammar Error) and runtime issues. SQL Execution Error Check (Grammar Validation) high high  
As an admin, I want to upload data from my local so that I can add new datasets to the system. Data Upload medium high can rely on AI team
As an admin, I want to export selected datasets so that I can use them externally. Data Export medium high can rely on AI team
As a reviewer, I want to view data logs so that I can track changes and see who modified the data. Versioning & Data Logging high high  
As a reviewer, I want to perform CRUD operations on labels so that I can manage annotations effectively to update data. Label CRUD low medium need to discuss
As an admin, I want to sort and group dataset based on various criteria so that I can organize datasets efficiently. Collaborative Data Sorting and Grouping high medium  
As an admin, I want to assign specific data groups to reviewers so that they can access only the relevant data and update all assigned data at once. Data Access Control for Reviewers medium medium  
As an admin, I want to track modifications and retrieve previous versions so that I can monitor changes and revert if needed. Versioning & Data Logging high medium  
As a reviewer, I want to log in and access only the assigned datasets so that I can focus on my tasks and avoid simultaneous errors. Role-Based Access Control high medium  
As a reviewer, I want to view a list of all available templates so that I can choose and reference them as needed. Template Reference Function low medium  
As an admin, I want to search and filter datasets by various criteria so that I can quickly check relevant data. Data Search & Filtering medium medium  
As a reviewer, I want to view the latest log for specific data along with the retrieved data so that I can check if it has been updated and when the update occurred. Retrieve Data Update Logs low medium  
As an admin, I want to monitor reviewer progress on data tasks so that I can ensure that tasks are completed on time. Task Monitoring medium medium  
As a reviewer, I want to view and edit similar template values together at once so that I can efficiently compare, verify, and update multiple templates with consistent data. Update Similar Templates Values Together at Once high medium  
As a reviewer, I want to replace specific words with labels so that I can ensure data consistency according to specific standards. Replace Words with Labels (Labeling) medium low  
As an admin, I want to register users with different roles (reviewer, admin) so that I can control their access permissions. (Simple) User Registration & Role Management low low  
As an admin, I want to retrieve data from the database and download it in a structured format so that I can analyze or share it efficiently. Data Retrieval from DB medium low  
As an admin, I want to generate reports on dataset modifications and access logs so that I can monitor system activity and performance. Report Generation low low  
As an admin, I want to track the progress of each user(review & admin) so that I can monitor individual contributions and ensure tasks are completed on time. Track Progress for Each User (Monitoring Progress) low low  

🔑 Key Features

  1. Data Read (MongoDB, NoSQL)
    • The field information (no_sql_template, label, sql_template, direct_question, sql_query, nl_en_query, nl_ko_query , etc) can be retrieved by a unique ID(DB).
  2. Update sql_query, nl_en_query, and nl_ko_query Fields
    • Label CRUD
      • The Spring Boot application performs CRUD operations on Labels (Tags, Annotations, etc.).
    • All three fields (sql_query, nl_en_query, and nl_ko_query) can be updated with the following options:
      1. Replace words with labels. (Labeling)
      2. Modify the entire field value. (Full Modification of Values)
  3. Template Reference Function
    • A feature to load and reference templates in no_sql_template (Template No.) and sql_template (Template) formats.
  4. SQL Execution Error Check (Grammar Validation)
    • Validate sql_query using a SQL parser (SqlParser, etc).
    • Detect potential issues before execution (e.g., incorrect structure, missing references
  5. Versioning & Data Logging
    • Introduce version control for data modifications to track and retrieve previous versions from log data.
    • Allow users to revert to previous versions if necessary.
    • This versioning feature will be integrated with the modification logging : storing with logs (e.g., version number w/ modified timestamp, and modification details).
  6. Collaborative Data Sorting and Grouping
    • Admin sort, group, and classify the data based on various criteria.
    • Admin can monitor users' progress on data updates at once (Monitoring Progress).
    • This feature allows multiple users to label and modify the data simultaneously.

🖼️ Simple Wireframe

  • Roughly

📔 API Specification (Draft)

1. Data Retrieval (Pagination Included)

1.1 Read All Data

  • Endpoint: GET /api/data
  • Query Parameters:
    • page (optional, default: 1) : Page number
  • Description: Read all records from the database with pagination, including modification tracking information.(5. Modification Logging).

1.2 Read Specific Data by no_sql_template

  • Endpoint: GET /api/data/{no_sql_template}
  • Query Parameters:
    • page(same as above)
  • Description: Read records filtered by a specific no_sql_template with pagination, including modification tracking information(5. Modification Logging).

2. Data Modification Tracking

2.1 Retrieve Modification Info

  • Endpoint: GET /api/data/{id}/modifications
  • Description
    • Returns modification history for a given data entry, including modified fields, timestamps, and details of the changes.
    • The modification history is stored in a separate table (data_modifications) to preserve all changes made to the data. (To reset)

3. Label CRUD

3.1 Create a Label

  • Endpoint: POST /api/labels
  • Request Body e.g.
  • { "labelName": "string" }
  • Description: Adds a new label.

3.2 Read All Labels

  • Endpoint: GET /api/labels
  • Description: Read all available labels.

3.3 Update a Label

  • Endpoint: PUT /api/labels/{labelId}
  • Request Body e.g. :
  • { "labelName": "string" }
  • Description: Updates the label name.

3.4 Delete a Label

  • Endpoint: DELETE /api/labels/{labelId}
  • Description: Removes a label.

4. Data Update (Labeling & Full Modification)

4.1 Labeling - Replace Words with Labels

  • Endpoint: POST /api/data/{id}/modify
  • Request Body e.g. :
  • { "field": "sql_query", "indices": [0, 2], "labelId": "1" }
  • Description: Replaces words in a specific field based on provided indices with a given label.

4.2 Full Field Modification

  • Endpoint: POST /api/data/{id}/modify-full
  • Request Body e.g. :
  • { "field": "sql_query", "newValue": "SELECT * FROM DeliveryZones" }
  • Description: Replaces the entire field value.

5. Modification Logging

5.1 Store Modification Record

  • Endpoint: POST /api/data/{id}/log-modification
  • Request Body e.g. :
  • { "modifiedAt": "2025-02-12T10:00:00", "modifiedFields": [ { "field": "sql_query", "oldValue": "SELECT * FROM shipping_methods WHERE cost > 20;", "newValue": "SELECT * FROM shipping_methods WHERE cost < 20;", "dataId": "3" }, { "field": "nl_en_query", "oldValue": "What are the shipping methods that cost more than $20? Let me see all the details.", "newValue": "What are the order quantities that cost more than $20? Let me see all the details.", "dataId": "3" } ], "fullModification": false, "noUpdate": false }
  • Description: Logs a modification timestamp and modified fields concurrently with the value update, including label replacements, the old and new values, and whether any changes occurred or if the update was skipped.

6. Template Reference

6.1 Read Template by ID

  • Endpoint: GET /api/templates/{templateId}
  • Description: Read a specific template.

7. SQL Execution Error Check (Grammar Validation)

7.1 Validate SQL Query

  • Endpoint: POST /api/data/{id}/validate
  • Request Body e.g. :
  • { "sql_query": "SELECT * FROM orders WHERE order_id = ?" }
  • Description:
    • Validates a SQL query string using a SQL parser (e.g., SqlParser).
    • Returns validation results with details about any errors or warnings.

8. Versioning Management by Admin

8.1 Create New Version

  • Endpoint: POST /api/data/version
  • Request Body example:
  • { "versionComment": "version 1.0" }
  • Description:
    • **e.g. :**This endpoint allows the admin to create a new version of the data with comment.

8.2 Retrieve Specific Version

  • Endpoint: GET /api/data/version/{versionNumber}
  • Description:
    • Retrieves a specific version of the data by specifying the version number.

9. Collaborative Data Sorting and Grouping

9.1 Sort and Group Data

  • Endpoint: POST /api/data/group
  • Request Body e.g. :
  • { "sortBy": "date_created", "groupBy": "nl_ko_query" }
  • Description:
    • Allows the admin to sort and group data based on specified criteria such as NL language.

9.2 Assign Data Groups to Users

  • Endpoint: POST /api/data/assign-group
  • Request Body e.g. :
  • { "userId": "1", "groupId": "5" }
  • Description:
    • Enables the admin to assign specific data groups to users.
    • Users will only be able to view data assigned to them during data retrieval, filtering the results based on their group assignments.

🧵 Data Update Pipeline

Notes

  • The goal is to provide clear and structured RESTful endpoints for efficient data retrieval, labeling, updating, and modification tracking.
  • Further refinements(changes) may be necessary based on discussion and technical issues.

 

  1. User modifies the data in Bubble.io.
  2. The modified data is sent from Bubble.io to the Spring Boot application (running in a Docker container or instance, etc) via an API request.
  3. Spring Boot performs :
    • Retrieves the Synthesized Data (existing data) from MongoDB.
    • Retrieve Label information from MySQL or the requested update data to modify the data.
    • Updates the modified data in MongoDB.
    • Stores the modification record (date, modification status, etc.) in a separate collection.
  4. Spring Boot responds with the modified data (updated values + modification record).
  5. Bubble.io receives the API response and updates the UI to display the changes to the user.