Gena Labeling Tool Development: From Planning to PoC
Hello, I am Kim Hyunjin (Andrew), an intern at Gena Co.
In order to create a high-quality dataset for AI training in Gena's text2sql GenaSQL project, which focuses on natural language (NL) to SQL query transformation, I participated in annotating and error-checking the existing dataset.
During this process, I encountered some issues when manually handling CSV files with Data Anaylist of Gena AI Team. I was given the opportunity to plan and develop a solution for these problems, which led to the creation of the Gena Labeling Tool at the Proof of Concept (PoC) level. I would like to share the development journey with you.
Data Analyst Needs : Inefficient Labeling Process
Through discussions with data analyst, I was able to identify her pain points. The first issue was the need for a solution that goes beyond collaborative features in existing open-source tools like Doccano and Labeling Studio, which are limited to "labeling functions."
The following challenges needed to be solved:
- The manual editing and error correction of pre-made datasets for text2sql AI training was being done on bulky CSV files, leading to long processing times.
- There was a need for a tool that could help speed up the process by delegating labeling and data management tasks via crowdsourcing in the future.
- The tool should be usable as an internal tool by data analysts for efficient operation.
Based on these points, I further refined the requirements and set the data analyst as the user, which led to the planning and development of the Gena Labeling Tool.
Setting Key Features
Defining User Persona and User Stories
- Admin: Data management, user roles, logging, reporting, and access control management. Data labeling, modification, and review processing, etc.
- Reviewer(General User): Responsible for data labeling, modification, and review, etc.
The personas of the two primary users were developed with data analysts, and key features were defined from a developer’s perspective. Admin manages data labeling and updates, while Reviewers use the tool after receiving tasks from Admin.
Based on usage scenarios, essential features, difficulty, importance, and AI team collaboration were organized through meetings. Due to the limited internship period, we focused on the most important features first. As a result, PoC-level development was done for the core features.
Thus, for the PoC of the Gena Labeling Tool, we were able to organize the key features around the highest priority functions for Admin and Reviewer (Labeler).
주요 기능 Key Feature
Therefore, the key features are set to allow data to be uploaded and downloaded as files for management. Based on the uploaded data, the Admin can create groups and assign them to Reviewers, enabling multiple users to simultaneously perform labeling and modification tasks.
Update Data: Event Sourcing Pattern
During the meeting with the AI team and the product team, it was decided to adopt the Event Sourcing pattern for the Gena Labeling Tool. The Gena Labeling Tool allows multiple reviewers to label and update sample data for datasets, specifically natural language (NL) and SQL query data.
Additionally, since the Admin needs to review and approve the updates made by the reviewers, it was essential to move away from the traditional CRUD approach, where data is overwritten. Instead, Event Sourcing was chosen to track and manage all update versions. This approach enables clear recording of data changes by type (natural language questions, SQL queries, labels, deletion status, etc), providing the flexibility to restore to a specific point in time in case of incorrect updates or mistakes.
Applying Versioning to Updated Data
The Gena Labeling Tool follows an Event Sourcing pattern with versioning for updates. Using the dataset uploaded via CSV files, User A and User B request to update the same sample data, and the Admin approves updates based on the latest data from a specific user.
In the Gena Labeling Tool, when updating columns (sql_query or natural_question) or labels, each column (or attribute) is stored as a separate event. For instance, updates to sql_query and natural_question are stored as events "b" and "c," respectively.
While there is typically one reading table in the Event Sourcing pattern, the PoC stage involved creating only minimal relational tables, allowing operations to be performed within a single table. The Admin's data fetch is handled by the service logic in the Java application, and the Admin can fetch the latest values for each updated column. Each event is marked as APPROVED, and only the latest values are stored with an UPDATED status.
Rejected events are not deleted, so users can view and update the data by referencing the versioned events.
Using ROLE in the MVP : JWT Implementation
As shown in the simple example code above, in the PoC, I focused on implementing a REST API that receives request information, processes it through the service logic, and responds.
However, the task of separating functionalities by roles was not addressed. Therefore, if we were to create an MVP, I think it would be necessary to implement a method for assigning permissions by role in the Java application. For example, only Admin should be able to view the update requests made by all Reviewers.
As shown in the graph above, when the user logs in, the user role is included in the issued JWT. During the request/response process with the Gena Labeling Tool Server, this JWT will be used for JWT verification through the Spring Security filter, enabling endpoint control based on roles, thus implementing the MVP.
Conclusion
During the short internship period, it was a great opportunity to collaborate with the team, plan, and develop through meetings. While it would have been ideal for the project, which started as a PoC, to evolve into the actual product's MVP, there is some regret due to time and resource limitations. Nevertheless, achieving the first milestone and contributing to a tool that can be utilized as open-source was a valuable experience at Gena Co.