Why I Built This
Most data preprocessing workflows look the same: open a terminal, fire up a Jupyter notebook, write a few lines of pandas to load the CSV, check df.info(), spot the nulls, fix them, normalise a column, and eventually generate a cleaning script. It works, but it means constantly switching context out of your IDE.
I wanted to close that loop. The idea was simple: if you're already working in IntelliJ IDEA, you should be able to load a data file, understand it, clean it, and generate the Python code — without leaving the editor. That's Data Preprocessor.
It's now published on the JetBrains Marketplace and works with IntelliJ IDEA 2023.3 and later.
What the Plugin Does
The plugin adds a Data Preprocessor tool window to the IDE's right-hand sidebar. You can open it directly or right-click any .csv, .xlsx, or .json file in the Project view and choose Open in Data Preprocessor.
Once a file is loaded, it exposes four tabs:
- Preview — a paginated table view of your data
- Profile — per-column statistics (type, null count, null %, unique values, mean, median, std dev, min, max, mode)
- Clean — a point-and-click pipeline builder for transformations
- Code — the auto-generated
pandasPython script that mirrors everything you applied
The cleaning operations cover the most common preprocessing steps: handling missing values (drop, fill with mean/median/mode, or a custom value), removing duplicates, outlier removal using the IQR fence method, Min-Max or Z-Score normalisation, and type casting between int, float, boolean, and string.
When you hit Apply & Generate Code, the Code tab populates with a complete, ready-to-run script. You can save it as a .py file — it opens immediately in the IntelliJ editor — or export the cleaned dataset as a CSV.
The Architecture
The plugin is structured around a clean engine layer that the Swing UI drives:
src/main/java/com/datapreprocessor/
├── model/
│ ├── DataSet.java # in-memory tabular model
│ └── ColumnProfile.java # per-column statistics
├── engine/
│ ├── DataLoader.java # CSV/Excel/JSON → DataSet
│ ├── DataCleaner.java # all transformation logic
│ ├── CodeGenerator.java # generates pandas script
│ └── DataExporter.java # CSV and .py file export
├── actions/
│ └── OpenDataFileAction.java
└── toolwindow/
└── DataPreprocessorToolWindow.java # Swing UI (4 tabs)
DataSet is a straightforward in-memory model — a list of column names and a list of rows as String[] arrays. Keeping everything as strings until the user explicitly casts a column avoids a class of type-inference bugs that show up with mixed-type columns in real-world data.
CodeGenerator builds the pandas script by iterating over the cleaning pipeline — an ordered list of (column, operation) pairs — and emitting the equivalent pandas call for each step. The output always starts with a pd.read_csv(), pd.read_excel(), or pd.read_json() depending on the source file format, and ends with a df.to_csv() export.
The Hardest Parts
IntelliJ Threading Rules
The IntelliJ Platform has strict threading constraints that catch most first-time plugin developers off guard. Any UI update must happen on the Event Dispatch Thread (EDT). Any file I/O, network access, or anything slow must happen off the EDT — otherwise IntelliJ logs a SlowOperations SEVERE error and, in newer platform versions, throws an exception.
This tripped me up in two places.
The first was saveAsPythonFile() and exportCleanedCsv(). Both were originally writing files directly in button click handlers, which run on the EDT. The fix was wrapping them in a SwingWorker:
new SwingWorker<Void, Void>() {
@Override
protected Void doInBackground() throws Exception {
Files.write(targetPath, content.getBytes(StandardCharsets.UTF_8));
return null;
}
@Override
protected void done() {
ApplicationManager.getApplication().invokeLater(() ->
FileEditorManager.getInstance(project).openFile(virtualFile, true)
);
}
}.execute();
The FileEditorManager.openFile() call itself also needs to run in a write-safe context, which is why it's wrapped in invokeLater rather than called directly in done().
The second issue was the OpenDataFileAction. IntelliJ requires actions to declare which thread their update() method runs on — ActionUpdateThread.EDT or ActionUpdateThread.BGT (background thread). Using EDT for an action that checks file types causes a warning. Switching to BGT and using e.getPresentation().setEnabled() instead of directly touching UI components fixed it.
The untilBuild Compatibility Trap
When you publish a plugin, the Gradle IntelliJ plugin requires you to specify sinceBuild and optionally untilBuild in gradle.properties. If you omit untilBuild, the Gradle plugin was auto-filling it to 233.* — which caps compatibility at IntelliJ build 233 (2023.3) and blocks update notifications on every newer IDE version.
The fix is to explicitly set it to an empty string:
pluginUntilBuild=
This tells the Marketplace there's no upper bound, so users on 2024.x and 2025.x see updates correctly. This was invisible during development — it only surfaces once you're live on the Marketplace and users on newer IDEs can't install updates.
Supporting Excel and JSON Without Breaking CSV
Adding .xlsx and .json support in v1.1.0 seemed straightforward — add format-aware loading in DataLoader, route through a dispatcher, done. The subtle issue was refreshFromDisk(). This method is called when the Reload button is clicked, and it was hardcoded to loadCsv(). When a user loaded an Excel file and hit Reload, it would try to parse binary XLSX data as text and render garbage in the table.
The fix was making load() a format-aware dispatcher that stores the active format on the DataSet, and having refreshFromDisk() call load() instead of loadCsv() directly.
Publishing to JetBrains Marketplace
The publishing process is more straightforward than it looks once you get past the initial setup. The key steps:
- Create a JetBrains Marketplace account and generate a publish token
- Set the token as
PUBLISH_TOKENin your environment - Run
./gradlew publishPlugin
The Gradle plugin handles packaging and upload. JetBrains runs a manual review for new plugins (usually 1–3 business days) before they go live. Updates to existing plugins are published immediately.
One thing worth knowing: always use the Gradle wrapper (./gradlew) and never a system-installed Gradle. The build is pinned to Gradle 8.6 via gradle/wrapper/gradle-wrapper.properties. Running Gradle 9.x from a system install will fail with dependency resolution errors because the IntelliJ Gradle plugin hasn't caught up to 9.x's API changes yet.
What I'd Do Differently
Start with threading constraints in mind. The EDT/BGT rules aren't optional or edge cases — they're enforced by the platform. If I were starting again, I'd wire up SwingWorker for any I/O from day one rather than retrofitting it later.
Stub the UI against a fake DataSet first. I built the Swing UI and the engine in parallel, which meant the UI was always waiting on the engine to stabilise. A fake in-memory DataSet with hardcoded rows would have let me nail the layout and tab interaction before any real data loading worked.
Test with messy real-world files early. The sample data I used during development was clean and well-typed. When I tested with real CSVs — mixed types, inconsistent null representations, extra whitespace in headers — several edge cases surfaced that required changes to the profiling logic.
Wrapping Up
Building an IDE plugin is a different experience from building a web service or a mobile app. The platform is opinionated, the threading model is strict, and the feedback loop during development (run the Gradle sandbox, wait for a full IDE to boot, test, kill, repeat) is slower than hot-reload web development. But the end result is something that genuinely lives inside the tools developers use every day, which feels different from shipping a web app.
The plugin is free, open source under MIT, and available now on the JetBrains Marketplace. If you work with data files in IntelliJ, give it a try.