Skip to content

supermlorg/superml-java

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

25 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

SuperML Logo

By SuperML.org with SuperML.dev

SuperML Java

Build Status Performance Tests Version

A comprehensive, modular Java Machine Learning Framework inspired by scikit-learn, developed by the SuperML community.

Overview

SuperML Java 2.1.0 is a sophisticated 22-module machine learning library for Java that delivers enterprise-grade performance with 400K+ predictions/second and 22/22 modules compiling successfully. The framework provides:

  • 🎯 Supervised Learning: 11+ algorithms including Logistic Regression, Linear Regression, Ridge, Lasso, Decision Trees, Random Forest, XGBoost with lightning-fast training
  • πŸ” Unsupervised Learning: K-Means clustering with k-means++ initialization and advanced convergence criteria
  • βš™οΈ Data Preprocessing: Feature scaling, normalization, encoding, and comprehensive transformation utilities
  • πŸ”§ Model Selection: Cross-validation, hyperparameter tuning (Grid/Random Search), and automated optimization
  • πŸš€ Pipeline System: Seamless chaining of preprocessing and models like scikit-learn
  • πŸ€– AutoML Framework: Automated algorithm selection and hyperparameter optimization with ensemble methods
  • πŸ“Š Dual-Mode Visualization: Professional XChart GUI with ASCII terminal fallback
  • 🌐 Kaggle Integration: One-line training on any Kaggle dataset with automated workflows
  • ⚑ Inference Engine: High-performance model serving with microsecond predictions, caching, and monitoring
  • πŸ“ˆ Comprehensive Metrics: Complete evaluation suite for classification, regression, and clustering
  • πŸ’Ύ Model Persistence: Save/load models with automatic statistics capture and version management
  • πŸ”„ Cross-Platform Export: ONNX and PMML support for enterprise deployment
  • πŸ“± Drift Detection: Real-time model and data drift monitoring with automated alerts
  • πŸ“š Professional Logging: Configurable Logback/SLF4J logging framework
  • 🏭 Production Ready: Enterprise-grade error handling, validation, and concurrent processing

πŸš€ Quick Start

Basic Classification with Visualization

import org.superml.datasets.Datasets;
import org.superml.linear_model.LogisticRegression;
import org.superml.pipeline.Pipeline;
import org.superml.preprocessing.StandardScaler;
import org.superml.model_selection.ModelSelection;
import org.superml.visualization.VisualizationFactory;

// Load data and create pipeline
Datasets.Dataset dataset = Datasets.loadIris();
Pipeline pipeline = new Pipeline()
    .addStep("scaler", new StandardScaler())
    .addStep("classifier", new LogisticRegression());

// Train and evaluate
ModelSelection.TrainTestSplit split = ModelSelection.trainTestSplit(dataset.X, dataset.y, 0.2, 42);
pipeline.fit(split.XTrain, split.yTrain);
double[] predictions = pipeline.predict(split.XTest);

// Professional visualization (GUI + ASCII fallback)
VisualizationFactory.createDualModeConfusionMatrix(split.yTest, predictions, 
    new String[]{"Setosa", "Versicolor", "Virginica"}).display();

AutoML - One Line Training

import org.superml.autotrainer.AutoTrainer;

// Automated algorithm selection and optimization
AutoTrainer.AutoMLResult result = AutoTrainer.autoML(dataset.X, dataset.y, "classification");
System.out.println("Best Algorithm: " + result.getBestAlgorithm());
System.out.println("Best Score: " + result.getBestScore());

🌐 Kaggle Integration

Train on any Kaggle dataset with one line:

import org.superml.datasets.KaggleTrainingManager;
import org.superml.datasets.KaggleIntegration.KaggleCredentials;

KaggleCredentials credentials = KaggleCredentials.fromDefaultLocation();
KaggleTrainingManager trainer = new KaggleTrainingManager(credentials);

// Configure training with model saving
KaggleTrainingManager.TrainingConfig config = new KaggleTrainingManager.TrainingConfig()
    .setSaveModels(true)
    .setModelsDirectory("kaggle_models")
    .setAlgorithms("logistic", "ridge")
    .setGridSearch(true);

List<KaggleTrainingManager.TrainingResult> results = trainer.trainOnDataset("titanic", "titanic", "survived", config);
System.out.println("Best model: " + results.get(0).algorithm);
System.out.println("Model saved to: " + results.get(0).modelFilePath);

⚑ Performance Highlights

SuperML Java 2.1.0 delivers enterprise-grade performance across all 22 modules:

πŸ—οΈ Build & Deployment

  • βœ… 22/22 modules compile successfully (100% build success rate)
  • ⚑ ~4 minute full framework build time
  • πŸ§ͺ 145+ tests pass across all modules with comprehensive coverage
  • πŸ“¦ Production-ready JARs with complete dependency resolution

πŸš€ Runtime Performance

  • ⚑ 400,000+ predictions/second - XGBoost batch inference
  • πŸ”₯ 35,714 predictions/second - Production pipeline throughput
  • βš™οΈ ~6.88 microseconds - Single prediction latency
  • 🧠 Real-time neural networks - MLP/CNN/RNN with epoch-by-epoch training

🎯 Algorithm Benchmarks

  • XGBoost: Lightning-fast training (2.5 seconds) with early stopping & hyperparameter optimization
  • Neural Networks: Full training cycles with comprehensive loss tracking (46 tests passed)
  • Random Forest: Superior accuracy (89%+) with parallel tree construction
  • Linear Models: Millisecond training times with L1/L2 regularization (34 tests passed)

🌟 Advanced Capabilities

  • 🎲 Cross-Validation: Robust 5-fold CV with parallel execution
  • πŸ” AutoML: Automated hyperparameter tuning with grid/random search
  • πŸ“Š Kaggle Integration: Complete competition workflows from data to submission
  • πŸ’Ύ Model Persistence: High-speed serialization with automatic statistics capture
  • πŸ“ˆ Production Monitoring: Real-time drift detection and performance tracking

All benchmarks verified on comprehensive test suite with synthetic and real-world datasets.

πŸ’Ύ Model Persistence

Save and load trained models with automatic training statistics capture:

import org.superml.persistence.ModelPersistence;
import org.superml.persistence.ModelManager;

// Train a model
LogisticRegression model = new LogisticRegression().setMaxIter(1000);
model.fit(X_train, y_train);

// Save with automatic performance evaluation and statistics
ModelPersistence.saveWithStats(model, "my_model", 
                               "Production iris classifier", 
                               X_test, y_test);

// Load model with type safety
LogisticRegression loadedModel = ModelPersistence.load("my_model", LogisticRegression.class);
double[] predictions = loadedModel.predict(X_test);

// The framework automatically captures:
// - Performance metrics (accuracy, precision, recall, F1)
// - Dataset statistics and hyperparameters
// - System information and timestamps

// Manage multiple models with automatic statistics
ModelManager manager = new ModelManager("models");
String savedPath = manager.saveModel(model, "iris");
List<String> allModels = manager.listModels();

🎯 Features

Algorithms (12+ Implementations)

  • Linear Models (6 algorithms):

    • Logistic Regression with automatic multiclass support and L1/L2 regularization
    • Linear Regression with normal equation and closed-form solution
    • Ridge Regression with L2 regularization
    • Lasso Regression with L1 regularization and coordinate descent
    • SGD Classifier/Regressor with stochastic optimization
    • Advanced regularization and convergence strategies
  • Tree-Based Models (5 algorithms):

    • Decision Tree with CART implementation (classification & regression)
    • Random Forest with bootstrap aggregating and parallel training
    • Gradient Boosting with early stopping and validation monitoring
    • Advanced ensemble methods with feature importance
    • Optimized splitting criteria and pruning strategies
  • Clustering (1 algorithm):

    • K-Means with k-means++ initialization, multiple restarts, and convergence monitoring

Data Processing & Pipeline

  • Advanced Preprocessing: StandardScaler, MinMaxScaler, RobustScaler, LabelEncoder
  • Data Management: CSV loading, synthetic data generation, built-in datasets (Iris, Wine, etc.)
  • Pipeline System: Seamless chaining of preprocessing steps and models
  • Feature Engineering: Comprehensive transformation utilities

Model Selection & AutoML

  • Hyperparameter Optimization: Grid Search and Random Search with parallel execution
  • Cross-Validation: K-fold validation with comprehensive metrics and statistical analysis
  • AutoML Framework: Automated algorithm selection, hyperparameter tuning, and ensemble building
  • Parameter Spaces: Discrete, continuous, and integer parameter configurations

Visualization & Monitoring

  • Dual-Mode Visualization: Professional XChart GUI with ASCII terminal fallback
  • Interactive Charts: Confusion matrices, scatter plots, cluster visualizations
  • Performance Monitoring: Real-time inference metrics and model performance tracking
  • Drift Detection: Automated data and model drift monitoring with statistical tests

Production & Enterprise

  • High-Performance Inference: Microsecond predictions with intelligent caching and batch processing
  • Model Persistence: Save/load models with automatic training statistics and metadata capture
  • Cross-Platform Export: ONNX and PMML support for enterprise deployment
  • Kaggle Integration: Direct dataset download and automated competition workflows
  • Professional Logging: Structured logging with Logback and SLF4J
  • Thread Safety: Concurrent prediction capabilities after model training

πŸ“¦ Installation

Maven Dependency (Complete Framework)

<dependency>
    <groupId>org.superml</groupId>
    <artifactId>superml-bundle-all</artifactId>
    <version>2.0.0</version>
</dependency>

Modular Installation (Select Components)

<!-- Core + Linear Models (Minimal) -->
<dependency>
    <groupId>org.superml</groupId>
    <artifactId>superml-core</artifactId>
    <version>2.0.0</version>
</dependency>
<dependency>
    <groupId>org.superml</groupId>
    <artifactId>superml-linear-models</artifactId>
    <version>2.0.0</version>
</dependency>

<!-- Add Visualization -->
<dependency>
    <groupId>org.superml</groupId>
    <artifactId>superml-visualization</artifactId>
    <version>2.0.0</version>
</dependency>

<!-- Add AutoML -->
<dependency>
    <groupId>org.superml</groupId>
    <artifactId>superml-autotrainer</artifactId>
    <version>2.0.0</version>
</dependency>

Build from Source

git clone https://github.com/superml/superml-java.git
mvn clean install

πŸ’» Usage

Basic Classification

import org.superml.datasets.Datasets;
import org.superml.linear_model.LogisticRegression;
import org.superml.tree.RandomForest;
import org.superml.tree.GradientBoosting;
import org.superml.metrics.Metrics;
import org.superml.model_selection.ModelSelection;
import org.superml.preprocessing.StandardScaler;

// Load dataset
Datasets.Dataset dataset = Datasets.loadIris();
ModelSelection.TrainTestSplit split = ModelSelection.trainTestSplit(dataset.X, dataset.y, 0.2, 42);

// Preprocessing
StandardScaler scaler = new StandardScaler();
double[][] XTrainScaled = scaler.fitTransform(split.XTrain);
double[][] XTestScaled = scaler.transform(split.XTest);

// Train multiple models
LogisticRegression lr = new LogisticRegression().setMaxIterations(1000);
RandomForest rf = new RandomForest().setNEstimators(100);
GradientBoosting gb = new GradientBoosting().setNEstimators(100).setLearningRate(0.1);

lr.fit(XTrainScaled, split.yTrain);
rf.fit(XTrainScaled, split.yTrain);
gb.fit(XTrainScaled, split.yTrain);

// Compare performance
double lrAccuracy = Metrics.accuracy(split.yTest, lr.predict(XTestScaled));
double rfAccuracy = Metrics.accuracy(split.yTest, rf.predict(XTestScaled));
double gbAccuracy = Metrics.accuracy(split.yTest, gb.predict(XTestScaled));

System.out.printf("Logistic Regression: %.3f\n", lrAccuracy);
System.out.printf("Random Forest: %.3f\n", rfAccuracy);
System.out.printf("Gradient Boosting: %.3f\n", gbAccuracy);

Cross-Validation and Model Evaluation

import org.superml.model_selection.CrossValidation;

// Basic cross-validation
LogisticRegression classifier = new LogisticRegression();
CrossValidation.CrossValidationResults results = 
    CrossValidation.crossValidate(classifier, X, y);

System.out.println("Accuracy: " + results.getMeanScore("accuracy") + 
                   " Β± " + results.getStdScore("accuracy"));

// Custom cross-validation configuration
CrossValidation.CrossValidationConfig config = 
    new CrossValidation.CrossValidationConfig()
        .setFolds(10)
        .setShuffle(true)
        .setRandomSeed(42L)
        .setMetrics("accuracy", "precision", "recall", "f1");

CrossValidation.CrossValidationResults detailedResults = 
    CrossValidation.crossValidate(classifier, X, y, config);

// Regression cross-validation
Ridge regressor = new Ridge();
CrossValidation.CrossValidationResults regressionResults = 
    CrossValidation.crossValidateRegression(regressor, X, y, 
        new CrossValidation.CrossValidationConfig());

Advanced Hyperparameter Tuning

import org.superml.model_selection.HyperparameterTuning;

// Grid Search for Classification
HyperparameterTuning.TuningResults gridResults = HyperparameterTuning.gridSearch(
    new LogisticRegression(),
    X, y,
    HyperparameterTuning.ParameterSpec.discrete("learningRate", 0.01, 0.1, 0.5),
    HyperparameterTuning.ParameterSpec.discrete("maxIter", 500, 1000, 1500)
);

System.out.println("Best parameters: " + gridResults.getBestParameters());
System.out.println("Best score: " + gridResults.getBestScore());

// Grid Search for Regression
HyperparameterTuning.TuningResults regressionGrid = 
    HyperparameterTuning.gridSearchRegressor(
        new Ridge(),
        X, y,
        HyperparameterTuning.ParameterSpec.discrete("alpha", 0.1, 1.0, 10.0),
        HyperparameterTuning.ParameterSpec.continuous("tolerance", 1e-6, 1e-3, 5)
    );

// Random Search with Custom Configuration
HyperparameterTuning.TuningConfig advancedConfig = 
    new HyperparameterTuning.TuningConfig()
        .setScoringMetric("f1")
        .setCvFolds(5)
        .setParallel(true)
        .setVerbose(true)
        .setRandomSeed(123L);

HyperparameterTuning.TuningResults randomResults = 
    HyperparameterTuning.RandomSearch.search(
        new LogisticRegression(),
        X, y,
        Arrays.asList(
            HyperparameterTuning.ParameterSpec.discrete("learningRate", 0.001, 0.01, 0.1, 0.5),
            HyperparameterTuning.ParameterSpec.integer("maxIter", 100, 2000)
        ),
        advancedConfig
    );

// Parameter Specifications
// Discrete values
HyperparameterTuning.ParameterSpec.discrete("param", "A", "B", "C");

// Continuous range with specified steps
HyperparameterTuning.ParameterSpec.continuous("learning_rate", 0.001, 0.1, 10);

// Integer range
HyperparameterTuning.ParameterSpec.integer("max_depth", 1, 20);

Model Persistence and Management

import org.superml.persistence.ModelPersistence;
import org.superml.persistence.ModelManager;

// Train and save a pipeline
Pipeline pipeline = new Pipeline()
    .addStep("scaler", new StandardScaler())
    .addStep("classifier", new LogisticRegression());

pipeline.fit(X_train, y_train);

// Save with rich metadata
Map<String, Object> metadata = Map.of(
    "accuracy", Metrics.accuracy(y_test, pipeline.predict(X_test)),
    "features", X_train[0].length,
    "samples", X_train.length,
    "created_by", "SuperML_Demo"
);

ModelPersistence.save(pipeline, "production_model", "Main classification pipeline", metadata);

// Later, load and use the model
Pipeline loadedPipeline = ModelPersistence.load("production_model", Pipeline.class);
double[] predictions = loadedPipeline.predict(X_new);

// Model management
ModelManager manager = new ModelManager("models");
List<ModelManager.ModelInfo> models = manager.getModelsInfo();
for (ModelManager.ModelInfo info : models) {
    System.out.println(info); // Shows class, size, save time, etc.
}

πŸš€ Inference Layer

Deploy models in production with high-performance inference capabilities:

import org.superml.inference.InferenceEngine;
import org.superml.inference.BatchInferenceProcessor;

// Create inference engine and load model
InferenceEngine engine = new InferenceEngine();
engine.loadModel("classifier", "models/trained_model.superml");

// Single prediction
double prediction = engine.predict("classifier", features);

// Batch prediction with monitoring
double[] batchPredictions = engine.predict("classifier", batchFeatures);

// Asynchronous inference
CompletableFuture<Double> future = engine.predictAsync("classifier", features);

// Performance metrics
InferenceMetrics metrics = engine.getMetrics("classifier");
System.out.println("Throughput: " + metrics.getThroughputSamplesPerSecond() + " samples/sec");

// Batch processing for large datasets
BatchInferenceProcessor processor = new BatchInferenceProcessor(engine);
BatchResult result = processor.processCSV("input.csv", "output.csv", "classifier");

πŸ“š Documentation

🀝 Contributing

We welcome contributions to SuperML Java! Please see our Contributing Guide for details.

Ways to Contribute

  • Code: Implement new algorithms, improve performance, fix bugs
  • Documentation: Improve guides, add examples, write tutorials
  • Testing: Add test cases, improve coverage, performance testing
  • Community: Help others, report issues, suggest features

Development Setup

git clone https://github.com/superml/superml-java.git
mvn clean compile
mvn test

Code Coverage

SuperML Java includes comprehensive code coverage analysis using JaCoCo:

# Run tests and generate coverage report
mvn clean test jacoco:report

# Use the provided coverage script for detailed analysis
./coverage.sh --summary    # Show coverage summary
./coverage.sh --open       # Open HTML report in browser

Coverage Reports:

  • HTML Report: target/site/jacoco/index.html (visual coverage report)
  • Coverage Summary: Use ./coverage.sh --summary for quick overview
  • Detailed Analysis: See docs/CODE_COVERAGE_REPORT.md

Current Status:

  • -> Multiclass Classification: 85%+ coverage (LogisticRegression, SoftmaxRegression)
  • ⚠️ Tree Algorithms: 0% coverage (new v2.0 features needing tests)
  • ⚠️ Linear Models: 0% coverage (LinearRegression, Ridge, Lasso need tests)

🌟 Community & Support

πŸ† Attribution

SuperML Java is developed and maintained by the SuperML Community:

This project is inspired by scikit-learn and aims to bring the same ease of use and comprehensive functionality to the Java ecosystem.

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

MIT License Summary

  • -> Commercial use - Use in commercial projects
  • -> Modification - Modify and distribute
  • -> Distribution - Distribute original or modified
  • -> Private use - Use for private projects
  • ❗ License and copyright notice - Include in all copies
  • ❌ Liability - No warranty provided
  • ❌ Trademark use - SuperML trademarks not included

🎯 Project Status

Build Status Java Version License Version

Current Version: 1.0-SNAPSHOT
Stability: Beta - Core features complete, API may change
Java Compatibility: Java 11+
Dependencies: Minimal - only essential libraries


Made with ❀️ by the SuperML Community | Visit superML.dev for more projects

About

Modular machine learning framework foreign java for ML model training

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages