An End-to-End Machine Learning Pipeline on Time-Series Data

CrazyTomato1 pts0 comments

From Raw Data to Trained Models: An ML Journey | by DolphinDB | MediumSitemapOpen in appSign up<br>Sign in

Medium Logo

Get app<br>Write

Search

Sign up<br>Sign in

From Raw Data to Trained Models: An ML Journey

DolphinDB

10 min read·<br>Nov 12, 2025

Listen

Share

Press enter or click to view image in full size

As data-driven applications demand faster insights, teams need ML platforms that integrate seamlessly with their data infrastructure. DolphinDB provides a complete machine learning environment — combining built-in algorithms, plugin support for popular libraries, and distributed computing capabilities — all within a unified platform designed for time-series and analytical workloads.<br>This article walks through the complete ML workflow in DolphinDB: from data ingestion and preprocessing to model training, evaluation, and deployment. We’ll explore supervised and unsupervised learning, demonstrate distributed training at scale, and conclude with a production financial case study.<br>A First Look<br>Let’s begin with a compact, well-understood dataset to focus on the mechanics of a complete ML workflow in DolphinDB before scaling to production scenarios.<br>We use the wine dataset provided by UCI Machine Learning Repository to train our first random forest classification model.<br>(1) Data Import<br>Download the dataset and save it in file /chapter13/wine.data. Import the data into DolphinDB using the loadText function:<br>wineSchema = table(<br>["Label","Alcohol","MalicAcid","Ash","AlcalinityOfAsh","Magnesium","TotalPhenols",<br>"Flavanoids","NonflavanoidPhenols","Proanthocyanins","ColorIntensity","Hue",<br>"OD280_OD315","Proline"] as name,<br>["INT","DOUBLE","DOUBLE","DOUBLE","DOUBLE","DOUBLE","DOUBLE","DOUBLE","DOUBLE","DOUBLE",<br>"DOUBLE","DOUBLE","DOUBLE","DOUBLE"] as type<br>wine = loadText("/chapter13/wine.data", schema=wineSchema)(2) Data Preprocessing<br>DolphinDB’s randomForestClassifier function requires that the classification labels be integers in the range [0, classNum). The labels in the downloaded wine dataset are 1, 2, 3, so we use the following script to update the labels.<br>update wine set Label = Label - 1Then define function trainTestSplit to split the dataset into training and testing sets with a 7:3 ratio.<br>def trainTestSplit(x, testRatio) {<br>xSize = x.size()<br>testSize = xSize * testRatio<br>r = (0..(xSize-1)).shuffle()<br>return x[r > testSize], x[r (3) Random Forest Classification<br>Perform random forest classification on the training set with function randomForestClassifier. The function has four required parameters:<br>ds: The input data source (usually generated using the sqlDS function).<br>yColName: The column name of the dependent variable in the data source.<br>xColNames: The column names of the dependent variables in the data source.<br>numClasses: The number of classes.<br>model = randomForestClassifier(<br>sqlDS(),<br>yColName=`Label,<br>xColNames=["Alcohol","MalicAcid","Ash","AlcalinityOfAsh","Magnesium","TotalPhenols",<br>"Flavanoids","NonflavanoidPhenols","Proanthocyanins","ColorIntensity","Hue",<br>"OD280_OD315","Proline"],<br>numClasses=3<br>)(4) Prediction and Persistence<br>To predict test data with the trained model, use predict(model, X). Trained model can be persisted to disk using saveModel, and loaded from disk with loadModel.<br>// Predict the test set with the trained model<br>predicted = model.predict(wineTest)<br>// Examine the prediction accuracy<br>sum(predicted == wineTest.Label) \ wineTest.size();<br>// Output: 0.925926<br>// Persist model to disk<br>modelPath = "/chapter13/wineModel.bin"<br>model.saveModel(modelPath)<br>// Persisted model can be loaded from disk<br>model = loadModel(modelPath)Through this example, we can summarize the typical steps for machine learning in DolphinDB: data preprocessing, model building, data prediction, and model persistence.<br>Exploring Different Learning Approaches<br>Machine learning tasks can be broadly categorized into two primary types:<br>Supervised Learning is a type of machine learning where the model is trained on a labeled dataset (i.e., the target or outcome variable is known). Supervised learning is commonly used for risk assessment, image recognition, predictive analytics and fraud detection. Common supervised learning algorithms include linear regression, logistic regression, decision trees, support vector machines, and neural networks.<br>Unsupervised Learning draws inferences from unlabeled datasets, facilitating exploratory data analysis and enabling pattern recognition and predictive modeling. Common algorithms include clustering (e.g., K-means and hierarchical clustering) and dimensionality reduction (e.g., PCA and factor analysis).<br>In the following sections, we will explore detailed examples of how to implement supervised and unsupervised learning tasks in DolphinDB.<br>Supervised Learning: XGBoost Classification<br>Let’s explore supervised learning using XGBoost (eXtreme Gradient Boosting), a powerful ensemble method that combines gradient boosting with sophisticated regularization. XGBoost iteratively trains decision trees, with each...

data learning model double dolphindb supervised

Related Articles