Linear Pipeline For Data Science Workflow

Data science workflow can be iterative and take circutous paths. What futher adds to this complexity is leaving the memory to the data scientist, of how the workflow progresses over the course of a project.

Speedml memorizes the data science workflow for the data scientist.

It does so using the simple Speedml.eda method. In this release 0.9.2 we further optimize the method making it user configurable and progressively updating based on the workflow status.

Progressively updating workflow status

Now when you call Speedml.eda method at the start of your workflow, during pre-processing, and before model run, it returns a table which progressively hides away the metrics which are complete.

Within the same notebook you can scroll to prior or next EDA result to note the changes based on your workflow steps.

This ends up making the call the Speedml.eda akin to an automatically updating to do list.

See how this feature works in the notebook Titanic Solution Using Speedml from our GitHub repository.

Pipelining from EDA to pre-processing

The Speedml.eda method now returns a list of features instead of tuples with cardinality. This helps in taking the cell output straight into pandas dataframe filter or feature engineering methods like feature.density or feature.labels for next stage workflow.

Cardinality is still available for three bands - high, normal (within threshold), and continuous or unique. For most workflows this information is enough.

Following code demonstrates how we can pipeline results from the Speedml.eda method into the next stage in our workflow for pre-processing the features.

# Display top 5 samples with text unique features sml . train [ sml . eda () . get_value ( 'Text Unique' , 'Results' )) . head () # Convert categorical text features to numeric labels text_categoricals = sml . eda () . get_value ( 'Text Categorical' , 'Results' ) sml . feature . labels ( text_categoricals )

The Speedml.feature.density method now takes string feature name or a list of strings of feature names as parameter to create density features for one or more high-cardinality features. This way you can now pipe the eda method’s high-cardinality features list to the density method like so.

# Create density features for High-cardinality text features text_high_cardinality = sml . eda () . get_value ( 'Text High-cardinality' , 'Results' ) sml . feature . density ( text_high_cardinality )

That is easy.

User configurable EDA rules

Speedml EDA rules are now configurable using the API. You can configure how Speedml analyzes outliers, over-fitting, high-cardinality, unique or continuous features.

# Display the configuration dictonary sml . config # Used by data out path 'internally' within Speedml methods sml . configure ( 'outpath' , 'output/' ) # Positive and negative skew within +- this value sml . configure ( 'outliers_threshold' , 3 ) # #Features/#Samples Train < this value sml . configure [ 'overfit_threshold' ] = 0.01 # Feature is high-cardinality if categories > this value sml . configure ( 'high_cardinality' , 10 ) # Unique (continuous) if sml.config('unique_ratio')% non-repeat values sml . configure ( 'unique_ratio' , 80 )

Of course Speedml sets up the natural defaults so you do not have to.

# Display the configuration dictonary sml . configuration ()

Outlier detection during EDA

The Speedml.eda method now performs automatic outliers detection based on amount of skew of feature values from normal distribution. The outlier detection threshold is user configurable like so.

# Positive and negative skew within +- this value sml . config ( 'outliers_threshold' , 3 )