$\begingroup$

I am working on Azure ML Studio and try to create a regression model to predict a numerical value. I will try to describe my features and what I have done until now.

My data with about 3 million rows :

Features:

8 integer features from 1 to 25

2 boolean features with 0 and 1

3 integer features from 1 to 10

2 integer feature from 0 to 500.000 (and 1.000.000 respectively) with about 4.500 unique values

1 integer feature from 20 to 50

1 integer feature from 1 to 15

1 integer feature from 0 to 100

Label:

Integer from 10.000 to 100.000.000 with about 5.000 unique values

What I have done:

Split the dataset to 80% (train) and 20% (test). Then I split the training dataset again to 60% (actual train) and 40% (validation).

Normalize the features with many unique values (4th bullet in the above list)

Train a model of Boosted Decision Tree Regression.

Use the Sweep Parameters module to find the best combination

I tried also Neural Networks, Bayesian Linear Regression, but BDTR gave the best score.

I tried to exclude columns and start with only a few (based on what I think it will affect the model) and then add more columns one by one.

However, the least MSE I could achieved was 1.500.000 (plus I had many negative scored values)

So, I was thinking what other techniques I could use to improve the model.