Mercari, Japan’s biggest community-powered shopping app, knows one problem deeply. They’d like to offer pricing suggestions to sellers, but this is tough because their sellers are enabled to put just about anything, or any bundle of things, on Mercari’s marketplace.

In this machine learning project, we will build a model that automatically suggests the right product prices. We are provided of the following information:

train_id — the id of the listing

name — the title of the listing

item_condition_id — the condition of the items provided by the sellers

category_name — category of the listing

brand_name — the name of the brand

price — the price that the item was sold for. This is target variable that we will predict

shipping — 1 if shipping fee is paid by seller and 0 by buyer

item_description — the full description of the item

EDA

The data set can be downloaded from Kaggle. To validate the result, I only need the train.tsv. Let’s get started!

import gc

import time

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

from scipy.sparse import csr_matrix, hstack

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from sklearn.preprocessing import LabelBinarizer

from sklearn.model_selection import train_test_split, cross_val_score

from sklearn.metrics import mean_squared_error

import lightgbm as lgb df = pd.read_csv('train.tsv', sep = '\t')

Randomly split the data into train and test sets. We are using training set only for EDA.

msk = np.random.rand(len(df)) < 0.8

train = df[msk]

test = df[~msk] train.shape, test.shape

((1185866, 8), (296669, 8))