What's the difference?

To understand what's causing the difference, we need to study the logic of tree-building algorithms.

The tree building algorithm

At the heart of the tree-building algorithm is a subalgorithm that splits the samples into two bins by selecting a variable and a value. This splitting algorithm considers each of the features in turn, and for each feature selects the value of that feature that minimizes the impurity of the bins. We won't get into the details of how this is calculated (and there's more than one way), except to say that you can consider a bin that contains mostly positive or mostly negative samples more pure than one that contains a mixture. There's a nice visualization of the algorithm in the Visual Introduction to Machine Learning.

In our case, we'd hope that, when the algorithm considers $z$, it would choose to split at $10$. That is, any example whose value of $z$ is less than $10$ goes to into one bin, and any whose value is greater than $10$ goes in the other. It should, in turn, further subdivide the samples assigned to the 'less-than' bin, since we know that some of these are in fact positive.

Binary variables are automatically disadvantaged here, since there is only one way to split the samples: 0s one way, and 1s the other. Low-cardinality categorical variables suffer from the same problem. Another way to look at it: a continuous variable induces an ordering of the samples, and the algorithm can split that ordered list anywhere. A binary variable can only be split in one place, and a categorical variable with $q$ levels can be split in $\frac{2^{q}}{2} - 1$ ways.

An important sidenote: we don't actually have to search all the partitions because there are efficient algorithms for both binary classification and regression that are guaranteed to find the optimal split in linear time — see page 310 of the Elements of Statistical Learning. No such guarantee exists for multinomial classification, but there is a heuristic.

Why one-hot encoding is bad bad bad for trees

Predictive Performance

By one-hot encoding a categorical variable, we create many binary variables, and from the splitting algorithm's point of view, they're all independent. This means a categorical variable is already disadvantaged over continuous variables. But there's a further problem: these binary variables are sparse. Imagine our categorical variable has 100 levels, each appearing about as often as the others. The best the algorithm can expect to do by splitting on one of its one-hot encoded dummies is to reduce impurity by $\approx 1\%$, since each of the dummies will be 'hot' for around $1\%$ of the samples.

The result of all this is that, if we start by one-hot encoding a high-cardinality variable, the tree building algorithm is unlikely to select one of its dummies as the splitting variable near the root of the tree, instead choosing continuous variables. In datasets like the one we created here, that leads to inferior performance.

In contrast, by considering all of the levels of $c$ at once, H2O's algorithm is able to select $c$ at the very top of the tree.

Interpretability

The importance score assigned to each feature is a measure of how often that feature was selected, and how much of an effect it had in reducing impurity when it was selected. (We don't consider permutation feature importance here; this might help combat the preference for continuous variables over binary ones, but it will not help with the induced sparsity.)

H2O assigns about $70\%$ of its importance to $c$, and the remaining $30\%$ to $z$. Scikit-learn, in contrast, assigns less than $10\%$ in total to the one-hot encodings of $c$, $30\%$ to $z$ and almost $60\%$ collectively to $x_i$, features that are entirely unnecessarily to perfectly classify the data!

Fewer levels, fewer problems

As we discussed, this problem is especially profound for high-cardinality categorical variables. If the categorical variables have few levels, then the induced sparsity is less severe and the one-hot encoded versions have a chance of competing with the continuous ones.