Extraction of orthogonal features
The lesson from those considerations is the need for features. Each of those relations does not fix the exact position of the object but rather constrains it to some subset of positions. Further orthogonal relations can be extracted independently and jointly narrow down the position. This is exactly the role of an orthogonal feature basis. For example, to make things simple AT may be represented as \( |x|=0.1\) and RIGHT-OF may be \( x<0\). This should lead to the solution \( x=-0.1\). First, does the RIGHT relation follow from the AT relation in any way? Let’s try to use the developed technique.
\( f(x)=|x|\). Then \( \vec{\nabla}f=sign(x)\). This does not work since we are only in one dimension! Orthogonality here follows from the fact that a single relation does not determine the value of \( x\) sufficiently! Interestingly, when there are two variables, an equation reduced the dimensionality from two to one dimension. When there is only one variable, then an equation would usually constrain it to a single value, unless it involves any inequalities, or absolute value signs. Generally, the feature condition selects a subset from the whole set. The the 2D case it is a 1D subspace. In the 1D case it is something else. Is there something like an orthogonal subset in the 1D case? The second feature has to parametrize the remaining subset somehow.
Let’s do more examples. Suppose one feature says \( x/4\in\mathbb{N}\). Usually, it is expressed like \( x=i\cdot4\) with a free parameter \( i=0,1,2,..\). There is a reason, why it is called free: it is independent of \( x\). Are any two conditions that define a nonempty intersection valid candidates for “orthogonal” features? Well, a feature defining the interval
\( I_{1}=[1,17]\) does not seem very orthogonal to one defining \( I_{2}=[0,21]\). That’s because knowing that \( x\in I_{1}\) tells us already much about the probability of \( x\in I_{2}\)! And that’s a crucial criterion. If this probability is zero, then there is no conditional entropy. If this is true for both directions, then the mutual info is zero and we have orthogonality. For example, knowing that \( |x|=0.1\) tells you nothing about whether \( x<0\). But the same is true for the statement \( x<0.99\). It is strongly related to the “20 questions game”. If you want to find out \( x\) you should ask mutually independent questions the answers of which have equal probability. Then the information gain is maximized. Those considerations should be the generating process for orthogonal features.
Thus, in the context of one-dimensional data, when a feature defines a subset, then any new feature should cut it in half, assuming a uniform distribution. If the distribution is arbitrary then the posterior probability to find the number in any of those halves should be equal. This maximum entropy principle sounds like a good idea. Additionally the description length of the formula defining a feature should be taken into account.
Recently, we have developed a way to derive orthogonal features in the two dimensional case. Suppose, we have got a feature \( f(x_1, \ldots, x_n)=c\). How shall we construct orthogonal features to it? The idea was to compute the gradient \( \vec{\nabla}f\), and compute orthogonal vectors to it through an orthogonalization procedure, like Gram-Schmidt’s. Integrating the resulting new gradients would lead to orthogonal features. For example, let \( \alpha(x,y)=\mbox{arctan}(y/x)\). Then \( \vec{\nabla}\alpha=(\frac{-y}{x^2+y^2}, \frac{x}{x^2+y^2})\). Then the gradient of the orthogonal feature is \( \vec{\nabla}d=(\frac{x}{x^2+y^2}, \frac{y}{x^2+y^2})\), which integrates to \( d(x,y)=\ln\sqrt{x^2+y^2}\), which is the distance except for an outer function, the logarithm. Obviously, angle and distance are orthogonal features since the knowledge of either does not contain any information about the other.
This works for the 2D case, but not for the 1D case. Why? We should put everything in one framework with the overarching principle being the maximum entropy principle. The relation to the just described is the following: on the circle defined by a constant length, the angle parametrizes a uniform distribution, i.e. if points on the circle are drawn uniformly and their angle is computed, then the angle would also be distributed uniformly. However, the construction of an orthogonal feature is not unique even in the 2D case. After all \( x>0\) is also an orthogonal feature, but it does not fix a single point. Hence, an additional constraint is needed that defines what kind of feature shall be extracted.
Furthermore, the set has to have some symmetry in order for orthogonal features to make sense. After all, what would be the orthogonal feature for some irregularly shaped “lake” on the 2D plane? But if the lake has an axial symmetry like a guitar, then one can talk about the one half and the other one. So, the task is not only to find a separation line with equal probabilities, but also along some symmetry! A symmetry of a set is a transformation that maps a set onto itself. Given a set, one has to find all such transformations. Most probably, someone has thought about that already.
It is interesting to think about the reason, why symmetries seem important. Of course they are important since they allow to compress a set significantly. After all, if a part of the set is given, the rest can be obtained by the symmetry transformation. For example, if there is a mirror symmetry, only half of the set has to be saved. Or, if there is an axially rotational symmetry in 3 dimensions, then only the silhouette has to be saved, the rest is obtained by rotation.
Is it a coincidence that the above orthogonalization method has yielded a symmetry of the set? Probably not. However, it is just a special symmetry. It looks like feature have got some “dimensionality”. The angle feature is 1D-continuous. The mirror symmetry is binary. The octagonal symmetry is 8-fold. A rotational symmetry in 3 dimensions is 2D-continuous.