Home of: [Atelier "FUJIGURUMA"] >> [PageMixer hosted by SourceForge.net]

SEE "For Readers of English Version",
or Japanese version of this page

Symbol comparison

This section explains the reduction of String comparison cost.

Class names

In this document, abbreviated class names are used. Complete names are shown below.

NotationFull name
Attribute jp.ne.dti.lares.foozy.pagemixer.Attribute
HTMLSymbolSet jp.ne.dti.lares.foozy.pagemixer.HTMLSymbolSet
Symbol jp.ne.dti.lares.foozy.pagemixer.Symbol
SymbolSet jp.ne.dti.lares.foozy.pagemixer.SymbolSet
Token jp.ne.dti.lares.foozy.pagemixer.Token

Ordinary design

Simple comparison

PageMixer needs to examine equivalence between two strings as names of Token or Attribute.

Even though "String#equals(Object)" may be used to do so ordinarily, it costs very much because it consists of below steps if specified object is not itself according to the source file of String class in JDK.

  1. examine String-ness of given object by instanceof
  2. downcast Object to String
  3. compare length of string
  4. compare characters step by step

In fact, (1)instanceof and (2)downcast are not so reasonable in the least.

NOTE:Downcast to a class on the heels of instanceof with same one may not cost so much than alone downcast.

Optimization may skip class check of succeeding (down)cast.

Because calculation order of (4)step by step character comparison is in propotion to length of string "O(n)", some other ways shown below also cost as much as String#equals(Object).

Though PageMixer should examine case-insensitively if possible, case-insensitive comparison methods shown above convert to lower letter at execution time. They are not reasonable to process much data.

Comparison of references

One of good ways to examine equivalence between two strings quicklly is comparison of references.

But what should refenrce comparison implementation for examination of strings equivalence do when it detect difference between two references to string ?

It costs very much, if it examines equivalence of string contents by String#equals or same function. But it causes mis-judgment to conclude from difference of references that two strings are different from each other, because there is no good reason for difference between string contents.

Though all strings must be listed up beforehand for above reason, that restriction is not realistic for applicability.

Decision in PageMixer

For equivalence examination of two strings by reference comparison, instead of using String reference directly, PageMixer defines "Symbol" and its derived classes shown as below.

Class diagram
Class diagram

This class design not only makes comparison between known strings efficient, but also makes heterogeneous comparison of known/unknonw strings possible.

Symbol defines "isSameAs(Symbol)" to examine equivalence with another one, but it is defined as "abstract".

"Symbol.Resolved#isSameAs(Symbol)" is implemented as shown below. It just examines equivalence by reference comparison.


    public boolean isSameAs(Symbol symbol){
        return (value_ == symbol.value_);
    }

isSameAs(Symbol) of Symbol.Resolved class

The other hand, "Symbol.Unresolved#isSameAs(Symbol)" is implemented as shown below. It examines equivalence by String#equals(Object), which is not efficient.


    public boolean isSameAs(Symbol symbol){
        return (value_.equals(symbol.value_));
    }

isSameAs(Symbol) of Symbol.Unresolved class

Symbol.Resolved is used for known string, and Symbol.Unresolved is used for unknown one.

Assume that Symbol.Resolved instances are stored into SymbolSet by SymbolSet#add(String) for known strings beforehand.

By SymbolSet#lookup(String), you can get Symbol.Resolved as Symbol if you specify known string to it, and Symbol.Unresolved as Symbol if you specify unknown string.

Comparison between known strings is done by isSameAs(Symbol) of Symbol.Resolved, and result of it depends on (reference) equivalence of two strings. On the other hand, comparison between known and unknown strings will fail certainly, because references to String are different (for Symbol.Resolved) and contents of String are different (for String#equals(Object) in Symbol.Unresolved). Moreover, comparison between unknown strings is done by Symbol.Unresolved, and result of it depends on equivalence of two strings contents.

"this"argument
knownunknown
known reference comparison reference comparison (may fail)
unknown String#equals(Object) (may fail) String#equals(Object)

You must lookup (SymbolSet#lookup(String)) with cased string, which you decide as upper or lower, if you want compare case-insensitively. It seems not to be seriouse even though case conversion costs much, because looking Symbol up is done only in preparation phase not in execution phase.

PageMixer provides "HTMLSymbolSet", which is derived class of SymbolSet, to get Symbol for well known HTML tag and attribute names easily.For example:

                                             // correspond to
Symbol a      = HTMLSymbolSet.SET.A;         // "A"
Symbol html   = HTMLSymbolSet.SET.HTML;      // "HTML"
Symbol img    = HTMLSymbolSet.SET.IMG;       // "IMG"
Symbol color  = HTMLSymbolSet.SET.COLOR;     // "COLOR"
Symbol equiv = HTMLSymbolSet.SET.HTTP_EQUIV; // "HTTP-EQUIV"

Getting Symbol for well known HTML names