The changes to tests are due to LibTimeZone incorrectly interpreting
time stamps in the TZDB. The TZDB will list zone transitions in either
UTC or the zone's local time (which is then subject to DST offsets).
LibTimeZone did not handle the latter at all.
For example:
The following rule is in effect until November 18, 6PM UTC.
America/Chicago -5:50:36 - LMT 1883 Nov 18 18:00u
The following rule is in effect until March 1, 2AM in Chicago time. But
at that time, a DST transition occurs, so the local time is actually
3AM.
America/Chicago -6:00 Chicago C%sT 1936 Mar 1 2:00
This required updating some LibJS spec steps to their latest versions,
as the data expected by the old steps does not quite match the APIs that
are available with the ICU. The new spec steps are much more aligned.
LibLocale was split off from LibUnicode a couple years ago to reduce the
number of applications on SerenityOS that depend on CLDR data. Now that
we use ICU, both LibUnicode and LibLocale are actually linking in this
data. And since vcpkg gives us static libraries, both libraries are over
30MB in size.
This patch reverts the separation and merges LibLocale into LibUnicode
again. We now have just one library that includes the ICU data.
Further, this will let LibUnicode share the locale cache that previously
would only exist in LibLocale.
For SerenityOS, we parse emoji metadata from the UCD to learn emoji
groups, subgroups, names, etc. We used this information only in the
emoji picker dialog. It is entirely unused within Ladybird.
This removes our dependence on the UCD emoji file, as we no longer
need any of its information. All we need to know is the file path to
our custom emoji, which we get from Meta/emoji-file-list.txt.
There are a couple of differences here due to using ICU:
1. Titlecasing behaves slightly differently. We previously transformed
"123dollars" to "123Dollars", as we would use word segmentation to
split a string into words, then transform the first cased character
to titlecase. ICU doesn't go quite that far, and leaves the string
as "123dollars". While this is a behavior change, the only user of
this API is the `text-transform: capitalize;` CSS rule, and we now
match the behavior of other browsers.
2. There isn't an API to compare strings with case insensitivity without
allocating case-folded strings for both the left- and right-hand-side
strings. Our implementation was previously allocation-free; however,
in a benchmark, ICU is still ~1.4x faster.
These allow us to binary search the code point compositions based on
the first code point being combined, which makes the search close to
O(log N) instead of O(N).
This URL library ends up being a relatively fundamental base library of
the system, as LibCore depends on LibURL.
This change has two main benefits:
* Moving AK back more towards being an agnostic library that can
be used between the kernel and userspace. URL has never really fit
that description - and is not used in the kernel.
* URL _should_ depend on LibUnicode, as it needs punnycode support.
However, it's not really possible to do this inside of AK as it can't
depend on any external library. This change brings us a little closer
to being able to do that, but unfortunately we aren't there quite
yet, as the code generators depend on LibCore.
This commit un-deprecates DeprecatedString, and repurposes it as a byte
string.
As the null state has already been removed, there are no other
particularly hairy blockers in repurposing this type as a byte string
(what it _really_ is).
This commit is auto-generated:
$ xs=$(ack -l \bDeprecatedString\b\|deprecated_string AK Userland \
Meta Ports Ladybird Tests Kernel)
$ perl -pie 's/\bDeprecatedString\b/ByteString/g;
s/deprecated_string/byte_string/g' $xs
$ clang-format --style=file -i \
$(git diff --name-only | grep \.cpp\|\.h)
$ gn format $(git ls-files '*.gn' '*.gni')
The only user is currently String::equals_ignoring_case, but LibRegex
will need to do the same case-folded comparison with UTF-32 data. As it
turns out, the comparison works with all Unicode view types without much
fuss.
https://unicode.org/versions/Unicode15.1.0/
This update includes a new set of code point properties, Indic Conjunct
Break. These may have the values Consonant, Linker, or Extend. These are
used in text segmentation to prevent breaking on some extended grapheme
cluster sequences.
Similar to commit 0652cc4, we now generate 2-stage lookup tables for
case conversion information. Only about 1500 code points are actually
cased. This means that case information is rather highly compressible,
as the blocks we break the code points into will generally all have no
casing information at all.
In total, this change:
* Does not change the size of libunicode.so (which is nice because,
generally, the 2-stage lookup tables are expected to trade a bit
of size for performance).
* Reduces the runtime of the new benchmark test case added here from
1.383s to 1.127s (about an 18.5% improvement).
We started generating this data in commit 0505e03, but it was unused.
It's still not used, so let's remove it, rather than bloating the size
of libunicode.so with unused data. If we need it in the future, it's
trivial to add back.
Note we *have* always used the block name data from that commit, and
that is still present here.
We currently fully casefold the left- and right-hand sides to compare
two strings with case-insensitivity. Now, we casefold one code point at
a time, storing the result in a view for comparison, until we exhaust
both strings.
This was preventing some unqualified emoji sequences from rendering
properly, such as the custom SerenityOS flag. We rendered the flag
correctly when given the fully qualified sequence:
U+1F3F3 U+FEOF U+200D U+1F41E
But were not detecting the unqualified sequence as an emoji when also
filtering for emoji-presentation sequences:
U+1F3F3 U+200D U+1F41E
This adds an option to only detect emoji that should always present as
emoji. For example, the copyright symbol (unless followed by an emoji
presentation selector) should render as text.
Emoji sequences in the grapheme segmentation spec are a bit tricky:
\p{Extended_Pictographic} Extend* ZWJ × \p{Extended_Pictographic}
Our current strategy of tracking a boolean to indicate if we are in an
emoji sequence was causing us to break up emoji made of multiple sub-
sequences. For example, in the "family: man, woman, girl, boy" sequence:
U+1F468 U+200D U+1F469 U+200D U+1F467 U+200D U+1F466
We would break at indices 0 (correctly) and 6 (incorrectly).
Instead of tracking a boolean, it's quite a bit simpler to reason about
emoji sequences by just skipping past them entirely. Note that in cases
like the above emoji, we skip one sub-sequence at a time.