1
0
Fork 0
mirror of https://github.com/LadybirdBrowser/ladybird.git synced 2025-06-11 02:13:56 +09:00
Commit graph

302 commits

Author SHA1 Message Date
Andrew Kaster
2fa9ec20bd LibUnicode: Prefix AK::Duration with AK Namespace 2024-07-18 09:43:38 +01:00
Andrew Kaster
bd97442771 Meta: Add vulkan and vulkan-headers to vcpkg dependencies
Also require a specific ICU version to not run into unexpected problems.
2024-07-06 01:44:58 +02:00
Timothy Flynn
672a555f98 LibCore+LibJS+LibUnicode: Port retrieving time zone offsets to ICU
The changes to tests are due to LibTimeZone incorrectly interpreting
time stamps in the TZDB. The TZDB will list zone transitions in either
UTC or the zone's local time (which is then subject to DST offsets).
LibTimeZone did not handle the latter at all.

For example:

The following rule is in effect until November 18, 6PM UTC.

    America/Chicago -5:50:36 - LMT 1883 Nov 18 18:00u

The following rule is in effect until March 1, 2AM in Chicago time. But
at that time, a DST transition occurs, so the local time is actually
3AM.

    America/Chicago -6:00 Chicago C%sT 1936 Mar 1 2:00
2024-06-26 10:14:02 +02:00
Timothy Flynn
1b2d47e6bb LibJS+LibUnicode: Port retrieving available regional time zones to ICU 2024-06-26 10:14:02 +02:00
Timothy Flynn
4fc0fba646 LibCore+LibJS+LibUnicode: Port retrieving available time zones to ICU
This required updating some LibJS spec steps to their latest versions,
as the data expected by the old steps does not quite match the APIs that
are available with the ICU. The new spec steps are much more aligned.
2024-06-26 10:14:02 +02:00
Timothy Flynn
d3e809bcd4 LibJS+LibUnicode: Port retrieving the system time zone to ICU 2024-06-26 10:14:02 +02:00
Timothy Flynn
c379b35798 LibUnicode: Move helper to convert StringEnumeration to a list to ICU.h
This will be needed outside of UnicodeKeywords.cpp.
2024-06-26 10:14:02 +02:00
Andrew Kaster
a587eafbf4 CMake: Consistently use imported targets for third party dependencies 2024-06-25 17:15:42 -04:00
Timothy Flynn
ebdb92eef6 LibUnicode+Everywhere: Merge LibLocale back into LibUnicode
LibLocale was split off from LibUnicode a couple years ago to reduce the
number of applications on SerenityOS that depend on CLDR data. Now that
we use ICU, both LibUnicode and LibLocale are actually linking in this
data. And since vcpkg gives us static libraries, both libraries are over
30MB in size.

This patch reverts the separation and merges LibLocale into LibUnicode
again. We now have just one library that includes the ICU data.

Further, this will let LibUnicode share the locale cache that previously
would only exist in LibLocale.
2024-06-23 19:52:45 +02:00
Timothy Flynn
9220a89d2f CI+LibUnicode: Remove the UCD from the system 2024-06-22 14:56:39 +02:00
Timothy Flynn
2ba7b4c529 LibUnicode: Remove now-unused code generator facilities 2024-06-22 14:56:39 +02:00
Timothy Flynn
069bed5d47 LibUnicode+LibGfx: Remove superfluous emoji metadata
For SerenityOS, we parse emoji metadata from the UCD to learn emoji
groups, subgroups, names, etc. We used this information only in the
emoji picker dialog. It is entirely unused within Ladybird.

This removes our dependence on the UCD emoji file, as we no longer
need any of its information. All we need to know is the file path to
our custom emoji, which we get from Meta/emoji-file-list.txt.
2024-06-22 14:56:39 +02:00
Timothy Flynn
aa3a30870b LibUnicode: Replace code point bidirectional classes with ICU 2024-06-22 14:56:39 +02:00
Timothy Flynn
e77dafc987 LibUnicode: Replace code point scripts and script extensions with ICU 2024-06-22 14:56:39 +02:00
Timothy Flynn
986ff984cc LibUnicode: Replace code point general categories with ICU 2024-06-22 14:56:39 +02:00
Timothy Flynn
c804bda5fd LibUnicode: Replace code point properties with ICU 2024-06-22 14:56:39 +02:00
Timothy Flynn
ab56b8c8dc LibUnicode: Remove the locale-unaware text segmentation implementation 2024-06-20 13:46:54 +02:00
Timothy Flynn
5cf818e305 LibUnicode: Replace case transformations and comparison with ICUs
There are a couple of differences here due to using ICU:

1. Titlecasing behaves slightly differently. We previously transformed
   "123dollars" to "123Dollars", as we would use word segmentation to
   split a string into words, then transform the first cased character
   to titlecase. ICU doesn't go quite that far, and leaves the string
   as "123dollars". While this is a behavior change, the only user of
   this API is the `text-transform: capitalize;` CSS rule, and we now
   match the behavior of other browsers.

2. There isn't an API to compare strings with case insensitivity without
   allocating case-folded strings for both the left- and right-hand-side
   strings. Our implementation was previously allocation-free; however,
   in a benchmark, ICU is still ~1.4x faster.
2024-06-20 10:59:55 +02:00
Timothy Flynn
8d7216f4e0 LibUnicode: Replace IDNA ASCII conversion with ICU 2024-06-18 21:07:56 +02:00
Timothy Flynn
83475c5380 LibUnicode: Replace Unicode string normalization with ICU
In a benchmark, ICU's implementation was over 3x faster than ours.
2024-06-18 21:07:56 +02:00
Timothy Flynn
1feef17bf7 LibUnicode: Remove completely unused code point name & block name data
These were used for e.g. the Character Map on Serenity, but are not used
at all for Ladybird.
2024-06-18 21:07:56 +02:00
Timothy Flynn
fe3fde2411 AK+LibUnicode: Implement a case-insensitive variant of find_byte_offset
The existing String::find_byte_offset is case-sensitive. This variant
allows performing searches using Unicode-aware case folding.
2024-06-01 07:37:54 +02:00
Andreas Kling
df547bb321 LibUnicode: Avoid redundant UTF-8 validation in AK::String helpers 2024-04-21 19:32:49 +02:00
Idan Horowitz
945c58c7c1 LibUnicode: Generate and use code point composition mappings
These allow us to binary search the code point compositions based on
the first code point being combined, which makes the search close to
O(log N) instead of O(N).
2024-04-06 14:21:04 -04:00
Idan Horowitz
e227bf0f71 LibUnicode: Optimize the canonical composition algorithm implementation
It now takes O(N) time instead of O(N^2) time. Additionally some always
false conditions are removed.
2024-04-06 14:21:04 -04:00
Timothy Flynn
576c2f4f4d LibURL+LibUnicode+LibWebView: Handle punycode directly in LibURL
We had defined punycode handling in LibUnicode when LibURL (AK at the
time) was unable to depend on LibUnicode. This is no longer the case.
2024-03-26 12:25:21 -04:00
Shannon Booth
e800605ad3 AK+LibURL: Move AK::URL into a new URL library
This URL library ends up being a relatively fundamental base library of
the system, as LibCore depends on LibURL.

This change has two main benefits:
 * Moving AK back more towards being an agnostic library that can
   be used between the kernel and userspace. URL has never really fit
   that description - and is not used in the kernel.
 * URL _should_ depend on LibUnicode, as it needs punnycode support.
   However, it's not really possible to do this inside of AK as it can't
   depend on any external library. This change brings us a little closer
   to being able to do that, but unfortunately we aren't there quite
   yet, as the code generators depend on LibCore.
2024-03-18 14:06:28 -04:00
Timothy Flynn
aa0a6d58b2 Userland: Remove LibCore dependency from libraries that do not use it 2024-01-22 08:48:34 -05:00
Ali Mohammad Pur
5e1499d104 Everywhere: Rename {Deprecated => Byte}String
This commit un-deprecates DeprecatedString, and repurposes it as a byte
string.
As the null state has already been removed, there are no other
particularly hairy blockers in repurposing this type as a byte string
(what it _really_ is).

This commit is auto-generated:
  $ xs=$(ack -l \bDeprecatedString\b\|deprecated_string AK Userland \
    Meta Ports Ladybird Tests Kernel)
  $ perl -pie 's/\bDeprecatedString\b/ByteString/g;
    s/deprecated_string/byte_string/g' $xs
  $ clang-format --style=file -i \
    $(git diff --name-only | grep \.cpp\|\.h)
  $ gn format $(git ls-files '*.gn' '*.gni')
2023-12-17 18:25:10 +03:30
Timothy Flynn
43e9dc0500 LibUnicode: Use weak symbols to provide default IDNA defintions
Rather than using #ifdef blocks, update the fallback IDNA definitions to
use weak symbols to match the rest of LibUnicode / LibLocale.
2023-12-10 10:19:14 -05:00
Timothy Flynn
1f0e24bc3b LibUnicode: Fix compilation when ENABLE_UNICODE_DATABASE_DOWNLOAD is OFF 2023-12-10 10:19:14 -05:00
Simon Wanner
58f08107b0 AK+LibUnicode: Add Unicode::create_unicode_url
This is a workaround for the fact that AK::URLParser can't call into
LibUnicode directly.
2023-12-10 08:04:58 -05:00
Simon Wanner
5bcb019106 LibUnicode: Add IDNA::to_ascii
This implements the ToASCII operation of Unicode Technical Standard 46
2023-12-10 08:04:58 -05:00
Simon Wanner
7d9fe44039 LibUnicode: Download and parse IDNA data 2023-12-10 08:04:58 -05:00
Simon Wanner
cfd0a60863 LibUnicode: Add Punycode::encode 2023-12-10 08:04:58 -05:00
Simon Wanner
299d35aadc LibUnicode: Add Punycode::decode 2023-12-10 08:04:58 -05:00
Shannon Booth
d777b279e3 LibUnicode+Tests: Remove now unused to_unicode_*_full methods
Relocating all of the tests for these in LibUnicode over to the AK
String testsuite.
2023-11-28 17:15:27 -05:00
Shannon Booth
6b32a1f18f AK+LibUnicode: Expose TrailingCodePointTransformation in to_titlecase
Relocating the definition of this enum from LibUnicode to AK.
2023-11-28 17:15:27 -05:00
Timothy Flynn
6070df40f3 LibUnicode: Define case-insensitive string comparison more generically
The only user is currently String::equals_ignoring_case, but LibRegex
will need to do the same case-folded comparison with UTF-32 data. As it
turns out, the comparison works with all Unicode view types without much
fuss.
2023-11-08 12:54:26 -05:00
Cr4xy
bbfe0d3a82 LibWeb: Implement text-transform: capitalize 2023-10-03 09:47:17 -04:00
Timothy Flynn
139c575cc9 LibUnicode: Update to Unicode version 15.1.0
https://unicode.org/versions/Unicode15.1.0/

This update includes a new set of code point properties, Indic Conjunct
Break. These may have the values Consonant, Linker, or Extend. These are
used in text segmentation to prevent breaking on some extended grapheme
cluster sequences.
2023-09-15 18:30:26 +02:00
Timothy Flynn
02a8683266 LibUnicode+LibJS: Stop propagating small OOM errors from normalization
This API only perform small allocations, and is only used by LibJS.
2023-09-09 13:03:25 -04:00
Sam Atkins
0d021a63c7 LibUnicode: Generate data for bidirectional character types
This will let us examine code points to determine the rtl/ltr direction
of a piece of text.
2023-08-20 16:21:35 -04:00
Timothy Flynn
456211932f LibUnicode: Perform code point case conversion lookups in constant time
Similar to commit 0652cc4, we now generate 2-stage lookup tables for
case conversion information. Only about 1500 code points are actually
cased. This means that case information is rather highly compressible,
as the blocks we break the code points into will generally all have no
casing information at all.

In total, this change:

    * Does not change the size of libunicode.so (which is nice because,
      generally, the 2-stage lookup tables are expected to trade a bit
      of size for performance).

    * Reduces the runtime of the new benchmark test case added here from
      1.383s to 1.127s (about an 18.5% improvement).
2023-07-28 05:28:50 +02:00
Timothy Flynn
cb128dcf75 LibUnicode: Move the CodePointRangeComparator struct to a public header
Move it out of the generated code so that it may be used by the code
generator itself.
2023-07-26 08:36:20 +02:00
Timothy Flynn
c950f88611 LibUnicode: Stop generating Block property data
We started generating this data in commit 0505e03, but it was unused.
It's still not used, so let's remove it, rather than bloating the size
of libunicode.so with unused data. If we need it in the future, it's
trivial to add back.

Note we *have* always used the block name data from that commit, and
that is still present here.
2023-07-26 08:36:20 +02:00
Timothy Flynn
1393ed2000 AK+LibUnicode: Implement String::equals_ignoring_case without allocating
We currently fully casefold the left- and right-hand sides to compare
two strings with case-insensitivity. Now, we casefold one code point at
a time, storing the result in a view for comparison, until we exhaust
both strings.
2023-03-08 18:57:53 +00:00
Timothy Flynn
f8a0365002 LibUnicode: Detect ZWJ sequences when filtering by emoji presentation
This was preventing some unqualified emoji sequences from rendering
properly, such as the custom SerenityOS flag. We rendered the flag
correctly when given the fully qualified sequence:

    U+1F3F3 U+FEOF U+200D U+1F41E

But were not detecting the unqualified sequence as an emoji when also
filtering for emoji-presentation sequences:

    U+1F3F3 U+200D U+1F41E
2023-03-05 20:21:57 +01:00
Timothy Flynn
42c272c059 LibUnicode: Allow ignoring text presentation emoji in sequence detection
This adds an option to only detect emoji that should always present as
emoji. For example, the copyright symbol (unless followed by an emoji
presentation selector) should render as text.
2023-02-28 13:22:58 +00:00
Timothy Flynn
fa96811a22 LibUnicode: Skip over emoji sequences in grapheme boundary segmentation
Emoji sequences in the grapheme segmentation spec are a bit tricky:

    \p{Extended_Pictographic} Extend* ZWJ × \p{Extended_Pictographic}

Our current strategy of tracking a boolean to indicate if we are in an
emoji sequence was causing us to break up emoji made of multiple sub-
sequences. For example, in the "family: man, woman, girl, boy" sequence:

    U+1F468 U+200D U+1F469 U+200D U+1F467 U+200D U+1F466

We would break at indices 0 (correctly) and 6 (incorrectly).

Instead of tracking a boolean, it's quite a bit simpler to reason about
emoji sequences by just skipping past them entirely. Note that in cases
like the above emoji, we skip one sub-sequence at a time.
2023-02-25 22:23:39 +01:00