Skip to content

GitLab

    • GitLab: the DevOps platform
    • Explore GitLab
    • Install GitLab
    • How GitLab compares
    • Get started
    • GitLab docs
    • GitLab Learn
  • Pricing
  • Talk to an expert
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
    • Switch to GitLab Next
    • Menu
    Projects Groups Snippets
  • Sign up now
  • Login
  • Sign in / Register
  • Lazarus Lazarus
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
    • Locked Files
  • Issues 2,080
    • Issues 2,080
    • List
    • Boards
    • Service Desk
    • Milestones
    • Iterations
  • Merge requests 7
    • Merge requests 7
  • Deployments
    • Deployments
    • Releases
  • Analytics
    • Analytics
    • Value stream
    • Code review
    • Insights
    • Issue
    • Repository
  • External wiki
    • External wiki
  • Activity
  • Graph
  • Create a new issue
  • Commits
  • Issue Boards
Collapse sidebar
  • FPC
  • Lazarus
  • LazarusLazarus
  • Issues
  • #29851
Closed
Open
Created Mar 16, 2016 by FPC Admin account@fpc_adminOwner

Bug in UTF8FindNearestCharStart

Original Reporter info from Mantis: Bart @flyingsheep
  • Reporter name: Bart Broersma

Description:

UTF8FindNearestCharStart returns wrong result if BytePos points to $B8 in this 3-byte sequence $E0 $B8 $9A (which appears to be a valid codepoint: THAI CHARACTER BO BAIMAI, U+0E1A, see: http://unicode.scarfboy.com/?s=U%2b0E1A).

It returns an index pointing to $B8, where it should point to $E0 instead.

Steps to reproduce:

Unzip and build attached sample.
(The sample project has more code than needed fo this test, but it will just run the test demonstarting the problem.

It outputs:
C:\Users\Bart\LazarusProjecten\ConsoleProjecten\bugs\comparestr>compare
Windows: using LazUtf8
$C3 $A4 $E0 $B8 $9A
1: NCS=0 B=C3
2: NCS=0 B=C3
3: NCS=2 B=E0
4: NCS=3 B=B8 Expected: E0
5: NCS=2 B=E0

Additional information:

I was looking for a similar function in LazUtf8 that would returnn the start of the codepoint, only if the codepoint was valid.
The sampleproject has a function Utf8FindCodepointStart(...): Boolean that does just that.

Run the TestUtf8FindCodepointStart procedure to see the difference in behaviour (with a string that also has invalid codepoints):

C:\Users\Bart\LazarusProjecten\ConsoleProjecten\bugs\comparestr>compare
Windows: using LazUtf8
$C3 $A4 $E0 $B8 $9A $81 $F0
 1 C3 TRUE B=C3 CL=2 Cur-S=0  |  TRUE B=C3 CL=2 Idx=1  |   NCS=0 B=C3
 2 A4 TRUE B=C3 CL=2 Cur-S=0  |  TRUE B=C3 CL=2 Idx=1  |   NCS=0 B=C3
 3 E0 TRUE B=E0 CL=3 Cur-S=2  |  TRUE B=E0 CL=3 Idx=3  |   NCS=2 B=E0
 4 B8 TRUE B=E0 CL=3 Cur-S=2  |  TRUE B=E0 CL=3 Idx=3  |   NCS=3 B=B8
 5 9A TRUE B=E0 CL=3 Cur-S=2  |  TRUE B=E0 CL=3 Idx=3  |   NCS=2 B=E0
 6 81 FALSE   |  FALSE   |   NCS=5 B=81
 7 F0 FALSE   |  FALSE   |   NCS=6 B=F0

Mantis conversion info:

  • Mantis ID: 29851
  • OS: Windows
  • OS Build: Win7
  • Build: r51965
  • Platform: i386
  • Version: 1.7 (SVN)
  • Fixed in version: 1.6.2
  • Fixed in revision: r51973 (#b192fb97)
  • Target version: 1.6.2
Assignee
Assign to
Time tracking