How about a standard algorithm/mapping table for grouping unicode chars to known languages and flagging when a unicode string pulls chars from more than one language? (possibly only flagging if a change occurs within a 'word').
A standard way of rendering unicode strings could then be too highlight flagged chars (perhaps with a different colour). More sensitive areas (such as the browser url bar) could fire additional alerts if any chars were flagged.
Feels like a nice minor RFC to me - anyone see a problem with it?
Cool. I think a well-defined algorithm for defining a flagged character is important, to get cross-platform/application support and to validate the algo. (Any bug in the algo is a possible security hole, moreso if people become to trust that any dodgy characters are highlighted).
An exposed API to handle UTF8 strings to get count and/or indices of flagged chars would also be useful. Perhaps it could work as an extension to ICU?
If nothing else, the "change language within a word" should use a unicode-sane definition of 'word', which ICU would give you.
Once you have the two pieces above, adding the colouring etc should be reasonably straightforward. But it'd be a shame if the two pieces above weren't factored out - since I don't think the widespread adoption would follow otherwise.
A really well-known old attack, far more painfully expressed in internationalized domain names (IDN), where attackers can create pixel-perfect DNS clones of "paypal.com".
Notable also as an example of an attack that the DNSSEC security model does little to combat.
One of the biggest security weaknesses on the Internet is in browser UI (something Aza Raskin should be shouting from the rooftops). Right now, users are "trained" (to dignify what's happening) to look at the browser URL bar for a name and a lock icon.
We don't need new protocols or even changes in old ones so much as we need Mozilla, Microsoft, Google, and Apple to sit down and come up with a standard set of UI idioms that will allow users to quickly and visually "authenticate" a domain name from the cues made available to the browser.
In this case the domain name is legitimate; the homoglyph is in the query string instead. In fact, you can copy "tianаnmen square massacre" from this HN comment and paste it into any search engine to get the same result.
Recent browsers do protect against character spoofing in IDNs, and provide various UI features to make it easy to authenticate the domain name and identity. The same protection isn't applied to the rest of the URI, however.
Mozilla's Jesse Ruderman gave an interesting talk at the SOUPS conference on a similar class of attacks, some of which do not even require special characters - they are like SQL injection for the English language: http://www.squarefree.com/2010/07/14/untrusted-text-in-secur...
> Right now, users are "trained" (to dignify what's happening) to look at the browser URL bar for a name and a lock icon.
If only that were true.
A payment gateway I stumbled across the other day (http://www.centricom.com/flashdemo/POLi_demo.html) gets the user to download an application which they use to log into their bank from. The intro video instructs the user to check the address bar and the lock icon in the fake browser they've just downloaded. The best part is that the payment gateway goes through all this so they can edit the HTML of the users bank and inject payment details.
One of the Australian airlines is pushing this instead of Paypal, and it's getting real world use.
You can use this technique to mention people/companies/political parties on Twitter but who you don't want easily finding the tweet (e.g. the vile racists of the BиP ;-)).
This article presents some interesting consequences of using a unified map from numbers to glyphs for all the plethora of human languages.
Perhaps instead of Unicode there could have been an established way to change character encoding mid-string? Regarding and treating control characters as normal characters, as this article demonstrates we do with the mirror character at least, is nonsensical. Maybe a string could have a primary encoding, and tags could be placed around sections which use an alternate.
Under such a scheme, in an english string any cryllic sections would be surrounded by "tag" bits. The viewing program could then handle the cryllic part according to its capability and known environment:
- An english terminal might render as nonsense english characters.
- A capable viewing program could render proper cryllic glyphs according to the cryllic encoding.
- A good browser could render the cryllic glyphs but highlight them, display them in a different color, or even remove them.
Likewise, right-to-left languages would be implicitly rendered as such according to the capability of the viewing program.
But this has no-doubt already been suggested and rejected by somebody...
It doesn't really buy you anything. Everything you described is already possible with UTF-8, even the first bit about misinterpreting UTF-8 as an 8-bit ASCII. You've still described a single mapping of characters to numbers, you've just created a much more inconvenient encoding.
One could argue that a filesystem which contains two files, one in ASCII and one in EBCDIC, is on the whole just single mapping of characters to numbers. Your point isn't well argued.
What I'm suggesting differs in that it would use consistent tags to indicate deviations from the intended primary language of a string.
Surround sections of a string which should be interpreted differently with tags-bits indicating such. These tags should not be treated as characters as the mirror character described in the article obviously is. And finally, display routines would easily be able to indicate what's what; any sections of the string with tags indicating an alternate encoding could be displayed with bold, highlight, or in a different color. (ie. As in "ti<cryllic>a</cryllic>n<cryllic>a</cryllic>nmen square", except the tags would simply be unused bit-patterns.)
To accomplish the same thing with a Unicode string, one would have to store a list of ranges of code-points which correspond to the string's locale. Then, anything which isn't in one of those ranges could be displayed in the alternate color. (ie. A character outside of the range 32-126 would be considered "not EN-US". For other languages, the code-point ranges might be more complex.)
The scheme I suggested isn't great, but it would avoids having to compare each character to see if it falls somewhere in a list of ranges for the current locale.
The unicode "mirror" character mentioned in the article was very interesting -- I had not heard of its existence before. This Google search indicates its potential for hijinks:
How about a standard algorithm/mapping table for grouping unicode chars to known languages and flagging when a unicode string pulls chars from more than one language? (possibly only flagging if a change occurs within a 'word').
A standard way of rendering unicode strings could then be too highlight flagged chars (perhaps with a different colour). More sensitive areas (such as the browser url bar) could fire additional alerts if any chars were flagged.
Feels like a nice minor RFC to me - anyone see a problem with it?