File: _docs/Demojibakefier.md

Recommend this page to a friend!

_docs/Demojibakefier.md

File:	`_docs/Demojibakefier.md`
Role:	Example script
Content type:	`text/markdown`
Description:	Example script
Class:	PHP Common Class Library Set of classes that provides common functionality
Author:	By Caleb
Last change:	Add number formatter.
Date:	5 years ago
Size:	`18,796 bytes`

Download

Documentation for the "Demojibakefier" class.

Intended to normalise the character encoding of a given string to a preferred character encoding when the given string's byte sequences don't match the expectations of the preferred character encoding. Useful in cases where a block of data might conceivably be composed of several different unspecified, unknown encodings.

Why the name?

When a byte sequence doesn't conform to the expectations of a particular character encoding, and an attempt is made to render that byte sequence into readable characters using that particular character encoding, it can sometimes result in the appearance of generic replacement characters#Replacement_character) and "mojibake" (????).

Wikipedia excerpt: > Mojibake means "character transformation" in Japanese. The word is composed of ?? (moji, IPA: [mod??i]), "character" and ?? (bake, IPA: [b�ke?], pronounced "bah-keh"), "transform".

Related trivia: The word "emoji" has similar etymology. ??

"Demojibakefier" is a play on the word "mojibake", so named because ideally, it should eliminate, or at least reduce the occurrence of replacement characters, mojibake, etc.

What does it do?

Let's start with some sample code to reproduce a potential use-case (for the purpose of the sample code, please assume that it uses UTF-8 encoding).

<?php
/ Japanese placeholder text generated by <https://lipsum.sugutsukaeru.jp/index.cgi>. */
$TextJA = '????????????????????????????
?????????????????????????????????
????????????????????????????????
???????????????????????????????????????????????????????????????????????????
?????????????????????????????????????????????????????????????????????
???????????????????????????';

/
 * Output some basic HTML to define the character encoding we're using,
 * language, etc.
 */
echo '<!doctype html><html lang="ja-JP"><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /></head><body>';

/
 * When we echo the placeholder text in its original state, we should see it
 * correctly. We'll also add some linebreaks for better readability.
 */
echo $TextJA . '<br /><br />';

/
 * We'll convert our placeholder text from UTF-8 to SHIFT-JIS, and then echo
 * it, to intentionally produce output with mixed encodings (something that we
 * generally should never, ever do, but we're doing it to provide an example
 * for what the Demojibakefier class can do).
 */
echo $TextJA_SHIFTJIS = iconv('UTF-8', 'SHIFT-JIS', $TextJA);

/ Output closing HTML tags. */
echo '</body></html>';

When executing the above sample code via a browser request, it should produce something like this (the latter part being completely unintelligible):

> ???????????????????????????? ????????????????????????????????? ???????????????????????????????? ??????????????????????????????????????????????????????????????????????????? ????????????????????????????????????????????????????????????????????? ??????????????????????????? > > ??????A???p???????p????????????????????????A ????????A??????????f?????????@????????????????A ???????A?�\????p????????????T?????M??????????????????B ??????????A?v???v??????c???????????????G????????v?????A???[???????p???????????A?????L????????{???????????????�??????^???B ?????A?t???[???]????????????????A??????[?X???I???Z???Q?l?????????????????????A?t???[?????A?j???????????????A ????t?@?C??????�\?????????M?????????????B

In the case of our sample code, we already know that the latter uses SHIFT-JIS (because we're the ones that converted from UTF-8 to SHIFT-JIS), meaning that we could easily just use iconv() to convert it back to UTF-8 again, without the need for complex classes, external dependencies, etc. But what about for cases where we don't know which character encoding might be being used? It's possible that in some cases, we mightn't be able to reliably predict which types of character encoding some data could potentially contain, due to the nature of how that data is sourced or for a variety of other possible reasons. That's where the Demojibakefier can help.

Let's try the same thing again, but this time, we'll pretend that we don't know which character encoding we've converted the placeholder text to. We'll pretend that the only thing we know, is that everything should be using UTF-8. We'll use the Demojibakefier to try to automatically convert it back to UTF-8, without the need for us to specify which character encoding we're converting from.

<?php
/ All the same stuff as before (up until where we closed our HTML tags). */
$TextJA = '????????????????????????????
?????????????????????????????????
????????????????????????????????
???????????????????????????????????????????????????????????????????????????
?????????????????????????????????????????????????????????????????????
???????????????????????????';
echo '<!doctype html><html lang="ja-JP"><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /></head><body>';
echo $TextJA . '<br /><br />';
echo $TextJA_SHIFTJIS = iconv('UTF-8', 'SHIFT-JIS', $TextJA);

/ Now we'll create a Demojibakefier instance. */
$Demojibakefier = new \Maikuolan\Common\Demojibakefier();

/
 * Now we'll run our SHIFT-JIS text (remember that we're pretending that we
 * don't know which encoding it uses) through Demojibakefier's guard method.
 * Echo a few linebreaks for better readability and the output of guard.
 */
echo '<br /><br />' . $Demojibakefier->guard($TextJA_SHIFTJIS);

/ And finally, our closing HTML tags. */
echo '</body></html>';

This time, it should produce something like this (note that the output of guard is the same as our original UTF-8 text):

Demojibakefier's constructor.

To use the Demojibakefier, you'll firstly need to instantiate it.

public function __construct(string $NormaliseTo = '');

Demojibakefier's constructor optionally accepts one parameter: The character encoding that it should use whenever trying to normalise data. When omitted, UTF-8 will be used.

After you've instantiated the Demojibakefier, you can start demojibakefying data by using the instance's normalise or guard methods.

supported method.

Returns an array of all the character encoding types known to and suported by the Demojibakefier.

public function supported(): array;

Character encoding types currently known and suported by the Demojibakefier: - UTF-8 - UTF-16BE - UTF-16LE - ISO-8859-1 - CP1252 - ISO-8859-2 - ISO-8859-3 - ISO-8859-4 - ISO-8859-5 - ISO-8859-6 - ISO-8859-7 - ISO-8859-8 - ISO-8859-9 - ISO-8859-10 - ISO-8859-11 - ISO-8859-13 - ISO-8859-14 - ISO-8859-15 - ISO-8859-16 - CP1250 - CP1251 - CP1253 - CP1254 - CP1255 - CP1256 - CP1257 - CP1258 - GB18030 - GB2312 - BIG5 - SHIFT-JIS - JOHAB - UCS-2 - UTF-32BE - UTF-32LE - UCS-4 - CP437 - CP737 - CP775 - CP775 - CP775 - CP775 - CP775 - CP850 - CP852 - CP855 - CP857 - CP860 - CP861 - CP862 - CP863 - CP864 - CP865 - CP866 - CP869 - CP874 - KOI8-RU - KOI8-R - KOI8-U - KOI8-F - KOI8-T - CP037 - CP500 - CP858 - CP875 - CP1026

Note that the reliability of the Demojibakefier's ability to normalise strings, and of using it to convert a string between two particular character encoding types, can vary significantly, depending on the character encoding types in question, the length and nature of the string in question, among other factors. Note also that the Demojibakefier doesn't possess the same qualities as a linguistic translator, and isn't designed to test the intelligibility of strings beyond the conformity of their byte sequences to the various character encoding types supported by the class, or beyond the few rudimentary heuristics that it implements (such as the comparative likelihood of particular byte sequences occurring within the kinds of texts that typically utilise particular character encoding types). This means that an entirely unintelligible string could be regarded as already conformant, and therefore potentially not normalised, as long as its byte sequence conforms to that expected by the instance's target character encoding, or that an entirely unintelligible string could theoretically be produced by the Demojibakefier, as long as the provided string is not already conformant to the instance's target character encoding, but conforms to one or more of the other character encoding types supported by the class, passes all heuristics, and successfully reads unintelligibly in the character encoding types that it conforms with.

checkConformity method.

Checks for byte sequences that shouldn't normally appear in a specified character encoding (the second parameter) as a means of roughly guessing whether the string (the first parameter) likely conforms to the specified character encoding. The second parameter is optional, defaulting to instance's default character encoding when omitted (the character encoding provided to the constructor at instantiation, or UTF-8 when none was provided). Returns true when the string conforms (per specs), or false otherwise.

public function checkConformity(string $String, string $Encoding = ''): bool

weigh method.

Attempts to apply weighting to potential character encoding candidates based on the frequency/occurrence of specific byte sequences and lack thereof within a string. Method is private and thus shouldn't be called by the implementation.

private function weigh(string $String, array &$Arr);

dropVariants method.

Drops candidates belonging to encodings that are outdated subsets or variants of other encodings with valid candidates. Method is private and thus shouldn't be called by the implementation.

private function dropVariants(array &$Arr);

shannonEntropy method.

Calculates the shannon entropy of a string (the sole accepted parameter). This method isn't used by any current versions of the Demojibakefier, but its use is planned for a future version.

public function shannonEntropy(string $String);

normalise method.

Attempts to normalise a string (the sole accepted parameter), returning the string normalised, or the string verbatim when it can't be reliably, confidently returned normalised, when the string's byte sequence already conforms to the target character encoding, or when the string is empty.

public function normalise(string $String): string;

When normalise is called, it immediately resets the Last member and immediately populates the Len member. The Last member is then populated as soon as the Demojibakefier decides which character encoding it thinks the provided string uses (assuming it's able to come to a decision).

guard method.

The Demojibakefier heavily relies upon PHP's iconv() functionality in order to work as intended. If PHP's iconv() functionality isn't available, the Demojibakefier won't be able to work properly, and attempting to call the normalise method in such a situation will cause fatal errors to occur. The guard method provides a means of avoiding that problem. It checks firstly whether iconv() is available, and secondly whether the byte sequence of the provided string (the sole accepted parameter) already conforms to the instance's target character encoding (if it already conforms, the Demojibakefier doesn't need to do anything with the string anyway). If iconv() isn't available, or if the provided string already conforms, the string is immediately returned verbatim. Otherwise (i.e., only if iconv() is available, and if the string doesn't already conform), guard calls normalise to attempt to normalise the provided string and returns onward the return from normalise. Calling the guard method is therefore slightly safer than calling the normalise method directly.

public function guard(string $String): string;

Last member.

The Last member is a string populated by the normalise method, and can be used after calling normalise or guard, to determine the most recent character encoding that the Demojibakefier converted a string from.

public $Last = '';

Example usage:

<?php
/ Instantiate the Demojibakefier. */
$Demojibakefier = new \Maikuolan\Common\Demojibakefier();

/ Iterate through an arbitrary array of elements containing presumably wrongly encoded data. */
foreach ($Array as $Element) {

    / Provide each element to guard. */
    $TestString = $Demojibakefier->guard($Element);

    / Check whether the result of guard is different to the original string, and whether Last has been populated. */
    if ($TestString !== $Element && $Demojibakefier->Last) {
        // Element was normalised using the encoding specified by Last (do here as you would accordingly).
    } else {
        // Nothing was normalised (do here as you would accordingly).
    }

}

CIDRAM and phpMussel do something similar on the front-end logs page, to inform users when log entry fields have been transformed by the Demojibakefier.

Len member.

The Len member is an integer populated by the normalise method, representing the total length of the provided string (it uses strlen() internally to do this).