Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: x/text/encoding: Handling Encoding Errors by Replacing Visually Similar Unicode Characters in ShiftJIS Encoding #69934

Open
yuki2006 opened this issue Oct 18, 2024 · 4 comments
Labels
Milestone

Comments

@yuki2006
Copy link

Proposal Details

Summary

When encoding Unicode strings to Shift JIS in Go, certain visually similar characters cannot be directly represented in Shift JIS, leading to encoding errors. This causes confusion because the characters appear similar but result in errors during encoding. This proposal suggests introducing a normalization step that replaces these problematic characters with their Shift JIS-compatible equivalents before encoding. We accept that this transformation is one-way and that the original characters cannot be restored, which is acceptable for our use case.

Background

Shift JIS is a character encoding for the Japanese language but does not support all Unicode characters. Some visually similar characters have different code points and cannot be encoded in Shift JIS, causing encoding errors and confusion.

Examples:

The Unicode character "〜" (U+301C) looks similar to "~" (U+FF5E).
The Unicode character "−" (U+2212) resembles the standard hyphen "-" (U+002D).
These visually similar characters are often used interchangeably in text but may cause encoding errors when converting to Shift JIS. In our application, it is acceptable that the transformation is not reversible; we prioritize successful encoding over the ability to revert to the original characters.

Proposal

Introduce a normalization function that replaces visually similar Unicode characters, which cannot be encoded in Shift JIS, with their equivalent characters that can be encoded. This function can be integrated into the encoding process or provided as a utility in the golang.org/x/text/encoding/japanese package.

https://go.dev/play/p/OtEWoZmxDzb

package main

import (
	"fmt"

	"golang.org/x/text/encoding/japanese"
	"golang.org/x/text/transform"
)

func main() {
	replacements := map[string]string{
		"〜": "~", // U+301C (Wave Dash) → U+FF5E (Fullwidth Tilde)
		"−": "-", // U+2212 (Minus Sign) → U+002D (Hyphen-Minus)
		"—": "-", // U+2014 (Em Dash) → U+002D (Hyphen-Minus)
		"•": "*", // U+2022 (Bullet) → U+002A (Asterisk)
	}

	encoder := japanese.ShiftJIS.NewEncoder()

	for orig, replacement := range replacements {
		// Check if the original character can be encoded
		_, _, errOrig := transform.String(encoder, orig)
		// Check if the replacement character can be encoded
		_, _, errReplacement := transform.String(encoder, replacement)

		if errOrig == nil {
			fmt.Printf("Mapping may be unnecessary: Original character %q can be encoded.\n", orig)
		} else {
			fmt.Printf("Mapping necessary: Original character %q cannot be encoded: %v\n", orig, errOrig)
		}
		if errReplacement != nil {
			fmt.Printf("Warning: Replacement character %q cannot be encoded: %v\n", replacement, errReplacement)
		}
	}
}

Output

Mapping necessary: Original character "•" cannot be encoded: encoding: rune not supported by encoding.
Mapping necessary: Original character "〜" cannot be encoded: encoding: rune not supported by encoding.
Mapping necessary: Original character "−" cannot be encoded: encoding: rune not supported by encoding.
Mapping necessary: Original character "—" cannot be encoded: encoding: rune not supported by encoding.
@gopherbot gopherbot added this to the Proposal milestone Oct 18, 2024
@ianlancetaylor
Copy link
Contributor

CC @mpvl

@robpike
Copy link
Contributor

robpike commented Oct 20, 2024

If this is a wise approach, and it well may be, there should already be an official defining table for how to handle the translation. Go's implementation should not be the one to codify it.

@yuki2006
Copy link
Author

yuki2006 commented Oct 21, 2024

Thank you for your comment. Indeed, it might not be appropriate for the Go standard library to create a definition table.

In that case, would it be possible to identify which character (and at which position) failed to encode, and furthermore, allow us to specify a fallback when the conversion fails? (It might be convenient if we could specify a callback function, for example.)

https://go.dev/play/p/Jg6oE7cko4i

Postscript: It seems we can identify the location by using the return value n from transform.String.

@mattn
Copy link
Member

mattn commented Oct 21, 2024

Japanese versions of Windows still treat file names as Shift_JIS in some processes. This is not limited to Japanese, but is also the case in China and Korea, where Double Byte Character Sets are used. The Go language, which uses utf-8 as its internal encoding, has almost no problem when using the Windows wide character API to determine filenames, but when Go uses the command line to control specific filenames, it must handle Shift_JIS. In such cases, we want to use fallback characters to replace characters that only exist in UTF-8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Incoming
Development

No branches or pull requests

5 participants