Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Input length validation may be inaccurate. #272

Open
Vincent-Stragier opened this issue Aug 16, 2024 · 2 comments
Open

Input length validation may be inaccurate. #272

Vincent-Stragier opened this issue Aug 16, 2024 · 2 comments

Comments

@Vincent-Stragier
Copy link
Contributor

Vincent-Stragier commented Aug 16, 2024

Hi @snjv94,

Looking at the code, I'd say the error is still due to the length of the string. At the moment, the code is checking for a number of characters in a string (here 1701 and 2053), which here is lower than the number of bytes (here 5035 and 6091). So it passes the condition but failed to receive a valid response from the API. By chance, you still received an answer for the shortest sample, even if it seems it is slightly oversize bitwise.

Note: in my case at first I received None as an answer for the shortest sample string; because you forgot the return statement in you function.

def translate(self, text: str, **kwargs) -> str:
"""
function to translate a text
@param text: desired text to translate
@return: str: translated text
"""
if is_input_valid(text, max_chars=5000):
text = text.strip()
if self._same_source_target() or is_empty(text):
return text
self._url_params["tl"] = self._target
self._url_params["sl"] = self._source
if self.payload_key:
self._url_params[self.payload_key] = text
response = requests.get(
self._base_url, params=self._url_params, proxies=self.proxies
)
if response.status_code == 429:
raise TooManyRequests()
if request_failed(status_code=response.status_code):
raise RequestError()

As we can see here, len(text) is used and text is a string:

def is_input_valid(
text: str, min_chars: int = 0, max_chars: Optional[int] = None
) -> bool:
"""
validate the target text to translate
@param min_chars: min characters
@param max_chars: max characters
@param text: text to translate
@return: bool
"""
if not isinstance(text, str):
raise NotValidPayload(text)
if max_chars and (not min_chars <= len(text) < max_chars):
raise NotValidLength(text, min_chars, max_chars)
return True

A way to solve this issue is probably to use the string encode method to encode the string in UTF-8 as an array of bytes. In the Python documentation about Unicode, we can read the following snippet:

>>> u = chr(40960) + 'abcd' + chr(1972)
>>> u.encode('utf-8')
b'\xea\x80\x80abcd\xde\xb4'
>>> u.encode('ascii')  
Traceback (most recent call last):
    ...
UnicodeEncodeError: 'ascii' codec can't encode character '\ua000' in
  position 0: ordinal not in range(128)
>>> u.encode('ascii', 'ignore')
b'abcd'
>>> u.encode('ascii', 'replace')
b'?abcd?'
>>> u.encode('ascii', 'xmlcharrefreplace')
b'&#40960;abcd&#1972;'
>>> u.encode('ascii', 'backslashreplace')
b'\\ua000abcd\\u07b4'
>>> u.encode('ascii', 'namereplace')
b'\\N{YI SYLLABLE IT}abcd\\u07b4'

Which makes me believe len(text) should be replaced by len(text.encode('UTF-8", "ignore") or len(text.encode('UTF-8", "xmlcharrefreplace"). And, maybe return the size for the string and bytes for the user when raising the error.

Thanks for the assistance, @Vincent-Stragier

Originally posted by @snjv94 in #169 (comment)

@TrexPD
Copy link

TrexPD commented Aug 19, 2024

So if I understand correctly, it would be enough to use:

text_bytes = text.encode('utf-8')
if max_chars and (not min_chars <= len(text_bytes) < max_chars):
    raise NotValidLength(text, min_chars, max_chars)"```

@Vincent-Stragier
Copy link
Contributor Author

I would use the following to avoid raising an encoding error:

text_bytes = text.encode('utf-8', "ignore")
# or
text_bytes = text.encode('utf-8', "replace")
# or
# probably better to avoid getting an underestimated size (still not perfect).
text_bytes = text.encode('UTF-8", "xmlcharrefreplace")  

if max_chars and (not min_chars <= len(text_bytes) < max_chars):
    raise NotValidLength(text, min_chars, max_chars)"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants