Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge Telegram JSON export into Signal backup #153

Open
korg91 opened this issue Oct 29, 2023 · 32 comments
Open

Merge Telegram JSON export into Signal backup #153

korg91 opened this issue Oct 29, 2023 · 32 comments
Labels
enhancement New feature or request

Comments

@korg91
Copy link

korg91 commented Oct 29, 2023

I've been wanting to switch all my groups from Telegram to Signal for a long time, but what's holding me (and my friends) back is that I don't want to lose the chat history. I was wondering if it's possible to use signalbackup-tools to merge my Telegram history into my Signal backup?

I see that #19 already mentions this use case, but it's unclear to me whether the system is reliable and usable end-to-end. Also, the specific case of Telegram might be easier to address since Telegram allows users to export the full chat history in JSON.

Unfortunately, it looks like Telegram's JSON export format is not documented. I've reached out to Telegram to ask about this. In the meantime, this thread and this blog post look helpful. Overall I feel the format should be fairly easy to understand even without documentation, but I haven't properly looked into it.

I've just made a small donation for encouragement and to thank @bepaald for all the work that he has already put into the project. Happy to make a bigger donation if he or anybody can implement this :)

@bepaald
Copy link
Owner

bepaald commented Oct 30, 2023

Hi!

This should indeed be possible, as I mentioned in our email exchange, at least to some useful degree. I have already started working on this (nothing usable yet, just pushed some early code this morning, though locally I have more already), at first by installing Telegram to get some data. I will try to ask around to see if anyone I know also has Telegram so I can actually get some real data (I currently only have messages in an empty group I made with just myself). But that shouldn't be a big problem.

I had found the same links about the JSON format and now found the actual current export to be slightly different. I agree the format is pretty simple, the problem is I can only deal with what I know. For example, I see in my (very small) export the key chats.list.message.type with values service and message, but I have no way of knowing what other types exist. Even worse is if I don't even know a certain key can exist, for example sending a picture creates a key photo, and sending a video creates a key file. This is a matter of trying to get as many different message types/contents as possible. It's not a big problem, with enough feedback and testing we will certainly be able to deal with everything we encounter, but documentation would have been handy. I also noticed some other things that Signal doesn't even support (like 'polls', 'secret chats' and 'channels').

Two things I already noticed in my current JSON export:

  • When sending a message with multiple attachments, the exported JSON simply shows multiple messages with one attachment each. You could infer they belong to the same message from the timestamp, but since the timestamp in full second resolution that's not really reliable if you ask me. The imported messages will probably always show a separate message for each attachment.
  • While testing text styling, I noticed that while Telegram supports overlapping styles, they are not properly exported in the JSON. The message body 'This is a test sentence' is exported as
       "text_entities": [
        {
         "type": "bold",
         "text": "This is a test"
        },
        {
         "type": "italic",
         "text": " sentence"
        }
       ]
    
    So after importing, the styling of the message will be changed ('This is a test sentence').

Anyway, this is just to let you know work has started already, but please do not expect it to be done quickly (it will probably be weeks). I will probably have some questions about Telegram in the process. And, especially in the beginning, expect a ton of errors when trying out new code (I will let you know when there is something to test).

I've just made a small donation for encouragement and to thank @bepaald for all the work that he has already put into the project.

That is very generous of you, thank you so much.

@bepaald bepaald added the enhancement New feature or request label Oct 30, 2023
@korg91
Copy link
Author

korg91 commented Nov 2, 2023

Hi @bepaald, thanks a lot for starting the work on this already! This all sounds very promising. It's unfortunate that Telegram doesn't provide any documentation, but let's see if they get back to me with that. In the meantime, from your initial exploration it sounds like reverse engineering could work.

The bugs and limitations you mention are nothing serious. The problem with multiple attachments might be a bit annoying for messages that contain many photos, but that's not important and in any case it can probably be fixed easily by merging messages that were sent within e.g. 1 sec.

Since Signal does not support 'polls', they can probably be disregarded. 'channels' are essentially 1-way chats -- it might make sense to allow import anyway, but I'd say it's not very important. I'm a bit surprised that we have 'secret chats'. Those are e2ee chats, but afaik they are stored only on-device and I did not expect the extract to contain any data relating to them. Anyway, the vast majority of communications on Telegram happens through standard non-e2ee chats (which is the main issue with Telegram), so again this is not a big problem.

If you need someone with whom to exchange messages on Telegram I'm happy to help, just let me know via email :)

@bepaald
Copy link
Owner

bepaald commented Nov 6, 2023

I think I could do with a little feedback on this now. If at all possible, maybe you could give it a try? No need to import the result into Signal, just run the tool and look at the output (or paste it here). If there are no major errors, and the tool actually finishes successfully, you could add the --exporthtml option to actually check the output without importing to phone.

Instructions:
signalbackup-tools [input.backup] [passphrase] --importtelegram [result.json]

The tool will need to know which recipient in the backup is yourself. It can usually determine this automatically, but if it fails it will tell you and you will need to add the --setselfid option.

The tool will first scan the json file for all recipients and tries to match them by name. If it fails to do this for any contacts, you get an error stating so, and you should add the --mapjsoncontacts option.

If you need someone with whom to exchange messages on Telegram I'm happy to help, just let me know via email :)

I might need that, I'll let you know. Currently my JSON file looks something like this:

{
 "about": "Here is the data you requested. Remember: Telegram is ad free, it doesn't use your data for ad targeting and doesn't sell it to others. Telegram only keeps the information it needs to function as a secure and feature-rich cloud service.\n\nCheck out Settings > Privacy & Security on Telegram's mobile apps for the relevant settings.",
 "chats": {
  "about": "This page lists all chats from this export.",
  "list": [
   {
    "name": "Empty group",
    "type": "private_group",
    "id": 40552xxxxx,
    "messages": [
     {
      "id": 6,
      "type": "message",
      "date": "2023-10-19T22:58:17",
      "date_unixtime": "1697749097",
      "from": "Devphone Black",
      "from_id": "user68058xxxxx",
      "photo": "chats/chat_2/photos/photo_2@19-10-2023_22-58-17.jpg",
      "width": 1280,
      "height": 960,
      "text": "caption",
      "text_entities": [
       {
        "type": "plain",
        "text": "caption"
       }
      ]
     },
     {
      "id": 9,
      "type": "message",
      "date": "2023-10-20T08:23:50",
      "date_unixtime": "1697783030",
      "from": "Devphone Black",
      "from_id": "user68058xxxxx",
      "reply_to_message_id": 6,
      "text": [
       "And ",
       {
        "type": "italic",
        "text": "styling"
       },
       "?"
      ],
      "text_entities": [
       {
        "type": "plain",
        "text": "And "
       },
       {
        "type": "italic",
        "text": "styling"
       },
       {
        "type": "plain",
        "text": "?"
       }
      ]
     }
    ]
   }
  ]
 }

(Obviously, it has more than just two messages). Depending on the export settings there are other top level data entries (like, 'stories', 'personal_information', 'contacts', etc.), but they are all skipped. I'm sort of assuming the only difference between this group chat and a personal chat is the line "type": "private_group", changes into "type": "personal_chat",. But if not I'd love to get just a snippet with the differences in it.

Some current limitations (some I've mentioned before):

  • Overlapping text styles (while supported in Telegram) are not exported correctly in the JSON file.
  • I think message delivery info is not exported by Telegram. (I will probably default to setting all imported messages as sent+received, but this is not done yet)
  • Any message-types other than 'message' are skipped, but a warning is printed. If some of them are present in the JSON and can be supported, let me know.
  • Multiple attachments currently turn into multiple images. I do plan to merge on timestamp, but that's not done yet. DONE
  • Underline style is not supported by Signal
  • Attachments in quoted messages are not done yet (but I will) DONE
  • Stickers turn into normal image attachments. Signal needs the id of the sticker_pack (among other things), which are not available (and might not exist if the stickers are not in a Signal-sticker-pack.
  • 'Forwarded from' is not an existing attribute for Signal messages
  • Message reactions (emoji reactions) are not exported by Telegram
  • 'Poll' messages do not exist in Signal and are currently skipped. (it might also be possible to turn them into a text-representation and import as a normal message, but currently they are skipped)

Anyway, I expect some bugs/problems in a first test like this, but things are mostly working for me currently, so some feedback might come in handy. No hurry, whenever you have time.

Thanks!

@korg91
Copy link
Author

korg91 commented Nov 14, 2023

@bepaald Sorry I haven't yet found the time to test this out! I'll do it asap (hopefully this week).

In the meantime I have some good news: Telegram got back to me about the JSON documentation and apparently they have one! Very weird that it seems to be poorly indexed by Google...

This and this are also interesting. It's the opposite of our use case: importing chats from other apps into Telegram. Obviously it's not directly useful here, but maybe it's an interesting source of ideas.

@bepaald
Copy link
Owner

bepaald commented Nov 15, 2023

No worries, there is certainly no hurry.

That's a good resource, I've not studied it fully but it looks like I'm on the right track I think. There is a lot more I didn't know existed, but most of those things will not be possible to convert to Signal (either because they don't exist, like HTML5_Game (?), or because they will not be compatible (like payments and gifts).

Very busy at work this period, but I will look closer at it when I have time. Thanks!

@ajgrf
Copy link

ajgrf commented Dec 16, 2023

Does it make more sense to define a simpler JSON format (+attachments) which is easy to incorporate into signalbackup-tools, leaving the conversion for external scripts?

I have a similar need to import chat history from element, and I imagine that implementing an import feature for 1 or 2 apps might lead to a lot of brittle and difficult-to-test code, as well as more feature requests for importing from new apps.

@bepaald
Copy link
Owner

bepaald commented Dec 16, 2023

Well, it is indeed the intention to convert any other JSON formats to the one supported (either by this tool, or by the user) and let the existing import functionality do the actual work. Just not sure whether to use this Telegram format or try to define my own 'simpler' format. I think the Telegram format is already not too complicated. When supporting the huge number of features most modern messengers do (groups, attachments, quotes, reactions, etc.), some complexity is bound to appear. Defining a new universal JSON format for this seems like a big undertaking.

Currently in favor of using the Telegram format is:

  • It is already implemented (though untested)
  • Telegram is huge (much, much bigger than even Signal is)
  • Its json format is actually well documented

Just last weekend, I used this function to import plain old SMS messages (as exported by adb backup) into a backup (not to restore on phone, just to use the --exporthtml option). Converting Android's sms-json to one compatible with this import function was just 3 lines of sqlite (create table, insert messages into it, select data from it). If I was better in SQL it could probably be done in 1.

Do you find the format overly complex? Do you need help converting the 'element' chat history? What does that format look like?

Note, while the Telegram JSON format has many features, most are not required by this function. At a minimum, only the date_unixtime, text_entities, from and type fields are required for messages and name and type for chats (and they need to exist in the correct objects/arrays).

@ajgrf
Copy link

ajgrf commented Dec 16, 2023

Sorry, I didn't realize that this feature was mostly complete already. I didn't see any information about it in the readme or help output (using the nix package).

The Telegram format is probably fine, actually. I don't need any help converting to it at this point, but I may be back with questions later. Thanks for the quick response!

@korg91
Copy link
Author

korg91 commented Jan 20, 2024

Ok I finally had time to test this! Sorry for the (huge) delay

So far I've only tried to export to HTML and quite a few things already seem to work great! This is quite amazing, great job :)

What worked:

  1. I could import 1-1 and group chats using --mapjsoncontacts. For groups, I had to assign the name of the group to the Signal ID of the group, e.g. "Test Group"=123.
  2. I didn't check every type of message, but I didn't spot any error in the html export.

Here are some issues I faced:

  1. I believe you mistyped the relevant command in your comments. I've had a look at the code and I think that --importtelegramjsonis actually --importtelegram. No big deal (but might be good to update your comment for future readers?)
  2. Name-based automatic matching of recipients didn't seem to work but I didn't test it properly, it might actually work.
  3. The list of Telegram contacts produced by the signalbackup-tools (when it cannot identify a match automatically) often includes "double" entries. Specifically, for a contact Bob Smith, it often includes both "Bob" and "Bob Smith". I'm sure this happens when the JSON file contains "name": "Bob" at the beginning of the conversation with Bob Smith, but contains "from": "Bob Smith" in each message. (I speculate that the JSON contains this discrepancy whenever the first message of the conversation was written by Bob and Bob is not in my address book, in which case Telegram automatically names the conversation using only "Bob" rather than "Bob Smith" (where "Bob" is the first name that Bob has set for himself in his account)). In any case, if I map both "Bob" and "Bob Smith" to the same ID in the Signal backup everything seems to work as it should (i.e. all Telegram messages from/to Bob are copied to the Signal conversation with Bob).
  4. The tool is a bit grumpy when the Telegram export contains contacts that are not in the Signal backup. It looks like every Telegram contact must necessarily be assigned to a Signal contact via --mapjsoncontacts, otherwise the import does not even start. Perhaps it would be useful to allow to ignore some contacts? Even better, it would be great to be able to perform something like --listrecipients on the Telegram export and then be able to manually specify the contacts for which the user wants to import the messages (and ignore all the others).
  5. I've noticed that Signal doesn't use a specific label to mark forwarded messages, whereas Telegram does (together with the name of the original sender). Given this difference, I think it would make sense to import such messages prepending something like [Forwarded from <user>]. This could be optional I guess, but imo without this the chat history looks a bit weird and confusing (because of the automatic labeling, when forwarding messages on Telegram I'd generally not start with a message like "hey these are the messages that Alice sent me").
  6. For these tests I've actually used a friend's Telegram export, which is a lot smaller than mine. With that, the import was very fast and seemed to have worked well. When I tried with mine (~15GB), after about 3 hours it was still not done and there was no output (the CPU was still at 100%), so I killed it. Not sure whether this is normal and it would've been successful after a few more hours. I feel this might be because of the large number of contacts in the Telegram export (because of several very big groups with >1K participants). In any case, I think it's not a good idea to try and restore such a big export in one go. The problem is that Telegram allows you to export either one or all chats, you cannot select only a subset. Of course this could be done by manually removing them from the JSON (or by signalbackup-tools).

Overall, I think it would be great if you could add support for single-chat exports and/or for selecting the chats to import from multi-chat exports. It is possible to download single chats from Telegram Desktop by entering the chat and clicking on the three-dot menu > Export chat history. With this, I could try to do more tests with my own chats (which include years of messages) without (hopefully) incurring in the previous issue.

Thanks a lot again!

@bepaald
Copy link
Owner

bepaald commented Jan 21, 2024

Thanks so much for your thorough feedback. Those are all clear and actionable issues, I can definitely work with that (in fact, 1 is already solved! :-) ).

Not sure when I'll have time, but the first thing I'd want to try to deal with is 6. When you say you have a 15GB export, I assume (hope) you mean 15GB for the entire folder including attachments, not the json file itself? Any idea about how big the json was? I'd really like to try an replicate a json causing this, to try and fix it.

Other priorities are supporting single-chat exports and selecting conversations from multi-chat ones (somehow). But the other issues will all be addressed as well I think.

Hopefully I'll have some time tomorrow, but I'll let you know when things have changed enough for testing. Thanks again!

@korg91
Copy link
Author

korg91 commented Jan 21, 2024

Awesome, thanks! And yes, the 15GB include media. results.json alone is a bit over 100MB (over 4M lines).

6 is indeed the most urgent one (together with single-chat or selected-chats support in any form) because as long as the issue is there it's quite difficult for me to provide more feedback.

@bepaald
Copy link
Owner

bepaald commented Jan 24, 2024

Just a quick update, I've been very busy the last few days. With this issue, among other things.

While solving some of the issues you mentioned above I've also done a lot of refactoring, and not a lot of testing, so I hope I didn't break anything that was working before. Hopefully:

  • Large json files now work well. I think I managed to replicate your experience, but with the changes made in parsing, the same file is now read in a second or two.
  • Single-chat-exports should also work now.
  • You can get a list of chats with --listjsonchats [jsonfile] and limit the chats to be imported with --selectjsonchats [list-of-indeces]. (I might change the name of that last option at some point).

I have not yet dealt with 5, but that should be very easy to do when I have time. I'd just like to know if it works at all before getting in to these kinds of details.

That leaves 2 and 3. If you have more info/more certainty about 2, I'd like to know. Not sure how to deal with 3 to be honest, but I'll think about it.

@korg91
Copy link
Author

korg91 commented Jan 28, 2024

I've tried the new commands both with a single-chat exports and the 15GB full exports, they seem to work flawlessly! Amazing :)

I've found one bug though. While trying to export a group, I got the error message [Error]: Recipient id not found in contactmap (and the export skipped to the next chat). Upon investigation, I've found that the message was sent from a now deleted account (which appears as "Deleted Account" in Telegram). In the json file, the sender appears as "from": null. In this specific case the account was a bot (Telegram support bots), but I think this is irrelevant, I guess the json file would look the same even if the message was sent from a standard account.

I'm not sure what's the best way to address this. Maybe automatically add a "Deleted Account" contact in the Signal backup? But then would that mean that Deleted Account would appear as a member of the imported group? That's not very clean. Perhaps an alternative solution would be to use the selfid for these messages, prepending something like [from: Deleted Account].

@bepaald
Copy link
Owner

bepaald commented Jan 31, 2024

I've tried the new commands both with a single-chat exports and the 15GB full exports, they seem to work flawlessly! Amazing :)

Excellent, thanks for the feedback. Did you just try the import function, or also the HTML export to look at the results?

I've found one bug though. While trying to export a group, I got the error message [Error]: Recipient id not found in contactmap (and the export skipped to the next chat). Upon investigation, I've found that the message was sent from a now deleted account (which appears as "Deleted Account" in Telegram). In the json file, the sender appears as "from": null. In this specific case the account was a bot (Telegram support bots), but I think this is irrelevant, I guess the json file would look the same even if the message was sent from a standard account.

I'm not sure what's the best way to address this. Maybe automatically add a "Deleted Account" contact in the Signal backup? But then would that mean that Deleted Account would appear as a member of the imported group? That's not very clean.

Ok, I have been thinking about this, but not sure yet what to do. In the import functions (this, and --importfromdestop), I require the contacts to exist in the backup file, because I do not think it is possible to create valid new contacts in a backup locally (it needs id's and keys, which I think are genereated server side). This case is slightly different, in that the goal is to create an invalid contact, one that should never come in conflict with a future new contact. So, maybe indeed a "Unknown Telegram Contact"? I'd need to investigate if this is possible to create a contact without any other identifying info (aci, pni, username, keys, phone number, etc), but I think it might be. I don't think this means the contact will appear as a group member. Group membership and message authors are really two separate things: if a members leaves a group their messages also remain in the chat unchanged, these messages would then appear just like those.

Perhaps an alternative solution would be to use the selfid for these messages, prepending something like [from: Deleted Account].

That's also a possibility, but then the messages would appear as outgoing messages (colored bubble on the right hand side)... that's also not very clean I think.

I'll think about it some more, maybe try some options out next weekend. If you have a strong preference or other ideas, let me know.

Thanks!

@korg91
Copy link
Author

korg91 commented Feb 3, 2024

All the tests I've run so far are with import + export to HTML, and I've manually checked the HTML to confirm it looks good. I might've missed some issues, but the ones I reported here are all those that I spotted :) I haven't yet tried to actually restore the backup on the phone, but that should be straightforward, right?

Regarding the "Deleted Account", everything you say makes sense. I think that the possible solutions from best to worst are:

  1. Automatically create an invalid contact, if that works smoothly and is "future proof" (i.e. Signal updates will never complain about the (restored) database).
  2. Instruct the user to manually create an ad hoc contact in Signal and re-create the backup. I think this is totally ok, at least for my use case.
  3. Restore using self ID and prepending something like [from: Deleted Account]. As you say, this is not very clean. But after all the self ID (likely) corresponds to the user who's performing the import, so it wouldn't come "as a surprise" if they are informed at the end of the import.

@bepaald
Copy link
Owner

bepaald commented Feb 4, 2024

All the tests I've run so far are with import + export to HTML, and I've manually checked the HTML to confirm it looks good. I might've missed some issues, but the ones I reported here are all those that I spotted :)

Ok, good.

I haven't yet tried to actually restore the backup on the phone, but that should be straightforward, right?

Well, if the htmlexport would show problems, the backup almost certainly has problems, so that's a good sign. But that doesn't necessarily work the other way around. I think it should be good, but the html export is certainly less picky than Signal itself.

Regarding the "Deleted Account", everything you say makes sense. I think that the possible solutions from best to worst are:

I agree with this. I'm almost certain option (1) will work, and I don't foresee any problems, but whether or not it's future proof will always be a gamble. The type of contact I would insert does not exist in a natural database as far as I can tell, I've been trying to get one in there. Any recipient will always have at least one of: phone number, aci, pni, group_id. The only two exceptions to this are distribution lists (for sharing stories), which have a specific type that this contact will not have, and the "release channel" recipient, which is Signal's own recipient for posting news and important announcements (which has registered set to true, which the fake contact will need this false).

A simple workaround would be to give the fake contact a fake phone number, this can naturally occur in the database, and so should be future proof, but seems like an ugly solution (but would be equivalent to option (2), where the user would also have to come up with a fake number: otherwise the contact will not show up in Signal).

I'm postponing the dicision a little bit, because I think I have to rewrite the contactmapping first. I've been using the name of the contacts to do the mapping, but I should really be using the from_id. Otherwise having multiple contacts with the same name (which I assume can happen in Telegram, it can in Signal) would cause problems. Also, using the from_id would probably fix the 'double entries' you mentioned in an earlier message. I assume "Bob" and "Bob Smith" will still have the same from_id. You mentioned that messages from the "Deleted Account" had from: null, but do they still have a filled in from_id? I'm hoping yes, that would also be helpful. Then, after rewriting that part, option (2) should be available to you anyway, without any extra coding from me.

I think I'll have time to do the contact mapping tomorrow. Thanks!

@korg91
Copy link
Author

korg91 commented Feb 4, 2024

A simple workaround would be to give the fake contact a fake phone number, this can naturally occur in the database, and so should be future proof, but seems like an ugly solution (but would be equivalent to option (2), where the user would also have to come up with a fake number: otherwise the contact will not show up in Signal).

Yeah it's not super clean, but I think it can be fine. I can simply create a contact with number 0000 or something like that. Having something automated would be a bit nicer, but if there's a risk that Signal might complain about the database one day then I feel it's not worth the risk.

I'm postponing the dicision a little bit, because I think I have to rewrite the contactmapping first. I've been using the name of the contacts to do the mapping, but I should really be using the from_id. Otherwise having multiple contacts with the same name (which I assume can happen in Telegram, it can in Signal) would cause problems. Also, using the from_id would probably fix the 'double entries' you mentioned in an earlier message. I assume "Bob" and "Bob Smith" will still have the same from_id. You mentioned that messages from the "Deleted Account" had from: null, but do they still have a filled in from_id? I'm hoping yes, that would also be helpful. Then, after rewriting that part, option (2) should be available to you anyway, without any extra coding from me.

Ok sounds great! I can confirm the from_id is preserved even when you have "from": null. Thanks a lot!

@bepaald
Copy link
Owner

bepaald commented Feb 6, 2024

Just a quick message/question since I'm logged in anyway. I've been redoing the contactmapping, using id's instead of names, but it was a bit more complicated so it's not yet finished. I'm also trying to be more clever about automatically matching Telegram contacts to Signal contacts, but notice I make assumptions occasionally that I don't know are true in Telegram.

From your earlier message:

I'm sure this happens when the JSON file contains "name": "Bob" at the beginning of the conversation with Bob Smith, but contains "from": "Bob Smith" in each message.

I understand a chat name can be different from the contact name. I'd like the program to automatically link these contacts so that only one of them needs to be mapped. I think can do this if:

  • The conversation with "Bob" is a personal_chat (not a group)
  • I know the from_id of "self".
  • The conversation contains messages from both contacts.

Since personal chats only contain (max) two from_id's, if I know which of those is "self", I know the other one matches whatever name is at the chat-level and I don't need to map that one separately.

But this is only true if — like I was blindly assuming (and as is the case in Signal) — you can only have 1 personal_chat with each contact. However I don't know if this is the case: I don't know why the chat name can be different from the contact name at all (in Signal this is impossible), so maybe it's possible to start multiple personal_chats with the same contact and rename them yourself? For example: "Bob (work related)" and "Bob (after hours)", that would complicate things... Thanks!

@korg91
Copy link
Author

korg91 commented Feb 7, 2024

I'm quite sure you can have only one personal_chat with a contact. I also think that personal chats can have only two from_ids, also accounting for bots. This is because (1) I don't think bots can be invited to a personal_chat and (2) the use of inline bots will appear as a message sent by the from_id of the user who invoked the inline bot (with an additional via @[bot_name] label), not by the from_id of the bot (see here).

I don't know why the chat name can be different from the contact name at all

As I wrote in a previous comment, I believe this happens when the first message of the conversation was written by Bob and Bob is not in my address book, in which case Telegram automatically names the conversation using only "Bob" (where "Bob" is the first name that Bob has set for himself in his account) rather than "Bob Smith" or just "unknown contact" or Bob's phone number (Telegram doesn't even need a phone number to sign up). To clarify: if I then save Bob's contact as "Bob Smith", then the Telegram app names the chat as "Bob Smith". But apparently this is not reflected in the json export (I guess for good reasons).

Overall, I think your assumptions hold :)

@bepaald
Copy link
Owner

bepaald commented Feb 8, 2024

Thanks! I just pushed the update to use contact id's. It's pretty messy code, not too proud of it, but I hope it works.

There are still many, many (rare-ish) corner cases in which things can go wrong (like when two different contacts have the exact same name, or a contact is called "user122325" while that is also an existing id in the json file). Also the user supplied map option may not work if one has contacts with , or = in their names (maybe with escaping?).

So, the main thing that has changed should be that null contacts are now also available to be mapped, so that should allow you to map them to a specially created contact ("Deleted Telegram contact") as you please.

The program now also attempts to determine when different contacts are really the same, so it should hopefully ask for fewer contacts to be mapped manually. This process is helped by a correctly mapped "self", or when importing from a full export (not single chat).

There is also a new option --jsonprependforward, which will prepend "Forwarded from NAME:" (in italics) to forwarded messages. I could change this message if desired.

Lastly there is the option --preventjsonmapping. If the auto mapping makes a mistake for any reason (for example, multiple contacts with the same name), --preventjsonmapping "Bob Smith" will prevent the auto mapping of that specific name. It will then need to be mapped manually (using a unique identifier such as the id), but at least then the messages will not end up in the wrong thread.

The change was bigger and more complicated than I thought, I hope I didn't break things that were working before...


I also noticed during testing I'm not setting any delivery/read-receipts on the newly imported messages. The Telegram export has no data on this. For me personally, I think I'd like the messages to appear as delivered (but not read) by default. Just because I think most contacts have delivery receipts turned on, but not read receipts? But maybe you (or others) would feel differently? Maybe it should be another command line option?

Also, for group messages, Signal also has detailed delivery reports (per group member and with timestamps). These will probably be too difficult to implement correctly as it is not easy (maybe impossible?) to know exactly which contacts were members of the group at the time any message was sent. So even though these message would appear as being delivered/read in the main view, when long tapping the message to look at the details there would be nothing there. I don't consider this too big of an issue, I think it's pretty rare for people to look at that information anyway.

@korg91
Copy link
Author

korg91 commented Feb 10, 2024

Ok I've tried to import two chats (one group and one personal) and it seems to be working great! I've spotted only one bug: I think the --jsonprependforward function doesn't like single quotes. On some messages, I get the following error (modified for privacy):

[Error]: During sqlite3_prepare_v2(): near "di": syntax error
         -> Query: "SELECT json_array(json_object('type', 'italic', 'text', 'Forwarded from Bob:'), json_object('type', 'plain', 'text', '
'), json_extract('[{"type":"plain","text":"blablabla\nun po' di blablabla"}]', '$[0]'))"

Also, is --mapjsoncontacts still supposed to work? It looks like it works with group chat names, user names (from), and user IDs (from_id). This is great but I just want to confirm that it is indeed "supported". Even because it seems to me that --mapjsoncontacts is the only way to import a Telegram group chat to a Signal group chat (with a different name), right?

@bepaald
Copy link
Owner

bepaald commented Feb 10, 2024

I think the --jsonprependforward function doesn't like single quotes.

Good catch, should be fixed now, thanks!

Also, is --mapjsoncontacts still supposed to work? It looks like it works with group chat names, user names (from), and user IDs (from_id). This is great but I just want to confirm that it is indeed "supported". Even because it seems to me that --mapjsoncontacts is the only way to import a Telegram group chat to a Signal group chat (with a different name), right?

Yes, to everything. The auto-mapping is definitely not solid enough (and often impossible) for the function to not support manual mapping.

Thanks!

@korg91
Copy link
Author

korg91 commented Feb 21, 2024

I've tried to import some chats into my actual Signal chats on my phone and it seems to work great! This is really amazing, awesome job!

I think there's still some issues with apostrophes, probably when it's in the "forwarded" feel (I'm using --jsonprependforward). See e.g. here:

[Error]: During sqlite3_prepare_v2(): near "s": syntax error
         -> Query: "SELECT json_array(json_object('type', 'italic', 'text', 'Forwarded from Du Rove's Channel:'), json_object('type', 'plain', 'text', '
'))"
[Error]: After sqlite3_step(): malformed JSON
         -> Query: "SELECT json_insert(?, '$[#]', json_extract(?, '$[0]'))"

I also have about 40 lines with this:

[Warning]: Something went wrong merging unknown json contacts

I can't spot any issue in the imported chats though. The warning is a bit weird since (I think) I'm using --mapjsoncontacts specifying all the contacts that are in the chats I'm importing (specified with --selectjsonchats). Is there any way to know more?

Thanks a lot!

@bepaald
Copy link
Owner

bepaald commented Feb 22, 2024

I think there's still some issues with apostrophes, probably when it's in the "forwarded" feel (I'm using --jsonprependforward). See e.g. here:

Ah, stupid of me, I fixed it for the message body before, but it didn't occur to me the same could happen int the contact name. Should be fixed now (poorly tested, I'm late for work :))

I also have about 40 lines with this:

At first glance I do not think it is a problem (I believe in this part it is only adjusting things to make a cleaner prompt for the user to present unknown contacts (if any)), but it is unexpected that the process would ever reach that code. I'll investigate when I'm back from work.

Thanks for the feedback!

@bepaald
Copy link
Owner

bepaald commented Feb 22, 2024

I had a go at removing the Something went wrong merging unknown json contacts warnings. If they are now indeed gone, they were harmless anyway. If they're still there let me know. Thanks!

@korg91
Copy link
Author

korg91 commented Mar 18, 2024

Hi! I've finally tried to do the big import of a lot of my messages into my actual Signal, specifying manually the chats to import and the mapping for json contacts. It mostly seems to work well, and that's a 12GB import! 🚀

However, I've encountered a nasty bug with the null contact (a bot). I've created a fake contact Deleted Telegram User with a random number in my address book. This contact does show up in my Signal backup with --listrecipients. So I tried to assign that null contact (actually, multiple null contacts in the same group chat) to the Deleted Telegram User contact via --mapjsoncontacts (using the null contacts' IDs). The import of Telegram data seems to work well, as well as the import of the modified backup into the Signal app. Inside the Signal app, I can also see the images shared by the null contact in the Shared Media menu, and I can see the sender is indeed Deleted Telegram User. However, if I try to click or scroll to any message sent by the Deleted Telegram User contact, the Signal app crashes immediately.

Is there any way I can help you debug this? Btw, I haven't used --preventjsonmapping at all, I've just specified all (I think?) the contacts using --mapjsoncontacts. This should be enough, no? Perhaps it should be possible to have something like --preventjsonmapping all, just to be sure that all the mapping is correctly controlled by --mapjsoncontacts (and an error is dropped if not)?

Also, small issue: there's a typo when using --jsonprependforward: the string contains Fowarded instead of Forwarded :)

Thanks a lot!

@bepaald
Copy link
Owner

bepaald commented Mar 19, 2024

Hi! I've finally tried to do the big import of a lot of my messages into my actual Signal, specifying manually the chats to import and the mapping for json contacts. It mostly seems to work well, and that's a 12GB import! 🚀

Ok, that sounds promising

However, I've encountered a nasty bug with the null contact (a bot). I've created a fake contact Deleted Telegram User with a random number in my address book. This contact does show up in my Signal backup with --listrecipients. So I tried to assign that null contact (actually, multiple null contacts in the same group chat) to the Deleted Telegram User contact via --mapjsoncontacts (using the null contacts' IDs). The import of Telegram data seems to work well, as well as the import of the modified backup into the Signal app. Inside the Signal app, I can also see the images shared by the null contact in the Shared Media menu, and I can see the sender is indeed Deleted Telegram User. However, if I try to click or scroll to any message sent by the Deleted Telegram User contact, the Signal app crashes immediately.

Is there any way I can help you debug this?

Hm, I was hoping that wouldn't happen. As I mentioned before I haven't so far created contacts in backups before (because I expect it to not work), but with a contact already in the database I was hoping it was ok. Apparently not. I would guess the program needs something in the database filled in which it is currently not for this recipient. The question is what (and then, can we do it)? My first thought is the contact may need an 'aci' (used to be called 'uuid'), which would usually be assigned server side upon registration.

I think you should be able to get a crash report by, after letting the app crash, starting it back up and generating a debuglog (settings->help->debuglog). I think the crash (uncaught exception in Signal) should be in there even after restarting. The other option would be to turn on USB debugging on your phone and let the program crash while attached to your computer with adb logcat running. The first option is definitely simpler if it works.

I might try this myself at some point, though currently all my testing phones have expired SIM cards, so I can't receive the registration SMS, so I can't currently restore backups on them. So that will be a while, but I will eventually get around to it if you don't beat me to it.

Btw, I haven't used --preventjsonmapping at all, I've just specified all (I think?) the contacts using --mapjsoncontacts. This should be enough, no? Perhaps it should be possible to have something like --preventjsonmapping all, just to be sure that all the mapping is correctly controlled by --mapjsoncontacts (and an error is dropped if not)?

Yes, specifying all contacts should be enough. If you run the tool with --verbose it will print a message whenever it finds a contact on its own, or links contacts together (it will also print a lot of other stuff). (I've just added the last few remaining output-messages for this, so be sure to update if you want to try)

If you specify all contacts, I think there should be no output between ALL CONTACTS IN JSON: (...table), and [FINAL CONTACT MAP] (...list), specifically there should be no lines starting with "Found json contact..." or "Linking contacts...".

Also, small issue: there's a typo when using --jsonprependforward: the string contains Fowarded instead of Forwarded :)

Fixed! Thanks!

Thanks a lot!

Thank you once again for your feedback!

@korg91
Copy link
Author

korg91 commented Apr 3, 2024

Sorry for the delay! Here's the most relevant line I could find in Signal's debug log:

[7.1.3] [main ] [date redacted] E SignalUncaughtException: org.thoughtcrime.securesms.recipients.Recipient$MissingAddressError: Missing address for XXX

where XXX is the Signal recipient ID that I'm trying to use (overzealously redacted) . I can send you more lines via email if necessary!

In the meantime, I've just bought a cheap SIM card and used that one to create a Signal account for this purpose. I confirm that everything seems to work with that! I'll report back if I spot any issue :) And I'm happy to help you debug the fake user issue and also to give feedback on the documentation for this feature if you plan to write it.

I also want to sincerely thank you for working on this in the past few months! It's pretty great that everyone now has a tool to bring their Telegram conversations to Signal. I've just sent you a donation, it's just a small gift to show my gratitude for your work and friendliness :)

@bepaald
Copy link
Owner

bepaald commented Apr 4, 2024

Sorry for the delay! Here's the most relevant line I could find in Signal's debug log:

[7.1.3] [main ] [date redacted] E SignalUncaughtException: org.thoughtcrime.securesms.recipients.Recipient$MissingAddressError: Missing address for XXX

where XXX is the Signal recipient ID that I'm trying to use (overzealously redacted) . I can send you more lines via email if necessary!

Hey thanks for reporting back. No worries about the delay.

That error really does seem like an ACI is needed for recipients with messages in the database. To be sure I think I'd need to see a bit more of the error. Usually the initial exception (in this case MissingAddressError) is followed by a number of lines like at some.classes.followed.by.a.function(filename.java:[linenumber]). These are often useful in tracking down the problem, especially since the MissingAddressError exception can be thrown from a number of different functions. I don't think there is ever any user data in those following lines (it just links to source code), but feel free to redact anything you feel necessary. E-mail is also fine of course.

In the meantime, I've just bought a cheap SIM card and used that one to create a Signal account for this purpose. I confirm that everything seems to work with that! I'll report back if I spot any issue :) And I'm happy to help you debug the fake user issue and also to give feedback on the documentation for this feature if you plan to write it.

Assuming the error log will confirm creating a Signal contact (real or even fake) out of nothing, I think that indeed was the only option (besides just skipping those messages altogether). Very happy to hear everything else seems to be working so far. I hope it stays that way, but indeed let me know if anything comes up.

I'll leave this issue open while I muster up the courage to try and write some documentation for this :) Not my favorite thing to do, especially since I think this function is a bit more complicated for the end-user than most. I'll report back when I have something, so you can take a look at it (if you have time of course), it may be a while though. Initially I'll just quickly link to this issue in the readme.

I also want to sincerely thank you for working on this in the past few months! It's pretty great that everyone now has a tool to bring their Telegram conversations to Signal. I've just sent you a donation, it's just a small gift to show my gratitude for your work and friendliness :)

Thank you for your continuous testing and feedback. Functions like these are practically impossible to implement without valuable feedback. You have been very helpful. And thank you for another generous donation, though not necessary it is certainly very much appreciated.

@Eco-Gaming
Copy link

Hello, first of all thank you for your amazing work on this tool! With it, I was able to decrypt my database, assign a new id to a broken recipient, and afterwards encrypt and import the database again (worked perfectly).

Now, I'm trying to write some translation-scripts to import my WhatsApp chat history into Signal. I skimmed this thread a few days ago, I believe you said you wanted to only accept the Telegram json format, correct?
Could you maybe elaborate on which json fields you actually read from the backup file? That way I won't have to "translate" unnecessary fields and can save myself some work.

Here's an example WhatsApp json I have right now (one file per chat, slightly modified export of WhatsApp-Chat-Exporter):

chats/Dave Smith.json
{
    "name": "Dave Smith",
    "type": "android",
    "my_avatar": null,
    "their_avatar": null,
    "their_avatar_thumb": null,
    "status": "Hey there, I am using WhatsApp!",
    "messages": {
        "18455": {
            "from_me": true,
            "timestamp": 1584364650.692,
            "time": "14:17",
            "media": false,
            "key_id": "10E4846396F18700DAE0744E7442EA54",
            "meta": true,
            "data": null,
            "sender": null,
            "safe": false,
            "mime": null,
            "reply": null,
            "quoted_data": null,
            "caption": null,
            "thumb": null,
            "sticker": false
        },
        "18456": {
            "from_me": true,
            "timestamp": 1584364650.688,
            "time": "14:17",
            "media": false,
            "key_id": "37185009D31E3E1398113E376ABB1AD5",
            "meta": false,
            "data": "Hello Dave!",
            "sender": null,
            "safe": false,
            "mime": null,
            "reply": null,
            "quoted_data": null,
            "caption": null,
            "thumb": null,
            "sticker": false
        },
        "18462": {
            "from_me": false,
            "timestamp": 1584365423.0,
            "time": "14:30",
            "media": false,
            "key_id": "3A93117FA88F0C654421",
            "meta": false,
            "data": "Hey, there!<br>Look at this beautiful linebreak and quote.",
            "sender": null,
            "safe": false,
            "mime": null,
            "reply": "37185009D31E3E1398113E376ABB1AD5",
            "quoted_data": "Hello Dave!",
            "caption": null,
            "thumb": null,
            "sticker": false
        },
        "18475": {
            "from_me": true,
            "timestamp": 1584366324.0,
            "time": "14:45",
            "media": true,
            "key_id": "3EB047372D5E4D972CE6",
            "meta": false,
            "data": "WhatsApp/Media/WhatsApp Images/Sent/IMG-20200316-WA0005.jpg",
            "sender": null,
            "safe": false,
            "mime": "image/jpeg",
            "reply": null,
            "quoted_data": null,
            "caption": "Check out this cool bird I saw",
            "thumb": null,
            "sticker": false
        },
        "19929": {
            "from_me": false,
            "timestamp": 1584870473.0,
            "time": "10:47",
            "media": true,
            "key_id": "5E784DFEBC57DA894B81",
            "meta": false,
            "data": "WhatsApp/Media/WhatsApp Audio/AUD-20200322-WA0000.m4a",
            "sender": null,
            "safe": false,
            "mime": "audio/mp4",
            "reply": null,
            "quoted_data": null,
            "caption": null,
            "thumb": null,
            "sticker": false
        }
    }
}

I also have access to the phone number for each chat and could easily add them to the top of this json file, if that helps.

@bepaald
Copy link
Owner

bepaald commented Apr 21, 2024

Hello, first of all thank you for your amazing work on this tool! With it, I was able to decrypt my database, assign a new id to a broken recipient, and afterwards encrypt and import the database again (worked perfectly).

Hi! Thanks, glad the tool has been useful to you.

Now, I'm trying to write some translation-scripts to import my WhatsApp chat history into Signal. I skimmed this thread a few days ago, I believe you said you wanted to only accept the Telegram json format, correct?

Well, it's not that it has to be Telegram's json format, but I did not want to support many different ones (if we could just convert between them). The Telegram format just so happened to be the first one to be completed (and requested), so for now I'm going with it. If there is some other very prevalent json format, I could always write an option to also accept that and do the conversion internally, instead of having the user do it.

Could you maybe elaborate on which json fields you actually read from the backup file? That way I won't have to "translate" unnecessary fields and can save myself some work.

I'll try. I believe the program expects a json array $.chats.list, from which name, type, id, and messages are used. The last one, messages, is itself a json array from which the following keys are (possibly) used:

if (!d_database.exec("INSERT INTO messages SELECT "
"REPLACE(REPLACE(path, '$.chats.list[', ''), '].messages', '') AS chatidx, "
"json_extract(value, '$.id') AS id, "
"json_extract(value, '$.type') AS type, "
"json_extract(value, '$.date_unixtime') AS date, "
"json_extract(value, '$.from') AS from_name, "
"json_extract(value, '$.from_id') AS from_id, "
"json_extract(value, '$.text_entities') AS body, "
"json_extract(value, '$.reply_to_message_id') AS reply_to_id, "
"json_extract(value, '$.forwarded_from') AS forwarded_from, "
"json_extract(value, '$.photo') AS photo, "
"json_extract(value, '$.width') AS width, "
"json_extract(value, '$.height') AS height, "
"json_extract(value, '$.file') AS file, "
"json_extract(value, '$.media_type') AS media_type, "
"json_extract(value, '$.mime_type') AS mime_type, "
"json_extract(value, '$.poll') AS poll FROM tmp_json_tree"))

Some of those are obviously not required (if the message contains no attachment, there is no mime_type for example), others are required (like from_id and date_unixtime). I'd say just set as many of those as you can for each message.

A short example:

{
 "chats": {
  "about": "This page lists all chats from this export.",    # <-- IGNORED
  "list": [
   {
    "name": "Empty group",
    "type": "private_group",   # <-- MUST BE 'private_group' OR 'personal_chat'
    "id": 4045174149,
    "messages": [
     {
      "id": 14,
      "type": "message",
      "date": "2023-10-30T08:43:01",    # <-- IGNORED
      "date_unixtime": "1698651781",
      "from": "Name of Sender",
      "from_id": "user6805890121",
      "reply_to_message_id": 13,   # <-- REFERENCES AN EARLIER 'id'
      "photo": "chats/chat_1/photos/photo_1@30-10-2023_08-43-01.jpg",
      "width": 1280,
      "height": 960,
      "text": "Here is a picture",    # <-- IGNORED
      "text_entities": [
       {
        "type": "plain",
        "text": "Here is a "
       },
       {
        "type": "bold",
        "text": "picture"
       }
      ]
     },
     {
      <another message from this chat>
     }, ...
    ]
   },
   {
    <another chat>
   }, ...
  ]
 }
}

For single-chat json, you could just leave out the top level $.chats.list, so such a chat would start with

{
 "name": "Empty group",
 "type": etc...

If you need any more description of any of the fields used, let me know and I'll try to explain as best I can. Also, if the WhatsApp json contains any message-types/attributes that are not available in the Telegram json format, but could be supported by Signal, there should be no problem in adding more fields to the json format this tool supports (they will just be ignored if they don't exist (like in a Telegram json)).

It's a lot to explain, I hope that is somewhat clear, but let me know if it's not.

Good luck!

@Eco-Gaming
Copy link

Thank you for your detailed answer. I think that should be all the info I need, I'll look into it in the coming weeks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants