Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

we use crop the transcript badly (by decimating) #800

Open
wassname opened this issue Oct 13, 2024 · 7 comments
Open

we use crop the transcript badly (by decimating) #800

wassname opened this issue Oct 13, 2024 · 7 comments

Comments

@wassname
Copy link

Describe the bug
问题描述

A clear and concise description of what the bug is.

On youtube, this extension uses subtitles not transcript. The subtitles are terrible, and lead to the llm giving poor output

To Reproduce

如何复现

  1. go to https://www.youtube.com/watch?v=IYaNscnE7rc&t=556s
  2. run ChatGPTBox
  3. look at the summary
  4. open summary in separate window
  5. look at the inputs the summary
  6. go back to the video, open the transcript and compare

It seems that this extension is using the subtitles, not the transcript. But the subtitles often have much poorer transcriber model and uncommon words are totally missed.

For example, for this video

Expected behavior
期望行为

A clear and concise description of what you expected to happen.

This is part of the transcript available in the UI

it is now a matter of public record that under pompeo's explicit Direction the CIA Drew up plans to kidnap and to assassinate me within the Ecuadorian Embassy in London and authorized going after my European colleagues subjecting us to theft hacking attacks and the planting of false information my wife and my infant son were also targeted a CIA asset was permanently assigned to track my wife and instructions were given to obtain DNA from my six month-old son's nappy

And this is the subtitle information received in ChatGPTBox

it is now a matter of public,kidnap and to assassinate me within the,hacking attacks and the planting of,assigned to track my wife and,nappy

As you can see it's a poor source of informaiton

Please complete the following information):
请补全以下内容

  • OS: linux
  • Browser: firefox
@wassname
Copy link
Author

related: #679

@Mohamed3nan
Copy link
Contributor

Mohamed3nan commented Oct 14, 2024

I believe there is a bug related to retrieving information about the model's context window length and some logical calculations. For example, I used the web version of ChatGPT-4o, which is probably 8k, but I received a maxLength of only 900.

image

I’m trying to understand how it works, but it’s quite complex.

export async function cropText(
text,
maxLength = 4000,
startLength = 400,
endLength = 300,
tiktoken = true,
) {
const userConfig = await getUserConfig()
const k = modelNameToDesc(
userConfig.apiMode ? apiModeToModelName(userConfig.apiMode) : userConfig.modelName,
null,
userConfig.customModelName,
).match(/[- (]*([0-9]+)k/)?.[1]
if (k) {
maxLength = Number(k) * 1000
maxLength -= 100 + clamp(userConfig.maxResponseTokenLength, 1, maxLength - 1000)
} else {
maxLength -= 100 + clamp(userConfig.maxResponseTokenLength, 1, maxLength - 1000)
}
const splits = text.split(/[,,。??!!;;]/).map((s) => s.trim())
const splitsLength = splits.map((s) => (tiktoken ? encode(s).length : s.length))
const length = splitsLength.reduce((sum, length) => sum + length, 0)
const cropLength = length - startLength - endLength
const cropTargetLength = maxLength - startLength - endLength
const cropPercentage = cropTargetLength / cropLength
const cropStep = Math.max(0, 1 / cropPercentage - 1)
if (cropStep === 0) return text
let croppedText = ''
let currentLength = 0
let currentIndex = 0
let currentStep = 0
for (; currentIndex < splits.length; currentIndex++) {
if (currentLength + splitsLength[currentIndex] + 1 <= startLength) {
croppedText += splits[currentIndex] + ','
currentLength += splitsLength[currentIndex] + 1
} else if (currentLength + splitsLength[currentIndex] + 1 + endLength <= maxLength) {
if (currentStep < cropStep) {
currentStep++
} else {
croppedText += splits[currentIndex] + ','
currentLength += splitsLength[currentIndex] + 1
currentStep = currentStep - cropStep
}
} else {
break
}
}
let endPart = ''
let endPartLength = 0
for (let i = splits.length - 1; endPartLength + splitsLength[i] <= endLength; i--) {
endPart = splits[i] + ',' + endPart
endPartLength += splitsLength[i] + 1
}
currentLength += endPartLength
croppedText += endPart
console.log(
`input maxLength: ${maxLength}\n` +
`maxResponseTokenLength: ${userConfig.maxResponseTokenLength}\n` +
// `croppedTextLength: ${tiktoken ? encode(croppedText).length : croppedText.length}\n` +
`desiredLength: ${currentLength}\n` +
`content: ${croppedText}`,
)
return croppedText
}

@josStorer Perhaps we should simplify it by creating a new key, such as length

export const Models = {
chatgptFree35: { value: 'text-davinci-002-render-sha', desc: 'ChatGPT (Web)' },
chatgptFree4o: { value: 'gpt-4o', desc: 'ChatGPT (Web, GPT-4o)' },
chatgptFree4oMini: { value: 'gpt-4o-mini', desc: 'ChatGPT (Web, GPT-4o mini)' },
chatgptPlus4: { value: 'gpt-4', desc: 'ChatGPT (Web, GPT-4)' },
chatgptPlus4Browsing: { value: 'gpt-4', desc: 'ChatGPT (Web, GPT-4)' }, // for compatibility
chatgptApi35: { value: 'gpt-3.5-turbo', desc: 'ChatGPT (GPT-3.5-turbo)' },
chatgptApi35_16k: { value: 'gpt-3.5-turbo-16k', desc: 'ChatGPT (GPT-3.5-turbo-16k)' },
chatgptApi4o_128k: { value: 'gpt-4o', desc: 'ChatGPT (GPT-4o, 128k)' },
chatgptApi4oMini: { value: 'gpt-4o-mini', desc: 'ChatGPT (GPT-4o mini)' },
chatgptApi4_8k: { value: 'gpt-4', desc: 'ChatGPT (GPT-4-8k)' },
chatgptApi4_32k: { value: 'gpt-4-32k', desc: 'ChatGPT (GPT-4-32k)' },
chatgptApi4_128k: {
value: 'gpt-4-turbo',
desc: 'ChatGPT (GPT-4-Turbo 128k)',
},
chatgptApi4_128k_preview: {
value: 'gpt-4-turbo-preview',
desc: 'ChatGPT (GPT-4-Turbo 128k Preview)',
},
chatgptApi4_128k_1106_preview: {
value: 'gpt-4-1106-preview',
desc: 'ChatGPT (GPT-4-Turbo 128k 1106 Preview)',
},
chatgptApi4_128k_0125_preview: {
value: 'gpt-4-0125-preview',
desc: 'ChatGPT (GPT-4-Turbo 128k 0125 Preview)',
},
claude2WebFree: { value: '', desc: 'Claude.ai (Web)' },
claude12Api: { value: 'claude-instant-1.2', desc: 'Claude.ai (API, Claude Instant 1.2)' },
claude2Api: { value: 'claude-2.0', desc: 'Claude.ai (API, Claude 2)' },
claude21Api: { value: 'claude-2.1', desc: 'Claude.ai (API, Claude 2.1)' },
claude3HaikuApi: {
value: 'claude-3-haiku-20240307',
desc: 'Claude.ai (API, Claude 3 Haiku)',
},
claude3SonnetApi: { value: 'claude-3-sonnet-20240229', desc: 'Claude.ai (API, Claude 3 Sonnet)' },
claude3OpusApi: { value: 'claude-3-opus-20240229', desc: 'Claude.ai (API, Claude 3 Opus)' },
claude35SonnetApi: {
value: 'claude-3-5-sonnet-20240620',
desc: 'Claude.ai (API, Claude 3.5 Sonnet)',
},
bingFree4: { value: '', desc: 'Bing (Web, GPT-4)' },
bingFreeSydney: { value: '', desc: 'Bing (Web, GPT-4, Sydney)' },
moonshotWebFree: { value: '', desc: 'Kimi.Moonshot (Web, 100k)' },
bardWebFree: { value: '', desc: 'Gemini (Web)' },
chatglmTurbo: { value: 'GLM-4-Air', desc: 'ChatGLM (GLM-4-Air, 128k)' },
chatglm4: { value: 'GLM-4-0520', desc: 'ChatGLM (GLM-4-0520, 128k)' },
chatglmEmohaa: { value: 'Emohaa', desc: 'ChatGLM (Emohaa)' },
chatglmCharGLM3: { value: 'CharGLM-3', desc: 'ChatGLM (CharGLM-3)' },
chatgptFree35Mobile: { value: 'text-davinci-002-render-sha-mobile', desc: 'ChatGPT (Mobile)' },
chatgptPlus4Mobile: { value: 'gpt-4-mobile', desc: 'ChatGPT (Mobile, GPT-4)' },
chatgptApi35_1106: { value: 'gpt-3.5-turbo-1106', desc: 'ChatGPT (GPT-3.5-turbo 1106)' },
chatgptApi35_0125: { value: 'gpt-3.5-turbo-0125', desc: 'ChatGPT (GPT-3.5-turbo 0125)' },
chatgptApi4_8k_0613: { value: 'gpt-4', desc: 'ChatGPT (GPT-4-8k 0613)' },
chatgptApi4_32k_0613: { value: 'gpt-4-32k', desc: 'ChatGPT (GPT-4-32k 0613)' },
gptApiInstruct: { value: 'gpt-3.5-turbo-instruct', desc: 'GPT-3.5-turbo Instruct' },
gptApiDavinci: { value: 'text-davinci-003', desc: 'GPT-3.5' },
customModel: { value: '', desc: 'Custom Model' },
ollamaModel: { value: '', desc: 'Ollama API' },
azureOpenAi: { value: '', desc: 'ChatGPT (Azure)' },
waylaidwandererApi: { value: '', desc: 'Waylaidwanderer API (Github)' },
poeAiWebSage: { value: 'Assistant', desc: 'Poe AI (Web, Assistant)' },
poeAiWebGPT4: { value: 'gpt-4', desc: 'Poe AI (Web, GPT-4)' },
poeAiWebGPT4_32k: { value: 'gpt-4-32k', desc: 'Poe AI (Web, GPT-4-32k)' },
poeAiWebClaudePlus: { value: 'claude-2-100k', desc: 'Poe AI (Web, Claude 2 100k)' },
poeAiWebClaude: { value: 'claude-instant', desc: 'Poe AI (Web, Claude instant)' },
poeAiWebClaude100k: { value: 'claude-instant-100k', desc: 'Poe AI (Web, Claude instant 100k)' },
poeAiWebGooglePaLM: { value: 'Google-PaLM', desc: 'Poe AI (Web, Google-PaLM)' },
poeAiWeb_Llama_2_7b: { value: 'Llama-2-7b', desc: 'Poe AI (Web, Llama-2-7b)' },
poeAiWeb_Llama_2_13b: { value: 'Llama-2-13b', desc: 'Poe AI (Web, Llama-2-13b)' },
poeAiWeb_Llama_2_70b: { value: 'Llama-2-70b', desc: 'Poe AI (Web, Llama-2-70b)' },
poeAiWebChatGpt: { value: 'chatgpt', desc: 'Poe AI (Web, ChatGPT)' },
poeAiWebChatGpt_16k: { value: 'chatgpt-16k', desc: 'Poe AI (Web, ChatGPT-16k)' },
poeAiWebCustom: { value: '', desc: 'Poe AI (Web, Custom)' },
moonshot_v1_8k: {
value: 'moonshot-v1-8k',
desc: 'Kimi.Moonshot (8k)',
},
moonshot_v1_32k: {
value: 'moonshot-v1-32k',
desc: 'Kimi.Moonshot (32k)',
},
moonshot_v1_128k: {
value: 'moonshot-v1-128k',
desc: 'Kimi.Moonshot (128k)',
},
}

@wassname
Copy link
Author

Site note: I think there are two separate bugs? There's the one where it uses the transcript (I have a small example it above, the length AND contents are different), and the one you are investigating where it wrongly clips the transcript.

@Mohamed3nan
Copy link
Contributor

Site note: I think there are two separate bugs? There's the one where it uses the transcript (I have a small example it above, the length AND contents are different), and the one you are investigating where it wrongly clips the transcript.

The function cropText is cropping the transcript based on the model's context length. From what I understand, if the transcript is too long, it takes a portion from the start and a portion from the end, then performs some calculations to find a balance in between. This is likely why the transcript feels short.

@Mohamed3nan
Copy link
Contributor

By the way, if you want a temporary fix for only YouTube without considering the consequences, you can simply return the prompt here without the cropText function

return await cropText(
`Provide a structured summary of the following video in markdown format, focusing on key takeaways and crucial information, and ensuring to include the video title. The summary should be easy to read and concise, yet comprehensive.` +
`The video title is "${title}". The subtitle content is as follows:\n${subtitleContent}`,
)

@wassname
Copy link
Author

wassname commented Oct 14, 2024

Hmm let me look in the debugger... yes I see what you mean

an extract from subtitleContent in the debugger before cropText

it is now a matter of public,record that under pompeo's explicit,Direction the CIA Drew up plans to,kidnap and to assassinate me within the,Ecuadorian Embassy in London and,authorized going after my European,colleagues subjecting us to theft,hacking attacks and the planting of,false information,my wife and my infant son were also,targeted a CIA asset was permanently,assigned to track my wife and,instructions were given to obtain DNA,from my six-month-old son's,nappy

and croppedText (after applying croptext)

it is now a matter of public,kidnap and to assassinate me within the,hacking attacks and the planting of,assigned to track my wife and,nappy

If anyone would like to reproduce this, here are the full arguments to croptext

  • maxLength: 2900
  • startLength: 400
  • endLength: 300
  • tiktoken: true
  • model claude web free

maxResponseTokenLength: 1000
desiredLength: 2893
content: Provide a structured summary of the following video in markdown format,focusing on key takeaways and crucial information,and ensuring to include the video title. The summary should be easy to read and concise,yet comprehensive.The video title is "Assange's 1st public statement after he was released from prison: 'Pled guilty to journalism'". The subtitle content is as follows:
Mr,chairman esteemed members of the,Parliamentary assembly of the Council of,Europe ladies and,gentlemen the transition from years of,confinement in a Maximum Security Prison,to being here before the representatives,of 46 Nations and 700 million people is,a profound and a surreal,shift the experience of isolation for,years in a small,cell is difficult to convey it strips,away one's sense of self leaving only,the raw essence of,existence I am yet not fully equipped to,speak about what I have,endured the Relentless struggle to stay,alive both physically and,mentally nor can I speak yet about the,deaths by hanging murder and medical,neglect of my fellow prisoners,I apologize in advance if my words,falter or if my presentation lacks the,Polish you might expect from such a,distinguished,Forum isolation has taken its,toll which I am trying to unwind and,expressing myself in the setting is a,challenge however the gravity of this,occasion and the weight of the issues at,hand compel me to set aside my,reservations and speak to you,directly I have traveled a long way,literally and figuratively to be before,you,today before our discussion or answering,any questions you might have I wish to,thank pace for its 2020,resolution which stated that my,imprisonment set a dangerous precedent,for journalists and noticed that the UN,special reporter on torture called for,my release,statement expressing concern over,assassination again calling for my,commissioning a renowned rapur Suna I,and conviction and the consequent,case whether they were from,diplomats unions legal and medical,resolutions reports films articles,because without them I never would have,needed because the legal protections of,time I eventually chose Freedom over,sentence with no effective,agreement that I cannot file a case at,over what it did to me as a result of,system worked I am free today after,information from a source I plead guilty,public what that information,weakness the weaknesses of the existing,vulnerable as I emerge from the dungeon,period how expressing the truth has been,truth and more,prosecution of me it's Crossing Crossing,exists,people about how the world works so that,us understand where we might,there is,Horrors about programs of assassination,policies the agreements and the,the infamous gun camera footage of a US,Warfare shocked the world but we also,when the US military could deploy lethal,40 Years of my potential 175e sentence,the world's Dirty Wars and secret,change get these fundamentals right and,rest,principles that this assembly stands,home in,and in iand our journalistic and,based in France in Germany and in,class Manning a US intelligence analyst,launched an investigation against me and,bribes to an Informer to steal our legal,and financial services to block our,retribution it admitted at the European,time ultimately this harassment was,department chose not to indict me,publishing or obtaining government,ominous reinterpretation of the US,one of my,dramatically president Trump had been,arms industry executive as CIA director,exposed the cia's infiltration of French,leaders it's spying on the European,whole we revealed the cia's vast,chains its subversion of antivirus,Retribution it is now a matter of public,kidnap and to assassinate me within the,hacking attacks and the planting of,assigned to track my wife and,nappy this is the,additionally corroborated by record,involved the cia's targeting of myself,means provides a rare insight into how,unique what is,and to to judicial investigations in,groundbreaking report on CIA Renditions,unlawful Renditions on European soil,former CIA officer Joshua Schulte was,isolation his windows are blacked out,are more severe than those found in,processes the lack of effective,Mutual legal assistance and Expedition,in my prison cell the former CIA,expedition case against me in response,reopened the investigation against me,witness Manning was,000 a day in a formal attempt to,we usually think of,sources but Manning was now a source,and the US government issued a warrant,warrant secret from the public for two,Diplomatic grounds for my,stand a chance unless there are strong,them without this no individual has a,deploy if the situation,asserted a danger Dangerous new Global,rights Europeans and other nationalities,where they,far as the US government is concerned An,do so is a crime with no defense and he,forly asserted that Europeans have no,inevitably follow,Russia but based on the precedent set in,European journalists Publishers or even,Publishers within the European space are,cannot become the norm here as one of,Gathering activities is a threat to,power for asking for receiving and,journalists should not be prosecuted for,[Applause],freedom to speak and the freedom to,happened in my case never happens to,gratitude to this,supported me throughout this ordeal and,tirelessly tirelessly for my,commitment to the protection of,Crossroad I fear that unless,late let us all commit to doing our part,are not silenced by the interests of the,thank you Mr,members of the,of the committee of legal Affairs and,Mr Assange to reply to each of the,questions asked which you feel it would,be,with us for for sharing your testimony,Human Rights um get got any final,I would like to know whether you have,system that works correctly and properly,Security Prison and facing 175e,Human,States that would Release Me from prison,that I not be allowed to take a case in,proceedings nor that,done there will never be a hearing,that uh Pace,journalists here to protect themselves,inevitably be abused by other,to make the situation,NOA uh Mr Assange it is great to have,past is still the um uh the,ask you whether you believe that our,be accepted at the plenary session that,whistleblowers and the uh right to,uh um and your visit here to Parliament,situation in in this regard of the the,I'm not asking very much about your,that I I am here because I believe is an,rolling uh to address the problems of,Security journalism is possible within,the Big Wide World outside of house,adjustment um it's not simply the spooky,the change in,the where we once produced,debate,war in Gaza,impunity seems to,it my readaptation,children who have grown up without,are,very,Assange uh could I call upon Mr Klein,your experience does political Asylum,abusers within,it provides a mechanism where,have been hounded,what in the final analysis controls its,history of states that made it difficult,collapse there must,live and to,in my,to there is a big gap in the Asylum,people who are not fleeing their own,the,persecution by the United,China I was not able to apply for Asylum,particular political angle it might have,United States in the UK but there wasn't,Convention as it's implemented in most,would you like to ask a question please,uh Pace did in the last four years um,2020 with your father John chipton uh,assembly made a clear position calling,because it failed in other internet AAL,in the OS but none of those could uh um,sorry perhaps not enough geopolitical,question um,shocking for me that there is a law in,laws that can be applied to other,counter,and the US uh appealed,discrimination that is in the UK,someone on the basis of the nationality,system,Prevail however there is nothing in the,extradition so this is a small,act but it's not clear that it exists in,that you are here uh Bas has done work,this case in your case uh there are,those,prisoner the first part of your question,yes I was a political prisoner,relation to publishing the truth about,the US,offense uh in relation to the,against,small,uh having a ominous feeling and,security contractors that this CIA had,emerged it is a interesting example,targeted an investigative organization,Spain and in particular work done by us,might well now be themselves,of more than 30 current or former US,processes a criminal case in Spain with,embassy lawyers,response to that civil,privilege,defense but that defense is,impunity uh within the US,Mr,differently I'm not asking just in the,Effectiveness or impact of what you try,Will,back we were often constrained,secrecy uh that was necessary to protect,extra resources of,approaches,that we,that could be turned in,um I was not from the United Kingdom I,journalist um a very good,what UK Society was about who you could,um Maneuvers that are made uh in that,uh perhaps we U could have chosen,arrest warrant issued by Sweden to what,of,purpose the European arrest warrant,of Muslim,issued was issued by Sweden for a drunk,and say well this repressive legislation,um,broadly Injustice to one person spread,how often restuants are abused,Sweden um,charge but in its Amendment uh to the,to me,to Mr Assange thank you thank you very,questions uh I am like many of my,for doing your job the job that we all,shocking to me and to many of us to see,Europe and of course this raises,you aware before um uh all this were you,journalist were protected uh in Europe,thank,for publishing documents from a number,um we understood that in theory article,Amendment to its,classified information from the United,of harassment legal process I was,Publications was such um that is okay to,possible my naivity was believing in the,reinterpreted for political,class more,changes them which is,the constituent powers of the United,state it was powerful,Constitution us con the US First,law uh,precedence relating to it um,I it got into the Supreme Court of the,what the makeup was of the US Supreme,14,lesson that when a major power,Department of Justice do that,legal um that's,seek seeks um have had their,Mr Assange thank you obviously there's,about the allegation that you were,difficulty with the European arrest,treatment of the extradition treaties,maneuver and I'd like to ask you about,serious nobody denies a b Marsh is an,that and,um nine times more people are exited to,being exed to the UK are,primery case or Reasonable Suspicion,allegation is alleged you do not even,breach human rights that's it,judges are compelled to exodite most,um some judges in the UK found in my,not but all,Kingdom,astonishing intellectual back,its,case uh more,narrow section of British Society from,establishment and the UK establishment's,States um whether that's in the,United Kingdom a weapons,United Kingdom's,a long period of,what is good for that cohort and what is,government Bel thank you uh I Mr Mr Lee,afteron statue F statue F statue I,lessons learned from your experience and,process the arbitrary application of a,a an initation to,I didn't quite hear you can you please,your experience and the treatment you,legal,and incitation to silence instead of,that,in some other form of,rather,get uh not Justice seeking its,ourselves in many different,a anti-s slap,against public,reverse liabilities at an early,conduct but I I,that self-interested bureaucrats,use and will expand the,war Wars such as in Gaza and currently,in Lebanon so my questions are how would,you advise a journalist to deal with,this current situation first and the,second is uh what do you think is the,role of parliamentarians in this,regard thank you Mr Assange,I'm sorry I'm getting a bit tired but,uh Kristen perhaps you want to take,the the one what's journalist do about,the well what can be done when we have,horrible stories about uh targeted,killings where we have now have evidence,of that in in in in in the wars it is,the reality of,uh reporting on Wars is more severe than,ever before and it was bad it was bad in,Iraq now it is even worse it is a horror,story it is hard to give out advice for,these journalists how they can deal with,that,situation the only thing we can call out,at least is for an outcry and and,condemnation that this should be going,on because we need information we need,this,information uh there are,no tools to uh to secure individuals in,Gaza that are being followed by drones,and uh are being targeted in Mass,bombing uh there is a little defense on,that but uh the outcry and the,condemnation should be there we should,not be silent when this happen thank you,[Music],[Applause],[Music],,

@wassname
Copy link
Author

wassname commented Oct 14, 2024

I guess the cropping is a wider issue, the ideal way to crop must not be to skip random parts of sentences, that would lead to incoherent text. It's to chunk the text (https://js.langchain.com/v0.1/docs/modules/data_connection/document_transformers/), perhaps keeping the beginning and end (and to tell the model that it's an incomplete text as well so it doesn't misrepresent this to the user).

@wassname wassname changed the title on youtube we use low quality subtitles not the high quality transcript we use crop the transcript badly (by decimating) Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants