Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support facebook video url download via yt-dlp #469

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft

Conversation

devxpy
Copy link
Member

@devxpy devxpy commented Sep 23, 2024

Q/A checklist

  • If you add new dependencies, did you update the lock file?
poetry lock --no-update
  • Run tests
ulimit -n unlimited && ./scripts/run-tests.sh
  • Do a self code review of the changes - Read the diff at least twice.
  • Carefully think about the stuff that might break because of this change - this sounds obvious but it's easy to forget to do "Go to references" on each function you're changing and see if it's used in a way you didn't expect.
  • The relevant pages still run when you press submit
  • The API for those pages still work (API tab)
  • The public API interface doesn't change if you didn't want it to (check API tab > docs page)
  • Do your UI changes (if applicable) look acceptable on mobile?
  • Ensure you have not regressed the import time unless you have a good reason to do so.
    You can visualize this using tuna:
python3 -X importtime -c 'import server' 2> out.log && tuna out.log

To measure import time for a specific library:

$ time python -c 'import pandas'

________________________________________________________
Executed in    1.15 secs    fish           external
   usr time    2.22 secs   86.00 micros    2.22 secs
   sys time    0.72 secs  613.00 micros    0.72 secs

To reduce import times, import libraries that take a long time inside the functions that use them instead of at the top of the file:

def my_function():
    import pandas as pd
    ...

Legal Boilerplate

Look, I get it. The entity doing business as “Gooey.AI” and/or “Dara.network” was incorporated in the State of Delaware in 2020 as Dara Network Inc. and is gonna need some rights from me in order to utilize my contributions in this PR. So here's the deal: I retain all rights, title and interest in and to my contributions, and by keeping this boilerplate intact I confirm that Dara Network Inc can use, modify, copy, and redistribute my contributions, under its choice of terms.

@devxpy
Copy link
Member Author

devxpy commented Sep 23, 2024

doesnt work yet because fb doesnt do --format bestaudio

def is_yt_dlp_able_url(url: str) -> bool:
f = furl(url)
return (
"youtube.com" in f.origin

Check failure

Code scanning / CodeQL

Incomplete URL substring sanitization High

The string
youtube.com
may be at an arbitrary position in the sanitized URL.

Copilot Autofix AI 7 days ago

To fix the problem, we need to ensure that the URL's host is exactly "youtube.com" or a valid subdomain of "youtube.com". This can be achieved by parsing the URL and checking the hostname directly. We will use the urlparse function from the urllib.parse module to extract the hostname and then perform the necessary checks.

  1. Import the urlparse function from the urllib.parse module.
  2. Replace the substring checks with hostname checks using urlparse.
daras_ai_v2/vector_search.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/daras_ai_v2/vector_search.py b/daras_ai_v2/vector_search.py
--- a/daras_ai_v2/vector_search.py
+++ b/daras_ai_v2/vector_search.py
@@ -20,2 +20,3 @@
 from furl import furl
+from urllib.parse import urlparse
 from loguru import logger
@@ -743,13 +744,14 @@
 def is_yt_dlp_able_url(url: str) -> bool:
-    f = furl(url)
+    parsed_url = urlparse(url)
+    hostname = parsed_url.hostname
     return (
-        "youtube.com" in f.origin
-        or "youtu.be" in f.origin
-        or "fb.watch" in f.origin
+        hostname == "youtube.com"
+        or hostname == "youtu.be"
+        or hostname == "fb.watch"
         or (
-            ("facebook.com" in f.origin or "fb.com" in f.origin)
+            (hostname == "facebook.com" or hostname == "fb.com")
             and (
-                "videos" in f.path.segments
-                or "/share/v/" in f.pathstr
-                or "v" in f.query.params
+                "videos" in parsed_url.path
+                or "/share/v/" in parsed_url.path
+                or "v" in parsed_url.query
             )
EOF
@@ -20,2 +20,3 @@
from furl import furl
from urllib.parse import urlparse
from loguru import logger
@@ -743,13 +744,14 @@
def is_yt_dlp_able_url(url: str) -> bool:
f = furl(url)
parsed_url = urlparse(url)
hostname = parsed_url.hostname
return (
"youtube.com" in f.origin
or "youtu.be" in f.origin
or "fb.watch" in f.origin
hostname == "youtube.com"
or hostname == "youtu.be"
or hostname == "fb.watch"
or (
("facebook.com" in f.origin or "fb.com" in f.origin)
(hostname == "facebook.com" or hostname == "fb.com")
and (
"videos" in f.path.segments
or "/share/v/" in f.pathstr
or "v" in f.query.params
"videos" in parsed_url.path
or "/share/v/" in parsed_url.path
or "v" in parsed_url.query
)
Copilot is powered by AI and may make mistakes. Always verify output.
or "youtu.be" in f.origin
or "fb.watch" in f.origin
or (
("facebook.com" in f.origin or "fb.com" in f.origin)

Check failure

Code scanning / CodeQL

Incomplete URL substring sanitization High

The string
facebook.com
may be at an arbitrary position in the sanitized URL.

Copilot Autofix AI 7 days ago

To fix the problem, we need to parse the URL and check the host value to ensure it matches the allowed domains correctly. This involves using the urlparse function from the urllib.parse module to extract the hostname and then performing the check. This approach ensures that the check is not bypassed by embedding the allowed host in an unexpected location within the URL.

daras_ai_v2/vector_search.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/daras_ai_v2/vector_search.py b/daras_ai_v2/vector_search.py
--- a/daras_ai_v2/vector_search.py
+++ b/daras_ai_v2/vector_search.py
@@ -743,13 +743,17 @@
 def is_yt_dlp_able_url(url: str) -> bool:
-    f = furl(url)
+    from urllib.parse import urlparse
+    parsed_url = urlparse(url)
+    host = parsed_url.hostname
     return (
-        "youtube.com" in f.origin
-        or "youtu.be" in f.origin
-        or "fb.watch" in f.origin
-        or (
-            ("facebook.com" in f.origin or "fb.com" in f.origin)
-            and (
-                "videos" in f.path.segments
-                or "/share/v/" in f.pathstr
-                or "v" in f.query.params
+        host and (
+            host.endswith("youtube.com")
+            or host == "youtu.be"
+            or host == "fb.watch"
+            or (
+                (host.endswith("facebook.com") or host == "fb.com")
+                and (
+                    "videos" in parsed_url.path
+                    or "/share/v/" in parsed_url.path
+                    or "v" in parsed_url.query
+                )
             )
EOF
@@ -743,13 +743,17 @@
def is_yt_dlp_able_url(url: str) -> bool:
f = furl(url)
from urllib.parse import urlparse
parsed_url = urlparse(url)
host = parsed_url.hostname
return (
"youtube.com" in f.origin
or "youtu.be" in f.origin
or "fb.watch" in f.origin
or (
("facebook.com" in f.origin or "fb.com" in f.origin)
and (
"videos" in f.path.segments
or "/share/v/" in f.pathstr
or "v" in f.query.params
host and (
host.endswith("youtube.com")
or host == "youtu.be"
or host == "fb.watch"
or (
(host.endswith("facebook.com") or host == "fb.com")
and (
"videos" in parsed_url.path
or "/share/v/" in parsed_url.path
or "v" in parsed_url.query
)
)
Copilot is powered by AI and may make mistakes. Always verify output.
or "youtu.be" in f.origin
or "fb.watch" in f.origin
or (
("facebook.com" in f.origin or "fb.com" in f.origin)

Check failure

Code scanning / CodeQL

Incomplete URL substring sanitization High

The string
fb.com
may be at an arbitrary position in the sanitized URL.

Copilot Autofix AI 7 days ago

To fix the problem, we need to ensure that the URL's hostname is properly checked against the allowed hosts. Instead of using a substring match, we should parse the URL and check the hostname directly. This can be done using the urlparse function from the urllib.parse module.

  1. Parse the URL using urlparse.
  2. Extract the hostname from the parsed URL.
  3. Check if the hostname matches any of the allowed hosts.
daras_ai_v2/vector_search.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/daras_ai_v2/vector_search.py b/daras_ai_v2/vector_search.py
--- a/daras_ai_v2/vector_search.py
+++ b/daras_ai_v2/vector_search.py
@@ -743,13 +743,13 @@
 def is_yt_dlp_able_url(url: str) -> bool:
-    f = furl(url)
+    from urllib.parse import urlparse
+    parsed_url = urlparse(url)
+    hostname = parsed_url.hostname
     return (
-        "youtube.com" in f.origin
-        or "youtu.be" in f.origin
-        or "fb.watch" in f.origin
+        hostname in ["youtube.com", "youtu.be", "fb.watch"]
         or (
-            ("facebook.com" in f.origin or "fb.com" in f.origin)
+            hostname in ["facebook.com", "fb.com"]
             and (
-                "videos" in f.path.segments
-                or "/share/v/" in f.pathstr
-                or "v" in f.query.params
+                "videos" in parsed_url.path
+                or "/share/v/" in parsed_url.path
+                or "v" in parsed_url.query
             )
EOF
@@ -743,13 +743,13 @@
def is_yt_dlp_able_url(url: str) -> bool:
f = furl(url)
from urllib.parse import urlparse
parsed_url = urlparse(url)
hostname = parsed_url.hostname
return (
"youtube.com" in f.origin
or "youtu.be" in f.origin
or "fb.watch" in f.origin
hostname in ["youtube.com", "youtu.be", "fb.watch"]
or (
("facebook.com" in f.origin or "fb.com" in f.origin)
hostname in ["facebook.com", "fb.com"]
and (
"videos" in f.path.segments
or "/share/v/" in f.pathstr
or "v" in f.query.params
"videos" in parsed_url.path
or "/share/v/" in parsed_url.path
or "v" in parsed_url.query
)
Copilot is powered by AI and may make mistakes. Always verify output.
@devxpy devxpy assigned devxpy and unassigned devxpy Sep 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant