Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TakeScreenshot cannot take a full-page screenshot (beyond viewport) #322

Open
andynuss opened this issue Aug 4, 2021 · 15 comments
Open

Comments

@andynuss
Copy link

andynuss commented Aug 4, 2021

The first issue is that after using this snippet to create a use-once agent:

const myagent = new Agent({
  userAgent: ua,
  viewport: {
    screenHeight: 1024,
    screenWidth: 768,
    height: 1024,
    width: 768,
  }
});

and then when done scraping calling:

await myagent.close();

does work once after starting my node service that runs this function, but
subsequent times the same function is called, I get this error in node console:

2021-08-04T18:32:42.557Z ERROR [/Users/andy/repos/test-repo/app/node_modules/@secret-agent/client/connections/ConnectionFactory] Error connecting to core {
  error: 'Error: connect ECONNREFUSED 127.0.0.1:63738',
  context: {},
  sessionId: null,
  sessionName: undefined
} Error: connect ECONNREFUSED 127.0.0.1:63738
    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1146:16) {
  errno: -61,
  code: 'ECONNREFUSED',
  syscall: 'connect',
  address: '127.0.0.1',
  port: 63738
}

The second surprise is the screenshot itself, taken with:

const scrollHeight: number = await myagent.executeJs(() => {
  return document.scrollingElement.scrollHeight;
});

let buffer: Buffer;
buffer = await myagent.takeScreenshot({
  format: 'png',
  rectangle: {
    scale: 1,
    height: scrollHeight,
    width: 1024,
    x: 0,
    y: 0,
  }
});

I used this url: https://www.whatsmyua.info

The visible text in the screenshot is not centered as one would expect for the page I used, but is more or less
left-justified, and a large portion of the page is clipped even though I used the scrollHeight, which I checked
had not grown after taking the screenshot.


The third problem is that if I call takeScreenshot this way it fails with an error, even though typescript tells me rectangle is optional:

buffer = await myagent.takeScreenshot({
  format: 'png',
});

Hope I didn't do something stupid!

@blakebyrnes
Copy link
Contributor

blakebyrnes commented Aug 4, 2021

Thanks for reporting.

Can you share any more code or logs from 1? Secret Agent tracks everything in session databases for each "agent session" (https://secretagent.dev/docs/advanced/session)

For 2, can you include your screenshot that got generated?

For 3, I think I broke that trying to fix a different issue.. thanks for catching.

@andynuss
Copy link
Author

andynuss commented Aug 4, 2021

On 1, I found that this happens for some reason when creating an http server that calls my test scraping function, and not when I call it more than once consecutive times in the same nodejs "thread".

so here is my test function's typescript file:

/* eslint-disable no-console */
import { Agent } from 'secret-agent';
import ExecuteJsPlugin from '@secret-agent/execute-js-plugin';
import * as fs from 'fs';

const ua = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_16_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.165 Safari/537.36';

export async function testUrl(requestUrl: string, imageName: string): Promise<void> {
  const myagent = new Agent({
    userAgent: ua,
    viewport: {
      screenHeight: 1024,
      screenWidth: 768,
      height: 1024,
      width: 768,
    }
  });

  try {
    myagent.use(ExecuteJsPlugin);
    await myagent.goto(requestUrl);
    await myagent.waitForPaintingStable();

    const getScrollHeight = async (): Promise<number> => {
      // eslint-disable-next-line @typescript-eslint/ban-ts-comment
      // @ts-ignore
      const scrollHeight: number = await myagent.executeJs(() => {
        // eslint-disable-next-line @typescript-eslint/ban-ts-comment
        // @ts-ignore
        return document.scrollingElement.scrollHeight;
      });
      console.log('scrollHeight', scrollHeight, 'for', requestUrl);
      return scrollHeight;
    };

    const takeScreenshot = async (scrollHeight: number): Promise<void> => {
      const buffer: Buffer = await myagent.takeScreenshot({
        format: 'png',
        rectangle: {
          scale: 1,
          height: Math.max(scrollHeight, 768),
          width: 1024,
          x: 0,
          y: 0,
        }
      });
      fs.writeFileSync('../screenshots/' + imageName + '.png', buffer, 'binary');
    };

    const height1 = await getScrollHeight();
    await takeScreenshot(height1);

    const height2 = await getScrollHeight();
    if (height2 > height1) {
      console.log('oops: scrollHeight increased after taking screenshot.  Taken too soon?');
      await takeScreenshot(height2);
    }

    console.log(requestUrl + ' ok');
  } finally {
    try {
      await myagent.close();
    } catch (e) {
      console.log('unexpected error closing new Agent', e);
    }
  }
}

// (async() => {
//   await testUrl('https://example.org', 'example');
//   await testUrl('https://www.whatsmyua.info', 'myua');
// })();

@andynuss
Copy link
Author

andynuss commented Aug 4, 2021

and here is my http server written in javascript that compiles the typescript function above:

/* eslint-disable prefer-template */
/* eslint-disable no-console */
const _http = require('http');
const { testUrl } = require('./test');

function ProcessExit(num) {
  process.exit(num);
}

function TextResponse(res, txt) {
  res.writeHead(200, {
    'Content-Type': 'text/plain'
  });
  res.write(txt);
  res.end();
}

function StartServer() {
  console.log('listening for scrape requests');

  const server = _http.createServer((req, res) => {
    let data = '';
    req.on('data', (chunk) => {
      data += chunk;
    });
    req.on('end', () => {
      let json;
      let err;
      try {
        json = JSON.parse(data);
      } catch (e) {
        err = e;
      }
      if (err) {
        console.log('could not parse json request:', err);
        TextResponse(res, 'could not parse json request: ' + err);
      } else if (!json) {
        TextResponse(res, 'falsy json request: ' + json);
      } else if (typeof json.requestUrl !== 'string') {
        TextResponse(res, 'json.requestUrl invalid not a string');
      } else if (typeof json.imageName !== 'string') {
        TextResponse(res, 'json.imageName invalid not a string');
      } else {
        (async() => {
          let err2;
          try {
            await testUrl(json.requestUrl, json.imageName);
          } catch (e) {
            err2 = e;
          }
          if (err2) {
            console.log('testUrl failed:', err2);
            TextResponse(res, 'testUrl failed: ' + err2);
          } else {
            TextResponse(res, 'created image in server: ' + json.imageName);
          }
        })();
      }
    });
  });

  server.setTimeout(0);
  server.listen(8888);
}

(async() => {
  try {
    StartServer();
  } catch (e) {
    console.log(e);
    if (e.stack)
      console.log('' + e.stack);
    ProcessExit(1);
  }
})();

@andynuss
Copy link
Author

andynuss commented Aug 4, 2021

and here's how I invoked it from java (by running this standalone java file a second time while the node service is running):

public class SecretAgentProxy {
  
  private static void Test () throws IOException
  {
    String serverUrl = "http://localhost:8888";
    String requestUrl = "https://www.whatsmyua.info";
    String imageName = "my-image-" + (Common.randomInt(100) + 1);
    HashMap<LiteString, LiteString> json = new HashMap<>(2);
    json.put(LiteString.cons("requestUrl"), LiteString.cons(requestUrl));
    json.put(LiteString.cons("imageName"), LiteString.cons(imageName));
    LiteString sjson = StrictJsonEncoder.encode(json);
    
    // REFACTOR AUG: need to abstract most all of this everywhere I am using it
    // which is a TON of places, call it PostProxy
    //
    byte[] ba;
    ba = ExtStream.toArray(sjson);
    
    LengthInputStream body = null;
    try {
      HttpURLConnection conn = null;
      try {
        conn = (HttpURLConnection)(new URL(serverUrl).openConnection());
        conn.setConnectTimeout(15*1000);
        conn.setReadTimeout(45*1000);
        conn.setDoOutput(true);
        conn.setInstanceFollowRedirects(false);
        conn.setRequestMethod("POST");
        conn.setRequestProperty("Accept-Charset", "UTF-8");
        conn.setRequestProperty("Content-Type", "application/json;charset=UTF-8");
        conn.setRequestProperty("Content-Length", Integer.toString(ba.length));
        
        OutputStream os = conn.getOutputStream();
        try {
          os.write(ba);
          os.flush();
        } finally {
          os.close();
        }
        
        int status = conn.getResponseCode();
        if (status != 200)
          throw new HttpStatusCodeException(serverUrl, status);
        body = InputStreams.capture(conn.getInputStream(), true);
        
        try (InputStreamReader reader = new InputStreamReader(body, "UTF-8")) {
          System.out.println(ExtStream.readFully(reader));
        }
      } finally {
        if (conn != null)
          conn.disconnect();
      }
    } finally {
      if (body != null)
        body.close();
    }
  }
  
  public static void main (String[] args)
  {
    try {
      Test();
    } catch (Throwable t) {
      t.printStackTrace();
    }
  }
}

@andynuss
Copy link
Author

andynuss commented Aug 4, 2021

I'll have to figure out how to take a session trace in a little while if you still need that.

@andynuss
Copy link
Author

andynuss commented Aug 4, 2021

my-image-67

@andynuss
Copy link
Author

andynuss commented Aug 4, 2021

NOTE: in my testUrl function above, for this pretty simple webpage, it seems like waitForPaintingStable() didn't work as well as it should because on my machine, the scrollHeight obtained after waitForPaintingStable was 1413, but then when after taking the screenshot and writing it to a file, when I asked for scrollHeight again, it was 1971, prompting me to "retake" the screenshot, to make sure that wasn't part of why the screenshot is clipped.

@andynuss
Copy link
Author

andynuss commented Aug 5, 2021

Here's the session db:

sessions.db.zip

@blakebyrnes
Copy link
Contributor

I think what's happening is you are using the default "full-client" SecretAgent "connection" which is built for single use scrapes, but I think you're triggering the auto-shutdown when you call close the first time (think about booting up a script and then wanting the whole thing to tear down when you close). I think you'll get more reliable behavior by spinning up a CoreServer and then pointing your agents at the persistent server (SecretAgent already comes with a client/server setup - https://secretagent.dev/docs/advanced/remote). You can run the server in the same process as your existing server if you want - doesn't have to be a separate process.

Regarding paintingStable - that event is specifically geared around the page being visible above the fold, not "all content loaded". You can add a "domContentLoaded" trigger to wait for the page to be fully "loaded" as well.

With your screenshot, it seems like your viewport width & height are mismatched in your screenshot rectangle. Could that be why it's showing up with a strange shape? I guess it doesn't explain the x/y..

@andynuss
Copy link
Author

andynuss commented Aug 6, 2021

Thanks for the help. You were right about the viewport having switched the width and height. However, after fixing my code with everything you mentioned above, the screenshot still is clipped even though the height specified in takeScreenshot is always the full scrollHeight of 1971 for this url.

I don't see anything else that could explain the clipping, and in fact, now it appears that though the scrollHeight is 1971, and indeed the screenshot image height is 1971, and includes the proper background for the full 1971, somehow the text content inside the dom looks like it is being clipped to the viewports height of 768. Is this possible?

(Here's the fixed code)

/* eslint-disable no-console */
import { Agent, ConnectionFactory, ConnectionToCore, LocationStatus } from 'secret-agent';
import ExecuteJsPlugin from '@secret-agent/execute-js-plugin';
import * as fs from 'fs';

const ua = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_16_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.165 Safari/537.36';

let sharedConnection: ConnectionToCore = null;

function getConnection (): ConnectionToCore {
  if (sharedConnection !== null) return sharedConnection;
  sharedConnection = ConnectionFactory.createConnection({
    maxConcurrency: 4,
  });
  return sharedConnection;
}

export async function testUrl(requestUrl: string, imageName: string): Promise<void> {
  const myagent = new Agent({
    userAgent: ua,
    viewport: {
      screenHeight: 768,
      screenWidth: 1024,
      height: 768,
      width: 1024,
    },
    connectionToCore: getConnection(),
  });

  try {
    myagent.use(ExecuteJsPlugin);
    await myagent.goto(requestUrl);
    await myagent.waitForPaintingStable();
    await myagent.activeTab.waitForLoad(LocationStatus.DomContentLoaded);

    const getScrollHeight = async (): Promise<number> => {
      // @ts-ignore
      const scrollHeight: number = await myagent.executeJs(() => {
        // @ts-ignore
        return document.scrollingElement.scrollHeight;
      });
      console.log('scrollHeight', scrollHeight, 'for', requestUrl);
      return scrollHeight;
    };

    const takeScreenshot = async (scrollHeight: number): Promise<void> => {
      const buffer: Buffer = await myagent.takeScreenshot({
        format: 'png',
        rectangle: {
          scale: 1,
          height: Math.max(scrollHeight, 768),
          width: 1024,
          x: 0,
          y: 0,
        }
      });
      fs.writeFileSync('../screenshots/' + imageName + '.png', buffer, 'binary');
    };

    const height1 = await getScrollHeight();
    await takeScreenshot(height1);
    console.log(requestUrl + ' ok');
  } finally {
    try {
      await myagent.close();
    } catch (e) {
      console.log('unexpected error closing new Agent', e);
    }
  }
}

@blakebyrnes
Copy link
Contributor

Can you see if the latest version helps your screenshot issue if you provide no rectangle?

@blakebyrnes
Copy link
Contributor

Scratch that. I see it happening. No need to try it

@blakebyrnes
Copy link
Contributor

NOTE for implementation.. Looks like in Chromium, you have to change the visualViewport to take a full page screenshot then restore it. We need to think about how we should think about this from a detection perspective.

@blakebyrnes blakebyrnes changed the title Some unexpected problems trying to scrape a screenshot TakeScreenshot cannot take a full-page screenshot (beyond viewport) Aug 23, 2021
@andynuss
Copy link
Author

andynuss commented Oct 2, 2021

Hi, I was wondering if this has turned out to be difficult to fix from the standpoint of bot detection, since I noticed that the behavior is still the same as of the latest version. What exactly would be the detection exposure if a quick-and-dirty fix were to be done? Is it possible that you could point us to an easy approach and we could take the risk of detection ourselves in some kind of plugin?

@blakebyrnes
Copy link
Contributor

@andynuss - I just haven't gotten to this. There's a lot of stuff on the plate to do, and this one just hasn't made it to the top of the priorities yet. You could give a plugin a try or a PR - I think for a plugin, you'd just want to be able to set the page to the full length of the page (here's how puppeteer does that: https://github.com/puppeteer/puppeteer/blob/327282e0475b1b680471cce6b9e74ecc14fd6536/src/common/Page.ts#L2664)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants