Failing on Linkedin authwall
Trying to parse Linkedin, it fails at the /authwall page.
Here is the TOML snippet:
[linkedin-linkedin]
title = "Linkedin Test"
url = "https://www.linkedin.com/company/linkedin"
entrySelector = "ul.updates__list li"
titleSelector = "a[data-tracking-control-name='organization_guest_main-feed-card_feed-actor-name']"
linkSelector = "a.main-feed-card__overlay-link"
waitForSelector = "ul.updates__list"
and this is what the Gitlab CI output returns, when passing DEBUG=info:
page.waitForSelector: Timeout 30000ms exceeded.
Call log:
- waiting for locator('ul.updates__list') to be visible
- waiting for" https://www.linkedin.com/authwall?trk=bf&trkInfo=AQESgmzRyuBaPQAAAZWUpE8YiQa_HfFLN-uDfAroAVj6SfYqpBqq0QA69rX0WlPmj-rSWFlyd9KL2l2Qz5yPgLtrff0ED4k7mOdmE5Bz4Vu5Uiep3p6Jj8Vd5AMruzAPEwsh6B8=&original_refe…" navigation to finish...
- navigated to "https://www.linkedin.com/authwall?trk=bf&trkInfo=AQESgmzRyuBaPQAAAZWUpE8YiQa_HfFLN-uDfAroAVj6SfYqpBqq0QA69rX0WlPmj-rSWFlyd9KL2l2Qz5yPgLtrff0ED4k7mOdmE5Bz4Vu5Uiep3p6Jj8Vd5AMruzAPEwsh6B8=&original_refe…"
at fetchPageEntries (/root/.npm/_npx/fa12bb8499fb4525/node_modules/feed-me-up-scotty/dist/run.js:128:20)
at async file:///root/.npm/_npx/fa12bb8499fb4525/node_modules/feed-me-up-scotty/dist/run.js:80:33
at async fetchFeedData (/root/.npm/_npx/fa12bb8499fb4525/node_modules/feed-me-up-scotty/dist/run.js:78:25)
at async file:///root/.npm/_npx/fa12bb8499fb4525/node_modules/feed-me-up-scotty/dist/run.js:12:30
at async run (/root/.npm/_npx/fa12bb8499fb4525/node_modules/feed-me-up-scotty/dist/run.js:10:23) {
name: 'TimeoutError'
}
I did observe, that when using Httpie in my local terminal to do http https://www.linkedin.com/company/linkedin, it returns the actual requested page HTML.
However, when doing the same with curl https://www.linkedin.com/company/linkedin, it returns a page with some javascript that does window.location.href = "https://" + domain + "/authwall?trk=….
Why is this happening? Seems like that internal Firefox is being redirected like Curl is, while Httpie can somehow sneak around it? Is this something about the HTTP User-Agent? Can it be replaced?