Dvc.org: engine: return HTTP 404 when Page404 component is rendered

Created on 15 Oct 2019  路  30Comments  路  Source: iterative/dvc.org

Extracted from https://github.com/iterative/dvc.org/pull/690#issuecomment-541958989

Example: the engine renders a huge number "404", however it already responded with HTTP 200:

image

URL: https://dvc.org/doc/command-reference/bad

It would be ideal if the engine could check for the URL path validity first before responding by rendering the general layout (navigation, etc) with HTTP 200, and render it instead with HTTP 404 (and Page404).

This can aid in checking links by running script in #690. Once 404 is returned, we will be able to distinguish between existing pages and Not found ones.

doc-engine enhancement priority-p2

All 30 comments

And apparently one of these bad URLs from the website, it responds 304 Not Modified, not sure why.

Anyway, I wonder if @iAdramelk has any tips as to how to address these questions?

please, take a look at what URLs exactly does it return 304 (200 if you click it once again, I think - it's just a matter of caching on the browser side), 404. The point here is that it's engine requesting specific *.md files, not a regular dvc.org/doc/something.

@shcheklein Is it possible if we can render Page404 on a different URL? This way we will be able to pass writeHead(404) in server.js. I am not sure about this method though.

@algomaster99 could you elaborate please, not sure I understand the suggestion.

@shcheklein can you tell me in which file are the components mapped to their URL. I don't have much experience in next framework. But with simple react, I'm familiar. Usually the URLs are defined like this <Route path='/' component={App} />. I am looking for something corresponding to this here.
Basically, how does Next map each url to a component?

@algomaster99 the exact .md file to fetch is a convention that is based on the page URL + sidebar.json. Check also helper.js file. It's somewhat related to Next.js, but mostly our own logic that maps URLs to source files.

@shcheklein okay so, / is served my the server. Then rest after is mapped by helper.js, right?

@algomaster99 no, I think dvc.org/doc is mapped (by Next.js, implicitly) to the doc.js that is responsible for everything else after that. Unless I'm missing something.

@shcheklein so I went through the next documentation. I found two possible techniques to solve the problem.

What do you suggest?

so, even if change URL to /doc/404 when we render this wget won't detect this. The only way to detect with wget is either add SSR support (@iAdramelk should have more context on this), or use Selenium to write tests (which might be a good idea anyway, since will be able to test more things, not just broken URLs).

@shcheklein wget has a option to follow redirects but that may not be the best solution. Let's wait for @iAdramelk .

wget has a option to follow redirects but that may not be the best solution

Sorry guys, I don't understand the context of your discussions, but with respect to the statement above I can say that:
wget follows redirects by default. To suppress (or disable) this in the script I have used the option --max-redirect=0 (it is even explained with a comment on the script).

@dashohoxha Your script is fine but when we wget pages like /doc/user-guide/this-does-not-exist, our engine responds with 200 OK only. We need to change it to 404 to check links efficiently.

@shcheklein Shall we do it by defining a condition in our server?

if (/\/doc\/404/i.test(pathname)) {
        res.writeHead(404)
        res.end()
    }

I am not sure how can we render afterwards. Can you help?

@algomaster99 could you please try that?

@shcheklein I will update you after experimenting with it again.

take a look at what URLs exactly does it return 304 (200 if you click it once again, I think - it's just a matter of caching on the browser

Your'e right @shcheklein. This seems to be standard Node server behavior when the browser request includes If-None-Match header (sent if any URL is reloaded, at least in Chrome).

  1. The point here is that it's engine requesting specific *.md files, not a regular dvc.org/doc/something.

Right, there's a routing mechanism. My suggestion is that perhaps this could be evaluated (to determine path validity) before responding 200 with empty layout.

or use Selenium to write tests...

Selenium/ Headless Chrome tests are cool. I wrote a script that loads full dynamic js contents to calculate real web page size some time ago: https://github.com/jorgeorpinel/site-page-size-scraper/. (See full article.) We can still do this separately (what other users do you have in mind Ivan?), but I'm pretty sure Next should be able to return 404 here.

I went through the next documentation. I found two possible techniques...

@algomaster99 I did some searching based on some of the concepts there and found this article. Find the following text, maybe the answer is here: "Next.js by default displays a 404 error page when someone requests a non existing page, but for dynamic routes we need to do it ourselves."

We can create a URL /doc/404...

I don't think redirecting after rendering /doc should be the answer because the user will see the website load the empty layout and then reload itself again into the 404.

@jorgeorpinel I tried that but I am not able to send the response code as 404 from front-end (basically when we check if a requested page exists or not).

@shcheklein @jorgeorpinel Can you tell me how a request is processed? I simply typed http://localhost:3000/doc/use-cases before which I had deleted pages/index.js. But still, the Page404 was displayed. Where is the page getting served from?

Unfortunately I'm not familiar enough with the engine code yet to answer that question but I feel like it should be simple enough to answer. For this reason we've tried tagging @iAdramelk a couple times: I have the impression that he may be the most familiar with the Next.js engine at the moment.

@algomaster99 Sorry, somehow missed mentions.

Basically /doc/* pages now, for legacy reasons, are using non-standard routing mechanism (and we probably should replace it with the standard one to solve this issue:

  1. All /doc/* pages load pages/doc.js file by default. It is configured in the server.js.
  2. Then in shouldComponentUpdate we check for the URL and try to load corresponding md-file.
  3. If there is no such a file, then we render component for 404 error.

With current approach md-file loading happens after actual page load, so there is no way to solve http-headers problem there. Also this is why SSR is not working right now for docs pages.

Better way to render such pages would be by using getInitialProps this also will automatically render _error.js page if some error occurs inside getInitialProps and also by replacing current docs navigation logic with the Next.js component. I'm not totally sure which header will express return in such situation and can't find info in the docs through.

SSR?

@jorgeorpinel Server Side Rendering.

@iAdramelk closing this. please let me know if it's not resolved yet.

The 404 page is now the same as in other parts of the website, but the status code returned is still 200 (only inside /doc bad paths):

image

yep, @iAdramelk, I suppose we should be able to handle this properly, now?

We should be able to do it, yes.

Checked the doc's one more time and it is not this simple it looks. At the time then we run getInitialProps the page has already returned 200 header. So if we want to set correct header we still need to do it in server.js. Basically what we need to do is to run getItemByPath before line 109 here:

https://github.com/iterative/dvc.org/blob/5af60ee1cf0ea0530d5d40b8d935730933f3f8d7/server.js#L108-L110

and return with 404 if there is no such page. But to do so we need to rewrite src/Documentation/SidebarMenu/helper.js in commonjs (and probably move it together with sidebar.json to some directory closer to the root).

@iAdramelk it does not sound bad in terms of complexity, unless I'm missing something. What is your expectation?

@shcheklein Well, it's not that bad, but I initially hoped that we can just return some status from getInitialProps to automate these things. :)

Was this page helpful?
0 / 5 - 0 ratings