Hiding Database Query Strings: How do you get search engines to index your database-driven site?
This article has been archived due to outdated information or poor advice. Read on at your own risk.
The search engines won't index my database-driven pages. I'm calling for help. Your help, you web-savvy person.>
Please do not submit webpages with these symbols in the URL: ampersand (&), percent sign (%), equals sign (=), dollar sign ($) or question mark (?). Our spider does not recognize them.
Further, searching AltaVista for web pages whose URL contains "asp", "section", and "id" turns up 3 pages. Circumstantial evidence, but telling nonetheless.
I dreadfully want this site to be indexed...the central purpose the net.mind section is so that I can share this information with the world! How can I do this?
Someone from the Slashdot article on this subject said:
Why doesn't anyone use the ScriptAlias directive? It does the same thing as query strings, but makes it look nicer, like the rest of the web. You can "say" your looking at a directory or a .html file, but in reality you are viewing a singe script. For an example go to http://store.wolfram.com/. There are no directories on the server side, it's all served off of one script. Yet, to the user, it appears as a hierarchical directory structure, complete with .html files.
This sounds like a great idea to me! Where I currently have URLs like http://phrogz.dyndns.org/display.asp?nodeid=hidingquery, I could replace them with a sassy-looking (and cleaner) url like http://phrogz.dyndns.org/display/hidingquery/
Unfortunately, the server is Microsoft's IIS, and ScriptAlias is a directive which works only on Apache Web servers. So I'm back where I started, though a little bit smarter. I have several current ideas on how to get search engines to index my content, and I'd like your feedback:
- Find some way to do the same thing as ScriptAlias on IIS.
This would be ideal--anyone have any leads?
Update: While I haven't received a 100% certain "can't be done" from anyone, a lot of IIS-savvy people have told me they didn't think it could be done. See the bottom of the document for my final choice.
- Create the desired physical hierarchy dynamically.
This wouldn't be too bad--the admin section of this site could physically create a /hidingquery/ directory and write the file inside it which would have code to #include the display.asp page. It would be rather annoying to have my site filling my drive with directories and dumb files just so that search engines can find me, but it's doable.
Update: I've decided that this is the route I'll probably go, though there's no need to mimic the hierarchy of the site. In fact, there's a good reason not to--if I just create a page per node, all in the same directory, then I can re-order the hierarchy internally and not have to update the physical site. However, I just realized--since this site is framed, I'll need to create 3 files for each node--one for the frameset, one for the content page (since I can't use my current system of putting a query string in the indivual frames), and one for the side nav corresponding to each page. This choice, though better than the two below, has gotten ugly. (I also will need to create 5 different pages for the top frame.) It's almost enough to make me decide to make the noframe version of the site (which is planned) now and decide that there won't be a framed version available--but I refuse to let stupid technology bully my into make decisions which cause me to stray from my ideal goals.
- Spit flat files out of the database to the server.
As in delete and re-save a .html file whenever it gets updated. Not only do I worry about concurrency issues here (delete the file and re-write it? what if someone's looking?!) but it seems even worse to fill my drive with all the content which is also in the database.
- Fool the search engines.
I could maybe write something to spit the entire site into some other directory, which is handed out to search engines, where each page uses some nefarious trick to bounce visitors to it to the real site. (Or, maybe writes a link onto each page telling that they're not supposed to be there, that they should be somewhere else.) For example, if I wrote out the content of each page with a META REFRESH at the top which bounced users to the real URL, the search engine wouldn't index the real site because of its database-ness, but would index the bouncing page before getting bounced. It's a little bit better than the above solution, insofar as there are no concurrency issues to mess with. But this has all sorts of staleness possibilities--the engines are indexing a time-delayed copy of the site. What if a user doesn't get bounced, and ends up wandering around the fake site? (I suppose I could spit it out with links all over which said "You're NOT supposed to be here, go to this other page instead.") What about the search engines which do index urls with their full query string? No, I don't think that this one is the answer.
What do you think? How should I procede? How can I make these pages part of the global net.mind, as accessed by search engines?
Resolution: Well, Option "A" didn't work out, so I've decided to do the variation of "B" mentioned above. When a node is added to the site, files are created which store the value of the nodeid, and then transparently #include the common processing page.
This works fine in my case (though I had to do a bunch of work to make it happen) but it works for me only because I know in advance what variables may be passed to my script, and can hence create pages for each. This is not a truly flexibly solution, however, because not all sites will have the luxury of being able to predict in advance what variables will get passed to the script. I don't think it's outrageous to assume that there could exist a script which, taking arbitrary values, produces useful content which should be indexed by search engines. They don't have to index all possible permutations, just whatever ones people happen to link to. *sigh*
Maybe Windows 2000 Server will add a ScriptAlias-like feature.
Edit: Note that this page was first written in 1999; a lot has changed since then, including the fact that 10 years later I finally moved away from IIS to a more reasonable web development platform.
|I think slashdot archives older articles, making them read-only, but making them searchable (I may be wrong). I was thinking along the lines of option B above that you might have a process running that at fixed time intervals would take a "snap-shot" of your dynamic hierarchy and output it to a static, but read-only, hierarchy. This would be searchable, and each page could have a link back to the dynamic, live, version so the person could see the latest comments or add their own.
As for the META REFRESH tag, I think some engines warn that they'll ignore those pages entirely, because that's a way people deceive the engines (fill a page with not-necessarily-relevant keywords, and then bounce the soon-to-be-confused user to an entirely different page.)
BTW, this comment entry box needs to be taller :)
|Option A seems like the best. Perchance you should give Apache for Windows a shot? They admit its performance on NT is not optimal, but they're working on it.
I would opt for B as choice #2. C I dismiss almost entirely - too difficult to implement, too bulky, too kludgy. And D almost begs for the wrath of the net community to be brought to bear on you full force. (At least I would be wrathful if I were to wind up wandering around a fake site, being scolded about being where I'm not supposed to be.)
|Unfortunately, I need to run IIS on this machine for work-related purposes. And, since IRP does a lot of development for NT-only systems, including database work, I want to find a way to solve it using IIS.|
|It may not be graceful or even useful, but, could you make a static page for each of the items you'd like to have indexed including instant redirects to the dynamic version of each page?
Like I said, not graceful. ;)
|That's pretty much Option D, above, which is too gross for the reasons stated. Anyhow, as noted above in the "Resolution" section, I've solved my problem for now, but haven't solved the problem in general. If anyone knows of any way to get IIS for NT to do Option A, or something similar, please let me know!|
|I don't know if you still care about this topic, it being so ancient. However, I was tooling around this morning (my daily random walk on the web before getting down to work), and I came across an article on www.sitepoint.com that addressed this issue. It sparked a memory: didn't gav have wrestle with this ages ago on phrogz?
The link to the article:
Now, Sitepoint tends to lean in the LAMP direction, so the article is pretty Apache-oriented. However, one of the three suggestions should be doable on IIS: use Error Pages.
Essentially, you have a bunch of "fake" directories for your page URLs. When a browser (or spider) hits the URL, it doesn't exist, so your error page gets called. The error page, though, is an ASP script that parses the "bad" URL and serves the dynamic content based on the phony components in the link.
As the article points out, this is maddeningly inelegant and will make your server error logs into a joke.
The article actually uses Apache's .htaccess files to accomplish this, but from my hazy memory of IIS, it seems to me that this solution is possible.
Anyway, you solved your dillemma years ago, but I still thought you might find a new alternative interesting to contemplate.
|I tried the option B several months back. Quite useful. I have a suggestion for U. Why dont u look at the IE's offline web synchronization process. I learnt from it and successfully using the findings. What do u think.|
|I've done something very like this by simply setting up a "ghost" site. i.e. the content doesn't exist, then installing a custom 404 (also 403) handler that is a script (.asp) rather than a static file (.htm) and then inside this script, I get, and return the real content by decoding the requested URL, retrieving the content and returning it along with a 200 status. See
|Hey, Don't know how old this is, but you might want to just use mod_rewrite if you have apache... which you probably have IIS, idk. But anways, it creates fake directories, and rewrites them so url.com/value1/value2/ = url.com/file.asp?dir1=value1&dir2=value2/ and the directories don't actually exist, but look like they do and the search engine is none the wiser. I'm a n00b and just got this to work for me and it's doing pretty good. Give it a try (instead of actual directories).|