Google Presents Code Search
Nik Cubrilovic
59 comments »
Google today launched Code Search, a search engine and index of source code that is collected from publicly available sources. Google claims that the new code search engine will be able to find almost any code that its crawler can find, but in a few specific searches it failed to locate some code that I had hosted on my own server – but this is sure to improve. It does seem that the Google index of source code is a lot broader than those found at competing sites Krugle and Koders. For instance, Google Code Search will index the content of zip and tarball files on open source sites such as openssl.org, while the other search sites seem to return a lot of results from sourceforge and a few other centralized repositories.
The first thing you notice at Google Code Search is that you can use regular expressions in the query field when searching, and there are a lot of search options to help you further refine what you are looking for. On the front page of Google Code Search there is a nice overview with some pointers on using the service.
To test Google Code Search out against both Krugle and Koders, I ran a search for “md5 in C”, hoping to find an implementation of the MD5 hash algorithm in C. In Google, I can specify the implementation language I would like in the search query, while in both Krugle and Koders I needed to select the language from a drop down. Krugle and Koders didn’t seem to filter the results based on language too well as they both had results that were implementations in other languages. One problem here is that the search engines don’t actually know you are looking for a simple implementation of md5, they are just string-matching against their indexes so you get some very poor results (such as functions that call an MD5 library). Across the 3 search engines, I could not find a good, pure MD5 implementation – just a lot of header files and functions that had the string ‘md5’ within them.
All of these search engines have a long way to go before they become a shortcut way for developers to find code – especially considering that most developers are astute at using ordinary search engines to find what they are looking for. Searching for a phrase like “drop-down menu in ajax” won’t return anything usefull, so developers who don’t know which specific string within code they are looking for will have a hard time. Track record would suggest that Google are the company to most likely get this right, by combining the information they have in their main search engine with the source code data for better results (for example, I can see them indexing code examples from MSDN rather easily). This looks like bad news for the startups in this space who will need to further innovate, but it is good news for Google, a company that hasn’t really been hitting home runs recently with some of it’s recent new products.






Google has its hands on everything. Hasn’t it done well?
It’s a bit of a different animal, but I find O’Reilly’s Code Search to be easy to use and a great source when I’m looking for a quick example of usage or syntax:
http://labs.oreilly.com/code/
hmmm., this is kinda cool for programmers,
does it only search html?
similar in nature to http://www.krugle.com. wonder what the guys over at krugle are saying this morning!!
the issue with searching/finding code and using a system like this, is that you’d really like to know alot about the quality of the code that you’re using… and right now, no ones, got that part of the solution.
you’d also like to know how you can test the code chunks that you might find.
these systems also don’t permit you to post that you’re looking for a chunk of code to be able to solve some particular issue. mailing lists are good/bad for being able to ask these kinds of questions. but there again, you have no idea as to the skill level of the answer you get from the mailing lists.
You can sure find some interesting stuff with this…for example this search for wordpress config files. That can’t be good. If you need me, I’ll be auditing the security on my SVN repos…
Sounds like the problem of finding an MD5 implementation in C is very similar to the natural language problems identified in the previous post about Powerset.
Raj,
Ha.. good call! Lets wait for Powerset! =)
Jesse - Your search reminds me of all the stuff from johnny @ ihackstuff.com
Just found on Digg an article that has a link which leads to google’s code search results for ‘keygen’ and it looks like google was lucky enough to find a nice little section of code to generate Winzip 9.0 serial numbers.
You have identified the problem really well - it is hard to do a search for code and get meaningful results. Almost a year ago we decided that just plain code search doesn’t cut it. So instead we decided to figure out how to tie together project and code search. Taking your example, I did the following search http://www.krugle.com/kse/proj.....;lang_pr=C. The first hit is the Md5some project (http://www.krugle.com/kse/projects/zar7fsk) which contains exactly what you were looking for - the file md5some.c (http://www.krugle.com/kse/files/cvs/cvs.sourceforge.net/md5some/md5some/md5some.c).
By the way, you mentioned that one needed to use a popup to select the language in Krugle - actually you can either use the popup, or language:c in the query.
that’s scary about the wordpress config file… so how do we block our information from being shown?
Ye I was earlier having some fun with putting together queries that would find passwords in code, password files, admin areas, etc. Google isn’t to blame, they are just making it easier to find this stuff
As for the serial number generators I assume that Google and the others would have some sort of policy in taking that down if requested similar to what they do on the main search engine
Not sure if this is exploitable but in Java when you need to connect to a database it is possible to supply the username and password as parameters.
Not sure if these databases are meant to contain confidential info or how to access them externally but look at this search result for example :-
http://google.co.uk/codesearch.....age.jsp#a0
Check out the line :
connect=DriverManager.getConnection(”jdbc:odbc:RegisterDataSource”,”wasiqr”,”bhalbabatu”);
I’m assuming a good coder would store the MD5 hash or something, but if not that is a big problem since people’s confidential info could be at risk.
Mike - what?? If you published your passwords, you better change them ASAP. If they’re on your servers, they should NOT be visible to Google- and if they are, they’re visible to any old hacker too.
As for the engines themselves… pretty lame. Google Code returns almost only CPAN (perl) results for a query “in ruby”. Krugle and Koders don’t return anything at all for my keyword.
Yes, vanity obliging, I searched for code I wrote for gravatar caching. A normal G search for “ruby” doesn’t work as well as “rails”. Sigh.
Alright - this is cool indeed. For a small company, this is just one way “Extreme Programming”
Oh…. I see, I didn’t notice that it only searched files inside archives (zip, tar, gz). But yeah, I got scared because you have to put your password in that wp-config.php file on the server for it to work just like Utills said and showed in his example
http://labs.oreilly.com/search.....arch+Again
Shows the code to use OpenSSL and Win32 API library calls.
There’s O’Reilly Media’s Code Search: http://labs.oreilly.com/code/ as well. It only covers the code from their books but it has an excellent UI.
I did a search again today to compare Krugle and Google Code search for “shopping cart” and language “java”. Krugle showed 90% Sourceforge project code and Google showed 90% code from variety projects!!! I like variety.
Find more in my blog :
http://smallpanda.wordpress.co.....vs-krugle/
Of course Google is getting into this game. Far be it for them to let a chance pass to get their fingers into another pie. I’m not a programmer. Far from one, actually. But having interviewed quite a few of them - including Steve Larsen from Krugle - I’m pulling for the start-ups to gain some traction here.
(if you’re interested, you can find the conversation I had with Steve at: http://www.guidewireconnection.....tem=161457)
Given that it is a search engine intended for coders it is actually pretty easy to find what you are looking for using the regexp syntax (and correcting your search term to use the correct way to specify programming language):
^int\s+md5 lang:c file:.c$
This will return only .c files (no header files) and does a decent job of returning only function calls (though it picks up some variable declarations). The problem you were having is a familiar problem to C programs trying to find where a function is defined rather than its declaration or calls to it.
I agree with Craig Ogg, it seems that this engine produces more productive results when you rather specify querries same to regular expressions as you would in an actual program.
that’s really a good tool
Have you heard of http://merobase.com? Another cool code search engine
I just don’t see what the benefit is to code search. What does it provide that normal Google doesn’t?
cbmeeks
http://www.codershangout.com
@cbmeeks
Try to find something like
public class Stack {
public void push(int i) {}
public int pop() {}
}
with normal Google. Google’s codesearch seems to make it possible with regex. It’s not possible with Krugle and Koders? Merobase does search this directly.
search the code for finite automata implementation in ruby
Have you heard of http://merobase.com? Another cool code search engine